Code

The Code node is used to prepare code-based text for further processing, such as embedding or classification. It segments code into logical chunks by detecting syntax boundaries and respecting language-specific structures. 

Inputs

  • Text – Receives plain code or code-formatted text. This is the raw content to be split and transformed
  • Documents – Document objects containing code to be preprocessed

Outputs

  • Text – Preprocessed code content
  • Documents – Emits structured chunks of code as processed document segments. These segments are optimized for downstream components such as vector embeddings or LLMs

Configuration

Code Normalization Options

  • Language (Code Splitter Profile) – Programming language
    • Default – Auto
    • Notes – Automatically detect or specify manually
  • Remove Comments – Remove code comments
    • Default – True
    • Notes – Can be configured to preserve docstrings
  • Format Code – Apply code formatting
    • Default – True
    • Notes – Language-specific formatting rules
  • Remove Whitespace – Normalize whitespace
    • Default – True
    • Notes – Preserves indentation structure
  • Normalize Identifiers – Standardize variable/function names
    • Default – False
    • Notes – Useful for code comparison
  • Maximum String Length – Specify the maximum number of characters allowed in each chunk
    • Example – 512

Advanced Settings

  • Preserve Docstrings – Keep documentation strings
    • Default – True
    • Notes – Only applies when Remove Comments is True
  • Extract Functions – Extract function definitions
    • Default – False
    • Notes – Creates separate outputs for each function
  • Extract Classes – Extract class definitions
    • Default – False
    • Notes – Creates separate outputs for each class
  • Custom Patterns – Apply custom regex patterns
    • Default – None
    • Notes – For specialized code cleaning

Example Usage

Basic Code Cleaning

This example shows how to configure the Preprocessor for basic code cleaning:
{
"language": "python",
"removeComments": true,
"formatCode": true,
"removeWhitespace": true,
"normalizeIdentifiers": false,
"preserveDocstrings": true
}

Advanced Code Processing
For more advanced code processing with extraction features:
{
"language": "javascript",
"removeComments": true,
"formatCode": true,
"removeWhitespace": true,
"normalizeIdentifiers": true,
"preserveDocstrings": false,
"extractFunctions": true,
"extractClasses": true,
"customPatterns": [ {"pattern": "console\\.log\\([^)]*\\);", "replacement": ""}, {"pattern": "debugger;", "replacement": ""} ]
}

Best Practices

Code Preparation

  • Specify the programming language explicitly for best results
  • Consider preserving docstrings for documentation-heavy code
  • Use function extraction for analyzing specific code components
  • Normalize identifiers when comparing code functionality across different implementations

Performance Considerations

  • Code formatting can be resource-intensive for large files
  • Processing very large codebases may require batch processing
  • Consider memory usage when processing large volumes of code

Troubleshooting

Processing Problems

  • Incorrect language detection – Specify language explicitly instead of using Auto
  • Important comments lost – Enable preserveDocstrings option
  • Code structure altered – Disable formatting if precise structure preservation is needed

Performance Issues

  • Slow processing – Disable resource-intensive options like formatting for large codebases
  • Memory errors – Process code in smaller batches
Technical Reference

For detailed technical information, refer to:

  • Code Preprocessor Source Code /../../../aparavi-connectors/connectors/preprocessor/code.py