The Code node is used to prepare code-based text for further processing, such as embedding or classification. It segments code into logical chunks by detecting syntax boundaries and respecting language-specific structures.
Inputs
- Text – Receives plain code or code-formatted text. This is the raw content to be split and transformed
- Documents – Document objects containing code to be preprocessed
Outputs
- Text – Preprocessed code content
- Documents – Emits structured chunks of code as processed document segments. These segments are optimized for downstream components such as vector embeddings or LLMs
Configuration
Code Normalization Options
- Language (Code Splitter Profile) – Programming language
- Default – Auto
- Notes – Automatically detect or specify manually
- Remove Comments – Remove code comments
- Default – True
- Notes – Can be configured to preserve docstrings
- Format Code – Apply code formatting
- Default – True
- Notes – Language-specific formatting rules
- Remove Whitespace – Normalize whitespace
- Default – True
- Notes – Preserves indentation structure
- Normalize Identifiers – Standardize variable/function names
- Default – False
- Notes – Useful for code comparison
- Maximum String Length – Specify the maximum number of characters allowed in each chunk
- Example – 512
Advanced Settings
- Preserve Docstrings – Keep documentation strings
- Default – True
- Notes – Only applies when Remove Comments is True
- Extract Functions – Extract function definitions
- Default – False
- Notes – Creates separate outputs for each function
- Extract Classes – Extract class definitions
- Default – False
- Notes – Creates separate outputs for each class
- Custom Patterns – Apply custom regex patterns
- Default – None
- Notes – For specialized code cleaning
Example Usage
Basic Code Cleaning
This example shows how to configure the Preprocessor for basic code cleaning:
{
"language": "python",
"removeComments": true,
"formatCode": true,
"removeWhitespace": true,
"normalizeIdentifiers": false,
"preserveDocstrings": true
}
Advanced Code Processing
For more advanced code processing with extraction features:
{
"language": "javascript",
"removeComments": true,
"formatCode": true,
"removeWhitespace": true,
"normalizeIdentifiers": true,
"preserveDocstrings": false,
"extractFunctions": true,
"extractClasses": true,
"customPatterns": [ {"pattern": "console\\.log\\([^)]*\\);", "replacement": ""}, {"pattern": "debugger;", "replacement": ""} ]
}
Best Practices
Code Preparation
- Specify the programming language explicitly for best results
- Consider preserving docstrings for documentation-heavy code
- Use function extraction for analyzing specific code components
- Normalize identifiers when comparing code functionality across different implementations
Performance Considerations
- Code formatting can be resource-intensive for large files
- Processing very large codebases may require batch processing
- Consider memory usage when processing large volumes of code
Troubleshooting
Processing Problems
- Incorrect language detection – Specify language explicitly instead of using Auto
- Important comments lost – Enable preserveDocstrings option
- Code structure altered – Disable formatting if precise structure preservation is needed
Performance Issues
- Slow processing – Disable resource-intensive options like formatting for large codebases
- Memory errors – Process code in smaller batches
Technical Reference
For detailed technical information, refer to:
- Code Preprocessor Source Code /../../../aparavi-connectors/connectors/preprocessor/code.py