The General Text node splits unstructured or semi-structured text into smaller document chunks suitable for indexing, embedding, or LLM input. It supports table and plain text formats and applies configurable splitting logic.
Inputs
- Text – Accepts unstructured or semi-structured free-form text for processing
- Documents – Document objects containing text to be preprocessed
- Table – Accepts structured data in table format (ex CSV or tabular JSON) and prepares it for segmentation
Outputs
- Text – Preprocessed text content
- Documents – Emits a list of segmented text blocks as structured documents
Configuration
Splitter Options
- Text Splitter –Â Select the splitting method.
- Default option –Â Default Text Splitter (General-purpose, best balance of structure and size)
- Split by –Â Choose the splitting strategy
- Options may include
- String length
- Line count
- Punctuation
- Options may include
- String Length –Â Set the character limit per chunk when using string length as the splitting rule
- Example –Â 512
Text Normalization Options
- Remove Whitespace – Remove excessive whitespace
- Default – True
- Notes – Preserves single spaces between words
- Remove Punctuation – Remove punctuation marks
- Default –Â False
- Notes –Â Can be configured to preserve specific marks
- Lowercase – Convert text to lowercase
- Default –Â True
- Notes –Â Improves consistency for case-sensitive operations
- Remove Numbers – Remove numerical digits
- Default –Â False
- Notes –Â Can be configured to preserve specific patterns
- Remove Stop Words – Remove common stop words
- Default –Â False
- Notes – Language-dependent
Advanced Settings
- Language – Language for language-specific operations
- Default – English
- Notes – Affects stop words, stemming, etc.
- Stemming – Apply stemming to words
- Default – False
- Notes – Reduces words to their root form
- Lemmatization – Apply lemmatization to words
- Default – False
- Notes – More accurate than stemming but slower
- Custom Regex – Apply custom regex patterns
- Default – None
- Notes – For specialized text cleaning
Example Usage
Basic Text Cleaning
This example shows how to configure the Preprocessor for basic text cleaning:
{
"removeWhitespace": true,
"removePunctuation": true,
"lowercase": true,
"removeNumbers": false,
"removeStopWords": false
}
Advanced Text Processing
For more advanced text processing with language-specific features:
{
"removeWhitespace": true,
"removePunctuation": true,
"lowercase": true,
"removeNumbers": true,
"removeStopWords": true,
"language": "english",
"stemming": false,
"lemmatization": true,
"customRegex": [ {"pattern": "http\\S+", "replacement": ""}, {"pattern": "@\\w+", "replacement": ""} ]
}
Best Practices
Text Preparation
- Consider your downstream tasks when selecting preprocessing options
- For search applications, removing stop words can improve results
- For sentiment analysis, preserving punctuation may be important
- Use lemmatization for more accurate results when processing time is not critical
Performance Considerations
- Lemmatization is more resource-intensive than stemming
- Processing very large documents may require batch processing
- Consider memory usage when processing large volumes of text
Troubleshooting
Processing Problems
- Text not properly normalized –Â Check language settings and normalization options
- Important information lost –Â Adjust settings to preserve critical content (e.g., keep numbers for financial text)
- Unexpected output –Â Review custom regex patterns for errors
Performance Issues
- Slow processing –Â Disable resource-intensive options like lemmatization for large volumes
- Memory errors –Â Process text in smaller batches
Technical Reference
For detailed technical information, refer to:
- Preprocessor Source Code ../../../aparavi-connectors/connectors/preprocessor/general.py