General Text

The General Text node splits unstructured or semi-structured text into smaller document chunks suitable for indexing, embedding, or LLM input. It supports table and plain text formats and applies configurable splitting logic.

Inputs

  • Text – Accepts unstructured or semi-structured free-form text for processing
  • Documents – Document objects containing text to be preprocessed
  • Table – Accepts structured data in table format (ex CSV or tabular JSON) and prepares it for segmentation

Outputs

  • Text – Preprocessed text content
  • Documents – Emits a list of segmented text blocks as structured documents

Configuration

Splitter Options

  • Text Splitter – Select the splitting method.
    • Default option – Default Text Splitter (General-purpose, best balance of structure and size)
  • Split by – Choose the splitting strategy
    • Options may include
      • String length
      • Line count
      • Punctuation
  • String Length – Set the character limit per chunk when using string length as the splitting rule
    • Example – 512

Text Normalization Options

  • Remove Whitespace – Remove excessive whitespace
    • Default – True
    • Notes – Preserves single spaces between words
  • Remove Punctuation – Remove punctuation marks
    • Default – False
    • Notes – Can be configured to preserve specific marks
  • Lowercase – Convert text to lowercase
    • Default – True
    • Notes – Improves consistency for case-sensitive operations
  • Remove Numbers – Remove numerical digits
    • Default – False
    • Notes – Can be configured to preserve specific patterns
  • Remove Stop Words – Remove common stop words
    • Default – False
    • Notes – Language-dependent

Advanced Settings

  • Language – Language for language-specific operations
    • Default – English
    • Notes – Affects stop words, stemming, etc.
  • Stemming – Apply stemming to words
    • Default – False
    • Notes – Reduces words to their root form
  • Lemmatization – Apply lemmatization to words
    • Default – False
    • Notes – More accurate than stemming but slower
  • Custom Regex – Apply custom regex patterns
    • Default – None
    • Notes – For specialized text cleaning

Example Usage

Basic Text Cleaning

This example shows how to configure the Preprocessor for basic text cleaning:

{
"removeWhitespace": true,
"removePunctuation": true,
"lowercase": true,
"removeNumbers": false,
"removeStopWords": false
}

Advanced Text Processing

For more advanced text processing with language-specific features:

{
"removeWhitespace": true,
"removePunctuation": true,
"lowercase": true,
"removeNumbers": true,
"removeStopWords": true,
"language": "english",
"stemming": false,
"lemmatization": true,
"customRegex": [ {"pattern": "http\\S+", "replacement": ""}, {"pattern": "@\\w+", "replacement": ""} ]
}

Best Practices

Text Preparation

  • Consider your downstream tasks when selecting preprocessing options
  • For search applications, removing stop words can improve results
  • For sentiment analysis, preserving punctuation may be important
  • Use lemmatization for more accurate results when processing time is not critical

Performance Considerations

  • Lemmatization is more resource-intensive than stemming
  • Processing very large documents may require batch processing
  • Consider memory usage when processing large volumes of text

Troubleshooting

Processing Problems

  • Text not properly normalized – Check language settings and normalization options
  • Important information lost – Adjust settings to preserve critical content (e.g., keep numbers for financial text)
  • Unexpected output – Review custom regex patterns for errors

Performance Issues

  • Slow processing – Disable resource-intensive options like lemmatization for large volumes
  • Memory errors – Process text in smaller batches
Technical Reference

For detailed technical information, refer to:

  • Preprocessor Source Code ../../../aparavi-connectors/connectors/preprocessor/general.py