Parsing/Preprocessing/Embedding

This component combines document parsing, text preprocessing, and embedding generation in a single node. It provides an end-to-end solution for converting raw documents into vector representations suitable for semantic search and analysis.

Use a Parsing/Preprocessing/Embedding component in a flow

To use a Parsing/Preprocessing/Embedding component in your workflow, follow these steps:

  1. Drag the component from the Other section onto your canvas
  2. Connect document source inputs to the input port
  3. Configure the parsing, preprocessing, and embedding settings
  4. Connect the output ports to downstream processing nodes

Inputs

  • Source – Raw document files or URLs to process
  • Options – Optional configuration parameters

Outputs

  • Documents – Structured document objects with embeddings
  • Vectors – Generated vector embeddings
  • Text – Preprocessed text content

Configuration

Parsing Settings

  • Document Types – Supported document formats
    • Default – [“Text”, “Office”, “PDF”]
    • Notes – Multiple formats supported
  • Extract Images – Extract and process images
    • Default – false
    • Notes – Increases processing time
  • OCR – Apply optical character recognition
    • Default – false
    • Notes – For image-based content
  • Chunk Size – Size of text chunks
    • Default – 1000
    • Notes – Characters per chunk

Preprocessing Settings

  • Remove Whitespace – Remove excessive whitespace
    • Default – true
    • Notes – Preserves single spaces
  • Lowercase – Convert text to lowercase
    • Default – true
    • Notes – Improves consistency
  • Remove Punctuation – Remove punctuation marks
    • Default – false
    • Notes – Can be configured
  • Remove Stop Words – Remove common stop words
    • Default – false
    • Notes – Language-dependent

Embedding Settings

  • Model – Embedding model to use
    • Default – “sentence-transformers/all-MiniLM-L6-v2”
    • Notes – Various models available
  • Dimensions – Vector dimensions
    • Default – 384
    • Notes – Model-dependent
  • Batch Size – Number of texts to embed at once
    • Default – 32
    • Notes – Affects memory usage
  • Normalize – Normalize vector lengths
    • Default – true
    • Notes – Improves similarity calculations

Example Usage

Basic Document to Vector Pipeline

This example shows how to configure the component for a basic document-to-vector pipeline:

{
"parsing": { "documentTypes": ["Text", "Office", "PDF"], "extractImages": false, "ocr": false, "chunkSize": 1000, "overlap": 200 },
"preprocessing": { "removeWhitespace": true, "lowercase": true, "removePunctuation": false, "removeStopWords": false },
"embedding": { "model": "sentence-transformers/all-MiniLM-L6-v2", "dimensions": 384, "batchSize": 32, "normalize": true }
}

Advanced RAG Pipeline Configuration
For a more advanced configuration optimized for RAG applications:

{
"parsing": { "documentTypes": ["Text", "Office", "PDF", "HTML", "Email"], "extractImages": true, "ocr": true, "chunkSize": 500, "overlap": 100, "tableExtraction": true },
"preprocessing": { "removeWhitespace": true, "lowercase": true, "removePunctuation": true, "removeStopWords": true, "language": "english", "stemming": false, "lemmatization": true },
"embedding": { "model": "openai/text-embedding-ada-002", "dimensions": 1536, "batchSize": 8, "normalize": true, "cache": true, "instruction": "Represent this document for retrieval:" }
}

Best Practices

Pipeline Optimization

  • Adjust chunk size based on your retrieval needs
  • Select preprocessing options that preserve important content
  • Choose embedding models appropriate for your domain
  • Balance processing quality with performance requirements

Performance Considerations

  • Process documents in batches for large collections
  • Disable resource-intensive features for faster processing
  • Use smaller embedding models for speed, larger for accuracy
  • Enable caching for repeated processing of similar content

Troubleshooting

Processing Problems

  • Parsing failures – Check document formats and integrity
  • Missing content – Enable OCR for image-based text
  • Poor embedding quality – Try different preprocessing options or embedding models

Performance Issues

  • Slow processing – Adjust batch sizes or disable intensive features
  • Memory errors – Reduce chunk size or batch size
  • High resource usage – Use more efficient models or processing options