This component combines document parsing, text preprocessing, and embedding generation in a single node. It provides an end-to-end solution for converting raw documents into vector representations suitable for semantic search and analysis.
Use a Parsing/Preprocessing/Embedding component in a flow
To use a Parsing/Preprocessing/Embedding component in your workflow, follow these steps:
- Drag the component from the Other section onto your canvas
- Connect document source inputs to the input port
- Configure the parsing, preprocessing, and embedding settings
- Connect the output ports to downstream processing nodes
Inputs
- Source – Raw document files or URLs to process
- Options – Optional configuration parameters
Outputs
- Documents – Structured document objects with embeddings
- Vectors – Generated vector embeddings
- Text – Preprocessed text content
Configuration
Parsing Settings
- Document Types – Supported document formats
- Default – [“Text”, “Office”, “PDF”]
- Notes – Multiple formats supported
- Extract Images – Extract and process images
- Default – false
- Notes – Increases processing time
- OCR – Apply optical character recognition
- Default – false
- Notes – For image-based content
- Chunk Size – Size of text chunks
- Default – 1000
- Notes – Characters per chunk
Preprocessing Settings
- Remove Whitespace – Remove excessive whitespace
- Default – true
- Notes – Preserves single spaces
- Lowercase – Convert text to lowercase
- Default – true
- Notes – Improves consistency
- Remove Punctuation – Remove punctuation marks
- Default – false
- Notes – Can be configured
- Remove Stop Words – Remove common stop words
- Default – false
- Notes – Language-dependent
Embedding Settings
- Model – Embedding model to use
- Default – “sentence-transformers/all-MiniLM-L6-v2”
- Notes – Various models available
- Dimensions – Vector dimensions
- Default – 384
- Notes – Model-dependent
- Batch Size – Number of texts to embed at once
- Default – 32
- Notes – Affects memory usage
- Normalize – Normalize vector lengths
- Default – true
- Notes – Improves similarity calculations
Example Usage
Basic Document to Vector Pipeline
{
"parsing": { "documentTypes": ["Text", "Office", "PDF"], "extractImages": false, "ocr": false, "chunkSize": 1000, "overlap": 200 },
"preprocessing": { "removeWhitespace": true, "lowercase": true, "removePunctuation": false, "removeStopWords": false },
"embedding": { "model": "sentence-transformers/all-MiniLM-L6-v2", "dimensions": 384, "batchSize": 32, "normalize": true }
}
{
"parsing": { "documentTypes": ["Text", "Office", "PDF", "HTML", "Email"], "extractImages": true, "ocr": true, "chunkSize": 500, "overlap": 100, "tableExtraction": true },
"preprocessing": { "removeWhitespace": true, "lowercase": true, "removePunctuation": true, "removeStopWords": true, "language": "english", "stemming": false, "lemmatization": true },
"embedding": { "model": "openai/text-embedding-ada-002", "dimensions": 1536, "batchSize": 8, "normalize": true, "cache": true, "instruction": "Represent this document for retrieval:" }
}
Pipeline Optimization
- Adjust chunk size based on your retrieval needs
- Select preprocessing options that preserve important content
- Choose embedding models appropriate for your domain
- Balance processing quality with performance requirements
Performance Considerations
- Process documents in batches for large collections
- Disable resource-intensive features for faster processing
- Use smaller embedding models for speed, larger for accuracy
- Enable caching for repeated processing of similar content
Troubleshooting
Processing Problems
- Parsing failures – Check document formats and integrity
- Missing content – Enable OCR for image-based text
- Poor embedding quality – Try different preprocessing options or embedding models
Performance Issues
- Slow processing – Adjust batch sizes or disable intensive features
- Memory errors – Reduce chunk size or batch size
- High resource usage – Use more efficient models or processing options