This component extracts and structures content from various document formats. It processes raw document files and outputs structured document objects with extracted text and metadata.
Use a Document Parser in a flow
To use a Document Parser in your workflow, follow these steps:
- Drag the Document Parser component from the Other section onto your canvas
- Connect document source inputs (file paths, URLs, etc.) to the input port
- Configure the parser settings based on your document types
- Connect the output ports to downstream processing nodes
Inputs
- Source – Raw document files or URLs to process
- Options – Optional configuration parameters
Outputs
- Documents – Structured document objects with metadata
- Text – Plain text extracted from documents
- Metadata – Document metadata (author, date, etc.)
Configuration
Document Types
The Document Parser supports various document formats:
- Text – .txt, .md, .csv
- Office – .docx, .xlsx, .pptx
- PDF – .pdf
- HTML – .html, .htm
- Email – .eml, .msg
Parsing Options
- Extract Images – Extract and process images
- Default – FalseIncreases processing time
- OCRÂ – Apply optical character recognition
- Default – False
- Notes – For image-based content
- Table Extraction – Extract tables as structured data
- Default – True
- Notes –Â For documents with tabular data
- Header/Footer – Process headers and footers
- Default – True
- Notes –Â For documents with page elements
- Max File Size – Maximum file size to process
- Default – 50MB
- Notes –Â Limits for large files
Advanced Settings
- Chunk Size – Size of text chunks
- Default – 1000
- Notes – Characters per chunk
- Overlap – Overlap between chunks
- Default – 200
- Notes –Â Characters of overlap
- Metadata Filters – Filter which metadata to extract
- Default – All
- Notes –Â Customize metadata extraction
- Language Detection – Detect document language
- Default – True
- Notes –Â For multilingual processing
Example Usage
Basic Document Processing
{
"documentTypes": ["Text", "Office", "PDF"],
"extractImages": false,
"tableExtraction": true,
"chunkSize": 1000,
"overlap": 200
}
Comprehensive Document Analysis
{
"documentTypes": ["Text", "Office", "PDF", "HTML", "Email"],
"extractImages": true,
"ocr": true,
"tableExtraction": true,
"headerFooter": true,
"languageDetection": true,
"chunkSize": 500,
"overlap": 100,
"metadataFilters": ["author", "creationDate", "title", "subject", "keywords"]
}
Best Practices
Document Preparation
- Ensure documents are in supported formats
- Check for corruption or password protection
- Consider pre-processing very large documents
Chunking Strategy
- Use smaller chunks for precise retrieval
- Use larger chunks for maintaining context
- Adjust overlap based on content complexity
Performance Considerations
- Disable unnecessary features for faster processing
- Enable OCR only when needed for image-based text
- Consider batch processing for large document collections
Troubleshooting
Parsing Problems
- Failed to parse document: Check document format and integrity
- Missing text: Content may be in images; enable OCR option
- Garbled text: Check document encoding
Performance Issues
- Slow processing: Large documents or enabled OCR; adjust chunk size or disable OCR
- Memory errors: Document too large; increase max file size or pre-split documents
- Incomplete extraction: Complex document structure; try different parsing options
Technical Reference
For detailed technical information, refer to:
- Document Parser API Reference
- Supported File Formats
- Document Parser Source Code /../../aparavi-connectors/connectors/document-parser/parser.py