Document Parser

This component extracts and structures content from various document formats. It processes raw document files and outputs structured document objects with extracted text and metadata.

Use a Document Parser in a flow

To use a Document Parser in your workflow, follow these steps:

  1. Drag the Document Parser component from the Other section onto your canvas
  2. Connect document source inputs (file paths, URLs, etc.) to the input port
  3. Configure the parser settings based on your document types
  4. Connect the output ports to downstream processing nodes

Inputs

  • Source – Raw document files or URLs to process
  • Options – Optional configuration parameters

Outputs

  • Documents – Structured document objects with metadata
  • Text – Plain text extracted from documents
  • Metadata – Document metadata (author, date, etc.)

Configuration

Document Types

The Document Parser supports various document formats:

  • Text – .txt, .md, .csv
  • Office – .docx, .xlsx, .pptx
  • PDF – .pdf
  • HTML – .html, .htm
  • Email – .eml, .msg

Parsing Options

  • Extract Images – Extract and process images
    • Default – FalseIncreases processing time
  • OCR  – Apply optical character recognition
    • Default – False
    • Notes – For image-based content
  • Table Extraction – Extract tables as structured data
    • Default – True
    • Notes – For documents with tabular data
  • Header/Footer – Process headers and footers
    • Default – True
    • Notes – For documents with page elements
  • Max File Size – Maximum file size to process
    • Default – 50MB
    • Notes – Limits for large files

Advanced Settings

  • Chunk Size – Size of text chunks
    • Default – 1000
    • Notes – Characters per chunk
  • Overlap – Overlap between chunks
    • Default – 200
    • Notes – Characters of overlap
  • Metadata Filters – Filter which metadata to extract
    • Default – All
    • Notes – Customize metadata extraction
  • Language Detection – Detect document language
    • Default – True
    • Notes – For multilingual processing

Example Usage

Basic Document Processing

This example shows how to configure the Document Parser for standard document processing:

{
"documentTypes": ["Text", "Office", "PDF"],
"extractImages": false,
"tableExtraction": true,
"chunkSize": 1000,
"overlap": 200
}

Comprehensive Document Analysis

For detailed document analysis with maximum information extraction:

{
"documentTypes": ["Text", "Office", "PDF", "HTML", "Email"],
"extractImages": true,
"ocr": true,
"tableExtraction": true,
"headerFooter": true,
"languageDetection": true,
"chunkSize": 500,
"overlap": 100,
"metadataFilters": ["author", "creationDate", "title", "subject", "keywords"]
}

Best Practices

Document Preparation

  • Ensure documents are in supported formats
  • Check for corruption or password protection
  • Consider pre-processing very large documents

Chunking Strategy

  • Use smaller chunks for precise retrieval
  • Use larger chunks for maintaining context
  • Adjust overlap based on content complexity

Performance Considerations

  • Disable unnecessary features for faster processing
  • Enable OCR only when needed for image-based text
  • Consider batch processing for large document collections

Troubleshooting

Parsing Problems

  • Failed to parse document: Check document format and integrity
  • Missing text: Content may be in images; enable OCR option
  • Garbled text: Check document encoding

Performance Issues

  • Slow processing: Large documents or enabled OCR; adjust chunk size or disable OCR
  • Memory errors: Document too large; increase max file size or pre-split documents
  • Incomplete extraction: Complex document structure; try different parsing options

Technical Reference

For detailed technical information, refer to: