This node enables policy-based classification of textual input according to predefined privacy, compliance, or national security categories. It supports multiple classification contexts, including country-specific policies.
Inputs
- Documents – Document objects to be classified
- Text – Raw or preprocessed text to be classified
Outputs
- Documents – Document objects with classification metadata added
- Categories – Classification results as category objects
- ClassificationContext – Metadata describing the classification environment or policy scope (example – U.S. Privacy, Germany Data Act)
- Classifications – Detected classification labels matching the input policies (example – SSN, Passport, Sensitive Data)
Configuration
GUI
-
- Select applicable classification policies
- Expand each country or domain section.
- Check individual policies you want applied to the input text.
- Examples
- United States > Social Security Number (SSN) and Taxpayer ID Policy
- Germany (no subcategories shown in screenshot)
- Examples
-
- Review and confirm selections
Note
- All selected policies will be used as context for the classification engine.
- Ensure you only enable policies relevant to your regulatory or operational domain.
Classification Model
- Model Type – Type of classification model
- Default – “ml”
- Options – ml, rules, hybrid
- Predefined Model – Use a predefined classification model
- Default – null
- Notes – Available predefined models
- Custom Model Path – Path to custom model file
- Default – null
- Notes – For user-trained models
Classification Categories
- Categories – List of classification categories
- Default – []
- Notes – Required for rule-based classification
- Hierarchical – Enable hierarchical classification
- Default – false
- Notes – For nested category structures
- Multi-label – Allow multiple categories per document
- Default – false
- Notes – For non-exclusive categorization
Advanced Settings
- Confidence Threshold – Minimum confidence score
- Default – 0.7
- Notes – Higher values increase precision
- Feature Extraction – Text feature extraction method
- Default – “tfidf”
- Options – tfidf, embeddings, custom
- Max Categories – Maximum categories per document
- Default – 3
- Notes – Only applies when multi-label is true
Example Usage
Basic Document Categorization
This example shows how to configure the Document Classification for basic categorization:
{
"modelType": "ml",
"predefinedModel": "general-document-classifier",
"confidenceThreshold": 0.7,
"multiLabel": false
}
Custom Classification Rules
For rule-based classification with custom categories:
{
"modelType": "rules",
"categories": [ { "name": "Financial","rules": [ {"type": "keyword", "terms": ["invoice", "payment", "transaction", "bank"]},{"type": "regex", "pattern": "\\$\\d+([.,]\\d{2})"} ] },{ "name": "Legal","rules": [ {"type": "keyword", "terms": ["contract", "agreement", "legal", "law"]},{"type": "regex", "pattern": "section \\d+\\.\\d+"} ] },{ "name": "Technical","rules": [ {"type": "keyword", "terms": ["software", "hardware", "system", "code"]},{"type": "regex", "pattern": "[a-zA-Z0-9_]+\\.[a-zA-Z0-9_]+\\("} ] } ],
"hierarchical": false,
"multiLabel": true,
"maxCategories": 2,
"confidenceThreshold": 0.6
}
Example
- Input Text – “John Doe, SSN: 123-45-6789, lives in California.”
- ClassificationContext – United States
- Classifications – SSN, Personal Data
Best Practices
Model Selection
- Use predefined models for common classification tasks
- Train custom models for domain-specific categorization
- Consider rule-based classification for transparent, explainable results
- Use hybrid approach for complex classification needs
Performance Optimization
- Adjust confidence threshold based on precision vs. recall requirements
- Use hierarchical classification for large category sets
- Limit max categories in multi-label mode to improve precision
- Preprocess documents to remove noise that might affect classification
Troubleshooting
Classification Problems
- Low accuracy – Train on more domain-specific data or adjust confidence threshold
- Misclassifications – Review and refine rules or model training data
- No classifications – Check if confidence threshold is too high
Performance Issues
- Slow processing – Use a simpler model or feature extraction method
- Memory errors – Process documents in smaller batches
Technical Reference
For detailed technical information, refer to:
- Document Classification API Reference
- Predefined Classification Models
- Classification Source Code /../../../aparavi-connectors/connectors/classification/classifier.py