Document Processor is a powerful AI-driven platform that helps you automatically classify and extract information from your documents. Upload documents and use AI to analyze their content with custom classifiers and extractors.

Getting Started

1. Upload Your Documents

Use the Files List panel on the left to upload documents (PDF, Word, text files, etc.). Your uploaded files will appear in the list where you can select them for processing.

2. Create Classifiers

Go to the Classifiers tab to create document classification rules:

  • Classifier Sets: Groups of related classification categories
  • Categories: Individual classification types (e.g., "Invoice", "Contract", "Report")
  • Terms: Keywords and phrases that help identify each category
  • Weights & Distance: Fine-tune how strictly terms must match

Using Wildcards in Terms

Make your classifier terms more flexible with wildcards:

  • * (asterisk): Matches any word or number
    Example: "invoice *" matches "invoice 123", "invoice total", "invoice amount"
  • ? (question mark): Matches any single word
    Example: "contract ?" matches "contract date", "contract terms" but not "contract end date"
  • # (hash): Matches any number
    Example: "total #" matches "total 1500", "total 99.99" but not "total amount"

Run classifiers against your files to automatically categorize documents.

3. Build Extractors

Visit the Extractors tab to create information extraction tools:

  • Prompt: Describe what information you want to extract
  • Fields: Define specific data points to extract (names, dates, amounts, etc.)
  • Testing: Run extractors against selected files to pull out structured data

Perfect for extracting key information like invoice amounts, contract dates, or contact details.

Pro Tips

  • Start Simple: Begin with basic classifiers and extractors, then refine them based on results
  • Use Multiple Files: Test your tools against several documents to ensure accuracy
  • Iterate: Adjust terms, prompts, and fields based on the results you see
  • Collapse Results: Use the arrow buttons to hide result panels when editing
  • Bulk Operations: Select multiple files to process them all at once

Typical Workflow

  1. Upload a batch of documents you want to process
  2. Create a classifier to categorize document types
  3. Run the classifier to see how well it identifies your document categories
  4. Create extractors for each document type to pull out specific information
  5. Test and refine your extractors until they capture the data you need
  6. Process new documents using your trained classifiers and extractors

Need Help?

If you encounter any issues or have questions about using Document Processor, don't hesitate to reach out for support. The system is designed to learn and improve with your feedback.

API Credentials

Use these credentials to authenticate with the API endpoints. Your API secret should be kept secure.

API Reference

Below are the available classifiers and extractors with their IDs for use with the API endpoints.

API Documentation

Use the following endpoints with HTTP Basic Authentication using your API key as username and API secret as password:

📋 Quick Start Workflow

Choose your extraction approach:

  • 🔄 Synchronous (GET endpoint): Best for testing, debugging, or simple integrations where you can wait for results
  • âš¡ Asynchronous (POST endpoint): Best for production systems, batch processing, or when handling multiple documents

Typical workflow:

  1. List available extractors/classifiers → get IDs to use
  2. Upload a document → get file_id
  3. Run extraction (sync or async) → get extraction results
  4. Download marked PDF (optional) → get highlighted document
  5. Clean up files when done (optional)

File Upload

POST /service/file

Upload a document file for processing. Returns file ID.

curl -X POST -u "your-api-key:your-api-secret" \
  -F "file=@document.pdf" \
  service/file

Markdown Upload

PUT /service/file/markdown

Upload Markdown content as a document. The first line of the content is used as the filename.

Request Parameters:
  • content (required): The Markdown content to upload
Usage Notes:
  • First line becomes the filename (markdown formatting like # is stripped)
  • Filename is automatically sanitized and given .md extension
  • Content is processed through the same extraction pipeline as file uploads
  • Returns document ID and generated filename
  • Compatible with SmolPageBot Chrome extension for easy web page capture
Example Request:
curl -X PUT -u "your-api-key:your-api-secret" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "# My Document Title\n\nThis is the content of my markdown document.\n\n## Section 1\n\nSome more content here."
  }' \
  service/file/markdown
Example Response:
{
  "id": 124,
  "filename": "My-Document-Title.md"
}

List Available Classifiers

GET /service/classifiers

Get a list of all classifier sets available to your account. Use the returned IDs with the classification endpoint.

Response Format:
[
  {
    "id": 1,
    "name": "Document Type Classifier"
  },
  {
    "id": 2,
    "name": "Invoice vs Receipt Classifier"
  }
]
Example Request:
curl -u "your-api-key:your-api-secret" \
  service/classifiers

List Available Extractors

GET /service/extractors

Get a list of all extractors available to your account. Use the returned IDs with the extraction endpoints.

Response Format:
[
  {
    "id": 1,
    "name": "Invoice Data Extractor"
  },
  {
    "id": 2,
    "name": "Contract Terms Extractor"
  },
  {
    "id": 3,
    "name": "Contact Information Extractor"
  }
]
Example Request:
curl -u "your-api-key:your-api-secret" \
  service/extractors

Classification

GET /service/classifier/{classifier_id}/{file_id}

Classify a document using the specified classifier. Use the classifiers endpoint above to get available classifier IDs.

curl -u "your-api-key:your-api-secret" \
  service/classifier/1/123

Extraction (Asynchronous)

POST /service/extractor

Extract information from a document using the specified extractor. This operation runs asynchronously in the background and returns results via webhook.

Request Parameters:
  • extractor_id (required): The ID of the extractor to use (get from extractors endpoint above)
  • file_id (required): The ID of the uploaded document file
  • web_hook (required): URL where extraction results will be sent via POST
  • csrf_token (optional): CSRF protection token for webhook validation
Webhook Details:

When extraction completes, results are sent to your webhook URL via POST request:

  • The webhook receives the extracted data as JSON
  • Include a CSRF token to verify webhook authenticity
  • Ensure your webhook endpoint can handle POST requests
  • Webhook should return 200 status for successful receipt
CSRF Token Usage:

The CSRF token provides security verification for webhook calls:

  • Generate a unique token for each extraction request
  • Store the token to validate incoming webhook calls
  • The system will include this token in the webhook payload
  • Verify the token matches before processing webhook data
Marked PDF Generation:

When extraction finds data with citations, the system automatically creates a marked-up PDF:

  • Citations from extracted fields are highlighted in yellow
  • Non-PDF documents are automatically converted to PDF first
  • Each extractor creates its own marked version
  • Webhook payload includes marked_pdf_available flag
  • Use marked PDF endpoints to download highlighted versions
Example Request:
curl -X POST -u "your-api-key:your-api-secret" \
  -H "Content-Type: application/json" \
  -d '{
    "extractor_id": 1, 
    "file_id": 123, 
    "web_hook": "https://your-domain.com/api/extraction-webhook", 
    "csrf_token": "abc123-secure-token-xyz789"
  }' \
  service/extractor
Example Webhook Payload:

Your webhook will receive a POST request similar to:

{
  "result": {
    "confidence": 1,
    "found": true,
    "explanation": "Explanation ...",
    "extracted_data": {
      "field_1": {
        "value": "value 1",
        "citation": ["Quote from document supporting field_1"]
      },
      "field_2": {
        "value": "value 2", 
        "citation": ["Quote 1", "Quote 2"]
      }
    }
  },
  "file_name": "document.pdf",
  "document_id": 123,
  "csrf_token": "abc123-secure-token-xyz789",
  "marked_pdf_available": true,
  "marked_pdf_path": "/path/to/document.marked.1.pdf"
}

Extraction (Synchronous)

GET /service/extractor/{extractor_id}/{document_id}

Extract information from a document using the specified extractor. This operation runs synchronously and returns results immediately in the response.

Path Parameters:
  • extractor_id (required): The ID of the extractor to use (get from extractors endpoint above)
  • document_id (required): The ID of the uploaded document file
When to Use:
  • Simple integrations that need immediate results
  • Interactive applications where users wait for results
  • Testing and debugging extractions
  • When webhook setup is not feasible
Comparison with Asynchronous Extraction:
Aspect Asynchronous (POST) Synchronous (GET)
Response {"status": "started"} Complete results immediately
Webhook Required Not needed
Use Case High volume, fire-and-forget Interactive, immediate results
PDF Markup ✅ Generated ✅ Generated
Example Request:
curl -u "your-api-key:your-api-secret" \
  service/extractor/1/123
Example Response:
{
  "extractor_id": 1,
  "document_id": 123,
  "file_name": "document.pdf",
  "extraction_result": {
    "confidence": 1,
    "found": true,
    "explanation": "Successfully extracted the following information...",
    "extracted_data": {
      "field_1": {
        "value": "extracted value 1",
        "citation": ["Quote from document supporting field_1"]
      },
      "field_2": {
        "value": "extracted value 2",
        "citation": ["Quote 1", "Quote 2"]
      }
    }
  },
  "marked_pdf_available": true,
  "marked_pdf_path": "/path/to/document.marked.1.pdf",
  "success": true
}
Error Responses:
  • 404: Extractor or document not found
  • 400: Document has no content available for extraction
  • 500: Extraction process failed

Marked PDF Download

GET /service/marked-pdf/{extractor_id}/{file_id}

Download a marked-up PDF with highlighted citations from an extraction. The PDF contains visual highlights showing where extracted information was found in the original document.

Usage:
  • Only available after successful extraction with citations
  • Automatically converts non-PDF documents to PDF before marking
  • Returns 404 if no marked version exists
  • Each extractor creates its own marked version
Example:
curl -u "your-api-key:your-api-secret" \
  --output "marked_document.pdf" \
  service/marked-pdf/1/123

Marked PDF Status

GET /service/marked-pdf-status/{file_id}

Check which marked PDF versions are available for a document across all your extractors.

Response Format:
{
  "file_id": 123,
  "pdf_available": true,
  "marked_versions": [
    {
      "extractor_id": 1,
      "extractor_name": "Invoice Extractor",
      "file_size": 245760
    },
    {
      "extractor_id": 2, 
      "extractor_name": "Contract Extractor",
      "file_size": 238950
    }
  ],
  "total_marked_versions": 2
}
Example:
curl -u "your-api-key:your-api-secret" \
  service/marked-pdf-status/123

File Cleanup

DELETE /service/file/{file_id}

Remove an uploaded file from the system. This also deletes any associated marked PDF versions.

curl -X DELETE -u "your-api-key:your-api-secret" \
  service/file/123