Skip to content

Ingestion & Chunking

When you upload a document, Airbeeps runs an ingestion pipeline that extracts content, splits it into chunks, and stores vectors for retrieval.

Pipeline overview

┌──────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│  Upload  │───▶│  Parse  │───▶│  Chunk  │───▶│  Embed  │───▶│ Upsert  │
└──────────┘    └─────────┘    └─────────┘    └─────────┘    └─────────┘
                   15%             20%            55%            10%

Each stage reports progress and can be tracked via the admin UI or API. Ingestion runs through a job queue that supports cancellation.

Content extraction

Airbeeps uses specialized extractors for each file type:

File typeExtractorNotes
PDFPyMuPDFPage-level extraction with page-range support
DOCXmarkitdownParagraphs and tables
PPTXmarkitdownPresentation slides
TXT/MDDirect readUTF-8 encoding
Excel/CSVpandasRow-based processing

PDF processing

PDFs are processed page by page. The extractor:

  1. Extracts raw text from each page
  2. Tracks page numbers in chunk metadata
  3. Handles multi-column layouts (best effort)
  4. Supports page-range selection (e.g., 1-5,8,10-12)
  5. Can truncate to a max page count

INFO

Scanned PDFs without embedded text are not currently supported. Use OCR preprocessing if needed.

Chunking strategies

Airbeeps uses LlamaIndex node parsers for chunking. Three strategies are available, controlled by feature flags in the configuration.

Hierarchical chunking (default)

Creates a parent-child-leaf structure with multiple chunk sizes:

Parent chunk (2048 tokens)
├── Child chunk (512 tokens)
│   ├── Leaf chunk (128 tokens)
│   └── Leaf chunk (128 tokens)
└── Child chunk (512 tokens)
    └── ...

This enables auto-merging during retrieval — when multiple child chunks from the same parent are retrieved, they can be merged back for better context.

yaml
# Default hierarchical sizes (configurable)
RAG_HIERARCHICAL_CHUNK_SIZES: [2048, 512, 128]

Enable/disable: AIRBEEPS_RAG_ENABLE_HIERARCHICAL=true

Semantic chunking

Splits documents based on embedding similarity rather than fixed sizes. Uses SemanticSplitterNodeParser:

  1. Computes embeddings for sentence groups
  2. Identifies semantic breakpoints where meaning shifts
  3. Splits at those boundaries
yaml
# Configuration
RAG_SEMANTIC_BREAKPOINT_THRESHOLD: 95 # Percentile for semantic splits
RAG_SEMANTIC_BUFFER_SIZE: 1 # Sentences to buffer for context

Enable/disable: AIRBEEPS_RAG_ENABLE_SEMANTIC_CHUNKING=true

TIP

Semantic chunking requires an embedding model. If none is available, it falls back to sentence-based chunking automatically.

Sentence-based chunking (fallback)

Uses SentenceSplitter with configurable chunk size and overlap. This is the fallback when hierarchical and semantic chunking are both disabled or unavailable.

Code block preservation

For documents containing fenced code blocks (```), Airbeeps can preserve them as atomic units — code blocks are never split mid-block. Surrounding text is chunked normally.

Row-wise chunking (Excel/CSV)

Tabular files get special treatment:

  • Each row becomes a separate chunk
  • Column headers are prepended: Column: Value
  • Empty cells are skipped
  • Ingestion profiles can customize column selection and text templates

This preserves the structure and enables per-row citations.

Chunk metadata

Every chunk includes rich metadata stored in the vector store:

json
{
  "chunk_id": "uuid",
  "document_id": "uuid",
  "knowledge_base_id": "uuid",
  "chunk_index": 0,
  "title": "Document title",
  "file_path": "files/abc123.pdf",
  "file_type": "pdf",
  "embedding_model_id": "uuid",
  "embedding_model_name": "text-embedding-3-small"
}

For hierarchical chunks, parent-child relationships are tracked via LlamaIndex node relationships.

For Excel files, additional fields:

json
{
  "sheet": "Sheet1",
  "row_number": 42,
  "original_filename": "data.xlsx"
}

For PDFs, page numbers are tracked:

json
{
  "page_number": 5
}

Token counting

Airbeeps uses tiktoken with the cl100k_base encoding (GPT-4 tokenizer) to count tokens consistently. This ensures chunk sizes work well with most modern LLMs.

Ingestion profiles

For advanced tabular processing, you can define ingestion profiles that control:

  • Which columns to include
  • Custom text templates per row
  • Metadata extraction rules
  • Preprocessing options

Profiles are configured via the API and applied during ingestion. They are associated with a knowledge base and can be matched by file type.

Monitoring ingestion

Document ingestion runs through a job queue with progress tracking. Each ingestion job has:

  • Stage: PARSING, CHUNKING, EMBEDDING, UPSERTING
  • Progress: Percentage within the current stage
  • Events: Detailed log of what happened at each step

Job status values

StatusMeaning
QUEUEDWaiting to be processed
INDEXINGCurrently processing
ACTIVESuccessfully indexed
FAILEDIndexing failed (check logs)
CANCELLINGCancel requested
CANCELLEDJob was cancelled
DELETEDSoft-deleted

Check the admin UI or API for failed documents and review backend logs for details.

Data cleaning

Airbeeps applies optional cleaners during ingestion:

  • Unicode normalization
  • Whitespace collapse
  • HTML tag stripping (if present)

Enable via the clean_data flag during upload.

MIT License