Ingestion & Chunking

When you upload a document, Airbeeps runs an ingestion pipeline that extracts content, splits it into chunks, and stores vectors for retrieval.

Pipeline overview

┌──────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│  Upload  │───▶│  Parse  │───▶│  Chunk  │───▶│  Embed  │───▶│ Upsert  │
└──────────┘    └─────────┘    └─────────┘    └─────────┘    └─────────┘
                   15%             20%            55%            10%

Each stage reports progress and can be tracked via the admin UI or API. Ingestion runs through a job queue that supports cancellation.

Content extraction

Airbeeps uses specialized extractors for each file type:

File type	Extractor	Notes
PDF	PyMuPDF	Page-level extraction with page-range support
DOCX	markitdown	Paragraphs and tables
PPTX	markitdown	Presentation slides
TXT/MD	Direct read	UTF-8 encoding
Excel/CSV	pandas	Row-based processing

PDF processing

PDFs are processed page by page. The extractor:

Extracts raw text from each page
Tracks page numbers in chunk metadata
Handles multi-column layouts (best effort)
Supports page-range selection (e.g., 1-5,8,10-12)
Can truncate to a max page count

INFO

Scanned PDFs without embedded text are not currently supported. Use OCR preprocessing if needed.

Chunking strategies

Airbeeps uses LlamaIndex node parsers for chunking. Three strategies are available, controlled by feature flags in the configuration.

Hierarchical chunking (default)

Creates a parent-child-leaf structure with multiple chunk sizes:

Parent chunk (2048 tokens)
├── Child chunk (512 tokens)
│   ├── Leaf chunk (128 tokens)
│   └── Leaf chunk (128 tokens)
└── Child chunk (512 tokens)
    └── ...

This enables auto-merging during retrieval — when multiple child chunks from the same parent are retrieved, they can be merged back for better context.

yaml

# Default hierarchical sizes (configurable)
RAG_HIERARCHICAL_CHUNK_SIZES: [2048, 512, 128]

Enable/disable: AIRBEEPS_RAG_ENABLE_HIERARCHICAL=true

Semantic chunking

Splits documents based on embedding similarity rather than fixed sizes. Uses SemanticSplitterNodeParser:

Computes embeddings for sentence groups
Identifies semantic breakpoints where meaning shifts
Splits at those boundaries

yaml

# Configuration
RAG_SEMANTIC_BREAKPOINT_THRESHOLD: 95 # Percentile for semantic splits
RAG_SEMANTIC_BUFFER_SIZE: 1 # Sentences to buffer for context

Enable/disable: AIRBEEPS_RAG_ENABLE_SEMANTIC_CHUNKING=true

TIP

Semantic chunking requires an embedding model. If none is available, it falls back to sentence-based chunking automatically.

Sentence-based chunking (fallback)

Uses SentenceSplitter with configurable chunk size and overlap. This is the fallback when hierarchical and semantic chunking are both disabled or unavailable.

Code block preservation

For documents containing fenced code blocks (```), Airbeeps can preserve them as atomic units — code blocks are never split mid-block. Surrounding text is chunked normally.

Row-wise chunking (Excel/CSV)

Tabular files get special treatment:

Each row becomes a separate chunk
Column headers are prepended: Column: Value
Empty cells are skipped
Ingestion profiles can customize column selection and text templates

This preserves the structure and enables per-row citations.

Chunk metadata

Every chunk includes rich metadata stored in the vector store:

json

{
  "chunk_id": "uuid",
  "document_id": "uuid",
  "knowledge_base_id": "uuid",
  "chunk_index": 0,
  "title": "Document title",
  "file_path": "files/abc123.pdf",
  "file_type": "pdf",
  "embedding_model_id": "uuid",
  "embedding_model_name": "text-embedding-3-small"
}

For hierarchical chunks, parent-child relationships are tracked via LlamaIndex node relationships.

For Excel files, additional fields:

json

{
  "sheet": "Sheet1",
  "row_number": 42,
  "original_filename": "data.xlsx"
}

For PDFs, page numbers are tracked:

json

{
  "page_number": 5
}

Token counting

Airbeeps uses tiktoken with the cl100k_base encoding (GPT-4 tokenizer) to count tokens consistently. This ensures chunk sizes work well with most modern LLMs.

Ingestion profiles

For advanced tabular processing, you can define ingestion profiles that control:

Which columns to include
Custom text templates per row
Metadata extraction rules
Preprocessing options

Profiles are configured via the API and applied during ingestion. They are associated with a knowledge base and can be matched by file type.

Monitoring ingestion

Document ingestion runs through a job queue with progress tracking. Each ingestion job has:

Stage: PARSING, CHUNKING, EMBEDDING, UPSERTING
Progress: Percentage within the current stage
Events: Detailed log of what happened at each step

Job status values

Status	Meaning
`QUEUED`	Waiting to be processed
`INDEXING`	Currently processing
`ACTIVE`	Successfully indexed
`FAILED`	Indexing failed (check logs)
`CANCELLING`	Cancel requested
`CANCELLED`	Job was cancelled
`DELETED`	Soft-deleted

Check the admin UI or API for failed documents and review backend logs for details.

Data cleaning

Airbeeps applies optional cleaners during ingestion:

Unicode normalization
Whitespace collapse
HTML tag stripping (if present)

Enable via the clean_data flag during upload.

Ingestion & Chunking ​

Pipeline overview ​

Content extraction ​

PDF processing ​

Chunking strategies ​

Hierarchical chunking (default) ​

Semantic chunking ​

Sentence-based chunking (fallback) ​

Code block preservation ​

Row-wise chunking (Excel/CSV) ​

Chunk metadata ​

Token counting ​

Ingestion profiles ​

Monitoring ingestion ​

Job status values ​

Data cleaning ​

Ingestion & Chunking

Pipeline overview

Content extraction

PDF processing

Chunking strategies

Hierarchical chunking (default)

Semantic chunking

Sentence-based chunking (fallback)

Code block preservation

Row-wise chunking (Excel/CSV)

Chunk metadata

Token counting

Ingestion profiles

Monitoring ingestion

Job status values

Data cleaning