Ingestion & Chunking
When you upload a document, Airbeeps runs an ingestion pipeline that extracts content, splits it into chunks, and stores vectors for retrieval.
Pipeline overview
┌──────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Upload │───▶│ Parse │───▶│ Chunk │───▶│ Embed │───▶│ Upsert │
└──────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘
15% 20% 55% 10%Each stage reports progress and can be tracked via the admin UI or API. Ingestion runs through a job queue that supports cancellation.
Content extraction
Airbeeps uses specialized extractors for each file type:
| File type | Extractor | Notes |
|---|---|---|
| PyMuPDF | Page-level extraction with page-range support | |
| DOCX | markitdown | Paragraphs and tables |
| PPTX | markitdown | Presentation slides |
| TXT/MD | Direct read | UTF-8 encoding |
| Excel/CSV | pandas | Row-based processing |
PDF processing
PDFs are processed page by page. The extractor:
- Extracts raw text from each page
- Tracks page numbers in chunk metadata
- Handles multi-column layouts (best effort)
- Supports page-range selection (e.g.,
1-5,8,10-12) - Can truncate to a max page count
INFO
Scanned PDFs without embedded text are not currently supported. Use OCR preprocessing if needed.
Chunking strategies
Airbeeps uses LlamaIndex node parsers for chunking. Three strategies are available, controlled by feature flags in the configuration.
Hierarchical chunking (default)
Creates a parent-child-leaf structure with multiple chunk sizes:
Parent chunk (2048 tokens)
├── Child chunk (512 tokens)
│ ├── Leaf chunk (128 tokens)
│ └── Leaf chunk (128 tokens)
└── Child chunk (512 tokens)
└── ...This enables auto-merging during retrieval — when multiple child chunks from the same parent are retrieved, they can be merged back for better context.
# Default hierarchical sizes (configurable)
RAG_HIERARCHICAL_CHUNK_SIZES: [2048, 512, 128]Enable/disable: AIRBEEPS_RAG_ENABLE_HIERARCHICAL=true
Semantic chunking
Splits documents based on embedding similarity rather than fixed sizes. Uses SemanticSplitterNodeParser:
- Computes embeddings for sentence groups
- Identifies semantic breakpoints where meaning shifts
- Splits at those boundaries
# Configuration
RAG_SEMANTIC_BREAKPOINT_THRESHOLD: 95 # Percentile for semantic splits
RAG_SEMANTIC_BUFFER_SIZE: 1 # Sentences to buffer for contextEnable/disable: AIRBEEPS_RAG_ENABLE_SEMANTIC_CHUNKING=true
TIP
Semantic chunking requires an embedding model. If none is available, it falls back to sentence-based chunking automatically.
Sentence-based chunking (fallback)
Uses SentenceSplitter with configurable chunk size and overlap. This is the fallback when hierarchical and semantic chunking are both disabled or unavailable.
Code block preservation
For documents containing fenced code blocks (```), Airbeeps can preserve them as atomic units — code blocks are never split mid-block. Surrounding text is chunked normally.
Row-wise chunking (Excel/CSV)
Tabular files get special treatment:
- Each row becomes a separate chunk
- Column headers are prepended:
Column: Value - Empty cells are skipped
- Ingestion profiles can customize column selection and text templates
This preserves the structure and enables per-row citations.
Chunk metadata
Every chunk includes rich metadata stored in the vector store:
{
"chunk_id": "uuid",
"document_id": "uuid",
"knowledge_base_id": "uuid",
"chunk_index": 0,
"title": "Document title",
"file_path": "files/abc123.pdf",
"file_type": "pdf",
"embedding_model_id": "uuid",
"embedding_model_name": "text-embedding-3-small"
}For hierarchical chunks, parent-child relationships are tracked via LlamaIndex node relationships.
For Excel files, additional fields:
{
"sheet": "Sheet1",
"row_number": 42,
"original_filename": "data.xlsx"
}For PDFs, page numbers are tracked:
{
"page_number": 5
}Token counting
Airbeeps uses tiktoken with the cl100k_base encoding (GPT-4 tokenizer) to count tokens consistently. This ensures chunk sizes work well with most modern LLMs.
Ingestion profiles
For advanced tabular processing, you can define ingestion profiles that control:
- Which columns to include
- Custom text templates per row
- Metadata extraction rules
- Preprocessing options
Profiles are configured via the API and applied during ingestion. They are associated with a knowledge base and can be matched by file type.
Monitoring ingestion
Document ingestion runs through a job queue with progress tracking. Each ingestion job has:
- Stage:
PARSING,CHUNKING,EMBEDDING,UPSERTING - Progress: Percentage within the current stage
- Events: Detailed log of what happened at each step
Job status values
| Status | Meaning |
|---|---|
QUEUED | Waiting to be processed |
INDEXING | Currently processing |
ACTIVE | Successfully indexed |
FAILED | Indexing failed (check logs) |
CANCELLING | Cancel requested |
CANCELLED | Job was cancelled |
DELETED | Soft-deleted |
Check the admin UI or API for failed documents and review backend logs for details.
Data cleaning
Airbeeps applies optional cleaners during ingestion:
- Unicode normalization
- Whitespace collapse
- HTML tag stripping (if present)
Enable via the clean_data flag during upload.