Knowledge Bases
A knowledge base is a collection of documents that an assistant can search when answering questions. Each knowledge base has its own vector store collection and configuration.
Key concepts
Documents
A document is a single file or piece of content you upload. Supported formats:
| Format | Extensions | Notes |
|---|---|---|
.pdf | Full text extraction with page-level tracking | |
| Word | .doc, .docx | Microsoft Word documents |
| PowerPoint | .pptx | Presentation slides |
| Plain text | .txt, .md, .rtf | Markdown supported |
| Spreadsheets | .xlsx, .xls, .csv | Row-wise chunking for citations |
Chunks
Documents are split into chunks — smaller segments optimized for retrieval. Each chunk becomes a vector in the embedding space.
Airbeeps supports three chunking strategies:
| Strategy | Description | Default |
|---|---|---|
| Hierarchical | Parent-child-leaf structure at sizes [2048, 512, 128] | ✅ Enabled |
| Semantic | Splits by embedding similarity using SemanticSplitterNodeParser | ✅ Enabled |
| Sentence | Fallback sentence-based splitting | Used as fallback |
TIP
Hierarchical chunking creates multi-level chunks that enable auto-merging during retrieval — child chunks can be merged back into parent chunks for better context. This is the recommended default.
Embeddings
Each chunk is converted to a vector embedding — a numerical representation of its semantic meaning. Airbeeps supports multiple embedding models:
- OpenAI
text-embedding-3-small/text-embedding-3-large - HuggingFace models via
sentence-transformers - DashScope embedding models
- Any OpenAI-compatible embedding API
The embedding model is configured per knowledge base. Changing it requires reindexing.
Vector stores
Each knowledge base stores its vectors in the configured vector store:
| Store | Description |
|---|---|
| Qdrant (default) | High-performance vector search with gRPC support |
| ChromaDB | Embedded or server mode |
| PGVector | PostgreSQL extension — uses your existing database |
| Milvus | Distributed vector database for large-scale deployments |
The vector store type is set globally via AIRBEEPS_VECTOR_STORE_TYPE or per knowledge base during creation.
Creating a knowledge base
- Navigate to
/admin/kbsin the admin UI - Click Create Knowledge Base
- Configure:
- Name — descriptive identifier
- Embedding model — select from configured models
- Vector store type — optionally override the global default
- Retrieval config — customize retrieval parameters
- Save and start uploading documents
Uploading documents
You can upload files through:
- Admin UI — drag and drop at
/admin/kbs/{id} - API —
POST /api/v1/rag/knowledge-bases/{kb_id}/documents/upload
Deduplication strategies
When uploading a file that already exists (by hash or filename):
| Strategy | Behavior |
|---|---|
replace | Delete existing document, add new one |
skip | Keep existing, ignore upload |
version | Add as new version with (v2) suffix |
Ingestion pipeline
Documents are processed through a job queue with four stages:
Upload → Parse → Chunk → Embed → Upsert to vector storeEach stage reports progress, and you can track the status of ingestion jobs in the admin UI.
- Parse — Extract content from the file (PDF pages, DOCX paragraphs, etc.)
- Chunk — Split into nodes using the selected strategy (hierarchical, semantic, or sentence)
- Embed — Generate vector embeddings for each chunk
- Upsert — Store vectors and metadata in the vector store
Excel/CSV special handling
Tabular files use row-wise chunking instead of text-based chunking:
- Each row becomes a separate chunk
- Column headers are included in chunk text
- Row numbers are preserved in metadata for citations
Ingestion job status
| Status | Meaning |
|---|---|
QUEUED | Waiting to be processed |
INDEXING | Currently processing |
ACTIVE | Successfully indexed |
FAILED | Indexing failed (check logs) |
CANCELLING | Cancel requested |
CANCELLED | Job was cancelled |
DELETED | Soft-deleted |
Reindexing
If you change the embedding model or chunk settings, existing documents need reindexing:
# Via API
POST /api/v1/rag/knowledge-bases/{kb_id}/reindexOr use the Reindex button in the admin UI.
WARNING
Reindexing regenerates all embeddings. This can be slow and expensive for large knowledge bases with paid embedding APIs.
Best practices
- One topic per KB — group related documents together
- Use descriptive names — makes assistant configuration easier
- Monitor ingestion jobs — check for
FAILEDdocuments after upload - Test retrieval — use the search endpoint to verify chunk quality
- Choose the right vector store — Qdrant for most cases, PGVector if you want to consolidate with your database