Knowledge Bases

A knowledge base is a collection of documents that an assistant can search when answering questions. Each knowledge base has its own vector store collection and configuration.

Key concepts

Documents

A document is a single file or piece of content you upload. Supported formats:

Format	Extensions	Notes
PDF	`.pdf`	Full text extraction with page-level tracking
Word	`.doc`, `.docx`	Microsoft Word documents
PowerPoint	`.pptx`	Presentation slides
Plain text	`.txt`, `.md`, `.rtf`	Markdown supported
Spreadsheets	`.xlsx`, `.xls`, `.csv`	Row-wise chunking for citations

Chunks

Documents are split into chunks — smaller segments optimized for retrieval. Each chunk becomes a vector in the embedding space.

Airbeeps supports three chunking strategies:

Strategy	Description	Default
Hierarchical	Parent-child-leaf structure at sizes `[2048, 512, 128]`	✅ Enabled
Semantic	Splits by embedding similarity using `SemanticSplitterNodeParser`	✅ Enabled
Sentence	Fallback sentence-based splitting	Used as fallback

TIP

Hierarchical chunking creates multi-level chunks that enable auto-merging during retrieval — child chunks can be merged back into parent chunks for better context. This is the recommended default.

Embeddings

Each chunk is converted to a vector embedding — a numerical representation of its semantic meaning. Airbeeps supports multiple embedding models:

OpenAI text-embedding-3-small / text-embedding-3-large
HuggingFace models via sentence-transformers
DashScope embedding models
Any OpenAI-compatible embedding API

The embedding model is configured per knowledge base. Changing it requires reindexing.

Vector stores

Each knowledge base stores its vectors in the configured vector store:

Store	Description
Qdrant (default)	High-performance vector search with gRPC support
ChromaDB	Embedded or server mode
PGVector	PostgreSQL extension — uses your existing database
Milvus	Distributed vector database for large-scale deployments

The vector store type is set globally via AIRBEEPS_VECTOR_STORE_TYPE or per knowledge base during creation.

Creating a knowledge base

Navigate to /admin/kbs in the admin UI
Click Create Knowledge Base
Configure:
- Name — descriptive identifier
- Embedding model — select from configured models
- Vector store type — optionally override the global default
- Retrieval config — customize retrieval parameters
Save and start uploading documents

Uploading documents

You can upload files through:

Admin UI — drag and drop at /admin/kbs/{id}
API — POST /api/v1/rag/knowledge-bases/{kb_id}/documents/upload

Deduplication strategies

When uploading a file that already exists (by hash or filename):

Strategy	Behavior
`replace`	Delete existing document, add new one
`skip`	Keep existing, ignore upload
`version`	Add as new version with `(v2)` suffix

Ingestion pipeline

Documents are processed through a job queue with four stages:

Upload → Parse → Chunk → Embed → Upsert to vector store

Each stage reports progress, and you can track the status of ingestion jobs in the admin UI.

Parse — Extract content from the file (PDF pages, DOCX paragraphs, etc.)
Chunk — Split into nodes using the selected strategy (hierarchical, semantic, or sentence)
Embed — Generate vector embeddings for each chunk
Upsert — Store vectors and metadata in the vector store

Excel/CSV special handling

Tabular files use row-wise chunking instead of text-based chunking:

Each row becomes a separate chunk
Column headers are included in chunk text
Row numbers are preserved in metadata for citations

Ingestion job status

Status	Meaning
`QUEUED`	Waiting to be processed
`INDEXING`	Currently processing
`ACTIVE`	Successfully indexed
`FAILED`	Indexing failed (check logs)
`CANCELLING`	Cancel requested
`CANCELLED`	Job was cancelled
`DELETED`	Soft-deleted

Reindexing

If you change the embedding model or chunk settings, existing documents need reindexing:

bash

# Via API
POST /api/v1/rag/knowledge-bases/{kb_id}/reindex

Or use the Reindex button in the admin UI.

WARNING

Reindexing regenerates all embeddings. This can be slow and expensive for large knowledge bases with paid embedding APIs.

Best practices

One topic per KB — group related documents together
Use descriptive names — makes assistant configuration easier
Monitor ingestion jobs — check for FAILED documents after upload
Test retrieval — use the search endpoint to verify chunk quality
Choose the right vector store — Qdrant for most cases, PGVector if you want to consolidate with your database

Knowledge Bases ​

Key concepts ​

Documents ​

Chunks ​

Embeddings ​

Vector stores ​

Creating a knowledge base ​

Uploading documents ​

Deduplication strategies ​

Ingestion pipeline ​

Excel/CSV special handling ​

Ingestion job status ​

Reindexing ​

Best practices ​