Skip to content

Knowledge Bases

A knowledge base is a collection of documents that an assistant can search when answering questions. Each knowledge base has its own vector store collection and configuration.

Key concepts

Documents

A document is a single file or piece of content you upload. Supported formats:

FormatExtensionsNotes
PDF.pdfFull text extraction with page-level tracking
Word.doc, .docxMicrosoft Word documents
PowerPoint.pptxPresentation slides
Plain text.txt, .md, .rtfMarkdown supported
Spreadsheets.xlsx, .xls, .csvRow-wise chunking for citations

Chunks

Documents are split into chunks — smaller segments optimized for retrieval. Each chunk becomes a vector in the embedding space.

Airbeeps supports three chunking strategies:

StrategyDescriptionDefault
HierarchicalParent-child-leaf structure at sizes [2048, 512, 128]✅ Enabled
SemanticSplits by embedding similarity using SemanticSplitterNodeParser✅ Enabled
SentenceFallback sentence-based splittingUsed as fallback

TIP

Hierarchical chunking creates multi-level chunks that enable auto-merging during retrieval — child chunks can be merged back into parent chunks for better context. This is the recommended default.

Embeddings

Each chunk is converted to a vector embedding — a numerical representation of its semantic meaning. Airbeeps supports multiple embedding models:

  • OpenAI text-embedding-3-small / text-embedding-3-large
  • HuggingFace models via sentence-transformers
  • DashScope embedding models
  • Any OpenAI-compatible embedding API

The embedding model is configured per knowledge base. Changing it requires reindexing.

Vector stores

Each knowledge base stores its vectors in the configured vector store:

StoreDescription
Qdrant (default)High-performance vector search with gRPC support
ChromaDBEmbedded or server mode
PGVectorPostgreSQL extension — uses your existing database
MilvusDistributed vector database for large-scale deployments

The vector store type is set globally via AIRBEEPS_VECTOR_STORE_TYPE or per knowledge base during creation.

Creating a knowledge base

  1. Navigate to /admin/kbs in the admin UI
  2. Click Create Knowledge Base
  3. Configure:
    • Name — descriptive identifier
    • Embedding model — select from configured models
    • Vector store type — optionally override the global default
    • Retrieval config — customize retrieval parameters
  4. Save and start uploading documents

Uploading documents

You can upload files through:

  1. Admin UI — drag and drop at /admin/kbs/{id}
  2. APIPOST /api/v1/rag/knowledge-bases/{kb_id}/documents/upload

Deduplication strategies

When uploading a file that already exists (by hash or filename):

StrategyBehavior
replaceDelete existing document, add new one
skipKeep existing, ignore upload
versionAdd as new version with (v2) suffix

Ingestion pipeline

Documents are processed through a job queue with four stages:

Upload → Parse → Chunk → Embed → Upsert to vector store

Each stage reports progress, and you can track the status of ingestion jobs in the admin UI.

  1. Parse — Extract content from the file (PDF pages, DOCX paragraphs, etc.)
  2. Chunk — Split into nodes using the selected strategy (hierarchical, semantic, or sentence)
  3. Embed — Generate vector embeddings for each chunk
  4. Upsert — Store vectors and metadata in the vector store

Excel/CSV special handling

Tabular files use row-wise chunking instead of text-based chunking:

  • Each row becomes a separate chunk
  • Column headers are included in chunk text
  • Row numbers are preserved in metadata for citations

Ingestion job status

StatusMeaning
QUEUEDWaiting to be processed
INDEXINGCurrently processing
ACTIVESuccessfully indexed
FAILEDIndexing failed (check logs)
CANCELLINGCancel requested
CANCELLEDJob was cancelled
DELETEDSoft-deleted

Reindexing

If you change the embedding model or chunk settings, existing documents need reindexing:

bash
# Via API
POST /api/v1/rag/knowledge-bases/{kb_id}/reindex

Or use the Reindex button in the admin UI.

WARNING

Reindexing regenerates all embeddings. This can be slow and expensive for large knowledge bases with paid embedding APIs.

Best practices

  1. One topic per KB — group related documents together
  2. Use descriptive names — makes assistant configuration easier
  3. Monitor ingestion jobs — check for FAILED documents after upload
  4. Test retrieval — use the search endpoint to verify chunk quality
  5. Choose the right vector store — Qdrant for most cases, PGVector if you want to consolidate with your database

MIT License