Architecture¶
Overview¶
Browser (Vue 3 SPA)
|
nginx (static + /api proxy)
|
FastAPI backend
├── BM25Index (in-process, rank-bm25)
├── Retriever (BM25 + optional vector)
├── Synthesizer (LLMRouter → Ollama)
└── SQLite (page_chunks + metadata)
+
sqlite-vec (vectors)
Ingest pipeline¶
PDF / EPUB file
│
├─ PDFExtractor (pdfminer + OCR fallback) ← circuitforge_core
│ or
└─ EPUBExtractor (BeautifulSoup + heading chunking)
│
text_clean.py (strip artifacts)
│
INSERT INTO page_chunks
│
Ollama embed (batches of 64) ← BYOK gate
│
sqlite-vec upsert
Retrieval¶
Hybrid search merges BM25 and semantic results with a 50/50 score blend:
- BM25 queries the in-process index (no round-trip to DB)
- Semantic query embeds the user query via Ollama, fetches
top_k * 20nearest vectors, filters bydoc_idin Python - Hits are merged: BM25 scores and vector scores combined; BM25 hits take priority
- Top
kresults are ranked, then adjacent pages (page ± 1) are fetched to restore context for mid-sentence chunk boundaries
Storage¶
| File | Format | Contents |
|---|---|---|
pagepiper.db |
SQLite | documents, page_chunks, chat_feedback |
pagepiper_vecs.db |
sqlite-vec | page_vecs virtual table + page_vecs_meta |
The vector database stores one row per page chunk. If the embedding model changes, Pagepiper detects the dimension mismatch at startup (reads CREATE VIRTUAL TABLE DDL from sqlite_master), deletes the vec DB, and queues a background re-embed.
Licensing boundary¶
| Component | License |
|---|---|
| BM25 search, ingest pipeline, library API | MIT |
| Hybrid vector search, RAG chat, embedding | BSL 1.1 (BYOK unlocked on Free tier) |