PortfolioAll Projects
AI / RAG System

Knowledge Hub - AI-Powered Document Management

100%

Local / No Cloud

Hybrid

Search Mode

RAG

With Citations

A document manager I built for my USC coursework. It does OCR on handwritten notes and PDFs, runs semantic search with pgvector, and answers questions about your docs using a local LLM.

Python
Flask
PostgreSQL
pgvector
Docker
OCR
AI/ML
Vector Search
RAG
Academic Tools
01

System Overview

Knowledge Hub - How It Works Dashboard
Click to expand
02

Project Story

I built Knowledge Hub because I was tired of digging through hundreds of course PDFs and lecture notes during my MS in CS at USC. I wanted one place where I could dump all my documents and actually find what I needed quickly. The system takes in PDFs and images, runs OCR to extract text (even from handwritten notes), chunks everything up, and stores vector embeddings in PostgreSQL with pgvector. That gives me two ways to search: regular full-text search and semantic search, where I can ask a question in plain English and get back the most relevant passages. The part I'm most proud of is the Q&A feature. It uses RAG with a local LLM running on Ollama (gemma3:1b) to answer questions about my documents and cite exactly where the answer came from. No API keys, no cloud dependency, everything runs locally. The whole thing is a Flask API with SQLAlchemy, containerized with Docker so setup is just `docker-compose up`. It's genuinely useful. I still use it to prep for exams and review research papers.

03

Upload & Processing

Knowledge Hub - Document Upload Interface
Knowledge Hub - Document Processing and OCR
04

AI Question Answering

Knowledge Hub - AI-Powered Question Answering Interface
Knowledge Hub - LLM Response with Citations
05

Key Features

Document Management

Upload docs, and the system pulls out metadata and organizes everything automatically

AI-Powered OCR

Extracts text from PDFs and images using OpenCV, PyMuPDF, and Tesseract

Semantic Search

Find related content with vector similarity search, powered by pgvector and Sentence-Transformers

Question Answering

Ask questions about your docs and get answers with citations, powered by a local LLM through RAG

06

Technical Stack

implementation.notes
01Flask API with SQLAlchemy for the backend
02PostgreSQL + pgvector for vector similarity search
03Dockerized the whole stack for easy setup
04OCR pipeline using OpenCV, PyMuPDF, and Tesseract
05Sentence-Transformers (all-MiniLM-L6-v2) for generating vector embeddings
06Ollama running gemma3:1b locally for Q&A
07Hybrid search that combines full-text and semantic results
08Document chunking at 300-700 tokens with overlap
09OCR results ranked by confidence score
10REST API with endpoints for upload, search, and Q&A
07

Friction & Takeaways

Friction

  • pgvector queries got slow with larger doc collections, had to tune indexing and query params
  • OCR quality varied a lot depending on scan quality and handwriting, so I added confidence scoring to filter out bad results
  • Combining full-text and semantic search results into a single ranked list took a lot of trial and error
  • Running embeddings and an LLM locally eats RAM, had to be careful about batch sizes and model selection
  • Figuring out the right chunk size and overlap for different document types without losing context

Takeaways

  • pgvector is powerful but you need to think about indexing strategy early
  • Wiring up ML models into a web app has more plumbing than you'd expect
  • OCR is never as clean as you want it to be, confidence-based filtering helps a lot
  • Chunk size matters more than the embedding model for search quality
  • RAG is only as good as your retrieval step, garbage in garbage out
  • Docker Compose makes multi-service setups way less painful
Technical Deep Dive

Architecture, OCR Pipeline & Hybrid Ranking

Every moving part explained: system architecture diagram, ingestion pipeline, full-text search in Postgres, pgvector semantic search with IVFFlat, z-score hybrid ranking, and the RAG prompt structure — with interactive SVG diagrams.

Read Deep DivePDF Version
Source CodeAll Projects