AI / RAG System

Knowledge Hub - AI-Powered Document Management

100%

Local / No Cloud

Hybrid

Search Mode

RAG

With Citations

A document manager I built for my USC coursework. It does OCR on handwritten notes and PDFs, runs semantic search with pgvector, and answers questions about your docs using a local LLM.

Python

Flask

PostgreSQL

pgvector

Docker

OCR

AI/ML

Vector Search

RAG

Academic Tools

System Overview

Click to expand

Project Story

I built Knowledge Hub because I was tired of digging through hundreds of course PDFs and lecture notes during my MS in CS at USC. I wanted one place where I could dump all my documents and actually find what I needed quickly. The system takes in PDFs and images, runs OCR to extract text (even from handwritten notes), chunks everything up, and stores vector embeddings in PostgreSQL with pgvector. That gives me two ways to search: regular full-text search and semantic search, where I can ask a question in plain English and get back the most relevant passages. The part I'm most proud of is the Q&A feature. It uses RAG with a local LLM running on Ollama (gemma3:1b) to answer questions about my documents and cite exactly where the answer came from. No API keys, no cloud dependency, everything runs locally. The whole thing is a Flask API with SQLAlchemy, containerized with Docker so setup is just `docker-compose up`. It's genuinely useful. I still use it to prep for exams and review research papers.

Upload & Processing

Knowledge Hub - Document Upload Interface

Knowledge Hub - Document Processing and OCR

AI Question Answering

Knowledge Hub - AI-Powered Question Answering Interface

Knowledge Hub - LLM Response with Citations

Key Features

Document Management

Upload docs, and the system pulls out metadata and organizes everything automatically

AI-Powered OCR

Extracts text from PDFs and images using OpenCV, PyMuPDF, and Tesseract

Semantic Search

Find related content with vector similarity search, powered by pgvector and Sentence-Transformers

Question Answering

Ask questions about your docs and get answers with citations, powered by a local LLM through RAG

Technical Stack

implementation.notes

01Flask API with SQLAlchemy for the backend

02PostgreSQL + pgvector for vector similarity search

03Dockerized the whole stack for easy setup

04OCR pipeline using OpenCV, PyMuPDF, and Tesseract

05Sentence-Transformers (all-MiniLM-L6-v2) for generating vector embeddings

06Ollama running gemma3:1b locally for Q&A

07Hybrid search that combines full-text and semantic results

08Document chunking at 300-700 tokens with overlap

09OCR results ranked by confidence score

10REST API with endpoints for upload, search, and Q&A

Friction & Takeaways

Friction

pgvector queries got slow with larger doc collections, had to tune indexing and query params
OCR quality varied a lot depending on scan quality and handwriting, so I added confidence scoring to filter out bad results
Combining full-text and semantic search results into a single ranked list took a lot of trial and error
Running embeddings and an LLM locally eats RAM, had to be careful about batch sizes and model selection
Figuring out the right chunk size and overlap for different document types without losing context

Takeaways

pgvector is powerful but you need to think about indexing strategy early
Wiring up ML models into a web app has more plumbing than you'd expect
OCR is never as clean as you want it to be, confidence-based filtering helps a lot
Chunk size matters more than the embedding model for search quality
RAG is only as good as your retrieval step, garbage in garbage out
Docker Compose makes multi-service setups way less painful

Technical Deep Dive

Architecture, OCR Pipeline & Hybrid Ranking

Every moving part explained: system architecture diagram, ingestion pipeline, full-text search in Postgres, pgvector semantic search with IVFFlat, z-score hybrid ranking, and the RAG prompt structure — with interactive SVG diagrams.

Read Deep Dive PDF Version

Source Code All Projects

Portfolio All Projects