Back to Projects

Knowledge Hub - AI-Powered Document Management System

A local-first document management system built with Flask, SQLAlchemy, and Postgres+pgvector. Supports OCR, semantic search, and LLM-powered Q&A for academic research.

Python
Flask
PostgreSQL
pgvector
Docker
OCR
AI/ML
Vector Search
RAG
Academic Tools

How It Works

Knowledge Hub - How It Works Dashboard

Project Overview

Knowledge Hub is a document management system I built to handle my MS in CS coursework at USC. The goal was simple: make it easy to search through course materials, research papers, and notes without digging through folders. It runs on Flask and PostgreSQL with the pgvector extension, so it supports both regular full-text search and vector-based semantic search. I added OCR processing for PDFs and images, document chunking, and hooked it up to a local LLM (Ollama) for question answering over uploaded docs. The whole thing is containerized with Docker. It has made finding relevant information across my coursework way faster than manual searching.

Document Upload & Processing

Knowledge Hub - Document Upload Interface
Knowledge Hub - Document Processing and OCR

Key Features

Document Management

Upload, store, and organize documents with automatic metadata extraction and categorization

AI-Powered OCR

Automatic text extraction from PDFs and images using OpenCV, PyMuPDF, and Tesseract

Semantic Search

Vector-based similarity search using pgvector and Sentence-Transformers for intelligent content discovery

Question Answering

RAG-powered Q&A system with local LLM integration for contextual answers with citations

AI-Powered Question Answering

Knowledge Hub - AI-Powered Question Answering Interface
Knowledge Hub - LLM Response with Citations

Technical Implementation

  • Flask REST API with SQLAlchemy ORM for backend services
  • PostgreSQL with pgvector extension for vector similarity search
  • Docker containerization for easy deployment and scaling
  • OCR processing with OpenCV, PyMuPDF, and Tesseract
  • Vector embeddings using Sentence-Transformers (all-MiniLM-L6-v2)
  • Local LLM integration with Ollama (gemma3:1b)
  • Hybrid search combining full-text and semantic search
  • Intelligent document chunking (300-700 tokens with overlaps)
  • Confidence-aware ranking for OCR results
  • RESTful API for document upload, search, and management

Challenges Faced

  • Implementing efficient vector search with large document collections
  • Optimizing OCR processing for various document types and quality
  • Designing hybrid search algorithms for optimal relevance ranking
  • Managing memory requirements for vector embeddings and LLM inference
  • Creating intuitive API design for complex document operations

Key Learnings

  • Advanced database design with vector extensions
  • AI/ML integration in production web applications
  • Document processing and OCR optimization techniques
  • Vector database operations and similarity search algorithms
  • RAG (Retrieval-Augmented Generation) system implementation
  • Container orchestration and deployment strategies

Technical Documentation

Knowledge Hub Technical Deep Dive

Technical documentation covering architecture, implementation details, and design decisions.

PDF • 827 KB
Download PDFDownload
View Source CodeBack to Projects