AI / News Aggregator

Tech-updates (Personal Tech News Aggregator)

I wanted my own tech news feed, so I built one. It scrapes articles from Medium, YC, and Crunchbase, then uses Azure OpenAI to sort everything into categories.

React

Vite

Python

Flask

Azure OpenAI

Qdrant(vectorDB)

PostgreSQL

Web Scraping

Preview

Tech-updates (Personal Tech News Aggregator)

Overview

I got tired of checking five different sites every morning for tech news, so I built a thing that does it for me. Tech-updates scrapes articles from Medium, Y Combinator's Hacker News, and Crunchbase on a schedule, then pipes them through Azure OpenAI to auto-categorize everything (AI/ML, startups, web dev, etc.). The backend is Flask + PostgreSQL for the core API and data storage. The cool part is the Qdrant vector database: every article gets embedded and stored as a vector, so I can do similarity search. If you're reading about LLMs, it'll surface related articles you might've missed. It's not just keyword matching, it actually understands the content. The frontend is React + Vite. Nothing groundbreaking there, but it's fast and the UI updates as new articles come in. I built this mostly because I wanted to work with vector databases and see how well GPT-based models handle content categorization at scale. Turns out, pretty well.

Key Features

AI Categorization

Articles go through Azure OpenAI, which tags them by topic so I don't have to read everything myself

Vector Database

Every article gets embedded in Qdrant, so 'find me more like this' actually works

Multi-Source Scraping

Scrapers run on a schedule, pulling articles from Medium, YC, and Crunchbase automatically

Real-time Updates

Flask API + Postgres on the backend. New articles show up in the feed as they're scraped

Technical Stack

implementation.notes

01React + Vite frontend with category filtering and search

02Flask REST API handling scraping triggers, CRUD, and AI calls

03PostgreSQL for article storage, user prefs, and scraping metadata

04Azure OpenAI (GPT-4) for categorizing and summarizing articles

05Qdrant vector DB storing article embeddings for similarity search

06Custom scrapers for Medium (RSS parsing), YC (HTML scraping), and Crunchbase

07Cron-based scraping schedule so the feed stays fresh

08Mobile-friendly layout built with Tailwind

09Filter by category, source, or date range

10Similar article recommendations powered by cosine similarity on embeddings

Friction & Takeaways

Friction

Every site structures its HTML differently, so each scraper needed custom parsing logic that breaks when they redesign
Azure OpenAI rate limits hit hard when you're categorizing hundreds of articles at once. Had to add batching and retry logic
Qdrant was new to me, and figuring out the right embedding dimensions and distance metrics took real experimentation
Deduplication is tricky when the same story gets covered by multiple sources with different titles
Keeping the scraping schedule reliable without hammering the source sites or getting IP-blocked

Takeaways

LLMs are surprisingly good at categorization if you write clear prompts and give them structured output formats
Vector databases aren't just hype. Similarity search on embeddings works way better than keyword search for articles
Web scraping is fragile by nature. You need good error handling and alerts for when a scraper silently breaks
Rate limiting and backoff aren't optional when you're calling external APIs in a loop
Building something I actually use every day kept me motivated in ways a tutorial project never would

Source Code All Projects

Portfolio All Projects