PortfolioAll Projects
AI / News Aggregator

Tech-updates (Personal Tech News Aggregator)

I wanted my own tech news feed, so I built one. It scrapes articles from Medium, YC, and Crunchbase, then uses Azure OpenAI to sort everything into categories.

React
Vite
Python
Flask
Azure OpenAI
Qdrant(vectorDB)
PostgreSQL
Web Scraping
AI
01

Preview

Tech-updates (Personal Tech News Aggregator)
02

Overview

I got tired of checking five different sites every morning for tech news, so I built a thing that does it for me. Tech-updates scrapes articles from Medium, Y Combinator's Hacker News, and Crunchbase on a schedule, then pipes them through Azure OpenAI to auto-categorize everything (AI/ML, startups, web dev, etc.). The backend is Flask + PostgreSQL for the core API and data storage. The cool part is the Qdrant vector database: every article gets embedded and stored as a vector, so I can do similarity search. If you're reading about LLMs, it'll surface related articles you might've missed. It's not just keyword matching, it actually understands the content. The frontend is React + Vite. Nothing groundbreaking there, but it's fast and the UI updates as new articles come in. I built this mostly because I wanted to work with vector databases and see how well GPT-based models handle content categorization at scale. Turns out, pretty well.

03

Key Features

AI Categorization

Articles go through Azure OpenAI, which tags them by topic so I don't have to read everything myself

Vector Database

Every article gets embedded in Qdrant, so 'find me more like this' actually works

Multi-Source Scraping

Scrapers run on a schedule, pulling articles from Medium, YC, and Crunchbase automatically

Real-time Updates

Flask API + Postgres on the backend. New articles show up in the feed as they're scraped

04

Technical Stack

implementation.notes
01React + Vite frontend with category filtering and search
02Flask REST API handling scraping triggers, CRUD, and AI calls
03PostgreSQL for article storage, user prefs, and scraping metadata
04Azure OpenAI (GPT-4) for categorizing and summarizing articles
05Qdrant vector DB storing article embeddings for similarity search
06Custom scrapers for Medium (RSS parsing), YC (HTML scraping), and Crunchbase
07Cron-based scraping schedule so the feed stays fresh
08Mobile-friendly layout built with Tailwind
09Filter by category, source, or date range
10Similar article recommendations powered by cosine similarity on embeddings
05

Friction & Takeaways

Friction

  • Every site structures its HTML differently, so each scraper needed custom parsing logic that breaks when they redesign
  • Azure OpenAI rate limits hit hard when you're categorizing hundreds of articles at once. Had to add batching and retry logic
  • Qdrant was new to me, and figuring out the right embedding dimensions and distance metrics took real experimentation
  • Deduplication is tricky when the same story gets covered by multiple sources with different titles
  • Keeping the scraping schedule reliable without hammering the source sites or getting IP-blocked

Takeaways

  • LLMs are surprisingly good at categorization if you write clear prompts and give them structured output formats
  • Vector databases aren't just hype. Similarity search on embeddings works way better than keyword search for articles
  • Web scraping is fragile by nature. You need good error handling and alerts for when a scraper silently breaks
  • Rate limiting and backoff aren't optional when you're calling external APIs in a loop
  • Building something I actually use every day kept me motivated in ways a tutorial project never would
Source CodeAll Projects