Generates captions for images using CNN feature extraction and Transformer-based text generation. Built with TensorFlow and Streamlit.

An image captioning system that uses CNN/VGG-16 for feature extraction and both LSTM and Transformer architectures for generating captions. The Transformer model hit a BLEU score of 0.80 (LSTM got 0.65), showing how much attention mechanisms help with caption quality. The frontend is a Streamlit app where you can upload an image and get a caption back instantly. The main challenges were optimizing inference speed, handling different image types, and keeping the model small enough for real-time use.
Implemented CNN and VGG-16 for feature extraction, LSTM and Transformer for caption generation
Achieved BLEU scores of 0.65 (LSTM) and 0.80 (Transformer) for caption quality
Built with Streamlit for easy image upload and instant caption generation
Optimized for fast inference and real-time caption generation