0.80
BLEU Score
0.65
LSTM Baseline
+23%
Improvement
Trained a CNN + Transformer pipeline that generates captions for images. The Transformer model scored 0.80 BLEU. Built a Streamlit UI so you can try it yourself.

I wanted to build something that could look at an image and describe what's in it. The idea was straightforward, but getting it to actually work well took some effort. I used VGG-16 as the image encoder to extract feature vectors, then fed those into two different decoders: an LSTM and a Transformer. The LSTM got a BLEU score of 0.65, which was decent, but the Transformer with attention hit 0.80 and generated noticeably better captions. The attention mechanism really helped the model focus on the right parts of the image. I wrapped the whole thing in a Streamlit app so you can upload any image and get a caption back in a few seconds. It's a simple UI, but it makes the model feel real instead of just numbers in a notebook. The hardest part was honestly the training pipeline. VGG-16 is memory-hungry, the Transformer needed careful hyperparameter tuning, and I spent more time on data preprocessing than I'd like to admit.
VGG-16 for image feature extraction, then LSTM and Transformer decoders for generating captions
LSTM hit 0.65 BLEU, Transformer got 0.80. The attention mechanism made a real difference
Streamlit app where you upload an image and get a caption back in seconds
Tuned the pipeline so inference runs fast enough to feel instant
Friction
Takeaways