ML / Computer Vision

Image Feature Detection & Captioning

0.80

BLEU Score

0.65

LSTM Baseline

+23%

Improvement

Trained a CNN + Transformer pipeline that generates captions for images. The Transformer model scored 0.80 BLEU. Built a Streamlit UI so you can try it yourself.

Python

TensorFlow

CNN

Transformer

LSTM

StreamLit

Computer Vision

NLP

Preview

Overview

I wanted to build something that could look at an image and describe what's in it. The idea was straightforward, but getting it to actually work well took some effort. I used VGG-16 as the image encoder to extract feature vectors, then fed those into two different decoders: an LSTM and a Transformer. The LSTM got a BLEU score of 0.65, which was decent, but the Transformer with attention hit 0.80 and generated noticeably better captions. The attention mechanism really helped the model focus on the right parts of the image. I wrapped the whole thing in a Streamlit app so you can upload any image and get a caption back in a few seconds. It's a simple UI, but it makes the model feel real instead of just numbers in a notebook. The hardest part was honestly the training pipeline. VGG-16 is memory-hungry, the Transformer needed careful hyperparameter tuning, and I spent more time on data preprocessing than I'd like to admit.

Key Features

Advanced AI Models

VGG-16 for image feature extraction, then LSTM and Transformer decoders for generating captions

High Performance

LSTM hit 0.65 BLEU, Transformer got 0.80. The attention mechanism made a real difference

User-Friendly Interface

Streamlit app where you upload an image and get a caption back in seconds

Real-time Processing

Tuned the pipeline so inference runs fast enough to feel instant

Technical Stack

implementation.notes

01CNN and VGG-16 for pulling features out of images

02LSTM decoder with attention

03Transformer decoder that outperformed the LSTM by 15 BLEU points

04Evaluated with BLEU scores across both architectures

05Streamlit frontend for uploading images and viewing captions

06Image preprocessing and augmentation during training

07Model optimization to keep inference time reasonable

Friction & Takeaways

Friction

The Transformer gave better captions but was noticeably slower, had to find the right size/speed tradeoff
The model struggled with unusual image compositions that weren't well represented in training data
Getting BLEU above 0.70 on the LSTM took a lot of hyperparameter tuning before I switched to the Transformer
Making the Streamlit UI responsive enough that it didn't feel like you were waiting forever
VGG-16 is memory-hungry, had to be strategic about batch sizes during training

Takeaways

Transformers really do outperform LSTMs on sequence tasks once you get the training right
Connecting a vision encoder to a language decoder is tricky, the feature vector interface matters a lot
Attention maps are useful for debugging, not just for boosting scores
Even a simple Streamlit UI makes a model way more convincing in a demo
BLEU is a useful metric but doesn't always match how good a caption actually reads

Source Code All Projects

Portfolio All Projects