PortfolioAll Projects
ML / Computer Vision

Image Feature Detection & Captioning

0.80

BLEU Score

0.65

LSTM Baseline

+23%

Improvement

Trained a CNN + Transformer pipeline that generates captions for images. The Transformer model scored 0.80 BLEU. Built a Streamlit UI so you can try it yourself.

Python
TensorFlow
CNN
Transformer
LSTM
StreamLit
Computer Vision
NLP
01

Preview

Image Feature Detection & Captioning
02

Overview

I wanted to build something that could look at an image and describe what's in it. The idea was straightforward, but getting it to actually work well took some effort. I used VGG-16 as the image encoder to extract feature vectors, then fed those into two different decoders: an LSTM and a Transformer. The LSTM got a BLEU score of 0.65, which was decent, but the Transformer with attention hit 0.80 and generated noticeably better captions. The attention mechanism really helped the model focus on the right parts of the image. I wrapped the whole thing in a Streamlit app so you can upload any image and get a caption back in a few seconds. It's a simple UI, but it makes the model feel real instead of just numbers in a notebook. The hardest part was honestly the training pipeline. VGG-16 is memory-hungry, the Transformer needed careful hyperparameter tuning, and I spent more time on data preprocessing than I'd like to admit.

03

Key Features

Advanced AI Models

VGG-16 for image feature extraction, then LSTM and Transformer decoders for generating captions

High Performance

LSTM hit 0.65 BLEU, Transformer got 0.80. The attention mechanism made a real difference

User-Friendly Interface

Streamlit app where you upload an image and get a caption back in seconds

Real-time Processing

Tuned the pipeline so inference runs fast enough to feel instant

04

Technical Stack

implementation.notes
01CNN and VGG-16 for pulling features out of images
02LSTM decoder with attention
03Transformer decoder that outperformed the LSTM by 15 BLEU points
04Evaluated with BLEU scores across both architectures
05Streamlit frontend for uploading images and viewing captions
06Image preprocessing and augmentation during training
07Model optimization to keep inference time reasonable
05

Friction & Takeaways

Friction

  • The Transformer gave better captions but was noticeably slower, had to find the right size/speed tradeoff
  • The model struggled with unusual image compositions that weren't well represented in training data
  • Getting BLEU above 0.70 on the LSTM took a lot of hyperparameter tuning before I switched to the Transformer
  • Making the Streamlit UI responsive enough that it didn't feel like you were waiting forever
  • VGG-16 is memory-hungry, had to be strategic about batch sizes during training

Takeaways

  • Transformers really do outperform LSTMs on sequence tasks once you get the training right
  • Connecting a vision encoder to a language decoder is tricky, the feature vector interface matters a lot
  • Attention maps are useful for debugging, not just for boosting scores
  • Even a simple Streamlit UI makes a model way more convincing in a demo
  • BLEU is a useful metric but doesn't always match how good a caption actually reads
Source CodeAll Projects