Spaces:
Running
title: Image-Attention-Visualizer
emoji: 🔥
colorFrom: blue
colorTo: purple
sdk: gradio
app_file: app.py
license: mit
pinned: true
tags:
- gradio
- pytorch
- computer-vision
- nlp
- multimodal
- vision-language
- image-to-text
- attention
- attention-visualization
- interpretability
- explainability
- xai
- demo
Github repo
TRY IT NOW ON HUGGING FACE SPACES !!
Image-Attention-Visualizer
Image Attention Visualizer is an interactive Gradio app that visualizes cross-modal attention between image tokens and generated text tokens in a custom multimodal model. It allows researchers and developers to see how different parts of an image influence the model’s textual output, token by token.
Image-to-Text Attention Visualizer (Gradio)
An interactive Gradio app to generate text from an image using a custom multimodal model and visualize attention in real time. It provides 3 synchronized views — original image, attention overlay, and heatmap — plus a word-level visualization showing how each generated word attends to visual regions.
✨ What the app does
Generates text from an image input using your custom model (
create_complete_model).Displays three synchronized views:
- 🖼️ Original image
- 🔥 Overlay (original + attention heatmap)
- 🌈 Heatmap alone
Word-level attention viewer: select any generated word to see how its attention is distributed across the image and previously generated words.
Works directly with your custom tokenizer (
model.decoder.tokenizer).Fixed-length 1024 image tokens (32×32 grid) projected as a visual heatmap.
Adjustable options: Layer, Head, or Mean Across Layers/Heads.
🚀 Quickstart
1) Clone
git clone https://github.com/devMuniz02/Image-Attention-Visualizer
cd Image-Attention-Visualizer
2) (Optional) Create a virtual environment
Windows (PowerShell):
python -m venv venv
.\venv\Scripts\Activate.ps1
macOS / Linux (bash/zsh):
python3 -m venv venv
source venv/bin/activate
3) Install requirements
pip install -r requirements.txt
4) Run the app
python app.py
You should see something like:
Running on local URL: http://127.0.0.1:7860
5) Open in your browser
Navigate to http://127.0.0.1:7860 to use the app.
🧭 How to use
Upload an image or load a random sample from your dataset folder.
Set generation parameters:
- Max New Tokens
- Layer/Head selection (or average across all)
Click Generate — the model will produce a textual description or continuation.
Select a generated word from the list:
The top row will show:
- Left → Original image
- Center → Overlay (attention on image regions)
- Right → Colored heatmap
The bottom section highlights attention strength over the generated words.
🧩 Files
app.py— Main Gradio interface and visualization logic.utils/models/complete_model.py— Model definition and generation method.utils/processing.py— Image preprocessing utilities.requirements.txt— Dependencies.README.md— This file.
🛠️ Troubleshooting
- Black or blank heatmap: Ensure your model returns
output_attentions=Truein.generate(). - Low resolution or distortion: Adjust
img_sizeor the interpolation method inside_make_overlay. - Tokenizer error: Make sure
model.decoder.tokenizerexists and is loaded correctly. - OOM errors: Reduce
max_new_tokensor use a smaller model checkpoint. - Color or shape mismatch: Verify that your image tokens length = 1024 (for a 32×32 layout).
🧪 Model integration notes
The app is compatible with any encoder–decoder or vision–language model that:
- Accepts
pixel_valuesas input. - Returns
generate(..., output_attentions=True)with(gen_ids, gen_text, attentions).
- Accepts
Uses the tokenizer from
model.decoder.tokenizer.Designed for research in vision-language interpretability, cross-modal explainability, and attention visualization.
📣 Acknowledgments
- Built with Gradio and Hugging Face Transformers.
- Inspired by the original Token-Attention-Viewer project.
- Special thanks to the open-source community advancing vision-language interpretability.
