Spaces:

manu02
/

CXR-Findings-AI

Running

App Files Files Community

CXR-Findings-AI / README.md

manu02

Upload 3 files

110fd76 about 1 month ago

preview code

raw

history blame

4.94 kB

metadata

title: Image-Attention-Visualizer
emoji: 🔥
colorFrom: blue
colorTo: purple
sdk: gradio
app_file: app.py
license: mit
pinned: true
tags:
  - gradio
  - pytorch
  - computer-vision
  - nlp
  - multimodal
  - vision-language
  - image-to-text
  - attention
  - attention-visualization
  - interpretability
  - explainability
  - xai
  - demo

Github repo

TRY IT NOW ON HUGGING FACE SPACES !!

Image-Attention-Visualizer

Image Attention Visualizer is an interactive Gradio app that visualizes cross-modal attention between image tokens and generated text tokens in a custom multimodal model. It allows researchers and developers to see how different parts of an image influence the model’s textual output, token by token.

Image-to-Text Attention Visualizer (Gradio)

An interactive Gradio app to generate text from an image using a custom multimodal model and visualize attention in real time. It provides 3 synchronized views — original image, attention overlay, and heatmap — plus a word-level visualization showing how each generated word attends to visual regions.

✨ What the app does

Generates text from an image input using your custom model (create_complete_model).
Displays three synchronized views:
1. 🖼️ Original image
2. 🔥 Overlay (original + attention heatmap)
3. 🌈 Heatmap alone
Word-level attention viewer: select any generated word to see how its attention is distributed across the image and previously generated words.
Works directly with your custom tokenizer (model.decoder.tokenizer).
Fixed-length 1024 image tokens (32×32 grid) projected as a visual heatmap.
Adjustable options: Layer, Head, or Mean Across Layers/Heads.

🚀 Quickstart

1) Clone

git clone https://github.com/devMuniz02/Image-Attention-Visualizer
cd Image-Attention-Visualizer

2) (Optional) Create a virtual environment

Windows (PowerShell):

python -m venv venv
.\venv\Scripts\Activate.ps1

macOS / Linux (bash/zsh):

python3 -m venv venv
source venv/bin/activate

3) Install requirements

pip install -r requirements.txt

4) Run the app

python app.py

You should see something like:

Running on local URL:  http://127.0.0.1:7860

5) Open in your browser

Navigate to http://127.0.0.1:7860 to use the app.

🧭 How to use

Upload an image or load a random sample from your dataset folder.
Set generation parameters:
- Max New Tokens
- Layer/Head selection (or average across all)
Click Generate — the model will produce a textual description or continuation.
Select a generated word from the list:
- The top row will show:
  - Left → Original image
  - Center → Overlay (attention on image regions)
  - Right → Colored heatmap
- The bottom section highlights attention strength over the generated words.

🧩 Files

app.py — Main Gradio interface and visualization logic.
utils/models/complete_model.py — Model definition and generation method.
utils/processing.py — Image preprocessing utilities.
requirements.txt — Dependencies.
README.md — This file.

🛠️ Troubleshooting

Black or blank heatmap: Ensure your model returns output_attentions=True in .generate().
Low resolution or distortion: Adjust img_size or the interpolation method inside _make_overlay.
Tokenizer error: Make sure model.decoder.tokenizer exists and is loaded correctly.
OOM errors: Reduce max_new_tokens or use a smaller model checkpoint.
Color or shape mismatch: Verify that your image tokens length = 1024 (for a 32×32 layout).

🧪 Model integration notes

The app is compatible with any encoder–decoder or vision–language model that:
- Accepts pixel_values as input.
- Returns generate(..., output_attentions=True) with (gen_ids, gen_text, attentions).
Uses the tokenizer from model.decoder.tokenizer.
Designed for research in vision-language interpretability, cross-modal explainability, and attention visualization.

📣 Acknowledgments

Built with Gradio and Hugging Face Transformers.
Inspired by the original Token-Attention-Viewer project.
Special thanks to the open-source community advancing vision-language interpretability.