yuvalalaluf
/

MyVLM

Model card Files Files and versions

xet

Community

yuvalalaluf commited on Apr 25, 2024

Commit

d6e5af2

verified ·

1 Parent(s): 095ee78

Update README.md

Browse files

Files changed (1) hide show

README.md +66 -0

README.md CHANGED Viewed

@@ -3,3 +3,69 @@ license: other
 license_name: myvlm-snap-license
 license_link: https://github.com/snap-research/MyVLM/blob/master/LICENSE
 ---

 license_name: myvlm-snap-license
 license_link: https://github.com/snap-research/MyVLM/blob/master/LICENSE
 ---
+# MyVLM
+**Paper:** https://arxiv.org/abs/2403.14599
+**Project Page:** https://snap-research.github.io/MyVLM/
+**Code:** https://github.com/snap-research/MyVLM
+# MyVLM Concept Heads & Concept Embeddings
+As part of our [MyVLM code](https://github.com/snap-research/MyVLM) release, we have also released pretrained concept heads and concept embeddings for all 29 objects used in the paper.
+These can be loaded using the `CLIPConceptHead` class and our inference scripts for reproducing the paper results.
+This repository contains 5 concept heads for each object, representing five different training seeds and sets of images used for training.
+## Concept Heads
+<p align="center">
+<img src="docs/concept_head.jpg" width="200px"/>
+For each user-specific concept, we introduce an external concept head designed to identify the presence of the concept within an image.
+</p>
+As mentioned in the paper, we have two types of concept heads:
+1. A facial recognition model for recognizing individuals
+2. A CLIP-based concept head for recognizing user-specific objects
+For faces, we use the `buffalo_l` face detection and face recognition model from [insightface](https://github.com/deepinsight/insightface/tree/master).
+See `concept_heads/face_recognition/head.py` for usage.
+For objects, we train a single linear layer over features extracted from a CLIP ViT-H/14 model (`DFN5B-CLIP-ViT-H-14-384`).
+See `concept_heads/clip/head.py` for usage.
+## Concept Embeddings
+<p align="center">
+<img src="docs/method.jpg" width="800px"/>
+Having identified the presence of a user-specific concept within an image, a learned concept embedding representing an object or individual is used to guide the LLM in incorporating the concept into its personalized textual response.
+</p>
+The concept embeddings are saved as `.pt` files in the following format:
+    ```
+    {
+      10: {
+        "keys": torch.Tensor(),    # the keys used for optimizing the concept embedding
+        "values": torch.Tensor(),  # the concept embedding itself
+      },
+      ...
+      20: {
+        "keys": torch.Tensor(),
+        "values": torch.Tensor(),
+      },
+      ...
+    }
+    ```
+where each entry in the dictionary represents a different checkpoint during the optimization process.
+We provide the concept embeddings for personalized captioning using both BLIP-2 and LLaVA.
+## License
+This sample code is made available by Snap Inc. for non-commercial, academic purposes only.
+Please see the full license [here](https://github.com/snap-research/MyVLM/blob/master/LICENSE).