yuvalalaluf commited on
Commit
d6e5af2
·
verified ·
1 Parent(s): 095ee78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md CHANGED
@@ -3,3 +3,69 @@ license: other
3
  license_name: myvlm-snap-license
4
  license_link: https://github.com/snap-research/MyVLM/blob/master/LICENSE
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  license_name: myvlm-snap-license
4
  license_link: https://github.com/snap-research/MyVLM/blob/master/LICENSE
5
  ---
6
+ # MyVLM
7
+
8
+ **Paper:** https://arxiv.org/abs/2403.14599
9
+
10
+ **Project Page:** https://snap-research.github.io/MyVLM/
11
+
12
+ **Code:** https://github.com/snap-research/MyVLM
13
+
14
+
15
+ # MyVLM Concept Heads & Concept Embeddings
16
+ As part of our [MyVLM code](https://github.com/snap-research/MyVLM) release, we have also released pretrained concept heads and concept embeddings for all 29 objects used in the paper.
17
+
18
+ These can be loaded using the `CLIPConceptHead` class and our inference scripts for reproducing the paper results.
19
+
20
+ This repository contains 5 concept heads for each object, representing five different training seeds and sets of images used for training.
21
+
22
+ ## Concept Heads
23
+
24
+ <p align="center">
25
+ <img src="docs/concept_head.jpg" width="200px"/>
26
+ For each user-specific concept, we introduce an external concept head designed to identify the presence of the concept within an image.
27
+ </p>
28
+
29
+
30
+ As mentioned in the paper, we have two types of concept heads:
31
+ 1. A facial recognition model for recognizing individuals
32
+ 2. A CLIP-based concept head for recognizing user-specific objects
33
+
34
+ For faces, we use the `buffalo_l` face detection and face recognition model from [insightface](https://github.com/deepinsight/insightface/tree/master).
35
+ See `concept_heads/face_recognition/head.py` for usage.
36
+
37
+ For objects, we train a single linear layer over features extracted from a CLIP ViT-H/14 model (`DFN5B-CLIP-ViT-H-14-384`).
38
+ See `concept_heads/clip/head.py` for usage.
39
+
40
+
41
+ ## Concept Embeddings
42
+ <p align="center">
43
+ <img src="docs/method.jpg" width="800px"/>
44
+ Having identified the presence of a user-specific concept within an image, a learned concept embedding representing an object or individual is used to guide the LLM in incorporating the concept into its personalized textual response.
45
+ </p>
46
+
47
+
48
+ The concept embeddings are saved as `.pt` files in the following format:
49
+
50
+ ```
51
+ {
52
+ 10: {
53
+ "keys": torch.Tensor(), # the keys used for optimizing the concept embedding
54
+ "values": torch.Tensor(), # the concept embedding itself
55
+ },
56
+ ...
57
+ 20: {
58
+ "keys": torch.Tensor(),
59
+ "values": torch.Tensor(),
60
+ },
61
+ ...
62
+ }
63
+ ```
64
+ where each entry in the dictionary represents a different checkpoint during the optimization process.
65
+
66
+ We provide the concept embeddings for personalized captioning using both BLIP-2 and LLaVA.
67
+
68
+
69
+ ## License
70
+ This sample code is made available by Snap Inc. for non-commercial, academic purposes only.
71
+ Please see the full license [here](https://github.com/snap-research/MyVLM/blob/master/LICENSE).