Disty0 commited on
Commit
d07b527
·
verified ·
1 Parent(s): f8d26b7

Upload 28 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer_2/tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text
LICENSE.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Newbie Non-Commercial Community License (Newbie-NC-1.0)
2
+
3
+ **Version 1.0**
4
+
5
+ ## Preamble
6
+
7
+ The licensor ("Newbie Team" or the "Licensor") releases the model weights and parameters (the "Model") under this **Newbie Non-Commercial Community License** to foster an open, collaborative, and strictly non-commercial research environment.
8
+
9
+ **Note on Training Code:** The source code used to train this Model is separately licensed under the **Apache License 2.0**. This License applies solely to the Model weights, parameters, and any Derivatives thereof.
10
+
11
+ By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this License.
12
+
13
+ ---
14
+
15
+ ## Section I: Definitions
16
+
17
+ 1. **"Model"** means the specific machine learning model weights, parameters, and configuration files released by the Licensor under this License.
18
+ 2. **"Derivatives"** means any modifications to the Model, including but not limited to:
19
+ * Fine-tuned versions (including LoRA, Dreambooth, Hypernetworks).
20
+ * Merged models where this Model is a component (weighted or unweighted).
21
+ * Quantized, distilled, or pruned versions of the Model.
22
+ * Any model that initializes its weights from the Model.
23
+ 3. **"Output"** means the content (images, text, etc.) generated by the Model or its Derivatives.
24
+ 4. **"Commercial Use"** means any usage primarily intended for or directed towards commercial advantage or monetary compensation.
25
+
26
+ ---
27
+
28
+ ## Section II: Grant of Rights
29
+
30
+ Subject to the terms and conditions of this License, the Licensor grants You a worldwide, royalty-free, non-exclusive, perpetual license to:
31
+ 1. **Use** the Model for non-commercial research, personal, and educational purposes.
32
+ 2. **Create Derivatives** of the Model for non-commercial purposes.
33
+ 3. **Distribute** the Model and Derivatives, provided that strictly non-commercial conditions are met and the "Community Contribution" requirements (Section V) are fulfilled.
34
+
35
+ ---
36
+
37
+ ## Section III: Usage Restrictions
38
+
39
+ Your use of the Model and Derivatives must strictly comply with the following restrictions. You agree **NOT** to use the Model or Derivatives:
40
+
41
+ 1. **Illegal Activities:** To violate any applicable law or regulation in Your jurisdiction.
42
+ 2. **Harmful Content:** To generate, disseminate, or promote content that creates a risk of harm, loss, physical or mental injury, emotional distress, or death to any person or animal.
43
+ 3. **Harassment & Hate Speech:** To generate or disseminate content that harms, defames, disparages, intimidates, or harasses individuals or groups based on race, ethnicity, religion, gender, sexual orientation, or disability.
44
+ 4. **Misinformation:** To generate or disseminate verifiable false information with the purpose of harming others or spreading propaganda.
45
+ 5. **Sexual Violence:** To generate content that depicts sexual violence, non-consensual sexual content, or deepfakes of real individuals without their consent.
46
+
47
+ ---
48
+
49
+ ## Section IV: Commercial Prohibition
50
+
51
+ **This License enforces a strict Non-Commercial policy.**
52
+
53
+ 1. **Prohibition on Model & Outputs:** You may **NOT** use the Model, Derivatives, or Output for any Commercial Use.
54
+ 2. **Prohibition on Derivatives (Strict Inheritance):**
55
+ * **Any Derivative Model** created based on this Model (including Merges and LoRAs) **inherits this Commercial Prohibition**.
56
+ * You may **NOT** use, sell, rent, or license any Derivative Model for commercial purposes, even if you have added your own training data or significant effort.
57
+ * Embedding a Derivative Model into a paid software, app, or service is strictly prohibited.
58
+ 3. **No Monetization:** You may not place the Model or Derivatives behind a paywall (e.g., Patreon, Ko-fi memberships, paid Discord channels) or charge fees for generation services using the Model.
59
+
60
+ ---
61
+
62
+ ## Section V: Open Source Community & Share-Alike (The "Newbie" Clause)
63
+
64
+ To ensure the growth of the open-source community, if You distribute, host, or publish the Model or any Derivatives, You **MUST** comply with the following conditions:
65
+
66
+ 1. **Open Source Requirement (Copyleft):** Any Derivative You create and distribute must be released under **this same License (Newbie Non-Commercial Community License)**. You cannot re-license a Derivative under a permissive or commercial license.
67
+ 2. **Transparency & Reproducibility:** You must make the following information publicly available along with the Derivative:
68
+ * The complete training configuration or synthesis formula.
69
+ * Representative prompts and workflows used to verify the model.
70
+ 3. **Credit:** You must prominently display the following notice: *"This model is a derivative of [Model Name] by Newbie Team, licensed under the Newbie-NC-1.0 License."*
71
+
72
+ *Failure to comply with this section automatically terminates Your rights under this License.*
73
+
74
+ ---
75
+
76
+ ## Section VI: Disclaimer of Warranty and Limitation of Liability
77
+
78
+ 1. **DISCLAIMER:** THE MODEL IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT.
79
+ 2. **Risk Assumption:** The Model may produce Output that is inaccurate, offensive, or otherwise objectionable. You assume all risks associated with the use of the Model.
80
+ 3. **Limitation of Liability:** IN NO EVENT SHALL THE LICENSOR (NEWBIE TEAM) BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE MODEL, DERIVATIVES, OR THE USE OR OTHER DEALINGS IN THE MODEL.
81
+
82
+ ---
83
+
84
+ **End of License**
model_index.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "NewbiePipeline",
3
+ "_diffusers_version": "0.36.0.dev0",
4
+ "scheduler": [
5
+ "diffusers",
6
+ "FlowMatchEulerDiscreteScheduler"
7
+ ],
8
+ "text_encoder": [
9
+ "transformers",
10
+ "Gemma3Model"
11
+ ],
12
+ "text_encoder_2": [
13
+ "transformers",
14
+ "PreTrainedModel"
15
+ ],
16
+ "tokenizer": [
17
+ "transformers",
18
+ "GemmaTokenizerFast"
19
+ ],
20
+ "tokenizer_2": [
21
+ "transformers",
22
+ "XLMRobertaTokenizerFast"
23
+ ],
24
+ "transformer": [
25
+ "diffusers",
26
+ "Lumina2Transformer2DModel"
27
+ ],
28
+ "vae": [
29
+ "diffusers",
30
+ "AutoencoderKL"
31
+ ]
32
+ }
scheduler/scheduler_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "FlowMatchEulerDiscreteScheduler",
3
+ "_diffusers_version": "0.33.0.dev0",
4
+ "base_image_seq_len": 256,
5
+ "base_shift": 0.5,
6
+ "invert_sigmas": false,
7
+ "max_image_seq_len": 4096,
8
+ "max_shift": 1.15,
9
+ "num_train_timesteps": 1000,
10
+ "shift": 6.0,
11
+ "shift_terminal": null,
12
+ "use_beta_sigmas": false,
13
+ "use_dynamic_shifting": false,
14
+ "use_exponential_sigmas": false,
15
+ "use_karras_sigmas": false
16
+ }
text_encoder/config.json ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Gemma3Model"
4
+ ],
5
+ "boi_token_index": 255999,
6
+ "dtype": "bfloat16",
7
+ "eoi_token_index": 256000,
8
+ "eos_token_id": [
9
+ 1,
10
+ 106
11
+ ],
12
+ "image_token_index": 262144,
13
+ "initializer_range": 0.02,
14
+ "mm_tokens_per_image": 256,
15
+ "model_type": "gemma3",
16
+ "text_config": {
17
+ "_sliding_window_pattern": 6,
18
+ "attention_bias": false,
19
+ "attention_dropout": 0.0,
20
+ "attn_logit_softcapping": null,
21
+ "dtype": "bfloat16",
22
+ "final_logit_softcapping": null,
23
+ "head_dim": 256,
24
+ "hidden_activation": "gelu_pytorch_tanh",
25
+ "hidden_size": 2560,
26
+ "initializer_range": 0.02,
27
+ "intermediate_size": 10240,
28
+ "layer_types": [
29
+ "sliding_attention",
30
+ "sliding_attention",
31
+ "sliding_attention",
32
+ "sliding_attention",
33
+ "sliding_attention",
34
+ "full_attention",
35
+ "sliding_attention",
36
+ "sliding_attention",
37
+ "sliding_attention",
38
+ "sliding_attention",
39
+ "sliding_attention",
40
+ "full_attention",
41
+ "sliding_attention",
42
+ "sliding_attention",
43
+ "sliding_attention",
44
+ "sliding_attention",
45
+ "sliding_attention",
46
+ "full_attention",
47
+ "sliding_attention",
48
+ "sliding_attention",
49
+ "sliding_attention",
50
+ "sliding_attention",
51
+ "sliding_attention",
52
+ "full_attention",
53
+ "sliding_attention",
54
+ "sliding_attention",
55
+ "sliding_attention",
56
+ "sliding_attention",
57
+ "sliding_attention",
58
+ "full_attention",
59
+ "sliding_attention",
60
+ "sliding_attention",
61
+ "sliding_attention",
62
+ "sliding_attention"
63
+ ],
64
+ "max_position_embeddings": 131072,
65
+ "model_type": "gemma3_text",
66
+ "num_attention_heads": 8,
67
+ "num_hidden_layers": 34,
68
+ "num_key_value_heads": 4,
69
+ "query_pre_attn_scalar": 256,
70
+ "rms_norm_eps": 1e-06,
71
+ "rope_local_base_freq": 10000.0,
72
+ "rope_scaling": {
73
+ "factor": 8.0,
74
+ "rope_type": "linear"
75
+ },
76
+ "rope_theta": 1000000.0,
77
+ "sliding_window": 1024,
78
+ "use_bidirectional_attention": false,
79
+ "use_cache": true,
80
+ "vocab_size": 262208
81
+ },
82
+ "transformers_version": "4.57.3",
83
+ "vision_config": {
84
+ "attention_dropout": 0.0,
85
+ "dtype": "bfloat16",
86
+ "hidden_act": "gelu_pytorch_tanh",
87
+ "hidden_size": 1152,
88
+ "image_size": 896,
89
+ "intermediate_size": 4304,
90
+ "layer_norm_eps": 1e-06,
91
+ "model_type": "siglip_vision_model",
92
+ "num_attention_heads": 16,
93
+ "num_channels": 3,
94
+ "num_hidden_layers": 27,
95
+ "patch_size": 14,
96
+ "vision_use_head": false
97
+ }
98
+ }
text_encoder/model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb5fd5e97ddd07b56778733e9653c07312529cb00980a318fc3e1c4e3b5a8f1f
3
+ size 4961251752
text_encoder/model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fdde0e5aa5ced0fa203b3d50f4ab78168b7e3a3e08c6349f5cc9326666e1bb13
3
+ size 3639026128
text_encoder/model.safetensors.index.json ADDED
@@ -0,0 +1,891 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 4300079472,
4
+ "total_size": 8600158944
5
+ },
6
+ "weight_map": {
7
+ "language_model.model.embed_tokens.weight": "model-00001-of-00002.safetensors",
8
+ "language_model.model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
9
+ "language_model.model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
10
+ "language_model.model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
11
+ "language_model.model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
12
+ "language_model.model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
13
+ "language_model.model.layers.0.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
14
+ "language_model.model.layers.0.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
15
+ "language_model.model.layers.0.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
16
+ "language_model.model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
17
+ "language_model.model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
18
+ "language_model.model.layers.0.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
19
+ "language_model.model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
20
+ "language_model.model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
21
+ "language_model.model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
22
+ "language_model.model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
23
+ "language_model.model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
24
+ "language_model.model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
25
+ "language_model.model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
26
+ "language_model.model.layers.1.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
27
+ "language_model.model.layers.1.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
28
+ "language_model.model.layers.1.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
29
+ "language_model.model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
30
+ "language_model.model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
31
+ "language_model.model.layers.1.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
32
+ "language_model.model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
33
+ "language_model.model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
34
+ "language_model.model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
35
+ "language_model.model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
36
+ "language_model.model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
37
+ "language_model.model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
38
+ "language_model.model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
39
+ "language_model.model.layers.10.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
40
+ "language_model.model.layers.10.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
41
+ "language_model.model.layers.10.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
42
+ "language_model.model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
43
+ "language_model.model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
44
+ "language_model.model.layers.10.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
45
+ "language_model.model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
46
+ "language_model.model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
47
+ "language_model.model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
48
+ "language_model.model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
49
+ "language_model.model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
50
+ "language_model.model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
51
+ "language_model.model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
52
+ "language_model.model.layers.11.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
53
+ "language_model.model.layers.11.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
54
+ "language_model.model.layers.11.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
55
+ "language_model.model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
56
+ "language_model.model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
57
+ "language_model.model.layers.11.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
58
+ "language_model.model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
59
+ "language_model.model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
60
+ "language_model.model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
61
+ "language_model.model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
62
+ "language_model.model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
63
+ "language_model.model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
64
+ "language_model.model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
65
+ "language_model.model.layers.12.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
66
+ "language_model.model.layers.12.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
67
+ "language_model.model.layers.12.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
68
+ "language_model.model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
69
+ "language_model.model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
70
+ "language_model.model.layers.12.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
71
+ "language_model.model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
72
+ "language_model.model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
73
+ "language_model.model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
74
+ "language_model.model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
75
+ "language_model.model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
76
+ "language_model.model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
77
+ "language_model.model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
78
+ "language_model.model.layers.13.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
79
+ "language_model.model.layers.13.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
80
+ "language_model.model.layers.13.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
81
+ "language_model.model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
82
+ "language_model.model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
83
+ "language_model.model.layers.13.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
84
+ "language_model.model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
85
+ "language_model.model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
86
+ "language_model.model.layers.14.input_layernorm.weight": "model-00002-of-00002.safetensors",
87
+ "language_model.model.layers.14.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
88
+ "language_model.model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
89
+ "language_model.model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
90
+ "language_model.model.layers.14.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
91
+ "language_model.model.layers.14.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
92
+ "language_model.model.layers.14.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
93
+ "language_model.model.layers.14.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
94
+ "language_model.model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
95
+ "language_model.model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
96
+ "language_model.model.layers.14.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
97
+ "language_model.model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
98
+ "language_model.model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
99
+ "language_model.model.layers.15.input_layernorm.weight": "model-00002-of-00002.safetensors",
100
+ "language_model.model.layers.15.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
101
+ "language_model.model.layers.15.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
102
+ "language_model.model.layers.15.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
103
+ "language_model.model.layers.15.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
104
+ "language_model.model.layers.15.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
105
+ "language_model.model.layers.15.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
106
+ "language_model.model.layers.15.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
107
+ "language_model.model.layers.15.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
108
+ "language_model.model.layers.15.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
109
+ "language_model.model.layers.15.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
110
+ "language_model.model.layers.15.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
111
+ "language_model.model.layers.15.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
112
+ "language_model.model.layers.16.input_layernorm.weight": "model-00002-of-00002.safetensors",
113
+ "language_model.model.layers.16.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
114
+ "language_model.model.layers.16.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
115
+ "language_model.model.layers.16.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
116
+ "language_model.model.layers.16.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
117
+ "language_model.model.layers.16.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
118
+ "language_model.model.layers.16.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
119
+ "language_model.model.layers.16.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
120
+ "language_model.model.layers.16.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
121
+ "language_model.model.layers.16.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
122
+ "language_model.model.layers.16.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
123
+ "language_model.model.layers.16.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
124
+ "language_model.model.layers.16.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
125
+ "language_model.model.layers.17.input_layernorm.weight": "model-00002-of-00002.safetensors",
126
+ "language_model.model.layers.17.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
127
+ "language_model.model.layers.17.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
128
+ "language_model.model.layers.17.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
129
+ "language_model.model.layers.17.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
130
+ "language_model.model.layers.17.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
131
+ "language_model.model.layers.17.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
132
+ "language_model.model.layers.17.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
133
+ "language_model.model.layers.17.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
134
+ "language_model.model.layers.17.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
135
+ "language_model.model.layers.17.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
136
+ "language_model.model.layers.17.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
137
+ "language_model.model.layers.17.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
138
+ "language_model.model.layers.18.input_layernorm.weight": "model-00002-of-00002.safetensors",
139
+ "language_model.model.layers.18.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
140
+ "language_model.model.layers.18.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
141
+ "language_model.model.layers.18.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
142
+ "language_model.model.layers.18.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
143
+ "language_model.model.layers.18.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
144
+ "language_model.model.layers.18.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
145
+ "language_model.model.layers.18.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
146
+ "language_model.model.layers.18.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
147
+ "language_model.model.layers.18.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
148
+ "language_model.model.layers.18.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
149
+ "language_model.model.layers.18.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
150
+ "language_model.model.layers.18.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
151
+ "language_model.model.layers.19.input_layernorm.weight": "model-00002-of-00002.safetensors",
152
+ "language_model.model.layers.19.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
153
+ "language_model.model.layers.19.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
154
+ "language_model.model.layers.19.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
155
+ "language_model.model.layers.19.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
156
+ "language_model.model.layers.19.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
157
+ "language_model.model.layers.19.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
158
+ "language_model.model.layers.19.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
159
+ "language_model.model.layers.19.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
160
+ "language_model.model.layers.19.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
161
+ "language_model.model.layers.19.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
162
+ "language_model.model.layers.19.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
163
+ "language_model.model.layers.19.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
164
+ "language_model.model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
165
+ "language_model.model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
166
+ "language_model.model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
167
+ "language_model.model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
168
+ "language_model.model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
169
+ "language_model.model.layers.2.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
170
+ "language_model.model.layers.2.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
171
+ "language_model.model.layers.2.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
172
+ "language_model.model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
173
+ "language_model.model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
174
+ "language_model.model.layers.2.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
175
+ "language_model.model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
176
+ "language_model.model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
177
+ "language_model.model.layers.20.input_layernorm.weight": "model-00002-of-00002.safetensors",
178
+ "language_model.model.layers.20.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
179
+ "language_model.model.layers.20.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
180
+ "language_model.model.layers.20.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
181
+ "language_model.model.layers.20.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
182
+ "language_model.model.layers.20.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
183
+ "language_model.model.layers.20.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
184
+ "language_model.model.layers.20.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
185
+ "language_model.model.layers.20.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
186
+ "language_model.model.layers.20.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
187
+ "language_model.model.layers.20.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
188
+ "language_model.model.layers.20.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
189
+ "language_model.model.layers.20.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
190
+ "language_model.model.layers.21.input_layernorm.weight": "model-00002-of-00002.safetensors",
191
+ "language_model.model.layers.21.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
192
+ "language_model.model.layers.21.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
193
+ "language_model.model.layers.21.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
194
+ "language_model.model.layers.21.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
195
+ "language_model.model.layers.21.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
196
+ "language_model.model.layers.21.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
197
+ "language_model.model.layers.21.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
198
+ "language_model.model.layers.21.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
199
+ "language_model.model.layers.21.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
200
+ "language_model.model.layers.21.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
201
+ "language_model.model.layers.21.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
202
+ "language_model.model.layers.21.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
203
+ "language_model.model.layers.22.input_layernorm.weight": "model-00002-of-00002.safetensors",
204
+ "language_model.model.layers.22.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
205
+ "language_model.model.layers.22.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
206
+ "language_model.model.layers.22.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
207
+ "language_model.model.layers.22.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
208
+ "language_model.model.layers.22.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
209
+ "language_model.model.layers.22.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
210
+ "language_model.model.layers.22.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
211
+ "language_model.model.layers.22.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
212
+ "language_model.model.layers.22.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
213
+ "language_model.model.layers.22.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
214
+ "language_model.model.layers.22.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
215
+ "language_model.model.layers.22.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
216
+ "language_model.model.layers.23.input_layernorm.weight": "model-00002-of-00002.safetensors",
217
+ "language_model.model.layers.23.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
218
+ "language_model.model.layers.23.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
219
+ "language_model.model.layers.23.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
220
+ "language_model.model.layers.23.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
221
+ "language_model.model.layers.23.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
222
+ "language_model.model.layers.23.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
223
+ "language_model.model.layers.23.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
224
+ "language_model.model.layers.23.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
225
+ "language_model.model.layers.23.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
226
+ "language_model.model.layers.23.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
227
+ "language_model.model.layers.23.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
228
+ "language_model.model.layers.23.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
229
+ "language_model.model.layers.24.input_layernorm.weight": "model-00002-of-00002.safetensors",
230
+ "language_model.model.layers.24.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
231
+ "language_model.model.layers.24.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
232
+ "language_model.model.layers.24.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
233
+ "language_model.model.layers.24.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
234
+ "language_model.model.layers.24.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
235
+ "language_model.model.layers.24.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
236
+ "language_model.model.layers.24.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
237
+ "language_model.model.layers.24.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
238
+ "language_model.model.layers.24.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
239
+ "language_model.model.layers.24.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
240
+ "language_model.model.layers.24.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
241
+ "language_model.model.layers.24.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
242
+ "language_model.model.layers.25.input_layernorm.weight": "model-00002-of-00002.safetensors",
243
+ "language_model.model.layers.25.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
244
+ "language_model.model.layers.25.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
245
+ "language_model.model.layers.25.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
246
+ "language_model.model.layers.25.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
247
+ "language_model.model.layers.25.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
248
+ "language_model.model.layers.25.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
249
+ "language_model.model.layers.25.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
250
+ "language_model.model.layers.25.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
251
+ "language_model.model.layers.25.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
252
+ "language_model.model.layers.25.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
253
+ "language_model.model.layers.25.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
254
+ "language_model.model.layers.25.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
255
+ "language_model.model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
256
+ "language_model.model.layers.26.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
257
+ "language_model.model.layers.26.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
258
+ "language_model.model.layers.26.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
259
+ "language_model.model.layers.26.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
260
+ "language_model.model.layers.26.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
261
+ "language_model.model.layers.26.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
262
+ "language_model.model.layers.26.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
263
+ "language_model.model.layers.26.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
264
+ "language_model.model.layers.26.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
265
+ "language_model.model.layers.26.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
266
+ "language_model.model.layers.26.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
267
+ "language_model.model.layers.26.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
268
+ "language_model.model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
269
+ "language_model.model.layers.27.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
270
+ "language_model.model.layers.27.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
271
+ "language_model.model.layers.27.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
272
+ "language_model.model.layers.27.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
273
+ "language_model.model.layers.27.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
274
+ "language_model.model.layers.27.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
275
+ "language_model.model.layers.27.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
276
+ "language_model.model.layers.27.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
277
+ "language_model.model.layers.27.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
278
+ "language_model.model.layers.27.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
279
+ "language_model.model.layers.27.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
280
+ "language_model.model.layers.27.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
281
+ "language_model.model.layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
282
+ "language_model.model.layers.28.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
283
+ "language_model.model.layers.28.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
284
+ "language_model.model.layers.28.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
285
+ "language_model.model.layers.28.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
286
+ "language_model.model.layers.28.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
287
+ "language_model.model.layers.28.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
288
+ "language_model.model.layers.28.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
289
+ "language_model.model.layers.28.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
290
+ "language_model.model.layers.28.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
291
+ "language_model.model.layers.28.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
292
+ "language_model.model.layers.28.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
293
+ "language_model.model.layers.28.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
294
+ "language_model.model.layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
295
+ "language_model.model.layers.29.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
296
+ "language_model.model.layers.29.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
297
+ "language_model.model.layers.29.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
298
+ "language_model.model.layers.29.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
299
+ "language_model.model.layers.29.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
300
+ "language_model.model.layers.29.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
301
+ "language_model.model.layers.29.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
302
+ "language_model.model.layers.29.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
303
+ "language_model.model.layers.29.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
304
+ "language_model.model.layers.29.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
305
+ "language_model.model.layers.29.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
306
+ "language_model.model.layers.29.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
307
+ "language_model.model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
308
+ "language_model.model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
309
+ "language_model.model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
310
+ "language_model.model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
311
+ "language_model.model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
312
+ "language_model.model.layers.3.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
313
+ "language_model.model.layers.3.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
314
+ "language_model.model.layers.3.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
315
+ "language_model.model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
316
+ "language_model.model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
317
+ "language_model.model.layers.3.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
318
+ "language_model.model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
319
+ "language_model.model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
320
+ "language_model.model.layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
321
+ "language_model.model.layers.30.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
322
+ "language_model.model.layers.30.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
323
+ "language_model.model.layers.30.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
324
+ "language_model.model.layers.30.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
325
+ "language_model.model.layers.30.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
326
+ "language_model.model.layers.30.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
327
+ "language_model.model.layers.30.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
328
+ "language_model.model.layers.30.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
329
+ "language_model.model.layers.30.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
330
+ "language_model.model.layers.30.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
331
+ "language_model.model.layers.30.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
332
+ "language_model.model.layers.30.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
333
+ "language_model.model.layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
334
+ "language_model.model.layers.31.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
335
+ "language_model.model.layers.31.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
336
+ "language_model.model.layers.31.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
337
+ "language_model.model.layers.31.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
338
+ "language_model.model.layers.31.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
339
+ "language_model.model.layers.31.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
340
+ "language_model.model.layers.31.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
341
+ "language_model.model.layers.31.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
342
+ "language_model.model.layers.31.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
343
+ "language_model.model.layers.31.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
344
+ "language_model.model.layers.31.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
345
+ "language_model.model.layers.31.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
346
+ "language_model.model.layers.32.input_layernorm.weight": "model-00002-of-00002.safetensors",
347
+ "language_model.model.layers.32.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
348
+ "language_model.model.layers.32.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
349
+ "language_model.model.layers.32.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
350
+ "language_model.model.layers.32.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
351
+ "language_model.model.layers.32.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
352
+ "language_model.model.layers.32.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
353
+ "language_model.model.layers.32.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
354
+ "language_model.model.layers.32.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
355
+ "language_model.model.layers.32.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
356
+ "language_model.model.layers.32.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
357
+ "language_model.model.layers.32.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
358
+ "language_model.model.layers.32.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
359
+ "language_model.model.layers.33.input_layernorm.weight": "model-00002-of-00002.safetensors",
360
+ "language_model.model.layers.33.mlp.down_proj.weight": "model-00002-of-00002.safetensors",
361
+ "language_model.model.layers.33.mlp.gate_proj.weight": "model-00002-of-00002.safetensors",
362
+ "language_model.model.layers.33.mlp.up_proj.weight": "model-00002-of-00002.safetensors",
363
+ "language_model.model.layers.33.post_attention_layernorm.weight": "model-00002-of-00002.safetensors",
364
+ "language_model.model.layers.33.post_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
365
+ "language_model.model.layers.33.pre_feedforward_layernorm.weight": "model-00002-of-00002.safetensors",
366
+ "language_model.model.layers.33.self_attn.k_norm.weight": "model-00002-of-00002.safetensors",
367
+ "language_model.model.layers.33.self_attn.k_proj.weight": "model-00002-of-00002.safetensors",
368
+ "language_model.model.layers.33.self_attn.o_proj.weight": "model-00002-of-00002.safetensors",
369
+ "language_model.model.layers.33.self_attn.q_norm.weight": "model-00002-of-00002.safetensors",
370
+ "language_model.model.layers.33.self_attn.q_proj.weight": "model-00002-of-00002.safetensors",
371
+ "language_model.model.layers.33.self_attn.v_proj.weight": "model-00002-of-00002.safetensors",
372
+ "language_model.model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
373
+ "language_model.model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
374
+ "language_model.model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
375
+ "language_model.model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
376
+ "language_model.model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
377
+ "language_model.model.layers.4.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
378
+ "language_model.model.layers.4.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
379
+ "language_model.model.layers.4.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
380
+ "language_model.model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
381
+ "language_model.model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
382
+ "language_model.model.layers.4.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
383
+ "language_model.model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
384
+ "language_model.model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
385
+ "language_model.model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
386
+ "language_model.model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
387
+ "language_model.model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
388
+ "language_model.model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
389
+ "language_model.model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
390
+ "language_model.model.layers.5.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
391
+ "language_model.model.layers.5.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
392
+ "language_model.model.layers.5.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
393
+ "language_model.model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
394
+ "language_model.model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
395
+ "language_model.model.layers.5.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
396
+ "language_model.model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
397
+ "language_model.model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
398
+ "language_model.model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
399
+ "language_model.model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
400
+ "language_model.model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
401
+ "language_model.model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
402
+ "language_model.model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
403
+ "language_model.model.layers.6.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
404
+ "language_model.model.layers.6.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
405
+ "language_model.model.layers.6.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
406
+ "language_model.model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
407
+ "language_model.model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
408
+ "language_model.model.layers.6.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
409
+ "language_model.model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
410
+ "language_model.model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
411
+ "language_model.model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
412
+ "language_model.model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
413
+ "language_model.model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
414
+ "language_model.model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
415
+ "language_model.model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
416
+ "language_model.model.layers.7.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
417
+ "language_model.model.layers.7.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
418
+ "language_model.model.layers.7.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
419
+ "language_model.model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
420
+ "language_model.model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
421
+ "language_model.model.layers.7.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
422
+ "language_model.model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
423
+ "language_model.model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
424
+ "language_model.model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
425
+ "language_model.model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
426
+ "language_model.model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
427
+ "language_model.model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
428
+ "language_model.model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
429
+ "language_model.model.layers.8.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
430
+ "language_model.model.layers.8.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
431
+ "language_model.model.layers.8.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
432
+ "language_model.model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
433
+ "language_model.model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
434
+ "language_model.model.layers.8.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
435
+ "language_model.model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
436
+ "language_model.model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
437
+ "language_model.model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
438
+ "language_model.model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
439
+ "language_model.model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
440
+ "language_model.model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
441
+ "language_model.model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
442
+ "language_model.model.layers.9.post_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
443
+ "language_model.model.layers.9.pre_feedforward_layernorm.weight": "model-00001-of-00002.safetensors",
444
+ "language_model.model.layers.9.self_attn.k_norm.weight": "model-00001-of-00002.safetensors",
445
+ "language_model.model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
446
+ "language_model.model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
447
+ "language_model.model.layers.9.self_attn.q_norm.weight": "model-00001-of-00002.safetensors",
448
+ "language_model.model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
449
+ "language_model.model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
450
+ "language_model.model.norm.weight": "model-00002-of-00002.safetensors",
451
+ "multi_modal_projector.mm_input_projection_weight": "model-00001-of-00002.safetensors",
452
+ "multi_modal_projector.mm_soft_emb_norm.weight": "model-00001-of-00002.safetensors",
453
+ "vision_tower.vision_model.embeddings.patch_embedding.bias": "model-00001-of-00002.safetensors",
454
+ "vision_tower.vision_model.embeddings.patch_embedding.weight": "model-00001-of-00002.safetensors",
455
+ "vision_tower.vision_model.embeddings.position_embedding.weight": "model-00001-of-00002.safetensors",
456
+ "vision_tower.vision_model.encoder.layers.0.layer_norm1.bias": "model-00001-of-00002.safetensors",
457
+ "vision_tower.vision_model.encoder.layers.0.layer_norm1.weight": "model-00001-of-00002.safetensors",
458
+ "vision_tower.vision_model.encoder.layers.0.layer_norm2.bias": "model-00001-of-00002.safetensors",
459
+ "vision_tower.vision_model.encoder.layers.0.layer_norm2.weight": "model-00001-of-00002.safetensors",
460
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc1.bias": "model-00001-of-00002.safetensors",
461
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
462
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc2.bias": "model-00001-of-00002.safetensors",
463
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
464
+ "vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
465
+ "vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
466
+ "vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
467
+ "vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
468
+ "vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
469
+ "vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
470
+ "vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
471
+ "vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
472
+ "vision_tower.vision_model.encoder.layers.1.layer_norm1.bias": "model-00001-of-00002.safetensors",
473
+ "vision_tower.vision_model.encoder.layers.1.layer_norm1.weight": "model-00001-of-00002.safetensors",
474
+ "vision_tower.vision_model.encoder.layers.1.layer_norm2.bias": "model-00001-of-00002.safetensors",
475
+ "vision_tower.vision_model.encoder.layers.1.layer_norm2.weight": "model-00001-of-00002.safetensors",
476
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc1.bias": "model-00001-of-00002.safetensors",
477
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
478
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc2.bias": "model-00001-of-00002.safetensors",
479
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
480
+ "vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
481
+ "vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
482
+ "vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
483
+ "vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
484
+ "vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
485
+ "vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
486
+ "vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
487
+ "vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
488
+ "vision_tower.vision_model.encoder.layers.10.layer_norm1.bias": "model-00001-of-00002.safetensors",
489
+ "vision_tower.vision_model.encoder.layers.10.layer_norm1.weight": "model-00001-of-00002.safetensors",
490
+ "vision_tower.vision_model.encoder.layers.10.layer_norm2.bias": "model-00001-of-00002.safetensors",
491
+ "vision_tower.vision_model.encoder.layers.10.layer_norm2.weight": "model-00001-of-00002.safetensors",
492
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias": "model-00001-of-00002.safetensors",
493
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
494
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc2.bias": "model-00001-of-00002.safetensors",
495
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
496
+ "vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
497
+ "vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
498
+ "vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
499
+ "vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
500
+ "vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
501
+ "vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
502
+ "vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
503
+ "vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
504
+ "vision_tower.vision_model.encoder.layers.11.layer_norm1.bias": "model-00001-of-00002.safetensors",
505
+ "vision_tower.vision_model.encoder.layers.11.layer_norm1.weight": "model-00001-of-00002.safetensors",
506
+ "vision_tower.vision_model.encoder.layers.11.layer_norm2.bias": "model-00001-of-00002.safetensors",
507
+ "vision_tower.vision_model.encoder.layers.11.layer_norm2.weight": "model-00001-of-00002.safetensors",
508
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc1.bias": "model-00001-of-00002.safetensors",
509
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
510
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc2.bias": "model-00001-of-00002.safetensors",
511
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
512
+ "vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
513
+ "vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
514
+ "vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
515
+ "vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
516
+ "vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
517
+ "vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
518
+ "vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
519
+ "vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
520
+ "vision_tower.vision_model.encoder.layers.12.layer_norm1.bias": "model-00001-of-00002.safetensors",
521
+ "vision_tower.vision_model.encoder.layers.12.layer_norm1.weight": "model-00001-of-00002.safetensors",
522
+ "vision_tower.vision_model.encoder.layers.12.layer_norm2.bias": "model-00001-of-00002.safetensors",
523
+ "vision_tower.vision_model.encoder.layers.12.layer_norm2.weight": "model-00001-of-00002.safetensors",
524
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc1.bias": "model-00001-of-00002.safetensors",
525
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
526
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc2.bias": "model-00001-of-00002.safetensors",
527
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
528
+ "vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
529
+ "vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
530
+ "vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
531
+ "vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
532
+ "vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
533
+ "vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
534
+ "vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
535
+ "vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
536
+ "vision_tower.vision_model.encoder.layers.13.layer_norm1.bias": "model-00001-of-00002.safetensors",
537
+ "vision_tower.vision_model.encoder.layers.13.layer_norm1.weight": "model-00001-of-00002.safetensors",
538
+ "vision_tower.vision_model.encoder.layers.13.layer_norm2.bias": "model-00001-of-00002.safetensors",
539
+ "vision_tower.vision_model.encoder.layers.13.layer_norm2.weight": "model-00001-of-00002.safetensors",
540
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc1.bias": "model-00001-of-00002.safetensors",
541
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
542
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc2.bias": "model-00001-of-00002.safetensors",
543
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc2.weight": "model-00001-of-00002.safetensors",
544
+ "vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
545
+ "vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
546
+ "vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
547
+ "vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
548
+ "vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
549
+ "vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
550
+ "vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
551
+ "vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
552
+ "vision_tower.vision_model.encoder.layers.14.layer_norm1.bias": "model-00001-of-00002.safetensors",
553
+ "vision_tower.vision_model.encoder.layers.14.layer_norm1.weight": "model-00001-of-00002.safetensors",
554
+ "vision_tower.vision_model.encoder.layers.14.layer_norm2.bias": "model-00001-of-00002.safetensors",
555
+ "vision_tower.vision_model.encoder.layers.14.layer_norm2.weight": "model-00001-of-00002.safetensors",
556
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc1.bias": "model-00001-of-00002.safetensors",
557
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc1.weight": "model-00001-of-00002.safetensors",
558
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc2.bias": "model-00001-of-00002.safetensors",
559
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc2.weight": "model-00001-of-00002.safetensors",
560
+ "vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
561
+ "vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
562
+ "vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
563
+ "vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
564
+ "vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
565
+ "vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
566
+ "vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
567
+ "vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
568
+ "vision_tower.vision_model.encoder.layers.15.layer_norm1.bias": "model-00001-of-00002.safetensors",
569
+ "vision_tower.vision_model.encoder.layers.15.layer_norm1.weight": "model-00001-of-00002.safetensors",
570
+ "vision_tower.vision_model.encoder.layers.15.layer_norm2.bias": "model-00001-of-00002.safetensors",
571
+ "vision_tower.vision_model.encoder.layers.15.layer_norm2.weight": "model-00001-of-00002.safetensors",
572
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc1.bias": "model-00001-of-00002.safetensors",
573
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc1.weight": "model-00001-of-00002.safetensors",
574
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc2.bias": "model-00001-of-00002.safetensors",
575
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc2.weight": "model-00001-of-00002.safetensors",
576
+ "vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
577
+ "vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
578
+ "vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
579
+ "vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
580
+ "vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
581
+ "vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
582
+ "vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
583
+ "vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
584
+ "vision_tower.vision_model.encoder.layers.16.layer_norm1.bias": "model-00001-of-00002.safetensors",
585
+ "vision_tower.vision_model.encoder.layers.16.layer_norm1.weight": "model-00001-of-00002.safetensors",
586
+ "vision_tower.vision_model.encoder.layers.16.layer_norm2.bias": "model-00001-of-00002.safetensors",
587
+ "vision_tower.vision_model.encoder.layers.16.layer_norm2.weight": "model-00001-of-00002.safetensors",
588
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc1.bias": "model-00001-of-00002.safetensors",
589
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc1.weight": "model-00001-of-00002.safetensors",
590
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc2.bias": "model-00001-of-00002.safetensors",
591
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc2.weight": "model-00001-of-00002.safetensors",
592
+ "vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
593
+ "vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
594
+ "vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
595
+ "vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
596
+ "vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
597
+ "vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
598
+ "vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
599
+ "vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
600
+ "vision_tower.vision_model.encoder.layers.17.layer_norm1.bias": "model-00001-of-00002.safetensors",
601
+ "vision_tower.vision_model.encoder.layers.17.layer_norm1.weight": "model-00001-of-00002.safetensors",
602
+ "vision_tower.vision_model.encoder.layers.17.layer_norm2.bias": "model-00001-of-00002.safetensors",
603
+ "vision_tower.vision_model.encoder.layers.17.layer_norm2.weight": "model-00001-of-00002.safetensors",
604
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc1.bias": "model-00001-of-00002.safetensors",
605
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc1.weight": "model-00001-of-00002.safetensors",
606
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc2.bias": "model-00001-of-00002.safetensors",
607
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc2.weight": "model-00001-of-00002.safetensors",
608
+ "vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
609
+ "vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
610
+ "vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
611
+ "vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
612
+ "vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
613
+ "vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
614
+ "vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
615
+ "vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
616
+ "vision_tower.vision_model.encoder.layers.18.layer_norm1.bias": "model-00001-of-00002.safetensors",
617
+ "vision_tower.vision_model.encoder.layers.18.layer_norm1.weight": "model-00001-of-00002.safetensors",
618
+ "vision_tower.vision_model.encoder.layers.18.layer_norm2.bias": "model-00001-of-00002.safetensors",
619
+ "vision_tower.vision_model.encoder.layers.18.layer_norm2.weight": "model-00001-of-00002.safetensors",
620
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc1.bias": "model-00001-of-00002.safetensors",
621
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc1.weight": "model-00001-of-00002.safetensors",
622
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc2.bias": "model-00001-of-00002.safetensors",
623
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc2.weight": "model-00001-of-00002.safetensors",
624
+ "vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
625
+ "vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
626
+ "vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
627
+ "vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
628
+ "vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
629
+ "vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
630
+ "vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
631
+ "vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
632
+ "vision_tower.vision_model.encoder.layers.19.layer_norm1.bias": "model-00001-of-00002.safetensors",
633
+ "vision_tower.vision_model.encoder.layers.19.layer_norm1.weight": "model-00001-of-00002.safetensors",
634
+ "vision_tower.vision_model.encoder.layers.19.layer_norm2.bias": "model-00001-of-00002.safetensors",
635
+ "vision_tower.vision_model.encoder.layers.19.layer_norm2.weight": "model-00001-of-00002.safetensors",
636
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc1.bias": "model-00001-of-00002.safetensors",
637
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc1.weight": "model-00001-of-00002.safetensors",
638
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc2.bias": "model-00001-of-00002.safetensors",
639
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc2.weight": "model-00001-of-00002.safetensors",
640
+ "vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
641
+ "vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
642
+ "vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
643
+ "vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
644
+ "vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
645
+ "vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
646
+ "vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
647
+ "vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
648
+ "vision_tower.vision_model.encoder.layers.2.layer_norm1.bias": "model-00001-of-00002.safetensors",
649
+ "vision_tower.vision_model.encoder.layers.2.layer_norm1.weight": "model-00001-of-00002.safetensors",
650
+ "vision_tower.vision_model.encoder.layers.2.layer_norm2.bias": "model-00001-of-00002.safetensors",
651
+ "vision_tower.vision_model.encoder.layers.2.layer_norm2.weight": "model-00001-of-00002.safetensors",
652
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc1.bias": "model-00001-of-00002.safetensors",
653
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
654
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc2.bias": "model-00001-of-00002.safetensors",
655
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
656
+ "vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
657
+ "vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
658
+ "vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
659
+ "vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
660
+ "vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
661
+ "vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
662
+ "vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
663
+ "vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
664
+ "vision_tower.vision_model.encoder.layers.20.layer_norm1.bias": "model-00001-of-00002.safetensors",
665
+ "vision_tower.vision_model.encoder.layers.20.layer_norm1.weight": "model-00001-of-00002.safetensors",
666
+ "vision_tower.vision_model.encoder.layers.20.layer_norm2.bias": "model-00001-of-00002.safetensors",
667
+ "vision_tower.vision_model.encoder.layers.20.layer_norm2.weight": "model-00001-of-00002.safetensors",
668
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc1.bias": "model-00001-of-00002.safetensors",
669
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc1.weight": "model-00001-of-00002.safetensors",
670
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc2.bias": "model-00001-of-00002.safetensors",
671
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc2.weight": "model-00001-of-00002.safetensors",
672
+ "vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
673
+ "vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
674
+ "vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
675
+ "vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
676
+ "vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
677
+ "vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
678
+ "vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
679
+ "vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
680
+ "vision_tower.vision_model.encoder.layers.21.layer_norm1.bias": "model-00001-of-00002.safetensors",
681
+ "vision_tower.vision_model.encoder.layers.21.layer_norm1.weight": "model-00001-of-00002.safetensors",
682
+ "vision_tower.vision_model.encoder.layers.21.layer_norm2.bias": "model-00001-of-00002.safetensors",
683
+ "vision_tower.vision_model.encoder.layers.21.layer_norm2.weight": "model-00001-of-00002.safetensors",
684
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc1.bias": "model-00001-of-00002.safetensors",
685
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc1.weight": "model-00001-of-00002.safetensors",
686
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc2.bias": "model-00001-of-00002.safetensors",
687
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc2.weight": "model-00001-of-00002.safetensors",
688
+ "vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
689
+ "vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
690
+ "vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
691
+ "vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
692
+ "vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
693
+ "vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
694
+ "vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
695
+ "vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
696
+ "vision_tower.vision_model.encoder.layers.22.layer_norm1.bias": "model-00001-of-00002.safetensors",
697
+ "vision_tower.vision_model.encoder.layers.22.layer_norm1.weight": "model-00001-of-00002.safetensors",
698
+ "vision_tower.vision_model.encoder.layers.22.layer_norm2.bias": "model-00001-of-00002.safetensors",
699
+ "vision_tower.vision_model.encoder.layers.22.layer_norm2.weight": "model-00001-of-00002.safetensors",
700
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc1.bias": "model-00001-of-00002.safetensors",
701
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc1.weight": "model-00001-of-00002.safetensors",
702
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc2.bias": "model-00001-of-00002.safetensors",
703
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc2.weight": "model-00001-of-00002.safetensors",
704
+ "vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
705
+ "vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
706
+ "vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
707
+ "vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
708
+ "vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
709
+ "vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
710
+ "vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
711
+ "vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
712
+ "vision_tower.vision_model.encoder.layers.23.layer_norm1.bias": "model-00001-of-00002.safetensors",
713
+ "vision_tower.vision_model.encoder.layers.23.layer_norm1.weight": "model-00001-of-00002.safetensors",
714
+ "vision_tower.vision_model.encoder.layers.23.layer_norm2.bias": "model-00001-of-00002.safetensors",
715
+ "vision_tower.vision_model.encoder.layers.23.layer_norm2.weight": "model-00001-of-00002.safetensors",
716
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc1.bias": "model-00001-of-00002.safetensors",
717
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc1.weight": "model-00001-of-00002.safetensors",
718
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc2.bias": "model-00001-of-00002.safetensors",
719
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc2.weight": "model-00001-of-00002.safetensors",
720
+ "vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
721
+ "vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
722
+ "vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
723
+ "vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
724
+ "vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
725
+ "vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
726
+ "vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
727
+ "vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
728
+ "vision_tower.vision_model.encoder.layers.24.layer_norm1.bias": "model-00001-of-00002.safetensors",
729
+ "vision_tower.vision_model.encoder.layers.24.layer_norm1.weight": "model-00001-of-00002.safetensors",
730
+ "vision_tower.vision_model.encoder.layers.24.layer_norm2.bias": "model-00001-of-00002.safetensors",
731
+ "vision_tower.vision_model.encoder.layers.24.layer_norm2.weight": "model-00001-of-00002.safetensors",
732
+ "vision_tower.vision_model.encoder.layers.24.mlp.fc1.bias": "model-00001-of-00002.safetensors",
733
+ "vision_tower.vision_model.encoder.layers.24.mlp.fc1.weight": "model-00001-of-00002.safetensors",
734
+ "vision_tower.vision_model.encoder.layers.24.mlp.fc2.bias": "model-00001-of-00002.safetensors",
735
+ "vision_tower.vision_model.encoder.layers.24.mlp.fc2.weight": "model-00001-of-00002.safetensors",
736
+ "vision_tower.vision_model.encoder.layers.24.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
737
+ "vision_tower.vision_model.encoder.layers.24.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
738
+ "vision_tower.vision_model.encoder.layers.24.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
739
+ "vision_tower.vision_model.encoder.layers.24.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
740
+ "vision_tower.vision_model.encoder.layers.24.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
741
+ "vision_tower.vision_model.encoder.layers.24.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
742
+ "vision_tower.vision_model.encoder.layers.24.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
743
+ "vision_tower.vision_model.encoder.layers.24.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
744
+ "vision_tower.vision_model.encoder.layers.25.layer_norm1.bias": "model-00001-of-00002.safetensors",
745
+ "vision_tower.vision_model.encoder.layers.25.layer_norm1.weight": "model-00001-of-00002.safetensors",
746
+ "vision_tower.vision_model.encoder.layers.25.layer_norm2.bias": "model-00001-of-00002.safetensors",
747
+ "vision_tower.vision_model.encoder.layers.25.layer_norm2.weight": "model-00001-of-00002.safetensors",
748
+ "vision_tower.vision_model.encoder.layers.25.mlp.fc1.bias": "model-00001-of-00002.safetensors",
749
+ "vision_tower.vision_model.encoder.layers.25.mlp.fc1.weight": "model-00001-of-00002.safetensors",
750
+ "vision_tower.vision_model.encoder.layers.25.mlp.fc2.bias": "model-00001-of-00002.safetensors",
751
+ "vision_tower.vision_model.encoder.layers.25.mlp.fc2.weight": "model-00001-of-00002.safetensors",
752
+ "vision_tower.vision_model.encoder.layers.25.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
753
+ "vision_tower.vision_model.encoder.layers.25.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
754
+ "vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
755
+ "vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
756
+ "vision_tower.vision_model.encoder.layers.25.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
757
+ "vision_tower.vision_model.encoder.layers.25.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
758
+ "vision_tower.vision_model.encoder.layers.25.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
759
+ "vision_tower.vision_model.encoder.layers.25.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
760
+ "vision_tower.vision_model.encoder.layers.26.layer_norm1.bias": "model-00001-of-00002.safetensors",
761
+ "vision_tower.vision_model.encoder.layers.26.layer_norm1.weight": "model-00001-of-00002.safetensors",
762
+ "vision_tower.vision_model.encoder.layers.26.layer_norm2.bias": "model-00001-of-00002.safetensors",
763
+ "vision_tower.vision_model.encoder.layers.26.layer_norm2.weight": "model-00001-of-00002.safetensors",
764
+ "vision_tower.vision_model.encoder.layers.26.mlp.fc1.bias": "model-00001-of-00002.safetensors",
765
+ "vision_tower.vision_model.encoder.layers.26.mlp.fc1.weight": "model-00001-of-00002.safetensors",
766
+ "vision_tower.vision_model.encoder.layers.26.mlp.fc2.bias": "model-00001-of-00002.safetensors",
767
+ "vision_tower.vision_model.encoder.layers.26.mlp.fc2.weight": "model-00001-of-00002.safetensors",
768
+ "vision_tower.vision_model.encoder.layers.26.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
769
+ "vision_tower.vision_model.encoder.layers.26.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
770
+ "vision_tower.vision_model.encoder.layers.26.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
771
+ "vision_tower.vision_model.encoder.layers.26.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
772
+ "vision_tower.vision_model.encoder.layers.26.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
773
+ "vision_tower.vision_model.encoder.layers.26.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
774
+ "vision_tower.vision_model.encoder.layers.26.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
775
+ "vision_tower.vision_model.encoder.layers.26.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
776
+ "vision_tower.vision_model.encoder.layers.3.layer_norm1.bias": "model-00001-of-00002.safetensors",
777
+ "vision_tower.vision_model.encoder.layers.3.layer_norm1.weight": "model-00001-of-00002.safetensors",
778
+ "vision_tower.vision_model.encoder.layers.3.layer_norm2.bias": "model-00001-of-00002.safetensors",
779
+ "vision_tower.vision_model.encoder.layers.3.layer_norm2.weight": "model-00001-of-00002.safetensors",
780
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc1.bias": "model-00001-of-00002.safetensors",
781
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc1.weight": "model-00001-of-00002.safetensors",
782
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc2.bias": "model-00001-of-00002.safetensors",
783
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc2.weight": "model-00001-of-00002.safetensors",
784
+ "vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
785
+ "vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
786
+ "vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
787
+ "vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
788
+ "vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
789
+ "vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
790
+ "vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
791
+ "vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
792
+ "vision_tower.vision_model.encoder.layers.4.layer_norm1.bias": "model-00001-of-00002.safetensors",
793
+ "vision_tower.vision_model.encoder.layers.4.layer_norm1.weight": "model-00001-of-00002.safetensors",
794
+ "vision_tower.vision_model.encoder.layers.4.layer_norm2.bias": "model-00001-of-00002.safetensors",
795
+ "vision_tower.vision_model.encoder.layers.4.layer_norm2.weight": "model-00001-of-00002.safetensors",
796
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc1.bias": "model-00001-of-00002.safetensors",
797
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc1.weight": "model-00001-of-00002.safetensors",
798
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc2.bias": "model-00001-of-00002.safetensors",
799
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc2.weight": "model-00001-of-00002.safetensors",
800
+ "vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
801
+ "vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
802
+ "vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
803
+ "vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
804
+ "vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
805
+ "vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
806
+ "vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
807
+ "vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
808
+ "vision_tower.vision_model.encoder.layers.5.layer_norm1.bias": "model-00001-of-00002.safetensors",
809
+ "vision_tower.vision_model.encoder.layers.5.layer_norm1.weight": "model-00001-of-00002.safetensors",
810
+ "vision_tower.vision_model.encoder.layers.5.layer_norm2.bias": "model-00001-of-00002.safetensors",
811
+ "vision_tower.vision_model.encoder.layers.5.layer_norm2.weight": "model-00001-of-00002.safetensors",
812
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc1.bias": "model-00001-of-00002.safetensors",
813
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc1.weight": "model-00001-of-00002.safetensors",
814
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc2.bias": "model-00001-of-00002.safetensors",
815
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc2.weight": "model-00001-of-00002.safetensors",
816
+ "vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
817
+ "vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
818
+ "vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
819
+ "vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
820
+ "vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
821
+ "vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
822
+ "vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
823
+ "vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
824
+ "vision_tower.vision_model.encoder.layers.6.layer_norm1.bias": "model-00001-of-00002.safetensors",
825
+ "vision_tower.vision_model.encoder.layers.6.layer_norm1.weight": "model-00001-of-00002.safetensors",
826
+ "vision_tower.vision_model.encoder.layers.6.layer_norm2.bias": "model-00001-of-00002.safetensors",
827
+ "vision_tower.vision_model.encoder.layers.6.layer_norm2.weight": "model-00001-of-00002.safetensors",
828
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc1.bias": "model-00001-of-00002.safetensors",
829
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc1.weight": "model-00001-of-00002.safetensors",
830
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc2.bias": "model-00001-of-00002.safetensors",
831
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc2.weight": "model-00001-of-00002.safetensors",
832
+ "vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
833
+ "vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
834
+ "vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
835
+ "vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
836
+ "vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
837
+ "vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
838
+ "vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
839
+ "vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
840
+ "vision_tower.vision_model.encoder.layers.7.layer_norm1.bias": "model-00001-of-00002.safetensors",
841
+ "vision_tower.vision_model.encoder.layers.7.layer_norm1.weight": "model-00001-of-00002.safetensors",
842
+ "vision_tower.vision_model.encoder.layers.7.layer_norm2.bias": "model-00001-of-00002.safetensors",
843
+ "vision_tower.vision_model.encoder.layers.7.layer_norm2.weight": "model-00001-of-00002.safetensors",
844
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc1.bias": "model-00001-of-00002.safetensors",
845
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc1.weight": "model-00001-of-00002.safetensors",
846
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc2.bias": "model-00001-of-00002.safetensors",
847
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc2.weight": "model-00001-of-00002.safetensors",
848
+ "vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
849
+ "vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
850
+ "vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
851
+ "vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
852
+ "vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
853
+ "vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
854
+ "vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
855
+ "vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
856
+ "vision_tower.vision_model.encoder.layers.8.layer_norm1.bias": "model-00001-of-00002.safetensors",
857
+ "vision_tower.vision_model.encoder.layers.8.layer_norm1.weight": "model-00001-of-00002.safetensors",
858
+ "vision_tower.vision_model.encoder.layers.8.layer_norm2.bias": "model-00001-of-00002.safetensors",
859
+ "vision_tower.vision_model.encoder.layers.8.layer_norm2.weight": "model-00001-of-00002.safetensors",
860
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc1.bias": "model-00001-of-00002.safetensors",
861
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc1.weight": "model-00001-of-00002.safetensors",
862
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc2.bias": "model-00001-of-00002.safetensors",
863
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc2.weight": "model-00001-of-00002.safetensors",
864
+ "vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
865
+ "vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
866
+ "vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
867
+ "vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
868
+ "vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
869
+ "vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
870
+ "vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
871
+ "vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
872
+ "vision_tower.vision_model.encoder.layers.9.layer_norm1.bias": "model-00001-of-00002.safetensors",
873
+ "vision_tower.vision_model.encoder.layers.9.layer_norm1.weight": "model-00001-of-00002.safetensors",
874
+ "vision_tower.vision_model.encoder.layers.9.layer_norm2.bias": "model-00001-of-00002.safetensors",
875
+ "vision_tower.vision_model.encoder.layers.9.layer_norm2.weight": "model-00001-of-00002.safetensors",
876
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc1.bias": "model-00001-of-00002.safetensors",
877
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc1.weight": "model-00001-of-00002.safetensors",
878
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc2.bias": "model-00001-of-00002.safetensors",
879
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc2.weight": "model-00001-of-00002.safetensors",
880
+ "vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
881
+ "vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
882
+ "vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.bias": "model-00001-of-00002.safetensors",
883
+ "vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.weight": "model-00001-of-00002.safetensors",
884
+ "vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
885
+ "vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
886
+ "vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
887
+ "vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
888
+ "vision_tower.vision_model.post_layernorm.bias": "model-00001-of-00002.safetensors",
889
+ "vision_tower.vision_model.post_layernorm.weight": "model-00001-of-00002.safetensors"
890
+ }
891
+ }
text_encoder_2/config.json ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_projections": false,
3
+ "architectures": [
4
+ "JinaCLIPModel"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "jinaai/jina-clip-implementation--configuration_clip.JinaCLIPConfig",
8
+ "AutoModel": "jinaai/jina-clip-implementation--modeling_clip.JinaCLIPModel"
9
+ },
10
+ "initializer_factor": 1.0,
11
+ "logit_scale_init_value": 2.6592,
12
+ "matryoshka_dimensions": [32, 64, 128, 256, 512, 768, 1024],
13
+ "model_type": "jina_clip",
14
+ "projection_dim": 1024,
15
+ "text_config": {
16
+ "default_instruction_task": null,
17
+ "default_lora_task": "retrieval.query",
18
+ "embed_dim": 1024,
19
+ "hf_model_config_kwargs": {
20
+ "load_trained_adapters": false,
21
+ "lora_adaptations": [
22
+ "retrieval.query"
23
+ ],
24
+ "lora_alpha": 4,
25
+ "lora_dropout_p": 0.0,
26
+ "lora_main_params_trainable": false,
27
+ "lora_rank": 4,
28
+ "task_instructions": {
29
+ "retrieval.query": "Represent the query for retrieving evidence documents: "
30
+ },
31
+ "use_flash_attn": true
32
+ },
33
+ "hf_model_name_or_path": "jinaai/jina-embeddings-v3",
34
+ "model_type": "jina_clip_text",
35
+ "pooler_type": "mean_pooler",
36
+ "proj_bias": false,
37
+ "proj_type": null
38
+ },
39
+ "torch_dtype": "bfloat16",
40
+ "transformers.js_config": {
41
+ "use_external_data_format": {
42
+ "model.onnx": true
43
+ }
44
+ },
45
+ "truncate_dim": null,
46
+ "use_text_flash_attn": null,
47
+ "use_vision_xformers": null,
48
+ "vision_config": {
49
+ "embed_dim": 1024,
50
+ "fused_layer_norm": false,
51
+ "head_width": 64,
52
+ "image_size": 512,
53
+ "intp_freq": true,
54
+ "layers": 24,
55
+ "ls_init_value": null,
56
+ "mlp_ratio": 2.6667,
57
+ "model_type": "jina_clip_vision",
58
+ "naive_swiglu": true,
59
+ "patch_dropout": 0.1,
60
+ "patch_size": 14,
61
+ "post_norm": false,
62
+ "proj_type": null,
63
+ "pt_hw_seq_len": 16,
64
+ "qkv_bias": true,
65
+ "rope_embeddings": true,
66
+ "subln": true,
67
+ "width": 1024,
68
+ "x_attention": true
69
+ }
70
+ }
text_encoder_2/configuration_clip.py ADDED
@@ -0,0 +1,324 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ #
3
+ # Code mainly copied from:
4
+ # https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/configuration_clip.py
5
+ # and adjusted for Jina CLIP
6
+
7
+ import os
8
+ from copy import deepcopy
9
+ from typing import Any, Dict, List, Optional, Union
10
+
11
+ import torch
12
+ from transformers import PretrainedConfig, logging
13
+
14
+ logger = logging.get_logger(__name__)
15
+
16
+
17
+ """ Jina CLIP model configuration """
18
+
19
+
20
+ class JinaCLIPTextConfig(PretrainedConfig):
21
+ model_type = 'jina_clip_text'
22
+
23
+ def __init__(
24
+ self,
25
+ embed_dim: int = 768,
26
+ hf_model_name_or_path: str = 'jinaai/jina-bert-flash-implementation',
27
+ hf_model_config_kwargs: Optional[Dict[str, Any]] = None,
28
+ default_instruction_task: Optional[str] = None,
29
+ default_lora_task: Optional[str] = None,
30
+ pooler_type: Optional[str] = None,
31
+ proj_type: Optional[str] = None,
32
+ proj_bias: bool = False,
33
+ **kwargs,
34
+ ):
35
+ super().__init__(**kwargs)
36
+
37
+ self.embed_dim = embed_dim
38
+ self.hf_model_name_or_path = hf_model_name_or_path
39
+ self.hf_model_config_kwargs = hf_model_config_kwargs or {}
40
+ self.default_instruction_task = default_instruction_task
41
+ self.default_lora_task = default_lora_task
42
+ self.pooler_type = pooler_type
43
+ self.proj_type = proj_type
44
+ self.proj_bias = proj_bias
45
+
46
+ @classmethod
47
+ def from_pretrained(
48
+ cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs
49
+ ) -> 'PretrainedConfig':
50
+ cls._set_token_in_kwargs(kwargs)
51
+
52
+ configdict, kwargs = cls.get_config_dict(
53
+ pretrained_model_name_or_path, **kwargs
54
+ )
55
+ # get the text config dict if we are loading from JinaCLIPConfig
56
+ if configdict.get('model_type') == 'jina_clip':
57
+ configdict = configdict['text_config']
58
+ if (
59
+ 'model_type' in configdict
60
+ and hasattr(cls, 'model_type')
61
+ and configdict['model_type'] != cls.model_type
62
+ ):
63
+ logger.warning(
64
+ f'You are using a model of type {configdict["model_type"]} to '
65
+ f'instantiate a model of type {cls.model_type}. This is not supported '
66
+ 'for all configurations of models and can yield errors.'
67
+ )
68
+ return cls.from_dict(configdict, **kwargs)
69
+
70
+
71
+ class JinaCLIPVisionConfig(PretrainedConfig):
72
+ model_type = 'jina_clip_vision'
73
+
74
+ def __init__(
75
+ self,
76
+ embed_dim: int = 768,
77
+ width: int = 768,
78
+ image_size: int = 224,
79
+ patch_size: int = 16,
80
+ layers: int = 12,
81
+ head_width: int = 64,
82
+ mlp_ratio: float = 4.0,
83
+ ls_init_value: Optional[float] = None,
84
+ patch_dropout: float = 0.0,
85
+ qkv_bias: bool = True,
86
+ fused_layer_norm: bool = False,
87
+ x_attention: bool = False,
88
+ post_norm: bool = False,
89
+ rope_embeddings: bool = False,
90
+ pt_hw_seq_len: int = 16,
91
+ intp_freq: bool = False,
92
+ naive_swiglu: bool = False,
93
+ subln: bool = False,
94
+ drop_path_rate: float = 0.0,
95
+ proj_type: Optional[str] = None,
96
+ **kwargs,
97
+ ):
98
+ super().__init__(**kwargs)
99
+
100
+ self.layers = layers
101
+ self.embed_dim = embed_dim
102
+ self.width = width
103
+ self.head_width = head_width
104
+ self.mlp_ratio = mlp_ratio
105
+ self.image_size = image_size
106
+ self.patch_size = patch_size
107
+ self.ls_init_value = ls_init_value
108
+ self.patch_dropout = patch_dropout
109
+ self.qkv_bias = qkv_bias
110
+ self.fused_layer_norm = fused_layer_norm
111
+ self.x_attention = x_attention
112
+ self.post_norm = post_norm
113
+ self.rope_embeddings = rope_embeddings
114
+ self.pt_hw_seq_len = pt_hw_seq_len
115
+ self.intp_freq = intp_freq
116
+ self.naive_swiglu = naive_swiglu
117
+ self.subln = subln
118
+ self.drop_path_rate = drop_path_rate
119
+ self.proj_type = proj_type
120
+
121
+ @classmethod
122
+ def from_pretrained(
123
+ cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs
124
+ ) -> 'PretrainedConfig':
125
+ cls._set_token_in_kwargs(kwargs)
126
+
127
+ configdict, kwargs = cls.get_config_dict(
128
+ pretrained_model_name_or_path, **kwargs
129
+ )
130
+ # get the vision config dict if we are loading from JinaCLIPConfig
131
+ if configdict.get('model_type') == 'jina_clip':
132
+ configdict = configdict['vision_config']
133
+ if (
134
+ 'model_type' in configdict
135
+ and hasattr(cls, 'model_type')
136
+ and configdict['model_type'] != cls.model_type
137
+ ):
138
+ logger.warning(
139
+ f'You are using a model of type {configdict["model_type"]} to '
140
+ f'instantiate a model of type {cls.model_type}. This is not supported '
141
+ 'for all configurations of models and can yield errors.'
142
+ )
143
+ return cls.from_dict(configdict, **kwargs)
144
+
145
+
146
+ class JinaCLIPConfig(PretrainedConfig):
147
+ model_type = 'jina_clip'
148
+ is_composition = True
149
+
150
+ def __init__(
151
+ self,
152
+ text_config: Optional[Dict] = None,
153
+ vision_config: Optional[Dict] = None,
154
+ add_projections: bool = False,
155
+ projection_dim: int = 768,
156
+ logit_scale_init_value: float = 2.6592,
157
+ use_text_flash_attn: Optional[bool] = None,
158
+ use_vision_xformers: Optional[bool] = None,
159
+ matryoshka_dimensions: Optional[List[int]] = None,
160
+ truncate_dim: Optional[int] = None,
161
+ torch_dtype: Optional[Union[str, torch.dtype]] = None,
162
+ **kwargs,
163
+ ):
164
+ # If `_config_dict` exist, we use them for the backward compatibility.
165
+ # We pop out these 2 attributes before calling `super().__init__` to avoid
166
+ # them being saved (which causes a lot of confusion!).
167
+
168
+ text_config_dict: Optional[Dict] = kwargs.pop('text_config_dict', None)
169
+ vision_config_dict: Optional[Dict] = kwargs.pop('vision_config_dict', None)
170
+ self.use_text_flash_attn = use_text_flash_attn
171
+ self.use_vision_xformers = use_vision_xformers
172
+ self.matryoshka_dimensions = matryoshka_dimensions
173
+ self.truncate_dim = truncate_dim
174
+
175
+ super().__init__(**kwargs)
176
+
177
+ if text_config_dict is not None:
178
+ if text_config is None:
179
+ text_config = {}
180
+
181
+ # This is the complete result when using `text_config_dict`.
182
+ _text_config_dict = JinaCLIPTextConfig(**text_config_dict).to_dict()
183
+
184
+ # Give a warning if the values exist in both `_text_config_dict` and
185
+ # `text_config` but being different.
186
+ for key, value in _text_config_dict.items():
187
+ if (
188
+ key in text_config
189
+ and value != text_config[key]
190
+ and key not in ['transformers_version']
191
+ ):
192
+ # If specified in `text_config_dict`
193
+ if key in text_config_dict:
194
+ message = (
195
+ f'`{key}` is found in both `text_config_dict` and '
196
+ f'`text_config` but with different values. '
197
+ f'The value `text_config_dict["{key}"]` will be used '
198
+ f'instead.'
199
+ )
200
+ # If inferred from default argument values (
201
+ # just to be super careful)
202
+ else:
203
+ message = (
204
+ f'`text_config_dict` is provided which will be used to '
205
+ f'initialize `JinaCLIPTextConfig`. The '
206
+ f'value `text_config["{key}"]` will be overriden.'
207
+ )
208
+ logger.info(message)
209
+
210
+ # Update all values in `text_config` with the ones in `_text_config_dict`.
211
+ text_config.update(_text_config_dict)
212
+
213
+ if vision_config_dict is not None:
214
+ if vision_config is None:
215
+ vision_config = {}
216
+
217
+ # This is the complete result when using `vision_config_dict`.
218
+ _vision_config_dict = JinaCLIPVisionConfig(**vision_config_dict).to_dict()
219
+ # convert keys to string instead of integer
220
+ if 'id2label' in _vision_config_dict:
221
+ _vision_config_dict['id2label'] = {
222
+ str(key): value
223
+ for key, value in _vision_config_dict['id2label'].items()
224
+ }
225
+
226
+ # Give a warning if the values exist in both `_vision_config_dict`
227
+ # and `vision_config` but being different.
228
+ for key, value in _vision_config_dict.items():
229
+ if (
230
+ key in vision_config
231
+ and value != vision_config[key]
232
+ and key not in ['transformers_version']
233
+ ):
234
+ # If specified in `vision_config_dict`
235
+ if key in vision_config_dict:
236
+ message = (
237
+ f'`{key}` is found in both `vision_config_dict` and '
238
+ f'`vision_config` but with different '
239
+ f'values. The value `vision_config_dict["{key}"]` will '
240
+ f'be used instead.'
241
+ )
242
+ # If inferred from default argument values
243
+ # (just to be super careful)
244
+ else:
245
+ message = (
246
+ f'`vision_config_dict` is provided which will be used to '
247
+ f'initialize `JinaCLIPVisionConfig`. '
248
+ f'The value `vision_config["{key}"]` will be overriden.'
249
+ )
250
+ logger.info(message)
251
+
252
+ # Update all values in `vision_config` with the ones in
253
+ # `_vision_config_dict`.
254
+ vision_config.update(_vision_config_dict)
255
+
256
+ if text_config is None:
257
+ text_config = {}
258
+ logger.info(
259
+ '`text_config` is `None`. Initializing the `JinaCLIPTextConfig` with '
260
+ 'default values.'
261
+ )
262
+
263
+ if vision_config is None:
264
+ vision_config = {}
265
+ logger.info(
266
+ '`vision_config` is `None`. initializing the `JinaCLIPVisionConfig` '
267
+ 'with default values.'
268
+ )
269
+
270
+ self.text_config = JinaCLIPTextConfig(**text_config)
271
+ self.vision_config = JinaCLIPVisionConfig(**vision_config)
272
+
273
+ self.add_projections = add_projections
274
+ self.projection_dim = projection_dim
275
+ self.logit_scale_init_value = logit_scale_init_value
276
+ self.initializer_factor = 1.0
277
+
278
+ if not self.add_projections:
279
+ if self.text_config.embed_dim != self.vision_config.embed_dim:
280
+ raise ValueError(
281
+ 'When projections are disabled (`add_projections=False`), text '
282
+ 'and vision towers need to have the same embedding dimensionality. '
283
+ f'Currently text embedding dim is {self.text_config.embed_dim} != '
284
+ f'{self.vision_config.embed_dim} of the vision tower. '
285
+ 'Either set the same output dim for both towers, or enable '
286
+ 'projections with `add_projections=True`.'
287
+ )
288
+
289
+ if (
290
+ torch_dtype
291
+ and hasattr(torch, torch_dtype)
292
+ and type(getattr(torch, torch_dtype)) is torch.dtype
293
+ ):
294
+ self.torch_dtype = getattr(torch, torch_dtype)
295
+ else:
296
+ self.torch_dtype = torch_dtype
297
+
298
+ use_text_flash_attn = (
299
+ self.use_text_flash_attn if self.use_text_flash_attn is not None
300
+ else self.text_config.hf_model_config_kwargs.get('use_flash_attn', False)
301
+ )
302
+ if not use_text_flash_attn or not torch.cuda.is_available():
303
+ self.torch_dtype = torch.float32
304
+
305
+ @classmethod
306
+ def from_text_vision_configs(
307
+ cls,
308
+ text_config: JinaCLIPTextConfig,
309
+ vision_config: JinaCLIPVisionConfig,
310
+ **kwargs,
311
+ ):
312
+ return cls(
313
+ text_config=text_config.to_dict(),
314
+ vision_config=vision_config.to_dict(),
315
+ projection_dim=text_config.projection_dim,
316
+ **kwargs,
317
+ )
318
+
319
+ def to_dict(self):
320
+ output = deepcopy(self.__dict__)
321
+ output['text_config'] = self.text_config.to_dict()
322
+ output['vision_config'] = self.vision_config.to_dict()
323
+ output['model_type'] = self.__class__.model_type
324
+ return output
text_encoder_2/eva_model.py ADDED
@@ -0,0 +1,771 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # Adapted from EVA CLIP
3
+ # https://github.com/baaivision/EVA/tree/master/EVA-CLIP/rei/eva_clip
4
+ # --------------------------------------------------------
5
+
6
+ import math
7
+ import os
8
+ import warnings
9
+ from functools import partial
10
+
11
+ import torch
12
+ import torch.nn as nn
13
+ import torch.nn.functional as f
14
+
15
+ try:
16
+ warnings.filterwarnings('ignore', category=FutureWarning, module='timm')
17
+ from timm.models.layers import drop_path as timm_drop_path
18
+ from timm.models.layers import to_2tuple, trunc_normal_
19
+ except ImportError or ModuleNotFoundError:
20
+ from timm.layers import drop_path as timm_drop_path, to_2tuple, trunc_normal_
21
+
22
+ from .rope_embeddings import VisionRotaryEmbeddingFast
23
+
24
+ if os.getenv('ENV_TYPE') == 'deepspeed':
25
+ try:
26
+ from deepspeed.runtime.activation_checkpointing.checkpointing import checkpoint
27
+ except ImportError or ModuleNotFoundError:
28
+ from torch.utils.checkpoint import checkpoint
29
+ else:
30
+ from torch.utils.checkpoint import checkpoint
31
+
32
+ try:
33
+ import xformers.ops as xops
34
+ except ImportError:
35
+ xops = None
36
+
37
+
38
+ class PatchDropout(nn.Module):
39
+ """
40
+ https://arxiv.org/abs/2212.00794
41
+ """
42
+
43
+ def __init__(self, prob, exclude_first_token=True):
44
+ super().__init__()
45
+ assert 0 <= prob < 1.0
46
+ self.prob = prob
47
+ self.exclude_first_token = exclude_first_token # exclude CLS token
48
+
49
+ def forward(self, x):
50
+ if not self.training or self.prob == 0.0:
51
+ return x
52
+
53
+ if self.exclude_first_token:
54
+ cls_tokens, x = x[:, :1], x[:, 1:]
55
+ else:
56
+ cls_tokens = torch.jit.annotate(torch.Tensor, x[:, :1])
57
+
58
+ batch = x.size()[0]
59
+ num_tokens = x.size()[1]
60
+
61
+ batch_indices = torch.arange(batch)
62
+ batch_indices = batch_indices[..., None]
63
+
64
+ keep_prob = 1 - self.prob
65
+ num_patches_keep = max(1, int(num_tokens * keep_prob))
66
+
67
+ rand = torch.randn(batch, num_tokens)
68
+ patch_indices_keep = rand.topk(num_patches_keep, dim=-1).indices
69
+
70
+ x = x[batch_indices, patch_indices_keep]
71
+
72
+ if self.exclude_first_token:
73
+ x = torch.cat((cls_tokens, x), dim=1)
74
+
75
+ return x, patch_indices_keep
76
+
77
+
78
+ class DropPath(nn.Module):
79
+ """Drop paths (Stochastic Depth) per sample (when applied in main path of
80
+ residual blocks)."""
81
+
82
+ def __init__(self, drop_prob=None):
83
+ super(DropPath, self).__init__()
84
+ self.drop_prob = drop_prob
85
+
86
+ def forward(self, x):
87
+ return timm_drop_path(x, self.drop_prob, self.training)
88
+
89
+ def extra_repr(self) -> str:
90
+ return 'p={}'.format(self.drop_prob)
91
+
92
+
93
+ class Mlp(nn.Module):
94
+ def __init__(
95
+ self,
96
+ in_features,
97
+ hidden_features=None,
98
+ out_features=None,
99
+ act_layer=nn.GELU,
100
+ norm_layer=nn.LayerNorm,
101
+ drop=0.0,
102
+ subln=False,
103
+ ):
104
+ super().__init__()
105
+ out_features = out_features or in_features
106
+ hidden_features = hidden_features or in_features
107
+ self.fc1 = nn.Linear(in_features, hidden_features)
108
+ self.act = act_layer()
109
+
110
+ self.ffn_ln = norm_layer(hidden_features) if subln else nn.Identity()
111
+
112
+ self.fc2 = nn.Linear(hidden_features, out_features)
113
+ self.drop = nn.Dropout(drop)
114
+
115
+ def forward(self, x):
116
+ x = self.fc1(x)
117
+ x = self.act(x)
118
+ # x = self.drop(x)
119
+ # commit this for the orignal BERT implement
120
+ x = self.ffn_ln(x)
121
+
122
+ x = self.fc2(x)
123
+ x = self.drop(x)
124
+ return x
125
+
126
+
127
+ class SwiGLU(nn.Module):
128
+ def __init__(
129
+ self,
130
+ in_features,
131
+ hidden_features=None,
132
+ out_features=None,
133
+ act_layer=nn.SiLU,
134
+ drop=0.0,
135
+ norm_layer=nn.LayerNorm,
136
+ subln=False,
137
+ ):
138
+ super().__init__()
139
+ out_features = out_features or in_features
140
+ hidden_features = hidden_features or in_features
141
+
142
+ self.w1 = nn.Linear(in_features, hidden_features)
143
+ self.w2 = nn.Linear(in_features, hidden_features)
144
+
145
+ self.act = act_layer()
146
+ self.ffn_ln = norm_layer(hidden_features) if subln else nn.Identity()
147
+ self.w3 = nn.Linear(hidden_features, out_features)
148
+
149
+ self.drop = nn.Dropout(drop)
150
+
151
+ def forward(self, x):
152
+ x1 = self.w1(x)
153
+ x2 = self.w2(x)
154
+ hidden = self.act(x1) * x2
155
+ x = self.ffn_ln(hidden)
156
+ x = self.w3(x)
157
+ x = self.drop(x)
158
+ return x
159
+
160
+
161
+ class Attention(nn.Module):
162
+ def __init__(
163
+ self,
164
+ dim,
165
+ num_heads=8,
166
+ qkv_bias=False,
167
+ qk_scale=None,
168
+ attn_drop=0.0,
169
+ proj_drop=0.0,
170
+ window_size=None,
171
+ attn_head_dim=None,
172
+ xattn=False,
173
+ rope=None,
174
+ subln=False,
175
+ norm_layer=nn.LayerNorm,
176
+ ):
177
+ super().__init__()
178
+ self.num_heads = num_heads
179
+ head_dim = dim // num_heads
180
+ if attn_head_dim is not None:
181
+ head_dim = attn_head_dim
182
+ all_head_dim = head_dim * self.num_heads
183
+ self.scale = qk_scale or head_dim**-0.5
184
+
185
+ self.subln = subln
186
+ if self.subln:
187
+ self.q_proj = nn.Linear(dim, all_head_dim, bias=False)
188
+ self.k_proj = nn.Linear(dim, all_head_dim, bias=False)
189
+ self.v_proj = nn.Linear(dim, all_head_dim, bias=False)
190
+ else:
191
+ self.qkv = nn.Linear(dim, all_head_dim * 3, bias=False)
192
+
193
+ if qkv_bias:
194
+ self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
195
+ self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
196
+ else:
197
+ self.q_bias = None
198
+ self.v_bias = None
199
+
200
+ if window_size:
201
+ self.window_size = window_size
202
+ self.num_relative_distance = (2 * window_size[0] - 1) * (
203
+ 2 * window_size[1] - 1
204
+ ) + 3
205
+ self.relative_position_bias_table = nn.Parameter(
206
+ torch.zeros(self.num_relative_distance, num_heads)
207
+ ) # 2*Wh-1 * 2*Ww-1, nH
208
+ # cls to token & token 2 cls & cls to cls
209
+
210
+ # get pair-wise relative position index for each token inside the window
211
+ coords_h = torch.arange(window_size[0])
212
+ coords_w = torch.arange(window_size[1])
213
+ coords = torch.stack(torch.meshgrid([coords_h, coords_w])) # 2, Wh, Ww
214
+ coords_flatten = torch.flatten(coords, 1) # 2, Wh*Ww
215
+ relative_coords = (
216
+ coords_flatten[:, :, None] - coords_flatten[:, None, :]
217
+ ) # 2, Wh*Ww, Wh*Ww
218
+ relative_coords = relative_coords.permute(
219
+ 1, 2, 0
220
+ ).contiguous() # Wh*Ww, Wh*Ww, 2
221
+ relative_coords[:, :, 0] += window_size[0] - 1 # shift to start from 0
222
+ relative_coords[:, :, 1] += window_size[1] - 1
223
+ relative_coords[:, :, 0] *= 2 * window_size[1] - 1
224
+ relative_position_index = torch.zeros(
225
+ size=(window_size[0] * window_size[1] + 1,) * 2,
226
+ dtype=relative_coords.dtype,
227
+ )
228
+ relative_position_index[1:, 1:] = relative_coords.sum(-1) # Wh*Ww, Wh*Ww
229
+ relative_position_index[0, 0:] = self.num_relative_distance - 3
230
+ relative_position_index[0:, 0] = self.num_relative_distance - 2
231
+ relative_position_index[0, 0] = self.num_relative_distance - 1
232
+
233
+ self.register_buffer('relative_position_index', relative_position_index)
234
+ else:
235
+ self.window_size = None
236
+ self.relative_position_bias_table = None
237
+ self.relative_position_index = None
238
+
239
+ self.attn_drop = nn.Dropout(attn_drop)
240
+ self.inner_attn_ln = norm_layer(all_head_dim) if subln else nn.Identity()
241
+ # self.proj = nn.Linear(all_head_dim, all_head_dim)
242
+ self.proj = nn.Linear(all_head_dim, dim)
243
+ self.proj_drop = nn.Dropout(proj_drop)
244
+ self.xattn = xattn
245
+ self.xattn_drop = attn_drop
246
+
247
+ self.rope = rope
248
+
249
+ def forward(self, x, rel_pos_bias=None, attn_mask=None):
250
+ b, n, _ = x.shape
251
+ if self.subln:
252
+ q = f.linear(input=x, weight=self.q_proj.weight, bias=self.q_bias)
253
+ k = f.linear(input=x, weight=self.k_proj.weight, bias=None)
254
+ v = f.linear(input=x, weight=self.v_proj.weight, bias=self.v_bias)
255
+
256
+ q = q.reshape(b, n, self.num_heads, -1).permute(
257
+ 0, 2, 1, 3
258
+ ) # B, num_heads, N, C
259
+ k = k.reshape(b, n, self.num_heads, -1).permute(0, 2, 1, 3)
260
+ v = v.reshape(b, n, self.num_heads, -1).permute(0, 2, 1, 3)
261
+ else:
262
+ qkv_bias = None
263
+ if self.q_bias is not None:
264
+ qkv_bias = torch.cat(
265
+ (
266
+ self.q_bias,
267
+ torch.zeros_like(self.v_bias, requires_grad=False),
268
+ self.v_bias,
269
+ )
270
+ )
271
+
272
+ qkv = f.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
273
+ qkv = qkv.reshape(b, n, 3, self.num_heads, -1).permute(
274
+ 2, 0, 3, 1, 4
275
+ ) # 3, B, num_heads, N, C
276
+ q, k, v = qkv[0], qkv[1], qkv[2]
277
+
278
+ if self.rope:
279
+ # slightly fast impl
280
+ q_t = q[:, :, 1:, :]
281
+ ro_q_t = self.rope(q_t)
282
+ q = torch.cat((q[:, :, :1, :], ro_q_t), -2).type_as(v)
283
+
284
+ k_t = k[:, :, 1:, :]
285
+ ro_k_t = self.rope(k_t)
286
+ k = torch.cat((k[:, :, :1, :], ro_k_t), -2).type_as(v)
287
+
288
+ if self.xattn:
289
+ if xops is None:
290
+ raise ValueError(
291
+ "Can't use xattn without xformers. Please 'pip install xformers'"
292
+ )
293
+ q = q.permute(0, 2, 1, 3) # B, num_heads, N, C -> B, N, num_heads, C
294
+ k = k.permute(0, 2, 1, 3)
295
+ v = v.permute(0, 2, 1, 3)
296
+
297
+ x = xops.memory_efficient_attention(
298
+ q,
299
+ k,
300
+ v,
301
+ p=self.xattn_drop,
302
+ scale=self.scale,
303
+ )
304
+ x = x.reshape(b, n, -1)
305
+ x = self.inner_attn_ln(x)
306
+ x = self.proj(x)
307
+ x = self.proj_drop(x)
308
+ else:
309
+ q = q * self.scale
310
+ attn = q @ k.transpose(-2, -1)
311
+
312
+ if self.relative_position_bias_table is not None:
313
+ relative_position_bias = self.relative_position_bias_table[
314
+ self.relative_position_index.view(-1)
315
+ ].view(
316
+ self.window_size[0] * self.window_size[1] + 1,
317
+ self.window_size[0] * self.window_size[1] + 1,
318
+ -1,
319
+ ) # Wh*Ww,Wh*Ww,nH
320
+ relative_position_bias = relative_position_bias.permute(
321
+ 2, 0, 1
322
+ ).contiguous() # nH, Wh*Ww, Wh*Ww
323
+ attn = attn + relative_position_bias.unsqueeze(0).type_as(attn)
324
+
325
+ if rel_pos_bias is not None:
326
+ attn = attn + rel_pos_bias.type_as(attn)
327
+
328
+ if attn_mask is not None:
329
+ attn_mask = attn_mask.bool()
330
+ attn = attn.masked_fill(~attn_mask[:, None, None, :], float('-inf'))
331
+
332
+ attn = attn.softmax(dim=-1)
333
+ attn = self.attn_drop(attn)
334
+
335
+ x = (attn @ v).transpose(1, 2).reshape(b, n, -1)
336
+ x = self.inner_attn_ln(x)
337
+ x = self.proj(x)
338
+ x = self.proj_drop(x)
339
+ return x
340
+
341
+
342
+ class Block(nn.Module):
343
+ def __init__(
344
+ self,
345
+ dim,
346
+ num_heads,
347
+ mlp_ratio=4.0,
348
+ qkv_bias=False,
349
+ qk_scale=None,
350
+ drop=0.0,
351
+ attn_drop=0.0,
352
+ drop_path=0.0,
353
+ init_values=None,
354
+ act_layer=nn.GELU,
355
+ norm_layer=nn.LayerNorm,
356
+ window_size=None,
357
+ attn_head_dim=None,
358
+ xattn=False,
359
+ rope=None,
360
+ postnorm=False,
361
+ subln=False,
362
+ naiveswiglu=False,
363
+ ):
364
+ super().__init__()
365
+ self.norm1 = norm_layer(dim)
366
+ self.attn = Attention(
367
+ dim,
368
+ num_heads=num_heads,
369
+ qkv_bias=qkv_bias,
370
+ qk_scale=qk_scale,
371
+ attn_drop=attn_drop,
372
+ proj_drop=drop,
373
+ window_size=window_size,
374
+ attn_head_dim=attn_head_dim,
375
+ xattn=xattn,
376
+ rope=rope,
377
+ subln=subln,
378
+ norm_layer=norm_layer,
379
+ )
380
+ # NOTE: drop path for stochastic depth, we shall see if this is better
381
+ # than dropout here
382
+ self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
383
+ self.norm2 = norm_layer(dim)
384
+ mlp_hidden_dim = int(dim * mlp_ratio)
385
+
386
+ if naiveswiglu:
387
+ self.mlp = SwiGLU(
388
+ in_features=dim,
389
+ hidden_features=mlp_hidden_dim,
390
+ subln=subln,
391
+ norm_layer=norm_layer,
392
+ )
393
+ else:
394
+ self.mlp = Mlp(
395
+ in_features=dim,
396
+ hidden_features=mlp_hidden_dim,
397
+ act_layer=act_layer,
398
+ subln=subln,
399
+ drop=drop,
400
+ )
401
+
402
+ if init_values is not None and init_values > 0:
403
+ self.gamma_1 = nn.Parameter(
404
+ init_values * torch.ones((dim,)), requires_grad=True
405
+ )
406
+ self.gamma_2 = nn.Parameter(
407
+ init_values * torch.ones((dim,)), requires_grad=True
408
+ )
409
+ else:
410
+ self.gamma_1, self.gamma_2 = None, None
411
+
412
+ self.postnorm = postnorm
413
+
414
+ def forward(self, x, rel_pos_bias=None, attn_mask=None):
415
+ if self.gamma_1 is None:
416
+ if self.postnorm:
417
+ x = x + self.drop_path(
418
+ self.norm1(
419
+ self.attn(x, rel_pos_bias=rel_pos_bias, attn_mask=attn_mask)
420
+ )
421
+ )
422
+ x = x + self.drop_path(self.norm2(self.mlp(x)))
423
+ else:
424
+ x = x + self.drop_path(
425
+ self.attn(
426
+ self.norm1(x), rel_pos_bias=rel_pos_bias, attn_mask=attn_mask
427
+ )
428
+ )
429
+ x = x + self.drop_path(self.mlp(self.norm2(x)))
430
+ else:
431
+ if self.postnorm:
432
+ x = x + self.drop_path(
433
+ self.gamma_1
434
+ * self.norm1(
435
+ self.attn(x, rel_pos_bias=rel_pos_bias, attn_mask=attn_mask)
436
+ )
437
+ )
438
+ x = x + self.drop_path(self.gamma_2 * self.norm2(self.mlp(x)))
439
+ else:
440
+ x = x + self.drop_path(
441
+ self.gamma_1
442
+ * self.attn(
443
+ self.norm1(x), rel_pos_bias=rel_pos_bias, attn_mask=attn_mask
444
+ )
445
+ )
446
+ x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
447
+ return x
448
+
449
+
450
+ class PatchEmbed(nn.Module):
451
+ """Image to Patch Embedding"""
452
+
453
+ def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768):
454
+ super().__init__()
455
+ img_size = to_2tuple(img_size)
456
+ patch_size = to_2tuple(patch_size)
457
+ num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0])
458
+ self.patch_shape = (img_size[0] // patch_size[0], img_size[1] // patch_size[1])
459
+ self.img_size = img_size
460
+ self.patch_size = patch_size
461
+ self.num_patches = num_patches
462
+
463
+ self.proj = nn.Conv2d(
464
+ in_chans, embed_dim, kernel_size=patch_size, stride=patch_size
465
+ )
466
+
467
+ def forward(self, x, **_):
468
+ target_dtype = self.proj.weight.dtype
469
+ _, __, h, w = x.shape
470
+ # FIXME look at relaxing size constraints
471
+ assert h == self.img_size[0] and w == self.img_size[1], (
472
+ f"Input image size ({h}*{w}) doesn't match model "
473
+ f'({self.img_size[0]}*{self.img_size[1]}).'
474
+ )
475
+ x = self.proj(x.to(dtype=target_dtype)).flatten(2).transpose(1, 2)
476
+ return x
477
+
478
+
479
+ class RelativePositionBias(nn.Module):
480
+ def __init__(self, window_size, num_heads):
481
+ super().__init__()
482
+ self.window_size = window_size
483
+ self.num_relative_distance = (2 * window_size[0] - 1) * (
484
+ 2 * window_size[1] - 1
485
+ ) + 3
486
+ self.relative_position_bias_table = nn.Parameter(
487
+ torch.zeros(self.num_relative_distance, num_heads)
488
+ ) # 2*Wh-1 * 2*Ww-1, nH
489
+ # cls to token & token 2 cls & cls to cls
490
+
491
+ # get pair-wise relative position index for each token inside the window
492
+ coords_h = torch.arange(window_size[0])
493
+ coords_w = torch.arange(window_size[1])
494
+ coords = torch.stack(torch.meshgrid([coords_h, coords_w])) # 2, Wh, Ww
495
+ coords_flatten = torch.flatten(coords, 1) # 2, Wh*Ww
496
+ relative_coords = (
497
+ coords_flatten[:, :, None] - coords_flatten[:, None, :]
498
+ ) # 2, Wh*Ww, Wh*Ww
499
+ relative_coords = relative_coords.permute(
500
+ 1, 2, 0
501
+ ).contiguous() # Wh*Ww, Wh*Ww, 2
502
+ relative_coords[:, :, 0] += window_size[0] - 1 # shift to start from 0
503
+ relative_coords[:, :, 1] += window_size[1] - 1
504
+ relative_coords[:, :, 0] *= 2 * window_size[1] - 1
505
+ relative_position_index = torch.zeros(
506
+ size=(window_size[0] * window_size[1] + 1,) * 2, dtype=relative_coords.dtype
507
+ )
508
+ relative_position_index[1:, 1:] = relative_coords.sum(-1) # Wh*Ww, Wh*Ww
509
+ relative_position_index[0, 0:] = self.num_relative_distance - 3
510
+ relative_position_index[0:, 0] = self.num_relative_distance - 2
511
+ relative_position_index[0, 0] = self.num_relative_distance - 1
512
+
513
+ self.register_buffer('relative_position_index', relative_position_index)
514
+
515
+ def forward(self):
516
+ relative_position_bias = self.relative_position_bias_table[
517
+ self.relative_position_index.view(-1)
518
+ ].view(
519
+ self.window_size[0] * self.window_size[1] + 1,
520
+ self.window_size[0] * self.window_size[1] + 1,
521
+ -1,
522
+ ) # Wh*Ww,Wh*Ww,nH
523
+ return relative_position_bias.permute(2, 0, 1).contiguous() # nH, Wh*Ww, Wh*Ww
524
+
525
+
526
+ class EVAVisionTransformer(nn.Module):
527
+ """Vision Transformer with support for patch or hybrid CNN input stage"""
528
+
529
+ def __init__(
530
+ self,
531
+ img_size=224,
532
+ patch_size=16,
533
+ in_chans=3,
534
+ num_classes=0,
535
+ embed_dim=768,
536
+ depth=12,
537
+ num_heads=12,
538
+ mlp_ratio=4.0,
539
+ qkv_bias=False,
540
+ qk_scale=None,
541
+ drop_rate=0.0,
542
+ attn_drop_rate=0.0,
543
+ drop_path_rate=0.0,
544
+ norm_layer=nn.LayerNorm,
545
+ init_values=None,
546
+ patch_dropout=0.0,
547
+ use_abs_pos_emb=True,
548
+ use_rel_pos_bias=False,
549
+ use_shared_rel_pos_bias=False,
550
+ rope=False,
551
+ use_mean_pooling=True,
552
+ init_scale=0.001,
553
+ grad_checkpointing=False,
554
+ xattn=False,
555
+ postnorm=False,
556
+ pt_hw_seq_len=16,
557
+ intp_freq=False,
558
+ naiveswiglu=False,
559
+ subln=False,
560
+ proj_type=None,
561
+ ):
562
+ super().__init__()
563
+ self.image_size = img_size
564
+ self.num_classes = num_classes
565
+ # num_features for consistency with other models
566
+ self.num_features = self.embed_dim = embed_dim
567
+
568
+ self.patch_embed = PatchEmbed(
569
+ img_size=img_size,
570
+ patch_size=patch_size,
571
+ in_chans=in_chans,
572
+ embed_dim=embed_dim,
573
+ )
574
+ num_patches = self.patch_embed.num_patches
575
+
576
+ self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
577
+ # self.mask_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
578
+ if use_abs_pos_emb:
579
+ self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
580
+ else:
581
+ self.pos_embed = None
582
+ self.pos_drop = nn.Dropout(p=drop_rate)
583
+
584
+ if use_shared_rel_pos_bias:
585
+ self.rel_pos_bias = RelativePositionBias(
586
+ window_size=self.patch_embed.patch_shape, num_heads=num_heads
587
+ )
588
+ else:
589
+ self.rel_pos_bias = None
590
+
591
+ if rope:
592
+ half_head_dim = embed_dim // num_heads // 2
593
+ hw_seq_len = img_size // patch_size
594
+ self.rope = VisionRotaryEmbeddingFast(
595
+ dim=half_head_dim,
596
+ pt_seq_len=pt_hw_seq_len,
597
+ ft_seq_len=hw_seq_len if intp_freq else None,
598
+ patch_dropout=patch_dropout,
599
+ )
600
+ else:
601
+ self.rope = None
602
+
603
+ self.naiveswiglu = naiveswiglu
604
+
605
+ dpr = [
606
+ x.item() for x in torch.linspace(0, drop_path_rate, depth)
607
+ ] # stochastic depth decay rule
608
+ self.use_rel_pos_bias = use_rel_pos_bias
609
+ self.blocks = nn.ModuleList(
610
+ [
611
+ Block(
612
+ dim=embed_dim,
613
+ num_heads=num_heads,
614
+ mlp_ratio=mlp_ratio,
615
+ qkv_bias=qkv_bias,
616
+ qk_scale=qk_scale,
617
+ drop=drop_rate,
618
+ attn_drop=attn_drop_rate,
619
+ drop_path=dpr[i],
620
+ norm_layer=norm_layer,
621
+ init_values=init_values,
622
+ window_size=self.patch_embed.patch_shape
623
+ if use_rel_pos_bias
624
+ else None,
625
+ xattn=xattn,
626
+ rope=self.rope,
627
+ postnorm=postnorm,
628
+ subln=subln,
629
+ naiveswiglu=naiveswiglu,
630
+ )
631
+ for i in range(depth)
632
+ ]
633
+ )
634
+ self.norm = nn.Identity() if use_mean_pooling else norm_layer(embed_dim)
635
+ self.fc_norm = norm_layer(embed_dim) if use_mean_pooling else None
636
+ if (num_classes == embed_dim) and (proj_type is None):
637
+ self.head = nn.Identity()
638
+ elif proj_type == 'linear':
639
+ self.head = nn.Linear(embed_dim, num_classes, bias=qkv_bias)
640
+ elif proj_type == 'mlp':
641
+ hidden_size = (embed_dim + num_classes) // 2
642
+ self.proj = nn.Sequential(
643
+ nn.Linear(embed_dim, hidden_size, bias=qkv_bias),
644
+ nn.GELU(),
645
+ nn.Linear(hidden_size, num_classes, bias=qkv_bias),
646
+ )
647
+
648
+ if self.pos_embed is not None:
649
+ trunc_normal_(self.pos_embed, std=0.02)
650
+
651
+ trunc_normal_(self.cls_token, std=0.02)
652
+
653
+ self.apply(self._init_weights)
654
+ self.fix_init_weight()
655
+
656
+ if isinstance(self.head, nn.Linear):
657
+ trunc_normal_(self.head.weight, std=0.02)
658
+ self.head.weight.data.mul_(init_scale)
659
+ if qkv_bias:
660
+ self.head.bias.data.mul_(init_scale)
661
+
662
+ # setting a patch_dropout of 0. would mean it is disabled and this function
663
+ # would be the identity fn
664
+ self.patch_dropout = (
665
+ PatchDropout(patch_dropout) if patch_dropout > 0.0 else nn.Identity()
666
+ )
667
+
668
+ self.grad_checkpointing = grad_checkpointing
669
+
670
+ def fix_init_weight(self):
671
+ def rescale(param, _layer_id):
672
+ param.div_(math.sqrt(2.0 * _layer_id))
673
+
674
+ for layer_id, layer in enumerate(self.blocks):
675
+ rescale(layer.attn.proj.weight.data, layer_id + 1)
676
+ if self.naiveswiglu:
677
+ rescale(layer.mlp.w3.weight.data, layer_id + 1)
678
+ else:
679
+ rescale(layer.mlp.fc2.weight.data, layer_id + 1)
680
+
681
+ def get_cast_dtype(self) -> torch.dtype:
682
+ return self.blocks[0].mlp.fc2.weight.dtype
683
+
684
+ @staticmethod
685
+ def _init_weights(m):
686
+ if isinstance(m, nn.Linear):
687
+ trunc_normal_(m.weight, std=0.02)
688
+ if m.bias is not None:
689
+ nn.init.constant_(m.bias, 0)
690
+ elif isinstance(m, nn.LayerNorm):
691
+ nn.init.constant_(m.bias, 0)
692
+ nn.init.constant_(m.weight, 1.0)
693
+
694
+ @staticmethod
695
+ def _initialize_weights(m):
696
+ EVAVisionTransformer._init_weights(m)
697
+
698
+ def get_num_layers(self):
699
+ return len(self.blocks)
700
+
701
+ def lock(self, unlocked_groups=0, *_, **__):
702
+ assert (
703
+ unlocked_groups == 0
704
+ ), 'partial locking not currently supported for this model'
705
+ for param in self.parameters():
706
+ param.requires_grad = False
707
+
708
+ @torch.jit.ignore
709
+ def set_grad_checkpointing(self, enable=True):
710
+ self.grad_checkpointing = enable
711
+
712
+ @torch.jit.ignore
713
+ def no_weight_decay(self):
714
+ return {'pos_embed', 'cls_token'}
715
+
716
+ def get_classifier(self):
717
+ return self.head
718
+
719
+ def reset_classifier(self, num_classes, *_, **__):
720
+ self.num_classes = num_classes
721
+ self.head = (
722
+ nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()
723
+ )
724
+
725
+ def forward_features(self, x, return_all_features=False):
726
+ x = self.patch_embed(x)
727
+ batch_size, seq_len, _ = x.size()
728
+
729
+ cls_tokens = self.cls_token.expand(
730
+ batch_size, -1, -1
731
+ ) # stole cls_tokens impl from Phil Wang, thanks
732
+ x = torch.cat((cls_tokens, x), dim=1)
733
+ if self.pos_embed is not None:
734
+ x = x + self.pos_embed
735
+ x = self.pos_drop(x)
736
+
737
+ # a patch_dropout of 0. would mean it is disabled and this function would do
738
+ # nothing but return what was passed in
739
+ if self.rope is not None:
740
+ if self.training and not isinstance(self.patch_dropout, nn.Identity):
741
+ x, patch_indices_keep = self.patch_dropout(x)
742
+ self.rope.forward = partial(
743
+ self.rope.forward, patch_indices_keep=patch_indices_keep
744
+ )
745
+ else:
746
+ self.rope.forward = partial(self.rope.forward, patch_indices_keep=None)
747
+ x = self.patch_dropout(x)
748
+ else:
749
+ x = self.patch_dropout(x)
750
+
751
+ rel_pos_bias = self.rel_pos_bias() if self.rel_pos_bias is not None else None
752
+ for blk in self.blocks:
753
+ if self.grad_checkpointing:
754
+ x = checkpoint(blk, x, (rel_pos_bias,))
755
+ else:
756
+ x = blk(x, rel_pos_bias=rel_pos_bias)
757
+
758
+ if not return_all_features:
759
+ x = self.norm(x)
760
+ if self.fc_norm is not None:
761
+ return self.fc_norm(x.mean(1))
762
+ else:
763
+ return x[:, 0]
764
+ return x
765
+
766
+ def forward(self, x, return_all_features=False):
767
+ if return_all_features:
768
+ return self.forward_features(x, return_all_features)
769
+ x = self.forward_features(x)
770
+ x = self.head(x)
771
+ return x
text_encoder_2/hf_model.py ADDED
@@ -0,0 +1,364 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import warnings
3
+ from typing import Dict, Optional, Union
4
+
5
+ import torch
6
+ import torch.nn as nn
7
+ from transformers import AutoConfig, AutoModel, PretrainedConfig
8
+ from transformers.modeling_outputs import (
9
+ BaseModelOutput,
10
+ BaseModelOutputWithPooling,
11
+ BaseModelOutputWithPoolingAndCrossAttentions,
12
+ )
13
+
14
+ _HF_ARCH_DICT = {
15
+ # https://huggingface.co/docs/transformers/model_doc/roberta#roberta
16
+ 'roberta': {
17
+ 'config_names': {
18
+ 'context_length': 'max_position_embeddings',
19
+ 'vocab_size': 'vocab_size',
20
+ 'width': 'hidden_size',
21
+ 'heads': 'num_attention_heads',
22
+ 'layers': 'num_hidden_layers',
23
+ 'layer_attr': 'layer',
24
+ 'token_embeddings_attr': 'embeddings',
25
+ },
26
+ 'pooler': 'mean_pooler',
27
+ },
28
+ # https://huggingface.co/docs/transformers/model_doc/xlm-roberta#transformers.XLMRobertaConfig
29
+ 'xlm-roberta': {
30
+ 'config_names': {
31
+ 'context_length': 'max_position_embeddings',
32
+ 'vocab_size': 'vocab_size',
33
+ 'width': 'hidden_size',
34
+ 'heads': 'num_attention_heads',
35
+ 'layers': 'num_hidden_layers',
36
+ 'layer_attr': 'layer',
37
+ 'token_embeddings_attr': 'embeddings',
38
+ },
39
+ 'pooler': 'mean_pooler',
40
+ },
41
+ # https://huggingface.co/docs/transformers/model_doc/bert
42
+ 'bert': {
43
+ 'config_names': {
44
+ 'context_length': 'max_position_embeddings',
45
+ 'vocab_size': 'vocab_size',
46
+ 'width': 'hidden_size',
47
+ 'heads': 'num_attention_heads',
48
+ 'layers': 'num_hidden_layers',
49
+ },
50
+ 'pooler': 'cls_pooler',
51
+ },
52
+ }
53
+
54
+ _POOLERS = {}
55
+
56
+
57
+ def _camel2snake(s):
58
+ return re.sub(r'(?<!^)(?=[A-Z])', '_', s).lower()
59
+
60
+
61
+ def register_pooler(cls):
62
+ """Decorator registering pooler class"""
63
+ _POOLERS[_camel2snake(cls.__name__)] = cls
64
+ return cls
65
+
66
+
67
+ @register_pooler
68
+ class MeanPooler(nn.Module):
69
+ @staticmethod
70
+ def forward(x: BaseModelOutput, attention_mask: torch.Tensor):
71
+ masked_output = x.last_hidden_state * attention_mask.unsqueeze(-1)
72
+ return masked_output.sum(dim=1) / attention_mask.sum(-1, keepdim=True)
73
+
74
+
75
+ @register_pooler
76
+ class MaxPooler(nn.Module):
77
+ @staticmethod
78
+ def forward(x: BaseModelOutput, attention_mask: torch.Tensor):
79
+ masked_output = x.last_hidden_state.masked_fill(
80
+ attention_mask.unsqueeze(-1), -torch.inf
81
+ )
82
+ return masked_output.max(1).values
83
+
84
+
85
+ @register_pooler
86
+ class ClsPooler(nn.Module):
87
+ def __init__(self, use_pooler_output: bool = True):
88
+ super().__init__()
89
+ self.cls_token_position = 0
90
+ self.use_pooler_output = use_pooler_output
91
+
92
+ def forward(self, x: BaseModelOutput, _: torch.Tensor):
93
+ if (
94
+ self.use_pooler_output
95
+ and isinstance(
96
+ x,
97
+ (
98
+ BaseModelOutputWithPooling,
99
+ BaseModelOutputWithPoolingAndCrossAttentions,
100
+ ),
101
+ )
102
+ and (x.pooler_output is not None)
103
+ ):
104
+ return x.pooler_output
105
+ return x.last_hidden_state[:, self.cls_token_position, :]
106
+
107
+
108
+ class HFTextEncoder(nn.Module):
109
+ output_tokens: torch.jit.Final[bool]
110
+
111
+ def __init__(
112
+ self,
113
+ model_name_or_path: str,
114
+ output_dim: int,
115
+ config: PretrainedConfig = None,
116
+ pooler_type: str = None,
117
+ proj_type: str = None,
118
+ proj_bias: bool = False,
119
+ pretrained: bool = True,
120
+ output_tokens: bool = False,
121
+ trust_remote_code: bool = False,
122
+ revision: Optional[str] = None,
123
+ code_revision: Optional[str] = None,
124
+ default_instruction_task: Optional[str] = None,
125
+ default_lora_task: Optional[str] = None,
126
+ model_config_kwargs: Optional[Dict] = None,
127
+ ):
128
+ super().__init__()
129
+ self.output_tokens = output_tokens
130
+ self.output_dim = output_dim
131
+
132
+ model_config_kwargs = model_config_kwargs or {}
133
+
134
+ if config is None:
135
+ if pretrained:
136
+ self.transformer = AutoModel.from_pretrained(
137
+ model_name_or_path,
138
+ trust_remote_code=trust_remote_code,
139
+ revision=revision,
140
+ add_pooling_layer=False,
141
+ code_revision=code_revision,
142
+ **model_config_kwargs,
143
+ )
144
+ self.config = self.transformer.config
145
+ else:
146
+ self.config = AutoConfig.from_pretrained(
147
+ model_name_or_path,
148
+ trust_remote_code=trust_remote_code,
149
+ code_revision=code_revision,
150
+ )
151
+ self.config.update(model_config_kwargs)
152
+ self.transformer = AutoModel.from_config(
153
+ self.config,
154
+ trust_remote_code=trust_remote_code,
155
+ add_pooling_layer=False,
156
+ code_revision=code_revision,
157
+ )
158
+ if (
159
+ hasattr(self.config, 'is_encoder_decoder')
160
+ and self.config.is_encoder_decoder
161
+ ):
162
+ self.transformer = self.transformer.encoder
163
+
164
+ else:
165
+ self.config = config
166
+ self.config.update(model_config_kwargs)
167
+ self.transformer = AutoModel.from_config(
168
+ self.config,
169
+ trust_remote_code=trust_remote_code,
170
+ revision=revision,
171
+ code_revision=code_revision,
172
+ )
173
+ self.vocab_size = getattr(self.config, 'vocab_size', 0)
174
+ self.context_length = getattr(self.config, 'max_position_embeddings', 0)
175
+
176
+ pooler_type = pooler_type or _HF_ARCH_DICT[self.config.model_type]['pooler']
177
+ self.pooler = _POOLERS[pooler_type]()
178
+
179
+ d_model = getattr(
180
+ self.config, _HF_ARCH_DICT[self.config.model_type]['config_names']['width']
181
+ )
182
+ if (d_model == output_dim) and (proj_type is None): # do we always need a proj?
183
+ self.proj = nn.Identity()
184
+ elif (d_model != output_dim) or proj_type == 'linear':
185
+ self.proj = nn.Linear(d_model, output_dim, bias=proj_bias)
186
+ elif proj_type == 'mlp':
187
+ hidden_size = (d_model + output_dim) // 2
188
+ self.proj = nn.Sequential(
189
+ nn.Linear(d_model, hidden_size, bias=proj_bias),
190
+ nn.GELU(),
191
+ nn.Linear(hidden_size, output_dim, bias=proj_bias),
192
+ )
193
+
194
+ self._task_instructions = {}
195
+ self._lora_adaptation_map = {}
196
+ self._supports_task_instructions = False
197
+ self._supports_lora = False
198
+ if (
199
+ hasattr(self.transformer, '_adaptation_map')
200
+ and len(self.transformer._adaptation_map) > 0
201
+ ):
202
+ self._lora_adaptation_map = self.transformer._adaptation_map
203
+ self._supports_lora = True
204
+ if (
205
+ hasattr(self.transformer, '_task_instructions')
206
+ and len(self.transformer._task_instructions) > 0
207
+ ):
208
+ self._task_instructions = self.transformer._task_instructions
209
+ self._supports_task_instructions = True
210
+
211
+ self._default_instruction_task = None
212
+ self._default_lora_task = None
213
+ self._default_instruction = None
214
+ self._default_loraid = None
215
+
216
+ if default_instruction_task is not None:
217
+ self._default_instruction_task = default_instruction_task
218
+ self._default_instruction = self.get_instruction_from_task(
219
+ default_instruction_task
220
+ )
221
+ if default_lora_task is not None:
222
+ self._default_lora_task = default_lora_task
223
+ self._default_loraid = self.get_loraid_from_task(default_lora_task)
224
+
225
+ @property
226
+ def supports_task_instructions(self) -> bool:
227
+ return self._supports_task_instructions
228
+
229
+ @property
230
+ def supports_lora(self) -> bool:
231
+ return self._supports_lora
232
+
233
+ @property
234
+ def task_instructions(self) -> Dict[str, str]:
235
+ return self._task_instructions
236
+
237
+ @property
238
+ def lora_adaptation_map(self) -> Dict[str, int]:
239
+ return self._lora_adaptation_map
240
+
241
+ @property
242
+ def default_instruction(self) -> Optional[str]:
243
+ return self._default_instruction
244
+
245
+ @property
246
+ def default_loraid(self) -> Optional[int]:
247
+ return self._default_loraid
248
+
249
+ def get_instruction_from_task(self, task: Optional[str]) -> Optional[str]:
250
+ if self._supports_task_instructions:
251
+ if task is None:
252
+ return self._default_instruction
253
+ if task not in self._task_instructions:
254
+ raise ValueError(
255
+ f'Unsupported task \'{task}\'. Choose one of the following: '
256
+ f'{", ".join(self._task_instructions)} or set to None to disable '
257
+ f'task instructions completely'
258
+ )
259
+ return self._task_instructions[task]
260
+ else:
261
+ if task is not None:
262
+ warnings.warn(
263
+ 'Model does not support task instructions, ignoring instruction '
264
+ f"task '{task}'"
265
+ )
266
+ return None
267
+
268
+ def get_loraid_from_task(self, task: Optional[str]) -> Optional[int]:
269
+ if self._supports_lora:
270
+ if task is None:
271
+ return self._default_loraid
272
+ if task not in self._lora_adaptation_map:
273
+ raise ValueError(
274
+ f'Unsupported task \'{task}\'. Choose one of the following: '
275
+ f'{", ".join(self._task_instructions)} or set to None to disable '
276
+ f'the LoRA adapters completely'
277
+ )
278
+ return self._lora_adaptation_map[task]
279
+ else:
280
+ if task is not None:
281
+ warnings.warn(
282
+ f"Model does not support LoRA adapters, ignoring LoRA task '{task}'"
283
+ )
284
+ return None
285
+
286
+ @staticmethod
287
+ def get_adapter_mask_from_loraid(
288
+ batch_size: int, loraid: int, device: Union[str, torch.device]
289
+ ):
290
+ return torch.full((batch_size,), loraid, dtype=torch.int32, device=device)
291
+
292
+ @torch.jit.ignore
293
+ def set_grad_checkpointing(self, _=True):
294
+ self.transformer.gradient_checkpointing_enable()
295
+
296
+ def init_parameters(self):
297
+ pass
298
+
299
+ def forward(self, x: torch.Tensor, adapter_mask: Optional[torch.Tensor] = None):
300
+ if adapter_mask is None:
301
+ default_loraid = self.default_loraid
302
+ if default_loraid is not None:
303
+ adapter_mask = self.get_adapter_mask_from_loraid(
304
+ x.shape[0], default_loraid, x.device
305
+ )
306
+ else:
307
+ if not self.supports_lora:
308
+ warnings.warn(
309
+ 'Model does not support LoRA adapters, setting adapter_mask to None'
310
+ )
311
+ adapter_mask = None
312
+
313
+ attention_mask = (x != self.config.pad_token_id).long()
314
+ lora_kwargs = {}
315
+ if adapter_mask is not None:
316
+ lora_kwargs['adapter_mask'] = adapter_mask
317
+
318
+ out = self.transformer(
319
+ input_ids=x, attention_mask=attention_mask, **lora_kwargs
320
+ )
321
+ pooled_out = self.pooler(out, attention_mask)
322
+ projected = self.proj(pooled_out)
323
+ seqlen = out.last_hidden_state.shape[1]
324
+ tokens = (
325
+ out.last_hidden_state[
326
+ :, torch.arange(seqlen) != self.pooler.cls_token_position, :
327
+ ]
328
+ if isinstance(self.pooler, ClsPooler)
329
+ else out.last_hidden_state
330
+ )
331
+ if self.output_tokens:
332
+ return projected, tokens
333
+ return projected
334
+
335
+ def lock(self, unlocked_layers: int = 0, freeze_layer_norm: bool = True):
336
+ if not unlocked_layers:
337
+ for n, p in self.transformer.named_parameters():
338
+ p.requires_grad = (
339
+ (not freeze_layer_norm) if 'LayerNorm' in n.split('.') else False
340
+ )
341
+ return
342
+
343
+ encoder = (
344
+ self.transformer.encoder
345
+ if hasattr(self.transformer, 'encoder')
346
+ else self.transformer
347
+ )
348
+ layer_list = getattr(
349
+ encoder, _HF_ARCH_DICT[self.config.model_type]['config_names']['layer_attr']
350
+ )
351
+ print(f'Unlocking {unlocked_layers}/{len(layer_list) + 1} layers of hf model')
352
+ embeddings = getattr(
353
+ self.transformer,
354
+ _HF_ARCH_DICT[self.config.model_type]['config_names'][
355
+ 'token_embeddings_attr'
356
+ ],
357
+ )
358
+ modules = [embeddings, *layer_list][:-unlocked_layers]
359
+ # freeze layers
360
+ for module in modules:
361
+ for n, p in module.named_parameters():
362
+ p.requires_grad = (
363
+ (not freeze_layer_norm) if 'LayerNorm' in n.split('.') else False
364
+ )
text_encoder_2/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eff4c0a13ab4de71a9927a56968fef44e626920ff935e503f1bd3e6ec797062d
3
+ size 1730688642
text_encoder_2/modeling_clip.py ADDED
@@ -0,0 +1,684 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ #
3
+ # Code mainly copied from:
4
+ # https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/modeling_clip.py
5
+ # and adjusted for Jina CLIP
6
+
7
+ import base64
8
+ import importlib.util
9
+ import warnings
10
+ from functools import partial
11
+ from io import BytesIO
12
+ from typing import List, Optional, Tuple, Union
13
+
14
+ import numpy as np
15
+ import requests
16
+ import torch
17
+ import torch.nn.functional as f
18
+ import torch.utils.checkpoint
19
+ from PIL import Image
20
+ from torch import nn
21
+ from transformers import (
22
+ AutoImageProcessor,
23
+ AutoTokenizer,
24
+ BatchEncoding,
25
+ BatchFeature,
26
+ PreTrainedModel,
27
+ logging,
28
+ )
29
+ from transformers.models.clip.modeling_clip import (
30
+ CLIPOutput,
31
+ CLIPTextModelOutput,
32
+ CLIPVisionModelOutput,
33
+ clip_loss,
34
+ )
35
+
36
+ try:
37
+ from tqdm.autonotebook import trange
38
+
39
+ has_tqdm = True
40
+ except ImportError:
41
+ trange = None
42
+ has_tqdm = False
43
+
44
+ from .configuration_clip import JinaCLIPConfig, JinaCLIPTextConfig, JinaCLIPVisionConfig
45
+ from .eva_model import EVAVisionTransformer
46
+ from .hf_model import HFTextEncoder
47
+ from .rope_embeddings import VisionRotaryEmbeddingFast # noqa: F401
48
+ from .transform import ( # noqa: F401
49
+ OPENAI_DATASET_MEAN,
50
+ OPENAI_DATASET_STD,
51
+ image_transform,
52
+ )
53
+
54
+ logger = logging.get_logger(__name__)
55
+
56
+
57
+ """ Jina CLIP model implementation """
58
+
59
+
60
+ class LayerNorm(nn.LayerNorm):
61
+ """Subclass torch's LayerNorm (with cast back to input dtype)."""
62
+
63
+ def forward(self, x: torch.Tensor):
64
+ origtype = x.dtype
65
+ x = f.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
66
+ return x.to(origtype)
67
+
68
+
69
+ def _build_text_tower(config: JinaCLIPTextConfig) -> HFTextEncoder:
70
+ return HFTextEncoder(
71
+ model_name_or_path=config.hf_model_name_or_path,
72
+ output_dim=config.embed_dim,
73
+ default_instruction_task=config.default_instruction_task,
74
+ default_lora_task=config.default_lora_task,
75
+ pooler_type=config.pooler_type,
76
+ proj_type=config.proj_type,
77
+ proj_bias=config.proj_bias,
78
+ pretrained=False,
79
+ output_tokens=False,
80
+ trust_remote_code=True,
81
+ revision=None,
82
+ model_config_kwargs=config.hf_model_config_kwargs,
83
+ )
84
+
85
+
86
+ def _build_vision_tower(config: JinaCLIPVisionConfig) -> EVAVisionTransformer:
87
+ norm_layer = partial(LayerNorm, eps=1e-6)
88
+
89
+ if config.fused_layer_norm:
90
+ try:
91
+ from apex.normalization import FusedLayerNorm
92
+
93
+ norm_layer = partial(FusedLayerNorm, eps=1e-6)
94
+ except (ModuleNotFoundError, ImportError):
95
+ logger.warning('Please install apex to use fused layer norm, ignoring')
96
+
97
+ return EVAVisionTransformer(
98
+ img_size=config.image_size,
99
+ patch_size=config.patch_size,
100
+ num_classes=config.embed_dim,
101
+ use_mean_pooling=False,
102
+ init_values=config.ls_init_value,
103
+ patch_dropout=config.patch_dropout,
104
+ embed_dim=config.width,
105
+ depth=config.layers,
106
+ num_heads=config.width // config.head_width,
107
+ mlp_ratio=config.mlp_ratio,
108
+ qkv_bias=config.qkv_bias,
109
+ drop_path_rate=config.drop_path_rate,
110
+ norm_layer=norm_layer,
111
+ xattn=config.x_attention,
112
+ rope=config.rope_embeddings,
113
+ postnorm=config.post_norm,
114
+ pt_hw_seq_len=config.pt_hw_seq_len,
115
+ intp_freq=config.intp_freq,
116
+ naiveswiglu=config.naive_swiglu,
117
+ subln=config.subln,
118
+ proj_type=config.proj_type,
119
+ )
120
+
121
+
122
+ def _resolve_attention_libs(config: JinaCLIPConfig):
123
+ use_text_flash_attn = (
124
+ config.use_text_flash_attn
125
+ if config.use_text_flash_attn is not None
126
+ else config.text_config.hf_model_config_kwargs.get('use_flash_attn', True)
127
+ )
128
+ use_vision_xformers = (
129
+ config.use_vision_xformers
130
+ if config.use_vision_xformers is not None
131
+ else config.vision_config.x_attention
132
+ )
133
+
134
+ def _resolve_use_text_flash_attn() -> bool:
135
+ if use_text_flash_attn:
136
+ if not torch.cuda.is_available():
137
+ warnings.warn('Flash attention requires CUDA, disabling')
138
+ return False
139
+ if importlib.util.find_spec('flash_attn') is None:
140
+ warnings.warn(
141
+ 'Flash attention is not installed. Check '
142
+ 'https://github.com/Dao-AILab/flash-attention?'
143
+ 'tab=readme-ov-file#installation-and-features '
144
+ 'for installation instructions, disabling'
145
+ )
146
+ return False
147
+ major, minor, *_ = torch.version.cuda.split('.')
148
+ major, minor = int(major), int(minor)
149
+ if major < 11 or (major == 11 and minor < 7):
150
+ warnings.warn(
151
+ 'Flash attention requires CUDA>=11.7. Found version '
152
+ f'{major}.{minor}, disabling'
153
+ )
154
+ return False
155
+ capability = torch.cuda.get_device_capability()
156
+ major, *_ = capability
157
+ major = int(major)
158
+ if major < 8:
159
+ device_name = torch.cuda.get_device_properties(0).name
160
+ warnings.warn(
161
+ 'Flash attention requires device capability>=8.0 (NVIDIA Ampere, '
162
+ f'Hopper or ADA). Found device {device_name} with capability '
163
+ f'{capability}, disabling'
164
+ )
165
+ return False
166
+ return True
167
+ return False
168
+
169
+ def _resolve_use_vision_xformers() -> bool:
170
+ if use_vision_xformers:
171
+ if not torch.cuda.is_available():
172
+ warnings.warn('xFormers requires CUDA, disabling')
173
+ return False
174
+ if importlib.util.find_spec('xformers') is None:
175
+ warnings.warn(
176
+ 'xFormers is not installed. Check '
177
+ 'https://github.com/facebookresearch/xformers?'
178
+ 'tab=readme-ov-file#installing-xformers for installation '
179
+ 'instructions, disabling'
180
+ )
181
+ return False
182
+ return True
183
+ return False
184
+
185
+ _use_text_flash_attn = _resolve_use_text_flash_attn()
186
+ _use_vision_xformers = _resolve_use_vision_xformers()
187
+
188
+ config.use_text_flash_attn = _use_text_flash_attn
189
+ config.use_vision_xformers = _use_vision_xformers
190
+ config.text_config.hf_model_config_kwargs['use_flash_attn'] = _use_text_flash_attn
191
+ config.vision_config.x_attention = _use_vision_xformers
192
+
193
+ return config
194
+
195
+
196
+ class JinaCLIPPreTrainedModel(PreTrainedModel):
197
+ """
198
+ An abstract class to handle weights initialization and a simple interface for
199
+ downloading and loading pretrained models.
200
+ """
201
+
202
+ config_class = JinaCLIPConfig
203
+ base_model_prefix = 'clip'
204
+ supports_gradient_checkpointing = True
205
+
206
+ def _init_weights(self, module):
207
+ """Initialize the weights"""
208
+ if isinstance(module, JinaCLIPModel):
209
+ if isinstance(module.text_projection, nn.Linear):
210
+ nn.init.normal_(
211
+ module.text_projection.weight,
212
+ std=module.text_embed_dim**-0.5 * self.config.initializer_factor,
213
+ )
214
+ if isinstance(module.text_projection, nn.Linear):
215
+ nn.init.normal_(
216
+ module.visual_projection.weight,
217
+ std=module.vision_embed_dim**-0.5 * self.config.initializer_factor,
218
+ )
219
+ if isinstance(module, nn.LayerNorm):
220
+ module.bias.data.zero_()
221
+ module.weight.data.fill_(1.0)
222
+ if isinstance(module, nn.Linear) and module.bias is not None:
223
+ module.bias.data.zero_()
224
+
225
+ @classmethod
226
+ def from_pretrained(cls, *args, **kwargs):
227
+ if 'torch_dtype' not in kwargs:
228
+ kwargs['torch_dtype'] = 'auto'
229
+ return super().from_pretrained(*args, **kwargs)
230
+
231
+
232
+ class JinaCLIPTextModel(JinaCLIPPreTrainedModel):
233
+ config_class = JinaCLIPTextConfig
234
+
235
+ def __init__(self, config: JinaCLIPTextConfig):
236
+ super().__init__(config)
237
+ self.text_model = _build_text_tower(config)
238
+ self.post_init()
239
+
240
+ def forward(
241
+ self,
242
+ input_ids: Union[None, torch.Tensor, BatchEncoding] = None,
243
+ return_dict: Optional[bool] = None,
244
+ *_,
245
+ **__,
246
+ ) -> Union[Tuple[Optional[torch.FloatTensor], ...], CLIPTextModelOutput]:
247
+ return_dict = (
248
+ return_dict if return_dict is not None else self.config.use_return_dict
249
+ )
250
+ x = input_ids.input_ids if isinstance(input_ids, BatchEncoding) else input_ids
251
+ feats = self.text_model(x=x)
252
+ out = CLIPTextModelOutput(text_embeds=feats)
253
+ return out if return_dict else out.to_tuple()
254
+
255
+
256
+ class JinaCLIPVisionModel(JinaCLIPPreTrainedModel):
257
+ config_class = JinaCLIPVisionConfig
258
+ main_input_name = 'pixel_values'
259
+
260
+ def __init__(self, config: JinaCLIPVisionConfig):
261
+ super().__init__(config)
262
+ self.vision_model = _build_vision_tower(config)
263
+ self.post_init()
264
+
265
+ def forward(
266
+ self,
267
+ pixel_values: Union[None, torch.FloatTensor, BatchFeature] = None,
268
+ return_dict: Optional[bool] = None,
269
+ *_,
270
+ **__,
271
+ ) -> Union[Tuple[Optional[torch.FloatTensor], ...], CLIPVisionModelOutput]:
272
+ return_dict = (
273
+ return_dict if return_dict is not None else self.config.use_return_dict
274
+ )
275
+ x = (
276
+ pixel_values.pixel_values
277
+ if isinstance(pixel_values, BatchFeature)
278
+ else pixel_values
279
+ )
280
+ feats = self.vision_model(x=x)
281
+ out = CLIPVisionModelOutput(image_embeds=feats)
282
+ return out if return_dict else out.to_tuple()
283
+
284
+
285
+ class JinaCLIPModel(JinaCLIPPreTrainedModel):
286
+ config_class = JinaCLIPConfig
287
+
288
+ def __init__(self, config: JinaCLIPConfig):
289
+ super().__init__(config)
290
+
291
+ if not isinstance(config.text_config, JinaCLIPTextConfig):
292
+ raise ValueError(
293
+ 'Attribute config.text_config is expected to be of type '
294
+ f'JinaCLIPTextConfig but is of type {type(config.text_config)}.'
295
+ )
296
+
297
+ if not isinstance(config.vision_config, JinaCLIPVisionConfig):
298
+ raise ValueError(
299
+ 'Attribute config.vision_config is expected to be of type '
300
+ f'JinaCLIPVisionConfig but is of type {type(config.vision_config)}.'
301
+ )
302
+
303
+ config = _resolve_attention_libs(config)
304
+ text_config = config.text_config
305
+ vision_config = config.vision_config
306
+
307
+ self.add_projections = config.add_projections
308
+ self.projection_dim = config.projection_dim
309
+ self.text_embed_dim = text_config.embed_dim
310
+ self.vision_embed_dim = vision_config.embed_dim
311
+ self.text_model = _build_text_tower(text_config)
312
+ self.vision_model = _build_vision_tower(vision_config)
313
+ self.logit_scale = nn.Parameter(
314
+ torch.tensor(self.config.logit_scale_init_value)
315
+ )
316
+ if self.add_projections:
317
+ self.visual_projection = nn.Linear(
318
+ self.vision_embed_dim, self.projection_dim, bias=False
319
+ )
320
+ self.text_projection = nn.Linear(
321
+ self.text_embed_dim, self.projection_dim, bias=False
322
+ )
323
+ else:
324
+ self.visual_projection = nn.Identity()
325
+ self.text_projection = nn.Identity()
326
+
327
+ self.tokenizer = None
328
+ self.preprocess = None
329
+ self.post_init()
330
+
331
+ def get_tokenizer(self):
332
+ if self.tokenizer is None:
333
+ self.tokenizer = AutoTokenizer.from_pretrained(
334
+ self.config._name_or_path, trust_remote_code=True
335
+ )
336
+ return self.tokenizer
337
+
338
+ def get_preprocess(self):
339
+ if not self.preprocess:
340
+ self.preprocess = AutoImageProcessor.from_pretrained(
341
+ self.config._name_or_path, trust_remote_code=True
342
+ )
343
+ return self.preprocess
344
+
345
+ def get_text_features(
346
+ self,
347
+ input_ids: Union[None, torch.Tensor, BatchEncoding] = None,
348
+ *_,
349
+ **__,
350
+ ) -> torch.FloatTensor:
351
+ x = input_ids.input_ids if isinstance(input_ids, BatchEncoding) else input_ids
352
+ return self.text_projection(self.text_model(x=x))
353
+
354
+ def get_image_features(
355
+ self,
356
+ pixel_values: Union[None, torch.FloatTensor, BatchFeature] = None,
357
+ *_,
358
+ **__,
359
+ ) -> torch.FloatTensor:
360
+ x = (
361
+ pixel_values.pixel_values
362
+ if isinstance(pixel_values, BatchFeature)
363
+ else pixel_values
364
+ )
365
+ return self.visual_projection(self.vision_model(x=x))
366
+
367
+ def _truncate_embeddings(self, embeddings: torch.Tensor, truncate_dim: int):
368
+ if not self.config.matryoshka_dimensions:
369
+ logger.warning(
370
+ 'Model is not trained using Matryoshka Representation Learning, '
371
+ 'truncating embeddings will not work optimally.'
372
+ )
373
+ return embeddings[:, :truncate_dim]
374
+
375
+ @staticmethod
376
+ def _decode_image_data(image_data_str: str) -> Image:
377
+ header, data = image_data_str.split(',', 1)
378
+ image_data = base64.b64decode(data)
379
+ return Image.open(BytesIO(image_data))
380
+
381
+ @torch.inference_mode()
382
+ def encode_image(
383
+ self,
384
+ images: Union[str, List[Union[str, 'Image.Image']]],
385
+ batch_size: int = 32,
386
+ show_progress_bar: Optional[bool] = None,
387
+ convert_to_numpy: bool = True,
388
+ convert_to_tensor: bool = False,
389
+ device: Optional[torch.device] = None,
390
+ normalize_embeddings: bool = True,
391
+ truncate_dim: Optional[int] = None,
392
+ ) -> Union[List[torch.Tensor], np.ndarray, torch.Tensor]:
393
+ """
394
+ Computes image embeddings
395
+
396
+ Args:
397
+ images(`str` or `List[Union[str, Image.Image]]`):
398
+ Image paths, URLs, PIL images, or data:image/ strings to be encoded
399
+ batch_size(`int`, *optional*, defaults to 32):
400
+ Batch size for the computation
401
+ show_progress_bar(`bool`, *optional*, defaults to None):
402
+ Show a progress bar when encoding images. If set to None, progress bar
403
+ is only shown when `logger.level == logging.INFO` or
404
+ `logger.level == logging.DEBUG`
405
+ convert_to_numpy(`bool`, *optional*, defaults to True):
406
+ If true, the output is a list of numpy vectors. Else, it is a list of
407
+ pytorch tensors
408
+ convert_to_tensor(`bool`, *optional*, defaults to False):
409
+ If true, you get one large tensor as return. Overwrites any setting
410
+ from convert_to_numpy
411
+ device(`torch.device`, *optional*, defaults to None):
412
+ Which torch.device to use for the computation
413
+ normalize_embeddings(`bool`, *optional*, defaults to True):
414
+ If set to true, returned vectors will have length 1. In that case,
415
+ the faster dot-product (util.dot_score) instead of cosine similarity
416
+ can be used
417
+ truncate_dim(`int`, *optional*, defaults to None):
418
+ The dimension to truncate sentence embeddings to. If set to `None`
419
+ no truncation is performed
420
+
421
+ Returns:
422
+ By default, a list of tensors is returned. If convert_to_tensor, a stacked
423
+ tensor is returned. If convert_to_numpy, a numpy matrix is returned
424
+ """
425
+
426
+ _is_training = self.training
427
+ self.eval()
428
+
429
+ self.preprocess = self.get_preprocess()
430
+ all_embeddings = []
431
+
432
+ if show_progress_bar is None:
433
+ show_progress_bar = (
434
+ logger.getEffectiveLevel() == logging.INFO
435
+ or logger.getEffectiveLevel() == logging.DEBUG
436
+ )
437
+ if convert_to_tensor:
438
+ convert_to_numpy = False
439
+
440
+ _input_was_single_img = False
441
+ if isinstance(images, str) or not hasattr(images, '__len__'):
442
+ images = [images]
443
+ _input_was_single_img = True
444
+
445
+ if device is not None:
446
+ self.to(device)
447
+
448
+ _permutation = np.argsort([-len(str(i)) for i in images])
449
+ _inverse_permutation = np.argsort(_permutation)
450
+ images = [images[idx] for idx in _permutation]
451
+
452
+ if has_tqdm:
453
+ range_iter = trange(
454
+ 0,
455
+ len(images),
456
+ batch_size,
457
+ desc='Encoding',
458
+ disable=not show_progress_bar,
459
+ )
460
+ else:
461
+ range_iter = range(0, len(images), batch_size)
462
+
463
+ truncate_dim = truncate_dim or self.config.truncate_dim
464
+
465
+ for i in range_iter:
466
+ _processed_images = []
467
+ for img in images[i: i + batch_size]:
468
+ if isinstance(img, str):
469
+ if img.startswith('http'):
470
+ response = requests.get(img)
471
+ image = Image.open(BytesIO(response.content)).convert('RGB')
472
+ elif img.startswith('data:image/'):
473
+ image = self._decode_image_data(img).convert('RGB')
474
+ else:
475
+ image = Image.open(img).convert('RGB')
476
+ elif isinstance(img, Image.Image):
477
+ image = img.convert('RGB')
478
+ else:
479
+ raise ValueError('Unsupported image format')
480
+ _processed_images.append(image)
481
+
482
+ pixelvals = self.preprocess(_processed_images)
483
+ pixelvals = pixelvals.to(self.device)
484
+ embeddings = self.get_image_features(pixelvals)
485
+
486
+ if truncate_dim:
487
+ embeddings = self._truncate_embeddings(embeddings, truncate_dim)
488
+ if normalize_embeddings:
489
+ embeddings = f.normalize(embeddings, p=2, dim=1)
490
+ if convert_to_numpy:
491
+ embeddings = embeddings.cpu()
492
+
493
+ all_embeddings.extend(embeddings)
494
+
495
+ all_embeddings = [all_embeddings[idx] for idx in _inverse_permutation]
496
+
497
+ if convert_to_tensor:
498
+ all_embeddings = torch.stack(all_embeddings)
499
+ elif convert_to_numpy:
500
+ all_embeddings = np.asarray(
501
+ [emb.to(torch.float32).numpy() for emb in all_embeddings]
502
+ )
503
+
504
+ if _input_was_single_img:
505
+ all_embeddings = all_embeddings[0]
506
+
507
+ self.train(_is_training)
508
+ return all_embeddings
509
+
510
+ @torch.inference_mode()
511
+ def encode_text(
512
+ self,
513
+ sentences: Union[str, List[str]],
514
+ task: Optional[str] = None,
515
+ batch_size: int = 32,
516
+ show_progress_bar: Optional[bool] = None,
517
+ convert_to_numpy: bool = True,
518
+ convert_to_tensor: bool = False,
519
+ device: Optional[torch.device] = None,
520
+ normalize_embeddings: bool = True,
521
+ truncate_dim: Optional[int] = None,
522
+ **tokenizer_kwargs,
523
+ ) -> Union[List[torch.Tensor], np.ndarray, torch.Tensor]:
524
+ """
525
+ Computes text embeddings
526
+
527
+ Args:
528
+ sentences(`str` or `List[str]`):
529
+ Sentence or sentences to be encoded
530
+ task(`str`, *optional*, defaults to `None`):
531
+ Specifies the task for which the encoding is intended. If a `task` is
532
+ provided, a task-specific instruction is added to the beginning of each
533
+ sentence. If `task` is not provided, no instructions are added.
534
+ batch_size(`int`, *optional*, defaults to 32):
535
+ Batch size for the computation
536
+ show_progress_bar(`bool`, *optional*, defaults to None):
537
+ Show a progress bar when encoding sentences. If set to None, progress
538
+ bar is only shown when `logger.level == logging.INFO` or
539
+ `logger.level == logging.DEBUG`
540
+ convert_to_numpy(`bool`, *optional*, defaults to True):
541
+ If true, the output is a list of numpy vectors. Else, it is a list of
542
+ pytorch tensors
543
+ convert_to_tensor(`bool`, *optional*, defaults to False):
544
+ If true, you get one large tensor as return. Overwrites any setting
545
+ from convert_to_numpy
546
+ device(`torch.device`, *optional*, defaults to None):
547
+ Which torch.device to use for the computation
548
+ normalize_embeddings(`bool`, *optional*, defaults to True):
549
+ If set to true, returned vectors will have length 1. In that case,
550
+ the faster dot-product (util.dot_score) instead of cosine similarity
551
+ can be used
552
+ truncate_dim(`int`, *optional*, defaults to None):
553
+ The dimension to truncate sentence embeddings to. If set to `None`
554
+ no truncation is performed
555
+ tokenizer_kwargs(`Dict[str, Any]`, *optional*, defaults to {}):
556
+ Keyword arguments for the tokenizer
557
+ Returns:
558
+ By default, a list of tensors is returned. If convert_to_tensor, a stacked
559
+ tensor is returned. If convert_to_numpy, a numpy matrix is returned.
560
+ """
561
+ _is_training = self.training
562
+ self.eval()
563
+
564
+ all_embeddings = []
565
+ self.tokenizer = self.get_tokenizer()
566
+
567
+ if show_progress_bar is None:
568
+ show_progress_bar = (
569
+ logger.getEffectiveLevel() == logging.INFO
570
+ or logger.getEffectiveLevel() == logging.DEBUG
571
+ )
572
+ if convert_to_tensor:
573
+ convert_to_numpy = False
574
+
575
+ _input_was_string = False
576
+ if isinstance(sentences, str) or not hasattr(sentences, '__len__'):
577
+ sentences = [sentences]
578
+ _input_was_string = True
579
+
580
+ if device is not None:
581
+ self.to(device)
582
+
583
+ _permutation = np.argsort([-len(i) for i in sentences])
584
+ _inverse_permutation = np.argsort(_permutation)
585
+ sentences = [sentences[idx] for idx in _permutation]
586
+
587
+ tokenizer_kwargs['padding'] = tokenizer_kwargs.get('padding', True)
588
+ tokenizer_kwargs['max_length'] = tokenizer_kwargs.get('max_length', 512)
589
+ tokenizer_kwargs['truncation'] = tokenizer_kwargs.get('truncation', True)
590
+
591
+ if has_tqdm:
592
+ range_iter = trange(
593
+ 0,
594
+ len(sentences),
595
+ batch_size,
596
+ desc='Encoding',
597
+ disable=not show_progress_bar,
598
+ )
599
+ else:
600
+ range_iter = range(0, len(sentences), batch_size)
601
+
602
+ truncate_dim = truncate_dim or self.config.truncate_dim
603
+
604
+ instruction = self.text_model.get_instruction_from_task(task)
605
+ if instruction:
606
+ sentences = [instruction + sentence for sentence in sentences]
607
+
608
+ for i in range_iter:
609
+ tokens = self.tokenizer(
610
+ sentences[i: i + batch_size],
611
+ return_tensors='pt',
612
+ **tokenizer_kwargs,
613
+ ).to(self.device)
614
+ embeddings = self.get_text_features(input_ids=tokens)
615
+ if truncate_dim:
616
+ embeddings = self._truncate_embeddings(embeddings, truncate_dim)
617
+ if normalize_embeddings:
618
+ embeddings = f.normalize(embeddings, p=2, dim=1)
619
+ if convert_to_numpy:
620
+ embeddings = embeddings.cpu()
621
+ all_embeddings.extend(embeddings)
622
+
623
+ all_embeddings = [all_embeddings[idx] for idx in _inverse_permutation]
624
+
625
+ if convert_to_tensor:
626
+ all_embeddings = torch.stack(all_embeddings)
627
+ elif convert_to_numpy:
628
+ all_embeddings = np.asarray(
629
+ [emb.to(torch.float32).numpy() for emb in all_embeddings]
630
+ )
631
+ if _input_was_string:
632
+ all_embeddings = all_embeddings[0]
633
+
634
+ self.train(_is_training)
635
+ return all_embeddings
636
+
637
+ def forward(
638
+ self,
639
+ input_ids: Union[None, torch.Tensor, BatchEncoding] = None,
640
+ pixel_values: Union[None, torch.FloatTensor, BatchFeature] = None,
641
+ return_dict: Optional[bool] = None,
642
+ return_loss: Optional[bool] = None,
643
+ *_,
644
+ **__,
645
+ ) -> Union[Tuple[Optional[torch.FloatTensor], ...], CLIPOutput]:
646
+ return_dict = (
647
+ return_dict if return_dict is not None else self.config.use_return_dict
648
+ )
649
+ image_embeds = self.get_image_features(pixel_values=pixel_values)
650
+ text_embeds = self.get_text_features(input_ids=input_ids)
651
+
652
+ # normalized features
653
+ image_embeds = image_embeds / image_embeds.norm(p=2, dim=-1, keepdim=True)
654
+ text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True)
655
+
656
+ # cosine similarity as logits
657
+ logit_scale = self.logit_scale.exp()
658
+ logits_per_text = torch.matmul(text_embeds, image_embeds.t()) * logit_scale
659
+ logits_per_image = logits_per_text.t()
660
+
661
+ loss = None
662
+ if return_loss:
663
+ loss = clip_loss(logits_per_text)
664
+
665
+ if not return_dict:
666
+ output = (
667
+ logits_per_image,
668
+ logits_per_text,
669
+ text_embeds,
670
+ image_embeds,
671
+ None,
672
+ None,
673
+ )
674
+ return ((loss,) + output) if loss is not None else output
675
+
676
+ return CLIPOutput(
677
+ loss=loss,
678
+ logits_per_image=logits_per_image,
679
+ logits_per_text=logits_per_text,
680
+ text_embeds=text_embeds,
681
+ image_embeds=image_embeds,
682
+ text_model_output=None,
683
+ vision_model_output=None,
684
+ )
text_encoder_2/rope_embeddings.py ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # Adapted from EVA CLIP
3
+ # https://github.com/baaivision/EVA/tree/master/EVA-CLIP/rei/eva_clip
4
+ # --------------------------------------------------------
5
+
6
+ from math import pi
7
+
8
+ import torch
9
+ from einops import rearrange, repeat
10
+ from torch import nn
11
+
12
+
13
+ def broadcast(tensors, dim=-1):
14
+ num_tensors = len(tensors)
15
+ shape_lens = set(list(map(lambda t: len(t.shape), tensors)))
16
+ assert len(shape_lens) == 1, 'tensors must all have the same number of dimensions'
17
+ shape_len = list(shape_lens)[0]
18
+ dim = (dim + shape_len) if dim < 0 else dim
19
+ dims = list(zip(*map(lambda t: list(t.shape), tensors)))
20
+ expandable_dims = [(i, val) for i, val in enumerate(dims) if i != dim]
21
+ assert all(
22
+ [*map(lambda t: len(set(t[1])) <= 2, expandable_dims)]
23
+ ), 'invalid dimensions for broadcastable concatentation'
24
+ max_dims = list(map(lambda t: (t[0], max(t[1])), expandable_dims))
25
+ expanded_dims = list(map(lambda t: (t[0], (t[1],) * num_tensors), max_dims))
26
+ expanded_dims.insert(dim, (dim, dims[dim]))
27
+ expandable_shapes = list(zip(*map(lambda t: t[1], expanded_dims)))
28
+ tensors = list(map(lambda t: t[0].expand(*t[1]), zip(tensors, expandable_shapes)))
29
+ return torch.cat(tensors, dim=dim)
30
+
31
+
32
+ def rotate_half(x):
33
+ x = rearrange(x, '... (d r) -> ... d r', r=2)
34
+ x1, x2 = x.unbind(dim=-1)
35
+ x = torch.stack((-x2, x1), dim=-1)
36
+ return rearrange(x, '... d r -> ... (d r)')
37
+
38
+
39
+ class VisionRotaryEmbedding(nn.Module):
40
+ def __init__(
41
+ self,
42
+ dim,
43
+ pt_seq_len,
44
+ ft_seq_len=None,
45
+ custom_freqs=None,
46
+ freqs_for='lang',
47
+ theta=10000,
48
+ max_freq=10,
49
+ num_freqs=1,
50
+ ):
51
+ super().__init__()
52
+ if custom_freqs:
53
+ freqs = custom_freqs
54
+ elif freqs_for == 'lang':
55
+ freqs = 1.0 / (
56
+ theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim)
57
+ )
58
+ elif freqs_for == 'pixel':
59
+ freqs = torch.linspace(1.0, max_freq / 2, dim // 2) * pi
60
+ elif freqs_for == 'constant':
61
+ freqs = torch.ones(num_freqs).float()
62
+ else:
63
+ raise ValueError(f'unknown modality {freqs_for}')
64
+
65
+ if ft_seq_len is None:
66
+ ft_seq_len = pt_seq_len
67
+ t = torch.arange(ft_seq_len) / ft_seq_len * pt_seq_len
68
+
69
+ freqs_h = torch.einsum('..., f -> ... f', t, freqs)
70
+ freqs_h = repeat(freqs_h, '... n -> ... (n r)', r=2)
71
+
72
+ freqs_w = torch.einsum('..., f -> ... f', t, freqs)
73
+ freqs_w = repeat(freqs_w, '... n -> ... (n r)', r=2)
74
+
75
+ freqs = broadcast((freqs_h[:, None, :], freqs_w[None, :, :]), dim=-1)
76
+
77
+ self.register_buffer('freqs_cos', freqs.cos(), persistent=False)
78
+ self.register_buffer('freqs_sin', freqs.sin(), persistent=False)
79
+
80
+ def forward(self, t, start_index=0):
81
+ rot_dim = self.freqs_cos.shape[-1]
82
+ end_index = start_index + rot_dim
83
+ assert rot_dim <= t.shape[-1], (
84
+ f'feature dimension {t.shape[-1]} is not of sufficient size to rotate in '
85
+ f'all the positions {rot_dim}'
86
+ )
87
+ t_left, t, t_right = (
88
+ t[..., :start_index],
89
+ t[..., start_index:end_index],
90
+ t[..., end_index:],
91
+ )
92
+ t = (t * self.freqs_cos) + (rotate_half(t) * self.freqs_sin)
93
+
94
+ return torch.cat((t_left, t, t_right), dim=-1)
95
+
96
+
97
+ class VisionRotaryEmbeddingFast(nn.Module):
98
+ def __init__(
99
+ self,
100
+ dim,
101
+ pt_seq_len,
102
+ ft_seq_len=None,
103
+ custom_freqs=None,
104
+ freqs_for='lang',
105
+ theta=10000,
106
+ max_freq=10,
107
+ num_freqs=1,
108
+ patch_dropout=0.0,
109
+ ):
110
+ super().__init__()
111
+ if custom_freqs:
112
+ freqs = custom_freqs
113
+ elif freqs_for == 'lang':
114
+ freqs = 1.0 / (
115
+ theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim)
116
+ )
117
+ elif freqs_for == 'pixel':
118
+ freqs = torch.linspace(1.0, max_freq / 2, dim // 2) * pi
119
+ elif freqs_for == 'constant':
120
+ freqs = torch.ones(num_freqs).float()
121
+ else:
122
+ raise ValueError(f'unknown modality {freqs_for}')
123
+
124
+ if ft_seq_len is None:
125
+ ft_seq_len = pt_seq_len
126
+ t = torch.arange(ft_seq_len) / ft_seq_len * pt_seq_len
127
+
128
+ freqs = torch.einsum('..., f -> ... f', t, freqs)
129
+ freqs = repeat(freqs, '... n -> ... (n r)', r=2)
130
+ freqs = broadcast((freqs[:, None, :], freqs[None, :, :]), dim=-1)
131
+
132
+ freqs_cos = freqs.cos().view(-1, freqs.shape[-1])
133
+ freqs_sin = freqs.sin().view(-1, freqs.shape[-1])
134
+
135
+ self.patch_dropout = patch_dropout
136
+
137
+ self.register_buffer('freqs_cos', freqs_cos, persistent=False)
138
+ self.register_buffer('freqs_sin', freqs_sin, persistent=False)
139
+
140
+ def forward(self, t, patch_indices_keep=None):
141
+ if patch_indices_keep is not None:
142
+ batch = t.size()[0]
143
+ batch_indices = torch.arange(batch)
144
+ batch_indices = batch_indices[..., None]
145
+
146
+ freqs_cos = repeat(
147
+ self.freqs_cos, 'i j -> n i m j', n=t.shape[0], m=t.shape[1]
148
+ )
149
+ freqs_sin = repeat(
150
+ self.freqs_sin, 'i j -> n i m j', n=t.shape[0], m=t.shape[1]
151
+ )
152
+
153
+ freqs_cos = freqs_cos[batch_indices, patch_indices_keep]
154
+ freqs_cos = rearrange(freqs_cos, 'n i m j -> n m i j')
155
+ freqs_sin = freqs_sin[batch_indices, patch_indices_keep]
156
+ freqs_sin = rearrange(freqs_sin, 'n i m j -> n m i j')
157
+
158
+ return t * freqs_cos + rotate_half(t) * freqs_sin
159
+
160
+ return t * self.freqs_cos + rotate_half(t) * self.freqs_sin
text_encoder_2/transform.py ADDED
@@ -0,0 +1,374 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import random
2
+ import warnings
3
+ from dataclasses import asdict, dataclass
4
+ from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
5
+
6
+ import torch
7
+ import torchvision.transforms.functional as f
8
+ from torchvision.transforms import (
9
+ CenterCrop,
10
+ ColorJitter,
11
+ Compose,
12
+ Grayscale,
13
+ InterpolationMode,
14
+ Normalize,
15
+ RandomResizedCrop,
16
+ Resize,
17
+ ToTensor,
18
+ )
19
+ from transformers.image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD
20
+
21
+ OPENAI_DATASET_MEAN = tuple(OPENAI_CLIP_MEAN)
22
+ OPENAI_DATASET_STD = tuple(OPENAI_CLIP_STD)
23
+
24
+
25
+ def _setup_size(size, error_msg):
26
+ if isinstance(size, int):
27
+ return size, size
28
+ if isinstance(size, Sequence) and len(size) == 1:
29
+ return size[0], size[0]
30
+ if len(size) != 2:
31
+ raise ValueError(error_msg)
32
+ return size
33
+
34
+
35
+ def _center_crop_or_pad(
36
+ img: torch.Tensor,
37
+ output_size: Union[int, Tuple[int, ...], List[int]],
38
+ fill: Union[int, Tuple[int]] = 0,
39
+ ) -> torch.Tensor:
40
+ """
41
+ Center crops and/or pads the given image. If the image is torch Tensor, it is
42
+ expected to have [..., H, W] shape, where ... means an arbitrary number of leading
43
+ dimensions. If image size is smaller than output size along any edge, image is
44
+ padded with 0 and then center cropped.
45
+ """
46
+ if isinstance(output_size, int):
47
+ output_size = (output_size, output_size)
48
+ elif isinstance(output_size, (tuple, list)) and len(output_size) == 1:
49
+ output_size = (output_size[0], output_size[0])
50
+
51
+ _, image_height, image_width = f.get_dimensions(img)
52
+ crop_height, crop_width = output_size
53
+
54
+ if crop_width > image_width or crop_height > image_height:
55
+ padding_ltrb = [
56
+ (crop_width - image_width) // 2 if crop_width > image_width else 0,
57
+ (crop_height - image_height) // 2 if crop_height > image_height else 0,
58
+ (crop_width - image_width + 1) // 2 if crop_width > image_width else 0,
59
+ (crop_height - image_height + 1) // 2 if crop_height > image_height else 0,
60
+ ]
61
+ img = f.pad(img, padding_ltrb, fill=fill)
62
+ _, image_height, image_width = f.get_dimensions(img)
63
+ if crop_width == image_width and crop_height == image_height:
64
+ return img
65
+
66
+ crop_top = int(round((image_height - crop_height) / 2.0))
67
+ crop_left = int(round((image_width - crop_width) / 2.0))
68
+ return f.crop(img, crop_top, crop_left, crop_height, crop_width)
69
+
70
+
71
+ class _CenterCropOrPad(torch.nn.Module):
72
+ """Crops the given image at the center.
73
+ If the image is torch Tensor, it is expected
74
+ to have [..., H, W] shape, where ... means an arbitrary number of leading
75
+ dimensions. If image size is smaller than output size along any edge, image is
76
+ padded with 0 and then center cropped.
77
+
78
+ Args:
79
+ size (sequence or int): Desired output size of the crop. If size is an
80
+ int instead of sequence like (h, w), a square crop (size, size) is
81
+ made. If provided a sequence of length 1, it will be interpreted as
82
+ (size[0], size[0]).
83
+ """
84
+
85
+ def __init__(self, size, fill=0):
86
+ super().__init__()
87
+ self.size = _setup_size(
88
+ size, error_msg='Please provide only two dimensions (h, w) for size.'
89
+ )
90
+ self.fill = fill
91
+
92
+ def forward(self, img):
93
+ """
94
+ Args:
95
+ img (PIL Image or Tensor): Image to be cropped.
96
+
97
+ Returns:
98
+ PIL Image or Tensor: Cropped image.
99
+ """
100
+ return _center_crop_or_pad(img, self.size, fill=self.fill)
101
+
102
+ def __repr__(self) -> str:
103
+ return f'{self.__class__.__name__}(size={self.size})'
104
+
105
+
106
+ def _convert_to_rgb(image):
107
+ return image.convert('RGB')
108
+
109
+
110
+ class _ResizeKeepRatio:
111
+ """Resize while keeping ratio. Copied from timm"""
112
+
113
+ def __init__(
114
+ self,
115
+ size,
116
+ longest=0.0,
117
+ interpolation=InterpolationMode.BICUBIC,
118
+ random_scale_prob=0.0,
119
+ random_scale_range=(0.85, 1.05),
120
+ random_aspect_prob=0.0,
121
+ random_aspect_range=(0.9, 1.11),
122
+ ):
123
+ if isinstance(size, (list, tuple)):
124
+ self.size = tuple(size)
125
+ else:
126
+ self.size = (size, size)
127
+ self.interpolation = interpolation
128
+ self.longest = float(longest) # [0, 1] where 0 == shortest edge, 1 == longest
129
+ self.random_scale_prob = random_scale_prob
130
+ self.random_scale_range = random_scale_range
131
+ self.random_aspect_prob = random_aspect_prob
132
+ self.random_aspect_range = random_aspect_range
133
+
134
+ @staticmethod
135
+ def get_params(
136
+ img,
137
+ target_size,
138
+ longest,
139
+ random_scale_prob=0.0,
140
+ random_scale_range=(0.85, 1.05),
141
+ random_aspect_prob=0.0,
142
+ random_aspect_range=(0.9, 1.11),
143
+ ):
144
+ """Get parameters"""
145
+ source_size = img.size[::-1] # h, w
146
+ h, w = source_size
147
+ target_h, target_w = target_size
148
+ ratio_h = h / target_h
149
+ ratio_w = w / target_w
150
+ ratio = max(ratio_h, ratio_w) * longest + min(ratio_h, ratio_w) * (
151
+ 1.0 - longest
152
+ )
153
+ if random_scale_prob > 0 and random.random() < random_scale_prob:
154
+ ratio_factor = random.uniform(random_scale_range[0], random_scale_range[1])
155
+ ratio_factor = (ratio_factor, ratio_factor)
156
+ else:
157
+ ratio_factor = (1.0, 1.0)
158
+ if random_aspect_prob > 0 and random.random() < random_aspect_prob:
159
+ aspect_factor = random.uniform(
160
+ random_aspect_range[0], random_aspect_range[1]
161
+ )
162
+ ratio_factor = (
163
+ ratio_factor[0] / aspect_factor,
164
+ ratio_factor[1] * aspect_factor,
165
+ )
166
+ return [
167
+ round(x * factor / ratio) for x, factor in zip(source_size, ratio_factor)
168
+ ]
169
+
170
+ def __call__(self, img):
171
+ """
172
+ Args:
173
+ img (PIL Image): Image to be cropped and resized.
174
+
175
+ Returns:
176
+ PIL Image: Resized, padded to at least target size, possibly
177
+ cropped to exactly target size
178
+ """
179
+ size = self.get_params(
180
+ img,
181
+ self.size,
182
+ self.longest,
183
+ self.random_scale_prob,
184
+ self.random_scale_range,
185
+ self.random_aspect_prob,
186
+ self.random_aspect_range,
187
+ )
188
+ img = f.resize(img, size, self.interpolation)
189
+ return img
190
+
191
+ def __repr__(self):
192
+ format_string = self.__class__.__name__ + '(size={0}'.format(self.size)
193
+ format_string += f', interpolation={self.interpolation})'
194
+ format_string += f', longest={self.longest:.3f})'
195
+ return format_string
196
+
197
+
198
+ class _ColorJitter(object):
199
+ """Apply color jitter to the PIL image with a specified probability"""
200
+
201
+ def __init__(self, brightness=0.0, contrast=0.0, saturation=0.0, hue=0.0, p=0.8):
202
+ assert 0.0 <= p <= 1.0
203
+ self.p = p
204
+ self.transf = ColorJitter(
205
+ brightness=brightness, contrast=contrast, saturation=saturation, hue=hue
206
+ )
207
+
208
+ def __call__(self, img):
209
+ if random.random() < self.p:
210
+ return self.transf(img)
211
+ else:
212
+ return img
213
+
214
+
215
+ class _GrayScale(object):
216
+ """Apply gray scale to the PIL image with a specified probability"""
217
+
218
+ def __init__(self, p=0.2):
219
+ assert 0.0 <= p <= 1.0
220
+ self.p = p
221
+ self.transf = Grayscale(num_output_channels=3)
222
+
223
+ def __call__(self, img):
224
+ if random.random() < self.p:
225
+ return self.transf(img)
226
+ else:
227
+ return img
228
+
229
+
230
+ @dataclass
231
+ class AugmentationCfg:
232
+ scale: Tuple[float, float] = (0.9, 1.0)
233
+ ratio: Optional[Tuple[float, float]] = None
234
+ color_jitter: Optional[
235
+ Union[float, Tuple[float, float, float], Tuple[float, float, float, float]]
236
+ ] = None
237
+ re_prob: Optional[float] = None
238
+ re_count: Optional[int] = None
239
+ use_timm: bool = False
240
+ color_jitter_prob: float = None
241
+ gray_scale_prob: float = None
242
+
243
+
244
+ def image_transform(
245
+ image_size: Union[int, Tuple[int, int]],
246
+ is_train: bool,
247
+ mean: Optional[Tuple[float, ...]] = None,
248
+ std: Optional[Tuple[float, ...]] = None,
249
+ resize_mode: Optional[str] = None,
250
+ interpolation: Optional[str] = None,
251
+ fill_color: int = 0,
252
+ aug_cfg: Optional[Union[Dict[str, Any], AugmentationCfg]] = None,
253
+ ):
254
+ mean = mean or OPENAI_DATASET_MEAN
255
+ if not isinstance(mean, (list, tuple)):
256
+ mean = (mean,) * 3
257
+
258
+ std = std or OPENAI_DATASET_STD
259
+ if not isinstance(std, (list, tuple)):
260
+ std = (std,) * 3
261
+
262
+ interpolation = interpolation or 'bicubic'
263
+ assert interpolation in ['bicubic', 'bilinear', 'random']
264
+ # NOTE random is ignored for interpolation_mode, so defaults to BICUBIC for
265
+ # inference if set
266
+ interpolation_mode = (
267
+ InterpolationMode.BILINEAR
268
+ if interpolation == 'bilinear'
269
+ else InterpolationMode.BICUBIC
270
+ )
271
+
272
+ resize_mode = resize_mode or 'shortest'
273
+ assert resize_mode in ('shortest', 'longest', 'squash')
274
+
275
+ if isinstance(aug_cfg, dict):
276
+ aug_cfg = AugmentationCfg(**aug_cfg)
277
+ else:
278
+ aug_cfg = aug_cfg or AugmentationCfg()
279
+
280
+ normalize = Normalize(mean=mean, std=std)
281
+
282
+ if is_train:
283
+ aug_cfg_dict = {k: v for k, v in asdict(aug_cfg).items() if v is not None}
284
+ use_timm = aug_cfg_dict.pop('use_timm', False)
285
+ if use_timm:
286
+ from timm.data import create_transform # timm can still be optional
287
+
288
+ if isinstance(image_size, (tuple, list)):
289
+ assert len(image_size) >= 2
290
+ input_size = (3,) + image_size[-2:]
291
+ else:
292
+ input_size = (3, image_size, image_size)
293
+
294
+ aug_cfg_dict.setdefault('color_jitter', None) # disable by default
295
+ # drop extra non-timm items
296
+ aug_cfg_dict.pop('color_jitter_prob', None)
297
+ aug_cfg_dict.pop('gray_scale_prob', None)
298
+
299
+ train_transform = create_transform(
300
+ input_size=input_size,
301
+ is_training=True,
302
+ hflip=0.0,
303
+ mean=mean,
304
+ std=std,
305
+ re_mode='pixel',
306
+ interpolation=interpolation,
307
+ **aug_cfg_dict,
308
+ )
309
+ else:
310
+ train_transform = [
311
+ RandomResizedCrop(
312
+ image_size,
313
+ scale=aug_cfg_dict.pop('scale'),
314
+ interpolation=InterpolationMode.BICUBIC,
315
+ ),
316
+ _convert_to_rgb,
317
+ ]
318
+ if aug_cfg.color_jitter_prob:
319
+ assert (
320
+ aug_cfg.color_jitter is not None and len(aug_cfg.color_jitter) == 4
321
+ )
322
+ train_transform.extend(
323
+ [_ColorJitter(*aug_cfg.color_jitter, p=aug_cfg.color_jitter_prob)]
324
+ )
325
+ if aug_cfg.gray_scale_prob:
326
+ train_transform.extend([_GrayScale(aug_cfg.gray_scale_prob)])
327
+ train_transform.extend(
328
+ [
329
+ ToTensor(),
330
+ normalize,
331
+ ]
332
+ )
333
+ train_transform = Compose(train_transform)
334
+ if aug_cfg_dict:
335
+ warnings.warn(
336
+ f'Unused augmentation cfg items, specify `use_timm` to use '
337
+ f'({list(aug_cfg_dict.keys())}).'
338
+ )
339
+ return train_transform
340
+ else:
341
+ if resize_mode == 'longest':
342
+ transforms = [
343
+ _ResizeKeepRatio(
344
+ image_size, interpolation=interpolation_mode, longest=1
345
+ ),
346
+ _CenterCropOrPad(image_size, fill=fill_color),
347
+ ]
348
+ elif resize_mode == 'squash':
349
+ if isinstance(image_size, int):
350
+ image_size = (image_size, image_size)
351
+ transforms = [
352
+ Resize(image_size, interpolation=interpolation_mode),
353
+ ]
354
+ else:
355
+ assert resize_mode == 'shortest'
356
+ if not isinstance(image_size, (tuple, list)):
357
+ image_size = (image_size, image_size)
358
+ if image_size[0] == image_size[1]:
359
+ # simple case, use torchvision built-in Resize w/ shortest edge mode
360
+ # (scalar size arg)
361
+ transforms = [Resize(image_size[0], interpolation=interpolation_mode)]
362
+ else:
363
+ # resize shortest edge to matching target dim for non-square target
364
+ transforms = [_ResizeKeepRatio(image_size)]
365
+ transforms += [CenterCrop(image_size)]
366
+
367
+ transforms.extend(
368
+ [
369
+ _convert_to_rgb,
370
+ ToTensor(),
371
+ normalize,
372
+ ]
373
+ )
374
+ return Compose(transforms)
tokenizer/added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "<image_soft_token>": 262144
3
+ }
tokenizer/chat_template.jinja ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {{ bos_token }}
2
+ {%- if messages[0]['role'] == 'system' -%}
3
+ {%- if messages[0]['content'] is string -%}
4
+ {%- set first_user_prefix = messages[0]['content'] + '
5
+
6
+ ' -%}
7
+ {%- else -%}
8
+ {%- set first_user_prefix = messages[0]['content'][0]['text'] + '
9
+
10
+ ' -%}
11
+ {%- endif -%}
12
+ {%- set loop_messages = messages[1:] -%}
13
+ {%- else -%}
14
+ {%- set first_user_prefix = "" -%}
15
+ {%- set loop_messages = messages -%}
16
+ {%- endif -%}
17
+ {%- for message in loop_messages -%}
18
+ {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
19
+ {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
20
+ {%- endif -%}
21
+ {%- if (message['role'] == 'assistant') -%}
22
+ {%- set role = "model" -%}
23
+ {%- else -%}
24
+ {%- set role = message['role'] -%}
25
+ {%- endif -%}
26
+ {{ '<start_of_turn>' + role + '
27
+ ' + (first_user_prefix if loop.first else "") }}
28
+ {%- if message['content'] is string -%}
29
+ {{ message['content'] | trim }}
30
+ {%- elif message['content'] is iterable -%}
31
+ {%- for item in message['content'] -%}
32
+ {%- if item['type'] == 'image' -%}
33
+ {{ '<start_of_image>' }}
34
+ {%- elif item['type'] == 'text' -%}
35
+ {{ item['text'] | trim }}
36
+ {%- endif -%}
37
+ {%- endfor -%}
38
+ {%- else -%}
39
+ {{ raise_exception("Invalid content type") }}
40
+ {%- endif -%}
41
+ {{ '<end_of_turn>
42
+ ' }}
43
+ {%- endfor -%}
44
+ {%- if add_generation_prompt -%}
45
+ {{'<start_of_turn>model
46
+ '}}
47
+ {%- endif -%}
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "boi_token": "<start_of_image>",
3
+ "bos_token": {
4
+ "content": "<bos>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ "eoi_token": "<end_of_image>",
11
+ "eos_token": {
12
+ "content": "<eos>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false
17
+ },
18
+ "image_token": "<image_soft_token>",
19
+ "pad_token": {
20
+ "content": "<pad>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false
25
+ },
26
+ "unk_token": {
27
+ "content": "<unk>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
tokenizer/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4667f2089529e8e7657cfb6d1c19910ae71ff5f28aa7ab2ff2763330affad795
3
+ size 33384568
tokenizer/tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1299c11d7cf632ef3b4e11937501358ada021bbdf7c47638d13c0ee982f2e79c
3
+ size 4689074
tokenizer/tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_2/special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer_2/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6601c4120779a1a3863897ba332fe3481d548e363bec2c91eba10ef8640a5e93
3
+ size 17082997
tokenizer_2/tokenizer_config.json ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "extra_special_tokens": {},
49
+ "mask_token": "<mask>",
50
+ "max_length": 77,
51
+ "model_max_length": 8194,
52
+ "pad_to_multiple_of": null,
53
+ "pad_token": "<pad>",
54
+ "pad_token_type_id": 0,
55
+ "padding_side": "right",
56
+ "sep_token": "</s>",
57
+ "stride": 0,
58
+ "tokenizer_class": "XLMRobertaTokenizerFast",
59
+ "truncation_side": "right",
60
+ "truncation_strategy": "longest_first",
61
+ "unk_token": "<unk>"
62
+ }
transformer/config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "Lumina2Transformer2DModel",
3
+ "_diffusers_version": "0.36.0.dev0",
4
+ "axes_dim_rope": [
5
+ 32,
6
+ 32,
7
+ 32
8
+ ],
9
+ "axes_lens": [
10
+ 1024,
11
+ 512,
12
+ 512
13
+ ],
14
+ "cap_feat_dim": 2560,
15
+ "ffn_dim_multiplier": null,
16
+ "hidden_size": 2304,
17
+ "in_channels": 16,
18
+ "multiple_of": 256,
19
+ "norm_eps": 1e-05,
20
+ "num_attention_heads": 24,
21
+ "num_kv_heads": 8,
22
+ "num_layers": 36,
23
+ "num_refiner_layers": 2,
24
+ "out_channels": null,
25
+ "patch_size": 2,
26
+ "pooled_projection_dim": 1024,
27
+ "sample_size": 128,
28
+ "scaling_factor": 1.0
29
+ }
transformer/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ea0fb6818f825cb105c24e941e10ce4badbe9edd75aa8c9cfd4d5d49a8796fb
3
+ size 6973332440
vae/config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.35.2",
4
+ "_name_or_path": "black-forest-labs/FLUX.1-dev",
5
+ "act_fn": "silu",
6
+ "block_out_channels": [
7
+ 128,
8
+ 256,
9
+ 512,
10
+ 512
11
+ ],
12
+ "down_block_types": [
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D",
16
+ "DownEncoderBlock2D"
17
+ ],
18
+ "force_upcast": true,
19
+ "in_channels": 3,
20
+ "latent_channels": 16,
21
+ "latents_mean": null,
22
+ "latents_std": null,
23
+ "layers_per_block": 2,
24
+ "mid_block_add_attention": true,
25
+ "norm_num_groups": 32,
26
+ "out_channels": 3,
27
+ "sample_size": 1024,
28
+ "scaling_factor": 0.3611,
29
+ "shift_factor": 0.1159,
30
+ "up_block_types": [
31
+ "UpDecoderBlock2D",
32
+ "UpDecoderBlock2D",
33
+ "UpDecoderBlock2D",
34
+ "UpDecoderBlock2D"
35
+ ],
36
+ "use_post_quant_conv": false,
37
+ "use_quant_conv": false
38
+ }
vae/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5b59a26851551b67ae1fe58d32e76486e1e812def4696a4bea97f16604d40a3
3
+ size 167666902