ikarius commited on
Commit
bc723e9
·
verified ·
1 Parent(s): 6265e79

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -3
README.md CHANGED
@@ -1,3 +1,119 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - transformer
4
+ - qwen
5
+ - qwen2
6
+ - qwen2.5
7
+ - coder
8
+ - code-generation
9
+ - quantization
10
+ - bitsandbytes
11
+ - nf4
12
+ - 4bit
13
+ - large-language-model
14
+ - llm
15
+ - abliterated
16
+ license: apache-2.0
17
+ ---
18
+
19
+ # 🤖 Qwen2.5-32B-Coder-NF4-Quantized
20
+
21
+ This is a **4-bit NF4 Quantized** version of [huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated](https://huggingface.co/huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated).
22
+
23
+ The model was quantized to enable **efficient running and inference** on hardware with limited VRAM, while maintaining performance.
24
+
25
+ ---
26
+
27
+ ## ⚙️ Model Specifications and Quantization
28
+
29
+ This model was loaded and quantized using the **`bitsandbytes`** library. The quantization is based on the NF4 (Normal Float 4-bit) format and requires bitsandbytes to load.
30
+
31
+ ### Model Configuration (from `config.json`):
32
+
33
+ | Parameter | Value | Description |
34
+ | :--- | :--- | :--- |
35
+ | **Architecture** | `Qwen2ForCausalLM` | The model's base architecture. |
36
+ | **Parameter Count** | 32 Billion (Original) | The original number of parameters. |
37
+ | **Number of Layers** | `64` | The number of transformer blocks. |
38
+ | **Hidden Size** | `5120` | The dimension of the hidden states. |
39
+ | **Context Length** | `32768` | The maximum context length the model can process. |
40
+ | **Dtype (Activations)** | **`bfloat16`** | The data type for activations during inference (recommended for stability). |
41
+
42
+ ### Quantization Details (`quantization_config`):
43
+
44
+ | Parameter | Value | Description |
45
+ | :--- | :--- | :--- |
46
+ | **Method** | `bitsandbytes` | The quantization library used. |
47
+ | **Load In 4-bit** | `true` | Indicates that the model should be loaded in 4-bit. |
48
+ | **Quantization Type** | `nf4` | **Normal Float 4-bit**, optimized for transformer weights. |
49
+ | **Compute Dtype** | `bfloat16` | The dtype the weights are decompressed to for *computation* (matrix multiplication). |
50
+ | **Double Quantization**| `true` | Uses an extra 8-bit quantization for the scaling tensors, further reducing memory usage. |
51
+
52
+ ---
53
+
54
+ ## 💻 Usage (Inference)
55
+
56
+ To use this quantized model, ensure you have **`accelerate`** and **`bitsandbytes`** installed. You can load the model directly with the **`AutoModelForCausalLM`** from the Hugging Face `transformers` library.
57
+
58
+ ### Required Libraries
59
+
60
+ ```bash
61
+ pip install transformers accelerate bitsandbytes torch
62
+ ```
63
+
64
+ ```python
65
+ import torch
66
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
67
+
68
+ model_id = "ikarius/Qwen2.5-Coder-32B-Instruct-Abliterated-NF4"
69
+
70
+ # Load tokenizer
71
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
72
+
73
+ # Load the model in 4-bit using the saved configuration
74
+ model = AutoModelForCausalLM.from_pretrained(
75
+ model_id,
76
+ device_map="auto",
77
+ torch_dtype=torch.bfloat16
78
+ )
79
+
80
+ # 📝 Input Prompt
81
+ prompt = "def quicksort(arr):"
82
+ messages = [
83
+ {"role": "user", "content": f"Write a Python function for quicksort.\n\n{prompt}"}
84
+ ]
85
+ text = tokenizer.apply_chat_template(
86
+ messages,
87
+ tokenize=False,
88
+ add_generation_prompt=True
89
+ )
90
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
91
+
92
+ # Generation
93
+ generated_ids = model.generate(
94
+ **model_inputs,
95
+ max_new_tokens=512,
96
+ do_sample=True,
97
+ temperature=0.7,
98
+ pad_token_id=tokenizer.eos_token_id # Ensures correct padding/EOS
99
+ )
100
+
101
+ # Decode and print the result
102
+ generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
103
+ print(generated_text)
104
+ ```
105
+
106
+ Disclaimer and Limitations
107
+
108
+ Abliterated Model Status: This model is based on the "abliterated" variant (*-abliterated). This indicates that certain data, capabilities, or behaviors were deliberately modified or removed from the model during fine-tuning. The quantized version inherits these characteristics. Performance in certain domains may differ compared to the non-abliterated base model.
109
+
110
+ Memory Requirements: While this model is 4-bit quantized, it is a 32B model and still requires a GPU with significant VRAM (typically ~18 GB VRAM or more, depending on context length).
111
+
112
+ Accuracy: Quantization to 4-bit (NF4) introduces a small loss of precision. This may potentially affect performance compared to the original FP16/BF16 model.
113
+
114
+ ...
115
+ ## 🔗 Sources and Acknowledgements
116
+
117
+ * Original Model: [huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated](https://huggingface.co/huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated)
118
+ * Quantization Technology: [Bitsandbytes Library](https://github.com/bitsandbytes-foundation/bitsandbytes)
119
+ ...