Update README.md

Browse files

Files changed (1) hide show

README.md +119 -3

README.md CHANGED Viewed

@@ -1,3 +1,119 @@
----
-license: apache-2.0
----

+---
+tags:
+- transformer
+- qwen
+- qwen2
+- qwen2.5
+- coder
+- code-generation
+- quantization
+- bitsandbytes
+- nf4
+- 4bit
+- large-language-model
+- llm
+- abliterated
+license: apache-2.0
+---
+# 🤖 Qwen2.5-32B-Coder-NF4-Quantized
+This is a **4-bit NF4 Quantized** version of [huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated](https://huggingface.co/huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated).
+The model was quantized to enable **efficient running and inference** on hardware with limited VRAM, while maintaining performance.
+---
+## ⚙️ Model Specifications and Quantization
+This model was loaded and quantized using the **`bitsandbytes`** library. The quantization is based on the NF4 (Normal Float 4-bit) format and requires bitsandbytes to load.
+### Model Configuration (from `config.json`):
+| Parameter | Value | Description |
+| :--- | :--- | :--- |
+| **Architecture** | `Qwen2ForCausalLM` | The model's base architecture. |
+| **Parameter Count** | 32 Billion (Original) | The original number of parameters. |
+| **Number of Layers** | `64` | The number of transformer blocks. |
+| **Hidden Size** | `5120` | The dimension of the hidden states. |
+| **Context Length** | `32768` | The maximum context length the model can process. |
+| **Dtype (Activations)** | **`bfloat16`** | The data type for activations during inference (recommended for stability). |
+### Quantization Details (`quantization_config`):
+| Parameter | Value | Description |
+| :--- | :--- | :--- |
+| **Method** | `bitsandbytes` | The quantization library used. |
+| **Load In 4-bit** | `true` | Indicates that the model should be loaded in 4-bit. |
+| **Quantization Type** | `nf4` | **Normal Float 4-bit**, optimized for transformer weights. |
+| **Compute Dtype** | `bfloat16` | The dtype the weights are decompressed to for *computation* (matrix multiplication). |
+| **Double Quantization**| `true` | Uses an extra 8-bit quantization for the scaling tensors, further reducing memory usage. |
+---
+## 💻 Usage (Inference)
+To use this quantized model, ensure you have **`accelerate`** and **`bitsandbytes`** installed. You can load the model directly with the **`AutoModelForCausalLM`** from the Hugging Face `transformers` library.
+### Required Libraries
+```bash
+pip install transformers accelerate bitsandbytes torch
+```
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+model_id = "ikarius/Qwen2.5-Coder-32B-Instruct-Abliterated-NF4"
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+# Load the model in 4-bit using the saved configuration
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+)
+# 📝 Input Prompt
+prompt = "def quicksort(arr):"
+messages = [
+    {"role": "user", "content": f"Write a Python function for quicksort.\n\n{prompt}"}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+# Generation
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=512,
+    do_sample=True,
+    temperature=0.7,
+    pad_token_id=tokenizer.eos_token_id # Ensures correct padding/EOS
+)
+# Decode and print the result
+generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
+print(generated_text)
+```
+Disclaimer and Limitations
+    Abliterated Model Status: This model is based on the "abliterated" variant (*-abliterated). This indicates that certain data, capabilities, or behaviors were deliberately modified or removed from the model during fine-tuning. The quantized version inherits these characteristics. Performance in certain domains may differ compared to the non-abliterated base model.
+    Memory Requirements: While this model is 4-bit quantized, it is a 32B model and still requires a GPU with significant VRAM (typically ~18 GB VRAM or more, depending on context length).
+    Accuracy: Quantization to 4-bit (NF4) introduces a small loss of precision. This may potentially affect performance compared to the original FP16/BF16 model.
+...
+## 🔗 Sources and Acknowledgements
+* Original Model: [huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated](https://huggingface.co/huihui-ai/Qwen2.5-Coder-32B-Instruct-abliterated)
+* Quantization Technology: [Bitsandbytes Library](https://github.com/bitsandbytes-foundation/bitsandbytes)
+...