DeepHermes-3-Mistral-24B-Preview-BNB-NF4
π Model Overview
This is a 4-bit quantized version of NousResearch/DeepHermes-3-Mistral-24B-Preview using BitsAndBytes NF4 quantization. The model has been optimized for efficient inference while maintaining high performance across various tasks.
π― Key Features
- Base Model: NousResearch/DeepHermes-3-Mistral-24B-Preview
- Quantization: 4-bit NF4 (BitsAndBytes)
- Size Reduction: ~48GB β ~12-15GB (70% reduction)
- GPU Optimized: NVIDIA L4, A10G, RTX 4090, and similar
- Memory Efficient: Runs on 16GB+ VRAM GPUs
π§ Quantization Details
Technical Specifications
- Method: BitsAndBytes NF4 (Normal Float 4-bit)
- Compute Type: bfloat16
- Double Quantization: Enabled
- CPU Offloading: Supported
- Quantization Library:
bitsandbytes>=0.41.0
Performance Characteristics
- Memory Usage: 12-15GB VRAM
- Inference Speed: ~2-5 tokens/second (varies by hardware)
- Quality: Minimal degradation from original model
- Compatibility: Transformers library compatible
π Quick Start
Installation
pip install torch transformers accelerate bitsandbytes
Basic Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# Model configuration
model_name = "AkameLLC/DeepHermes-3-Mistral-24B-Preview-BNB-NF4"
# Quantization config (already applied, but needed for loading)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True
)
# Generate response
prompt = "Explain the concept of machine learning in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Chat Format
def format_chat_prompt(message: str) -> str:
return f"<|user|>\n{message}\n<|assistant|>\n"
# Example usage
user_message = "What are the benefits of renewable energy?"
formatted_prompt = format_chat_prompt(user_message)
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
# ... (generation code as above)
π» Hardware Requirements
Minimum Requirements
- VRAM: 16GB+ (NVIDIA RTX 4060 Ti 16GB or better)
- RAM: 16GB+ system memory
- Storage: 20GB available space
- CUDA: Compatible GPU with CUDA support
Recommended Hardware
- NVIDIA L4: 24GB VRAM (optimal)
- NVIDIA A10G: 24GB VRAM
- NVIDIA RTX 4090: 24GB VRAM
- NVIDIA RTX A6000: 48GB VRAM
Cloud Deployment
- AWS EC2: g5.2xlarge or larger
- Google Cloud: n1-standard-4 with T4/V100
- Azure: Standard_NC6s_v3 or similar
π Performance Benchmarks
Memory Usage Comparison
| Model Version | VRAM Usage | Model Size | Reduction |
|---|---|---|---|
| Original FP16 | ~48GB | ~48GB | - |
| BNB NF4 | ~12-15GB | ~12-15GB | 70% |
Inference Speed (Approximate)
| Hardware | Tokens/Second | Batch Size |
|---|---|---|
| NVIDIA L4 | 3-5 | 1 |
| RTX 4090 | 4-6 | 1 |
| A10G | 3-5 | 1 |
Note: Performance varies based on prompt length and generation parameters
π― Use Cases
Recommended Applications
- Conversational AI: Chat applications and assistants
- Content Generation: Writing, summarization, analysis
- Code Assistance: Programming help and code generation
- Educational Tools: Tutoring and explanation systems
- Research: NLP research and experimentation
Not Recommended For
- Production Critical Systems: Use original model for mission-critical applications
- High-Precision Tasks: Where minimal quality loss is unacceptable
- Batch Processing: Large-scale batch inference (use original model)
β οΈ Limitations and Considerations
Quantization Impact
- Slight Quality Degradation: ~1-3% performance reduction expected
- Numerical Precision: Reduced precision in calculations
- Edge Cases: May handle some prompts differently than original
Technical Limitations
- Memory Requirements: Still requires significant VRAM
- Loading Time: Initial model loading takes 2-5 minutes
- Hardware Dependency: Requires CUDA-compatible GPU
Usage Guidelines
- Temperature: Recommended range 0.1-1.0
- Max Tokens: Optimal performance with 512 tokens or less
- Batch Size: Single inference recommended for optimal memory usage
π License and Legal
Model License
This quantized model inherits the license from the original DeepHermes-3-Mistral-24B-Preview model. Please refer to the original model page for detailed license information.
Attribution
- Original Model: NousResearch/DeepHermes-3-Mistral-24B-Preview
- Quantization: Performed by AkameLLC
- Quantization Method: BitsAndBytes library
- Base Architecture: Mistral-7B
Commercial Use
Please review the original model's license terms for commercial usage guidelines.
π οΈ Technical Details
Quantization Configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
llm_int8_enable_fp32_cpu_offload=True
)
Model Architecture
- Architecture: Transformer (Mistral-based)
- Parameters: ~24B (quantized from original)
- Layers: Attention and Feed-Forward layers quantized
- Precision: 4-bit weights, bfloat16 compute
Files Included
model-*.safetensors: Quantized model weightsconfig.json: Model configurationtokenizer.json: Tokenizer configurationgeneration_config.json: Generation parametersaws_deployment_config.json: AWS deployment settings
π§ Troubleshooting
Common Issues
CUDA Out of Memory
# Reduce memory usage
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto",
max_memory={"0": "22GB"} # Adjust based on your GPU
)
Slow Loading
# Enable low CPU memory usage
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto",
low_cpu_mem_usage=True
)
Support
For issues specific to this quantized version, please open an issue in the repository. For general model questions, refer to the original DeepHermes-3 documentation.
π Advanced Usage
API Integration
For production deployment, consider using with:
- FastAPI: REST API development
- vLLM: High-performance inference server
- TensorRT-LLM: NVIDIA optimized inference
- Ollama: Local deployment solution
Fine-tuning
While this is a quantized model, you can still apply LoRA (Low-Rank Adaptation) for task-specific fine-tuning:
from peft import LoraConfig, get_peft_model
# LoRA configuration
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1
)
# Apply LoRA
model = get_peft_model(model, lora_config)
π Citation
If you use this quantized model in your research or applications, please cite both the original model and the quantization method:
@misc{deephermes3-bnb-nf4,
title={DeepHermes-3-Mistral-24B-Preview-BNB-NF4},
author={AkameLLC},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/AkameLLC/DeepHermes-3-Mistral-24B-Preview-BNB-NF4}
}
@misc{dettmers2022gpt3int8,
title={GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
author={Tim Dettmers and Mike Lewis and Sam Shleifer and Luke Zettlemoyer},
year={2022},
eprint={2208.07339},
archivePrefix={arXiv}
}
π€ Acknowledgments
- NousResearch: Original DeepHermes-3 model development
- Mistral AI: Base Mistral architecture
- BitsAndBytes Team: Quantization library and methods
- Hugging Face: Model hosting and transformers library
Last Updated: August 2025
Model Version: 1.0
Quantization Method: BitsAndBytes NF4
- Downloads last month
- 3
Model tree for akameLLC/DeepHermes-3-Mistral-24B-Preview-BNB-NF4
Base model
mistralai/Mistral-Small-24B-Base-2501