DeepHermes-3-Mistral-24B-Preview-BNB-NF4

📋 Model Overview

This is a 4-bit quantized version of NousResearch/DeepHermes-3-Mistral-24B-Preview using BitsAndBytes NF4 quantization. The model has been optimized for efficient inference while maintaining high performance across various tasks.

🎯 Key Features

Base Model: NousResearch/DeepHermes-3-Mistral-24B-Preview
Quantization: 4-bit NF4 (BitsAndBytes)
Size Reduction: ~48GB → ~12-15GB (70% reduction)
GPU Optimized: NVIDIA L4, A10G, RTX 4090, and similar
Memory Efficient: Runs on 16GB+ VRAM GPUs

🔧 Quantization Details

Technical Specifications

Method: BitsAndBytes NF4 (Normal Float 4-bit)
Compute Type: bfloat16
Double Quantization: Enabled
CPU Offloading: Supported
Quantization Library: bitsandbytes>=0.41.0

Performance Characteristics

Memory Usage: 12-15GB VRAM
Inference Speed: ~2-5 tokens/second (varies by hardware)
Quality: Minimal degradation from original model
Compatibility: Transformers library compatible

🚀 Quick Start

Installation

pip install torch transformers accelerate bitsandbytes

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Model configuration
model_name = "AkameLLC/DeepHermes-3-Mistral-24B-Preview-BNB-NF4"

# Quantization config (already applied, but needed for loading)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

# Generate response
prompt = "Explain the concept of machine learning in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Chat Format

def format_chat_prompt(message: str) -> str:
    return f"<|user|>\n{message}\n<|assistant|>\n"

# Example usage
user_message = "What are the benefits of renewable energy?"
formatted_prompt = format_chat_prompt(user_message)

inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
# ... (generation code as above)

💻 Hardware Requirements

Minimum Requirements

VRAM: 16GB+ (NVIDIA RTX 4060 Ti 16GB or better)
RAM: 16GB+ system memory
Storage: 20GB available space
CUDA: Compatible GPU with CUDA support

Recommended Hardware

NVIDIA L4: 24GB VRAM (optimal)
NVIDIA A10G: 24GB VRAM
NVIDIA RTX 4090: 24GB VRAM
NVIDIA RTX A6000: 48GB VRAM

Cloud Deployment

AWS EC2: g5.2xlarge or larger
Google Cloud: n1-standard-4 with T4/V100
Azure: Standard_NC6s_v3 or similar

📊 Performance Benchmarks

Memory Usage Comparison

Model Version	VRAM Usage	Model Size	Reduction
Original FP16	~48GB	~48GB	-
BNB NF4	~12-15GB	~12-15GB	70%

Inference Speed (Approximate)

Hardware	Tokens/Second	Batch Size
NVIDIA L4	3-5	1
RTX 4090	4-6	1
A10G	3-5	1

Note: Performance varies based on prompt length and generation parameters

🎯 Use Cases

Recommended Applications

Conversational AI: Chat applications and assistants
Content Generation: Writing, summarization, analysis
Code Assistance: Programming help and code generation
Educational Tools: Tutoring and explanation systems
Research: NLP research and experimentation

Not Recommended For

Production Critical Systems: Use original model for mission-critical applications
High-Precision Tasks: Where minimal quality loss is unacceptable
Batch Processing: Large-scale batch inference (use original model)

⚠️ Limitations and Considerations

Quantization Impact

Slight Quality Degradation: ~1-3% performance reduction expected
Numerical Precision: Reduced precision in calculations
Edge Cases: May handle some prompts differently than original

Technical Limitations

Memory Requirements: Still requires significant VRAM
Loading Time: Initial model loading takes 2-5 minutes
Hardware Dependency: Requires CUDA-compatible GPU

Usage Guidelines

Temperature: Recommended range 0.1-1.0
Max Tokens: Optimal performance with 512 tokens or less
Batch Size: Single inference recommended for optimal memory usage

📜 License and Legal

Model License

This quantized model inherits the license from the original DeepHermes-3-Mistral-24B-Preview model. Please refer to the original model page for detailed license information.

Attribution

Original Model: NousResearch/DeepHermes-3-Mistral-24B-Preview
Quantization: Performed by AkameLLC
Quantization Method: BitsAndBytes library
Base Architecture: Mistral-7B

Commercial Use

Please review the original model's license terms for commercial usage guidelines.

🛠️ Technical Details

Quantization Configuration

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    llm_int8_enable_fp32_cpu_offload=True
)

Model Architecture

Architecture: Transformer (Mistral-based)
Parameters: ~24B (quantized from original)
Layers: Attention and Feed-Forward layers quantized
Precision: 4-bit weights, bfloat16 compute

Files Included

model-*.safetensors: Quantized model weights
config.json: Model configuration
tokenizer.json: Tokenizer configuration
generation_config.json: Generation parameters
aws_deployment_config.json: AWS deployment settings

🔧 Troubleshooting

Common Issues

CUDA Out of Memory

# Reduce memory usage
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
    max_memory={"0": "22GB"}  # Adjust based on your GPU
)

Slow Loading

# Enable low CPU memory usage
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
    low_cpu_mem_usage=True
)

Support

For issues specific to this quantized version, please open an issue in the repository. For general model questions, refer to the original DeepHermes-3 documentation.

🚀 Advanced Usage

API Integration

For production deployment, consider using with:

FastAPI: REST API development
vLLM: High-performance inference server
TensorRT-LLM: NVIDIA optimized inference
Ollama: Local deployment solution

Fine-tuning

While this is a quantized model, you can still apply LoRA (Low-Rank Adaptation) for task-specific fine-tuning:

from peft import LoraConfig, get_peft_model

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1
)

# Apply LoRA
model = get_peft_model(model, lora_config)

📚 Citation

If you use this quantized model in your research or applications, please cite both the original model and the quantization method:

@misc{deephermes3-bnb-nf4,
  title={DeepHermes-3-Mistral-24B-Preview-BNB-NF4},
  author={AkameLLC},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/AkameLLC/DeepHermes-3-Mistral-24B-Preview-BNB-NF4}
}

@misc{dettmers2022gpt3int8,
  title={GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
  author={Tim Dettmers and Mike Lewis and Sam Shleifer and Luke Zettlemoyer},
  year={2022},
  eprint={2208.07339},
  archivePrefix={arXiv}
}

🤝 Acknowledgments

NousResearch: Original DeepHermes-3 model development
Mistral AI: Base Mistral architecture
BitsAndBytes Team: Quantization library and methods
Hugging Face: Model hosting and transformers library

Last Updated: August 2025
Model Version: 1.0
Quantization Method: BitsAndBytes NF4

Downloads last month: 3

Safetensors

Model size

24B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for akameLLC/DeepHermes-3-Mistral-24B-Preview-BNB-NF4

Base model

mistralai/Mistral-Small-24B-Base-2501

Finetuned

NousResearch/DeepHermes-3-Mistral-24B-Preview

Quantized

(12)

this model