DeepHermes-3-Mistral-24B-Preview-BNB-NF4

πŸ“‹ Model Overview

This is a 4-bit quantized version of NousResearch/DeepHermes-3-Mistral-24B-Preview using BitsAndBytes NF4 quantization. The model has been optimized for efficient inference while maintaining high performance across various tasks.

🎯 Key Features

  • Base Model: NousResearch/DeepHermes-3-Mistral-24B-Preview
  • Quantization: 4-bit NF4 (BitsAndBytes)
  • Size Reduction: ~48GB β†’ ~12-15GB (70% reduction)
  • GPU Optimized: NVIDIA L4, A10G, RTX 4090, and similar
  • Memory Efficient: Runs on 16GB+ VRAM GPUs

πŸ”§ Quantization Details

Technical Specifications

  • Method: BitsAndBytes NF4 (Normal Float 4-bit)
  • Compute Type: bfloat16
  • Double Quantization: Enabled
  • CPU Offloading: Supported
  • Quantization Library: bitsandbytes>=0.41.0

Performance Characteristics

  • Memory Usage: 12-15GB VRAM
  • Inference Speed: ~2-5 tokens/second (varies by hardware)
  • Quality: Minimal degradation from original model
  • Compatibility: Transformers library compatible

πŸš€ Quick Start

Installation

pip install torch transformers accelerate bitsandbytes

Basic Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Model configuration
model_name = "AkameLLC/DeepHermes-3-Mistral-24B-Preview-BNB-NF4"

# Quantization config (already applied, but needed for loading)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

# Generate response
prompt = "Explain the concept of machine learning in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Chat Format

def format_chat_prompt(message: str) -> str:
    return f"<|user|>\n{message}\n<|assistant|>\n"

# Example usage
user_message = "What are the benefits of renewable energy?"
formatted_prompt = format_chat_prompt(user_message)

inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
# ... (generation code as above)

πŸ’» Hardware Requirements

Minimum Requirements

  • VRAM: 16GB+ (NVIDIA RTX 4060 Ti 16GB or better)
  • RAM: 16GB+ system memory
  • Storage: 20GB available space
  • CUDA: Compatible GPU with CUDA support

Recommended Hardware

  • NVIDIA L4: 24GB VRAM (optimal)
  • NVIDIA A10G: 24GB VRAM
  • NVIDIA RTX 4090: 24GB VRAM
  • NVIDIA RTX A6000: 48GB VRAM

Cloud Deployment

  • AWS EC2: g5.2xlarge or larger
  • Google Cloud: n1-standard-4 with T4/V100
  • Azure: Standard_NC6s_v3 or similar

πŸ“Š Performance Benchmarks

Memory Usage Comparison

Model Version VRAM Usage Model Size Reduction
Original FP16 ~48GB ~48GB -
BNB NF4 ~12-15GB ~12-15GB 70%

Inference Speed (Approximate)

Hardware Tokens/Second Batch Size
NVIDIA L4 3-5 1
RTX 4090 4-6 1
A10G 3-5 1

Note: Performance varies based on prompt length and generation parameters

🎯 Use Cases

Recommended Applications

  • Conversational AI: Chat applications and assistants
  • Content Generation: Writing, summarization, analysis
  • Code Assistance: Programming help and code generation
  • Educational Tools: Tutoring and explanation systems
  • Research: NLP research and experimentation

Not Recommended For

  • Production Critical Systems: Use original model for mission-critical applications
  • High-Precision Tasks: Where minimal quality loss is unacceptable
  • Batch Processing: Large-scale batch inference (use original model)

⚠️ Limitations and Considerations

Quantization Impact

  • Slight Quality Degradation: ~1-3% performance reduction expected
  • Numerical Precision: Reduced precision in calculations
  • Edge Cases: May handle some prompts differently than original

Technical Limitations

  • Memory Requirements: Still requires significant VRAM
  • Loading Time: Initial model loading takes 2-5 minutes
  • Hardware Dependency: Requires CUDA-compatible GPU

Usage Guidelines

  • Temperature: Recommended range 0.1-1.0
  • Max Tokens: Optimal performance with 512 tokens or less
  • Batch Size: Single inference recommended for optimal memory usage

πŸ“œ License and Legal

Model License

This quantized model inherits the license from the original DeepHermes-3-Mistral-24B-Preview model. Please refer to the original model page for detailed license information.

Attribution

  • Original Model: NousResearch/DeepHermes-3-Mistral-24B-Preview
  • Quantization: Performed by AkameLLC
  • Quantization Method: BitsAndBytes library
  • Base Architecture: Mistral-7B

Commercial Use

Please review the original model's license terms for commercial usage guidelines.

πŸ› οΈ Technical Details

Quantization Configuration

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    llm_int8_enable_fp32_cpu_offload=True
)

Model Architecture

  • Architecture: Transformer (Mistral-based)
  • Parameters: ~24B (quantized from original)
  • Layers: Attention and Feed-Forward layers quantized
  • Precision: 4-bit weights, bfloat16 compute

Files Included

  • model-*.safetensors: Quantized model weights
  • config.json: Model configuration
  • tokenizer.json: Tokenizer configuration
  • generation_config.json: Generation parameters
  • aws_deployment_config.json: AWS deployment settings

πŸ”§ Troubleshooting

Common Issues

CUDA Out of Memory

# Reduce memory usage
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
    max_memory={"0": "22GB"}  # Adjust based on your GPU
)

Slow Loading

# Enable low CPU memory usage
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
    low_cpu_mem_usage=True
)

Support

For issues specific to this quantized version, please open an issue in the repository. For general model questions, refer to the original DeepHermes-3 documentation.

πŸš€ Advanced Usage

API Integration

For production deployment, consider using with:

  • FastAPI: REST API development
  • vLLM: High-performance inference server
  • TensorRT-LLM: NVIDIA optimized inference
  • Ollama: Local deployment solution

Fine-tuning

While this is a quantized model, you can still apply LoRA (Low-Rank Adaptation) for task-specific fine-tuning:

from peft import LoraConfig, get_peft_model

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1
)

# Apply LoRA
model = get_peft_model(model, lora_config)

πŸ“š Citation

If you use this quantized model in your research or applications, please cite both the original model and the quantization method:

@misc{deephermes3-bnb-nf4,
  title={DeepHermes-3-Mistral-24B-Preview-BNB-NF4},
  author={AkameLLC},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/AkameLLC/DeepHermes-3-Mistral-24B-Preview-BNB-NF4}
}

@misc{dettmers2022gpt3int8,
  title={GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale},
  author={Tim Dettmers and Mike Lewis and Sam Shleifer and Luke Zettlemoyer},
  year={2022},
  eprint={2208.07339},
  archivePrefix={arXiv}
}

🀝 Acknowledgments

  • NousResearch: Original DeepHermes-3 model development
  • Mistral AI: Base Mistral architecture
  • BitsAndBytes Team: Quantization library and methods
  • Hugging Face: Model hosting and transformers library

Last Updated: August 2025
Model Version: 1.0
Quantization Method: BitsAndBytes NF4

Downloads last month
3
Safetensors
Model size
24B params
Tensor type
F32
Β·
BF16
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for akameLLC/DeepHermes-3-Mistral-24B-Preview-BNB-NF4