Kimi-K2-Instruct-0905 MLX 5-bit

This is a 5-bit quantized version of moonshotai/Kimi-K2-Instruct-0905 converted to MLX format for efficient inference on Apple Silicon.

Model Details

  • Original Model: Kimi-K2-Instruct-0905 by Moonshot AI
  • Quantization: 5-bit quantization (5.502 bits per weight)
  • Framework: MLX (Apple Machine Learning Framework)
  • Model Size: ~658GB (5-bit quantized)
  • Optimized for: Apple Silicon (M1/M2/M3/M4 chips)

Quantization Options

This model is available in multiple quantization levels:

  • 8-bit - Highest quality, larger size
  • 6-bit - Excellent balance of quality and size
  • 5-bit (this model) - Very good quality with reduced size
  • 4-bit - Lower memory usage
  • 3-bit - Compact with acceptable quality
  • 2-bit - Smallest size, fastest inference

Hardware Requirements

Kimi-K2 is a massive 671B parameter MoE model. Choose your quantization based on available unified memory:

Quantization Model Size Min RAM Quality
2-bit ~84 GB 96 GB Acceptable - some quality loss
3-bit ~126 GB 128 GB Good - recommended minimum
4-bit ~168 GB 192 GB Very Good - best quality/size balance
5-bit ~210 GB 256 GB Excellent
6-bit ~252 GB 288 GB Near original
8-bit ~336 GB 384 GB Original quality

Recommended Configurations

Mac Model Max RAM Recommended Quantization
Mac Studio M2 Ultra 192 GB 4-bit
Mac Studio M4 Ultra 512 GB 8-bit
Mac Pro M2 Ultra 192 GB 4-bit
MacBook Pro M3 Max 128 GB 3-bit
MacBook Pro M4 Max 128 GB 3-bit

Performance Notes

  • Inference Speed: Expect ~5-15 tokens/sec depending on quantization and hardware
  • First Token Latency: 10-30 seconds for model loading
  • Context Window: Full 128K context supported
  • Active Parameters: Only ~37B parameters active per token (MoE architecture)

Usage

Installation

pip install mlx-lm

Basic Usage

from mlx_lm import load, generate

# Load the model
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-5bit")

# Generate text
prompt = "你好,请介绍一下自己。"
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

Chat Format

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)

Performance Considerations

5-bit Quantization Trade-offs:

  • ✅ Very good quality retention (~92% of original model quality)
  • ✅ Great balance between size and performance
  • ✅ Smaller than 6-bit with minimal quality loss
  • ✅ Suitable for production use
  • ⚡ Good inference speed on Apple Silicon

Recommended Use Cases:

  • Production deployments balancing quality and efficiency
  • General-purpose applications
  • When you need better quality than 4-bit but smaller than 6-bit
  • Resource-constrained environments

Why Choose 5-bit: The 5-bit quantization offers an excellent middle ground between the premium 6-bit and the more efficient 4-bit versions. It provides near-original model quality while being more memory-efficient than 6-bit, making it ideal for systems with moderate RAM constraints.

System Requirements

  • Apple Silicon Mac (M1/M2/M3/M4)
  • macOS 13.0 or later
  • Sufficient RAM (recommended: 64GB+ for optimal performance)
  • Python 3.8+

Conversion Details

This model was quantized using MLX's conversion tools:

mlx_lm.convert \
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
  --mlx-path ./Kimi-K2-Instruct-0905-MLX-5bit \
  -q --q-bits 5 \
  --trust-remote-code

Actual quantization: 5.502 bits per weight

License

This model follows the same license as the original Kimi-K2-Instruct-0905 model. Please refer to the original model card for license details.

Citation

If you use this model, please cite the original Kimi model:

@misc{kimi-k2-instruct,
  title={Kimi K2 Instruct},
  author={Moonshot AI},
  year={2024},
  url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}

Acknowledgments

  • Original model by Moonshot AI
  • Quantization performed using MLX by Apple
  • Conversion and hosting by richardyoung

Contact

For issues or questions about this quantized version, please open an issue on the model repository.

Downloads last month
338
Safetensors
Model size
1T params
Tensor type
BF16
·
U32
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including richardyoung/Kimi-K2-Instruct-0905-MLX-5bit