Kimi-K2-Instruct-0905 MLX 5-bit

This is a 5-bit quantized version of moonshotai/Kimi-K2-Instruct-0905 converted to MLX format for efficient inference on Apple Silicon.

Model Details

Original Model: Kimi-K2-Instruct-0905 by Moonshot AI
Quantization: 5-bit quantization (5.502 bits per weight)
Framework: MLX (Apple Machine Learning Framework)
Model Size: ~658GB (5-bit quantized)
Optimized for: Apple Silicon (M1/M2/M3/M4 chips)

Quantization Options

This model is available in multiple quantization levels:

8-bit - Highest quality, larger size
6-bit - Excellent balance of quality and size
5-bit (this model) - Very good quality with reduced size
4-bit - Lower memory usage
3-bit - Compact with acceptable quality
2-bit - Smallest size, fastest inference

Hardware Requirements

Kimi-K2 is a massive 671B parameter MoE model. Choose your quantization based on available unified memory:

Quantization	Model Size	Min RAM	Quality
2-bit	~84 GB	96 GB	Acceptable - some quality loss
3-bit	~126 GB	128 GB	Good - recommended minimum
4-bit	~168 GB	192 GB	Very Good - best quality/size balance
5-bit	~210 GB	256 GB	Excellent
6-bit	~252 GB	288 GB	Near original
8-bit	~336 GB	384 GB	Original quality

Recommended Configurations

Mac Model	Max RAM	Recommended Quantization
Mac Studio M2 Ultra	192 GB	4-bit
Mac Studio M4 Ultra	512 GB	8-bit
Mac Pro M2 Ultra	192 GB	4-bit
MacBook Pro M3 Max	128 GB	3-bit
MacBook Pro M4 Max	128 GB	3-bit

Performance Notes

Inference Speed: Expect ~5-15 tokens/sec depending on quantization and hardware
First Token Latency: 10-30 seconds for model loading
Context Window: Full 128K context supported
Active Parameters: Only ~37B parameters active per token (MoE architecture)

Usage

Installation

pip install mlx-lm

Basic Usage

from mlx_lm import load, generate

# Load the model
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-5bit")

# Generate text
prompt = "你好，请介绍一下自己。"
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

Chat Format

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)

Performance Considerations

5-bit Quantization Trade-offs:

✅ Very good quality retention (~92% of original model quality)
✅ Great balance between size and performance
✅ Smaller than 6-bit with minimal quality loss
✅ Suitable for production use
⚡ Good inference speed on Apple Silicon

Recommended Use Cases:

Production deployments balancing quality and efficiency
General-purpose applications
When you need better quality than 4-bit but smaller than 6-bit
Resource-constrained environments

Why Choose 5-bit: The 5-bit quantization offers an excellent middle ground between the premium 6-bit and the more efficient 4-bit versions. It provides near-original model quality while being more memory-efficient than 6-bit, making it ideal for systems with moderate RAM constraints.

System Requirements

Apple Silicon Mac (M1/M2/M3/M4)
macOS 13.0 or later
Sufficient RAM (recommended: 64GB+ for optimal performance)
Python 3.8+

Conversion Details

This model was quantized using MLX's conversion tools:

mlx_lm.convert \
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
  --mlx-path ./Kimi-K2-Instruct-0905-MLX-5bit \
  -q --q-bits 5 \
  --trust-remote-code

Actual quantization: 5.502 bits per weight

License

This model follows the same license as the original Kimi-K2-Instruct-0905 model. Please refer to the original model card for license details.

Citation

If you use this model, please cite the original Kimi model:

@misc{kimi-k2-instruct,
  title={Kimi K2 Instruct},
  author={Moonshot AI},
  year={2024},
  url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}

Acknowledgments

Original model by Moonshot AI
Quantization performed using MLX by Apple
Conversion and hosting by richardyoung

Contact

For issues or questions about this quantized version, please open an issue on the model repository.

Downloads last month: 338

Safetensors

Model size

1T params

Tensor type

BF16

U32

F32

Collection including richardyoung/Kimi-K2-Instruct-0905-MLX-5bit

Apple Silicon MLX Models

Collection

High-performance LLMs optimized for Apple Silicon Macs using MLX. Run state-of-the-art models locally on M1/M2/M3/M4 Macs with unified memory. • 9 items • Updated 15 days ago