Post
3
π Run DeepSeek V4 on more AI GPUs with FlagOS
DeepSeek V4 just dropped with huge specs: 1.6T params, 1M context, MIT license.
But thereβs a catch: the official weights use FP4+FP8 mixed precision, which mainly targets NVIDIA Blackwell / B200-class GPUs.
So we built DeepSeek-V4-FlagOS.
On Day 0, the FlagOS community completed multi-chip adaptation across 8 AI hardware platforms:
β NVIDIA H100/H20 β FP8/BF16
β Huawei Ascend β BF16
β Hygon DCU β BF16
β MetaX GPU β BF16
β Moore Threads MTT S5000 β FP8
β Kunlunxin XPU β BF16
β T-Head/Alibaba Zhenwu β BF16
β Iluvatar GPU β BF16
π§ What makes it work?
1οΈβ£ FlagGems operator replacement
DeepSeek V4 operators β MoE routing, Attention, RMSNorm and more β are reimplemented with Triton, reducing dependency on CUDA-specific libraries.
New V4 operators include:
Act Quant, hc_split_sinkhorn, FP8 MatMul, Sparse Attention, Hadamard Transform.
2οΈβ£ Flexible tensor parallelism
DeepSeek V4 uses o_groups=8, which can limit TP.
We added an independent communication group for o-groups, while allowing the rest of the model to scale to higher TP, enabling deployment on 32GB/64GB cards.
3οΈβ£ FP4 β BF16 conversion
For hardware without native FP4, we provide ready-to-use BF16 conversion and pre-converted model releases.
π¦ Pre-converted models are available on Hugging Face:
V4-Pro:
FlagRelease/DeepSeek-V4-Pro-nvidia-FlagOS
FlagRelease/DeepSeek-V4-Pro-metax-FlagOS
FlagRelease/DeepSeek-V4-Pro-mthreads-FlagOS
FlagRelease/DeepSeek-V4-Pro-hygon-FlagOS
FlagRelease/DeepSeek-V4-Pro-ascend-FlagOS
V4-Flash:
FlagRelease/DeepSeek-V4-Flash-nvidia-FlagOS
FlagRelease/DeepSeek-V4-Flash-zhenwu-FlagOS
FlagRelease/DeepSeek-V4-Flash-kunlunxin-FlagOS
FlagRelease/DeepSeek-V4-Flash-iluvatar-FlagOS
β‘ Performance on NVIDIA H20, V4-Flash FP8:
FlagGems C++ Wrapper + Triton: 70.7 tok/s
DeepSeek TileLang: 62.99 tok/s
Thatβs 12.24% faster.
π Try it here:
https://github.com/flagos-ai/DeepSeek-V4-FlagOS
Open models should run on open infrastructure
DeepSeek V4 just dropped with huge specs: 1.6T params, 1M context, MIT license.
But thereβs a catch: the official weights use FP4+FP8 mixed precision, which mainly targets NVIDIA Blackwell / B200-class GPUs.
So we built DeepSeek-V4-FlagOS.
On Day 0, the FlagOS community completed multi-chip adaptation across 8 AI hardware platforms:
β NVIDIA H100/H20 β FP8/BF16
β Huawei Ascend β BF16
β Hygon DCU β BF16
β MetaX GPU β BF16
β Moore Threads MTT S5000 β FP8
β Kunlunxin XPU β BF16
β T-Head/Alibaba Zhenwu β BF16
β Iluvatar GPU β BF16
π§ What makes it work?
1οΈβ£ FlagGems operator replacement
DeepSeek V4 operators β MoE routing, Attention, RMSNorm and more β are reimplemented with Triton, reducing dependency on CUDA-specific libraries.
New V4 operators include:
Act Quant, hc_split_sinkhorn, FP8 MatMul, Sparse Attention, Hadamard Transform.
2οΈβ£ Flexible tensor parallelism
DeepSeek V4 uses o_groups=8, which can limit TP.
We added an independent communication group for o-groups, while allowing the rest of the model to scale to higher TP, enabling deployment on 32GB/64GB cards.
3οΈβ£ FP4 β BF16 conversion
For hardware without native FP4, we provide ready-to-use BF16 conversion and pre-converted model releases.
π¦ Pre-converted models are available on Hugging Face:
V4-Pro:
FlagRelease/DeepSeek-V4-Pro-nvidia-FlagOS
FlagRelease/DeepSeek-V4-Pro-metax-FlagOS
FlagRelease/DeepSeek-V4-Pro-mthreads-FlagOS
FlagRelease/DeepSeek-V4-Pro-hygon-FlagOS
FlagRelease/DeepSeek-V4-Pro-ascend-FlagOS
V4-Flash:
FlagRelease/DeepSeek-V4-Flash-nvidia-FlagOS
FlagRelease/DeepSeek-V4-Flash-zhenwu-FlagOS
FlagRelease/DeepSeek-V4-Flash-kunlunxin-FlagOS
FlagRelease/DeepSeek-V4-Flash-iluvatar-FlagOS
β‘ Performance on NVIDIA H20, V4-Flash FP8:
FlagGems C++ Wrapper + Triton: 70.7 tok/s
DeepSeek TileLang: 62.99 tok/s
Thatβs 12.24% faster.
π Try it here:
https://github.com/flagos-ai/DeepSeek-V4-FlagOS
Open models should run on open infrastructure