1 1 4

Yonghua Lin

Yonghua

AI & ML interests

None yet

Recent Activity

new activity 37 minutes ago

deepseek-ai/DeepSeek-V4-Flash:Run DeepSeek-V4-Flash on more hardware: FP8/BF16 adapted versions for 8 AI chips (ready to download)

posted an update about 1 hour ago

authored a paper 7 months ago

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

View all activity

Organizations

Posts 1

Post

🚀 Run DeepSeek V4 on more AI GPUs with FlagOS

DeepSeek V4 just dropped with huge specs: 1.6T params, 1M context, MIT license.

But there’s a catch: the official weights use FP4+FP8 mixed precision, which mainly targets NVIDIA Blackwell / B200-class GPUs.

So we built DeepSeek-V4-FlagOS.

On Day 0, the FlagOS community completed multi-chip adaptation across 8 AI hardware platforms:

✅ NVIDIA H100/H20 — FP8/BF16
✅ Huawei Ascend — BF16
✅ Hygon DCU — BF16
✅ MetaX GPU — BF16
✅ Moore Threads MTT S5000 — FP8
✅ Kunlunxin XPU — BF16
✅ T-Head/Alibaba Zhenwu — BF16
✅ Iluvatar GPU — BF16

🔧 What makes it work?

1️⃣ FlagGems operator replacement
DeepSeek V4 operators — MoE routing, Attention, RMSNorm and more — are reimplemented with Triton, reducing dependency on CUDA-specific libraries.

New V4 operators include:
Act Quant, hc_split_sinkhorn, FP8 MatMul, Sparse Attention, Hadamard Transform.

2️⃣ Flexible tensor parallelism
DeepSeek V4 uses o_groups=8, which can limit TP.
We added an independent communication group for o-groups, while allowing the rest of the model to scale to higher TP, enabling deployment on 32GB/64GB cards.

3️⃣ FP4 → BF16 conversion
For hardware without native FP4, we provide ready-to-use BF16 conversion and pre-converted model releases.

📦 Pre-converted models are available on Hugging Face:
V4-Pro:
FlagRelease/DeepSeek-V4-Pro-nvidia-FlagOS
FlagRelease/DeepSeek-V4-Pro-metax-FlagOS
FlagRelease/DeepSeek-V4-Pro-mthreads-FlagOS
FlagRelease/DeepSeek-V4-Pro-hygon-FlagOS
FlagRelease/DeepSeek-V4-Pro-ascend-FlagOS

V4-Flash:
FlagRelease/DeepSeek-V4-Flash-nvidia-FlagOS
FlagRelease/DeepSeek-V4-Flash-zhenwu-FlagOS
FlagRelease/DeepSeek-V4-Flash-kunlunxin-FlagOS
FlagRelease/DeepSeek-V4-Flash-iluvatar-FlagOS

⚡ Performance on NVIDIA H20, V4-Flash FP8:
FlagGems C++ Wrapper + Triton: 70.7 tok/s
DeepSeek TileLang: 62.99 tok/s

That’s 12.24% faster.

👉 Try it here:
https://github.com/flagos-ai/DeepSeek-V4-FlagOS

Open models should run on open infrastructure

Articles 1

Article

Letting Large Models Debate: The First Multilingual LLM Debate Competition

Papers 3

arxiv:2509.17177

arxiv:2506.07463

arxiv:2409.18869

models 1

Yonghua/ViT_warehouse

Updated Jun 4, 2022

datasets 0

None public yet