Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
1
1
4
Yonghua Lin
Yonghua
Follow
Tkaegypt's profile picture
21world's profile picture
ldwang's profile picture
3 followers
Β·
1 following
AI & ML interests
None yet
Recent Activity
new
activity
about 5 hours ago
deepseek-ai/DeepSeek-V4-Flash:
Run DeepSeek-V4-Flash on more hardware: FP8/BF16 adapted versions for 8 AI chips (ready to download)
posted
an
update
about 6 hours ago
π Run DeepSeek V4 on more AI GPUs with FlagOS DeepSeek V4 just dropped with huge specs: 1.6T params, 1M context, MIT license. But thereβs a catch: the official weights use FP4+FP8 mixed precision, which mainly targets NVIDIA Blackwell / B200-class GPUs. So we built DeepSeek-V4-FlagOS. On Day 0, the FlagOS community completed multi-chip adaptation across 8 AI hardware platforms: β NVIDIA H100/H20 β FP8/BF16 β Huawei Ascend β BF16 β Hygon DCU β BF16 β MetaX GPU β BF16 β Moore Threads MTT S5000 β FP8 β Kunlunxin XPU β BF16 β T-Head/Alibaba Zhenwu β BF16 β Iluvatar GPU β BF16 π§ What makes it work? 1οΈβ£ FlagGems operator replacement DeepSeek V4 operators β MoE routing, Attention, RMSNorm and more β are reimplemented with Triton, reducing dependency on CUDA-specific libraries. New V4 operators include: Act Quant, hc_split_sinkhorn, FP8 MatMul, Sparse Attention, Hadamard Transform. 2οΈβ£ Flexible tensor parallelism DeepSeek V4 uses o_groups=8, which can limit TP. We added an independent communication group for o-groups, while allowing the rest of the model to scale to higher TP, enabling deployment on 32GB/64GB cards. 3οΈβ£ FP4 β BF16 conversion For hardware without native FP4, we provide ready-to-use BF16 conversion and pre-converted model releases. π¦ Pre-converted models are available on Hugging Face: V4-Pro: FlagRelease/DeepSeek-V4-Pro-nvidia-FlagOS FlagRelease/DeepSeek-V4-Pro-metax-FlagOS FlagRelease/DeepSeek-V4-Pro-mthreads-FlagOS FlagRelease/DeepSeek-V4-Pro-hygon-FlagOS FlagRelease/DeepSeek-V4-Pro-ascend-FlagOS V4-Flash: FlagRelease/DeepSeek-V4-Flash-nvidia-FlagOS FlagRelease/DeepSeek-V4-Flash-zhenwu-FlagOS FlagRelease/DeepSeek-V4-Flash-kunlunxin-FlagOS FlagRelease/DeepSeek-V4-Flash-iluvatar-FlagOS β‘ Performance on NVIDIA H20, V4-Flash FP8: FlagGems C++ Wrapper + Triton: 70.7 tok/s DeepSeek TileLang: 62.99 tok/s Thatβs 12.24% faster. π Try it here: https://github.com/flagos-ai/DeepSeek-V4-FlagOS Open models should run on open infrastructure
authored
a paper
7 months ago
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
View all activity
Organizations
Yonghua
's models
1
Sort:Β Recently updated
Yonghua/ViT_warehouse
Updated
Jun 4, 2022