1️⃣ FlagGems operator replacement DeepSeek V4 operators — MoE routing, Attention, RMSNorm and more — are reimplemented with Triton, reducing dependency on CUDA-specific libraries.
2️⃣ Flexible tensor parallelism DeepSeek V4 uses o_groups=8, which can limit TP. We added an independent communication group for o-groups, while allowing the rest of the model to scale to higher TP, enabling deployment on 32GB/64GB cards.
3️⃣ FP4 → BF16 conversion For hardware without native FP4, we provide ready-to-use BF16 conversion and pre-converted model releases.
📦 Pre-converted models are available on Hugging Face: V4-Pro: FlagRelease/DeepSeek-V4-Pro-nvidia-FlagOS FlagRelease/DeepSeek-V4-Pro-metax-FlagOS FlagRelease/DeepSeek-V4-Pro-mthreads-FlagOS FlagRelease/DeepSeek-V4-Pro-hygon-FlagOS FlagRelease/DeepSeek-V4-Pro-ascend-FlagOS