NVIDIA integrates Muon and advanced optimizers into Megatron to enhance large-scale LLM training with near-parity throughput to AdamW. (Read More)NVIDIA integrates Muon and advanced optimizers into Megatron to enhance large-scale LLM training with near-parity throughput to AdamW. (Read More)

NVIDIA Megatron Boosts LLM Training With Muon Optimizer

2026/04/23 04:41
3분 읽기
이 콘텐츠에 대한 의견이나 우려 사항이 있으시면 [email protected]으로 연락주시기 바랍니다

NVIDIA Megatron Boosts LLM Training With Muon Optimizer

Zach Anderson Apr 22, 2026 20:41

NVIDIA integrates Muon and advanced optimizers into Megatron to enhance large-scale LLM training with near-parity throughput to AdamW.

NVIDIA Megatron Boosts LLM Training With Muon Optimizer

NVIDIA is pushing the boundaries of large language model (LLM) training with its integration of advanced optimizers like Muon into the Megatron Core framework. According to NVIDIA’s April 22, 2026 blog post, the Muon optimizer, based on higher-order mathematical methods, has achieved near-parity training throughput with the widely used AdamW optimizer while enhancing model performance on large-scale systems like the NVIDIA GB300 NVL72.

Muon, short for MomentUm Orthogonalized by Newton-Schulz, is a higher-order optimization algorithm. It has been instrumental in training leading open-source models such as Kimi K2 and GLM-5. By leveraging advanced preconditioning techniques, the optimizer ensures higher FLOPs utilization (floating point operations per second), a critical metric for maximizing computational efficiency in LLMs.

Performance Metrics: Muon vs. AdamW

Table 1 from NVIDIA’s report shows that Muon delivers comparable throughput to AdamW on the GB300 NVL72 system. For instance, the Kimi K2 model achieved 1,080 TFLOPs/s/GPU with Muon, slightly surpassing AdamW’s 1,051 TFLOPs/s/GPU. Similarly, the Qwen3 30B model reached 721 TFLOPs/s/GPU with Muon compared to 713 TFLOPs/s/GPU with AdamW.

These results were obtained using the NVIDIA NeMo Megatron Bridge 26.02, a PyTorch-native library designed for pretraining and fine-tuning LLMs. The high-performance benchmarks highlight Muon’s ability to handle the computational demands of modern AI workloads without sacrificing efficiency.

Technological Innovations

Scaling Muon to thousands of GPUs presents challenges, including increased computational and memory costs during preconditioning, as well as communication bottlenecks in distributed systems. NVIDIA addresses these hurdles through several innovations:

  • Layer-Wise Distributed Optimizer: Full layers of model parameters are distributed across GPUs, enabling efficient preconditioning without excessive communication overhead.
  • Distributed Newton-Schulz: Two modes—duplicated and distributed—allow flexible handling of momentum updates. While the duplicated mode minimizes latency, the distributed mode optimizes computational efficiency.
  • Communication Hiding and SYRK Fusion: Techniques like overlapping parameter updates with computation and fusing SYRK operations with communication significantly reduce latency, boosting overall throughput.

Implications and Future Developments

By integrating Muon into the Megatron Core, NVIDIA is equipping researchers and developers with tools to improve LLM training at scale. The near-parity performance with AdamW makes Muon an attractive choice, especially as upcoming updates promise further efficiency gains. These include enhanced load balancing, better communication strategies, and advanced kernel optimizations for SYRK operations.

For those eager to explore these technologies, NVIDIA has made tools and performance recipes available through its Megatron Bridge GitHub repository. With these resources, researchers can implement and benchmark emerging optimizers like Muon in their own LLM projects.

Image source: Shutterstock
  • nvidia
  • llm
  • optimization
  • muon
  • ai training

AI Strategy: Powered 24/7

AI Strategy: Powered 24/7AI Strategy: Powered 24/7

Generate automated strategies using natural language

면책 조항: 본 사이트에 재게시된 글들은 공개 플랫폼에서 가져온 것으로 정보 제공 목적으로만 제공됩니다. 이는 반드시 MEXC의 견해를 반영하는 것은 아닙니다. 모든 권리는 원저자에게 있습니다. 제3자의 권리를 침해하는 콘텐츠가 있다고 판단될 경우, [email protected]으로 연락하여 삭제 요청을 해주시기 바랍니다. MEXC는 콘텐츠의 정확성, 완전성 또는 시의적절성에 대해 어떠한 보증도 하지 않으며, 제공된 정보에 기반하여 취해진 어떠한 조치에 대해서도 책임을 지지 않습니다. 본 콘텐츠는 금융, 법률 또는 기타 전문적인 조언을 구성하지 않으며, MEXC의 추천이나 보증으로 간주되어서는 안 됩니다.

No Chart Skills? Still Profit

No Chart Skills? Still ProfitNo Chart Skills? Still Profit

Copy top traders in 3s with auto trading!