AI ANALYSIS REPORT
MUON IS SCALABLE FOR LLM TRAINING
The Muon optimizer, based on matrix orthogonalization, has shown promise in small-scale language models but lacked proven scalability for larger models. This report identifies two crucial techniques for scaling Muon: incorporating weight decay and carefully adjusting per-parameter update scales. These enhancements enable Muon to operate effectively in large-scale training without extensive hyper-parameter tuning. Scaling law experiments demonstrate that Muon achieves approximately 2x computational efficiency compared to AdamW for compute-optimal training. Leveraging these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model, trained with 5.7T tokens using Muon. Moonlight significantly improves the current Pareto frontier, offering superior performance with fewer training FLOPs than previous models. We open-source our memory-optimal and communication-efficient distributed Muon implementation, along with pretrained, instruction-tuned, and intermediate checkpoints to foster further research.
Executive Impact: Key Findings
Uncover the core metrics driving efficiency and performance in next-generation LLM training with Muon.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Muon Optimization Principles
This section details the foundational principles and key enhancements applied to the Muon optimizer to achieve scalability in large-scale LLM training.
Muon demonstrates approximately 2x greater computational efficiency compared to AdamW for compute-optimal training, significantly reducing the resources required.
Muon, proposed by K. Jordan et al. (2024), updates matrix parameters using orthogonalized gradient momentum via Newton-Schulz iteration. This approach ensures update matrices are isomorphic, preventing learning along a few dominant directions. Initial experiments showed strong results for small models, but scalability to larger models required further enhancements.
Crucially, the addition of weight decay (Wt = Wt-1 - ηt(Ot + λWt-1)) was identified as vital for Muon's scalability. Without it, weight and layer output RMS values grew too large, harming performance. Weight decay resolved this, enabling Muon to outperform vanilla Muon and AdamW in the over-train regime.
Maintaining a consistent update RMS across different matrix shapes was another critical improvement. Muon's theoretical update RMS varies with √1/max(A, B). We introduced a scaling factor of √max(A, B) to normalize updates, ensuring stability and optimal performance across diverse parameter shapes (e.g., dense MLP matrices, KV heads). This adjustment also allows Muon to reuse AdamW's tuned learning rates and weight decay.
Distributed Implementation
This section describes the optimized distributed implementation of Muon, highlighting its memory and communication efficiencies.
Enterprise Process Flow
Our distributed Muon implementation builds upon ZeRO-1 (Rajbhandari et al. 2020) to partition optimizer states across Data Parallel (DP) groups. Compared to a vanilla ZeRO-1 AdamW, Distributed Muon introduces two additional operations: DP Gather (to form a full gradient matrix for Newton-Schulz) and Calculate Full Update.
Memory Usage: Muon uses only one momentum buffer, half of AdamW's requirement, leading to optimal memory efficiency. Communication Overhead: The additional DP gathering is efficient, with the Newton-Schulz iterations performed in bf16, further reducing overhead. Overall, communication workload is comparable to AdamW (1x to 1.25x). Latency: While additional communication and Newton-Schulz steps are introduced, the end-to-end latency is negligible (1-3% of forward-backward pass time), further optimized by overlapping operations.
Scaling Law Validation
Exploration of Muon's performance through scaling law experiments, demonstrating its superior efficiency compared to traditional optimizers.
Muon requires only approximately 52% of the training FLOPs to achieve performance comparable to AdamW under compute-optimal settings, highlighting significant resource savings.
We performed comprehensive scaling law experiments on Llama-architecture dense models, rigorously comparing Muon with a strong AdamW baseline. AdamW's hyper-parameters were optimized via a grid search following compute-optimal training setups. For Muon, we reused these optimal AdamW hyper-parameters after matching its update RMS.
The fitted scaling law curves (Figure 3 in the paper) confirm that Muon provides comparable performance to AdamW with substantially reduced computational requirements, making it a highly efficient optimizer for large-scale LLM training.
Moonlight Model Performance
Detailed performance analysis of Moonlight, a Muon-optimized MoE model, against leading public models.
Moonlight: A Muon-Optimized MoE LLM
Moonlight is a 3B activated / 16B total parameter Mixture-of-Expert (MoE) model trained with 5.7 trillion tokens using the Muon optimizer. Its architecture is based on Deepseek-V3-Small, with minor modifications. Moonlight demonstrates superior performance, advancing the Pareto frontier of model performance versus training FLOPs.
- 3B Activated / 16B Total Parameters (MoE)
- Trained with 5.7 Trillion Tokens
- Muon Optimizer for entire pretraining process
- Improved Pareto Frontier performance
| Benchmark (Metric) | DSV3-Small (AdamW) | Moonlight-A (AdamW) | Moonlight (Muon) |
|---|---|---|---|
| MMLU | 53.3 | 60.2 | 60.4 |
| HumanEval (Pass@1) | 26.8 | 29.3 | 37.2 |
| MBPP (Pass@1) | 36.8 | 49.2 | 52.9 |
| GSM8K | 31.4 | 43.8 | 45.0 |
| MATH | 10.7 | 16.1 | 19.8 |
Moonlight (Muon-optimized) significantly outperforms its AdamW-trained counterpart (Moonlight-A) and Deepseek-v3-Small at 1.2T tokens, particularly in Math and Code related tasks.
| Benchmark (Metric) | Llama3.2-3B (AdamW) | Qwen2.5-3B (Unknown) | DSV2-Lite (AdamW) | Moonlight (Muon) |
|---|---|---|---|---|
| MMLU | 54.7 | 65.6 | 58.3 | 70.0 |
| BBH | 46.8 | 56.3 | 44.1 | 65.2 |
| HumanEval | 28.0 | 42.1 | 29.9 | 48.1 |
| GSM8K | 34.0 | 79.1 | 41.1 | 77.4 |
| MATH | 8.5 | 42.6 | 17.1 | 45.3 |
Even when compared to larger, dense models or those trained on substantially larger datasets, Moonlight maintains competitive and often superior performance, cementing its position on the Pareto frontier.
Advanced ROI Calculator
Estimate the potential cost savings and efficiency gains for your enterprise by integrating Muon-optimized LLMs.
Your Enterprise AI Roadmap
A phased approach to integrating Muon-optimized LLMs into your operational framework for maximum impact.
Phase 1: Strategic Assessment & Pilot (1-2 Months)
Identify high-impact use cases, conduct a feasibility study, and implement a small-scale pilot project using Muon-optimized models on a specific task within your organization.
Phase 2: Customization & Integration (3-6 Months)
Fine-tune Muon-optimized LLMs with your proprietary data, integrate with existing enterprise systems, and develop custom applications. Implement distributed Muon for large-scale training of specialized models.
Phase 3: Scaled Deployment & Optimization (6-12 Months)
Roll out Muon-powered solutions across departments, establish monitoring and feedback loops, and continuously optimize model performance and efficiency based on real-world usage and scaling law insights.
Ready to Transform Your AI Strategy?
Leverage the power of scalable, efficient LLM training with Muon. Book a consultation to discuss how these innovations can drive your enterprise forward.