Skip to main content
Enterprise AI Analysis: MUON IS SCALABLE FOR LLM TRAINING

AI ANALYSIS REPORT

MUON IS SCALABLE FOR LLM TRAINING

The Muon optimizer, based on matrix orthogonalization, has shown promise in small-scale language models but lacked proven scalability for larger models. This report identifies two crucial techniques for scaling Muon: incorporating weight decay and carefully adjusting per-parameter update scales. These enhancements enable Muon to operate effectively in large-scale training without extensive hyper-parameter tuning. Scaling law experiments demonstrate that Muon achieves approximately 2x computational efficiency compared to AdamW for compute-optimal training. Leveraging these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model, trained with 5.7T tokens using Muon. Moonlight significantly improves the current Pareto frontier, offering superior performance with fewer training FLOPs than previous models. We open-source our memory-optimal and communication-efficient distributed Muon implementation, along with pretrained, instruction-tuned, and intermediate checkpoints to foster further research.

Executive Impact: Key Findings

Uncover the core metrics driving efficiency and performance in next-generation LLM training with Muon.

0 Compute Efficiency vs. AdamW
0 Training FLOPs Required vs. AdamW
0 Total Parameters (MoE)
0 Training Tokens Processed

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Muon Optimization Principles

This section details the foundational principles and key enhancements applied to the Muon optimizer to achieve scalability in large-scale LLM training.

2x Computational Efficiency vs. AdamW

Muon demonstrates approximately 2x greater computational efficiency compared to AdamW for compute-optimal training, significantly reducing the resources required.

Muon, proposed by K. Jordan et al. (2024), updates matrix parameters using orthogonalized gradient momentum via Newton-Schulz iteration. This approach ensures update matrices are isomorphic, preventing learning along a few dominant directions. Initial experiments showed strong results for small models, but scalability to larger models required further enhancements.

Crucially, the addition of weight decay (Wt = Wt-1 - ηt(Ot + λWt-1)) was identified as vital for Muon's scalability. Without it, weight and layer output RMS values grew too large, harming performance. Weight decay resolved this, enabling Muon to outperform vanilla Muon and AdamW in the over-train regime.

Maintaining a consistent update RMS across different matrix shapes was another critical improvement. Muon's theoretical update RMS varies with √1/max(A, B). We introduced a scaling factor of √max(A, B) to normalize updates, ensuring stability and optimal performance across diverse parameter shapes (e.g., dense MLP matrices, KV heads). This adjustment also allows Muon to reuse AdamW's tuned learning rates and weight decay.

Distributed Implementation

This section describes the optimized distributed implementation of Muon, highlighting its memory and communication efficiencies.

Enterprise Process Flow

Reduce-scatter Gradients (G) on DP
Apply Momentum to partitioned gradients (m)
DP Gather gradients into full matrix (G)
Calculate Muon Update (Newton-Schulz(G))
Keep local partition (u), apply update rule
All-gather updated parameters (P)
Return update RMS for logging

Our distributed Muon implementation builds upon ZeRO-1 (Rajbhandari et al. 2020) to partition optimizer states across Data Parallel (DP) groups. Compared to a vanilla ZeRO-1 AdamW, Distributed Muon introduces two additional operations: DP Gather (to form a full gradient matrix for Newton-Schulz) and Calculate Full Update.

Memory Usage: Muon uses only one momentum buffer, half of AdamW's requirement, leading to optimal memory efficiency. Communication Overhead: The additional DP gathering is efficient, with the Newton-Schulz iterations performed in bf16, further reducing overhead. Overall, communication workload is comparable to AdamW (1x to 1.25x). Latency: While additional communication and Newton-Schulz steps are introduced, the end-to-end latency is negligible (1-3% of forward-backward pass time), further optimized by overlapping operations.

Scaling Law Validation

Exploration of Muon's performance through scaling law experiments, demonstrating its superior efficiency compared to traditional optimizers.

52% Training FLOPs to Match AdamW Performance

Muon requires only approximately 52% of the training FLOPs to achieve performance comparable to AdamW under compute-optimal settings, highlighting significant resource savings.

We performed comprehensive scaling law experiments on Llama-architecture dense models, rigorously comparing Muon with a strong AdamW baseline. AdamW's hyper-parameters were optimized via a grid search following compute-optimal training setups. For Muon, we reused these optimal AdamW hyper-parameters after matching its update RMS.

The fitted scaling law curves (Figure 3 in the paper) confirm that Muon provides comparable performance to AdamW with substantially reduced computational requirements, making it a highly efficient optimizer for large-scale LLM training.

Moonlight Model Performance

Detailed performance analysis of Moonlight, a Muon-optimized MoE model, against leading public models.

Moonlight: A Muon-Optimized MoE LLM

Moonlight is a 3B activated / 16B total parameter Mixture-of-Expert (MoE) model trained with 5.7 trillion tokens using the Muon optimizer. Its architecture is based on Deepseek-V3-Small, with minor modifications. Moonlight demonstrates superior performance, advancing the Pareto frontier of model performance versus training FLOPs.

  • 3B Activated / 16B Total Parameters (MoE)
  • Trained with 5.7 Trillion Tokens
  • Muon Optimizer for entire pretraining process
  • Improved Pareto Frontier performance
Moonlight Performance (1.2T Tokens) vs. Baselines
Benchmark (Metric) DSV3-Small (AdamW) Moonlight-A (AdamW) Moonlight (Muon)
MMLU 53.3 60.2 60.4
HumanEval (Pass@1) 26.8 29.3 37.2
MBPP (Pass@1) 36.8 49.2 52.9
GSM8K 31.4 43.8 45.0
MATH 10.7 16.1 19.8

Moonlight (Muon-optimized) significantly outperforms its AdamW-trained counterpart (Moonlight-A) and Deepseek-v3-Small at 1.2T tokens, particularly in Math and Code related tasks.

Moonlight Performance (5.7T Tokens) vs. Larger Models
Benchmark (Metric) Llama3.2-3B (AdamW) Qwen2.5-3B (Unknown) DSV2-Lite (AdamW) Moonlight (Muon)
MMLU 54.7 65.6 58.3 70.0
BBH 46.8 56.3 44.1 65.2
HumanEval 28.0 42.1 29.9 48.1
GSM8K 34.0 79.1 41.1 77.4
MATH 8.5 42.6 17.1 45.3

Even when compared to larger, dense models or those trained on substantially larger datasets, Moonlight maintains competitive and often superior performance, cementing its position on the Pareto frontier.

Advanced ROI Calculator

Estimate the potential cost savings and efficiency gains for your enterprise by integrating Muon-optimized LLMs.

Estimated Annual Savings $0
Productive Hours Reclaimed 0

Your Enterprise AI Roadmap

A phased approach to integrating Muon-optimized LLMs into your operational framework for maximum impact.

Phase 1: Strategic Assessment & Pilot (1-2 Months)

Identify high-impact use cases, conduct a feasibility study, and implement a small-scale pilot project using Muon-optimized models on a specific task within your organization.

Phase 2: Customization & Integration (3-6 Months)

Fine-tune Muon-optimized LLMs with your proprietary data, integrate with existing enterprise systems, and develop custom applications. Implement distributed Muon for large-scale training of specialized models.

Phase 3: Scaled Deployment & Optimization (6-12 Months)

Roll out Muon-powered solutions across departments, establish monitoring and feedback loops, and continuously optimize model performance and efficiency based on real-world usage and scaling law insights.

Ready to Transform Your AI Strategy?

Leverage the power of scalable, efficient LLM training with Muon. Book a consultation to discuss how these innovations can drive your enterprise forward.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking