Skip to main content
Enterprise AI Analysis: Fantastic Pretraining Optimizers and Where to Find Them

Enterprise AI Analysis

Fantastic Pretraining Optimizers and Where to Find Them

Kaiyue Wen, David Hall, Tengyu Ma, Percy Liang – September 8, 2025

AdamW has dominated LLM pretraining, but our rigorous study of ten optimizers reveals that reported speedups are often overstated due to poor hyperparameter tuning and limited evaluation. We find that actual gains are modest (1.1x-1.4x), diminishing with model scale, and require careful, end-of-training comparisons. Matrix-based optimizers excel at smaller scales but their advantage reduces significantly as models grow to 1.2B parameters.

Executive Impact

Key metrics demonstrating the real-world implications of optimizer choice in large-scale AI projects.

Peak Speedup Over AdamW (Small Models)
Speedup Over AdamW (1.2B Parameters)
Potential Speedup from Proper Tuning
Matrix-based vs. Scalar-based (Small Models)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Systematic Benchmarking Approach

Our study addresses two critical issues in optimizer evaluation: unequal hyperparameter tuning and limited, misleading evaluation setups. We conduct a systematic, three-phase investigation of ten deep learning optimizers across four model scales (0.1B–1.2B parameters) and varied data-to-model ratios (1–8× Chinchilla optimum). This involves rigorous coordinate-descent hyperparameter tuning and end-of-training evaluations to ensure fair comparisons and reliable insights into performance across different scaling regimes.

Enterprise Process Flow

Phase I: Fine-grained Coordinate Descent
Phase II: Refine Scaling-Sensitive HPs
Phase III: Hyperparameter Scaling Law Extrapolation

True Optimizer Speedups & Scaling Behavior

Against a well-tuned AdamW baseline, the speedup of alternative optimizers is considerably lower than widely claimed. For small models (0.1B parameters), optimizers like Muon and Soap achieve up to 1.4× speedup. However, this advantage diminishes significantly with increasing model size.

1.1x Speedup for 1.2B Models (at 8x Chinchilla Ratio)

Matrix-based optimizers consistently outperform scalar-based ones for smaller models, delivering approximately 1.3× speedup. Yet, their efficiency is inversely proportional to model scale. Furthermore, the optimal choice of optimizer shifts with data-to-model ratios: Muon excels at lower Chinchilla ratios, while Kron and Soap become superior at 8x or higher.

Benchmarking Pitfalls & Best Practices

Our research reveals that suboptimal hyperparameter tuning is a primary cause for inflated speedup claims. Even similar optimizers require distinct optimal settings, making direct hyperparameter transfer unfair. Additionally, evaluating optimizers solely based on early-stage loss curves can be highly misleading, as rankings can reverse due to learning rate decay, underscoring the necessity of end-of-training evaluations.

Evaluation Aspect Common Pitfall Rigorous Practice
Hyperparameter Tuning
  • Blind transfer of hyperparameters
  • Weakly tuned baselines
  • Coordinate descent for optimal settings
  • Tuning for each optimizer and regime
Evaluation Timing
  • Judging by early-stage loss curves
  • Intermediate checkpoint comparisons
  • Evaluating at target training budget
  • End-of-training performance
Scaling Regimes
  • Confined to small-scale settings
  • Limited data-to-model ratio testing
  • Broad testing across model scales (0.1B-1.2B)
  • Varied data-to-model ratios (1-8x Chinchilla)

Advanced ROI Calculator

Estimate your potential efficiency gains and cost savings by optimizing your LLM pretraining.

Estimated Annual Savings
Annual Hours Reclaimed

Implementation Timeline

A phased approach to integrate optimized pretraining strategies within your enterprise, maximizing efficiency and impact.

Phase 1: Hyperparameter Optimization Strategy

Develop and validate a rigorous hyperparameter tuning pipeline for your specific model architecture and data regime, ensuring optimal performance from your chosen optimizer.

Phase 2: Large-Scale Optimizer Evaluation

Systematically benchmark candidate optimizers across various model scales and data-to-model ratios, focusing on end-of-training performance rather than misleading early-stage gains.

Phase 3: Integration & Continuous Monitoring

Integrate the best-performing, scale-robust optimizer into your pretraining pipeline, with continuous monitoring of efficiency and downstream impact for long-term strategic advantage.

Ready to Find Your Fantastic Optimizer?

Schedule a free consultation to discuss how our AI experts can rigorously optimize your LLM pretraining, ensuring maximum efficiency and cost savings.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking