Enterprise AI Analysis

Fantastic Pretraining Optimizers and Where to Find Them

Kaiyue Wen, David Hall, Tengyu Ma, Percy Liang – September 8, 2025

AdamW has dominated LLM pretraining, but our rigorous study of ten optimizers reveals that reported speedups are often overstated due to poor hyperparameter tuning and limited evaluation. We find that actual gains are modest (1.1x-1.4x), diminishing with model scale, and require careful, end-of-training comparisons. Matrix-based optimizers excel at smaller scales but their advantage reduces significantly as models grow to 1.2B parameters.

Schedule Your Strategy Session

Executive Impact

Key metrics demonstrating the real-world implications of optimizer choice in large-scale AI projects.

Peak Speedup Over AdamW (Small Models)

Speedup Over AdamW (1.2B Parameters)

Potential Speedup from Proper Tuning

Matrix-based vs. Scalar-based (Small Models)

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Systematic Benchmarking Approach

Our study addresses two critical issues in optimizer evaluation: unequal hyperparameter tuning and limited, misleading evaluation setups. We conduct a systematic, three-phase investigation of ten deep learning optimizers across four model scales (0.1B–1.2B parameters) and varied data-to-model ratios (1–8× Chinchilla optimum). This involves rigorous coordinate-descent hyperparameter tuning and end-of-training evaluations to ensure fair comparisons and reliable insights into performance across different scaling regimes.

Enterprise Process Flow

Phase I: Fine-grained Coordinate Descent

→

Phase II: Refine Scaling-Sensitive HPs

→

Phase III: Hyperparameter Scaling Law Extrapolation

Discuss Your Implementation

True Optimizer Speedups & Scaling Behavior

Against a well-tuned AdamW baseline, the speedup of alternative optimizers is considerably lower than widely claimed. For small models (0.1B parameters), optimizers like Muon and Soap achieve up to 1.4× speedup. However, this advantage diminishes significantly with increasing model size.

1.1x Speedup for 1.2B Models (at 8x Chinchilla Ratio)

Matrix-based optimizers consistently outperform scalar-based ones for smaller models, delivering approximately 1.3× speedup. Yet, their efficiency is inversely proportional to model scale. Furthermore, the optimal choice of optimizer shifts with data-to-model ratios: Muon excels at lower Chinchilla ratios, while Kron and Soap become superior at 8x or higher.

Optimize Your Pretraining

Benchmarking Pitfalls & Best Practices

Our research reveals that suboptimal hyperparameter tuning is a primary cause for inflated speedup claims. Even similar optimizers require distinct optimal settings, making direct hyperparameter transfer unfair. Additionally, evaluating optimizers solely based on early-stage loss curves can be highly misleading, as rankings can reverse due to learning rate decay, underscoring the necessity of end-of-training evaluations.

Evaluation Aspect	Common Pitfall	Rigorous Practice
Hyperparameter Tuning	Blind transfer of hyperparameters Weakly tuned baselines	Coordinate descent for optimal settings Tuning for each optimizer and regime
Evaluation Timing	Judging by early-stage loss curves Intermediate checkpoint comparisons	Evaluating at target training budget End-of-training performance
Scaling Regimes	Confined to small-scale settings Limited data-to-model ratio testing	Broad testing across model scales (0.1B-1.2B) Varied data-to-model ratios (1-8x Chinchilla)

Ensure Fair Benchmarking

Advanced ROI Calculator

Estimate your potential efficiency gains and cost savings by optimizing your LLM pretraining.

Your Industry

Number of AI/ML Engineers

Average Weekly Hours on Pretraining (per engineer)

Average Hourly Cost (loaded rate, per engineer)

Estimated Annual Savings

Annual Hours Reclaimed

Calculate My AI ROI

Implementation Timeline

A phased approach to integrate optimized pretraining strategies within your enterprise, maximizing efficiency and impact.

Phase 1: Hyperparameter Optimization Strategy

Develop and validate a rigorous hyperparameter tuning pipeline for your specific model architecture and data regime, ensuring optimal performance from your chosen optimizer.

Phase 2: Large-Scale Optimizer Evaluation

Systematically benchmark candidate optimizers across various model scales and data-to-model ratios, focusing on end-of-training performance rather than misleading early-stage gains.

Phase 3: Integration & Continuous Monitoring

Integrate the best-performing, scale-robust optimizer into your pretraining pipeline, with continuous monitoring of efficiency and downstream impact for long-term strategic advantage.

Book a Discovery Call

Ready to Find Your Fantastic Optimizer?

Schedule a free consultation to discuss how our AI experts can rigorously optimize your LLM pretraining, ensuring maximum efficiency and cost savings.

Schedule Your Consultation

Enterprise AI Analysis

Fantastic Pretraining Optimizers and Where to Find Them

Executive Impact

Deep Analysis & Enterprise Applications

Systematic Benchmarking Approach

Enterprise Process Flow

True Optimizer Speedups & Scaling Behavior

Benchmarking Pitfalls & Best Practices

Advanced ROI Calculator

Implementation Timeline

Phase 1: Hyperparameter Optimization Strategy

Phase 2: Large-Scale Optimizer Evaluation

Phase 3: Integration & Continuous Monitoring

Ready to Find Your Fantastic Optimizer?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai