Reinforcement Learning
Each Prompt Matters: Scaling RL Efficiently for MoE Models
Our new RL framework, COMPASSMAX-V3-Thinking, optimizes hundred-billion-scale Mixture-of-Experts (MoE) models by ensuring every prompt yields a useful learning signal. We tackle zero-variance prompts, unstable importance sampling, advantage inversion, and systemic bottlenecks with unified innovations for stable and efficient training.
Executive Impact: Unleashing MoE Performance
COMPASSMAX-V3-Thinking achieves significant gains in efficiency and performance across complex reasoning tasks.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Scaling Reinforcement Learning to hundred-billion-parameter Mixture-of-Experts (MoE) models presents unique challenges. The COMPASSMAX-V3-Thinking framework addresses these by focusing on maximizing the utility of each training prompt, preventing wasted computation and ensuring stable learning dynamics. This integrated approach combines algorithmic improvements with system-level optimizations to deliver a robust and efficient RL pipeline.
COMPASSMAX-V3-Thinking Training Pipeline Stages
Our Multi-Stage Zero-Variance Elimination method significantly reduces wasted rollouts, focusing policy learning on diverse and informative queries. This leads to faster convergence and improved training stability.
ESPO (Entropy Importance Sampling Policy Optimization) adaptively balances token-level and sequence-level importance sampling, reweighting updates based on entropy and reward. This improves stability in long-horizon settings and mitigates importance sampling brittleness. Coupled with Router Replay, which aligns MoE expert routing between training and inference, the framework ensures consistent log-probability distributions and prevents training collapse.
| Feature | Traditional RL | COMPASSMAX-V3-Thinking |
|---|---|---|
| Prompt Efficiency |
|
|
| Importance Sampling |
|
|
| Training-Inference Consistency |
|
|
| Reward Model |
|
|
Our High-Throughput Rollout System leverages FP8-precision rollouts, length-aware scheduling, and overlapped reward computation to eliminate performance bottlenecks, making RL on hundred-billion-scale MoE models stable and efficient.
FP8 quantisation reduces rollout time by 30% for 32k tokens, leading to an overall training time reduction of nearly 20%. Multi-detokenization parallelism and overlapped reward computation further improve throughput by efficiently utilizing CPU and GPU resources, ensuring minimal idle time.
Shopee's E-commerce Reward System
COMPASSMAX-V3-Thinking incorporates a specialized multi-domain reward system tailored for Shopee's e-commerce operations. This system combines keyword-based verifiers, instruction-following verifiers with JSON schema checks, and a Generative Reward Model for unstructured text answers. It provides flexible and robust reward signals across product guidance, after-sales issues, search intent understanding, and product title optimization, leading to a 94.58% peak performance on Product Recommendation tasks.
The system ensures high-quality, domain-aligned rewards, bridging verifiable signals with GenRM-based judgments for complex e-commerce workflows. This translates directly into higher-quality e-commerce actions and improved decision quality in noisy, real-world inputs.
Calculate Your Potential AI Impact
Estimate the efficiency gains and cost savings for your organization with CompassMax-V3-Thinking.
Your Implementation Roadmap
A structured approach ensures a smooth transition to enhanced AI capabilities.
Discovery & Strategy
Understand current pain points, define objectives, and tailor the COMPASSMAX-V3-Thinking framework to your specific enterprise needs. Initial data assessment and model alignment.
Duration: 2-4 Weeks
Model Adaptation & Fine-Tuning
Implement Multi-Stage Zero-Variance Elimination and ESPO. Fine-tune the MoE model with your proprietary datasets, integrating Router Replay and GenRM. Establish a robust reward system.
Duration: 6-10 Weeks
System Integration & Optimization
Deploy the High-Throughput Rollout System. Integrate with existing infrastructure, ensuring FP8-precision rollouts and length-aware scheduling for maximum efficiency. Rigorous testing and validation.
Duration: 4-8 Weeks
Monitoring & Continuous Improvement
Set up continuous monitoring for model performance and system stability. Implement feedback loops for ongoing optimization, leveraging the adaptive capabilities of the framework.
Duration: Ongoing
Ready to Elevate Your Enterprise AI?
Discover how COMPASSMAX-V3-Thinking can deliver unparalleled performance and efficiency.