Self-Supervised Reinforcement Learning for LLMs
M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization
A groundbreaking approach to eliminate 'policy collapse' and achieve state-of-the-art performance in Large Language Model reasoning tasks without costly human annotation. Discover how M-GRPO ensures robust and reliable AI training.
Executive Impact: Stable & Superior LLM Performance
Our M-GRPO framework fundamentally transforms the landscape of self-supervised reinforcement learning for LLMs. By addressing critical instabilities like 'policy collapse' and 'entropy collapse', it delivers unparalleled training stability and state-of-the-art reasoning capabilities, driving tangible improvements in critical enterprise AI applications.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The Challenge: Unstable Self-Supervised RL for LLMs
Traditional Reinforcement Learning with Verifiable Reward (RLVR) significantly improves LLM reasoning but relies on costly human-annotated data. Self-supervised RL aims to overcome this by generating intrinsic rewards from the model itself. However, existing methods suffer from a critical "policy collapse" under long-horizon training, leading to precipitous performance degradation (Figure 1 & 5).
This instability is often accompanied by a rapid collapse in policy entropy, resulting in prematurely confident and suboptimal policies. Simple scaling of rollouts only delays, but does not prevent, this fundamental issue.
M-GRPO: Momentum-Anchored Policy Optimization
To address the "policy collapse," we introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization). Inspired by momentum contrast in self-supervised visual learning, M-GRPO leverages two models:
- Current Policy Model (πθq): The model being actively trained.
- Momentum Model (πθk): A slowly evolving exponential moving average of the current policy model's parameters (Equation 1). This model provides a stable and consistent target for pseudo-label generation via majority voting.
By combining rollouts from both models (Figure 3), M-GRPO mitigates noise and instability inherent in self-rewarding systems, offering a more reliable training signal and preventing catastrophic performance degradation.
IQR-Based Adaptive Entropy Filtering
Beyond "policy collapse," a rapid decline in policy entropy leads to overly confident and suboptimal policies (Figure 2, Left). To counteract this, we propose an adaptive filtering method based on the Interquartile Range (IQR).
For each batch, we calculate the trajectory-level entropy for all generated rollouts. Trajectories with excessively low entropy—identified as outliers below Q1 - k * IQR (where k=0.75, Equation 2)—are dynamically pruned. This ensures that only high-quality, diverse trajectories contribute to the learning process.
This dynamic approach prevents premature convergence and maintains essential policy diversity, allowing the model to explore effectively and learn more robust policies.
Superior Stability and State-of-the-Art Performance
Our extensive experiments demonstrate that M-GRPO effectively stabilizes the training process, sustaining improving reward and high validation accuracy throughout (Figure 4). This obviates the need for manual checkpoint selection required by prior methods (SRT-Best).
Quantitatively, M-GRPO-Final significantly outperforms baseline methods, often surpassing even their manually selected best checkpoints (SRT-Best). For instance, M-GRPO achieved a +7.43% absolute accuracy gain on LiveCode and +5.05% on GPQA Diamond compared to SRT (Table 2).
These results confirm that M-GRPO not only delivers superior training stability but also enables LLMs to converge to more robust, capable, and diverse reasoning states, setting a new benchmark for self-supervised RLVR.
M-GRPO Policy Optimization Flow
| Benchmark | Original (Qwen3-4B-base) | SRT-Best | SRT-Final | M-GRPO+IQR Final |
|---|---|---|---|---|
| MATH500 | 61.50% | 79.20% | 47.50% | 79.75% |
| AIME24 | 0.83% | 12.50% | 7.50% | 14.58% |
| AIME25 | 5.00% | 11.67% | 8.75% | 14.17% |
| GPQA Diamond | 34.41% | 38.26% | 28.54% | 39.65% |
| GPQA | 29.91% | 35.04% | 25.89% | 35.49% |
| LiveCode | 9.61% | 19.69% | 16.12% | 27.12% |
Calculate Your Potential AI ROI
Estimate the significant operational efficiencies and cost savings your enterprise could achieve by implementing advanced LLM solutions.
Your M-GRPO Implementation Roadmap
Our phased approach ensures a seamless transition and optimal integration of M-GRPO into your existing AI infrastructure, maximizing stability and performance from day one.
Phase 1: Policy Instability Diagnosis
Comprehensive analysis and identification of 'policy collapse' and 'entropy collapse' challenges within your current self-supervised RL setups.
Phase 2: M-GRPO Framework Development
Design and tailored implementation of the momentum-anchored mechanism to provide stable training targets for your specific LLM applications.
Phase 3: IQR Filtering Integration
Development and integration of the adaptive interquartile range-based filtering to preserve policy diversity and prevent premature convergence.
Phase 4: Validation & Benchmarking
Rigorous testing and validation across your enterprise reasoning benchmarks to demonstrate superior stability and state-of-the-art performance.
Phase 5: Enterprise Integration Strategy
Consultative phase to tailor M-GRPO for specific enterprise LLM deployment scenarios, ensuring robust, high-performing AI solutions.
Ready to Stabilize Your LLMs?
Stop grappling with unstable self-supervised learning. Let's discuss how M-GRPO can bring unparalleled stability and state-of-the-art performance to your enterprise AI, without the need for expensive human annotations.