Self-Supervised Reinforcement Learning for LLMs

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

A groundbreaking approach to eliminate 'policy collapse' and achieve state-of-the-art performance in Large Language Model reasoning tasks without costly human annotation. Discover how M-GRPO ensures robust and reliable AI training.

Schedule Your Strategy Session

Executive Impact: Stable & Superior LLM Performance

Our M-GRPO framework fundamentally transforms the landscape of self-supervised reinforcement learning for LLMs. By addressing critical instabilities like 'policy collapse' and 'entropy collapse', it delivers unparalleled training stability and state-of-the-art reasoning capabilities, driving tangible improvements in critical enterprise AI applications.

0% Policy Collapse Mitigation

0% Reasoning Accuracy Uplift

0% Training Stability & Reliability

0% Policy Entropy Restoration

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The Challenge: Unstable Self-Supervised RL for LLMs

Traditional Reinforcement Learning with Verifiable Reward (RLVR) significantly improves LLM reasoning but relies on costly human-annotated data. Self-supervised RL aims to overcome this by generating intrinsic rewards from the model itself. However, existing methods suffer from a critical "policy collapse" under long-horizon training, leading to precipitous performance degradation (Figure 1 & 5).

This instability is often accompanied by a rapid collapse in policy entropy, resulting in prematurely confident and suboptimal policies. Simple scaling of rollouts only delays, but does not prevent, this fundamental issue.

M-GRPO: Momentum-Anchored Policy Optimization

To address the "policy collapse," we introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization). Inspired by momentum contrast in self-supervised visual learning, M-GRPO leverages two models:

Current Policy Model (πθq): The model being actively trained.
Momentum Model (πθk): A slowly evolving exponential moving average of the current policy model's parameters (Equation 1). This model provides a stable and consistent target for pseudo-label generation via majority voting.

By combining rollouts from both models (Figure 3), M-GRPO mitigates noise and instability inherent in self-rewarding systems, offering a more reliable training signal and preventing catastrophic performance degradation.

IQR-Based Adaptive Entropy Filtering

Beyond "policy collapse," a rapid decline in policy entropy leads to overly confident and suboptimal policies (Figure 2, Left). To counteract this, we propose an adaptive filtering method based on the Interquartile Range (IQR).

For each batch, we calculate the trajectory-level entropy for all generated rollouts. Trajectories with excessively low entropy—identified as outliers below Q1 - k * IQR (where k=0.75, Equation 2)—are dynamically pruned. This ensures that only high-quality, diverse trajectories contribute to the learning process.

This dynamic approach prevents premature convergence and maintains essential policy diversity, allowing the model to explore effectively and learn more robust policies.

Superior Stability and State-of-the-Art Performance

Our extensive experiments demonstrate that M-GRPO effectively stabilizes the training process, sustaining improving reward and high validation accuracy throughout (Figure 4). This obviates the need for manual checkpoint selection required by prior methods (SRT-Best).

Quantitatively, M-GRPO-Final significantly outperforms baseline methods, often surpassing even their manually selected best checkpoints (SRT-Best). For instance, M-GRPO achieved a +7.43% absolute accuracy gain on LiveCode and +5.05% on GPQA Diamond compared to SRT (Table 2).

These results confirm that M-GRPO not only delivers superior training stability but also enables LLMs to converge to more robust, capable, and diverse reasoning states, setting a new benchmark for self-supervised RLVR.

M-GRPO Policy Optimization Flow

Prompt Generation

→

Policy Rollouts (M)

→

Momentum Rollouts (N)

→

Combine G Rollouts

→

Majority Voting & Pseudo-Labeling

→

Advantage & Objective Calculation

→

Policy Model Update

→

Momentum Model Update

7.43% Absolute Accuracy Gain on LiveCode Benchmark

M-GRPO vs. Baselines: Reasoning Performance

Benchmark	Original (Qwen3-4B-base)	SRT-Best	SRT-Final	M-GRPO+IQR Final
MATH500	61.50%	79.20%	47.50%	79.75%
AIME24	0.83%	12.50%	7.50%	14.58%
AIME25	5.00%	11.67%	8.75%	14.17%
GPQA Diamond	34.41%	38.26%	28.54%	39.65%
GPQA	29.91%	35.04%	25.89%	35.49%
LiveCode	9.61%	19.69%	16.12%	27.12%

Calculate Your Potential AI ROI

Estimate the significant operational efficiencies and cost savings your enterprise could achieve by implementing advanced LLM solutions.

Your Industry

Number of Employees (Impacted by LLM)

Avg. Hours/Week on Repetitive Tasks

Avg. Hourly Employee Cost ($)

Estimated Annual Savings $0

Annual Hours Reclaimed 0

Your M-GRPO Implementation Roadmap

Our phased approach ensures a seamless transition and optimal integration of M-GRPO into your existing AI infrastructure, maximizing stability and performance from day one.

Phase 1: Policy Instability Diagnosis

Comprehensive analysis and identification of 'policy collapse' and 'entropy collapse' challenges within your current self-supervised RL setups.

Phase 2: M-GRPO Framework Development

Design and tailored implementation of the momentum-anchored mechanism to provide stable training targets for your specific LLM applications.

Phase 3: IQR Filtering Integration

Development and integration of the adaptive interquartile range-based filtering to preserve policy diversity and prevent premature convergence.

Phase 4: Validation & Benchmarking

Rigorous testing and validation across your enterprise reasoning benchmarks to demonstrate superior stability and state-of-the-art performance.

Phase 5: Enterprise Integration Strategy

Consultative phase to tailor M-GRPO for specific enterprise LLM deployment scenarios, ensuring robust, high-performing AI solutions.

Ready to Stabilize Your LLMs?

Stop grappling with unstable self-supervised learning. Let's discuss how M-GRPO can bring unparalleled stability and state-of-the-art performance to your enterprise AI, without the need for expensive human annotations.

Discuss Your Implementation

Self-Supervised Reinforcement Learning for LLMs

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

Executive Impact: Stable & Superior LLM Performance

Deep Analysis & Enterprise Applications

The Challenge: Unstable Self-Supervised RL for LLMs

M-GRPO: Momentum-Anchored Policy Optimization

IQR-Based Adaptive Entropy Filtering

Superior Stability and State-of-the-Art Performance

M-GRPO Policy Optimization Flow

M-GRPO vs. Baselines: Reasoning Performance

Calculate Your Potential AI ROI

Your M-GRPO Implementation Roadmap

Phase 1: Policy Instability Diagnosis

Phase 2: M-GRPO Framework Development

Phase 3: IQR Filtering Integration

Phase 4: Validation & Benchmarking

Phase 5: Enterprise Integration Strategy

Ready to Stabilize Your LLMs?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai