AI RESEARCH ANALYSIS

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

Large Reasoning Models (LRMs) often struggle with sophisticated jailbreak attacks despite existing safety measures. This work identifies that the core issue is the inadequate generalization of safe reasoning against increasingly complex prompts. We propose RAPO, a Risk-Aware Preference Optimization framework, which adaptively scales an LRM's safe reasoning depth and length based on perceived risk. Our extensive experiments demonstrate RAPO's superior robustness against diverse jailbreak attacks while meticulously preserving general utility, setting a new benchmark for LRM safety.

Schedule Your Strategy Session

Executive Impact & Key Metrics

RAPO fundamentally enhances the security posture of Large Reasoning Models, drastically reducing their vulnerability to adversarial attacks without compromising performance on core reasoning tasks. This provides a robust foundation for deploying safe and reliable AI systems in critical enterprise environments.

91.8% Reduction in Jailbreak Attack Success Rate (DeepSeek)

0 DeepSeek ASR Reduction

0 Qwen-1.7B ASR Reduction

0 Utility Score Retention

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The Generalization Challenge in Safe Reasoning

Current safe reasoning mechanisms in Large Reasoning Models (LRMs) often fall short when confronted with complex, adaptive jailbreak attacks. Our theoretical analysis, supported by Theorem 3.1, posits that successful defense requires the safe reasoning length to scale proportionally with the attack's complexity (k).

Empirical observations further validate this, showing that while basic harmful requests are refused, complex attacks lead to a significant decrease in the proportion of safe reasoning tokens. This highlights a critical limitation: existing safe reasoning does not adaptively scale its depth and length to match the evolving complexity of adversarial prompts.

Safe Reasoning Proportions & ASR Across Attack Complexity

Dataset	Thinking Tokens	Safe Reasoning % Jailbreak	Safe Reasoning % Refusal	ASR
SorryBench-Base	596	17.6%	36.8%	34%
SorryBench-Attack	743	15.7%	27.9%	50%
StrataSword-L1	371	27.2%	55.1%	9%
StrataSword-L2	712	18.7%	41.9%	36%
StrataSword-L3	1096	6.4%	24.8%	58%

This table illustrates how the proportion of safe reasoning tokens in successful refusals (Safe reasoning % Refusal) decreases as attack complexity increases (L1 to L3), leading to higher Attack Success Rates (ASR). This underscores the need for adaptive safe reasoning.

RAPO Framework: Adaptive Safety Alignment

The Risk-Aware Preference Optimization (RAPO) framework tackles the generalization problem by explicitly training LRMs to adapt their safe reasoning process. It combines an SFT warm-up stage for format alignment with a robust RL stage that leverages a novel risk-aware reward system.

This dual-stage approach ensures that the model not only initiates safe reasoning consistently but also tailors the granularity of its risk analysis to the specific complexity of the input prompt, thereby generalizing defenses across a wider range of adversarial scenarios.

Enterprise Process Flow: RAPO's Adaptive Safety Training

SFT Warm-up for Format Alignment

→

RL Stage with Risk-Aware Reward

→

Adaptive Safe Reasoning Generalization

Refined Reward System for Safety & Utility

RAPO employs a sophisticated composite reward system during its RL stage, designed to balance adaptive safe reasoning with general model utility. The system includes two key components:

Risk-Aware Reward: An LLM-as-a-Judge evaluates the prompt's risk complexity (L1: Explicit, L2: Indirect, L3+: Complex) and assigns rewards based on the adequacy of the generated safe reasoning trace. This ensures reasoning is safety-relevant, detailed enough for the risk, and avoids excessive verbosity (Table 3 provides criteria).

General Reward: This binary (±1) reward regularizes the model, ensuring it firmly refuses harmful requests while accurately addressing benign prompts, thereby preserving overall reasoning capabilities and preventing over-refusal.

Benchmark Performance & Robustness

RAPO consistently achieves state-of-the-art safety performance across various benchmarks while maintaining high general utility. It excels in basic harmlessness and, critically, demonstrates remarkable robustness against complex and adaptive jailbreak attacks.

The framework significantly reduces Attack Success Rates (ASR) on challenging datasets like WildJailbreak, proving its effectiveness in generalizing safe reasoning beyond simple harmful prompts. Furthermore, RAPO mitigates the common issue of over-refusal seen in other safety alignment methods, ensuring balanced behavior.

Overall Performance Comparison

Model	Method	JBB ASR↓	HB ASR↓	SB ASR↓	WJ ASR↓	XsTest↑	MMLU-Pro↑
Qwen-8B	Base	5.0%	14.5%	16.6%	62.3%	99.2%	63.0%
	Intent-Aware	3.0%	5.0%	8.5%	56.1%	92.8%	62.8%
	STAR	0.0%	1.0%	6.9%	21.3%	90.0%	60.1%
	TARS	0.0%	5.5%	10.6%	35.3%	92.0%	60.9%
	IPO	0.0%	0.0%	4.6%	16.9%	89.6%	61.0%
	GRPO	2.0%	9.5%	11.6%	38.5%	95.2%	61.6%
	RAPO	0.0%	0.0%	2.7%	7.4%	90.4%	60.3%
Qwen-1.7B	Base	17.0%	30.0%	49.5%	70.9%	97.6%	41.9%
	Intent-Aware	0.0%	0.0%	5.2%	22.3%	96.4%	41.5%
	STAR	0.0%	0.0%	7.4%	21.8%	74.0%	40.7%
	TARS	6.0%	15.5%	16.0%	26.7%	93.8%	41.6%
	IPO	0.0%	0.0%	5.8%	20.7%	93.6%	40.9%
	GRPO	13.0%	20.0%	24.7%	31.4%	95.0%	41.6%
	RAPO	0.0%	0.0%	3.7%	15.8%	91.2%	41.1%
DeepSeek	Base	66.0%	69.0%	31.2%	68.7%	96.0%	31.2%
	Intent-Aware	14.0%	33.5%	14.5%	40.2%	92.0%	29.8%
	STAR	19.0%	32.0%	19.4%	49.4%	77.8%	30.7%
	TARS	26.0%	27.0%	19.2%	44.6%	92.4%	30.1%
	IPO	16.0%	18.0%	15.6%	18.2%	82.4%	30.6%
	GRPO	31.4%	29.5%	20.3%	42.5%	95.2%	30.6%
	RAPO	9.0%	5.0%	5.7%	5.6%	93.6%	30.3%

Adaptive Attack Defense (ASR)

Attack	Qwen PAIR	Qwen TAP	DeepSeek PAIR	DeepSeek TAP
STAR	23%	22%	55%	26%
Intent	14%	21%	39%	20%
IPO	40%	25%	42%	29%
RAPO	17%	19%	15%	14%

RAPO significantly outperforms baselines against PAIR and TAP, two advanced adaptive jailbreak attacks, showcasing its superior robustness.

Validating RAPO's Core Mechanisms

Ablation studies confirm the critical role of each component in RAPO, demonstrating that both the RL stage and the carefully designed reward system are indispensable for robust generalization. The RL stage specifically enables the model to effectively learn adaptive safe reasoning beyond initial SFT alignment.

Training dynamics show a steady increase in risk-aware reward, indicating successful learning of adequate safe reasoning traces. This systematic approach ensures that RAPO's defense mechanisms are well-integrated and effective across diverse tasks and threat levels.

Ablation Study of RAPO (Qwen-1.7B)

Recipe	JBB.(↓)	WJ. (↓)	MMLU.(↑)
Base	17%	70.9%	41.9%
RAPO-SFT	1%	36.1%	40.9%
RAPO	0%	15.8%	41.1%
Harmful Data Only	0%	14.7%	40.2%
Benign Data Only	1%	37.2%	40.9%
Risk-Aware Reward Only	1%	34.4%	40.9%
General Reward Only	0%	22.3%	40.4%

The full RAPO framework significantly outperforms partial implementations, highlighting the importance of both the RL stage and the composite reward system for optimal safety and utility.

Calculate Your Potential AI ROI

Estimate the tangible benefits of implementing advanced AI safety and reasoning solutions within your organization. Understand how robust alignment translates into reduced risk and improved operational efficiency.

Industry Sector

Number of Employees Impacted

Avg. Weekly Hours on Repetitive Tasks per Employee

Average Hourly Cost per Employee ($)

Estimated Annual Savings $0

Hours Reclaimed Annually 0

Your Implementation Roadmap

A phased approach ensures seamless integration and optimal impact. Our expert team guides you through each step, from initial assessment to full-scale deployment and ongoing optimization.

Phase 1: Discovery & Strategy

In-depth analysis of your current AI usage, identification of critical safety gaps, and development of a tailored RAPO implementation strategy.

Phase 2: Customization & Training

Fine-tuning RAPO with your specific datasets, adapting reward models to your unique safety policies, and comprehensive training for your teams.

Phase 3: Integration & Deployment

Seamless integration of RAPO into your existing LRM infrastructure, ensuring minimal disruption and maximum security enhancements.

Phase 4: Monitoring & Optimization

Continuous monitoring of safety performance, adaptive adjustments to RAPO configurations, and ongoing support to maintain peak robustness against evolving threats.

Ready to Elevate Your AI Safety?

Don't let complex jailbreak attacks compromise your enterprise AI. Partner with us to implement RAPO and ensure your Large Reasoning Models are not only powerful but also robustly safe and reliable.

Discuss Your Implementation

AI RESEARCH ANALYSIS

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

Executive Impact & Key Metrics

Deep Analysis & Enterprise Applications

The Generalization Challenge in Safe Reasoning

Safe Reasoning Proportions & ASR Across Attack Complexity

RAPO Framework: Adaptive Safety Alignment

Enterprise Process Flow: RAPO's Adaptive Safety Training

Refined Reward System for Safety & Utility

Benchmark Performance & Robustness

Overall Performance Comparison

Adaptive Attack Defense (ASR)

Validating RAPO's Core Mechanisms

Ablation Study of RAPO (Qwen-1.7B)

Calculate Your Potential AI ROI

Your Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Customization & Training

Phase 3: Integration & Deployment

Phase 4: Monitoring & Optimization

Ready to Elevate Your AI Safety?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai