Skip to main content
Enterprise AI Analysis: FLUENT ALIGNMENT WITH DISFLUENT JUDGES

AI Alignment & LLM Development

Fluent Alignment with Disfluent Judges: Post-Training for Lower-Resource Languages

This paper introduces a post-training strategy to develop fluent language models for lower-resource languages, even when using "disfluent" (less-than-perfect) reward models for alignment. By employing on-policy reinforcement learning and meticulously avoiding translated training data, our method successfully preserves the native linguistic quality learned during pretraining. Experiments with Norwegian Bokmål, including native-speaker evaluations, demonstrate that this approach significantly outperforms traditional supervised finetuning on machine-translated data, proving that fluent policies can emerge from disfluent judges.

Executive Impact

Unlock superior linguistic fluency and drive impactful AI development in underserved language markets with our proven methodology.

0% On-Policy RL Avg Win-Rate
0% Policy Fluency w/ Disfluent Judges
0pp Fluency Gain over Translated SFT
0pp Fluency Lost from Translated Data

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Enterprise Process Flow: Fluent LLM Alignment

Base Model Pretraining (Target Language)
Short Supervised Finetuning (English LIMA)
On-Policy Reinforcement Learning (Target Language, LLM-as-a-Judge)

This three-stage methodology ensures that models for lower-resource languages maintain native fluency by leveraging pretraining on native texts and on-policy learning, critically avoiding any exposure to translated content during the alignment phase.

Manual Fluency Evaluation: On-Policy RL vs. Baselines

Model On-policy RL Translated SFT Mistral Nemo Average Win-Rate
On-policy RL - 67.5% 91.8% 79.7%
Translated SFT 32.5% - 87.5% 60.0%
Mistral Nemo 8.2% 12.5% - 10.3%

Results from native Norwegian speaker evaluations show a clear preference for outputs generated by the On-policy RL model, confirming its superior fluency.

79.7% Average Fluency Win-Rate for On-Policy RL

Our on-policy reinforcement learning approach demonstrated superior fluency, achieving a 79.7% average win-rate in native-speaker evaluations, significantly outperforming supervised finetuning on translated data (60.0%) and multilingual baselines (10.3%).

Impact of Judge Fluency on Trained Policy Fluency

Judge Model Judge NLU Judge NLG Judge Fluency Policy Fluency
Mistral Nemo 12B 87.5 29.7 67.0 92.2
Mistral Large 123B 90.0 70.4 83.4 94.2
Llama 3.1 8B 86.4 50.0 62.8 92.9
Qwen 2.5 14B 89.6 43.5 39.0 93.1
Qwen 2.5 72B 92.0 75.2 50.7 92.9

Despite varying judge fluencies (some quite low, e.g., Qwen 2.5 14B at 39.0%), the resulting policies consistently achieve high fluency (92-94%), demonstrating that fluent policies can be trained with "disfluent" judges. (Pearson's correlation coefficient: 0.067)

The Criticality of Avoiding Translationese

Our research unequivocally shows that any exposure to machine-translated text during the alignment phase measurably degrades the linguistic fluency of the language model. Even a single epoch of training on a translated dataset can reduce final policy fluency from 94.2% to 91.0%, underscoring the importance of purely native text for maintaining high-quality language generation. This finding challenges conventional methods that rely on translating high-resource instruction datasets for lower-resource languages.

~93% Sustained Fluency Rate with On-Policy RL

On-policy reinforcement learning consistently maintains a stable high fluency rate (around 93%) after initial convergence. This demonstrates its robustness in preserving native linguistic quality throughout training, unlike supervised finetuning on translated data which showed a clear decrease in fluency over time.

Calculate Your Potential ROI

Estimate the efficiency gains and cost savings your enterprise could realize by implementing fluent, AI-aligned LLMs in lower-resource markets.

Annual Cost Savings $0
Annual Hours Reclaimed 0

Your Implementation Roadmap

A strategic, phase-by-phase approach to integrating fluent LLMs into your enterprise workflows.

Phase 1: Foundational Pretraining & English Alignment

Initiate with extensive pretraining on native target language data to establish core linguistic knowledge. Follow with a short, English-only SFT phase using high-quality instruction datasets like LIMA to teach chat format and instruction-following without introducing translationese.

Phase 2: On-Policy Reinforcement Learning

Implement on-policy reinforcement learning, allowing the model to learn from its own generated responses. Utilize an LLM-as-a-judge system to provide reward signals, even if the judge model is not perfectly fluent itself. Crucially, avoid all translated data during this phase to preserve native fluency.

Phase 3: Continuous Evaluation & Iterative Enhancement

Establish a robust evaluation framework, including both automatic and native-speaker assessments of fluency, NLU, and NLG. Use these insights to continually refine the policy and judge models, ensuring sustained high performance and linguistic naturalness.

Ready to Elevate Your LLM Strategy?

Connect with our experts to design a tailored approach for achieving native-level fluency and alignment in your target languages.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking