AI Alignment & LLM Development

Fluent Alignment with Disfluent Judges: Post-Training for Lower-Resource Languages

This paper introduces a post-training strategy to develop fluent language models for lower-resource languages, even when using "disfluent" (less-than-perfect) reward models for alignment. By employing on-policy reinforcement learning and meticulously avoiding translated training data, our method successfully preserves the native linguistic quality learned during pretraining. Experiments with Norwegian Bokmål, including native-speaker evaluations, demonstrate that this approach significantly outperforms traditional supervised finetuning on machine-translated data, proving that fluent policies can emerge from disfluent judges.

Schedule Your Strategy Session

Executive Impact

Unlock superior linguistic fluency and drive impactful AI development in underserved language markets with our proven methodology.

0% On-Policy RL Avg Win-Rate

0% Policy Fluency w/ Disfluent Judges

0pp Fluency Gain over Translated SFT

0pp Fluency Lost from Translated Data

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Enterprise Process Flow: Fluent LLM Alignment

Base Model Pretraining (Target Language)

→

Short Supervised Finetuning (English LIMA)

→

On-Policy Reinforcement Learning (Target Language, LLM-as-a-Judge)

This three-stage methodology ensures that models for lower-resource languages maintain native fluency by leveraging pretraining on native texts and on-policy learning, critically avoiding any exposure to translated content during the alignment phase.

Manual Fluency Evaluation: On-Policy RL vs. Baselines
Model	On-policy RL	Translated SFT	Mistral Nemo	Average Win-Rate
On-policy RL	-	67.5%	91.8%	79.7%
Translated SFT	32.5%	-	87.5%	60.0%
Mistral Nemo	8.2%	12.5%	-	10.3%
Results from native Norwegian speaker evaluations show a clear preference for outputs generated by the On-policy RL model, confirming its superior fluency.

79.7% Average Fluency Win-Rate for On-Policy RL

Our on-policy reinforcement learning approach demonstrated superior fluency, achieving a 79.7% average win-rate in native-speaker evaluations, significantly outperforming supervised finetuning on translated data (60.0%) and multilingual baselines (10.3%).

Impact of Judge Fluency on Trained Policy Fluency
Judge Model	Judge NLU	Judge NLG	Judge Fluency	Policy Fluency
Mistral Nemo 12B	87.5	29.7	67.0	92.2
Mistral Large 123B	90.0	70.4	83.4	94.2
Llama 3.1 8B	86.4	50.0	62.8	92.9
Qwen 2.5 14B	89.6	43.5	39.0	93.1
Qwen 2.5 72B	92.0	75.2	50.7	92.9
Despite varying judge fluencies (some quite low, e.g., Qwen 2.5 14B at 39.0%), the resulting policies consistently achieve high fluency (92-94%), demonstrating that fluent policies can be trained with "disfluent" judges. (Pearson's correlation coefficient: 0.067)

The Criticality of Avoiding Translationese

Our research unequivocally shows that any exposure to machine-translated text during the alignment phase measurably degrades the linguistic fluency of the language model. Even a single epoch of training on a translated dataset can reduce final policy fluency from 94.2% to 91.0%, underscoring the importance of purely native text for maintaining high-quality language generation. This finding challenges conventional methods that rely on translating high-resource instruction datasets for lower-resource languages.

~93% Sustained Fluency Rate with On-Policy RL

On-policy reinforcement learning consistently maintains a stable high fluency rate (around 93%) after initial convergence. This demonstrates its robustness in preserving native linguistic quality throughout training, unlike supervised finetuning on translated data which showed a clear decrease in fluency over time.

Calculate Your Potential ROI

Estimate the efficiency gains and cost savings your enterprise could realize by implementing fluent, AI-aligned LLMs in lower-resource markets.

Your Industry

Number of Employees (impacted by language tasks)

Avg. Hours/Week spent on language-related tasks

Avg. Hourly Rate ($)

Annual Cost Savings $0

Annual Hours Reclaimed 0

Your Implementation Roadmap

A strategic, phase-by-phase approach to integrating fluent LLMs into your enterprise workflows.

Phase 1: Foundational Pretraining & English Alignment

Initiate with extensive pretraining on native target language data to establish core linguistic knowledge. Follow with a short, English-only SFT phase using high-quality instruction datasets like LIMA to teach chat format and instruction-following without introducing translationese.

Phase 2: On-Policy Reinforcement Learning

Implement on-policy reinforcement learning, allowing the model to learn from its own generated responses. Utilize an LLM-as-a-judge system to provide reward signals, even if the judge model is not perfectly fluent itself. Crucially, avoid all translated data during this phase to preserve native fluency.

Phase 3: Continuous Evaluation & Iterative Enhancement

Establish a robust evaluation framework, including both automatic and native-speaker assessments of fluency, NLU, and NLG. Use these insights to continually refine the policy and judge models, ensuring sustained high performance and linguistic naturalness.

Book a Discovery Call

Ready to Elevate Your LLM Strategy?

Connect with our experts to design a tailored approach for achieving native-level fluency and alignment in your target languages.

Schedule Your Consultation Today

AI Alignment & LLM Development

Fluent Alignment with Disfluent Judges: Post-Training for Lower-Resource Languages

Executive Impact

Deep Analysis & Enterprise Applications

Enterprise Process Flow: Fluent LLM Alignment

Manual Fluency Evaluation: On-Policy RL vs. Baselines

Impact of Judge Fluency on Trained Policy Fluency

The Criticality of Avoiding Translationese

Calculate Your Potential ROI

Your Implementation Roadmap

Phase 1: Foundational Pretraining & English Alignment

Phase 2: On-Policy Reinforcement Learning

Phase 3: Continuous Evaluation & Iterative Enhancement

Ready to Elevate Your LLM Strategy?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai