AI Alignment & LLM Development
Fluent Alignment with Disfluent Judges: Post-Training for Lower-Resource Languages
This paper introduces a post-training strategy to develop fluent language models for lower-resource languages, even when using "disfluent" (less-than-perfect) reward models for alignment. By employing on-policy reinforcement learning and meticulously avoiding translated training data, our method successfully preserves the native linguistic quality learned during pretraining. Experiments with Norwegian Bokmål, including native-speaker evaluations, demonstrate that this approach significantly outperforms traditional supervised finetuning on machine-translated data, proving that fluent policies can emerge from disfluent judges.
Executive Impact
Unlock superior linguistic fluency and drive impactful AI development in underserved language markets with our proven methodology.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Enterprise Process Flow: Fluent LLM Alignment
This three-stage methodology ensures that models for lower-resource languages maintain native fluency by leveraging pretraining on native texts and on-policy learning, critically avoiding any exposure to translated content during the alignment phase.
| Model | On-policy RL | Translated SFT | Mistral Nemo | Average Win-Rate |
|---|---|---|---|---|
| On-policy RL | - | 67.5% | 91.8% | 79.7% |
| Translated SFT | 32.5% | - | 87.5% | 60.0% |
| Mistral Nemo | 8.2% | 12.5% | - | 10.3% |
Results from native Norwegian speaker evaluations show a clear preference for outputs generated by the On-policy RL model, confirming its superior fluency. |
||||
Our on-policy reinforcement learning approach demonstrated superior fluency, achieving a 79.7% average win-rate in native-speaker evaluations, significantly outperforming supervised finetuning on translated data (60.0%) and multilingual baselines (10.3%).
| Judge Model | Judge NLU | Judge NLG | Judge Fluency | Policy Fluency |
|---|---|---|---|---|
| Mistral Nemo 12B | 87.5 | 29.7 | 67.0 | 92.2 |
| Mistral Large 123B | 90.0 | 70.4 | 83.4 | 94.2 |
| Llama 3.1 8B | 86.4 | 50.0 | 62.8 | 92.9 |
| Qwen 2.5 14B | 89.6 | 43.5 | 39.0 | 93.1 |
| Qwen 2.5 72B | 92.0 | 75.2 | 50.7 | 92.9 |
Despite varying judge fluencies (some quite low, e.g., Qwen 2.5 14B at 39.0%), the resulting policies consistently achieve high fluency (92-94%), demonstrating that fluent policies can be trained with "disfluent" judges. (Pearson's correlation coefficient: 0.067) |
||||
The Criticality of Avoiding Translationese
Our research unequivocally shows that any exposure to machine-translated text during the alignment phase measurably degrades the linguistic fluency of the language model. Even a single epoch of training on a translated dataset can reduce final policy fluency from 94.2% to 91.0%, underscoring the importance of purely native text for maintaining high-quality language generation. This finding challenges conventional methods that rely on translating high-resource instruction datasets for lower-resource languages.
On-policy reinforcement learning consistently maintains a stable high fluency rate (around 93%) after initial convergence. This demonstrates its robustness in preserving native linguistic quality throughout training, unlike supervised finetuning on translated data which showed a clear decrease in fluency over time.
Calculate Your Potential ROI
Estimate the efficiency gains and cost savings your enterprise could realize by implementing fluent, AI-aligned LLMs in lower-resource markets.
Your Implementation Roadmap
A strategic, phase-by-phase approach to integrating fluent LLMs into your enterprise workflows.
Phase 1: Foundational Pretraining & English Alignment
Initiate with extensive pretraining on native target language data to establish core linguistic knowledge. Follow with a short, English-only SFT phase using high-quality instruction datasets like LIMA to teach chat format and instruction-following without introducing translationese.
Phase 2: On-Policy Reinforcement Learning
Implement on-policy reinforcement learning, allowing the model to learn from its own generated responses. Utilize an LLM-as-a-judge system to provide reward signals, even if the judge model is not perfectly fluent itself. Crucially, avoid all translated data during this phase to preserve native fluency.
Phase 3: Continuous Evaluation & Iterative Enhancement
Establish a robust evaluation framework, including both automatic and native-speaker assessments of fluency, NLU, and NLG. Use these insights to continually refine the policy and judge models, ensuring sustained high performance and linguistic naturalness.
Ready to Elevate Your LLM Strategy?
Connect with our experts to design a tailored approach for achieving native-level fluency and alignment in your target languages.