Enterprise AI Analysis
PERSIAN-PHI: EFFICIENT CROSS-LINGUAL ADAPTATION OF COMPACT LLMS VIA CURRICULUM LEARNING
The democratization of AI is currently hindered by the immense computational costs required to train Large Language Models (LLMs) for low-resource languages. This paper presents Persian-Phi, a 3.8B parameter model that challenges the assumption that robust multilingual capabilities require massive model sizes or multilingual baselines. We demonstrate how Microsoft's Phi-3 Mini—originally a monolingual English model—can be effectively adapted to Persian through a novel, resource-efficient curriculum learning pipeline. This approach employs a unique "warm-up" stage using bilingual narratives (Tiny Stories) to align embeddings prior to heavy training, followed by continual pretraining and instruction tuning via Parameter-Efficient Fine-Tuning (PEFT). Despite its compact size, Persian-Phi achieves competitive results on Open Persian LLM Leaderboard. Our findings provide a validated, scalable framework for extending the reach of state-of-the-art LLMs to underrepresented languages with minimal hardware resources. The model is publicly available at Persian-Phi.
Executive Impact: Key Takeaways
Persian-Phi's novel approach demonstrates how compact, monolingual models can be efficiently adapted to low-resource languages, providing a scalable and cost-effective pathway for global AI adoption.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Enterprise Process Flow
The adaptation began with an extended tokenizer, followed by a warm-up phase to align new Persian tokens. Deep language understanding was built via continual pre-training on filtered Persian corpora, and finally, instruction tuning refined conversational abilities.
| Model | #Params (B) | Part MC | ARC Easy | ARC Challenge | MMLU Pro | AUT MC |
|---|---|---|---|---|---|---|
| Ours (Persian-Phi) | 3.85 | 30.56 | 64.65 | 51.00 | 17.18 | 43.98 |
| PartAI Dorna2-8B | 8.03 | 35.52 | 75.28 | 53.52 | 24.1 | 53.45 |
| Meta-LLaMA3.1-8B | 8.03 | 36.68 | 78.4 | 60.4 | 21 | 54.24 |
| Gemma-2-2b-it | 2.61 | 31.12 | 71.26 | 57.72 | 16.23 | 49.9 |
| PersianMind-v1.0 | 6.82 | 29.27 | 58.91 | 48.32 | 15.51 | 45.36 |
| Maral-7B-alpha-1 | 7.24 | 26.67 | 44.54 | 32.88 | 15.99 | 36.09 |
| Phi-3-mini-4k-instruct (Baseline) | 3.82 | 27.37 | 36.78 | 36.78 | 17.89 | 35.1 |
Persian-Phi achieves competitive results despite its compact size, outperforming several larger Llama 2-based models. Dorna-2 (8B params) remains state-of-the-art but at twice the parameter count, Persian-Phi achieves ~80% of its aggregate performance.
Persian-Phi introduces a unique curriculum learning pipeline: warm-up with bilingual Tiny Stories for embedding alignment, followed by continual pre-training on filtered Persian corpora, and instruction tuning using PEFT. This phased approach enables efficient adaptation of a monolingual model to a new language while preserving its original capabilities.
Strategic Advantages for Enterprise
The Persian-Phi project demonstrates a powerful paradigm for extending advanced AI capabilities to languages like Persian, traditionally underserved due to data scarcity and computational costs. By starting with a compact, high-capability English model (Microsoft's Phi-3 Mini) and strategically adapting it, we offer a resource-efficient and scalable solution. This challenges the notion that robust multilingual support requires massive models or training from scratch, making cutting-edge LLMs accessible with minimal hardware. For enterprises, this means faster time-to-market for AI solutions in new language markets and significant cost savings on development and infrastructure.
Calculate Your Potential AI ROI
Estimate the potential savings and reclaimed hours by implementing efficient AI solutions for your enterprise.
Your Implementation Roadmap
Our structured approach ensures a seamless integration of advanced AI capabilities into your enterprise, maximizing impact while minimizing disruption.
Phase 1: Tokenizer Extension & Warm-up (Est. 2-3 weeks)
Expanded tokenizer to include Persian-specific tokens and initialized new embeddings through bilingual translation tasks (Tiny Stories) to ensure smooth cross-lingual alignment and prevent catastrophic forgetting.
Phase 2: Continual Pre-training (Est. 4-6 weeks)
Applied intensive pre-training on a large, high-quality filtered Persian corpus (TLPC, Wikipedia) to build deep language understanding and fluency. Utilized higher-rank LoRA and full embedding/head tuning.
Phase 3: Supervised Fine-Tuning (SFT) (Est. 2-3 weeks)
Refined the model's instruction-following and conversational abilities using a mixed dataset of Persian and English instruction-response pairs, employing LoRA to balance proficiency in both languages.
Phase 4: Integration & Deployment (Est. 1-2 weeks)
Merged LoRA weights and prepared the final Persian-Phi model for public release and enterprise integration, ensuring robustness and performance.
Ready to Transform Your Enterprise with AI?
Unlock the full potential of language models for your specific needs, even in low-resource environments. Contact us today to explore tailored solutions.