Skip to main content
Enterprise AI Analysis: TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

Enterprise AI Analysis

TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

TildeOpen LLM is a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. It uses dataset upsampling and a curriculum-based training schedule to address data imbalance. The model outperforms similar multilingual LLMs in text generation and comprehension, especially for Baltic, Finno-Ugric, and Slavic languages, with up to a tenfold reduction in linguistic errors compared to leading baselines. This demonstrates that careful data curation and balanced training strategies can significantly enhance multilingual model quality without increasing model size or training volume.

Executive Impact

Key metrics highlighting the potential impact of TildeOpen LLM for European enterprises.

30B Parameters
34 Languages
2T Tokens Trained

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Tokenizer Design
Data Curation & Sampling
Curriculum Learning
Performance Evaluation

The TildeOpen LLM tokenizer supports 34 European languages, grouped into 'focus' languages (e.g., Baltic, Slavic, Finno-Ugric) and 'other' supported languages. The primary goal is equitable language representation, ensuring similar token counts for equivalent content across languages, which is critical for inference cost and context duration. Tokenization efficiency is measured using parallel translations from FLORES 200, with proportions adjusted iteratively. The tokenizer uses SentencePiece with Byte Pair Encoding, a vocabulary size of 131,072, and specific settings to enhance efficiency and coverage.

Training data comprises large Web datasets (MADLAD-400, HPLT, Cultura-X, FineWeb 2, Common Pile) and specialist resources (The Stack, Math-Pile, Tezaurs). A multi-step filtering process includes URL filtering (removing low-quality domains, spam, pornography), deduplication (exact and similar line removal using the Onion tool, with specific handling for large datasets like English, French, German), and heuristic/PII filters (removing low-quality text, anonymizing personal data). Russian-language data undergoes extra topic-level filtering to mitigate propaganda.

To address data imbalance where some languages have orders of magnitude more data than others, TildeOpen LLM uses upsampling for under-represented languages (up to 2.5 times). This is combined with a curriculum learning approach across three phases: an initial uniform language exposure phase (7.5% of training), an intermediate phase with a more natural data distribution (67.5%), and a final uniform phase (25%). This strategy ensures balanced language exposure while maximizing resource diversity.

The model is evaluated using MultiBLIMP 1.0, Belebele, ARC, MMLU, and Exams benchmarks. It achieves strong performance, particularly in text generation and comprehension for Baltic, Finno-Ugric, and Slavic languages. Human evaluations show a significant reduction in linguistic errors (up to tenfold) compared to baselines like Gemma 2 for lower-resource languages. The model performs comparably on parametric knowledge tasks, suggesting data quantity alone isn't the sole driver for improvement.

10x Reduction in linguistic errors for low-resource languages compared to baselines like Gemma 2.

Enterprise Process Flow

Initial Uniform Phase (7.5% Training)
Intermediate Natural Distribution Phase (67.5% Training)
Final Uniform Phase (25% Training)
Feature TildeOpen LLM Competitor Average
Languages Supported
  • ✓ 34 European Languages
  • Fewer, English-dominant
Linguistic Equity
  • ✓ High (Balanced Tokenization)
  • Low (Skewed Token Counts)
Data Balancing
  • ✓ Curriculum Learning & Upsampling
  • Limited Upsampling or None
Performance (Low-Resource)
  • ✓ Superior Text Gen & Comp
  • Underperforms English
Error Rate
  • ✓ Significantly Lower (Human Eval)
  • Higher Linguistic Errors

Impact on Baltic Languages

For Baltic languages like Latvian and Lithuanian, TildeOpen LLM shows a 13.8% improvement in per-character perplexity compared to other foundational LLMs. This significant gain is attributed to the focused tokenization equity, upsampling strategies, and uniform exposure during initial and final training phases. Human evaluations for Latvian texts revealed an average of less than one mistake per 100 words, a substantial improvement over three mistakes for EuroLLM and ten for Gemma 2, underscoring the success of equitable representation for historically under-resourced languages.

Quantify Your Enterprise AI Impact

Estimate the potential annual cost savings and hours reclaimed by deploying TildeOpen LLM in your operations.

Potential Annual Savings $0
Hours Reclaimed Annually 0

Your TildeOpen LLM Implementation Roadmap

A strategic overview of deploying TildeOpen LLM within your enterprise, ensuring a seamless integration and maximizing value.

Phase 1: Needs Assessment & Customization

Identify specific enterprise use cases, data requirements, and customization needs for TildeOpen LLM. This includes initial data integration planning and defining success metrics.

Phase 2: Pilot Deployment & Fine-tuning

Deploy TildeOpen LLM in a pilot environment with a select group of users. Gather feedback, fine-tune the model with proprietary data, and refine integration workflows.

Phase 3: Full-Scale Rollout & Optimization

Scale TildeOpen LLM across the organization, providing training and support. Continuously monitor performance, gather user insights, and optimize for sustained impact and ROI.

Ready to Transform Your Enterprise with Equitable AI?

Connect with our AI specialists to explore how TildeOpen LLM can address your specific language needs and drive innovation.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking