Enterprise AI Analysis
Revolutionizing Portuguese NLP with Tucano 2
Tucano 2 offers a fully open suite of large language models (LLMs) from 0.5B to 3.7B parameters, specifically designed to advance open-source development for Portuguese. This initiative addresses critical gaps in available resources, providing high-quality datasets, optimized training recipes, and a comprehensive evaluation framework to empower the broader NLP community.
Executive Impact & Key Advantages
Tucano 2 delivers unparalleled performance and efficiency, setting new benchmarks for open-source Portuguese LLMs. Our comprehensive approach ensures high-quality, reproducible, and accessible AI solutions.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Pretraining Data Innovation
Our work introduces GigaVerbo-v2, a 320B-token Portuguese corpus refined with LLM-judged quality annotations. This is complemented by GigaVerbo-v2 Synth, a 9.3B-token synthetic dataset, designed to fill domain gaps. Ablation studies show that models trained on educational and synthetic data significantly outperform those trained on non-educational web data alone.
Efficient Tokenization Strategy
We developed a custom SentencePiece tokenizer (49,152 vocabulary size) trained on a 40-40-20 mixture of Portuguese, English, and code. This tokenizer achieves the lowest subword fertility (1.51) and highest compression efficiency (2.88 characters per token), resulting in estimated computational savings of approximately 30% compared to Qwen3's tokenizer.
Robust Evaluation Framework
We constructed a new Portuguese evaluation suite, replacing high-noise generative tasks with log-likelihood evaluations. This two-tier suite (Easy Set for early stages, Hard Set for advanced capabilities) provides reliable signals across training phases, including adapted benchmarks for instruction-following, mathematical reasoning, long-context, and coding capabilities.
Cost-Effective Continual Pretraining
Addressing the challenge of achieving Hard Set capabilities under constrained budgets, we adopted a continual pretraining strategy. This involves adapting larger multilingual base models (Qwen3 series) to Portuguese using our curated datasets and tokenizer transplantation. This approach yields substantial performance gains with negligible additional compute.
Advanced Post-Training Alignment
Our post-training pipeline includes Supervised Fine-Tuning (SFT) and Anchored Preference Optimization (APO) using two new Portuguese alignment datasets: GigaVerbo-v2 SFT (4.1M examples) and GigaVerbo-v2 Preferences (28K contrastive pairs). This enhances instruction-following, reasoning, and safety-focused responses.
Sustainable AI Development
All energy and carbon estimates were tracked using CodeCarbon. Synthetic data generation constituted 73% of the total tracked energy. Continual pretraining proves to be a compute-efficient path, using only ~2.7x the energy of from-scratch pretraining. Our total tracked carbon footprint (~7,900 kg CO2e) is significantly lower than frontier models, demonstrating sustainable LLM development for low-resource languages.
GigaVerbo-v2 Synth Generation Pipeline
| Model | Total Avg. NPM | Key Strengths |
|---|---|---|
| Tucano2-qwen-3.7B-Base | 59.21 |
|
| Qwen3-4B-Base | 57.86 |
|
| Qwen2.5-7B | 57.97 |
|
| SmolLM3-3B-Base | 50.25 |
|
| Tucano2-qwen-1.5B-Base | 47.90 |
|
| Tucano2-qwen-0.5B-Base | 35.36 |
|
Tucano 2: Empowering Portuguese NLP
The Tucano 2 project provides a fully open suite of Portuguese large language models, spanning 0.5B to 3.7B parameters. By focusing on language-specific data curation, efficient tokenization, and a robust evaluation framework, Tucano 2 models demonstrate superior performance on Portuguese benchmarks compared to similarly sized multilingual and prior Portuguese baselines. This open-source release, including models, datasets, training recipes, and evaluation code, significantly lowers barriers to entry and catalyzes community-driven progress for Portuguese and other low-resource languages.
Key Achievements:
- State-of-the-art Portuguese performance in 0.5B-3.7B parameter range.
- Over 320 billion tokens of high-quality Portuguese data released.
- Efficient tokenization reducing compute costs by ~30%.
- Comprehensive evaluation suite for all training stages.
Calculate Your Potential AI Savings
Discover the significant efficiency gains and cost reductions your enterprise could achieve by implementing language-optimized AI solutions like Tucano 2. Our models are designed to integrate seamlessly, providing immediate value.
Our Phased Implementation Roadmap
Our proven methodology guides you through every step of integrating advanced AI, from foundational data preparation to deploying production-ready models.
Data Curation & Annotation
Constructed GigaVerbo-v2 (320B tokens) and GigaVerbo-v2 Synth (9.3B tokens) with LLM-judged quality and toxicity annotations.
Tokenization Optimization
Developed a custom SentencePiece tokenizer optimized for Portuguese, English, and code, achieving ~30% compute savings.
Model Pretraining (Tucano 2 Base)
Trained Tucano2-0.6B-Base on 408B tokens, achieving strong Easy Set performance with 92% less energy than prior models.
Continual Pretraining (Tucano 2 qwen-Base)
Adapted Qwen3 base models to Portuguese using tokenizer transplantation and focused pretraining (50-100B tokens), outperforming larger baselines.
Post-Training (Instruct & Think)
Supervised Fine-Tuning (SFT) and Anchored Preference Optimization (APO) on GigaVerbo-v2 SFT and Preferences datasets to create Instruct and Think variants.
Empower Your Enterprise with Tucano 2
Request a tailored consultation to integrate these open-source LLMs into your workflows.