Skip to main content
Enterprise AI Analysis: TUCANO 2 COOL: BETTER OPEN SOURCE LLMS FOR PORTUGUESE

Enterprise AI Analysis

Revolutionizing Portuguese NLP with Tucano 2

Tucano 2 offers a fully open suite of large language models (LLMs) from 0.5B to 3.7B parameters, specifically designed to advance open-source development for Portuguese. This initiative addresses critical gaps in available resources, providing high-quality datasets, optimized training recipes, and a comprehensive evaluation framework to empower the broader NLP community.

Executive Impact & Key Advantages

Tucano 2 delivers unparalleled performance and efficiency, setting new benchmarks for open-source Portuguese LLMs. Our comprehensive approach ensures high-quality, reproducible, and accessible AI solutions.

0 Total Tokens Processed
0 Energy Reduction (vs. Tucano-2b4)
0 Models Released (Base, Instruct & Think)
0 High-Quality Datasets Released

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Pretraining Data Innovation

Our work introduces GigaVerbo-v2, a 320B-token Portuguese corpus refined with LLM-judged quality annotations. This is complemented by GigaVerbo-v2 Synth, a 9.3B-token synthetic dataset, designed to fill domain gaps. Ablation studies show that models trained on educational and synthetic data significantly outperform those trained on non-educational web data alone.

Efficient Tokenization Strategy

We developed a custom SentencePiece tokenizer (49,152 vocabulary size) trained on a 40-40-20 mixture of Portuguese, English, and code. This tokenizer achieves the lowest subword fertility (1.51) and highest compression efficiency (2.88 characters per token), resulting in estimated computational savings of approximately 30% compared to Qwen3's tokenizer.

Robust Evaluation Framework

We constructed a new Portuguese evaluation suite, replacing high-noise generative tasks with log-likelihood evaluations. This two-tier suite (Easy Set for early stages, Hard Set for advanced capabilities) provides reliable signals across training phases, including adapted benchmarks for instruction-following, mathematical reasoning, long-context, and coding capabilities.

Cost-Effective Continual Pretraining

Addressing the challenge of achieving Hard Set capabilities under constrained budgets, we adopted a continual pretraining strategy. This involves adapting larger multilingual base models (Qwen3 series) to Portuguese using our curated datasets and tokenizer transplantation. This approach yields substantial performance gains with negligible additional compute.

Advanced Post-Training Alignment

Our post-training pipeline includes Supervised Fine-Tuning (SFT) and Anchored Preference Optimization (APO) using two new Portuguese alignment datasets: GigaVerbo-v2 SFT (4.1M examples) and GigaVerbo-v2 Preferences (28K contrastive pairs). This enhances instruction-following, reasoning, and safety-focused responses.

Sustainable AI Development

All energy and carbon estimates were tracked using CodeCarbon. Synthetic data generation constituted 73% of the total tracked energy. Continual pretraining proves to be a compute-efficient path, using only ~2.7x the energy of from-scratch pretraining. Our total tracked carbon footprint (~7,900 kg CO2e) is significantly lower than frontier models, demonstrating sustainable LLM development for low-resource languages.

0 Reduction in Energy Consumption for Tucano2-0.6B-Base vs. Tucano-2b4

GigaVerbo-v2 Synth Generation Pipeline

Carefully Crafted Prompts
Diverse Seed Datasets
State-of-the-Art LLMs for Generation
Quality Filtering & Decontamination

Aggregate Benchmark (NPM) Results: Tucano2-qwen-3.7B-Base vs. Baselines

Model Total Avg. NPM Key Strengths
Tucano2-qwen-3.7B-Base 59.21
  • Highest K&R score in 3-4B range, outperforms Qwen3-4B, SmolLM3-3B, Gemma-3-Gaia-PT-BR-4b-it.
Qwen3-4B-Base 57.86
  • Strong general multilingual performance, high IFEval/Code.
Qwen2.5-7B 57.97
  • Competitive performance for its size.
SmolLM3-3B-Base 50.25
  • Supports multiple languages, dual-mode instruct/reasoning.
Tucano2-qwen-1.5B-Base 47.90
  • Strong gains over Qwen3-1.7B-Base and comparable domain-adapted models.
Tucano2-qwen-0.5B-Base 35.36
  • Significant improvement over Qwen3-0.6B-Base and from-scratch Tucano2-0.6B-Base.

Tucano 2: Empowering Portuguese NLP

The Tucano 2 project provides a fully open suite of Portuguese large language models, spanning 0.5B to 3.7B parameters. By focusing on language-specific data curation, efficient tokenization, and a robust evaluation framework, Tucano 2 models demonstrate superior performance on Portuguese benchmarks compared to similarly sized multilingual and prior Portuguese baselines. This open-source release, including models, datasets, training recipes, and evaluation code, significantly lowers barriers to entry and catalyzes community-driven progress for Portuguese and other low-resource languages.

Key Achievements:

  • State-of-the-art Portuguese performance in 0.5B-3.7B parameter range.
  • Over 320 billion tokens of high-quality Portuguese data released.
  • Efficient tokenization reducing compute costs by ~30%.
  • Comprehensive evaluation suite for all training stages.

Calculate Your Potential AI Savings

Discover the significant efficiency gains and cost reductions your enterprise could achieve by implementing language-optimized AI solutions like Tucano 2. Our models are designed to integrate seamlessly, providing immediate value.

Estimated Annual Savings $0
Annual Hours Reclaimed 0

Our Phased Implementation Roadmap

Our proven methodology guides you through every step of integrating advanced AI, from foundational data preparation to deploying production-ready models.

Data Curation & Annotation

Constructed GigaVerbo-v2 (320B tokens) and GigaVerbo-v2 Synth (9.3B tokens) with LLM-judged quality and toxicity annotations.

Tokenization Optimization

Developed a custom SentencePiece tokenizer optimized for Portuguese, English, and code, achieving ~30% compute savings.

Model Pretraining (Tucano 2 Base)

Trained Tucano2-0.6B-Base on 408B tokens, achieving strong Easy Set performance with 92% less energy than prior models.

Continual Pretraining (Tucano 2 qwen-Base)

Adapted Qwen3 base models to Portuguese using tokenizer transplantation and focused pretraining (50-100B tokens), outperforming larger baselines.

Post-Training (Instruct & Think)

Supervised Fine-Tuning (SFT) and Anchored Preference Optimization (APO) on GigaVerbo-v2 SFT and Preferences datasets to create Instruct and Think variants.

Empower Your Enterprise with Tucano 2

Request a tailored consultation to integrate these open-source LLMs into your workflows.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking