PCMIND-2.1-KAIYUAN-2B Technical Report

Revolutionizing Open-Source LLM Pretraining for Enterprise AI

The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMIND-2.1-KAIYUAN-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, KAIYUAN-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.

Schedule Your Strategy Session

Executive Summary: KAIYUAN-2B

KAIYUAN-2B represents a significant step towards democratizing LLM development by offering a fully open-source, high-performance model. It demonstrates how resource-efficient pretraining strategies, including novel data benchmarking and curriculum training, can achieve competitive results against state-of-the-art models, even with limited computational resources. This initiative provides a transparent blueprint for academic and open-source communities to foster further innovation.

2.03B Model Parameters

2.2T Training Tokens

47.78% Chinese Language Score

49.55% Code Generation Score

KAIYUAN-2B: Bridging the Open-Source-Industry Gap

KAIYUAN-2B represents a significant step towards democratizing LLM development by offering a fully open-source, high-performance model. It demonstrates how resource-efficient pretraining strategies, including novel data benchmarking and curriculum training, can achieve competitive results against state-of-the-art models, even with limited computational resources. This initiative provides a transparent blueprint for academic and open-source communities to foster further innovation.

Competitive performance with state-of-the-art fully open-source models.
Superior efficiency in resource-limited pretraining.
Demonstrates advanced capabilities in Chinese, Math, and Code.
All assets (weights, data, code) released under Apache 2.0 license.

This model facilitates further exploration and innovation in the open-source LLM ecosystem, pushing the frontier of what is achievable under limited resources.

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

KAIYUAN-2B uses a modified Llama architecture with SwiGLU, RMSNorm, and ROPE. Key modifications like Logits Soft-Capping and Sandwich Normalization ensure FP16 training stability on Ascend 910A accelerators, preventing overflow and underflow issues during large-scale pretraining. QK-Norm is also incorporated. This allows for robust training despite FP16's limited dynamic numerical range.

The Quantile Data Benchmarking method systematically compares heterogeneous open-source datasets by training small reference models on data subsets stratified by quality score quantiles. This provides deep insights into data characteristics and guides effective data selection and mixing. Deduplication is performed prior to benchmarking. The Kaiyuan-Spark framework, optimized with Chukonu, handles large-scale data preprocessing and curriculum implementation efficiently.

A multi-phase training paradigm with Strategic Selective Repetition ensures higher-quality data samples (identified via benchmarking) are repeated more frequently across phases. Multi-Domain Curriculum Training orders samples by quality, gradually introducing Chinese, code, and math datasets in later phases. A moderate Learning Rate decay and model averaging over final checkpoints are used to maximize benefits from the curriculum.

Key Performance Indicator

46.05% Average Core Capability Score (Chinese, Math, Code)

KAIYUAN-2B vs. Open-Weight Baselines (Core Capabilities)
Feature	KAIYUAN-2B	Gemma2-2B	Qwen2.5-1.5B
Model Parameters	2B	2B	1.5B
Training Efficiency	Optimized for FP16, resource-limited	Distilled from larger models	Robust performance for efficient deployment
Chinese Language	Superior (46.30% C-Eval, 49.25% CMMLU)	Moderate (41.35% C-Eval, 39.63% CMMLU)	High (68.63% C-Eval, 68.01% CMMLU)
Mathematics	Strong (51.33% GSM8K, 30.34% MATH)	Low (23.90% GSM8K, 15.00% MATH)	High (68.50% GSM8K, 35.00% MATH)
Code Generation	High (56.42% MBPP, 42.68% HumanEval)	Moderate (38.91% MBPP, 17.70% HumanEval)	High (58.37% MBPP, 37.20% HumanEval)

Enterprise Process Flow

Define Target Quantiles

→

Select Fixed-Size Data Subset

→

Train Small Reference Model (0.5B)

→

Evaluate Model on Benchmarks

→

Record Performance & Analyze

Strategic Data Utilization for Performance Gains

KAIYUAN-2B's training strategy directly addresses data heterogeneity and resource constraints. By systematically benchmarking data quality and employing selective repetition for high-quality samples, the model achieves better performance with fewer tokens. This approach is particularly effective in improving capabilities in domains like Chinese, Math, and Code, where high-quality data is often sparse.

Achieved competitive performance against larger models.
Demonstrated efficacy of selective repetition for high-quality data.
Improved data utilization efficiency through curriculum training.
Reduced training instability with architectural optimizations for FP16.

This methodology provides a scalable framework for future LLM pretraining, allowing academic and open-source initiatives to push performance frontiers without requiring massive compute budgets.

Accelerate Your AI Journey

Advanced ROI Calculator

Estimate the potential impact of AI integration on your organization's operational efficiency and cost savings with our interactive calculator.

Your Industry

Number of Employees Impacted

Hours Saved Per Employee Per Week

Average Hourly Rate ($)

Annual Cost Savings

$0

Hours Reclaimed Annually

0

Calculate Your AI ROI

Your AI Implementation Roadmap

A structured approach to integrating KAIYUAN-2B into your enterprise, ensuring a smooth transition and maximum impact.

Phase 1: Discovery & Strategy Alignment

Engage with our AI specialists to assess your current infrastructure, identify key business challenges, and define clear AI objectives aligned with KAIYUAN-2B's capabilities.

Phase 2: Data Preparation & Model Customization

Leverage our data processing frameworks to prepare your proprietary datasets. Fine-tune KAIYUAN-2B with your data using our optimized curriculum training strategies.

Phase 3: Integration & Pilot Deployment

Seamlessly integrate the customized KAIYUAN-2B model into your existing systems. Conduct pilot programs to validate performance and gather initial user feedback.

Phase 4: Scaling & Continuous Optimization

Expand KAIYUAN-2B deployment across relevant departments. Implement continuous monitoring and optimization loops to maintain peak performance and adapt to evolving needs.

Book a Consultation

Ready to Transform Your Enterprise with AI?

Connect with our experts to explore how KAIYUAN-2B can be tailored to your specific needs and drive innovation within your organization.

Get Started Now

PCMIND-2.1-KAIYUAN-2B Technical Report

Revolutionizing Open-Source LLM Pretraining for Enterprise AI

Executive Summary: KAIYUAN-2B

KAIYUAN-2B: Bridging the Open-Source-Industry Gap

Deep Analysis & Enterprise Applications

Key Performance Indicator

KAIYUAN-2B vs. Open-Weight Baselines (Core Capabilities)

Enterprise Process Flow

Strategic Data Utilization for Performance Gains

Advanced ROI Calculator

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy Alignment

Phase 2: Data Preparation & Model Customization

Phase 3: Integration & Pilot Deployment

Phase 4: Scaling & Continuous Optimization

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai