PCMIND-2.1-KAIYUAN-2B Technical Report
Revolutionizing Open-Source LLM Pretraining for Enterprise AI
The rapid advancement of Large Language Models (LLMs) has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMIND-2.1-KAIYUAN-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, KAIYUAN-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.
Executive Summary: KAIYUAN-2B
KAIYUAN-2B represents a significant step towards democratizing LLM development by offering a fully open-source, high-performance model. It demonstrates how resource-efficient pretraining strategies, including novel data benchmarking and curriculum training, can achieve competitive results against state-of-the-art models, even with limited computational resources. This initiative provides a transparent blueprint for academic and open-source communities to foster further innovation.
KAIYUAN-2B: Bridging the Open-Source-Industry Gap
KAIYUAN-2B represents a significant step towards democratizing LLM development by offering a fully open-source, high-performance model. It demonstrates how resource-efficient pretraining strategies, including novel data benchmarking and curriculum training, can achieve competitive results against state-of-the-art models, even with limited computational resources. This initiative provides a transparent blueprint for academic and open-source communities to foster further innovation.
- Competitive performance with state-of-the-art fully open-source models.
- Superior efficiency in resource-limited pretraining.
- Demonstrates advanced capabilities in Chinese, Math, and Code.
- All assets (weights, data, code) released under Apache 2.0 license.
This model facilitates further exploration and innovation in the open-source LLM ecosystem, pushing the frontier of what is achievable under limited resources.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
KAIYUAN-2B uses a modified Llama architecture with SwiGLU, RMSNorm, and ROPE. Key modifications like Logits Soft-Capping and Sandwich Normalization ensure FP16 training stability on Ascend 910A accelerators, preventing overflow and underflow issues during large-scale pretraining. QK-Norm is also incorporated. This allows for robust training despite FP16's limited dynamic numerical range.
The Quantile Data Benchmarking method systematically compares heterogeneous open-source datasets by training small reference models on data subsets stratified by quality score quantiles. This provides deep insights into data characteristics and guides effective data selection and mixing. Deduplication is performed prior to benchmarking. The Kaiyuan-Spark framework, optimized with Chukonu, handles large-scale data preprocessing and curriculum implementation efficiently.
A multi-phase training paradigm with Strategic Selective Repetition ensures higher-quality data samples (identified via benchmarking) are repeated more frequently across phases. Multi-Domain Curriculum Training orders samples by quality, gradually introducing Chinese, code, and math datasets in later phases. A moderate Learning Rate decay and model averaging over final checkpoints are used to maximize benefits from the curriculum.
Key Performance Indicator
46.05% Average Core Capability Score (Chinese, Math, Code)| Feature | KAIYUAN-2B | Gemma2-2B | Qwen2.5-1.5B |
|---|---|---|---|
| Model Parameters | 2B | 2B | 1.5B |
| Training Efficiency |
|
|
|
| Chinese Language |
|
|
|
| Mathematics |
|
|
|
| Code Generation |
|
|
|
Enterprise Process Flow
Strategic Data Utilization for Performance Gains
KAIYUAN-2B's training strategy directly addresses data heterogeneity and resource constraints. By systematically benchmarking data quality and employing selective repetition for high-quality samples, the model achieves better performance with fewer tokens. This approach is particularly effective in improving capabilities in domains like Chinese, Math, and Code, where high-quality data is often sparse.
- Achieved competitive performance against larger models.
- Demonstrated efficacy of selective repetition for high-quality data.
- Improved data utilization efficiency through curriculum training.
- Reduced training instability with architectural optimizations for FP16.
This methodology provides a scalable framework for future LLM pretraining, allowing academic and open-source initiatives to push performance frontiers without requiring massive compute budgets.
Advanced ROI Calculator
Estimate the potential impact of AI integration on your organization's operational efficiency and cost savings with our interactive calculator.
Your AI Implementation Roadmap
A structured approach to integrating KAIYUAN-2B into your enterprise, ensuring a smooth transition and maximum impact.
Phase 1: Discovery & Strategy Alignment
Engage with our AI specialists to assess your current infrastructure, identify key business challenges, and define clear AI objectives aligned with KAIYUAN-2B's capabilities.
Phase 2: Data Preparation & Model Customization
Leverage our data processing frameworks to prepare your proprietary datasets. Fine-tune KAIYUAN-2B with your data using our optimized curriculum training strategies.
Phase 3: Integration & Pilot Deployment
Seamlessly integrate the customized KAIYUAN-2B model into your existing systems. Conduct pilot programs to validate performance and gather initial user feedback.
Phase 4: Scaling & Continuous Optimization
Expand KAIYUAN-2B deployment across relevant departments. Implement continuous monitoring and optimization loops to maintain peak performance and adapt to evolving needs.
Ready to Transform Your Enterprise with AI?
Connect with our experts to explore how KAIYUAN-2B can be tailored to your specific needs and drive innovation within your organization.