Technical Report Analysis
Unlocking Next-Gen AI: Introducing LongCat-Flash
LongCat-Flash is a 560-billion-parameter Mixture-of-Experts (MoE) language model with two novel designs: Zero-computation Experts for dynamic budget allocation (18.6B–31.3B activated parameters) and Shortcut-connected MoE for enhanced inference efficiency. It was trained on >20 trillion tokens in 30 days with a multi-stage strategy for agentic intelligence. It achieves >100 TPS inference at $0.70/million output tokens, outperforming leading models in agentic tasks.
Executive Impact & Performance Metrics
LongCat-Flash delivers exceptional performance across key operational and efficiency benchmarks, showcasing its potential for enterprise AI applications.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Architecture Innovations
LongCat-Flash introduces a novel Mixture-of-Experts (MoE) architecture with key innovations aimed at computational efficiency and dynamic resource allocation. These include Zero-computation Experts, Shortcut-connected MoE, and Variance Alignment designs to ensure scalability and stable performance.
Zero-Computation Experts for Dynamic Allocation: LongCat-Flash uses Zero-computation Experts to dynamically allocate computational resources. This allows the model to activate between 18.6B and 31.3B parameters per token based on contextual demands, optimizing resource usage. Figure 3a demonstrates consistent loss reduction and improved performance under matched computation budgets.
Shortcut-Connected MoE (ScMoE): The Shortcut-connected MoE (ScMoE) architecture is employed to significantly expand the computation-communication overlap window, boosting both training and inference efficiency. This design ensures that the training loss curves are virtually indistinguishable from baselines without ScMoE, confirming its quality-neutral benefits across various model scales and attention mechanisms. Figure 4 illustrates these consistent loss curves.
Variance Alignment for Scalability: LongCat-Flash incorporates Variance Alignment techniques for both Multi-head Latent Attention (MLA) and fine-grained FFN experts. This addresses variance misalignment during scaling, preventing instability and performance degradation. Scale-correction factors (aq and aku) in MLA and a scaling factor (γ) for expert initialization ensure well-conditioned attention computations and preserve MoE layer output variance. Figure 5a demonstrates improved convergence with scale-correction.
Figure 2: The architecture adopted in LongCat-Flash. It integrates Shortcut-connected Mixture of Experts (ScMoE) with zero-computation experts for dynamic computation and enhanced efficiency.
Enterprise Process Flow: General Pre-Training Phases
The General Pre-Training process involves a multi-phase data pipeline to ensure quality and diversity. This sequence is crucial for building a robust base model for LongCat-Flash.
Pre-Training Methodologies
The pre-training of LongCat-Flash follows a robust multi-stage curriculum, focusing on scalability, stability, and agentic capability. It incorporates hyperparameter transfer, model growth initialization, and a multi-pronged stability suite to ensure efficient and reliable large-scale training.
Hyperparameter Transfer & Model Growth: LongCat-Flash leverages hyperparameter transfer based on width scaling and model growth initialization (layer stacking) to efficiently train large-scale models. This strategy significantly reduces computational costs and provides improved performance compared to random initialization, as evidenced in Figure 5b. The model starts as a half-scale version, pre-trained on billions of tokens, then expands.
Training Stability Measures: Training stability is enhanced through router stability control (balancing LM and LB losses), activation stability via hidden z-loss, and optimized Adam's epsilon. Hidden z-loss (Eq. 10) prevents massive activations and loss spikes (Figure 6), while Adam's epsilon is set to 1e-16 to maintain adaptive properties for large-scale models (Figure 7).
Long Context Extension & Decontamination: A two-stage context length extension strategy expands the context window from 8k to 128k tokens, using naturally occurring long-text data and curated source code. Rigorous decontamination procedures, including 13-gram overlap and semantic similarity checks, prevent data leakage from common benchmarks, ensuring robust evaluation confidence.
General Pre-Training Data Strategy: A multi-phase pipeline ensures data quality and diversity, with Content Extraction, Quality Filtering, and Deduplication steps. The data mixture progressively increases the proportion of high-quality reasoning data (e.g., STEM and code), aiming for comprehensive foundational capabilities.
Training Infrastructures & Efficiency
The training infrastructure for LongCat-Flash is designed for scalability with precision, ensuring deterministic computation and efficient distributed training. Key innovations include numerical precision control, kernel optimizations, and advanced distributed strategies like ScMoE for computation-communication overlap.
GEMM Precision Comparison (ULP)
Table 4: GEMM Precision Comparison (ULP) between Solution 1 and Solution 2, demonstrating efforts to reduce ULP errors.
Case | Output Shape | Value Range | Solution 1 | Solution 2 | ||
---|---|---|---|---|---|---|
Max | Min | Max | Min | |||
1 | [1024,1536] | [-5,5] | 2292 | -568 | 112 | -100 |
2 | [1024,576] | [-5,5] | 65362 | -82046 | 6.5 | -9 |
3 | [1024,16384] | [-19,15] | 544 | -104 | 224 | -112 |
4 | [1024,12288] | [-4,4] | 202 | -88 | 72 | -41 |
5 | [1024,6144] | [-1,1] | 5376 | -1376 | 304 | -224 |
6 | [1024,24576] | [-5,5] | 7200 | -510 | 104 | -294 |
7 | [1024,131072] | [5,5] | 8128 | -6976 | 2528 | -368 |
8 | [1024,6144] | [-1,1] | 5344 | -8064 | 80 | -258 |
Numerical Precision Control & SDC Detection: LongCat-Flash employs ULP (Unit in the Last Place) evaluation to quantify floating-point errors and integrates an on-chip, in-place operator recomputation mechanism for Silent Data Corruption (SDC) detection. This ensures bitwise-aligned loss values and minimizes numerical errors during BF16 training, crucial for stable large-scale training. Table 4 demonstrates GEMM precision comparison, highlighting efforts to reduce ULP errors.
Kernel Optimizations for Determinism & Performance: Custom kernel redesigns address determinism overhead, including a deterministic FAG (FlashAttention Gradients) kernel (1.6x faster than original deterministic, 0.95x non-deterministic) and a hierarchical reduction algorithm for Deterministic ScatterAdd (performance parity with non-deterministic). Optimized Grouped GEMM and Fused GemmAdd further enhance efficiency.
Distributed Strategy for Large-scale Training: The training architecture is centered on Expert Parallelism Groups (EP), with Context Parallelism (CP) for attention layers and EP partitioning for FFN layers. ScMoE enables dispatch/combine communication to overlap with dense FFN computation by dividing MoE layers into chunks, significantly reducing non-overlapping communication. Figures 8 and 9 illustrate the ScMoE layer chunking and overall overlapping strategy, which reduced non-overlapping communication from 25.3% to 8.4%.
Figure 8: ScMoE Layer with Chunk achieves highest efficiency through computation-communication overlap.
Figure 9: An overview of LongCat-Flash's overlapping strategy, leveraging ScMoE to maximize efficiency.
Inference and Deployment Optimizations
LongCat-Flash's inference system is optimized through model-system co-design, achieving high throughput and low latency. Key techniques include Single Batch Overlap (SBO), speculative decoding with Multi-Token Prediction (MTP), KV cache reduction, multi-step overlapped scheduling, custom kernels, and fine-grained quantization.
Model-Specific Inference Optimization: A Single Batch Overlap (SBO) scheduling strategy optimizes both latency and throughput by orchestrating computation-communication overlap within the ScMoE architecture. Speculative decoding employs Multi-Token Prediction (MTP) as a lightweight draft model (90% acceptance rate), and Multi-head Latent Attention (MLA) significantly reduces KV cache size and bandwidth pressure.
MTP Head Structures Comparison
Table 5: Draft token acceptance rate on MT-Bench of different MTP head structures with a 6B activated model.
MTP layer | Activated parameters ratio | Acceptance rate α |
---|---|---|
Dense layer | 1.41% | 92.1% |
ScMoE layer | 4.17% | 92.9% |
Minimize Schedule Overhead: A multi-step overlapped scheduler launches kernels for multiple forward steps in a single iteration, effectively hiding CPU scheduling and synchronization. This ensures continuous GPU occupancy and dynamically pre-allocates KV cache slots, guaranteeing convergence in allocated KV cache size even without prior knowledge of accept length, as shown in Figure 10.
Figure 10: Multi-step overlapped scheduler for efficient inference.
Measured Inference Performance
Table 6: Performance of LongCat-Flash under different settings, showing superior TGS and TPS/u compared to DeepSeek-V3.
Model | Attention | Avg Context | #Hopper GPUs | TGS | TPS/u |
---|---|---|---|---|---|
DeepSeek-V3-profile | bf16 | 4096 | 128 | 2324 | 20 |
DeepSeek-V3-blog | bf16 | 4989 | 144 | 1850 | 20~22 |
LongCat-Flash | bf16 | 5000 | 128 | 3785 | 35 |
LongCat-Flash | bf16 | 5000 | 128 | 2205 | 68.9 |
LongCat-Flash | bf16 | 5000 | 128 | 804 | 100.5 |
LongCat-Flash | fp8 | 5000 | 128 | 4230 | 26.4 |
LongCat-Flash | fp8 | 8192 | 128 | 3240 | 33.8 |
Theoretical Decoding Performance: LongCat-Flash achieves significant theoretical improvements in both throughput and latency due to its reduced layer count and the SBO overlapping strategy. The theoretical Time-Per-Output-Token (TPOT) is significantly lower than competing models, demonstrating superior efficiency.
Theoretical Decoding Performance (Model Configurations)
Table 7 Part 1: Theoretical decoding model configurations impacting performance.
Metric | DeepSeek-V3 | Qwen3-235B-A22B | LongCat-Flash |
---|---|---|---|
MTP | w/ | w/o | w/ |
n_layer | 61 | 94 | 28 |
batch per device | 96 | 96 | 96 |
Theoretical Decoding Performance (Module Costs & TPOT)
Table 7 Part 2: Theoretical decoding module costs and TPOT/cost comparison, highlighting LongCat-Flash's efficiency.
Module/Metric | DeepSeek-V3 | Qwen3-235B-A22B | LongCat-Flash |
---|---|---|---|
attention | 471 us | 314 us | 264 us |
all-to-all dispatch | 275 us | 157 us | 236 us |
MoE | 77 us | 29 us | 60 us |
all-to-all combine | 551 us | 315 us | 472 us |
overlap strategy | TBO | TBO | SBO |
TPOT (ms) | 30 | 26.2 | 16 |
$/1M output token | 0.17 | 0.15 | 0.09 |
Calculate Your Potential AI ROI
Estimate the significant efficiency gains and cost savings your enterprise could achieve by integrating advanced AI models like LongCat-Flash.
Your AI Implementation Roadmap
A clear, phased approach to integrating advanced AI into your enterprise, ensuring a smooth transition and measurable impact.
Phase 1: Discovery & Strategy
Comprehensive assessment of current workflows, identification of AI opportunities, and development of a tailored implementation strategy.
Phase 2: Pilot & Proof-of-Concept
Deployment of a small-scale AI solution to validate performance, gather feedback, and refine the model for your specific needs.
Phase 3: Integration & Scaling
Seamless integration of the AI model into your existing infrastructure and scaling to optimize performance across the enterprise.
Phase 4: Optimization & Future-Proofing
Continuous monitoring, performance optimization, and strategic planning for future AI advancements and expanded applications.
Ready to Transform Your Enterprise with AI?
Connect with our experts to explore how LongCat-Flash and other cutting-edge AI solutions can drive efficiency, innovation, and growth for your business.