Skip to main content
Enterprise AI Analysis: LongCat-Flash Technical Report

Technical Report Analysis

Unlocking Next-Gen AI: Introducing LongCat-Flash

LongCat-Flash is a 560-billion-parameter Mixture-of-Experts (MoE) language model with two novel designs: Zero-computation Experts for dynamic budget allocation (18.6B–31.3B activated parameters) and Shortcut-connected MoE for enhanced inference efficiency. It was trained on >20 trillion tokens in 30 days with a multi-stage strategy for agentic intelligence. It achieves >100 TPS inference at $0.70/million output tokens, outperforming leading models in agentic tasks.

Executive Impact & Performance Metrics

LongCat-Flash delivers exceptional performance across key operational and efficiency benchmarks, showcasing its potential for enterprise AI applications.

0B Total Parameters
0B Avg. Activated Parameters
0 Days Training Completed In
0 Trillion Tokens Trained On
0% Training Availability
0 TPS Inference Speed (H800)
$0/M Output Inference Cost

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Architecture
Pre-Training
Training Infrastructures
Inference and Deployment

Architecture Innovations

LongCat-Flash introduces a novel Mixture-of-Experts (MoE) architecture with key innovations aimed at computational efficiency and dynamic resource allocation. These include Zero-computation Experts, Shortcut-connected MoE, and Variance Alignment designs to ensure scalability and stable performance.

28 Total Layers (excluding MTP layer)

Zero-Computation Experts for Dynamic Allocation: LongCat-Flash uses Zero-computation Experts to dynamically allocate computational resources. This allows the model to activate between 18.6B and 31.3B parameters per token based on contextual demands, optimizing resource usage. Figure 3a demonstrates consistent loss reduction and improved performance under matched computation budgets.

Shortcut-Connected MoE (ScMoE): The Shortcut-connected MoE (ScMoE) architecture is employed to significantly expand the computation-communication overlap window, boosting both training and inference efficiency. This design ensures that the training loss curves are virtually indistinguishable from baselines without ScMoE, confirming its quality-neutral benefits across various model scales and attention mechanisms. Figure 4 illustrates these consistent loss curves.

Variance Alignment for Scalability: LongCat-Flash incorporates Variance Alignment techniques for both Multi-head Latent Attention (MLA) and fine-grained FFN experts. This addresses variance misalignment during scaling, preventing instability and performance degradation. Scale-correction factors (aq and aku) in MLA and a scaling factor (γ) for expert initialization ensure well-conditioned attention computations and preserve MoE layer output variance. Figure 5a demonstrates improved convergence with scale-correction.

Figure 2: The architecture adopted in LongCat-Flash. Each layer employs Shortcut-connected Mixture of Experts (ScMoE) with zero-computation experts. ScMoE significantly expands the computation-communication window to boost training and inference efficiency. The zero-computation experts enable dynamic computation based on contextual importance, improving the efficiency of computational resource utilization.

Figure 2: The architecture adopted in LongCat-Flash. It integrates Shortcut-connected Mixture of Experts (ScMoE) with zero-computation experts for dynamic computation and enhanced efficiency.

Enterprise Process Flow: General Pre-Training Phases

The General Pre-Training process involves a multi-phase data pipeline to ensure quality and diversity. This sequence is crucial for building a robust base model for LongCat-Flash.

Content Extraction
Quality Filtering
Deduplication
Data Mixture (Stage 1)
Data Mixture (Stage 2)

Pre-Training Methodologies

The pre-training of LongCat-Flash follows a robust multi-stage curriculum, focusing on scalability, stability, and agentic capability. It incorporates hyperparameter transfer, model growth initialization, and a multi-pronged stability suite to ensure efficient and reliable large-scale training.

Hyperparameter Transfer & Model Growth: LongCat-Flash leverages hyperparameter transfer based on width scaling and model growth initialization (layer stacking) to efficiently train large-scale models. This strategy significantly reduces computational costs and provides improved performance compared to random initialization, as evidenced in Figure 5b. The model starts as a half-scale version, pre-trained on billions of tokens, then expands.

Training Stability Measures: Training stability is enhanced through router stability control (balancing LM and LB losses), activation stability via hidden z-loss, and optimized Adam's epsilon. Hidden z-loss (Eq. 10) prevents massive activations and loss spikes (Figure 6), while Adam's epsilon is set to 1e-16 to maintain adaptive properties for large-scale models (Figure 7).

Long Context Extension & Decontamination: A two-stage context length extension strategy expands the context window from 8k to 128k tokens, using naturally occurring long-text data and curated source code. Rigorous decontamination procedures, including 13-gram overlap and semantic similarity checks, prevent data leakage from common benchmarks, ensuring robust evaluation confidence.

General Pre-Training Data Strategy: A multi-phase pipeline ensures data quality and diversity, with Content Extraction, Quality Filtering, and Deduplication steps. The data mixture progressively increases the proportion of high-quality reasoning data (e.g., STEM and code), aiming for comprehensive foundational capabilities.

Training Infrastructures & Efficiency

The training infrastructure for LongCat-Flash is designed for scalability with precision, ensuring deterministic computation and efficient distributed training. Key innovations include numerical precision control, kernel optimizations, and advanced distributed strategies like ScMoE for computation-communication overlap.

GEMM Precision Comparison (ULP)

Table 4: GEMM Precision Comparison (ULP) between Solution 1 and Solution 2, demonstrating efforts to reduce ULP errors.

CaseOutput ShapeValue RangeSolution 1Solution 2
MaxMinMaxMin
1[1024,1536][-5,5]2292-568112-100
2[1024,576][-5,5]65362-820466.5-9
3[1024,16384][-19,15]544-104224-112
4[1024,12288][-4,4]202-8872-41
5[1024,6144][-1,1]5376-1376304-224
6[1024,24576][-5,5]7200-510104-294
7[1024,131072][5,5]8128-69762528-368
8[1024,6144][-1,1]5344-806480-258

Numerical Precision Control & SDC Detection: LongCat-Flash employs ULP (Unit in the Last Place) evaluation to quantify floating-point errors and integrates an on-chip, in-place operator recomputation mechanism for Silent Data Corruption (SDC) detection. This ensures bitwise-aligned loss values and minimizes numerical errors during BF16 training, crucial for stable large-scale training. Table 4 demonstrates GEMM precision comparison, highlighting efforts to reduce ULP errors.

Kernel Optimizations for Determinism & Performance: Custom kernel redesigns address determinism overhead, including a deterministic FAG (FlashAttention Gradients) kernel (1.6x faster than original deterministic, 0.95x non-deterministic) and a hierarchical reduction algorithm for Deterministic ScatterAdd (performance parity with non-deterministic). Optimized Grouped GEMM and Fused GemmAdd further enhance efficiency.

Distributed Strategy for Large-scale Training: The training architecture is centered on Expert Parallelism Groups (EP), with Context Parallelism (CP) for attention layers and EP partitioning for FFN layers. ScMoE enables dispatch/combine communication to overlap with dense FFN computation by dividing MoE layers into chunks, significantly reducing non-overlapping communication. Figures 8 and 9 illustrate the ScMoE layer chunking and overall overlapping strategy, which reduced non-overlapping communication from 25.3% to 8.4%.

Figure 8: These architectures have the same total and activated number of experts. ScMoE with chunk achieves the highest efficiency because more communication is overlapped by computation.

Figure 8: ScMoE Layer with Chunk achieves highest efficiency through computation-communication overlap.

Figure 9: An overview of overlapping strategy.

Figure 9: An overview of LongCat-Flash's overlapping strategy, leveraging ScMoE to maximize efficiency.

Inference and Deployment Optimizations

LongCat-Flash's inference system is optimized through model-system co-design, achieving high throughput and low latency. Key techniques include Single Batch Overlap (SBO), speculative decoding with Multi-Token Prediction (MTP), KV cache reduction, multi-step overlapped scheduling, custom kernels, and fine-grained quantization.

Model-Specific Inference Optimization: A Single Batch Overlap (SBO) scheduling strategy optimizes both latency and throughput by orchestrating computation-communication overlap within the ScMoE architecture. Speculative decoding employs Multi-Token Prediction (MTP) as a lightweight draft model (90% acceptance rate), and Multi-head Latent Attention (MLA) significantly reduces KV cache size and bandwidth pressure.

MTP Head Structures Comparison

Table 5: Draft token acceptance rate on MT-Bench of different MTP head structures with a 6B activated model.

MTP layerActivated parameters ratioAcceptance rate α
Dense layer1.41%92.1%
ScMoE layer4.17%92.9%

Minimize Schedule Overhead: A multi-step overlapped scheduler launches kernels for multiple forward steps in a single iteration, effectively hiding CPU scheduling and synchronization. This ensures continuous GPU occupancy and dynamically pre-allocates KV cache slots, guaranteeing convergence in allocated KV cache size even without prior knowledge of accept length, as shown in Figure 10.

Figure 10: Multi-step overlapped scheduler (4 steps as a example here).

Figure 10: Multi-step overlapped scheduler for efficient inference.

Measured Inference Performance

Table 6: Performance of LongCat-Flash under different settings, showing superior TGS and TPS/u compared to DeepSeek-V3.

ModelAttentionAvg Context#Hopper GPUsTGSTPS/u
DeepSeek-V3-profilebf164096128232420
DeepSeek-V3-blogbf164989144185020~22
LongCat-Flashbf165000128378535
LongCat-Flashbf165000128220568.9
LongCat-Flashbf165000128804100.5
LongCat-Flashfp85000128423026.4
LongCat-Flashfp88192128324033.8

Theoretical Decoding Performance: LongCat-Flash achieves significant theoretical improvements in both throughput and latency due to its reduced layer count and the SBO overlapping strategy. The theoretical Time-Per-Output-Token (TPOT) is significantly lower than competing models, demonstrating superior efficiency.

Theoretical Decoding Performance (Model Configurations)

Table 7 Part 1: Theoretical decoding model configurations impacting performance.

MetricDeepSeek-V3Qwen3-235B-A22BLongCat-Flash
MTPw/w/ow/
n_layer619428
batch per device969696

Theoretical Decoding Performance (Module Costs & TPOT)

Table 7 Part 2: Theoretical decoding module costs and TPOT/cost comparison, highlighting LongCat-Flash's efficiency.

Module/MetricDeepSeek-V3Qwen3-235B-A22BLongCat-Flash
attention471 us314 us264 us
all-to-all dispatch275 us157 us236 us
MoE77 us29 us60 us
all-to-all combine551 us315 us472 us
overlap strategyTBOTBOSBO
TPOT (ms)3026.216
$/1M output token0.170.150.09

Calculate Your Potential AI ROI

Estimate the significant efficiency gains and cost savings your enterprise could achieve by integrating advanced AI models like LongCat-Flash.

Estimated Annual Savings $0
Employee Hours Reclaimed Annually 0

Your AI Implementation Roadmap

A clear, phased approach to integrating advanced AI into your enterprise, ensuring a smooth transition and measurable impact.

Phase 1: Discovery & Strategy

Comprehensive assessment of current workflows, identification of AI opportunities, and development of a tailored implementation strategy.

Phase 2: Pilot & Proof-of-Concept

Deployment of a small-scale AI solution to validate performance, gather feedback, and refine the model for your specific needs.

Phase 3: Integration & Scaling

Seamless integration of the AI model into your existing infrastructure and scaling to optimize performance across the enterprise.

Phase 4: Optimization & Future-Proofing

Continuous monitoring, performance optimization, and strategic planning for future AI advancements and expanded applications.

Ready to Transform Your Enterprise with AI?

Connect with our experts to explore how LongCat-Flash and other cutting-edge AI solutions can drive efficiency, innovation, and growth for your business.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking