Technical Report Analysis

Unlocking Next-Gen AI: Introducing LongCat-Flash

LongCat-Flash is a 560-billion-parameter Mixture-of-Experts (MoE) language model with two novel designs: Zero-computation Experts for dynamic budget allocation (18.6B–31.3B activated parameters) and Shortcut-connected MoE for enhanced inference efficiency. It was trained on >20 trillion tokens in 30 days with a multi-stage strategy for agentic intelligence. It achieves >100 TPS inference at $0.70/million output tokens, outperforming leading models in agentic tasks.

Schedule Your AI Strategy Session

Executive Impact & Performance Metrics

LongCat-Flash delivers exceptional performance across key operational and efficiency benchmarks, showcasing its potential for enterprise AI applications.

0B Total Parameters

0B Avg. Activated Parameters

0 Days Training Completed In

0 Trillion Tokens Trained On

0% Training Availability

0 TPS Inference Speed (H800)

$0/M Output Inference Cost

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Architecture

Pre-Training

Training Infrastructures

Inference and Deployment

Architecture Innovations

LongCat-Flash introduces a novel Mixture-of-Experts (MoE) architecture with key innovations aimed at computational efficiency and dynamic resource allocation. These include Zero-computation Experts, Shortcut-connected MoE, and Variance Alignment designs to ensure scalability and stable performance.

28 Total Layers (excluding MTP layer)

Zero-Computation Experts for Dynamic Allocation: LongCat-Flash uses Zero-computation Experts to dynamically allocate computational resources. This allows the model to activate between 18.6B and 31.3B parameters per token based on contextual demands, optimizing resource usage. Figure 3a demonstrates consistent loss reduction and improved performance under matched computation budgets.

Shortcut-Connected MoE (ScMoE): The Shortcut-connected MoE (ScMoE) architecture is employed to significantly expand the computation-communication overlap window, boosting both training and inference efficiency. This design ensures that the training loss curves are virtually indistinguishable from baselines without ScMoE, confirming its quality-neutral benefits across various model scales and attention mechanisms. Figure 4 illustrates these consistent loss curves.

Variance Alignment for Scalability: LongCat-Flash incorporates Variance Alignment techniques for both Multi-head Latent Attention (MLA) and fine-grained FFN experts. This addresses variance misalignment during scaling, preventing instability and performance degradation. Scale-correction factors (aq and aku) in MLA and a scaling factor (γ) for expert initialization ensure well-conditioned attention computations and preserve MoE layer output variance. Figure 5a demonstrates improved convergence with scale-correction.

Figure 2: The architecture adopted in LongCat-Flash. Each layer employs Shortcut-connected Mixture of Experts (ScMoE) with zero-computation experts. ScMoE significantly expands the computation-communication window to boost training and inference efficiency. The zero-computation experts enable dynamic computation based on contextual importance, improving the efficiency of computational resource utilization.

Figure 2: The architecture adopted in LongCat-Flash. It integrates Shortcut-connected Mixture of Experts (ScMoE) with zero-computation experts for dynamic computation and enhanced efficiency.

Enterprise Process Flow: General Pre-Training Phases

The General Pre-Training process involves a multi-phase data pipeline to ensure quality and diversity. This sequence is crucial for building a robust base model for LongCat-Flash.

Content Extraction

→

Quality Filtering

→

Deduplication

→

Data Mixture (Stage 1)

→

Data Mixture (Stage 2)

Pre-Training Methodologies

The pre-training of LongCat-Flash follows a robust multi-stage curriculum, focusing on scalability, stability, and agentic capability. It incorporates hyperparameter transfer, model growth initialization, and a multi-pronged stability suite to ensure efficient and reliable large-scale training.

Hyperparameter Transfer & Model Growth: LongCat-Flash leverages hyperparameter transfer based on width scaling and model growth initialization (layer stacking) to efficiently train large-scale models. This strategy significantly reduces computational costs and provides improved performance compared to random initialization, as evidenced in Figure 5b. The model starts as a half-scale version, pre-trained on billions of tokens, then expands.

Training Stability Measures: Training stability is enhanced through router stability control (balancing LM and LB losses), activation stability via hidden z-loss, and optimized Adam's epsilon. Hidden z-loss (Eq. 10) prevents massive activations and loss spikes (Figure 6), while Adam's epsilon is set to 1e-16 to maintain adaptive properties for large-scale models (Figure 7).

Long Context Extension & Decontamination: A two-stage context length extension strategy expands the context window from 8k to 128k tokens, using naturally occurring long-text data and curated source code. Rigorous decontamination procedures, including 13-gram overlap and semantic similarity checks, prevent data leakage from common benchmarks, ensuring robust evaluation confidence.

General Pre-Training Data Strategy: A multi-phase pipeline ensures data quality and diversity, with Content Extraction, Quality Filtering, and Deduplication steps. The data mixture progressively increases the proportion of high-quality reasoning data (e.g., STEM and code), aiming for comprehensive foundational capabilities.

Training Infrastructures & Efficiency

The training infrastructure for LongCat-Flash is designed for scalability with precision, ensuring deterministic computation and efficient distributed training. Key innovations include numerical precision control, kernel optimizations, and advanced distributed strategies like ScMoE for computation-communication overlap.

GEMM Precision Comparison (ULP)

Table 4: GEMM Precision Comparison (ULP) between Solution 1 and Solution 2, demonstrating efforts to reduce ULP errors.

Case	Output Shape	Value Range	Solution 1		Solution 2
			Max	Min	Max	Min
1	[1024,1536]	[-5,5]	2292	-568	112	-100
2	[1024,576]	[-5,5]	65362	-82046	6.5	-9
3	[1024,16384]	[-19,15]	544	-104	224	-112
4	[1024,12288]	[-4,4]	202	-88	72	-41
5	[1024,6144]	[-1,1]	5376	-1376	304	-224
6	[1024,24576]	[-5,5]	7200	-510	104	-294
7	[1024,131072]	[5,5]	8128	-6976	2528	-368
8	[1024,6144]	[-1,1]	5344	-8064	80	-258

Numerical Precision Control & SDC Detection: LongCat-Flash employs ULP (Unit in the Last Place) evaluation to quantify floating-point errors and integrates an on-chip, in-place operator recomputation mechanism for Silent Data Corruption (SDC) detection. This ensures bitwise-aligned loss values and minimizes numerical errors during BF16 training, crucial for stable large-scale training. Table 4 demonstrates GEMM precision comparison, highlighting efforts to reduce ULP errors.

Kernel Optimizations for Determinism & Performance: Custom kernel redesigns address determinism overhead, including a deterministic FAG (FlashAttention Gradients) kernel (1.6x faster than original deterministic, 0.95x non-deterministic) and a hierarchical reduction algorithm for Deterministic ScatterAdd (performance parity with non-deterministic). Optimized Grouped GEMM and Fused GemmAdd further enhance efficiency.

Distributed Strategy for Large-scale Training: The training architecture is centered on Expert Parallelism Groups (EP), with Context Parallelism (CP) for attention layers and EP partitioning for FFN layers. ScMoE enables dispatch/combine communication to overlap with dense FFN computation by dividing MoE layers into chunks, significantly reducing non-overlapping communication. Figures 8 and 9 illustrate the ScMoE layer chunking and overall overlapping strategy, which reduced non-overlapping communication from 25.3% to 8.4%.

Figure 8: These architectures have the same total and activated number of experts. ScMoE with chunk achieves the highest efficiency because more communication is overlapped by computation.

Figure 8: ScMoE Layer with Chunk achieves highest efficiency through computation-communication overlap.

Figure 9: An overview of overlapping strategy.

Figure 9: An overview of LongCat-Flash's overlapping strategy, leveraging ScMoE to maximize efficiency.

Inference and Deployment Optimizations

LongCat-Flash's inference system is optimized through model-system co-design, achieving high throughput and low latency. Key techniques include Single Batch Overlap (SBO), speculative decoding with Multi-Token Prediction (MTP), KV cache reduction, multi-step overlapped scheduling, custom kernels, and fine-grained quantization.

Model-Specific Inference Optimization: A Single Batch Overlap (SBO) scheduling strategy optimizes both latency and throughput by orchestrating computation-communication overlap within the ScMoE architecture. Speculative decoding employs Multi-Token Prediction (MTP) as a lightweight draft model (90% acceptance rate), and Multi-head Latent Attention (MLA) significantly reduces KV cache size and bandwidth pressure.

MTP Head Structures Comparison

Table 5: Draft token acceptance rate on MT-Bench of different MTP head structures with a 6B activated model.

MTP layer	Activated parameters ratio	Acceptance rate α
Dense layer	1.41%	92.1%
ScMoE layer	4.17%	92.9%

Minimize Schedule Overhead: A multi-step overlapped scheduler launches kernels for multiple forward steps in a single iteration, effectively hiding CPU scheduling and synchronization. This ensures continuous GPU occupancy and dynamically pre-allocates KV cache slots, guaranteeing convergence in allocated KV cache size even without prior knowledge of accept length, as shown in Figure 10.

Figure 10: Multi-step overlapped scheduler (4 steps as a example here).

Figure 10: Multi-step overlapped scheduler for efficient inference.

Measured Inference Performance

Table 6: Performance of LongCat-Flash under different settings, showing superior TGS and TPS/u compared to DeepSeek-V3.

Model	Attention	Avg Context	#Hopper GPUs	TGS	TPS/u
DeepSeek-V3-profile	bf16	4096	128	2324	20
DeepSeek-V3-blog	bf16	4989	144	1850	20~22
LongCat-Flash	bf16	5000	128	3785	35
LongCat-Flash	bf16	5000	128	2205	68.9
LongCat-Flash	bf16	5000	128	804	100.5
LongCat-Flash	fp8	5000	128	4230	26.4
LongCat-Flash	fp8	8192	128	3240	33.8

Theoretical Decoding Performance: LongCat-Flash achieves significant theoretical improvements in both throughput and latency due to its reduced layer count and the SBO overlapping strategy. The theoretical Time-Per-Output-Token (TPOT) is significantly lower than competing models, demonstrating superior efficiency.

Theoretical Decoding Performance (Model Configurations)

Table 7 Part 1: Theoretical decoding model configurations impacting performance.

Metric	DeepSeek-V3	Qwen3-235B-A22B	LongCat-Flash
MTP	w/	w/o	w/
n_layer	61	94	28
batch per device	96	96	96

Theoretical Decoding Performance (Module Costs & TPOT)

Table 7 Part 2: Theoretical decoding module costs and TPOT/cost comparison, highlighting LongCat-Flash's efficiency.

Module/Metric	DeepSeek-V3	Qwen3-235B-A22B	LongCat-Flash
attention	471 us	314 us	264 us
all-to-all dispatch	275 us	157 us	236 us
MoE	77 us	29 us	60 us
all-to-all combine	551 us	315 us	472 us
overlap strategy	TBO	TBO	SBO
TPOT (ms)	30	26.2	16
$/1M output token	0.17	0.15	0.09

Calculate Your Potential AI ROI

Estimate the significant efficiency gains and cost savings your enterprise could achieve by integrating advanced AI models like LongCat-Flash.

Your Industry

Number of Employees

Avg. Weekly Hours on Repetitive Tasks

Average Hourly Employee Cost ($)

Estimated Annual Savings $0

Employee Hours Reclaimed Annually 0

Discuss Your AI Potential

Your AI Implementation Roadmap

A clear, phased approach to integrating advanced AI into your enterprise, ensuring a smooth transition and measurable impact.

Phase 1: Discovery & Strategy

Comprehensive assessment of current workflows, identification of AI opportunities, and development of a tailored implementation strategy.

Phase 2: Pilot & Proof-of-Concept

Deployment of a small-scale AI solution to validate performance, gather feedback, and refine the model for your specific needs.

Phase 3: Integration & Scaling

Seamless integration of the AI model into your existing infrastructure and scaling to optimize performance across the enterprise.

Phase 4: Optimization & Future-Proofing

Continuous monitoring, performance optimization, and strategic planning for future AI advancements and expanded applications.

Start Your AI Journey

Ready to Transform Your Enterprise with AI?

Connect with our experts to explore how LongCat-Flash and other cutting-edge AI solutions can drive efficiency, innovation, and growth for your business.

Book a Free Consultation

Technical Report Analysis

Unlocking Next-Gen AI: Introducing LongCat-Flash

Executive Impact & Performance Metrics

Deep Analysis & Enterprise Applications

Architecture Innovations

Enterprise Process Flow: General Pre-Training Phases

Pre-Training Methodologies

Training Infrastructures & Efficiency

GEMM Precision Comparison (ULP)

Inference and Deployment Optimizations

MTP Head Structures Comparison

Measured Inference Performance

Theoretical Decoding Performance (Model Configurations)

Theoretical Decoding Performance (Module Costs & TPOT)

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Pilot & Proof-of-Concept

Phase 3: Integration & Scaling

Phase 4: Optimization & Future-Proofing

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai