Skip to main content
Enterprise AI Analysis: Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

LLM Serving Optimization

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

Sandwich introduces a full-stack CPU LLM serving system with three core innovations: seamless phase-wise plan switching to eliminate cross-phase interference, TopoTree for automated substructure-aware partial core allocation, and fast-start-then-finetune dynamic-shape tensor program generation. This approach significantly enhances CPU-based LLM serving efficiency.

Executive Impact

Sandwich significantly boosts CPU LLM serving performance by tackling key challenges like prefill/decode resource conflicts, suboptimal core allocation, and dynamic-shape kernel inefficiencies. Its innovations lead to substantial gains in speed and efficiency, making CPU-based LLM deployment more viable and cost-effective for enterprise applications.

Average End-to-End Speedup
Latency Reduction
Lower Tuning Cost

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Enhanced Efficiency Through Dynamic Switching

Sandwich introduces seamless phase-wise plan switching, which dynamically adapts computation graphs between the compute-intensive prefill phase and the memory-intensive decode phase. This eliminates cross-phase interference, ensuring optimal resource utilization for each stage of LLM serving. Unlike static solutions, Sandwich avoids performance degradation by ensuring that decode's memory bottlenecks do not hinder prefill's computational throughput.

TopoTree: Smart Core Allocation

The TopoTree is a tree-based hardware abstraction that systematically enumerates core allocation plans, accounting for complex CPU topologies including NUMA nodes and shared LLC slices. This enables substructure-aware partial core allocation, mitigating memory contention during the decode phase and improving overall performance. It efficiently explores core allocation strategies while maximizing synergy and minimizing resource contention, a significant improvement over approaches that ignore sub-NUMA structures.

Fast-Start-then-Finetune for Tensor Programs

Sandwich employs a fast-start-then-finetune strategy for dynamic-shape tensor program generation. This method jointly optimizes micro-kernels (MKs) and polymerization schemes, significantly reducing tuning overhead and improving prefill kernel performance. It produces dedicated kernels for both prefill and decode phases, matching kernel performance of static compilers with drastically lower tuning costs, a critical advantage for dynamic LLM workloads.

2.01x Average End-to-End Speedup Across Five CPU Platforms

Enterprise Process Flow

Hardware Information Parsing
TopoTree Construction
Explore Latent Shared Structures
Apply Remove Transformations
Service Configuration Generation
Tensor Program Generation
Runtime Hot-Switching
Optimization BS=1 (Tokens/s) BS=8 (Tokens/s)
VLLM 4.09 2.35
+ communication op 13.46 3.66
+ service config 14.90 5.40
+ kernel tuning 17.54 8.16
+ split k 17.09 8.78

Case Study: Llama-1.3B Serving on EPYC 7H12

On an AMD EPYC 7H12 platform, Sandwich demonstrates robust performance for Llama-1.3B, achieving an average 2.26x speedup over OpenVINO. Its ability to manage complex hardware architectures effectively results in significant latency reductions and higher throughput, making it ideal for cost-efficient enterprise LLM deployments on diverse CPU systems.

Advanced ROI Calculator

Estimate your potential annual savings and reclaimed hours by optimizing LLM serving with Sandwich's CPU efficiency.

Estimated Annual Savings $0
Engineering Hours Reclaimed Annually 0

Your Implementation Roadmap

Our structured approach ensures a smooth integration of Sandwich's optimizations into your existing LLM serving infrastructure.

Phase 1: Hardware Analysis & TopoTree Generation

We begin by thoroughly analyzing your CPU architecture using system tools to construct a fundamental TopoTree. This phase identifies NUMA structures, shared caches, and processing units, laying the groundwork for optimized core allocation and resource management specific to your environment.

Phase 2: Service Configuration & Kernel Optimization

In this phase, Sandwich explores potential core allocation plans and dynamic-shape tensor program generation using its fast-start-then-finetune strategy. This involves identifying optimal micro-kernels and polymerization schemes, tailored to your LLM workloads and CPU capabilities, significantly reducing tuning time.

Phase 3: Hot-Switching Integration & Performance Validation

The final phase involves integrating Sandwich's hot-switching mechanism into your LLM serving system. We deploy the optimized prefill and decode kernels and validate the performance gains in real-world scenarios, ensuring seamless phase transitions and sustained efficiency across diverse LLM tasks.

Ready to Transform Your CPU LLM Serving?

Unlock unparalleled efficiency and cost savings for your enterprise LLM deployments. Schedule a free consultation to see how Sandwich can benefit your specific infrastructure.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking