LLM Serving Optimization
Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
Sandwich introduces a full-stack CPU LLM serving system with three core innovations: seamless phase-wise plan switching to eliminate cross-phase interference, TopoTree for automated substructure-aware partial core allocation, and fast-start-then-finetune dynamic-shape tensor program generation. This approach significantly enhances CPU-based LLM serving efficiency.
Executive Impact
Sandwich significantly boosts CPU LLM serving performance by tackling key challenges like prefill/decode resource conflicts, suboptimal core allocation, and dynamic-shape kernel inefficiencies. Its innovations lead to substantial gains in speed and efficiency, making CPU-based LLM deployment more viable and cost-effective for enterprise applications.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Enhanced Efficiency Through Dynamic Switching
Sandwich introduces seamless phase-wise plan switching, which dynamically adapts computation graphs between the compute-intensive prefill phase and the memory-intensive decode phase. This eliminates cross-phase interference, ensuring optimal resource utilization for each stage of LLM serving. Unlike static solutions, Sandwich avoids performance degradation by ensuring that decode's memory bottlenecks do not hinder prefill's computational throughput.
TopoTree: Smart Core Allocation
The TopoTree is a tree-based hardware abstraction that systematically enumerates core allocation plans, accounting for complex CPU topologies including NUMA nodes and shared LLC slices. This enables substructure-aware partial core allocation, mitigating memory contention during the decode phase and improving overall performance. It efficiently explores core allocation strategies while maximizing synergy and minimizing resource contention, a significant improvement over approaches that ignore sub-NUMA structures.
Fast-Start-then-Finetune for Tensor Programs
Sandwich employs a fast-start-then-finetune strategy for dynamic-shape tensor program generation. This method jointly optimizes micro-kernels (MKs) and polymerization schemes, significantly reducing tuning overhead and improving prefill kernel performance. It produces dedicated kernels for both prefill and decode phases, matching kernel performance of static compilers with drastically lower tuning costs, a critical advantage for dynamic LLM workloads.
Enterprise Process Flow
| Optimization | BS=1 (Tokens/s) | BS=8 (Tokens/s) |
|---|---|---|
| VLLM | 4.09 | 2.35 |
| + communication op | 13.46 | 3.66 |
| + service config | 14.90 | 5.40 |
| + kernel tuning | 17.54 | 8.16 |
| + split k | 17.09 | 8.78 |
Case Study: Llama-1.3B Serving on EPYC 7H12
On an AMD EPYC 7H12 platform, Sandwich demonstrates robust performance for Llama-1.3B, achieving an average 2.26x speedup over OpenVINO. Its ability to manage complex hardware architectures effectively results in significant latency reductions and higher throughput, making it ideal for cost-efficient enterprise LLM deployments on diverse CPU systems.
Advanced ROI Calculator
Estimate your potential annual savings and reclaimed hours by optimizing LLM serving with Sandwich's CPU efficiency.
Your Implementation Roadmap
Our structured approach ensures a smooth integration of Sandwich's optimizations into your existing LLM serving infrastructure.
Phase 1: Hardware Analysis & TopoTree Generation
We begin by thoroughly analyzing your CPU architecture using system tools to construct a fundamental TopoTree. This phase identifies NUMA structures, shared caches, and processing units, laying the groundwork for optimized core allocation and resource management specific to your environment.
Phase 2: Service Configuration & Kernel Optimization
In this phase, Sandwich explores potential core allocation plans and dynamic-shape tensor program generation using its fast-start-then-finetune strategy. This involves identifying optimal micro-kernels and polymerization schemes, tailored to your LLM workloads and CPU capabilities, significantly reducing tuning time.
Phase 3: Hot-Switching Integration & Performance Validation
The final phase involves integrating Sandwich's hot-switching mechanism into your LLM serving system. We deploy the optimized prefill and decode kernels and validate the performance gains in real-world scenarios, ensuring seamless phase transitions and sustained efficiency across diverse LLM tasks.
Ready to Transform Your CPU LLM Serving?
Unlock unparalleled efficiency and cost savings for your enterprise LLM deployments. Schedule a free consultation to see how Sandwich can benefit your specific infrastructure.