LLM Serving Optimization

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

Sandwich introduces a full-stack CPU LLM serving system with three core innovations: seamless phase-wise plan switching to eliminate cross-phase interference, TopoTree for automated substructure-aware partial core allocation, and fast-start-then-finetune dynamic-shape tensor program generation. This approach significantly enhances CPU-based LLM serving efficiency.

Schedule Your Strategy Session

Executive Impact

Sandwich significantly boosts CPU LLM serving performance by tackling key challenges like prefill/decode resource conflicts, suboptimal core allocation, and dynamic-shape kernel inefficiencies. Its innovations lead to substantial gains in speed and efficiency, making CPU-based LLM deployment more viable and cost-effective for enterprise applications.

Average End-to-End Speedup

Latency Reduction

Lower Tuning Cost

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Enhanced Efficiency Through Dynamic Switching

Sandwich introduces seamless phase-wise plan switching, which dynamically adapts computation graphs between the compute-intensive prefill phase and the memory-intensive decode phase. This eliminates cross-phase interference, ensuring optimal resource utilization for each stage of LLM serving. Unlike static solutions, Sandwich avoids performance degradation by ensuring that decode's memory bottlenecks do not hinder prefill's computational throughput.

TopoTree: Smart Core Allocation

The TopoTree is a tree-based hardware abstraction that systematically enumerates core allocation plans, accounting for complex CPU topologies including NUMA nodes and shared LLC slices. This enables substructure-aware partial core allocation, mitigating memory contention during the decode phase and improving overall performance. It efficiently explores core allocation strategies while maximizing synergy and minimizing resource contention, a significant improvement over approaches that ignore sub-NUMA structures.

Fast-Start-then-Finetune for Tensor Programs

Sandwich employs a fast-start-then-finetune strategy for dynamic-shape tensor program generation. This method jointly optimizes micro-kernels (MKs) and polymerization schemes, significantly reducing tuning overhead and improving prefill kernel performance. It produces dedicated kernels for both prefill and decode phases, matching kernel performance of static compilers with drastically lower tuning costs, a critical advantage for dynamic LLM workloads.

2.01x Average End-to-End Speedup Across Five CPU Platforms

Enterprise Process Flow

Hardware Information Parsing

→

TopoTree Construction

→

Explore Latent Shared Structures

→

Apply Remove Transformations

→

Service Configuration Generation

→

Tensor Program Generation

→

Runtime Hot-Switching

Optimization	BS=1 (Tokens/s)	BS=8 (Tokens/s)
VLLM	4.09	2.35
+ communication op	13.46	3.66
+ service config	14.90	5.40
+ kernel tuning	17.54	8.16
+ split k	17.09	8.78

Case Study: Llama-1.3B Serving on EPYC 7H12

On an AMD EPYC 7H12 platform, Sandwich demonstrates robust performance for Llama-1.3B, achieving an average 2.26x speedup over OpenVINO. Its ability to manage complex hardware architectures effectively results in significant latency reductions and higher throughput, making it ideal for cost-efficient enterprise LLM deployments on diverse CPU systems.

Advanced ROI Calculator

Estimate your potential annual savings and reclaimed hours by optimizing LLM serving with Sandwich's CPU efficiency.

Your Industry

Average Employees Working on LLM Infrastructure

Average Weekly Hours Saved per Employee with Optimized Serving

Average Hourly Fully-Burdened Rate of Your Engineering Staff ($)

Estimated Annual Savings $0

Engineering Hours Reclaimed Annually 0

Calculate Your Specific ROI

Your Implementation Roadmap

Our structured approach ensures a smooth integration of Sandwich's optimizations into your existing LLM serving infrastructure.

Phase 1: Hardware Analysis & TopoTree Generation

We begin by thoroughly analyzing your CPU architecture using system tools to construct a fundamental TopoTree. This phase identifies NUMA structures, shared caches, and processing units, laying the groundwork for optimized core allocation and resource management specific to your environment.

Phase 2: Service Configuration & Kernel Optimization

In this phase, Sandwich explores potential core allocation plans and dynamic-shape tensor program generation using its fast-start-then-finetune strategy. This involves identifying optimal micro-kernels and polymerization schemes, tailored to your LLM workloads and CPU capabilities, significantly reducing tuning time.

Phase 3: Hot-Switching Integration & Performance Validation

The final phase involves integrating Sandwich's hot-switching mechanism into your LLM serving system. We deploy the optimized prefill and decode kernels and validate the performance gains in real-world scenarios, ensuring seamless phase transitions and sustained efficiency across diverse LLM tasks.

Start Your Optimization Journey

Ready to Transform Your CPU LLM Serving?

Unlock unparalleled efficiency and cost savings for your enterprise LLM deployments. Schedule a free consultation to see how Sandwich can benefit your specific infrastructure.

Book a Free Consultation

LLM Serving Optimization

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving

Executive Impact

Deep Analysis & Enterprise Applications

Enhanced Efficiency Through Dynamic Switching

TopoTree: Smart Core Allocation

Fast-Start-then-Finetune for Tensor Programs

Enterprise Process Flow

Case Study: Llama-1.3B Serving on EPYC 7H12

Advanced ROI Calculator

Your Implementation Roadmap

Phase 1: Hardware Analysis & TopoTree Generation

Phase 2: Service Configuration & Kernel Optimization

Phase 3: Hot-Switching Integration & Performance Validation

Ready to Transform Your CPU LLM Serving?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai