Skip to main content
Enterprise AI Analysis: TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms

Enterprise AI Analysis

TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms

Our in-depth analysis of "TAPAS" reveals critical strategies for optimizing Large Language Model (LLM) inference in cloud datacenters, addressing thermal and power management challenges. By leveraging historical data and SaaS workload adaptability, TAPAS optimizes VM placement, request routing, and instance configuration to minimize thermal and power throttling, thereby enhancing cooling and power oversubscription capabilities. This system directly contributes to a substantial reduction in the total cost of ownership (TCO) for cloud providers, while ensuring performance and accuracy for SaaS workloads, and gracefully handling emergency situations like cooling or power failures. Its core innovation lies in adapting to the heterogeneous and dynamic thermal and power profiles inherent in LLM inference, moving beyond traditional datacenter management techniques that prove suboptimal for these advanced AI workloads.

Executive Impact & Key Metrics

Leveraging TAPAS, enterprises can achieve significant operational efficiencies, reduce infrastructure costs, and enhance the reliability of their LLM deployments. The research highlights tangible benefits applicable to cloud-hosted AI workloads.

0% Reduction in Thermal Throttling
0% Reduction in Power Throttling
0% Increase in Datacenter Capacity
0% Reduction in Peak Row Power

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Overview of TAPAS

TAPAS is a thermal- and power-aware scheduling framework designed for LLM inference clusters in cloud platforms. It addresses the unique challenges of LLM workloads, which have fine-grained execution phases and diverse configuration sensitivities. By leveraging historical data and SaaS workload adaptability, TAPAS optimizes VM placement, request routing, and instance configuration to minimize thermal and power throttling, thereby enhancing cooling and power oversubscription capabilities.

This system directly contributes to a substantial reduction in the total cost of ownership (TCO) for cloud providers, while ensuring performance and accuracy for SaaS workloads, and gracefully handling emergency situations like cooling or power failures. Its core innovation lies in adapting to the heterogeneous and dynamic thermal and power profiles inherent in LLM inference, moving beyond traditional datacenter management techniques that prove suboptimal for these advanced AI workloads.

TAPAS System Architecture

The TAPAS framework extends conventional cloud LLM inference cluster components by introducing a per-SaaS VM Instance Configurator and maintaining multiple profiles. Its architecture focuses on three core aspects:

  • VM Placement: Efficiently places new GPU workload VMs within cooling and power constraints, leveraging historical data. It prioritizes placing IaaS VMs in cooler servers due to limited control over them, while balancing IaaS and SaaS workloads across aisles and rows.
  • LLM Inference Request Routing: Routes requests across SaaS LLM instances based on individual VM load and the thermal/power slack of the underlying infrastructure. This helps in smoothing out thermal and power draw.
  • Instance Configuration: Dynamically adjusts LLM inference configurations (e.g., GPU frequency, batch size, model parallelism, quantization) for SaaS instances to manage load spikes and emergency situations, ensuring operations stay within safe thermal and power limits while minimizing impact on goodput and quality.

This integrated approach allows TAPAS to intelligently adapt to changing conditions and workload demands, preventing hotspots and power overloads without compromising service quality.

TAPAS Performance Gains

Evaluation of TAPAS on a large GPU cluster using production traces demonstrates significant improvements:

  • Reduced Throttling: Achieves a 97% reduction in thermal throttling events and a 99% reduction in power throttling events compared to baseline policies.
  • Increased Capacity: Enables up to 40% additional datacenter capacity through safe oversubscription, lowering TCO.
  • Peak Power Reduction: Reduces peak row power by 23% and maximum temperature by 17%.
  • Maintained SLOs: Preserves P99 tail latency of inference requests and result quality for SaaS workloads.
  • Failure Management: Effectively manages cooling and power failures, maintaining performance with minimal quality impact even under reduced capacity by strategically reconfiguring SaaS instances and rerouting requests.

These gains highlight TAPAS's capability to deliver robust, efficient, and cost-effective LLM inference services in dynamic cloud environments.

97% Reduction in thermal throttling events observed with TAPAS.

Enterprise Process Flow: TAPAS Operational Logic

New GPU VM Arrival
Thermal/Power-Aware VM Placement
LLM Request Routing (SaaS)
SaaS Instance Reconfiguration (Load/Emergency)
Reduced TCO & Throttling
Feature Traditional Datacenter Mgmt TAPAS Framework
Workload Type Focus General-purpose CPU/GPU workloads. LLM Inference workloads with fine-grained phases.
Thermal/Power Awareness Often suboptimal for dynamic LLM profiles.
  • Leverages historical data & real-time adaptability.
  • Considers spatial/temporal heterogeneity.
Oversubscription Limited due to peak load provisioning.
  • Maximizes safe oversubscription (up to 40% additional capacity).
  • Dynamically adjusts based on load and slack.
Failure Handling Basic redundancy, potential for widespread throttling.
  • Intelligently reroutes requests.
  • Reconfigures SaaS VMs to mitigate impact.
Configuration Control Minimal or static.
  • Dynamically adjusts GPU frequency, batch size, parallelism.
  • Balances performance, temperature, power, quality.

Case Study: Datacenter Capacity Optimization

A major cloud provider faced challenges with increasing demand for LLM inference, leading to frequent thermal and power throttling events. By implementing TAPAS, they were able to reduce these events by over 97% for thermal and 99% for power, respectively. This enabled the datacenter to safely increase its operational capacity by up to 40%, significantly lowering the total cost of ownership without impacting the P99 tail latency or quality of inference results. The dynamic reconfiguration of SaaS VMs proved crucial in maintaining stability during peak loads and simulated cooling failures.

40% Additional datacenter capacity achieved through safe oversubscription.
Configuration Parameter Impact on Performance Impact on Temperature Impact on Power Impact on Quality
Model Size (e.g., 70B→7B)
  • ↑ Increased (smaller model is faster)
  • ↓ Decreased
  • ↓ Decreased
  • ↓↓ Significantly Decreased
Quantization (e.g., FP16→FP8)
  • ↑ Increased
  • ↓ Decreased
  • ↓ Decreased
  • ↓ Decreased
Parallelism (e.g., TP8→TP2)
  • ↓ Decreased (less parallelism = more computation per GPU)
  • ↑ Increased (hotter GPUs with concentrated load)
  • ↓ Decreased (fewer GPUs used)
  • ↓ Decreased (potentially due to slower processing affecting user experience)
Frequency (e.g., 2GHz→1GHz)
  • ↓ Decreased
  • ↓ Decreased
  • ↓ Decreased
  • No significant impact
Batch Size (e.g., 64→16)
  • ↓ Decreased (smaller batch = lower throughput)
  • ↓ Decreased (less overall computational load)
  • ↓ Decreased
  • No significant impact

Calculate Your Potential ROI

Estimate the potential annual savings and reclaimed human hours by implementing advanced thermal- and power-aware LLM scheduling in your enterprise cloud environment.

Annual Cost Savings $-
Annual Hours Reclaimed -- hrs

Your Implementation Roadmap

Deploying TAPAS for LLM inference involves a strategic, phased approach to integrate thermal and power awareness into your existing cloud infrastructure. Our experts guide you through each step.

Phase 1: Discovery & Profiling

Initial assessment of current GPU cluster setup, including datacenter layout, inlet temperatures, GPU temperatures, fan airflow, and server power loads. Offline profiling of LLM inference configurations to establish baseline thermal and power profiles.

Phase 2: Integration & Model Refinement

Integration of TAPAS's VM Allocator with existing cloud orchestration. Deployment of the Load Balancer for SaaS endpoints. Weekly refinement of thermal/power models using live datacenter data to ensure accuracy and adaptability.

Phase 3: Dynamic Optimization Deployment

Activation of dynamic LLM instance reconfiguration for SaaS workloads. Implementation of thermal- and power-aware request routing. Continuous monitoring and recalibration to handle load spikes, oversubscription, and potential failures.

Phase 4: Scaling & Continuous Improvement

Expand TAPAS application across more rows and datacenters. Leverage insights for more precise cooling and power provisioning. Implement live migration capabilities for GPU VMs (when supported) to further enhance flexibility and performance.

Ready to Transform Your LLM Infrastructure?

Unlock unparalleled efficiency and cost savings with thermal- and power-aware LLM scheduling. Our experts are ready to design a tailored solution for your enterprise.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking