Enterprise AI Analysis
OASIS: Optimal Allocation Strategy for Inference Services in Cloud Environments
Authors: Viyom Mittal, Mohammed Baydoun, Alok Mishra, Pavana Prakash, Gourav Rattihalli, Aditya Dhakal, Eitan Frachtenberg, Izzat El Hajj, Michails Faloutsos, Dejan Milojicic
Cloud providers face significant challenges in efficiently provisioning Language Models as inference services due to the need to balance Service Level Objectives (SLOs) with capital and operational costs. This paper introduces OASIS, a comprehensive methodology combining static and dynamic optimizations to provision inference services efficiently, ensuring optimal resource utilization and energy efficiency.
Executive Impact
OASIS delivers significant operational efficiencies and cost savings for cloud providers deploying LLM inference services.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
OASIS Two-Phase Methodology
OASIS implements a two-phase approach to optimize LLM inference service provisioning:
Enterprise Process Flow
Phase 1: Pre-Deployment Profiling involves systematically profiling an SLM's performance across various hardware configurations, parameter settings (batch size, GPU frequency, inference variants), and query types. This phase generates a hardware cost table mapping configurations to optimal parameters and their associated maximum serviceable request rates (MSRR) and energy consumption metrics.
Phase 2: Run-time Query Routing and Provisioning dynamically adjusts resource allocation based on observed workload patterns. It classifies incoming queries, routes them to optimally configured instances, and provisions/deallocates resources adaptively. This phase integrates both system-level (MIG partitioning) and software-level (continuous batching) optimizations to maximize efficiency and minimize idle power consumption.
Query Type Classification
OASIS classifies user queries into four distinct categories based on input and output token counts, using a threshold (τ = 500 tokens) to differentiate "small" from "large" inputs/outputs. Each type has unique resource requirements and optimal serving configurations:
Query Categories & Resource Implications
SISO (Small Input, Small Output): |Tin| ≤ τ, |Tout| ≤ τ. Examples: short Q&A, simple paraphrasing. Characterized by minimal compute and memory requirements.
SILO (Small Input, Large Output): |Tin| ≤ τ, |Tout| > τ. Examples: story generation, essay writing. Compute-intensive during autoregressive generation phase.
LISO (Large Input, Small Output): |Tin| > τ, |Tout| ≤ τ. Examples: document summarization, classification. Memory-intensive during prefill, requiring substantial KV cache.
LILO (Large Input, Large Output): |Tin| > τ, |Tout| > τ. Examples: translation, long-form rewriting. Resource-intensive throughout both prefill and generation phases.
Understanding these distinctions is crucial for allocating resources efficiently and preventing over-provisioning or SLO violations.
Key Optimization Results
Decreasing GPU frequency from 1320 MHz to 855 MHz reduces average power consumption from 239.4W to 145.2W, with only a 7% throughput loss. This improves energy efficiency by 34% (261.7 to 171.5 mJ/tok) while maintaining acceptable latency for a 20s SLO.
Query type classification is critical: LISO workloads (e.g., document summarization) achieve 1.8-2.5x higher concurrency and 2.1x higher throughput compared to SILO workloads (e.g., story generation) across both Llama3-8B and Qwen-7B models. This validates the need for specific resource allocation per query type.
MIG-Based Multi-Tenancy Analysis
OASIS demonstrates that MIG-based multi-tenancy can significantly reduce power consumption when running multiple models simultaneously, especially for workloads with moderate demand.
Running Llama3-8B and Qwen-7B simultaneously with MIG saves 197W compared to using separate full GPUs, while still meeting SLO requirements. This translates to substantial daily energy savings.
| Metric | Separate Full GPUs (LISO) | MIG Configuration (LISO) |
|---|---|---|
| Total Power (W) | 467W (235W Llama3-8B + 232W Qwen-7B) | 270W |
| P95 Latency (s) | 6.85s (Llama3-8B), 7.44s (Qwen-7B) | 13.87s (Llama3-8B), 19.96s (Qwen-7B) |
| Throughput (req/s) | 7.59 (Llama3-8B), 6.80 (Qwen-7B) | 3.69 (Llama3-8B), 3.17 (Qwen-7B) |
| SLO Compliance | ✓ Met for all models | ✓ Met for all models |
Hardware Platform Comparison (Yi-6B Model)
Choosing the right hardware is crucial for balancing performance and energy efficiency. OASIS shows distinct trade-offs between NVIDIA A100 and H100 GPUs:
| Metric | A100 (1035 MHz) | H100 (1530 MHz) |
|---|---|---|
| Avg. Power (W) | 140.9W | 265.9W |
| Throughput (tok/s) | 1511.3 | 1698.5 |
| P99 Latency (ms) | 1208.9 | 86.3 |
| Key Finding | Most energy-efficient configuration across both platforms, suitable for energy-conscious deployments. | 12% higher throughput than A100, but consumes 89% more power. Justified for throughput-critical workloads. |
Runtime Adaptation
OASIS's runtime adaptation, evaluated on real-world BurstGPT campus traces, demonstrates significant energy savings while maintaining performance.
During a 4-hour period with bursty campus workloads, MIG configuration achieved a 12.4% power reduction (164.9W vs 188.2W) compared to full GPU, while maintaining identical throughput and meeting all SLO requirements (P99 latency 11.5s vs 20s threshold). This highlights MIG's ability to prevent resource waste for moderate user counts.
Economic and Environmental Impact
The energy efficiencies achieved by OASIS translate directly into tangible economic and environmental benefits, making LLM serving more sustainable for cloud providers:
Based on a campus serving 3,000 students, the 12.4% energy reduction from runtime adaptation saves $10.67/day in electricity costs.
The same energy reduction also translates to avoiding 34.8 kg of CO2 emissions per day. These cumulative benefits over academic terms make OASIS a practical solution for sustainable LLM serving.
This demonstrates OASIS's capability to deliver substantial economic and environmental value by intelligently managing GPU resources.
Calculate Your Potential AI ROI
Estimate the efficiency gains and cost savings for your enterprise with an optimized AI inference strategy.
Your AI Implementation Roadmap
Our phased approach ensures a smooth transition and optimized performance for your enterprise AI initiatives.
Phase 1: Multi-Parameter Joint Optimization
Beyond single-parameter tuning, our next step is to explore joint configurations (e.g., frequency + batch size + scheduling policy) to identify global optima. This will enable even more nuanced resource allocation and energy savings.
Phase 2: Workload Prediction & Proactive Adaptation
Moving beyond reactive adaptation, we will integrate time-series forecasting (LSTM, ARIMA) to predict traffic patterns. This allows for proactive resource provisioning, reducing transition overheads and further improving energy efficiency.
Phase 3: Cross-GPU Architecture Support
As new GPU architectures emerge, we will develop automated profiling pipelines that generalize across different GPU vendors and generations, enhancing OASIS's applicability and ensuring future-proof optimizations.
Ready to Optimize Your AI Infrastructure?
Discover how OASIS can revolutionize your LLM inference services, reduce costs, and achieve sustainability goals.