Enterprise AI Analysis

OASIS: Optimal Allocation Strategy for Inference Services in Cloud Environments

Authors: Viyom Mittal, Mohammed Baydoun, Alok Mishra, Pavana Prakash, Gourav Rattihalli, Aditya Dhakal, Eitan Frachtenberg, Izzat El Hajj, Michails Faloutsos, Dejan Milojicic

Cloud providers face significant challenges in efficiently provisioning Language Models as inference services due to the need to balance Service Level Objectives (SLOs) with capital and operational costs. This paper introduces OASIS, a comprehensive methodology combining static and dynamic optimizations to provision inference services efficiently, ensuring optimal resource utilization and energy efficiency.

Schedule Your Strategy Session

Executive Impact

OASIS delivers significant operational efficiencies and cost savings for cloud providers deploying LLM inference services.

GPU Power Reduction (Frequency Scaling)

Power Reduction (MIG Multi-Tenancy)

Power Reduction (Runtime Adaptation)

Estimated Cost Savings per Day

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

OASIS Two-Phase Methodology

OASIS implements a two-phase approach to optimize LLM inference service provisioning:

Enterprise Process Flow

Phase 1: Pre-Deployment Profiling

→

Phase 2: Run-Time Provisioning & Adaptation

Phase 1: Pre-Deployment Profiling involves systematically profiling an SLM's performance across various hardware configurations, parameter settings (batch size, GPU frequency, inference variants), and query types. This phase generates a hardware cost table mapping configurations to optimal parameters and their associated maximum serviceable request rates (MSRR) and energy consumption metrics.

Phase 2: Run-time Query Routing and Provisioning dynamically adjusts resource allocation based on observed workload patterns. It classifies incoming queries, routes them to optimally configured instances, and provisions/deallocates resources adaptively. This phase integrates both system-level (MIG partitioning) and software-level (continuous batching) optimizations to maximize efficiency and minimize idle power consumption.

Query Type Classification

OASIS classifies user queries into four distinct categories based on input and output token counts, using a threshold (τ = 500 tokens) to differentiate "small" from "large" inputs/outputs. Each type has unique resource requirements and optimal serving configurations:

Query Categories & Resource Implications

SISO (Small Input, Small Output): |Tin| ≤ τ, |Tout| ≤ τ. Examples: short Q&A, simple paraphrasing. Characterized by minimal compute and memory requirements.

SILO (Small Input, Large Output): |Tin| ≤ τ, |Tout| > τ. Examples: story generation, essay writing. Compute-intensive during autoregressive generation phase.

LISO (Large Input, Small Output): |Tin| > τ, |Tout| ≤ τ. Examples: document summarization, classification. Memory-intensive during prefill, requiring substantial KV cache.

LILO (Large Input, Large Output): |Tin| > τ, |Tout| > τ. Examples: translation, long-form rewriting. Resource-intensive throughout both prefill and generation phases.

Understanding these distinctions is crucial for allocating resources efficiently and preventing over-provisioning or SLO violations.

Key Optimization Results

39% ↓ Power Consumption via GPU Frequency Scaling (Llama-3-8B)

Decreasing GPU frequency from 1320 MHz to 855 MHz reduces average power consumption from 239.4W to 145.2W, with only a 7% throughput loss. This improves energy efficiency by 34% (261.7 to 171.5 mJ/tok) while maintaining acceptable latency for a 20s SLO.

2.1x ↑ Higher Throughput for LISO vs. SILO Workloads

Query type classification is critical: LISO workloads (e.g., document summarization) achieve 1.8-2.5x higher concurrency and 2.1x higher throughput compared to SILO workloads (e.g., story generation) across both Llama3-8B and Qwen-7B models. This validates the need for specific resource allocation per query type.

MIG-Based Multi-Tenancy Analysis

OASIS demonstrates that MIG-based multi-tenancy can significantly reduce power consumption when running multiple models simultaneously, especially for workloads with moderate demand.

43% ↓ Power Consumption with MIG vs. Full GPU for LISO Workloads

Running Llama3-8B and Qwen-7B simultaneously with MIG saves 197W compared to using separate full GPUs, while still meeting SLO requirements. This translates to substantial daily energy savings.

Metric	Separate Full GPUs (LISO)	MIG Configuration (LISO)
Total Power (W)	467W (235W Llama3-8B + 232W Qwen-7B)	270W
P95 Latency (s)	6.85s (Llama3-8B), 7.44s (Qwen-7B)	13.87s (Llama3-8B), 19.96s (Qwen-7B)
Throughput (req/s)	7.59 (Llama3-8B), 6.80 (Qwen-7B)	3.69 (Llama3-8B), 3.17 (Qwen-7B)
SLO Compliance	✓ Met for all models	✓ Met for all models

Hardware Platform Comparison (Yi-6B Model)

Choosing the right hardware is crucial for balancing performance and energy efficiency. OASIS shows distinct trade-offs between NVIDIA A100 and H100 GPUs:

Metric	A100 (1035 MHz)	H100 (1530 MHz)
Avg. Power (W)	140.9W	265.9W
Throughput (tok/s)	1511.3	1698.5
P99 Latency (ms)	1208.9	86.3
Key Finding	Most energy-efficient configuration across both platforms, suitable for energy-conscious deployments.	12% higher throughput than A100, but consumes 89% more power. Justified for throughput-critical workloads.

Runtime Adaptation

OASIS's runtime adaptation, evaluated on real-world BurstGPT campus traces, demonstrates significant energy savings while maintaining performance.

12.4% ↓ Statistically Significant Power Reduction (MIG vs. Full GPU)

During a 4-hour period with bursty campus workloads, MIG configuration achieved a 12.4% power reduction (164.9W vs 188.2W) compared to full GPU, while maintaining identical throughput and meeting all SLO requirements (P99 latency 11.5s vs 20s threshold). This highlights MIG's ability to prevent resource waste for moderate user counts.

Economic and Environmental Impact

The energy efficiencies achieved by OASIS translate directly into tangible economic and environmental benefits, making LLM serving more sustainable for cloud providers:

$10.67 ↓ Daily Cost Savings per 3,000 Users

Based on a campus serving 3,000 students, the 12.4% energy reduction from runtime adaptation saves $10.67/day in electricity costs.

34.8 kg ↓ Daily CO2 Emissions Reduction per 3,000 Users

The same energy reduction also translates to avoiding 34.8 kg of CO2 emissions per day. These cumulative benefits over academic terms make OASIS a practical solution for sustainable LLM serving.

This demonstrates OASIS's capability to deliver substantial economic and environmental value by intelligently managing GPU resources.

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings for your enterprise with an optimized AI inference strategy.

Your Industry Sector

Number of Employees (Using AI)

Avg. AI-Assisted Hours/Week per Employee

Avg. Hourly Fully-Loaded Cost per Employee ($)

Estimated Annual Savings

Annual Hours Reclaimed

Your AI Implementation Roadmap

Our phased approach ensures a smooth transition and optimized performance for your enterprise AI initiatives.

Phase 1: Multi-Parameter Joint Optimization

Beyond single-parameter tuning, our next step is to explore joint configurations (e.g., frequency + batch size + scheduling policy) to identify global optima. This will enable even more nuanced resource allocation and energy savings.

Phase 2: Workload Prediction & Proactive Adaptation

Moving beyond reactive adaptation, we will integrate time-series forecasting (LSTM, ARIMA) to predict traffic patterns. This allows for proactive resource provisioning, reducing transition overheads and further improving energy efficiency.

Phase 3: Cross-GPU Architecture Support

As new GPU architectures emerge, we will develop automated profiling pipelines that generalize across different GPU vendors and generations, enhancing OASIS's applicability and ensuring future-proof optimizations.

Ready to Optimize Your AI Infrastructure?

Discover how OASIS can revolutionize your LLM inference services, reduce costs, and achieve sustainability goals.

Book a Consultation

Enterprise AI Analysis

OASIS: Optimal Allocation Strategy for Inference Services in Cloud Environments

Executive Impact

Deep Analysis & Enterprise Applications

OASIS Two-Phase Methodology

Enterprise Process Flow

Query Type Classification

Query Categories & Resource Implications

Key Optimization Results

MIG-Based Multi-Tenancy Analysis

Hardware Platform Comparison (Yi-6B Model)

Runtime Adaptation

Economic and Environmental Impact

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Phase 1: Multi-Parameter Joint Optimization

Phase 2: Workload Prediction & Proactive Adaptation

Phase 3: Cross-GPU Architecture Support

Ready to Optimize Your AI Infrastructure?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai