Skip to main content
Enterprise AI Analysis: AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Enterprise AI Analysis

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

This paper introduces AMMA, a multi-chiplet, memory-centric architecture designed to significantly reduce latency and energy consumption for long-context LLM attention serving. By replacing traditional GPU compute dies with HBM-PNM cubes, AMMA doubles memory bandwidth. It features a novel logic-die microarchitecture, a two-level hybrid parallelism scheme, and a reordered collective communication flow to maximize bandwidth utilization and minimize inter-chiplet overhead. Evaluation shows 15.5x lower attention latency and 6.9x lower energy compared to NVIDIA H100.

Executive Impact & Value Proposition

AMMA addresses the core mismatch between compute-rich GPUs and memory-bound LLM decode attention, offering a path to unprecedented efficiency and scalability for reasoning and agentic AI workloads with millions of tokens. This architecture unlocks significant cost savings in datacenter operations by drastically reducing power consumption and increasing serving capacity for long contexts.

0 Lower Attention Latency (vs. H100)
0 Lower Energy Consumption (vs. H100)
0 Memory Bandwidth Increase

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

AMMA replaces GPU compute dies with HBM-PNM cubes, creating a standalone, memory-centric accelerator. This fundamentally shifts the paradigm from GPU-centric to memory-centric, leveraging advanced logic dies (≤5nm) for sophisticated PNM integration. The architecture connects 16 HBM-NMP cubes in a 4x4 2D mesh, communicating via high-speed D2D links, forming a single chip dedicated to long-context attention.

40+ TB/s Aggregate HBM Bandwidth in AMMA, critical for memory-bound attention workloads.

Enterprise Process Flow

Current GPU-centric Serving
GPU Compute Die Inefficiency
AMMA: HBM-PNM Cube Integration
Memory-Centric Acceleration
Feature Conventional GPU AMMA (Memory-Centric)
Core Design
  • Compute-rich
  • GPU as central hub
  • Memory-centric
  • HBM-PNM as central hub
Target Workload
  • Compute-bound operations
  • Data reuse emphasized
  • Memory-bound decode attention
  • Minimal data reuse
Memory Bandwidth
  • Limited HBM BW for compute
  • LLC for bridging gap
  • Doubled aggregated HBM BW
  • No LLC (20% power/area saving)
Compute Units
  • Few large SAs (128x128)
  • High idle compute for attention
  • Many small SAs (16x16)
  • Sized for narrow M, full PE occupancy

AMMA employs a two-level hybrid parallelism scheme and a reordered collective communication flow to minimize inter-cube data movement and overhead. Unlike naive Tensor Parallelism (TP16) that causes extensive communication across the mesh, AMMA's approach confines data movement to local neighborhoods, making long-context inference practical. This includes mapping KV heads to cube groups and splitting KV cache by sequence within groups.

65.4x Communication Speedup at 1M Tokens (HP_RO vs TP16), demonstrates the efficacy of reordered collective communication.

Impact of Reordered Collective Operations

The reordered collective communication flow in AMMA significantly reduces overhead by replacing intra-group AllReduce with ReduceScatter and cross-group AllReduce with point-to-point Reduce. This design eliminates two AllGather operations and downgrades one AllReduce, leading to nearly half the traffic volume and fixed latency reduction, regardless of sequence length. This optimization is crucial for achieving low latency in long-context attention serving, where communication can easily become the bottleneck.

Key Finding: The reordered flow achieves a 65.4x communication speedup at 1M tokens compared to a naive TP16 approach, contributing significantly to AMMA's overall latency reduction.

AMMA's design prioritizes energy efficiency and optimal resource utilization, departing from GPU-like microarchitectures that waste power and area on idle compute units and large memory hierarchies like LLCs. By removing the LLC, AMMA reclaims 20% of the power budget and substantial die area, redirecting it to useful compute and direct HBM bandwidth exploitation.

6.9x Lower Energy Consumption (vs. H100), highlighting significant operational cost savings.
Metric NVIDIA H100 AMMA (16 Cubes)
Compute (FP8 TFLOPS)
  • 1978
  • 1536 (tuned for attention)
Tot HBM BW (TB/s)
  • 3.35
  • 44 (11.9x of H100)
Compute-to-BW Ratio
  • 795 FLOPs/byte (25x surplus for GQA)
  • Optimized for 32 FLOPs/byte (GQA)
Power Consumption (W)
  • 700 (693W avg. for attention)
  • 1440 (lower total energy for equivalent work)
LLC Utilization
  • Near-100% miss rate, 130W dissipation
  • Removed (20% power/area reclaimed)

Advanced ROI Calculator: Optimize Your AI Infrastructure Costs

Estimate potential annual savings by adopting AMMA's memory-centric architecture for your LLM serving needs.

Estimated Annual Savings $0
Reclaimed Employee Hours 0

Implementation Roadmap for AMMA Integration

A phased approach to integrate AMMA's cutting-edge architecture into your enterprise AI ecosystem.

Phase 1: Architectural Assessment

Evaluate current LLM serving infrastructure, identify bottlenecks (memory vs. compute), and quantify long-context workload demands. Determine optimal AMMA configuration (compute power, D2D bandwidth) based on specific enterprise requirements.

Phase 2: Pilot Deployment & Benchmarking

Deploy a pilot AMMA system for key long-context attention workloads. Benchmark latency, throughput, and energy efficiency against existing GPU-centric solutions. Validate 15.5x latency and 6.9x energy improvements.

Phase 3: Integration & Scaling

Integrate AMMA into broader data center infrastructure. Scale deployment to production level, leveraging its disaggregation capabilities for FFN offloading to LPUs or GPUs. Monitor performance and cost savings at scale.

Phase 4: Continuous Optimization

Continuously monitor and optimize AMMA's performance. Explore further design-space adjustments based on evolving LLM models and workload characteristics, ensuring sustained efficiency and cost benefits.

Ready to Transform Your AI Infrastructure?

Leverage AMMA's memory-centric architecture to achieve unparalleled low-latency, energy-efficient LLM serving for long-context workloads. Book a session with our experts to discuss a tailored strategy for your enterprise.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking