Enterprise AI Analysis
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
This paper introduces AMMA, a multi-chiplet, memory-centric architecture designed to significantly reduce latency and energy consumption for long-context LLM attention serving. By replacing traditional GPU compute dies with HBM-PNM cubes, AMMA doubles memory bandwidth. It features a novel logic-die microarchitecture, a two-level hybrid parallelism scheme, and a reordered collective communication flow to maximize bandwidth utilization and minimize inter-chiplet overhead. Evaluation shows 15.5x lower attention latency and 6.9x lower energy compared to NVIDIA H100.
Executive Impact & Value Proposition
AMMA addresses the core mismatch between compute-rich GPUs and memory-bound LLM decode attention, offering a path to unprecedented efficiency and scalability for reasoning and agentic AI workloads with millions of tokens. This architecture unlocks significant cost savings in datacenter operations by drastically reducing power consumption and increasing serving capacity for long contexts.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
AMMA replaces GPU compute dies with HBM-PNM cubes, creating a standalone, memory-centric accelerator. This fundamentally shifts the paradigm from GPU-centric to memory-centric, leveraging advanced logic dies (≤5nm) for sophisticated PNM integration. The architecture connects 16 HBM-NMP cubes in a 4x4 2D mesh, communicating via high-speed D2D links, forming a single chip dedicated to long-context attention.
Enterprise Process Flow
| Feature | Conventional GPU | AMMA (Memory-Centric) |
|---|---|---|
| Core Design |
|
|
| Target Workload |
|
|
| Memory Bandwidth |
|
|
| Compute Units |
|
|
AMMA employs a two-level hybrid parallelism scheme and a reordered collective communication flow to minimize inter-cube data movement and overhead. Unlike naive Tensor Parallelism (TP16) that causes extensive communication across the mesh, AMMA's approach confines data movement to local neighborhoods, making long-context inference practical. This includes mapping KV heads to cube groups and splitting KV cache by sequence within groups.
Impact of Reordered Collective Operations
The reordered collective communication flow in AMMA significantly reduces overhead by replacing intra-group AllReduce with ReduceScatter and cross-group AllReduce with point-to-point Reduce. This design eliminates two AllGather operations and downgrades one AllReduce, leading to nearly half the traffic volume and fixed latency reduction, regardless of sequence length. This optimization is crucial for achieving low latency in long-context attention serving, where communication can easily become the bottleneck.
Key Finding: The reordered flow achieves a 65.4x communication speedup at 1M tokens compared to a naive TP16 approach, contributing significantly to AMMA's overall latency reduction.
AMMA's design prioritizes energy efficiency and optimal resource utilization, departing from GPU-like microarchitectures that waste power and area on idle compute units and large memory hierarchies like LLCs. By removing the LLC, AMMA reclaims 20% of the power budget and substantial die area, redirecting it to useful compute and direct HBM bandwidth exploitation.
| Metric | NVIDIA H100 | AMMA (16 Cubes) |
|---|---|---|
| Compute (FP8 TFLOPS) |
|
|
| Tot HBM BW (TB/s) |
|
|
| Compute-to-BW Ratio |
|
|
| Power Consumption (W) |
|
|
| LLC Utilization |
|
|
Advanced ROI Calculator: Optimize Your AI Infrastructure Costs
Estimate potential annual savings by adopting AMMA's memory-centric architecture for your LLM serving needs.
Implementation Roadmap for AMMA Integration
A phased approach to integrate AMMA's cutting-edge architecture into your enterprise AI ecosystem.
Phase 1: Architectural Assessment
Evaluate current LLM serving infrastructure, identify bottlenecks (memory vs. compute), and quantify long-context workload demands. Determine optimal AMMA configuration (compute power, D2D bandwidth) based on specific enterprise requirements.
Phase 2: Pilot Deployment & Benchmarking
Deploy a pilot AMMA system for key long-context attention workloads. Benchmark latency, throughput, and energy efficiency against existing GPU-centric solutions. Validate 15.5x latency and 6.9x energy improvements.
Phase 3: Integration & Scaling
Integrate AMMA into broader data center infrastructure. Scale deployment to production level, leveraging its disaggregation capabilities for FFN offloading to LPUs or GPUs. Monitor performance and cost savings at scale.
Phase 4: Continuous Optimization
Continuously monitor and optimize AMMA's performance. Explore further design-space adjustments based on evolving LLM models and workload characteristics, ensuring sustained efficiency and cost benefits.
Ready to Transform Your AI Infrastructure?
Leverage AMMA's memory-centric architecture to achieve unparalleled low-latency, energy-efficient LLM serving for long-context workloads. Book a session with our experts to discuss a tailored strategy for your enterprise.