Enterprise AI Analysis

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

This paper introduces AMMA, a multi-chiplet, memory-centric architecture designed to significantly reduce latency and energy consumption for long-context LLM attention serving. By replacing traditional GPU compute dies with HBM-PNM cubes, AMMA doubles memory bandwidth. It features a novel logic-die microarchitecture, a two-level hybrid parallelism scheme, and a reordered collective communication flow to maximize bandwidth utilization and minimize inter-chiplet overhead. Evaluation shows 15.5x lower attention latency and 6.9x lower energy compared to NVIDIA H100.

Schedule a Strategic Consultation

Executive Impact & Value Proposition

AMMA addresses the core mismatch between compute-rich GPUs and memory-bound LLM decode attention, offering a path to unprecedented efficiency and scalability for reasoning and agentic AI workloads with millions of tokens. This architecture unlocks significant cost savings in datacenter operations by drastically reducing power consumption and increasing serving capacity for long contexts.

0 Lower Attention Latency (vs. H100)

0 Lower Energy Consumption (vs. H100)

0 Memory Bandwidth Increase

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

AMMA replaces GPU compute dies with HBM-PNM cubes, creating a standalone, memory-centric accelerator. This fundamentally shifts the paradigm from GPU-centric to memory-centric, leveraging advanced logic dies (≤5nm) for sophisticated PNM integration. The architecture connects 16 HBM-NMP cubes in a 4x4 2D mesh, communicating via high-speed D2D links, forming a single chip dedicated to long-context attention.

40+ TB/s Aggregate HBM Bandwidth in AMMA, critical for memory-bound attention workloads.

Enterprise Process Flow

Current GPU-centric Serving

→

GPU Compute Die Inefficiency

→

AMMA: HBM-PNM Cube Integration

→

Memory-Centric Acceleration

Feature	Conventional GPU	AMMA (Memory-Centric)
Core Design	Compute-rich GPU as central hub	Memory-centric HBM-PNM as central hub
Target Workload	Compute-bound operations Data reuse emphasized	Memory-bound decode attention Minimal data reuse
Memory Bandwidth	Limited HBM BW for compute LLC for bridging gap	Doubled aggregated HBM BW No LLC (20% power/area saving)
Compute Units	Few large SAs (128x128) High idle compute for attention	Many small SAs (16x16) Sized for narrow M, full PE occupancy

AMMA employs a two-level hybrid parallelism scheme and a reordered collective communication flow to minimize inter-cube data movement and overhead. Unlike naive Tensor Parallelism (TP16) that causes extensive communication across the mesh, AMMA's approach confines data movement to local neighborhoods, making long-context inference practical. This includes mapping KV heads to cube groups and splitting KV cache by sequence within groups.

65.4x Communication Speedup at 1M Tokens (HP_RO vs TP16), demonstrates the efficacy of reordered collective communication.

Impact of Reordered Collective Operations

The reordered collective communication flow in AMMA significantly reduces overhead by replacing intra-group AllReduce with ReduceScatter and cross-group AllReduce with point-to-point Reduce. This design eliminates two AllGather operations and downgrades one AllReduce, leading to nearly half the traffic volume and fixed latency reduction, regardless of sequence length. This optimization is crucial for achieving low latency in long-context attention serving, where communication can easily become the bottleneck.

Key Finding: The reordered flow achieves a 65.4x communication speedup at 1M tokens compared to a naive TP16 approach, contributing significantly to AMMA's overall latency reduction.

AMMA's design prioritizes energy efficiency and optimal resource utilization, departing from GPU-like microarchitectures that waste power and area on idle compute units and large memory hierarchies like LLCs. By removing the LLC, AMMA reclaims 20% of the power budget and substantial die area, redirecting it to useful compute and direct HBM bandwidth exploitation.

6.9x Lower Energy Consumption (vs. H100), highlighting significant operational cost savings.

Metric	NVIDIA H100	AMMA (16 Cubes)
Compute (FP8 TFLOPS)	1978	1536 (tuned for attention)
Tot HBM BW (TB/s)	3.35	44 (11.9x of H100)
Compute-to-BW Ratio	795 FLOPs/byte (25x surplus for GQA)	Optimized for 32 FLOPs/byte (GQA)
Power Consumption (W)	700 (693W avg. for attention)	1440 (lower total energy for equivalent work)
LLC Utilization	Near-100% miss rate, 130W dissipation	Removed (20% power/area reclaimed)

Advanced ROI Calculator: Optimize Your AI Infrastructure Costs

Estimate potential annual savings by adopting AMMA's memory-centric architecture for your LLM serving needs.

Your Industry

Number of Employees (using AI)

Average Hours/Week per Employee on AI Tasks

Average Hourly Cost per Employee ($)

Estimated Annual Savings $0

Reclaimed Employee Hours 0

Implementation Roadmap for AMMA Integration

A phased approach to integrate AMMA's cutting-edge architecture into your enterprise AI ecosystem.

Phase 1: Architectural Assessment

Evaluate current LLM serving infrastructure, identify bottlenecks (memory vs. compute), and quantify long-context workload demands. Determine optimal AMMA configuration (compute power, D2D bandwidth) based on specific enterprise requirements.

Phase 2: Pilot Deployment & Benchmarking

Deploy a pilot AMMA system for key long-context attention workloads. Benchmark latency, throughput, and energy efficiency against existing GPU-centric solutions. Validate 15.5x latency and 6.9x energy improvements.

Phase 3: Integration & Scaling

Integrate AMMA into broader data center infrastructure. Scale deployment to production level, leveraging its disaggregation capabilities for FFN offloading to LPUs or GPUs. Monitor performance and cost savings at scale.

Phase 4: Continuous Optimization

Continuously monitor and optimize AMMA's performance. Explore further design-space adjustments based on evolving LLM models and workload characteristics, ensuring sustained efficiency and cost benefits.

Ready to Transform Your AI Infrastructure?

Leverage AMMA's memory-centric architecture to achieve unparalleled low-latency, energy-efficient LLM serving for long-context workloads. Book a session with our experts to discuss a tailored strategy for your enterprise.

Book a Consultation to Transform Your AI Infrastructure

Enterprise AI Analysis

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Executive Impact & Value Proposition

Deep Analysis & Enterprise Applications

Enterprise Process Flow

Impact of Reordered Collective Operations

Advanced ROI Calculator: Optimize Your AI Infrastructure Costs

Implementation Roadmap for AMMA Integration

Phase 1: Architectural Assessment

Phase 2: Pilot Deployment & Benchmarking

Phase 3: Integration & Scaling

Phase 4: Continuous Optimization

Ready to Transform Your AI Infrastructure?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai