BUILD, JUDGE, OPTIMIZE: A BLUEPRINT FOR CONTINUOUS IMPROVEMENT OF MULTI-AGENT CONSUMER ASSISTANTS
Executive Summary
This paper presents a practical blueprint for evaluating and optimizing Conversational Shopping Assistants (CSAs), using a production-scale AI grocery assistant as a case study. It introduces a multi-faceted evaluation rubric, a calibrated LLM-as-judge pipeline, and two prompt-optimization strategies: Sub-agent GEPA and system-level MAMUT GEPA. The framework is designed to address challenges in multi-turn interactions and tightly coupled multi-agent systems, particularly in preference-sensitive and underspecified domains like grocery shopping.
Key Impact & Performance Benchmarks
Leveraging advanced evaluation and optimization, our methodology significantly boosts agent performance and reliability.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Challenges in Production CSAs
Conversational shopping assistants (CSAs) face two underexplored challenges: evaluating multi-turn interactions and optimizing tightly coupled multi-agent systems. Grocery shopping amplifies these difficulties due to underspecified, preference-sensitive user requests and inventory constraints. Traditional retrieval metrics are insufficient for multi-dimensional quality assessment across interaction trajectories, and optimizing individual sub-agents doesn't reliably translate to better end-to-end outcomes due to delayed effects and cross-agent coupling.
Key Finding
93.45% Optimized Judge-Human AgreementAfter GEPA calibration, the LLM-as-judge pipeline achieved 93.45% overall agreement with human reviewers, a +5.0% improvement over baseline, making it a reliable reward signal for optimization.
MAGIC System Architecture
| Rubric Domain | Sub-agent GEPA | MAMUT | Improvement |
|---|---|---|---|
| Shopping Execution | 79.0% | 85.0% | +6.0% |
| Personalization & Context | 80.2% | 87.0% | +6.8% |
| Conversational Quality | 64.0% | 72.0% | +8.0% |
| Safety & Compliance | 76.0% | 88.0% | +12.0% |
| MAMUT GEPA consistently outperforms Sub-agent GEPA across all rubric domains, especially in Safety & Compliance, confirming the importance of system-level optimization for tightly coupled multi-agent systems. | |||
The MAGIC Grocery Assistant
MAGIC (Multi-Agent Grocery Intelligent Concierge) is a production-scale grocery assistant used as a case study. Early monolithic designs were brittle, leading to a pivot to a modular multi-agent architecture where an Orchestrator coordinates sub-agents. This design, while improving control, introduced tighter coupling, making system-level optimization crucial. The blueprint helps MAGIC achieve robust, preference-sensitive, and context-aware interactions.
Evaluation Framework
A multi-faceted evaluation rubric assesses end-to-end shopping quality across four domains: Shopping Execution, Personalization, Conversation Quality, and Safety. This rubric is grounded in observable trace artifacts and uses a calibrated LLM-as-judge pipeline to provide deterministic scoring, enabling a stable reward signal for optimization. The judge's decision boundaries were refined using GEPA prompt optimization, achieving high alignment with human annotations.
Calculate Your Potential ROI with Enterprise AI
Estimate the transformational impact of AI on your organization's efficiency and cost savings.
Your AI Implementation Roadmap
A typical journey to integrate and optimize enterprise AI, from initial strategy to continuous improvement.
Phase 1: Discovery & Strategy
In-depth analysis of current workflows, identification of AI opportunities, and development of a tailored strategy.
Phase 2: Pilot & Development
Rapid prototyping, development of core AI components, and deployment of a pilot program for initial testing.
Phase 3: Integration & Scaling
Seamless integration with existing systems, comprehensive testing, and phased rollout across the organization.
Phase 4: Optimization & Monitoring
Continuous performance monitoring, iterative model refinement, and ongoing support to maximize ROI.