Skip to main content
Enterprise AI Analysis: Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

BUILD, JUDGE, OPTIMIZE: A BLUEPRINT FOR CONTINUOUS IMPROVEMENT OF MULTI-AGENT CONSUMER ASSISTANTS

Executive Summary

This paper presents a practical blueprint for evaluating and optimizing Conversational Shopping Assistants (CSAs), using a production-scale AI grocery assistant as a case study. It introduces a multi-faceted evaluation rubric, a calibrated LLM-as-judge pipeline, and two prompt-optimization strategies: Sub-agent GEPA and system-level MAMUT GEPA. The framework is designed to address challenges in multi-turn interactions and tightly coupled multi-agent systems, particularly in preference-sensitive and underspecified domains like grocery shopping.

Key Impact & Performance Benchmarks

Leveraging advanced evaluation and optimization, our methodology significantly boosts agent performance and reliability.

0 LLM Judge-Human Agreement
0 Overall Agreement Improvement
0 MAMUT Rubric Pass Rate

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Multi-Agent Systems

Challenges in Production CSAs

Conversational shopping assistants (CSAs) face two underexplored challenges: evaluating multi-turn interactions and optimizing tightly coupled multi-agent systems. Grocery shopping amplifies these difficulties due to underspecified, preference-sensitive user requests and inventory constraints. Traditional retrieval metrics are insufficient for multi-dimensional quality assessment across interaction trajectories, and optimizing individual sub-agents doesn't reliably translate to better end-to-end outcomes due to delayed effects and cross-agent coupling.

Key Finding

93.45% Optimized Judge-Human Agreement

After GEPA calibration, the LLM-as-judge pipeline achieved 93.45% overall agreement with human reviewers, a +5.0% improvement over baseline, making it a reliable reward signal for optimization.

MAGIC System Architecture

User Request
Orchestrator
Sub-agents & APIs
Actionable Tasks
User Communication

Sub-agent GEPA vs. MAMUT GEPA Performance

Rubric Domain Sub-agent GEPA MAMUT Improvement
Shopping Execution 79.0% 85.0% +6.0%
Personalization & Context 80.2% 87.0% +6.8%
Conversational Quality 64.0% 72.0% +8.0%
Safety & Compliance 76.0% 88.0% +12.0%
MAMUT GEPA consistently outperforms Sub-agent GEPA across all rubric domains, especially in Safety & Compliance, confirming the importance of system-level optimization for tightly coupled multi-agent systems.

The MAGIC Grocery Assistant

MAGIC (Multi-Agent Grocery Intelligent Concierge) is a production-scale grocery assistant used as a case study. Early monolithic designs were brittle, leading to a pivot to a modular multi-agent architecture where an Orchestrator coordinates sub-agents. This design, while improving control, introduced tighter coupling, making system-level optimization crucial. The blueprint helps MAGIC achieve robust, preference-sensitive, and context-aware interactions.

Evaluation Framework

A multi-faceted evaluation rubric assesses end-to-end shopping quality across four domains: Shopping Execution, Personalization, Conversation Quality, and Safety. This rubric is grounded in observable trace artifacts and uses a calibrated LLM-as-judge pipeline to provide deterministic scoring, enabling a stable reward signal for optimization. The judge's decision boundaries were refined using GEPA prompt optimization, achieving high alignment with human annotations.

Calculate Your Potential ROI with Enterprise AI

Estimate the transformational impact of AI on your organization's efficiency and cost savings.

Estimated Annual Savings $0
Hours Reclaimed Annually 0

Your AI Implementation Roadmap

A typical journey to integrate and optimize enterprise AI, from initial strategy to continuous improvement.

Phase 1: Discovery & Strategy

In-depth analysis of current workflows, identification of AI opportunities, and development of a tailored strategy.

Phase 2: Pilot & Development

Rapid prototyping, development of core AI components, and deployment of a pilot program for initial testing.

Phase 3: Integration & Scaling

Seamless integration with existing systems, comprehensive testing, and phased rollout across the organization.

Phase 4: Optimization & Monitoring

Continuous performance monitoring, iterative model refinement, and ongoing support to maximize ROI.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking