Skip to main content
Enterprise AI Analysis: AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

Enterprise AI Analysis

AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

This paper introduces AGENTSELECT, a new benchmark for narrative query-to-agent recommendation. It standardizes heterogeneous evaluation artifacts into query-conditioned supervision for learning to rank compositional agent configurations. Comprising 111,179 queries and 107,721 deployable agents from 40+ sources, it covers LLM-only, toolkit-only, and compositional agents. The analysis reveals a shift from dense head reuse to sparse, long-tail supervision, where content-aware capability matching is crucial. Models trained on AGENTSELECT show strong performance, transfer well to real-world marketplaces, and induce capability-sensitive behavior, providing a unified infrastructure to accelerate the agent ecosystem.

Executive Impact

The rise of AI agents for task automation presents a selection dilemma due to the explosion of configurations. AGENTSELECT addresses this by providing the first unified benchmark for ranking agents based on natural-language queries and their capability profiles (M,T). This enables learning to recommend end-to-end compositional agent configurations, a critical step towards democratized task automation. The benchmark’s structure supports robust, content-aware matching and its insights are transferable to real-world agent marketplaces.

0 Narrative Queries
0 Deployable Agents
0 Interaction Records

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

251,103+ Positive Query-Agent Interactions Aggregated

AGENTSELECT unifies model-only, tool-only, and compositional settings into positive-only query-agent interaction benchmarks. It converts heterogeneous evaluation artifacts into structured, positive-only interaction data, providing a consistent training and evaluation interface for learning rankers at scale.

Enterprise Process Flow

Data Collection (LLM Leaderboards, Tool Benchmarks)
Query Selection & Processing
Compositional Agent Synthesis
Agent Recommendation Model Training

The benchmark consists of 111,179 narrative queries, an agent catalog of 107,721 deployable agents, and 251,103 positive query-agent interactions. This massive scale, combined with its positive-only implicit feedback design, positions AGENTSELECT as a robust foundation for agent recommendation research.

Method Family Key Findings
CF/GNN Methods
  • Fragile under sparse, one-off supervision
  • Competitive only when dense co-occurrence exists
Content-aware DNN/Two-Tower Models
  • Robust in long-tail scenarios
  • Strength depends on text representations (tuned BERT, BGE-M3)
  • Content matching dominates ID-based retrieval
Generative Recommenders (OneRec)
  • Benefits from reuse, biased under sparse positives
  • Struggles with one-off supervision in large catalogs

The evaluation reveals a significant regime shift from dense head reuse to long-tail, near one-off supervision. This means traditional popularity-based methods (CF/GNN) become fragile, and content-aware capability matching becomes essential for effective agent selection.

MuleRun Marketplace Transfer

Models trained on AGENTSELECT consistently outperform untuned baselines on an external MuleRun agent marketplace. This demonstrates transferable supervision and practical significance for real-world agent retrieval on unseen catalogs.

Counterfactual capability edits show that learned rankers shift preferences in expected directions, indicating capability-sensitive behavior. Furthermore, Part III synthesized interactions are learnable and improve coverage over realistic compositions.

Advanced ROI Calculator

This calculator estimates potential annual savings and hours reclaimed by optimizing AI agent selection in your enterprise. By efficiently matching narrative queries to the most suitable agents, organizations can reduce trial-and-error costs, accelerate task completion, and improve resource allocation. The efficiency gain and cost multiplier are adjusted based on industry specifics.

Estimated Annual Savings $0
Total Hours Reclaimed 0

Implementation Roadmap

A typical phased approach to integrating intelligent agent selection within your enterprise:

Phase 1: Initial Assessment

Analyze existing agent usage, identify key pain points, and define core capabilities needed. Baseline current agent selection efficiency.

Phase 2: Benchmark Integration

Integrate AGENTSELECT benchmark data and train initial agent recommender models using enterprise-specific interaction logs (if available).

Phase 3: Pilot Deployment & Refinement

Deploy the agent recommender in a pilot environment. Collect feedback, monitor performance, and iteratively refine models for optimal matching.

Phase 4: Full-Scale Rollout

Expand the recommender to all relevant agent ecosystems. Implement continuous learning mechanisms and ensure seamless integration with existing workflows.

Ready to Optimize Your AI Agent Ecosystem?

Book a free consultation to explore how AGENTSELECT can transform your enterprise AI strategy.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking