Research Paper Analysis
Automating High Energy Physics Data Analysis with LLM-Powered Agents
Presented by Eli Gendreau-Distler, Joshua Ho, Dongwon Kim, Luc Tomas Le Pottier, Haichen Wang, and Chengxi Yang.
This study demonstrates the use of Large Language Model (LLM) agents to automate a representative High Energy Physics (HEP) analysis, utilizing a hybrid system combining an LLM-based supervisor-coder agent with the Snakemake workflow manager. The framework enables systematic benchmarking of model capabilities, stability, and limitations in real-world scientific computing environments, revealing significant progress in automated data analysis.
Executive Impact: Key Achievements
Our innovative LLM-agent framework achieved significant milestones in automating complex HEP data analysis workflows.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
LLM Agents in Scientific Computing
Large Language Models (LLMs) and agent-based systems are increasingly being explored in scientific computing, with applications in genomics and software engineering. While High Energy Physics (HEP) data analyses are inherently structured, the use of LLMs is in early stages. This study introduces a framework for integrating LLM-based agents into a Snakemake-managed workflow, enabling controlled evaluation of agent-driven steps in a full collider-analysis setting.
Hybrid Architecture for HEP Analysis
The study employs a hybrid approach, combining a Snakemake workflow manager with a supervisor-coder LLM agent. Snakemake orchestrates the sequential execution of five analysis steps: ROOT file inspection, Ntuple conversion, Preprocessing, S-B separation, and Categorization. This design balances the determinism of the analysis structure with the flexibility required in HEP analyses, allowing the agent to autonomously generate, execute, and iteratively correct analysis code based on user instructions.
Performance Statistics
A baseline study with the gemini-pro-2.5 model across 219 experiments per step showed success rates of 58% for data preparation, 88% for S-B separation, and 74% for categorization. The data preparation stage was the most error-prone (93 failures out of 219 trials), largely due to insufficient context for identifying objects within ROOT files. S-B separation and categorization had fewer failures, with categorization errors often related to function-calling and semantic issues due to its increased complexity.
Cross-Model LLM Benchmarking
The study benchmarked a selection of state-of-the-art LLMs, including Gemini and GPT-5 series, Claude family, and open-weight models. Figures 2a and 2b illustrate significant variation in reliability and error patterns across architectures. Models in the GPT-5 and Gemini families generally achieved higher completion fractions, while smaller or open-weight models like gpt-oss-120b showed lower but non-negligible success. This demonstrates the feasibility and reproducibility of the agentic workflow across diverse LLM architectures.
Systematic Error Classification
Failure modes were systematically categorized into seven types, including "all data weights = 0", "intermediate file not found", "function-calling error", and "semantic error". An LLM (gpt-oss-120b) was used for error classification. A notable semantic error occurred in the categorization step where the agent misinterpreted user intent, leading to incorrect boundary initialization and an extra boundary being generated, resulting in failure. This highlights the importance of precise prompting and validation.
Economic Footprint of LLM Agents
The study measured the estimated monetary cost per workflow step, based on token usage and public pricing. The preprocessing stage was generally the most expensive due to intensive internal prompting and multi-step reasoning. S-B separation was consistently inexpensive, while categorization showed high variability. Higher-capacity proprietary models often achieved lower agent work and more stable retry behavior but incurred higher costs due to per-token pricing, while open-weight models, despite lower absolute cost, showed reduced robustness.
Current Limitations & Future Directions
While LLMs successfully supported HEP data analysis via natural language interpretation, code generation, and basic self-correction, multi-step task planning is currently beyond the scope of this study. Future work aims to strengthen the framework through improvements in prompting, agent design, domain adaptation, and retrieval-augmented generation. The Snakemake integration provides a natural path toward rule-based agent planning, enhancing reproducibility and control.
Enterprise Process Flow
LLM Agent Performance Comparison
A comparative overview of selected LLM agents across key performance indicators for the first step of the analysis pipeline. Costs are per 1M tokens.
| Model | Step 1 Success Rate | Input Cost | Output Cost |
|---|---|---|---|
| Claude Sonnet 4.5 | 100% (10/10) | $3.00 | $15.00 |
| Gemini 2.5 Pro | 100% (10/10) | $1.25 | $10.00 |
| GPT-5 Codex | 100% (10/10) | $1.25 | $10.00 |
| GPT-OSS-120B | 72% (54/75) | $0.00 | $0.00 |
| Qwen-3 (235B) | 100% (10/10) | $0.50 | $2.00 |
Case Study: Semantic Error in Categorization
A critical analysis revealed a semantic error in the categorization stage where the Supervisor-Coder agent misinterpreted user intent. The agent failed to initialize the significance array with the null selection, leading to an extra boundary being generated. This resulted in five categorization boundaries instead of the expected four, causing the trial to be marked as unsuccessful. This incident underscores the importance of precise communication and robust validation mechanisms in complex, iterative AI-driven workflows.
Calculate Your Potential AI ROI
Estimate the efficiency gains and cost savings your enterprise could achieve by automating workflows with LLM-powered agents.
Your AI Implementation Roadmap
A typical journey to integrate LLM-powered agents into your enterprise workflows.
Phase 01: Discovery & Strategy
In-depth analysis of existing workflows, identification of automation opportunities, and strategic planning for LLM agent integration. Define success metrics and a phased rollout plan.
Phase 02: Pilot Development & Testing
Develop and deploy a pilot LLM-agent system on a selected workflow. Rigorous testing, validation, and iterative refinement based on performance data and user feedback.
Phase 03: Scaled Deployment & Integration
Expand LLM agent deployment across identified workflows. Seamless integration with existing enterprise systems and infrastructure, ensuring data security and compliance.
Phase 04: Performance Monitoring & Optimization
Continuous monitoring of agent performance, efficiency, and ROI. Ongoing optimization, model retraining, and adaptation to evolving business needs and data patterns.
Ready to Transform Your Enterprise with AI?
Book a personalized consultation to explore how LLM-powered agents can streamline your operations, enhance efficiency, and drive innovation.