Enterprise AI Analysis
TML-bench: Benchmark for Data Science Agents on Tabular ML Tasks
This paper introduces TML-bench, a rigorous benchmark for evaluating autonomous data science agents on Kaggle-style tabular ML tasks. It assesses 10 open-source LLMs across four competitions and three time budgets, focusing on reliability, correctness, and performance under strict protocols. MiniMax-M2.1-TEE emerges as the top performer, demonstrating the current state-of-the-art in autonomous tabular ML.
Executive Impact: Key Metrics & Opportunities
Understand the quantitative performance and implications for deploying autonomous data science agents in your enterprise.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
TML-bench establishes a strict protocol for evaluating data science agents on real-world Kaggle-style tabular ML tasks. The benchmark emphasizes end-to-end correctness, reliability, and performance under time constraints.
Key design principles include: deterministic preparation, strict submission validation, and private-holdout scoring (not accessible to the agent). Internet access is disabled during runs, and models are selected with knowledge cutoffs predating competition start dates to prevent contamination.
Each model is run five times per task and budget, and results are reported as the median of the earliest 5 successful runs. Scores are min-max normalized for cross-competition comparability.
Enterprise Process Flow: TML-bench Evaluation Protocol
The evaluation spanned 10 OSS LLMs across four diverse Kaggle tabular competitions, under three distinct time budgets. The primary aggregation method identifies the best normalized score per competition for each model, then averages these scores.
MiniMax-M2.1-TEE consistently achieved the best aggregate performance score across all four competitions. Other strong performers include Qwen3-Coder-480B-A35B-Instruct-FP8 and GLM-4.7-FP8, demonstrating robust capabilities across various tasks and budgets.
Performance generally improved with larger time budgets, indicating that more time allows agents to develop better solutions, although scaling was noisy for some models at lower run counts.
| Model | Primary Aggregation Score | Key Strengths |
|---|---|---|
| MiniMax-M2.1-TEE | 1.0 |
|
| Qwen3-Coder-480B-A35B-Instruct-FP8 | ~0.95 |
|
| GLM-4.7-FP8 | ~0.92 |
|
Reliability in autonomous agents is crucial. TML-bench measures this through run success rate (producing a valid submission) and within-setting stability (variability across five runs).
Results show a meaningful variation in reliability even among top performers. While some models achieve near 100% success rates and low variability, others struggle with consistency, exhibiting broad interquartile ranges (IQRs) in their scores.
For example, GLM 4.7 Flash at 1200s showed a median RMSE of 0.107502 with a substantially wider IQR (0.070186..0.221725) compared to neighboring models, indicating lower stability.
| Model/Setting | Success Rate | Stability (Relative IQR) | Observation |
|---|---|---|---|
| MiniMax-M2.1-TEE | High (0.95+) | Low (0.00-0.01) |
|
| Qwen3-Coder-480B-A35B-Instruct-FP8 | High (0.95+) | Low (0.00-0.01) |
|
| gpt-oss-120b-TEE | Medium (0.75-0.85) | Variable (0.01-0.05) |
|
| GLM 4.7 Flash (1200s) | High | High (0.07-0.22) |
|
The benchmark evaluates performance across three time budgets (240s, 600s, 1200s). This allows insights into an agent's ability to leverage more time for iterative improvement and deeper problem-solving.
Overall, average performance improves with larger time budgets, reflecting the expected monotonic pattern. However, at the individual model level, scaling can be noisy for some models due to the current run count (five successful runs per setting).
The paper reports that 57.5% of model×competition curves are monotone, meaning their performance did not worsen as the budget increased. This highlights varying capabilities among agents to effectively utilize additional computation time.
Impact of Increased Budget
While performance generally improves with more time, the scaling is not always linear or perfectly smooth. For instance, some models like DeepSeek-TNG-R1T2-Chimera show significant marginal gains in score when moving from 240s to 600s, suggesting they benefit greatly from initial time increases. Others, like MiniMax-M2.1-TEE, maintain strong performance even at shorter budgets and show steady but less dramatic gains. This underscores that different agents have distinct strategies for leveraging time.
Calculate Your Potential AI ROI
Estimate the efficiency gains and cost savings your enterprise could achieve by automating tabular data science tasks with advanced AI agents.
Your AI Transformation Roadmap
A typical phased approach to integrate autonomous data science agents into your existing workflows.
Phase 1: Discovery & Strategy
Assess current data science workflows, identify automation opportunities, and define clear business objectives. Select appropriate AI agent technologies based on TML-bench insights.
Phase 2: Pilot & Integration
Implement a pilot project on a critical tabular task. Integrate agents with existing data pipelines and validate performance against established benchmarks like TML-bench.
Phase 3: Scaling & Optimization
Expand AI agent deployment across more tasks and departments. Continuously monitor performance, refine agent instructions, and optimize resource allocation.
Phase 4: Advanced Capabilities
Explore multi-agent orchestration, advanced feature engineering, and real-time inference capabilities for competitive advantage.
Ready to Benchmark Your AI Strategy?
Connect with our experts to understand how autonomous data science agents can elevate your enterprise's capabilities and drive efficiency.