Skip to main content
Enterprise AI Analysis: TML-bench: Benchmark for Data Science Agents on Tabular ML Tasks

Enterprise AI Analysis

TML-bench: Benchmark for Data Science Agents on Tabular ML Tasks

This paper introduces TML-bench, a rigorous benchmark for evaluating autonomous data science agents on Kaggle-style tabular ML tasks. It assesses 10 open-source LLMs across four competitions and three time budgets, focusing on reliability, correctness, and performance under strict protocols. MiniMax-M2.1-TEE emerges as the top performer, demonstrating the current state-of-the-art in autonomous tabular ML.

Executive Impact: Key Metrics & Opportunities

Understand the quantitative performance and implications for deploying autonomous data science agents in your enterprise.

0 LLM Agents Evaluated
0 Kaggle Competitions
0 Time Budgets Tested
MiniMax-M2.1-TEE Top Performing Model

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

TML-bench establishes a strict protocol for evaluating data science agents on real-world Kaggle-style tabular ML tasks. The benchmark emphasizes end-to-end correctness, reliability, and performance under time constraints.

Key design principles include: deterministic preparation, strict submission validation, and private-holdout scoring (not accessible to the agent). Internet access is disabled during runs, and models are selected with knowledge cutoffs predating competition start dates to prevent contamination.

Each model is run five times per task and budget, and results are reported as the median of the earliest 5 successful runs. Scores are min-max normalized for cross-competition comparability.

Enterprise Process Flow: TML-bench Evaluation Protocol

Agent Receives Task (Kilo Code)
Fixed Instruction Set
Time Budget Enforcement (240s/600s/1200s)
Generates Submission File
Submission Validation (Format)
Private-Holdout Scoring
Median of 5 Successful Runs
Min-Max Normalization
$10 Approximate cost to run full suite, highlighting efficiency.

The evaluation spanned 10 OSS LLMs across four diverse Kaggle tabular competitions, under three distinct time budgets. The primary aggregation method identifies the best normalized score per competition for each model, then averages these scores.

MiniMax-M2.1-TEE consistently achieved the best aggregate performance score across all four competitions. Other strong performers include Qwen3-Coder-480B-A35B-Instruct-FP8 and GLM-4.7-FP8, demonstrating robust capabilities across various tasks and budgets.

Performance generally improved with larger time budgets, indicating that more time allows agents to develop better solutions, although scaling was noisy for some models at lower run counts.

Top 3 Performers (Aggregate Score)

Model Primary Aggregation Score Key Strengths
MiniMax-M2.1-TEE 1.0
  • Best overall aggregate performance.
  • Consistent across budgets and competitions.
Qwen3-Coder-480B-A35B-Instruct-FP8 ~0.95
  • Strong baseline performance.
  • Competitive in shorter budgets.
GLM-4.7-FP8 ~0.92
  • Reliable performance.
  • Good stability across runs.
MiniMax-M2.1-TEE The leading autonomous agent for tabular ML tasks.

Reliability in autonomous agents is crucial. TML-bench measures this through run success rate (producing a valid submission) and within-setting stability (variability across five runs).

Results show a meaningful variation in reliability even among top performers. While some models achieve near 100% success rates and low variability, others struggle with consistency, exhibiting broad interquartile ranges (IQRs) in their scores.

For example, GLM 4.7 Flash at 1200s showed a median RMSE of 0.107502 with a substantially wider IQR (0.070186..0.221725) compared to neighboring models, indicating lower stability.

Reliability & Stability Highlights

Model/Setting Success Rate Stability (Relative IQR) Observation
MiniMax-M2.1-TEE High (0.95+) Low (0.00-0.01)
  • Excellent consistency and reliability.
Qwen3-Coder-480B-A35B-Instruct-FP8 High (0.95+) Low (0.00-0.01)
  • Strong reliability across tasks.
gpt-oss-120b-TEE Medium (0.75-0.85) Variable (0.01-0.05)
  • Notable inconsistencies in some settings.
GLM 4.7 Flash (1200s) High High (0.07-0.22)
  • Good success but high score variability.

The benchmark evaluates performance across three time budgets (240s, 600s, 1200s). This allows insights into an agent's ability to leverage more time for iterative improvement and deeper problem-solving.

Overall, average performance improves with larger time budgets, reflecting the expected monotonic pattern. However, at the individual model level, scaling can be noisy for some models due to the current run count (five successful runs per setting).

The paper reports that 57.5% of model×competition curves are monotone, meaning their performance did not worsen as the budget increased. This highlights varying capabilities among agents to effectively utilize additional computation time.

57.5% Monotonicity Rate Across Model/Competition Curves: Performance did not worsen with increased budget.

Impact of Increased Budget

While performance generally improves with more time, the scaling is not always linear or perfectly smooth. For instance, some models like DeepSeek-TNG-R1T2-Chimera show significant marginal gains in score when moving from 240s to 600s, suggesting they benefit greatly from initial time increases. Others, like MiniMax-M2.1-TEE, maintain strong performance even at shorter budgets and show steady but less dramatic gains. This underscores that different agents have distinct strategies for leveraging time.

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by automating tabular data science tasks with advanced AI agents.

Estimated Annual Savings $0
Hours Reclaimed Annually 0

Your AI Transformation Roadmap

A typical phased approach to integrate autonomous data science agents into your existing workflows.

Phase 1: Discovery & Strategy

Assess current data science workflows, identify automation opportunities, and define clear business objectives. Select appropriate AI agent technologies based on TML-bench insights.

Phase 2: Pilot & Integration

Implement a pilot project on a critical tabular task. Integrate agents with existing data pipelines and validate performance against established benchmarks like TML-bench.

Phase 3: Scaling & Optimization

Expand AI agent deployment across more tasks and departments. Continuously monitor performance, refine agent instructions, and optimize resource allocation.

Phase 4: Advanced Capabilities

Explore multi-agent orchestration, advanced feature engineering, and real-time inference capabilities for competitive advantage.

Ready to Benchmark Your AI Strategy?

Connect with our experts to understand how autonomous data science agents can elevate your enterprise's capabilities and drive efficiency.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking