Enterprise AI Analysis

TML-bench: Benchmark for Data Science Agents on Tabular ML Tasks

This paper introduces TML-bench, a rigorous benchmark for evaluating autonomous data science agents on Kaggle-style tabular ML tasks. It assesses 10 open-source LLMs across four competitions and three time budgets, focusing on reliability, correctness, and performance under strict protocols. MiniMax-M2.1-TEE emerges as the top performer, demonstrating the current state-of-the-art in autonomous tabular ML.

Schedule Your Strategy Session

Executive Impact: Key Metrics & Opportunities

Understand the quantitative performance and implications for deploying autonomous data science agents in your enterprise.

0 LLM Agents Evaluated

0 Kaggle Competitions

0 Time Budgets Tested

MiniMax-M2.1-TEE Top Performing Model

Discuss Your AI Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

TML-bench establishes a strict protocol for evaluating data science agents on real-world Kaggle-style tabular ML tasks. The benchmark emphasizes end-to-end correctness, reliability, and performance under time constraints.

Key design principles include: deterministic preparation, strict submission validation, and private-holdout scoring (not accessible to the agent). Internet access is disabled during runs, and models are selected with knowledge cutoffs predating competition start dates to prevent contamination.

Each model is run five times per task and budget, and results are reported as the median of the earliest 5 successful runs. Scores are min-max normalized for cross-competition comparability.

Enterprise Process Flow: TML-bench Evaluation Protocol

Agent Receives Task (Kilo Code)

→

Fixed Instruction Set

→

Time Budget Enforcement (240s/600s/1200s)

→

Generates Submission File

→

Submission Validation (Format)

→

Private-Holdout Scoring

→

Median of 5 Successful Runs

→

Min-Max Normalization

$10 Approximate cost to run full suite, highlighting efficiency.

The evaluation spanned 10 OSS LLMs across four diverse Kaggle tabular competitions, under three distinct time budgets. The primary aggregation method identifies the best normalized score per competition for each model, then averages these scores.

MiniMax-M2.1-TEE consistently achieved the best aggregate performance score across all four competitions. Other strong performers include Qwen3-Coder-480B-A35B-Instruct-FP8 and GLM-4.7-FP8, demonstrating robust capabilities across various tasks and budgets.

Performance generally improved with larger time budgets, indicating that more time allows agents to develop better solutions, although scaling was noisy for some models at lower run counts.

Top 3 Performers (Aggregate Score)

Model	Primary Aggregation Score	Key Strengths
MiniMax-M2.1-TEE	1.0	Best overall aggregate performance. Consistent across budgets and competitions.
Qwen3-Coder-480B-A35B-Instruct-FP8	~0.95	Strong baseline performance. Competitive in shorter budgets.
GLM-4.7-FP8	~0.92	Reliable performance. Good stability across runs.

MiniMax-M2.1-TEE The leading autonomous agent for tabular ML tasks.

Reliability in autonomous agents is crucial. TML-bench measures this through run success rate (producing a valid submission) and within-setting stability (variability across five runs).

Results show a meaningful variation in reliability even among top performers. While some models achieve near 100% success rates and low variability, others struggle with consistency, exhibiting broad interquartile ranges (IQRs) in their scores.

For example, GLM 4.7 Flash at 1200s showed a median RMSE of 0.107502 with a substantially wider IQR (0.070186..0.221725) compared to neighboring models, indicating lower stability.

Reliability & Stability Highlights

Model/Setting	Success Rate	Stability (Relative IQR)	Observation
MiniMax-M2.1-TEE	High (0.95+)	Low (0.00-0.01)	Excellent consistency and reliability.
Qwen3-Coder-480B-A35B-Instruct-FP8	High (0.95+)	Low (0.00-0.01)	Strong reliability across tasks.
gpt-oss-120b-TEE	Medium (0.75-0.85)	Variable (0.01-0.05)	Notable inconsistencies in some settings.
GLM 4.7 Flash (1200s)	High	High (0.07-0.22)	Good success but high score variability.

The benchmark evaluates performance across three time budgets (240s, 600s, 1200s). This allows insights into an agent's ability to leverage more time for iterative improvement and deeper problem-solving.

Overall, average performance improves with larger time budgets, reflecting the expected monotonic pattern. However, at the individual model level, scaling can be noisy for some models due to the current run count (five successful runs per setting).

The paper reports that 57.5% of model×competition curves are monotone, meaning their performance did not worsen as the budget increased. This highlights varying capabilities among agents to effectively utilize additional computation time.

57.5% Monotonicity Rate Across Model/Competition Curves: Performance did not worsen with increased budget.

Impact of Increased Budget

While performance generally improves with more time, the scaling is not always linear or perfectly smooth. For instance, some models like DeepSeek-TNG-R1T2-Chimera show significant marginal gains in score when moving from 240s to 600s, suggesting they benefit greatly from initial time increases. Others, like MiniMax-M2.1-TEE, maintain strong performance even at shorter budgets and show steady but less dramatic gains. This underscores that different agents have distinct strategies for leveraging time.

Unlock Deeper AI Insights

Calculate Your Potential AI ROI

Estimate the efficiency gains and cost savings your enterprise could achieve by automating tabular data science tasks with advanced AI agents.

Your Industry

Number of Employees in Data Science/Analytics

Average Weekly Hours on Repetitive Data Tasks

Average Hourly Cost of Data Science Personnel ($)

Estimated Annual Savings $0

Hours Reclaimed Annually 0

Get a Custom ROI Analysis

Your AI Transformation Roadmap

A typical phased approach to integrate autonomous data science agents into your existing workflows.

Phase 1: Discovery & Strategy

Assess current data science workflows, identify automation opportunities, and define clear business objectives. Select appropriate AI agent technologies based on TML-bench insights.

Phase 2: Pilot & Integration

Implement a pilot project on a critical tabular task. Integrate agents with existing data pipelines and validate performance against established benchmarks like TML-bench.

Phase 3: Scaling & Optimization

Expand AI agent deployment across more tasks and departments. Continuously monitor performance, refine agent instructions, and optimize resource allocation.

Phase 4: Advanced Capabilities

Explore multi-agent orchestration, advanced feature engineering, and real-time inference capabilities for competitive advantage.

Start Your AI Roadmap

Ready to Benchmark Your AI Strategy?

Connect with our experts to understand how autonomous data science agents can elevate your enterprise's capabilities and drive efficiency.

Book a Free Consultation

Enterprise AI Analysis

TML-bench: Benchmark for Data Science Agents on Tabular ML Tasks

Executive Impact: Key Metrics & Opportunities

Deep Analysis & Enterprise Applications

Enterprise Process Flow: TML-bench Evaluation Protocol

Top 3 Performers (Aggregate Score)

Reliability & Stability Highlights

Impact of Increased Budget

Calculate Your Potential AI ROI

Your AI Transformation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Pilot & Integration

Phase 3: Scaling & Optimization

Phase 4: Advanced Capabilities

Ready to Benchmark Your AI Strategy?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai