Enterprise AI Analysis

DataSciBench: An LLM Agent Benchmark for Data Science

This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models. We release all code and data at https://github.com/THUDM/DataSciBench/.

Schedule Your Strategy Session

Key Impact Metrics

DataSciBench provides a rigorous evaluation of LLMs in data science, highlighting key performance indicators and the breadth of its assessment.

0 Comprehensive Prompts

0 Ground Truth Test Cases

0 Aggregate Functions

0 LLMs Evaluated

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Prompt Definition

Evaluation Framework

Experimental Results

Prompt Definition and Collection

Explores how prompts are designed and collected for data science tasks. DataSciBench ensures natural, challenging, and high-quality prompts to drive LLM improvement across 6 defined data science tasks: data preprocessing, statistics, visualization, mining, interpretability, and report generation.

Evaluation Framework

Details the Task-Function-Code (TFC) framework for evaluation. This novel semi-automated framework efficiently generates ground truth and evaluation metrics, addressing critical challenges in automated data science task assessment. It aggregates task types, evaluation functions, and corresponding codes, defining programmatic rules for consistent assessment.

Experimental Results

Presents the outcomes of benchmarking various LLMs. Experiments on 222 prompts and 519 ground truths show API-based models generally outperform open-sourced models, with GPT-4o achieving the highest score. Deepseek-Coder-33B-Instruct leads among open-source models, yet all models show significant room for improvement in following fine-grained instructions and executing accurate plans.

Highest Performance Achieved

0% GPT-4o achieved the highest total score, demonstrating comprehensive capacity across various aspects.

GPT-4o significantly outperformed other models, showcasing its robust capabilities across various data science tasks within the DataSciBench framework.

Enterprise Process Flow

Prompt Collection

→

Question Filtering

→

Expert Review

→

Response Integration

→

Test Case Generation

→

LLM Evaluation

Comparison with Related Benchmarks
Benchmark	Key Focus	DataSciBench Advantages
MLAgentBench	Machine learning research	Broader data science tasks Complex scenarios
Text2Analysis	Tabular data analysis Simple tasks	Multi-subtask prompts Real-world complexity
LiveCodeBench	General code generation	Domain-specific evaluation Nuanced metrics

LLM Performance Anomalies

The o1-mini model, often regarded as strong in reasoning, showed an unexpected failure rate of 29.77% in DataSciBench tasks. This was primarily due to non-compliance with instructions, incorrect tool calls, and forgetfulness, rather than core reasoning failures. This highlights the real-life data science coding tasks' comprehensive challenge to a model's ability to follow fine-grained instructions, utilize existing tools (libraries, APIs...), and do planning effectively. Another notable finding was that larger-scale models like CodeLlama-34B-Instruct sometimes performed worse than their smaller counterparts, potentially due to issues in generating formatted text different from their training data.

Calculate Your Potential AI ROI

Estimate the significant time and cost savings your enterprise could achieve with advanced AI automation.

Your Industry

Number of Employees (impacted by AI tasks)

Average Weekly Hours on Repetitive Tasks

Average Hourly Cost per Employee ($)

Annual Cost Savings

Hours Reclaimed Annually

Your AI Implementation Roadmap

A typical phased approach to integrate advanced AI agents into your enterprise operations.

Phase 1: Discovery & Strategy

Comprehensive analysis of current workflows, identification of AI opportunities, and development of a tailored implementation strategy.

Phase 2: Pilot Program Development

Design and deploy a proof-of-concept AI agent for a critical task, gathering initial performance metrics and feedback.

Phase 3: Scaled Integration & Optimization

Expand AI agent deployment across relevant departments, fine-tuning performance and integrating with existing enterprise systems.

Phase 4: Continuous Improvement & Governance

Establish monitoring, maintenance, and governance frameworks for long-term AI success and adaptive evolution.

Discuss Your Implementation Timeline

Ready to Transform Your Enterprise with AI?

Connect with our experts to explore how custom AI solutions can drive efficiency and innovation in your organization.

Book a Free AI Strategy Session

Enterprise AI Analysis

DataSciBench: An LLM Agent Benchmark for Data Science

Key Impact Metrics

Deep Analysis & Enterprise Applications

Prompt Definition and Collection

Evaluation Framework

Experimental Results

Highest Performance Achieved

Enterprise Process Flow

Comparison with Related Benchmarks

LLM Performance Anomalies

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Pilot Program Development

Phase 3: Scaled Integration & Optimization

Phase 4: Continuous Improvement & Governance

Ready to Transform Your Enterprise with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai