Enterprise AI Analysis
DataSciBench: An LLM Agent Benchmark for Data Science
This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models. We release all code and data at https://github.com/THUDM/DataSciBench/.
Key Impact Metrics
DataSciBench provides a rigorous evaluation of LLMs in data science, highlighting key performance indicators and the breadth of its assessment.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Prompt Definition and Collection
Explores how prompts are designed and collected for data science tasks. DataSciBench ensures natural, challenging, and high-quality prompts to drive LLM improvement across 6 defined data science tasks: data preprocessing, statistics, visualization, mining, interpretability, and report generation.
Evaluation Framework
Details the Task-Function-Code (TFC) framework for evaluation. This novel semi-automated framework efficiently generates ground truth and evaluation metrics, addressing critical challenges in automated data science task assessment. It aggregates task types, evaluation functions, and corresponding codes, defining programmatic rules for consistent assessment.
Experimental Results
Presents the outcomes of benchmarking various LLMs. Experiments on 222 prompts and 519 ground truths show API-based models generally outperform open-sourced models, with GPT-4o achieving the highest score. Deepseek-Coder-33B-Instruct leads among open-source models, yet all models show significant room for improvement in following fine-grained instructions and executing accurate plans.
Highest Performance Achieved
0% GPT-4o achieved the highest total score, demonstrating comprehensive capacity across various aspects.GPT-4o significantly outperformed other models, showcasing its robust capabilities across various data science tasks within the DataSciBench framework.
Enterprise Process Flow
Benchmark | Key Focus | DataSciBench Advantages |
---|---|---|
MLAgentBench |
|
|
Text2Analysis |
|
|
LiveCodeBench |
|
|
LLM Performance Anomalies
The o1-mini model, often regarded as strong in reasoning, showed an unexpected failure rate of 29.77% in DataSciBench tasks. This was primarily due to non-compliance with instructions, incorrect tool calls, and forgetfulness, rather than core reasoning failures. This highlights the real-life data science coding tasks' comprehensive challenge to a model's ability to follow fine-grained instructions, utilize existing tools (libraries, APIs...), and do planning effectively. Another notable finding was that larger-scale models like CodeLlama-34B-Instruct sometimes performed worse than their smaller counterparts, potentially due to issues in generating formatted text different from their training data.
Calculate Your Potential AI ROI
Estimate the significant time and cost savings your enterprise could achieve with advanced AI automation.
Your AI Implementation Roadmap
A typical phased approach to integrate advanced AI agents into your enterprise operations.
Phase 1: Discovery & Strategy
Comprehensive analysis of current workflows, identification of AI opportunities, and development of a tailored implementation strategy.
Phase 2: Pilot Program Development
Design and deploy a proof-of-concept AI agent for a critical task, gathering initial performance metrics and feedback.
Phase 3: Scaled Integration & Optimization
Expand AI agent deployment across relevant departments, fine-tuning performance and integrating with existing enterprise systems.
Phase 4: Continuous Improvement & Governance
Establish monitoring, maintenance, and governance frameworks for long-term AI success and adaptive evolution.
Ready to Transform Your Enterprise with AI?
Connect with our experts to explore how custom AI solutions can drive efficiency and innovation in your organization.