AI CAPABILITY ANALYSIS
Measuring AI Ability to Complete Long Tasks
Thomas Kwa*, Ben West†*, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles+, Seraphina Nix, Tao Lin, Chris Painter, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler§, Elizabeth Barnes, Lawrence Chan (Model Evaluation & Threat Research (METR))
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. Current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.
Executive Impact: Key Findings
This research introduces the 50%-task-completion time horizon as a critical metric for understanding AI's practical capabilities. Our analysis shows a rapid, exponential growth in AI's ability to autonomously complete complex tasks, with significant implications for future enterprise automation and strategic planning.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Our study reveals that the AI time horizon—the duration of tasks AI models can complete with 50% success—has been growing exponentially, doubling approximately every seven months since 2019. This remarkable pace of improvement suggests a rapid expansion of AI's practical utility across various domains. The trend may have accelerated in 2024, indicating even faster future progress.
Qualitative analysis of AI agent performance indicates significant improvements across several key areas. Models demonstrate enhanced logical reasoning capabilities, more effective tool use, and greater reliability in task execution. Notably, there's a marked improvement in the ability to adapt to mistakes, preventing repeated failures and enabling course correction during complex tasks.
Enhanced AI Adaptability and Reasoning
Earlier models frequently struggled with syntax errors and repetitive failed actions. Recent models, like Claude 3.5 Sonnet (New), demonstrate a greater capacity to debug Python code, recover from misplaced elements, and even rewrite entire files when initial approaches fail. This shift from looping behaviors to adaptive problem-solving is crucial for tackling real-world software engineering challenges.
Despite rapid progress, current AI systems still face limitations, particularly with 'messier' tasks lacking clear feedback loops or requiring proactive information seeking. We investigated external validity by replicating our methods on SWE-bench Verified and analyzing internal pull requests, finding similar exponential trends but also highlighting that human baselines (especially for easier tasks or those with high-context knowledge) can significantly impact time horizon estimates. The 'messiness' of tasks, defined by factors like novel situations, resource constraints, and real-time coordination, negatively correlates with AI success rates.
| Category | AI Limitation |
|---|---|
| Feedback Loops | Struggles without clear, immediate feedback mechanisms. |
| Information Seeking | Fails to proactively seek out relevant, available information. |
| Messy Environments | Lower performance on tasks with ambiguity, dynamic environments, or unpunishing mistakes. |
| Context Acquisition | Less effective on tasks requiring deep, domain-specific context not explicitly provided. |
We introduced the 50%-task-completion time horizon as a novel metric to quantify AI capabilities relative to human performance. Our methodology involves three key steps: assembling a diverse task suite (HCAST, RE-Bench, SWAA), baselining human and AI performance on these tasks, and fitting a logistic model to calculate the time horizon. This approach, inspired by Item Response Theory, converts AI success rates and human completion times into an intuitive measure of real-world capability.
Enterprise Process Flow
Extrapolating the observed exponential growth, our models predict that AI systems will achieve a 1-month 50%-time horizon for software tasks between late 2028 and early 2031, with an 80% confidence interval of about two years. This represents a significant milestone, indicating AI's potential to automate complex intellectual labor currently performed by humans. While challenges like external validity and potential changes in growth rate exist, the current trajectory points towards transformative AI capabilities within the next five to ten years.
Quantify Your AI Transformation ROI
Estimate the potential savings and reclaimed hours by integrating advanced AI capabilities into your enterprise workflows. Adjust the parameters to reflect your organization's specifics.
Your AI Implementation Roadmap
A structured approach ensures successful AI integration and maximized ROI. Here’s a typical journey we guide our clients through.
Discovery & Strategy Alignment
In-depth assessment of current workflows, identification of high-impact AI opportunities, and development of a tailored strategic plan.
Pilot Program & MVP Development
Rapid prototyping and deployment of minimal viable AI solutions in a controlled environment to demonstrate value and gather feedback.
Enterprise Integration & Scaling
Seamless integration of proven AI solutions across departments, comprehensive training, and continuous optimization for sustained performance.
Ready to Transform Your Enterprise with AI?
Unlock the full potential of AI within your organization. Schedule a personalized consultation to explore how our insights can drive your strategic AI initiatives and deliver tangible ROI.