Skip to main content
Enterprise AI Analysis: Evaluating AI cyber capabilities with crowdsourced elicitation

AI CAPABILITIES REPORT

Evaluating AI Cyber Capabilities with Crowdsourced Elicitation

As AI systems become increasingly capable, understanding their offensive cyber potential is critical for informed governance and responsible deployment. However, it's hard to accurately bound their capabilities, and some prior evaluations dramatically underestimated them. The art of extracting maximum task-specific performance from AIs is called "AI elicitation", and today's safety organizations typically conduct it in-house. In this paper, we explore crowdsourcing elicitation efforts as an alternative to in-house elicitation work. We host open-access AI tracks at two Capture The Flag (CTF) competitions: AI vs. Humans (400 teams) and Cyber Apocalypse (8000 teams). The AI teams achieve outstanding performance at both events, ranking top-5% and top-10% respectively for a total of $7500 in bounties. This impressive performance suggests that open-market elicitation may offer an effective complement to in-house elicitation. We propose elicitation bounties as a practical mechanism for maintaining timely, cost-effective situational awareness of emerging AI capabilities. Another advantage of open elicitations is the option to collect human performance data at scale. Applying METR's methodology [3], we found that AI agents can reliably solve cyber challenges requiring one hour or less of effort from a median human CTF participant.

Executive Impact Summary

0 AI Rank: AI vs. Humans CTF
0 AI Rank: Cyber Apocalypse CTF
0 AI Solves Human-1.3hr Tasks
0 Cost-Effective Elicitation Prize

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The paper demonstrates strong AI performance in offensive cybersecurity CTF challenges, ranking highly against human teams. This highlights a significant advancement in AI's ability to identify and exploit system vulnerabilities in real-time environments. In the AI vs. Humans CTF, top AI agents solved 19/20 challenges, showcasing near-saturation of available tasks. In the Cyber Apocalypse CTF, AI achieved a top 10% rank against thousands of human teams, indicating robust performance across a broader range of complex challenges.

Crowdsourcing AI elicitation, as showcased by the AI vs. Humans and Cyber Apocalypse CTF tracks, proves to be an effective and scalable method for evaluating AI capabilities. It complements traditional in-house evaluations by leveraging a wider range of approaches and fostering competition. This approach helps mitigate the 'evals gap' by allowing diverse teams to push AI performance to its maximum, providing a more accurate assessment of frontier AI models' offensive potential at a fraction of the cost.

While AI demonstrated impressive speed in solving CTF challenges, top human teams, often professional CTF players with years of experience, were able to match AI speeds. This suggests a future where AI tools can augment human experts, handling routine or well-defined cyber tasks and accelerating problem-solving in complex scenarios, allowing human specialists to focus on more strategic and creative security work.

The findings have significant implications for policymakers, R&D agencies, and frontier AI labs. Targeted support for AI-focused tracks in existing CTF events can establish a sustainable and cost-effective evaluation ecosystem, offering timely situational awareness of emerging AI capabilities. For AI labs, open-market evaluations provide a fast and low-cost way to uncover overlooked capabilities and validate internal assessments, reducing reliance on single-team assumptions.

AI vs. Humans CTF: Performance Overview

Characteristic Top AI Teams Top Human Teams
Challenges Solved (Max 20) 19-20 (near saturation) 19-20 (matched AI)
Speed of Submission High (on par with top humans) High (professional players)
Cost of Elicitation Low ($7500 bounty) N/A (intrinsic motivation)
Approach Automated agents, diverse designs Manual, experience-driven techniques
Overall Rank Top 5% Top 5%
90% AI Agents Outperform General Human Teams

In the Cyber Apocalypse CTF, the best AI agent significantly outperformed 90% of all human participants, demonstrating its broad applicability and effectiveness in diverse cyber challenges.

The Crowdsourced AI Elicitation Cycle

Define AI Cyber Task
Organize Open CTF Track
Incentivize Diverse AI Agents
Collect AI Performance Data
Benchmark vs. Human Skill
Inform Governance & R&D

Bridging the 'Evals Gap': Lessons Learned

Previous evaluations like Meta's CyberSecEval 2 and InterCode-CTF initially reported modest AI performance. However, subsequent work like Project Naptime and [6] demonstrated that with dedicated AI elicitation efforts or simple agent modifications, AI success rates could dramatically increase (e.g., from 5% to 100% or 40% to 92%). This highlights the critical need for crowdsourced, open-market evaluations to reveal the true capabilities of frontier AI models.

Calculate Your Potential AI Impact

Our analysis demonstrates that deploying AI agents for tasks typically performed by cybersecurity professionals can lead to substantial efficiencies. By automating repetitive or clearly defined cyber challenges, organizations can free up expert human talent for more strategic, complex, and creative security work. The initial investment in AI elicitation and deployment can yield significant returns through reduced operational costs and enhanced security posture.

Potential Annual Savings $0
Annual Hours Reclaimed 0

Implementation Roadmap

Transforming your cybersecurity operations with AI is a strategic journey. Here’s a typical roadmap to integrate AI capabilities effectively within your enterprise.

Pilot Program Setup

Identify initial high-impact cyber tasks suitable for AI, set up secure testing environments, and define success metrics. Leverage crowdsourced platforms for initial agent development or fine-tuning.

Agent Integration & Testing

Integrate developed AI agents into your existing security operations. Conduct rigorous testing against real-world and simulated threats, ensuring seamless operation and compliance.

Performance Monitoring & Scaling

Continuously monitor AI agent performance, gather feedback, and iterate on models. Gradually scale AI deployment to cover a broader range of cyber defense or offense tasks, maximizing ROI.

Human-AI Teaming Optimization

Train human security teams to effectively collaborate with AI agents, leveraging AI for speed and consistency while humans focus on advanced threat analysis and strategic decision-making.

Ready to Transform Your Enterprise with AI?

Ready to explore AI's potential for your cyber operations? Book a free consultation with our experts to discuss how these findings apply to your organization.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking