CVE-BENCH: A BENCHMARK FOR AI AGENTS' ABILITY TO EXPLOIT REAL-WORLD WEB APPLICATION VULNERABILITIES
Evaluating AI Agents' Exploitation Capabilities in Real-World Web Applications
Introducing CVE-Bench: A comprehensive benchmark for LLM agents targeting critical-severity web vulnerabilities.
The Growing Threat: AI Agents in Cybersecurity
Large language model (LLM) agents are rapidly advancing their capabilities to autonomously conduct cyberattacks. This necessitates robust evaluation mechanisms to understand their potential misuse and secure critical systems. CVE-Bench addresses this by providing a real-world cybersecurity benchmark.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
LLM Agent Exploitation Workflow on CVE-Bench
| Feature | Cybench | Fang et al. | CVE-Bench |
|---|---|---|---|
| # Vulnerability | 40 | 25 | 40 |
| Real-world Vul. | ✗ | ✓ | ✓ |
| Critical-Severity | ✗ | O | ✓ |
| Diverse attacks | ✗ | O | ✓ |
| Notes: ✓ indicates full support, O indicates limited support, ✗ indicates no support. | |||
Showcasing the potential of hierarchical multi-agent frameworks.
Case Study: T-Agent's SQL Injection Success
Scenario: CVE-2024-37849 (Billing Management System): T-Agent successfully exploited a SQL Injection vulnerability in a zero-day setting, demonstrating the importance of tool use (sqlmap) and strategic planning by supervisor agents.
Lesson: Correctly using specialized tools like sqlmap and effective decision-making by supervisor agents are crucial for exploiting complex vulnerabilities. However, unnecessary explorations can occur if the planning isn't optimized.
Case Study: AutoGPT's Self-Correction
Scenario: CVE-2024-32980 (Spin Developer Tool): AutoGPT successfully exploited an outbound service vulnerability in a one-day setting. Its self-criticism and self-correction mechanism allowed it to fix technical errors (e.g., accessing wrong port) and proceed with the exploit based on vulnerability descriptions.
Lesson: AutoGPT's ability to identify and correct errors, combined with understanding vulnerability descriptions, highlights its potential in exploiting one-day vulnerabilities effectively, even when starting with initial missteps.
| Failure Mode | Cy-Agent (Zero-day) | T-Agent (Zero-day) | AutoGPT (Zero-day) |
|---|---|---|---|
| Limited Task Understanding | 30.0% | 0% | 15.0% |
| Incorrect Focus | 0% | 35.0% | 0% |
| Insufficient Exploration | 67.5% | 80.0% | 72.5% |
| Tool Misuse | 47.5% | 17.5% | 5.0% |
| Inadequate Reasoning | 10.0% | 7.5% | 7.5% |
| Notes: Insufficient exploration is the dominant bottleneck across all agents and settings. Data for zero-day setting shown. | |||
Calculate Your Potential AI Impact
Estimate the efficiency gains and cost savings your enterprise could achieve by integrating advanced AI solutions.
Your AI Implementation Roadmap
A phased approach to integrating advanced AI, designed for minimal disruption and maximum impact.
Phase 01: Discovery & Strategy
Comprehensive analysis of current workflows, identification of AI opportunities, and tailored strategy development.
Phase 02: Pilot & Proof-of-Concept
Deployment of AI solutions in a controlled environment to validate effectiveness and gather key performance indicators.
Phase 03: Scaled Integration
Full-scale deployment across relevant departments, ensuring seamless integration with existing systems.
Phase 04: Optimization & Future-Proofing
Continuous monitoring, performance tuning, and planning for future AI advancements and applications.
Ready to Transform Your Enterprise?
Connect with our AI specialists to explore how CVE-Bench insights can inform your cybersecurity strategy and leverage AI securely.