Autonomous Security Testing
Can LLMs Hack Enterprise Networks? – RCR Report
This Replicated Computational Results (RCR) Report empirically investigates the efficacy and effectiveness of different Large Language Models (LLMs) for penetration-testing enterprise networks, specifically Microsoft Active Directory Assumed-Breach Simulations. It details the artifacts, evaluation setup, and analysis scripts used to assess LLMs' capabilities in offensive security.
Authored by ANDREAS HAPPE and JÜRGEN CITO. Published: March 05, 2026.
Executive Impact: Elevating Enterprise Cybersecurity
This research demonstrates the potential of LLMs to automate complex and costly cybersecurity tasks, offering a new paradigm for identifying vulnerabilities and enhancing network resilience. Key metrics underscore the depth and practical implications of our findings.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Approach to Autonomous Penetration Testing
The research employed a rigorous methodology to evaluate LLMs in a realistic cybersecurity context. It leverages a publicly available, third-party testbed known as "A Game of Active Directory" (GOAD), utilizing Microsoft Windows virtual machines.
The replication process was structured into three key phases:
- Setup of the Third-Party Testbed (Section 3): Installation of GOAD on VMWare, including a Kali Linux VM for attacks. The system required significant resources: approximately 64 GB RAM and 190 GB hard-drive space.
- Prototype Setup (Section 4): Configuration of 'cochise', the autonomous AD penetration-testing prototype written in Python.
- Data Generation and Analysis (Section 5): Execution of test runs to generate data traces, followed by quantitative and qualitative analysis. Log files are provided to facilitate replication or independent analysis.
LLM Configurations Evaluated: Five distinct LLM configurations were used to compare performance and reasoning capabilities:
- Baseline Non-Reasoning LLMs: OpenAI's GPT-4o (gpt-40-2024-08-06) and DeepSeek's DEEPSEEK-V3.
- Integrated Reasoning LLM: Google's GEMINI-2.5-FLASH (Preview).
- Hybrid Reasoning System: OpenAI's 01 (01-preview-2024-12-17) for high-level planning combined with OpenAI's GPT-4o for low-level execution.
- Open-weight Small World Model (SLM): Alibaba's Qwen3, suitable for local deployment.
All cloud-hosted models were utilized through their respective makers' offerings, with Qwen3 being run on a dedicated virtual machine via LambdaLabs.
Core Findings & Enterprise Applications
This section presents the critical insights derived from the research, highlighting the practical implications for enterprise cybersecurity strategies.
Enterprise Process Flow: LLM Penetration Testing Phases
Average Test Run Duration
6934.42s Average duration of LLM-driven penetration test runs across various configurations.| Feature | LLM-Powered Automation | Traditional Pen-Testing |
|---|---|---|
| Scope |
|
|
| Cost-Efficiency |
|
|
| Consistency & Reproducibility |
|
|
| Adaptability |
|
|
| Limitations |
|
|
Case Study: A Game of Active Directory (GOADv3) Testbed
The research significantly benefited from using GOADv3, a robust, virtualized third-party testbed for Active Directory environments. This choice addressed known concerns about the limitations of synthetic testbeds for real-life security impact evaluations.
Key aspects of GOADv3:
- Realistic Environment: Consists of 5 virtual Windows machines, providing a complex and authentic Active Directory network for penetration testing.
- Resource Intensive: Requires approximately 64GB of RAM and 190GB of hard drive space, reflecting a genuine enterprise network setup.
- Flexibility: Supports multiple virtualization backends (VMWare Workstation Pro used in this study) and leverages tools like Vagrant and Ansible for automation.
Impact on Research: GOADv3 allowed for comprehensive and realistic capability evaluations of LLMs in offensive security settings, validating findings against established cybersecurity frameworks like MITRE ATT&CK.
Replication Artifacts & Analysis Tools
To ensure full reproducibility and enable further research, all source code and generated evidence (log data) are openly available.
Key Artifacts & Locations:
- GitHub Repository: https://github.com/andreashappe/cochise
- Zenodo Artifact Package: https://zenodo.org/records/17456062
Github Contents:
- Prototype Code:
cochise/tree/main/src(for evidence generation) - Generated Evidence:
cochise/tree/main/examples/initial-paper(JSON log files) - Analysis Scripts:
cochise/tree/main/src/analysis
Zenodo Package Contents:
- Docker Images:
cochise_docker.tar(prototype),cochise_replay_analysis.tar(analysis scripts & evidence) - Kali Linux Attacker VM:
kali-linux-2025.3-vmware-amd64*(pre-configured) - Captured Log Data:
testrunslogs.zip(during test-runs) - Comprehensive Documentation:
REQUIREMENTS,LICENSE,INSTALL,README(setup/installation instructions)
Key Analysis Scripts:
src/analyze-json-logs.py: Generates overview tables of LLM runs and detailed token usage.src/cochise-replay.py: Allows replaying existing JSON log files to visualize tool output.src/analyze-json-graphs.py: Generates graphs used in the paper.
The container-based installation (docker run -it cochise_analyze) simplifies setup and allows interactive analysis without network connectivity.
Calculate Your Potential AI-Driven ROI in Cybersecurity
Estimate the transformative financial and operational benefits of integrating AI into your penetration testing and security operations, based on research-backed efficiency gains.
Your AI Cybersecurity Implementation Roadmap
A strategic phased approach to integrate advanced AI capabilities into your enterprise security framework, ensuring seamless adoption and maximum impact.
Phase 1: Discovery & Strategy Alignment
Conduct a comprehensive assessment of current cybersecurity practices and identify key areas for AI augmentation. Define clear objectives and success metrics for LLM-driven penetration testing.
Phase 2: Pilot Deployment & Customization
Implement a pilot LLM-based autonomous testing prototype in a controlled environment (like a GOAD-like testbed). Customize LLM prompts and tools to align with specific enterprise network configurations and security policies.
Phase 3: Integration & Scaled Implementation
Integrate the validated AI solution into your existing security operations center (SOC) workflows. Scale deployment across relevant network segments, ensuring robust monitoring and incident response protocols are in place.
Phase 4: Continuous Optimization & Monitoring
Establish continuous feedback loops to refine LLM performance and adapt to evolving threat landscapes. Monitor AI-generated insights and automate remediation processes, ensuring ongoing security posture improvement.
Ready to Enhance Your Enterprise Security with AI?
Leverage the power of autonomous AI for advanced threat detection and vulnerability management. Book a personalized consultation to discuss how these innovations can fortify your network.