Autonomous Security Testing

Can LLMs Hack Enterprise Networks? – RCR Report

This Replicated Computational Results (RCR) Report empirically investigates the efficacy and effectiveness of different Large Language Models (LLMs) for penetration-testing enterprise networks, specifically Microsoft Active Directory Assumed-Breach Simulations. It details the artifacts, evaluation setup, and analysis scripts used to assess LLMs' capabilities in offensive security.

Authored by ANDREAS HAPPE and JÜRGEN CITO. Published: March 05, 2026.

Schedule Your AI Security Strategy Session

Executive Impact: Elevating Enterprise Cybersecurity

This research demonstrates the potential of LLMs to automate complex and costly cybersecurity tasks, offering a new paradigm for identifying vulnerabilities and enhancing network resilience. Key metrics underscore the depth and practical implications of our findings.

0 Downloads (RCR Report)

0 Citations (RCR Report)

0 RAM for Testbed

0 HD for Testbed

0 Avg. Experiment Run Time

0 LLM Configurations Evaluated

0 Windows VMs in Testbed

Discuss Your Implementation

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Methodology

Key Insights

Artifacts

Approach to Autonomous Penetration Testing

The research employed a rigorous methodology to evaluate LLMs in a realistic cybersecurity context. It leverages a publicly available, third-party testbed known as "A Game of Active Directory" (GOAD), utilizing Microsoft Windows virtual machines.

The replication process was structured into three key phases:

Setup of the Third-Party Testbed (Section 3): Installation of GOAD on VMWare, including a Kali Linux VM for attacks. The system required significant resources: approximately 64 GB RAM and 190 GB hard-drive space.
Prototype Setup (Section 4): Configuration of 'cochise', the autonomous AD penetration-testing prototype written in Python.
Data Generation and Analysis (Section 5): Execution of test runs to generate data traces, followed by quantitative and qualitative analysis. Log files are provided to facilitate replication or independent analysis.

LLM Configurations Evaluated: Five distinct LLM configurations were used to compare performance and reasoning capabilities:

Baseline Non-Reasoning LLMs: OpenAI's GPT-4o (gpt-40-2024-08-06) and DeepSeek's DEEPSEEK-V3.
Integrated Reasoning LLM: Google's GEMINI-2.5-FLASH (Preview).
Hybrid Reasoning System: OpenAI's 01 (01-preview-2024-12-17) for high-level planning combined with OpenAI's GPT-4o for low-level execution.
Open-weight Small World Model (SLM): Alibaba's Qwen3, suitable for local deployment.

All cloud-hosted models were utilized through their respective makers' offerings, with Qwen3 being run on a dedicated virtual machine via LambdaLabs.

Core Findings & Enterprise Applications

This section presents the critical insights derived from the research, highlighting the practical implications for enterprise cybersecurity strategies.

Enterprise Process Flow: LLM Penetration Testing Phases

Setup of Third-Party Testbed

→

Prototype Setup

→

Data Generation and Analysis

Average Test Run Duration

6934.42s Average duration of LLM-driven penetration test runs across various configurations.

LLM-Powered vs. Traditional Penetration Testing
Feature	LLM-Powered Automation	Traditional Pen-Testing
Scope	Autonomous Assumed-Breach Simulations against Active Directory. Scalable for repetitive tasks.	Broader scope, adaptable to unique scenarios. Human-driven exploration.
Cost-Efficiency	Investigates viability for SMEs due to lower operational costs. Reduced human-intensive hours.	High cost due to expert human labor. Limited by consultant availability.
Consistency & Reproducibility	Reproducible tests with defined LLM prompts. Less susceptible to human error/bias.	Varies with individual tester's skill and experience. Less consistent across multiple engagements.
Adaptability	Evolves with LLM capabilities and training data. Can be fine-tuned for specific environments.	Highly adaptive based on expert intuition and real-time decision-making. Leverages deep human understanding of complex systems.
Limitations	Reliance on precise prompting. Potential for "Almost-There" attacks due to trivial mistakes. Lack of ethical guard rails observed in experiments.	Human biases and potential for fatigue. Higher potential for missed vulnerabilities without comprehensive tooling.

Case Study: A Game of Active Directory (GOADv3) Testbed

The research significantly benefited from using GOADv3, a robust, virtualized third-party testbed for Active Directory environments. This choice addressed known concerns about the limitations of synthetic testbeds for real-life security impact evaluations.

Key aspects of GOADv3:

Realistic Environment: Consists of 5 virtual Windows machines, providing a complex and authentic Active Directory network for penetration testing.
Resource Intensive: Requires approximately 64GB of RAM and 190GB of hard drive space, reflecting a genuine enterprise network setup.
Flexibility: Supports multiple virtualization backends (VMWare Workstation Pro used in this study) and leverages tools like Vagrant and Ansible for automation.

Impact on Research: GOADv3 allowed for comprehensive and realistic capability evaluations of LLMs in offensive security settings, validating findings against established cybersecurity frameworks like MITRE ATT&CK.

Replication Artifacts & Analysis Tools

To ensure full reproducibility and enable further research, all source code and generated evidence (log data) are openly available.

Key Artifacts & Locations:

GitHub Repository: https://github.com/andreashappe/cochise
Zenodo Artifact Package: https://zenodo.org/records/17456062

Github Contents:

Prototype Code: cochise/tree/main/src (for evidence generation)
Generated Evidence: cochise/tree/main/examples/initial-paper (JSON log files)
Analysis Scripts: cochise/tree/main/src/analysis

Zenodo Package Contents:

Docker Images: cochise_docker.tar (prototype), cochise_replay_analysis.tar (analysis scripts & evidence)
Kali Linux Attacker VM: kali-linux-2025.3-vmware-amd64* (pre-configured)
Captured Log Data: testrunslogs.zip (during test-runs)
Comprehensive Documentation: REQUIREMENTS, LICENSE, INSTALL, README (setup/installation instructions)

Key Analysis Scripts:

src/analyze-json-logs.py: Generates overview tables of LLM runs and detailed token usage.
src/cochise-replay.py: Allows replaying existing JSON log files to visualize tool output.
src/analyze-json-graphs.py: Generates graphs used in the paper.

The container-based installation (docker run -it cochise_analyze) simplifies setup and allows interactive analysis without network connectivity.

Calculate Your Potential AI-Driven ROI in Cybersecurity

Estimate the transformative financial and operational benefits of integrating AI into your penetration testing and security operations, based on research-backed efficiency gains.

Your Industry

Security Team Size

Avg. Weekly Hours on Manual Testing (per team member)

Avg. Hourly Cost per Security Professional ($)

Potential Annual Savings $0

Annual Hours Reclaimed 0

Unlock Your Enterprise AI Potential

Your AI Cybersecurity Implementation Roadmap

A strategic phased approach to integrate advanced AI capabilities into your enterprise security framework, ensuring seamless adoption and maximum impact.

Phase 1: Discovery & Strategy Alignment

Conduct a comprehensive assessment of current cybersecurity practices and identify key areas for AI augmentation. Define clear objectives and success metrics for LLM-driven penetration testing.

Phase 2: Pilot Deployment & Customization

Implement a pilot LLM-based autonomous testing prototype in a controlled environment (like a GOAD-like testbed). Customize LLM prompts and tools to align with specific enterprise network configurations and security policies.

Phase 3: Integration & Scaled Implementation

Integrate the validated AI solution into your existing security operations center (SOC) workflows. Scale deployment across relevant network segments, ensuring robust monitoring and incident response protocols are in place.

Phase 4: Continuous Optimization & Monitoring

Establish continuous feedback loops to refine LLM performance and adapt to evolving threat landscapes. Monitor AI-generated insights and automate remediation processes, ensuring ongoing security posture improvement.

Start Your AI Transformation

Ready to Enhance Your Enterprise Security with AI?

Leverage the power of autonomous AI for advanced threat detection and vulnerability management. Book a personalized consultation to discuss how these innovations can fortify your network.

Book Your Strategy Session Now

Autonomous Security Testing

Can LLMs Hack Enterprise Networks? – RCR Report

Executive Impact: Elevating Enterprise Cybersecurity

Deep Analysis & Enterprise Applications

Approach to Autonomous Penetration Testing

Core Findings & Enterprise Applications

Enterprise Process Flow: LLM Penetration Testing Phases

Average Test Run Duration

LLM-Powered vs. Traditional Penetration Testing

Case Study: A Game of Active Directory (GOADv3) Testbed

Replication Artifacts & Analysis Tools

Calculate Your Potential AI-Driven ROI in Cybersecurity

Your AI Cybersecurity Implementation Roadmap

Phase 1: Discovery & Strategy Alignment

Phase 2: Pilot Deployment & Customization

Phase 3: Integration & Scaled Implementation

Phase 4: Continuous Optimization & Monitoring

Ready to Enhance Your Enterprise Security with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai