Skip to main content
Enterprise AI Analysis: Can LLMs Hack Enterprise Networks? – RCR Report

Autonomous Security Testing

Can LLMs Hack Enterprise Networks? – RCR Report

This Replicated Computational Results (RCR) Report empirically investigates the efficacy and effectiveness of different Large Language Models (LLMs) for penetration-testing enterprise networks, specifically Microsoft Active Directory Assumed-Breach Simulations. It details the artifacts, evaluation setup, and analysis scripts used to assess LLMs' capabilities in offensive security.

Authored by ANDREAS HAPPE and JÜRGEN CITO. Published: March 05, 2026.

Executive Impact: Elevating Enterprise Cybersecurity

This research demonstrates the potential of LLMs to automate complex and costly cybersecurity tasks, offering a new paradigm for identifying vulnerabilities and enhancing network resilience. Key metrics underscore the depth and practical implications of our findings.

0 Downloads (RCR Report)
0 Citations (RCR Report)
0 RAM for Testbed
0 HD for Testbed
0 Avg. Experiment Run Time
0 LLM Configurations Evaluated
0 Windows VMs in Testbed

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Methodology
Key Insights
Artifacts

Approach to Autonomous Penetration Testing

The research employed a rigorous methodology to evaluate LLMs in a realistic cybersecurity context. It leverages a publicly available, third-party testbed known as "A Game of Active Directory" (GOAD), utilizing Microsoft Windows virtual machines.

The replication process was structured into three key phases:

  • Setup of the Third-Party Testbed (Section 3): Installation of GOAD on VMWare, including a Kali Linux VM for attacks. The system required significant resources: approximately 64 GB RAM and 190 GB hard-drive space.
  • Prototype Setup (Section 4): Configuration of 'cochise', the autonomous AD penetration-testing prototype written in Python.
  • Data Generation and Analysis (Section 5): Execution of test runs to generate data traces, followed by quantitative and qualitative analysis. Log files are provided to facilitate replication or independent analysis.

LLM Configurations Evaluated: Five distinct LLM configurations were used to compare performance and reasoning capabilities:

  • Baseline Non-Reasoning LLMs: OpenAI's GPT-4o (gpt-40-2024-08-06) and DeepSeek's DEEPSEEK-V3.
  • Integrated Reasoning LLM: Google's GEMINI-2.5-FLASH (Preview).
  • Hybrid Reasoning System: OpenAI's 01 (01-preview-2024-12-17) for high-level planning combined with OpenAI's GPT-4o for low-level execution.
  • Open-weight Small World Model (SLM): Alibaba's Qwen3, suitable for local deployment.

All cloud-hosted models were utilized through their respective makers' offerings, with Qwen3 being run on a dedicated virtual machine via LambdaLabs.

Core Findings & Enterprise Applications

This section presents the critical insights derived from the research, highlighting the practical implications for enterprise cybersecurity strategies.

Enterprise Process Flow: LLM Penetration Testing Phases

Setup of Third-Party Testbed
Prototype Setup
Data Generation and Analysis

Average Test Run Duration

6934.42s Average duration of LLM-driven penetration test runs across various configurations.

LLM-Powered vs. Traditional Penetration Testing

Feature LLM-Powered Automation Traditional Pen-Testing
Scope
  • Autonomous Assumed-Breach Simulations against Active Directory.
  • Scalable for repetitive tasks.
  • Broader scope, adaptable to unique scenarios.
  • Human-driven exploration.
Cost-Efficiency
  • Investigates viability for SMEs due to lower operational costs.
  • Reduced human-intensive hours.
  • High cost due to expert human labor.
  • Limited by consultant availability.
Consistency & Reproducibility
  • Reproducible tests with defined LLM prompts.
  • Less susceptible to human error/bias.
  • Varies with individual tester's skill and experience.
  • Less consistent across multiple engagements.
Adaptability
  • Evolves with LLM capabilities and training data.
  • Can be fine-tuned for specific environments.
  • Highly adaptive based on expert intuition and real-time decision-making.
  • Leverages deep human understanding of complex systems.
Limitations
  • Reliance on precise prompting.
  • Potential for "Almost-There" attacks due to trivial mistakes.
  • Lack of ethical guard rails observed in experiments.
  • Human biases and potential for fatigue.
  • Higher potential for missed vulnerabilities without comprehensive tooling.

Case Study: A Game of Active Directory (GOADv3) Testbed

The research significantly benefited from using GOADv3, a robust, virtualized third-party testbed for Active Directory environments. This choice addressed known concerns about the limitations of synthetic testbeds for real-life security impact evaluations.

Key aspects of GOADv3:

  • Realistic Environment: Consists of 5 virtual Windows machines, providing a complex and authentic Active Directory network for penetration testing.
  • Resource Intensive: Requires approximately 64GB of RAM and 190GB of hard drive space, reflecting a genuine enterprise network setup.
  • Flexibility: Supports multiple virtualization backends (VMWare Workstation Pro used in this study) and leverages tools like Vagrant and Ansible for automation.

Impact on Research: GOADv3 allowed for comprehensive and realistic capability evaluations of LLMs in offensive security settings, validating findings against established cybersecurity frameworks like MITRE ATT&CK.

Replication Artifacts & Analysis Tools

To ensure full reproducibility and enable further research, all source code and generated evidence (log data) are openly available.

Key Artifacts & Locations:

Github Contents:

  • Prototype Code: cochise/tree/main/src (for evidence generation)
  • Generated Evidence: cochise/tree/main/examples/initial-paper (JSON log files)
  • Analysis Scripts: cochise/tree/main/src/analysis

Zenodo Package Contents:

  • Docker Images: cochise_docker.tar (prototype), cochise_replay_analysis.tar (analysis scripts & evidence)
  • Kali Linux Attacker VM: kali-linux-2025.3-vmware-amd64* (pre-configured)
  • Captured Log Data: testrunslogs.zip (during test-runs)
  • Comprehensive Documentation: REQUIREMENTS, LICENSE, INSTALL, README (setup/installation instructions)

Key Analysis Scripts:

  • src/analyze-json-logs.py: Generates overview tables of LLM runs and detailed token usage.
  • src/cochise-replay.py: Allows replaying existing JSON log files to visualize tool output.
  • src/analyze-json-graphs.py: Generates graphs used in the paper.

The container-based installation (docker run -it cochise_analyze) simplifies setup and allows interactive analysis without network connectivity.

Calculate Your Potential AI-Driven ROI in Cybersecurity

Estimate the transformative financial and operational benefits of integrating AI into your penetration testing and security operations, based on research-backed efficiency gains.

Potential Annual Savings $0
Annual Hours Reclaimed 0

Your AI Cybersecurity Implementation Roadmap

A strategic phased approach to integrate advanced AI capabilities into your enterprise security framework, ensuring seamless adoption and maximum impact.

Phase 1: Discovery & Strategy Alignment

Conduct a comprehensive assessment of current cybersecurity practices and identify key areas for AI augmentation. Define clear objectives and success metrics for LLM-driven penetration testing.

Phase 2: Pilot Deployment & Customization

Implement a pilot LLM-based autonomous testing prototype in a controlled environment (like a GOAD-like testbed). Customize LLM prompts and tools to align with specific enterprise network configurations and security policies.

Phase 3: Integration & Scaled Implementation

Integrate the validated AI solution into your existing security operations center (SOC) workflows. Scale deployment across relevant network segments, ensuring robust monitoring and incident response protocols are in place.

Phase 4: Continuous Optimization & Monitoring

Establish continuous feedback loops to refine LLM performance and adapt to evolving threat landscapes. Monitor AI-generated insights and automate remediation processes, ensuring ongoing security posture improvement.

Ready to Enhance Your Enterprise Security with AI?

Leverage the power of autonomous AI for advanced threat detection and vulnerability management. Book a personalized consultation to discuss how these innovations can fortify your network.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking