CYBERSECURITY ANALYSIS

CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments

The advancement of Large Language Models (LLMs) raises concerns regarding their dual-use potential in cybersecurity. Existing evaluation frameworks overwhelmingly focus on Information Technology (IT) environments, failing to capture the constraints and specialized protocols of Operational Technology (OT). This paper introduces CritBench, a novel framework designed to evaluate the cybersecurity capabilities of LLM agents within IEC 61850 Digital Substation environments, assessing five state-of-the-art models across a corpus of 81 domain-specific tasks.

Schedule Your Strategy Session

Key Findings & Performance Snapshot

CritBench reveals LLMs possess domain-specific knowledge but face significant challenges in dynamic OT environments, highlighting a critical gap between conceptual understanding and autonomous action.

0 Overall GPT-5.1 Accuracy

0 CritLayer Performance Improvement

0 Domain-Specific Tasks Evaluated

0 Key OT Protocols Covered

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

CritBench Framework Overview

Language Model

→

CritLayer (Specialized Tools)

→

Virtual Machine

→

IEC 61850 Environment

CritBench addresses a critical gap in LLM cybersecurity evaluations by focusing on Operational Technology (OT) environments, specifically IEC 61850 Digital Substations. Unlike generic IT benchmarks, OT requires understanding specialized communication protocols, high-stakes environments, and prioritization of availability over data confidentiality. LLMs need to bridge a significant semantic gap to operate effectively in this domain.

Main Contributions of CritBench:

An open-source, automated benchmarking framework for LLM agents in IEC 61850 Digital Substation environments.
A corpus of 81 newly developed IEC 61850 domain-specific tasks covering static XML configuration analysis, network traffic analysis (PCAP), and live machine interaction.
An execution environment equipped with CritLayer, a domain-specific agent scaffolding tailored for Digital Substation protocols.
A comprehensive empirical evaluation across five recent LLMs (GPT-5.1, Qwen3.5 397BA17, GPT-5 mini, MiniMax M2.5, GPT-5 nano).

The CritBench task corpus consists of 81 cybersecurity challenges evaluating agent performance in IEC 61850 Digital Substation environments. These tasks require proficiency in key protocols such as MMS, GOOSE, SV, and IEC 60870-5-104. Tasks range from analyzing static configurations and passive network traffic to active interaction with virtual machines.

Mapping CritBench Tasks to MITRE ATT&CK for ICS

ICS Tactic	ICS Technique	Benchmark Implementation Examples
Discovery	Point & Tag Identification (T0861)	Enumerate logical nodes & IEC 104 Information Object Addresses
Discovery	Network Service Scanning (T0842)	Identifying listening ports & IEC 61850 protocols
Collection	Automated Collection (T0802)	Parsing high-rate SV traffic to extract measurement channels
Evasion	Spoofing Standard Data (T0885)	Injecting GOOSE frames with spoofed state numbers
Impact	Manipulation of Control (T0831)	Toggling single-point control outputs to forcefully open circuit breakers
Impact	Loss of Protection (T0837)	Overwriting time overcurrent threshold attributes in protection relays

Task Modalities:

Static Configuration Analysis (30 tasks): Agents parse and interpret SCD-files (XML-based) to understand substation topology, enumerate logical nodes, identify protection functions, extract VLAN assignments, and perform security auditing by detecting configuration changes.

Network Traffic Analysis (30 tasks): Evaluate agents' capability to extract intelligence from raw network packet captures. Challenges include auditing clock synchronization in SV, detecting GOOSE configuration anomalies, reconstructing MMS data models, and correlating hardware identities across network layers.

Dynamic System Interaction (21 tasks): Measure agents' ability to autonomously execute actions against a containerized IED environment with real industrial protocol stacks. Tasks involve discovering MMS data models, reading passive SVs, and targeted state alterations such as flipping a logic breaker or modifying protection thresholds.

86.4% GPT-5.1 achieved the highest overall accuracy.

The flagship model, GPT-5.1, demonstrated the highest proficiency with an overall accuracy of 86.4%. Open-weight models like Qwen3.5 397BA17 (76.6%), GPT-5 mini (75.4%), and MiniMax M2.5 (74.9%) form a distinct capability cluster, reliably handling basic protocol parsing and static file analysis. The smallest model, GPT-5 nano, showed a clear degradation at 43.3%, indicating a capability threshold for navigating IEC 61850 environments.

+19.7% Average performance improvement with CritLayer.

The specialized tooling provided by CritLayer significantly improves overall model performance. Agents leveraging CritLayer achieved 77.8% success compared to 58.1% without it, demonstrating that domain-specific tools, offering structured outputs and reducing command-guessing, enable more reliable task execution by LLMs.

Performance by Task Complexity:

All models show high proficiency in static analysis tasks, such as SCD parsing and single-step PCAP analysis, achieving near 100% on inventory challenges. However, performance significantly degrades in dynamic tasks requiring multi-step cross-protocol correlation and live IED interaction. LLMs struggle with persistent sequential reasoning, state tracking, and accurate manipulation of live systems without the specialized tool scaffold. Difficulties arise in mapping discovered logical nodes to precise writable functional constraints and interpreting complex industrial protocol structures like IEC 60870-5-104.

Resource Efficiency:

Open-weight and mid-tier models consume substantial token budgets and frequently enter repetitive loops when industrial tools return unfamiliar binary structures. GPT-5.1 is more efficient, using fewer tokens and completing tasks faster (25 seconds median run duration). Financial costs vary; GPT-5 mini offers a competitive mid-tier accuracy at a highly economical $0.005 per run, while GPT-5 nano lacks capability despite its low cost.

Current Limitations:

CritBench currently uses emulated network infrastructure and software-defined protocol stacks, which, while preserving packet structures, do not fully replicate proprietary firmware behaviors. The task corpus primarily covers IEC 61850 terminology and features, not the full spectrum of cybersecurity attacker or defender tasks. Finally, LLM performance is highly sensitive to system prompting and runtime non-determinism, a factor mitigated but not entirely eliminated by multiple independent runs.

Future Work Directions:

Transition parts of the tasks into Hardware-in-the-Loop (HIL) setups to capture vendor-specific firmware behaviors and physical process interactions.
Broaden protocol coverage and diversify the vendor/device corpus to represent a wider range of OT environments.
Investigate advanced agent scaffolds to address more complex, cross-protocol chains that single agents currently find challenging.
Incorporate prompt-robustness stress tests and standardized evaluation of safety-critical failure modes to better quantify risks as LLM capabilities evolve.

Calculate Your Potential AI ROI

Estimate the time and cost savings your enterprise could achieve by integrating advanced AI solutions in cybersecurity operations.

Your Industry

Number of Employees in Cybersecurity

Average Weekly Hours Spent on Manual Tasks per Employee

Average Hourly Cost per Employee ($)

Estimated Annual Savings $0

Equivalent Hours Reclaimed 0

Discuss Your Implementation

Our Proven Implementation Roadmap

Our structured approach ensures a smooth, secure, and effective integration of AI into your enterprise cybersecurity operations, minimizing disruption and maximizing impact.

Phase 1: Discovery & Strategy

In-depth analysis of your current cybersecurity posture, existing IT/OT infrastructure, and specific operational challenges. We define clear, measurable objectives for AI integration.

Phase 2: Custom AI Solution Design

Development of a tailored AI architecture, including model selection, data pipelines for OT/IT environments, and custom tooling like CritLayer. Focus on protocol compatibility and security.

Phase 3: Integration & Testing

Seamless integration of the AI system within your existing security ecosystem. Rigorous testing in emulated and, where appropriate, HIL environments to validate performance and safety.

Phase 4: Deployment & Optimization

Staged deployment, ongoing monitoring, and continuous optimization based on real-world performance data. We ensure the AI system adapts to evolving threats and operational needs.

Ready to Transform Your Cybersecurity?

Partner with our experts to harness the power of AI for robust and intelligent OT/IT cybersecurity. Let's build a more resilient future together.

Book a Free Consultation

CYBERSECURITY ANALYSIS

CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments

Key Findings & Performance Snapshot

Deep Analysis & Enterprise Applications

CritBench Framework Overview

Main Contributions of CritBench:

Mapping CritBench Tasks to MITRE ATT&CK for ICS

Task Modalities:

Performance by Task Complexity:

Resource Efficiency:

Current Limitations:

Future Work Directions:

Calculate Your Potential AI ROI

Our Proven Implementation Roadmap

Phase 1: Discovery & Strategy

Phase 2: Custom AI Solution Design

Phase 3: Integration & Testing

Phase 4: Deployment & Optimization

Ready to Transform Your Cybersecurity?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai