CYBERSECURITY ANALYSIS
CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments
The advancement of Large Language Models (LLMs) raises concerns regarding their dual-use potential in cybersecurity. Existing evaluation frameworks overwhelmingly focus on Information Technology (IT) environments, failing to capture the constraints and specialized protocols of Operational Technology (OT). This paper introduces CritBench, a novel framework designed to evaluate the cybersecurity capabilities of LLM agents within IEC 61850 Digital Substation environments, assessing five state-of-the-art models across a corpus of 81 domain-specific tasks.
Key Findings & Performance Snapshot
CritBench reveals LLMs possess domain-specific knowledge but face significant challenges in dynamic OT environments, highlighting a critical gap between conceptual understanding and autonomous action.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
CritBench Framework Overview
CritBench addresses a critical gap in LLM cybersecurity evaluations by focusing on Operational Technology (OT) environments, specifically IEC 61850 Digital Substations. Unlike generic IT benchmarks, OT requires understanding specialized communication protocols, high-stakes environments, and prioritization of availability over data confidentiality. LLMs need to bridge a significant semantic gap to operate effectively in this domain.
Main Contributions of CritBench:
- An open-source, automated benchmarking framework for LLM agents in IEC 61850 Digital Substation environments.
- A corpus of 81 newly developed IEC 61850 domain-specific tasks covering static XML configuration analysis, network traffic analysis (PCAP), and live machine interaction.
- An execution environment equipped with CritLayer, a domain-specific agent scaffolding tailored for Digital Substation protocols.
- A comprehensive empirical evaluation across five recent LLMs (GPT-5.1, Qwen3.5 397BA17, GPT-5 mini, MiniMax M2.5, GPT-5 nano).
The CritBench task corpus consists of 81 cybersecurity challenges evaluating agent performance in IEC 61850 Digital Substation environments. These tasks require proficiency in key protocols such as MMS, GOOSE, SV, and IEC 60870-5-104. Tasks range from analyzing static configurations and passive network traffic to active interaction with virtual machines.
Mapping CritBench Tasks to MITRE ATT&CK for ICS
| ICS Tactic | ICS Technique | Benchmark Implementation Examples |
|---|---|---|
| Discovery | Point & Tag Identification (T0861) | Enumerate logical nodes & IEC 104 Information Object Addresses |
| Discovery | Network Service Scanning (T0842) | Identifying listening ports & IEC 61850 protocols |
| Collection | Automated Collection (T0802) | Parsing high-rate SV traffic to extract measurement channels |
| Evasion | Spoofing Standard Data (T0885) | Injecting GOOSE frames with spoofed state numbers |
| Impact | Manipulation of Control (T0831) | Toggling single-point control outputs to forcefully open circuit breakers |
| Impact | Loss of Protection (T0837) | Overwriting time overcurrent threshold attributes in protection relays |
Task Modalities:
Static Configuration Analysis (30 tasks): Agents parse and interpret SCD-files (XML-based) to understand substation topology, enumerate logical nodes, identify protection functions, extract VLAN assignments, and perform security auditing by detecting configuration changes.
Network Traffic Analysis (30 tasks): Evaluate agents' capability to extract intelligence from raw network packet captures. Challenges include auditing clock synchronization in SV, detecting GOOSE configuration anomalies, reconstructing MMS data models, and correlating hardware identities across network layers.
Dynamic System Interaction (21 tasks): Measure agents' ability to autonomously execute actions against a containerized IED environment with real industrial protocol stacks. Tasks involve discovering MMS data models, reading passive SVs, and targeted state alterations such as flipping a logic breaker or modifying protection thresholds.
The flagship model, GPT-5.1, demonstrated the highest proficiency with an overall accuracy of 86.4%. Open-weight models like Qwen3.5 397BA17 (76.6%), GPT-5 mini (75.4%), and MiniMax M2.5 (74.9%) form a distinct capability cluster, reliably handling basic protocol parsing and static file analysis. The smallest model, GPT-5 nano, showed a clear degradation at 43.3%, indicating a capability threshold for navigating IEC 61850 environments.
The specialized tooling provided by CritLayer significantly improves overall model performance. Agents leveraging CritLayer achieved 77.8% success compared to 58.1% without it, demonstrating that domain-specific tools, offering structured outputs and reducing command-guessing, enable more reliable task execution by LLMs.
Performance by Task Complexity:
All models show high proficiency in static analysis tasks, such as SCD parsing and single-step PCAP analysis, achieving near 100% on inventory challenges. However, performance significantly degrades in dynamic tasks requiring multi-step cross-protocol correlation and live IED interaction. LLMs struggle with persistent sequential reasoning, state tracking, and accurate manipulation of live systems without the specialized tool scaffold. Difficulties arise in mapping discovered logical nodes to precise writable functional constraints and interpreting complex industrial protocol structures like IEC 60870-5-104.
Resource Efficiency:
Open-weight and mid-tier models consume substantial token budgets and frequently enter repetitive loops when industrial tools return unfamiliar binary structures. GPT-5.1 is more efficient, using fewer tokens and completing tasks faster (25 seconds median run duration). Financial costs vary; GPT-5 mini offers a competitive mid-tier accuracy at a highly economical $0.005 per run, while GPT-5 nano lacks capability despite its low cost.
Current Limitations:
CritBench currently uses emulated network infrastructure and software-defined protocol stacks, which, while preserving packet structures, do not fully replicate proprietary firmware behaviors. The task corpus primarily covers IEC 61850 terminology and features, not the full spectrum of cybersecurity attacker or defender tasks. Finally, LLM performance is highly sensitive to system prompting and runtime non-determinism, a factor mitigated but not entirely eliminated by multiple independent runs.
Future Work Directions:
- Transition parts of the tasks into Hardware-in-the-Loop (HIL) setups to capture vendor-specific firmware behaviors and physical process interactions.
- Broaden protocol coverage and diversify the vendor/device corpus to represent a wider range of OT environments.
- Investigate advanced agent scaffolds to address more complex, cross-protocol chains that single agents currently find challenging.
- Incorporate prompt-robustness stress tests and standardized evaluation of safety-critical failure modes to better quantify risks as LLM capabilities evolve.
Calculate Your Potential AI ROI
Estimate the time and cost savings your enterprise could achieve by integrating advanced AI solutions in cybersecurity operations.
Our Proven Implementation Roadmap
Our structured approach ensures a smooth, secure, and effective integration of AI into your enterprise cybersecurity operations, minimizing disruption and maximizing impact.
Phase 1: Discovery & Strategy
In-depth analysis of your current cybersecurity posture, existing IT/OT infrastructure, and specific operational challenges. We define clear, measurable objectives for AI integration.
Phase 2: Custom AI Solution Design
Development of a tailored AI architecture, including model selection, data pipelines for OT/IT environments, and custom tooling like CritLayer. Focus on protocol compatibility and security.
Phase 3: Integration & Testing
Seamless integration of the AI system within your existing security ecosystem. Rigorous testing in emulated and, where appropriate, HIL environments to validate performance and safety.
Phase 4: Deployment & Optimization
Staged deployment, ongoing monitoring, and continuous optimization based on real-world performance data. We ensure the AI system adapts to evolving threats and operational needs.
Ready to Transform Your Cybersecurity?
Partner with our experts to harness the power of AI for robust and intelligent OT/IT cybersecurity. Let's build a more resilient future together.