AI in Cyber Defence
Beyond Rewards in Reinforcement Learning for Cyber Defence
Elizabeth Bates*, Chris Hicks*, and Vasilios Mavroudis
February 4, 2026
Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered reward functions which combine many penalties and incentives for a range of (un)desirable states and costly actions. Dense rewards help alleviate the challenge of exploring complex environments but risk biasing agents towards suboptimal and potentially riskier solutions, a critical issue in complex cyber environments. We thoroughly evaluate the impact of reward function structure on learning and policy behavioural characteristics using a variety of sparse and dense reward functions, two well-established cyber gyms, a range of network sizes, and both policy gradient and value-based RL algorithms. Our evaluation is enabled by a novel ground truth evaluation approach which allows directly comparing between different reward functions, illuminating the nuanced inter-relationships between rewards, action space and the risks of suboptimal policies in cyber environments. Our results show that sparse rewards, provided they are goal aligned and can be encountered frequently, uniquely offer both enhanced training reliability and more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards can also yield policies that are better aligned with cyber defender goals and make sparing use of costly defensive actions without explicit reward-based numerical penalties.
Executive Impact: Transforming Cyber Defence with Smarter AI
Our research demonstrates that a shift to sparse reward functions can drastically improve the effectiveness and reliability of autonomous cyber defence agents, leading to more secure networks with lower operational risks.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The Critical Need for Smarter Cyber Defence AI
Autonomous Cyber Defence (ACD) agents, particularly those using Deep Reinforcement Learning (DRL), hold immense promise for mitigating increasingly sophisticated cyber attacks. However, current DRL agents are often trained in cyber gym environments with highly engineered, dense reward functions. While seemingly beneficial for expedited learning, these dense rewards risk biasing agents towards suboptimal and potentially riskier solutions, a critical concern in high-stakes cyber security. Our work addresses this fundamental challenge by proposing a more accurate evaluation framework and demonstrating the superior performance of sparse reward structures.
| Characteristic | Dense Rewards | Sparse Rewards |
|---|---|---|
| Optimization Objective | Combines many penalties and incentives; risks arbitrary numerical equivalences. | Places fewer constraints; aligned with core defensive goals. |
| Learning Speed | Appears faster due to frequent feedback, but can lead to suboptimal solutions. | May require more training iterations, but enables discovery of more effective policies. |
| Policy Risk | Risks biasing agents towards suboptimal/riskier solutions and avoidable weaknesses. | Encourages learning optimal strategies with lower-risk policies. |
| Computational Costs | Higher due to increased complexity of reward function engineering and potential for bias. | Lower due to simpler design, often leading to more robust and reliable training. |
The Hidden Risk: Intra-Step Node Compromise
Traditional cyber gym evaluations often miss critical 'intra-step' events where nodes are compromised and then immediately restored within the same time step. This means agents receive an inaccurate observation and reward, failing to distinguish between truly secure states and those where a brief compromise occurred. Our ground truth scoring mechanism, ScoreGT, addresses this by capturing the maximum number of compromised nodes over both intra- and end-step, providing a true measure of security.
Precision in Evaluation: Introducing ScoreGT
ScoreGT Ground Truth ScoringThe Ground Truth Score (ScoreGT) is a novel metric we introduce to overcome the limitations of discrete step-wise evaluation in cyber gyms. It measures the maximum number of compromised nodes over both intra- and end-step, providing a more accurate and robust measure of agent performance, independent of the training reward function. For instance, if during a single step, 3 nodes were compromised even if 1 was restored by the end, ScoreGT will register 3, not 2.
| Metric | Description | Why it Matters for Cyber Defence |
|---|---|---|
| Dispersion Variability Across Time (DT) | Measures stability of RL training. Smooth, monotonic policy improvements yield low DT scores. | Indicates high reliability and lowered computational costs during training. |
| Dispersion Variability Across Runs (DR') | Measures reproducibility of RL training across multiple runs, normalized by mean episodic reward. | Low DR' means high consistency between training runs, requiring fewer total training runs to find best agents. |
| Conditional Value at Risk (CVaR) | Quantifies risk in worst-case scenarios, specifically expected performance in the worst α fraction of cases. | Crucial for cyber defence to understand and mitigate worst-case performance, which can be catastrophic. |
| Risk Across Fixed-Policy Rollouts (RF) | Applies CVaR to the distribution of multiple fixed-policy evaluation rollouts. | Provides a robust measure of worst-case expected loss for a trained policy, directly assessing operational risk. |
Experimental Framework: Cyber Gyms & Reward Functions
Our study leverages two prominent cyber gyms: Yawning Titan (YT), an abstract graph-based network simulator, and CAGE 2 (miniCAGE), a more complex enterprise network simulation. We used both basic and extended action spaces, and scaled network sizes from 2 to 50 nodes. We evaluated agents trained with PPO and DQN algorithms across five distinct reward functions: Sparse Positive (SP), Sparse Negative (SN), Sparse Positive-Negative (SPN), Dense Negative (DN), and a Complex Dense Negative (CDN) (representing typical cyber gym rewards). This comprehensive setup allowed us to robustly assess the impact of reward structure on agent performance, risk, and reliability.
| Reward Function | Avg. ScoreGT (Lower is Better) | Avg. Lower RF (Risk) | Avg. DT (Reliability) (x10^-3) | Avg. DR' (Reproducibility) |
|---|---|---|---|---|
| Sparse Positive-Negative (SPN) | 2.00 | 1.82 | 0.08 | 0.19 |
| Sparse Positive (SP) | 2.69 | 2.46 | 0.11 | 0.12 |
| Dense Negative (DN) | 6.29 | 5.84 | 2.33 | 0.12 |
| Complex Dense Negative (CDN) | 6.21 | 5.71 | 2.45 | 0.31 |
| Sparse Negative (SN) | 10.29 | 9.00 | 0.09 | 0.17 |
Enhanced Performance at Scale
As network sizes increase, the benefits of sparse rewards become even more pronounced, leading to significantly better policies and reduced risks in larger, more realistic environments.
Comparing results from Table 4 for 10 and 50 node networks in Yawning Titan with extended action space shows that while all scores generally increase with network size (more nodes to compromise), the relative performance of sparse rewards (SPN) remains significantly better than dense rewards (DN). For example, SPN's ScoreGT for 50 nodes (6.58) is considerably lower than DN's (17.78), indicating far fewer compromised nodes.
Strategic Action Usage with Sparse Rewards
A detailed analysis of agent behavior (Table 18, 20) reveals that sparsely rewarded agents (SP, SPN) make more strategic and sparing use of costly defensive actions like 'restore'. Specifically, SP agents use the restore action more sparingly overall and focus it primarily on user hosts, which have fewer privileges and less impact on operations. In CAGE, SP agents achieve significantly fewer successful operational and enterprise privilege escalations (0.25 and 1.29 average per episode vs. 7.89 and 22.22 for CDN), and confine most privileged host access to user subnet hosts (92.31%). This demonstrates that sparse rewards lead to policies that are better aligned with actual cyber defender goals, avoiding unnecessary actions and focusing on critical assets without explicit numerical penalties for those actions in the reward function.
| Reward Function | ScoreGT (Red then Blue) | ScoreGT (Blue then Red) | ScoreGT (Random Order) |
|---|---|---|---|
| Sparse Positive-Negative (SPN) | 0.90 | 0.61 | 4.50 |
| Sparse Positive (SP) | 0.90 | 0.27 | 6.91 |
| Complex Dense Negative (CDN) | 5.65 | 3.99 | 13.24 |
| Dense Negative (DN) | 4.13 | 2.98 | 11.75 |
Calculate Your Potential ROI with AI-Powered Cyber Defence
Estimate the potential savings and reclaimed hours by implementing an advanced AI cyber defence strategy informed by robust reward functions.
AI Cyber Defence Implementation Roadmap
A phased approach to integrating advanced AI into your cyber security operations for optimal results.
Phase 1: Assessment & Strategy Definition
Evaluate current cyber defence posture, identify key vulnerabilities, and define AI integration strategy. This includes selecting appropriate reward structures based on core security objectives.
Duration: 2-4 Weeks
Phase 2: Custom Model Training & Simulation
Train DRL agents in secure cyber gym environments using optimized sparse reward functions. Conduct extensive simulations to validate policy effectiveness and risk profiles.
Duration: 4-8 Weeks
Phase 3: Integration & Phased Deployment
Integrate trained AI agents into existing security tools and infrastructure. Deploy in a controlled, phased manner, starting with low-risk environments and gradually expanding scope.
Duration: 6-12 Weeks
Phase 4: Continuous Monitoring & Optimization
Establish real-time monitoring of AI agent performance. Continuously refine models and adapt reward functions to evolving threat landscapes and network changes.
Duration: Ongoing
Ready to Enhance Your Cyber Defence with AI?
Discover how intelligent, reliable AI agents can transform your enterprise security. Our experts are ready to guide you through a tailored implementation plan.