AI in Cyber Defence

Beyond Rewards in Reinforcement Learning for Cyber Defence

Elizabeth Bates*, Chris Hicks*, and Vasilios Mavroudis

February 4, 2026

Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered reward functions which combine many penalties and incentives for a range of (un)desirable states and costly actions. Dense rewards help alleviate the challenge of exploring complex environments but risk biasing agents towards suboptimal and potentially riskier solutions, a critical issue in complex cyber environments. We thoroughly evaluate the impact of reward function structure on learning and policy behavioural characteristics using a variety of sparse and dense reward functions, two well-established cyber gyms, a range of network sizes, and both policy gradient and value-based RL algorithms. Our evaluation is enabled by a novel ground truth evaluation approach which allows directly comparing between different reward functions, illuminating the nuanced inter-relationships between rewards, action space and the risks of suboptimal policies in cyber environments. Our results show that sparse rewards, provided they are goal aligned and can be encountered frequently, uniquely offer both enhanced training reliability and more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards can also yield policies that are better aligned with cyber defender goals and make sparing use of costly defensive actions without explicit reward-based numerical penalties.

Executive Impact: Transforming Cyber Defence with Smarter AI

Our research demonstrates that a shift to sparse reward functions can drastically improve the effectiveness and reliability of autonomous cyber defence agents, leading to more secure networks with lower operational risks.

0 Sparse-PN Avg ScoreGT

0 Sparse-PN Avg DT (x10^-3)

0 Sparse-PN Avg Lower RF

0 Sparse-PN Avg DR'

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

The Critical Need for Smarter Cyber Defence AI

Autonomous Cyber Defence (ACD) agents, particularly those using Deep Reinforcement Learning (DRL), hold immense promise for mitigating increasingly sophisticated cyber attacks. However, current DRL agents are often trained in cyber gym environments with highly engineered, dense reward functions. While seemingly beneficial for expedited learning, these dense rewards risk biasing agents towards suboptimal and potentially riskier solutions, a critical concern in high-stakes cyber security. Our work addresses this fundamental challenge by proposing a more accurate evaluation framework and demonstrating the superior performance of sparse reward structures.

Dense vs. Sparse Rewards: A Fundamental Difference

Characteristic	Dense Rewards	Sparse Rewards
Optimization Objective	Combines many penalties and incentives; risks arbitrary numerical equivalences.	Places fewer constraints; aligned with core defensive goals.
Learning Speed	Appears faster due to frequent feedback, but can lead to suboptimal solutions.	May require more training iterations, but enables discovery of more effective policies.
Policy Risk	Risks biasing agents towards suboptimal/riskier solutions and avoidable weaknesses.	Encourages learning optimal strategies with lower-risk policies.
Computational Costs	Higher due to increased complexity of reward function engineering and potential for bias.	Lower due to simpler design, often leading to more robust and reliable training.

The Hidden Risk: Intra-Step Node Compromise

Red Agent Attacks Node 1

→

Blue Agent Scans Network

→

Red Agent Compromises Node 2

→

Blue Agent Restores Node 1

→

Observation & Reward Returned (Node 1 is Clean)

Traditional cyber gym evaluations often miss critical 'intra-step' events where nodes are compromised and then immediately restored within the same time step. This means agents receive an inaccurate observation and reward, failing to distinguish between truly secure states and those where a brief compromise occurred. Our ground truth scoring mechanism, ScoreGT, addresses this by capturing the maximum number of compromised nodes over both intra- and end-step, providing a true measure of security.

Precision in Evaluation: Introducing ScoreGT

ScoreGT Ground Truth Scoring

The Ground Truth Score (ScoreGT) is a novel metric we introduce to overcome the limitations of discrete step-wise evaluation in cyber gyms. It measures the maximum number of compromised nodes over both intra- and end-step, providing a more accurate and robust measure of agent performance, independent of the training reward function. For instance, if during a single step, 3 nodes were compromised even if 1 was restored by the end, ScoreGT will register 3, not 2.

Comprehensive AI Reliability & Risk Metrics

Metric	Description	Why it Matters for Cyber Defence
Dispersion Variability Across Time (DT)	Measures stability of RL training. Smooth, monotonic policy improvements yield low DT scores.	Indicates high reliability and lowered computational costs during training.
Dispersion Variability Across Runs (DR')	Measures reproducibility of RL training across multiple runs, normalized by mean episodic reward.	Low DR' means high consistency between training runs, requiring fewer total training runs to find best agents.
Conditional Value at Risk (CVaR)	Quantifies risk in worst-case scenarios, specifically expected performance in the worst α fraction of cases.	Crucial for cyber defence to understand and mitigate worst-case performance, which can be catastrophic.
Risk Across Fixed-Policy Rollouts (RF)	Applies CVaR to the distribution of multiple fixed-policy evaluation rollouts.	Provides a robust measure of worst-case expected loss for a trained policy, directly assessing operational risk.

Experimental Framework: Cyber Gyms & Reward Functions

Our study leverages two prominent cyber gyms: Yawning Titan (YT), an abstract graph-based network simulator, and CAGE 2 (miniCAGE), a more complex enterprise network simulation. We used both basic and extended action spaces, and scaled network sizes from 2 to 50 nodes. We evaluated agents trained with PPO and DQN algorithms across five distinct reward functions: Sparse Positive (SP), Sparse Negative (SN), Sparse Positive-Negative (SPN), Dense Negative (DN), and a Complex Dense Negative (CDN) (representing typical cyber gym rewards). This comprehensive setup allowed us to robustly assess the impact of reward structure on agent performance, risk, and reliability.

Overall Performance & Reliability (YT & CAGE)

Table 2 (YT) and Table 3 (CAGE) clearly show that SPN and SP reward functions consistently achieve the best ScoreGT (fewer compromised nodes) and lower worst-case risks (Lower RF). They also demonstrate superior training reliability (lower DT) and reproducibility (lower DR') compared to dense reward functions, indicating a more robust and effective learning process.
Reward Function	Avg. ScoreGT (Lower is Better)	Avg. Lower RF (Risk)	Avg. DT (Reliability) (x10^-3)	Avg. DR' (Reproducibility)
Sparse Positive-Negative (SPN)	2.00	1.82	0.08	0.19
Sparse Positive (SP)	2.69	2.46	0.11	0.12
Dense Negative (DN)	6.29	5.84	2.33	0.12
Complex Dense Negative (CDN)	6.21	5.71	2.45	0.31
Sparse Negative (SN)	10.29	9.00	0.09	0.17

Enhanced Performance at Scale

As network sizes increase, the benefits of sparse rewards become even more pronounced, leading to significantly better policies and reduced risks in larger, more realistic environments.

0 SPN ScoreGT (10 Nodes, Extended Action Space)

0 SPN ScoreGT (50 Nodes, Extended Action Space)

0 DN ScoreGT (10 Nodes, Extended Action Space)

0 DN ScoreGT (50 Nodes, Extended Action Space)

Comparing results from Table 4 for 10 and 50 node networks in Yawning Titan with extended action space shows that while all scores generally increase with network size (more nodes to compromise), the relative performance of sparse rewards (SPN) remains significantly better than dense rewards (DN). For example, SPN's ScoreGT for 50 nodes (6.58) is considerably lower than DN's (17.78), indicating far fewer compromised nodes.

Strategic Action Usage with Sparse Rewards

A detailed analysis of agent behavior (Table 18, 20) reveals that sparsely rewarded agents (SP, SPN) make more strategic and sparing use of costly defensive actions like 'restore'. Specifically, SP agents use the restore action more sparingly overall and focus it primarily on user hosts, which have fewer privileges and less impact on operations. In CAGE, SP agents achieve significantly fewer successful operational and enterprise privilege escalations (0.25 and 1.29 average per episode vs. 7.89 and 22.22 for CDN), and confine most privileged host access to user subnet hosts (92.31%). This demonstrates that sparse rewards lead to policies that are better aligned with actual cyber defender goals, avoiding unnecessary actions and focusing on critical assets without explicit numerical penalties for those actions in the reward function.

Resilience in Unpredictable Scenarios

As shown in Table 5, when the attacker's timing is randomized – the most challenging and realistic setting – SPN significantly outperforms all other reward functions with a ScoreGT of 4.50, demonstrating its robustness and adaptability. Furthermore, in the 'Blue then Red' order (where the defender acts first), SP agents achieve a remarkably low ScoreGT of 0.27, approaching ideal network security. This highlights the strong inter-relationships between reward function, action space, and the defender's ability to cope with unpredictable adversary timing.
Reward Function	ScoreGT (Red then Blue)	ScoreGT (Blue then Red)	ScoreGT (Random Order)
Sparse Positive-Negative (SPN)	0.90	0.61	4.50
Sparse Positive (SP)	0.90	0.27	6.91
Complex Dense Negative (CDN)	5.65	3.99	13.24
Dense Negative (DN)	4.13	2.98	11.75

Calculate Your Potential ROI with AI-Powered Cyber Defence

Estimate the potential savings and reclaimed hours by implementing an advanced AI cyber defence strategy informed by robust reward functions.

Your Industry

Number of Security Analysts

Average Weekly Hours on Manual Tasks (per analyst)

Average Hourly Rate of Analysts

Annual Cost Savings

Annual Hours Reclaimed

AI Cyber Defence Implementation Roadmap

A phased approach to integrating advanced AI into your cyber security operations for optimal results.

Phase 1: Assessment & Strategy Definition

Evaluate current cyber defence posture, identify key vulnerabilities, and define AI integration strategy. This includes selecting appropriate reward structures based on core security objectives.

Duration: 2-4 Weeks

Phase 2: Custom Model Training & Simulation

Train DRL agents in secure cyber gym environments using optimized sparse reward functions. Conduct extensive simulations to validate policy effectiveness and risk profiles.

Duration: 4-8 Weeks

Phase 3: Integration & Phased Deployment

Integrate trained AI agents into existing security tools and infrastructure. Deploy in a controlled, phased manner, starting with low-risk environments and gradually expanding scope.

Duration: 6-12 Weeks

Phase 4: Continuous Monitoring & Optimization

Establish real-time monitoring of AI agent performance. Continuously refine models and adapt reward functions to evolving threat landscapes and network changes.

Duration: Ongoing

Ready to Enhance Your Cyber Defence with AI?

Discover how intelligent, reliable AI agents can transform your enterprise security. Our experts are ready to guide you through a tailored implementation plan.

Schedule Your Strategy Session

AI in Cyber Defence

Beyond Rewards in Reinforcement Learning for Cyber Defence

Executive Impact: Transforming Cyber Defence with Smarter AI

Deep Analysis & Enterprise Applications

The Critical Need for Smarter Cyber Defence AI

Dense vs. Sparse Rewards: A Fundamental Difference

The Hidden Risk: Intra-Step Node Compromise

Precision in Evaluation: Introducing ScoreGT

Comprehensive AI Reliability & Risk Metrics

Experimental Framework: Cyber Gyms & Reward Functions

Overall Performance & Reliability (YT & CAGE)

Enhanced Performance at Scale

Strategic Action Usage with Sparse Rewards

Resilience in Unpredictable Scenarios

Calculate Your Potential ROI with AI-Powered Cyber Defence

AI Cyber Defence Implementation Roadmap

Phase 1: Assessment & Strategy Definition

Phase 2: Custom Model Training & Simulation

Phase 3: Integration & Phased Deployment

Phase 4: Continuous Monitoring & Optimization

Ready to Enhance Your Cyber Defence with AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai