COOL-MC: VERIFYING AND EXPLAINING RL POLICIES FOR MULTI-BRIDGE NETWORK MAINTENANCE

Reinforcement Learning Verification

This paper introduces COOL-MC, a tool for formally verifying and explaining Reinforcement Learning policies for multi-bridge network maintenance, combining probabilistic model checking with explainability methods.

Schedule Your Strategy Session

Executive Impact: Key Metrics

COOL-MC provides a novel approach to verify and explain Reinforcement Learning (RL) policies for multi-bridge network maintenance. By constructing a discrete-time Markov chain (DTMC) from the RL policy and the underlying MDP, it enables probabilistic model checking and explainability analyses. This allows for formal safety guarantees, identification of policy biases, and understanding of decision-making, addressing critical challenges in adopting RL for infrastructure management.

0% Safety Violation Probability

0% Reachability Coverage (DTMC)

0% State Space Reduction

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Feature Lumping on Bridge 1 Condition

Feature lumping coarsened bridge 1 condition ratings (0-3 to 2, 4-6 to 5, 7-9 to 7). The resulting failure probability changed negligibly from 0.03547 to 0.03542. This confirms that a coarse categorical assessment is sufficient for maintaining safety performance, reducing the need for single-point accuracy.

Budget Sensitivity to Bmax

Varying budget cap Bmax reveals that budget exhaustion probability remains extremely low (< 2.1 × 10^-6) and decreases monotonically as Bmax increases. This indicates a conservative spending strategy by the policy, avoiding budget exhaustion even under tighter constraints.

Bmax	P_?(◊ 'budget_empty')	States	Transitions
9	2.02 × 10^-6	26,377	212,593
10	1.17 × 10^-6	27,033	216,187
11	4.69 × 10^-7	27,987	219,568

Action Replacement: Minor to Major Maintenance

Globally replacing all Minor Maintenance actions (cost 1) with Major Maintenance (cost 2) raises the budget exhaustion probability from 1.17 × 10^-6 to 2.20 × 10^-5. This quantifies the policy's dependence on cheap interventions for conservative spending.

Configuration	P_?(◊ 'budget_empty')
Baseline (no replacement, Bmax = 10)	1.17 × 10^-6
MN → MJ (37 actions)	2.20 × 10^-5

3.5% Safety Violation Probability (P<0.05(◊ 'failed'))

Model checking revealed that the trained policy has a safety-violation probability of 3.5% over the 20-year planning horizon, slightly above the theoretical minimum of 0%.

Model Size Comparison (States/Transitions)
Model Type	States	Transitions
Induced DTMC (min)	5,174	31,494
Induced DTMC (max)	27,856	227,468
Full MDP	156,579	50,915,203

The induced Discrete-Time Markov Chain (DTMC) significantly reduces the state and transition space compared to the full MDP, making formal verification tractable and scalable for bridge networks.

Enterprise Process Flow

Encode MDP in PRISM

→

Train Deep RL Policy (PPO)

→

Verify Policy with PCTL Queries

→

Explain Policy Decisions (COOL-MC)

Our methodology outlines a four-stage process for developing and analyzing RL maintenance policies, integrating formal verification and explainability techniques like feature lumping and saliency ranking.

Global Feature Importance Ranking
Rank	Feature	Mean \|∇\|
1	year	2.530
2	cycle_year	2.410
3	cond_b1	2.089
4	budget	1.457
5	cond_b2	1.278
6	cond_b3	0.882
7	init_done	0.636

Temporal features (year, cycle_year) and bridge 1 condition (cond_b1) are the most influential, indicating policy decisions are strongly shaped by planning horizon and bridge 1's state, rather than symmetric treatment of all bridges.

0.07535 Failure Probability with Horizon Remap (P_?(◊ 'failed'))

When the policy believes the episode is always ending soon (horizon remap), the failure probability elevates to 0.07535 from a baseline of 0.0355, revealing reward hacking where maintenance costs are reduced near the planning horizon at the expense of structural safety.

Conditional Saliency Ranking (Worst Bridge Focus)
Worst Bridge	Rank 1	Rank 2	Rank 3
Bridge 1	cond_b1 (2.741)	year (1.713)	budget (1.492)
Bridge 2	cond_b1 (1.433)	budget (1.161)	cond_b2 (1.112)
Bridge 3	cycle_year (4.026)	year (3.059)	cond_b1 (1.624)

The policy exhibits an attention bias towards bridge 1. Even when bridge 2 or 3 are in worst condition, bridge 1's condition (cond_b1) often remains a top influencing feature, indicating a potential coverage gap in the learned policy for other bridges.

Advanced ROI Calculator

Estimate your potential cost savings and efficiency gains by implementing verified AI policies in your enterprise.

Your Industry

Number of Employees Involved in Manual Processes

Average Weekly Hours on Manual Tasks (per employee)

Average Hourly Cost (incl. overhead)

Annual Savings $0

Hours Reclaimed Annually 0

Implementation Roadmap

A structured approach to integrating verified RL policies into your operational framework.

Phase 1: MDP Modeling & Baseline RL

Formally encode your multi-bridge network as an MDP in PRISM, capturing states, actions, transitions, and rewards. Train an initial deep RL policy (PPO) to establish a performance baseline.

Phase 2: Induced DTMC Construction & Verification

COOL-MC automatically constructs the induced Discrete-Time Markov Chain (DTMC) from your trained RL policy. Apply PCTL queries to formally verify safety and performance properties, such as bridge failure probability and budget utilization.

Phase 3: Explainability Analysis & Policy Profiling

Utilize COOL-MC's explainability methods (feature lumping, saliency ranking, action labeling, counterfactual analysis) to understand policy decision-making, identify biases, and characterize behaviors like horizon-gaming.

Phase 4: Policy Refinement & Re-verification

Based on the insights from verification and explainability, iteratively refine your MDP model or RL policy architecture. Re-verify the updated policy with COOL-MC to confirm that anomalies are resolved and safety properties are maintained.

Ready to Transform Your Operations with Verified AI?

Book a personalized consultation to discuss how COOL-MC can integrate into your existing infrastructure management systems and enhance safety with explainable, verifiable RL policies.

Discuss Your Implementation

COOL-MC: VERIFYING AND EXPLAINING RL POLICIES FOR MULTI-BRIDGE NETWORK MAINTENANCE

Reinforcement Learning Verification

Executive Impact: Key Metrics

Deep Analysis & Enterprise Applications

Feature Lumping on Bridge 1 Condition

Budget Sensitivity to Bmax

Action Replacement: Minor to Major Maintenance

Model Size Comparison (States/Transitions)

Enterprise Process Flow

Global Feature Importance Ranking

Conditional Saliency Ranking (Worst Bridge Focus)

Advanced ROI Calculator

Implementation Roadmap

Phase 1: MDP Modeling & Baseline RL

Phase 2: Induced DTMC Construction & Verification

Phase 3: Explainability Analysis & Policy Profiling

Phase 4: Policy Refinement & Re-verification

Ready to Transform Your Operations with Verified AI?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Jobs

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai