Hierarchical Deep Reinforcement Learning Framework for Multi-Year Asset Management Under Budget Constraints

Amir Fard^a and Arnold X.-X. Yuan^a

^aDepartment of Civil Engineering, Toronto Metropolitan University
350 Victoria Street, Toronto, ON, Canada, M5B 2K3

ABSTRACT

Budget planning and maintenance optimization are crucial for infrastructure asset management, ensuring cost-effectiveness and sustainability. However, the complexity arising from combinatorial action spaces, diverse asset deterioration, stringent budget constraints, and environmental uncertainty significantly limits existing methods' scalability. This paper proposes a Hierarchical Deep Reinforcement LearningHDRL is an advanced RL method that decomposes a complex problem into a hierarchy of simpler sub-problems, often with a high-level "manager" policy and low-level "worker" policies. methodology specifically tailored to multi-year infrastructure planning. Our approach decomposes the problem into two hierarchical levels: a high-level Budget Planner allocating annual budgets within explicit feasibility bounds, and a low-level Maintenance Planner prioritizing assets within the allocated budget. By structurally separating macro-budget decisions from asset-level prioritization and integrating linear programming projection within a hierarchical Soft Actor-CriticSAC is an off-policy reinforcement learning algorithm that optimizes for both expected return and policy entropy, encouraging exploration and improving stability. framework, the method efficiently addresses exponential growth in the action space and ensures rigorous budget compliance. A case study evaluating sewer networks of varying sizes (10, 15, and 20 sewersheds) illustrates the effectiveness of the proposed approach. Compared to conventional Deep Q-LearningDQL is a DRL algorithm that uses a deep neural network to approximate the Q-value function, which estimates the value of taking a certain action in a given state. and enhanced genetic algorithms, our methodology converges more rapidly, scales effectively, and consistently delivers near-optimal solutions even as network size grows.

KEYWORDS

infrastructure planning; hierarchical reinforcement learning; constrained optimization; asset management; budget allocation; maintenance optimization; deep learning; linear programming.

1. Introduction

Effectively and efficiently managing infrastructure networks is crucial for sustainable development, economic progress, and public safety. Assets such as transportation systems, water and wastewater networks, and energy grids are integral to modern societies, supporting daily activities and enabling large-scale services. However, a variety of forces drive the urgency of proper infrastructure asset management, including natural deterioration processes, the increasing impact of climate change, technological advancements, and growing demands that stem from urbanization and economic expansion. These pressures underscore the need for comprehensive and adaptive strategies to maintain and improve infrastructure performance over multiple years, while consistently meeting budgetary and operational constraints.

Infrastructure Asset Management (IAM) encompasses several tasks: inventorying and evaluating the condition of assets, projecting their future states, and budgeting and prioritization. Among these tasks, budgeting and prioritization (often referred to as maintenance planning) represent a pivotal challenge, as they determine which actions, such as maintenance, rehabilitation, or replacement, should be taken for each asset and when those actions should be executed. Such decisions affect the performance of individual assets, the total costs incurred over their lifecycles, and the reliability of the entire network. Developing an optimal multi-year maintenance plan that complies within annual budget constraints becomes particularly difficult, because infrastructure networks often contain numerous interlinked assets, each with its own deterioration behaviors, and exposures to varying operating and environmental conditions.

The sequential nature of these decisions further complicates the planning task. Maintenance policies must account for how interventions at one point in time influence asset conditions in future periods, thereby affecting subsequent budgets and the cost-effectiveness of later actions. Additionally, non-linear relationships may arise when certain improvements or rehabilitations yield synergistic effects on the network, or when deterioration accelerates non-linearly once an asset's condition drops below a threshold. Handling these sequential changes and non-linear behaviors often goes beyond the capacity of many classical optimization techniques.

2. Review of Existing Methodologies

Budget planning and maintenance prioritization is a hard research problem that has been studied by several generations. A good analysis of the problem and traditional methods related to pavement management can be found in Gao et al. (2012). The primary purpose of this section is to present a brief review of the major work related to the applications of deep reinforcement learning to multi-year budget planning and network maintenance prioritization in the context of infrastructure asset management. Before this, a canonical model of the problem is presented below.

2.1. Problem Setting and Canonical Model

Infrastructure Asset Management (IAM) involves planning multi-year interventions, such as maintenance, rehabilitation, and replacement, to keep a network of assets in desired condition while respecting budgetary constraints. Consider a planning horizon of h years and n infrastructure assets, where each asset i (i = 1,...,n) has a performance measure s_i,t at the beginning of year t.

Each asset's state evolves over time according to whether a maintenance action is performed. The binary decision variable x_i,t {0,1} indicates if a maintenance action is applied to asset i in year t. Asset i then transitions from s_i,t to s_i,t+1 via the following deterioration model:

s_i,t+1 = x_i,t f^act(s_i,t) + (1 - x_i,t) f^det(s_i,t) (1)

where f^act describes the effect of applying intervention, and f^det characterizes how the asset naturally deteriorates if left untreated.

To facilitate coordination, a network-level performance measure, often known as Level of Service (LoS), is defined. A general expression for LoS at time t is given as:

LoS_t = _i=1ⁿ w_i E [(s_i,t)] (2)

where w_i is the relative weight of asset i and (·) is a function that converts the asset-level condition to a system performance contribution.

The principal goal is to maximize the expected total or average performance over the planning horizon, subject to budgetary constraints. The multi-year IAM planning can be formulated as the following optimization problem:

max (1/h) _t=1^h LoS_t
subject to:
_i=1ⁿ c_i,t w_i x_i,t b^l_t, t = 1,...,h,
_i=1ⁿ c_i,t w_i x_i,t b^u_t, t = 1,...,h,
_t=1^h _i=1ⁿ c_i,t w_i x_i,t b^total (3)

This formulation represents a notoriously hard-to-solve integer nonlinear optimization problem, especially for large networks. This has led to research into various approaches, including Markov Decision Processes (MDPs)An MDP is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It is the foundation of reinforcement learning. and, more recently, deep reinforcement learning.

2.2. Decomposition for scalability

Various decomposition techniques have been used in DRL for network MRR planning to tackle the scalability challenge. These include temporal decomposition, hierarchical task decomposition, and component-wise MDP decomposition.

2.3. Constraint-handling techniques in constrained MDPs

Four broad families of constraint-handling methods appear in the RL and IAM literatures: reward shaping with fixed penalties, Lagrangian relaxation, state augmentation, and hierarchical or decomposition-based schemes, which is the approach we build upon.

2.4. Concluding Remarks

Despite significant progress, scalability and exact adherence to budget constraints remain central challenges. Existing methods often struggle with large-scale networks, guaranteed feasibility, and interpretability. Our proposed framework directly addresses these gaps by using a hierarchical structure with an embedded linear programming layer to ensure strict budget compliance and linear scalability of the action space.

3. Key Concepts of Reinforcement Learning

Reinforcement Learning (RL) provides a flexible framework for sequential decision-making under uncertainty, casting problems as Markov Decision Processes (MDPs)An MDP is defined by a set of states (S), actions (A), transition probabilities P(s'|s,a), and a reward function R(s,a). The goal is to find a policy (a|s) that maximizes cumulative reward.. An agent interacts with an environment over discrete time steps, observing a state s_t, choosing an action a_t, and receiving a reward r_t. The goal is to learn a policy that maximizes the discounted return:

G_t = _k=0^T-t-1 ^kr(s_t+k, a_t+k) (4)

where is the discount factor. Value-based methods like Q-learningQ-learning is a model-free RL algorithm that learns an action-value function (Q-function) which gives the expected utility of taking a given action in a given state and following the optimal policy thereafter. estimate the optimal action-value function Q*(s, a). For large state-action spaces, Deep Reinforcement Learning (DRL) uses deep neural networks to approximate these functions.

Deep Q-Learning (DQL)DQL combines Q-learning with a deep neural network. The network takes a state as input and outputs the Q-values for all possible actions in that state. replaces the Q-table with a neural network Q(s, a; ). However, it struggles with combinatorial action spaces, as seen in asset management, where the number of actions can be 2ⁿ for n assets.

Policy gradient methods directly optimize a parameterized policy (a|s). Soft Actor-Critic (SAC)SAC is an off-policy DRL algorithm based on the maximum entropy reinforcement learning framework. It encourages exploration by adding an entropy term to the objective function, leading to more stable training. is an advanced actor-critic algorithm designed for continuous action spaces that enhances exploration and stability by adding an entropy bonus to the reward. This makes it well-suited for our hierarchical approach, where one actor handles continuous budget allocation.

4. Proposed Method: Hierarchical Deep Reinforcement Learning (HDRL)

This section presents a Hierarchical Deep Reinforcement Learning (HDRL) framework designed to handle multi-year maintenance decisions under strict budget constraints. The core concept is to decompose each year's decision into two levels: (1) how much of the remaining budget to allocate for the current year, and (2) which maintenance actions to execute given that allocated budget.

4.1. Decomposing the Annual Decisions into Two Levels

At each time step t, the decision process is split:

Actor 1 (Budget Planner): This high-level actor determines a scalar action a⁽¹⁾_t [0, 1], representing the fraction of the remaining multi-year budget to allocate to the current year. This scalar value is then mapped to a feasible annual budget b_t.
Actor 2 (Maintenance Planner): This low-level actor receives the annual budget b_t and the current state s_t. It outputs a vector of priority coefficients a⁽²⁾_t Rⁿ for the n assets.
Linear Programming (LP) Projection: The priority scores are used in a knapsack-like LP to select the final binary maintenance actions x_i,t for each asset, maximizing immediate improvement while strictly adhering to the budget b_t.

This hierarchical decomposition avoids the exponential action space of monolithic RL approaches. The action space dimension grows linearly (1 for Actor 1, n for Actor 2) instead of exponentially (2ⁿ).

Interactive HDRL Architecture

Agent Block

Step 1: Budget Planner (Actor 1)
Outputs budget fraction a⁽¹⁾.

Step 2: Maintenance Planner (Actor 2)
Outputs priority vector a⁽²⁾.

The agent consists of two actors. Actor 1 decides the 'how much' (budget), and Actor 2 decides the 'where' (which assets). This decouples the decision-making process. Click to close.

→

Linear Program Block

Determines binary actions (x_i,t) based on priorities and budget.

This is a critical step. It translates the 'soft' priorities from Actor 2 into a 'hard', budget-compliant action plan by solving a 0/1 knapsack problem. This guarantees feasibility at every step. Click to close.

→

Environment Block

Updates asset states and calculates reward based on the action plan.

The environment simulates the real-world consequences of the maintenance plan. It models asset deterioration, calculates the new system state (s_t+1), and provides a reward (r_t) to the agent. Click to close.

Figure 1. HDRL architecture with two-level decision decomposition. Click on a block to see more details.

4.2. State Vector, Reward and Objective

The global state vector s_t is augmented to include asset conditions, normalized time, and the remaining budget ratio. This provides the agent with all necessary information for making informed decisions.

s_t = [s_1,t, s_2,t, ..., s_n,t, LoS_t, t/h, b'_t] (15)

The one-step reward is defined as the next state's Level of Service, r_t = LoS_t+1, which directly aligns the agent's goal with the original optimization problem's objective of maximizing the average LoS over the horizon.

4.3. Neural Network Architecture and Learning Process

The framework is built on a Soft Actor-Critic (SAC)SAC maximizes a trade-off between expected return and entropy. High entropy encourages exploration by making the policy as random as possible while still achieving the objective. paradigm. It includes two actor networks (Budget and Maintenance Planners) and two critic networks (for stability). The actors are trained to maximize the soft Q-value, and the critics are trained to minimize the Bellman error. A learnable temperature parameter balances the trade-off between reward maximization and policy entropy (exploration).

5. Case Study and Analysis

This section compares the proposed HDRL method with a standard Deep Q-Learning (DQL) baseline on sewer network problems of increasing size (10, 15, and 20 sewersheds) over a five-year planning horizon.

**Table 1.** Details of the Ten Sewersheds (Base Case)
Sewershed	# of Segments	Total Length	Initial Avg. Condition
1	520	34,643.29	2.367
2	463	30,783.54	1.346
3	195	12,934.21	1.434
4	193	11,667.71	1.312
5	156	11,423.13	1.307
6	137	11,393.38	1.493
7	97	8,479.24	1.344
8	96	7,359.18	1.000
9	78	6,411.50	1.327
10	89	5,939.84	1.388

5.1. Baseline Validation: 10-Sewershed Case Study

For the 10-sewershed case, a constraint programming solver found the optimal solution with a total cost of $498,925 and an average condition of 1.4687. This serves as the ground truth for comparison.

The DQL approach uses a discrete action space of 1024 (2¹⁰) combinations, filtered down to 20 feasible actions by the budget. The HDRL approach uses its two-actor structure. Results show that while DQL converges slightly faster initially, HDRL learning dynamics are more stable and uniform.

Interactive Cost-Condition Outcomes (10 Sewersheds)

This plot shows the final solutions from 100 runs of DQL and HDRL compared to the global optimum. Hover over points for details.

DQL

HDRL

Optimum

Figure 5 (Interactive). Final cost-condition outcomes. HDRL solutions (green) cluster more tightly and closer to the optimal solution (star) compared to DQL solutions (red), particularly on the lower-cost side.

5.2. Scaling to 15 and 20 Sewersheds

As the number of sewersheds increases, the DQL action space explodes, while HDRL's action space grows linearly. For 15 sewersheds, DQL's action space has 32,768 combinations (364 after budget filtering). For 20 sewersheds, it's over a million (7,448 after filtering). The training curves show DQL becoming increasingly unstable and achieving lower returns, while HDRL maintains stable convergence and superior performance.

5.3. Discussion of Scalability

The results are summarized in the table below. The key takeaway is the difference in how the output layer size (a proxy for single-step computational complexity) scales. DQL scales exponentially, while HDRL scales linearly. This allows HDRL to maintain performance and reasonable training times as the problem size grows, whereas DQL's performance degrades significantly.

**Table 3.** Comparison of DQL and HDRL Performance Across 10-, 15-, and 20-Sewershed Networks
Case	Method	Output Layer Size	Runtime (s)	Avg. Obj. Value	Complexity Visualization
10 sewersheds	DQL	20	118 ± 4	1.4744	20
10 sewersheds	HDRL	1 + 10	264 ± 6	1.4723	11
15 sewersheds	DQL	364	176 ± 5	1.6629	364
15 sewersheds	HDRL	1 + 15	366 ± 6	1.5368	16
20 sewersheds	DQL	7,448	570 ± 5	1.8128	7448
20 sewersheds	HDRL	1 + 20	585 ± 7	1.5375	21

6. Conclusion

This paper presents an HDRL framework designed to tackle multi-year infrastructure maintenance planning under stringent budget constraints. The method decomposes the decision space into two levels: a Budget Planner that selects an annual budget within feasible bounds and a Maintenance Planner that allocates that budget across individual assets. This hierarchical design avoids the exponential action blow-up commonly encountered by monolithic RL methods with combinatorial action, making HDRL more tractable for larger networks. A local linear programming step further ensures that annual cost remains within the chosen budget.

Comparisons with a Deep Q-Learning (DQL) baseline, across 10-, 15-, and 20-sewershed case studies, show that HDRL yields smoother training performance, converges more reliably, and consistently discovers near-optimal or high-performance maintenance policies. In the 10-sewershed instance, the HDRL solutions cluster tightly near the global optima solution obtained by constraint-programming, and in the 15 and 20-sewershed extension, HDRL continues to learn stable policies while DQL's performance deteriorates due to the exponentially larger action space. Empirical results confirm that dividing budget allocation from local maintenance selection significantly reduces complexity for each actor network and maintains stable learning even as network size grows. This capacity to preserve scalable action spaces with explicit budget enforcement addresses a longstanding gap in reinforcement learning applications to real-world asset management.

Future work may extend HDRL by incorporating partial observability of asset condition (e.g., uncertain inspection data), modeling dynamic climate or demand scenarios, and exploring different forms of hierarchical decomposition (multi-level budgets or region-based planners). Additional refinements, such as multi-agent collaboration, dynamic reward shaping, or specialized optimization routines, might further enhance performance in extremely large-scale systems with tens of thousands of assets. Nonetheless, the presented approach provides a sound foundation for bridging the gap between theoretical DRL advances and the practical demands of infrastructure maintenance planning under tight financial limits.