Solving Markov decision processes for network-level post-hazard recovery via simulation optimization and rollout

by   Yugandhar Sarkale, et al.
Colorado State University

Computation of optimal recovery decisions for community resilience assurance post-hazard is a combinatorial decision-making problem under uncertainty. It involves solving a large-scale optimization problem, which is significantly aggravated by the introduction of uncertainty. In this paper, we draw upon established tools from multiple research communities to provide an effective solution to this challenging problem. We provide a stochastic model of damage to the water network (WN) within a testbed community following a severe earthquake and compute near-optimal recovery actions for restoration of the water network. We formulate this stochastic decision-making problem as a Markov Decision Process (MDP), and solve it using a popular class of heuristic algorithms known as rollout. A simulation-based representation of MDPs is utilized in conjunction with rollout and the Optimal Computing Budget Allocation (OCBA) algorithm to address the resulting stochastic simulation optimization problem. Our method employs non-myopic planning with efficient use of simulation budget. We show, through simulation results, that rollout fused with OCBA performs competitively with respect to rollout with total equal allocation (TEA) at a meagre simulation budget of 5-10 which is a crucial step towards addressing large-scale community recovery problems following natural disasters.



There are no comments yet.


page 2


Near-optimal planning using approximate dynamic programming to enhance post-hazard community resilience management

The lack of a comprehensive decision-making approach at the community-le...

Decision Automation for Electric Power Network Recovery

Critical infrastructure systems such as electric power networks, water n...

Optimal Immunization Policy Using Dynamic Programming

Decisions in public health are almost always made in the context of unce...

Multiple Plans are Better than One: Diverse Stochastic Planning

In planning problems, it is often challenging to fully model the desired...

Budget Allocation in Binary Opinion Dynamics

In this article we study the allocation of a budget to promote an opinio...

An approximate dynamic programming approach to food security of communities following hazards

Food security can be threatened by extreme natural hazard events for hou...

A Modified Approximate Dynamic Programming Algorithm for Community-level Food Security Following Disasters

In the aftermath of an extreme natural hazard, community residents must ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Natural disasters have a significant impact on the economic, social, and cultural fabric of affected communities. Moreover, because of the interconnected nature of communities in the modern world, the adverse impact is no longer restricted to the locally affected region, but it has ramifications on national or international scale. Among other factors, the occurrence of such natural disasters is on the rise owing to population growth and economic development in hazard-prone areas [saeed]. Keeping in view the increased frequency of natural disasters, there is an urgent need to address the problem of community recovery post-hazard. Typically, the resources available to post-disaster planners are limited and relatively small compared to the impact of the damage. Under these scenarios, it becomes imperative to assign limited resources to various damaged components in the network optimally to support community recovery. Such an assignment must also consider multiple objectives and cascading effects due to the interconnectedness of various networks within the community and must also successfully adopt previous proven methods and practices developed by expert disaster-management planners. Holistic approaches addressing various uncertainties for network-level management of limited resources must be developed for maximum effect. Civil infrastructure systems, including power, transportation, and water networks, play a critical part in post-disaster recovery management. In this study, we focus on one such critical infrastructure system, namely the water networks (WN), and compute near-optimal recovery actions, in the aftermath of an earthquake, for the WN of a test-bed community.

Markov decision processes (MDPs) offer a convenient framework for representation and solution of stochastic decision-making problems. Exact solutions are intractable for problems of even modest size; therefore, approximate solution methods have to be employed. We can leverage the rich theory of MDPs to model recovery action optimization for large state-space decision-making problems such as our. In this study, we employ a simulation-based representation and solution of MDP. The near-optimal solutions are computed using an approximate solution technique known as rollout. Even though state-of-the-art hardware and software practices are used to implement the solution to our problem, we are faced with the additional dilemma of computing recovery actions on a fixed simulation budget without affecting the solution performance. Therefore, any prospective methodology must incorporate such a limitation in its solution process. We incorporate the Optimal Computing Budget Allocation (OCBA) algorithm into our MDP solution process [ocbar1, ocbarr] to address the limited simulation budget problem.

Ii Testbed Case Study

Ii-a Network Characterization

This study considers the potable water network (WN) of Gilroy, CA, USA as an example to illustrate the proposed methodology. Gilroy, located 50 kilometers (km) south of the city of San Jose, CA is approximately 41.91 km2 in area, with a population of 48,821[Gilroy1]. We divide our study area into 36 grid regions to define the properties of infrastructure systems, household units, and the population. Our model of the community maintains adequate detail to study the performance of the WN at a community level under severe earthquakes. The potable water of Gilroy is provided only by the Llgas sub-basin [Gilroy2]. The potable water wells, located in wood-frame buildings, pump water into the distribution system. The Gilroy municipal water pipelines range from 102 mm to 610 mm in diameter [Gilroy2]. In this study, a simplified aggregated model of WN of Gilroy adopted from [Gilroy2] is modeled. This model shown in Fig. 1, includes six water wells, two booster pump stations (BPS), three water tanks (WT), and the main pipelines.

Fig. 1: The modeled Water Network of Gilroy

Ii-B Seismic Hazard Simulation

The San Andreas Fault (SAF), which is near Gilroy, is a source of severe earthquakes. In this study, we assume that a seismic event of moment magnitude

occurs at one of the closest points on the SAF projection to downtown Gilroy with an epicentral distance of approximately 12 km. Ground motion prediction equations (GMPE) determine the conditional probability of exceeding ground motion intensity at specific geographic locations within Gilroy for this earthquake.

The Abrahamson et al. [abrahamson]

GMPE is used to estimate the Intensity Measures (IM) and associated uncertainties. Peak Ground Acceleration (PGA) is considered for the above-ground WN facilities and wells, whereas Peak Ground Velocity (PGV) is considered as IM of pipelines.

Ii-C Fragility and Restoration Assessment of Water Network

The physical damage to WN components can be assessed by seismic fragility curves. We use the fragility curves presented in HAZUS-MH [hazus] for wells, water tanks, and pump stations based on the IM of PGA. This study adopts the assumptions in [adachi] for water pipelines. The failure probability of a pipe is bounded as follows:


where is the failure probability of a pipe, is the length of pipe, is the average PGV for the entire length of the water main, and

is the moment-generating function of

(the residual of the PGV). The term for water pipe segment i is , where is a coefficient determined by the pipe material, diameter, joint type, and soil condition based on the guidelines prepared by the American Lifeline Alliance [ALA]. Adachi and Ellingwood [adachi] demonstrated that the Upper Bound (UB) and exact solutions (1) are close enough so that in practical applications the UB assessment (conservative evaluation) can be used.

Repair crews, replacement components, and tools are considered as available units of resources to restore the damaged components of WN following the hazard. One unit of resources is required to repair each damaged component [ouyang, masoomi]

. However, the available units of resources are limited and depend on the capacities and policy of the entities within the community. To restore the WN, the restoration times based on exponential distributions synthesized from HAZUS-MH

[hazus] are used, as summarized in Table I. The pipe-restoration time in the WN is based on repair rate or number of repairs per kilometer.

Damage States
Component Minor Moderate Extensive Complete
Water tanks 1.2 3.1 93 155
Wells 0.8 1.5 10.5 26
Pumping plants 0.9 3.1 13.5 35
TABLE I: The expected repair times (Unit: days)


Iii-a MDP Framework

We provide a brief description of MDP [Puterman] for the sake of completeness. An MDP is a controlled dynamical process useful in modelling of wide range of decision-making problems. It can be represented by the 4-tuple . Here, represents the set of states, and represents the set of actions. Let and ; then is the state transition function, where is the probability of going into state after taking action in state . is the reward function, where is the reward received after transitioning from to as a result of action . In this study, we assume that and are finite; is bounded and real-valued and a deterministic function of , and . Implicit in our presentation are also the following assumptions: First order Markovian dynamics (history independence), stationary dynamics (reward function is not a function of absolute time), and full observability of the state space (outcome of an action in a state might be random, but we know the state reached after action is completed). In our study, we assume that we are allowed to take recovery actions (decisions) indefinitely until all the damaged components of our modeled problem are repaired (infinite-horizon planning). In this setting, we have a stationary policy , which is defined as . Suppose that decisions are made at discrete-time ; then is the action to be taken in state (regardless of time ). Our objective is to find an optimal policy . For the infinite-horizon case, is defined as




is called the value function for a fixed policy , and is the discount factor. Note that the optimal policy is independent of the initial state . Also, note that we maximize over policies , where at each time the action taken is . Stationary optimal policies are guaranteed to exist for discounted infinite-horizon optimization criteria [howard]. To summarize, our presentation is for infinite-horizon discrete-time MDPs with the discounted value as our optimization criterion.

Iii-B MDP Solution

A solution to an MDP is the optimal policy . We can obtain

with linear programming or dynamic programming. In the dynamic programming regime, there are several solution strategies, namely value iteration, policy iteration, modified policy iteration, etc. Unfortunately, such exact solution algorithms are intractable for large state and actions spaces. We briefly mention here the method of value iteration because it illustrates the Bellman’s equation

[Bellman]. Studying Bellman’s equation is useful for defining value function. value function will play a critical role in describing the rollout algorithm. Let denote the optimal value function for some ; Bellman showed that satisfies:


Equation (4) is known as the Bellman’s optimality equation, where is the set of possible actions in any state . The value iteration algorithm solves (4) by using Bellman backup repeatedly, where Bellman backup is given by:


Bellman showed that , where is initialised arbitrarily.111On a historical note, Lloyd Shapely’s paper [Shapley] included the value iteration algorithm for MDPs as a special case, but this was recognised only later on [kallenberg2003finite]. Next, we define the value function of policy :


The value function of any policy gives the expected discount reward in the future after starting in some state , taking action and following policy thereafter. Note that this is the inner term in (4).

Iii-C Simulation-Based Representation of MDP

We now briefly explain the simulation-based representation of an MDP [Fern]. Such a representation serves well for large state and action spaces, which is a characteristic feature of many real-world problems. When or is large, it is not feasible to represent and in a matrix form. A simulation-based representation of an MDP is a 5-tuple , where and are as before, except and are large. Here, is a stochastic real-valued bounded function that stochastically returns a reward when input and are provided, where is the action applied in state . is a simulator that stochastically returns a state when state and action are provided as inputs. is the stochastic initial state function that stochastically returns a state according to some initial state distribution. , , and can be thought of as any callable library functions that can be implemented in any programming language.

Iii-D Problem Formulation

After an earthquake event occurs, the components of the water network remain undamaged or exhibit one of the damage states as shown in Table I. Let be the total number of damaged component at . Let represent the decision time when all components are repaired. There is a fixed number of resource units () available to the decision maker. At each discrete-time , the decision maker has to decide the assignment of unit of resource to the damaged locations; each component cannot be assigned more than one resource unit. When the number of damaged locations is less than the number of units of resources (because of sequential application of repair actions, or otherwise), we retire the extra unit of resources so that is equal to the number of damaged locations.

  • States S: Let be the state of the damaged components of the system at time ; then

    is a vector of length

    , , and is one of the damaged state in Table I where .

  • Actions A: Let denote the repair action to be carried out at time . Then, is a vector of length , , and . When , no repair work is to be carried out at . Similarly, when , repair work is carried out at .

  • Simulator T: The repair time associated with each damaged location depends on the state of the damage to the component at that location (see Table I). This repair time is random and is exponentially distributed with expected repair times shown in Table I. Given and , gives us the new state . We say that a repair action is complete as soon as at least one of the locations where repair work is carried out is fully repaired. Let’s denote this completion time at every by . Note that it is possible for the repair work at two or more damaged locations to be completed simultaneously. Once the repair action is complete, the units of resources at remaining locations, where repair work was not complete, are also available for reassignment along with unit of resources where repair was complete. The new repair time at such unrepaired locations is calculated by subtracting from the time required to repair these locations. It is also possible to reassign the unit of resource at the same unrepaired location if it is deemed important for the repair work to be continued at that location by the planner. Because of this reason, preemption of repair work during reassignment is not a restrictive assumption, on the contrary, it allows greater flexibility to the decision maker for planning. Because the repair times are random, the outcomes of repair actions are random as not the same damaged component will be repaired first every time (random repair time), even if the same repair action is applied in . Hence, our simulator is stochastic. Alternative formulation where outcome of repair action is deterministic is also an active area of research [our, emi, iEMSs].

  • Rewards R: We wish to optimally plan decisions so that maximum people will get water in minimum amount of time. We combine these two competing objectives to define our reward as:


    where is the number of people who have water after action is completed, and is the total repair time (days) required to reach from any initial state . Note that our reward function is stochastic because the outcome of our action is random.

  • Initial State I: We have already described the stochastic damage model of the components for the modeled network in Section II-B and Section II-C. The initial damage states associated with the components will be provided by these models.

  • Discount factor : In our simulation studies, is fixed at 0.99.

Iii-E Rollout

The rollout algorithm was first proposed for stochastic scheduling problems by Bertsekas and Castanon [Bertsekas1999]. Instead of the dynamic programming formalism [Bertsekas1999], we motivate the rollout algorithm in relation to the simulation-based representation of our MDP. Suppose that we have access to a non-optimal policy , and our aim is to compute an improved policy . Then, we have:


where the function is as defined in (6). If the policy defined in (8) is non-optimal, it is a strict improvement over [howard]. This result is termed as policy improvement theorem. Note that the improved policy is generated as a greedy policy w.r.t. . Unlike the exact solution methods described in Section III-B, we are interested here in computing only for the current state. Methods that use (8) as the basis for updating the policy suffer from the curse of dimensionality. Before performing the policy improvement step in (8), we have to first calculate the value of . Calculating the value of in (8) is known as policy evaluation. Policy evaluation is intractable for large or continuous state and action spaces. Approximation techniques alleviate this problem by calculating an approximate value function. Rollout is one such approximation technique that utilises monte-carlo simulations. Particularly, rollout can be formulated as an approximate policy iteration algorithm [Fern, lagoudakis2003reinforcement]. An implementable (programming sense) stochastic function (simulator) is defined in such a way that its expected value is , where is a finite number representing horizon length. In the rollout algorithm, is implemented by simulating action in state and following thereafter for steps. This is done for all the actions . A finite horizon approximation of (termed as ), is required; our simulation would never finish in the infinite horizon case because we would have to follow policy indefinitely. However, , and consequently , is defined over the infinite horizon. It is easy to show the following:


The approximation error in (9) reduces exponentially fast as grows. Therefore, the -horizon results apply to the infinite horizon setting, for we can always choose such that the error in (9) is negligible. To summarize, the rollout algorithm can be presented in the following fashion for our problem:

for  to  do
     for  to  do
          See algorithm 2
     end for
end for
Algorithm 1 Uniform Rollout (,h,,)
for  to  do
end for
Algorithm 2 Simulator

In Algorithm 1, denotes . Note that Algorithm 2 returns the discounted sum of rewards. When , we term the rollout as complete rollout, and when , the rollout is called truncated rollout [Bertsekas1999]. It is possible to analyse the performance of uniform rollout in terms of uniform allocation and horizon depth [Fern, dimitri2008a].

Iii-F Optimal Computing Budget Allocation

In the previous section, we presented the rollout method for solving our MDP problem. In the case of uniform rollout, we allocate a fixed rollout sampling budget to each action, i.e., we obtain number of rollout samples per candidate action to estimate the value associated with the action. In the simulation optimization community, this is analogous to total equal allocation (TEA)[fu] with a fixed budget for each simulation experiment (a single simulation experiment is equivalent to one rollout sample). In practice, we are only interested in the best possible action, and we would like to direct our search towards the most promising candidates. Also, for large real-world problems, the simulation budget available is insufficient to allocate number of rollout samples per action. We would like to get a rough estimate of the performance of each action and spend the remaining simulation budget in refining the accuracy of the best estimates. This is the classic exploration vs. exploitation problem faced in optimal learning and simulation optimization problems.

Instead of a uniform allocation for each action, non-uniform allocation methods have been explored in the literature pertaining to Algorithm 1 called as adaptive rollout [Dimitrakakis2008b]. An analysis of performance guarantees for adaptive rollout remains an active area of research [Dimitrakakis2008b, dimitri2018, lazaric]. These non-uniform allocation methods guarantee performance without a constraint on the budget of rollouts. Hence, we explore an alternative non-uniform allocation method that would not only fuse well into our solutions (adaptively guiding the stochastic search) but would also incorporate the constraint of simulation budget in its allocation procedure. Numerous techniques have been proposed in the simulation optimization community to solve this problem. We draw upon one of the best performers [branke] that naturally fits into our solution framework—OCBA. Moreover, the probability of correct selection of an alternative in OCBA mimics finding the best candidate action at each stage in Algorithm 1. Formally, the OCBA problem [Chen2000] for Section III-D can be stated as :


where represents the simulation budget for determining optimal for at any , and is the simulation budget for the action at a particular . At each OCBA allocation step (for the definition of the allocation step, see variable in [Chen2000]

), barring the best alternative, the OCBA solution assigns an allocation that is directly proportional to the variance of each alternative and inversely proportional to the squared difference between the mean of that alternative and the best alternative.

Here, we only provide information required to initialize the OCBA algorithm. For a detailed description of OCBA, including the solution to the problem in (10), see [Chen2000]. The key initialization variables, for the OCBA algorithm [Chen2000], are , (not to be confused with in this paper), , and . The variable is equal to variable in our problem. The value of changes at each and depends on the number of damaged components and units of resources. The variable is equal to per-stage budget in our problem. More information about the exact value assigned to is described in Section IV. We follow the guidelines specified in [Chen1999] to select and ; in the OCBA algorithm is selected equal to 5, and is kept at of (within rounding).

Iv Simulation Results

We simulate 100 different initial damage scenarios for each of the plots presented in this section. There will be a distinct recovery path for each of the initial damage scenarios. All the plots presented here represent the average of 100 such recovery paths. Two different simulation plots of rollout fused with OCBA are provided in Fig. 2 and Fig. 3. They are termed as rollout with OCBA1 and rollout with OCBA2. The method applied is the same for both cases; only the per-stage simulation budget is different. A per-stage budget (budget at each decision time ) of is assigned for rollout with OCBA1 and for rollout with OCBA2.

Fig. 2: Performance comparison of rollout vs base policy for 3 units of resources.

Fig. 2 compares the performance of rollout fused with OCBA and base policy. The rollout algorithm is known to have the “lookahead property” [Bertsekas1999]. This behavior of the rollout algorithm is evident in the results in Fig. 2, where the base policy initially outperforms the rollout policy, but after about six days the former steadily outperforms the later. Recall, that our objective is to perform repair actions so that maximum people will have water in minimum amount of time. Evaluating the performance of our method in meeting this objective is equivalent to checking the area under the curve of our plots. This area represents the product of the number of people who have water and the number of days for which they have water. A larger area represents that greater number of people were benefitted as a result of the recovery actions. The area under the curve for recovery with rollout (blue and red plots) is more than its base counterpart (black). A per-stage budget increase of 5000 simulations in rollout with OCBA2 with respect to rollout with OCBA1 shows improvements in the recovery process.

Fig. 3: Performance comparison of rollout vs base policy for 5 unit of resources.

In the plots shown in Fig. 3, we use . In the initial phase of planning, it might appear that the base policy outperforms the rollout for a substantial amount of time. However, this is not the case. Note that the number of days for which the base policy outperforms rollout, in both Fig. 2 and Fig. 3, is about six days, but because the number of resource units has increased from three to five, the recovery is faster, giving an illusion that the base policy outperforms rollout for a longer duration. It was verified that the area under the curve for recovery with rollout (blue and red curves) is more than its base counterpart (black curve). Because OCBA is fused with rollout here, we would like to ascertain the exact contribution of the OCBA approach in enhancing the rollout performance.

Fig. 4: Performance comparison of uniform rollout (TEA), rollout with OCBA and base policy for 3 units of resources.

For the rollout with OCBA in Fig. 4, , whereas for the uniform rollout simulations. The recovery as a result of these algorithms outperforms the base policy recovery in all cases. Also, rollout with OCBA performs competitively with respect to uniform rollout despite a meagre simulation budget of of uniform rollout. The area under the recovery process in Fig. 4, as a result of uniform rollout, is only marginally greater than that due to rollout with OCBA. Note that after six days, OCBA slightly outperforms uniform rollout because it prioritizes the simulation budget on the most promising actions per-stage. Rollout exploits this behavior in each stage and gives a set of sequential recovery decisions that further enhances the outcome of the recovery decisions. We would like to once again stress that such an improvement is being achieved at a significantly low simulation budget with respect to uniform rollout. Therefore, these two algorithms form a powerful combination together, where each algorithm consistently and sequentially reinforces the performance of the other. Such synergistic behavior of the combined approach is appealing. Lastly, our simulation studies show that increments in the simulation budget of rollout results in marginal performance improvement for each increment. Beyond a certain increment in the simulation budget, the gain in performance might not scale with the simulation budget expended. A possible explanation is that small simulation budget increase might not dramatically change the approximation of value function associated with a state-action pair. Thus, in (8) might not show a drastic improvement compared to the one computed by a lower simulation budget (policy improvement based on approximation that utilises lower simulation budget).

V Future Work

For future work, we would like to leverage the availability of multiple base polices in the aftermath of hazards in our framework and incorporate parallel rollout in the solution method [Chang2004]. We anticipate further improvements to the performance demonstrated here when OCBA is fused with parallel rollout. In the future, we will also present the inter-relationship in other critical infrastructure systems like electrical power, roads, bridges, and water networks and the impact such dynamic interactive system has on the recovery process post-hazard. We are also interested in exploring the social impact of the optimized recovery process. We will examine how to incorporate meta-heuristics to guide the stochastic search that determines most promising actions [kaveh].