I Introduction
Natural disasters have a significant impact on the economic, social, and cultural fabric of affected communities. Moreover, because of the interconnected nature of communities in the modern world, the adverse impact is no longer restricted to the locally affected region, but it has ramifications on national or international scale. Among other factors, the occurrence of such natural disasters is on the rise owing to population growth and economic development in hazardprone areas [saeed]. Keeping in view the increased frequency of natural disasters, there is an urgent need to address the problem of community recovery posthazard. Typically, the resources available to postdisaster planners are limited and relatively small compared to the impact of the damage. Under these scenarios, it becomes imperative to assign limited resources to various damaged components in the network optimally to support community recovery. Such an assignment must also consider multiple objectives and cascading effects due to the interconnectedness of various networks within the community and must also successfully adopt previous proven methods and practices developed by expert disastermanagement planners. Holistic approaches addressing various uncertainties for networklevel management of limited resources must be developed for maximum effect. Civil infrastructure systems, including power, transportation, and water networks, play a critical part in postdisaster recovery management. In this study, we focus on one such critical infrastructure system, namely the water networks (WN), and compute nearoptimal recovery actions, in the aftermath of an earthquake, for the WN of a testbed community.
Markov decision processes (MDPs) offer a convenient framework for representation and solution of stochastic decisionmaking problems. Exact solutions are intractable for problems of even modest size; therefore, approximate solution methods have to be employed. We can leverage the rich theory of MDPs to model recovery action optimization for large statespace decisionmaking problems such as our. In this study, we employ a simulationbased representation and solution of MDP. The nearoptimal solutions are computed using an approximate solution technique known as rollout. Even though stateoftheart hardware and software practices are used to implement the solution to our problem, we are faced with the additional dilemma of computing recovery actions on a fixed simulation budget without affecting the solution performance. Therefore, any prospective methodology must incorporate such a limitation in its solution process. We incorporate the Optimal Computing Budget Allocation (OCBA) algorithm into our MDP solution process [ocbar1, ocbarr] to address the limited simulation budget problem.
Ii Testbed Case Study
Iia Network Characterization
This study considers the potable water network (WN) of Gilroy, CA, USA as an example to illustrate the proposed methodology. Gilroy, located 50 kilometers (km) south of the city of San Jose, CA is approximately 41.91 km^{2} in area, with a population of 48,821[Gilroy1]. We divide our study area into 36 grid regions to define the properties of infrastructure systems, household units, and the population. Our model of the community maintains adequate detail to study the performance of the WN at a community level under severe earthquakes. The potable water of Gilroy is provided only by the Llgas subbasin [Gilroy2]. The potable water wells, located in woodframe buildings, pump water into the distribution system. The Gilroy municipal water pipelines range from 102 mm to 610 mm in diameter [Gilroy2]. In this study, a simplified aggregated model of WN of Gilroy adopted from [Gilroy2] is modeled. This model shown in Fig. 1, includes six water wells, two booster pump stations (BPS), three water tanks (WT), and the main pipelines.
IiB Seismic Hazard Simulation
The San Andreas Fault (SAF), which is near Gilroy, is a source of severe earthquakes. In this study, we assume that a seismic event of moment magnitude
occurs at one of the closest points on the SAF projection to downtown Gilroy with an epicentral distance of approximately 12 km. Ground motion prediction equations (GMPE) determine the conditional probability of exceeding ground motion intensity at specific geographic locations within Gilroy for this earthquake.
The Abrahamson et al. [abrahamson]
GMPE is used to estimate the Intensity Measures (IM) and associated uncertainties. Peak Ground Acceleration (PGA) is considered for the aboveground WN facilities and wells, whereas Peak Ground Velocity (PGV) is considered as IM of pipelines.
IiC Fragility and Restoration Assessment of Water Network
The physical damage to WN components can be assessed by seismic fragility curves. We use the fragility curves presented in HAZUSMH [hazus] for wells, water tanks, and pump stations based on the IM of PGA. This study adopts the assumptions in [adachi] for water pipelines. The failure probability of a pipe is bounded as follows:
(1) 
where is the failure probability of a pipe, is the length of pipe, is the average PGV for the entire length of the water main, and
is the momentgenerating function of
(the residual of the PGV). The term for water pipe segment i is , where is a coefficient determined by the pipe material, diameter, joint type, and soil condition based on the guidelines prepared by the American Lifeline Alliance [ALA]. Adachi and Ellingwood [adachi] demonstrated that the Upper Bound (UB) and exact solutions (1) are close enough so that in practical applications the UB assessment (conservative evaluation) can be used.Repair crews, replacement components, and tools are considered as available units of resources to restore the damaged components of WN following the hazard. One unit of resources is required to repair each damaged component [ouyang, masoomi]
. However, the available units of resources are limited and depend on the capacities and policy of the entities within the community. To restore the WN, the restoration times based on exponential distributions synthesized from HAZUSMH
[hazus] are used, as summarized in Table I. The piperestoration time in the WN is based on repair rate or number of repairs per kilometer.Damage States  

Component  Minor  Moderate  Extensive  Complete 
Water tanks  1.2  3.1  93  155 
Wells  0.8  1.5  10.5  26 
Pumping plants  0.9  3.1  13.5  35 
Iii PROBLEM DESCRIPTION and SOLUTION
Iiia MDP Framework
We provide a brief description of MDP [Puterman] for the sake of completeness. An MDP is a controlled dynamical process useful in modelling of wide range of decisionmaking problems. It can be represented by the 4tuple . Here, represents the set of states, and represents the set of actions. Let and ; then is the state transition function, where is the probability of going into state after taking action in state . is the reward function, where is the reward received after transitioning from to as a result of action . In this study, we assume that and are finite; is bounded and realvalued and a deterministic function of , and . Implicit in our presentation are also the following assumptions: First order Markovian dynamics (history independence), stationary dynamics (reward function is not a function of absolute time), and full observability of the state space (outcome of an action in a state might be random, but we know the state reached after action is completed). In our study, we assume that we are allowed to take recovery actions (decisions) indefinitely until all the damaged components of our modeled problem are repaired (infinitehorizon planning). In this setting, we have a stationary policy , which is defined as . Suppose that decisions are made at discretetime ; then is the action to be taken in state (regardless of time ). Our objective is to find an optimal policy . For the infinitehorizon case, is defined as
(2) 
where
(3) 
is called the value function for a fixed policy , and is the discount factor. Note that the optimal policy is independent of the initial state . Also, note that we maximize over policies , where at each time the action taken is . Stationary optimal policies are guaranteed to exist for discounted infinitehorizon optimization criteria [howard]. To summarize, our presentation is for infinitehorizon discretetime MDPs with the discounted value as our optimization criterion.
IiiB MDP Solution
A solution to an MDP is the optimal policy . We can obtain
with linear programming or dynamic programming. In the dynamic programming regime, there are several solution strategies, namely value iteration, policy iteration, modified policy iteration, etc. Unfortunately, such exact solution algorithms are intractable for large state and actions spaces. We briefly mention here the method of value iteration because it illustrates the Bellman’s equation
[Bellman]. Studying Bellman’s equation is useful for defining value function. value function will play a critical role in describing the rollout algorithm. Let denote the optimal value function for some ; Bellman showed that satisfies:(4) 
Equation (4) is known as the Bellman’s optimality equation, where is the set of possible actions in any state . The value iteration algorithm solves (4) by using Bellman backup repeatedly, where Bellman backup is given by:
(5) 
Bellman showed that , where is initialised arbitrarily.^{1}^{1}1On a historical note, Lloyd Shapely’s paper [Shapley] included the value iteration algorithm for MDPs as a special case, but this was recognised only later on [kallenberg2003finite]. Next, we define the value function of policy :
(6) 
The value function of any policy gives the expected discount reward in the future after starting in some state , taking action and following policy thereafter. Note that this is the inner term in (4).
IiiC SimulationBased Representation of MDP
We now briefly explain the simulationbased representation of an MDP [Fern]. Such a representation serves well for large state and action spaces, which is a characteristic feature of many realworld problems. When or is large, it is not feasible to represent and in a matrix form. A simulationbased representation of an MDP is a 5tuple , where and are as before, except and are large. Here, is a stochastic realvalued bounded function that stochastically returns a reward when input and are provided, where is the action applied in state . is a simulator that stochastically returns a state when state and action are provided as inputs. is the stochastic initial state function that stochastically returns a state according to some initial state distribution. , , and can be thought of as any callable library functions that can be implemented in any programming language.
IiiD Problem Formulation
After an earthquake event occurs, the components of the water network remain undamaged or exhibit one of the damage states as shown in Table I. Let be the total number of damaged component at . Let represent the decision time when all components are repaired. There is a fixed number of resource units () available to the decision maker. At each discretetime , the decision maker has to decide the assignment of unit of resource to the damaged locations; each component cannot be assigned more than one resource unit. When the number of damaged locations is less than the number of units of resources (because of sequential application of repair actions, or otherwise), we retire the extra unit of resources so that is equal to the number of damaged locations.

Actions A: Let denote the repair action to be carried out at time . Then, is a vector of length , , and . When , no repair work is to be carried out at . Similarly, when , repair work is carried out at .

Simulator T: The repair time associated with each damaged location depends on the state of the damage to the component at that location (see Table I). This repair time is random and is exponentially distributed with expected repair times shown in Table I. Given and , gives us the new state . We say that a repair action is complete as soon as at least one of the locations where repair work is carried out is fully repaired. Let’s denote this completion time at every by . Note that it is possible for the repair work at two or more damaged locations to be completed simultaneously. Once the repair action is complete, the units of resources at remaining locations, where repair work was not complete, are also available for reassignment along with unit of resources where repair was complete. The new repair time at such unrepaired locations is calculated by subtracting from the time required to repair these locations. It is also possible to reassign the unit of resource at the same unrepaired location if it is deemed important for the repair work to be continued at that location by the planner. Because of this reason, preemption of repair work during reassignment is not a restrictive assumption, on the contrary, it allows greater flexibility to the decision maker for planning. Because the repair times are random, the outcomes of repair actions are random as not the same damaged component will be repaired first every time (random repair time), even if the same repair action is applied in . Hence, our simulator is stochastic. Alternative formulation where outcome of repair action is deterministic is also an active area of research [our, emi, iEMSs].

Rewards R: We wish to optimally plan decisions so that maximum people will get water in minimum amount of time. We combine these two competing objectives to define our reward as:
(7) where is the number of people who have water after action is completed, and is the total repair time (days) required to reach from any initial state . Note that our reward function is stochastic because the outcome of our action is random.

Discount factor : In our simulation studies, is fixed at 0.99.
IiiE Rollout
The rollout algorithm was first proposed for stochastic scheduling problems by Bertsekas and Castanon [Bertsekas1999]. Instead of the dynamic programming formalism [Bertsekas1999], we motivate the rollout algorithm in relation to the simulationbased representation of our MDP. Suppose that we have access to a nonoptimal policy , and our aim is to compute an improved policy . Then, we have:
(8) 
where the function is as defined in (6). If the policy defined in (8) is nonoptimal, it is a strict improvement over [howard]. This result is termed as policy improvement theorem. Note that the improved policy is generated as a greedy policy w.r.t. . Unlike the exact solution methods described in Section IIIB, we are interested here in computing only for the current state. Methods that use (8) as the basis for updating the policy suffer from the curse of dimensionality. Before performing the policy improvement step in (8), we have to first calculate the value of . Calculating the value of in (8) is known as policy evaluation. Policy evaluation is intractable for large or continuous state and action spaces. Approximation techniques alleviate this problem by calculating an approximate value function. Rollout is one such approximation technique that utilises montecarlo simulations. Particularly, rollout can be formulated as an approximate policy iteration algorithm [Fern, lagoudakis2003reinforcement]. An implementable (programming sense) stochastic function (simulator) is defined in such a way that its expected value is , where is a finite number representing horizon length. In the rollout algorithm, is implemented by simulating action in state and following thereafter for steps. This is done for all the actions . A finite horizon approximation of (termed as ), is required; our simulation would never finish in the infinite horizon case because we would have to follow policy indefinitely. However, , and consequently , is defined over the infinite horizon. It is easy to show the following:
(9) 
The approximation error in (9) reduces exponentially fast as grows. Therefore, the horizon results apply to the infinite horizon setting, for we can always choose such that the error in (9) is negligible. To summarize, the rollout algorithm can be presented in the following fashion for our problem:
In Algorithm 1, denotes . Note that Algorithm 2 returns the discounted sum of rewards. When , we term the rollout as complete rollout, and when , the rollout is called truncated rollout [Bertsekas1999]. It is possible to analyse the performance of uniform rollout in terms of uniform allocation and horizon depth [Fern, dimitri2008a].
IiiF Optimal Computing Budget Allocation
In the previous section, we presented the rollout method for solving our MDP problem. In the case of uniform rollout, we allocate a fixed rollout sampling budget to each action, i.e., we obtain number of rollout samples per candidate action to estimate the value associated with the action. In the simulation optimization community, this is analogous to total equal allocation (TEA)[fu] with a fixed budget for each simulation experiment (a single simulation experiment is equivalent to one rollout sample). In practice, we are only interested in the best possible action, and we would like to direct our search towards the most promising candidates. Also, for large realworld problems, the simulation budget available is insufficient to allocate number of rollout samples per action. We would like to get a rough estimate of the performance of each action and spend the remaining simulation budget in refining the accuracy of the best estimates. This is the classic exploration vs. exploitation problem faced in optimal learning and simulation optimization problems.
Instead of a uniform allocation for each action, nonuniform allocation methods have been explored in the literature pertaining to Algorithm 1 called as adaptive rollout [Dimitrakakis2008b]. An analysis of performance guarantees for adaptive rollout remains an active area of research [Dimitrakakis2008b, dimitri2018, lazaric]. These nonuniform allocation methods guarantee performance without a constraint on the budget of rollouts. Hence, we explore an alternative nonuniform allocation method that would not only fuse well into our solutions (adaptively guiding the stochastic search) but would also incorporate the constraint of simulation budget in its allocation procedure. Numerous techniques have been proposed in the simulation optimization community to solve this problem. We draw upon one of the best performers [branke] that naturally fits into our solution framework—OCBA. Moreover, the probability of correct selection of an alternative in OCBA mimics finding the best candidate action at each stage in Algorithm 1. Formally, the OCBA problem [Chen2000] for Section IIID can be stated as :
(10) 
where represents the simulation budget for determining optimal for at any , and is the simulation budget for the action at a particular . At each OCBA allocation step (for the definition of the allocation step, see variable in [Chen2000]
), barring the best alternative, the OCBA solution assigns an allocation that is directly proportional to the variance of each alternative and inversely proportional to the squared difference between the mean of that alternative and the best alternative.
Here, we only provide information required to initialize the OCBA algorithm. For a detailed description of OCBA, including the solution to the problem in (10), see [Chen2000]. The key initialization variables, for the OCBA algorithm [Chen2000], are , (not to be confused with in this paper), , and . The variable is equal to variable in our problem. The value of changes at each and depends on the number of damaged components and units of resources. The variable is equal to perstage budget in our problem. More information about the exact value assigned to is described in Section IV. We follow the guidelines specified in [Chen1999] to select and ; in the OCBA algorithm is selected equal to 5, and is kept at of (within rounding).
Iv Simulation Results
We simulate 100 different initial damage scenarios for each of the plots presented in this section. There will be a distinct recovery path for each of the initial damage scenarios. All the plots presented here represent the average of 100 such recovery paths. Two different simulation plots of rollout fused with OCBA are provided in Fig. 2 and Fig. 3. They are termed as rollout with OCBA1 and rollout with OCBA2. The method applied is the same for both cases; only the perstage simulation budget is different. A perstage budget (budget at each decision time ) of is assigned for rollout with OCBA1 and for rollout with OCBA2.
Fig. 2 compares the performance of rollout fused with OCBA and base policy. The rollout algorithm is known to have the “lookahead property” [Bertsekas1999]. This behavior of the rollout algorithm is evident in the results in Fig. 2, where the base policy initially outperforms the rollout policy, but after about six days the former steadily outperforms the later. Recall, that our objective is to perform repair actions so that maximum people will have water in minimum amount of time. Evaluating the performance of our method in meeting this objective is equivalent to checking the area under the curve of our plots. This area represents the product of the number of people who have water and the number of days for which they have water. A larger area represents that greater number of people were benefitted as a result of the recovery actions. The area under the curve for recovery with rollout (blue and red plots) is more than its base counterpart (black). A perstage budget increase of 5000 simulations in rollout with OCBA2 with respect to rollout with OCBA1 shows improvements in the recovery process.
In the plots shown in Fig. 3, we use . In the initial phase of planning, it might appear that the base policy outperforms the rollout for a substantial amount of time. However, this is not the case. Note that the number of days for which the base policy outperforms rollout, in both Fig. 2 and Fig. 3, is about six days, but because the number of resource units has increased from three to five, the recovery is faster, giving an illusion that the base policy outperforms rollout for a longer duration. It was verified that the area under the curve for recovery with rollout (blue and red curves) is more than its base counterpart (black curve). Because OCBA is fused with rollout here, we would like to ascertain the exact contribution of the OCBA approach in enhancing the rollout performance.
For the rollout with OCBA in Fig. 4, , whereas for the uniform rollout simulations. The recovery as a result of these algorithms outperforms the base policy recovery in all cases. Also, rollout with OCBA performs competitively with respect to uniform rollout despite a meagre simulation budget of of uniform rollout. The area under the recovery process in Fig. 4, as a result of uniform rollout, is only marginally greater than that due to rollout with OCBA. Note that after six days, OCBA slightly outperforms uniform rollout because it prioritizes the simulation budget on the most promising actions perstage. Rollout exploits this behavior in each stage and gives a set of sequential recovery decisions that further enhances the outcome of the recovery decisions. We would like to once again stress that such an improvement is being achieved at a significantly low simulation budget with respect to uniform rollout. Therefore, these two algorithms form a powerful combination together, where each algorithm consistently and sequentially reinforces the performance of the other. Such synergistic behavior of the combined approach is appealing. Lastly, our simulation studies show that increments in the simulation budget of rollout results in marginal performance improvement for each increment. Beyond a certain increment in the simulation budget, the gain in performance might not scale with the simulation budget expended. A possible explanation is that small simulation budget increase might not dramatically change the approximation of value function associated with a stateaction pair. Thus, in (8) might not show a drastic improvement compared to the one computed by a lower simulation budget (policy improvement based on approximation that utilises lower simulation budget).
V Future Work
For future work, we would like to leverage the availability of multiple base polices in the aftermath of hazards in our framework and incorporate parallel rollout in the solution method [Chang2004]. We anticipate further improvements to the performance demonstrated here when OCBA is fused with parallel rollout. In the future, we will also present the interrelationship in other critical infrastructure systems like electrical power, roads, bridges, and water networks and the impact such dynamic interactive system has on the recovery process posthazard. We are also interested in exploring the social impact of the optimized recovery process. We will examine how to incorporate metaheuristics to guide the stochastic search that determines most promising actions [kaveh].
Comments
There are no comments yet.