Recent major blackouts caused by extreme events lead to catastrophic consequences for the economy and society . Load restoration is of paramount importance in resilient smart grids . Great progress has been made in coordinating multiple energy resources to effectively restore electricity supply to critical loads after major blackouts . Microgrids are well utilized to consolidate stationary energy resources . Moreover, with the increasing installation of charging/discharging facilities , microgrids can provide plug-and-play integration of mobile energy storage systems (MESSs) for effective service restoration. The importance of integrating mobile energy resources into critical load restoration in smart grid has been increasingly recognized in recent studies [6, 7]. Reference  proposes a microgrid-based critical load restoration by adaptively forming microgrids and positioning mobile emergency resources after power disruptions. Reference  implements resilient routing and scheduling of mobile power sources via a two-stage framework. Reference 
proposed a rolling horizon-based integrated restoration strategy to effectively restore electricity supply to critical loads by MESSs considering uncertainties. However, the optimal scheduling is generally formulated as mixed-integer convex program, which is NP-hard and computationally expensive, in terms of a large number of integer or binary variables in large-scale systems. In addition, accurate forecast information is necessary in the optimization model .
Recent advances in deep reinforcement learning (DRL) give rise to tremendous success in solving challenging decision-making problem [13, 14]. In general, the decision-making problem under uncertainties is formulated using Markov decision process (MDP)  and solved iteratively by data-driven DRL algorithms . The application of deep reinforcement learning in energy management systems has been increasingly recognized. Reference  presents a reinforcement learning approach for optimal distributed energy management in a microgrid. A DRL-based economic dispatch in microgrid is proposed in . Reference  developed an MDP formulation for the joint bidding and pricing problem and applied DRL algorithm to solve it. Reference  proposes a demand response for home energy management based on DRL. An MDP formulation for electrical vehicle charging is proposed to jointly coordinate a set of charging stations . However, research in this areas is still in the early stage, the benefit of applying DRL in coordinated scheduling of stationary and mobile energy resources has not yet been fully investigated and further studies are needed.
To address the aforementioned issue, a novel MDP formulation for critical load restoration in microgrids is proposed considering the stationary and mobile energy resources. Uncertainties in load consumption are taken into account. The agent aims to maximize the service restoration in microgrids by jointly coordinating the resource dispatching of microgrids and scheduling of MESS. The MESS fleets are dynamically dispatched among microgrids for load restoration in coordination with micogrid operation. The proposed model is solved by twin delayed deep deterministic policy gradient (TD3) , which is an actor-critic algorithm that can deal with discrete or continuous variables in state and action space.
The remainder of this paper is organized as follows. Section II mathematically describes the scheduling of MESSs and integrated service restoration strategy. Section III develops the MDP formulation and deep reinforcement learning algorithm. Section IV provides case studies and the paper is concluded in Section V.
Ii Mathematical modeling
Ii-a Uncertainties Modeling
Ii-B Scheduling of Mobile Energy Storage Fleets
A transportation network is modeled as a weighted graph , where is the nodes set, while denotes the edges set of roads with the edge distance . A set of microgrids indexed by and a set of depots are located in the transportation network . Location mappings and denote microgrids and depots’ locations in the transportation network, respectively. represents an MESS fleet. An MESS is initially located at a depot , where it starts and travels among microgrids to provide power supply to power grids, finally it goes back to a depot.
The scheduling of MESS fleets is defined as a sequence of trips. An MESS ’s current location at is represented by , which is generally defined as the node in the transportation network . In addition, mess may change destination during its’ movement without having to arrive at the next destination, that is, MESS may be on the edge at , so the location of MESS is defined as , where the denotes a location on the edge , and depict the location’s distance to corresponding nodes, and represents the edge length.
The movement decision for MESS at is to designate the destination , which specifies the destination to one of microgrids or stations. The MESS moves from the current location and follows the movement decision to the designated destination. And It is assumed that the MESS always takes the shortest path, which is determined by the Dijkstra’s algorithm . Therefore, a location function is defined to obtain the next location in graph , by using Dijkstra algorithm based on current location and designated destination . Thus, we have
Binary variables denotes if MESS stays at microgrid during the interval , which is described as follows.
MESS fleets can exchange power with microgrids by charging from or discharging to microgrids. The operation constraints are described as follows.
where represent the charging/discharging power of MESS from/to microgrid at interval , negative power depicts that MESS charges from microgrids while positive power means that MESS discharge to microgrid. and are maximum charging/discharging power of MESS . indicates the state-of-charge (SOC) of MESS at time point . and provide the prescribed minimum and maximum level of SOC. and are charging/discharging efficiency. Constraints (3) indicates that an MESS can only stay at no more than one microgrid, which is also implicated in the Equation (2). Constraint (4) shows the relation between charging/discharging and temporal-spatial behaviors. That is, only when staying at a microgrid can MESS charge or discharge to exchange power. Equation (5) calculates the SOC of MESS and Constraint (6) sets the upper and lower bound for SOC.
Ii-C Joint Service Restoration
The operation constraints of microgrids are as follows.
where are the active/reactive power generation of equivalent dispatchable DG in microgrid in interval , respectively. are the maximum active/reactive power generation, respectively. are active/reactive load restoration in microgrid , respectively. is the power factor. is the energy of equivalent DG. and are the energy capacity and minimum energy reserve in microgrid . Constraints (7)-(8) describe the active/reactive power balance at microgrid in interval . It takes into account the power generation of dispatachable DG and mobile energy storage by considering if the location of MESSs. Equations (9)-(10) constrain the load restoration and power factor. Constraints (11)-(12) depict the power generation capacity. Equation (13) calculates the energy in each microgrid. Constraint (14) presents the upper and lower bounds of energy.
The objective is formulated as follows to minimize the system overall cost.
where the overall cost is composed of four parts. The first term represents the customer interruption cost. is the microgrids generation cost. The third term shows the MESS battery maintenance cost. The last term calculates the transportation cost of MESSs.
Iii Deep Reinforcement Learning Algorithm
Iii-a Markov Decision Process
The sequential decision-making problem in a stochastic environment is formulated by Markov decision processes (MDPs). In an MDP, an agent observes the state at each time step and continually interacts with an environment by following a policy to select actions . In response to the actions, the environment presents new states and give rise to rewards to the agent. An MDP is defined by a 4-tuple , where
are the state space, action space, transition probability functions that satisfy Markov property (i.e., the next state is only dependent on present state and action), and reward functions. The detailed formulation is described as follows.
The state is a vector defined as, presenting information on time step, load, the location and SOC of MESSs, and energy in microgrids.
Furthermore, the action is a vector consisting of decision variables on the designated destination of MESSs charging/discharging behavior of MESSs and generation output in microgrids. The action is defined as . It is noted that
represents categorical action and needs to be one-hot encoded.
The state transition represents the dynamics of the environment, the transition function indicating the probability of, so we have . To model the uncertainties in load consumption. the exogenous information in state vector
are random variables. Based on the state and action, the next statecan be obtained. In reinforcement learning, the is unknown and needs to be learned through interactions between the agent and the environment .
The reward function is defined as , where is the immediate reward the agent receives by taking action given state . The immediate reward has two components to take into objectives and penalty violating constraints . The detailed definition is as follows.
where are coefficients. relates to objective function (II-C) and is obtained by ignoring the constant term and taking minus sign, thus the cost minimization is transformed into a reward maximization problem. The second term is Lagrangian penalty term incurred by violation of constraints.
Iii-B Twin Delayed Deep Deterministic Policy Gradient
In reinforcement learning, the return is defined as the sum of discounted reward , where is the discount factor. A policy is a mapping from states to selecting actions, i.e., stochastic policy or deterministic policy . Solving an MDP is to find a policy that maximizes the expected return .
In order to deal with continuous and discrete variables in state and action space, an actor-critic algorithm is adopted , e.g. deep deterministic policy gradient (DDPG), which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy parameterized with .
In Q-learning, the action-value function denotes the expected return starting from , taking the action and thereafter following policy , The optimal Q function obeys the Bellman equation, which describes the optimal in recursive form, .
The Q-function is estimated by a differentiable function approximator
, which is a neural network with weightsas a Q-network. The Q-network can be learned to reduce the mean-squared Bellman error. To make the training converge and stable, a separate target Q-network and a target policy network are utilized to generate optimal target value 
. Therefore a sequence of loss functionsis set up by the mean-squared Bellman error as , where the target value is defined as .
The Q-network is updated by one step gradient descent using . A soft target update is used for actor-critic algorithm , the target networks are updated by Polyak averaging, , where
is the Polyak hyperparameter (usually).
Policy learning is to find a policy that maximizes the expected discounted return 
. The policy network is updated by applying the chain rule to thewith respect to the actor parameters and gradient ascent is implemented. Thus, the policy learning algorithm is
Furthermore, twin delayed deep deterministic policy gradient (TD3) concurrently learns two Q-networks, and by minimizing mean-squared Bellman error. By upper-bounding the less biased value approximator with the biased estimate , a single target update for clipped Double Q-learning is obtained by taking the minimum between the two Q-networks:
Then and are updated by minimizing the corresponding mean-squared Bellman error as follows.
Target smoothing regularization is to add a small amount of random noise to the target policy network in target update and averaging over mini-batches. The modified target actions and target values are as follows.
where the added noise is a normal distribution with zero-mean and standard deviation, and clipped by a hyperparameter .
The policy is optimized with respect to to maximize the expected return , so the policy learning is rewritten as:
In addition, the policy network is updated at a lower frequency than the value network , in order to reduce error before introducing a policy update .
Iv Case Studies
The case studies are implemented on an integrated test system, based on Sioux Falls transportation network and three microgrids, to verify the effectiveness of the proposed service restoration strategy.
Iv-a Test Systems
Fig. 1 shows an integrated test system with microgrids connected by the Sioux Falls transportation network. The length of the entire time horizon is set to 24-h and the length of interval is 1-h. A depot is located at node #10 in the transportation network. There are three microgrids located at nodes #2, #12, #21 in the transportation network, respectively. The operational parameters of microgrids are shown in Table I. The predicted value of industrial, commercial and residential loads, as well as prediction intervals could be obtained in . The parameters for MESS refers to . The customer interruption cost for industrial, commercial and residential loads are , and , respectively. The unit generation cost in microgrid is . The unit battery maintenance cost is . The unit transportation cost is .
|Load||Peak load (MW)||3.0||3.0||3.0|
Iv-B Simulation Results
The total cost is , with the customer interruption cost , microgrid generation cost , MESS generations cost and transportation cost . The load restorations in three microgrids are and , respectively.
Fig. 2 illustrates the evolution of learning and validation rewards over 60000 episodes. A purely exploratory policy is carried out for the first 3000 episodes. Then, an off-policy exploration strategy is adopted with Gaussian noise. In the learning curve, the average and standard deviation are obtained every 10 episodes. In the validation curve, the validation is evaluated every 500 episodes over 20 episodes with no exploration noise. It can be seen that the learning process converges to a suboptimal policy in 40000 episodes. The results indicate that the proposed approach can learn a policy to maximize the cumulative rewards. After learning, the model can be deployed in on-line manner.
Fig. 3 presents the charging/discharging schedule with respect to the position of MESS. The bar shows the charging/discharging active power while the dash lines with asterisks and right Y-axis indicates the MESS’s movements. The dynamic scheduling of MESS optimizes the trip chain of MESSs and corresponding charging/discharging behaviors.
The simulation result shows that MESSs transport energy among microgrids to restore critical loads by charging from some microgrids and discharging to others. For example, it is observed that MESS #1 is dispatched between micogrid #1 and microgrid #2. The MESS #1 initially moves to microgrid #2 from depot and charges at microgrid #2. Next, it moves back and forth between microgrid #2 and microgrid #1 in (07:00-22:00) to transfer energy. The integration of MESSs and coordination with microgrids can leverage the MESSs mobility. Also, the MESSs can carry out load shifting within the same microgrid. For instance, MESS #3 charges at microgrid #2 in (01:00-02:00) and discharges in (02:00-07:00). The results highlight the importance of effective utilization of MESSs mobility and flexibility.
This paper presents a novel MDP formulation for service restoration strategy in microgrids by coordinating the scheduling of MESSs and resource dispatching of microgrids. The DRL algorithms are leveraged to solve the formulated sequential decision-making problem with consideration of uncertainties in load consumption. The well trained policy can be deployed in on-line manner and is computationally efficient. The simulation results verify the effectiveness of MESSs mobility that transport energy among microgrids to facilitate load restoration. Mobile and stationary resources can be jointly coordinated to enhance system resilience.
This work was supported by the Future Resilient System (FRS) at the Singapore-ETH Centre (SEC), which is funded by the National Research Foundation of Singapore (NRF) under its Campus for Research Excellence and Technological Enterprise (CREATE) program.
-  Z. Bie, Y. Lin, G. Li, and F. Li, “Battling the Extreme: A Study on the Power System Resilience,” Proc. IEEE, vol. 105, no. 7, pp. 1253–1266, 2017.
-  Y. Wang, C. Chen, J. Wang, and R. Baldick, “Research on Resilience of Power Systems Under Natural Disasters—A Review,” IEEE Trans. Power Syst., vol. 31, no. 2, pp. 1604–1613, 2016.
-  Y. Xu, C. C. Liu, K. P. Schneider, F. K. Tuffner, and D. T. Ton, “Microgrids for service restoration to critical load in a resilient distribution system,” IEEE Trans. Smart Grid, vol. 9, no. 1, pp. 426–437, 2018.
-  C. Chen, J. Wang, and D. Ton, “Modernizing Distribution System Restoration to Achieve Grid Resiliency Against Extreme Weather Events: An Integrated Solution,” Proc. IEEE, vol. 105, no. 7, pp. 1267–1288, 2017.
-  S. Yao, P. Wang, and T. Zhao, “Transportable Energy Storage for More Resilient Distribution Systems with Multiple Microgrids,” IEEE Trans. Smart Grid, vol. 10, no. 3, pp. 3331–3341, 2019.
-  S. Yao, T. Zhao, H. Zhang, P. Wang, and L. Goel, “Two-stage stochastic scheduling of transportable energy storage systems for resilient distribution systems,” in 2018 IEEE Int. Conf. Probabilistic Methods Appl. to Power Syst., 2018, pp. 1–6.
-  J. Kim and Y. Dvorkin, “Enhancing Distribution System Resilience With Mobile Energy Storage and Microgrids,” IEEE Trans. Smart Grid, vol. 10, no. 5, pp. 4996–5006, sep 2019.
-  L. Che and M. Shahidehpour, “Adaptive Formation of Microgrids With Mobile Emergency Resources for Critical Service Restoration in Extreme Conditions,” IEEE Trans. Power Syst., vol. 34, no. 1, pp. 742–753, 2019.
-  S. Lei, C. Chen, H. Zhou, and Y. Hou, “Routing and Scheduling of Mobile Power Sources for Distribution System Resilience Enhancement,” IEEE Trans. Smart Grid, vol. PP, no. 99, pp. 1–1, 2018.
-  S. Yao, P. Wang, X. Liu, H. Zhang, and T. Zhao, “Rolling Optimization of Mobile Energy Storage Fleets for Resilient Service Restoration,” IEEE Trans. Smart Grid, vol. PP, no. 99, pp. 1–1, 2019.
-  R. Lu, S. H. Hong, and M. Yu, “Demand Response for Home Energy Management using Reinforcement Learning and Artificial Neural Network,” IEEE Trans. Smart Grid, vol. PP, no. c, pp. 1–1, 2019.
-  E. Mocanu, D. C. Mocanu, P. H. Nguyen, A. Liotta, M. E. Webber, M. Gibescu, and J. G. Slootweg, “On-line Building Energy Optimization using Deep Reinforcement Learning,” IEEE Trans. Smart Grid, vol. 99, no. PP, pp. 1–1, 2018.
-  D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. Van Den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
-  T. J. Sheskin, Markov Chains and Decision Processes for Engineers and Managers. CRC Press, 2011.
-  E. Foruzan, L. K. Soh, and S. Asgarpoor, “Reinforcement Learning Approach for Optimal Distributed Energy Management in a Microgrid,” IEEE Trans. Power Syst., vol. 33, no. 5, pp. 5749–5758, 2018.
-  W. Liu, P. Zhuang, H. Liang, J. Peng, and Z. Huang, “Distributed Economic Dispatch in Microgrids Based on Cooperative Reinforcement Learning,” IEEE Trans. Neural Networks Learn. Syst., vol. 29, no. 6, pp. 2192–2203, 2018.
-  H. Xu, H. Sun, D. Nikovski, S. Kitamura, K. Mori, and H. Hashimoto, “Deep Reinforcement Learning for Joint Bidding and Pricing of Load Serving Entity,” IEEE Trans. Smart Grid, vol. PP, no. 99, pp. 1–1, 2019.
-  N. Sadeghianpourhamami, J. Deleu, and C. Develder, “Definition and evaluation of model-free coordination of electrical vehicle charging with reinforcement learning,” IEEE Trans. Smart Grid, vol. PP, no. 99, pp. 1–1, 2018.
-  S. Fujimoto, H. van Hoof, and D. Meger, “Addressing Function Approximation Error in Actor-Critic Methods,” 35th Int. Conf. Mach. Learn., 2018.
-  X. Yu, S. Gao, X. Hu, and H. Park, “A Markov decision process approach to vacant taxi routing with e-hailing,” Transp. Res. Part B Methodol., vol. 121, pp. 114–134, mar 2019.
-  T. H. Cormen, Introduction to algorithms. MIT press, 2009.
-  R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. The MIT Press, 2018.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in Int. Conf. Learn. Represent. (2016 ICLR), 2016.