I Introduction
RELIABLE and resilient electricity is vital to the economy and national security of all countries. Preventive control measures have been widely employed to ensure adequate security margins against some conceived (e.g. N1) contingencies. However, several large blackouts still occurred in the US, Europe, India and Brazil in the last two decades [1, 2, 3]. It has been well recognized that emergency control is imperative in realtime operation to minimize the occurrence and impact of power outages or widespread blackouts. Conventional emergency control actions include generation redispatch or tripping, load shedding, controlled system separation (or islanding), and dynamic braking [4].
Some of these actions are automatically triggered by control or protection systems, while others are armed by system operators. Ideally, these emergency control actions should be adaptive to realtime system operation conditions. However, existing control and protection systems for emergency controls are usually based on fixed settings that are mostly determined offline based on some typical scenarios, and they are operated in a “setandforget” mode. Emergency controls used by system operators in control rooms today are predefined through offline studies based on a few forecasted system conditions and conceived contingency scenarios. In addition, it heavily relies on system operators to choose suitable control actions by matching the current system situation with the nearest system conditions defined in emergency control lookup tables, as well as determining when and how to apply them. These processes are time consuming and often overwhelming for system operators. For example, during the 11minute time span of the 2011 Southwest blackout event in US, system operators lacked sufficient time to understand the causes and take effective corrective actions [3].
Current research into solutions to the emergency control problem can be categorized into three directions: 1) securityconstrained alternating current optimal power flow (SCACOPF) [5]; 2) optimal control [6]
; and 3) conventional machine learning, such as decision tree
[7] and conventional reinforcement learning (RL) [8, 9].Mathematically, power system emergency control is a problem of dynamic, sequential decisionmaking underuncertainty. When being applied to solve this problem, SCACOPF is inherently limited by its static formulation of the problem and poor scalability. Optimal controlbased methods are generally difficult to scale to handle largescale power systems and a large number of control actions, and are not adaptive to system uncertainties. RL methods can solve sequential decisionmaking problems in real time [10]. The last two decades have seen increasing efforts to apply conventional RL methods, such as Qlearning and fitted Qiteration [10], in various decisionmaking and control problems in power systems; these range from demand response [11], energy management, and automatic generation control to transient stability and emergency control [9],[12],[13]. Due to scalability issues, applications of conventional RL methods are mainly focusing on problems with lowdimensional state and action spaces. In addition, their performance is heavily dependent on the quality of handcrafted features [14]. Thus, they are not suitable for large, complex problems, such as emergency control for largescale power systems.
In the past few years, significant progress has been made in solving challenging problems in games [14, 15], robotics [16]
, etc. using deep reinforcement learning (DRL), which is a combination of deep learning technologies and RL. Unlike conventional RL, by replacing the handcrafted feature mapping and extraction (such as Qtable) with deep learning technologies, DRL enables automatic highdimensional feature extraction and endtoend learning through stochastic gradient descent. In addition, the highdimensional feature representation capability of deep learning technologies and the development of scalable learning algorithms such as Deep QNetwork (DQN)
[14] and Proximal Policy Optimization (PPO) [17] significantly improve the scalability of DRL, making it suitable for solving largescale control problems, such as Dota II [18]. These advantages of DRL were recognized by some researchers and leveraged in several different applications in power systems in the past few years. In [19], authors utilized a DRL method to optimize the operation of storage devices in a microgrid considering both future electricity consumption and photovoltaic (PV) output uncertainties. A DRL approach was applied to solve the problem of jointly determining the energy bid submitted to the wholesale market and the energy price charged in the retail market for a load serving entity in [20]. DRL was applied to develop a dynamic load shedding scheme for shortterm voltage control in [21]. In [22], authors applied DRL to determine generation unit tripping under emergency circumstances. In light of these, we proposed to develop adaptive and robust power system emergency control schemes using DRL.One main challenge faced by both power system and RL research communities is reproducing and benchmarking existing work and accurately judging the improvements offered by novel RL methods [23]. Open platforms and/or tools, such as OpenAI Gym [24] and ELF [25] have been proven to be significantly beneficial for developing and comparing RL algorithms in games and robotics. On the other hand, to the best knowledge of the authors, all previous research efforts in application of RL for power system control [9] were based on simulation environments and RL algorithms that were not publicly available. Lack of an open platform for developing, training and benchmarking DRL algorithms for power system control not only becomes a roadblock for power system researchers and engineers to work on applying DRL in power systems, but also prevents many researchers with machine learning and control backgrounds from easily applying their DRL algorithms in power system control. To fill this gap and address the reproducibility issue, an open platform named RLGC [26] for developing, training and benchmarking DRL algorithms for power system control has been developed in this paper. To the best knowledge of the authors, this is the first of this kind platform in the power system area. It is extensible, lightweight and flexible.
The main contributions of this paper include: 1) novel application of DRL algorithms for power system emergency controls, including generator dynamic braking and undervoltage load shedding (UVLS); 2) development of the first opensource platform for developing and benchmarking DRL algorithms for power system control; 3) detailed investigation into several important aspects of DRL algorithms for grid control, including adapting generic problem formulations and the DQN algorithm to detailed, specific emergency control designs, robustness to different simulation scenarios, model parameter uncertainty, and noise in the input (observation), with direct comparisons with a conventional Qlearning and an optimal control methods.
The rest of the paper is organized as follows: an overview of DRL and grid emergency control is presented in Section II. Section III details the open platform for developing and benchmarking DRL algorithms for grid control; Section IV discusses development details of two DRLbased grid emergency control schemes; Test cases and results are shown in Section V; discussions on several key aspects of DRL applications for grid emergency control are presented in Section VI; and conclusions and future work are provided in Section VII.
Ii Overview of Deep Reinforcement Learning and Grid Emergency Control
Iia Reinforcement Learning
In RL, the agent learns to make optimal decisions by interacting with the environment through exploration and exploitation [10]
. The environment is modeled as a (partially observable) Markov decision process (MDP), defined by:

a state space that could be continuous or discrete;

an action space that could be continuous or discrete;

an environment transition function ;

a reward function ;

a discount factor .
In this setting, at each time step , the agent can observe the state and receive reward signals from the environment. At the same time, the agent can select an action to change the environment. The goal is to apply the optimal action given the current state so that the agent can accumulate most rewards over time, which are generally defined as discounted future return .
(1) 
where means the time step when the interaction with the system ends. To evaluate the result of the action based on current state, the actionvalue function also known as Q function, is proposed as . We define the optimal Qvalue of the stateaction pair as , which represents the maximum discounted future return after taking action at state . The Q function is updated by the iteration algorithm in the Bellman equation, defined by [10]
(2) 
The iteration will converge to the optimal solution as if the state signals have the Markov property [10].
QLearning [10] is a valuebased RL algorithm which finds the optimal actionselection policy using
(3) 
where represents the learning rate.
Conventional Qlearning is based on tabular methods, where the observation space needs to be discretized first. There are two main practical issues: 1) The observation space discretization strategy only works well if the range and dimension of the observation space are relatively small. For largescale problems, it easily leads to memory explosion and also requires more training time to converge a good solution; 2) The observed states in realworld environments are usually noisy or incomplete, which makes it very difficult for tabular methods to capture the true pattern based on noisy data. We tested the performance of conventional Qlearning with noisy input data in Section V.
IiB Deep Reinforcement Learning
Deep reinforcement learning is a combination of RL and deep learning technologies. The use of deep learning makes it possible with DRL to directly use the raw state representations, and train policies for complex systems and tasks with effective and efficient approaches for highdimensional feature extraction and nonlinear generalization. DRL algorithms learn directly from agents’ interactions with an environment (either simulation or real). Although catastrophic events rarely happen in the real world, a wide variety of extreme event scenarios can be created in simulation and provide the DRL algorithms extensive experience to learn. This is unlike other deep learning techniques that require a large amount of labelled training data, which is usually sparse or not available in the power industry.
One of the most successful DRL algorithms suitable for discrete action space is DQN, which uses neural network (NN) with weights
to estimate Qvalues. Compared to conventional Qlearning with the function approximation approach
[10], which usually requires a significant amount of manual tuning to stabilize the learning process, there are two key traits that make DQN more efficient and stable: 1) the use of a target network besides the Qnetwork; and 2) the use of experience replay [14]. The target network has the same structure as the Qnetwork. At regular periodicity (every steps), the weights of the Qnetwork are copied to the target network [14]. To perform experience replay, the agent’s experience is stored in data set at each time step. A Qnetwork can be trained using samples (minibatches) randomly drawn fromby minimizing a sequence of loss function (
4) [14](4) 
where is the target Qvalue for iteration computed by , and
is the probability distribution of the state and action pair
. Updating NN weights θ can be done by stochastic gradient descent with the gradient calculated by (5).(5) 
A popular algorithm for training DQN is presented as Algorithm 1 below [14].
Algorithm 1 Deep Qlearning  

1  Initialize and target network with random weights 
2  Initialize experience replay memory , exploration rate 
3  For episode , M do 
4  Initialize the environment 
5  For , T do 
6  With probability select a random action 
7  Otherwise select 
8  Execute in the environment and 
9  observe reward and next state 
10  Store transition in 
11  Sample random batches from 
12  If is a terminal state do 
13  
14  else 
15  
16  
17  
18  Every step, reset 
19  End For 
20  If , 
21  End For 
With the implementation shown in Algorithm 1
, DQN uses every possible data tuple and break correlation in the observation sequence by sampling from experience replay, which benefits data efficiency and reduces training variance. The exploration rate
in the stateoftheart implementation is usually not constant, but decays (linearly in our experiments) from 1.0 to a small constant value within certain steps, which is defined as in Algorithm 1. It means that the agent will explore more in the beginning and exploit more at the end. DQN approximates Q values based on neural networks, so it avoids the memory explosion problem caused by observation space discretization in traditional Qlearning. At last, DQN can capture the underlying pattern(s) even from noisy observations, which will be shown in Section V.Note that we represent Algorithm 1 from a general perspective, with good generalization capabilities that could be adapted for and interact with many different environments. The key steps in Algorithm 1 for interaction with a specific grid control environment are highlighted as follows: (1) step 4—initialize the environment; (2) step 8—execute an action in the environment; (3) step 9—observe reward and next state ; and (4) step 12—check whether is a terminal state. More details of this Deep Qlearning algorithm interacting with the grid control environment will be discussed in the following sections.
IiC Grid Emergency Control
For largescale power systems, the emergency control problem is a highly nonlinear, nonconvex optimal decisionmaking problem and can be formulated as follows:
(6) 
s.t.
(7a)  
(8a)  
(9a)  
(10a)  
(11a) 
where represents dynamic state variables of the power grid, such as the generator rotor angles and speeds, etc.; represents the algebraic state variables of the power grid, which are typically the voltages at nodes (or buses) of the grid; represents the emergency control variables of the power grid, such as generator tripping or load shedding; and represents the disturbance (or contingency) that could occur in the grid. and represent the time horizon. represents the cost function of the power grid emergency control. The dynamic behavior of various components in the power grid, such as generators and their controllers, is represented by (7a). Eqn. (8a) represents the algebraic constraints that describe the network coupling between generators, loads, and transmission branches in the power grid. Eqns. (9a), (10a) and (11a) represent the operation and security constraints on the dynamic state variables, algebraic state variables, and control variables over the time horizon. Notice that the upper and lower bounds in (9a), (10a) and (11a) could be timevariant. The emergency control problem formulated as can be solved by a model predictive control (MPC) method [27].
The same problem can also be formulated as an MDP and solved by RL methods. Note that not all the state variables are observed by the agent(s); thus, the state space in MDP is a subset of the grid state variables, i.e., . It should be noted that properly defining for specific emergency control problems is critical. Two specific examples, along with general design principles, will be discussed in section IV. Based on the properties of the control actions , an action space , either continuous or discrete, will be defined. The limits on the controls defined in (11a) are generally considered in the definition of the action space by setting the bounds. The environment transition from to (i.e., steps 8 and 9 in Algorithm 1) is governed by the differential and algebraic equation set (7a) and (8a). The detailed formulations of (7a) and (8a) and the solution methods can be found in [28]. The reward is a function of , and as follows:
(12) 
where , in principle, should incorporate both the action cost function in (8a) and a penalty of any violation of the constraints defined in (9a), (10a) and (11a). Detailed formulations of for two specific emergency control schemes will be presented in Section IV.
Iii An Open Platform for Developing and Benchmarking RL Algorithms for Grid Control
Iiia Overview
An opensource platform, Reinforcement Learning for Grid Control (RLGC), has been developed and published for the purpose of developing, training and benchmarking RL algorithms for power system control [26]
. Opensource benchmarks (such as ImageNet and OpenAI Gym) are the key driving forces that propel the advancement of machine learning (including RL). The goal of RLGC is to create a similar opensource benchmark for reinforcement learning for power grid control.
The architecture of this open platform is shown in Fig. 1. It has two main modules: 1) the RL module; and 2) the power system simulation and control module. The RL module is developed based on OpenAI Gym, which is a widelyused generic toolkit for RL research and is programmed in Python [18]. A general power system simulation and control environment for training and testing RL algorithms is created, where the power system simulation and control module is called. The power system simulation and control module is developed based on InterPSS [29] and programmed in Java. Both modules are decoupled and communicated through Py4J [30], which acts as a communication “bridge” between Python and Java programs. The data exchange through Py4J between the two modules is inmemory, with high efficiency and integration flexibility. Two configuration files are used to specify the power system dynamic simulation settings and the RL training parameters, respectively.
One main advantage of choosing OpenAI Gym for the platform is that users can directly use stateoftheart opensource learning algorithms such as OpenAI Baselines [31], which is a set of highquality implementations of DRL algorithms, such as DQN for MDPs with discrete control actions and PPO [17] for MDPs with continuous control actions. We use the DQN implementation in OpenAI Baselines for solving two emergency control control problems with discrete control spaces in this paper.
With a modular, decoupled architecture design, as well as opensource tools adopted for its development, the RLGC platform is:
1) Extensive: the framework can capture many diverse aspects of RL and power systems, such as abundant choices of different RL training algorithms, rich power system dynamics and measurements, and typical emergency control actions. It can also simulate various power systems, including integrated transmission and distribution systems [32].
2) Flexible: With this platform, users only need to specify a minimum of two configuration files to build a customized environment for training and testing RL algorithms for power system control. Users can define various observations, actions and rewards through either a configuration file or programming new functions for them.
IiiB Implementation Details and Usage
In the RL module, a python class named PowerDynSimEnv is developed by extending the OpenAI Gym’s standard basic environment Env class. In the power system simulation and control module, a wrapper of InterPSS simulation functions and capabilities is developed for interfacing with the PowerDynSimEnv environment in the RL module. It comprises several key functions representing the interactions between the learning agent and the environment in Algorithm 1 (AL1). The key functions include initStudyCase(*) for initializing the environment of AL1 in step 4, applyAction(*) and nextStepDynSim(*) for executing action in step 8, getReward(*) and getEnvObversations(*) for observing reward and the next state , and isSimulationDone(*) for checking if is a terminal state of AL1. The usage of these functions for RL training will be detailed in the following paragraph.
A typical procedure for using the developed platform to test DRL algorithms and train NN models for grid control mainly includes two stages: (1) the training stage for learning, and (2) the testing stage for validating the trained NN. During the training stage, the DRL will perform neural network learning through a large number of training steps. It learns an optimal policy with exploration and exploitation, and automatically saves the bestperformance NN parameters. Once the training stage is completed, the RL agent at the testing stage will use the learned optimal policy (represented by the bestperformance NN parameters) to provide optimal control actions to the environment, based on the observed environment states.
Fig. 2 gives the details of the procedure for using the platform for training and testing the DRL model for grid control. Once the study cases and configuration files described in Section III.A are prepared, the training procedure initializes power system simulation module (initStudyCase(*)), the NN model, the RL module, and then launches the training. At each training step, the agent in the RL module receives the states (getEnvObversations(*)) and rewards (getReward(*)) from the environment, which calls the power system simulation module to obtain these information, trains the NN model (see Algorithm 1 for details of training algorithm), and sends back the selected control action to the simulation environment. Upon receiving the control action from the RL module, the power system simulation module applies this control action in the dynamic simulation(applyAction(*)), runs to the next agentenvironment interaction step (nextStepDynSim(*)), and sends the updated states and rewards to the RL module. It should also be noted that power system dynamic simulation has its own time step (ranging from 1 ms to half cycle) to ensure numerical stability and which is usually smaller than the time step of the DRL module (agent) interacting with the power system simulation module (environment); thus, there is an internal power system simulation loop within nextStepDynSim(*) function. These interactions between the two modules continue until the training reaches the end of one dynamic simulation as one training episode finishes. At the end of each training episode, the training procedure reinitializes the dynamic simulation (reset(*)) and starts the next training episode. The training procedure ends after a predefined number of training steps. Once the training is finished, the trained NN model could be tested for cases different from the training cases to validate the effectiveness of the training. Based on the testing results, the users may adjust the training parameters and case settings in the configuration files and launch more training tasks. To facilitate the training process, the platform supports tasklevel parallelism, so that multiple RL training tasks with different hyperparameters can be run in parallel.
Iv DRL Algorithms for Grid Emergency Control
With the developed platform discussed in the previous section, we investigated and developed DRLbased control schemes for two typical types of grid emergency control: 1) generator dynamic brake [8]; and 2) undervoltage load shedding. In the following subsections, the DRL algorithm design and implementation details for both emergency control schemes, including neural networks, observations, actions, and rewards will be discussed.
Iva Neural Network Architecture
The proposed architecture of the NN for both emergency control schemes is shown in Fig. 3. The number of units in the input and output layers are and . There are two hidden layers in between, with and
hidden units, respectively, which are followed by a rectified linear unit (ReLU). It should be noted that there seems to be a misconception that NNs in DRL methods have to be “deep” to make them work well. In fact, the groundbreaking DRL application in
[14] and most of the DRL algorithms in OpenAI Gym continuous control benchmarks [24] use NNs with 23 hidden layers. In the reinforcement learning domain, the term “deep” often means a set of recent approaches that makes it possible to train a NN model using reinforcement learning, such as target network, replay buffer, duel network, etc. The fact that the same or very similar NN architecture can be used for significantly different control problems is one main advantage of DRL over traditional RL methods like Qlearning.IvB Generator Dynamic Brake
Generator dynamic brakes are utilized to achieve two main objectives :1) to avoid the loss of synchronism between the generators when a severe incident occurs; and 2) to damp large electromechanical oscillations [8]. Due to the energy losses and operation limits, the time the dynamic brake is switched on is limited; thus, it should be used only under emergency conditions. To achieve these objectives under the operational constraints, the following reward function [8] is used:
(13) 
where and are the average generator speed and angle defined in [33], denotes the control action ( when the brake is switched off and when it is switched on) and is a penalty factor for penalizing the brake action. When the system has lost synchronism (when rad in this paper), a very negative reward (1000) is given to direct the agent to perform appropriate actions to avoid such conditions.
Unlike using pseudostates (i.e., generator equivalent angle and speed) as observations in the previous research effort with RL [8], which is essentially a handcrafted feature extraction process, the rotor angles and speeds of monitored generators are directly used as the observation for the agent (input to the NN) in the proposed scheme. Note that it is impossible for the agent to learn the system’s dynamic behaviors and the trend solely based on current observed states . Similar to stacking most recent frames as input in [14], a sequence of observations (the number is ) is treated as a distinct state in this paper, i.e., . In the developed platform, the number of measurements (i.e., ) and are configurable and defined by the users, thus .
IvC Undervoltage Load Shedding
Faultinduced delayed voltage recovery (FIDVR) is defined as the phenomenon whereby system voltage remains at significantly reduced levels for several seconds after a fault has been cleared [34]. The root cause is stalling of residential airconditioner (A/C) motors and prolonged tripping. FIDVR events occurred in many utilities in the US. Concerns over FIDVR issues have increased since residential A/C penetration is at an alltime high and growing rapidly. A transient voltage recovery criterion (TVRC) is defined to evaluate the system voltage recovery. Without loss of generality, we referred to the standard proposed in [35] and shown in Fig. 4. After fault clearance, the standard requires that voltages should return to at least 0.8, 0.9 and 0.95 p.u. within 0.33 s, 0.5 s and 1.5 s, respectively. Per current industry practice, UVLS relays are usually employed to shed load demands at substations in a stepwise manner if the monitored bus voltages fall below the predefined voltage thresholds to protect power systems against FIDVR. The ULVS relay has a fast response, however, this distributed control scheme does not have any communication or coordination between other substations, thus, it could lead to unnecessary load shedding [36] at affected substations. MPC methods [27][37] have been proposed for UVLS protection. The MPC methods utilize a system model (usually in the form of differential algebraic equations) to predict the states of the power grid. It formulates and solves an optimization problem to decide load shedding control actions. MPC is a centralized method and considers the coordination of load shedding between different substations. However, the optimization process in MPC methods is usually computationally intensive, and the performance of MPC methods heavily depends on the accuracy of the system model [27]. In this paper, we investigated applying DRL to multiple loadserving substations to implement an adaptive, coordinated emergency load shedding scheme against FIVDR.
The observed states at time include voltage magnitudes at monitored buses (denoted as ), as well as the percentage of load still remaining at controlled buses (denoted as ). To capture the dynamics of the voltage change, the most recent observed states are stacked with some history state records and treated as the input into DQN at time , i.e., . The control action at each controlled load bus is defined as either 0 (no load shedding) or 1 (shed 20% of the initial total load) at each action time step. Thus the control action space is discrete with a dimension of , where is the number of controlled buses. The reward at time is defined as follows:
(14) 
where is the time instant of fault clearance. The above reward function has three parts: (1) total bus voltage deviation below the standard voltage thresholds shown in Fig. 4, where is the bus voltage magnitude for bus in the power grid; (2) total load shedding amount, where is the load shedding amount in p.u. at time step for load bus ; (3) invalid action penalty if the DRL agent still provides load shedding action when the load at a specific bus has already been shed to zero at the previous time step when the system is within normal operation. and
are weight factors for the above three parts. Note that the reward function will be set to a large negative number (1000) if any bus voltage is below 0.95 p.u. 4 s after the fault is cleared. Please note that tuning the reward function is a challenge for DRL. It requires a combination of heuristics based on prior knowledge and some automated parameter search (trialanderror selection). Here we provide some basic principles for reward function design: (a) use prior knowledge about the problem to identify a rough range for the parameters (
and) with regard to the proper reward values. A welldesigned reward function should give higher reward values for better system performance. In this paper, we roughly estimate the range of parameters by performing the power grid dynamic simulation by directly applying uniformly distributed actions from the defined action space; (b) once the rough ranges for the parameters are identified, randomly select several points from those ranges, then train the DRL model using the selected combination of parameters and choose the combination that performs best.
V Test Results
In this section, test cases and results are presented for the two typical grid emergency control schemes: 1) generator dynamic brake; 2) under voltage load shedding we discussed in Section IV. All the case studies including training and testing were performed in a simulation environment (offline mode) based on the RLGC platform.
Va Generator Dynamic Brake
To illustrate the capabilities of the proposed DRL framework and algorithm, a generator dynamic brake controlled by an RL agent is tested on the twoarea, fourmachine system, as shown in Fig. 5, where the resistive brake (RB) is located at bus 6 with the size of p.u. mhos on a 100 MVA base (400 MW). The test case is very similar to the first test case in [8].
The observation states are the speed and rotor angles of four generators; thus, = 8. The last 4 recent observation states are used as input for DQN; thus, = 4, and the number of nodes in NN input layer is 32. The number of nodes in the output layer
is 2 (representing 0 and 1). Other important hyperparameters are as follows: the coefficient
in (6) is 2; total interaction steps in training is 900,000; nodes in hidden layers: ; learning rate ; minimum exploration rate .The training period is partitioned into different episodes (scenarios). Each episode begins with a flat start of dynamic simulation, and a threephase, shortcircuit fault is applied at bus 3 at 1.0 s with a random fault duration ranging from 0.581 s to 0.585 s; thus, the fault is selfcleared. This random selection of the fault duration could guarantee that the training agent interacts with both stable and unstable postfault conditions, as the critical clearing time for threephase faults at bus 3 is 0.583 s. For each episode, the simulation proceeds until either instability is detected or the simulation time reaches 30 s. The power system dynamic simulation time step is 0.002 s. During each episode, the agent interacts with the simulated power system environment at the time step of 0.1 s. The same time steps are used in the test cases in the rest of the paper. It took 9 hours in a Linux workstation with 32 AMD Opteron 1.44 GHz Processors and 64 Gigabit memory with no parallelism to complete the training process. With welltuned parameters, our approach robustly learns successful policies. The moving average of the reward during the training is shown in Fig. 6. The dip around the 3600th episode shown in Fig. 6 is corresponding to a large negative reward due to one “bad” exploration during training. However, this does not imply the instability of the DQN algorithm. As the training of DQN algorithm continues, the DQN model learns to avoid the bad control actions experienced in the training and converges to a local optimal solution. Extensive tests show that all the local optimums that we achieved are good solutions.
After the DRL model training, we assess robustness of the resulting control policy (law) on a different and much larger set of scenarios, with different combinations of power flow condition, fault location, and fault duration:
1) different power flow conditions are tested, including (a) the original power flow case for training and learning, (b) each load in the system increases/decreases by 50 MW, 100 MW, and 180 MW; (c) the tieline (two lines between buses 7 and 10 ) power flow increases/decreases by 20MW, 40 MW, 70 MW and 100 MW. Because the two tielines are the only connection between area 1 and 2, the adjustment of tieline power flow could be achieved by increasing the real power output of the generators at one area while decreasing the real power output of the generators at the other area accordingly;
2) the fault location is selected for all the 10 buses;
3) and the fault duration is randomly selected between 0.3 s and 0.7 s.
Without the dynamic breaking, the maximum fault duration that the twoarea power system can withstand without losing stability is 0.583 s. On the other hand, when the RB is used with the control law trained by DRL, for the above discussed different scenarios (we test 220 different scenarios), the system can remain stable. To make the inputs of the DRLbased control more realistic, we also add zero mean, 1% Gaussiandistributed noise to the observations fed into the trained NN. We also compared the trained DRLbased control versus the conventional 2dimension Qtablebased Qlearning method in
[8]. The results show that the DRLbased control outperforms the conventional Qlearningbased control for all testing scenarios with noises added into the observations.Fig. 7 (a) and (b) show two examples of the RB actions for different faults and power flow conditions, for both DRLbased and conventional Qlearningbased control. Fig. 7 (a) shows the generator 3 speed and the relative rotor angle (with and without RB actions), as well as the RB actions for a fault at bus 4 with a duration of 0.7 seconds, under the power flow condition that each load increases 100 MW with reference to the power flow case in the training. Fig. 7 (b) shows the generator 3 speed and the relative rotor angle, as well as the RB actions for a fault at bus 9 with a duration of 0.6 seconds, under the original power flow condition for training. It could be observed from Fig. 7 (a) and (b) that the system loses stability if there are no RB actions (red line), while the RB actions provided by both the DRLbased (blue line) and conventional Qlearningbased control (green line) can sustain the system stability. However, the DRLbased control definitely provides better control actions than the conventional Qlearningbased control, as the DRLbased control operates the RB in less time steps and thus obtains higher rewards. It could also be observed from Fig. 7 (a) and (b) that the DRLbased control will provide different RB actions at different times for the two different scenarios. All the results shown in Fig. 7 demonstrate the effectiveness, robustness, and adaptiveness of the DRL algorithm. It should be noted that we also tested various prefault periods; the DRLbased control does not apply any braking action on the system under normal conditions.
VB Under Voltage Load Shedding
The developed platform and DRL algorithm was applied for developing a coordinated UVLS scheme against FIDVR and was tested on a modified IEEE 39bus system [38], as shown in Fig. 8, where stepdown transformers are added to load buses 4, 7, and 18. The original loads are moved to the lowvoltage side of the transformers and modelled as a combination of 50% singlephase airconditioner motors [39] and 50% constant impedance loads.
The OpenAI Baselines implementation of the DQN algorithm is used to learn a closedloop control policy for applying the load shedding at buses 4, 7 and 18 to avoid the FIDVR and meet the voltage recovery requirements shown in Fig. 4. The coefficients of the reward function (14) for this study are: , , and . The observations include voltage magnitudes at buses 4, 7, 8, and 18 and lowvoltage sides of the stepdown transformers connected to them, as well as the fractions of loads served by buses 4, 7, and 18; thus, = 11. The last 10 recent observation states are stacked and used as input for DQN; thus, = 10, and the number of nodes in the NN input layer is 110. The control action for buses 4, 7, and 18 at each action time step is either 0 (no load shedding) or 1 (shedding 20 % of the initial total load at the bus). Thus, the total number of combinations of potential discrete control actions at each action step is 8, i.e., the number of nodes in the output layer is 8. Other important hyperparameters are as follows: total interaction steps in training is 1,200,000; nodes in hidden layers ; learning rate ; minimum exploration rate .
During the training, each episode begins with a flat start of dynamic simulation, and at 1.0 s of the simulation time, a shortcircuit fault is randomly applied at bus 4, 15 or 21 with a randomlyselected fault duration of 0.0 s (no fault), 0.05 s or 0.08 s; and the fault is selfcleared. This random selection of the fault location and duration could guarantee the training agent interacts with the system with and without FIDVR conditions. The training process took 21 hours on the same Linux workstation used in the previous case without any paralization. The moving average of the rewards during the training is shown in Fig. 9.
After the training, we tested the robustness and adaptiveness of the trained DRL agent on a set of 960 test scenarios that have different combinations of power flow conditions, dynamic model parameters, fault locations, and fault duration from the training scenarios, as follows: (1) four different load levels (i.e., 80%, 90%, 110%, and 120% load levels); (2) two sets of critical dynamic parameters of the airconditioner motor model, with one set corresponding to (assumed) true values and the other set considering a 10% increase in the A/C motor stalling performance parameters and [39]. Note that the airconditioner motor dynamic model is an aggregated model that represents a large set of physical airconditioners in the real environment, so its parameters could contain many uncertainties; (3) 30 different fault locations (i.e., buses 1 to 30); and (4) four different fault duration times (i.e., 0.02, 0.05, 0.08 and 0.1 s).
We have compared the trained DRLbased load shedding control versus the UVLS relay load shedding scheme, as well as an MPC method that uses a mixed integer programming optimization to solve the problem described by (6). We have compared all three control methods in terms of the execution time and the reward defined in (14). To show the comparison results, we calculate the reward differences (i.e., the reward of DRL subtract that of a comparison method) for all the test scenarios, and a positive value means that the DRL method is better for the corresponding test scenario, and vice versa.
Among the 960 test scenarios, 462 of them could lead to FIDVR problems if no action is applied, and thus require load shedding. Fig. 10 (a) shows the histogram of the reward difference between the DRLbased control and the UVLS relay. The DRLbased control outperforms the UVLS relay for 92.22% of these 462 test scenarios. Among the 462 test scenarios, 229 test scenarios have the same dynamic parameters as the training scenarios (Test Set A), while 233 test scenarios have a 10% increase for the dynamic load parameters and (Test Set B). The main objective of Test Set B is to mimic the modeling gaps (or uncertainties) in realworld applications. Note that the DQNbased DRL method is modelfree, while MPCbased methods heavily depend on the accuracy of the model; thus, it is important to consider the modeling errors in MPCbased applications.
For Test Set A, Fig. 10 (b) depicts the histogram of reward difference between the DRL and MPC, which indicates that DRLbased control has a slightly better performance than the MPC (the DRL outperforms the MPC in 57.22% of the test scenarios). For Test Set B, Fig. 10 (c) shows the histogram of the reward differences between the DRL and MPC methods, which shows that the DRL method outperforms the MPC method in 90.56% of the test scenarios. Fig. 10 (b) and (c) clearly show a significant advantage of the developed DRL method over the MPC method: the performance of the MPC method heavily depends on the accuracy of the system model, while DRL is modelfree and more robust to modeling errors.
Table I shows the average computation time of the DRL and MPC methods. The computation time for UVLS relays is not included as it is either instantaneous or a predefined delay. It is clearly shown in Table I that the DRL method requires much shorter execution time than the MPC method, because the NN handling the complex mapping from observed states to actions in the DRL approach is much more efficient compared to a timeconsuming, complex optimization solution process in the MPC method. With 0.13 s action time during a 8second simulation event, the DRL method can meet the realtime operation requirements and allows grid operators to verify the control actions when necessary.
Average DRL Computation Time  Average MPC Computation Time 
0.13 seconds  23.73 seconds 
To further illustrate the advantages of the DRL method, Figs. 11 and 12 show the comparison of the performance of the DRL, MPC, and the UVLS relay control schemes for a new test scenario with 120% load level. The fault occurs at bus 3 with a duration time of 0.1 s, and there is a 10% increase in the dynamic parameters and . To make the testing for the DRLbased load shedding control more realistic, we also add zero mean, 1% Gaussiandistributed noise to the observations. The total rewards of the DRL, MPC, and UVLS relay control in this test case are 1271.61, 1548.14, and 3778.80, respectively. Fig. 11 shows the voltage profiles at buses 4, 7, and 18 for different load shedding controls; Fig. 12 shows the load shedding amount at buses 4, 7, and 18 for the DRL, MPC, and UVLS relay control schemes. Note that the added 1% noise does not affect the decision making and the performance of the DRLbased control. The large reward difference (2507.19) between the DRL and UVLS relay comes from two parts: 1) the DRL sheds a significantly less amount of loads than UVLS relay. Fig. 12 shows that compared with the UVLS relay, the DRL sheds 60% (120 MW) less load for bus 4 (the DRL method does not shed any load at bus 4) and 20% (14.64 MW) less load for bus 18; 2) the DRL method leads to a much better voltage recovery profile compared with the UVLS relay method, as shown in Fig. 11. With the DRLbased control, the voltages at all three load buses with the A/C motors recover quickly above the voltage recovery envelope required by the operation standard. In contrast, the UVLS relay method cannot recover the voltages at the three buses even at 3 s after the fault is cleared, which causes the UVLS relays to shed more loads at these three buses. The reward difference (276.53) between the DRL and MPC methods is mainly due to the fact that the DRL method sheds less load than the MPC while meeting the operation standard requirements. Fig. 12 shows that the DRL method sheds 20% (26 MW) less load at bus 7, and 20% (14.64 MW) less load at bus 18. The MPC method results in more load shedding as the MPC method suffers from inaccurate critical model parameters (10% difference from the true values). Note that although Fig. 11 shows that the voltage recovery profiles of the MPC method are slightly higher than the ones of the DRL method (at the cost of more loads being shed), this does not contribute to an increase of the reward, because the voltage profile being above the voltage recovery standard is not rewarded according to (14). We believe this is reasonable as the ultimate goal of UVLS controls is to recover the voltage above the envelope required by the standard with minimum load shedding.
In summary, compared with the UVLS relay and MPC control methods, the DRL method shows significant improvements in terms of robustness and adaptiveness. In addition, the welltrained DRL model can provide control actions very fast (0.13 s on average) under emergency conditions, thus it can be applied for realtime emergency controls.
Vi Discussions
There are several important considerations for DRL application in general, and particularly in regards to its use in power system emergency control.
1) Applicability to general emergency control problems: We discussed how general power gridemergency control problems could be formulated as MDP problems and solved by DRL in Section II.C. Still, we believe that successful application of DRL to general emergency control problems heavily depends on properly formulating the problems as MDPs, including welldefined states, actions, and rewards. Given that automating the formulation process is still at an early research stage, synergy between power domain knowledge and DRL, together with close collaborations between experts from both domains, is highly recommended.
2) Parameter selection: In this paper, we manually tuned the parameters in the proposed algorithms, such as penalty factors and weighted factors in the reward functions. Determining these parameters is a known challenge for applying DRL and is also an active research topic in the RL community. Inspired by a recent work [40], we plan to automate this part in future efforts.
3) Reality gaps: For controlling missioncritical infrastructures like power grids, training of the DRL agent(s) are, in general, performed in a simulation environment. There are always some reality gaps between models and realworld systems. One of the authors has made good progress in addressing this reality gap issue in the robotic domain [41]. We plan to adapt the developed technologies to solve power grid control problems in the future.
4) Safety guarantee (or safe exploration): In this paper, operation and/or safety constraints are considered by adding appropriate violation penalties in the reward functions. Recently, constrained policy optimization [42] and safe exploration [43] methods were proposed to realize constrained reinforcement learning.
Vii Conclusions and Future Work
Emergency control is imperative to guarantee the secure and reliable operation of power systems, particularly under large disturbance or severe contingency conditions. This paper investigates developing adaptive emergency control schemes using DRL. To support the development and benchmarking of DRL algorithms for grid control, for the first time, an opensource platform named RLGC is developed. By opensourcing it, we hope to provide a good starting point and an open benchmark that accelerates future research in this field. The platform is employed to develop two typical emergency control schemes, including dynamic generator brake and UVLS. The test results demonstrate the adaptiveness and robustness (to new scenarios, model parameter uncertainy and noise in observations) of the two developed DRLbased emergency control schemes, as well as the advantages over schemes based on conventional Qlearning, MPC and existing protection mechanisms.
Future research work includes: 1) functionality extension of the RLGC platform, for example, support of other power system simulators; 2) applying DRL for other emergency controls on largerscale power systems and with continuous action spaces; 3) applying recent advancements such as safe exploration and deep metareinforcement learning to better address control challenges associated with increased uncertainties in power systems.
Viii Acknowledgement
The authors gratefully thank Dr. Guanji Hou for his valuable suggestions and assistance in developing the MPCbased emergency control method in this paper.
References
 [1] Z. Bo, O. Shaojie, Z. Jianhua, S. Hui, W. Geng, and Z. Ming, “An analysis of previous blackouts in the world: Lessons for china’s power industry,” Renewable and Sustainable Energy Reviews, vol. 42, pp. 1151–1163, feb 2015.
 [2] Y. Makarov, V. Reshetov, A. Stroev, and I. Voropai, “Blackout prevention in the united states, europe, and russia,” Proceedings of the IEEE, vol. 93, no. 11, pp. 1942–1955, nov 2005.
 [3] N. Ferc, “Arizonasouthern california outages on 8 september 2011: causes and recommendations,” FERC and NERC, 2012.
 [4] P. Kundur, G. Morison, and L. Wang, “Techniques for online transient stability assessment and control,” in 2000 IEEE Power Engineering Society Winter Meeting. Conference Proceedings. IEEE.
 [5] S. Misra, L. Roald, M. Vuffray, and M. Chertkov, “Fast and robust determination of power system emergency control actions,” arXiv preprint arXiv:1707.07105, 2017.
 [6] Z. Li, G. Yao, G. Geng, and Q. Jiang, “An efficient optimal control method for openloop transient stability emergency control,” IEEE Transactions on Power Systems, vol. 32, no. 4, pp. 2704–2713, jul 2017.
 [7] I. Genc, R. Diao, V. Vittal, S. Kolluri, and S. Mandal, “Decision treebased preventive and corrective control applications for dynamic security enhancement in power systems,” IEEE Transactions on Power Systems, vol. 25, no. 3, pp. 1611–1619, 2010.
 [8] D. Ernst, M. Glavic, and L. Wehenkel, “Power systems stability control: Reinforcement learning framework,” IEEE Transactions on Power Systems, vol. 19, no. 1, pp. 427–435, feb 2004.
 [9] M. Glavic, R. Fonteneau, and D. Ernst, “Reinforcement learning for electric power system decision and control: Past considerations and perspectives,” IFACPapersOnLine, vol. 50, pp. 6918–6927, 2017.
 [10] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning. MIT press Cambridge, 1998, vol. 135.
 [11] J. R. VázquezCanteli and Z. Nagy, “Reinforcement learning for demand response: A review of algorithms and modeling techniques,” Applied energy, vol. 235, pp. 1072–1089, 2019.
 [12] C. Druet, D. Ernst, and L. Wehenkel, “Application of reinforcement learning to electrical power system closedloop emergency control,” in Principles of Data Mining and Knowledge Discovery, D. A. Zighed, J. Komorowski, and J. Żytkow, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000, pp. 86–95.
 [13] J. Jung, C. Liu, S. L. Tanimoto, and V. Vittal, “Adaptation in load shedding under vulnerable operating conditions,” IEEE Transactions on Power Systems, vol. 17, no. 4, pp. 1199–1205, Nov 2002.
 [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [15] D. Silver, J. Schrittwieser et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.
 [16] S. Amarjyoti, “Deep reinforcement learning for robotic manipulationthe state of the art,” arXiv preprint arXiv:1701.08878, 2017.
 [17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 [18] O. Five, “OpenAI,” https://blog.openai.com/openaifive/, accessed: 20181030.
 [19] V. FrançoisLavet, D. Taralla, D. Ernst, and R. Fonteneau, “Deep reinforcement learning solutions for energy microgrids management,” in European Workshop on Reinforcement Learning (EWRL 2016), 2016.
 [20] H. Xu, H. Sun, D. Nikovski, S. Kitamura, K. Mori, and H. Hashimoto, “Deep reinforcement learning for joint bidding and pricing of load serving entity,” IEEE Transactions on Smart Grid, pp. 1–1, 2019.
 [21] J. Zhang, C. Lu, J. Si, J. Song, and Y. Su, “Deep reinforcement leaming for shortterm voltage control by dynamic load shedding in china southem power grid,” in 2018 International Joint Conference on Neural Networks, IJCNN 2018  Proceedings, vol. 2018July. Institute of Electrical and Electronics Engineers Inc., 10 2018.
 [22] W. Liu, D. Zhang, X. Wang, J. Hou, and L. Liu, “A decision making strategy for generating unit tripping under emergency circumstances based on deep reinforcement learning,” Proc CSEE, vol. 38, no. 1, pp. 109–119, 2018.

[23]
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep
reinforcement learning that matters,” in
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [24] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
 [25] Y. Tian, Q. Gong, W. Shang, Y. Wu, and C. L. Zitnick, “Elf: An extensive, lightweight and flexible research platform for realtime strategy games,” in Advances in Neural Information Processing Systems, 2017, pp. 2659–2669.
 [26] W. Hao, Q. Huang, and R. Huang, “An opensource platform for applying reinforcement learning for grid control,” https://github.com/RLGCProject/RLGC, accessed: 20181210.
 [27] L. Jin, R. Kumar, and N. Elia, “Model predictive controlbased realtime power system protection schemes,” IEEE Transactions on Power Systems, vol. 25, no. 2, pp. 988–998, May 2010.
 [28] P. Kundur, N. J. Balu, and M. G. Lauby, Power system stability and control. McGrawhill New York, 1994, vol. 7.
 [29] M. Zhou and Q. Huang, “Interpss: A new generation power system simulation engine,” arXiv preprint arXiv:1711.10875, 2017.
 [30] “Py4J,” https://www.py4j.org/index.html, accessed: 20181030.
 [31] “OpenAI Baselines,” https://github.com/openai/baselines, accessed: 20181030.
 [32] Q. Huang and V. Vittal, “Integrated transmission and distribution system power flow and dynamic simulation using mixed threesequence/threephase modeling,” IEEE Transactions on Power Systems, vol. 32, no. 5, pp. 3704–3714, 2017.
 [33] M. Pavella, D. Ernst, and D. RuizVega, Transient stability of power systems: a unified approach to assessment and control. Springer Science & Business Media, 2012.
 [34] NERC, “A technical reference paper faultinduced delayed voltage recovery,” 2009.
 [35] PJM Transmission Planning Department, “Exelon transmission planning criteria,” 2009.
 [36] H. Bai and V. Ajjarapu, “A novel online load shedding strategy for mitigating faultinduced delayed voltage recovery,” IEEE Transactions on Power Systems, vol. 26, no. 1, pp. 294–304, 2011.
 [37] T. Amraee, A. Ranjbar, and R. Feuillet, “Adaptive undervoltage load shedding scheme using model predictive control,” Electric Power Systems Research, vol. 81, no. 7, pp. 1507–1513, 2011.
 [38] G. Pyo, J. Park, and S. Moon, “A new method for dynamic reduction of power system using pam algorithm,” in IEEE PES General Meeting, 2010.
 [39] D. Kosterev, A. Meklin, J. Undrill, B. Lesieutre, W. Price, D. Chassin, R. Bravo, and S. Yang, “Load modeling in power system studies: Wecc progress update,” in IEEE PES General Meeting, 2008.
 [40] H. L. Chiang, A. Faust, M. Fiser, and A. Francis, “Learning navigation behaviors end to end with autorl,” CoRR, vol. abs/1809.10124, 2018. [Online]. Available: http://arxiv.org/abs/1809.10124
 [41] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Simtoreal: Learning agile locomotion for quadruped robots,” arXiv preprint arXiv:1804.10332, 2018.
 [42] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 22–31.
 [43] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” arXiv preprint arXiv:1801.08757, 2018.