I Introduction
Resiliency and security of bulk power system is of vital importance with the significant increase of penetration of inverterinterfaced resources, and dynamic loads. Fault induced delayed voltage recovery (FIDVR) [18] causes bus voltage magnitudes to stay at a significantly reduced levels for several seconds after fault clearing, due to the stalling of induction motor loads and prolonged tripping [19], which could lead to widearea outages [1]. One of the industry standard practices to counteract voltage instabilities is to perform load shedding [22]. Standard approaches such as empirical rule based approach using prespecified thresholds [9] without coordination usually result in unnecessary load shedding to safeguard the system. On the other hand, security constrained power flow [15] or model predictive control (MPC) [3] based approaches would require accurate knowledge about the model of the system along with the requirement of computationally expensive realtime optimization solutions.
Recently, to reduce the reliance on accurate model
, datadriven approaches with adaptive characteristics are being investigated, for example, decision tree based algorithm
[2], a hierarchical extremelearning machine based algorithm [11], etc. Reinforcement learning (RL) based datadriven approaches can directly optimize load shedding actions using output and reward feedback from the the grid [6, 24, 7]. RL marries the utilities of adaptive [8] and optimal control [10], where the RL agent interacts with the system, and optimizes its actions to accumulate maximum possible reward over episodes. There are varieties of Markov decision process (MDP)based RL algorithms using valuebased or policy gradient based or a combination of these approaches in works such as
[16, 12, 21, 20, 4], to name a few. In our previous works, we have designed a deep Qlearning based [6], and subsequently, an accelerated deep RL (DRL) approach in [7] for voltage control. Along these lines of research, this paper proposes a scalable DRL algorithm that exploits the structure of the power grid and can be efficiently applied to largescale power system models.It is worth to note that most of the existing RLbased emergency grid control approaches are trained and deployed in a centralized manner, where the controller observes the measurements at multiple buses and sends the control signals to several load buses across the grid to perform emergency load shedding. This centralized paradigm requires long communication links from the control center to several remote locations, and a failure in, or a cyberattack to, the control center or a link can make the control performance ineffective. In addition, training the centralized RL on the largescale power grids would require very long, if not unacceptable, training times to train excessive number of neural network parameters. Otherwise, if the policy network is trained with insufficient number of parameters then the control performance may degrade. An alternative solution is to train and implement the RL control in a completely local manner, in which each local RL control in a prespecified geographical area is designed to recover the local bus voltages if fault happens in that area. However, similar to the rulebased under voltage load shedding, the lack of coordination among the local RL controls will lead to unnecessary load shedding, and can also lead to insufficient recovery in other areas.
To alleviate these concerns of the centralized and local RL framework, in this paper we propose an innovative hierarchical RL framework for the training and deployment of intelligent emergency load shedding. Remarkably, our design exploits the structural properties of the power grid to incorporate modularity in both training and implementation, and can provide sufficient control performance on par or better than centralized policies. Architecturally, we employ a twolevel RL policy where the lower level policies are trained based on the clustering or area division structure of the grid independently avoiding the nonstationarity issue [25] in mutliagent RL training, and the higher level agent coordinates them to achieve sufficient control performance. To save the training time, we will train the lowerlevel RLs in a parallel manner using multiple cores, and their policy updates are used to train the higherlevel RL concurrently. In addition, as we use multiple fault rollouts for the lower level policies across the grid, the training of higherlevel RL only takes a limited number of faults used in each of lowerlevel RL training. This training framework significantly reduces the simulation time in higherlevel RL training, while still ensuring the coordination among lowerlevel RLs and offering robust performance to multiple contingencies. Our approach is different than classical hierarchical RL techniques [23, 17] as we utilize the structural clustering of the grid to enforce hierarchy, rather than temporal abstractions. On the algorithmic side, we propose the distributed hierarchical learning employing an enhanced version of augmented random search methodology [14], namely parallel augmented random search (PARS) [7] that explores the parameter space of policies and shown to be of better computational efficiency than the other modelfree methods used for multiagent training such as [25, 13].
Contributions: The main contributions of the paper are:

We present a novel structuredriven, hierarchical, multiagent DRL algorithm for emergency voltage control design that can be scaled to larger power system models with faster learning and increase in the modularity. We exploit the inherent area divisions of the grid, and propose a structureexploiting DRL design by incorporating few traits of hierarchical and multiagent learning to propose a twolevel architecture.

We employ two concurrent training mechanisms. On one computing branche, we train the areawise decentralized policies based on the fault scenarios in the corresponding areas. On the other branch, we simultaneously train a higherlevel RL agent to intelligently supervise and coordinate the lowerlevel agents, thereby saving considerable training time compared to centralized training.

We demonstrate the performance of our approach with IEEE bus power system model with
areas. Our approach can be easily extendable to larger industrial models paving the way for realworld implementation of artificial intelligence (AI) driven decision making for widearea voltage control. We show that our proposed hierarchical RL can save 60
training time in comparison to the centralized fullscale training, while achieving comparable, if not better, control policies.
Organization: The rest of the paper is organized as follows. The formulation of solving dynamic voltage stability control problem via RL methods is described in Section II. Section III describes our proposed structuredriven hierarchical RL design. Test results are shown in Section IV, and concluding remarks are given in Section V.
Ii Dynamic Voltage Stability Control Problem
We start by describing the the power system dynamics in a differentialalgebraic form as:
(1)  
(2) 
where the grid dynamic states are denoted as , the algebraic variables are denoted as , and denotes the controls. The nonlinear function captures the dynamics of the grid states, and (2) with characterizes the nonlinear power flow. In the optimal control setting, the objective would be to minimize cost functionals such as . The RL controllers can optimize from experience by repeatedly interacting with the environment such as complex grid simulators without knowing the dynamic models.
The RL problem is conventionally formulated in the Markov decision processs (MDP) framework, which requires a fully or partially observable state space , action space , the transition dynamics to find based on the current state , and action , and a scalar reward associated with the action and the state transition. For the dynamic voltage stability control problem we define the complete MDP as follows.

Observation space: Accessing all the dynamic states of the power system is a difficult task, and the operators can only measure a limited number of algebraic states and outputs. For the voltage control problem, the bus voltage magnitudes and the remaining percentage of the loads at the controlled buses are easily measurable, and therefore, considered in the the observation space denoted as , where denotes the space pertaining to the algebraic variables. The observation variables are continuous in nature.

Action space: We consider controllable loads as actuators where shedding locations are generally set by the utilities by solving a rulebased optimization problem for secure grid operation. We consider the operator can shed upto of the total load at a particular bus at any given time instant. The action space is continuous with range where denotes the load shedding, with the actions denoted as .

Policy class: Policies are the mappings from to the actions denoted as
. The learning design will compute this policy optimally such that it can achieve the desired objective. In our learning design, we consider the long short term memory (LSTM) network
[5] due to its capability of automatically learning to capture the temporal dependence over multiple time steps. 
Transition dynamics: The dynamics of the bus voltage magnitudes and remaining percentage of the loads are governed by the differentialalgebraic equations (DAEs) as in (1)(2). Following a disturbance input , the transition is captured by the flow map such as,
(3) Please note in our RL algorithm, we will not require the knowledge about this transition dynamics and only use the measurements of the trajectories of , and to design the control. The framework is flexible, and also accounts for stochastic transitions which are important for power systems with increased uncertainties.

Rewards: The scalar rewards are designed to achieve the voltage control objective, i.e, to keep the voltages of all the buses within the safe recovery profile as given in Fig. 1. During the hierarchical RL control design, we will describe in detail about the reward definitions for different layers of RL agents.
Iii Hierarchical reinforcement Learning
Performing fullscale centralized RL control for largescale practical grid models will definitely encounter the curse of dimensionality. This leads to the idea of hierarchy by exploiting the interconnected nature of the grid and relatively localized effect of the voltage stability problem. We exploit the area divisions where different control areas are often equipped with localized controls and sometimes managed by different utilities, and interconnected via tielines. Our approach is composed of two concurrent stages. We train lowerlevel DRL policies for individual areas, and parallely train a higherlevel coordinator policy using the policy updates from the lowerlevel training. The higherlevel agent activates the lowerlevel policies in a coordinated manner to achieve system level voltage control objective. The coupled lower and higher levels of training helps in bringing scalability to the design by parallelizing the computing burden to each individual decentralized agents and their coordinator. Fig.
2 shows a conceptual overview of our approach.Iiia Learning Areawise Decentralized Policies:
We consider the grid to be divided into nonoverlapping zones with their corresponding bus indices enumerated in the set . The bus voltage magnitudes in these areas are denoted by and the remaining percentage of loads at their controlled buses by We denote the set of buses with controllable loads for the areas by . The lowerlevel policy for area is being trained with multiple faults at area. The neighbor buses for each individual areas, denoted as , are those buses where the impact of faults from area is considerable. Therefore, for the area , the observations are consisting of , for , and the load shedding control actions are denoted by where , for . Please note that although we take neighboring feedbacks, we use the term decentralized to signify that the lower policies are independently trained only for their corresponding areas.
IiiA1 Finding local neighborhood of the area
We start by exploring area by running contingencies at the control actuation locations to emulate an impulselike input that can excite all the inherent oscillation modes. We use the safe voltage recovery profile as shown in Fig. 1 to set the criterion of neighbor buses following contingencies. We first check if any of the bus voltage magnitudes of the area violates the voltage safety profile for a contingency in area . An area is not considered as neighbors if none of the neighbor buses violate (or a very small number of violations). Else, we select representative buses that violates the safety profile as , is the fault clearing time. When multiple buses in area violate the safety profile, representative worstaffected buses are selected, where we use the criterion of bus voltage magnitude nadirs to be lower than p.u. Minimal set of feedbacks also decreases the number of weights to be trained in the policy network.
IiiA2 Decentralized Augmented Random Search
The decentralized policies are denoted as , where denotes the policy parameters, which in our design denotes the LSTM weights and biases for the lowerlevel policy of the area. To train a sufficiently good policy, we specify a reward function for the respective areas. Let us denote the bus voltage magnitudes in the observations of area be . RL agent’s objective is defined as to maximize the expected reward, where the reward at time for area is set as follows:
(4) 
In the reward function (4), is the time instant of fault clearance; is the voltage magnitude for bus correspond to area ; is the load shedding amount in p.u. at time step for load bus corresponding to area ; invalid action penalty if the DRL agent still provides load shedding action when the load at a specific bus has already been fully shed at the previous time step when the system is within normal operation. , and are the weight factors for the above three parts.
Algorithm presents the steps to compute the decentralized policies for the individual areas. Please note that each area contains a set of fault buses , (each with cardinality ) within an area for different fault duration. Step describes the rollouts or trajectory simulations with these faults applied in the emulator. We will describe the steps shortly after presenting the coordinator design.
, standard deviation of the exploration noise
, the number of topperforming perturbed directions selected for updating weights , the number of rollouts per perturbation direction and the decay rate , for all .(5)  
(6)  
(7) 
(8) 
IiiB Concurrent Learning of Coordinator Policy:
To this end, we have the mechanism of generating locally learned policies for the areas . If we perform a fully decentralized design, then one can just use this lower level policies based on the fault location, i.e., if a fault occurs in area , then only the controller at the area will be actuated. However, that may not sufficient to improve the dynamic performance in the neighboring areas. Moreover, without intelligent coordination, all the lowerlayer agents will be activated and although that may produce sufficient performance on voltages, it will lead to expensive inefficient load shedding. Motivated from these concerns, we will now design a higherlevel coordinator that can synchronize the lowerlevel actions efficiently. Mathematically, this means the higherlevel coordinator policy needs to intelligently select lower level policies ’s depending on the location and severity of the faults. The Coordinating RL agent will have discrete action spaces that selects different areas. We now describe the mathematical framework and the algorithm.
IiiB1 Mathematical formalization of the transition dynamics
When one of the lowerlevel RL agents are actuated, for example the one in the area , then the power system transition dynamics is given as:
(9) 
If we consider noisy dynamic behaviour then we have,
(10) 
where the argument signifies that the distribution is subjected to a fault at the area at the buses given in the set . Once we design the coordinator, which we will describe shortly, then the coordinator will select the higher level policies , and that in turn will activate lower level policies in the grid. If multiple RL agents are actuated, then the resultant closedloop dynamics will be due to their joint action. We refer this idea of selecting one or multiple lowerlevel agents by the higherlevel coordinator as the field of vision of the coordinator at each time step with the notation . Therefore, the power system dynamics due to the combined action of the coordinator and decentralized agents is given by:
(11) 
or the stochastic equivalent
(12) 
Here the disturbance can occur in any part of the grid. Next, we look into the action and the observation spaces of the coordinator.
(13) 
IiiB2 Action space of coordinator
The actions that the higher level coordinator can take are of discrete values. Based on the fault location, the action space can be made restricted. This is because if a fault occurs in area , we expect to activate the controls in an intelligent way for only the areas that are the physical neighbors of area . Let us denote the physical neighbor areas by . Please note that is different than the neighbors that we have computed in the Step 1 of DARS. For example, in the IEEE 39bus system, the physical neighbors of area are , and discrete action space when the fault occurs in area is given as . Similarly action spaces for other areas can be designed. For a particular fault, the actions can be restricted to . However, when starting with an unrestricted action space of the coordinator, the reward design will help the RL agent to select the optimal set of lowerlevel actuators to gain high rewards. That will make the action selection fully automated and the operator does not need to pass any information about the location of the fault.
IiiB3 Observation space of coordinator
The observation space of the coordinator is composed of the the minimum of the instantaneous bus voltage magnitudes for individual areas. The instantaneous minimum captures the worstcase behaviour following a fault. For area , the observation used for the learning design is , where are the set of voltage magnitudes of all the buses in the area . We denote the observation space for the coordinator as
IiiB4 Reward function of the coordinator
To design the reward function, we use the transient voltage recovery criterion based on the ideal recovery profile . Then the reward function for the coordinator can be written as:
(14)  
(15)  
(16) 
IiiB5 Training
Algorithm 2 gives the steps for the coordinator training. We make the following design considerations to make the overall algorithm scalable and efficient.

Consideration 1: The higherlevel coordinator reads the policies that are being learned in the lower layers via DARS and then implements them in the power system during its rollouts. However, to achieve steady convergence, the higherlevel coordinator reads the lower level policies once after a specified number of iteration interval (say ) in the lower layers, and then keeps those policies fixed for another prespecified set of its own iterations (say ) as in Step 5 of Alg. 2. The CARS iteration starts after running DARS for number of iterations to receive the first set of lowerlevel policies. When the lowerlevel policies converge, the corresponding converged policies of the areas are used for subsequent iterations. As the lower layer policies are trained parallelly in a decentralized framework, the grid model does not encounter the issue of nonstationarity during DARS training. However, as these policies may need to be implemented jointly depending on the severity and location of the faults, any approximation during DARS can be compensated by the coordinator training to ensure global performance improvement.

Consideration 2: The computation burden of the coordinator training due to rollouts are kept similar to that of the lowerlevel policies. We do that by restricting the upperlevel coordinator to train with a limited number of representative faults randomly selected from each of the individual areas. We denote the set of the selected fault buses from the area by with is sufficiently smaller than . Therefore, the coordinator performs rollouts at the buses from along with different fault duration for each of them. As the CARS algorithm is implemented concurrently with DARS using the updates from the lower layer, the total training time is dictated by that of the higherlevel agent.
IiiC Summary of Algorithmic Steps:
For both of Algs. 1 and 2, the ARS learner delegates tasks to lowerlevel workers, gathers rewards, and accordingly updates the policy weights and . The learner communicates with multiple workers which conduct perturbations (random search) of the policy weights as in Step 7 of Alg. 1 and Step 3 of Alg. 2, and subsequently run fault rollouts for each of the perturbed policies. In an another layer of parallelism, the workers utilize a number of subordinate actors running in parallel where each actor is only responsible for one fault rollout corresponding to the perturbed policy sent by its uplevel worker as in Step 8 of Alg. 1 and Step 5 of Alg. 2. In Step 10 of Alg. 1 and in Step 6 of Alg. 2, the ARS learner aggregates the rewards of multiple rollouts conducted by each of the sublevel workers, sorts the directions according to the obtained rewards, and selects the bestperforming directions. Then, in Step 11 of Alg. 1 and in Step 6 of Alg. 2, the ARS learner updates the policy weights and based on the perturbation results from the top performing workers. Steps 24 in Alg. 1 are distinct from the coordinator training where the neighbor buses of individual areas are computed following III.A.1. On the other hand, Steps 4 and 5 in Alg. 2 are designed based on considerations 1 and 2, respectively, as discussed in III.B.5.
Iv Test Results
We consider the IEEE benchmark bus, generator model as our simulation testcase. The proofofconcept experiments have been performed in a Linux server with AMD Opteron processor with cores per socket and the maximum of CPUs running at GHz, and GB memory. The power system simulator is implemented in the GridPack^{1}^{1}1https://www.gridpack.org, and the hierarchical deep RL algorithm is implemented in the python platform. A software setup has been built such that the grid simulations in the GridPack and the RL iterations in the python can communicate. Fig. 3 shows that the grid model can be divided into three distinct areas. In each of the areas, we have considered few buses where controllable loads are connected. For area , the controllable loads are at buses and , for area , dynamic loads are connected at buses and , and for area , we have control actions available at buses and .
Before showing the efficacy of the hierarchical design, we first perform two scenarios: a fullscale centralized training, and training only areawise decentralized policies. For the centralized fullscale design, we consider the policy network with hidden units in the LSTM layer, and units in the fully connected layer. Thereafter, we run a centralized augmented random search with faults at buses with three different duration with (no fault), and seconds of faults. Once the centralized ARS training following the steps in [7] has been performed, we test the learned policy with various different faults in the grid. Fig. 12 shows a scenario where the nonhierarchical centralized policy was not able to recover the voltages of all the buses due to fault at bus . This experiment shows that we need to train a larger policy network with more parameters to achieve sufficient performance for the nonhierarchical design. This creates a bottleneck for largescale grid models. Moreover, the time taken for fullscale centralized training is hours. We will later show the scalability of our hierarchical algorithm in terms of training time along with using modular policy networks. Considering the second scenario, when we do not implement the higher level coordinator, and only implement the nonhierarchical decentralized policies, Fig. 12 shows an example where the control actions at area 2 cannot improve the dynamic behaviour of the grid even if the fault occurred in area 2 itself. These studies show the difficulties in designing a fullscale centralized policy for largescale grids, and insufficiency of only areawise decentralized policies without any coordinator.
Parameters  Area 1  Area 2  Area 3  Coordinator  

[16, 16]  [16, 16]  [16, 16]  [16, 16]  
Number of directions  14  16  14  20  
Top directions  7  8  7  11  
Maximum iterations  200  250  300  500  
Step size  1  1  1  1  

2  2  2  2  
Decay rate  0.995  0.995  0.995  0.995 
To this end, we move toward the hierarchical deep RL design. The lower layer policy networks and the coordinator networks are designed with hidden units in the LSTM layer, and units in the fully connected layer, and try to learn sufficiently optimized policies. We train decentralized policies for the individual areas following DARS and concurrently run CARS to compute optimized coordinator policy. In order to find the neighbor buses for the areas, we follow the procedure described in III.A.1. We excite the grid model with faults at the controlled buses, and observe the bus voltage magnitudes in the neighboring areas. For area 1, we found that the neighbor buses in areas 2 and 3 do not violate the transient voltage recovery profile and we neglect small number of violations, however, during exploration in areas 2 and 3, we found violations in the neighbors. Figs. 1212 show two examples, where in Fig. 12 we found that the impact of fault at bus on the buses at area 2 does not create violations, however, when fault occurs at bus in area 2, multiple bus voltage magnitudes in the neighbors violate the safe recovery profile as shown in Fig. 12. For area 1, we do not consider any neighbor buses in the observations, however for area 2 and 3 we have the neighbor buses as , . For each area, we consider the load buses as the selfobservation buses in our design. Thereafter, we implement the areawise decentralized ARS training for the individual areas as given in the Algorithm 1. The training parameters used for the algorithm is given in Table I.
Centralized fullscale training  5.705 hours  
Hierarchical RL training 

Fig. 12 shows the training profile where average rewards at the end of iterations are plotted over all the fault rollouts. We run simulations with multiple random seeds to ensure reproducibility of our results. The areawise decentralized training constituted three separate training for the policies of the individual areas. The training time separately for area 1, 2, and 3 were minutes, minutes, and minutes. Each of these policies are trained with nine fault scenarios within the area. Simultaneously, the lower level policy updates are used as per Algorithm 2 where we train a higher level coordinating RL agent that selects a lower level policy intelligently based on the impact of the fault in the grid. The coordinator randomly uses one fault bus from a set of training buses for each of the areas to generate rollouts. We consider fault duration of (no fault), and seconds, totalling fault scenarios for the CARS training. Fig. 12 shows the training of the coordinator agent following CARS. We use . The coordinator training using CARS took hours. As the coordinator iteration starts after initial iterations of the DARS, and then runs concurrently with the decentralized training, the total training time in this design turns out to be hours. A comparison with the nonhierarchical design is shown in Table II, which shows the scalability in computation time for this design with respect to that of the fullscale centralized approach. We also compare the training convergence of the coordinator for the proposed approach to another scenario where the coordinator is trained using the converged lowerlevel policies in Fig. 12. This shows that when we pass the runtime lower level updates to the higher level, thereby sending the information on how the lower policies are getting optimized, the coordinator can perform better in terms of convergence and achieve higher rewards. We, then, implement the control architecture as shown in Fig 2, and test with multiple faults with s duration at the different bus locations of the grid. Figs. 1212 show how hierarchical design can coordinate between the RL agents of different areas to produce much better and desired performance compared to the fullscale centralized design as in Fig. 12, or the decentralized onearea policy as in Fig. 12. Similar performance is found to be achieved for other fault bus locations as well. Please note that we tested with fault buses that are not used in training to show the robustness and the generalizability of the trained policy. The sufficiently optimized performance of the learned hierarchical policies are also validated in the accumulated rewards over the fault rollouts as shown in Fig. 13 with a detailed discussion.
V Conclusions and Future Research
This paper presents a novel scalable reinforcement learning based automated emergency load shedding based voltage control. We exploit the interconnected structure of the grid to develop a hierarchical control architecture where the lower level RL agents are trained in an areawise decentralized way, and the higherlevel agent is being trained to intelligently coordinate the load shedding actions taken by the lowerlevel agents. Training of the coordinator and the lower level agents were performed in a concurrent way. The design can compute sufficiently optimized policies in a considerable lower time than the centralized fullscale design by incorporating modularity in training. We have ensured that although the areawise decentralized policies may not be independently sufficient to recover all the bus voltages after faults, the higherlevel coordinating agent can effectively control them to ensure voltage stability under emergency conditions. In our future research, we will experiment with our approach for a larger industrial model that can alleviate the insufficiency of the centralized learning approaches. We will continue to investigate the development of efficient design variants related to multiagent and hierarchical approaches.
References
 [1] (201703) Black system South Australia 28 September 2016  Final Report. Cited by: §I.
 [2] (2010) Decision treebased preventive and corrective control applications for dynamic security enhancement in power systems. IEEE Transactions on Power Systems 25 (3), pp. 1611–1619. Cited by: §I.
 [3] (2006) Some reflections on model predictive control of transmission voltages. In 2006 38th North American Power Symposium, Vol. , pp. 625–632. Cited by: §I.
 [4] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. CoRR abs/1801.01290. External Links: 1801.01290 Cited by: §I.
 [5] (199711) Long shortterm memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 08997667, Document Cited by: 3rd item.
 [6] (2020) Adaptive power system emergency control using deep reinforcement learning. IEEE Transactions on Smart Grid 11 (2), pp. 1171–1182. Cited by: §I.
 [7] (20202020) Accelerated deep reinforcement learning based load shedding for emergency voltage control. arXiv preprint arXiv:2006.12667. Cited by: §I, §I, §IV.
 [8] (2006) Adaptive control tutorial. Cited by: §I.
 [9] (200307) Design of an undervoltage load shedding scheme for the hydroquebec system. In 2003 IEEE Power Engineering Society General Meeting (IEEE Cat. No.03CH37491), Vol. 4, pp. 2030–2036 Vol. 4. External Links: ISSN null Cited by: §I.
 [10] (1995) Optimal control. Cited by: §I.
 [11] (2020) A hierarchical datadriven method for eventbased load shedding against faultinduced delayed voltage recovery in power systems. IEEE Transactions on Industrial Informatics (), pp. 1–1. Cited by: §I.
 [12] (2016) Continuous control with deep reinforcement learning.. In ICLR, Cited by: §I.
 [13] (2017) Multiagent actorcritic for mixed cooperativecompetitive environments. In NeurIPS, pp. 6379–6390. Cited by: §I.
 [14] (2018) Simple random search of static linear policies is competitive for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1800–1809. Cited by: §I.
 [15] (2017) Fast and robust determination of power system emergency control actions. arXiv preprint arXiv:1707.07105. Cited by: §I.

[16]
(2013)
Playing atari with deep reinforcement learning.
NIPS Deep Learning Workshop
. External Links: Link Cited by: §I.  [17] (2018) Dataefficient hierarchical reinforcement learning. CoRR abs/1805.08296. External Links: 1805.08296 Cited by: §I.
 [18] (200906) Faultinduced delayed voltage recovery, version 1.2. NERC Transm. Issues Subcommittee Syst. Protect. Control Subcommittee. Cited by: §I.
 [19] (2006) Shortterm voltage instability: effects on synchronous and induction machines. IEEE Transactions on Power Systems 21 (2), pp. 791–798. Cited by: §I.
 [20] (2015) Trust region policy optimization.. In ICML, JMLR Workshop and Conference Proceedings, Vol. 37. Cited by: §I.
 [21] (2017) Proximal policy optimization algorithms.. CoRR. Cited by: §I.
 [22] (2007) Emergency voltage stability controls: an overview. In 2007 IEEE Power Engineering Society General Meeting, pp. 1–10. Cited by: §I.
 [23] (2017) FeUdal networks for hierarchical reinforcement learning. CoRR abs/1703.01161. External Links: 1703.01161 Cited by: §I.
 [24] (2018) Load shedding scheme with deep reinforcement learning to improve shortterm voltage stability. In 2018 IEEE Innovative Smart Grid TechnologiesAsia (ISGT Asia), pp. 13–18. Cited by: §I.
 [25] (2019) Multiagent reinforcement learning: a selective overview of theories and algorithms. arXiv 1911.10635. Cited by: §I.
Comments
There are no comments yet.