Physics-informed Evolutionary Strategy based Control for Mitigating Delayed Voltage Recovery

by   Yan Du, et al.

In this work we propose a novel data-driven, real-time power system voltage control method based on the physics-informed guided meta evolutionary strategy (ES). The main objective is to quickly provide an adaptive control strategy to mitigate the fault-induced delayed voltage recovery (FIDVR) problem. Reinforcement learning methods have been developed for the same or similar challenging control problems, but they suffer from training inefficiency and lack of robustness for "corner or unseen" scenarios. On the other hand, extensive physical knowledge has been developed in power systems but little has been leveraged in learning-based approaches. To address these challenges, we introduce the trainable action mask technique for flexibly embedding physical knowledge into RL models to rule out unnecessary or unfavorable actions, and achieve notable improvements in sample efficiency, control performance and robustness. Furthermore, our method leverages past learning experience to derive surrogate gradient to guide and accelerate the exploration process in training. Case studies on the IEEE 300-bus system and comparisons with other state-of-the-art benchmark methods demonstrate effectiveness and advantages of our method.



There are no comments yet.


page 1

page 5


Data-Driven Reinforcement Learning for Virtual Character Animation Control

Virtual character animation control is a problem for which Reinforcement...

Power Grid Cascading Failure Mitigation by Reinforcement Learning

This paper proposes a cascading failure mitigation strategy based on Rei...

EVO-RL: Evolutionary-Driven Reinforcement Learning

In this work, we propose a novel approach for reinforcement learning dri...

Evolutionary Reinforcement Learning

Deep Reinforcement Learning (DRL) algorithms have been successfully appl...

Fault-Tolerant Control of Degrading Systems with On-Policy Reinforcement Learning

We propose a novel adaptive reinforcement learning control approach for ...

Data-driven Optimal Power Flow: A Physics-Informed Machine Learning Approach

This paper proposes a data-driven approach for optimal power flow (OPF) ...

Physics-Informed Graph Learning for Robust Fault Location in Distribution Systems

The rapid growth of distributed energy resources potentially increases p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Motivation

INITIALLY brought into the spotlight by the unprecedented success of AlphaGo in year 2016, the deep reinforcement learning (deep RL) technique [sutton2018_RLBook] has been motivating breakthroughs in a broad range of areas including games, robotics, and autonomous driving. In the field of power systems, the deep RL technique has been leveraged for solving complex grid control and optimization problems, such as autonomous voltage regulation [wang2020adata], residential HVAC control [yu2021multi], electricity market bidding [liang2020agent], and power system stability and emergency control [yan2018data, Huang2020_DRL, zhang2020deep]. The major advantages of the deep RL method over the conventional model-based method, as has been discussed thoroughly in the above existing research works, lie in that it is model-free and thus more robust than conventional control methods for modeling errors; it can determine control solutions within a very short time and meet real time control requirements; and it has generalization to unseen instances.

Nevertheless, there still exist a number of critical factors that prohibit the full adoption of deep RL algorithms in the physical systems such as power systems: 1) they usually have a costly training and fine-tuning process due to numerous embedded parameters and low exploration efficiency; 2) they fail to properly incorporate physical knowledge to achieve efficient training and robust performance; 3) they cannot adapt well to new or unseen situations.

The objective of this paper is to address these three key issues by embedding the physics knowledge for accelerating the learning process and improving the robustness of the control policy, by incorporating guided search to achieve better exploration efficiency, and by leveraging meta-learning for fast adaptation.

Physics-informed machine learning has received growing attention lately due to the challenges encountered by the pure data-driven machine learning methods, including high cost of data acquisition, data incompleteness, and extremely high search spaces, etc


. In the fault-induced delayed voltage recovery (FIDVR) problem, the bulk power system with a large number of load buses induces a vast control action space and an unduly burdensome searching process. Voltage performance criteria have been developed by the industry through extensive off-line studies to guide planning and operation against voltage problems including FIDVR. In light of these, we augment the conventional RL model (represented by a neural network) with a physics-informed module called trainable action mask (TAM), which utilizes power system physical knowledge (i.e., voltage performance criteria) to filter out improper control actions and to avoid unnecessary explorations, thereby achieving better sample efficiency and control robustness. A recent study in the dialog system

[wu2019tam] adopted a similar idea to leverage prior, non-physical knowledge. In this paper, we for the first time embed physical knowledge into RL models through the TAM technique for power system control applications, particularly, the FIDVR problem and got promising results.

Conventional deep RL methods rely on chain-rules and back-propagation to update the parameters of the neural networks, which makes it rather difficult to incorporate non-differentiatable modules like TAM. Recently, evolutionary strategy (ES) has been proven to be an effective, scalable alternative to conventional RL methods

[salimans2017evolution, mania2018simple]. ES methods are derivative-free and can be easily parallelized. The derivative-free feature facilitates incorporating the TAM technique into the RL model.

Most recently, a novel guided ES method is proposed to enhance the exploration efficiency of the algorithm in high-dimensional parameter spaces [maheswaranathan2019guided]. Instead of conducting a complete random search, the guided ES method leverages the guidance from a surrogate gradient, which is derived from prior exploration experiences during training. By coordinating the random search with the gradient-guided search, the guided ES method can achieve a faster learning speed and better solutions.

Inspired by the above work, also building upon our previous work in leveraging an advanced ES algorithm for power system emergency control [huang2021accelerated_DRL], in this paper we develop a model-free guided ES-based control strategy to mitigate FIDVR with high exploration efficiency. The free from computationally intensive back-propagation process and easy parallelization of the guided ES method makes it possible to overcome the extremely high-dimensional state/action complexities introduced by large-scale power systems.

One key issue with the FIDVR problem is that how well a developed control policy can be adapted to the ever-changing grid operation conditions [park2020model]. To achieve a fast adaptation of the learnt control strategy to unseen fault scenarios to meet with the real-time control requirement, we further combine the above guided ES method with a meta-learning strategy introduced in our previous work [huang2021learning], namely the meta strategy optimization (MSO), which leads to the guided meta ES algorithm. The core idea behind MSO is to learn a latent variable as a representation of the variations of the training environments. The latent variable can then be fine-tuned when a new operation scenario is presented, and the control policy is adjusted accordingly. The adaptability of the proposed guided meta ES method makes it practical for real-world applications.

I-B Literature review

Our work focus on the FIDVR problem. FIDVR is defined as the phenomenon whereby system voltage remains at significantly reduced levels for several seconds after a fault has been cleared [NERC2009]. The root cause is stalling of residential air-conditioner (A/C) motors absorbing excessive reactive power from the power grid and prolonged tripping. FIDVR events occurred in many utilities in the US. Concerns over FIDVR issues have increased since residential A/C penetration is at an all-time high and growing rapidly. A transient voltage recovery criterion (TVRC) is defined to evaluate the system voltage recovery. Without loss of generality, we referred to the standard proposed in [PJM2009] and shown in Fig. 1. After fault clearance, the standard requires that voltages should return to at least 0.8, 0.9 and 0.95 p.u. within 0.33 s, 0.5 s and 1.5 s, respectively.

The control actions to mitigate the FIDVR problem can be classified as supply-side solutions and demand-side solutions. The most popular supply-side solution applied by utilities is the implementation of FACTS devices such as SVC

[al2009preventing] and static condenser[du2009utilizing] to increase reactive power support. However, these devices are very expensive and cost approximately $20–50 million per installation. Since a large number of these devices would have to be installed, it could be very costly. The demand-side load-shedding solution is widely recognized as the most effective and economic control strategy to mitigate the FIDVR problem by shedding part of the stalling A/C motors in the impacted areas to reduce the huge amount of reactive power they are absorbing and thus recover the voltage [Bai2011]. The challenges associated with determine the load shedding actions to mitigate the FIDVR problem are when, where, and how much amount of A/C motor load to shed in the impacted areas of the grid to recover the system voltage. It is desired that shedding the least amount of the A/C motor load in the system to recover the system voltage to meet the transient voltage recovery criterion.

Existing load shedding methods for mitigating the FIDVR problem can be roughly classified into the following four categories, and the advantages of our proposed method over each category are analyzed accordingly:

  • rule-based methods: An example of rule-based method can be found in [lefebvre2003design]. While the implementation is fairly easy, the settings of the rules tend to be conservative, and they cannot adapt flexibly to different conditions.

  • model-based methods: one popular method is the model-predictive control (MPC) method [Jin2010MPC, MPC2]. The MPC model can be formulated based on the detailed power grid network and dynamic models, and can be solved based on the mathematical-rigor optimization techniques. However, inclusion of the transient stability constraints in the FIDVR problem makes the MPC model much more complex and higher dimensional when compared with the conventional steady-state OPF problem. Thus, it suffers from the scalability issue and cannot not meet the solution time requirement (i.e. <0.5 s ) for real-time voltage control.

  • measure-based methods: new real-time voltage control methods have been developed by leveraging the phasor measurement unit (PMU) technologies [glavic2012see, matavalam2019pmu, sun2019review]. They do not consider coordinating load shedding actions to achieve system-wide optimality.

  • learning-based methods: the deep Q network (DQN) and deep deterministic policy gradient (DDPG) methods have been applied for developing emergency load shedding schemes to recover voltage [Huang2020_DRL, zhang2018load]. While the deep RL methods have been credited for their strong exploration and adaptability, they do suffer from training inefficiency and lack of adaptability and robustness, which prohibits their deployment for large-scale power system voltage recovery.

Compared with the above existing methods, our proposed evolutionary strategy method has the following merits in the case of real-time FIDVR mitigation control:

  • Compared with the rule-based method, our proposed method adopts the neural network to learn a more generalized control strategy, and further combines with a meta learning strategy to get quickly adapted to unseen operation scenarios and contingencies.

  • Compared with the model-based method, our proposed method is model-free and once well-trained, can be directly executed in real-time, which greatly spares both modeling efforts and computational efforts.

  • Compared with the measurement-based method, our proposed method can efficiently tackle with high-dimensional state and action spaces in the case of large-scale power systems with the embedded neural network to achieve a near-optimal solution.

  • Compared with other learning-based deep RL methods, our proposed method has a much higher exploration efficiency and can be easily scaled up for parallel computing during the training, and furthermore our proposed method is guided by physical knowledge during both training and execution, which greatly reduces the training efforts and improves the control adaptability and robustness.

Table I presents an overview of the proposed technical roadmap in our work. We also analyze the improvement of the current work over our previous works in this research area [huang2021accelerated_DRL, huang2021learning] as follows:

  • Compared with [huang2021accelerated_DRL], which applies a parallel evolutionary strategy method called augmented random search (ARS), we introduced guided search to the evolutionary strategy in this work instead of a complete random search, which substantially accelerates the learning process and help find better control strategies.

  • Compared with [huang2021learning], which adopts the meta strategy optimization (MSO) to develop a robust and adaptive load shedding strategy, in this work we combine the MSO with the guided search to take advantage of both methods to improve the computational efficiency as well as the adaptability of the evolutionary strategy for real-time voltage recovery.

  • Last but not least, we for the first time introduce physical knowledge into the model-free ES method through the TAM technique to help boost the exploration efficiency. This endeavor shows remarkable success yet was not explored in the previous two pieces of works.

Key challenges of deep RL Proposed technique
Costly back-propagation process; Laborious parameter fine-tuning; Lack of scalability; Increasing algorithm complexity Derivative-free ES for easy paralllelization and low computational burden (in our previous work[huang2021accelerated_DRL])
Lack of adaptability and generalization to unseen test cases Meta strategy optimization for fast policy adaptation (in our previous work [huang2021learning])
Time-consuming random action-space exploration Guiding exploration in the parameter space with surrogate gradient to focus on promising directions (developed in this paper)
High-dimensional action domains leading to exhaustive searching; lack of robustness Introducing physical knowledge through TAM for effective action filtering (developed in this paper)
TABLE I: Technical Roadmap of the Physics-informed ES Method

I-C Contributions

In summary, the key contributions of our work can be outlined as follows:

1) The major contribution of our work is the novel embedding of physical knowledge into the model-free ES method to achieve high exploration efficiency for the real-time FIDVR problem with large action domains. The physical knowledge is introduced through the TAM technique, which eliminates improper load shedding actions and considerably spares the exploration efforts. To the best of the authors’ knowledge, this effort of embedding physics awareness in model-free voltage control methods is unprecedented in the literature. We have witnessed substantial algorithm performance improvement in our comparative studies.

2) A novel model-free guided meta ES method is developed to achieve superior sample efficiency than the standard ES algorithm, thanks to the combination of random search with surrogate gradient information.

3) The generalization and adaptability of the learnt control policy to unseen fault scenarios is further enhanced through a meta learning strategy.

4) The adaptability and efficiency of the proposed voltage recovery method based on the physics-informed guided meta ES algorithm is fully verified by testing on a large-scale power system under multiple unseen fault scenarios and by comparing with state-of-the-art benchmark methods, which implies its great promises for real-world applications.

I-D Organization

The rest of the paper is organized as follows: Section II describes the problem formulation of FIDVR. Section III introduces the proposed adaptive model-free control method based on guided meta ES algorithm. In Section IV, the physics knowledge is further combined with the proposed control method through trainable action mask. Case studies are shown in Section V. Finally, Section VI concludes the work and the future directions.

Ii Problem Formulation

As discussed in Section I.B, one widely-recognized, effective and economic control strategy to mitigate the FIDVR problem is load shedding. An ideal load shedding strategy should be able to bring the system voltage magnitude to a certain level with minimum amount of load shedding. A standard transient voltage recovery criterion is shown in Fig. 1 [PJM2009]. As shown in the figure, the voltage should return to at least 0.7, 0.8, and 0.9 p.u. within 0.33s, 0.5s and 1.5s after the fault is cleared.

Fig. 1: Transient voltage recovery criterion

Deciding the optimal load shedding strategy is not a trivial task since three crucial problems must be considered: when to conduct load shedding, at which bus the load should be shed, and how much of the load should be shed, which leads to a high-dimensional non-convex decision-making problem [Huang2020_DRL] and fails many model-based solutions in the case of real-time implementation. Also, the model-based solution is sensitive to the model errors which are common in complex power systems.

Based on the above discussions, in this work we propose to formulate FIDVR mitigation control problem as a Markov Decision Process (MDP) and further apply a model-free guided meta ES algorithm with trainable action mask (TAM) to obtain the optimal load shedding strategies under different fault scenarios, which is adaptive, highly scalable, and computationally efficient. The MDP-based problem formulation is first presented as follows.

The MDP is represented as a 5-tuple (), and their definitions under the context of FIDVR are provided as follows:

1) state: the state

is defined as a vector that contains the latest observations from the power system, including the voltage magnitude and the percentage of remaining load that can be shed at the monitored buses:

, where and are the number of voltage monitoring buses and controllable load buses.

2) action: the action is defined as a vector that contains the normalized load shedding actions for all the controllable buses. The normalized load shedding action is a scalar between -1 and 1, where -1 indicates that 20% of the remaining load will be shed, and 1 indicates no load shedding actions.

3) state transition: the state transition describes the power system dynamics and is deterministically governed by a set of differential and algebraic equations:


In (1)-(2), represents the system dynamic state variables, such as the generator rotor angle and speed; represents the system algebraic state variables, which are usually bus voltage magnitudes and bus voltage angles; refers to the system perturbation or contingency; and is the control action.

4) reward: in the power system emergency control problem, the main objective is to recover the voltage magnitude to the normal level after fault clearance with the least amount of load shedding. To reach this objective, we adopted the same reward design as in our previous work [Huang2020_DRL]:




In (3)-(4), t is the current time step; is the voltage magnitude at bus ; is the time instant for fault clearance; is the amount of shed load at bus in p.u.; is the penalty for invalid action if there is still a load shedding action at buses with zero remaining load; ,, and are the weight factors; , ,, ,and constitute the voltage recovery criterion. One example of their values has been shown in Fig. 1. 5) discount factor: the objective of MDP is to maximize the following total reward ,where is a discount factor between 0 and 1. The reason for adding the discounted factor is to avoid an infinite sum of future rewards.

Iii Guided Meta Evolutionary Strategy

In this section, we will first give a brief review of the evolutionary strategy (ES). Then we will introduce the guided ES algorithm that combines the surrogate gradient with random search to achieve higher sampling efficiency. Lastly, we will present a guided meta ES algorithm by utilizing meta strategy optimization (MSO) to obtain adaptive control strategies.

Iii-a An introduction to ES

The ES is a type of heuristic search algorithm inspired by the evolution theory: at each iteration, a population of parameters that need to be optimized are randomly perturbed and their objective function values are calculated. The parameters with the highest values are then recombined to formulate the population for the next iteration. The process repeats until the objective meets the convergence criterion.

In the context of RL, given the reward function and the policy , the goal is to find the optimal that maximizes the expected total discounted reward . Algorithm 1 shows the implementation of ES [salimans2017evolution]:

1:  Initialize the learning rate

, noise standard deviation

, the number of perturbation directions , and policy parameter
2:  for iteration = 1 to  do
3:     Sample perturbation directions from (0,I)
4:     for  = 1 to  do
5:        Generate action
6:        Execute and receive reward
7:     end for
8:     Update the policy parameter:
10:  end for
Algorithm 1 Evolutionary Strategy (ES)

As shown in the above pseudo code, the algorithm consists of two repeated phases: first, the policy parameter is randomly perturbed by noises derived from a standard normal distribution, and the associated actions are executed and evaluated based on their reward values for an entire episode (line 3-line 7); second, the policy parameter is updated by an estimated stochastic gradient (line 9), which comes from the following derivation: assuming our objective is to optimize

over a distribution to maximize the expected reward , when the parameter distribution follows a Guassian distribution, the expected reward can be directly written as . With the objective defined in terms of , the gradient can be calculated as follows:


The expectation term in (5) can be achieved through sampling, as shown by line 3 in the algorithm. Note that line 4-line 7 can be naturally deployed in a parallel fashion to speed up the training, since each perturbation direction is independent from each other. The simple way of sampling instead of back-propagation for parameter update makes the ES algorithm more scalable than the gradient-based RL methods. We will provide more explanations on scalability with parallel computing in the context of power system control in the later subsection IV-B.

Iii-B Guiding ES search with surrogate gradient

In the above ES algorithm, the policy parameter

is randomly perturbed following a Gaussian distribution. While this random search is easy to implement, it can introduce high variance and results in unnecessary explorations. The guided ES algorithm is thus proposed to handle this challenge. The core idea behind the guided ES algorithm is to refer to the surrogate gradient to guide the algorithm search toward the most promising directions instead of conducting a completely random search.

A surrogate gradient is correlated with the true gradient, but somehow biased or corrupted due to the model unobservability. An illustration of the surrogate gradient is shown in Fig. 2.

Fig. 2: Schematic of surrogate gradient

The guided ES algorithm takes advantage of the surrogate gradient in the following way [maheswaranathan2019guided]: suppose we can get a vector of surrogate gradient for the policy parameters at each iteration, then by collecting the surrogate gradients from the previous iterations, we can generate a subspace , where is an orthogonal basis for this subspace, and is the dimension of the policy parameters. The gradient information can be further embedded in the ES algorithm by changing the distribution of the perturbation from (0,I) to (0,), where is calculated as follows:


In (6), is a weight factor that makes a trade-off between the random search (exploration) and the guided search with surrogate gradient (exploitation). Setting will lead to the ES algorithm. In our case, we set to 0.5 to balance the exploration with exploitation. With the modified distribution, the perturbation direction can be calculated as follows:


where (0,), and (0,). The complete guided ES algorithm is shown in Algorithm 2. The algorithm basically follows the same framework as the ES method. One difference is that at the initialization state, a surrogate gradient buffer is defined to store the surrogate gradients from the previous steps for generating perturbations. In addition, an antithetic sampling is applied, where for each perturbation direction, a pair of evaluations for and are conducted to reduce variance, as shown by line 6. The evaluations are later used to calculate the surrogate gradient for policy parameter update, as shown by line 10-11. Finally, the surrogate gradient is stored in the buffer for generating the new perturbation distribution, as shown by line 12 in the algorithm.

1:  Initialize the learning rate , the weight factor , the scale factor , the noise standard deviation , the number of samples , the number of surrogate gradients to use , and the policy parameter
2:  Initialize the surrogate gradient buffer
3:  for iteration = 1 to  do
4:     Sample perturbation directions from (0,)
5:     for  = 1 to  do
6:        Generate action , 
7:        Execute and receive reward
8:     end for
9:     Update with surrogate gradient :
12:     Store to the buffer and update the surrogate gradient subspace and perturbation distribution
13:  end for
Algorithm 2 Guided Evolutionary Strategy (guided ES)

Iii-C Enhancing algorithm adaptability with MSO

The power system has a fast-changing and uncertain nature, which requires that an emergency control strategy should have sufficient robustness and be adaptive to unseen fault scenarios. To enhance the adaptability of the above data-driven guided ES-based control policy to new environment dynamics, in this subsection we propose to integrate the idea of meta learning, namely learning to learn, into the guided ES method, which leads to guided meta ES.

We apply a specific meta-learning technique, the meta strategy optimization (MSO) [yu2020learning], to realize the above objective. MSO adapts a learnt control policy to unseen scenarios through latent space representation. For each operation scenario encountered during the training, a latent variable is defined for this scenario to encode its hidden features. The latent variable is later combined with the direct observations of the scenario and sent to the policy function for decision-making. The latent variable optimization and the policy parameter update can be expressed by the following two equations:


In (8), is the latent variable associated with scenario at the iteration; is a performance measurement, e.g., the reward function. The policy parameter is then optimized by maximizing the expected performance measurement with the learnt latent variable , as shown by (9).

When unseen operation scenarios occur during the testing, new latent variables can be calculated through the above process for fine-tuning the policy, making it adapted to the new environment dynamics. This adaptation can be realized through only a few iterations with the environment, which is highly time-efficient. More technical details of MSO application in power system emergency control can be found in our previous work [huang2021learning].

Iii-D Advantages of Guided Meta ES algorithm over conventional RL

Under the context of power system FIDVR mitigation control, the above guided meta ES algorithm exceeds the gradient-based RL methods in the following three aspects:

  • The evolutionary strategy does not require or rely on back-propagation process for gradient calculation and parameter update. This provides more flexibility such as incorporating a trainable action mask module, which can be regarded as a non-differentiatable regulation layer,in the end-to-end training process.

  • In line with the above analysis, the large-scale power system FIDVR mitigation control problems have the feature of non-smoothness in the environment dynamics and the reward function definition, which can cause the issue of gradient explosion during the back-propagation process in standard value-based and policy gradient deep reinforcement methods. The proposed evolutionary strategy refers to the surrogate gradient as an efficient workaround to avoid the gradient explosion issue.

  • The evolutionary strategy is well suited to scale up with parallel computing technologies: the algorithm operates on complete power system dynamic simulations, which indicates infrequent communications among the parallel workers. Considering the variety of power system operation scenarios, a parallel simulation greatly facilitates the training process.

  • In the case of unseen scenarios, the learnt control policy can be quickly adjusted through MSO for new environment dynamics, which is highly desirable for real-time FIDVR mitigation under uncertainties.

  • The learning performance of the standard value-based and policy gradient deep RL methods are highly sensitive to the hyper-parameter settings, and they require additional human efforts for parameter fine-tuning. In contrast, the proposed evolutionary strategy method only has a small set of hyper-parameters, which avoids extensive fine-tuning process while boosting a better control performance.

Iv Physics-informed Guided Meta ES with Trainable Action Mask

In this section, we aim to further improve the exploration efficiency of the guided meta ES method by introducing a novel trainable action mask (TAM) technique, which brings in the physical knowledge of power systems to pinpoint the optimal control actions.

Iv-a Embedding physics knowledge through TAM

While the guided ES algorithm is much better than basic ES algorithms, it still suffers from exploration inefficiency issues when applied to high-dimensional control problems. One way to overcome this obstacle is to incorporate a physics-informed action mask component into the algorithm, which will filter out impossible or unfavorable actions and prevent the algorithm from conducting unnecessary explorations [williams2017hybrid].

The action mask makes use of existing physical knowledge. In the case of FIDVR mitigation, the voltage performance criterion (such as the one in Fig. 1) can be regarded as prior knowledge, and be used to accelerate the training and improve control performance through a simple hand-crafted action-mask, as illustrated in Fig. 3

. In the figure, the time-sequential observations first go through a long-short-term-memory (LSTM) layer, then two fully-connected (FC) layers. The output from the FC layer is later masked by an action vector to get the final control actions. The function of the LSTM layer is to learn the temporal correlations of the system observations over a long time window. In the LSTM layer, a cell state is designed to capture and maintain the historical input, which spares the efforts of stacking all the historical data as one single input and thus reduces the dimension of the input data.

The action mask is constructed as follows: The action mask has the same dimension as the control action. At each time step, for each controllable bus, if its observed voltage magnitude is above the stability criterion, no action is required and a zero element will be added to the corresponding position in the mask, and vice versa. Next, the action generated by the policy network will be multiplied by this mask, where the positions with zero elements will eliminate the corresponding load shedding actions since it is unnecessary, and the positions with one will keep the load shedding actions.

Note that in the above hand-crafted action mask, the mask settings are set according to a predefined, fixed performance or stability criterion and generally remain fixed for all scenarios. However, considering that the power system operation scenarios can vary significantly from one to another, for instance, with different loading conditions, a fixed mask is unlikely to be the optimal solution for a wide range of operation scenarios.

Fig. 3: Regulating the neural network output with a fixed action mask

Based on the above discussions, we propose a TAM technique to develop a learnable, adaptive criterion and to obtain a more flexible and generalized control strategy. An illustrative explanation of the TAM technique is shown in Fig. 4, and it can be described by the following mathematical expressions:


where and are the action and the learned criterion based on the current state and the current latent variable . The TAM is generated by comparing the voltage magnitude at each bus with the voltage criterion , as shown by (11). The action is filtered by conducting a element-wise multiplication with TAM, as shown by (12).

As can be seen from the above process, at each time step, a specified voltage criterion is generated based on the current state and the operation scenario information provided by the latent variable. Compared with the fixed action mask, the TAM is flexibly adjusted as the states vary, and the control actions are filtered accordingly. In the TAM method, the physical knowledge is introduced by defining an upper bound and a lower bound for the learnable criterion, which reasonably reduces the search space and facilitates the training process.

Fig. 4: Illustration of the trainable action mask technique

Iv-B Physics-informed Guided Meta ES with TAM

The complete physics-informed guided meta ES method with TAM for grid voltage control to mitigate FIDVR is shown in Algorithm 3. The algorithm is composed of three major parts: 1) generate the latent variable (lines 6-8); 2) perturb the policy parameters and get the associated rewards (lines 9-20); 3) update the policy parameter and the surrogate gradient subspace (lines 21-24). In the second part, the FIDVR simulations under different perturbation directions are implemented in parallel. More specifically, as there are in total perturbation directions, and under each perturbation direction, power flow cases are run to get the associated rewards. These power flow cases are independent and can be run in parallel; the perturbation directions are also simulated in parallel, as each perturbation direction is also independent from each other. This fully parallelized framework will substantially accelerate the convergence of the algorithm, especially in the case of extremely large-scale power system simulation [huang2021accelerated_DRL].

The major difference of the above physics-informed guided meta ES method with TAM from the guided meta ES method lies in that in the former method, the policy network will output not only the action but also the voltage criterion, which is later used to regulate the NN output (i.e., masking unnecessary actions), as shown by lines 14-16. This augmentation of the algorithm leads to a remarkable learning performance improvement with negligible extra computational efforts, since the voltage criterion only adds a few additional output dimensions to the policy network. In the next section, we will further test and demonstrate the superiority of the proposed physics-informed ES method through comparative studies .

1:  Initialize the learning rate , the decay rate , the weight factor , the noise standard deviation , the total number of perturbation directions , the number of surrogate gradients to use , the number of top-performing directions , the number of power flow cases to simulate for each iteration , and the policy parameter
2:  Initialize the latent variable c for each training power flow case
3:  Initialize the surrogate gradient buffer
4:  for iteration = 1 to  do
5:     Sample power flow cases
6:     if mod(, )==0 then
7:        Update the latent variable by maximizing
8:     end if
9:     Generate the perturbation direction :
10:     (0,), (0,)
11:     for  = 1 to  do
12:        Generate a pair of policy parameters and
13:        for  = 1 to  do
14:           Generate control action and the learnt criterion and
15:           Generate the binary action mask vector based on the observed voltage level in and
16:           Apply the action mask to the control actions and collect the rewards and
17:        end for
18:        Calculate the average reward:
20:     end for
21:     Select top rewards with the largest values and update with the surrogate gradient:
24:     Store to the buffer and update the surrogate gradient subspace , where
25:     Update the learning rate and the noise standard deviation with the decay rate: ,
26:  end for
Algorithm 3 Physics-informed Guided Meta ES for Power System Control

V Case Studies

In this section, we will first introduce the test environment for implementing and testing the proposed methods, then we will present the simulation results and comparisons with other state-of-the-art benchmark methods to demonstrate the performance of the proposed methods in terms of training efficiency, RL agent generalization capability, control performance, and optimality.

V-a Test environment and deployment details

The proposed physics-informed guided meta ES-based learning framework is deployed on a local high performance computing cluster with a Linux operation system of 520 nodes. Each node has a dual-socket Intel Haswell E5-2670V3 CPU with 64 GB DDR4 memory and 12 cores per socket running at 2.3 GHz. The training and testing of the algorithm are performed with IEEE 300-bus system [huang2019_cmpldw]

. The power system dynamic simulation is completed by the open-source platform RLGC

[Huang2020_DRL, RLGC]. A summary of the hyper-parameters of the algorithm is shown in Table IV

. Note that the policy network is constructed as a neural network with two hidden layers, one LSTM layer and one fully-connected layer, with each having 32 neurons. The state is defined as a vector with 154 elements, where the first 108 elements are the bus voltage magnitudes, and the last 46 elements are the remaining load levels at the buses with controllable loads. The state vector is further concatenated with a latent context vector with 16 latent variables as the input to the policy network. The output from the policy network is an action vector with 51 elements, where the first 46 elements are the amount of shed load, and the last 5 elements define the learnt voltage criterion, namely

in (4). Based on the physical knowledge, an upper bound and a lower bound are defined for the above 5 fixation points as follows: [0.7., 0.85] p.u., [0.85., 0.92] p.u., [0.92 0.96] p.u., [0.25, 0.4] s, [0.4, 0.6] s.

In the simulation, the FIDVR events are triggered by three-phase faults at different buses which first lead to the stalling of A/C motors and eventually cause FIDVR events. To replicate the FIDVR problem, we first model the loads as single-phase induction motors (33) plus static loads (67). For dynamic modeling and parameters of the single-phase induction motors, we use the performance model and parameters recommended by NERC[NERC2016]. Specially, the motor stalling time is 0.05 s and stalling voltage threshold is 0.5 p.u., which means the motors will operate in a stalled mode if the motor terminal voltage is depressed to 0.5 p.u. for a duration of more than 0.05 s. For training the algorithm, 36 operation scenarios are generated, which combines 4 power flow scenarios with 9 fault scenarios. The power flow scenarios vary in their generation levels and loading levels, and the fault takes places at 9 different buses. The fault is assumed to start at 1.0 s and ends after 0.1 s. For testing the algorithm, 136 operation scenarios are generated, which combines 4 power flow scenarios with 34 fault scenarios. Compared with the training cases, 25 more fault buses are considered during testing. In addition, the fault is assumed to start at 0.5s and ends after 0.08s, which is also different from the training cases. The reason for applying new test scenarios is to validate the adaptability of the proposed data-driven control policy. The detailed power flow conditions for training and testing are shown in Table II. The fault locations for training and testing are shown in Table III. Each training or test scenario is simulated for 10 seconds. During the 10-second dynamic simulation, the policy network obtains the observations (voltages and percentage of remaining loads) and reward from the grid environment and provides control actions back to the grid environment every 0.1 s.

Power flow scenarios Generation Load
Scenario 1
100% for all generation
(22929.5 MW)
100% for all loads
(22570.2 MW)
Scenario 2 120% for all generators 120% for all loads
Scenario 3 135% for all generators 135% for all loads
Scenario 4 115% for all generators 150% for loads in Zone 1
TABLE II: Power flow scenarios for training and testing
Training Testing
3,5,12,2,8,15,17,23,26 3,5,12,2,8,15,17,23,
TABLE III: Bus indices of fault locations for training and testing
Parameters 300-Bus
Policy Model LSTM+FC
Policy Network Size (Hidden Layers) [32,32]
Weight factor () 0.5
Number of Disturbances () 128
Number of surrogate gradients to use () 16
Top Directions () 64
Step Size () 1
Std. Dev. of Exploration Noise () 2
Decay Rate () 0.998

TABLE IV: Hyperparameters for Guided Meta ES with TAM

V-B Comparison studies and performance metrics

In this subsection, we will analyze the control performance of the proposed method for mitigating FIDVR problems and compare it with several benchmark algorithms. We define two performance metrics, namely efficiency and adaptability. For efficiency, we refer to the number of iterations for the training to converge to a reward threshold as measurements. One algorithm has a higher efficiency compared with another algorithm if it uses less training iterations to converge to the reward threshold during the training. In the following case studies, is used as the reward threshold. For adaptability, we refer to the average reward and the number of failed cases gained by the algorithm in unseen test scenarios as measurements. One algorithm has a better adaptability compared with another algorithm if it has higher average reward and less number of failed cases during the testing. The failed case is where the control policy fails to recover the system voltage level to the desired criterion, which is indicated by a reward smaller than (the large penalty set in (3) and used in this paper) .

V-B1 Improvement on exploration efficiency and adaptability through guided search

We first compare the performance of the guided ES method with the ES method to evaluate the function of the surrogate gradient in terms of exploration efficiency and adaptability. Fig. 5 presents the training results and the testing results for both methods. For the ES method, we applied the ARS from our previous work [huang2021accelerated_DRL], which is an improved version of ES.

Fig. 5(a) shows the average reward for 500 training iterations, where the shaded area stands for the standard deviation over 3 random seeds. As can be seen from the figure, during the training, the reward curve of the guided ES method converges faster than that of the ARS method by demonstrating a larger slope from 0 to 100 iterations, and it also reaches a higher converged value with a smaller deviation range at the end of the training. As shown in the figure, on average, the ARS method did not achieve the reward threshold after 500 iterations, while the guided ES method reaches the threshold only after 200 iterations. This is because the ARS method applies a complete random search in the action space, which brings in additional exploration efforts. On the contrary, the guided ES leverages the guidance from the surrogate gradient such that it can explore more in those promising directions and thus achieve more effective exploration, as illustrated in Fig. 2, which greatly improves exploration and sample efficiency. Fig. 5(b) shows the reward gained in each of the 136 test cases from the two methods. Table V lists the average test reward for the two methods. As shown in the table, the guided ES method improved the average reward by more than 50% compared with the ARS method. We further compared the number of failed cases of the two methods. As shown in the third column of Table V, guided search helps reduce the number of failed cases by 75% compared with the ARS method. Therefore, we can safely conclude that the guided search based on the surrogate gradient also contributes to more adaptive control policies.

Fig. 5: Comparison of ARS with guided ES:(a) average training reward over 3 random seeds; (b) test reward for 136 new cases.
Method Average test reward No. of failed cases
ARS 72
Guided ES 17
Guided meta ES 12
Guided meta ES + mask 8
Guided meta ES + TAM 3
TABLE V: Comparison of test results

V-B2 Meta learning for adaptability enhancement

The proposed physics-informed guided meta ES with TAM method is compared with three other benchmark guided search methods, namely the guided ES method, the guided meta ES method, and the guided meta ES method with mask derived from the fixed voltage criterion. Fig. 6 shows the training curves of the four methods, and Fig. 7 shows the test results of the four methods. The test results are also listed in Table V.

We first look into the function of meta learning for enhancing the adaptability of the control policy. In Fig. 6, the training curve of the guided ES method and the guided meta ES method show a similar trend. However, by comparing the test rewards in Fig. 7 and Table V, it can be discovered that the guided meta ES method can achieve a higher average test reward with fewer failed cases. This is because the implemented MSO strategy within the guided meta ES method can quickly fine-tune the learnt control policy based on the newly extracted features from the unseen test cases, leading to a more adaptive control performance.

V-B3 Boosting efficiency and adaptability with physics-informed action mask

We further analyze the function of action mask by comparing the training and test rewards among the four methods. As shown in Fig. 6, the last two methods with an action mask outperform the first two methods without an action mask with a much higher starting point and also a higher final reward. The last two methods reach the reward threshold at the very beginning of the training, while it takes around 150 iterations for the first two methods to reach the same threshold. There is at least 80% improvement in terms of sample efficiency. Therefore, it can be safely concluded that the embedding of physics knowledge through the action mask can considerably boost the exploration efficiency of the algorithm.

With respect to adaptability, based on Table V, it can be observed that the last two ES methods with an action mask have higher average test rewards and fewer failed cases than the other ES methods without an action mask. This verifies the contribution of the action mask in enhancing the adaptability of the learnt control policy. In addition, it should be noted that the proposed guided meta ES with TAM outperforms all the other methods with the highest average test reward and the fewest failed case. Thus, a learnt voltage criterion from TAM that flexibly adjusts based on the environment works more efficiently in generating masks for action selection than a fixed voltage criterion.

Fig. 6: Comparison of training curves of guided ES methods

Fig. 7: Test results of the proposed methods and the benchmark methods

To look deeper into how the TAM helps improve the learning performance, we show one test case in which the first three methods failed and only the guided meta ES with TAM succeeded in restoring the voltage level. The voltage magnitudes of all observable buses under the four methods are shown in Fig. 8. In total there are 108 voltage curves in each figure, representing the voltage magnitudes at 108 observable buses in the system. The dashed black voltage envelope stands for the lower security bound of the voltage. For the first three methods, the simulation ends at around 6 seconds. This is because the control policies failed to restore the system voltage to 0.95 p.u. after 5 seconds when the fault is cleared. The simulation thus terminates in advance. For the proposed guided meta ES with TAM method, the simulation lasts for 10 seconds, which is the predefined time length of an entire simulation, and all the bus voltage magnitudes are above the voltage recovery criterion envelope.

Fig. 8: Comparison of bus voltage under the four control methods

Fig. 9: Comparison of load shedding strategy and mask voltage criterion: (a) total remaining load of the system; (b) voltage criterion.

To understand why the voltage recovery strategy from our proposed method is more effective than the other four benchmark guided search methods, we further look under the hood by comparing the load shedding actions generated by the four methods, as shown in Fig. 9(a). The figure shows the total remaining load for all tested methods. A higher remaining load level indicates less load interruptions and is thus more desired. As shown in the figure, the first two methods have relatively lower remaining load, which implies that unnecessary load shedding is conducted without guidance from the physical knowledge, since they did not apply the mask technique for filtering actions. The third method, which utilizes a mask from a fixed voltage criterion for action filtering, has the highest remaining load level after shedding. However, the system voltage cannot be fully recovered in this case due to insufficient load shedding, which deviates from the initial goal of FIDVR mitigation and voltage recovery. Our proposed guided meta ES with TAM method achieves a remaining load level in between the above methods, which indicates that it has achieved a balanced trade-off: meeting the voltage recovery goal with minimal load interruptions.

Fig. 9(b) compares the learnt voltage criterion from TAM and the fixed voltage criterion. In TAM, for each time step, a voltage criterion is generated. For a complete simulation with a time length of 10s and a control time step of 0.1s, there are 100 learnt voltage criteria. We compare the average value of the learnt voltage criteria with the fixed voltage criterion. As can be seen from the figure, following the immediate occurrence of the fault (between 0.6s and 1s), the learnt voltage criterion is higher than the fixed voltage criterion, which explains why a larger amount of load is shed in the guided meta ES method with TAM. The algorithm increases the voltage bar immediately after the fault occurs to avoid the potential future voltage failures. In Fig. 8, comparing the two figures at the second row, it can be observed that with TAM, the voltage rises faster and higher than that with a fixed mask after the fault takes place (from 0.6 s to 1 s), due to the larger amount of load shedding. Therefore, we can safely conclude that a learnt voltage criterion from TAM can lead to more reliable and adaptive load shedding strategies.

V-B4 Computation time for implementation

We also compare the computational time of the four methods for determining control actions, and the results are shown in Table VI. The FIDVR event duration for each testing scenario is 10 s. As discussed in Section V.A, all the four methods obtain the observations from the grid simulation environment and provide control actions back to it every 0.1 s. Thus, there are a total of 100 control steps with 0.1 s time step for each testing scenarios. The “average solution time” listed in the Table VI is the average “total computation time” for the whole 100 control steps for each testing scenario. That is, for example the proposed guided meta ES with TAM method take 0.0063 s on average to compute control actions within the 0.1 s control interval. Therefore, our proposed method meets the real-time operation requirement and is suitable for fast, short-term voltage recovery.

Method Average solution time (s)
Guided ES 0.51
Guided meta ES 0.49
Guided meta ES + mask 0.62
Guided meta ES + TAM 0.63
MPC 64.34
TABLE VI: Comparison of computational time

V-B5 Comparison with MPC-based voltage control method

Finally, to further demonstrate the effectiveness of our method, we compare it with the MPC method that has been used for FIDVR mitigation control in the literature. The test results of the MPC method are shown in the last row of Table V and also Fig. 7. As shown in the table and the figure, the guided meta ES method with TAM and the MPC method have the same number of failed cases, and the former achieved an average test reward very close to the latter. The voltage magnitudes under the two methods can be further compared in Fig. 8. As can be seen from the figure, the voltage profiles obtained with both methods are very similar. The same is true for the remaining load levels (or total load shedding amounts). The MPC method relies on complete knowledge and models of the power grid environment, hence theoretically it can reach a (near) global optimal solution. However, the assumption of accurate modeling of the power grid environment is untenable in reality, especially for large-scale power systems with extreme complexities. Furthermore, the computational burden of the MPC method makes it impractical for real-time fast voltage control. As shown in Table VI, the solution time of the MPC method is 100 times higher than that of the proposed guided meta ES method with TAM, which does not meet the “real-time” requirements for the load shedding actions. This comparison study shows that our method can quickly respond to the emergencies that are unseen during training with a near-optimal voltage recovery policy, indicating its good potential for real system implementations.

Vi Conclusions

In this paper, we propose a physics-informed guided meta ES algorithm to mitigate the FIDVR problem. The proposed algorithm makes use of guidance from physics knowledge via the TAM technique and the surrogate gradient to conduct more efficient exploration. The physics-embedding also help achieve more robust solutions, particularly for unseen scenarios. In addition, the algorithm is combined with meta-learning to gain adaptability. Simulation results on the IEEE 300-bus system show that the proposed algorithm outperforms other state-of-the-art model-free control algorithms with a faster convergence during training (i.e., 80% sample efficiency improvement over the basic ES method) and a more adaptive control strategy to unseen fault scenarios (i.e., reducing the number of failed cases from 72 to 3 out of 136 test cases) , and also meets the real-time requirement with 0.006 s solution time per control step.

For future research, given the good results for the FIDVR problem, applying the developed method to other power system control applications are promising and worthwhile. Also, trade-off between the guided direction and the random search in the guided ES method for better learning performance deserves more investigations. Last but not least, it is of great interest to study the transferability of the learnt control policy among different test systems, which could lead to more general and practical algorithm implementation.