# PowerGym: A Reinforcement Learning Environment for Volt-Var Control in Power Distribution Systems

We introduce PowerGym, an open-source reinforcement learning environment for Volt-Var control in power distribution systems. Following OpenAI Gym APIs, PowerGym targets minimizing power loss and voltage violations under physical networked constraints. PowerGym provides four distribution systems (13Bus, 34Bus, 123Bus, and 8500Node) based on IEEE benchmark systems and design variants for various control difficulties. To foster generalization, PowerGym offers a detailed customization guide for users working with their distribution systems. As a demonstration, we examine state-of-the-art reinforcement learning algorithms in PowerGym and validate the environment by studying controller behaviors.

## Authors

• 5 publications
• 8 publications
• 7 publications
01/27/2020

### Some Insights into Lifelong Reinforcement Learning Systems

A lifelong reinforcement learning system is a learning system that has t...
08/18/2020

05/20/2020

### Deep Reinforcement Learning for High Level Character Control

In this paper, we propose the use of traditional animations, heuristic b...
10/27/2021

### Multi-Agent Reinforcement Learning for Active Voltage Control on Power Distribution Networks

This paper presents a problem in power networks that creates an exciting...
12/01/2021

### NEORL: NeuroEvolution Optimization with Reinforcement Learning

We present an open-source Python framework for NeuroEvolution Optimizati...
04/06/2021

### Design and implementation of an environment for Learning to Run a Power Network (L2RPN)

This report summarizes work performed as part of an internship at INRIA,...
08/05/2020

### Learning Power Control from a Fixed Batch of Data

We address how to exploit power control data, gathered from a monitored ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Volt-Var control refers to the control of voltage (Volt) and reactive power (Var) in power distribution systems to achieve healthy operation of the systems. By optimally dispatching voltage regulators, switchable capacitors, and controllable batteries, Volt-Var control helps to flatten voltage profiles and reduce power losses across the power distribution systems. It is hence rated as the most desired function for power distribution systems  (Borozan et al., 2001).

The center of the Volt-Var control is an optimization for voltage profiles and power losses governed by networked constraints. Represent a power distribution system as a tree graph (), where is the set of nodes or buses and is the set of edges or lines and transformers. Denote node as ’s parent. The physical networked constraints are given by (Farivar et al., 2013):

 pj=pij−Rijℓij−∑(j,k)∈ξpjkqj=qij−Xijℓij−∑(j,k)∈ξqjkv2j=v2i−2(Rijpij+Xijqij)+(R2ij+X2ij)ℓijℓij=(p2ij+q2ij)/v2i, (1)

where are active and reactive power consumed at nodes or edges, denote bus voltage magnitude and squared current magnitude, are resistance and reactance. Capital letters stand for given parameters otherwise variables. The constraints in Eq. (1) have quadratic equalities, making any optimization upon it nonconvex. Researchers have either tightly relaxed the constraints with strict nodal injection assumptions  (Gan et al., 2014) or used linearization that assumes the distribution systems operates at a fixed operating point  (Yang et al., 2016). Both methods require tremendous efforts in trimming and conversion of a model that is readily available in commercial circuit simulation software to specific optimization formulation. Together with many integer decision variables in controllable devices not shown above, the Volt-Var control problem becomes extremely hard to scale to a system with thousands of buses, a typical size for distribution systems.

With recent breakthroughs in deep reinforcement learning (RL), power system researchers have tried using RL for power system operation. One such example is learning to operate a transmission systems operation in L2RPN competition (Marot et al., 2021). Though transmission systems are fundamentally different from distribution systems in both network topology (looped vs. radial) and typical problem types (dynamic vs. quasi-static), RL has shown promising results  (Yoon et al., 2020) in operating transmission systems. While there exist many papers on RL in distribution systems, researchers have used their own environments. One of the many reasons behind it is due to the regulatory and conservative nature of the power engineering industry: being safety-critical, the real-life distribution system topologies and control settings are proprietary. To encourage power systems researchers to make fair comparisons on the developed RL algorithms without having the concern of proprietary information leakage, we have developed PowerGym, a Gym-like environment (Brockman et al., 2016) for optimizing Volt-Var control using IEEE benchmark test systems (PES, 2010; Dugan and Arritt, 2010). It further serves as a base for power systems engineers to implement RL algorithms on their proprietary systems with minor customization.

PowerGym supports Gym-like usages such as reset, step, random action sampling, and visualization; hence it is readily applicable to run on most developed RL algorithms. On top of the Gym design, PowerGym provides a wide range of environment variations of the IEEE benchmark systems. These variations affect the environment’s physical constraints and ultimately the control difficulties, allowing users to choose an environment either easier to control however more abstracted or harder to control yet more realistic. PowerGym is safe for parallel execution up to some file constraints (discuss in environment design section), so the user can run parallel algorithms such as A3C (Mnih et al., 2016).

Our contributions are as follows. First, we design PowerGym to help power system researchers benchmark their controls and RL algorithms. To be best of the authors’ knowledge, this is the first publicly accessible environment with a focus on Volt-Var control in power distribution systems. Second, we consider environment usability and extendibility in PowerGym. We provide variations of the environments for different control difficulties and a detailed customization guide. Finally, we showcase the applicability of PowerGym on two popular RL algorithms, PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018) for validation purposes. We also explain how the controllers work through a case study.

## 2 Related Work

The application of RL to control and manage various aspects of power systems is a well-studied topic in literature (Zhang et al., 2020)

. In the past few years, there has been renewed interest in this topic due to algorithmic advancements, allowing RL to go beyond tabular settings and scale to large state and action spaces using neural networks as expressive function approximators. Examples of such work include using RL to reduce operational cost and optimize the handling of daily loads

(Sun et al., 2020), power systems stability control (Ernst et al., 2004) and learning a control policy that is adaptable to stochastic renewable energy generation (Yan and Xu, 2019). In the context of Volt-Var control, there have also been various studies that leverage RL to optimize various aspects of the problem, such as emphasizing the constraint satisfaction (Wang et al., 2019, 2020) or the sample efficiency and scalability (Zhang et al., 2021). Nevertheless, most of these results are based on non-standardized implementations of various systems with the environment tuned to the specifics of the problem. This has led to the difficulty of evaluating and comparing the results and remains a crucial challenge in these areas, as highlighted by Gao and Yu (2021).

In other domains of deep RL, researchers have begun to recognize the importance of having high-quality benchmark environments to facilitate the research into RL application. This has led to the development of environments such as OpenAI Gym for training RL agents to play a variety of games and for robotic control (Brockman et al., 2016), Safety Gym for safe RL exploration (Ray et al., 2019) and ns3-gym for training RL in networks research problems (Gawłowicz and Zubow, 2018). A more closely related benchmark includes the Grid2Op environment, which allows researchers to train an RL agent to address the challenges of unpredictable power generation from renewable energy sources and robustness to power systems’ topology change (Marot et al., 2020). Nevertheless, a standardized benchmark for Volt-Var control environments with the flexibility of instantiating systems of various sizes is still lacking. To this end, we hope that this work serves to fill the gap in the community as a unified benchmark environment for RL research in Volt-Var control applications.

## 3 Reinforcement Learning Preliminaries

A reinforcement learning (RL) environment is often modeled by the Markov Decision Process (MDP) with two common MDPs: infinite-horizon discounted MDP

and finite-horizon episodic MDP . , are the state and action spaces. /, / are the stationary/non-stationary state transition function and reward function. , are the discount factor and the horizon. In this paper, we assume a stationary state transition in : for . The goal of RL is to find a policy to maximize the cumulative rewards:

 RMd(π)=E[∞∑i=0γir(si,ai,si+1)∣∣ai∼π(⋅|si)]RMe({πi})=E[H−1∑i=0ri(si,ai,si+1)∣∣ai∼πi(⋅|si)] (2)

It is well-known that the optimal policy of is stationary while that of is non-stationary (Agarwal et al., 2019)[Chapter 1]; hence in Eq. (2), the policy is denoted as in and as in . To reduce the model complexity, most RL experiments are formulated into if the stationarity holds. However, is inevitable when the reward is non-stationary. Depending on the application scenarios, we implement both stationary and non-stationary rewards, which will be discussed in the next section.

## 4 A Volt-Var Control Environment

### 4.1 Power Distribution Systems and Objectives

Power distribution systems are networks for delivering electric power from the power transmission system to end consumers. Due to the distribution loss, voltage drops along the power delivery line, possibly causing voltage violations and power losses. Thus, Volt-Var optimization is required. In power distribution systems, the Volt-Var optimization problem is to control devices (e.g., regulators, capacitors, and batteries. represented as .) under constraints. affects voltage, resistance, reactance, and power in the physical networked constraints, so Eq. (1) is a constraint of Eq. (3).

 [l]minxfvolt(x)+fctrl(x)+fpower% (x)s.t.Eq. (???) and~{}device~{}% constraints. (3)

The Volt-Var optimization’s objective is a combination of three losses: for voltage violation, for control error, and for power loss. The device constraints ensures the devices operates within its physical limits. While Eq. (3) is time-independent, in practice we have to solve it at every time step. Solving a sequence of Volt-Var optimization, Eq. (3), becomes a Volt-Var control problem. In short, we call a problem Volt-Var optimization if solving a single Eq. (3) and Volt-Var control if solving a sequence of Eq. (3) connected by device operation constraints over time.

We use a Python version of OpenDSS to solve for the physical networked constraints of Eq. (3). OpenDSS is an open-source power flow solver developed by EPRI. It takes from Eq. (1) as known and solves the nonlinear equations of voltages and currents using fixed-point iterations.

Shifting the focus to elements in power distribution systems, we define each element to be multi-phase following the fact that power is usually delivered in multi-phases. As shown in Figure 1, a (multi-phase) node, or a bus, can be a pure connection point or include node objects like loads, capacitors, or batteries. A (multi-phase) edge is formed by a line, a transformer, or a regulator. Loads model the power consumption from the consumers. Capacitors provide reactive power and batteries are energy (active power) storage. Lines imitate the connection from one (multi-phase) node to another subject to Ohm’s law. Transformers and regulators are for voltage adjustment from one node to another.

### 4.2 Volt-Var Control as An RL Problem

In this subsection, we describe the Volt-Var control problem in the language of RL. We consider a finite-horizon MDP with steps as the horizon because we focus on a daily control with the control frequency being one action per hour. Still, the horizon is a variable and hence changeable. We will discuss this in the sections of environment registration and experiments. The following paragraphs give the details about the observation space, the action space, the state transition, and the reward function in PowerGym.

#### 4.2.1 Observation and Action Spaces

The observation and action spaces, as summarized in Table 1 and 2, are products of discrete and continuous variables. The discrete variables are from the physical constraints of the controllers; for example, a capacitor either turns on or off, a regulator operates on a finite number of modes (tap number), and a discrete battery only has a finite number of discharge powers. The continuous variables are normalized into some bounded ranges; for example, the (per-unit) voltage is represented into the unit of the base voltage on a bus, and hence usually bounded in [0.8, 1.2]. The battery’s state-of-charge (charge / max charge), or the soc, is in [0.0, 1.0]. The continuous battery’s normalized discharge power (discharge power / max discharge power) is in [-1.0, 1.0], where the negative means charging and the positive means discharging.

Depending on the device constraints, a battery control can be either discrete or continuous. This affects the action space (Table 2) since the action representations are different. Still, because we can post-process the observation after receiving it, in the observation space (Table 1), we unify the representation of discrete and continuous batteries by mapping the discrete battery’s discharge power to the normalized form.

Whether to discretize the battery makes the action space either multi-discrete or a product of multi-discrete and continuous spaces. Either way, the problem is hard for a tabular policy or a policy that encodes the actions as one-hot vectors (e.g., DDQN

(Van Hasselt et al., 2016)) because the size of the discrete part of action space scales exponentially in the number of controllers. Also, the possibility of mixing discrete and continuous actions makes it harder to design the policy.

#### 4.2.2 State Transition

We now describe the state transition function in PowerGym. With the descriptions in Table 1 and 2 is represented as

 s′=[Vols(s,a), cap(a), reg(a), soc(s,a), dis(s,a)]. (4)

is the next set of voltages and depends on action and the stochasticity of loads, which we model using the load profiles (will discuss in environment design). and are the next statuses of capacitors and regulators. and are the next soc’s and discharge powers of batteries. Both of them depend on the current state because a battery’s soc cannot go beyond full charge () or depleted (). To enforce this, we project the attempted discharge power in an action to the allowed range based on , making a function of .

#### 4.2.3 Reward Function

We implement the objective of a Volt-Var problem, Eq. (3), into a reward function as follows:

 r(s,s′,i)=−fvolt(s′)−fctrl(s,s′,i)−fpower(s′) (5)
 fpower(s′)=wpowerPowerLoss(s′)TotalPower(s′) (6)

is a concatenation of all observations in the current step, is that in the next step, and is the episode step. The dependency on step implies the reward could be non-stationary. The power loss, Eq. (6), is a ratio of the overall power loss to the total power. The voltage violation and control error are expressed in Eq. (7) and (8).

Eq. (5) is expressed as , not , because the action is a part of the next state . Mathematically, and are equivalent because is a function of under the state transition function .

The voltage violation, Eq. (7), is a sum of worst-case voltage violations among all phases across all the nodes in the system. The upper/lower violation thresholds (/) are set as of the per-unit voltage as a result of the US voltage regulation standard (ANSI, 2011).

 fvolt(s′)=∑n∈N(maxp∈Phases(n)Vn,p(s′)−¯¯¯¯V)++(V––−minp∈Phases(n)Vn,p(s′))+, (7)

where is a shorthand for . Thereby, the upper violation is positive when and zero otherwise.

The control error, Eq. (8), is a sum of capacitors’ and regulators’ switching penalties ( & rows) and batteries’ discharge penalty and soc penalty ( row). These penalties discourage the policy from making frequent changes and slow the devices from wear out. Note the discharge error , with being the max power, has a function as the battery degradation is primarily caused by the battery discharging power . Besides, the soc penalty has an indicator of the last time step to encourage a battery to return to its initial state-of-charge . Hence, the reward is stationary if and non-stationary otherwise.

 fctrl(s,s′,i)=∑c∈% capswcap|Statusc(s)−Statusc(s′)|+∑r∈regswreg|TapNumr(s)−TapNumr(s′)|+∑b∈batswdisPb(s′)+¯¯¯¯¯Pb+% wsocIi=H|socb(s′)−soc0b|, (8)

where represent a capacitor, a regulator, and a battery. , , , are status of , tap number of , discharge power of , and soc of .

## 5 Design of PowerGym

### 5.1 Environment Instantiation

Similar to the OpenAI Gym, PowerGym provides make_env() to instantiate an environment:

make_env(env_name, worker_idx=None)


env_name is the name of the registered environment. worker_idx is used (if not None) for parallel execution, which we detail in the subsection of load profiles.

make_env() reads the following information. First, PowerGym reads circuit files into the environment class, followed by leveraging OpenDSS to compile the file, as shown in Figure 1. Secondly, to define the hyper-parameters that affects the RL training under the same system, PowerGym needs information such as the horizon, the number of actions of a regulator/battery and weights of the power loss, capacitor’s switch loss, regulator’s switch loss, battery’s discharge loss, battery’s state-of-charge (soc) loss. The next subsection introduces the customization of such information.

### 5.2 Environment Registration and Customization

Users can customize their environment by registering a new environment name associated with the required information. This is done by appending the information to the dictionaries in the PowerGym register. Below is an example of the dictionary. dss_file is the main circuit file that OpenDSS compiles. Users can edit dss_file to change the circuit objects and structure. Users can also change the hyper-parameters. max_episode_steps is the horizon. It is 24 by default as we focus on the daily control. act_num is the shorthand of the number of actions, so the battery becomes continuous if bat_act_num is infinity and discrete if finite. The other parameters are the weights in the reward function shown in Eq. (5).

’13Bus’: {
’system_name’: ’13Bus’,
’dss_file’: ’IEEE13Nodeckt_daily.dss’,
’max_episode_steps’: 24,
’reg_act_num’: 33,
’bat_act_num’: 33,
’power_w’: 10.0,
’cap_w’: 1.0/33,
’reg_w’: 1.0/33,
’soc_w’: 0.0/33,
’dis_w’: 6.0/33 }


Besides the information shown above, PowerGym also depends on the load profiles (see Figure 1 and the subsequent subsection) and the other circuit files. These files are customizable and can be found in the folder of systems/system_name of the repository. For example, the above shows users can find customizable files in the folder systems/13Bus.

### 5.3 Default Registered Environments

In Table 3, each system (summarized in Table 4) in PowerGym has four default environments: vanilla, continuous battery, soc, continuous battery & soc. The difference lies only in the battery’s settings; hence capacitors and regulators are the same across these four environments.

The presence of cbat affects the battery’s number of discharge power: without cbat, the number is finite (33 by default), and the battery’s model is discrete; with cbat, the number is infinite, and the battery’s model is continuous. On the other hand, soc tells the state-of-charge penalty on the battery at the end of the horizon: without soc, the soc penalty is zero, and the reward is stationary; with soc, the soc penalty is positive, and the reward is non-stationary. Besides the four default environments, one can call a scaled environment by appending a scale to an environment name; e.g., 13Bus_s1.5 scales the loads by 1.5. We will revisit the load scaling in the subsection of load profiles.

### 5.4 Gym-like Usage

PowerGym supports Gym-like usages such as reset, step, random action sampling, and visualization. The design is compatible with most RL algorithms. Below is a brief overview of these functions.

obs = Env.reset(load_profile_idx=0)


The reset function initializes the system and returns an initial observation. The dynamics of the load are controlled by the load profile index and will be discussed in the next subsection. The initial statuses of capacitors, regulators, and batteries are set as ”on”, full tap number, and (full charge, zero discharge power), respectively.

obs, reward, done, info = Env.step(action)


The step function takes an action as the input and returns the next state, reward, done signal, and the information dictionary. Since the current design does not define a terminal state that should be strictly avoided, the done signal is true only when the episode step reaches the horizon. The information dictionary includes several details about the reward such as capacitor error, regulator error, discharge error, soc error, and soc (all in average), which also facilitates the application of multi-objective RL (Liu et al., 2014).

action = Env.random_action()


Random actions can also be generated by Env.action_space.sample() and the random seed is set by Env.seed(). The action is a ()-dimensional array for the control signal on the controllers, with being the number of a certain controller.

fig, pos = Env.plot_graph()


The plot graph function returns a Matplotlib figure and a dictionary of node positions for users to visualize the network status. It supports options such as show_voltages, show_controllers, and show_actions.

### 5.5 Other Usages and Constraints of Load Profiles

To simulate load dynamics, OpenDSS supports time-series simulations following some predefined load curves. A group of predefined curves of all loads is called a load profile. As mentioned in the previous subsection, Env.reset() has an option of load profile selection. Hence, PowerGym models the stochasticity of state transition using the load profiles.

By enlarging the values in the load profile with a fixed scale, PowerGym creates environments with various load scales. Since power consumption scales linearly with the load scale, the environment tends to be hard under a large load scale. Referring to the subsection of default environments, a scaled environment is instantiated by appending a scale to an environment name. During the call, PowerGym generates the load profiles under the corresponding scale factor and another text file to store the current load scale. Note the load profile is regenerated only if the previous load scale (stored in the text file) is different from the current one.

Due to the file dependency on the load profile, parallel execution is possible under certain conditions. As mentioned earlier, the worker index of make_env() is used for parallel execution. When it is None, PowerGym cannot execute two environments on the same system (e.g., cannot execute 13Bus, 13Bus_cbat together) due to the conflict of load profile selection. This is solved when the worker index is an integer because each worker has a distinct profile selection file. However, even with the worker index, PowerGym cannot execute environments in parallel with names that differ only in the load scales (e.g., 13Bus_s1.0, 13Bus_s2.0) because it only allows one load scale at any given time.

## 6 Experiments

### 6.1 Cumulative Rewards in Default Environments

To show the applicability of PowerGym, we have trained two popular RL algorithms as benchmarks on our environments: Proximal Policy Optimization (PPO) (Schulman et al., 2017) and Soft Actor-Critic (SAC) (Haarnoja et al., 2018), with implementations based on (Fujita et al., 2021). Since PPO is on-policy while SAC is off-policy, these two algorithms give us a proxy of the expected performance of on-policy versus off-policy algorithms in the environments. For comparisons, both PPO and SAC have been trained on multi-discrete actions by default. In addition, SAC has been trained on environments with continuous batteries (cbat) to compare the environments with different battery settings. The experiments are run on a server with one AMD Ryzen Threadripper 3970X CPU and one Nvidia RTX 3090 GPU.

The experiments have been designed as follows: The load profiles are randomly partitioned into two halves, one for training and the other for testing. During training, the policy is tested on test load profiles every 5 episodes; or equivalently every 120 steps as the horizon is 24. Lastly, all experiments are performed across ten random seeds.

In Figure 2, the label ”random” denotes an untrained policy that samples actions uniformly from the action space. As expected, SAC converges faster and outperforms PPO across all environments, which aligns with the SAC paper (Haarnoja et al., 2018) that has been demonstrated on the MuJoCo (Todorov et al., 2012)

. Due to the experiment design (evaluation at every 120 steps), all curves start at step 120 instead of step 0. The first evaluation (step 120) reveals the algorithms’ performance based on the first few updates: PPO is similar to random policy while SAC isn’t. The fact that PPO is near the random policy validates the clipping nature of its policy gradient. Clipping makes PPO update slowly yet steadily and hence similar to random policy in the early steps. As for SAC, because its DDPG-style policy gradient

(Silver et al., 2014) isn’t clipped, SAC suffers more from the initial inaccuracy of Q-function and hence deviates from the random policy in the early steps. Finally, the performances of SAC and cbat_SAC are very similar, implying discrete and continuous batteries share similar behaviors, and SAC successfully adapts to both.

To sum up, we have demonstrated the applicability of PPO and SAC in PowerGym and the sample efficiency of SAC. We have also shown that environments with discrete or continuous batteries have similar performances.

### 6.2 Case Study: 123Bus

We take the 123Bus system (Figure 3) as an example to further analyze the behavior of the control policy in PowerGym. Specifically, we focus on the continuous battery scenario because the battery may arbitrarily discharge/charge within an allowable range in practice. For this case study, we consider four variations: vanilla (cbat), scaled loads (cbat_s2.5), with soc penalty (soc), and scaled loads with soc penalty (cbat_soc_s2.5). As mentioned in the default environment section, both soc penalty and large load scale make the environment more challenging, as the former introduces a non-stationary reward while the latter incurs large power consumption. Thereby, we would like to see how the control policy adapts to different scenarios.

The first row of Figure 4 visualizes the average switching errors of capacitors and regulators respectively. Both errors are small in most time steps across all scenarios. Hence, the policies for both capacitors and regulators only make large changes when needed while making small adjustments the rest of the time. Note the behavior of the first 1000 steps and the later steps are different because the RL exploration starts after the first 1000 steps of random exploration.

The second row of Figure 4 shows the power loss ratio and the voltage violation. Because of the load scaling, the 2.5-scaled environments have the higher voltage violations than the un-scaled environments (cbat_s2.5 cbat and cbat_soc_s2.5 cbat_soc). Furthermore, the voltage violation of soc_s2.5 is greater than that of s2.5 as the soc penalty makes the policy on batteries more restrictive and non-stationary. As for the power loss, since it is a difficult objective, it barely improves over time. Still, we see that the power losses on the 2.5-scaled environments are higher than the un-scaled counterparts. This is because large voltage violations cause large voltage differences on the lines, which brings up the power loss on the lines.

Finally, the third row of Figure 4 shows the battery activity in discharge errors and soc errors. Since the battery is an energy storage device, it is useful when the environment lacks power and has high voltage violations. Hence, the battery barely discharges in the un-scaled environments and maintains mostly zero soc error. As for the scaled environments (s2.5 & soc_s2.5), because s2.5 discharges frequently, it has smaller voltage violations but higher soc error. In comparison, soc_s2.5 discharges less and has a higher voltage violation but smaller soc error. Therefore, there is a trade-off between battery activity and voltage violation in heavily-loaded environments: the more battery activity, the less voltage violation, and RL algorithms need to find a dedicated balance between the two.

All in all, the soc penalty and the load scale affect the difficulty of a PowerGym environment. The difficulty can be evaluated by power losses, voltage violations, and battery activities. The harder an environment is, the more power losses, voltage violations, and battery activities.

### 6.3 Effects of Horizons

Figure 5 shows the testing cumulative reward w.r.t. the horizon for 123Bus and 8500Node systems. We only analyze under the continuous battery with the SAC algorithm as this is the best-performed setting according to Figure 2. As the cumulative reward scales linearly w.r.t. the horizon, h48’s cumulative reward is roughly twice of h24’s and h96’s is four times of h24’s. Besides, the convergence speeds w.r.t. horizons are similar in 123Bus for the fact that the 123Bus system is a more stable system and less likely to have voltage violations. On the other hand, 8500Node is less stable, resulting in longer steps to converge in a longer horizon.

### 6.4 Difficulty Comparisons

As a concluding remark, we discuss the trend of PowerGym’s difficulty in four aspects: problem size, base voltage violation, load scale, and soc penalty. It helps users to choose the best environment for their applications.

Problem size refers to the dimensions of an environment; e.g., horizon, sizes of state, and action spaces. The larger horizon makes an environment harder as the error of value/Q-function is usually quadratic to the horizon (Duan et al., 2020). Similarly, the larger state and action spaces complicate an environment. Under a fixed horizon, we expect the problem complexity in PowerGym follows 8500Node 123Bus 34Bus 13Bus.

Base voltage violation is the tendency of violating the voltage in an un-scaled environment. It depends on the structure of the distribution system and the default load profiles. One may find a small system (small number of nodes) with high base voltage violation or a big system with low base voltage violation. For instance, the tendency of base voltage violation is 8500Node 34Bus 13Bus 123Bus, as the voltage violation is the major term in the reward function and the training speed in Figure 2 follows the reverse order.

Load scale affects a PowerGym environment by changing the scale of the load profiles. A high load scale brings up the load power consumption, which increases the chance of voltage violations and makes the problem harder.

The soc penalty determines the stationarity of the battery behavior. With the soc penalty, the battery behaves non-stationarily as the battery should discharge at peak hours and charge at off-peak hours. Because a non-stationary behavior is harder to train than a stationary one, the soc error of cbat_soc_s2.5 in Figure 4 is mostly zero but less stable.

## 7 Conclusion

We develop a gym-like open-source environment, PowerGym, to facilitate RL research/adaptation for Volt-Var control in power distribution systems. PowerGym encourages power system researchers to make fair comparisons on RL algorithms using the same environment. It includes sufficient variations (problem size, base voltage violation, load scale, and soc penalty) to study different aspects of the Volt-Var control. PowerGym also acts as a base for researchers/engineers to adopt RL algorithms to power distribution systems in real life: it provides a detailed customization guide for researchers/engineers who use PowerGym with their own proprietary power distribution systems. Our RL experiments suggest the correctness of the PowerGym design. The cumulative rewards achieved by our RL agents serve as a baseline for the PowerGym users. Future work on other problems in power distribution systems is underway.

## 8 Acknowledgement

We thank Siddharth Bhela for the instructions on OpenDSS, Suat Gumussoy for the feedbacks on environment design and Ulrich Muenz for the general supports on the project development.

## References

• A. Agarwal, N. Jiang, and S. Kakade (2019) Reinforcement learning: theory and algorithms. Cited by: §3.
• ANSI (2011) Electric power systems and equipment – voltage ratings (60 hz) c84.1. Note: American National Standards Institute (ANSI) Standard: C84.1-2011 Cited by: §4.2.3.
• V. Borozan, M. E. Baran, and D. Novosel (2001) Integrated volt/var control in distribution systems. In 2001 IEEE Power Engineering Society Winter Meeting. Conference Proceedings (Cat. No. 01CH37194), Vol. 3, pp. 1485–1490. Cited by: §1.
• G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §1, §2.
• Y. Duan, Z. Jia, and M. Wang (2020) Minimax-optimal off-policy evaluation with linear function approximation. In

International Conference on Machine Learning

,
pp. 2701–2709. Cited by: §6.4.
• R. Dugan and R. Arritt (2010) The ieee 8500-node test feeder. Electric Power Research Institute, Palo Alto, CA, USA. Cited by: §1.
• D. Ernst, M. Glavic, and L. Wehenkel (2004) Power systems stability control: reinforcement learning framework. IEEE Transactions on Power Systems 19 (1), pp. 427–435. External Links: Document Cited by: §2.
• M. Farivar, L. Chen, and S. Low (2013) Equilibrium and dynamics of local voltage control in distribution systems. In 52nd IEEE Conference on Decision and Control, pp. 4329–4334. Cited by: §1.
• Y. Fujita, P. Nagarajan, T. Kataoka, and T. Ishikawa (2021) ChainerRL: a deep reinforcement learning library. Journal of Machine Learning Research 22 (77), pp. 1–14. Cited by: §6.1.
• L. Gan, N. Li, U. Topcu, and S. H. Low (2014) Exact convex relaxation of optimal power flow in radial networks. IEEE Transactions on Automatic Control 60 (1), pp. 72–87. Cited by: §1.
• Y. Gao and N. Yu (2021) Deep reinforcement learning in power distribution systems: overview, challenges, and opportunities. In 2021 IEEE Power Energy Society Innovative Smart Grid Technologies Conference (ISGT), Vol. , pp. 1–5. External Links: Document Cited by: §2.
• P. Gawłowicz and A. Zubow (2018) Ns3-gym: extending openai gym for networking research. arXiv preprint arXiv:1810.03943. Cited by: §2.
• T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. (2018) Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §1, §6.1, §6.1.
• C. Liu, X. Xu, and D. Hu (2014) Multiobjective reinforcement learning: a comprehensive overview. IEEE Transactions on Systems, Man, and Cybernetics: Systems 45 (3), pp. 385–398. Cited by: §5.4.
• A. Marot, B. Donnot, G. Dulac-Arnold, A. Kelly, A. O’Sullivan, J. Viebahn, M. Awad, I. Guyon, P. Panciatici, and C. Romero (2021) Learning to run a power network challenge: a retrospective analysis. arXiv preprint arXiv:2103.03104. Cited by: §1.
• A. Marot, B. Donnot, C. Romero, B. Donon, M. Lerousseau, L. Veyrin-Forrer, and I. Guyon (2020) Learning to run a power network challenge for training topology controllers. Electric Power Systems Research 189, pp. 106635. External Links: ISSN 0378-7796, Document, Link Cited by: §2.
• V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §1.
• I. PES (2010) IEEE pes test feeders. Note: https://site.ieee.org/pes-testfeeders/resources/Accessed: 2021-07-28 Cited by: §1.
• A. Ray, J. Achiam, and D. Amodei (2019) Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708 7. Cited by: §2.
• J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §6.1.
• D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Cited by: §6.1.
• J. Sun, Y. Zheng, J. Hao, Z. Meng, and Y. Liu (2020) Continuous multiagent control using collective behavior entropy for large-scale home energy management. In

Proceedings of the AAAI Conference on Artificial Intelligence

,
Vol. 34, pp. 922–929. Cited by: §2.
• E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §6.1.
• H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30. Cited by: §4.2.1.
• W. Wang, N. Yu, Y. Gao, and J. Shi (2020) Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems. IEEE Transactions on Smart Grid 11 (4), pp. 3008–3018. External Links: Document Cited by: §2.
• W. Wang, N. Yu, J. Shi, and Y. Gao (2019) Volt-var control in power distribution systems with deep reinforcement learning. In 2019 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), Vol. , pp. 1–7. External Links: Document Cited by: §2.
• Z. Yan and Y. Xu (2019) Data-driven load frequency control for stochastic power systems: a deep reinforcement learning method with continuous action search. IEEE Transactions on Power Systems 34 (2), pp. 1653–1656. External Links: Document Cited by: §2.
• Z. Yang, H. Zhong, Q. Xia, A. Bose, and C. Kang (2016) Optimal power flow based on successive linear approximation of power flow equations. IET Generation, Transmission & Distribution 10 (14), pp. 3654–3662. Cited by: §1.
• D. Yoon, S. Hong, B. Lee, and K. Kim (2020) Winning the l2rpn challenge: power grid management via semi-markov afterstate actor-critic. In International Conference on Learning Representations, Cited by: §1.
• Y. Zhang, X. Wang, J. Wang, and Y. Zhang (2021) Deep reinforcement learning based volt-var optimization in smart distribution systems. IEEE Transactions on Smart Grid 12 (1), pp. 361–371. External Links: Document Cited by: §2.
• Z. Zhang, D. Zhang, and R. C. Qiu (2020) Deep reinforcement learning for power system applications: an overview. CSEE Journal of Power and Energy Systems 6 (1), pp. 213–225. External Links: Document Cited by: §2.

## Appendix A Appendix

### a.1 System Layouts

Table 4 (in main context) and Figure 6 show the controller summary and layouts of the systems. 13Bus system has 2 capacitors at bus 611, 675; 3 single-phase regulators at the edge (650, rg60); 1 battery at bus 680. 34Bus system has 2 capacitors at bus 844, 848; 6 single-phase regulators at edge (814,814r), (852, 852r) (3 for each); 2 batteries at bus 832, 890. 123Bus system has 4 capacitors at bus 83, 88, 90, 92; 1 three-phase regulator at bus 150 and 9 single-phase regulators at bus 9, 25, 160 (3 for each); 4 batteries at bus 33, 67, 114, 300. 8500Node system has 10 capacitors (one three-phase near the source, nine single-phase at three other locations); 12 single-phase regulators at four buses (3 for each bus. One bus is the source. The other buses are shown in the figure); 10 batteries.

### a.2 Observation and Action Wrapper

Although the observation and action spaces are composed of discrete and continuous values, for the conciseness of representation, we wrap the observation and the action into Numpy arrays as follows.

wrapped_obs = Concatenate([all phase voltages at each bus,
all capacitor statuses,
all regulator tap numbers,
all battery soc’s and normalized discharge powers])

action = Concatenate([all capacitor statuses,
all regulator tap numbers,
all battery discharge powers])


Capacitor statuses, regulator tap numbers, and discrete batteries’ discharge powers are represented in integers. Continuous batteries’ discharge powers are represented in floating numbers.

The wrapped observation is the default output of Env.reset() and Env.step(). Still, users can access all phase voltages with the observation dictionary at Env.obs. These two representations of observations have the following relation:

wrapped_obs = Env.wrap_obs(Env.obs)


### a.3 Hyper-parameters

In this section, we provide a summary of the hyper-parameters of our environments and RL agents. The coefficients of the reward function are shown in Table 5.

To train PPO and SAC agents, we use separate deep neural networks to parameterize the policy and value/Q functions. Both networks consist of dense layers with the same widths. Table 6 presents the suggested hyper-parameters for PPO and SAC.