1 Introduction
VoltVar control refers to the control of voltage (Volt) and reactive power (Var) in power distribution systems to achieve healthy operation of the systems. By optimally dispatching voltage regulators, switchable capacitors, and controllable batteries, VoltVar control helps to flatten voltage profiles and reduce power losses across the power distribution systems. It is hence rated as the most desired function for power distribution systems (Borozan et al., 2001).
The center of the VoltVar control is an optimization for voltage profiles and power losses governed by networked constraints. Represent a power distribution system as a tree graph (), where is the set of nodes or buses and is the set of edges or lines and transformers. Denote node as ’s parent. The physical networked constraints are given by (Farivar et al., 2013):
(1) 
where are active and reactive power consumed at nodes or edges, denote bus voltage magnitude and squared current magnitude, are resistance and reactance. Capital letters stand for given parameters otherwise variables. The constraints in Eq. (1) have quadratic equalities, making any optimization upon it nonconvex. Researchers have either tightly relaxed the constraints with strict nodal injection assumptions (Gan et al., 2014) or used linearization that assumes the distribution systems operates at a fixed operating point (Yang et al., 2016). Both methods require tremendous efforts in trimming and conversion of a model that is readily available in commercial circuit simulation software to specific optimization formulation. Together with many integer decision variables in controllable devices not shown above, the VoltVar control problem becomes extremely hard to scale to a system with thousands of buses, a typical size for distribution systems.
With recent breakthroughs in deep reinforcement learning (RL), power system researchers have tried using RL for power system operation. One such example is learning to operate a transmission systems operation in L2RPN competition (Marot et al., 2021). Though transmission systems are fundamentally different from distribution systems in both network topology (looped vs. radial) and typical problem types (dynamic vs. quasistatic), RL has shown promising results (Yoon et al., 2020) in operating transmission systems. While there exist many papers on RL in distribution systems, researchers have used their own environments. One of the many reasons behind it is due to the regulatory and conservative nature of the power engineering industry: being safetycritical, the reallife distribution system topologies and control settings are proprietary. To encourage power systems researchers to make fair comparisons on the developed RL algorithms without having the concern of proprietary information leakage, we have developed PowerGym, a Gymlike environment (Brockman et al., 2016) for optimizing VoltVar control using IEEE benchmark test systems (PES, 2010; Dugan and Arritt, 2010). It further serves as a base for power systems engineers to implement RL algorithms on their proprietary systems with minor customization.
PowerGym supports Gymlike usages such as reset, step, random action sampling, and visualization; hence it is readily applicable to run on most developed RL algorithms. On top of the Gym design, PowerGym provides a wide range of environment variations of the IEEE benchmark systems. These variations affect the environment’s physical constraints and ultimately the control difficulties, allowing users to choose an environment either easier to control however more abstracted or harder to control yet more realistic. PowerGym is safe for parallel execution up to some file constraints (discuss in environment design section), so the user can run parallel algorithms such as A3C (Mnih et al., 2016).
Our contributions are as follows. First, we design PowerGym to help power system researchers benchmark their controls and RL algorithms. To be best of the authors’ knowledge, this is the first publicly accessible environment with a focus on VoltVar control in power distribution systems. Second, we consider environment usability and extendibility in PowerGym. We provide variations of the environments for different control difficulties and a detailed customization guide. Finally, we showcase the applicability of PowerGym on two popular RL algorithms, PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018) for validation purposes. We also explain how the controllers work through a case study.
2 Related Work
The application of RL to control and manage various aspects of power systems is a wellstudied topic in literature (Zhang et al., 2020)
. In the past few years, there has been renewed interest in this topic due to algorithmic advancements, allowing RL to go beyond tabular settings and scale to large state and action spaces using neural networks as expressive function approximators. Examples of such work include using RL to reduce operational cost and optimize the handling of daily loads
(Sun et al., 2020), power systems stability control (Ernst et al., 2004) and learning a control policy that is adaptable to stochastic renewable energy generation (Yan and Xu, 2019). In the context of VoltVar control, there have also been various studies that leverage RL to optimize various aspects of the problem, such as emphasizing the constraint satisfaction (Wang et al., 2019, 2020) or the sample efficiency and scalability (Zhang et al., 2021). Nevertheless, most of these results are based on nonstandardized implementations of various systems with the environment tuned to the specifics of the problem. This has led to the difficulty of evaluating and comparing the results and remains a crucial challenge in these areas, as highlighted by Gao and Yu (2021).In other domains of deep RL, researchers have begun to recognize the importance of having highquality benchmark environments to facilitate the research into RL application. This has led to the development of environments such as OpenAI Gym for training RL agents to play a variety of games and for robotic control (Brockman et al., 2016), Safety Gym for safe RL exploration (Ray et al., 2019) and ns3gym for training RL in networks research problems (Gawłowicz and Zubow, 2018). A more closely related benchmark includes the Grid2Op environment, which allows researchers to train an RL agent to address the challenges of unpredictable power generation from renewable energy sources and robustness to power systems’ topology change (Marot et al., 2020). Nevertheless, a standardized benchmark for VoltVar control environments with the flexibility of instantiating systems of various sizes is still lacking. To this end, we hope that this work serves to fill the gap in the community as a unified benchmark environment for RL research in VoltVar control applications.
3 Reinforcement Learning Preliminaries
A reinforcement learning (RL) environment is often modeled by the Markov Decision Process (MDP) with two common MDPs: infinitehorizon discounted MDP
and finitehorizon episodic MDP . , are the state and action spaces. /, / are the stationary/nonstationary state transition function and reward function. , are the discount factor and the horizon. In this paper, we assume a stationary state transition in : for . The goal of RL is to find a policy to maximize the cumulative rewards:(2) 
It is wellknown that the optimal policy of is stationary while that of is nonstationary (Agarwal et al., 2019)[Chapter 1]; hence in Eq. (2), the policy is denoted as in and as in . To reduce the model complexity, most RL experiments are formulated into if the stationarity holds. However, is inevitable when the reward is nonstationary. Depending on the application scenarios, we implement both stationary and nonstationary rewards, which will be discussed in the next section.
4 A VoltVar Control Environment
4.1 Power Distribution Systems and Objectives
Power distribution systems are networks for delivering electric power from the power transmission system to end consumers. Due to the distribution loss, voltage drops along the power delivery line, possibly causing voltage violations and power losses. Thus, VoltVar optimization is required. In power distribution systems, the VoltVar optimization problem is to control devices (e.g., regulators, capacitors, and batteries. represented as .) under constraints. affects voltage, resistance, reactance, and power in the physical networked constraints, so Eq. (1) is a constraint of Eq. (3).
(3) 
The VoltVar optimization’s objective is a combination of three losses: for voltage violation, for control error, and for power loss. The device constraints ensures the devices operates within its physical limits. While Eq. (3) is timeindependent, in practice we have to solve it at every time step. Solving a sequence of VoltVar optimization, Eq. (3), becomes a VoltVar control problem. In short, we call a problem VoltVar optimization if solving a single Eq. (3) and VoltVar control if solving a sequence of Eq. (3) connected by device operation constraints over time.
We use a Python version of OpenDSS to solve for the physical networked constraints of Eq. (3). OpenDSS is an opensource power flow solver developed by EPRI. It takes from Eq. (1) as known and solves the nonlinear equations of voltages and currents using fixedpoint iterations.
Shifting the focus to elements in power distribution systems, we define each element to be multiphase following the fact that power is usually delivered in multiphases. As shown in Figure 1, a (multiphase) node, or a bus, can be a pure connection point or include node objects like loads, capacitors, or batteries. A (multiphase) edge is formed by a line, a transformer, or a regulator. Loads model the power consumption from the consumers. Capacitors provide reactive power and batteries are energy (active power) storage. Lines imitate the connection from one (multiphase) node to another subject to Ohm’s law. Transformers and regulators are for voltage adjustment from one node to another.
4.2 VoltVar Control as An RL Problem
In this subsection, we describe the VoltVar control problem in the language of RL. We consider a finitehorizon MDP with steps as the horizon because we focus on a daily control with the control frequency being one action per hour. Still, the horizon is a variable and hence changeable. We will discuss this in the sections of environment registration and experiments. The following paragraphs give the details about the observation space, the action space, the state transition, and the reward function in PowerGym.
4.2.1 Observation and Action Spaces
The observation and action spaces, as summarized in Table 1 and 2, are products of discrete and continuous variables. The discrete variables are from the physical constraints of the controllers; for example, a capacitor either turns on or off, a regulator operates on a finite number of modes (tap number), and a discrete battery only has a finite number of discharge powers. The continuous variables are normalized into some bounded ranges; for example, the (perunit) voltage is represented into the unit of the base voltage on a bus, and hence usually bounded in [0.8, 1.2]. The battery’s stateofcharge (charge / max charge), or the soc, is in [0.0, 1.0]. The continuous battery’s normalized discharge power (discharge power / max discharge power) is in [1.0, 1.0], where the negative means charging and the positive means discharging.
Depending on the device constraints, a battery control can be either discrete or continuous. This affects the action space (Table 2) since the action representations are different. Still, because we can postprocess the observation after receiving it, in the observation space (Table 1), we unify the representation of discrete and continuous batteries by mapping the discrete battery’s discharge power to the normalized form.
Whether to discretize the battery makes the action space either multidiscrete or a product of multidiscrete and continuous spaces. Either way, the problem is hard for a tabular policy or a policy that encodes the actions as onehot vectors (e.g., DDQN
(Van Hasselt et al., 2016)) because the size of the discrete part of action space scales exponentially in the number of controllers. Also, the possibility of mixing discrete and continuous actions makes it harder to design the policy.Variable  Type  Range 

Bus voltage  cont.  
Capacitor status  disc.  
Regulator tap number  disc.  
Stateofcharge (soc)  cont.  
Discharge power  cont. 
Variable  Type  Range 

Capacitor status  disc.  
Regulator tap number  disc.  
Discharge power (disc.)  disc.  
Discharge power (cont.)  cont. 
4.2.2 State Transition
We now describe the state transition function in PowerGym. With the descriptions in Table 1 and 2, is represented as
(4) 
is the next set of voltages and depends on action and the stochasticity of loads, which we model using the load profiles (will discuss in environment design). and are the next statuses of capacitors and regulators. and are the next soc’s and discharge powers of batteries. Both of them depend on the current state because a battery’s soc cannot go beyond full charge () or depleted (). To enforce this, we project the attempted discharge power in an action to the allowed range based on , making a function of .
4.2.3 Reward Function
We implement the objective of a VoltVar problem, Eq. (3), into a reward function as follows:
(5) 
(6) 
is a concatenation of all observations in the current step, is that in the next step, and is the episode step. The dependency on step implies the reward could be nonstationary. The power loss, Eq. (6), is a ratio of the overall power loss to the total power. The voltage violation and control error are expressed in Eq. (7) and (8).
Eq. (5) is expressed as , not , because the action is a part of the next state . Mathematically, and are equivalent because is a function of under the state transition function .
The voltage violation, Eq. (7), is a sum of worstcase voltage violations among all phases across all the nodes in the system. The upper/lower violation thresholds (/) are set as of the perunit voltage as a result of the US voltage regulation standard (ANSI, 2011).
(7) 
where is a shorthand for . Thereby, the upper violation is positive when and zero otherwise.
The control error, Eq. (8), is a sum of capacitors’ and regulators’ switching penalties ( & rows) and batteries’ discharge penalty and soc penalty ( row). These penalties discourage the policy from making frequent changes and slow the devices from wear out. Note the discharge error , with being the max power, has a function as the battery degradation is primarily caused by the battery discharging power . Besides, the soc penalty has an indicator of the last time step to encourage a battery to return to its initial stateofcharge . Hence, the reward is stationary if and nonstationary otherwise.
(8) 
where represent a capacitor, a regulator, and a battery. , , , are status of , tap number of , discharge power of , and soc of .
5 Design of PowerGym
5.1 Environment Instantiation
Similar to the OpenAI Gym, PowerGym provides make_env() to instantiate an environment:
make_env(env_name, worker_idx=None)
env_name is the name of the registered environment. worker_idx is used (if not None) for parallel execution, which we detail in the subsection of load profiles.
make_env() reads the following information. First, PowerGym reads circuit files into the environment class, followed by leveraging OpenDSS to compile the file, as shown in Figure 1. Secondly, to define the hyperparameters that affects the RL training under the same system, PowerGym needs information such as the horizon, the number of actions of a regulator/battery and weights of the power loss, capacitor’s switch loss, regulator’s switch loss, battery’s discharge loss, battery’s stateofcharge (soc) loss. The next subsection introduces the customization of such information.
5.2 Environment Registration and Customization
Users can customize their environment by registering a new environment name associated with the required information. This is done by appending the information to the dictionaries in the PowerGym register. Below is an example of the dictionary. dss_file is the main circuit file that OpenDSS compiles. Users can edit dss_file to change the circuit objects and structure. Users can also change the hyperparameters. max_episode_steps is the horizon. It is 24 by default as we focus on the daily control. act_num is the shorthand of the number of actions, so the battery becomes continuous if bat_act_num is infinity and discrete if finite. The other parameters are the weights in the reward function shown in Eq. (5).
’13Bus’: { ’system_name’: ’13Bus’, ’dss_file’: ’IEEE13Nodeckt_daily.dss’, ’max_episode_steps’: 24, ’reg_act_num’: 33, ’bat_act_num’: 33, ’power_w’: 10.0, ’cap_w’: 1.0/33, ’reg_w’: 1.0/33, ’soc_w’: 0.0/33, ’dis_w’: 6.0/33 }
Besides the information shown above, PowerGym also depends on the load profiles (see Figure 1 and the subsequent subsection) and the other circuit files. These files are customizable and can be found in the folder of systems/system_name of the repository. For example, the above shows users can find customizable files in the folder systems/13Bus.
5.3 Default Registered Environments
System  Environment Names 

13Bus  13Bus, 13Bus_cbat, 13Bus_soc 
13Bus_cbat_soc  
34Bus  34Bus, 34Bus_cbat, 34Bus_soc 
34Bus_cbat_soc  
123Bus  123Bus, 123Bus_cbat, 123Bus_soc 
123Bus_cbat_soc  
8500Node  8500Node, 8500Node_cbat 
8500Node_soc, 8500Node_cbat_soc 
In Table 3, each system (summarized in Table 4) in PowerGym has four default environments: vanilla, continuous battery, soc, continuous battery & soc. The difference lies only in the battery’s settings; hence capacitors and regulators are the same across these four environments.
The presence of cbat affects the battery’s number of discharge power: without cbat, the number is finite (33 by default), and the battery’s model is discrete; with cbat, the number is infinite, and the battery’s model is continuous. On the other hand, soc tells the stateofcharge penalty on the battery at the end of the horizon: without soc, the soc penalty is zero, and the reward is stationary; with soc, the soc penalty is positive, and the reward is nonstationary. Besides the four default environments, one can call a scaled environment by appending a scale to an environment name; e.g., 13Bus_s1.5 scales the loads by 1.5. We will revisit the load scaling in the subsection of load profiles.
System  # Caps  # Regs  # Bats 

13Bus  2  3  1 
34Bus  2  6  2 
123Bus  4  7  4 
8500Node  10  12  10 
5.4 Gymlike Usage
PowerGym supports Gymlike usages such as reset, step, random action sampling, and visualization. The design is compatible with most RL algorithms. Below is a brief overview of these functions.
obs = Env.reset(load_profile_idx=0)
The reset function initializes the system and returns an initial observation. The dynamics of the load are controlled by the load profile index and will be discussed in the next subsection. The initial statuses of capacitors, regulators, and batteries are set as ”on”, full tap number, and (full charge, zero discharge power), respectively.
obs, reward, done, info = Env.step(action)
The step function takes an action as the input and returns the next state, reward, done signal, and the information dictionary. Since the current design does not define a terminal state that should be strictly avoided, the done signal is true only when the episode step reaches the horizon. The information dictionary includes several details about the reward such as capacitor error, regulator error, discharge error, soc error, and soc (all in average), which also facilitates the application of multiobjective RL (Liu et al., 2014).
action = Env.random_action()
Random actions can also be generated by Env.action_space.sample() and the random seed is set by Env.seed(). The action is a ()dimensional array for the control signal on the controllers, with being the number of a certain controller.
fig, pos = Env.plot_graph()
The plot graph function returns a Matplotlib figure and a dictionary of node positions for users to visualize the network status. It supports options such as show_voltages, show_controllers, and show_actions.
5.5 Other Usages and Constraints of Load Profiles
To simulate load dynamics, OpenDSS supports timeseries simulations following some predefined load curves. A group of predefined curves of all loads is called a load profile. As mentioned in the previous subsection, Env.reset() has an option of load profile selection. Hence, PowerGym models the stochasticity of state transition using the load profiles.
By enlarging the values in the load profile with a fixed scale, PowerGym creates environments with various load scales. Since power consumption scales linearly with the load scale, the environment tends to be hard under a large load scale. Referring to the subsection of default environments, a scaled environment is instantiated by appending a scale to an environment name. During the call, PowerGym generates the load profiles under the corresponding scale factor and another text file to store the current load scale. Note the load profile is regenerated only if the previous load scale (stored in the text file) is different from the current one.
Due to the file dependency on the load profile, parallel execution is possible under certain conditions. As mentioned earlier, the worker index of make_env() is used for parallel execution. When it is None, PowerGym cannot execute two environments on the same system (e.g., cannot execute 13Bus, 13Bus_cbat together) due to the conflict of load profile selection. This is solved when the worker index is an integer because each worker has a distinct profile selection file. However, even with the worker index, PowerGym cannot execute environments in parallel with names that differ only in the load scales (e.g., 13Bus_s1.0, 13Bus_s2.0) because it only allows one load scale at any given time.
6 Experiments
6.1 Cumulative Rewards in Default Environments
Cumulative rewards on test load profiles for different agents in 40k steps. The labels denote the average and standard deviation of the final rewards.
To show the applicability of PowerGym, we have trained two popular RL algorithms as benchmarks on our environments: Proximal Policy Optimization (PPO) (Schulman et al., 2017) and Soft ActorCritic (SAC) (Haarnoja et al., 2018), with implementations based on (Fujita et al., 2021). Since PPO is onpolicy while SAC is offpolicy, these two algorithms give us a proxy of the expected performance of onpolicy versus offpolicy algorithms in the environments. For comparisons, both PPO and SAC have been trained on multidiscrete actions by default. In addition, SAC has been trained on environments with continuous batteries (cbat) to compare the environments with different battery settings. The experiments are run on a server with one AMD Ryzen Threadripper 3970X CPU and one Nvidia RTX 3090 GPU.
The experiments have been designed as follows: The load profiles are randomly partitioned into two halves, one for training and the other for testing. During training, the policy is tested on test load profiles every 5 episodes; or equivalently every 120 steps as the horizon is 24. Lastly, all experiments are performed across ten random seeds.
In Figure 2, the label ”random” denotes an untrained policy that samples actions uniformly from the action space. As expected, SAC converges faster and outperforms PPO across all environments, which aligns with the SAC paper (Haarnoja et al., 2018) that has been demonstrated on the MuJoCo (Todorov et al., 2012)
. Due to the experiment design (evaluation at every 120 steps), all curves start at step 120 instead of step 0. The first evaluation (step 120) reveals the algorithms’ performance based on the first few updates: PPO is similar to random policy while SAC isn’t. The fact that PPO is near the random policy validates the clipping nature of its policy gradient. Clipping makes PPO update slowly yet steadily and hence similar to random policy in the early steps. As for SAC, because its DDPGstyle policy gradient
(Silver et al., 2014) isn’t clipped, SAC suffers more from the initial inaccuracy of Qfunction and hence deviates from the random policy in the early steps. Finally, the performances of SAC and cbat_SAC are very similar, implying discrete and continuous batteries share similar behaviors, and SAC successfully adapts to both.To sum up, we have demonstrated the applicability of PPO and SAC in PowerGym and the sample efficiency of SAC. We have also shown that environments with discrete or continuous batteries have similar performances.
6.2 Case Study: 123Bus
We take the 123Bus system (Figure 3) as an example to further analyze the behavior of the control policy in PowerGym. Specifically, we focus on the continuous battery scenario because the battery may arbitrarily discharge/charge within an allowable range in practice. For this case study, we consider four variations: vanilla (cbat), scaled loads (cbat_s2.5), with soc penalty (soc), and scaled loads with soc penalty (cbat_soc_s2.5). As mentioned in the default environment section, both soc penalty and large load scale make the environment more challenging, as the former introduces a nonstationary reward while the latter incurs large power consumption. Thereby, we would like to see how the control policy adapts to different scenarios.
The first row of Figure 4 visualizes the average switching errors of capacitors and regulators respectively. Both errors are small in most time steps across all scenarios. Hence, the policies for both capacitors and regulators only make large changes when needed while making small adjustments the rest of the time. Note the behavior of the first 1000 steps and the later steps are different because the RL exploration starts after the first 1000 steps of random exploration.
The second row of Figure 4 shows the power loss ratio and the voltage violation. Because of the load scaling, the 2.5scaled environments have the higher voltage violations than the unscaled environments (cbat_s2.5 cbat and cbat_soc_s2.5 cbat_soc). Furthermore, the voltage violation of soc_s2.5 is greater than that of s2.5 as the soc penalty makes the policy on batteries more restrictive and nonstationary. As for the power loss, since it is a difficult objective, it barely improves over time. Still, we see that the power losses on the 2.5scaled environments are higher than the unscaled counterparts. This is because large voltage violations cause large voltage differences on the lines, which brings up the power loss on the lines.
Finally, the third row of Figure 4 shows the battery activity in discharge errors and soc errors. Since the battery is an energy storage device, it is useful when the environment lacks power and has high voltage violations. Hence, the battery barely discharges in the unscaled environments and maintains mostly zero soc error. As for the scaled environments (s2.5 & soc_s2.5), because s2.5 discharges frequently, it has smaller voltage violations but higher soc error. In comparison, soc_s2.5 discharges less and has a higher voltage violation but smaller soc error. Therefore, there is a tradeoff between battery activity and voltage violation in heavilyloaded environments: the more battery activity, the less voltage violation, and RL algorithms need to find a dedicated balance between the two.
All in all, the soc penalty and the load scale affect the difficulty of a PowerGym environment. The difficulty can be evaluated by power losses, voltage violations, and battery activities. The harder an environment is, the more power losses, voltage violations, and battery activities.
6.3 Effects of Horizons
Figure 5 shows the testing cumulative reward w.r.t. the horizon for 123Bus and 8500Node systems. We only analyze under the continuous battery with the SAC algorithm as this is the bestperformed setting according to Figure 2. As the cumulative reward scales linearly w.r.t. the horizon, h48’s cumulative reward is roughly twice of h24’s and h96’s is four times of h24’s. Besides, the convergence speeds w.r.t. horizons are similar in 123Bus for the fact that the 123Bus system is a more stable system and less likely to have voltage violations. On the other hand, 8500Node is less stable, resulting in longer steps to converge in a longer horizon.
6.4 Difficulty Comparisons
As a concluding remark, we discuss the trend of PowerGym’s difficulty in four aspects: problem size, base voltage violation, load scale, and soc penalty. It helps users to choose the best environment for their applications.
Problem size refers to the dimensions of an environment; e.g., horizon, sizes of state, and action spaces. The larger horizon makes an environment harder as the error of value/Qfunction is usually quadratic to the horizon (Duan et al., 2020). Similarly, the larger state and action spaces complicate an environment. Under a fixed horizon, we expect the problem complexity in PowerGym follows 8500Node 123Bus 34Bus 13Bus.
Base voltage violation is the tendency of violating the voltage in an unscaled environment. It depends on the structure of the distribution system and the default load profiles. One may find a small system (small number of nodes) with high base voltage violation or a big system with low base voltage violation. For instance, the tendency of base voltage violation is 8500Node 34Bus 13Bus 123Bus, as the voltage violation is the major term in the reward function and the training speed in Figure 2 follows the reverse order.
Load scale affects a PowerGym environment by changing the scale of the load profiles. A high load scale brings up the load power consumption, which increases the chance of voltage violations and makes the problem harder.
The soc penalty determines the stationarity of the battery behavior. With the soc penalty, the battery behaves nonstationarily as the battery should discharge at peak hours and charge at offpeak hours. Because a nonstationary behavior is harder to train than a stationary one, the soc error of cbat_soc_s2.5 in Figure 4 is mostly zero but less stable.
7 Conclusion
We develop a gymlike opensource environment, PowerGym, to facilitate RL research/adaptation for VoltVar control in power distribution systems. PowerGym encourages power system researchers to make fair comparisons on RL algorithms using the same environment. It includes sufficient variations (problem size, base voltage violation, load scale, and soc penalty) to study different aspects of the VoltVar control. PowerGym also acts as a base for researchers/engineers to adopt RL algorithms to power distribution systems in real life: it provides a detailed customization guide for researchers/engineers who use PowerGym with their own proprietary power distribution systems. Our RL experiments suggest the correctness of the PowerGym design. The cumulative rewards achieved by our RL agents serve as a baseline for the PowerGym users. Future work on other problems in power distribution systems is underway.
8 Acknowledgement
We thank Siddharth Bhela for the instructions on OpenDSS, Suat Gumussoy for the feedbacks on environment design and Ulrich Muenz for the general supports on the project development.
References
 Reinforcement learning: theory and algorithms. Note: https://rltheorybook.github.io/ Cited by: §3.
 Electric power systems and equipment – voltage ratings (60 hz) c84.1. Note: American National Standards Institute (ANSI) Standard: C84.12011 Cited by: §4.2.3.
 Integrated volt/var control in distribution systems. In 2001 IEEE Power Engineering Society Winter Meeting. Conference Proceedings (Cat. No. 01CH37194), Vol. 3, pp. 1485–1490. Cited by: §1.
 Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §1, §2.

Minimaxoptimal offpolicy evaluation with linear function approximation.
In
International Conference on Machine Learning
, pp. 2701–2709. Cited by: §6.4.  The ieee 8500node test feeder. Electric Power Research Institute, Palo Alto, CA, USA. Cited by: §1.
 Power systems stability control: reinforcement learning framework. IEEE Transactions on Power Systems 19 (1), pp. 427–435. External Links: Document Cited by: §2.
 Equilibrium and dynamics of local voltage control in distribution systems. In 52nd IEEE Conference on Decision and Control, pp. 4329–4334. Cited by: §1.
 ChainerRL: a deep reinforcement learning library. Journal of Machine Learning Research 22 (77), pp. 1–14. Cited by: §6.1.
 Exact convex relaxation of optimal power flow in radial networks. IEEE Transactions on Automatic Control 60 (1), pp. 72–87. Cited by: §1.
 Deep reinforcement learning in power distribution systems: overview, challenges, and opportunities. In 2021 IEEE Power Energy Society Innovative Smart Grid Technologies Conference (ISGT), Vol. , pp. 1–5. External Links: Document Cited by: §2.
 Ns3gym: extending openai gym for networking research. arXiv preprint arXiv:1810.03943. Cited by: §2.
 Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §1, §6.1, §6.1.
 Multiobjective reinforcement learning: a comprehensive overview. IEEE Transactions on Systems, Man, and Cybernetics: Systems 45 (3), pp. 385–398. Cited by: §5.4.
 Learning to run a power network challenge: a retrospective analysis. arXiv preprint arXiv:2103.03104. Cited by: §1.
 Learning to run a power network challenge for training topology controllers. Electric Power Systems Research 189, pp. 106635. External Links: ISSN 03787796, Document, Link Cited by: §2.
 Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §1.
 IEEE pes test feeders. Note: https://site.ieee.org/pestestfeeders/resources/Accessed: 20210728 Cited by: §1.
 Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708 7. Cited by: §2.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §6.1.
 Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Cited by: §6.1.

Continuous multiagent control using collective behavior entropy for largescale home energy management.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 34, pp. 922–929. Cited by: §2.  Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §6.1.
 Deep reinforcement learning with double qlearning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30. Cited by: §4.2.1.
 Safe offpolicy deep reinforcement learning algorithm for voltvar control in power distribution systems. IEEE Transactions on Smart Grid 11 (4), pp. 3008–3018. External Links: Document Cited by: §2.
 Voltvar control in power distribution systems with deep reinforcement learning. In 2019 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), Vol. , pp. 1–7. External Links: Document Cited by: §2.
 Datadriven load frequency control for stochastic power systems: a deep reinforcement learning method with continuous action search. IEEE Transactions on Power Systems 34 (2), pp. 1653–1656. External Links: Document Cited by: §2.
 Optimal power flow based on successive linear approximation of power flow equations. IET Generation, Transmission & Distribution 10 (14), pp. 3654–3662. Cited by: §1.
 Winning the l2rpn challenge: power grid management via semimarkov afterstate actorcritic. In International Conference on Learning Representations, Cited by: §1.
 Deep reinforcement learning based voltvar optimization in smart distribution systems. IEEE Transactions on Smart Grid 12 (1), pp. 361–371. External Links: Document Cited by: §2.
 Deep reinforcement learning for power system applications: an overview. CSEE Journal of Power and Energy Systems 6 (1), pp. 213–225. External Links: Document Cited by: §2.
Appendix A Appendix
a.1 System Layouts
Table 4 (in main context) and Figure 6 show the controller summary and layouts of the systems. 13Bus system has 2 capacitors at bus 611, 675; 3 singlephase regulators at the edge (650, rg60); 1 battery at bus 680. 34Bus system has 2 capacitors at bus 844, 848; 6 singlephase regulators at edge (814,814r), (852, 852r) (3 for each); 2 batteries at bus 832, 890. 123Bus system has 4 capacitors at bus 83, 88, 90, 92; 1 threephase regulator at bus 150 and 9 singlephase regulators at bus 9, 25, 160 (3 for each); 4 batteries at bus 33, 67, 114, 300. 8500Node system has 10 capacitors (one threephase near the source, nine singlephase at three other locations); 12 singlephase regulators at four buses (3 for each bus. One bus is the source. The other buses are shown in the figure); 10 batteries.
a.2 Observation and Action Wrapper
Although the observation and action spaces are composed of discrete and continuous values, for the conciseness of representation, we wrap the observation and the action into Numpy arrays as follows.
wrapped_obs = Concatenate([all phase voltages at each bus, all capacitor statuses, all regulator tap numbers, all battery soc’s and normalized discharge powers])
action = Concatenate([all capacitor statuses, all regulator tap numbers, all battery discharge powers])
Capacitor statuses, regulator tap numbers, and discrete batteries’ discharge powers are represented in integers. Continuous batteries’ discharge powers are represented in floating numbers.
The wrapped observation is the default output of Env.reset() and Env.step(). Still, users can access all phase voltages with the observation dictionary at Env.obs. These two representations of observations have the following relation:
wrapped_obs = Env.wrap_obs(Env.obs)
a.3 Hyperparameters
In this section, we provide a summary of the hyperparameters of our environments and RL agents. The coefficients of the reward function are shown in Table 5.
Variable  13Bus  34Bus  123Bus  8500Node 

33  
33 (disc. bat), (cont. bat)  
1/33  
1/33  
10.0  1.0  10.0  1.0  
6/33  10/33  7/33  200/33  
0.0 
Variable  13Bus  34Bus  123Bus  8500Node 

1/33  4/33  5/33  200/33  
100/33  500/33  500/33  10000/33 
To train PPO and SAC agents, we use separate deep neural networks to parameterize the policy and value/Q functions. Both networks consist of dense layers with the same widths. Table 6 presents the suggested hyperparameters for PPO and SAC.
Variable  Value 

Optimizer  Adam 
Learning rate  3E4 
Discount factor  0.95 
Clip epsilon  0.2 
Batch size  64 
Model update interval  512 
Entropy coefficient  0.01 
Variable  Value 

Optimizer  Adam 
Learning rate  3E4 
Discount factor  0.95 
Batch size  256 
Model width  512 
Model depth  3 
Entropy coefficient  0.4 
Comments
There are no comments yet.