PowerGridworld: A Framework for Multi-Agent Reinforcement Learning in Power Systems

We present the PowerGridworld software package to provide users with a lightweight, modular, and customizable framework for creating power-systems-focused, multi-agent Gym environments that readily integrate with existing training frameworks for reinforcement learning (RL). Although many frameworks exist for training multi-agent RL (MARL) policies, none can rapidly prototype and develop the environments themselves, especially in the context of heterogeneous (composite, multi-device) power systems where power flow solutions are required to define grid-level variables and costs. PowerGridworld is an open-source software package that helps to fill this gap. To highlight PowerGridworld's key features, we present two case studies and demonstrate learning MARL policies using both OpenAI's multi-agent deep deterministic policy gradient (MADDPG) and RLLib's proximal policy optimization (PPO) algorithms. In both cases, at least some subset of agents incorporates elements of the power flow solution at each time step as part of their reward (negative cost) structures.



There are no comments yet.


page 1


Multi-Agent Deep Reinforcement Learning with Adaptive Policies

We propose a novel approach to address one aspect of the non-stationarit...

Godot Reinforcement Learning Agents

We present Godot Reinforcement Learning (RL) Agents, an open-source inte...

Mava: a research framework for distributed multi-agent reinforcement learning

Breakthrough advances in reinforcement learning (RL) research have led t...

Attentional Policies for Cross-Context Multi-Agent Reinforcement Learning

Many potential applications of reinforcement learning in the real world ...

Experience Augmentation: Boosting and Accelerating Off-Policy Multi-Agent Reinforcement Learning

Exploration of the high-dimensional state action space is one of the big...

Health-Informed Policy Gradients for Multi-Agent Reinforcement Learning

This paper proposes a definition of system health in the context of mult...

Human and Multi-Agent collaboration in a human-MARL teaming framework

Collaborative multi-agent reinforcement learning (MARL) as a specific ca...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Multi-Agent Reinforcement Learning in Energy Systems

With the increased controllability of power consumption and generation at the edge of modern power systems, devising centralized control approaches to manage flexible devices within these systems is becoming nearly impossible. In particular, the dynamical models of these interconnected systems present considerable nonlinearities, which challenges the applicability of classical control methods. In addition, the conflicting operational costs/objectives of heterogeneous devices often obstruct the formulation of a system-wide operational objective. Thus, decentralized, data-driven control approaches, where edge controllers utilize data to derive effective local control policies, provide a viable pathway to realizing the resilient and efficient operation of future energy systems.

Fig. 1: PowerGridworld architecture for an -agent environment comprised of both single-component agents (Agent 1) and multi-component agents, which can be added in any number and combination. Given a base system load and the agents’ individual power profiles, a power flow solution is computed at each control step and may be used to update the agents’ states and rewards.

Reinforcement learning (RL) approaches have shown great potential in several power systems control and load management tasks [3, 24, 25, 5]. In addition, multi-agent RL (MARL) approaches have advanced and have been applied in many complex systems, including games [23] and autonomous driving [18]. Recently, MARL approaches have also found applications in the power systems domain, with an emphasis on voltage regulation problems [7, 2, 14]. These applications utilize the capabilities of MARL to devise local control policies without any knowledge of the models of the underlying complex systems. However, due to the significant focus on applying MARL in power system decentralized control tasks, there has been no standardized RL test environment that supports the development of heterogeneous power system components and the deployment of off-the-shelf MARL approaches.

I-B Related Software

Table I summarizes the features of MARL frameworks for power systems.


Package Power Systems   Application Agent Customization Composable Agents Control Step MARL Training Interfaces


PettingZoo None Unlimited No User-defined RLLib, OpenAI
CityLearn Demand response Buildings and subsystems No 1-hour Package-specific
GridLearn Demand response, voltage regulation Buildings and subsystems No Sub-hourly Package-specific
PowerGridworld Any energy management/optimization Unlimited Yes User-defined RLLib, OpenAI
TABLE I: Comparison of software features for MARL environments for power systems.

I-B1 General MARL Framework

The development of MARL is relatively new compared to its single-agent counterpart, and there is currently no commonly and widely used MARL framework. PettingZoo [19], a Python library, has a goal of developing a universal application programming interface (API) for formulating MARL problems, as OpenAI Gym [1] did for single-agent RL problems. However, because the key advantages of PettingZoo—namely, an efficient formulation suitable for the turn-based games and an ability to handle agent creation and death within episodes—are less relevant to power system control problems, PowerGridworld does not adopt the PettingZoo APIs at this stage for simplicity.

I-B2 Multi-Agent Energy Systems

CityLearn [22, 21] is an open-source library aimed at implementing and evaluating MARL for building demand response and energy coordination. By design, the heating and cooling demands of buildings in CityLearn are guaranteed to be satisfied, allowing researchers to focus on energy balance and load shifting for the control problem. To achieve this, building thermal models and associated energy demands are precomputed using EnergyPlus [4], and control actions are limited to active energy storage decisions rather than those affecting passive (thermal) mass. CityLearn is intended to provide a benchmark MARL environment from the standpoint of building demand response and, as such, it is highly constrained in terms of the types of agents and models that are available. CityLearn energy models include buildings, heat pumps, energy storage, and batteries, while the state and action spaces of the agents themselves must be constructed from a predefined variable list. Control steps in CityLearn are restricted to 1-hour resolution, and grid physics is not modeled. To address this, GridLearn [14] utilizes the building models provided in CityLearn and extends its functionality to include power system simulation. The added power flow model, implemented using pandapower [20], allows researchers studying decentralized control to consider both building-side and grid-level objectives. The GridLearn case study presented in [14] demonstrates that this platform can be used to train MARL controllers to achieve voltage regulation objectives in a distribution system by controlling behind-the-meter resources.

I-B3 MARL Training

While many open-source choices exist for MARL training, we highlight two of the most popular: RLLib (multiple algorithms available) and OpenAI’s multi-agent deep deterministic policy gradient (MADDPG). RLLib [8, 16] is a framework for scalable RL training built on the Ray Python library [11], and it supports a variety of training paradigms for single-agent, multi-agent, hierarchical, and offline learning. RLLib can be deployed on both cloud and high performance computing (HPC) systems, and it provides a number of training “abstractions,” enabling users to develop custom, distributed RL algorithms. The multi-agent API in PowerGridworld is derived from RLLib’s own MultiAgentEnv API and thus is readily integrated into this framework. OpenAI[12] has played a central role in the evolution of both theory and software for RL and MARL. In addition to creating the Gym API, OpenAI released a series of tutorials and implementations in the mid-2010s that have continued to hold traction in the RL community, including the SpinningUp blog111https://spinningup.openai.com/en/latest/ and the baselines GitHub repository.222https://github.com/openai/baselines The OpenAI implementation [13] of the MADDPG [9] is a popular choice for MARL with continuous control.

As described in greater detail in Section II-A, PowerGridworld makes it easy for users to leverage the implementations of both RLLib’s and OpenAI’s RL algorithms.

To the best of our knowledge, no previous software packages exist that enable users to implement arbitrary multi-agent scenarios with with a power systems focus—in particular, with the ability to incorporate power flow solutions into the agents’ observation spaces and rewards. We believe that PowerGridworld begins to bridge this gap by enabling highly modular, customizable environments that readily integrate with open-source, scalable MARL training frameworks such as RLLib.

Ii PowerGridworld

Ii-a Description of Software

PowerGridworld is designed to provide users with a lightweight, modular, and customizable framework for creating power-systems-focused, multi-agent Gym environments that readily integrate with existing RL training frameworks. The purpose of this software, which is available as an open-source Python package333https://github.com/NREL/PowerGridworld, is to enable researchers to rapidly prototype simulators and RL algorithms for power systems applications at the level of detail of their choice, while also enabling the use of cloud and HPC via integration with scalable training libraries such as RLLib [8] and Stable Baselines [15].

Ii-A1 Architecture

The PowerGridworld design pattern is based on the OpenAI Gym API, which has become the de facto standard interface for training RL algorithms. The Gym API essentially consists of the following two methods:

  • reset: Initialize the simulation instance and return an observation of the initial state space, .

  • step: For each control step, apply an input control action, , and return a a new state space observation, ; a step reward, ; a termination flag; and any desired metadata.

A simulator that is wrapped in the Gym API is often referred to as an environment, and one instance of the simulation is often called an episode.

The core functionality of the PowerGridworld package is to extend the Gym API to include multi-agent simulations and to allow a user to combine environments that simulate individual devices or subsystems into a single, multi-component agent. This “plug-and-play” functionality is highly useful in power systems applications because it enables the user to rapidly create heterogeneous agents using basic building blocks of distributed energy resources (DERs).

We illustrate the PowerGridworld architecture in Fig. 1. Here, the MultiAgentEnv environment (blue) encapsulates agents that subclass one of two types:

  1. [a)]

  2. ComponentEnv environments (green), which implement a single, independent agent. This class is a slight extension of the OpenAI Gym API.

  3. MultiComponentEnv environments (yellow), which are a composition of component environments. For example, Agent could represent a smart building agent composed of building thermodynamics, photovoltaics (PV), and battery physics, each implemented as a separate ComponentEnv.

The multi-agent Gym API can be readily plugged into an RL training framework such as RLLib (grey), where agent-level policies (red) are learned. Once the individual device physics has been implemented according to the ComponentEnv API, the software automates the creation of MultiComponentEnv environments. Any number of ComponentEnv and MultiComponentEnv agents can then be added to the MultiAgentEnv environment.

Ii-A2 Power Flow Solver Integration

Another key feature of PowerGridworld is the integration of a power flow solver for simulating the grid physics that underlies the multi-agent environment. Although our examples utilize the open distribution system simulator (OpenDSS) [6] to solve the power flow on a test feeder, any power flow solver wrapped in the PowerFlowSolver API can be utilized.

Ii-B Advantages

The advantages of using PowerGridworld over MARL software packages are as follows. First, the plug-and-play modularity with a three-tier hierarchy (cf. Fig. 1) allows environments to be created from simpler components. Second, the multi-agent environment design allows both homogeneous and heterogeneous agent types. Third, the power flow solution can be used in agent-level states and rewards. Finally, PowerGridworld adheres to RLLib’s multi-agent API, with converters for both CityLearn/GridLearn and OpenAI’s MADDPG interfaces.

Ii-C Limitations

Next, we list some of the limitations of PowerGridworld. First, time stepping is synchronous and of fixed frequency. However, we have a road map for implementing both hierarchical and multi-frequency time stepping. Second, the communication model is limited. Centralized communication, whereby the process driving the environment collects and communicates variables between agents, is relatively straightforward to implement using only the Gym API. More advanced paradigms require custom implementations. Finally, the initial version of the MultiAgentEnv serializes calls to the agents (i.e., there is no parallelism).

Iii Case Studies

In this section, we present two examples of how PowerGridworld can be used to formulate multi-agent control tasks in energy systems.

Iii-a Multi-Agent Building Coordination Environment

In the first example, we consider three RL agents in a homogeneous setting. Each agent controls three components within one building: one HVAC system, one on-site PV system, and one energy storage (ES) system. Using this setup, this example demonstrates how to use the PowerGridworld package to model a learning environment that allows agent coordination while achieving each agent’s own objective.

To this end, the MARL system is implemented as follows:

  1. For each agent/building, the HVAC system needs to be controlled so that thermal comfort can be realized with minimal energy consumption. As a result, the HVAC component reward, , includes penalties for both thermal discomfort and energy consumption.

  2. The PV and ES systems are two additional components that are controlled by an agent to modify the building’s net power consumption, but for simplicity, the rewards related to these two components are set to be zero, i.e., .

  3. We designed a simple scenario with a sudden PV generation drop when the system loading level is high. If all three buildings, which connect to the same bus in the distribution system, only care about their own objective, the voltage at the common bus might fall below the limit, i.e., . As a result, voltage support – maintaining – requires the three buildings to coordinate with one another.

Fig. 2: Learning curves of using MADDPG to train control policies for the multi-agent coordinated building control. All x-axes represent training iterations. Losses, i.e., and , are averaged among three agents.

Based on the setup above, at control step and for agent , the total agent reward is


in which is the agent-level reward and represents the system-level reward, shared evenly with all three agents. Here, is a large number. Through MARL training, each agent should be able to optimize its own objective (i.e., keep low) and also able to work with other agents to avoid any voltage violation (i.e., keep low).

To train control policies for this problem, we use OpenAI’s MADDPG implementation. Specifically, agent trains a critic network (i.e., ) in an off-policy manner to minimize the mean squared Bellman error (MSBE):


and the actor (i.e., the control policy ) is trained by minimizing the following actor loss:


In the above equations, and are the RL parameters to be optimized, and represents the target value network parameters (a common off-policy learning trick; see [10] for details). In our notation, and are the action and state of agent , respectively, and the collection of all agents’ actions and states are written as and . the The states at the next step are denoted , and are the policy chosen actions at .

Fig. 2 shows the learning curves over 350 training iterations. By the end of the training, both agent costs and the total cost converge to a low level.

starts at a large value and gradually decreases to a value close to zero, indicating that the state-action values can be estimated accurately. As the value estimation becomes more reliable,

also decreases, implying that the control policies are improving to achieve a higher reward level for each agent. Finally, the episodic voltage violation sum, , is high at the beginning, and as the agents learn to coordinate with one another, the voltage violation is also eliminated, leading to .

In summary, this example demonstrates using PowerGridworld to formulate a MARL problem with both competition (building comfort) and collaboration (system voltage) among agents. Admittedly, instead of naïvely splitting the system penalty evenly among agents to encourage agents’ coordination, a more advanced approach could be flexibly implemented using this framework by modifying the corresponding interfacing functions.

Iii-B Multi-Agent Environment With Heterogeneous Agents

A key feature of the PowerGridworld package is that it enables users to model heterogeneous agents that interact both with one another and with the grid. To demonstrate this feature, we developed a simple example with three different agents consisting of one smart building—simulated as a MultiComponentEnv composed of a five-zone building, a solar panel, and a battery component environment—and two independent component environments representing a PV array and an electric vehicle (EV) charging station. The agents here are loosely coupled according to their reward structures and observation spaces, as described next.

Smart building (). The five-zone building has a simple reward function characterized by a soft constraint that zone temperatures be maintained within a comfort range. The building thermal model used is the same as in Section III-A; the reward function is similar, except that it does not take power consumption into account.

PV array (). Next, we include a controllable PV array as a source of real power injection, with the purpose of mitigating voltage violations stemming from high real power demand on the distribution feeder. We model a simple control whereby the real power injection can be curtailed between 0% and 100% of available power from the panels; the observation space consists of both real power available from the panels and the minimum bus voltage on the feeder, . (The scenario we consider is stable with respect to maximum feeder voltage.) The reward function is given by a soft penalty on the minimum bus voltage, which is computed using OpenDSS.

EV charging station (). Finally, we consider an EV charging station with an aggregate, continuous control, , representing the rate of charging for all charging vehicles. For example, with action , all charging vehicles will charge at 25% of the maximum possible rate. The distribution of vehicles is control dependent because, as vehicles become fully charged, they leave the station and thus reduce the aggregate load profile. Furthermore, each vehicle has prespecified (exogenous) arrival and departure times, before and after which it cannot charge. The observation space consists of a handful of continuous variables characterizing the station’s current occupancy and aggregate power consumption, as well as aggregate information about the state of the charging vehicles. The reward function balances the local task of meeting demand with a grid-supportive task of keeping the total real power consumption under a peak threshold. Note that, while the charging station does not directly respond to grid signals, the soft constraint on peak power incentivizes load shifting.

Using RLLib’s multi-agent training framework, we train separate proximal policy optimization (PPO) [17] policies for each agent, with each agent attempting to optimize its own reward. Although training multi-agent policies in this way is generally challenging due to nonstationarity, here, the agents are only loosely coupled through bus voltages in the PV agent’s reward function, and training converges without issue—see Fig. 3

. The lower panel in the figure shows the PPO loss function for each agent’s policy,


where is the advantage estimator, is a clipping function with threshold , and refers to the policy weights from the previous training iteration. We refer the reader to [17] for additional details about the PPO algorithm and loss function.

Fig. 3: Learning curves for independent PPO policies for the heterogeneous control problem trained using RLLib. The PPO loss function, , is given in (4). All x-axes represent training iterations.

Iv Conclusion

PowerGridworld fills a gap in MARL for power systems by providing users with a lightweight framework for rapidly prototyping customized, grid-interactive, multi-agent simulations with a bring-your-own-model philosophy. The multi-agent Gym API and other API converters enable users to rapidly integrate with existing MARL training frameworks, including RLLib (multiple algorithms) and OpenAI’s MADDPG implementation. Unlike the CityLearn and GridLearn software packages, PowerGridworld does not provide carefully designed benchmarks for a given application, such as demand response with voltage regulation. Rather, it provides the user with abstractions that streamline experimentation with novel multi-agent scenarios, component Gym environments, and MARL algorithms where the power flow solutions are essential to the problem. Integration with RLLib, in particular, paves the way for the use of supercomputing and HPC resources for RL training, which will become ever more important as the complexity of MARL simulations continues to increase.


  • [1] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI Gym. arXiv preprint arXiv:1606.01540. Cited by: §I-B1.
  • [2] D. Chen, K. Chen, Z. Li, T. Chu, R. Yao, F. Qiu, and K. Lin (2021) PowerNet: multi-agent deep reinforcement learning for scalable powergrid control. IEEE Transactions on Power Systems. Cited by: §I-A.
  • [3] B. J. Claessens, P. Vrancx, and F. Ruelens (2016) Convolutional neural networks for automatic state-time feature extraction in reinforcement learning applied to residential load control. IEEE Transactions on Smart Grid 9 (4), pp. 3259–3269. Cited by: §I-A.
  • [4] D. B. Crawley, L. K. Lawrie, F. C. Winkelmann, W. F. Buhl, Y. J. Huang, C. O. Pedersen, R. K. Strand, R. J. Liesen, D. E. Fisher, M. J. Witte, et al. (2001) EnergyPlus: creating a new-generation building energy simulation program. Energy and Buildings 33 (4), pp. 319–331. Cited by: §I-B2.
  • [5] J. Duan, D. Shi, R. Diao, H. Li, Z. Wang, B. Zhang, D. Bian, and Z. Yi (2019) Deep-reinforcement-learning-based autonomous voltage control for power grid operations. IEEE Transactions on Power Systems 35 (1), pp. 814–817. Cited by: §I-A.
  • [6] Electric Power Research Institute (EPRI) OpenDSS: EPRI distribution system simulator. Note: https://sourceforge.net/projects/electricdss/Accessed: 2021-11-04 Cited by: §II-A2.
  • [7] Y. Gao, W. Wang, and N. Yu (2021) Consensus multi-agent reinforcement learning for volt-var control in power distribution networks. IEEE Transactions on Smart Grid. Cited by: §I-A.
  • [8] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gonzalez, M. Jordan, and I. Stoica (2018) RLlib: abstractions for distributed reinforcement learning. In

    International Conference on Machine Learning

    pp. 3053–3062. Cited by: §I-B3, §II-A.
  • [9] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275. Cited by: §I-B3.
  • [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §III-A.
  • [11] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, et al. (2018) Ray: a distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 561–577. Cited by: §I-B3.
  • [12] OpenAI. Note: https://openai.com/about/Accessed: 2021-11-04 Cited by: §I-B3.
  • [13] OpenAI’s MADDPG Implementation. Note: https://github.com/openai/maddpgAccessed: 2021-11-03 Cited by: §I-B3.
  • [14] A. Pigott, C. Crozier, K. Baker, and Z. Nagy (2021) GridLearn: multiagent reinforcement learning for grid-aware building energy management. arXiv preprint arXiv:2110.06396. Cited by: §I-A, §I-B2.
  • [15] A. Raffin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, and N. Dormann (2019) Stable baselines3. Note: https://github.com/hill-a/stable-baselinesAccessed: 2021-11-02 Cited by: §II-A.
  • [16] RLLib: Scalable Reinforcement Learning. Note: https://docs.ray.io/en/latest/rllib.htmlAccessed: 2021-11-03 Cited by: §I-B3.
  • [17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §III-B.
  • [18] S. Shalev-Shwartz, S. Shammah, and A. Shashua (2016) Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295. Cited by: §I-A.
  • [19] J. K. Terry, B. Black, N. Grammel, M. Jayakumar, A. Hari, R. Sullivan, L. Santos, R. Perez, C. Horsch, C. Dieffendahl, et al. (2020) Pettingzoo: gym for multi-agent reinforcement learning. arXiv preprint arXiv:2009.14471. Cited by: §I-B1.
  • [20] L. Thurner, A. Scheidler, F. Schäfer, J. Menke, J. Dollichon, F. Meier, S. Meinecke, and M. Braun (2018) Pandapower—an open-source python tool for convenient modeling, analysis, and optimization of electric power systems. IEEE Transactions on Power Systems 33 (6), pp. 6510–6521. Cited by: §I-B2.
  • [21] J. R. Vazquez-Canteli, S. Dey, G. Henze, and Z. Nagy (2020) CityLearn: standardizing research in multi-agent reinforcement learning for demand response and urban energy management. arXiv preprint arXiv:2012.10504. Cited by: §I-B2.
  • [22] J. R. Vázquez-Canteli, J. Kämpf, G. Henze, and Z. Nagy (2019) CityLearn v1.0: an OpenAI Gym environment for demand response with deep reinforcement learning. In Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, pp. 356–357. Cited by: §I-B2.
  • [23] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §I-A.
  • [24] H. Xu, A. D. Domínguez-García, and P. W. Sauer (2019) Optimal tap setting of voltage regulation transformers using batch reinforcement learning. IEEE Transactions on Power Systems 35 (3), pp. 1990–2001. Cited by: §I-A.
  • [25] Q. Yang, G. Wang, A. Sadeghi, G. B. Giannakis, and J. Sun (2019) Two-timescale voltage control in distribution grids using deep reinforcement learning. IEEE Transactions on Smart Grid 11 (3), pp. 2313–2323. Cited by: §I-A.