I Introduction
Voltage violation problems and high network losses are becoming increasingly severe in active distribution networks (ADN) with high penetration level of distributed generation (DG)[6336354, 8960272]. As an important solution, VoltVAR control (VVC) has been successfully integrated into distribution management systems to optimize the voltage profile and reduce network losses. Since most DGs are inverterbased energy resources (IBERs), they are able and required to provide fast Volt/VAR support using their free capacity.
Conventionally, VVC is described as a nonlinear programming problem to generate a set of optimal strategies for voltage regulation devices and reactive power resources. Plenty of literatures solve VVC problems using centralized optimization methods such as interior point methods [4808225]
[6336354]. Despite the wide application of centralized VVC, they suffered from the singlepoint failure and heavy computation & communication burdens. Also, as for the increasingly huge amount of IBERs, centralized VVC is also limited with communicationdependent timedelay issues.Therefore, distributed VVC methods have been proposed to exploit the distributed nature of the ADN. Distributed methods utilizes local measurements with P2P communication with neighbors to realize fast control. Previous papers mainly adapt distributed optimization algorithms, such as quasi realtime reactive optimization [7470528], alternating direction method of multipliers (ADMM) [7926415, 7042735] and accelerate ADMM [8960272]. However, these P2P communication system is hard to maintain in real practice. There are also some decentralized methods [7361761, 7500071] to realize quasioptimal control, which are based on improved droop control strategies and only local measurements are used for each controller.
Till now, most VVC algorithms depend on the accurate ADN models to achieve desirable performance. It is impractical and expensive for regional power utilities to maintain such reliable models, especially in a distribution system with increasing complexity and numerous buses [6579907, 8944292]. Recently, the effectiveness of (deep) reinforcement learning (RL) based approaches have been verified to cope with the incomplete model challenges in energy trading [8661902], emergency control [8787888], load frequency control [8534442], and voltage regulation [8944292, 8873679].
In order to apply RL algorithms in a distributed or decentralized manner, multiagent RL has been studied in inspiring attempts [5589633, 6392462, doi:10.1177/0037549710367904, 8981922, 9076841]. [8981922] develops a decentralized cooperative control strategy for multiple energy storage systems based on Qlearning and the value decomposition network (VDN) from [sunehag2017valuedecomposition]; [9076841] develops a multiagent autonomous voltage control method based on the stateofart algorithm multiagent deep deterministic policy gradient (MADDPG) proposed in [NIPS2017_7217].
However, the existing control methods either a) implement training of agents in the offline stage based on a simulation model and execute them online without training, which sacrifice the modelfree feature of multiagent RL, or b) synchronously learn the agents online with heavy communication burdens. As for VVC with numerous highspeed IBERs, a novel online multiagent RL framework that performs online learning without heavy communication and local computation burdens is urgently desired.
Moreover, to realize such framework, there are several critical technical challenges:

The deterministic policies of Qlearning and DDPG algorithms lead to extreme brittleness and notorious hyperparameter sensitivity
[haarnoja2018sacapplications], which limit the online application. 
The power system operational constraints are not modelled explicitly in the existing multiagent RL based methods, which is a critical issue in VVC.

Online exploration of the datadriven algorithms could lead to deterioration on the performance of VVC. Such exploration and exploitation issue is especially serious in ADN with high speed IBERs.
In this paper, we propose an Online multiagent reinforcement Learning and Decentralized Control framework (OLDC) for VVC as shown in fig. 1. Moreover, to improve the stability and efficiency of VVC, we propose a novel multiagent RL algorithm called MultiAgent Constrained Soft ActorCritic (MACSAC) inspired by previous works [haarnoja2018sacapplications, 10.5555/3305381.3305384, 8944292, NIPS2017_7217].
As shown in fig. 1, coordinated multiagent learning based on MACSAC is conducted in the control center, and the latest trained polices are sent to controllers to carry out local control. With the asynchronous learning, sampling and control processes, this solution can realize safe and fast model free optimization for VVC in ADNs. The unique contributions of this article are summarized as follows.

Compared to the existing algorithms like MADDPG [NIPS2017_7217], our proposed MACSAC significantly improves the stability and efficiency of the training and application processes. Instead of using deterministic policies, MACSAC utilizes stochastic policies with maximum entropy regularization following [haarnoja2018sacapplications], which prevents optimization failure and ameliorates training robustness. MACSAC also explicitly model voltage constraints instead of treat it as a penalty, which can significantly improve voltage security level.

To the best of our knowledge, we are the first to synergistically combining online multiagent RL and decentralized control by proposing OLDC framework with detailed timing design. The proposed VVC with OLDC can both learn the control experiences continuously to meet the incomplete model challenge, and make decision locally to realize fast control. Also, OLDC can be extended to apply in other multiagent power system controls, and is capable with future offpolicy multiagent RL algorithms.

With the offpolicy nature of MACSAC, our OLDC provides a promising method for balancing exploration and exploitation in RLbased algorithms. The safety and operation efficiency is dramatically enhanced by saving the cost of redundant exploration online.
The remainder of this article is organized as follows. Section II formulates the the VVC problem in ADNs as a constrained multiagent Markov game, and also briefly introduces RL and the multiagent actorcritic framework. Then, the detailed introduction to the proposed MACSAC and OLDC are presented in Section III. In Section IV the results of our numerical study are shown and analyzed. Finally, Section V concludes this article.
Ii Preliminaries
In this section, we firstly introduce the VVC problem in this paper. Then, the settings of Markov games and RL in this paper is explained. In the last subsection, we introduce preliminaries of actorcritic and multiagent actorcritic methods.
Iia VVC Problem Formulation
An ADN is divided into nature control areas with local measurements and control agent. It can be depicted by an undirected graph with the collection of all nodes , the collection of each area ’s nodes , and the collection of all branches . Since it is common for the ADN in the real world to equip only with singlephase steadystate measurements, the VVC problem is formulated on balanced networks for realtime steadystate dispatch in this paper. Since the inner details of the model are not required and only the input and output data are necessary, such model can be easily extended to unbalanced multiphase networks.
While we consider the steadystate voltage control, the power flow equations are employed as shown in eq. 1, where is the active and reactive power flow from node to , is the voltage at node and is the admittance of branch , and is the shunt admittance of node .
(1) 
The th area is equipped with IBERs and compensation devices such as static Var compensators (SVC). Without loss of generality, we assume that the IBERs and compensation devices are installed on different nodes in . Accordingly, the collection of the nodes equipped with IBERs and compensation devices are noted as and .
Since , the power injections at each nodes can be determined via eq. 2.
(2)  
The IBERs are typically designed with redundant rated capacity for safety reasons and operate under maximum power point tracking (MPPT) mode. Hence, the controllable range of the reactive power of IBERs can be determined by the rated capacity and current active power output . The reactive power range of controllable devices is and .
IiB Markov Games and Reinforcement Learning
In order to formalize sequential multiagent decision processes, we consider an extension of the Markov decision processes (MDP) called constrained Markov Games (CMG), which can be seen as a constrained version of Markov games (MG). In a MG, multiple agents can interact with a common environment locally. A MG for
agents is defined by a tuple . The set of states describes all possible states of the common environment. The sets of local observations and actions are the local observations and actions for each agent.In each time step , each agent firstly observes the environment as ; then, chooses its action
using a stochastic policy defined as a probability density function
, i.e., . The actions taken at lead to the next state according to an unknown state transition probability . After the transition, each agent obtains an reward by the corresponding reward function and receives the next observation . The goal of each agent is to maximize its own total expected discounted return , where is a discount factor and is the time horizon. Note as the initial state, as all policies, as all local observations at , as all actions at for convenience.In the power system control domain, it is important for RL agents to keep safe exploration. A natural way to incorporate safety is to formulate constraints into the RL problem. Following the constrained MDP (CMDP) given by [cmdp], CMG is formulated as an constrained extension of MG, where each agent must satisfy its own constraints on expectations of auxiliary costs. An extra group of auxiliary cost functions defined as is inserted into the tuple of MG. At time step , the constraint reward is defined as where . The constraints are expressed as .
Under the settings of CMG, the task of the RL algorithms, or multiagent RL algorithms explicitly, is to learn an optimal policy for each agent to maximize , i.e.,
(3) 
with sequential decisions data and without knowledge of the probability density functions . Such feature of RL algorithms leads to huge potential to optimize the agents in a modelfree manner.
IiC ActorCritic and Multiagent ActorCritic
In order to accomplish the reinforcement learning task, a group of RL algorithms called actorcritic algorithms are becoming popular in the recent years for their high sample efficiency and stability, such as PPO[schulman2017proximal], A3C[mnih2016asynchronous], DDPG[silver2014deterministic], and SAC[haarnoja2018sacapplications]
. These algorithms utilize deep neural network to approximate an “actor”, which generate actions with observations using policy
, and an “critic” which evaluate the policy using or . By training the actor and critic alternatively, these algorithms could explore the environment efficiently and get high quality policies.For such multiagent environments, separately adopting traditional RL algorithms for each agent is poorly suited because the environment is nonstationary from the perspective of each individual agent. In this paper, we follow the multiagent actorcritic framework in [foerster2016learning, NIPS2017_7217] to cope with the inherent nonstationary challenges of multiagent environments. Both a critic and a local actor are constructed for each agent. At training time, the critics are allow to use global information, including all observations and actions, to build its own evaluation of the global environment characteristics. The local actors are trained with the corresponding critic with the knowledge of other actors since we consider a cooperative setting in this paper. After training is complete, the local actors are deployed and make decisions in a decentralized manner using only the local information.
However, previous work is not indented for online controlling and acts in an offline training and online application mode. In our DRLbased VVC algorithm, the most important task is to utilized online learning and control to adaptively operates ADNs. So in section IIIC, we propose an online multiagent learning and decentralized control framework (OLDC) with totally asynchronous sampling, training and application, which fully preserves the advantage of OLDC in the online stage.
Iii Methods
In this section, we innovate a online multiagent reinforcement learning method to solve the VVC problem formulated as a MG. Since the method is carried out online, the safety, efficiency and optimality are the critical concerns to address in the real world problem. Firstly, the VVC problem is formulated into CMG. Then, we develop an innovated offpolicy multiagent algorithm called MACSAC in section IIIB, which improves the safety and efficiency of the existing algorithms. Finally, based on the offpolicy nature of MACSAC, we propose OLDC as an online multiagent actorcritic framework with totally asynchronous sampling, learning and application in section IIIC, which is also capable with other offpolicy algorithms.
Iiia VVC Formulation in Constrained Markov Game
The VVC problem of ADNs is formulated as CMG with their natural features. The detailed VVC problem settings are given in the supplemental file [zzsupple] due to page limitation. The specific definitions of state space, action space and reward function are designed as follows.
IiiA1 State Space
The state of CMG
is defined as a vector
. Here is the vector of nodal active/reactive power injections , is the vector of voltage magnitudes . is the time step in each episode.IiiA2 Observation Spaces
The local observations of each agent are selected according to the local measurements. In this paper, is defined as , where is the vector of th area’s nodal active/reactive power injections ; is the vector of th area’s voltage magnitudes ; is the vector of outlet powers of th area.
IiiA3 Action Spaces
For each agent , the action space is constructed with all the controllable reactive power resources in th area, including PV inverters and SVCs.
IiiA4 Reward Functions
In the classic RL algorithms, the reward is designed to be a function of previous observations. In this paper, the rewards of agents are calculated in the coordinator, so all observations are available to the reward functions. Since the objectives are to minimize active power loss and mitigate voltage violations, the reward functions and constraint reward functions are defined as eq. 4 and eq. 5. is the cooperative index of agent , which describes the willingness of the agent to optimize the welfare for global system rather than itself.
(4)  
(5) 
The index functions and can be evaluated in the coordinator for any collection of nodes at time step .
(6) 
Here,
is the rectified linear unit function defined as
. We have where the equality holds if and only if all voltage magnitudes satisfy the voltage constraints. Note as voltage violation rate (VVR) since it is assigned according to the 2norm of voltage magnitude violations. We use VVR instead of the amount of violated nodes because the voltage violations are usually severe in the ADNs and the regulation capacity may be not enough to eliminate all violations in some scenarios. In such scenarios, VVR serves as a much smoother index and can effectively mitigate the voltage violations.IiiB Multiagent Constrained Soft ActorCritic
To improve the safety and efficiency of the existing multiagent RL algorithms, we propose MACSAC in this subsection. As space is limited, the detailed derivation of MACSAC and practical skills are provided in the supplemental file [zzsupple].
First of all, with the formulation of CMG for VVC in section II, the multiagent RL problem is reformulated as eqs. 10, 9, 8 and 7 for each agent locally. Here, eq. 7 is the original RL objective; eq. 8 is the action constraint, where and is the lower and upper bound of ; eq. 9 is the entropy constraint from [haarnoja2018sacapplications], where is the lower bound of ’s entropy; eq. 10 is the state constraint of our CMG, i.e., the expected discount sum of VVR.
(7)  
(8)  
(9)  
(10) 
For the action constraint eq. 8, it has already been included in the action spaces’ definition. As usual, we adapt Lagrange relaxation here to handle constraints eqs. 10 and 9. Multipliers and are introduced for eq. 9 and eq. 10 respectively. Note that and are two pairs of variables. In each pair, if one variable is considered as a hyperparameter, the other one can be determined via iterations. Since the physical meaning of is clear, we select and as hyperparameters. Hence, the problem is refined as , where .
IiiB1 Preparation
The actors optimize the policies with parameters according to the optimization problem above. In MADDPG, is defined as a deterministic map from to , but faces overfitting problem and shows undesirable instability. Inspired by [haarnoja2018sacapplications],
is defined as a probability distribution
here in a stochastic manner. Since directly optimization of a distribution is hard to implement, the policies is reparameterized as(11) 
where
is the mean and standard deviations approximated by neural networks.
In order to quantify the policies, the stateaction value functions are defined in eq. 12 for . is representing the expected discounted reward after taking action under observation with the policy . Here, is the trajectory when applying ; is noted for all ; is all observations ; is all actions . At every time step , we store in the experience replay buffer , and then learn the critics and actors alternatively as follows.
(12) 
From the definition, the only difference between each is and . In the rest of MACSAC, we use neural networks to approximate the actual .
As for the state constraint term , similar stateaction value functions are defined by substituting with in eq. 12.
IiiB2 Learning the critics
As defined in eq. 12, we learn centralized critics with all observations and actions instead of learn local ones separately. Such manner can cope with the nonstationary problem from the perspective of any individual agents. Since in this paper the agents are cooperative, the policies of others are available when training a certain critic.
Using Bellman equation, we could approximate the current stateaction value with the expectation of all possible next state and corresponding actions with . That is,
(13) 
where ; is the delayed parameters for and is updated using .
Hence, the training of is to minimize the loss .
Similarly, we calculate the approximated value for as , and update by minimizing the loss .
IiiB3 Learning the actors
With the definition of critics, the optimization problem of actors is transformed from maximizing , which is hard to get, to eq. 14 with approximated and .
(14) 
where .
The Lagrange function is derived for eq. 14 as,
(15) 
Hence, the dual problem for and is . In MACSAC, we update as , and update as .
The algorithm of MACSAC is shown in algorithm 1. Compared to the stateofart multiagent RL algorithm MADDPG [NIPS2017_7217], our MACSAC a) utilizes stochastic policies instead of deterministic policies for each agent and follow the maximumentropy training in [haarnoja2018sacapplications], which explores the environment better online and gains significantly higher sample efficiency and stability, and b) introduces constraints for each agent and solve CMG instead of MG, which guarantees voltage safety explicitly. Also, both MADDPG and MACSAC are offpolicy actorcritic algorithms, since we do not have any assumption with the order of samples or samples’ original policy. It means that the sampling policies, which are executed locally, are not required to be the latest policies. Such feature inspires us to come up with OLDC as follows.
IiiC Online Centralized Training and Decentralized Execution Framework
With the pyhsical structure shown in fig. 1, we propose OLDC to carry out MACSAC online with high efficiency. The detailed diagram of OLDC is illustrated in fig. 2. Note that in OLDC, sampling (green), learning (blue) and application (orange) are totally asynchronous.
IiiC1 Timing
In the bottom of fig. 2, a timeline is built for all agents and the centralized server.
As the orange part, in every time gap , each agent a) get the local measurement , b) generate the action with local policy as , and c) send to local controlled devices. Note that the lower bound of depends on the measurements, computation of , and devices. Since we consider highspeed measurements and devices, and is reparameterized as with neural networks and can be fast evaluated, can be relatively small.
Asynchronously, the samples got in every is uploaded to the experience replay buffer on the server as the green part. Because of relatively slow communication, is much greater than . However, the sampling process would not delay the actual control speed, since all application is carried out locally as above.
Also asynchronously as the blue part, the training of agents is carried out every : batch of samples is randomly selected to train the critics and actors using eqs. 15 and 13, and the updated policies are sent to the agents. Since the communication is relatively slow and computations is relatively heavy, is also much greater than . Note that the training process would not delay the application or sampling; also, the samples are selected from the experience replay buffer, so the training is not directly affected by .
IiiC2 Communication and Computation
OLDC is robust to communication and computation conditions. In the application process, local controller only evaluates a small neural network from local measurements to for local devices with little computation burdens and no communications are needed with other controllers or upper control center. Most computations of MACSAC are carried out on the centralized server with abundant resources.
OLDC could choose to upload any proper numbers of samples in every considering the communication conditions. Without loss of generality, one sample is drawn in fig. 2 with dashed green box. Also, even if the communication to the server is unstable and some samples were lost, they could be ignored safely.
IiiC3 Exploration and Exploitation
For datadriven algorithms like MACSAC, the balance of exploration and exploitation is extraordinary important. In MACSAC, bigger multiplier will results in higher entropy level, which means is more stochastic and explore the environment better. However, the exploration will sacrifice the exploitation, i.e., optimality and performance.
Hence, OLDC provides another way to balance exploration and exploitation. Suppose we upload samples in every , which means . Since other samples are not uploaded or used in training, we can carry out the policy in a deterministic manner, that is, instead of . To be brief, only the actions of samples which are meant to upload should explore stochastically in OLDC. With smaller , the exploration is weaker and exploitation is stronger. Moreover, and can be changed online to manually control the learning process or even stop learning with . With a proper tuned and , the efficiency of MACSAC can be dramatically improved in the online application.
IiiC4 Special Case
As a special case, OLDC is also capable with singleagent actorcritic RL, i.e., . The sampling, training and execution are still asynchronous if needed.
With extraordinary efficiency and robustness to various computing and communication conditions, OLDC is a practical and suitable framework for online (MA)RL application in the power system, especially for multiagent RLbased VVC in the ADNs.
Iv Numerical Study
In this section, numerical experiments are conducted to validate the advantage of the proposed OLDC and MACSAC over some popular benchmark algorithms including DRL algorithms and optimizationbased algorithms. Multiagent RL environments are built of steadystate power systems under the scheme of the toolkit Gym [brockman2016openai]. Both IEEE 33bus and IEEE 141bus test cases are adapted as ADNs. In the 33bus case, there are three PV inverters and one SVC, which are assumed as four stations. In the 141bus case, we have 13 PV inverters, 5 SVCs and 5 stations. Detailed simulation configuration and load/generation profiles are given in the supplemental file [zzsupple].
Iva Proposed and Baseline Algorithms Setup
In the following experiments, the proposed MACSAC is implemented with our OLDC. For the benchmark algorithm, we adapt the stateofart MADDPG [NIPS2017_7217] as a multiagent RL baseline, and SAC from [haarnoja2018sacapplications] as a centralized RL baseline. An optimizationbased algorithm with SOCP relaxation is implemented with oracle models (VVO), which could serve as a benchmark of theoretically best performance. VVO with approximated models and practical considerations is treated as the modelbased benchmark called approximated VVO (AVVO). The algorithm hyperparameters for RL algorithms are listed in the supplemental file [zzsupple].
Due to the stochastic property of DRLbased algorithms, we use 3 independent random seeds for each group of experiment, whose mean values and error bounds are presented in the figures as solid lines and filled areas.
IvB Algorithm Convergence and Efficiency with Ideal Simulation
To verify the convergence and efficiency of the proposed MACSAC, we first conduct an ideal centralized experiment with the RL algorithms, in which all RL algorithms do not consider the speed of communication, that is, SAC in a centralized manner and MACSAC / MADDPG in OLDC () can execute the policies in every time step. During the execution, all samples are uploaded to the experience replay buffer. In this first experiment, all stochastic explorations are carried out in an identical copy of our simulated system, thus policies are free of noisy explorations in the our testing algorithms for now. Note that though such noisefree scenario is actually not realistic in practice, the results of which are informative for making it more explicit to compare the convergence and efficiency of RL and multiagent RL approaches.
The step value of active power loss and VVR during the training process are shown in figs. 4 and 3. The modelbased benchmark VVO is also tested with results averaged across load/generation profile since it is deterministic.
The first important observation from figs. 4 and 3 is that both SAC and MACSAC converge to a lower active power loss than the optimizationbased method AVVO without oracle parameters, which reveals the advantage of DRLbased algorithms over such parametersensitive optimization method regarding VVC problem. On the other hand, though the oracle VVO attains the minimum of active power loss theoretically once given all true parameters, DRLbased algorithms could closely approach it after certain iterations, as depicted in the figure.
Only using local measurements during application for each agent, MACSAC has achieved similar performance as the centralized algorithm SAC, which in comparison utilizes global measurements during application, even in an ideal centralized scenario advantageous for the latter. Such results strongly support the fact that the CMG formulation and OLDClike learning framework is valid for VVC in ADNs.
Also, MACSAC outperforms MADDPG obviously regrading active power loss and VVR in limited steps as figs. 4 and 3 shows. In fact, this significant improvement in MACSAC compared to MADDPG is credited to the usage of maximumentropy regularized stochastic policies rather than deterministic policies, since the latter could easily overfit the value functions and lead to extreme brittleness [haarnoja2018sacapplications]. Such features make MACSAC preferable in practice for multiagent VVC, not only in this study but also in more complex potential tasks.
IvC Online Application Performance with Realworld Simulation
To simulate the online stage, practical considerations include: a) communication speed is limited comparing to the control speed, so the centralized algorithm SAC can generate actions every 8 steps; b) exploration has to be performed on the real system; and c) training and sampling can be performed every 8 steps. Since the original OLDC framework is not suitable for online learning, we implement both MACSAC and MADDPG under OLDC with and . Note that VVO is still implemented in the ideal scenario to provide a lower bound reference.
Figures 6 and 5 shows the results in online application. With OLDC, MACSAC has achieved smaller active power loss and VVR than SAC in this scenario. The obviously better performance justifies multiagent RL especially MACSAC with OLDC as an outstanding solution for VVC in ADNs.
Comparing MACSAC and MADDPG, though both algorithms are conducted under OLDC, MACSAC converges to much better power loss and VVR with more stable performance. Such significant privilege over MADDPG in terms of active power loss and VVR supports MACSAC as a preferred multiagent RL algorithm for VVC in ADNs.
V Conclusion
An online multiagent RL framework OLDC and the corresponding algorithm MACSAC are proposed for VVC to optimize the reactive power distribution in ADNs without the knowledge of accurate model parameters. With the consideration of distributed stations with high speed IBERs in ADNs, the online multiagent learning and decentralized control framework can both learn the control experiences continuously to meet the incomplete model challenge, and make decision locally to keep high control speed. Instead of the existing MADDPG, we propose the safe and efficient MACSAC with maximum entropy regularized stochastic policies and explicitly modelled constraints, which prevents optimization failure and ameliorates training robustness. Numerical studies on ADNs represented by the modified IEEE 33bus and 141bus test cases indicate that the proposed MACSAC outperforms the benchmark methods in the online application. Also, it is demonstrated that OLDC has remarkable superiority for online multiagent RLbased VVC with extraordinary efficiency and robustness to various computing and communication conditions.
In the future work, the application of the proposed OLDC to other distributed or decentralized control problems is a promising research direction. With improved performance, MACSAC has the potential to handle more complex control problems.
Comments
There are no comments yet.