I Introduction
With increasing penetration level of distributed generations (DG) [kurbatovaGlobalTrendsRenewable2020], modern distribution networks are challenged with severe operating problems such as voltage violations and high network losses. As the common practice, active distribution networks (ADN) have integrated Volt/VAR control (VVC) to optimize the voltage profile and reduce network losses by employing not only the discrete regulation equipments such as onload tap changers (OLTC) and capacitor banks (CB), but also the continuous control facilities such as the DGs and static var compensators (SVC).
Typically the original VVC task is described as a mixed integer nonlinear programming problem with variables standing for strategies of voltage regulation devices and reactive power resources. While general symbolic solutions of such problem are hardly available, a variety of methods have been studied and led to various schemes of VVC, which could be categorized via control architectures into the centralized VVC [liuReactivePowerVoltage2009, borghettiUsingMixedInteger2013], distributed VVC [liuDistributedVoltageControl2018, xuAcceleratedADMMBasedFully2020], decentralized VVC [zhuFastLocalVoltage2016, liuOnlineMultiagentReinforcement2021].
Even though the existing VVC methods have achieved considerable performance in traditional distribution networks, most of them rely heavily on the accurate network model. These modelbased methods are seriously challenged when an accurate model is expensive or sometimes impractical to maintain in a fast developing ADN with increasing complexity and numerous components [arnoldModelFreeOptimalControl2016, liuOnlineMultiagentReinforcement2021, wangSafeOffpolicyDeep2019, gaoBatchConstrainedReinforcementLearning2020]. In the recent years, the research on deep reinforcement learning (DRL) has shown desirable potential on coping with the incomplete model challenges in video game [nachumDataEfficientHierarchicalReinforcement2018, ecoffetFirstReturnThen2021, lazaridisDeepReinforcementLearning2020] and multiple areas in power grid operation, including energy trading [serbanArtificialIntelligenceSmart2020], network reconfiguration [gaoBatchConstrainedReinforcementLearning2020], frequency control [stanojevReinforcementLearningApproach2020, zhangResearchAGCPerformance2020] and so on. Hence, many inspiring DRLbased VVC methods has been proposed recently such as [liuTwostageDeepReinforcement2020, wangSafeOffpolicyDeep2019, liCoordinationPVSmart2019, caoMultiAgentDeepReinforcement2020, liuOnlineMultiagentReinforcement2021, yangTwoTimescaleVoltageControl2020]. Such DRLbased VVC methods have empowered the agent in the ADN operating utility to learn a nearoptimal strategy by interacting with the actual ADN and mining the optimization process data without an accurate ADN model [chenReinforcementLearningDecisionMaking].
Moreover, the characteristics of controlled devices determine the nature of the VVC problem. Two types of devices considered by modern VVC in ADNs are described in table I, including the Slow Timescale Discrete Devices (STDD) and Fast Timescale Continuous Devices (FTCD). As shown in fig. 1, we assume FTCDs take steps in one STDD step.
Item  STDD  FTCD 

Variable  Discrete  Continuous 
Timescale  Slow (in hours)  Fast (in minutes)^{o} 
Number  Relatively small  Large 
Control Price  Limited switching times  Flexible 
Devices  OLTCs, CBs, …  DGs, SVCs^{*}, … 

The fast timescale depends heavily on communications.

The location of SVCs is similar to STDD.
In the active distribution networks with both STDDs and FTCDs, all control devices are supposed to work concurrently and cooperatively, while most of the existing VVC works consider either of them. Because of the huge difference of the natures listed in table I especially the timescale fig. 1, a proper optimization and control method which fully utilizes the fast speed of FTCDs and the limited STDD actions is nontrivial.
With the assumption of oracle models of ADNs, some researchers have developed optimizationbased multitimescale VVC methods such as [zhengRobustReactivePower2017, xuMultiTimescaleCoordinatedVoltage2017, jinTwoTimescaleMultiObjectiveCoordinated2019, jhaBiLevelVoltVAROptimization2019, zafarMultiTimescaleVoltageStabilityConstrained2020, yangTwoTimescaleVoltageControl2020]. For example, [jinTwoTimescaleMultiObjectiveCoordinated2019] presents a multiobject coordinated VVC method with the slow timescale as MINLP to optimize the power loss and control actions, and the fast timescale as NLP to minimize the voltage deviation. [jinTwoTimescaleMultiObjectiveCoordinated2019] handles the two stages separately by applying a searching method to solve the MINLP and a scenariobased method to solve the NLP. Instead, [zhengRobustReactivePower2017] formulates the slow timescale VVC as a robust optimization problem and guarantees the worst case in the fast timescale, which may leads to conservativeness or poor convergence. To solve the slow timescale MINLP more efficiently online, [yangTwoTimescaleVoltageControl2020] has incorporated a DRL algorithm called deep Q network (DQN) to boost the solution, while optimizing the FTCDs with a modelbased secondorder cone program.
However, to coordinate the FTCDs and STDDs in two timescales and conduct efficient VVC in a modelfree manner, it is indispensable for us to develop DRLbased multitimescale VVC methods. Among the DRLbased VVC methods, most of them target at singletimescale. For example, authors in [wangSafeOffpolicyDeep2019] proposed a safe offpolicy RL algorithm to optimize STDDs hourly by formulating the voltage constraints explicitly and considering the device switching cost. In contrast, references [caoMultiAgentDeepReinforcement2020, liuOnlineMultiagentReinforcement2021] are designed to optimize the FTCDs in minutes by incorporating and improving continuous RL algorithms. As for multitimescale (twotimescale) VVC, reference [yangTwoTimescaleVoltageControl2020] applied DQN to the slow timescale optimization problem, but depended on oracle model in the fast timescale optimization. Research on DRLbased VVC for both fast and slow timescales, which could achieve modelfree optimization, is still urgently needed.
Unfortunately, DRL algorithms for twotimescale agent with such different natures shown in table I are nontrivial and rarely studied. A reasonable solution is to set up RL agents for both timescales individually like fig. 2, so as to satisfy natures of FTCDs and STDDs. However, for the training processes of the agents, traditional RL approaches such as Qlearning are poorly suited. The most fundamental issue is that the policy of the fast timescale agent (FTA) in the lower layer is changing as training processes, and the environment becomes nonstationary from the perspective of slow timescale agent (STA) in the upper layer [loweMultiAgentActorCriticMixed2020]. In another way, at every decision time of STA, the next step is not only depended on the STDDs’ actions of STA itself, but also depended on the subsequent FTCDs’ actions of FTA. This issue extremely challenges the learning stability and prevents the use of experience replay offpolicy RL algorithms, which are generally more efficient than the onpolicy ones [guQPropSampleEfficientPolicy2017]. Besides, the STA involves multiple STDDs and is bothered by the curse of dimensionality in action space.
In this paper, we propose a novel bilevel offpolicy RL algorithm and develop a twotimescale VVC accordingly to jointly optimize FTCDs and STDDs in ADNs in a modelfree manner. As shown in fig. 2, we first formulate the twotimescale VVC problem in the bilevel RL framework with separate STA and FTA established. Then, the two agents are implemented with detailed designed actorcritic algorithms. Finally, STA and FTA are trained jointly by introducing the multitimescale offpolicy correction technique to eliminate the nonstationary problem. The proposed modelfree twotimescale VVC method not only ensures the stability of learning by the coordination of STA and FTA, but also performs offpolicy learning with desirable sample efficiency.
Comparing with previous studies on VVC in ADNs, the unique contributions of this paper are summarized as follows.

To realize modelfree optimization of FTCDs and STDDs described table I together, we design a mathematical formulation called bilevel Markov decision process to describe the twotimescale environment. A bilevel offpolicy RL framework is proposed accordingly, where two agents FTA and STA are set up for the FTCDs and STDDs respectively and are both trained with offpolicy RL algorithms to exploit the samples efficiently.

To cope with the nonstationary challenge of learning the two agents in two different timescales, our bilevel offpolicy RL framework conduct coordination between STA and FTA instead of training them separately. In this context, we propose a technique called multitimescale offpolicy correction (MTOPC). With MTOPC, the bias of STA learning under the disturbance of FTA can be effectively eliminated. Such factors make the application of offpolicy RL algorithms available for the STA.

For the FTAs, the soft actorcritic (SAC) with continuous actions is adapted to learn a stochastic VVC policy; and for the STAs, we develop a multidiscrete soft actorcritic (MDSAC) algorithm to reduce the complexity of training and improve the efficiency. Comparing with the stateofart RL algorithms, MDSAC can produce discrete action values for all STDDs simultaneously but alleviating the curse of dimensionality challenge.
The rest of this paper is organized as follows. Section II formulates the twotimescale VVC problem in this paper, and also introduces key concepts and basic methods of RL as preliminaries. Then in section III, the details of the proposed bilevel offpolicy RL algorithm are derived and presented, and a twotimescale VVC is developed accordingly. Moreover, in section IV, the results of the numerical study on the proposed twotimescale VVC are shown and analyzed. Finally, section V concludes this paper.
Ii Preliminaries
In this section, we firstly formulate the twotimescale VVC problem in this paper. Then, the settings of Markov decision process and its variants used in this paper is introduced. In the last subsection, we cover the preliminaries of reinforcement learning and actorcritic framework to support section III.
Iia Twotimescale VVC Problem Formulation
In this paper, we consider an ADN with nodes. It can be depicted by an undirected graph with the collection of all nodes and the collection of all branches . The point of common coupling (PCC) is located at node 0 with a substation connected to the power grid simulated by a generation.
Both STDDs and FTCDs are installed in the ADN. STDDs include OLTCs and CBs. The tap of th OLTC is and the tap of th CB is
. Typically, the number of taps are odd integers
. FTCDs include DGs and SVCs. The reactive power of th DG is and that of th SVC is . Without loss of generality, we assume that all STDDs and FTCDs are installed on different nodes in .In a slowtimescale VVC, the taps of OLTCs and CBs are optimized following the objective in eq. 1 [wangSafeOffpolicyDeep2019]. is the number of slowtimescale VVC steps in one day and is one of the steps. is the active power loss of the ADN, and are the acting loss of OLTCs and CBs accordingly. We have and when and . are the price coefficients of accordingly. are the voltage lower and upper limit, and is the voltage at node .
(1) 
As for the fasttimescale VVC, are optimized given the tap settings from the slow timescale. Let be the number of fasttimescale VVC steps in one day, and be one of the steps. Then the problem is formulated as eq. 2 [xuAcceleratedADMMBasedFully2020].
(2) 
The DGs are typically designed with redundant rated capacity for safety reasons and operate under maximum power point tracking (MPPT) mode. Hence, the controllable range of the reactive power of DGs can be determined by the rated capacity and maximum power output . The reactive power range of controllable devices is . Also, we have where are the bounds of the th SVC.
IiB Markov Decision Process and Reinforcement Learning
A fundamental assumption of reinforcement learning is that the environment can be described as an MDP. The classic definition of an MDP is shown in definition II.1.
Definition II.1 (Markov Decision Process).
A Markov decision process is a tuple , where

is the state space,

is the action space,

is the transition probability distribution of the next state
at time with the current state and the action at time , 
is the immediate reward received after transiting from state to due to action ,
In the standard continuous control RL setting, an agent interacts with an environment (MDP) over periods of time according to a policy . The policy can be either deterministic, which means , or stochastic, which means . In this paper, stochastic policies are adapted. From the definition of MDP, we can tell that if the environment is stationary, the cumulative reward can be improved by optimizing . Classically, the objective of RL at time is to maximize the expectation of the sum of discounted rewards where is the discount factor, and is the length of episode. A wellperforming RL algorithm will learn a good policy from ideally minimal interactions with the environment, as . Here is a trajectory of states and actions noted as and . is a trajectory with applied.
To get an optimal policy, one has to evaluate the policy under unknown environment transition dynamics and conduct improvement. In reinforcement learning, such evaluation is carried out by defining two value functions and as shown in eq. 4. is the statevalue function representing the expected discounted reward after state with the policy .
is the stateaction value function representing the expected discounted reward after taking action at state with the policy .
(4) 
Iii Methods
In this section, we propose a novel bilevel offpolicy reinforcement learning algorithm to solve the twotimescale VVC in ADNs. A summarized version of the RLbased twotimescale VVC is presented in section IIIE.
We firstly propose a variant of MDP called bilevel Markov decision process (BMDP) in section IIIA to describe the environment in two timescales. The twotimescale VVC problem in section IIA is further formulated into BMDP accordingly.
Then, as shown in figs. 3 and 2, two agents FTA and STA are set up for the two timescales. We adopt the wellknown off policy RL algorithm SAC for FTA, which is described in section IIIB. To alleviate the curse of dimensionality challenge and improve the efficiency of STA, section IIIC propose a novel algorithm MDSAC which allow the STA to decide actions for all STDDs simultaneously with outstanding sample efficiency.
Finally, instead of simply training FTA and STA separately, we calculate the exchanging factors between FTA and STA by our innovated MTOPC technique in section IIID. It allows the FTA and STA trained together to optimize BMDP in a stationary manner based on the inherent Markov property of BMDP.
Iiia Twotimescale VVC in Bilevel Markov Decision Process
The standard RL setting in section IIB considers only a single timescale, which does not match the twotimescale VVC problem in section IIA. Hence, BMDP is defined as a variant of MDP in definition III.1 to describe the environment (twotimescale VVC problem). The transition probability is still unknown to the agents. Note that though similar settings exist in some previous works related to hierarchical RL such as [suttonMDPsSemiMDPsFramework1999], BMDP is specially reformed for the twotimescale setting.
Definition III.1 (Bilevel Markov Decision Process).
A bilevel Markov decision process is a joint of two MDPs defined in definition II.1 in two timescales, and is defined as a tuple . Most of the symbols follow definition II.1, and . The incremental parts are described as follows.

is the slow action space,

is the fast action space,

is the timescale ratio and the is only available every steps of the ,

when , the fast action makes that ,

when , two transitions happens consequently: 1) the slow action is applied and the state transits from to as ; 2) the fast action is applied and ,

is the immediate reward received in slow timescale after transiting from state to due to action ,

is the immediate reward received in fast timescale after transiting from state to due to action .
Orange painted part in fig. 3 illustrates BMDP step by step. The setting of BMDP has ensured the environment is Markovian from the perspective of bilevel. In each episode ( steps) of the fast layer, the initial state depends on the certain slow action. We therefore has to include the corresponding slow action in the state explicitly, which is emphasized by red dashed lines. The slow actions can be seen as the exchanging factors from STA to FTA as shown in fig. 2.
Obviously, the transition from to depends not only on the slow action , but also on the fast actions . Hence, in the view of slow level, the environment does not satisfy Markov property. In another way, BMDP is Markovian bilevelly in fast level, but nonMarkovian in slow level. In the following section III, one of the major ideas is to take full advantage of the inherent Markov property of BMDP and carry out stable learning and control process in both timescales.
To formulate the twotimescale VVC problem eq. 3 into BMDP, the specific definitions of episodes, state spaces, action spaces and reward functions are designed as follows.
IiiA1 Episode
An episode of STA is defined as one day, and the step size is one hour. We have total steps in one episode. An episode of FTA includes steps in one STA step with . Note that are all alterable while satisfying .
IiiA2 State Space
The common state of BMDP is defined as a vector
, where are the vectors of nodal active/reactive power injections , is the vector of voltage magnitudes , is the vector of OLTC taps, is the vector of CB taps, is the time in one day.IiiA3 Action Spaces
For FTA, the action space includes all the controllable reactive power of FTCDs, that is, . Since the reactive power generations are continuous in section IIA, is defined as a box space with the lower bound , and upper bound , where .
For STA, the action space includes all the tap settings of STDDs, that is, . Each tap setting or can be selected from or taps, so is a discrete space with multi dimensions, also known as multidiscrete space. The dimensions are listed in the vector .
IiiA4 Reward Functions
The reward functions maps a certain transition to a single value to cumulate and maximize.
For the fast timescale, note a transition as where and . Then is defined as eq. 6 according to the inner minimization of eq. 3. Here
is the rectified linear unit function defined as
, is a penalty multiplier for the voltage constraints, and is a smooth index of the voltage violations called voltage violation rate (VVR).(6)  
(7)  
(8) 
For the slow timescale, the transition is noted as , where and . Also, as shown in fig. 3, we have as series of fast timescale samples between and . Note them as where . Since the objective of the STA considers switching cost of STDDs, the reward function of the slow timescale includes the switching cost part and the cumulative reward of the fast timescale.
(9)  
(10)  
(11)  
(12) 
To solve BMDP, two agents STA and FTA are set up for the slow and fast timescales respectively, as shown in fig. 3. The policy of STA is marked as , while that of FTA is . An intuitive method would be training STA and FTA separately. However due to the fact that the environment is nonMarkovian from the view of STA, it violates the basic assumption of RL algorithms and leads to nonstationary learning process. In section IIID, MTOPC is proposed addressing this challenge.
IiiB Soft ActorCritic for FTA
SAC [haarnojaSoftActorcriticAlgorithms2018] is a stateofart offpolicy RL method for MDPs with continuous action space. It is implemented in the actorcritic framework, where the actor is the stochastic policy , and the critic is the stateaction value function . and
are both approximated by deep neural networks (DNN) with parameters noted as
and in practice.In SAC, the definition of stateaction value function is entropyregularized as
(13) 
where is the entropy for the stochastic policy at , is the temperature parameter.
During the learning process, all the samples of MDP are stored in the replay buffer as . To approximate iteratively, Bellman equation is applied to the entropyregularized as eq. 14 since .
(14)  
Then, meansquared Bellman error (MSBE) eq. 15 is minimized to update the network. Note . All expectations are approximated with Monte Carlo method.
(15) 
The policy is optimized to maximize the state value function . In SAC, the reparameterization trick using a squashed Gaussian policy is introduced: , where are two DNNs. Hence, and the policy can be optimized by minimizing eq. 16, where .
(16) 
Other practical techniques in [haarnojaSoftActorcriticAlgorithms2018] are omitted here due to limited space.
IiiC MultiDiscrete Soft ActorCritic for STA
Comparing with the FTA with continuous actions, STA has discrete actions in multi dimensions. It makes fundamental differences in the RL algorithm. Therefore, we propose a variant of SAC called multidiscrete soft actorcritic (MDSAC) as follows. MDSAC fully suits the multidiscrete action space of STA and can carry out high efficiency training in an offpolicy manner.
IiiC1 Policy DNN Architecture
Because the policy should select an action for each of the STDDs, we design a multihead DNN for . Note as the number of STDDs, and as the number of taps of th STDD. As shown in fig. (a)a, is mapped to a shared representation, which are passed to heads with hidden layers. The th head produces
digits, passes them to the softmax layer, and receives a vector of
digits , where are all the possible actions of th STDD. Note that . Finally, the probability value of a certain action is .IiiC2 Value DNN Architecture
One of the most challenging problem in offpolicy RL with multidiscrete action space is that the stateaction value function is nontrivial to implement. Classically, if the action space is discrete, the stateaction value function is designed as , where is the number of actions. However, with multiple devices, it upgrades to mathematically. It means that has outputs. When increases, the cost of memory and CPU time will grow exponentially. Worse still is that bloated outputs requires much more samples to train, which is unaffordable in ADNs. Such exponential complexity can be seen as the curse of dimensionality problem and hinders the implementation of STA in practice.
To alleviate the problem above, we introduce a device decomposition technique to the value DNN architecture inspired by [sunehagValueDecompositionNetworksCooperative2017], where the value function is relaxed as the sum of independent value functions with local states and actions of each agent in multiagent settings. A limitation of [sunehagValueDecompositionNetworksCooperative2017] is that such sum combination may reduce the approximation performance of the DNN. In MDSAC, we introduce adaptive affine parameters to address the limitation.
As shown in fig. (b)b, the share representation is fed to two parts: 1) the th head produces a vector of scalars noted as ; 2) a vector containing scalars, where . For a certain action , the devicewise stateaction values are selected as . Then, they are combined in an affine mixing network with ratios , as eq. 17. Because the ratios are also generate from DNNs, the flexibility of approximation to actual is generally boosted comparing with [sunehagValueDecompositionNetworksCooperative2017]. Also, learns a base value for each state, which is inherently similar to the wellknown dueling network architecture [wangDuelingNetworkArchitectures2016].
(17) 
IiiC3 Updates of Actor and Critic
Since the action space is discrete, the policy maps to the probability value of action
instead of a probability density function. As a result, the expectation on the policy (
) can be calculated explicitly instead of in a Monte Carlo way. That means the state value function can be expressed as(18)  
(19) 
where the linearity of expectation operator and the mixing network are leveraged.
(20) 
where is the current environment transition probability distribution. Note . If holds, the expectation can be calculated by Monte Carlo method on . However, is changing with of FTA varying in section IIIB. This is ignored temporarily here and will be corrected in section IIID.
(21) 
(22) 
IiiD Multitimescale Offpolicy Correction
As an offpolicy RL algorithm, MDSAC stores all transitions in the experience replay buffer . In eq. 20, if holds as assumed by MDP, the expectation can be calculated with Monte Carlo method by sampling , like SAC in section IIIB does. However, in BMDP only is assumed to be stationary. Mathematically, as shown in fig. 3, the current probability of transition from to is
(23) 
where . While the FTA is training with eq. 16, is varying from time to time. Hence, we have marked for each FTA step. Though STA needs to calculate the expectation on eq. 23, the data in is sampled from another probability distribution,
(24) 
where is the behavior policy used in past, and is different from the current policy . Obviously, the samples from are not valid for direct Monte Carlo any more. In another way, during the learning process, the past experience of STA is no longer correct for the current learning and can lead to significant bias.
To reuse the samples in and leverage the offpolicy MDSAC’s high sample efficiency, we propose multitimescale offpolicy correction (MTOPC) based on importance sampling (IS) method. IS is widely used in policy gradient RL algorithms like A2C [mnihAsynchronousMethodsDeep2016] and PPO [schulmanProximalPolicyOptimization2017].
To estimate the expectation in
eq. 20, MTOPC is derived as eq. 25,(25)  
where is the current FTA policy, is the behavior (original) policies, and is the correction factor calculated by FTA,
(26) 
Simlar technique was also studied by [nachumDataEfficientHierarchicalReinforcement2018] in hierarchical RL instead of the multitimescale settings.
IiiE Twotimescale VVC with Bilevel RL
The overall algorithm combining sections IIID, IIIC, IIIB and IIIA are summarized in algorithm 1. Note that the gradient steps of STA or FTA are carried out in parallel with the controlling process. Typically, one gradient step is executed every one or several control steps.
Comments
There are no comments yet.