Bi-level Off-policy Reinforcement Learning for Volt/VAR Control Involving Continuous and Discrete Devices

04/13/2021 ∙ by Haotian Liu, et al. ∙ Tsinghua University 0

In Volt/Var control (VVC) of active distribution networks(ADNs), both slow timescale discrete devices (STDDs) and fast timescale continuous devices (FTCDs) are involved. The STDDs such as on-load tap changers (OLTC) and FTCDs such as distributed generators should be coordinated in time sequence. Such VCC is formulated as a two-timescale optimization problem to jointly optimize FTCDs and STDDs in ADNs. Traditional optimization methods are heavily based on accurate models of the system, but sometimes impractical because of their unaffordable effort on modelling. In this paper, a novel bi-level off-policy reinforcement learning (RL) algorithm is proposed to solve this problem in a model-free manner. A Bi-level Markov decision process (BMDP) is defined to describe the two-timescale VVC problem and separate agents are set up for the slow and fast timescale sub-problems. For the fast timescale sub-problem, we adopt an off-policy RL method soft actor-critic with high sample efficiency. For the slow one, we develop an off-policy multi-discrete soft actor-critic (MDSAC) algorithm to address the curse of dimensionality with various STDDs. To mitigate the non-stationary issue existing the two agents' learning processes, we propose a multi-timescale off-policy correction (MTOPC) method by adopting importance sampling technique. Comprehensive numerical studies not only demonstrate that the proposed method can achieve stable and satisfactory optimization of both STDDs and FTCDs without any model information, but also support that the proposed method outperforms existing two-timescale VVC methods.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With increasing penetration level of distributed generations (DG) [kurbatovaGlobalTrendsRenewable2020], modern distribution networks are challenged with severe operating problems such as voltage violations and high network losses. As the common practice, active distribution networks (ADN) have integrated Volt/VAR control (VVC) to optimize the voltage profile and reduce network losses by employing not only the discrete regulation equipments such as on-load tap changers (OLTC) and capacitor banks (CB), but also the continuous control facilities such as the DGs and static var compensators (SVC).

Typically the original VVC task is described as a mixed integer nonlinear programming problem with variables standing for strategies of voltage regulation devices and reactive power resources. While general symbolic solutions of such problem are hardly available, a variety of methods have been studied and led to various schemes of VVC, which could be categorized via control architectures into the centralized VVC [liuReactivePowerVoltage2009, borghettiUsingMixedInteger2013], distributed VVC [liuDistributedVoltageControl2018, xuAcceleratedADMMBasedFully2020], decentralized VVC [zhuFastLocalVoltage2016, liuOnlineMultiagentReinforcement2021].

Even though the existing VVC methods have achieved considerable performance in traditional distribution networks, most of them rely heavily on the accurate network model. These model-based methods are seriously challenged when an accurate model is expensive or sometimes impractical to maintain in a fast developing ADN with increasing complexity and numerous components [arnoldModelFreeOptimalControl2016, liuOnlineMultiagentReinforcement2021, wangSafeOffpolicyDeep2019, gaoBatchConstrainedReinforcementLearning2020]. In the recent years, the research on deep reinforcement learning (DRL) has shown desirable potential on coping with the incomplete model challenges in video game [nachumDataEfficientHierarchicalReinforcement2018, ecoffetFirstReturnThen2021, lazaridisDeepReinforcementLearning2020] and multiple areas in power grid operation, including energy trading [serbanArtificialIntelligenceSmart2020], network reconfiguration [gaoBatchConstrainedReinforcementLearning2020], frequency control [stanojevReinforcementLearningApproach2020, zhangResearchAGCPerformance2020] and so on. Hence, many inspiring DRL-based VVC methods has been proposed recently such as [liuTwostageDeepReinforcement2020, wangSafeOffpolicyDeep2019, liCoordinationPVSmart2019, caoMultiAgentDeepReinforcement2020, liuOnlineMultiagentReinforcement2021, yangTwoTimescaleVoltageControl2020]. Such DRL-based VVC methods have empowered the agent in the ADN operating utility to learn a near-optimal strategy by interacting with the actual ADN and mining the optimization process data without an accurate ADN model [chenReinforcementLearningDecisionMaking].

Moreover, the characteristics of controlled devices determine the nature of the VVC problem. Two types of devices considered by modern VVC in ADNs are described in table I, including the Slow Timescale Discrete Devices (STDD) and Fast Timescale Continuous Devices (FTCD). As shown in fig. 1, we assume FTCDs take steps in one STDD step.

Item          STDD          FTCD
Variable Discrete Continuous
Timescale Slow (in hours) Fast (in minutes)o
Number Relatively small Large
Control Price Limited switching times Flexible
Devices OLTCs, CBs, … DGs, SVCs*, …
  • The fast timescale depends heavily on communications.

  • The location of SVCs is similar to STDD.

TABLE I: Two Types of Devices Considered by Modern VVC in ADNs
Fig. 1: The timeline of two types of devices in the ADNs.

In the active distribution networks with both STDDs and FTCDs, all control devices are supposed to work concurrently and cooperatively, while most of the existing VVC works consider either of them. Because of the huge difference of the natures listed in table I especially the timescale fig. 1, a proper optimization and control method which fully utilizes the fast speed of FTCDs and the limited STDD actions is non-trivial.

With the assumption of oracle models of ADNs, some researchers have developed optimization-based multi-timescale VVC methods such as [zhengRobustReactivePower2017, xuMultiTimescaleCoordinatedVoltage2017, jinTwoTimescaleMultiObjectiveCoordinated2019, jhaBiLevelVoltVAROptimization2019, zafarMultiTimescaleVoltageStabilityConstrained2020, yangTwoTimescaleVoltageControl2020]. For example, [jinTwoTimescaleMultiObjectiveCoordinated2019] presents a multi-object coordinated VVC method with the slow timescale as MINLP to optimize the power loss and control actions, and the fast timescale as NLP to minimize the voltage deviation. [jinTwoTimescaleMultiObjectiveCoordinated2019] handles the two stages separately by applying a searching method to solve the MINLP and a scenario-based method to solve the NLP. Instead, [zhengRobustReactivePower2017] formulates the slow timescale VVC as a robust optimization problem and guarantees the worst case in the fast timescale, which may leads to conservativeness or poor convergence. To solve the slow timescale MINLP more efficiently online, [yangTwoTimescaleVoltageControl2020] has incorporated a DRL algorithm called deep Q network (DQN) to boost the solution, while optimizing the FTCDs with a model-based second-order cone program.

However, to coordinate the FTCDs and STDDs in two timescales and conduct efficient VVC in a model-free manner, it is indispensable for us to develop DRL-based multi-timescale VVC methods. Among the DRL-based VVC methods, most of them target at single-timescale. For example, authors in [wangSafeOffpolicyDeep2019] proposed a safe off-policy RL algorithm to optimize STDDs hourly by formulating the voltage constraints explicitly and considering the device switching cost. In contrast, references [caoMultiAgentDeepReinforcement2020, liuOnlineMultiagentReinforcement2021] are designed to optimize the FTCDs in minutes by incorporating and improving continuous RL algorithms. As for multi-timescale (two-timescale) VVC, reference [yangTwoTimescaleVoltageControl2020] applied DQN to the slow timescale optimization problem, but depended on oracle model in the fast timescale optimization. Research on DRL-based VVC for both fast and slow timescales, which could achieve model-free optimization, is still urgently needed.

Unfortunately, DRL algorithms for two-timescale agent with such different natures shown in table I are non-trivial and rarely studied. A reasonable solution is to set up RL agents for both timescales individually like fig. 2, so as to satisfy natures of FTCDs and STDDs. However, for the training processes of the agents, traditional RL approaches such as Q-learning are poorly suited. The most fundamental issue is that the policy of the fast timescale agent (FTA) in the lower layer is changing as training processes, and the environment becomes non-stationary from the perspective of slow timescale agent (STA) in the upper layer [loweMultiAgentActorCriticMixed2020]. In another way, at every decision time of STA, the next step is not only depended on the STDDs’ actions of STA itself, but also depended on the subsequent FTCDs’ actions of FTA. This issue extremely challenges the learning stability and prevents the use of experience replay off-policy RL algorithms, which are generally more efficient than the on-policy ones [guQPropSampleEfficientPolicy2017]. Besides, the STA involves multiple STDDs and is bothered by the curse of dimensionality in action space.

In this paper, we propose a novel bi-level off-policy RL algorithm and develop a two-timescale VVC accordingly to jointly optimize FTCDs and STDDs in ADNs in a model-free manner. As shown in fig. 2, we first formulate the two-timescale VVC problem in the bi-level RL framework with separate STA and FTA established. Then, the two agents are implemented with detailed designed actor-critic algorithms. Finally, STA and FTA are trained jointly by introducing the multi-timescale off-policy correction technique to eliminate the non-stationary problem. The proposed model-free two-timescale VVC method not only ensures the stability of learning by the coordination of STA and FTA, but also performs off-policy learning with desirable sample efficiency.

Fig. 2: Overall structure of the proposed bi-level off-policy RL for two-timescale VVC in ADNs. The contributions of this paper are highlighted.

Comparing with previous studies on VVC in ADNs, the unique contributions of this paper are summarized as follows.

  1. To realize model-free optimization of FTCDs and STDDs described table I together, we design a mathematical formulation called bi-level Markov decision process to describe the two-timescale environment. A bi-level off-policy RL framework is proposed accordingly, where two agents FTA and STA are set up for the FTCDs and STDDs respectively and are both trained with off-policy RL algorithms to exploit the samples efficiently.

  2. To cope with the non-stationary challenge of learning the two agents in two different timescales, our bi-level off-policy RL framework conduct coordination between STA and FTA instead of training them separately. In this context, we propose a technique called multi-timescale off-policy correction (MTOPC). With MTOPC, the bias of STA learning under the disturbance of FTA can be effectively eliminated. Such factors make the application of off-policy RL algorithms available for the STA.

  3. For the FTAs, the soft actor-critic (SAC) with continuous actions is adapted to learn a stochastic VVC policy; and for the STAs, we develop a multi-discrete soft actor-critic (MDSAC) algorithm to reduce the complexity of training and improve the efficiency. Comparing with the state-of-art RL algorithms, MDSAC can produce discrete action values for all STDDs simultaneously but alleviating the curse of dimensionality challenge.

The rest of this paper is organized as follows. Section II formulates the two-timescale VVC problem in this paper, and also introduces key concepts and basic methods of RL as preliminaries. Then in section III, the details of the proposed bi-level off-policy RL algorithm are derived and presented, and a two-timescale VVC is developed accordingly. Moreover, in section IV, the results of the numerical study on the proposed two-timescale VVC are shown and analyzed. Finally, section V concludes this paper.

Ii Preliminaries

In this section, we firstly formulate the two-timescale VVC problem in this paper. Then, the settings of Markov decision process and its variants used in this paper is introduced. In the last subsection, we cover the preliminaries of reinforcement learning and actor-critic framework to support section III.

Ii-a Two-timescale VVC Problem Formulation

In this paper, we consider an ADN with nodes. It can be depicted by an undirected graph with the collection of all nodes and the collection of all branches . The point of common coupling (PCC) is located at node 0 with a substation connected to the power grid simulated by a generation.

Both STDDs and FTCDs are installed in the ADN. STDDs include OLTCs and CBs. The tap of th OLTC is and the tap of th CB is

. Typically, the number of taps are odd integers

. FTCDs include DGs and SVCs. The reactive power of th DG is and that of th SVC is . Without loss of generality, we assume that all STDDs and FTCDs are installed on different nodes in .

In a slow-timescale VVC, the taps of OLTCs and CBs are optimized following the objective in eq. 1 [wangSafeOffpolicyDeep2019]. is the number of slow-timescale VVC steps in one day and is one of the steps. is the active power loss of the ADN, and are the acting loss of OLTCs and CBs accordingly. We have and when and . are the price coefficients of accordingly. are the voltage lower and upper limit, and is the voltage at node .


As for the fast-timescale VVC, are optimized given the tap settings from the slow timescale. Let be the number of fast-timescale VVC steps in one day, and be one of the steps. Then the problem is formulated as eq. 2 [xuAcceleratedADMMBasedFully2020].


The DGs are typically designed with redundant rated capacity for safety reasons and operate under maximum power point tracking (MPPT) mode. Hence, the controllable range of the reactive power of DGs can be determined by the rated capacity and maximum power output . The reactive power range of controllable devices is . Also, we have where are the bounds of the th SVC.

Because STDDs and FTCDs both exists in the ADN and need to be coordinated properly, the two problems eqs. 2 and 1 are combined in this paper as eq. 3 assuming that is a integer multiple of and .


Note that in a model-based optimization method, the VVC problems including eqs. 3, 2 and 1 are solved with power flow constraints. In this paper, we focus on a situation that the accurate power flow model is not available.

Ii-B Markov Decision Process and Reinforcement Learning

A fundamental assumption of reinforcement learning is that the environment can be described as an MDP. The classic definition of an MDP is shown in definition II.1.

Definition II.1 (Markov Decision Process).

A Markov decision process is a tuple , where

  • is the state space,

  • is the action space,

  • is the transition probability distribution of the next state

    at time with the current state and the action at time ,

  • is the immediate reward received after transiting from state to due to action ,

  • is the probability distribution of the initial state


In the standard continuous control RL setting, an agent interacts with an environment (MDP) over periods of time according to a policy . The policy can be either deterministic, which means , or stochastic, which means . In this paper, stochastic policies are adapted. From the definition of MDP, we can tell that if the environment is stationary, the cumulative reward can be improved by optimizing . Classically, the objective of RL at time is to maximize the expectation of the sum of discounted rewards where is the discount factor, and is the length of episode. A well-performing RL algorithm will learn a good policy from ideally minimal interactions with the environment, as . Here is a trajectory of states and actions noted as and . is a trajectory with applied.

To get an optimal policy, one has to evaluate the policy under unknown environment transition dynamics and conduct improvement. In reinforcement learning, such evaluation is carried out by defining two value functions and as shown in eq. 4. is the state-value function representing the expected discounted reward after state with the policy .

is the state-action value function representing the expected discounted reward after taking action at state with the policy .


According to the Bellman theorem and the Markov feature of MDP, the value functions can be recursively derived as eq. 5,


where is the next state of , and is the next action. With eq. 5, the value functions can be relaxed from the whole trajectory but updated with batch of transitions.

Iii Methods

In this section, we propose a novel bi-level off-policy reinforcement learning algorithm to solve the two-timescale VVC in ADNs. A summarized version of the RL-based two-timescale VVC is presented in section III-E.

We firstly propose a variant of MDP called bi-level Markov decision process (BMDP) in section III-A to describe the environment in two timescales. The two-timescale VVC problem in section II-A is further formulated into BMDP accordingly.

Then, as shown in figs. 3 and 2, two agents FTA and STA are set up for the two timescales. We adopt the well-known off policy RL algorithm SAC for FTA, which is described in section III-B. To alleviate the curse of dimensionality challenge and improve the efficiency of STA, section III-C propose a novel algorithm MDSAC which allow the STA to decide actions for all STDDs simultaneously with outstanding sample efficiency.

Finally, instead of simply training FTA and STA separately, we calculate the exchanging factors between FTA and STA by our innovated MTOPC technique in section III-D. It allows the FTA and STA trained together to optimize BMDP in a stationary manner based on the inherent Markov property of BMDP.

Iii-a Two-timescale VVC in Bi-level Markov Decision Process

The standard RL setting in section II-B considers only a single timescale, which does not match the two-timescale VVC problem in section II-A. Hence, BMDP is defined as a variant of MDP in definition III.1 to describe the environment (two-timescale VVC problem). The transition probability is still unknown to the agents. Note that though similar settings exist in some previous works related to hierarchical RL such as [suttonMDPsSemiMDPsFramework1999], BMDP is specially reformed for the two-timescale setting.

Definition III.1 (Bi-level Markov Decision Process).

A bi-level Markov decision process is a joint of two MDPs defined in definition II.1 in two timescales, and is defined as a tuple . Most of the symbols follow definition II.1, and . The incremental parts are described as follows.

  • is the slow action space,

  • is the fast action space,

  • is the timescale ratio and the is only available every steps of the ,

  • when , the fast action makes that ,

  • when , two transitions happens consequently: 1) the slow action is applied and the state transits from to as ; 2) the fast action is applied and ,

  • is the immediate reward received in slow timescale after transiting from state to due to action ,

  • is the immediate reward received in fast timescale after transiting from state to due to action .

Fig. 3: The setting of BMDP and the agents STA, FTA in this paper.

Orange painted part in fig. 3 illustrates BMDP step by step. The setting of BMDP has ensured the environment is Markovian from the perspective of bi-level. In each episode ( steps) of the fast layer, the initial state depends on the certain slow action. We therefore has to include the corresponding slow action in the state explicitly, which is emphasized by red dashed lines. The slow actions can be seen as the exchanging factors from STA to FTA as shown in fig. 2.

Obviously, the transition from to depends not only on the slow action , but also on the fast actions . Hence, in the view of slow level, the environment does not satisfy Markov property. In another way, BMDP is Markovian bi-levelly in fast level, but non-Markovian in slow level. In the following section III, one of the major ideas is to take full advantage of the inherent Markov property of BMDP and carry out stable learning and control process in both timescales.

To formulate the two-timescale VVC problem eq. 3 into BMDP, the specific definitions of episodes, state spaces, action spaces and reward functions are designed as follows.

Iii-A1 Episode

An episode of STA is defined as one day, and the step size is one hour. We have total steps in one episode. An episode of FTA includes steps in one STA step with . Note that are all alterable while satisfying .

Iii-A2 State Space

The common state of BMDP is defined as a vector

, where are the vectors of nodal active/reactive power injections , is the vector of voltage magnitudes , is the vector of OLTC taps, is the vector of CB taps, is the time in one day.

Iii-A3 Action Spaces

For FTA, the action space includes all the controllable reactive power of FTCDs, that is, . Since the reactive power generations are continuous in section II-A, is defined as a box space with the lower bound , and upper bound , where .

For STA, the action space includes all the tap settings of STDDs, that is, . Each tap setting or can be selected from or taps, so is a discrete space with multi dimensions, also known as multi-discrete space. The dimensions are listed in the vector .

Iii-A4 Reward Functions

The reward functions maps a certain transition to a single value to cumulate and maximize.

For the fast timescale, note a transition as where and . Then is defined as eq. 6 according to the inner minimization of eq. 3. Here

is the rectified linear unit function defined as

, is a penalty multiplier for the voltage constraints, and is a smooth index of the voltage violations called voltage violation rate (VVR).


For the slow timescale, the transition is noted as , where and . Also, as shown in fig. 3, we have as series of fast timescale samples between and . Note them as where . Since the objective of the STA considers switching cost of STDDs, the reward function of the slow timescale includes the switching cost part and the cumulative reward of the fast timescale.


To solve BMDP, two agents STA and FTA are set up for the slow and fast timescales respectively, as shown in fig. 3. The policy of STA is marked as , while that of FTA is . An intuitive method would be training STA and FTA separately. However due to the fact that the environment is non-Markovian from the view of STA, it violates the basic assumption of RL algorithms and leads to non-stationary learning process. In section III-D, MTOPC is proposed addressing this challenge.

Iii-B Soft Actor-Critic for FTA

SAC [haarnojaSoftActorcriticAlgorithms2018] is a state-of-art off-policy RL method for MDPs with continuous action space. It is implemented in the actor-critic framework, where the actor is the stochastic policy , and the critic is the state-action value function . and

are both approximated by deep neural networks (DNN) with parameters noted as

and in practice.

In SAC, the definition of state-action value function is entropy-regularized as


where is the entropy for the stochastic policy at , is the temperature parameter.

During the learning process, all the samples of MDP are stored in the replay buffer as . To approximate iteratively, Bellman equation is applied to the entropy-regularized as eq. 14 since .


Then, mean-squared Bellman error (MSBE) eq. 15 is minimized to update the network. Note . All expectations are approximated with Monte Carlo method.


The policy is optimized to maximize the state value function . In SAC, the reparameterization trick using a squashed Gaussian policy is introduced: , where are two DNNs. Hence, and the policy can be optimized by minimizing eq. 16, where .


Other practical techniques in [haarnojaSoftActorcriticAlgorithms2018] are omitted here due to limited space.

Iii-C Multi-Discrete Soft Actor-Critic for STA

Comparing with the FTA with continuous actions, STA has discrete actions in multi dimensions. It makes fundamental differences in the RL algorithm. Therefore, we propose a variant of SAC called multi-discrete soft actor-critic (MDSAC) as follows. MDSAC fully suits the multi-discrete action space of STA and can carry out high efficiency training in an off-policy manner.

Iii-C1 Policy DNN Architecture

Because the policy should select an action for each of the STDDs, we design a multi-head DNN for . Note as the number of STDDs, and as the number of taps of th STDD. As shown in fig. (a)a, is mapped to a shared representation, which are passed to heads with hidden layers. The th head produces

digits, passes them to the softmax layer, and receives a vector of

digits , where are all the possible actions of th STDD. Note that . Finally, the probability value of a certain action is .

Iii-C2 Value DNN Architecture

One of the most challenging problem in off-policy RL with multi-discrete action space is that the state-action value function is non-trivial to implement. Classically, if the action space is discrete, the state-action value function is designed as , where is the number of actions. However, with multiple devices, it upgrades to mathematically. It means that has outputs. When increases, the cost of memory and CPU time will grow exponentially. Worse still is that bloated outputs requires much more samples to train, which is unaffordable in ADNs. Such exponential complexity can be seen as the curse of dimensionality problem and hinders the implementation of STA in practice.

To alleviate the problem above, we introduce a device decomposition technique to the value DNN architecture inspired by [sunehagValueDecompositionNetworksCooperative2017], where the value function is relaxed as the sum of independent value functions with local states and actions of each agent in multi-agent settings. A limitation of [sunehagValueDecompositionNetworksCooperative2017] is that such sum combination may reduce the approximation performance of the DNN. In MDSAC, we introduce -adaptive affine parameters to address the limitation.

As shown in fig. (b)b, the share representation is fed to two parts: 1) the th head produces a vector of scalars noted as ; 2) a vector containing scalars, where . For a certain action , the device-wise state-action values are selected as . Then, they are combined in an affine mixing network with ratios , as eq. 17. Because the ratios are also generate from DNNs, the flexibility of approximation to actual is generally boosted comparing with [sunehagValueDecompositionNetworksCooperative2017]. Also, learns a base value for each state, which is inherently similar to the well-known dueling network architecture [wangDuelingNetworkArchitectures2016].

(a) The architecture of policy neural network .
(b) The architecture of state-action value neural network .
Fig. 4: DNN architecture design of and .

Iii-C3 Updates of Actor and Critic

Since the action space is discrete, the policy maps to the probability value of action

instead of a probability density function. As a result, the expectation on the policy (

) can be calculated explicitly instead of in a Monte Carlo way. That means the state value function can be expressed as


where the linearity of expectation operator and the mixing network are leveraged.

From eqs. 5 and 4, we have


where is the current environment transition probability distribution. Note . If holds, the expectation can be calculated by Monte Carlo method on . However, is changing with of FTA varying in section III-B. This is ignored temporarily here and will be corrected in section III-D.


Accordingly, the parameters of and of can be optimized by mimimizing eqs. 21 and 22.

Iii-D Multi-timescale Off-policy Correction

As an off-policy RL algorithm, MDSAC stores all transitions in the experience replay buffer . In eq. 20, if holds as assumed by MDP, the expectation can be calculated with Monte Carlo method by sampling , like SAC in section III-B does. However, in BMDP only is assumed to be stationary. Mathematically, as shown in fig. 3, the current probability of transition from to is


where . While the FTA is training with eq. 16, is varying from time to time. Hence, we have marked for each FTA step. Though STA needs to calculate the expectation on eq. 23, the data in is sampled from another probability distribution,


where is the behavior policy used in past, and is different from the current policy . Obviously, the samples from are not valid for direct Monte Carlo any more. In another way, during the learning process, the past experience of STA is no longer correct for the current learning and can lead to significant bias.

To reuse the samples in and leverage the off-policy MDSAC’s high sample efficiency, we propose multi-timescale off-policy correction (MTOPC) based on importance sampling (IS) method. IS is widely used in policy gradient RL algorithms like A2C [mnihAsynchronousMethodsDeep2016] and PPO [schulmanProximalPolicyOptimization2017].

To estimate the expectation in

eq. 20, MTOPC is derived as eq. 25,


where is the current FTA policy, is the behavior (original) policies, and is the correction factor calculated by FTA,


Simlar technique was also studied by [nachumDataEfficientHierarchicalReinforcement2018] in hierarchical RL instead of the multi-timescale settings.

In practice, is stored with the transition in , and is calculated using the latest FTA policy. Because the cumulative product in eq. 26

may lead to high variance and numerical problem, we clip

as , where are the lower and upper bound for .

Accordingly, eq. 21 is corrected as


with batches of sampled from and the latest .

Iii-E Two-timescale VVC with Bi-level RL

The overall algorithm combining sections III-D, III-C, III-B and III-A are summarized in algorithm 1. Note that the gradient steps of STA or FTA are carried out in parallel with the controlling process. Typically, one gradient step is executed every one or several control steps.

Given learning rates , temperature parameters ;
Initialize STA and FTA’s policy and value functions’ parameter vectors ;
foreach  STA episode  do
       Get the initial state , ;
       foreach  STA step do in parallel
             Feed to the environment, get next state ;
             , ;
             foreach  FTA step in a -step episode  do
                   , ;
                   Feed to the environment, get reward , next state ;
             end foreach
            Get reward , state ;
      foreach FTA gradient step do in parallel
             Sample a batch of