I Introduction
Voltage regulation transformers—also referred to as load tap changers (LTCs)—are widely utilized in power distribution systems to regulate the voltage magnitudes along a feeder. Conventionally, the tap position of each LTC is controlled through an automatic voltage regulator based on local voltage measurements [1]. This approach, albeit simple and effective, is not optimal in any sense, and may result in frequent actions of the LTCs, thus, accelerating wear and tear [2]. Particularly, the voltage deviation may not be minimized. In the context of transmission systems, transformer tap positions are optimized jointly with active and reactive power generation by solving an optimal power flow (OPF) problem, which is typically cast as a mixedinteger programming problem (see, e.g., [3, 4] and references therein). Similar OPFbased approaches are also adopted in power distribution systems. For example, in [2], the authors cast the optimal tap setting problem as a rankconstrained semidefinite program that is further relaxed by dropping the rankone constraint, which avoids the nonconvexity and integer variables, and thus, the problem can be solved efficiently. OPFbased approaches have also been utilized to determine the optimal reactive power injection from distributed energy resources so as to regulate voltage in a distribution network [5, 6].
While these OPFbased approaches are effective in regulating voltages, they require complete system knowledge, including active and reactive power injections, and transmission/distribution line parameters. While it may be reasonable to assume that such information in available for transmission systems, the situation in distribution systems is quite different. Accurate line parameters may not be known and power injections at each bus may not be available in real time, which prevents the application of OPFbased approaches [7]. In addition, OPFbased approaches typically deal with one snapshot of system conditions, and assume loads remain constant between two consecutive snapshots. Therefore, the optimal tap setting problem needs to be solved for each snapshot in real time.
In this paper, we develop an algorithm that can find a policy for determining the optimal tap positions of the LTCs in a power distribution system under uncertain load dynamics without any information on power injections or line parameters; the algorithm requires only voltage magnitude measurements and system topology information. Specifically, the optimal tap setting problem is cast as a Markov decision process (MDP), which can be solved using reinforcement learning (RL) algorithms. Yet, adequate state and action samples that sufficiently explore the MDP state and action spaces are needed. However, it is hard to obtain such samples in real power systems since this requires changing tap settings and other controls to excite the system and record voltage responses, which may jeopardize system operational reliability and incur economic costs. To circumvent this issue, we take advantage of a linearized power flow model and develop an effective algorithm to estimate voltage magnitudes under different tap settings so that the state and action spaces can be explored freely offline without impacting the real system.
The dimension of the state and action spaces increases exponentially as the number of LTCs grows, which causes the issue known as the “curse of dimensionality” and makes the computation of the optimal policy intractable [8]. To circumvent the “curse of dimensionality,” we propose an efficient batch RL algorithm—the least squares policy iteration (LSPI) based sequential learning algorithm—to learn an actionvalue function sequentially for each LTC. Once the learning of the actionvalue function is completed, we can determine the policy for optimally setting the LTC taps. We emphasize that the optimal policy can be computed offline, where most computational burden takes place. However, when executed online, the required computation to find the optimal tap positions is minimal. The effectiveness of the proposed algorithm is validated through simulations on two IEEE distribution test feeders.
The remainder of the paper is organized as follows. Section II introduces a linearized power flow model that includes the effect of LTCs and describes the optimal tap setting problem. Section III provides a primer on MDPs and the LSPI algorithm. Section IV develops an MDPbased formulation for the optimal tap setting problem and Section V proposes an algorithm to solve this problem. Numerical simulation results on two IEEE test feeders are presented in Section VI. Concluding remarks are provided in Section VII.
Ii Preliminaries
In this section, we review a linearized power flow model for power distribution systems, and modify it to include the effect of LTCs. We also describe the LTC tap setting problem.
Iia Power Distribution System Model
Consider a power distribution system that consists of a set of buses indexed by the elements in , and a set of transmission lines indexed by the elements in . Each line
is associated with an ordered pair
. Assume bus is an ideal voltage source that corresponds to a substation bus, which is the only connection of the distribution system to the bulk power grid.Let denote the magnitude of the voltage at bus , , and define ; note that is a constant since bus is assumed to be an ideal voltage source. Let and denote the active power injection and reactive power injection at bus , , respectively. For each line that is associated with , let and respectively denote active and reactive power flows on line , which are positive if the flow of power is from bus to bus and negative otherwise. Let and denote the resistance and reactance of line , . For a radial power distribution system, the relation between squared voltage magnitudes, power injections, and line power flows, can be captured by the socalled LinDisfFlow model [9] as follows:
(1a)  
(1b)  
(1c) 
where is associated with .
Define and . Let , with and if line is associated with , and all other entries equal to zero. Let denote the first row of and the matrix that results by removing from . For a radial distribution system, , and is invertible. Define , , and . Then, the LinDistFlow model in (1) can be written as follows:
(2) 
where returns a diagonal matrix with the entries of the argument as its diagonal elements.
The standard model for an LTC in the literature is shown in Fig. 1 (see, e.g., [1]), where , line is associated with , and is the tap ratio of the LTC on line . Typically, the tap ratio can possibly take on discrete values ranging from to , by an increment of % p.u., i.e., [1]. Let denote the set of all feasible LTC tap ratio changes. We index the tap positions by for convenience.
Let denote the set of lines with LTCs and let , where denotes the cardinality of a set. For line that is associated with , if , the voltage relation in the LinDistFlow model, i.e., (1c), needs to be modified as follows:
(3) 
Define and , . Let , with and if line , and if line , and all other entries equal to zero. Let denote the first row of and the matrix that results by removing from . The matrix is nonsingular when the power distribution system is connected. Then, the modified matrixform LinDistFlow model that takes into account the LTCs is given by:
(4) 
IiB Optimal Tap Setting Problem
To effectively regulate the voltages in a power distribution system, the tap positions of LTCs need to be set appropriately. The objective of the optimal tap setting problem is to find a policy that determines the LTC tap ratio so as to minimize the voltage deviation from some reference value, denoted by , based on current tap ratios and measurements of the voltage magnitudes, i.e., . Throughout this paper, we make the following two assumptions:

The distribution system topology is known but the line parameters are unknown.

The active and reactive power injections are not measured and their probability distributions are unknown.
Iii Markov Decision Process and Batch Reinforcement Learning
In this section, we provide some background on MDPs and the batch RL algorithm, a type of data efficient and stable algorithm for solving MDPs with unknown models.
Iiia Markov Decision Process
An MDP is defined as a 5tuple , where is a finite set of states, is a finite set of actions, is a Markovian transition model that denotes the probability of transitioning from one state into another after taking an action, is a reward function such that, for and , is the reward obtained when the system transitions from state into state after taking action , and is a discount factor (see, e.g., [10]).^{1}^{1}1These definitions can be directly extended to the case where the the set of states is infinite. Due to space limitation, this case is not discussed in detail here. We refer to the 4tuple , where is the state following after taking action and , as a transition.
Let and denote the state and action at time instant , respectively, and the reward received after taking action in state . Let denote the probability operator; then, is the probability of transitioning from state into state after taking action at instant . Throughout this paper, we assume timehomogeneous transition probabilities, hence we drop the subindex and just write .
Let denote the expected reward for a stateaction pair ; then, we have
(5) 
where denotes the expectation operation. The total discounted reward from time instant and onwards, denoted by , also referred to as the return, is given by
(6) 
A deterministic policy is a mapping from to , i.e., . The actionvalue function under policy is defined as follows:
(7) 
which is the expected return when taking action in state , and following policy afterwards. Intuitively, the actionvalue function quantifies, for a given policy , how “good” the stateaction pair is in the long run.
Let denote the optimal actionvalue function—the maximum actionvalue function over all policies, i.e., . All optimal policies share the same optimal actionvalue function. Also, the greedy policy with respect to , i.e., is an optimal policy. Then, it follows from (6) and (7) that satisfies the following Bellman optimality equation (see, e.g., [8]):
(8) 
The MDP is solved if we find , and correspondingly, the optimal policy . It is important to emphasize that (8) is key in solving the MDP. For ease of notation, in the rest of this paper, we simply write the as .
When both the state and the action sets are finite, the actionvalue function can be exactly represented in a tabular form that covers all possible pairs . In this case, if is also known, then the MDP can be solved using, e.g., the socalled policy iteration and value iteration algorithms (see, e.g., [8]). If is unknown but samples of transitions are available, the MDP can be solved by using RL algorithms such as the Qlearning algorithm (see, e.g., [11]).
IiiB Batch Reinforcement Learning
When is not finite, conventional Qlearning based approaches require discretization of (see, e.g., [12] and [13]). The discretized state space will better approximate the original state space if a small step size is used in the discretization process, yet the resulting MDP will face the “curse of dimensionality.” A large step size can alleviate the computational burden caused by the high dimensionality of the state space, but at the cost of potentially degrading performance significantly.
More practically, when the number of elements in is large or is not finite, the actionvalue function can be approximated by some parametric functions such as linear functions [10]
and neural networks
[14]. Let denote the approximate optimal actionvalue function. Using a linear function approximation, can be represented as follows:(9) 
where is a feature mapping for , which is also referred to as the basis function, and
is the parameter vector.
A class of stable and dataefficient RL algorithms that can solve an MDP with function approximations are the batch RL algorithms—“batch” in the sense that a set of transition samples are utilized each time—such as the LSPI algorithm [10], which is considered to be the most efficient one in this class. We next explain the fundamental idea behind the LSPI algorithm. Let denote a set (batch) of transition samples obtained via observation or simulation. The LSPI algorithm finds the best that fits the transition samples in in an iterative manner. One way to explain the intuition behind the LSPI algorithm is as follows (the readers are referred to [10] for a more rigorous development). Define
(10) 
Let denote the value of that is available at the beginning of iteration . At iteration , the algorithm finds by solving the following problem:
(11) 
which is an unconstrained optimization problem. The solution of (11) can be computed by setting the gradient of to zero as follows:
(12) 
Note that the true value of is not known and is substituted by the socalled temporaldifference (TD) target, , where is the optimal action in state determined based on . Note that the TD target is a sample of the righthandside (RHS) of , which serves as an estimate for the RHS of . We emphasize that despite being substituted by , the true optimal actionvalue function is not a function of ; therefore, the gradient of with respect to is taken before the is approximated by the TD target, which does depends on . Then, after replacing with the TD target, (12) has the following closedform solution:
(13) 
Intuitively, at each iteration, the LSPI algorithm finds the that minimizes the mean squared error between the TD target and over all transition samples in . This process is repeated until change of , defined as , where denotes the norm, becomes smaller than a threshold , upon which the algorithm is considered to have converged.
The LSPI algorithm has the following three nice properties. First, linear functions are used to approximate the optimal actionvalue function, which allows the algorithm to handle MDPs with highdimensional or continuous state spaces. Second, at each iteration, a batch of transition samples is used to update the vector parameterizing , and these samples are reused at each iteration, thus increasing data efficiency. Third, the optimal parameter vector is found by solving a leastsquares problem, resulting in a stable algorithm. We refer interested readers to [10] for more details on the convergence and performance guarantee of the LSPI algorithm.
Iv Optimal Tap Setting Problem as An MDP
In this section, we formulate the optimal tap setting problem as an MDP as follows:
Iv1 State space
Define the squared voltage magnitudes at all buses but bus and the tap ratios as the state, i.e., , which has both continuous and discrete variables. Then, the state space is .
Iv2 Action space
The actions are the LTC tap ratio changes, i.e., , and the action space is the set of all feasible values of LTC tap ratios, i.e., . In the optimal tap setting problem, the action is discrete. The size of the action space increases exponentially with the number of LTCs.
Iv3 Reward function
The objective of voltage regulation is to minimize the voltage deviation as measured by the norm. As such, when the system transitions from state into state after taking action , the reward is computed by the following function:
(14) 
Iv4 Transition model
To derive the transition model , note that it follows from (4) that
(15) 
where , and and are active and reactive power injections that results into , respectively. Then, the transition model
can be derived from the probability density function (pdf) of
, which can be further computed from the pdf of . However, under Assumptions A1 and A2, the line parameters as well as the probability distributions of active and reactive power injections are unknown; thus, the transition model is not known a priori. Therefore, we need to resort to RL algorithms that do not require an explicit transition model to solve the MDP.V Optimal Tap Setting Algorithm
In this section, we propose an optimal tap setting algorithm, which consists of a transition generating algorithm that can generate samples of transitions in
, and an LSPIbased sequential learning algorithm to solve the MDP. Implementation details such as the feature selection are also discussed.
Va Overview
The overall structure of the optimal tap setting framework is illustrated in Fig. 2. The framework consists of an environment that is the power distribution system, a learning agent that learns the actionvalue function from a set of transition samples, and an acting agent that determines the optimal action from the actionvalue function. Define the history to be the sequence of states, actions, and rewards, and denote it by , i.e., . Specifically, the learning agent will use the elements in the set together with a virtual transition generator to generate a set of transition samples according to some exploratory behavior defined in the exploratory actor. The set of transition samples in is then used by the actionvalue function estimator—also referred to as the critic—to fit an approximate actionvalue function using the LSPI algorithm described earlier. The learning agent, which has a copy of the uptodate approximate actionvalue function from the learning agent, finds a greedy action for the current state and instructs the LTCs to follow it.
Note that the learning of the actionvalue function can be done offline by the learning agent, which is capable of exploring various system conditions through the virtual transition generator based on the history , yet without directly interacting with the power distribution system. This avoids jeopardizing system operational reliability, which is a major concern when applying RL algorithms to power system applications [15].
VB Virtual Transition Generator
The LSPI algorithm (as well as all other RL algorithms) require adequate transition samples that spread over the state and action spaces . However, this is challenging in power systems since the system operational reliability might be jeopardized when exploring randomly. One way to work around this issue is to use simulation models, rather than the physical system, to generate virtual transitions. To this end, we develop a datadriven virtual transition generator that simulates transitions without any knowledge of the active and reactive power injections (neither measurements nor probability distributions) or the line parameters.
The fundamental idea is the following. For a transition sample that is obtained from , the virtual transition generator generates a new transition sample , where is determined from according to some exploration policy (to be defined later) that aims to explore the state and action spaces. Replacing in the first transition sample with , the voltage magnitudes will change accordingly. Assume the same transition of the power injections in these two samples, then the RHS of (4) does not change. Thus, can be readily computed from by solving the following set of linear equations:
(16) 
Since the only unknown in (16) is and is invertible, we can solve for as follows:
(17) 
For ease of notation, we simply write (17) as
(18) 
This nice property allows us to estimate the new values of voltage magnitudes when the tap positions change without knowing the exact values of power injections and line parameters. The virtual transition generating procedure is summarized in Algorithm 1.
VC LSPIbased Sequential ActionValue Function Learning
Given the transition sample set , we can now develop a learning algorithm for based on the LSPI algorithm. While the LSPI is very efficient when the action space is relatively small, it becomes computationally intractable when the action space is large, since the number of unknown parameters in the approximate actionvalue function is typically proportional to , which increases exponentially with the number of LTCs. To overcome the “curse of dimensionality” that results from the size of the action space, we propose an LSPIbased sequential learning algorithm to learn the actionvalue function.
The key idea is the following. Instead of learning an approximate optimal actionvalue function for the action vector , we learn a separate approximate actionvalue function for each component of . To be more specific, for each LTC , , we learn an approximate optimal actionvalue function , where is the component of , is a feature mapping from to . During the learning process of , the rest of the LTCs are assumed to behave greedily according to their own approximate optimal actionvalue function. To achieve this, we design the following exploration policy to generate the virtual transition samples used when learning for LTC . In the exploration step in Algorithm 1, the tap ratio change of LTC is selected uniformly in (uniform exploration), while those of others are selected greedily with respect to the uptodate (greedy exploration). Then, the LSPI algorithm detailed in Algorithm 2, where is a small positive precondition number and is the initial value for the parameter vector, is applied to learn . This procedure is repeated in a roundrobin fashion for all LTCs for iterations, in each of which is set to the uptodate learned in the previous iteration or chosen if it is in the first iteration. The value of is set to if there is only one LTC and is increased slightly when there are more LTCs. Note that a new set of transitions is generated when learning for different LTCs at each iteration. Using this sequential learning algorithm, the total number of unknowns is then proportional to , which is far fewer compared to as in the case where the approximate optimal actionvalue function for the entire action vector, , is learned.
A critical step in implementing the LSPI algorithm is constructing features from the stateaction pair for LTC
; we use radial basis function (RBFs) to this end. The feature vector for a stateaction pair
, i.e., , is a vector in , where and is a positive integer. has segments, each one of length corresponding to a tap change in , i.e, , where . Specifically, for and being the tap change in , for , and , where , with being obtained by replacing the entry in with , and , are prespecified constant vectors in referred to as the RBF centers. The action only determines which segment will be nonzero. Thus, is indeed the squared voltage magnitudes under the same power injections if the tap of LTC is at position . Each RBF computes the distance between and some prespecified squared voltage magnitudes.VD Tap Setting Algorithm
The tap setting algorithm, the timeline of which is illustrated in Fig. 3, works as follows. At time instant , a new state as well as the reward following the action , , is observed. Let denote the time ellapsed between two time instants. Every time instants, i.e., every units of time, , is updated by the learning agent by executing the LSPIbased sequential learning algorithm described in Section VC. The acting agent then finds a greedy action for the current state and sends it to the LTCs. In order to reduce the wear and tear on the LTCs, the greedy action for the current state is chosen only if the difference between the actionvalue resulting from the greedy action, i.e., , and that resulting from the previous action, i.e., , is larger than a threshold . Otherwise, the tap positions do not change. The above procedure is summarized in Algorithm 3.
Vi Numerical Simulation
In this section, we apply the proposed methodology to the IEEE 13bus and 123bus test feeders from [16].
Via Simulation Setup
The power injections for both these two test feeders are constructed based on historical hourly active power load data from a residential building in San Diego over one year [17]. Specifically, the historical hourly active power load data are first scaled up so that the maximum system total active power load over that year for the IEEE 13bus and 123bus distribution test feeders are MW and MW, respectively. These numbers are chosen so that the resulting voltage magnitudes fall outside of the desired range at some time instants. Then, the time granularity of the scaled system total active power load is increased to
minutes through a linear interpolation. Each value in the resulting fiveminute system total active power load data time series is further multiplied by a normally distributed variable, the mean and standard deviation of which is
and , respectively. The active power load profile at each bus is constructed by pseudorandomly redistributing the system total active power load among all load buses. Each load bus is assumed to have a constant power factor of . While only load variation is considered in the simulation, the proposed methodology can be directly applied to the case with renewablebased resources, which can be modeled as negative loads.We first verify the accuracy of the virtual transition generating algorithm. Specifically, assume the voltage magnitudes are known for some unknown power injections under a known tap ratio of . Then, when the tap ratio changes, we compute the true voltage magnitudes under the new tap ratio, denoted by , by solving the full ac power flow problem, and the estimated voltage magnitudes under new tap ratio, denoted by , via (18). Simulation results indicate that the maximum absolute difference between the true and the estimated voltage magnitude, i.e., , is smaller than p.u., which is accurate enough for the application of voltage regulation addressed in this paper.
ViB Case Study on the IEEE 13bus Test Feeder
Assume , where is an allones vector in In the simulation, RBF centers are used, i.e., . Specifically, , . The duration between two time instants is min. The policy is updated every hours, i.e., . In each update, actual transition samples are chosen from the history over the same time interval in the previous days, which are part of , and new actions are chosen according to the exploration policy described in Section VC. A total number of virtual transitions are generated using Algorithm 1. Since this test feeder only has one LTC, there is no need to sequentially update the approximate actionvalue function, so we set . Other parameters are chosen as follows: , , , , and .
Assuming complete and perfect knowledge on the system parameters as well as active and reactive power injections for all time instants, we can find the optimal tap position that results in the highest reward by exhaustively searching the action space, i.e., all feasible tap ratios, at each time instant. It is important to point out that, in practice, the exhaustive search approach is infeasible since we do not have the necessary information, and not practical due to the high computational burden. Results obtained by the exhaustive search approach and the conventional tap setting scheme (see, e.g., [1]), in which the taps are adjusted only when the voltage magnitudes exceed a desired range, e.g., p.u., are used to benchmark the proposed methodology.
Figure 4 shows the tap positions (top panel) and the rewards (bottom panel) under different approaches. The rewards resulted from these two approaches are very close. The daily mean reward, i.e., , where is the reward at time instant as defined in (14), obtained by the batch RL approach and the exhaustive search approach is and , respectively, while that under the conventional scheme is . The tap positions under the batch RL approach and the exhaustive search approach are aligned during most of the time during the day. Note that the tap position under the conventional scheme remains at since the voltage magnitudes are within p.u., and is not plotted. Figure 5 shows the voltage magnitude profiles under the different tap setting algorithms. The voltage magnitude profiles under the proposed batch RL approach (see Fig. 5, center panel) are quite similar to those obtained via the exhaustive search approach (see Fig. 5, bottom panel), both result in a higher daily mean reward than that resulted from the conventional scheme (see Fig. 5, top panel). We also would like to point out that Algorithm 2 typically converges within iterations in less than seconds, and the batch RL approach is faster than the exhaustive search approach by several orders of magnitude.
ViC Case Study on the IEEE 123bus Test Feeder
We next test the proposed methodology on the IEEE 123bus test feeder. In the results for the IEEE 13bus test feeder reported earlier, while the LTC has tap positions, only a small portion of them is actually used. This motivates us to further reduce the action space by narrowing the action space to a smaller range. Specifically, we can estimate the voltage magnitudes under various power injections and LTC tap positions using (18). After ruling out tap positions under which the voltage magnitudes will exceed the desired range, we eventually allow positions, from to , for two LTCs, and positions, from to , for the other two LTCs. Here, RBF centers are used. Specifically, for all LTCs except for the one near the substation, for which , . A total number of virtual transitions are generated in a similar manner as in the IEEE 13bus test feeder case. The number of iterations in the LSPIbased sequential learning algorithm is set to . Other parameters are the same as in the IEEE 13bus test feeder case.
Figure 6 shows the rewards under the batch RL approach and the exhaustive search. The daily mean reward obtained by the batch RL approach and the exhaustive search approach is and , respectively, while that under the conventional scheme is . Due to the space limitation, other simulation results such as voltage profiles are not presented.
Vii Concluding Remarks
In this paper, we formulate the optimal tap setting problem of LTCs in power distribution systems as an MDP and propose a batch RL algorithm to solve it. To obtain adequate stateaction samples, we develop a virtual transition generator that estimates the voltage magnitudes under different tap settings. To circumvent the “curse of dimensionality”, we proposed an LSPIbased sequential learning algorithm to learn an actionvalue function for each LTC, based on which the optimal tap positions can be determined directly. The proposed algorithm can find the policy that determines the optimal tap positions that minimize the voltage deviation across the system, based only on voltage magnitude measurements and network topology information, which makes it more desirable for implementation in practice. Numerical simulation on the IEEE 13 and 123bus test feeders validated the effectiveness of the proposed methodology.
References
 [1] P. Kundur, N. J. Balu, and M. G. Lauby, Power system stability and control. McGrawhill New York, 1994, vol. 7.
 [2] B. A. Robbins, H. Zhu, and A. D. DomínguezGarcía, “Optimal tap setting of voltage regulation transformers in unbalanced distribution systems,” IEEE Trans. Power Syst., vol. 31, no. 1, pp. 256–267, Jan 2016.
 [3] W. H. E. Liu, A. D. Papalexopoulos, and W. F. Tinney, “Discrete shunt controls in a newton optimal power flow,” IEEE Trans. Power Syst., vol. 7, no. 4, pp. 1509–1518, Nov 1992.
 [4] M. R. Salem, L. A. Talat, and H. M. Soliman, “Voltage control by tapchanging transformers for a radial distribution network,” IEE Proceedings  Generation, Transmission and Distribution, vol. 144, no. 6, pp. 517–520, Nov 1997.
 [5] H. Zhu and H. J. Liu, “Fast local voltage control under limited reactive power: Optimality and stability analysis,” IEEE Trans. Power Syst., vol. 31, no. 5, pp. 3794–3803, Sept. 2016.
 [6] B. A. Robbins and A. D. DomínguezGarcía, “Optimal reactive power dispatch for voltage regulation in unbalanced distribution systems,” IEEE Trans. Power Syst., vol. 31, no. 4, pp. 2903–2913, July 2016.
 [7] H. Xu, A. D. DomínguezGarcía, and P. W. Sauer, “A datadriven voltage control framework for power distribution systems,” in Proc. of IEEE PES General Meeting, Portland, OR, Aug. 2018, pp. 1–5.
 [8] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
 [9] M. E. Baran and F. F. Wu, “Network reconfiguration in distribution systems for loss reduction and load balancing,” IEEE Trans. Power Del., vol. 4, no. 2, pp. 1401–1407, Apr 1989.

[10]
M. G. Lagoudakis and R. Parr, “Leastsquares policy iteration,”
Journal of machine learning research
, vol. 4, no. Dec, pp. 1107–1149, 2003.  [11] C. J. Watkins and P. Dayan, “Qlearning,” Machine learning, vol. 8, no. 34, pp. 279–292, 1992.
 [12] J. G. Vlachogiannis and N. D. Hatziargyriou, “Reinforcement learning for reactive power control,” IEEE Trans. Power Syst., vol. 19, no. 3, pp. 1317–1325, 2004.
 [13] Y. Xu, W. Zhang, W. Liu, and F. Ferrese, “Multiagentbased reinforcement learning for optimal reactive power dispatch,” IEEE Trans. Syst., Man, Cybern., Syst., Part C (Applications and Reviews), vol. 42, no. 6, pp. 1742–1751, 2012.
 [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [15] M. Glavic, R. Fonteneau, and D. Ernst, “Reinforcement learning for electric power system decision and control: Past considerations and perspectives,” IFACPapersOnLine, vol. 50, no. 1, pp. 6918–6927, 2017.
 [16] IEEE distribution test feeders. [Online]. Available: https://ewh.ieee.org/soc/pes/dsacom/testfeeders/
 [17] Commercial and residential hourly load profiles for all TMY3 locations in the United States. [Online]. Available: https://openei.org/doeopendata/dataset
Comments
There are no comments yet.