## I Introduction

Distribution grids are undergoing a rapid transition due to the massive deployment of distributed energy resources (DERs), e.g., PV arrays, electric vehicles, and energy storage units. The main factors fueling this expansion include significant decreases in the capital costs of DER technologies and incentives for DER installations offered by local electric power utilities, as well as by local and state authorities. For example, the state of California aims to reduce greenhouse gas emissions (GHG) by 40% below its 1990 levels in 2030 by means of increasing the share of electricity produced by renewable generation to 50%, doubling energy efficiency targets, and encouraging widespread transportation electrification [california_target]. Similarly, the state of NY set a target of zero-carbon power sector by 2040, along with the goal of reducing the 1990 levels of GHG emissions by 85% in 2050 [NY_target]. On the other hand, the presence of DERs in distribution grids also imposes additional operational challenges, e.g. bidirectional power flows, voltage fluctuations, and, as a result, additional wear-and-tear on electric power equipment. Dealing with such challenges is crucial to ensure economic and reliable distribution grid operations and necessitates more flexibility. Demand Response (DR) is one way to provide this additional flexibility, which enrolls controllable loads in residential and commercial buildings to provide a broad range of distribution-level ancillary services (e.g. energy arbitrage, peak shaving, balancing regulation, congestion relief, capacity deferral, voltage support, [aggregated_tcl_callaway]). Our efforts to explore this source of flexibility is motivated by the recent statistics that the U.S. building sector claims about 40% of the total electricity consumption [building_DOE] and still remains, to a large extent, unleveraged for distribution grid operations. The primary obstacle is in the current inability to accurately aggregate and synchronously operate a large ensemble of such small-scale loads, while taking into account their inherent techno- and socio-economic characteristics (e.g., dispatch limits, complex thermodynamics of building environments, and/or comfort preferences of building occupants). Therefore, to address these challenges, this paper focuses on mathematical modeling of an ensemble of thermostatically controlled loads (TCL), such as heat pumps, air conditioners, heating and ventilation systems, for its accurate representation in energy management (dispatch) tools used by DR aggregators or local electric power utilities, [aggregated_tcl_callaway, aggregated_tcl_lu].

The primary challenge in modeling TCL ensembles is to simultaneously achieve a high level of accuracy and maintain computational tractability. Currently, there are two large groups of methods to model and forecast electricity consumption of TCL ensembles: (i) physics-based co-simulation of TCLs and building dynamics (e.g. using heat transport models, electromechanical considerations, Kirchoff’s laws, evaporation, etc) and (ii) data-driven (e.g. statistical analyses and inference). The advantage of using the physics-based models is in their ability to describe buildings without prior observations. However, the performance of these models is highly sensitive to the number and accuracy of the underlying modeling choices and assumptions, as well as to input parameters. Physics-based models often require more inputs than existing data acquisition systems can provide [Coaklet_data_review]

, and therefore incur significant uncertainties in both model parameters and dynamic processes. Using such models for controlling an ensemble of TCLs may lead to computational issues that would prevent their scalability and implementation for real-life decision-making. On the other hand, in lieu of the physics-based models, one can use machine learning and statistical modeling to perform data-driven studies of TCL and building dynamics using a vast amount of historical data available at the buildings equipped with smart meters. These reduced order models are trained using the historical energy consumption data and other parameters (e.g. weather conditions, daily operational schedules, and control functionality)

[Koch_DR, CHASSIN_DR]. This paper develops a data-driven model to accurately represent a TCL ensemble using historical data and to continuously improve the accuracy of model performance via learning.Among data-driven methods, TCL ensembles have been modelled as virtual storage units with linear dynamics, [callawaystorage, mathieustorage, bowenstorage], or as a Markov Decision Process (MDP) with probabilistic transitions, [MDP_Mathieu, Meyn_MDP, Chertkov_MDP_chap, MDP_buildings, MDP_Network, MDP_Network_CC]. The MDP framework is particularly suitable for modeling large TCL ensembles, without sacrificing modeling accuracy or computational tractability. Thus, it produces high-quality solutions by means of using dynamic programming, which are both analytically and computationally tractable. The models in [MDP_Mathieu, Meyn_MDP, Chertkov_MDP_chap, MDP_buildings, MDP_Network, MDP_Network_CC]

model a TCL ensemble as a discrete-time, discrete-space Markov Process characterized by a given transition probability matrix with deterministic coefficients. However, in practice, it is hardly possible to estimate these coefficients accurately due to the imperfection or incompleteness of historical measurements and behavioral uncertainty of consumers. Therefore, the common caveat of current MDP models in

[MDP_Mathieu, Meyn_MDP, Chertkov_MDP_chap, MDP_buildings, MDP_Network, MDP_Network_CC] is that they ignore uncertainty on model parameters (e.g. transition probabilities). Since the inaccuracies stemming from the inability to compute model parameters in the MDP framework can be significant and can eliminate the benefits of using these resources for DR flexibility, this paper enhances the MDP framework with model-free reinforcement learning (RL), where the DR aggregator^{1}

^{1}1Alternatively, TCL ensembles can be aggregated and operated by utilities. interacts with the TCL ensemble and learns model parameters from both historical and streaming data (see Fig. 1).

The main advantage of the model-free RL in the context of dispatch TCLs is in its ability to eliminate the need for knowing precise model parameters (e.g. parameters of the transition probability distribution underlying the MDP) because the optimal control policy can be learned from

“experience”. In the context of real-life DR applications, this “experience” can be obtained via indirect (passive) observations of the TCL ensemble or, in some cases, even individual TCLs by means of using advanced metering infrastructure or data crowdsourcing, [8233189].Although there is a number of model-free RL techniques that can be used under the MDP framework, we exploit the property of TCL ensembles that allow for reducing a conventional MDP to a linearly-solvable MDP (LS-MDP). This reduction assumes that devices in the TCL ensemble are relatively heterogeneous and, therefore, explicit control actions on each TCL device (e.g. on/off decisions or power consumption) can be replaced by a distribution of potential future states of the TCL ensemble, [Todorov, optimal_actions_Todorov, LS_optimal_control]. Thus, the optimal policy derived from the LS-MDP is not a mapping of states to action variables, as in a conventional MDP, but is a mapping of a current state into a next-state distribution, which minimizes the expected next-state costs and the divergence cost between the default (e.g., without external control applied) and controlled (e.g. with external control applied) probability distributions [Hierarchical_LSMDP, LS_optimal_control]. The reduced LS-MDP problem is suitable for the Z-learning method, which is a modification of the common Q-learning method. In turn, the Z-learning method is capable of producing an accurate approximation of the original MDP at a faster convergence rate than the Q-learning method, [Todorov, optimal_actions_Todorov, Hierarchical_LSMDP, LS_optimal_control], mainly because Z-learning does not require state-action values as needed in Q-learning.

This paper uses a LS-MDP to model a TCL ensemble and leverage the Z-learning method to find the optimal TCL dispatch policy. The Z-learning method samples transitions passively from the default (e.g. without external control) behavior of the system, but is able to learn the optimal policy by leveraging the specific structure of LS-MDP. Note that the available state transitions may not accurately reflect the underlying true distribution due to limited availability of data. Hence, we show that the Z-learning algorithm is robust to noise in the observed transitions and analyze its convergence in cases with and without noise. The case study is carried out on aggregated heating, ventilation, and air conditioning (HVAC) systems in a residential neighborhood with 100 homes, where data is sampled using the Net-Zero Energy Test Facility [Nist_data], operated by the National Institute of Standards and Technology (NIST).

The remainder of this paper is organized as follows. Section II presents a LS-MDP model for optimally dispatching a given TCL ensemble. Section III solves the LS-MDP model using dynamic programming and leverages the Z-learning approach to improve the solution accuracy. Section IV presents the case study using real-life data from the NIST Test Facility to demonstrate the usefulness of the proposed approach. Section V concludes the paper.

## Ii Formulation

Similarly to [Chertkov_MDP_chap, MDP_buildings, MDP_Network, MDP_Network_CC], the MDP framework is leveraged to build the model for the control of the TCL ensemble. We define a LS-MDP for modeling a given TCL ensemble as a 5-tuple , , , , , where is the set of time intervals, which constitute a planning horizon, is the set of possible states, is the utility of the aggregator in state at time , and are default (i.e. without control actions of the DR aggregator) and controlled (with control actions of the DR aggregator) transition probabilities from state to . The states in set are obtained by discretizing the range of power consumption for each TCL ensemble given the operating range of TCL devices in the ensemble. For any given state at time , the probability of the transition of the TCL ensemble to the next state at time is characterized by . Fig. 2 displays all possible transitions from the current state at time to all possible next states at time . Note that the ensemble can remain in the same state at time such that = . The default transition probabilities, represented by parameter , corresponds to the internal dynamics of the TCL ensemble without actions of the aggregator and are typically estimated from historical data (see [MDP_buildings]). The TCL ensemble is then optimized as:

(1a) | |||

(1b) | |||

(1c) |

where and are decision variables, which characterize the probability that the TCL ensemble is operated in states and at time and , and are related via transition probabilities . Eq. (1a) represents the objective function of the DR aggregator that controls the TCL ensemble and aims to maximize its expected utility or minimize its expected cost of energy () and to minimize the discomfort cost for the TCL ensemble. The discomfort cost is computed using the Kullback-Leibler (KL) divergence, weighted by parameter . This divergence penalizes the difference between the transition decisions made by the DR aggregator () and the default transitions of the TCL ensemble (), under the assumption that the latter represents first-choice preferences of TCL users. Parameter can influence the KL divergence and thus encourage or discourage deviations from the default behavior of the TCL ensemble. The choice of the KL divergence for the penalty cost is motivated by its extensive use for modeling randomness of discrete and continuous time-series [KL_divg]. Eq. (1b) describes the temporal evolution of the TCL ensemble from time to over time horizon . Eq. (1c) imposes the integrality constraint on the transition decisions optimized by the DR aggregator such that their total probability is equal to one.

After solving (1) as described later in Section III-A, the active power () consumed by the TCL ensemble can be computed using optimized decisions and rated active power at each state, e.g. .

### Ii-a Relation to Other Methods

The LS-MDP in (1) can be related to linear dynamical TCL models in other data-driven methods, [callawaystorage, mathieustorage]. Consider the following linear dynamics for the TCL ensemble, [bowenstorage]:

(2) |

where is the energy state of the TCL ensemble, is the normal power consumed and is the power change sought by control actions. Let and consider the KL divergence between (without control) and (with control). Using [bishop2006pattern] leads to:

(3) |

which is the quadratic cost for control used in linear systems, i.e. the quadratic discomfort cost for control in linear dynamics with Gaussian uncertainties is equivalent to the discrete-time KL cost in the LS-MDP. However, the discrete nature of LS-MDP transitions simplifies modeling, even for complex state transitions and non-Gaussian uncertainties.

## Iii Z-learning in LS-MDP

### Iii-a Solving LS-MDP

The optimization in Eq. (1) is a Linearly Solvable MDP (LS-MDP) as introduced by [Todorov]. The optimal policy for Eq. (1) is computed using techniques from dynamic programming [RL]. The Bellman equation for the LS-MDP in (4) can be derived from the Bellman equation for the traditional MDP explained in Appendix A and leads to:

(4) |

where is the value function of the TCL ensemble at present state at time and is the value function at next state at time . By introducing desirability function in (4) we obtain:

(5) |

After introducing a normalization term defined as , (5) can be recast as:

(6) |

The KL divergence provides the expectation of the log-difference between the two distributions such that . It is zero if and only if the two distributions are same. Therefore, it follows from Eq. (6) that the optimal policy is achieved when the KL divergence term in Eq. (6) is minimal, i.e. it is equal to zero. Hence, by equating the two distributions in the KL divergence, the optimal policy follows as:

(7) |

The optimal policy in Eq. (7) depends on the uncontrolled transition probability () and the desirability function of the TCL ensemble at the next state (). The optimal policy reduces the Bellman equation in (6) to the following form:

(8) | ||||

(9) |

Exponentiating Eq. (9) converts the Bellman equation to the following reduced form:

(10) |

Given the Bellman equation in (10) and optimal policy in (7), the LS-MDP is solved as described in Algorithm 1. Eq. (10) is linear and thus can be represented in a matrix form as , where

is a vector with elements

, is a matrix with entries , and is a diagonal matrix with elements along its main diagonal [optimal_actions_Todorov, Todorov, LS_optimal_control, Hierarchical_LSMDP, Busic_TAC, Busic_CDC].### Iii-B Z-learning

Although the LS-MDP solves the optimization problem for the TCL ensemble efficiently, it requires knowledge about the model of the environment. Since the model is estimated from the historical data (e.g. values of the default transitions in ), which is limited and imperfect, it may introduce inaccuracies. This motivates the use of model-free learning techniques to robustly solve the optimization problem in (1). Using Z-learning, a model-free learning method, returns stochastic approximations of the optimal value function in Eq. (10). Thus, is updated as

(11) |

where is a decaying learning rate and is the state observed at sample by transitioning from previous state . Z-learning updates the value function at the present state based on the sample providing next-state information instead of averaging over all the future possible states as in the LS-MDP. Unlike in Q-learning, there is no optimization of actions during the iterations in Z-learning. Instead, the samples for Z-learning are passively collected from the underlying distribution discretized in . Then, are updated by using the specific KL divergence form of the optimal policy, which enables faster computations.

The proposed application of the Z-learning algorithm to dispatching TCLs is detailed in Algorithm 2. First, the algorithm is initialized with for all states and time periods . Next, it computes the desirability function for the final time . Then, it iteratively computes the desirability function for the remaining time intervals (from to ) using samples generated from the passive dynamics and updates the desirability function until a chosen convergence criterion is achieved. In this paper, the convergence criterion is defined as the difference between two successive values of the desirability function.

Note that the state transitions in the samples used in Z-learning may be corrupted by noise as well. The noise in the passive dynamics is modelled as the error term , where :

(12) |

where

can be modelled by a zero-mean, normal distribution with variance

, i.e. (other parametric distributions are also suitable). To ensure that every row in the transition probability matrix remains equal to one i.e. , every row in must be equal to zero, i.e.^{2}

^{2}2

Other methods can be used to capture noise, such as Interval Markov Chains, where actual transition probabilities lie in intervals

[Interval_MC].. can be extended to capture noise scenarios by defining a set of probability distributions as , such that is characterized as:(13) |

where Eq. (13) ensures that the expected value of all distributions is close to the passive dynamics of the TCL ensemble given by . At each Z-learning iteration, one out of distributions is selected with probability to update the value function. Note that despite the noise in the transition probability matrix, the same Algorithm 2 for Z-learning is used and, as shown in Section IV, performs efficiently and robustly.

### Iii-C Convergence of Z-learning

The convergence of Z-learning can be assessed using the optimal LS-MDP policy in (7) by proving that the Z-update in (11) asymptotically converges to (10). Let be the optimality at the iteration of Z-learning. Using (10)-(11) leads to:

where the indicator function is , if state is observed in the iteration, and otherwise. Consider , the final time-interval for updating -values. is of course directly determined using . It is clear that

is a random variable with mean

and a finite variance. Then, if learning rates are selected such that and , it follows that , see [jaakkola1994convergence]. Following similarly for , it returns . Thus Z-update (11) converges to the solution of (10). Note that the convergence also holds if a finite variance noise is allowed in the transition probability matrix (see Eq. (12)).## Iv Case Study

### Iv-a Data

We use data from the Net-Zero Energy Test Facility, [Nist_data], which is a single-family, three-floor, net-zero-energy house, with the total area of 386 (4156) m (ft), located in Gaithersburg, MD. To create an ensemble, this case study considers a neighbourhood with 100 houses with parameters and historical data obtained based on adding random noise to the data obtained from the Net-Zero Energy Test Facility. The random noise is limited in its magnitude by 20% of the original values because 100 houses are assumed to be located in close proximity and function similarly. For these 100 houses, we extract HVAC data and assume that all HVACs are operated by the same DR aggregator. Fig 3 shows the aggregated HVAC power consumption for both summer and winter seasons in the period from July 1, 2013 to June 30, 2014. These profiles were discretized in 12 Markovian states and Fig. 4 displays the resulting transition probability matrices (). These transition probability matrices are used to dispatch the TCL ensemble over the time horizon of 10 hourly intervals.

The case study solves the TCL optimization problem in Eq. (1) using Algorithms 1 and 2 and compare their performance in terms of the value function using the error metric:

(14) |

which computes the relative difference between the Z-learning and LS-MDP values. Moreover, the Z-learning algorithm is run for two cases: (a) without noise added to the passive dynamics and (b) with noise. The learning rate for the Z-learning algorithms is set to decay as , where is a sample number.

### Iv-B Results

Fig. 5 describes the error convergence of the Z-learning algorithm with and without noise for each hourly time period. As the number of learning iterations increases, the resulting error reduces. The rate of convergence differs for the winter and summer seasons. For instance, the 10% error for all time period is achieved within 225 and 245 learning iterations. Similarly, the effect of noise on the learning rate is more visible during the winter season, where the number of learning iterations required to achieve the 10% error increases from 245 to 290 iterations. In contrast, in the summer case, adding noise does not affect the convergence rate and Z-learning achieve the 10% error in 225 learning iterations. The slower convergence rate in the winter case is explained by the fact that a greater power consumption being approximated using the same number of discrete states in the transition probability matrix, which requires more exploration of the model environment, especially when noise samples noticeably deviate from the default behavior defined by the passive dynamics.

Given the outcomes of Z-learning, the estimated transition probabilities are obtained as shown in Fig. 6. The estimated matrices for the cases with and without noise do not differ significantly. Thus, the Root-mean-square difference of elements between is 0.0101% and 0.0068% for the summer and winter seasons. Notably, this difference changes only slightly when compared to the default transition matrices in Fig. 2. In the case of winter season shown in Fig. 6 (c) and (d), the difference is 0.0017% and 0.0055% for the case without noise and with noise. The difference for the summer season in Fig. 6 (a) and (b) increases relative to the winter season and is 0.0023% and 0.01% for the case without noise and with noise. The result of using Z-learning is that as the number of iterations and samples increases, its outcomes will converge to the LS-MDP values.

Based on the transition probability matrices obtained with the LS-MDP and Z-learning, Fig. 7 compares the power dispatch of the TCL ensemble. Both Z-learning results with and without noise accurately approximate the benchmark LS-MDP solution. The maximum difference observed for the case with noise is 4.17 kW for the summer season and 13.7 kW for the winter season, and for the case without noise is 13.7 kW for the summer season and 13.9 kW for the winter season. These differences are relatively small given the summer and winter peaks of 287.4 kW and 1412.3 kW. The power dispatch of the TCL ensemble at every iteration during Z-learning is shown in Fig. 8, where values stabilize as the number of iterations continues to increase. Similarly, Fig. 9 compares the value function of each method that represents the operating cost of the TCL ensemble. The values of the operating cost for both Z-learning with and without noise are slightly greater than the optimal value provided by the LS-MDP, because Z-learning approximates the optimal solution for the optimization problem that minimizes the objective function (i.e. the operating cost). Notably, the operating cost is comparatively high when (no control taken), which shows the importance of controlling the TCL ensemble to lower the cost.

## V Conclusion

This paper presents a data-driven learning method for the control of TCL ensemble using the MDP and Z-learning approaches. The results show the importance of moving from model-based methods to model-free methods to bridge the gap between real environment and its model. The importance of modelling uncertainty to provide more robust solutions is demonstrated by comparing the TCL ensemble injections and cost of the solution. In future, we will also consider the related problem of TCL optimization under uncertain energy prices and analyze the regret associated with online learning based schemes [neu2017fast].

## Appendix A Bellman Equation Derivation for LS-MDP

The Bellman equation for a finite-horizon MDP is [bellma_mdp]:

(15) |

where represents the immediate cost that the agent pays at time for taking action at state and is the expectation of taken with respect to :

(16) |

Eq. (15) implicate the search over all actions for each new state . However, this can be time consuming due to the exponential growth of future states. The LS-MDP offers a solution for this problem, which uses the transition probabilities instead of the symbolic actions, where the agent can directly specify the probability of transition from the current state to any possible future state. The Bellman equation for choosing by the agent is:

(17) |

where represents the state cost and means the statistical expectation of taken with respect to the controlled transition distribution . Eq. (17) represents the Bellman equation for LS-MDP.

Comments

There are no comments yet.