Data-driven control of micro-climate in buildings; an event-triggered reinforcement learning approach

01/28/2020 ∙ by Ashkan Haji Hosseinloo, et al. ∙ MIT 0

Smart buildings have great potential for shaping an energy-efficient, sustainable, and more economic future for our planet as buildings account for approximately 40 large-scale plug and play deployment of the smart building technology is the ability to learn a good control policy in a short period of time, i.e. having a low sample complexity for the learning control agent. Motivated by this problem and to remedy the issue of high sample complexity in the general context of cyber-physical systems, we propose an event-triggered paradigm for learning and control with variable-time intervals, as opposed to the traditional constant-time sampling. The events occur when the system state crosses the a priori-parameterized switching manifolds; this crossing triggers the learning as well as the control processes. Policy gradient and temporal difference methods are employed to learn the optimal switching manifolds which define the optimal control policy. We propose two event-triggered learning algorithms for stochastic and deterministic control policies. We show the efficacy of our proposed approach via designing a smart learning thermostat for autonomous micro-climate control in buildings. The event-triggered algorithms are implemented on a single-zone building to decrease buildings' energy consumption as well as to increase occupants' comfort. Simulation results confirm the efficacy and improved sample efficiency of the proposed event-triggered approach for online learning and control.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Buildings account for approximately 40% of global energy consumption about half of which is used by heating, ventilation, and air conditioning (HVAC) systems Nejat et al. (2015); Wei et al. (2017), the primary means to control micro-climate in buildings. Furthermore, buildings are responsible for one-third of global energy-related greenhouse gas emissions Nejat et al. (2015). Hence, even an incremental improvement in the energy efficiency of buildings and HVAC systems goes a long way towards building a sustainable, more economic, and energy-efficient future. In addition to their economic and environmental impacts, HVAC systems can also affect productivity and decision-making performance of occupants in buildings through controlling indoor thermal and air quality Satish et al. (2012); Wargocki and Wyon (2017). For all these reasons micro-climate control in buildings is an important issue for its large-scale economic, environmental, and health-related and societal effects.

The main goal of the micro-climate control in buildings is to minimize the building’s (mainly HVAC’s) energy consumption while improving or respecting some notion of occupants’ comfort. Despite its immense importance, micro-climate control in buildings is often very energy-inefficient. HVAC systems are traditionally controlled by rule-based strategies and heuristics where an expert uses best practices to create a set of rules that control different HVAC components such as rule-based ON/OFF and conventional PID controllers

Levermore (2013); Dounis and Caraiscos (2009). These control methods are often far from optimal as they do not take into account the system dynamics model of the building i.e. the building thermodynamics and stochastic disturbances e.g. weather conditions or occupancy status. To overcome some of these shortcomings, more advanced model-based approaches have been proposed. In this category Model Predictive Control (MPC) is perhaps the most promising and extensively-studied method in the context of buildings climate control Oldewurtel et al. (2012); Ryzhov et al. (2019); Afram and Janabi-Sharifi (2014); Smarra et al. (2018).

Despite its potential benefits, performance and reliability of MPC and other model-based control methods depend highly on the accuracy of the building thermodynamics model and prediction of the stochastic disturbances. However, developing an accurate model for a building is extremely time-consuming and resource-intensive, and hence, not practical in most cases. Moreover, a once accurate developed model of a building could become fairly inaccurate over time due to, for instance, renovation or wear and tear of the building. Furthermore, at large scales, MPC like many other advanced model-based techniques may require formidable computational power if a real-time (or near real-time) solution is required Marantos et al. (2019). Last but not least, traditional and model-based techniques are inherently building-specific and not easily transferable to other buildings.

To remedy the above-mentioned issues of model-based climate control in buildings and towards building autonomous smart homes, data-driven approaches for HVAC control have attracted the interest of many researchers in the recent years. The concept of smart homes where household devices (e.g. appliances, thermostats, and lights) can operate efficiently in an autonomous, coordinated, and adaptive fashion, has been around for a couple of decades Mozer (1998). However, with recent advances in Internet of Things (IoT) technology (cheap sensors, efficient data storage, etc.) on the one hand Minoli et al. (2017)

, and immense progress in data science and machine learning tools on the other hand, the idea of smart homes with data-driven HVAC control systems looks ever more realistic.

Among different data-driven control approaches, reinforcement learning (RL) has found more attention in the recent years due to enormous recent algorithmic advances in this field as well as its ability to learn efficient control policies solely from experiential data via trial and error. This study focuses on an RL approach and hence, we next discuss some of the related studies using reinforcement learning for energy-efficient controls in buildings followed by our contribution.

The remaining of this article is organized as follows. Section 2 reviews the related work and highlights our contributions in this study. The Problem is stated and mathematically formulated in section 3 after which the idea of switching manifolds for event-triggered control is introduced in section 4. Combining the average-reward set-up and event-triggered control paradigm in sections 3 and 4, we present our event-triggered reinforcement learning algorithms in section 5. Finally, the implementation and simulation results are discussed in section 6 before the article is concluded in section 7.

2 Related work and contribution

2.1 Tabular RL

The Neural Network House project

Mozer (1998) is perhaps the first application of reinforcement learning in building energy management system. In this seminal work, the author explains how tabular Q-learning, one of the early versions of the popular Q-learning approach in RL, was employed to control lighting in a residential house so as to minimize energy consumption subject to occupants’ comfort constraint Mozer and Miller (1997). Tabular Q-learning was later used in a few other studies for controlling passive and active thermal storage inventory in commercial buildings Liu and Henze (2006a, b), heating systemBarrett and Linder (2015), air-conditioning and natural ventilation through windows Chen et al. (2018), photovoltaic arrays and geothermal heat pumps Yang et al. (2015), and lighting and blinds Cheng et al. (2016).

Given fully observable state and infinite exploration, tabular Q-learning is guaranteed to converge on an optimal policy. However, the tabular version of Q-learning is limited to systems with discrete states and actions, and becomes very data-intensive, hence very slow at learning, when the system has a large number of state-action combinations. For instance, the simulated RL training in Liu and Henze (2006b) for a fairly simple building required up to 6000 days (roughly 17 years) of data collection. To remedy some of these issues, other versions of Q-learning such as Neural Fitted Q-iteration (NFQ) and Deep RL (DRL) were employed where function approximation techniques are used to learn an approximate function of the state-action (Q) function.

2.2 RL with action-value function approximation

Dalamagkidis et al. Dalamagkidis et al. (2007) used a linear function approximation technique to approximate the Q-function in their Q-learning RL to control a heat pump and an air ventilation subsystem using sensory data on indoor and outdoor air temperature, relative humidity, and concentration. Fitted Q Iteration (FQI) developed by Ernst et al. Ernst et al. (2005)

is a batch RL method that iteratively estimates the Q-function given a fixed batch of past interactions. An online version that uses a neural network, neural fitted Q-iteration, has been proposed by

Riedmiller (2005). In a series of studies Ruelens et al. (2015, 2016b, 2016a), Ruelens et al. studied the application of FQI batch RL to schedule thermostatically controlled HVAC systems such as heat pumps and electric water heaters in different demand-response set-ups. Marantos et al. Marantos et al. (2018) applied NFQ batch RL to control the thermostat set-point of a single-zone building where input state was four-dimensional (outdoor and indoor temperatures, solar radiance, and indoor humidity) and action was one-dimensional with three discrete values.

Tremendous algorithmic and computational advancements in deep neural networks in the recent years have given rise to the field of deep reinforcement learning (DRL) where deep neural networks are combined with different RL approaches. This has resulted in numerous DRL algorithms (DQN, DDQN, RBW, A3C, DDPG, etc.) in the past few years, some of which have been employed for data-driven micro-climate control in buildings. Wei et al. Wei et al. (2017) claim to be the first to apply DRL to HVAC control problem. They used Deep Q-Network (DQN) algorithm Mnih et al. (2015) to approximate the Q-function with discrete number of actions. To remedy some of the issues of the DQN algorithm such as overestimation of action values, improvements to this algorithm have been made resulting in a bunch of other algorithms like Double DQN (DDQN) Van Hasselt et al. (2016) and Rainbow (RWB) Hessel et al. (2018). Avendano et al. Avendano et al. (2018) applied DDQN and RWB algorithms to optimize energy efficiency and comfort in a 2-zone apartment; they considered temperature and concentration for comfort and used heating and ventilation costs for energy efficiency.

2.3 RL with policy function approximation

All the above-mentioned RL-based studies rely on learning the optimal state-value or action-value (Q) functions based on which the optimal policy is derived. Parallel to this value-based approach there is a policy-based approach where the RL agent tries to directly learn the optimal policy (control law). Policy gradient algorithms are perhaps the most popular class of RL algorithms in this approach. The basic idea behind these algorithms is to adjust the parameters of the policy in the direction of a performance gradient Sutton et al. (2000); Silver et al. (2014). A distinctive advantage of policy gradient algorithms is their ability to handle continuous actions as well as stochastic policies. Wang et al. Wang et al. (2017) employed Monte Carlo actor-critic policy gradient RL with LSTM actor and critic networks to control HVAC system of a single-zone office. Deep Deterministic Policy Gradient (DDPG) algorithm Lillicrap et al. (2015) is another powerful algorithm in this class that handles deterministic policies. DDPG was used in Gao et al. (2019) and Li et al. (2019) to control energy consumption in a single-zone laboratory and 2-zone data center buildings, respectively.

2.4 Sample efficiency

Despite the sea-change advances in RL, sample efficiency is still the bottleneck for many real-world applications with slow dynamics. Building micro-climate control is one such application since thermodynamics in buildings is relatively slow; it can take a few minutes to an hour to collect an informative data point. The time-intensive process of data collection makes the online training of the RL algorithms so long that it practically becomes impossible to have a plug & play RL-based controller for HVAC systems. For instance, training the DQN RL algorithm in Wei et al. (2017) for a single-zone building required about 100 months of sensory data. The required data collection period for training the DDQN and RWB algorithms in Avendano et al. (2018) were reported as 120 and 90 months, respectively. A few different techniques have been proposed to alleviate the RL’s training sample complexity when it comes to real-world applications, in particular buildings, which are discussed next.

Multiple time scales in some real-world applications is one reason for the sample inefficiency of many RL algorithms. For instance, for precise control of a set-point temperature it is more efficient to design a controller that works on a coarse time scale in the beginning when the temperature is far from the set-point temperature, and on a finer time scale otherwise. To address this issue, double and multiple scales reinforcement learning are proposed in Riedmiller (1998); Li and Xia (2015). Reducing the system’s dimension, if possible, is another way to shorten the online training period. Different dimensionality reduction techniques such as auto-encoder Ruelens et al. (2015)

and convolutional neural networks (CNN)

Claessens et al. (2016) were used in RL-based building energy management control where the system states are high dimensional.

Another approach to reduce the training period is based on developing a data-driven model first, and then use it for offline RL training or direct planning. This approach is similar to the Dyna architecture Sutton (1991); Sutton and Barto (2018). Costanzo et al. Costanzo et al. (2016) used neural networks to learn temperature dynamics of a building heating system to feed training of their FQI RL algorithm while Nuag et al. Naug et al. (2019)

used support vector regression to develop consumption energy model of a commercial building for training of their DDPG algorithm. In

Nagy et al. (2018) and Kazmi et al. (2018) data-driven models of thermal systems are developed in the form of neural networks and partially observable MDP transition matrix, respectively, which are then used for finite horizon planning. As another example, Kazmi et al. Kazmi et al. (2019) used muti-agent RL to learn an MDP model of identical thermostatically controlled loads which was then used for deriving the optimal policy by Monte Carlo techniques.

2.5 Contributions

Despite all the recent efforts, none of the proposed methods can be used for a plug & play deployment of smart HVAC systems without pre-training due to their large sample complexity. In addition, all the reinforcement learning studies in building energy management systems have formulated the problem based on episodic tasks, as opposed to continuing tasks. Micro-climate control in buildings is indeed a continuing task problem and should be formulated as such. Furthermore, the algorithms in these studies are all based on periodic sampling with fixed time intervals. This is not very sample-efficient in many cases and is certainly not desirable in resource-constrained wireless embedded control systems Heemels et al. (2012). To remedy these issues we make the following major contributions:

  • We develop a general framework called switching manifolds for data-efficient control of HVAC systems;

  • Based on the idea of switching manifolds, we propose an event-triggered paradigm for learning and control with an application to the HVAC systems;

  • We develop and formulate the event-triggered control problem with variable-duration sampling as an undiscounted continuing task reinforcement learning problem with average reward set-up;

  • We demonstrate the effectiveness of our proposed approach on a small-scale building via simulation in EnergyPlus software.

3 Problem statement and MDP framework

The aim of this study is to provide a plug & play control algorithm that can efficiently learn to optimize HVAC energy consumption and occupants’ comfort in buildings. To this end we first formulate the sequential decision-making control problem as a Markov decision process (MDP) in this section.

The MDP is defined by a state space , an action space , a stationary transition dynamics distribution with conditional density where and are state and action at time indexed by when the event occurs, and a reward function . States and actions are in general continuous (e.g. temperature state or temperature threshold action). Events are occasions when control actions are taken and the learning takes place; hence, they define the transition times. These events are characterized when certain conditions are met and are explained in detail in section 4. Actions are taken at these events based on a stochastic () or deterministic () policy, where

is the set of probability measures on

and is a vector of parameters.

Taking action at state moves the system to a new state and results in a reward of . Let us assume this transition takes unit time (). Following the policy, dynamics of the MDP evolves and results in a trajectory of states, actions, and rewards; . We define the performance measure that we want to maximize as the average rate of reward per unit time or simply average reward rate:


This is different from and not proportional to the average rate of reward per time step if transition time periods are not equal, which will be the case in this study. We also define the differential return as:


In this definition of return the average reward is subtracted from the actual sample reward in each step to measure the accumulated reward relative to the average reward. Similarly, we can define the state-value and action-value functions as:


where, is the conditional probability density at associated with the policy. Although the average reward set-up is formulated here for stochastic policies, it is applicable to deterministic policies as well with minor modification to the equations above. In the next section, we introduce the idea of switching manifolds and learning and controlling when needed.

4 Switching manifolds and event-triggered control

Many HVAC control devices work based on a discrete set of control actions e.g. ON/OFF switches or discrete-scale knobs. The optimal control in the system’s state space is often not very discontinuous or non-smooth in many practical applications, or at least there often exists one such control policy that is not far from the optimal. In this case optimal (or near-optimal) actions are separated by some boundaries in the state space. We call these boundaries switching manifolds since it is only across these boundaries that the controller needs to switch actions. Figure 1 illustrates the concept of switching manifolds for two simple systems with two-dimensional state vectors and 2 or 4 actions.

Figure 1: Switching manifolds for a 2-dimensional state vector with (a) 2 and (b) 4 actions

Switching manifolds fully define a corresponding policy, hence, it is more sample-efficient to learn these manifolds or a parameterized version of them rather than a full tabular policy. Let us consider one such manifold parameterized by a parameter vector as . A different action is taken when the system dynamics cross this manifold, or in other words when holds true. To make it more intuitive we rewrite this manifold equation in terms of one particular state (e.g. temperature in the HVAC example) as . Given the other states of the system, we can now think of as a threshold , i.e. if state of the system reaches this threshold value of we need to switch to the new action based on the switching manifolds mapping (Fig.1(a) and Fig.1(b) schematically illustrate two such mappings). Also, instead of the parameters or the actual physical actions we can think of these thresholds as the actions that the learning agent needs to take.

So far we introduced the switching manifolds or the threshold policies as a family of policies among which we would like to search for an optimal policy via e.g. reinforcement learning. The manifold/threshold learning does not need to happen at constant time intervals. In fact, here we propose controlling and learning with variable-time intervals when actions and updates take place when specific events occur. By definition, these events occur when system dynamics reach the switching manifolds or equivalently when thresholds are reached.

Here we further illustrate these concepts with a simple example. Let us consider a 1-zone building equipped with a heating system described by its state vector , where and are indoor and outdoor temperatures and is the heater status ( means heater is on and means it is off). Possible physical actions we can take are; turning the heater ON, turning the heater OFF, or do nothing. Corresponding to this set of actions, we employ linear manifolds as an example and describe the parameterized temperature thresholds as: and . This is illustrated schematically in Fig. 2. For a given parameter vector and outdoor temperature , when the indoor temperature reaches the switch-off threshold () the heater is turned off and when it reaches the switch-on threshold () the heater is turned on; otherwise, no action is taken. The deterministic action policy for the underlying MDP of this system could be written as . Since at every event we need to decide for only one threshold (which will affect the next event), we can reduce the action dimension to only one by writing it as . This idea is applied to stochastic policy in a similar way to decide for only one threshold temperature when an event occurs. In the next section, we propose actor-critic event-triggered RL algorithms with both stochastic and deterministic policies based on the average-reward MDP set-up presented in section 3 and the concept of switching manifolds introduced in this section.

Figure 2: Switching manifolds (temperature thresholds) for a 1-zone building equipped with a heating system with three physical actions: turn on, turn off, and do nothing.

5 Reinforcement learning algorithm and implementation

Most, if not all, of the popular RL algorithms (both stochastic and deterministic) are based on episodic-task MDPs. Furthermore, transition time periods do not play any role in these algorithms; this is not an issue for applications where either the transition time intervals are irrelevant to the optimization problem e.g. in a game play, or these intervals are assumed to have fixed duration. None of these hold for the problem of micro-climate control in buildings where we want to optimize energy and occupants’ comfort in a continuing fashion with event-triggered sampling and control which result in variable-time intervals.

Here we consider both stochastic and deterministic policy gradient reinforcement learning for event-triggered control. Our algorithms are based on stochastic and deterministic policy gradient theorems Sutton et al. (2000); Silver et al. (2014) with modifications to cater for average-reward set-up and variable-time transition intervals. These theorems are as follows:


where, and are stationary state distributions under stochastic and deterministic policies, and . The actor components of our proposed algorithms employ Eq.(4) to adjust and improve the parameterized policies. To this end we use approximated action-value and state-value functions by parameterizing their true functions with parameter vectors and , respectively. We employ temporal difference (TD) Q-learning for the critic to estimate the state-value or action-value functions. In this set-up we also replace the true average reward rate (or ) by an approximation , which we learn via the same temporal difference error. We use the following TD errors () for the stochastic and deterministic policies, respectively:


where, , , and are the average reward and parameters at time . With this definition of TD errors we update the average reward as follows:


where, is the learning rate for the average reward update. Having explained the average-reward set-up and the event-triggered control and learning, we can now present the pseudocodes for actor-critic algorithms for continuing tasks with both deterministic and stochastic policies.

Algorithm 1 shows the pseudocode for stochastic policies with eligibility traces while algorithm 2 shows its deterministic counterpart. Algorithm 2 is an event-triggered compatible off-policy deterministic actor-critic algorithm with a simple Q-learning critic (ET-COPDAC-Q). For this algorithm we use compatible function approximator for the in the form of . Here is any differentiable baseline function independent of , such as a state-value function. We parameterize the baseline function linearly in its feature vector as , where, is a feature vector. In the next section, we implement these algorithms on a simple building model and assess their efficacy.

Input: a differentiable stochastic policy parameterization Input: a differentiable state-value function parameterization Parameters: , , , , Initialize (e.g. to 0) Initialize state-value and policy parameters and (e.g. to 0) Initialize the state vector (-component eligibility trace vector) (-component eligibility trace vector) repeat forever when an event occurs
      Execute action and wait till next event; then observe , ,
Algorithm 1 Event-triggered actor-critic stochastic policy gradient for continuing tasks with variable-time intervals (with eligibility traces)
Input: a differentiable deterministic policy parameterization Input: a differentiable state-value function parameterization Input: a differentiable action-value function parameterization Parameters: , , , Initialize (e.g. to 0) Initialize state-value, action-value, and policy parameters , and (e.g. to 0) Initialize the state vector Initialize a random process for action exploration repeat forever when an event occurs
      Execute action and wait till next event; then observe , ,
Algorithm 2 Event-triggered COPDAC-Q for continuing tasks with variable-time intervals

6 Simulations and results

In this section we implement our proposed algorithms to control the heating system of a one-zone building in order to minimize energy consumption without jeopardizing the occupants’ comfort. To this end we first describe the building models that we use for simulation, followed up by designing the rewards to use for our learning control algorithms. Then we explain the policy parameterization used in the simulations before we present the simulation results.

6.1 Building models

We use two one-zone building models: a simplified linear model characterized by a first-order ordinary differential equation, and a more realistic building modeled in the EnergyPlus software. The linear model for the one-zone building with the heating system is as follows:


where, is the building’s heat capacity, is the building’s thermal conductance, and is the heater’s power. As defined earlier, is the heater status, and is the outdoor temperature.

In addition to the simplified linear building model, a more realistic building modeled in EnergyPlus is also used for implementation of our proposed learning control algorithms. The building modeled in EnergyPlus is a single-floor rectangular building with dimensions of (). The walls and the roof are modeled massless with thermal resistance of and , respectively. All the walls as well as the roof are exposed to the Sun and wind, and have thermal and solar absorptance of 0.90 and 0.75, respectively. The floor is made up of a 4-inch h.w. concrete block with conductivity of , density of , specific heat capacity of , and thermal and solar absorptance of 0.90 and 0.65, respectively. The building is oriented 30 degrees east of north. EnergyPlus Chicago Weather data (Chicago-OHare Intl AP 725300) is used for the simulation. An electric heater with nominal heating rate of is used for space heating.

6.2 Rewards

Comfort and energy consumption are controlled by rewards or penalties. Rewards in RL play the role of cost function in controls theory and therefore proper design of the rewards is of paramount importance in the problem formulation. Here we formulate the reward with three components; one discrete and two continuous components:


where, is the discrete penalty for switching on/off the heater to avoid frequent switching. The frequent on/off switching can decrease the system life-cycle or could result in unpleasant noisy operation of the heater. Here, unit is an arbitrary scale for quantifying different rewards. Having the heater on is penalized continuously in time with the rate of . This penalty is responsible for limiting the power consumption; hence, for a more intuitive meaning, could be chosen such that the reward unit (unit) equals the monetary cost unit of the power consumption e.g. dollar currency. Here we define occupants’ discomfort rate proportional to the square of deviation from their desired temperature , and coefficient of proportionality is .

6.3 stochastic and deterministic policy parameterization

As discussed in section 4, although we can define actions as both of the thresholds at each event, we only need one of the thresholds at each event. For instance, when the system has just hit the switch-off manifold (), we only need to decide for the next switch-on manifold (

). This helps to reduce the action dimension to one. Next, we present the parameterization for the stochastic policy approach followed up by the deterministic policy approach. In the stochastic policy method, we constrain the policy distributions to the Gaussian distributions of the form:


where, , and

are mean and standard deviation of the action that are parameterized by parameter vectors

and , respectively (). Here, we consider constant switch-on and switch-off thresholds and parameterize the mean and standard deviation as follows:


where, and . For simplicity we later assume . is the state feature vector. We also approximate the state-value function as . It should be noted that with this simple parameterization the switching temperature thresholds do not depend on the outdoor temperature. This is a reasonable assumption because we know that if the outdoor temperature is fixed, the optimal thresholds should indeed be constant.

In a similar fashion, we simplify the parameterization of the deterministic policy in the form of:


where, is the policy parameter vector. We approximate the action-value function by a compatible function approximator as with . The state feature vector and the state-value function are defined the same as in the stochastic policy approach.

6.4 Results

Having set-up the simulation environment and parameterized the control policies and the related function approximators, we can now implement the learning algorithms 1 and 2. In order to asses the efficacy of our learning control methods we would better have the ground truth optimal switching thresholds to which the results of our learning algorithms should converge. It should be noted that even with a simple and known model of the building with no disturbances, the optimal control of energy cost minimization while improving the occupants’ comfort does not fall into any of the classical optimal control problems such as LQG or LQR. This is mainly because of the complex form of the reward or the cost function defined in Eq.(9). With that said, since we know that the optimal thresholds are constant (for fixed outdoor temperature), it is not computationally very heavy to find the ground truth thresholds by brute-force simulations and policy search in this set-up.

To this end, we run numerous simulations where the system dynamics are described by either Eq.(8) or the EnergyPlus model, and the control policy by Eq.(12) with constant parameter vector 111We know the optimal policy should be a deterministic policy with constant switching temperature thresholds.. For each such simulation, the simulation is run for a long time with a fixed pair of switching temperature thresholds at the end of which the average reward rate is calculated dividing the total reward by the total time. For the case where the system dynamics are described by Eq.(8), results are illustrated in Fig. 3 based on which the optimal average reward rate is corresponding to optimal thresholds of and . Knowing the optimal policy for the simplified linear model of the building, we next implement our proposed stochastic and deterministic learning algorithms on this building model.

Figure 3: Average reward rate for different fixed values of and thresholds.

Figure 4 depicts the on-policy learning of stochastic policy parameters during a training period of 10 days. Initial values of the mean of the threshold temperatures are set to and the initial standard deviation of these threshold temperatures are set to . Figure 5

illustrates probability distributions of the stochastic policies for switching temperature thresholds before and after the 10-day training by Algorithm

1. As seen in these two figures, the mean temperature thresholds have reached and , very close to the true optimal values. The standard deviation has decreased to by the end of the training. According to Fig. 6 the average reward rate is learnt and converges to a value of . This learnt policy is then implemented from the beginning in a separate 10-day simulation and the average reward rate is calculated as . Both of these value are very close to the optimal value of , confirming the efficacy of the proposed event-triggered stochastic learning algorithm.

Figure 4: Time history of stochastic policy parameters, i.e. means and standard deviation of the switching temperature thresholds, during a 10-day training by Algorithm 1.
Figure 5: Initial and learnt stochastic policies for switching temperature thresholds in a 10-day training by Algorithm 1.
Figure 6: Time history of on-policy average reward rate in a 10-day training by Algorithm 1.

Next, we implement our deterministic event-triggered learning algorithm (Algorithm 2) on the same building model. The learnt on/off switching temperatures at the end of a 10-day training are found to be and , again very close to the true optimal values. The implemented ET-COPDAC-Q is an off-policy algorithm; hence, to assess its efficacy we need to implement the resulted learnt policy on a new simulation where the average reward is calculated based on the learnt policy applied from the beginning. The average reward rate corresponding to the learnt thresholds is then calculated to be that is very close to the optimal value of .

It was explained in detail in sections 4 and 5

that the proposed event-triggered learning and control with variable time intervals should improve learning and control performance in terms of sample efficiency and variance. To back-up this via simulations we run two 10-day simulations on the same building model; one with variable intervals i.e. event-triggered learning (Algorithm

2) and one with constant intervals with 5-minute duration. This time the event-triggered deterministic algorithm learns the exact optimal thresholds i.e. and corresponding to an average reward rate of , whereas the same algorithm with constant time intervals learns the thresholds to be and . Now if the latter threshold policy is implemented with constant time interval for controls (i.e. both learning and control have constant time intervals) it results in an average reward rate of ; however, this value improves to an average reward rate of if the learnt policy is implemented via event-triggered control (i.e. constant time interval for learning but variable time interval for controls). These numbers corroborate the advantage of event-triggered learning and controls over the classic learning and controls with fixed time intervals. To highlight this advantage even more, Fig. 7 shows the learnt average reward rate during a 10-day training by Algorithm 2 with both variable and constant time intervals. It is clear that learning with constant time intervals results in a considerably larger variance.

Figure 7: Time history of average reward rate in a 10-day training by Algorithm 2 with variable (event-triggered mode) and constant time intervals.

Last but not least we implement our learning algorithms on the more realistic building modeled in EnergyPlus software as detailed in section 6.1. Here the outdoor temperature is no longer kept constant and varies as shown in Fig. 8. Although the optimal thresholds should in general be functions of outdoor temperature, here we constrain the learning problem to the family of threshold policies that are not functions of outdoor temperature. This is because (i) finding the ground truth optimal policy via brute-force simulations within this constrained family of policies is much easier than the unconstrained family of threshold policies, and (ii) based on our simulation results the optimal policy has a weak dependence on the outdoor temperature in this set-up.

Similar to the case of the simplified building model, we first find the optimal threshold policy and the corresponding optimal average reward rate by brute-force simulations. The optimal thresholds are found to be and resulting in an optimal average reward rate of . Here we employ our deterministic event-triggered COPDAC-Q algorithm to learn the optimal threshold policy. Starting from initial thresholds of and , the algorithm learns the threshold temperatures to be and at the end of 10 days of training. This learnt policy results in an average reward rate of -. Time history of the building’s indoor temperature controlled via an exploratory deterministic behaviour policy during the 10-day training period is illustrated in Fig.8. The learning time history of the deterministic policy parameters, i.e. the switching temperature thresholds during the 10-day training is shown in Fig.9.

Figure 8: Time history of indoor and outdoor temperatures of the EnergyPlus building model during a 10-day training by Algorithm 2.
Figure 9: Time history of deterministic policy parameters, i.e. the switching temperature thresholds, during a 10-day training of the EnergyPlus building model by Algorithm 2.

7 Conclusion

This study focuses on event-triggered learning and control in the context of cyber-physical systems with an application to buildings’ micro-climate control. Often learning and control systems are designed based on sampling with fixed time intervals. A shorter time interval usually lead to a more-accurate learning and more-precise control system; however, it inherently increases sample complexity and variance of the learning algorithms and requires more computational resources. To remedy these issues we proposed an event-triggered paradigm for learning and control with variable time intervals and showed its efficacy in designing a smart learning thermostat for autonomous micro-climate control in buildings.

We formulated the buildings’ climate control problem based on a continuing-task MDP with event-triggered control policies. The events occur when the system state crosses the a priori-parameterized switching manifolds; this crossing triggers the learning as well as the control processes. Policy gradient and temporal difference methods are employed to learn the optimal switching manifolds which define the optimal control policy. Two event-triggered learning algorithms are proposed for stochastic and deterministic control policies. These algorithms are implemented on a single-zone building to concurrently decrease buildings’ energy consumption and increase occupants’ comfort. Two different building models were used: (i) a simplified model where the building’s thermodynamics are characterized by a first-order ordinary differential equation, and (ii) a more realistic building modeled in the EnergyPlus software. Simulation results show that the proposed algorithms learn the optimal policy in a reasonable time. The results also confirm that in terms of sample efficiency and variance our proposed event-triggered algorithms outperform their classic reinforcement learning counterparts where learning and control happen with constant time intervals.


This work is supported by the Skoltech NGP Program (joint Skoltech-MIT project).


  • A. Afram and F. Janabi-Sharifi (2014) Theory and applications of hvac control systems–a review of model predictive control (mpc). Building and Environment 72, pp. 343–355. Cited by: §1.
  • D. N. Avendano, J. Ruyssinck, S. Vandekerckhove, S. Van Hoecke, and D. Deschrijver (2018) Data-driven optimization of energy efficiency and comfort in an apartment. In 2018 International Conference on Intelligent Systems (IS), pp. 174–182. Cited by: §2.2, §2.4.
  • E. Barrett and S. Linder (2015) Autonomous hvac control, a reinforcement learning approach. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 3–19. Cited by: §2.1.
  • Y. Chen, L. K. Norford, H. W. Samuelson, and A. Malkawi (2018) Optimal control of hvac and window systems for natural ventilation through reinforcement learning. Energy and Buildings 169, pp. 195–205. Cited by: §2.1.
  • Z. Cheng, Q. Zhao, F. Wang, Y. Jiang, L. Xia, and J. Ding (2016) Satisfaction based q-learning for integrated lighting and blind control. Energy and Buildings 127, pp. 43–55. Cited by: §2.1.
  • B. J. Claessens, P. Vrancx, and F. Ruelens (2016)

    Convolutional neural networks for automatic state-time feature extraction in reinforcement learning applied to residential load control

    IEEE Transactions on Smart Grid 9 (4), pp. 3259–3269. Cited by: §2.4.
  • G. T. Costanzo, S. Iacovella, F. Ruelens, T. Leurs, and B. J. Claessens (2016) Experimental analysis of data-driven control for a building heating system. Sustainable Energy, Grids and Networks 6, pp. 81–90. Cited by: §2.4.
  • K. Dalamagkidis, D. Kolokotsa, K. Kalaitzakis, and G. S. Stavrakakis (2007) Reinforcement learning for energy conservation and comfort in buildings. Building and environment 42 (7), pp. 2686–2698. Cited by: §2.2.
  • A. I. Dounis and C. Caraiscos (2009) Advanced control systems engineering for energy and comfort management in a building environment—a review. Renewable and Sustainable Energy Reviews 13 (6-7), pp. 1246–1261. Cited by: §1.
  • D. Ernst, P. Geurts, and L. Wehenkel (2005) Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6 (Apr), pp. 503–556. Cited by: §2.2.
  • G. Gao, J. Li, and Y. Wen (2019) Energy-efficient thermal comfort control in smart buildings via deep reinforcement learning. arXiv preprint arXiv:1901.04693. Cited by: §2.3.
  • W. Heemels, K. H. Johansson, and P. Tabuada (2012) An introduction to event-triggered and self-triggered control. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 3270–3285. Cited by: §2.5.
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.2.
  • H. Kazmi, F. Mehmood, S. Lodeweyckx, and J. Driesen (2018) Gigawatt-hour scale savings on a budget of zero: deep reinforcement learning based optimal control of hot water systems. Energy 144, pp. 159–168. Cited by: §2.4.
  • H. Kazmi, J. Suykens, A. Balint, and J. Driesen (2019) Multi-agent reinforcement learning for modeling and control of thermostatically controlled loads. Applied energy 238, pp. 1022–1035. Cited by: §2.4.
  • G. Levermore (2013) Building energy management systems: an application to heating, natural ventilation, lighting and occupant satisfaction. Routledge. Cited by: §1.
  • B. Li and L. Xia (2015) A multi-grid reinforcement learning method for energy conservation and comfort of hvac in buildings. In 2015 IEEE International Conference on Automation Science and Engineering (CASE), pp. 444–449. Cited by: §2.4.
  • Y. Li, Y. Wen, D. Tao, and K. Guan (2019) Transforming cooling optimization for green data center via deep reinforcement learning. IEEE transactions on cybernetics. Cited by: §2.3.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2.3.
  • S. Liu and G. P. Henze (2006a) Experimental analysis of simulated reinforcement learning control for active and passive building thermal storage inventory: part 1. theoretical foundation. Energy and Buildings 38 (2), pp. 142 – 147. External Links: ISSN 0378-7788, Document, Link Cited by: §2.1.
  • S. Liu and G. P. Henze (2006b) Experimental analysis of simulated reinforcement learning control for active and passive building thermal storage inventory: part 2: results and analysis. Energy and buildings 38 (2), pp. 148–161. Cited by: §2.1, §2.1.
  • C. Marantos, C. P. Lamprakos, V. Tsoutsouras, K. Siozios, and D. Soudris (2018) Towards plug&play smart thermostats inspired by reinforcement learning. In Proceedings of the Workshop on INTelligent Embedded Systems Architectures and Applications, pp. 39–44. Cited by: §2.2.
  • C. Marantos, K. Siozios, and D. Soudris (2019) Rapid prototyping of low-complexity orchestrator targeting cyberphysical systems: the smart-thermostat usecase. IEEE Transactions on Control Systems Technology. Cited by: §1.
  • D. Minoli, K. Sohraby, and B. Occhiogrosso (2017) IoT considerations, requirements, and architectures for smart buildings—energy optimization and next-generation building management systems. IEEE Internet of Things Journal 4 (1), pp. 269–283. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §2.2.
  • M. C. Mozer and D. Miller (1997) Parsing the stream of time: the value of event-based segmentation in a complex real-world control problem. In International School on Neural Networks, Initiated by IIASS and EMFCSC, pp. 370–388. Cited by: §2.1.
  • M. C. Mozer (1998) The neural network house: an environment hat adapts to its inhabitants. In Proc. AAAI Spring Symp. Intelligent Environments, Vol. 58. Cited by: §1, §2.1.
  • A. Nagy, H. Kazmi, F. Cheaib, and J. Driesen (2018) Deep reinforcement learning for optimal control of space heating. arXiv preprint arXiv:1805.03777. Cited by: §2.4.
  • A. Naug, I. Ahmed, and G. Biswas (2019) Online energy management in commercial buildings using deep reinforcement learning. In 2019 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 249–257. Cited by: §2.4.
  • P. Nejat, F. Jomehzadeh, M. M. Taheri, M. Gohari, and M. Z. A. Majid (2015) A global review of energy consumption, co2 emissions and policy in the residential sector (with an overview of the top ten co2 emitting countries). Renewable and sustainable energy reviews 43, pp. 843–862. Cited by: §1.
  • F. Oldewurtel, A. Parisio, C. N. Jones, D. Gyalistras, M. Gwerder, V. Stauch, B. Lehmann, and M. Morari (2012) Use of model predictive control and weather forecasts for energy efficient building climate control. Energy and Buildings 45, pp. 15–27. Cited by: §1.
  • M. Riedmiller (1998) High quality thermostat control by reinforcement learning-a case study. In Proceedings of the Conald Workshop, pp. 1–2. Cited by: §2.4.
  • M. Riedmiller (2005) Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317–328. Cited by: §2.2.
  • F. Ruelens, B. J. Claessens, S. Quaiyum, B. De Schutter, R. Babuška, and R. Belmans (2016a) Reinforcement learning applied to an electric water heater: from theory to practice. IEEE Transactions on Smart Grid 9 (4), pp. 3792–3800. Cited by: §2.2.
  • F. Ruelens, B. J. Claessens, S. Vandael, B. De Schutter, R. Babuška, and R. Belmans (2016b) Residential demand response of thermostatically controlled loads using batch reinforcement learning. IEEE Transactions on Smart Grid 8 (5), pp. 2149–2159. Cited by: §2.2.
  • F. Ruelens, S. Iacovella, B. Claessens, and R. Belmans (2015) Learning agent for a heat-pump thermostat with a set-back strategy using model-free reinforcement learning. Energies 8 (8), pp. 8300–8318. Cited by: §2.2, §2.4.
  • A. Ryzhov, H. Ouerdane, E. Gryazina, A. Bischi, and K. Turitsyn (2019) Model predictive control of indoor microclimate: existing building stock comfort improvement. Energy conversion and management 179, pp. 219–228. Cited by: §1.
  • U. Satish, M. J. Mendell, K. Shekhar, T. Hotchi, D. Sullivan, S. Streufert, and W. J. Fisk (2012) Is co2 an indoor pollutant? direct effects of low-to-moderate co2 concentrations on human decision-making performance. Environmental health perspectives 120 (12), pp. 1671–1677. Cited by: §1.
  • D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. Cited by: §2.3, §5.
  • F. Smarra, A. Jain, T. De Rubeis, D. Ambrosini, A. D’Innocenzo, and R. Mangharam (2018)

    Data-driven model predictive control using random forests for building energy optimization and climate control

    Applied energy 226, pp. 1252–1272. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.4.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.3, §5.
  • R. S. Sutton (1991) Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin 2 (4), pp. 160–163. Cited by: §2.4.
  • H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §2.2.
  • Y. Wang, K. Velswamy, and B. Huang (2017)

    A long-short term memory recurrent neural network based reinforcement learning controller for office heating ventilation and air conditioning systems

    Processes 5 (3), pp. 46. Cited by: §2.3.
  • P. Wargocki and D. P. Wyon (2017) Ten questions concerning thermal and indoor air quality effects on the performance of office work and schoolwork. Building and Environment 112, pp. 359–366. Cited by: §1.
  • T. Wei, Y. Wang, and Q. Zhu (2017) Deep reinforcement learning for building hvac control. In Proceedings of the 54th Annual Design Automation Conference 2017, pp. 22. Cited by: §1, §2.2, §2.4.
  • L. Yang, Z. Nagy, P. Goffin, and A. Schlueter (2015) Reinforcement learning for optimal control of low exergy buildings. Applied Energy 156, pp. 577–586. Cited by: §2.1.