1 Introduction
Buildings account for approximately 40% of global energy consumption about half of which is used by heating, ventilation, and air conditioning (HVAC) systems Nejat et al. (2015); Wei et al. (2017), the primary means to control microclimate in buildings. Furthermore, buildings are responsible for onethird of global energyrelated greenhouse gas emissions Nejat et al. (2015). Hence, even an incremental improvement in the energy efficiency of buildings and HVAC systems goes a long way towards building a sustainable, more economic, and energyefficient future. In addition to their economic and environmental impacts, HVAC systems can also affect productivity and decisionmaking performance of occupants in buildings through controlling indoor thermal and air quality Satish et al. (2012); Wargocki and Wyon (2017). For all these reasons microclimate control in buildings is an important issue for its largescale economic, environmental, and healthrelated and societal effects.
The main goal of the microclimate control in buildings is to minimize the building’s (mainly HVAC’s) energy consumption while improving or respecting some notion of occupants’ comfort. Despite its immense importance, microclimate control in buildings is often very energyinefficient. HVAC systems are traditionally controlled by rulebased strategies and heuristics where an expert uses best practices to create a set of rules that control different HVAC components such as rulebased ON/OFF and conventional PID controllers
Levermore (2013); Dounis and Caraiscos (2009). These control methods are often far from optimal as they do not take into account the system dynamics model of the building i.e. the building thermodynamics and stochastic disturbances e.g. weather conditions or occupancy status. To overcome some of these shortcomings, more advanced modelbased approaches have been proposed. In this category Model Predictive Control (MPC) is perhaps the most promising and extensivelystudied method in the context of buildings climate control Oldewurtel et al. (2012); Ryzhov et al. (2019); Afram and JanabiSharifi (2014); Smarra et al. (2018).Despite its potential benefits, performance and reliability of MPC and other modelbased control methods depend highly on the accuracy of the building thermodynamics model and prediction of the stochastic disturbances. However, developing an accurate model for a building is extremely timeconsuming and resourceintensive, and hence, not practical in most cases. Moreover, a once accurate developed model of a building could become fairly inaccurate over time due to, for instance, renovation or wear and tear of the building. Furthermore, at large scales, MPC like many other advanced modelbased techniques may require formidable computational power if a realtime (or near realtime) solution is required Marantos et al. (2019). Last but not least, traditional and modelbased techniques are inherently buildingspecific and not easily transferable to other buildings.
To remedy the abovementioned issues of modelbased climate control in buildings and towards building autonomous smart homes, datadriven approaches for HVAC control have attracted the interest of many researchers in the recent years. The concept of smart homes where household devices (e.g. appliances, thermostats, and lights) can operate efficiently in an autonomous, coordinated, and adaptive fashion, has been around for a couple of decades Mozer (1998). However, with recent advances in Internet of Things (IoT) technology (cheap sensors, efficient data storage, etc.) on the one hand Minoli et al. (2017)
, and immense progress in data science and machine learning tools on the other hand, the idea of smart homes with datadriven HVAC control systems looks ever more realistic.
Among different datadriven control approaches, reinforcement learning (RL) has found more attention in the recent years due to enormous recent algorithmic advances in this field as well as its ability to learn efficient control policies solely from experiential data via trial and error. This study focuses on an RL approach and hence, we next discuss some of the related studies using reinforcement learning for energyefficient controls in buildings followed by our contribution.
The remaining of this article is organized as follows. Section 2 reviews the related work and highlights our contributions in this study. The Problem is stated and mathematically formulated in section 3 after which the idea of switching manifolds for eventtriggered control is introduced in section 4. Combining the averagereward setup and eventtriggered control paradigm in sections 3 and 4, we present our eventtriggered reinforcement learning algorithms in section 5. Finally, the implementation and simulation results are discussed in section 6 before the article is concluded in section 7.
2 Related work and contribution
2.1 Tabular RL
The Neural Network House project
Mozer (1998) is perhaps the first application of reinforcement learning in building energy management system. In this seminal work, the author explains how tabular Qlearning, one of the early versions of the popular Qlearning approach in RL, was employed to control lighting in a residential house so as to minimize energy consumption subject to occupants’ comfort constraint Mozer and Miller (1997). Tabular Qlearning was later used in a few other studies for controlling passive and active thermal storage inventory in commercial buildings Liu and Henze (2006a, b), heating systemBarrett and Linder (2015), airconditioning and natural ventilation through windows Chen et al. (2018), photovoltaic arrays and geothermal heat pumps Yang et al. (2015), and lighting and blinds Cheng et al. (2016).Given fully observable state and infinite exploration, tabular Qlearning is guaranteed to converge on an optimal policy. However, the tabular version of Qlearning is limited to systems with discrete states and actions, and becomes very dataintensive, hence very slow at learning, when the system has a large number of stateaction combinations. For instance, the simulated RL training in Liu and Henze (2006b) for a fairly simple building required up to 6000 days (roughly 17 years) of data collection. To remedy some of these issues, other versions of Qlearning such as Neural Fitted Qiteration (NFQ) and Deep RL (DRL) were employed where function approximation techniques are used to learn an approximate function of the stateaction (Q) function.
2.2 RL with actionvalue function approximation
Dalamagkidis et al. Dalamagkidis et al. (2007) used a linear function approximation technique to approximate the Qfunction in their Qlearning RL to control a heat pump and an air ventilation subsystem using sensory data on indoor and outdoor air temperature, relative humidity, and concentration. Fitted Q Iteration (FQI) developed by Ernst et al. Ernst et al. (2005)
is a batch RL method that iteratively estimates the Qfunction given a fixed batch of past interactions. An online version that uses a neural network, neural fitted Qiteration, has been proposed by
Riedmiller (2005). In a series of studies Ruelens et al. (2015, 2016b, 2016a), Ruelens et al. studied the application of FQI batch RL to schedule thermostatically controlled HVAC systems such as heat pumps and electric water heaters in different demandresponse setups. Marantos et al. Marantos et al. (2018) applied NFQ batch RL to control the thermostat setpoint of a singlezone building where input state was fourdimensional (outdoor and indoor temperatures, solar radiance, and indoor humidity) and action was onedimensional with three discrete values.Tremendous algorithmic and computational advancements in deep neural networks in the recent years have given rise to the field of deep reinforcement learning (DRL) where deep neural networks are combined with different RL approaches. This has resulted in numerous DRL algorithms (DQN, DDQN, RBW, A3C, DDPG, etc.) in the past few years, some of which have been employed for datadriven microclimate control in buildings. Wei et al. Wei et al. (2017) claim to be the first to apply DRL to HVAC control problem. They used Deep QNetwork (DQN) algorithm Mnih et al. (2015) to approximate the Qfunction with discrete number of actions. To remedy some of the issues of the DQN algorithm such as overestimation of action values, improvements to this algorithm have been made resulting in a bunch of other algorithms like Double DQN (DDQN) Van Hasselt et al. (2016) and Rainbow (RWB) Hessel et al. (2018). Avendano et al. Avendano et al. (2018) applied DDQN and RWB algorithms to optimize energy efficiency and comfort in a 2zone apartment; they considered temperature and concentration for comfort and used heating and ventilation costs for energy efficiency.
2.3 RL with policy function approximation
All the abovementioned RLbased studies rely on learning the optimal statevalue or actionvalue (Q) functions based on which the optimal policy is derived. Parallel to this valuebased approach there is a policybased approach where the RL agent tries to directly learn the optimal policy (control law). Policy gradient algorithms are perhaps the most popular class of RL algorithms in this approach. The basic idea behind these algorithms is to adjust the parameters of the policy in the direction of a performance gradient Sutton et al. (2000); Silver et al. (2014). A distinctive advantage of policy gradient algorithms is their ability to handle continuous actions as well as stochastic policies. Wang et al. Wang et al. (2017) employed Monte Carlo actorcritic policy gradient RL with LSTM actor and critic networks to control HVAC system of a singlezone office. Deep Deterministic Policy Gradient (DDPG) algorithm Lillicrap et al. (2015) is another powerful algorithm in this class that handles deterministic policies. DDPG was used in Gao et al. (2019) and Li et al. (2019) to control energy consumption in a singlezone laboratory and 2zone data center buildings, respectively.
2.4 Sample efficiency
Despite the seachange advances in RL, sample efficiency is still the bottleneck for many realworld applications with slow dynamics. Building microclimate control is one such application since thermodynamics in buildings is relatively slow; it can take a few minutes to an hour to collect an informative data point. The timeintensive process of data collection makes the online training of the RL algorithms so long that it practically becomes impossible to have a plug & play RLbased controller for HVAC systems. For instance, training the DQN RL algorithm in Wei et al. (2017) for a singlezone building required about 100 months of sensory data. The required data collection period for training the DDQN and RWB algorithms in Avendano et al. (2018) were reported as 120 and 90 months, respectively. A few different techniques have been proposed to alleviate the RL’s training sample complexity when it comes to realworld applications, in particular buildings, which are discussed next.
Multiple time scales in some realworld applications is one reason for the sample inefficiency of many RL algorithms. For instance, for precise control of a setpoint temperature it is more efficient to design a controller that works on a coarse time scale in the beginning when the temperature is far from the setpoint temperature, and on a finer time scale otherwise. To address this issue, double and multiple scales reinforcement learning are proposed in Riedmiller (1998); Li and Xia (2015). Reducing the system’s dimension, if possible, is another way to shorten the online training period. Different dimensionality reduction techniques such as autoencoder Ruelens et al. (2015)
and convolutional neural networks (CNN)
Claessens et al. (2016) were used in RLbased building energy management control where the system states are high dimensional.Another approach to reduce the training period is based on developing a datadriven model first, and then use it for offline RL training or direct planning. This approach is similar to the Dyna architecture Sutton (1991); Sutton and Barto (2018). Costanzo et al. Costanzo et al. (2016) used neural networks to learn temperature dynamics of a building heating system to feed training of their FQI RL algorithm while Nuag et al. Naug et al. (2019)
used support vector regression to develop consumption energy model of a commercial building for training of their DDPG algorithm. In
Nagy et al. (2018) and Kazmi et al. (2018) datadriven models of thermal systems are developed in the form of neural networks and partially observable MDP transition matrix, respectively, which are then used for finite horizon planning. As another example, Kazmi et al. Kazmi et al. (2019) used mutiagent RL to learn an MDP model of identical thermostatically controlled loads which was then used for deriving the optimal policy by Monte Carlo techniques.2.5 Contributions
Despite all the recent efforts, none of the proposed methods can be used for a plug & play deployment of smart HVAC systems without pretraining due to their large sample complexity. In addition, all the reinforcement learning studies in building energy management systems have formulated the problem based on episodic tasks, as opposed to continuing tasks. Microclimate control in buildings is indeed a continuing task problem and should be formulated as such. Furthermore, the algorithms in these studies are all based on periodic sampling with fixed time intervals. This is not very sampleefficient in many cases and is certainly not desirable in resourceconstrained wireless embedded control systems Heemels et al. (2012). To remedy these issues we make the following major contributions:

We develop a general framework called switching manifolds for dataefficient control of HVAC systems;

Based on the idea of switching manifolds, we propose an eventtriggered paradigm for learning and control with an application to the HVAC systems;

We develop and formulate the eventtriggered control problem with variableduration sampling as an undiscounted continuing task reinforcement learning problem with average reward setup;

We demonstrate the effectiveness of our proposed approach on a smallscale building via simulation in EnergyPlus software.
3 Problem statement and MDP framework
The aim of this study is to provide a plug & play control algorithm that can efficiently learn to optimize HVAC energy consumption and occupants’ comfort in buildings. To this end we first formulate the sequential decisionmaking control problem as a Markov decision process (MDP) in this section.
The MDP is defined by a state space , an action space , a stationary transition dynamics distribution with conditional density where and are state and action at time indexed by when the event occurs, and a reward function . States and actions are in general continuous (e.g. temperature state or temperature threshold action). Events are occasions when control actions are taken and the learning takes place; hence, they define the transition times. These events are characterized when certain conditions are met and are explained in detail in section 4. Actions are taken at these events based on a stochastic () or deterministic () policy, where
is the set of probability measures on
and is a vector of parameters.Taking action at state moves the system to a new state and results in a reward of . Let us assume this transition takes unit time (). Following the policy, dynamics of the MDP evolves and results in a trajectory of states, actions, and rewards; . We define the performance measure that we want to maximize as the average rate of reward per unit time or simply average reward rate:
(1) 
This is different from and not proportional to the average rate of reward per time step if transition time periods are not equal, which will be the case in this study. We also define the differential return as:
(2) 
In this definition of return the average reward is subtracted from the actual sample reward in each step to measure the accumulated reward relative to the average reward. Similarly, we can define the statevalue and actionvalue functions as:
(3) 
where, is the conditional probability density at associated with the policy. Although the average reward setup is formulated here for stochastic policies, it is applicable to deterministic policies as well with minor modification to the equations above. In the next section, we introduce the idea of switching manifolds and learning and controlling when needed.
4 Switching manifolds and eventtriggered control
Many HVAC control devices work based on a discrete set of control actions e.g. ON/OFF switches or discretescale knobs. The optimal control in the system’s state space is often not very discontinuous or nonsmooth in many practical applications, or at least there often exists one such control policy that is not far from the optimal. In this case optimal (or nearoptimal) actions are separated by some boundaries in the state space. We call these boundaries switching manifolds since it is only across these boundaries that the controller needs to switch actions. Figure 1 illustrates the concept of switching manifolds for two simple systems with twodimensional state vectors and 2 or 4 actions.
Switching manifolds fully define a corresponding policy, hence, it is more sampleefficient to learn these manifolds or a parameterized version of them rather than a full tabular policy. Let us consider one such manifold parameterized by a parameter vector as . A different action is taken when the system dynamics cross this manifold, or in other words when holds true. To make it more intuitive we rewrite this manifold equation in terms of one particular state (e.g. temperature in the HVAC example) as . Given the other states of the system, we can now think of as a threshold , i.e. if state of the system reaches this threshold value of we need to switch to the new action based on the switching manifolds mapping (Fig.1(a) and Fig.1(b) schematically illustrate two such mappings). Also, instead of the parameters or the actual physical actions we can think of these thresholds as the actions that the learning agent needs to take.
So far we introduced the switching manifolds or the threshold policies as a family of policies among which we would like to search for an optimal policy via e.g. reinforcement learning. The manifold/threshold learning does not need to happen at constant time intervals. In fact, here we propose controlling and learning with variabletime intervals when actions and updates take place when specific events occur. By definition, these events occur when system dynamics reach the switching manifolds or equivalently when thresholds are reached.
Here we further illustrate these concepts with a simple example. Let us consider a 1zone building equipped with a heating system described by its state vector , where and are indoor and outdoor temperatures and is the heater status ( means heater is on and means it is off). Possible physical actions we can take are; turning the heater ON, turning the heater OFF, or do nothing. Corresponding to this set of actions, we employ linear manifolds as an example and describe the parameterized temperature thresholds as: and . This is illustrated schematically in Fig. 2. For a given parameter vector and outdoor temperature , when the indoor temperature reaches the switchoff threshold () the heater is turned off and when it reaches the switchon threshold () the heater is turned on; otherwise, no action is taken. The deterministic action policy for the underlying MDP of this system could be written as . Since at every event we need to decide for only one threshold (which will affect the next event), we can reduce the action dimension to only one by writing it as . This idea is applied to stochastic policy in a similar way to decide for only one threshold temperature when an event occurs. In the next section, we propose actorcritic eventtriggered RL algorithms with both stochastic and deterministic policies based on the averagereward MDP setup presented in section 3 and the concept of switching manifolds introduced in this section.
5 Reinforcement learning algorithm and implementation
Most, if not all, of the popular RL algorithms (both stochastic and deterministic) are based on episodictask MDPs. Furthermore, transition time periods do not play any role in these algorithms; this is not an issue for applications where either the transition time intervals are irrelevant to the optimization problem e.g. in a game play, or these intervals are assumed to have fixed duration. None of these hold for the problem of microclimate control in buildings where we want to optimize energy and occupants’ comfort in a continuing fashion with eventtriggered sampling and control which result in variabletime intervals.
Here we consider both stochastic and deterministic policy gradient reinforcement learning for eventtriggered control. Our algorithms are based on stochastic and deterministic policy gradient theorems Sutton et al. (2000); Silver et al. (2014) with modifications to cater for averagereward setup and variabletime transition intervals. These theorems are as follows:
(4) 
where, and are stationary state distributions under stochastic and deterministic policies, and . The actor components of our proposed algorithms employ Eq.(4) to adjust and improve the parameterized policies. To this end we use approximated actionvalue and statevalue functions by parameterizing their true functions with parameter vectors and , respectively. We employ temporal difference (TD) Qlearning for the critic to estimate the statevalue or actionvalue functions. In this setup we also replace the true average reward rate (or ) by an approximation , which we learn via the same temporal difference error. We use the following TD errors () for the stochastic and deterministic policies, respectively:
(5)  
(6) 
where, , , and are the average reward and parameters at time . With this definition of TD errors we update the average reward as follows:
(7) 
where, is the learning rate for the average reward update. Having explained the averagereward setup and the eventtriggered control and learning, we can now present the pseudocodes for actorcritic algorithms for continuing tasks with both deterministic and stochastic policies.
Algorithm 1 shows the pseudocode for stochastic policies with eligibility traces while algorithm 2 shows its deterministic counterpart. Algorithm 2 is an eventtriggered compatible offpolicy deterministic actorcritic algorithm with a simple Qlearning critic (ETCOPDACQ). For this algorithm we use compatible function approximator for the in the form of . Here is any differentiable baseline function independent of , such as a statevalue function. We parameterize the baseline function linearly in its feature vector as , where, is a feature vector. In the next section, we implement these algorithms on a simple building model and assess their efficacy.
6 Simulations and results
In this section we implement our proposed algorithms to control the heating system of a onezone building in order to minimize energy consumption without jeopardizing the occupants’ comfort. To this end we first describe the building models that we use for simulation, followed up by designing the rewards to use for our learning control algorithms. Then we explain the policy parameterization used in the simulations before we present the simulation results.
6.1 Building models
We use two onezone building models: a simplified linear model characterized by a firstorder ordinary differential equation, and a more realistic building modeled in the EnergyPlus software. The linear model for the onezone building with the heating system is as follows:
(8) 
where, is the building’s heat capacity, is the building’s thermal conductance, and is the heater’s power. As defined earlier, is the heater status, and is the outdoor temperature.
In addition to the simplified linear building model, a more realistic building modeled in EnergyPlus is also used for implementation of our proposed learning control algorithms. The building modeled in EnergyPlus is a singlefloor rectangular building with dimensions of (). The walls and the roof are modeled massless with thermal resistance of and , respectively. All the walls as well as the roof are exposed to the Sun and wind, and have thermal and solar absorptance of 0.90 and 0.75, respectively. The floor is made up of a 4inch h.w. concrete block with conductivity of , density of , specific heat capacity of , and thermal and solar absorptance of 0.90 and 0.65, respectively. The building is oriented 30 degrees east of north. EnergyPlus Chicago Weather data (ChicagoOHare Intl AP 725300) is used for the simulation. An electric heater with nominal heating rate of is used for space heating.
6.2 Rewards
Comfort and energy consumption are controlled by rewards or penalties. Rewards in RL play the role of cost function in controls theory and therefore proper design of the rewards is of paramount importance in the problem formulation. Here we formulate the reward with three components; one discrete and two continuous components:
(9) 
where, is the discrete penalty for switching on/off the heater to avoid frequent switching. The frequent on/off switching can decrease the system lifecycle or could result in unpleasant noisy operation of the heater. Here, unit is an arbitrary scale for quantifying different rewards. Having the heater on is penalized continuously in time with the rate of . This penalty is responsible for limiting the power consumption; hence, for a more intuitive meaning, could be chosen such that the reward unit (unit) equals the monetary cost unit of the power consumption e.g. dollar currency. Here we define occupants’ discomfort rate proportional to the square of deviation from their desired temperature , and coefficient of proportionality is .
6.3 stochastic and deterministic policy parameterization
As discussed in section 4, although we can define actions as both of the thresholds at each event, we only need one of the thresholds at each event. For instance, when the system has just hit the switchoff manifold (), we only need to decide for the next switchon manifold (
). This helps to reduce the action dimension to one. Next, we present the parameterization for the stochastic policy approach followed up by the deterministic policy approach. In the stochastic policy method, we constrain the policy distributions to the Gaussian distributions of the form:
(10) 
where, , and
are mean and standard deviation of the action that are parameterized by parameter vectors
and , respectively (). Here, we consider constant switchon and switchoff thresholds and parameterize the mean and standard deviation as follows:(11) 
where, and . For simplicity we later assume . is the state feature vector. We also approximate the statevalue function as . It should be noted that with this simple parameterization the switching temperature thresholds do not depend on the outdoor temperature. This is a reasonable assumption because we know that if the outdoor temperature is fixed, the optimal thresholds should indeed be constant.
In a similar fashion, we simplify the parameterization of the deterministic policy in the form of:
(12) 
where, is the policy parameter vector. We approximate the actionvalue function by a compatible function approximator as with . The state feature vector and the statevalue function are defined the same as in the stochastic policy approach.
6.4 Results
Having setup the simulation environment and parameterized the control policies and the related function approximators, we can now implement the learning algorithms 1 and 2. In order to asses the efficacy of our learning control methods we would better have the ground truth optimal switching thresholds to which the results of our learning algorithms should converge. It should be noted that even with a simple and known model of the building with no disturbances, the optimal control of energy cost minimization while improving the occupants’ comfort does not fall into any of the classical optimal control problems such as LQG or LQR. This is mainly because of the complex form of the reward or the cost function defined in Eq.(9). With that said, since we know that the optimal thresholds are constant (for fixed outdoor temperature), it is not computationally very heavy to find the ground truth thresholds by bruteforce simulations and policy search in this setup.
To this end, we run numerous simulations where the system dynamics are described by either Eq.(8) or the EnergyPlus model, and the control policy by Eq.(12) with constant parameter vector ^{1}^{1}1We know the optimal policy should be a deterministic policy with constant switching temperature thresholds.. For each such simulation, the simulation is run for a long time with a fixed pair of switching temperature thresholds at the end of which the average reward rate is calculated dividing the total reward by the total time. For the case where the system dynamics are described by Eq.(8), results are illustrated in Fig. 3 based on which the optimal average reward rate is corresponding to optimal thresholds of and . Knowing the optimal policy for the simplified linear model of the building, we next implement our proposed stochastic and deterministic learning algorithms on this building model.
Figure 4 depicts the onpolicy learning of stochastic policy parameters during a training period of 10 days. Initial values of the mean of the threshold temperatures are set to and the initial standard deviation of these threshold temperatures are set to . Figure 5
illustrates probability distributions of the stochastic policies for switching temperature thresholds before and after the 10day training by Algorithm
1. As seen in these two figures, the mean temperature thresholds have reached and , very close to the true optimal values. The standard deviation has decreased to by the end of the training. According to Fig. 6 the average reward rate is learnt and converges to a value of . This learnt policy is then implemented from the beginning in a separate 10day simulation and the average reward rate is calculated as . Both of these value are very close to the optimal value of , confirming the efficacy of the proposed eventtriggered stochastic learning algorithm.Next, we implement our deterministic eventtriggered learning algorithm (Algorithm 2) on the same building model. The learnt on/off switching temperatures at the end of a 10day training are found to be and , again very close to the true optimal values. The implemented ETCOPDACQ is an offpolicy algorithm; hence, to assess its efficacy we need to implement the resulted learnt policy on a new simulation where the average reward is calculated based on the learnt policy applied from the beginning. The average reward rate corresponding to the learnt thresholds is then calculated to be that is very close to the optimal value of .
It was explained in detail in sections 4 and 5
that the proposed eventtriggered learning and control with variable time intervals should improve learning and control performance in terms of sample efficiency and variance. To backup this via simulations we run two 10day simulations on the same building model; one with variable intervals i.e. eventtriggered learning (Algorithm
2) and one with constant intervals with 5minute duration. This time the eventtriggered deterministic algorithm learns the exact optimal thresholds i.e. and corresponding to an average reward rate of , whereas the same algorithm with constant time intervals learns the thresholds to be and . Now if the latter threshold policy is implemented with constant time interval for controls (i.e. both learning and control have constant time intervals) it results in an average reward rate of ; however, this value improves to an average reward rate of if the learnt policy is implemented via eventtriggered control (i.e. constant time interval for learning but variable time interval for controls). These numbers corroborate the advantage of eventtriggered learning and controls over the classic learning and controls with fixed time intervals. To highlight this advantage even more, Fig. 7 shows the learnt average reward rate during a 10day training by Algorithm 2 with both variable and constant time intervals. It is clear that learning with constant time intervals results in a considerably larger variance.Last but not least we implement our learning algorithms on the more realistic building modeled in EnergyPlus software as detailed in section 6.1. Here the outdoor temperature is no longer kept constant and varies as shown in Fig. 8. Although the optimal thresholds should in general be functions of outdoor temperature, here we constrain the learning problem to the family of threshold policies that are not functions of outdoor temperature. This is because (i) finding the ground truth optimal policy via bruteforce simulations within this constrained family of policies is much easier than the unconstrained family of threshold policies, and (ii) based on our simulation results the optimal policy has a weak dependence on the outdoor temperature in this setup.
Similar to the case of the simplified building model, we first find the optimal threshold policy and the corresponding optimal average reward rate by bruteforce simulations. The optimal thresholds are found to be and resulting in an optimal average reward rate of . Here we employ our deterministic eventtriggered COPDACQ algorithm to learn the optimal threshold policy. Starting from initial thresholds of and , the algorithm learns the threshold temperatures to be and at the end of 10 days of training. This learnt policy results in an average reward rate of . Time history of the building’s indoor temperature controlled via an exploratory deterministic behaviour policy during the 10day training period is illustrated in Fig.8. The learning time history of the deterministic policy parameters, i.e. the switching temperature thresholds during the 10day training is shown in Fig.9.
7 Conclusion
This study focuses on eventtriggered learning and control in the context of cyberphysical systems with an application to buildings’ microclimate control. Often learning and control systems are designed based on sampling with fixed time intervals. A shorter time interval usually lead to a moreaccurate learning and moreprecise control system; however, it inherently increases sample complexity and variance of the learning algorithms and requires more computational resources. To remedy these issues we proposed an eventtriggered paradigm for learning and control with variable time intervals and showed its efficacy in designing a smart learning thermostat for autonomous microclimate control in buildings.
We formulated the buildings’ climate control problem based on a continuingtask MDP with eventtriggered control policies. The events occur when the system state crosses the a prioriparameterized switching manifolds; this crossing triggers the learning as well as the control processes. Policy gradient and temporal difference methods are employed to learn the optimal switching manifolds which define the optimal control policy. Two eventtriggered learning algorithms are proposed for stochastic and deterministic control policies. These algorithms are implemented on a singlezone building to concurrently decrease buildings’ energy consumption and increase occupants’ comfort. Two different building models were used: (i) a simplified model where the building’s thermodynamics are characterized by a firstorder ordinary differential equation, and (ii) a more realistic building modeled in the EnergyPlus software. Simulation results show that the proposed algorithms learn the optimal policy in a reasonable time. The results also confirm that in terms of sample efficiency and variance our proposed eventtriggered algorithms outperform their classic reinforcement learning counterparts where learning and control happen with constant time intervals.
Acknowledgements
This work is supported by the Skoltech NGP Program (joint SkoltechMIT project).
References
 Theory and applications of hvac control systems–a review of model predictive control (mpc). Building and Environment 72, pp. 343–355. Cited by: §1.
 Datadriven optimization of energy efficiency and comfort in an apartment. In 2018 International Conference on Intelligent Systems (IS), pp. 174–182. Cited by: §2.2, §2.4.
 Autonomous hvac control, a reinforcement learning approach. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 3–19. Cited by: §2.1.
 Optimal control of hvac and window systems for natural ventilation through reinforcement learning. Energy and Buildings 169, pp. 195–205. Cited by: §2.1.
 Satisfaction based qlearning for integrated lighting and blind control. Energy and Buildings 127, pp. 43–55. Cited by: §2.1.

Convolutional neural networks for automatic statetime feature extraction in reinforcement learning applied to residential load control
. IEEE Transactions on Smart Grid 9 (4), pp. 3259–3269. Cited by: §2.4.  Experimental analysis of datadriven control for a building heating system. Sustainable Energy, Grids and Networks 6, pp. 81–90. Cited by: §2.4.
 Reinforcement learning for energy conservation and comfort in buildings. Building and environment 42 (7), pp. 2686–2698. Cited by: §2.2.
 Advanced control systems engineering for energy and comfort management in a building environment—a review. Renewable and Sustainable Energy Reviews 13 (67), pp. 1246–1261. Cited by: §1.
 Treebased batch mode reinforcement learning. Journal of Machine Learning Research 6 (Apr), pp. 503–556. Cited by: §2.2.
 Energyefficient thermal comfort control in smart buildings via deep reinforcement learning. arXiv preprint arXiv:1901.04693. Cited by: §2.3.
 An introduction to eventtriggered and selftriggered control. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 3270–3285. Cited by: §2.5.

Rainbow: combining improvements in deep reinforcement learning.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §2.2.  Gigawatthour scale savings on a budget of zero: deep reinforcement learning based optimal control of hot water systems. Energy 144, pp. 159–168. Cited by: §2.4.
 Multiagent reinforcement learning for modeling and control of thermostatically controlled loads. Applied energy 238, pp. 1022–1035. Cited by: §2.4.
 Building energy management systems: an application to heating, natural ventilation, lighting and occupant satisfaction. Routledge. Cited by: §1.
 A multigrid reinforcement learning method for energy conservation and comfort of hvac in buildings. In 2015 IEEE International Conference on Automation Science and Engineering (CASE), pp. 444–449. Cited by: §2.4.
 Transforming cooling optimization for green data center via deep reinforcement learning. IEEE transactions on cybernetics. Cited by: §2.3.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2.3.
 Experimental analysis of simulated reinforcement learning control for active and passive building thermal storage inventory: part 1. theoretical foundation. Energy and Buildings 38 (2), pp. 142 – 147. External Links: ISSN 03787788, Document, Link Cited by: §2.1.
 Experimental analysis of simulated reinforcement learning control for active and passive building thermal storage inventory: part 2: results and analysis. Energy and buildings 38 (2), pp. 148–161. Cited by: §2.1, §2.1.
 Towards plug&play smart thermostats inspired by reinforcement learning. In Proceedings of the Workshop on INTelligent Embedded Systems Architectures and Applications, pp. 39–44. Cited by: §2.2.
 Rapid prototyping of lowcomplexity orchestrator targeting cyberphysical systems: the smartthermostat usecase. IEEE Transactions on Control Systems Technology. Cited by: §1.
 IoT considerations, requirements, and architectures for smart buildings—energy optimization and nextgeneration building management systems. IEEE Internet of Things Journal 4 (1), pp. 269–283. Cited by: §1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §2.2.
 Parsing the stream of time: the value of eventbased segmentation in a complex realworld control problem. In International School on Neural Networks, Initiated by IIASS and EMFCSC, pp. 370–388. Cited by: §2.1.
 The neural network house: an environment hat adapts to its inhabitants. In Proc. AAAI Spring Symp. Intelligent Environments, Vol. 58. Cited by: §1, §2.1.
 Deep reinforcement learning for optimal control of space heating. arXiv preprint arXiv:1805.03777. Cited by: §2.4.
 Online energy management in commercial buildings using deep reinforcement learning. In 2019 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 249–257. Cited by: §2.4.
 A global review of energy consumption, co2 emissions and policy in the residential sector (with an overview of the top ten co2 emitting countries). Renewable and sustainable energy reviews 43, pp. 843–862. Cited by: §1.
 Use of model predictive control and weather forecasts for energy efficient building climate control. Energy and Buildings 45, pp. 15–27. Cited by: §1.
 High quality thermostat control by reinforcement learninga case study. In Proceedings of the Conald Workshop, pp. 1–2. Cited by: §2.4.
 Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317–328. Cited by: §2.2.
 Reinforcement learning applied to an electric water heater: from theory to practice. IEEE Transactions on Smart Grid 9 (4), pp. 3792–3800. Cited by: §2.2.
 Residential demand response of thermostatically controlled loads using batch reinforcement learning. IEEE Transactions on Smart Grid 8 (5), pp. 2149–2159. Cited by: §2.2.
 Learning agent for a heatpump thermostat with a setback strategy using modelfree reinforcement learning. Energies 8 (8), pp. 8300–8318. Cited by: §2.2, §2.4.
 Model predictive control of indoor microclimate: existing building stock comfort improvement. Energy conversion and management 179, pp. 219–228. Cited by: §1.
 Is co2 an indoor pollutant? direct effects of lowtomoderate co2 concentrations on human decisionmaking performance. Environmental health perspectives 120 (12), pp. 1671–1677. Cited by: §1.
 Deterministic policy gradient algorithms. Cited by: §2.3, §5.

Datadriven model predictive control using random forests for building energy optimization and climate control
. Applied energy 226, pp. 1252–1272. Cited by: §1.  Reinforcement learning: an introduction. MIT press. Cited by: §2.4.
 Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.3, §5.
 Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin 2 (4), pp. 160–163. Cited by: §2.4.
 Deep reinforcement learning with double qlearning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §2.2.

A longshort term memory recurrent neural network based reinforcement learning controller for office heating ventilation and air conditioning systems
. Processes 5 (3), pp. 46. Cited by: §2.3.  Ten questions concerning thermal and indoor air quality effects on the performance of office work and schoolwork. Building and Environment 112, pp. 359–366. Cited by: §1.
 Deep reinforcement learning for building hvac control. In Proceedings of the 54th Annual Design Automation Conference 2017, pp. 22. Cited by: §1, §2.2, §2.4.
 Reinforcement learning for optimal control of low exergy buildings. Applied Energy 156, pp. 577–586. Cited by: §2.1.
Comments
There are no comments yet.