A Relearning Approach to Reinforcement Learning for Control of Smart Buildings

08/04/2020 ∙ by Avisek Naug, et al. ∙ Vanderbilt University 0

This paper demonstrates that continual relearning of control policies using incremental deep reinforcement learning (RL) can improve policy learning for non-stationary processes. We demonstrate this approach for a data-driven 'smart building environment' that we use as a test-bed for developing HVAC controllers for reducing energy consumption of large buildings on our university campus. The non-stationarity in building operations and weather patterns makes it imperative to develop control strategies that are adaptive to changing conditions. On-policy RL algorithms, such as Proximal Policy Optimization (PPO) represent an approach for addressing this non-stationarity, but exploration on the actual system is not an option for safety-critical systems. As an alternative, we develop an incremental RL technique that simultaneously reduces building energy consumption without sacrificing overall comfort. We compare the performance of our incremental RL controller to that of a static RL controller that does not implement the relearning function. The performance of the static controller diminishes significantly over time, but the relearning controller adjusts to changing conditions while ensuring comfort and optimal energy performance.



There are no comments yet.


page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Energy efficient control of Heating, Ventilation and Air Conditioning (HVAC) systems is an important aspect of building operations because they account for the major share of energy consumed by buildings. Most large office buildings,which are significant energy consumers, are structures with complex, internal energy flow dynamics and complex interactions with their environment. Therefore, building energy management is a difficult problem. Traditional building energy control systems are based on heuristic rules to control the parameters of the building’s HVAC systems. However, analysis of historical data shows that such rule-based heuristic control is inefficient because the rules are based on simplified assumptions about weather and building operating conditions.

Recently, there has been a lot of research on smart buildings with smart controllers that sense the building state and environmental conditions to adjust the HVAC parameters to optimize building energy consumption [36]. Model Predictive Control (MPC) methods have been successfully deployed for smart control [22], but traditional MPC methods require accurate models to achieve good performance. Developing such models for large buildings may be an intractable problem [39]. Recently, Data-driven MPC

based on random forest methods have been used to solve demand-response problems for moderate size buildings

[39], but is not clear how they may scale up for continuous control of large buildings.

Reinforcement Learning (RL) methods have recently gained traction for controlling energy consumption and comfort in smart buildings because they provide several advantages. Unlike MPC methods for robust receding horizon control [43], they have the ability to learn a locally optimal control policy without simulating the system dynamics over long time horizons. Instead, RL methods use concepts from Dynamic Programming to select the optimal actions. A number of reinforcement learning controllers for buildings have been proposed, where the building behavior under different environmental conditions are learnt from historical data [28]

. These approaches are classified as data driven or Deep Reinforcement Learning approaches.


However, current data driven approaches for RL do not take into account the non-stationary behaviors of the building and its environment. Building operations and the environments in which they operate are continually changing, often in unpredictable ways. In such situations, the Deep RL controller performance degrades because the data that was used to train the controller becomes ‘stale’. The solution to this problem is to detect changes in the building operations and its environment, and relearn the controller using data that is more relevant to the current situation. This paper proposes such an approach, where we relearn the controller at periodic intervals to maintain its relevance, and thus its performance.

The rest of the paper is organized as follows. Section 2 presents a brief review of some of the current approaches in model and data-driven reinforcement learning, and the concept of non-stationarity in MDPs. Section 4 formally introduces the RL problem for non-stationary systems that we tackle in this paper. Section 6 then develops our data driven modeling as well as the reinforcement learning schemes for ‘optimal’ building energy management. Section 7 discusses our experimental results, and Section Conclusions presents our conclusions and directions for future work.

2 Literature Review

Traditional methods for developing RL controllers of systems have relied on accurate dynamic models of the system (model-based approaches) or data-driven approaches. We briefly review model-based and data-driven approaches to RL control, and then introduce the notion of non-stationary systems, where traditional methods for RL policy learning are not effective,

2.1 Reinforcement Learning with Model Based Simulators

Typical physics-based models of building energy consumption, use conservation of energy and mass to construct thermodynamic equations to describe system behavior. [43] applied Deep Q-Learning methods [24] to optimize the energy consumption and ensure temperature comfort in a building simulated using EnergyPlus[3], a whole building energy simulation program. [26] obtained cooling energy savings of on an EnergyPlus simulated model of a data-center using a natural policy gradient based algorithm called TRPO [34]. Similarly, [19] used an off policy algorithm called DDPG [21] to obtain cooling energy savings in an EnergyPlus simulation of a data-center. To deal with sample inefficiency in on-policy learning, [9] developed an event-triggered RL approach, where the control action changes when the system crosses a boundary function in the state space. They used a one-room EnergyPlus thermal to demonstrate their approach.

2.2 Reinforcement Learning with Data Driven Approaches

The examples above describe RL approaches applied to simple building architectures. As discussed, creating a model based simulator for large, complex buildings can be quite difficult [31, 14]. Alternatively, more realistic approaches for RL applied to large buildings rely on historical data from the building to learn data-driven models or directly use the data as experiences from which a policy is learnt. [27] developed simulators from data-driven models and then used them for finite horizon control. [29]

used Support Vector Regression to develop a building energy consumption model, and then used stochastic gradient methods to optimize energy consumption.


used value-based neural networks to learn the thermodynamics model of a building. The energy models were then optimized using

Q-learning [40]. Subsequently, [28] used a DDPG [21] approach with a sampling buffer to develop a policy function that minimized energy consumption without sacrificing comfort. Anther recent approach that has successfully applied deep RL to data-driven building energy optimization includes [25].

2.3 Non Stationary MDPs

The data-driven approaches presented in Section 2.2 do not address the non-stationarity of the large buildings. Non-stationary behaviors can be attributed to multiple sources. For example, weather patterns, though seasonal, can change abruptly in unexpected ways. Similarly, conditions in a building can change quickly, e.g., when a large number of people enter the building for an event, or components of the HVAC system, degrade of fail, e..g, stuck valves, or failed pumps. When such situations occur, a RL controller, trained on the past experiences, cannot adapt to the unexpected changes in the system and environment, and, therefore, performs sub-optimally. Some work [23, 42, 37] has been proposed to address non-stationarity in the environments by improving the value function under the worst case conditions[10] of the non-stationarity.

Other approaches try to minimize a regret function instead of finding the optimal policy for non-stationary MDPs. The regret function measures the sum of missed rewards when we compare the state value from a start state between current best policy and the target policy in hindsight i.e., they tell us what actions would have been appropriate after the episode ends. This regret is then optimized to get better actions. [7] applied this approach to context-driven MDPs (each context may represent a different non-stationary behavior) to find the piecewise stationary optimal policies for each context. They proposed a clustering algorithm to find a set of contexts. [11, 6] also minimize the regret based on an average reward formulation instead of a state value function. [30] proposed a non-stationary MDP control method under a model-free setting by using a context detection method proposed in [38]. These approaches assume knowledge of a known set of possible environment models beforehand, which may not be possible in real systems. Moreover, they are model-based, i.e., they assume the MDP models are available. Therefore, they cannot be applied in a model free setting.

To address non-stationarity issues in complex buildings we extend previous research in this domain to make the following contributions to data-driven modeling and RL based control of buildings:

  • We retrain the dynamic behavior models of the building and its environment at regular intervals to ensure that the models respond to the distributional shifts in the system behavior, and, therefore, provide an accurate representation of the behavior.

  • By not relearning the building and its environment model from scratch, we ensure the repeated training is not time consuming. This also has the benefit of the model not being susceptible to the catastrophic forgetting [15] of the past behavior which is common in neural networks used for online training and relearning.

  • We relearn the policy function; i.e., the HVAC controller every time the dynamic model of the system is re learnt, so that it adapts to the current conditions in the building.

In the rest of this paper, we develop the relearning algorithms, and demonstrate the benefits of this incremental relearning approach on the controller efficiency.

3 Optimal Control with Reinforcement Learning

Reinforcement learning (RL) represents a class of machine learning methods for solving optimal control problems, where an agent learns by continually interacting with an environment 

[40]. In brief, the agent observes the state of the environment, and based on this state/observation takes an action, and notes the reward it receives for the pair. The agent’s ultimate goal is to compute a policy

, i.e., a mapping from the environment states to the actions that maximizes the expected sum of reward. RL has been cast as a stochastic optimization method for solving Markov Decision Processes (MDPs), when the MDP is not known. We define RL problem more formally below.

Definition 3.1 (Markov Decision Process).

A Markov decision process is defined by a four tuple: where represents the set of possible states in the environment. The transition function

defines the probability of reaching state

at given that action was chosen in state at

decision epoch

, . The reward function estimates the immediate reward obtained from choosing action in state .

The objective of the agent is to find a policy that maximizes the accumulated discounted rewards it receives over the future. The optimization criteria is the following:


where is called value function and it is defined as


where is called the discount factor, and it determines the weight assigned to future rewards. In other words, the weight associated with future rewards decays with time.

An optimal deterministic Markovian policy satisfying Equation 1 exists if the following conditions are satisfied

  1. and do not change over time.

If a MDP satisfies the second condition, it is called a stationary MDP. However, most real world systems undergo changes that cause their dynamic model, represented by the transition function , to change over time [4]. In other words, these systems exhibit non stationary behaviors. Non stationary behaviors may happen because the components of a system degrade, and/or the environment in which a system operates changes, causing the models that govern the system behavior to change over time. In case of large buildings, the weather conditions can change abruptly, or changes in occupancy or faults in building components can cause unexpected and unanticipated changes in the system’s behavior model. In other words, is no longer invariant, but it may change over time. Therefore, a more realistic model of the interactions between an agent and its environment is defined by a non stationary MDP (NMDP) [32].

Definition 3.2 (Non-Stationary Markov Decision Process).

A non-stationary Markov decision process is defined by a 5-tuple: . represents the set of possible states that the environment can reach at decision epoch . is the set of decision epochs with . is the action space. and represent the transition function and the reward function at decision epoch , respectively.

In the most general case, the optimal policy for a NMDP, is also non stationary. The value of state at decision epoch within an infinite horizon NMDP is defined for a stochastic policy as follows:


Learning optimal policies from non-stationary MDPs is particularly difficult for non-episodic tasks when the agent is unable to explore the time axis at will. However, real systems do not change arbitrarily fast over time. Hence, we can assume that changes occur slowly over time. This assumption is know as the regularity hypothesis and it can be formalized by using the notion of Lipschitz Continuity (LC) applied to the transition and reward functions of a non-stationary MDP [18]. This results in the definition of Lipschitz Continuous NMDP (LC-NMDP)

Definition 3.3 (() -Lc-Nmdp).

An () -LC-NMDP is a NMDP whose transition and reward functions are respectively -LC and -LC w.r.t. time, i.e.

where represents the Wasserstein distance and it is used to quantify the distance between two distributions.

Although learning from the true NMDP is generally not possible because the agent does not have access to the true NSMDP model, it is possible to learn a quasi-optimal policy from interacting with temporal slices of the NMDP assuming the LC-property. This means that the agent can learn using a stationary MDP of the environment at time . Therefore, the trajectory generated by a LC-NMDP is assumed to be generated by a sequence of stationary MDPs . In the next section, we present a continuous learning approach for optimal control of non stationary processes based on this idea.

4 Continual Learning Approach for Optimal Control of Non-Stationary Systems

The proposed approach has two main steps: an initial offline learning process followed by continual learning process. Figure 1 presents the proposed approach organized in the following steps which are annotateed as 1, 2 in the figure:

  • Step 1. Data collection. Typically this represents historical data that may be available about system operations. In our work, we start with a data set containing information on past weather conditions and the building’s energy-related variables. This data set may be representative of one or more operating conditions of the non stationary system, in our case, the building,

  • Step 2. Deriving a dynamic model of the environment. In our case, this is the building energy consumption model, given relevant building and weather parameters.

    • A state transition model is defined in terms of state variables (inputs and outputs) and the dynamics of the system are learned from the data set.

    • The reward function used to train the agent is defined.

  • Step 3. Learning an initial policy. A policy is learned offline by interacting with the environment model derived in the previous step.

  • Step 4. Deployment. The policy learned is deployed online, i.e., in the real environment, and experiences from theses interaction are collected.

  • Step 5. Relearning. In general, the relearning module would be invoked based on some predefined performance parameters, for example, when average accumulated reward value over small intervals of time is monotonically decreasing. When this happens:

    • the transition model of the environment is updated based on the recent experiences collected from the interaction with the up-to-date policy.

    • The current policy is re-trained offline, much like Step 3, by interacting with the environment now using the updated transition model of the system.

Figure 1: Schematic of our Proposed Approach

We will demonstrate that this method works if the regularity hypothesis is satisfied, i.e., the environment changes occur after sufficiently long intervals, to allow for the offline relearning step (Step 5) to be effectively applied. In this work, we also assume that the reward function, , is stationary, and does not have to be re-derived (or re-learned) when episodic non stationary changes occur in the system.

Another point to note is that our algorithm uses a two-step off line process to learn a new policy: (1) learn the dynamic (transition) model of the system from recent experiences; and (2) relearn the policy function using the new transition model of the system. This approach addresses two important problems: (1) policy learning happens off line, therefore, additional safety check and verification methods can be applied to the learned policy before deployment this is an important consideration for safety critical systems; and (2) the relearning process can use an appropriate mix of past experiences and recent experiences to relearn the environment model and the corresponding policy. Thus, it addresses the catastrophic forgetting problem discussed earlier. This approach also provides a compromise between off policy and on policy learning in RL, by addressing to some extent the sample inefficiency problem.

We use Long Short-Term Memory (

LSTM) Neural Network to model the dynamics of the system and the the Proximal Policy Optimization (PPO) algorithm to train the control policy. PPO is one of the best known reinforcement learning algorithm for learning optimal control law in short periods of time. Next, we describe our approach to modeling the dynamic environment using LSTMs, and the reinforcement learning algorithm for learning and relearning the building controllers (i.e., the policy functions).

4.1 Long Short-Term Memory Networks for Modeling Dynamic Systems

Despite their known success in machine learning tasks, such as image classification, deep learning approaches for energy consumption prediction have not been sufficiently explored


. In recent work, Recurrent neural networks (RNN) have demonstrated their effectiveness for load forecasting when compared against standard Multi Layer Perceptron (MLP) architectures

[17, 33].

Figure 2: Time unrolled architecture of the basic LSTM neural network block

Among the variety of RNN architectures, Long-Short Term Memory (LSTM) networks have the flexibility for modeling complex dynamic relationships and the capability to overcome the so-called vanishing/exploding gradient problem associated with training the recurrent networks

[8]. Moreover, LSTMs can capture arbitrary long-term dependencies, which are likely in the context of energy forecasting tasks for large, complex buildings. The architecture of an LSTM model is represented in Figure 2. It captures non-linear long-term dependencies among the variables based on the following equations:


where , , and represent the input variables, hidden state and memory cell state vectors respectively; stands for element-wise multiplication; and and

are the sigmoid and tanh activation functions.

The adaptive update of values in the input and forget gates () provide LSTMs the ability to remember and forget patterns (Equation  8) over time. The information accumulated in the memory cell is transferred to the hidden state scaled by the output gate (

). Therefore, training this network consists of learning the input-output relationships for energy forecasting by adjusting the eight weight matrices and bias vectors.

4.2 Proximal Policy Optimization

The Proximal Policy Optimization(PPO) algorithm [35] has its roots in the Natural Policy Gradient method [13], whose goal was to improve the common issues encountered in the application of policy gradients. Policy gradient methods[41] represent better approaches to creating optimal policies, especially when compared to value-based reinforcement learning techniques. Value-based methods suffer from convergence issues when used with function approximators (Neural networks). Policy gradient methods also have issues with high variability, which have been addressed by Actor-Critic methods [16]. However, choosing the best step-size for policy updates was the single biggest issue that was addressed in [12]. PPO replaces the log of action probability in the policy gradient equation

with the probability ratio inspired by [12]. Here, the current parameterized control policy is denoted by . denotes the advantage of taking a particular action compared to the average of all other actions in state . According to the authors of PPO, this addresses the issue of the step size partially as they need to limit the values of this probability ratio. So they modify the objective function further to provide a Clipped Surrogate Objective function,


The best policy is found by maximizing the above objective. The above objective has several interesting properties that makes PPO easily implementable and fast to reach convergence during each optimization step. The clipping ensures that the policy does not update too much in a given direction when the Advantages are positive. Also, when the Advantages are negative, the clipping makes sure that the probability of choosing those actions are not decreased too much. In other words, it strikes a balance between exploration and exploitation with monotonic policy improvement by using the probability ratio.

Experiments run on the Mujoco platform, show that the PPO algorithm outperforms many other state of the art reinforcement learning algorithms [5]. This motivates our use of this algorithm in our relearning approach.

The PPO algorithm implements a parameterized policy using a neural network whose input is the state vector and the output is the mean

and standard deviation

of the best possible action in that state. The policy network is trained using the clipped objective function (see Equation 10) to obtain the best controller policy. A second neural network called the value network, , keeps track of the values associated with the states under this policy. This is subsequently used to estimate the advantage of action in state . Its input is also and its output is a scalar value indicating the average return from that state when policy is followed. This network is trained using the TD error [40].

5 Problem Formulation for The Building Environment

We start with a description of our building environment and formulate the solution of the energy optimization problem by using our continuous RL approach. This section presents the dynamic data-driven model of building energy consumption and the reward function we employ to derive our control policy.

5.1 System Description

The system under consideration is a large three-storeyed building on our university campus. It has a collection of individual office spaces, classrooms, halls, a gymnasium, a student lounge, and a small cafeteria. The building climate is controlled by a combination of Air Handling Units(AHU) and Variable Refrigerant Flow (VRF) systems [28]. The configuration of the HVAC system is shown in Figure 3.

The AHU brings in fresh air from the outside and adjusts the air’s temperature and humidity before releasing it into the building. Typically, the desired humidity level in the building is set to %, and the desired temperature values are set by the occupants. Typically, the air is released into the building at a neutral temperature (usually or ). The VRF units in the different zones further heat or cool the air according to the respective temperature set-point (defined by the occupants’ preferences).

Figure 3: Simplified schematic of the HVAC system under Study

The AHU has two operating modes depending on the outside wet bulb temperature. When the wet bulb temperature is above , only the cooling and the reheat coils operate. The AHU dehumidifies the air using the cooling coil to reduce the air temperature to , thus causing a condensation of the excess moisture, and then heats it back up to a specific value that was originally determined by a rule-based controller (either or ). When the wet bulb temperature is below (implying the humidity of the outside air is below %), only the preheat coil operates to heat the incoming cold air to a predefined set-point. The discharge temperature (reheating and preheating set-point depending on the operating mode) will be defined by our RL controller. The appropriate setting of this set-point would allow to reduce the work that must be done by the VRF units, as well as to prevent the building from becoming too cold during cooler weather.

5.2 Problem Formulation

The goals of our RL controller is to determine the discharge air temperature set-point of the AHU to minimize the total heating and cooling energy consumed by the building without sacrificing comfort. We will formulate the RL problem by specifying the state-space, the action-space, the reward function, and the transition function for the our building environment.

5.2.1 State Space

The overall energy consumption of our building depends on how the AHU operates but also on exogenous factors such as the weather variability and the building occupancy. The evolution of the weather does not depend on the state of the building. Therefore, the control problem we are trying to solve must be framed as a non-stationary and Exogenous State MDP. The latter can be formalized as follows

Definition 5.1 (Exogenous State Markov Decision Process).

An Exogenous State Markov decision process is defined by a Markov Decision Process which transition function satisfies the following property

where the state space of the MDP is divided into two sub-spaces such that and .

The above definition can be easily extended to the non-stationary case by considering the time dependency of the transition functions. The condition described above can be interpreted as if there is a subset of state variables whose change is independent from the actions taken by the agent. For our building, the subset of exogenous variables of the subspace are: (1) Outside Air Temperature (oat), (2) Outside Air Relative Humidity (orh), (3) Wet Bulb Temperature (wbt), (4) Solar irradiance (sol), (5) Average Building Temperature Preference Set Point (avg-stpt). The remaining variables corresponding to the subspace are (6) AHU Supply Air Temperature (sat), (7) Heating energy for the Entire Building() and (7) Cooling energy for the Entire Building (

). Since building occupancy is not measured at this moment, we cannot incorporate that variable to our state space.

5.2.2 Action Space

The action space of the MDP in each epoch is the change in the neutral discharge temperature set-point. As discussed before, the wet bulb temperature determines the AHU operating mode. The valves and actuators that operate the HVAC system have a certain latency in their operation. This means that our controller must not arbitrarily change the discharge temperature set-point. We therefore adopted a safer approach where the action space is defined as a continuous variable that represents the change with respect to the previous set-point. This means that at every output instant (in the present problem we have set the output to every minutes), the controller can change at most the discharge temperature set-point by this amount.

5.3 Transition Model

Taking into consideration that the state and action space of the building are continuous, the transition function will comprise 3 components.

First, the transition function of the exogenous state variables is not explicitly modeled (oat, orh, wbt, sol, and avg-stpt). Their next state () is determined by looking up at weather database forecasting for the next time step. These variables are available at minute intervals through a Metasys portal of our building; solar irradiance, , is available from external data sources. There are no humidity or occupancy sensors inside the building, therefore, we did not consider them as part of the exogenous state variables.

The supply air temperature and the heating and cooling energies are the non-exogenous variables. The change in the supply air temperature sat is a function of the current temperature and the set-point selected by the agent.


Here, the controller action will determine what the new set-point will be and subsequently the supply air temperature will approximate that value. We do not create a transition function for this variable since we obtain its value from a sensor installed in the AHU.

Lastly, the heating and cooling energy variables( and ) are determined by the transition functions


where . As discussed in the last section, we train stacked LSTMs to derive nonlinear approximators for these functions. LSTMs can help keep track of the state of the system since they allow modeling continuous systems with slow dynamics. The heating and cooling energy estimated by the LSTMs will be used as a part of the reward function as discussed next.

5.4 Reward Function

The reward function includes two components: (1) the total energy savings for the building expressed as heating and cooling energy savings, and (2) the comfort level achieved. The reward signal at time instant is by


where defines the importance we give to each term. We considered in this work.

is defined in terms of the energy savings achieved with respect to the rule-based controller previously implemented in the building, i.e. we reward the RL controller when its actions result in energy savings calculated as the difference between the total heating and cooling energy under the RBC controller actions and the RL controller actions. is defined as follows

where the components of this equation are

  • : The total energy used to heat the air at the heating or preheating coil as well as the VRF system at time-instant based on the heating set point at the AHU assigned by the RL controller.

  • : The total energy used to heat the air at the heating or preheating coil as well as the VRF system at time-instant based on the heating set point at the AHU assigned by the Rule Based Controller(RBC).

  • : The on-off state of the heating valve at time-instant based on the heating set point at the AHU assigned by the RL controller.

  • : The on-off state of the heating valve at time-instant based on the heating set point at the AHU assigned by the Rule Based Controller(RBC).

  • : The total energy used to cool the air at the cooling coil as well as the VRF system at time-instant based on the set point at the AHU assigned by the RL controller.

  • : The total energy used to cool the air at the cooling coil as well as the VRF system at time-instant based on the set point at the AHU assigned by the Rule Based Controller(RBC).

Here by Rule Based Controller set-point, we refer to the historical set point data that is obtained from the past data on which we shall do our comparison.

The heating and the cooling energy are calculated as a function of the exogenous state variables and , as discussed in the previous sub-section. Additionally, we model the behavior of the valve that manipulates the steam flow in the coil of the heating system, This valve shuts off under certain conditions such that the heating energy consumption sharply drops to 0. This hybrid on-off behavior cannot be modeled with an LSTM thus we need to model the valve behavior independently as a on-off switch to decide when to consider the predictions made by the LSTM (only during on). Note that both and are predicted by using a binary classifier.

The reward for comfort is measured by how close the supply air temperature is to the Average Building Temperature Preference set-point(avg-stpt. Let

The comfort term allows the RL controller to explore in the vicinity of the average building temperature preference to optimize energy. The 1 added to the denominator in case 1 makes the reward bounded.

The individual reward components are formulated such that a preferred action would provide positive feedback while a negative feedback implies actions which are not preferred. The overall reward is non-sparse so the RL agent would have sufficient heuristic information for moving towards an optimal policy.

6 Implementation Details

In this section, we describe the implementation of the proposed approach for the optimal control of the system described in the previous section.

6.1 Data Collection and Processing

This process is part of Step 1 in Figure 1. The data was collected over a period of 20 months(July ’18 to Feb ’20) from the building we were simulating using the BACNET

system which is a collection of sensor data logging all the relevant variables related to our study. These include the weather variables, the building set points, energy values collected at 5 minute aggregations. We first cleaned the data where we removed the statistical outliers using a 2 standard deviations approach. Next we aggregated the variables at half-an-hour intervals where variables like temperature, humidity were averaged and variables like energy were summed over that interval. Then we scaled the data to a

interval so that we can learn the different data-driven models and the controller policy. In order to perform the off-line learning as well as the subsequent relearning, we sampled this above data in windows of 3 months(for training) and 1 week(for evaluating).

6.2 Definition of the environment

The environment has to implement the the functions as described in Section 5.4 as they will be used to calculate the energy and valve state.

6.2.1 Heating Energy model

This process is part of Step 2 in Figure 1. The heating energy model is used to calculate the heating energy consumed in state which results from the action taken in state . The model for Heating energy is trained using the sequence of variables comprising the states over the last 3 hours i.e. 6 samples considering data samples at 30 minute intervals. The output for the heating energy model is the total historical heating energy over next 30 minute interval.

The heating coils for the building operate in a hybrid mode where the heating valve is shut-off at times. Thus the heating energy goes to zero for that instant. This abrupt change cannot be modeled by a smooth LSTM model. We therefore decided to train our model on contiguous sections where the heating coils were operating. During evaluation phase, the valve() model will predict the on/off state of the heating coils. We shall predict the energy consumption only for those instances when the valve model determines the heating coils to be switched on.

The model for is constructed by stacking 6 Fully Feed Forward Neural (FFN) Network Layers of 16 units each followed by 2 layers of LSTM with 4 units each. The activation for each layer is Relu. The FFN layers are used to generate the rich feature from the input data and the LSTM layers are used to learn the time based correlation. The learning rate is initially 0.001 and is changed according to a linear schedule to ensure faster improvement at the beginning followed by gradual improvements near the optimum so that we don’t oscillate around the optima. Mean Square Error on validation data is used to terminate training. The model parameters were found by hyper-parameter tuning via Bayesian Optimization on a Ray-Tune[20] cluster.

6.2.2 Valve State model

This process is also a part of Step 2 in Figure 1. The valve model is used to classify whether the system is switched on or off or equivalently whether the heating energy is positive or 0. The input to this model is the same as the Heating Energy model. The output is the valve (heating coil) on-off state at the next time instant.

The model for is constructed by stacking 4 Fully Feed Forward Layers of 16 units each followed by 2 layers of LSTM with 8 units each. The activation for each layer is Relu. The learning rate, validation data, and the model parameters are similarly chosen as before. The loss used in this case is the binary cross-entropy loss since it is a two-class prediction problem.

6.2.3 Cooling Energy model

This process is also part of Step 2 in Figure 1. The cooling energy model is used to calculate the cooling energy consumed in state when the action is taken in state . The input to this model is the same as the Heating Energy model. The output of the model is the total historical cooling energy over the next 30 minute interval.

The model for is constructed by stacking 6 Fully Feed Forward Layers of 16 units each followed by 2 layers of LSTM with 8 units each. The activation for each layer is Relu. The learning rate, validation data and the model parameters are chosen in a way similar to the Heating Energy Model.

Once the processes in step 2 are completed we construct the data-driven simulated environment . It receives the control action from the PPO controller and steps from its current state to the next state . To calculate , the weather values for the next state are obtained by simple time-based lookup from the ”Weather Data” database. The supply air temperature for the next state is obtained from the ”State Transition Model” using equation 11. The reward is calculated using equation 14. Every time the Environment is called with an action, it will perform this entire process and return the next state , the reward back to the RL Controller with some additional information on the current episode.

6.3 PPO Controller

This process is part of Step 3 in Figure 1. As discussed previously in section 4.2, the controller will learn two neural networks using the feedback it receives from the environment in response to its action . This action is generated by sampling from the distribution which are the outputs of the policy network as shown in the figure. After sampling responses from the environment for a number of times, the collected experiences under the current controller parameters, are used to update the controller network by optimizing in equation 10 and the value networks by TD Learning. We repeat this training process until the optimization has converged to a local optima.

The Policy Network model architecture consists of two layers of Fully Feed Forward Layers with 64 units each. The Value Network network structure is identical to the Policy network. The networks are trained on-policy with a learning rate of . Each time the networks were trained over 1e6 steps through the environment. For the environment this corresponded to approximately 10 episodes for each iteration.

6.4 Evaluating the energy models, valve models, and the PPO controller

This corresponds to Step 4 in Figure 1. Once the energy model, the valve state models, and the controller training have converged we evaluate them on a held out test data for 1 week. The Energy models are evaluated using the Coefficient of variation Root Mean Square Error (CVRMSE)

where and represent the true and the predicted value of the energy, respectively.

The valve model is evaluated based on its ROC-AUC as the on-off dataset was found to be imbalanced. The controller policy is evaluated by comparing the energy savings for the cooling energy and the heating energy as well as how close the controller set-point for the AHU supply air temperature is to the building average set-point avg-stpt.

6.5 Relearning Schedule

Steps 4 and 5 in Figure 1 are repeated by moving the data collection window forward by 1 week. We observed that having a large overlap over training data between successive iterations helps the model retain previous information and gradually adapt to the changing data.

From the second iteration onward we do not train the data driven LSTM models (i.e. ) from scratch. Instead, we use the pre-trained models from the previous iteration to start learning on the new data. For the energy models and valve models we no longer train the FFN layers and only retrain the head layers comprising the LSTMs. The FFN layers are used to learn the representation from the input data and this learning is likely to stay identical for different data. The LSTM layers, on the other hand, model the trend in the data which must be relearnt due to the distributional shift. Our results show that this training approach saves time with virtually no loss in model performance. We also adapt the pre-trained controller policy according to the changes in the system. This continual learning approach save us time during repeated retraining and allows the data-driven models and the controller adapting to the non-stationarity of the environment.

7 Results

In this section we present the performance of our energy models, valve model, and the RL controller over multiple weeks.

7.1 Relearning Results for Heating Energy Model

Figure 4 shows the heating energy prediction on a subset of the data from October 7th to 23rd. We selected this time period because the effects of the non-stationarity in the data can be appreciated. We compare the prediction of a fixed model, which is not updated after October 7th, with a model which is retrained by including the new week’s data from 7th to the 13th. The figure demonstrates the necessity of relearning the heating energy model at regular intervals. After the October 12th, the AHU switches from using the reheating to the preheating coil due to colder weather as indicated by the wet bulb temperature. This causes the heating energy consumption to change abruptly. The model which is not updated after October 7th is not able to learn this behavior and keeps predicting similar behavior as before. On the other hand, the weekly relearning model behavior starts degrading but once it is relearned using the data from Oct 7th to the 13th, it can capture the changing behavior quickly using a small section of similar data in its training set. The overall CVRMSE for the relearning energy model is shown in Figure 5. For majority of the weeks, the CVRMSE is below which is accepted according to ASHRAE guidelines for energy prediction at half hour intervals

7.2 Relearning Results for Cooling Energy Model

Figure 6 shows the plots for predicting the Cooling energy Energy over a span of two weeks. We also include the the energy prediction from a fixed model. Starting from 25th April, both the Fixed and Relearning model for Cooling Energy predictions start degrading as they start following an increasing trend while the actual trend is downward and this behavior is expected while learning on non-stationary data. But the Relearning Cooling Energy model is retrained using the data from April 19th to April 26th at the end of the week corresponding to 26th April. Thus its predictions tend to be better than a fixed model for the next week whose predictions degrade as the week progresses.The overall CVRMSE for the relearning energy model is shown in Figure 7. For all the weeks, the CVRMSE is below which is accepted according to ASHRAE guidelines for energy prediction at half hour intervals

Figure 4: Comparison of true versus predicted Heating Energy for a weekly relearning model and a static/non-relearning model
Figure 5: The weekly CVRMSE of the Hot Water Energy Relearning Model for predicting Hot Water Energy consumption at half hour intervals
Figure 6: Comparison of true versus predicted Cooling Energy for a weekly relearning model and a static/non-relearning model
Figure 7: The weekly CVRMSE of the Cooling Energy Relearning Model for predicting Cooling Energy consumption at half hour intervals

7.3 Prediction of the Heating Valve status

Figure 9 shows the Area Under the Receiver Operating Characteristics (ROC AUC) for the model predicting the valve status(on/off). We also show the actual and predicted valve state for around one month in Figure 8. Overall, the relearning valve model is able to accurately predict the valve behavior.

Figure 8: Comparing True versus Predicted Hot Water Valve State behavior
Figure 9: Hot Water Valve State Prediction model ROC AUC evaluated over multiple weeks

7.4 Training Episode Reward

Figure 10: Average Cumulative Reward Obtained across each episode trained across 10 environments in parallel

We trained the PPO controller on the environment every week to adjust to the shift in the data. The cumulative reward metric from equation J is used to asses the improvement in controller performance over the number of week. We observed that even though the controller is able to achieve good results after training over a couple of weeks of data, it still keeps improving as weeks progresses. The cumulative reward metric is plotted in Figure 10. The occasional drops in the average reward are due to changing environment conditions as training progresses.

7.5 Cooling Energy Performance

We compared the cooling energy performance of both the adaptive reinforcement learning controller and a static reinforcement learning controller against a rule based controller. A plots comparing the cooling energy consumed over a certain part of the evaluation period is shown in figure 11. We are displaying this part of the time-line because it will be significant in understanding why relearning is important. When we calculate the energy savings for each RL controller, the static RL controller had slightly higher cooling energy savings because the last version of it was trained during warmer weather and it tends to keep the building cooler. But when the outside temperature drops, the static controller action does not heat the system too much resulting in the VRF systems starting to heat the building which consume higher energy. The cooling energy savings over the period shown in figure 11 was for the Adaptive Controller and for the Static controller. The average weekly cooling energy savings over the entire evaluation period of 31 weeks was or kBTUs for the Adaptive Controller versus or kBTUs for the Non-Adaptive/Static Controller.

7.6 Heating Energy Performance

Similarly, we compared the heating energy performance of an adaptive and static controller over the same timeline as shown in figure 12. This plot shows the severe issue of over-cooling that can occur in the building when controller is not updated regularly, Due to lower action set point of the static controller, the total heating energy consumption for the building goes up over the entire period of cool weather. The heating energy savings over the period shown in figure 11 was for the Adaptive Controller while the Static controller increased the energy consumption by . The average weekly heating energy savings over the entire evaluation period of 31 weeks was or kBTUs for the Adaptive Controller whereas the Non-Adaptive/Static Controller increased the energy consumption by or kBTUs.

The sum total of the heating and cooling energy consumption under the historical rule based controller, the adaptive controller and the non-adaptive controller is shown in figure 13. The adaptive controller consistently saves more energy than the non-adaptive controller. Overall the adaptive controller was able to save 300.72 kBTUs each week on average whereas the static controller was able to save only 30.03 kBTUs.

7.7 Control Actions

Here we show why the overall energy consumption of the building went up when we use a static controller. We plot the Discharge/Supply Air Temperature set-point resulting from the actions of both the adaptive and static controller along with outside air temperature and relative humidity in Figure 14. On October 12th, the outside temperature goes down and both the adaptive and static controller fail to improve building comfort condition. After October 13th , the adaptive controller is re-trained by considering the last weeks data where it encounters environments states with lower outside air temperatures as subsequently it adapts to those conditions. For the remaining of the time period analyzed the adaptive controller keeps the Supply Air Temperature set-point closer to the comfort conditions required by the occupants.

Figure 11: Plot of Cooling Energy Consumed for actions based on RBC, Adaptive RL controller and Static RL Controller
Figure 12: Plot of Heating Energy Consumed for actions based on RBC, Adaptive RL controller and Static RL Controller
Figure 13: Plot of Total Energy Consumed for actions based on RBC, Adaptive RL controller and Static RL Controller
Figure 14: Plot of Supply Air setpoint(sat) based on actions chosen by the Adaptive RL controller versus Static RL Controller


We demonstrated the effectiveness of including retraining in a data-driven reinforcement learning framework.

It may be argued that our reward is only improving against a baseline Rule Based Controller. The truth is that we can only compare against controllers which select reasonable actions within the distribution of the data on which the data driven models were trained. If we were to learn our reinforcement learning controller without any comparison during training, the exploratory behavior of reinforcement learning methods may have found even better control actions. But as we are using data-driven models, it is highly likely that the actions chosen by the controller might lead the data-driven models to extrapolate results and introduce Out of Distribution Error. By comparing against a rule based controller and constraining actions from veering too far from the current actions, we might leave some savings but we can ensure that the data-driven models used in the environment are not leading us to spurious results by extrapolating.


  • [1] K. Amasyali and N. M. El-gohary (2018) A review of data-driven building energy consumption prediction studies. Renewable and Sustainable Energy Reviews 81, pp. 1192–1205. Cited by: §4.1.
  • [2] G. T. Costanzo, S. Iacovella, F. Ruelens, T. Leurs, and B. J. Claessens (2016-06) Experimental analysis of data-driven control for a building heating system. Sustainable Energy, Grids and Networks 6, pp. 81–90. External Links: Document, 1507.03638, ISSN 23524677 Cited by: §2.2.
  • [3] D. B. Crawley, C. O. Pedersen, L. K. Lawrie, and F. C. Winkelmann (2000) EnergyPlus: energy simulation program. ASHRAE Journal 42, pp. 49–56. Cited by: §2.1.
  • [4] G. Dulac-Arnold, D. J. Mankowitz, and T. Hester (2019) Challenges of real-world reinforcement learning. CoRR abs/1904.12901. External Links: Link, 1904.12901 Cited by: §3.
  • [5] L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, and A. Madry (2019) Implementation matters in deep rl: a case study on ppo and trpo. In International Conference on Learning Representations, Cited by: §4.2.
  • [6] P. Gajane, R. Ortner, and P. Auer (2019-05) Variational Regret Bounds for Reinforcement Learning.

    35th Conference on Uncertainty in Artificial Intelligence, UAI 2019

    External Links: 1905.05857, Link Cited by: §2.3.
  • [7] A. Hallak, D. Di Castro, and S. Mannor (2015-02) Contextual Markov Decision Processes. arXiv preprint arXiv:1502.02259. External Links: 1502.02259, Link Cited by: §2.3.
  • [8] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
  • [9] A. H. Hosseinloo, A. Ryzhov, A. Bischi, H. Ouerdane, K. Turitsyn, and M. A. Dahleh (2020-01) Data-driven control of micro-climate in buildings; an event-triggered reinforcement learning approach. arXiv preprint arXiv:2001.10505. External Links: 2001.10505, Link Cited by: §2.1.
  • [10] G. N. Iyengar (2005-05) Robust dynamic programming. Vol. 30, INFORMS. External Links: Document, ISSN 0364765X Cited by: §2.3.
  • [11] T. Jaksch, R. Ortner, and P. Auer (2010) Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11 (Apr), pp. 1563–1600. Cited by: §2.3.
  • [12] Kakade and J. Langford (2002) Approximately optimal approximate reinforcement learning. In ICML, Vol. 2, pp. 267–274. Cited by: §4.2.
  • [13] S. M. Kakade (2002) A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538. Cited by: §4.2.
  • [14] D. W. Kim and C. S. Park (2011-12) Difficulties and limitations in performance simulation of a double skin façade with EnergyPlus. Energy and Buildings 43 (12), pp. 3635–3645. External Links: Document, ISSN 03787788 Cited by: §2.2.
  • [15] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2017-03) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America 114 (13), pp. 3521–3526. External Links: Document, 1612.00796, ISSN 10916490 Cited by: 2nd item.
  • [16] V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §4.2.
  • [17] W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang (2019) Short-Term Residential Load Forecasting Based on LSTM Recurrent Neural Network. IEEE Transactions on Smart Grid 10 (1), pp. 841–851. Cited by: §4.1.
  • [18] E. Lecarpentier and E. Rachelson (2019) Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning. In Advances in Neural Information Processing Systems, pp. 7214–7223. Cited by: §3.
  • [19] Y. Li, Y. Wen, D. Tao, and K. Guan (2019-07) Transforming Cooling Optimization for Green Data Center via Deep Reinforcement Learning. IEEE Transactions on Cybernetics 50 (5), pp. 2002–2013. External Links: Document, 1709.05077, ISSN 2168-2267 Cited by: §2.1.
  • [20] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica (2018) Tune: a research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118. Cited by: §6.2.1.
  • [21] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016-09) Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings, External Links: 1509.02971 Cited by: §1, §2.1, §2.2.
  • [22] M. Maasoumy, M. Razmara, M. Shahbakhti, and A. S. Vincentelli (2014) Handling model uncertainty in model predictive control for energy efficient buildings. Energy and Buildings 77, pp. 377–392. Cited by: §1.
  • [23] D. J. Mankowitz, T. A. Mann, P. Bacon, D. Precup, and S. Mannor (2018-04) Learning Robust Options. Technical report External Links: Link Cited by: §2.3.
  • [24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015-02) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: Document, ISSN 14764687 Cited by: §2.1.
  • [25] E. Mocanu, D. C. Mocanu, P. H. Nguyen, A. Liotta, M. E. Webber, M. Gibescu, and J. G. Slootweg (2018) On-line building energy optimization using deep reinforcement learning. IEEE transactions on smart grid 10 (4), pp. 3698–3708. Cited by: §2.2.
  • [26] T. Moriyama, G. De Magistris, M. Tatsubori, T. H. Pham, A. Munawar, and R. Tachibana (2018-10) Reinforcement Learning Testbed for Power-Consumption Optimization. In Communications in Computer and Information Science, Vol. 946, pp. 45–59. External Links: 1808.10427, ISBN 9789811328527, ISSN 18650929 Cited by: §2.1.
  • [27] A. Nagy, H. Kazmi, F. Cheaib, and J. Driesen (2018-05) Deep Reinforcement Learning for Optimal Control of Space Heating. arXiv preprint arXiv:1805.03777. External Links: 1805.03777, Link Cited by: §2.2.
  • [28] A. Naug, I. Ahmed, and G. Biswas (2019-06) Online energy management in commercial buildings using deep reinforcement learning. In Proceedings - 2019 IEEE International Conference on Smart Computing, SMARTCOMP 2019, pp. 249–257. External Links: Document, ISBN 9781728116891 Cited by: §1, §2.2, §5.1.
  • [29] A. Naug and G. Biswas (2018) Data driven methods for energ reduction in large buildings. In 2018 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 131–138. Cited by: §2.2.
  • [30] S. Padakandla, P. K. J., and S. Bhatnagar (2019) Reinforcement learning in non-stationary environments. CoRR abs/1905.03970. External Links: Link, 1905.03970 Cited by: §2.3.
  • [31] C. Park (2013-08) Difficulties and issues in simulation of a high-rise office building. pp. . Cited by: §2.2.
  • [32] M. L. Puterman (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, Inc.. Cited by: §3.
  • [33] A. Rahman, V. Srikumar, and A. D. Smith (2018-02) Predicting electricity consumption for commercial and residential buildings using deep recurrent neural networks. Applied Energy 212, pp. 372–385. External Links: ISSN 0306-2619 Cited by: §4.1.
  • [34] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015-02) Trust Region Policy Optimization. 32nd International Conference on Machine Learning, ICML 2015 3, pp. 1889–1897. External Links: 1502.05477, Link Cited by: §2.1.
  • [35] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.2.
  • [36] P. H. Shaikh, N. B. M. Nor, P. Nallagownden, I. Elamvazuthi, and T. Ibrahim (2014) A review on optimized control systems for building energy and comfort management of smart sustainable buildings. Renewable and Sustainable Energy Reviews 34, pp. 409–429. Cited by: §1.
  • [37] S. D. Shashua and S. Mannor (2017-03)

    Deep Robust Kalman Filter

    arXiv preprint arXiv:1703.02310. External Links: 1703.02310, Link Cited by: §2.3.
  • [38] N. Singh, P. Dayama, and V. Pandit (2019) Change point detection for compositional multivariate data. arXiv preprint arXiv:1901.04935. Cited by: §2.3.
  • [39] F. Smarra, A. Jain, T. De Rubeis, D. Ambrosini, A. D’Innocenzo, and R. Mangharam (2018) Data-driven model predictive control using random forests for building energy optimization and climate control. Applied energy 226, pp. 1252–1272. Cited by: §1.
  • [40] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.2, §3, §4.2.
  • [41] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §4.2.
  • [42] A. Tamar, S. Mannor, and H. Xu (2014) Scaling up robust mdps using function approximation. In International Conference on Machine Learning, pp. 181–189. Cited by: §2.3.
  • [43] T. Wei, Y. Wang, and Q. Zhu (2017) Deep reinforcement learning for building hvac control. In Proceedings of the 54th Annual Design Automation Conference 2017, pp. 1–6. Cited by: §1, §2.1.