Enforcing Policy Feasibility Constraints through Differentiable Projection for Energy Optimization

05/19/2021 ∙ by Bingqing Chen, et al. ∙ University of Colorado Boulder Carnegie Mellon University 0

While reinforcement learning (RL) is gaining popularity in energy systems control, its real-world applications are limited due to the fact that the actions from learned policies may not satisfy functional requirements or be feasible for the underlying physical system. In this work, we propose PROjected Feasibility (PROF), a method to enforce convex operational constraints within neural policies. Specifically, we incorporate a differentiable projection layer within a neural network-based policy to enforce that all learned actions are feasible. We then update the policy end-to-end by propagating gradients through this differentiable projection layer, making the policy cognizant of the operational constraints. We demonstrate our method on two applications: energy-efficient building operation and inverter control. In the building operation setting, we show that PROF maintains thermal comfort requirements while improving energy efficiency by 4 inverter control setting, PROF perfectly satisfies voltage constraints on the IEEE 37-bus feeder system, as it learns to curtail as little renewable energy as possible within its safety set.



There are no comments yet.


page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

There has been increasing interest in using learning-based methods such as reinforcement learning (RL) for applications in energy systems control. However, a fundamental challenge with many of these methods is that they do not respect the physical constraints or functional requirements associated with the systems in which they operate. Therefore, there have been many calls for embedding safety guarantees into learning-based methods in the context of energy systems applications (zhang2019deep; glavic2019deep; dobbe2020learning).

One common proposal to address this challenge is to provide machine learning methods with “soft penalties” to encourage them to learn feasible solutions. For instance, the authors of

(zhang_buildsys2018; chen2019gnu) incentivize their RL-based building HVAC controller to satisfy thermal comfort constraints by adding a constraint violation penalty to the reward function. While such approaches often involve tuning some weight on the penalty term, recent work has proposed more theoretically-grounded approaches to choosing these weights; for instance, in the setting of approximating AC optimal power flow, the authors of (fioretto2020predicting; chatzos2020high) interpret the weight on their constraint violation penalty as a dual variable, and learn it via primal-dual updates. gupta2020deep adopt a similar approach in an inverter control problem. However, a challenge with these types of “soft penalty” methods in general is that while they incentivize feasibility, they do not strictly enforce it, which is potentially untenable in safety-critical applications.

Figure 1. The PROF framework. Our policy consists of a neural network followed by a differentiable projection onto a convexified set of operational constraints, (which is constructed via an approximate model, , of the environment). The differentiable projection layer enforces the constraints in the forward pass, and induces policy gradients that make the neural network cognizant of the constraints in its learning.

Given this limitation, a second class of approaches has aimed to strictly enforce operational constraints. For instance, in some cases, the outputs of a machine learning algorithm can be clipped post-hoc in order to make them feasible. However, a challenge is that such post-hoc corrections are not taken into account during the learning process, potentially negatively impacting overall performance. More recent approaches based in deep learning have therefore aimed to enforce simple classes of constraints in a way that

can be taken into account during learning; for instance, zamzam2020learning train a neural network to approximate AC optimal power flow (OPF), and enforce box constraints on certain variables via sigmoid activations in the last layer of the neural network. In general, however, existing approaches have only been able to accommodate simple sets of constraints, prompting a need for methods that can incorporate broader classes of constraints.

In this work, we propose a method to enforce general convex constraints into RL-based controllers in a way that can be taken into account during the learning process. In particular, we construct a neural network-based policy that culminates in a projection onto a set of constraints characterized by the underlying system. While the “true” constraints associated with the system may be somewhat complex, we observe that simple, approximate physical models are often available for many systems of interest, allowing us to specify convex approximations to the relevant constraints. The projections onto these (approximate) sets can thus be characterized as convex optimization problems, allowing us to leverage recent developments in differentiable convex optimization (amos2017optnet; agrawal2019differentiable) to train our neural network and projection end-to-end using standard RL methods. The result is a powerful neural policy that can flexibly optimize performance on the true underlying dynamics, while still satisfying the specified constraints.

We demonstrate our PROjected Feasibility approach, PROF, on two settings of interest. Specifically, we explore a building operation setting in which the goal is to reduce energy consumption during the heating season, while ensuring the satisfaction of thermal comfort constraints. We additionally explore an inverter control setting where the goal is to mitigate curtailment, while satisfying inverter operational constraints and nodal voltage bounds. In both settings, we find that our controller achieves good performance with respect to the control objective, while ensuring that relevant operational constraints are satisfied.

To summarize, our key contributions are as follows:

  • A framework for incorporating convex constraints. We propose a projection-based method to flexibly enforce convex constraints within neural policies (as summarized in Figure 1). By examining the gradient fields of the differentiable projection layer, we recommend the incorporation of an auxiliary loss for more robust results. We also show in an ablation study (Section 5.3) that propagating gradients through the differentiable projection layer is indeed conducive to policy learning.

  • Demonstration on building control. In the building control setting, we show that PROF further improves energy efficiency by 10% and 4%, respectively, compared to the best-performing RL agents in (zhang_buildsys2018) and (chen2019gnu). By using a locally-linear assumption to approximate the building thermodynamics and thereby formulating the constraints as a polytope (zhao2017geometric; chen2020cohort), we largely maintain the temperature within the deadband, except when the control is saturated.

  • Demonstration on inverter control. In the inverter control setting, PROF satisfies the voltage constraints 100% of the time over more than half a million time steps (1 week at one second per time step), with a randomly initialized neural network, compared to 22% over-voltage violations incurred by a Volt/Var control strategy. With respect to the objective of minimizing renewable generation curtailment, PROF performs as well as possible within its conservative safety set after learning safely for a day.

2. Related Work

Our approach relies on recent developments in implicit neural network layers, and is thematically similar to several recent works in safe RL. We briefly discuss these topics, and refer interested readers to (dobbe2020learning; zhang2019deep; glavic2019deep; rolnick2019tackling; drgona2020all) for comprehensive reviews of relevant work in power and energy systems application domains.

Implicit layers

A neural network can be viewed as a composition of functions, or layers

, with parameters that can be adjusted to improve performance on some task. While many of the layers commonly used within neural networks (e.g., convolutions or sigmoid functions) represent

explicit functions that provide a direct mapping between inputs and outputs, there has recently been a great deal of interest in expanding the set of commonly-used layers to include those representing implicit functions (kolter2020deepimplicit). This has included the creation of layers capturing optimization problems (amos2017optnet; djolonga2017differentiable; tschiatschek2018differentiable; wang2019satnet; agrawal2019differentiable; gould2019deep), physical equations (de2018end; chen2018neural; greydanus2019hamiltonian), sequence modeling processes (bai2019deep), and games (ling2018game). In this work, we leverage advances in differentiable optimization in particular, namely by incorporating a differentiable convex optimization layer into our neural policy in order to project proposed control actions onto the feasible set of constraints.

Safe reinforcement learning

While (deep) RL methods in general lack safety or stability guarantees, there has been recent interest in learning RL-based controllers that attempt to maintain some notion of safety during training and/or inference – e.g., to satisfy physical constraints or avoid particularly negative outcomes (garcia2015comprehensive). These include methods that aim to determine “safe” regions of the state space by making smoothness assumptions on the underlying dynamics (wachi18safeexpl; berkenkamp17stability; turchetta16safeexpl; akametalu14reachability), methods that combine concepts from RL and control theory (morimoto2005robust; luo2014off; pinto2017robust; chang2019neural; han2019h; zhang2020policy; donti2021enforcing), approaches based on formal verification logics (hunt2020verifiably; hasanbeig2020deep; fulton2019verifiably), and methods that aim to bound some (discounted) cost function characterizing violations of safety constraints (yang2020projection; taleghan2018efficient; achiam17cpo; altman1999constrained)

. While the particular notion of “safety” considered varies between settings, relevantly to the present work, several of these prior works employ some form of differentiable projection within the loop of deep RL. For instance, within the context of constrained Markov decision processes (C-MDPs),

yang2020projection project neural network-based policies onto a linearly-constrained set of policies with bounded cumulative discounted cost. In the context of asymptotic stability, donti2021enforcing project the actions output by their controller onto a convex set of actions satisfying stability specifications obtained via robust control. In the setting of robotic motion planning, pham2018optlayer project actions onto a linear set of robotic operational constraints, and apply separate updates to the neural network based on both pre-projection and post-projection actions. Similarly to this prior work, our approach employs differentiable projections within a neural network policy to enforce operational constraints over some planning horizon.

3. Preliminaries

We now present background on technical concepts used by PROF, namely reinforcement learning and differentiable projection layers.

3.1. Reinforcement Learning

The goal of RL is to learn an optimal control policy through direct interaction with the environment. The problem is usually formulated as a Markov decision process (MDP). At each time step , the agent selects an action given the current state , using its policy (Equation 1). In many modern RL techniques, the policy is commonly represented by a neural network parameterized by . When the agent takes the action , the state transitions to based on the system dynamics (Equation 2), and the agent receives a reward (or equivalently, incurs a cost ).


RL algorithms optimize for a policy that maximizes the expected cumulative reward, or equivalently, minimizes the expected cumulative cost, where is a temporal discount factor:


To simplify notation, we will denote the expected cumulative cost as , i.e.,


There are three general approaches to RL, namely value-based methods, policy gradient methods, and actor-critic methods. Value-based methods, e.g., Q-learning and its variants, update the value function of state-action pairs using the Bellman equation and take the action that maximizes the value of an action selection policy (the Q function) through exploration. Policy gradient methods, e.g., Proximal Policy Optimization (PPO) (schulman2017proximal), directly search for an optimal policy

using estimates of policy gradients. Denoting the policy gradient as

, the core idea of policy gradient algorithms is that they update based on an estimate, , of the gradient, i.e.,


for some learning rate . Different algorithms vary in how they obtain . For instance, the learning objective for PPO, which we use in our building control experiment (Section 5), is given by the following equation, where is the generalized advantage estimate that can be estimated via any of the estimators in (schulman2015high):


and the estimate is constructed based on this learning objective. Actor-critic methods, e.g., Advantage Actor-Critic (A2C), are hybrids of the value-based and policy gradient approaches, using a policy network to select actions (the actor) and a value network to evaluate the action (the critic).

3.2. Differentiable Projection Layers

As previously described, a neural network is a composition of parameterized functions (layers) whose parameters are adjusted during training via backpropagation (a class of gradient-based methods). Any function can be incorporated into a neural network as a layer provided that it satisfies two main conditions. The first condition is that it must have a forward procedure to map from inputs to outputs (i.e., do inference). The second is that it must have a backwards procedure

to compute gradients of the outputs with respect to the inputs and function parameters, in order to enable backpropagation.

With that in mind, consider the -norm projection that maps from some point in to its closest point in some constraint set as follows:


In cases where is convex, Equation 7 is a convex optimization problem. The forward procedure of this operation can then be implemented by simply solving the optimization problem, e.g., using standard convex optimization solvers. Perhaps less evidently, it is also possible to construct a backwards procedure for this problem by using the implicit function theorem (krantz2012implicit), as described in previous work (e.g., (amos2017optnet; agrawal2019differentiable)).

As an example, consider the case where characterizes linear constraints, i.e., for some , , , and . It is then possible to efficiently compute gradients through Equation 7 by implicitly differentiating through its KKT conditions, i.e., conditions that are necessary and sufficient to describe its optimal solutions. In particular, as described in (amos2017optnet), the KKT conditions for stationarity, primal feasibility, and complementary slackness for this case are given by


where , , and are the optimal primal and dual solutions. By the implicit function theorem, we can then take derivatives through these conditions at the optimum in order to obtain relevant gradients. Specifically, the total differentials of these KKT conditions are given by


As described in (amos2017optnet), these equations can then be rearranged to solve for the Jacobians of any of the solution variables with respect to any of the problem parameters

(or, in practice, to solve directly for these Jacobians’ left matrix-vector product with some backward pass vector, in order to reduce space complexity).

While the above example is for the case of a linearly-constrained projection operation, these kinds of gradients can be computed for convex projection problems in general. For instance, donti2021enforcing compute gradients through a projection onto a second order cone by differentiating through the fixed point equations of their solver, and agrawal2019differentiable provide a method and library for differentiable disciplined convex programs. A key benefit of using these kinds of projection layers for constraint enforcement is that they allow gradients through the enforcement procedure to flow back to the neural network, thereby informing the parameter updates of this network during training.

4. Approach

We now describe PROF, which incorporates differentiable projections onto convex(ified) sets of operational constraints within a neural policy.

4.1. Problem Formulation

Consider a discrete-time dynamical system


where is the state at time , is the control input, is an uncontrollable disturbance (which we assume to be observable), and denotes the system dynamics. Letting and denote the allowable state and action space, respectively, we can define the set of all feasible actions over the planning horizon as , where


Our goal is then to learn a policy that optimizes the control objective, , while enforcing the operational constraints. To simplify notation, we denote . In the case of a deterministic policy, i.e., , the learning problem is simply


In the case of a stochastic policy, e.g. , we can write the problem as


In this case, it is necessary to sample actions around in order to estimate policy gradients. At the same time the actions sampled from might fall outside of . Thus, we enforce that both and the sample action satisfy the constraints.

4.2. Approximate Convex Constraints

In practice, there are two key challenges inherent in solving Equations 1213 as written. The first is that the disturbances are not known ahead of time, meaning that the optimization problem must be solved under uncertainty. One approach to addressing this, from the field of robust control (zhou1998essentials), involves constructing an uncertainty set over the disturbance, and then optimizing for worst-case or expected cost under this uncertainty set. Here, we simply assume a predictive model of the disturbances is available. (By re-planning frequently, we observe that the prediction errors have limited empirical impact on performance in the two applications we study.) We will use the notation to denote our forecast of the disturbance if is a future time step, and the true value of the disturbance if is the present or a prior time step.

The second challenge pertains to the form of the set , which may be poorly structured or otherwise difficult to optimize over. In particular, our framework relies on obtaining convex approximations to the constraints in order to enable differentiable projections (see Section 3.2). Fortunately, for many energy systems applications, some approximate model is often available based on domain knowledge that allows to be approximated as a convex set, despite the complex nature of the true dynamical system.

Thus, letting denote our approximations of the dynamics and denote the (forecast or known) disturbance at each , we define our approximate convex constraint set as


We note that and are approximated solely for the purposes of constructing approximate constraint sets, and are not used otherwise during training and inference (i.e., our neural policy interacts with the true dynamics and disturbances during training and inference).

1:procedure main(env, )   // input: environment, control objective
2:     init neural network , replay memory
3:     specify RL algorithm , batch size , update interval
4:     specify planning horizon
5:     // online execution
6:     for  do
7:         observe state
8:         predict future disturbances
9:         construct constraint set , policy
10:         compute = inference(, , )
11:         execute action env.step()
12:         save memory.append(, , )
13:         // update policy every time steps
14:         if  then
15:               = train(, , , )
16:         end if
17:     end for
18:end procedure
20:procedure inference(, , )
21:     // input: neural policy, current state, planning horizon
22:     select action // only return the current action; replan at each time step
23:     return
24:end procedure
26:procedure train(, , , )  
27:     // input: neural policy, objective, replay memory, RL algorithm
28:     init = 0
29:     for  do
30:         sample
31:         construct constraint set , policy
32:         compute training loss
33:     end for
34:     train via to minimize
35:     return
36:end procedure
Algorithm 1 PROF

4.3. Policy Optimization


be any (e.g., fully-connected or recurrent) neural network parameterized by

. Our policy entails passing the output from the neural network to the differentiable projection layer characterized by the approximate constraints, which enforces that the resultant action is feasible with respect to these constraints. The overall (differentiable) neural policy is then given by


The key benefit of embedding a differentiable projection into our policy is that it enforces constraints in a way that is visible to the neural network during learning. In this work, we implement the differentiable projection using the cvxpylayers library (agrawal2019differentiable).

We construct the following loss function, which is a weighted sum of the control objective

and an auxiliary loss term to be explained shortly in this section.

is a hyperparameter.


We then train our policy (Equation 15) to minimize this cost using standard approaches in deep reinforcement learning. The full algorithm is presented in Algorithm 1.

Figure 2. Illustrative example of gradients from the differentiable projection layer. and denote unique optimal actions minimizing some convex control objective in the unconstrained and constrained settings, respectively; is thus a proxy for . (a) . The gradients point towards as desired, such that will reach this optimal point. (b) on the interior of . The gradients do not cause (or its projection) to update towards the interior. Adding a weighted auxiliary loss term, e.g., , can help direct updates towards the interior.

4.3.1. Visualization of gradient fields.

To provide more intuition on the differentiable projection layer and our cost function, we visualize the gradient fields in a hypothetical example with a deterministic policy and a planning horizon of . Specifically, for the purposes of illustration, let and denote unique optimal actions minimizing some convex control cost in the unconstrained and constrained settings, respectively:

In Figure 2, we then plot the gradient fields in two cases: (a) , and (b) . Note that and are assumed to be known here for illustrative purposes only, and are not known during training.

In particular, we plot the gradients (black arrows) of with respect to the output of the neural network . These indicate the direction in which the neural network would be incentivized to update in order to minimize the system cost. If no differentiable projection were embedded within the policy, all the gradients would point towards without regard for the constraints. Instead, in the case of (Figure 2a), the gradients through the differentiable projection layer point towards instead of . More specifically, if , then the projection layer is simply the identity, and the gradients point directly towards ; otherwise, the gradients point along the boundary of in the direction of .

This case is of particular interest, as in many practical applications some operational constraint will be binding. As a concrete example, the ultimate energy-saving strategy for building operations is to keep all mechanical systems off (i.e., ), which obviously violates occupants’ comfort requirements and is outside the set of allowable actions (i.e., ). Thus, the problem is to find a policy that uses the mechanical system as little as possible without violating comfort requirements. Given the common case where the control objective is convex, this then lies on the boundary of the constraint set (i.e., ).

We also depict the case where the solution of the unconstrained problem already satisfies the constraints, i.e., (Figure 2b). If this is generally the case for a particular application, we note that a constraint enforcement approach (ours or otherwise) is likely not needed, and indeed utilizing gradients through the projection layer may actually degrade performance. Specifically, if , the gradients do not point towards the interior of the constraint set, meaning that will lie on the boundary of the constraints despite the optimal solution being in the interior. This can be amended by augmenting the loss function with a (weighted) auxiliary term such as whose gradients (blue arrows) point towards the interior.

It may not be known a priori whether or not is in the constraint set in general or at any given time, except when domain experts are fully clear on the structure of the solutions for specific applications. In particular, is time-varying, making it difficult to know for sure whether or not the constraints will indeed be binding at any given time. For robustness, we therefore recommend incorporating the auxiliary loss within the RL training cost, unless it is known from domain knowledge that the constraints will certainly be active. As such, we formulate the training cost function as previously given in Equation 16.

5. Experiment 1: Energy-efficient Building Operation

There is significant potential to save energy through more efficient building operation. Buildings account for about 40% of the total energy consumption in the United States, and it is estimated that up to 30% of that energy usage may be reduced through advanced sensing and control strategies (fernandez2017impacts). However, this potential is largely untapped, as the heterogeneous nature of building environments limits the ability of control strategies developed for one building to scale to others (chen2019gnu). RL can address this challenge by adapting to individual buildings by directly interacting with the environment.

The most important constraint in building operation is to maintain a satisfactory level of comfort for occupants, while minimizing energy consumption. It is common in the RL-based building control literature to penalize thermal comfort violations (zhang_buildsys2018; chen2019gnu), which incentivizes but does not guarantee the satisfaction of these comfort requirements. In comparison, our proposed neural policy can largely maintain temperature within the specified comfortable range, except when the control is saturated.

We evaluate our policy in the same simulation testbed as (zhang_buildsys2018; chen2019gnu), following the same experimental setup as (chen2019gnu). Specifically, we first pre-train the neural policy by imitating a proportional-controller (P-controller). We then evaluate and further train our agent in the simulation environment, using a different sequence of weather data.

5.1. Problem Description

Simulation testbed

We utilize an EnergyPlus (E+) model of a 600m multi-functional space (Figure 2(a)), based on the Intelligent Workplace (IW) on Carnegie Mellon University (CMU) campus, located in Pittsburgh, PA, USA. The system of interest is the water-based radiant heating system, of which a schematic is provided in Figure 2(b). In this experiment, we control the supply water temperature so as to maintain the state variable, i.e., the zone temperature, within a comfortable range during the heating season. In the existing control, the supply water (SW) is maintained at a constant flow rate, and its temperature is managed by a P-controller. For more information on the simulation testbed, refer to (zhang_buildsys2018).

Approximate system model

We approximate the environment as a linear system as follows:


where represents the zone temperature and represents the supply water temperature. includes distributions from weather and occupancy. While building thermodynamics are fundamentally nonlinear, the locally-linear assumption works well for many control inputs (privara2013building). We identify the approximate model parameters , , and with prediction error minimization (privara2013building) on the same data used to pre-train the RL agent (see Section 5.2). The root mean squared error (RMSE) of this model on a unseen test set is 0.14oC.


Since our goal is to minimize energy consumption, we define the control cost at each time step as the agent’s control action, i.e. supply water temperature, which is linearly proportional to the heating demand, i.e.,

In contrast to the objectives in (zhang_buildsys2018; chen2019gnu), which are defined as weighted sum of energy cost and some penalty on thermal comfort violations, we consider the thermal comfort requirement as hard constraints, in the form of Equation 13.


To maintain a satisfactory comfort level, we require the zone temperature to be within a deadband when the building is occupied, based on the building code requirement of 10% Predicted Percentage of Dissatisfied (PPD) (fanger1986thermal). We allow for a wider temperature range during unoccupied hours. For the action, the allowable range of supply water temperature for the physical system is .

While it may appear from this description that we have only simple box constraints on both the state and action, we highlight the fact that actions are coupled over time through the building thermodynamics (zhao2017geometric). More concretely, a future state depends on all past actions. Thus, a box constraint on is in fact a constraint on . In this case, assuming to be a linear system, is then a set of linear inequalities, which can be geometrically interpreted as a polytope.222A polytope can be characterized as a set . We refer interested readers to (chen2020cohort; zhao2017geometric) for more details on this formulation. In fact, it was experimentally demonstrated in (chen2020cohort) that projecting actions onto the polytope constructed with an approximate linear model was sufficient to maintain temperature within the deadband in a real-world residential household (though (chen2020cohort) did not then differentiate through this projection).

Control time step

The EnergyPlus model has a 5-minute simulation time step. Following (zhang_buildsys2018; chen2019gnu), we use a 15-min control time step (i.e., each action is repeated 3 times) and a planning horizon of (i.e., a 3 hour look-ahead).

(a) Geometric view
(b) System schematic
Figure 3. Building simulation testbed (reproduced from (chen2019gnu)).

5.2. Implementation Details

Offline pre-training

We pre-train a long short-term memory (LSTM) recurrent policy (without a subsequent projection) by imitating a P-controller operating under the Typical Meteorological Year 3 (TMY3)  

(wilcox2008users) weather sequence, from Jan. 1 to Mar. 31. We min-max normalize all of the state, action, and disturbance, and use a learning rate of

. Specifically, we use the pre-trained weights after training on the expert demonstrations for 20 epochs following the same procedures as

(chen2020towards). We refer readers to (chen2020towards) for more details on the neural network architecture, training procedures, loss, and performance evaluation.

Online policy learning

We optimize the policy with PPO (schulman2017proximal) over the weather sequence in 2017 from Jan. 1 to Mar. 31. We use (see Equation 16), a learning rate of

, and RMSprop

(tieleman2012lecture) as the optimizer333The code is available at https://github.com/INFERLab/PROF.. We update the policy every four days, by iterating over those samples for 8 epochs with a batch size of 32. For hyperparameters, we use a temporal discount rate of = 0.9, = 0.2 (see Equation 6), and a Gaussian policy (see Equation 13) with linearly decreased from 0.1 to 0.01.

5.3. Results

(a) The differentiable projection layer enforces preheating behavior to ensure deadband constraints are never violated, even though this behavior is not present in the expert demonstrations.
(b) The agent has found a more energy-efficient control strategy by maintaining temperature at the lower end of the deadband.
Figure 4. Behavior of our proposed agent (a) at the onset of deployment, with pre-trained weights based on expert demonstrations and (b) after a month of interacting with and training on the environment.

After pre-training on expert demonstrations from the baseline P-controller, our agent directly operated the simulation testbed based on actual weather sequences in Pittsburgh from Jan. 1 to Mar. 31 in 2017. Figure 3(a) shows the behavior of our agent at the onset of deployment over a 3-day period. The baseline P-controller reactively turns on heating when the environment switches from unoccupied to occupied, which results in thermal comfort violations in the mornings. In comparison, PROF preheats the environment such that the environment is already at a comfortable temperature when occupants arrive in the morning. Notably, the differentiable projection layer manages to enforce this preheating behavior despite this behavior not being present in the expert demonstrations.

Figure 3(b) shows the behavior of our agent in comparison with Gnu-RL (chen2019gnu), having interacted with and trained on the environment for a month. Gnu-RL is updated via PPO, similarly to the current work, and incorporates domain knowledge on system dynamics. In comparison to Gnu-RL (chen2019gnu), which ends up trying to maintain temperature at the setpoint, PROF learns an energy-saving behavior by maintaining the temperature at the lower end of the deadband. This explains the further energy savings compared with Gnu-RL (chen2019gnu). However, we also notice that the temperature requirement may be violated on cold mornings. This happens when the control action is saturated, i.e., full heating over the 3-hour planning horizon is not sufficient to bring temperature back to the comfortable range. (In principle, even these constraint violations could be mitigated by increasing the length of the planning horizon.)

Table 1 summarizes the performance of our agent with comparison to the RL agents in (zhang_buildsys2018; chen2019gnu). Our proposed agent (averaged over 5 random seeds) saves 10% and 4% energy compared to the best-performing agents in (zhang_buildsys2018) and (chen2019gnu), respectively.

Heating PPD
Demand Mean SD
(kW) (%) (%)
Existing P-Controller (zhang_buildsys2018) 43709 9.45 5.59
Agent #6 (zhang_buildsys2018) 37131 11.71 3.76
Baseline P-Controller (chen2019gnu) 35792 9.71 6.87
Gnu-RL (chen2019gnu) 34687 9.56 6.39
LSTM & Clip + No Update 37938 8.55 3.39
LSTM & Clip 36068 2187 9.18 0.67 3.49
PROF (ours) 33271 1862 9.68 0.48 3.66
Table 1. Performance comparison. Our method saves energy while incurring minimal comfort violations.

We also compare our method to two ablations: (1) LSTM & Clip + No Update, which uses the same pre-trained weights and the projection layer to enforce feasible actions, but does not update the policy, and (2) LSTM & Clip, which uses the same pre-trained weights and the projection layer to enforce feasible actions during inference, but does not propagate gradients through the differentiable projection layer in the policy updates. We find that LSTM & Clip slightly improves upon LSTM & Clip + No Update, but is less performant compared to PROF. This affirms our hypothesis that the gradients through the differentiable projection layer are cognizant of the constraints and are thus conducive to policy learning.

6. Experiment 2: Inverter Control

Distributed energy resources (DERs), e.g., solar photovoltaic (PV) panels and energy storage, are becoming increasingly prevalent in an effort to curb carbon dioxide emissions and combat climate change. However, DERs interfacing with the power grid via power electronics, such as inverters, also introduce unintended challenges for grid operators. For instance, over-voltages have become a common occurrence in areas with high renewable penetration (9228929), and power electronics-interfaced generation has low-inertia and requires active control at much faster timescales compared to traditional synchronous machines (milano2018foundations).

To alleviate these issues, IEEE standard 1547.8-2018 (IEEE1547)

recommends a Volt/Var control strategy in which the reactive power contribution of an inverter is based on local voltage measurements. As will be clear in our empirical evaluation, this network-agnostic heuristic based on local information alleviates, but does not avoid, over-voltage issues. Given that the optimal solution needs to be obtained at the system-level and that the problem needs to be solved at very short timescales, a common paradigm is to address the problem in a quasi-static fashion

(jalali2019designing) adopted in works such as (baker2017network; jalali2019designing; gupta2020deep), where one chooses a policy over the next time period, e.g., 15 minutes-1 hour, and uses the policy without update for fast inference. In this work, we adopt the same paradigm and consider real-time control on a 1-second timescale of both active (P) and reactive (Q) power setpoints at each inverter.

We envision that a neural policy can learn from its prior experiences, in contrast to the traditional fit-and-forget approach (dobbe2020learning), and is capable of making decisions faster compared to solving optimization problems. Our primary contribution compared to existing work is the ability to enforce physical constraints within the neural network. In fact, we successfully enforce voltage constraints 100% of the time with a randomly initialized neural network, over more than half a million time-steps (i.e., 1 week with a one-second time step). The assumed control and communication scheme is consistent with the new definitions for smart inverter capabilities under IEEE standard 1547.1-2020 (IEEE15472020).

6.1. Problem Description

The problem we are considering here is to control active and reactive power setpoints at each inverter in order to maximize utilization (i.e., minimize curtailment) of renewable generation, while satisfying the maximum and minimum grid voltage requirements. Here, we first define the considered test case and input data, and describe the model of the network. We refer readers to (baker2017network) for more details on the problem set-up.

IEEE 37-bus test case

We evaluate our method on the IEEE 37-bus distribution test system (37node), with 21 solar PV systems indicated by green rectangles in Figure 5. We utilize a balanced, single-phase equivalent of the system, and simulate the nonlinear AC power flows using PYPOWER (matpower). For the simulation, the solar generation and loads are based on 1-second solar irradiance and load data collected from a neighborhood in Rancho Cordova, CA (Bank13) over a period of one week (604800 samples).

Figure 5. IEEE 37-bus feeder system, where the solar PV systems are indicated by green rectangles.
Approximate system model.

Denote the number of buses, excluding the slack bus (e.g., the distribution substation), as , the net active and the reactive power as and , and the voltage at all buses as . We linearize the AC power flow equations around the flat voltage solution, i.e. , using the method in (bolognani2015fast). The reference active and reactive power corresponding to is denoted as and . The linearized grid model, , is given by Equation 18, where represent system-dependent network parameters that can be either estimated from linearization (e.g., (bolognani2015fast)) or data-driven methods:


A notable advantage of the method in (bolognani2015fast) is that the resulting model has bounded error with respect to the true dynamics. By incorporating the error bound when constructing the safety set, the safety set is guaranteed to be a conservative under-approximation of the true safety set, and thus allow us to satisfy voltage constraints 100% of the time.


Our policy takes as input the voltage from the previous time-step, load, and generation at all the buses, and outputs active and reactive power setpoints at each inverter. (This is a deterministic policy; see Equation 12.) Note that while the grid model (Equation 18) contains all buses, only those with inverters are controllable.

Our neural architecture is similar to the one used in (gupta2020deep), which consists of a utility-level network, and inverter-level networks for individual inverters. The utility-level network collects information from all nodes, and broadcasts an intermediate representation to all inverter-level networks. Using this information along with its local observations, each inverter makes its local control decisions, which are then projected onto the constraints (discussed below).

Figure 6. PROF satisfies voltage constraints throughout the experiment, and learns to minimize curtailment as well as possible within its conservative safety set, , after learning safely for a day.

The objective is to minimize the curtailment of solar generation, or equivalently to maximize the utilization of the available solar power, . Specifically, letting denote the set of buses with inverters, the objective is


For an individual inverter, , with rated power and an available power (from available solar generation) , the feasible action space is

At the same time, the voltage at each bus should remain between - The primary challenge of satisfying voltage constraints is that the voltage at each bus depends on actions of neighboring nodes, i.e.

where the sparsity pattern of is characterized by the admittance matrix. We jointly project actions from all inverters at each time step onto the constraints .

6.2. Implementation Details

We evaluate PROF by executing it over the 1-week dataset (at 1 second) once. Similarly to other quasi-static approaches, we update the policy every 15-minutes. Similarly to (gupta2020deep), we optimize the neural policy with stochastic samples by directly differentiating through the objective (Equation 19) and the linearized grid model (Equation 18). However our method differs in that gupta2020deep characterized the constraints as a regularization term, and learned the policy via primal-dual updates. We incorporate the constraints directly via the differentiable projection layer and thus guarantee constraint satisfaction.

We use =10 (see Equation 16), a learning rate of , and RMSprop (tieleman2012lecture)

as the optimizer. At every 15 minutes, we sample 16 batches of data with size of 64 from the replay memory. We keep a replay memory size of 86400, i.e., samples from the previous day. For the both the utility-level network and the inverter-level network, we use fully-connected layers with ReLU activations. The utility-level network has hidden layer sizes (256, 128, 64), and each inverter-level network has hidden layer sizes (16, 4) and outputs active and reactive power. On top of the neural network, we implement the differentiable projection layer, following the constraints described in Section


We compare our methods to three baselines, (1) a Volt/Var strategy following IEEE 1547.8 (IEEE1547), (2) the optimal solution with respect to the linearized grid model, and (3) the optimal solution with respect to the true AC power flow equations.

6.3. Results

The performance of PROF in comparison to the three baselines is summarized in Figure 6. For clarity, we only show the maximum voltage over all buses; under-voltage is not a concern for this particular test case.

We see that the Volt/Var strategy violates voltage constraints 22.3% of time, mostly around noon when the solar generation is high and there is a surplus of energy. Since the Volt/Var baseline does not adjust active power, there is no curtailment.

In comparison, PROF satisfies the voltage constraints throughout the experiment, even with a randomly initialized neural policy. While PROF performs poorly on the first morning, it quickly improves its policy. In fact, the behavior of PROF is barely distinguishable from the optimal solution with respect to the linearized grid model, after learning safely for a day. This implies that PROF learned to control inverters as well as possible given its approximate model, which constructs a conservative under-approximation of the true safety set.

The optimal baseline with respect to the true AC power flow equations unsurprisingly achieves the best performance with respect to minimizing curtailment, as it can push the maximum voltage to the allowable limit in order to maximally reduce the amount of curtailed energy. However, inverter control is a task that requires near real-time inputs, and we find that running this baseline can be prohibitively slow. Specifically, we evaluate the computation time of different operations by averaging over 1000 randomly sampled problems from our dataset on a personal laptop. For PROF, on average, a forward pass in the neural network (excluding the projection layer) took 4.5 ms and the differentiable projection operation took 8.6 ms. The computation cost of the differentiable projection could be further reduced by using customized projection solvers such as the ones in (amos2017optnet; donti2021enforcing) that avoid the “canonicalization” costs introduced by general-purpose solvers such as the one we use (agrawal2019differentiable). In comparison, solving the optimization baseline with respect to the true AC power flow equations took 1.02s on the same machine, which is even longer than the 1s control time-step.

7. Discussion and Conclusions

In this work, we have presented a method, PROF, for integrating convex operational constraints into neural network policies for energy systems applications. In particular, we propose a policy that entails passing the output of a neural network to a differentiable projection layer, which enforces a convex approximation of the operational constraints. These convex constraint sets are obtained using approximate models of the system dynamics, which can be fit using system data and/or constructed using domain knowledge. We can then train the resultant neural policy via standard RL algorithms, using an augmented cost function designed to effect desirable policy gradients. The result is that our neural policy is cognizant of relevant operational constraints during learning, enhancing overall performance.

We find in both the building energy optimization and inverter control settings that PROF successfully enforces relevant constraints while improving performance on the control objective. In particular, in the building thermal control setting, we find that our approach achieves a 4% energy savings over the state of the art while largely maintaining the temperature within the deadband. In the inverter control setting, our method perfectly satisfies the voltage constraints over more than half a million time steps, while learning to minimize curtailment as much as possible within the safety set.

While these results demonstrate the promise of our method, a key limitation is in its computational cost. In particular, computing a projection during every forward pass of training and inference is decidedly more expensive than running a “standard” neural network. A fruitful area for future work – both in the context of our method, and in the context of research in differentiable optimization layers as a whole – may be to improve the speed of such differentiable projection layers. For instance, this might entail developing special-purpose differentiable solvers (amos2017optnet; donti2021enforcing) for optimization problems commonly encountered in energy systems applications, developing approximate solvers that do not rely on obtaining optimal solutions in order to compute reasonable gradients, or employing cheaper projection schemes such as -projection (shah2020solving) where possible.

Additionally, the success of our method (and many other constraint enforcement methods) depends fundamentally on the quality of the approximate model used to characterize the constraint sets. In particular, this determines the extent to which the resultant approximate constraint sets are a good representation of the true operational constraints. While we were able to employ reasonably high-quality approximation schemes in the context of this work, future work on safely updating the models or the constraint sets directly (fisac2019bridging) may greatly improve the quality of the solutions.

More generally, while our work highlights one approach to enforcing physical constraints within learning-based methods, we believe this is only the start of a broader conversation on closely integrating domain knowledge and control constraints into learning-based methods. In particular, strictly enforcing physical constraints will be paramount to the real-world success of these methods in energy systems contexts, and we hope that our paper will serve to spark further inquiry into this important line of work.

8. Acknowledgments

This material is based, in part, on work supported by Carnegie Mellon University’s College of Engineering Moonshot Award for Autonomous Technologies for Livability and Sustainability (ATLAS). This work was also supported by the U.S. Department of Energy Computational Science Graduate Fellowship (DE-FG02-97ER25308), the Center for Climate and Energy Decision Making through a cooperative agreement between the National Science Foundation and Carnegie Mellon University (SES-00949710), the Computational Sustainability Network, and the Bosch Center for AI. The work of K. Baker is supported by the National Science Foundation CAREER award 2041835.