An Optimal Treatment Assignment Strategy to Evaluate Demand Response Effect

10/02/2016 ∙ by Pan Li, et al. ∙ University of Washington 0

Demand response is designed to motivate electricity customers to modify their loads at critical time periods. The accurate estimation of impact of demand response signals to customers' consumption is central to any successful program. In practice, learning these response is nontrivial because operators can only send a limited number of signals. In addition, customer behavior also depends on a large number of exogenous covariates. These two features lead to a high dimensional inference problem with limited number of observations. In this paper, we formulate this problem by using a multivariate linear model and adopt an experimental design approach to estimate the impact of demand response signals. We show that randomized assignment, which is widely used to estimate the average treatment effect, is not efficient in reducing the variance of the estimator when a large number of covariates is present. In contrast, we present a tractable algorithm that strategically assigns demand response signals to customers. This algorithm achieves the optimal reduction in estimation variance, independent of the number of covariates. The results are validated from simulations on synthetic data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As more uncertain and intermittent renewable resources are integrated into the power system, operators are increasingly exploring flexibility in customers’ consumptions to balance supply and demand. This operation is commonly known as demand response (DR). In a typical implementation of DR programs, customers receive a DR signal to elicit a change in their consumptions. This signal can be a modification of electricity prices or simply a message requesting a change in consumption [1]. An effective DR program improves the efficiency and sustainability of power systems and is a central pillar of the envisioned smartgrid [2, 3, 4, 5].

A natural question about demand response is quantifying the impact of a DR signal. That is, if a DR signal is sent to a subset of the users, what is the change in these users’ consumptions because of that signal? An accurate estimate of this change is central to the operation of demand response programs: if not enough change in demand is elicited, other measures need to be taken; if too much change is elicited, the program is inefficient.

Most of existing work in this area of demand response have approached the problem from a market optimization point of view. For example, authors in [5, 6] considered how to optimize the social welfare; and authors in [7] have considered how to create an efficient market for demand response. In all these settings, customers’ responses are captured by well-defined utility functions. These functions are assumed to be known to the operators, or at least to the users themselves.

In practice, these market based approaches can be difficult to implement because customers often do not have a clear model of their own utilities. For example, consider a household with a smart energy management system (e.g., a NEST thermostat). This household will respond to a DR signal, but the response can be a complicated function of the current conditions in the household–e.g., temperature, appliances that are on, number of people at home and so on– and the user may not be consciously aware of the households’ utility function. Therefore the operator needs to learn customers’ responses from past history. Furthermore, because of the advancement of household sensors, this response need to be learned under a possibly high dimensional setting.

By performing enough experiments with enough customers, that is, sending enough DR signals, the operator will eventually learn the users’ response with accuracy. Repeated experimentation with large group of customers, however, is impractical for two reasons. The first is that operators only sends out DR signals if the demand need to be modified appreciably, and this event does not occur all that often in the power system. The second is that because of use fatigue [8], most utilities have agreements with their customers that a single household will only receive a limited number of DR requests [9]. Under these limited data regimes, accurately learning the response of users is nontrivial.

In this paper, we adopt an experimental design approach to the problem estimating the response of users to DR signals. We design user selection algorithms where by carefully choosing the users that receive DR signals, the maximum information about the system response can be learned.

In this paper we consider the linear setting, where a user’s consumption is a linear function of a set of variables and the DR signal. We refer to the former abstractly as a user’s covariates, where they could represent measurements such as temperature, appliance status, building type, behavior patterns and etc. Because of the explosive growth in sensing devices, we are particularly interested in the high dimensional case, that is, where users have a large number of covariates. We assume the impact of DR signals is additive to the original consumption behavior. Using the language of experimental design, we regard DR signal as a treatment and a user receiving a DR signal as being assigned this treatment. Then learning the DR response of the users is equivalent to learning the average treatment effect, which is the average response of the customers to the treatment [10]. The metric with respect to the estimation of the treatment effect is the variance of the estimator as more experiments are performed. Randomized trial is usually thought as the “gold standard” in these types of models mainly due to the fact that randomly assigning treatments to users removes the effect of confounding factors and provides a consistent estimate of the treatment effect. In the presence of many covariates, however, random assignment can be extremely inefficient. In fact, as we show in this paper, in the high dimensional setting, random assignment does not reduce the variance of the estimate of the average treatment effect, even as the number of treatment grows without bound. Instead, following the outline in [11], we design a selection scheme users are picked based on their covariates to be treated.

Suppose there are total users in the system. Under the linear models considered in this paper, the best possible lower bound on the rate of variance reduction is ,111no estimator can reduce variance faster than given by considering the Fisher information [12]. As discussed above, under high dimensional settings, a randomized algorithm can only achieve , even when are large number of fraction of users are assigned DR signals. The main contributions of this paper are:

  1. We show if the number of users selected is a constant fraction of the total number of users, there exists user assignments that achieve a variance reduction rate of . This rate is independent of the dimension of the covariates, as long as it is less than .

  2. We develop a tractable user assignment algorithm. This algorithm is obtained by converting the variance reduction problem in a densest-cut problem on a graph [13, 14, 15, 16].

Our approach differs from previous effort in learning demand response in one important regard. In previous studies, the focus was on training the best predictive model and subtracting out the predicted consumption from the measured consumption [17, 18]. In our approach, we do not ever learn a predictive model, in the sense that we do not learn the relationship between the covariates and the consumption. Rather, focus on learning a single parameter: the response to the DR signal.

The results in [11] act as an impetus to this paper. The main difference is that in [11], the users are assigned a treatment of , therefore some information is always conveyed by this assignment. In our model, the users are assigned either (receives DR signal) or (no signal). Therefore, for the users assigned , we do not obtain any information about the impact of DR signals. This makes the problem much more technically challenging, and consequently we only consider the offline assignment problem whereas [11] also considers the online assignment problem. There are extensive literature on average treatment effect estimation, and the interested reader can refer to [19, 20] and references within.

The rest of the paper is organized as follows. Section II introduces the preliminaries and the problem throughout this paper. Section III presents the variance of the estimator obtained by random assignment. Section IV further presents the variance of the estimator by optimal assignment, followed by a tractable algorithm presented in Section V. Section VI details the simulation results obtained by either random assignment and optimal assignment. Section VII concludes the paper.

Ii Preliminaries and Problem Formulation

In this paper we assume that a user’s consumption is given by a linear model. Let denotes a binary DR signal, where represents that a signal is sent to user and

presents that no signal is sent. A covariate vector

is also associated with a user, representing available side information. For example, side information may include local temperature, user’s household size, and number of electrical vehicles and so on. We denote the dimension of the covariate vector by , and assume the last component is , which is the intercept. Let denote the consumption of user , which is given as



is white noise with variance

(for convenience). The coefficient is the impact of the DR signal and estimating it efficiently is the goal of the paper. The coefficient represents the effect of the covariate vectors. The main technical challenge is to accurately estimate the coefficient of interest , even when is high dimensional. For analytical simplicity, we assume that the entries of

are drawn as i.i.d. Gaussian random variables (possibly after centering and rescaling). In simulations (Section

VI), we show that the results holds for other types of distributions as well.

We assume there are total users. In this model, a single user that receives two demand response signals at two different times is equivalent to two users each receiving a demand response signal. Therefore, we suppress the time dimension and label all users by . Note that in (1), all users share a common response to DR signals.

We denote the estimate of by . The value of is a function of the DR assignments, that is, the value of the ’s. Under the linear setting in (1

), the ordinary least square (ols) estimator

of is unbiased for all possible allocations of DR assignments, is centered at the true value . The natural measure of performance is then the variance: . With some simple linear algebra, the variance of is given by [21]:


where . The ’th row of the data matrix is given by . We adopt the notation that denotes a matrix that has rows and columns, while denotes the to column of a matrix , where .

We are primarily interested in the setting where an operator can assign a limited number of ’s to be . This setting reflects the limit in budget of an operator in sending DR signals. Specifically, let be the total number of DR signal that can be sent. The goal of the operator is to strategically assign ’s to be such that the variance of is minimized. In particular, we are interested in the rate of reduction of as increases and in settings where is a constant.

From (2), minimizing the variance of is equivalent to maximizing the quantity , and we focus on the latter quantity in the rest of the paper due to notational convenience. Two types of algorithms are of interest: i) the standard random assignment where each is chosen to be or

with probability

, and ii) an optimal assignment procedure where ’s are chosen to maximize . Both algorithms face the constrain that only out of ’s can be assigned to be . We characterize growth rate of quantity in terms of and , or equivalently, the decay rate of .

We show that when is relatively small compared to , the two strategies yield similar rates of . In a high dimensional setting where is comparable to , e.g., , however, the random assignment is essentially useless in estimating , in the sense that remains a constant in expectation as grows. Our proposed strategy, on the other hand, improves the rate to in this case, as long as is a constant. In Section III, we discuss the randomized strategy. The optimal assignment algorithm is then considered in Section IV.

Iii Random Assignment

Random assignment has been extensively studied in literature, mainly because it balances the covariates in two groups and eliminate the influence of confounders [19]. For our model in (1), random assignment means that a subset of ’s are chosen at random and assigned a value . Theorem 1 quantifies the rate of the increase of .

Theorem 1.

Random assignment achieves a rate of (). If is a constant, then this rate is ().

Before proving Theorem 1, we discuss the scaling rate under the setting when is a constant. In practice, this is the regime of interest since it is reasonable to suppose that a fraction (e.g. 10%) of users receives DR signals. In this case, the rate achieved by random assignment is . This rate is () when is relatively small compared to . However, when is large, e.g., , then this rate becomes (). This rate is not desirable as it indicates that the variance of the estimator is not decaying with even when is large. Thus we would like to design an assignment strategy which yields an estimator that still possesses a relatively good performance even when is very close to . In the next section, we show that with optimal assignment, we achieve the optimal rate () when . The proof of Theorem 1 follows.


We consider a random assignment where . Then the rate becomes:


where () follows from linearity and cyclic permutation of the trace operator; () follows from defining ; () follows from multiplying out each terms inside and using the fact that each element in has a zero mean and a variance as ; () follows from being a projection matrix onto

. Using the fact that the eigenvalues of a projection matrix are either 0 or 1 and

has rank with probability one, then the trace of is with probability one. In addition, from Lemma 1, it is shown that if contains one column as intercept, so that the trace of , which completes the equality in . ∎

Lemma 1.

If is a by matrix (where ) with one column which contains all ones, then .


Note that is the projection matrix which is orthogonal to , we then have the following:


where is a zero vector that has length .

Note that has one column as the intercept, which suggests that has one row where each element takes value one. Since the equality in (4) holds for every row, we then have:


which indicates that .

Iv Optimal Assignment

Instead of being randomly assigned into DR programs, users can be optimally allocated to either the treatment group or the control group depending on their covariate information, in order to obtain the best estimator of . Mathematically speaking, we optimally assign each to be 0 or 1, in order to minimize the variance of the estimator . This optimization problem is:

subject to

We first discuss the upper bound on the quantity (which signifies the lower bound for ). We show that it is O(). Then we establish that under the regime of , there exist algorithms that achieve a rate that meets the upper bound of .

Iv-a Optimal Rate

Before proceeding on analyzing the rate obtained by the proposed strategy, we first discuss the upper bound on the rate of .

Proposition 1.

No assignment can achieve a better rate that O [11].


The basic idea is to derive the Fisher information with the linear regression model in (

1). The inverse of the Fisher information provides a lower bound for the variance of the estimator obtained by least squares and thus an upper bound for the quantity . For more details, please refer to Proposition 1 in [11]. ∎

In the next subsection we will show that when which is a constant, we achieve this upper bound.

Iv-B Achievability of Optimal Rate

We first present the main result of this section. We assume that each element of

(excluding the intercept column) is drawn independently from a standard Gaussian distribution. This assumption will facilitate the calculation of the main result shown in Theorem

2. The algorithm associated with Theorem 2 is presented in Algorithm 1.

Theorem 2.

Recall that the rate is the growing rate of the inverse of the variance introduced in (2). This rate from optimal assignment is of (), which is independent of the dimension of covariates. More specifically, when is a constant, then this rate is linear rate, i.e., ().

Input: Covariates .
Output: Rate of optimal assignment and the corresponding optimal assignment strategy when .
Reduce the optimization problem in (6) to (8) using Lemma 2. Compute the null space of , denote it by . Each element of should independently follow a standard Gaussian distribution, according to Lemma 3. Find the lower bound for the largest element in (suppose that this element is non negative). This lower bound is shown in Lemma 4. The optimal value of the objective function in (8) is at least times this lower bound. The rate of this optimal value is stated in Theorem 2. The optimal assignment is to assign those ’s corresponding to the largest ’s in to be 1’s and the rest to be 0’s.
Algorithm 1 Procedures to obtain the rate shown in Theorem 2.

Before proving Theorem 2, we first show in Lemma 2 that the worst case scenario for the rate is when . This scenario provides a minimum on the quantity for every where , which provides a maximum for for every . Thus if we can show in Theorem 2 that in the worst case scenario where , the growing rate of quantity is when is a constant, then this rate holds for all where .

Lemma 2.

is increasing in . Consequently, if = , the estimator yields the worse case performance [11]:



This is a general result about linear estimation and the interested reader can refer to Lemma 5 in [11]. ∎

When , the rank of is one with probability one, thus we write , where is in the null space of [11], i.e., . Based on this observation, is written into a simpler form as . The problem is then to maximize under the constraint that we only get to assign ’s to be 1’s and the rest to be 0’s. The optimization problem is:

subject to

To solve the optimization problem in (8), we need to find ’s in such that their sum is maximized, where is the element of the vector . We observe that it actually suffices to provide a lower bound on this maximum sum to prove the rate.

To provide this lower bound we need to know the structure of . We then show in Lemma 3 that if is in the null space of , then each can be constructed to be drawn from an i.i.d. standard Gaussian distribution. Based on this observation, the problem is further reduced to find the lower bound on the largest , assuming that is smaller than to ensure that with overwhelming probability the largest is non negative. Let us refer to this statistic as the order statistic of and denote it by such that . We present a lower bound for in Lemma 4 when is smaller than . This lower bound facilitates the final proof for Theorem 2. The proof of Lemma 3 and Lemma 4 is left in the appendix.

Lemma 3.

The basis of the null space of can be constructed as an i.i.d. standard Gaussian vector with length , when each element of is independently drawn from a standard Gaussian distribution and the last column of is an all one column.

Lemma 4.

Let satisfies . If each is independent and follows standard Gaussian distribution, then , where C is a positive constant and .

Now we can use the introduced lemmas to prove Theorem 2. A summary is presented in Algorithm 1, illustrating the procedures to obtaining the rate stated in Theorem 2 using the proposed lemmas. This algorithm also provides the optimal assignment strategy when .

Proof of Theorem 2.

We will focus on the case when since it provides the worst case rate for every , as stated in Lemma 2.

From lemma 3, we know that , we then obtain the following results:


where () is based on the multivariate delta method [22], () comes from Jensen’s inequality and () is based on Lemma 3 and Lemma 4.

Specifically, if , then (9) can be written as:


Another interesting case is when , (9) can be written as:


As can be seen from Theorem 2, we will obtain the optimal rate by replacing as a constant. This indicates that with optimal assignment, the estimation variance will indeed decay with instead of being a constant as shown in Theorem 1, when the dimension of the covariates is comparable to . This is very interesting as it indicates that even in a high dimensional setting, the variance of the estimator will decay optimally by solving the variance minimization problem.

V A Tractable Alternative

Algorithm 1 provides a simple way to find the optimal assignment when . When is less than , this algorithm cannot be applied. In fact, the optimization problem in (6

) is a nonconvex quadratic optimization problem which can be NP-hard. In this section, we present a tractable approximate algorithm by relaxing the original combinatorial optimization problem into and semidefinite program. We then demonstrate that this SDP problem approximates the original problem with a performance ratio that is better than

when is in the range of . This procedure follows the results established in [15].

We first revisit the original variance minimization problem in (6) and transform into . Denote each element in as , then each takes value in . Therefore the variance minimization problem is written as:

subject to

This is a Dense--Subgraph (DSP) problem with existence of self edges. To illustrate this, let element at row and column in matrix be denoted as edge weight (except that is half of the value on the diagonals) associated with vertex and vertex , then (12) is trying to find a set of vertices such that the sum of edge weights induced by these vertices are maximized. Since the problem presented in (12

) contains binary variables, we relax this problem into a SDP formulation:

subject to

The original problem in (12) is hard, we will only obtain a surrogate solution in polynomial time. We thus adopt Algorithm 2 to obtain an approximate solution from SDP formulation. Let us denote this solution by based on Algorithm 2. The performance of the approximation from SDP is evaluated by the performance ratio which satisfies:


Here the randomness in is introduced by the random rounding procedure shown in Algorithm 2 and is the optimal value of the objective function shown in (12). Performance ratio can be used to quantify how close the solution from Algorithm 2 is to the optimal solution by solving the original hard problem.

There exists a fruitful line of work on the approximation algorithms using either greedy algorithm or LP/SDP relaxation for DSP problems [23, 13, 24, 14, 25, 26, 16, 15, 27]. Most recent research has improved the performance ratio to with LP relaxation in [27]. However, if is not decaying with , i.e., a constant, then this ratio is not desirable since it is decreasing in . The authors in [15] propose an improved performance ratio that is better than for a wide range of . We will adopt the approximation procedure in [15] and argue that the performance ratio is valid in our case here as well. In the following, we will first present the general algorithm and then show that the performance ratio in [15] is still applicable in our case.

The approximation procedure in [15] is presented in Algorithm 2, including three main procedures:

  • Solve SDP problem in (13) (step 1). After this procedure we can obtain the optimal continuous solution for (13) and the optimal value of the objective function is denote by .

  • Construct initial , where represents the initial subgraph and is a set of indices (step 2 through step 4). The ’s take value 1 such that and the rest -1. Let us denote them by . The value of the objective function in (13) is written as . Here is the total weights of edges in the subgraph induced by . At this point the cardinality of is not necessarily .

  • Resize to such that contains exactly vertices (step 5 through step 16). The final assignment of ’s is that . Let us denote them by . The value of the objective function is and is the total weights of the edges induced by .

1 Solve SDP in (13), obtain .
2 Construct .
3 Construct covariance matrix , where: , , .
4 Generate , , .
5 Let .
6 if  then
7      Output
9       if  then
10            arbitrarily add nodes into . Output .
11      else
12             while  do
13                   for each  do
14                         .
15                  Rearrange , where . Remove and reset .
16            Output .
Algorithm 2 Approximation algorithm with SDP relaxation in (13), adopted from [15].

We now demonstrate that Algorithm 2 achieves the performance ratio shown in Proposition 2.

Proposition 2.

The performance ratio from Algorithm 2 defined as:


satisfies the conditions presented in Proposition 2 in [15] and is plotted in Fig. 1, where is the optimal value of the objective function in problem (12). When is large, this ratio is better than either O() or O() obtained by LP relaxation.

Fig. 1: Performance ratio from Algorithm 2 for .

As can be seen from Proposition 2, the performance ratio quantifies the gap between the approximated solution obtained by Algorithm 2 and the optimal solution from the original problem shown in (12). This ratio is the direct result from the random rounding procedure and the resizing procedure shown in Algorithm 2. The ratios associated with these two procedures are presented in (16) and (17):




In [15], the authors well define the parameter (which depends on and ) and (depends on and obtained from the random rounding procedure) so that the performance ratio satisfies Proposition 2 where there are no self edges, i.e., . However, in our case, the are the diagonal elements of and they are not necessarily zero, so the graph in our case contains non trivial self edges. If we show that the presence of self edges does not change the values of and , then Proposition 2 naturally holds in our case as well.

Let us first discuss (16). From [15], parameter does not depend on whether there are self edges in the graph, so (16) directly applies.

Next we need to check if the same applies at the presence of self edges. When there are no self edges, i.e., , the authors in [15] show that when and otherwise. We show that the same condition for holds even with the presence of non negative self edges, as stated in Lemma 5.

Lemma 5.

Let and be obtained from random rounding procedure and resizing procedure in Algorithm 2 from a graph with non negative self edges, i.e., . Then we have:

Proof of Lemma 5 is given in the appendix. Lemma 5 validates the and with the presence of non negative self edges, thus the performance ratio stated in Proposition 2 is valid. A list of performance ratio ’s for different values of is shown in [15] and is plotted in Fig. 1. This rate is satisfactory since it has a rate of but strictly larger than [15]. It is better than O() when is not decaying faster than . In fact, as long as lies within some constant range, then can be seen as a varying constant thus is not decaying as a function of .

Vi Simulation

In this section we show the comparison results between random assignment and optimal assignment. We simulate the covariates from two different distributions, i.e., Gaussian distribution and uniform distribution. We also validate the claim by simulated building data from


Vi-a Gaussian Ensemble

We first generate the covariates as they are drawn from i.i.d. Gaussian ensemble, i.e., . We compare two cases where and in Fig. 2. Note that Fig. 2 is shown in semilogarithmic plot where the -axis has a logarithmic scale and the -axis has a linear scale. In addition, we adopt the value of in [15], i.e., 0.9 for and 0.94 for . We let to obtain the worst case performance.

Besides SDP relaxation, we also simulate a greedy based assignment to maximize the weighted edges induced by vertices. The greedy assignment sequentially eliminates vertices and works as follows: we start with the original graph and a set containing all vertex. At each elimination step, the vertex with the least weighted edges are eliminated from the set until this set contains exactly vertices. This greedy algorithm is introduced in [24].

In addition, we use the result from branch and bound (upper bound) to serve as the reference in order to compare the random assignment and the proposed optimal assignment. The duality gap for branch and bound is set to be 0.05 for all when . Due to computational complexity and time constraints, we set this gap to be around 0.25 when is big in the case of .

In Fig. 2, we see that the semilog plot on () is growing with and similar to . This suggests that is linear in , as we stated in Section IV. In addition, It is very close to the solution obtained by branch and bound, meaning that the result from SDP relaxation is close to the optimal solution in (12). The empirical performance ratio from SDP relaxation is shown in Table I.

From Table I, we see that the performance ratio for between and is actually greater than , which is even better than the theoretical bound in Proposition 2. In addition, if we decrease with respect to , i.e., change from to , then the performance ratio is reduced. This is due to the fact we need to do more eliminations during the resizing procedure in Algorithm 2 and the deterioration increases. However, if is well defined within a range, then this deterioration is controlled and does not change the statement in Proposition 2.

On the other hand, the semilog plot on from random assignment is a constant on average across different values of , whether is small or large. This validates Theorem 1 as it states that the rate is () when . In this case, we cannot obtain an efficient estimator since the variance is not decaying even when is big.

What is more, the greedy assignment does not provide a relatively good performance as well. It yields a constant as random assignment. This suggests although both the greedy algorithm and SDP relaxation are aimed to solve an optimization problem, the solution from SDP relaxation is much better and reliable.

Fig. 2: Semilogarithmic plot on the inverse of , i.e., , assuming Gaussian distribution of . The -axis is shown in a logarithmic scale and the -axis is shown in a linear scale. Upper plot shows the rate when , lower plot shows the corresponding rate when .
= 10 = 50 = 100 = 200
0.8131 0.6486 0.7416 0.6588
0.7026 0.5362 0. 6577 0.5273
TABLE I: Empirical performance ratio for Gaussian ensemble with different values for and varying .

Vi-B Uniform Ensemble

Although we discuss the rate of quantity with respect to Gaussian ensemble, we also simulate the covariates where the elements are drawn from a uniform distribution in an interval [-1,1]. The results are shown in Fig. 3. Fig. 3 is again a semilogarithmic graph. We again take to be 0.9 when and 0.94 when . The duality gap is 0.05 when . When , the duality gap is around 0.1 to 0.2 for greater than 100 and is 0.25 for . The comparison result is shown in Fig. 3 and the performance ratio by SDP relaxation is shown in Table II.

Fig. 3: Semilogarithmic plot on the inverse of , i.e., , assuming uniform distribution of . Upper plot shows the rate when , lower plot shows the corresponding rate when .
= 10 = 50 = 100 = 200
0.6961 0.6755 0.8068 0.8094
0.6101 0.6113 0.4799 0.5145
TABLE II: Empirical performance ratio for uniform ensemble with different values for and varying .

The observation from Fig. 3 and Table II is similar to the analysis in the case of Gaussian ensemble, that the solution obtained from SDP relaxation is still within a constant of the branch and bound solution for . Again, greedy algorithm fails to find a solution close to that by branch and bound and performs as poorly as the random assignment. The performance ratio is again decreased when we decrease , which indicates the similar deterioration occurred during the resizing procedure.

Vi-C Building data

We also validate the claim in the paper by simulated building data obtained from [28]. Using this software, we can generate building covariates such as environment temperature, number of occupants, appliances scheduling, etc. The software outputs the energy consumption based on the covariates that we generate. The buildings vary from a small postal office, to a large commercial hotel, with different number of covariates involved in the modeling.

For the purpose of this paper, we include significant covariates into the linear regression model and the number of those covariates are comparable to the number of users in the simulation. The values of the covariates are whitened by the covariance matrix so that the covariates have zero mean and unit variance [29]. The details of this procedure is given in the appendix. In addition, we fix in the simulation.

The simulation results are shown in Fig. 4 and Fig. 5.

In Fig. 4, we observe that is decaying very fast with the optimal assignment strategy, whereas the variance stays unchanged if adopting a random assignment strategy.

Fig. 4: Variance of as a function of when , simulated from building data, with two different assignment strategies.

The variance is a varying quantity in terms of the dimension of the covariates, i.e., . Assume that the number of users are fixed, i.e., is fixed, and that is a constant. According to Theorem 1 and in Theorem 2, with random assignment the variance decays with whereas with optimal assignment the worse case scenario with yields a decay rate of .

In Fig. 5, we show how varies with an increasing and a fixed . As can be seen from Fig. 5, the increasing deteriorates the variance much severely from random assignment than that from the proposed optimal assignment. This again validates the efficiency of the proposed strategy in improving estimation accuracy, especially in a high dimensional setting where is comparable to .

Fig. 5: Variance of as a function of when is fixed, simulated from building data, with two different assignment strategies.

Vii Conclusion

In this paper, we estimate the average treatment effect of demand response (DR) signals. We adopt an additive linear regression model and discuss two different strategies to assign DR signals to users under limited assignment budgets. The first strategy randomly picks users and sends DR signals to them. The second strategy optimally assigns DR signals to users by minimizing the variance to estimate treatment effect. We show that in a high dimensional setting, the second strategy achieves order optimal rates in variance reduction, whereas random assignment does not reduce variance even as the number of users grows. We formulate the general assignment as a combinatorial optimization problem and present a tractable SDP relaxation. We show that this relaxation obtains a solution that is within bounded gap of the original optimal solution. The simulation results validate this proposition with the synthetic data on both i.i.d. Gaussian covariates and uniform covariates. This work provides a framework for further research in applying causal inference in analyzing consumption data and DR interventions.


  • [1] P. Siano, “Demand response and smart grids:a survey,” Renewable and Sustainable Energy Reviews, vol. 30, pp. 461–478, 2014.
  • [2] P. Palensky and D. Dietrich, “Demand side management: Demand response, intelligent energy systems, and smart loads,” Industrial Informatics, IEEE Transactions on, vol. 7, no. 3, pp. 381–388, 2011.
  • [3] C. L. Su and D. Kirschen, “Quantifying the effect of demand response on electricity markets,” Power Systems, IEEE Transactions on, vol. 24, no. 3, pp. 1199–1207, 2009.
  • [4] D. Wang, X. Guan, J. Wu, P. Li, P. Zan, and H. Xu, “Integrated energy exchange scheduling for multimicrogrid system with electric vehicles,” Smart Grid, IEEE Transactions on, preprint, 2015.
  • [5] N. Li, L. Chen, and S. H. Low, “Optimal demand response based on utility maximization in power networks,” in IEEE Power and Energy Society General Meeting.   IEEE, 2011.
  • [6] L. Qian, Y. A. Zhang, J. Huang, and Y. Wu, “Demand response management via real-time electricity price control in smart grids,” Selected Areas in Communications, IEEE Journal on, vol. 31, no. 7, pp. 1268–1280, 2013.
  • [7] W. Saad, Z. Han, H. V. Poor, and T. Başar, “Game-theoretic methods for the smart grid: An overview of microgrid systems, demand-side management, and smart grid communications,” Signal Processing Magazine, IEEE, vol. 29, no. 5, pp. 86–105, 2012.
  • [8] J. C. Holyhead, S. D. Ramchurn, and A. Rogers, “Consumer targeting in residential demand response programmes,” in Proceedings of the 2015 ACM Sixth International Conference on Future Energy Systems.   ACM, 2015, pp. 7–16.
  • [9] V. M. Balijepalli, V. Pradhan, S. Khaparde, and R. Shereef, “Review of demand response under smart grid paradigm,” in Innovative Smart Grid Technologies-India (ISGT India), 2011 IEEE PES.   IEEE, 2011, pp. 236–243.
  • [10] P. W. Holland, “Statistics and causal inference,” Journal of the American statistical Association, vol. 81, no. 396, pp. 945–960, 1986.
  • [11] N. Bhat, V. Farias, and C. Moallemi, “Optimal ab testing,” 2015.
  • [12] S. Fisher, R. Fisher, S. Genetiker, R. Fisher, S. Genetician, G. Britain, R. Fisher, and S. Généticien, The design of experiments.   Oliver and Boyd, 1935.
  • [13] U. Feige, D. Peleg, and G. Kortsarz, “The dense k-subgraph problem,” Algorithmica, no. 3, pp. 410–421, 2001.
  • [14] U. Feige and M. Langberg, “Approximation algorithms for maximization problems arising in graph partitioning,” Journal of Algorithms, no. 2, pp. 174–211, 2001.
  • [15] Q. Han, Y. Ye, and J. Zhang, “An improved rounding method and semidefinite programming relaxation for graph partition,” Mathematical Programming, no. 3, pp. 509–535, 2002.
  • [16] Y. Ye and J. Zhang, “Approximation of dense-n/2-subgraph and the complement of min-bisection,” Journal of Global Optimization, no. 1, pp. 55–73, 2003.
  • [17] D. Zhou, M. Balandat, and C. Tomlin, “A bayesian perspective on residential demand response using smart meter data,” ArXiv, vol. 1608.03862, 2016. [Online]. Available:
  • [18] K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. L. Scott, “Inferring causal impact using bayesian structural time-series models,” Annals of Applied Statistics, vol. 9, pp. 247–274, 2015.
  • [19] E. A. Stuart, “Matching methods for causal inference: A review and a look forward,” Statistical science: a review journal of the Institute of Mathematical Statistics, vol. 25, no. 1, pp. 1–29, 2010.
  • [20] G. W. Imbens, “Nonparametric estimation of average treatment effects under exogeneity: A review,” Review of Economics and statistics, vol. 86, no. 1, pp. 4–29, 2004.
  • [21] G. A. F. Seber and A. J. Lee,

    Linear regression analysis

    .   John Wiley and Sons, 2012.
  • [22] G. Oehlert, “A note on the delta method,” The American Statistician, no. 1, pp. 27–29, 1992.
  • [23] G. Kortsarz and D. Peleg, “On choosing a dense subgraph,” in Foundations of Computer Science, 34th Annual Symposium on.   IEEE, 1993, pp. 692–701.
  • [24] Y. Asahiro, K. Iwama, H. Tamaki, and T. Tokuyama, “Greedily finding a dense subgraph,” Journal of Algorithms, no. 2, pp. 203–221, 2000.
  • [25] U. Feige and M. Seltser, “On the densest k-subgraph problem,” Weizmann Institute of Science. Department of Applied Mathematics and Computer Science, Tech. Rep.
  • [26] A. Srivastav and K. Wolf, “Finding dense subgraphs with semidefinite programming,” in International Workshop on Approximation Algorithms for Combinatorial Optimization.   Springer Berlin Heidelberg, 1998.
  • [27] A. Bhaskara, M. Charikar, E. Chlamtac, U. Feige, and A. Vijayaraghavan, “Detecting high log-densities: an o (n ¼) approximation for densest k-subgraph,” in

    In Proceedings of the forty-second ACM symposium on Theory of computing

    .   ACM, 2010.
  • [28] D. Crawley, C. Pedersen, L. Lawrie, and F. Winkelmann, “Energyplus: energy simulation program,” ASHRAE journal, vol. 42, no. 4, p. 49, 2000.
  • [29] C. Riquelme, R. Johari, and B. Zhang, “Online active linear regression via thresholding,” ArXiv, vol. 1602.02845, 2016. [Online]. Available:
  • [30] M. Stojnic, W. Xu, and B. Hassibi, “Compressed sensing-probabilistic analysis of a null-space characterization,” in IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2008, pp. 3377–3380.
  • [31] G. Kamath, “Bounds on the expectation of the maximum of samples from a gaussian,”
  • [32] G. T. F. De Abreu, “Supertight algebraic bounds on the gaussian q-function,” in Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers.   IEEE, 2009, pp. 948–951.
  • [33] F. D. Cote, I. N. Psaromiligkos, and W. J. Gross, “A chernoff-type lower bound for the gaussian q-function,” arXiv preprint arXiv:1202.6483, 2012.
  • [34] S. Winitzki, “A handy approximation for the error function and its inverse,” A lecture note obtained through private communication, Tech. Rep., 2008.
  • [35] S. H. Chang, P. C. Cosman, and L. B. Milstein, “Chernoff-type bounds for the gaussian error function,” IEEE Transactions on Communications, no. 11, pp. 2939–2944, 2011.


Vii-a Proof of Lemma 3


This proof follows the intuition in [30]. In [30]

, to prove the null space of a Gaussian random matrix, the authors use the fact that a standard multivariate Gaussian distribution is invariant to any orthogonal transform. The difference here is that the matrix

contains an extra column of 1’s as intercept.

Denote the null space of by . It is one dimensional and satisfies:


Now we multiply

by an orthogonal matrix

on the left hand side and multiply by on the right hand side. For simplicity let us write as . Since then the following holds:


Because has i.i.d. standard Gaussian entries and is an orthogonal matrix, then each row of must follow a Gaussian distribution , which is still a standard Gaussian distribution. Thus has i.i.d. standard Gaussian entries. Denote , then and are drawn from the same distribution and each of their entries follows i.i.d. standard Gaussian distribution.

From (19), we also require that the orthogonal matrix satisfies . If such orthogonal matrix exists (which is easy to find), then we can rewrite (19) as:


Comparing (20) and (18), we see that and

must be identically distributed. One distribution that satisfies this property is the standard normal distribution

. It is easy to show that identity matrix is the only covariance matrix that satisfies such condition whereas zero mean is based on the fact that

from (18). This observation concludes the final proof. ∎

Vii-B Proof of Lemma 4


The proof is based on the fact that if in expectation, at least ’s is greater than a constant, then the th largest must be greater than this constant [31]. Mathematically speaking, we want , where is a constant. We write in the following form:


function of Gaussian distribution does not have a close form expression, so we are interested in a tight lower bound for it in order to obtain a lower bound for (21). Many lower bounds are obtained in literature [32, 33, 34, 35]. In particular, [34] provides a lower bound that is valid when the argument is small, and [32] provides a lower bound that is the tightest when the argument becomes relatively large. The lower bounds (bound 1 through bound 4, from [32, 33, 34, 35] respectively) are presented in (22a) through (22d).


Note that bound 3 (in (22c)) is only valid when is small. A comparison of these four bounds are shown in Fig.6. It is shown in semilogarithmic plot where -axis is in a logarithmic scale and -axis is in a linear scale.

Assume that is small, i.e., when which is slightly smaller than but greater than . In this range the lower bound provided in (22c) is the tightest. Use this lower bound we obtain the constant that is universally applicable for that (21) holds when is small. After some simple calculation, we obtain the constant .

Although (22a) provides the tightest bound when gets bigger, this lower bound includes two exponential terms which complicate the calculation. Actually when , we can obtain a fairly good but conservative constant for the lower bound provided in (22d) with only one exponential term. The constant in this case when is relatively large is calculated as .

Fig. 6: Semilogarithmic plot of the lower bounds discussed in (22a) to (22d). The lower bounds are annotated in the graph as bound 1 to bound 4, respectively.

Vii-C Proof of Lemma 5

Before proving Lemma 5, let us first introduce Lemma 6 that bounds the values of the diagonals of .

Lemma 6.

The diagonals of is non negative.

Proof of Lemma 6.

Recall that , and is a projection matrix, then we have the following:


Denote the diagonals of the by , then we have , which implies that .

This suggests that the diagonals of , denoted by , is non-negative. ∎

Now we proceed to prove Lemma 5.

Proof of Lemma 5.

To prove that Lemma 5 holds, we add up a constant to each to make sure that there are no negative weights in the graph. This does not change the optimization problem since it only adds a constant to the objective function and the solution remains the same.

If , then we arbitrarily add vertices to until it contains exactly vertices. Since the weight on each edge is non negative and we keep adding more edges into the subgraph induced by , is at least 1 in this case.

Now suppose that