Target Tracking for Contextual Bandits: Application to Demand Side Management

01/28/2019 ∙ by Margaux Brégère, et al. ∙ EDF 0

We propose a contextual-bandit approach for demand side management by offering price incentives. More precisely, a target mean consumption is set at each round and the mean consumption is modeled as a complex function of the distribution of prices sent and of some contextual variables such as the temperature, weather, and so on. The performance of our strategies is measured in quadratic losses through a regret criterion. We offer √(T) upper bounds on this regret (up to poly-logarithmic terms), for strategies inspired by standard strategies for contextual bandits (like LinUCB, Li et al., 2010). Simulations on a real data set gathered by UK Power Networks, in which price incentives were offered, show that our strategies are effective and may indeed manage demand response by suitably picking the price levels.



There are no comments yet.


page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Electricity management is classically performed by anticipating demand and adjusting accordingly production. The development of smart grids, and in particular the installation of smart meters (see Yan et al., 2013; Mallet et al., 2014), come with new opportunities: getting new sources of information, offering new services. For example, demand-side management (also called demand-side response; see Albadi & El-Saadany, 2007; Siano, 2014 for an overview) consists of reducing or increasing consumption of electricity users when needed, typically reducing at peak times and encouraging consumption of off-peak times. This is good to adjust to intermittency of renewable energies and is made possible by the development of energy storage devices such as batteries or even electric vehicles (see Fischer et al., 2015; Kikusato et al., 2018

); the storages at hand can take place at a convenient moment for the electricity provider. We will consider such a demand-side management system, based on price incentives sent to users via their smart meters. We propose here to adapt contextual bandit algorithms to that end, which are already used in online advertising. Other such systems were based on different heuristics (

Shareef et al., 2018; Wang et al., 2015).

The structure of our contribution is to first provide a modeling of this management system, in Section 2. It relies on making the mean consumption as close as possible to a moving target by sequentially picking price allocations. The literature discussion of the main ingredient of our algorithms, contextual bandit theory, is postponed till Section 2.4. Then, our main results are stated and discussed in Section 3: we control our cumulative loss through a regret bound with respect to the best constant price allocation. A refinement as far as convergence rates are concerned is offered in Section 4. A section with simulations based on a real data set concludes the paper: Section 5. For the sake of length, most of the proofs are provided in appendix.


Without further indications,

denotes the Euclidean norm of a vector

. For the other norms, there will be a subscript: e.g., the supremum norm of is is denoted by .

2 Setting and model

Our setting consists of a modeling of electricity consumption and of an aim—tracking a target consumption. Both rely on price levels sent out to the customers.

2.1 Modeling of the electricity consumption

We consider a large population of customers of some electricity provider and assume it homogeneous, which is rather reasonable Mei et al. (2017). The consumption of each customer at each instance depends, among others, on some exogenous factors (temperature, wind, season, day of the week, etc.), which will form a context vector , where is some parametric space. The electricity provider aims to manage demand response: it sets a target mean consumption for each time instance. To achieve it, it changes electricity prices accordingly (by making it more expensive to reduce consumption or less expensive to encourage customers to consume more now rather than in some hours). We assume that price levels (tariffs) are available. The individual consumption of a given customer getting tariff is assumed to be of the form

, where the white noise models the variability due to the customers, and where

is some function associating with a context and a tariff an expected consumption . Details on and examples of are provided below. At instance , the electricity provider sends tariff to a share of the customers; we denote by the convex vector . As the population is rather homogeneous, it is unimportant to know to which specific customer a given signal was sent; only the global proportions matter. The mean consumption observed equals

The noise term is to be further discussed below; we first focus on the function by means of examples.

Example 1.   The simplest approach consists in considering a linear model per price level, i.e., parameters with . We denote the vector formed by aggregating all vectors .

This approach can be generalized by replacing by a vector-valued function . This corresponds to the case where it is assumed that the belong to some set of functions , with a basis composed of . Then, . For instance, can be given by histograms on a given grid of . ∎

Example 2.   Generalized additive models (Wood, 2006) form a powerful and efficient semi-parametric approach to model electricity consumption (see, among others, Goude et al., 2014; Gaillard et al., 2016). It models the load as a sum of independent exogenous variable effects. In our simulations, see (12), we will consider a mean expected consumption of the form , that is, the tariff will have a linear impact on the mean consumption, independently of the contexts.

The baseline mean consumption will be modeled as a sum of simple functions, each taking as input a single component of the context vector:

where and where each . Some components may be used several times.

When the considered component takes continuous values, these functions are so-called cubic splines: –smooth functions made up of sections of cubic polynomials joined together at points of a grid (the knots). Choosing the number of knots (points at which the sections join) and their locations is sufficient to determine (in closed form) a linear basis of size , see Wood (2006) for details. The function can then be represented on this basis by a vector of length , denoted by :

When the considered component takes finitely many values, we write as a sum of indicator functions:

where the are the modalities for the component .

All in all, can be represented by a vector of dimension obtained by aggregating the and the vectors into a single vector. ∎

Both examples above show that it is reasonable to assume that there exists some unknown and some known transfer function such that .

By linearly extending in its second component, we get

We will actually not use in the sequel that is linear in : the dependency of in could be arbitrary.

We now move on to the noise term. We first recall that we assumed that our population is rather homogeneous, which is a natural feature as soon as it is large enough. Therefore, we may assume that the variabilities within the group of customers getting the same tariff

can be combined into a single random variable

. We denote by the vector . All in all, we will mainly consider the following model.

Model 1: tariff-dependent noise.   When the electricity provider picks the convex vector , the mean consumption obtained at time instance equals

The noise vectors are –sub-Gaussian111A –dimensional random vector is –sub-Gaussian, where , if for all , one has . i.i.d. random variables with . We denote by their covariance matrix.

No assumption is made on in the model above. However, when it is proportional to the matrix , the noises associated with each group can be combined into a global noise, leading to the following model. It is less realistic in practice, but we discuss it because regret bounds may be improved in the presence of a global noise.

Model 2: global noise.   When the electricity provider picks the convex vector , the mean consumption obtained at time instance equals

The scalar noises are –sub-Gaussian i.i.d. random variables, with . We denote by

the variance of the random noises


2.2 Tracking a target consumption

We now move on to the aim of the electricity provider. At each time instance , it picks an allocation of price levels and wants the observed mean consumption to be as close as possible to some target mean consumption . This target is set in advance by another branch of the provider and is to be picked based on this target: our algorithms will explain how to pick given but will not discuss the choice of the latter. In this article we will measure the discrepancy between the observed and the target via a quadratic loss: .

We may set some restrictions on the convex combinations that can be picked: we denote by the set of legible allocations of price levels. This models some operational or marketing constraints that the electricity provider may encounter. We will see that whether is a strict subset of all convex vectors or whether it is given by the set of all convex vectors plays no role in our theoretical analysis.

As explained in Section 3.1 and as is standard in online learning theory, to minimize the cumulative loss suffered we will minimize some regret.

2.3 Summary: online protocol

After picking an allocation of price levels , the electricity provider only observes : it thus faces a bandit monitoring. Because of the contexts , the problem considered falls under the umbrella of contextual bandits. No stochastic assumptions are made on the sequences and : the contexts and will be considered as picked by the environment. Finally, mean consumptions are assumed to be bounded between and , where is some known maximal value.

The online protocol described in Sections 2.1 and 2.2 is stated in Protocol 1. We see that the choices , and need to be –measurable, where

   Parametric context set
   Set of legible convex weights
   Bound on mean consumptions
   Transfer function
  Unknown parameters
   Transfer parameter
   Covariance matrix of size (Model 1)
   Variance (Model 2)
  for  do
     Observe a context and a target
     Choose an allocation of price levels
     Observe a resulting mean consumption
(Model 1)
(Model 2)
     Suffer a loss
  end for
   Minimize the cumulative loss
Protocol 1 Target Tracking for Contextual Bandits

2.4 Literature discussion: contextual bandits

In many bandit problems the learner has access to additional information at the beginning of each round. Several settings for this side information may be considered. The adversarial case was introduced in Auer et al. (2002, Section 7, algorithm Exp4): and subsequent improvements were suggested in Beygelzimer et al. (2011) and McMahan & Streeter (2009)

. The case of i.i.d. contexts with rewards depending on contexts through an unknown parametric model was introduced by 

Wang et al. (2005b) and generalized to the non-i.i.d. setting in Wang et al. (2005a), then to the multivariate and nonparametric case in Perchet & Rigollet (2013). Hybrid versions (adversarial contexts but stochastic dependencies of the rewards on the contexts, usually in a linear fashion) are the most popular ones. They were introduced by Abe & Long (1999) and further studied in Auer (2002). A key technical ingredient to deal with them is confidence ellipsoids on the linear parameter; see Dani et al. (2008), Rusmevichientong & Tsitsiklis (2010) and Abbasi-Yadkori et al. (2011). The celebrated UCB algorithm of Lai & Robbins (1985) was generalized in this hybrid setting as the LinUCB algorithm, by Li et al. (2010) and Chu et al. (2011). Later, Filippi et al. (2010) extended it to a setting with generalized additive models and Valko et al. (2013) proposed a kernalized version of UCB. Other approaches, not relying on confidence ellipsoids, consider sampling strategies (see Gopalan et al., 2014) and are currently extended to bandit problems with complicated dependency in contextual variables (Mannor, 2018). Our model falls under the umbrella of hybrid versions considering stochastic linear bandit problems given a context. The main difference of our setting lies in how we measure performance: not directly with the rewards or their analogous quantities in our setting, but through how far away they are from the targets .

3 Main result, with Model 1

This section considers Model 1.

We take inspiration from LinUCB (Li et al., 2010; Chu et al., 2011

): given the form of the observed mean consumption, the key is to estimate the parameter

. Denoting by the identity matrix and picking , we classically do so according to


where .

A straightforward adaptation of earlier results (see Theorem 2 of Abbasi-Yadkori et al., 2011 or Theorem 20.2 in the monograph by Lattimore & Szepesvári, 2018) yields the following deviation inequality; details are provided in Appendix A.

Lemma 1.

No matter how the provider picks the , we have, for all and all ,

with probability at least 


3.1 Regret as a proxy for minimizing losses

We are interested in the cumulative sum of the losses, but under suitable assumptions (e.g., bounded noise) the latter is close to the sum of the conditionally expected losses (e.g., through Hoeffding’s inequality). Typical statements are of the form: for all strategies of the provider and of the environment,

All regret bounds in the sequel will involve the sum of conditionally expected losses

but up to adding a deviation term to all these regret bounds, we get from them a bound on the true cumulative loss .

Now, the choices , and are –measurable, where . Therefore, under Model 1,


We got the rewriting

and we therefore introduce the regret

This will be the quantity of interest in the sequel.

3.2 Optimistic algorithm: all but the estimation of

We assume that in the first rounds an estimator of the covariance matrix was obtained; details are provided in the next subsection. We explain here how the algorithm plays for rounds .

We assumed that the transfer function and the bound on the target mean consumptions were known. We use the notation for the clipped part of a real number (clipping between and ).

We then estimate the instantaneous losses (2)

associated with each choice by:

We also denote by deviation bounds, to be set by the analysis.

The optimistic algorithm picks, for :


Comment:   In linear contextual bandits, rewards are linear in and to maximize global gain, LinUCB Li et al. (2010) picks a vector which maximizes a sum of the form . Here, as we want to track the target, we slightly change this expression by substituting the target and taking a quadratic loss. But the spirit is similar.

3.3 Optimistic algorithm: estimation of

The estimation of the covariance matrix is hard to perform (on the fly and simultaneously) as the algorithm is running. We leave this problem for future research and devote here the first rounds to this estimation. We created from scratch the estimation of proposed below and studied in Lemma 2, as we could find no suitable result in the literature.

For each pair

we define the weight vector as: for ,

These correspond to all weights vectors that either assign all the mass to a single component, like the , or share the mass equally between two components, like the for . There are different weight vectors considered. We order these weight vectors, e.g., in lexicographic order, and pull them one after the other, in order. This implies that in the initial exploration phase of length , all vectors indexed by are selected at least

times. At the end of the exploration period, we define as in (1) and the estimator


where . Note that can be computed efficiently by solving a linear system as soon as is small enough.

3.4 Statement of our main result

Theorem 1.

Fix a risk level and a time horizon . Assume that the boundedness assumptions (5) hold. The optimistic algorithm (3) with an initial exploration of length rounds satisfies

with probability at least .

3.5 Analysis: structure

We first indicate the boundedness assumptions that will be useful in the proof of Theorem 1 and will then provide the structure of the analysis.

Boundedness assumptions. They are all linked to the knowledge that the mean consumption lies in and indicate some normalization of the modeling:


As a consequence,

and all eigenvalues of

lie in , thus .

The deviation bound of Lemma 1 plays a key role in the algorithm. We introduce the following upper bound on it:


Finally, we also assume that a bound is known, such that

A last consequence of all these boundedness assumptions is that upper bounds the (conditionally) expected losses .

Structure of the analysis. The analysis exploits how well the estimate and how well estimates . The regret bound, as is clear from Proposition 1 below, also consists of these two parts. The proof is to be found in Appendix B.

Proposition 1.

Fix a risk level and an exploration budget . Assume that the boundedness assumptions (5) hold. Consider an estimator of such that with probability at least , for some .
Then choosing and


the optimistic algorithm (3) ensures that w.p. ,

Comment:   Li et al. (2010) pick proportional to only, but we need an additional term to account for the covariance matrix.

We are thus left with studying how well estimates and with controlling the sum of the . The next two lemmas take care of these issues. Their proofs are to be found in Appendices C and D.

Lemma 2.

For all , the estimator (4) satisfies: with probability at least ,

and .

Comment:   We derived the estimator of as well as Lemma 2 from scratch: we could find no suitable result in the literature for estimating in our context.

Lemma 3.

No matter how the environment and provider pick the and ,

where .

Comment:   This lemma follows from a straightforward adaptation/generalization of Lemma 19.1 of the monograph by Lattimore & Szepesvári (2018); see also a similar result in Lemma 3 by Chu et al. (2011).

We are now ready to conclude the proof of Theorem 1. Indeed, using for the first rounds that upper bounds the (conditionally) expected losses , Proposition 1 and Lemmas 2 and 3 show that, w.p.

Picking of order concludes the proof.

Comment:   The algorithm of Theorem 1 considered above depends on via the tuning (7) of . But we can also have a result in expectation, i.e., a regret defined with and , in which case the algorithm can be made independent of . Only Step 3 of the proof of Proposition 1 is to be modified. The same rates in  are obtained.

4 Fast rates, with Model 2

In this section, we consider Model 2 and show that under an attainability condition stated below, the order of magnitude of the regret bound in Theorem 1 can be reduced to a poly-logarithmic rate. This result is in strong contrast with the typical results for contextual bandits. We underline in the proof the key step where we gain orders of magnitude in the regret bound. Before doing so, we note that similarly to Section 3.1,


which leads us to introduce a regret defined by

Thus, as far as the minimization of the regret is concerned, Model 2 is a special case of Model 1, corresponding to a matrix that can be taken as the null matrix . Of course, as explained in Section 2.1, the covariance matrix of Model 2 is in terms of real modeling, but in terms of regret-minimization it can be taken as . Therefore, all results established above for Model 1 extend to Model 2, but under an additional assumption stated below, the rates (up to poly-logarithmic terms) obtained above can be reduced to poly-logarithmic rates only.

Assumption 2: Attainability.   For each time instance , the expected mean consumption is attainable, i.e.,
We denote by such an element of .

In Model 2 and under this assumption, the expected losses defined in (8) are such that, for all and all ,


As in Model 2 the variance terms cancel out when considering the regret, the variance does not need to be estimated. Our optimistic algorithm thus takes a simpler form. For each and we consider the same estimators (1) of as before and then define

(no clipping needs to be considered in this case). We set


and then pick:


for and arbitrarily. The tuning parameter is hidden in . We get the following theorem, whose proof is deferred to Appendix E and re-uses many parts of the proofs of Proposition 1 and Lemma 3.

Theorem 2.

In Model 2, assume that the boundedness assumptions (5) hold. Then, the optimistic algorithm (11), tuned with , ensures that for all ,

w.p. at least , where is defined as in Lemma 3.

5 Simulations

Our simulations rely on a real data set of residential electricity consumption, in which different tariffs were sent to the customers according to some policy. But of course, we cannot test an alternative policy on historical data (we only observed the outcome of the tariffs sent) and therefore need to build a data simulator. This is what we explain first.

5.1 The underlying real data set / The simulator

We consider the data set “SmartMeter Energy Consumption Data in London Households222”. These open data are published by UK Power Networks and contain energy consumption (in kWh per half hour) at half hourly intervals of a thousand customers subjected to dynamic energy prices. A single tariff (among High–1, Normal–2 or Low–3) is offered to all the population for each half hour interval. The tariffs were announced in advance. The report by Schofield et al. (2014) provides a full description of this experimentation and an exhaustive analysis of results. We only kept customers with more than of data available ( clients) and considered their mean consumption. (Such a level of aggregation enables a proper estimation of the load whereas individual consumptions are erratic, see, e.g., Sevlian & Rajagopal, 2018.) As far as contexts are concerned, we considered half-hourly temperatures in London, obtained from the NOAA333

– We managed missing data by interpolating them linearly.

. We also created calendar variables: the day of the week (equal to for Monday, for Tuesday, etc.), the half-hour of the day , and the position in the year: , linear values between on January 1st at 00:00 and on December the 31st at 23:59.

Realistic simulator.  It is based on the following additive model, which breaks down time by half hours:


where the and are functions catching the effect of the temperature and of the yearly seasonality. As explained in Example 2.1, the transfer parameter gathers coordinates of the and the in bases of splines, as well as the coefficients , and . Here, we work under the assumption that exogenous factors do not impact customers’ reaction to tariff changes (which is admittedly a first step, and more complex models could be considered). Our algorithms will have to sequentially estimate the parameter, but we also need to set it to get our simulator in the first place. We do so by exploiting historical data together with the allocations of prices picked, of the form , and only on these data (all customers were getting the same tariff), and apply the formula (1) through the R–package mgcv (which replaces the identity matrix with a slightly more complex definite positive matrix , see Wood, 2006). The deterministic part of the obtained model is realistic enough: its adjusted R-square on historical observations equals while its mean absolute percentage error equals . Now, as far as noise is concerned, we take multivariate Gaussian noise vectors , where the covariance matrix was built again based on realistic values. The diagonal coefficients are given by the empirical variance of the residuals associated with tariff , while non-diagonal coefficients are given by the empirical covariance of between residuals of tariffs and at times and , and times and .

5.2 Design of our experiment

Target creation.  We focus on attainable targets which stay in the convex envelope of the mean consumption associated with the High–1 and Low–3 tariffs, namely, . To smooth consumption, we pick near during the night and near in the evening. These hypothesis can be seen as an ideal configuration where targets and customers portfolio are in a way compatible.

restriction. We assume that the electricity provider cannot send Low and High tariffs at the same round and that population can be split in equal parts. Thus, is restricted to the grid

Training period, testing period.  We create one year of data using historical contexts and assuming that only Normal tariffs are picked: ; this is a training period. Then the provider starts exploring the effects of tariffs for an additional month (a January month, based on the historical contexts) and freely picks the according to our algorithm; this is the testing period. The rationale is that this is how electricity providers do and then, is better estimated. Its estimation is still performed via the formula (1) and as indicated above (with the mgcv package), including the year when only allocations were picked. To simplify the analysis we assume that the algorithm knows the covariance matrix used by the simulator. To make sure that learning focuses on the parameters , as other parameters were decently estimated in the training period, we modify the exploration term of (3) into

with . We pick a convenient value for .

5.3 Results

Algorithms were run times each. The simplest set of results is provided in Figure 3: the regrets suffered on each run are compared to the theoretical orders of magnitude of the regret bounds. As expected, we observe a lower regrets for Model 2.

The bottom parts of Figures 12 indicate, for a single run, which allocation vectors were picked over time. During the first day of the testing period, the algorithms explore the effect of tariffs by sending the same tariff to all customers (the vectors are Dirac masses) while at the end of the testing period, they cleverly exploit the possibility to split the population in two groups of tariffs. Note that, over the first iterations, the exploration term for Model 2 is much larger than the exploitation term (but quickly vanishes), which leads to an initiak quasi-deterministic exploration and an erratic consumption (unlike in Model 1).

We obtain an approximation of the expected mean consumption by averaging the observed consumptions, and this is the main (black, solid) line to look at in the top parts of Figures 12. Four plots are depicted depending on the day of the testing period (first, last) and of the model considered. These (approximated) expected mean consumptions may be compared to the targets set (dashed red line). The algorithms seem to perform better on the last day of the testing period for Model 2 than for Model 1 as the expected mean consumption seems closer to the target. However, in Model 1, the algorithm has to pick tariffs leading to the best bias-variance trade-off (the expected loss features a variance term). This is why the average consumption does not overlap the target as in Model 2. This results in a slightly biased estimator of the mean consumption in Model 1.

Figure 1: Left: January 1st (first day of the testing set).  Right: January 31st (last day of the testing set).
Top: runs are considered. Plot: average of mean consumptions over runs for the algorithm associated with Model 1 (full black line); target consumption (dashed red line); mean consumption associated with each tariff (Low–1 in green, Normal–2 in blue and High–3 in navy). The envelope of attainable targets is in pastel blue.
Bottom: A single run is considered. Plot: proportions used over time.
Figure 2: Same legend, but with Model 2 (full black line).
Figure 3: Regret curves for each of the runs for Model 1 (left) and Model 2 (right). We also provide plots of and for some well-chosen constants .


  • Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320, 2011.
  • Abe & Long (1999) Abe, N. and Long, P. M.

    Associative reinforcement learning using linear probabilistic concepts.

    In ICML, pp. 3–11, 1999.
  • Albadi & El-Saadany (2007) Albadi, M. H. and El-Saadany, E. F. Demand response in electricity markets: An overview. In 2007 IEEE Power Engineering Society General Meeting, pp. 1–5, June 2007. doi: 10.1109/PES.2007.385728.
  • Auer (2002) Auer, P. Using confidence bounds for exploitation-exploration trade-offs.

    Journal of Machine Learning Research

    , 3(Nov):397–422, 2002.
  • Auer et al. (2002) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
  • Beygelzimer et al. (2011) Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R.

    Contextual bandit algorithms with supervised learning guarantees.


    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , pp. 19–26, 2011.
  • Cesa-Bianchi & Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.
  • Chu et al. (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS’11), pp. 208–214, 2011.
  • Dani et al. (2008) Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. 21st Annual Conference on Learning Theory, 2008.
  • Filippi et al. (2010) Filippi, S., Cappe, O., Garivier, A., and Szepesvári, C. Parametric bandits: The generalized linear case. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, pp. 586–594. Curran Associates, Inc., 2010.
  • Fischer et al. (2015) Fischer, D., Scherer, J., Flunk, A., Kreifels, N., Byskov-Lindberg, K., and Wille-Haussmann, B. Impact of hp, chp, pv and evs on households’ electric load profiles. In 2015 IEEE Eindhoven PowerTech, pp. 1–6, June 2015. doi: 10.1109/PTC.2015.7232784.
  • Gaillard et al. (2016) Gaillard, P., Goude, Y., and Nedellec, R. Additive models and robust aggregation for gefcom2014 probabilistic electric load and electricity price forecasting. International Journal of forecasting, 32(3):1038–1050, 2016.
  • Gopalan et al. (2014) Gopalan, A., Mannor, S., and Mansour, Y. Thompson sampling for complex online problems. In International Conference on Machine Learning, pp. 100–108, 2014.
  • Goude et al. (2014) Goude, Y., Nedellec, R., and Kong, N. Local short and middle term electricity load forecasting with semi-parametric additive models. IEEE transactions on smart grid, 5(1):440–446, 2014.
  • Kikusato et al. (2018) Kikusato, H., Mori, K., Yoshizawa, S., Fujimoto, Y., Asano, H., Hayashi, Y., Kawashima, A., Inagaki, S., and Suzuki, T. Electric vehicle charge-discharge management for utilization of photovoltaic by coordination between home and grid energy management systems. IEEE Transactions on Smart Grid, pp. 1–1, 2018. ISSN 1949-3053. doi: 10.1109/TSG.2018.2820026.
  • Lai & Robbins (1985) Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • Lattimore & Szepesvári (2018) Lattimore, T. and Szepesvári, C. Bandit algorithms. preprint, 2018.
  • Li et al. (2010) Li, L., Chu, W., Langford, J., and Schapire, R. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web (WWW’10), pp. 661–670, 2010.
  • Mallet et al. (2014) Mallet, P., Granstrom, P. O., Hallberg, P., Lorenz, G., and Mandatova, P. Power to the people!: European perspectives on the future of electric distribution. IEEE Power and Energy Magazine, 12(2):51–64, March 2014. ISSN 1540-7977. doi: 10.1109/MPE.2013.2294512.
  • Mannor (2018) Mannor, S. Misspecified and complex bandits problems, 2018. Talk at 50èmes Journées de Statistique, EDF Lab Paris Saclay, May 31. 2018.
  • McMahan & Streeter (2009) McMahan, H. B. and Streeter, M. J. Tighter bounds for multi-armed bandits with expert advice. In COLT, 2009.
  • Mei et al. (2017) Mei, J., De Castro, Y., Goude, Y., and Hébrail, G. Nonnegative matrix factorization for time series recovery from a few temporal aggregates. In International Conference on Machine Learning, pp. 2382–2390, 2017.
  • Perchet & Rigollet (2013) Perchet, V. and Rigollet, P. The multi-armed bandit problem with covariates. The Annals of Statistics, pp. 693–721, 2013.
  • Rusmevichientong & Tsitsiklis (2010) Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  • Schofield et al. (2014) Schofield, J., Carmichael, R., Tindemans, S., Woolf, M., Bilton, M., and Strbac, G. Residential consumer responsiveness to time-varying pricing. 2014.
  • Sevlian & Rajagopal (2018) Sevlian, R. and Rajagopal, R. A scaling law for short term load forecasting on varying levels of aggregation. 98:350–361, 06 2018.
  • Shareef et al. (2018) Shareef, H., Ahmed, M. S., Mohamed, A., and Al Hassan, E. Review on home energy management system considering demand responses, smart technologies, and intelligent controllers. IEEE Access, 6:24498–24509, 2018.
  • Siano (2014) Siano, P. Demand response and smart grids?a survey. Renewable and Sustainable Energy Reviews, 30:461 – 478, 2014. ISSN 1364-0321. doi: URL
  • Valko et al. (2013) Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. Finite-time analysis of kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013.
  • Wang et al. (2005a) Wang, C.-C., Kulkarni, S. R., and Poor, H. V. Arbitrary side observations in bandit problems. Advances in Applied Mathematics, 34(4):903–938, 2005a.
  • Wang et al. (2005b) Wang, C.-C., Kulkarni, S. R., and Poor, H. V. Bandit problems with side observations. IEEE Transactions on Automatic Control, 50(3):338–355, 2005b.
  • Wang et al. (2015) Wang, Y., Chen, Q., Kang, C., Zhang, M., Wang, K., and Zhao, Y. Load profiling and its application to demand response: A review. Tsinghua Science and Technology, 20(2):117–129, 2015.
  • Wood (2006) Wood, S. Generalized Additive Models: An Introduction with R. CRC Press, 2006.
  • Yan et al. (2013) Yan, Y., Qian, Y., Sharif, H., and Tipper, D. A survey on smart grid communication infrastructures: Motivations, requirements and challenges. IEEE Communications Surveys Tutorials, 15(1):5–20, First 2013. ISSN 1553-877X. doi: 10.1109/SURV.2012.021312.00034.

Appendix A Proof of Lemma 1

The proof below relies on Laplace’s method on super-martingales, which is a standard argument to provide confidence bounds on a self-normalized sum of conditionally centered random vectors. See Theorem 2 of Abbasi-Yadkori et al. (2011) or Theorem 20.2 in the monograph by Lattimore & Szepesvári (2018).

Under Model 1 and given the definition of , we have the rewriting

where we introduced

which is a martingale with respect to . Therefore, by a triangle inequality,

On the one hand, given that all eigenvalues of the symmetric matrix are larger than (given the term in its definition), all eigenvalues of are smaller than and thus,

We now prove, on the other hand, that with probability at least ,

which will conclude the proof of the lemma.

Step 1: Introducing super-martingales. For all , we consider

and now show that it is an –super-martingale. First, note that since the common distribution of the is –sub-Gaussian, then for all –measurable random vectors ,



where, by using the sub-Gaussian assumption (13) and the fact that for all convex weight vectors ,