Linear Bandits in High Dimension and Recommendation Systems

01/08/2013 ∙ by Yash Deshpande, et al. ∙ 0

A large number of online services provide automated recommendations to help users to navigate through a large collection of items. New items (products, videos, songs, advertisements) are suggested on the basis of the user's past history and --when available-- her demographic profile. Recommendations have to satisfy the dual goal of helping the user to explore the space of available items, while allowing the system to probe the user's preferences. We model this trade-off using linearly parametrized multi-armed bandits, propose a policy and prove upper and lower bounds on the cumulative "reward" that coincide up to constants in the data poor (high-dimensional) regime. Prior work on linear bandits has focused on the data rich (low-dimensional) regime and used cumulative "risk" as the figure of merit. For this data rich regime, we provide a simple modification for our policy that achieves near-optimal risk performance under more restrictive assumptions on the geometry of the problem. We test (a variation of) the scheme used for establishing achievability on the Netflix and MovieLens datasets and obtain good agreement with the qualitative predictions of the theory we develop.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recommendation systems are a key technology for navigating through the ever-growing amount of data that is available on the Internet (products, videos, songs, scientific papers, and so on). Recommended items are chosen on the basis of the user’s past history and have to strike the right balance between two competing objectives:


i.e. allowing accidental pleasant discoveries. This has a positive –albeit hard to quantify– impact on user experience, in that it naturally limits the recommendations monotony. It also has a quantifiable positive impact on the systems, by providing fresh independent information about the user preferences.


i.e. determining recommendations which are most valued by the user, given her past choices.

While this trade-off is well understood by practitioners, as well as in the data mining literature [SPUP02, ZH08, SW06], rigorous and mathematical work has largely focused on the second objective [SJ03, SRJ05, CR09, Gro09, CT10, KMO10a, KMO10b, KLT11]. In this paper we address the first objective, building on recent work on linearly parametrized bandits [DHK08, RT10, AYPS11].

In a simple model, the system recommends items sequentially at times . The item index at time is selected from a large set . Upon viewing (or reading, buying, etc.) item , the user provides feedback to the system. The feedback can be explicit, e.g. a one-to-five-stars rating, or implicit, e.g. the fraction of a video’s duration effectively watched by the user. We will assume that , although more general types of feedback also play an important role in practice, and mapping them to real values is sometimes non-trivial.

A large body of literature has developed statistical methods to predict the feedback that a user will provide on a specific item, given past data concerning the same and other users (see the references above). A particularly successful approach uses ‘low rank’ or ‘latent space’ models. These models postulate that the rating provided by user on item

is approximately given by the scalar product of two feature vectors

and characterizing, respectively, the user and the item. In formulae

where denotes the standard scalar product, and captures unexplained factors. The resulting matrix of ratings is well-approximated by a rank- matrix.

The items feature vectors can be either constructed explicitly, or derived from users’ feedback using matrix factorization methods. Throughout this paper we will assume that they have been computed in advance using either one of these methods and are hence given. We will use the shorthand for the feature vector of the item recommended at time .

Since the items’ feature vectors are known in advance, distinct users can be treated independently, and we will hereafter focus on a single users, with feature vector . The vector can encode demographic information known in advance or be computed from the user’s feedback. While the model can easily incorporate the former, we will focus on the most interesting case in which no information is known in advance.

We are therefore led to consider the linear bandit model


where, for simplicity, we will assume independent of , and . At each time , the recommender is given to choose a item feature vector , with

the set of feature vectors of the available items. A recommendation policy is a sequence of random variables

, wherein is a function of the past history (technically, has to be measurable on ). The system is rewarded at time by an amount equal to the user appreciation , and we let denote the expected reward, i.e. .

As mentioned above, the same linear bandit problem was already studied in several papers, most notably by Rusmevichientong and Tsitsiklis [RT10]. The theory developed in that work, however, has two limitations that are important in the context of recommendation systems. First, the main objective of [RT10] is to construct policies with nearly optimal ‘regret’, and the focus is on the asymptotic behavior for large with constant. In this limit the regret per unit time goes to . In a recommendation system, typical dimensions of the latent feature vector are about 20 to 50 [BK07, Kor08, KBV09]. If the vector include explicitly constructed features, can easily become easily much larger. As a consequence, existing theory requires at least ratings, which is unrealistic for many recommendation systems and a large number of users.

Second, the policies that have been analyzed in [RT10] are based on an alternation of pure exploration and pure exploitation. In exploration phases, recommendations are completely independent of the user profile. This is somewhat unrealistic (and potentially harmful) in practice because it would translate into a poor user experience. Consequently, we postulate the following desirable properties for a “good” policy:

  1. Constant-optimal cumulative reward: For all time , is within a constant factor of the maximum achievable reward.

  2. Constant-optimal regret: Let the maximum achievable reward be , then the ‘regret’ is within a constant of the optimal.

  3. Approximate monotonicity: For any , we have for as close as possible to .

We aim, in this paper, to address the first objection in a fairly general setting. In particular, when is small, say a constant times , we provide matching upper and lower bounds for the cumulative reward under certain mild assumptions on the set of arms . Under more restrictive assumptions on the set of arms , our policy can be extended to achieve near optimal regret as well. Although we will not prove a formal result of the type of Point 3, our policy is an excellent candidate in that respect.

The paper is organized as follows : in Section 2 we formally state our main results. In Section 3 we discuss further related work. Some explication on the assumptions we make on the set of arms is provided in Section 4. In Section 5 we present numerical simulations of our policy on synthetic as well as realistic data from the Netflix and MovieLens datasets. We also compare our results with prior work, and in particular with the policy of [RT10]. Finally, proofs are given in Sections 6 and 7.

2 Main results

We denote by the Euclidean ball in with radius and center . If is the origin, we omit this argument and write

. Also, we denote the identity matrix as


Our achievability results are based on the following assumption on the set of arms .

Assumption 1.

Assume, without loss of generality, . We further assume that there exists a subset of arms such that:

  1. For each there exists a distribution supported on with and , for a constant . Here denotes expectation with respect to .

  2. For all , for some .

Examples of sets satisfying Assumption 1

and further discussion of its geometrical meaning are deferred to Section

4. Intuitively, it requires that is ‘well spread-out’ in the unit ball .

Following [RT10] we will also assume to be drawn from a Gaussian prior . This roughly corresponds to the assumption that nothing is known a priori about the user except the length of its feature vector . Under this assumption, the scalar product , where is necessarily independent of , is also Gaussian with mean

and variance

and hence is noise-to-signal ratio for the problem. Our results are explicitly computable and apply to any value of . However they are constant-optimal for bounded away from zero.


be the posterior mean estimate of

at time , namely


A greedy policy would select the arm that maximizes the expected one-step reward . As for the classical multiarmed bandit problem, we would like to combine this approach with random exploration of alternative arms. We will refer to our strategy as SmoothExplore since it combines exploration and exploitation in a continuous manner. This policy is summarized in Table 1.

1:initialize , , , .
3:     Compute: .
4:     Play: , observe .
5:     Update: , .
Algorithm 1 SmoothExplore

The policy SmoothExplore

uses a fixed mixture of exploration and exploitation as prescribed by the probability kernel

. As formalized below, this is constant optimal in the data poor high-dimensional regime hence on small time horizons.

While the focus of this paper is on the data poor regime, it is useful to discuss how the latter blends with the data rich regime that arises on long time horizons. This also clarifies where the boundary between short and long time horizons sits. Of course, one possibility would be to switch to a long-time-horizon policy such as the one of [RT10]. Alternatively, in the spirit of approximate monotonicity, we can try to progressively reduce the random exploration component as increases. We will illustrate this point for the special case . In that case, we introduce a special case of SmoothExplore , called BallExplore , cf. Table 2. The amount of random exploration at time is gauged by a parameter that decreases from to as .

Note that, for , is kept constant with . In this regime BallExplore corresponds to SmoothExplore with the choice (here and below denotes the boundary of a set ). It is not hard to check that this choice of satisfies Assumption 1 with and . For further discussion on this point, we refer the reader to Section 4.

1:initialize , , , , .
3:     Compute: , .
4:     Play: , where is a uniformly sampled unit vector, independent of the past.
5:     Observe: .
6:     Update: , , .
Algorithm 2 BallExplore

Our main result characterizes the cumulative reward

Theorem 1.

Consider the linear bandits problem with , satisfying Assumption 1, and . Further assume that and .

Then there exists a constant bounded for and bounded away from zero, such that SmoothExplore achieves, for , cumulative reward

Further, the cumulative reward of any strategy is bounded for as:

We may take the constants and to be:

In the special case where , we have the following result demonstrating that BallExplore has near-optimal performance in the long time horizon as well.

Theorem 2.

Consider the linear bandits problem with with the set of arms is the unit ball, i.e. . Assume, and . Then BallExplore achieves for all :


For , we can obtain a matching upper bound by a simple modification of the arguments in [RT10].

Theorem 3 (Rusmevichientong and Tsitsiklis).

Under the described model, the cumulative reward of any policy is bounded as follows

The above results characterize a sharp dichotomy between a low-dimensional, data rich regime for

and a high-dimensional, data poor regime for

. In the first case classical theory applies: the reward approaches the oracle performance with a gap of order

. This behavior is in turn closely related to central limit theorem scaling in asymptotic statistics. Notice that the scaling with

of our upper bound on the risk of BallExplore for large is suboptimal, namely . Since however the difference can be seen only on exponential time scales and is likely to be irrelevant for moderate to large values (see Section 5 for a demonstration). It is an open problem to establish the exact asymptotic scaling111Simulations suggest that the upper bound might be tight. of BallExplore .

In the high-dimensional, data poor regime , the number of observations is smaller than the number of model parameters and the vector can only be partially estimated. Nevertheless, such partial estimate can be exploited to produce a cumulative reward scaling as . In this regime performances are not limited by central limit theorem fluctuations in the estimate of . The limiting factor is instead the dimension of the parameter space that can be effectively explored in steps.

In order to understand this behavior, it is convenient to consider the noiseless case . This is a somewhat degenerate case that, although not covered by the above theorem, yields useful intuition. In the noiseless case, acquiring observations , … is equivalent to learning the projection of on the -dimensional subspace spanned by . Equivalently, we learn coordinates of in a suitable basis. Since the mean square value of each component of is , this yields an estimate of (the restriction to these coordinates) with . By selecting in the direction of we achieve instantaneous reward and hence cumulative reward as stated in the theorem.

3 Related work

Auer in [Aue02] first considered a model similar to ours, wherein the parameter and noise are bounded almost surely. This work assumes finite and introduces an algorithm based on upper confidence bounds. Dani et al. [DHK08] extended the policy of [Aue02] for arbitrary compact decision sets . For finite sets, [DHK08] prove an upper bound on the regret that is logarithmic in its cardinality , while for continuous sets they prove an upper bound of . This result was further improved by logarithmic factors in [AYPS11]. The common theme throughout this line of work is the use of upper confidence bounds and least-squares estimation. The algorithms typically construct ellipsoidal confidence sets around the least-squares estimate which, with high probability, contain the parameter . The algorithm then chooses optimistically the arm that appears the best with respect to this ellipsoid. As the confidence ellipsoids are initialized to be large, the bounds are only useful for . In particular, in the high-dimensional data-poor regime , the bounds typically become trivial. In light of Theorem 3 this is not surprising. Even after normalizing the noise-to-signal ratio while scaling the dimension, the dependence of the risk is relevant only for large time scales of . This is the regime in which the parameter has been estimated fairly well.

Rusmevichientong and Tsitsiklis [RT10] propose a phased policy which operates in distinct phases of learning the parameter and earning based on the current estimate of . Although this approach yields order optimal bounds for the regret, it suffers from the same shortcomings as confidence-ellipsoid based algorithms. In fact, [RT10] also consider a more general policy based on confidence bounds and prove a bound on the regret.

Our approach to the problem is significantly different and does not rely on confidence bounds. It would be interesting to understand whether the techniques developed here can be use to improve the confidence bounds method.

4 On Assumption 1

The geometry of the set of arms is an important factor in the in the performance of any policy. For instance, [RT10], [DHK08] and [AYPS11] provide “problem-dependent” bounds on the regret incurred in terms of the difference between the reward of the optimal arm and the next-optimal arm. This characterization is reasonable in the long time horizon: if the posterior estimate of the feature vector coincided with itself, only the optimal arm would matter. Since the posterior estimate converges to in the limit of large , the local geometry of around the optimal arm dictates the asymptotic behavior of the regret.

In the high-dimensional, short-time regime, the global geometry of plays instead a crucial role. This is quantified in our results through the parameters and appearing in Assumption 1. Roughly speaking, this amounts to requiring that is ‘spread out’ in the unit ball. It is useful to discuss this intuition in a more precise manner. For the proofs of statements in this section we refer to Appendix A.

A simple case is the one in which the arm set contains a ball.

Lemma 4.1.

If , then satisfies Assumption 1 with , .

The last lemma does not cover the interesting case in which is finite. The next result shows however that, for Assumption 1.2 to hold it is sufficient that the closure of the convex hull of , denoted by , contains a ball.

Proposition 4.2.

Assumption 1.2 holds if and only if .

In other words, Assumption 1.2 is satisfied if is ‘spread out’ in all directions around the origin.

Finally, we consider a concrete example with finite. Let to be i.i.d. uniformly random in . We then refer to the set of arms as to a uniform cloud.

Proposition 4.3.

A uniform cloud in dimension satisfies Assumption 1 with , and with probability larger than .

5 Numerical results

Figure 1: Left frame: Cumulative reward in the data poor regime (here , ) as obtained through numerical simulations over synthetic data, together with analytical upper bound. Right frame: Cumulative risk in the data rich regime (again, , ).

We will mainly compare our results with those of [RT10] since the results of that paper directly apply to the present problem. The authors proposed a phased exploration/exploitation policy, wherein they separate the phases of learning the parameter (exploration) and earning reward based on the current estimate of (exploitation).

In Figure 1 we plot the cumulative reward and the cumulative risk incurred by our policy and the phased policy, as well as analytical bounds thereof. We generated randomly for , and produced observations , according to the general model (1) with and arm set . The curves presented here are averages over realizations and statistical fluctuations are negligible.

The left frame illustrates the performance of SmoothExplore in the data poor (high-dimensional) regime . We compare the cumulative reward as achieved in simulations, with that of the phased policy of [RT10] and with the theoretical upper bound of Theorem 1 (and Theorem 3 for ). In the right frame we consider instead the data rich (low-dimensional) regime . In this case it is more convenient to plot the cumulative risk . We plot the curves corresponding to the ones in the left frame, as well as the upper bound (lower bound on the reward) from Theorems 1 and 2.

Note that the behavior of the risk of the phased policy can be observed only for . On the other hand, our policy displays the correct behavior for both time scales. The extra factor in the exponent yields a multiplicative factor larger than only for .

Figure 2: Results using the Netflix (left frame) and MovieLens 1M (right frame) datasets. SmoothExplore is effective in learning the user’s preferences and is well described by the predicted behavior of Theorem 1.

The above set of numerical experiments used . For applications to recommendation systems, is in correspondence with a certain catalogue of achievable products or contents. In particular, is expected to be finite. It is therefore important to check how does SmoothExplore perform for a realistic sets of arms. We plot results obtained with the Netflix Prize dataset and the MovieLens 1M dataset in Figure 2. Here the feature vectors ’s for movies are obtained using the matrix completion algorithm of [KMO10b]. The user parameter vectors were obtained by regressing the rating against the movie feature vectors (the average user rating was subtracted). Similar to synthetic data, we took . Regression also yields an estimate for the noise variance which is assumed known in the algorithm. We then simulated an interactive scenario by postulating that the rating of user for movie is given by

where quantizes to to (corresponding to a one-to-five star rating). The feedback used for our simulation is the centered rating .

We implement a slightly modified version of SmoothExplore

for these simulations. At each time we compute the ridge regression estimate of the user feature vector

as before and choose the “best” movie assuming our estimate is error free. We then construct the ball in with center and radius . We list all the movies whose feature vectors fall in this ball, and recommend a uniformly randomly chosen one in this list.

Classical bandit theory implies the reward behavior is of the type where and are (dimension-dependent) constants. Figure 2 presents the best fit of this type for . The description appears to be qualitatively incorrect in this regime. Indeed, in this regime, the reward behavior is better explained by a curve. These results suggest that our policy is fairly robust to the significant modeling uncertainty inherent in the problem. In particular, the fact that the “noise” encountered in practice is manifestly non-Gaussian does not affect the qualitative predictions of Theorem 1.

A full validation of our approach would require an actual interactive realization of a recommendation system [DM13]. Unfortunately, such validation cannot be provided by existing datasets, such as the ones used here. A naive approach would be to use the actual ratings as the feedback , but this suffers from many shortcomings. First of all, each user rates a sparse subset (of the order of movies) of the whole database of movies, and hence any policy to be tested would be heavily constrained and distorted. Second, the set of rated movies is a biased subset (since it is selected by the user itself).

6 Proof of Theorem 1

We begin with some useful notation. Define the -algebra . Also let . We let and denote the posterior mean and covariance of given observations. Since is Gaussian and the observations are linear, it is a standard result that these can be computed as:

Note that since is Gaussian and the measurements are linear the posterior mean coincides with the maximum likelihood estimate for . This ensures our notation is consistent.

6.1 Upper bound on reward

At time , the expected reward , where the first inequality follows from Cauchy-Schwarz, that is unbiased and that . Since :


We have, applying Jensen’s inequality and further simplification:

Using this to bound the right hand side of Eq. (3)

The cumulative reward can then be bounded as follows:

Here we define .

6.2 Lower bound on reward

We compute the expected reward earned by SmoothExplore at time as:


The following lemma guarantees that is .

Lemma 6.1.

Under the conditions of Theorem 1 we have, for all :


Using this lemma we proceed by bounding the right side of Eq. (6.2):

Computing cumulative reward we have:

Thus, letting , we have the required result.

6.3 Proof of Lemma 6.1

In order to prove that , we will first show that is . Then we prove that is sub-gaussian, and use this to arrive at the required result.

Lemma 6.2 (Growth of Second Moment).

Under the conditions of Theorem 1:



We rewrite using the following inductive form:


Here is a random zero mean vector. Conditional on , is distributed as and is independent of and . Recall that the -algebra . Then we have:


The cross terms cancel since and conditionally on are independent and zero mean. The expectation in the second term can be reduced as follows:

The third term can be seen to be:

Thus we have, continuing Eq. (6):


Since , some calculation yields that:

Thus Eq. (7) reduces to


We now bound the additive term in Eq. (8). We know that (the prior covariance), thus since . Hence the denominator in Eq. (8) is upper bounded by . To bound the numerator:

since by Assumption 1. Using this in Eq. (8), we take expectations to get:


Considering the second term in Eq. (9):

where is the eigenvalue of . Continuing the chain of inequalities:

where the last inequality follows from the fact that for each . Combining this with Eq. (9) gives:


Summing over this implies:

The last inequality follows from fact that is concave in . Using , we obtain:

Lemma 6.3 (Sub-Gaussianity of ).

Under the conditions of Theorem 1