Contextual Bandits with Latent Confounders: An NMF Approach

06/01/2016 ∙ by Rajat Sen, et al. ∙ 0

Motivated by online recommendation and advertising systems, we consider a causal model for stochastic contextual bandits with a latent low-dimensional confounder. In our model, there are L observed contexts and K arms of the bandit. The observed context influences the reward obtained through a latent confounder variable with cardinality m (m ≪ L,K). The arm choice and the latent confounder causally determines the reward while the observed context is correlated with the confounder. Under this model, the L × K mean reward matrix U (for each context in [L] and each arm in [K]) factorizes into non-negative factors A (L × m) and W (m × K). This insight enables us to propose an ϵ-greedy NMF-Bandit algorithm that designs a sequence of interventions (selecting specific arms), that achieves a balance between learning this low-dimensional structure and selecting the best arm to minimize regret. Our algorithm achieves a regret of O(Lpoly(m, K) T ) at time T, as compared to O(LK T) for conventional contextual bandits, assuming a constant gap between the best arm and the rest for each context. These guarantees are obtained under mild sufficiency conditions on the factors that are weaker versions of the well-known Statistical RIP condition. We further propose a class of generative models that satisfy our sufficient conditions, and derive a lower bound of O(Km T). These are the first regret guarantees for online matrix completion with bandit feedback, when the rank is greater than one. We further compare the performance of our algorithm with the state of the art, on synthetic and real world data-sets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The study of bandit problems captures the inherent tradeoff between exploration and exploitation in online decision making. In various real world settings, policy designers have the freedom of observing specific samples and learning a model of the collected data on the fly; this online learning is instrumental in making future decisions. For instance in movie recommendations, algorithms suggest movies to users in order to meet their interests and simultaneously learn their preferences in an online manner. Similarly, for product recommendations (e.g. in Amazon) or web advertisement, there is an inherent tradeoff between collection of training data for user preferences, and recommending the best items that maximize profit according to the currently learned model. Multi-armed bandit problems provide a principled approach to attain this delicate balance between exploration and exploitation [9].

The classic -armed bandit problem has been studied extensively for decades. In the stochastic setting, one is faced with the choice of pulling one arm during each time-slot among arms, where the arm has mean reward . The task is to accumulate a total reward as close as possible to a genie strategy that has prior knowledge of arm statistics and always selects the optimal arm in each time-slot. The expected difference between the rewards collected by the genie strategy and the online strategy is defined as the regret. The expected regret of the state of the art algorithms [9] scales as when there is a constant gap between the best arm and the rest.

When side-information is available, a popular model is the contextual bandit, where the side information is encoded through observed contexts. In the stochastic setting, at each time an observed context is revealed, and the observed context influences the reward statistics of the arms. Thus, there are reward parameters (encoded through the reward matrix ) that need to be learned, one per each arm and observed context. Since there are reward parameters, it has been shown [9, 39] that the best expected regret obtainable scales as

Netflix Example: Consider the task of recommending movies to user profiles on Netflix. A user profile along with the past browsing history, social and demographic information is the observed context. The list of movies that can be recommended to any user are the arms of the bandit. In this setting with millions of users and items, standard contextual bandit algorithms are rendered impractical due to the scaling.

Therefore, it is important to exploit that in most practical situations, the underlying factors affecting the rewards may have a low-dimensional structure. Although this low dimensional structure is often not observable (latent), we will show that it can be leveraged to obtain better regret bounds. In the context of Netflix, there are millions of user profiles but the preference of users towards an item may be represented by a combination of only a handful of moods, where these moods lie in a much lower dimension. This is further corroborated by the fact that the Netflix data-set, which has more than 100 million movie ratings, can be approximated surprisingly well by matrices of rank as low as 20 [7]. Crucially however, these moods cannot be directly observed by a learning algorithm.

This problem of a contextual bandit with a latent structure has direct analogy with problem of designing structural interventions (forcing variables to take particular values) in causal graphs, a class of problems that is of increasing importance in social sciences, economics, epidemiology and computational advertising [31, 8].

A Causal Perspective: A causal model [31]

is a directed graph that encodes causal relationships between a set of random variables, where each variable is represented by a node of the graph (see Figure 

LABEL:fig:causalgraphs1). This example has a directed graph with 3 variables, where the variable has two parents

To illustrate the connection between contextual bandits and causal models, consider again the Netflix example, which can be mapped to the causal graph in Figure LABEL:fig:causalgraphs1. Here, the reward (satisfaction of the user) is causally dependent on two quantities – the observed context (user profile in Netflix) described by , and the arm selection (the recommended movie) described by the variable Setting to a particular value is equivalent to playing a particular arm (act of recommending an item). In this example, is the only variable that can be directly controlled by the algorithm; in the language of causality this is known as an intervention [31] denoted by .

More specifically, this contextual bandit setting maps to the causal graph problem of affecting a target variable (satisfaction of users), through limited interventional capacity (only being able to recommend a movie) when other observable causes (user profiles and contextual information) affecting the target variable are present but cannot be controlled. This is precisely the model in Figure LABEL:fig:causalgraphs1. An identical structural equation model has been defined in Figure 8 of [8].

(a) subfigcapskip =10pt
(b) subfigcapskip =10pt
Figure 1: Comparison between regular contextual bandits and contextual bandits with latent confounders through causal graphs.

Latent Confounders: In this causal framework, it is possible to formally capture the implications of latent effects, such as the moods in the context of Netflix. Consider the modified causal model in Figure LABEL:fig:causalgraphs2. The new variable denotes a latent confounder (mood) that is causally connected to the observed context and also causally affects the reward . The latent confounder takes values in , where .

The goal here is to develop an efficient algorithm that chooses the sequence of limited interventions (i.e. a sequence of actions) to achieve a balance between learning this latent variable (indirectly learning ) from observed rewards, and maximizing the observed reward under the given (but not intervenable) observed context

In the setting of contextual bandits with observed contexts and arms, we note that the presence of the -dimensional latent confounder leads to a factorization of the reward matrix into non-negative factors (an matrix) and (a matrix). We leverage this latent low-dimensional structure to develop an -greedy NMF-Bandit algorithm that achieves a balance between learning the hidden low-dimensional structure (indirectly learning ), and selecting the best arm to minimize regret. In the setting of causality, this result thus demonstrates an approach to designing a sequence of interventions with limited capacity to control a reward variable, in the presence of other (possibly latent) variables affecting the reward that cannot be intervened upon.

1.1 Main Contributions

The main contributions of this paper are as follows:

1. (Model for Latent Confounding Contexts) We investigate a causal model for contextual bandits (Figure LABEL:fig:causalgraphs2

), which, compared to the conventional model, allows more degrees of freedom through the unobservable context variable. This allows us to better capture real-world scenarios. In particular, our model has

(a) Latent Confounders representing unobserved low-dimensional variables affecting the mean rewards of the bandit arms under an observed context; and (b) Limited Interventional Capacity signifying that the observed contexts (eg. user profiles) cannot be intervened upon.

In the contextual bandit setting with observed contexts and arms, this translates into a decomposition of the reward matrix , where (non-negative matrix) represents the relation between (observed contexts) and (hidden confounder), while (non-negative matrix) encodes the relation between (reward) and .

2. (NMF-Bandit Algorithm) We propose a latent contextual bandit algorithm that, in an online fashion, multiplexes two tasks. The first task

refines the current estimate of matrix

by performing a non-negative matrix factorization (NMF) on the sampled version of a carefully chosen sub-matrix of the mean-reward matrix . The second task uses the current estimate of and refines the estimate of from sampled versions of several sub-matrices of .

A direct application of results from existing noisy matrix completion literature is infeasible in the bandit setting. In the literature, one of the key conditions to derive spectral norm bounds between the recovered matrix and the ground truth is that the noise in each entry should be in a matrix [20]. In the bandit setting where errors occur due to sampling, this would lead to a regret of at least would lead to in the presence of sampling errors. We provide further insights in Section A.2 in the appendix.

In contrast, our algorithm has much stronger regret guarantees that scale as

We show that our algorithm succeeds when the non-negative matrices and satisfy conditions weaker than the well-known statistical RIP property [35]. Further, we prove a lower bound for this setting which is only factors away from our upper bound. This the the first work which has provable guarantees for matrix completion with bandit feedback for rank greater than one.

3. (Generative Models for and ) We propose a family of generative models for the factors and which satisfy the above sufficient conditions for recovery. These models are extremely flexible, and employ a random + deterministic composition, where there can be large number of arbitrary bounded deterministic entries (see Section 2.4

for details). The remaining random entries in the matrices are generated from mean-shifted sub-gaussian distributions (commonly used in the compressive sensing literature 


Finally, we numerically compare our algorithm with contextual versions of UCB-1, Thompson Sampling algorithms 

[9] and online matrix factorization algorithms [25] on synthetic and real-world data-sets.

1.2 Related Work

The current work falls at the intersection of learning of low-dimensional causal structures and multi-armed bandit problems. We briefly review the areas of literature that are most relevant to our work.

Contextual Bandit Problems: There has been significant progress in contextual bandits both in the adversarial setting and in the stochastic setting. In the adversarial setting, the best known regret bounds scale as  [9, 38] where is the number of contexts and is the number of arms. In the stochastic regime where there is a constant gap from the best arm, it can be shown that the regret scales as  [39]. Contextual bandits with linear payoff functions have been analyzed in [2, 12] in the adversarial setting, while in [1] it has been analyzed in the stochastic setting. In [15] the authors have expanded this model for the generalized linear model regime.

However, these models require one of the low-dimensional features to be known a priori, while our algorithm learns both the features from sampled data. Another related line of work is in the online clustering of bandits [17, 30, 29]. In this framework, the features of the arms can be directly observed, which is the fundamental difference from our paper.

Causality and Bandits: Recently, contextual bandit algorithms have found use within the framework of causality. In [5], the authors investigate a similar latent confounder model. However [5] does not consider our scaling regime nor provide theoretical guarantees (and has a very different algorithm).

In [28], a causal model for observing feedback has been introduced in the best arm identification regime. However, in their model all the variables can be intervened upon. Moreover, the states of all the non-intervened variable including the reward is revealed after the intervention is made. In this work, we focus on a more realistic case where only some of the variables can be intervened upon and in fact some of the variables cannot be observed directly. Further, side information about the observed variables are revealed before an intervention has to be made. The reward is the only extra information that is revealed after each intervention.

Online Matrix factorization : The non-negative matrix factorization (NMF) problem has generated a lot of interest in the area of semi-supervised topic modeling. Arora et al. have shown that if the matrix is separable and has some robustness properties [4], then NMF is solvable efficiently. Since then, there has been a lot of work in proposing efficient scalable algorithms for NMF, out of which [18, 13, 32] are of particular interest. There has been some progress in online NMF [14, 19] which aims to update the features efficiently in a streaming sense. To the best of our knowledge there has been no work in NMF with bandit feedback with theoretical guarantees. [27, 25] propose algorithms for online matrix factorization, however they only have theoretical analysis for the rank case.

2 Problem Statement and Main Results

2.1 System Model

Observed Contexts and Latent Confounders: We consider a stochastic bandit model represented by the causal graph in Figure LABEL:fig:causalgraphs2. The variable denoting the observed context takes values in , while the variable determines the arm that has been pulled taking values in . The variable denotes the latent confounding contexts and takes values in , where

. The causal model results in the bayesian factorization of the joint distribution of

and . A natural interpretation is that, at any time nature chooses a latent context , and based on that, a context

is actually observed. We denote the posterior probability of a

latent context given an observed context as,

Let be the matrix with elements where and . Please note that the sub-matrix corresponding to the row indices in

from an identity matrix

. This is essentially the well-known separability condition [32]. We also define the marginal probability of observing a context as This specifies the joint distribution of the latent context and the observed context .

Bandit Setting: In this setting the contextual bandit problem can be described as follows: (i) At each time the algorithm observes a context (ii) After observing the context the algorithm selects an arm which is the intervention ; and (iii) The algorithm then obtains a Bernoulli reward with mean . The mean rewards have a latent structure described in the next subsection.

Rewards: When an observed context is provided, the reward for arm depends only on the latent variables. Consider an reward matrix . specifies the mean reward for arm when the latent context is . For all observed contexts , the mean rewards are given by the matrix . This is given by:

Therefore, we have . Since the latent contexts are also a subset of observed contexts, the matrix contain a sub-matrix. This is equivalent to the separability condition and is widely used in the NMF literature (see [18]). represent the relation while the matrix denotes the relation in the causal model of Figure LABEL:fig:causalgraphs2.

Regret: The goal is to minimize regret (also known as pseudo-regret [9]) when compared to a genie strategy which knows the matrix . Let us denote the best arm under a context by and the corresponding reward by . Now, we are at a position to define the regret of an algorithm at time ,


Note that the genie policy always selects the arm when . The class of policies we optimize over are agnostic to the true reward matrix and however we assume that (the latent dimension) is a known scalar parameter. We work in the problem dependent setting, where there is a gap (bounded away from zero) between the mean reward of the best arm and the second best for every observed context. Let the gap , be defined as,

2.2 Notation

We denote matrices by bold capital letters (e.g.

) and vectors with bold small letters (e.g.

). For an matrix denotes the sub-matrix restricted to the rows in , while denotes the sub-matrix restricted to the columns in . ) denotes the

-th smallest singular value of

. denotes the -norm of . For, a matrix refers to the maximum -norm among all the rows while and denotes its spectral and Frobenius norms respectively. denotes the maximum absolute value of an element in the matrix. denotes a Bernoulli random variable with mean .

2.3 Main results

We first provide few definitions before presenting our main results.

Definition 1.

Consider an matrix with . Define .

Definition 2.

Consider an matrix with . Define .

In our work, we require the matrices ( and ) to satisfy some weaker versions of the ‘statistical RIP property’ (RIP - restricted isometry property). This property has been well studied in the sparse recovery literature [6, 11, 36, 35, 10]. Statistical RIP property is a randomized variant of the well-known RIP condition  [16]. RIP requires the extreme singular values to be bounded for sub-dictionaries formed by any columns (or rows) of a dictionary for a suitable . Statistical RIP property is a weaker probabilistic version where extreme singular values need to be bounded for random sub-dictionaries with high probability when random columns are chosen out of a dictionary to form the random subdictionary. We note that this same property goes by different names such as weak RIP property [11] and quasi-isometry property [10] in the literature. The terminology we adopt in this work is from [6].

Definition 3.

(Statistical RIP Property - StRIP) An matrix () , whose rows have unit norm, satisfies the -Statistical RIP Property (-StRIP) with constants , if

where the probability is taken over sampling a set of size uniformly from .

In our work, we only need a weaker version of StRIP condition to hold. We only need that the smallest singular value be bounded below for random sub-matrices and we work with un-normalized matrices. Hence, we have the following version which we will use:

Definition 4.

( Weak Statistical RIP Property - -WStRIP) An matrix () satisfies the -Weak Statistical RIP Property (-WStRIP) with constants if where the probability is taken over sampling a set of size uniformly from .

For one of the matrices among and , we need its random sub-matrices to satisfy weaker RIP-like conditions in the sense.

Definition 5.

( Weak Statistical RIP Property - -WStRIP) An matrix () satisfies the -weak statistical RIP property (-WStRIP) with constants if where the probability is taken over sampling a set of size uniformly from .

In what follows, we assume that satisfies -WStRIP and satisfies -WStRIP. Note that in Section 2.4 we provide reasonable generative models for and that satisfy these conditions with high probability.

Now we are at a position to state Theorem 1 which shows the existence of an algorithm for the latent contextual bandit setting, with regret that scales at a much slower rate than the usual guarantees.

Theorem 1.

Consider the bandit model with reward matrix . Suppose is separable [32]. Let satisfy -WStRIP with constants while satisfies -WStRIP with constants . Let . Suppose for all . We also assume that . Then there exists a randomized algorithm whose regret at time is bounded as,


with probability at least . Here, .

We present an algorithm that achieves this performance in Section 3. This theorem is re-stated as Theorem 8 in the appendix which has greater details specific to our algorithm. It should be noted that in practice our algorithm has much lesser regret than . This can be observed in Section 4, where our algorithm performs well even if we set the explore rate much lower than what is prescribed.

Remark: In prior works [6, 11, 36, 35, 10] the statistical RIP property was established by relating it to the incoherence parameter of a matrix which is defined as . In some works, the average of these incoherence parameters has been used instead. We note that matrices and are non-negative. Hence, directly using analysis based on controlling dot-products among rows and columns is not useful in this scenario. Hence, we propose generative models for and that satisfy the properties listed above with high probability even when they are not incoherent. We also explain why these generative models are extremely reasonable for our setting.

2.4 Generative Models for and

We briefly describe our semi-random generative models for and that satisfy the weak statistical RIP conditions. We refer to Section A.3 for a more detailed discussion of the generative models.

  1. Random+Deterministic Composition: A significant fraction of entries of and are arbitrarily deterministic. fraction of columns of and fraction of rows of are deterministic. In addition, we assume that a sub-matrix in the deterministic part of is an identity matrix to account for the separability condition  [32]. The rest of the entries are mean shifted, bounded sub-gaussian random variables with some additional mild conditions. Uniform prior on reward that has been used in bandit setting [26] reduces to a special case of this model.

  2. Bounded randomness in the random part: The random entries of both and are in “general position”, i.e., they arise from mean shifted bounded sub-gaussian distributions (see Section A.3, and also [16] for similar conditions in compressed sensing literature). The mean shifts in the random parts of and , the support of the sub-gaussian randomness satisfy some technical conditions to make sure that row sum of is and to ensure that the weak statistical RIP conditions are satisfied.

One of our main results is stated as Theorem 2, which implies that if comes from our generative model then with high probability projecting it onto a small random subset of its columns preserves the -robust simplical property [32] which is a key step in our algorithm.

Theorem 2.

Let . Let follow the random model in Section A.3. satisfies (-WStRIP) with constants with probability at least . Here, are constants that depend on the sub-gaussian parameter

that depends on the variance in the model for


In Theorem 3, we follow very similar techniques to prove that small random subsets of rows of have singular values bounded away from zero with high probability if is drawn from our generative model.

Theorem 3.

Let . Let follow the random model in Section A.3. satisfies (-WStRIP) with constants with probability at least . Here, are constant the depends on the sub-gaussian parameter that depends on the variance in the model for .

The proof of these theorems are available in the appendix in Section A.4.

2.5 Lower Bound

We prove a problem-specific regret lower bound for a specific class of parameters which is only a factor away from the upper bound achieved by our algorithm. The lower bound holds for all policies in the class of -consistent policies [33] defined below.

Definition 6.

A scheduling policy is said to be -consistent if given any problem instance we have, for all and , where and

Theorem 4.

There exists a problem instance with for all such that the regret of any -consistent policy is lower-bounded as follows,

for any , where are universal constants independent of problem parameters and is a constant that depends on the entries of and is independent of and .

The proof of this theorem has been deferred to the appendix in Section A.11 where we specify the class of problem parameters for which we construct this bound.

3 NMF-Bandit Algorithm

In this section we present an -greedy algorithm that we call NMF-Bandit algorithm. Our algorithm takes advantage of the the low-dimensional structure of the reward matrix. The algorithm explores with probability in this case it samples from specific sets of arms (to be specified later). Otherwise w.p. it exploits, i.e., chooses the best arm based on current estimates of rewards to minimize regret. A detailed pseudo-code of our algorithm has been presented as Algorithm 1 in the appendix. The key steps in the algorithm are as follows.


(a) At each time and with probability the algorithm explores, i.e. it randomly performs one of these two steps:

Step 1 – (Sampling for NMF in low dimensions to estimate ): Given that it explores, with probability it samples a random arm from a subset of arms. for . The set is a randomly chosen at the onset and kept fixed there after. This is Step 6 of Algorithm 1.

Step 2 – (Sampling for estimating ): Otherwise with probability it samples in a context dependent manner. If the context at the time is , the algorithm samples one arm at random from a set of arms given by (the selection of these sets are outlined below). The context specific sets of arms are designed at the start of the algorithm and held fixed there after. This is Step 7 of Algorithm 1.

(b) Otherwise with probability it exploits by performing Step 3 below.

Step 3 – (Choose best arm for current observed context): Compute estimate as detailed in Step 10 of Algorithm 1, using . Estimate as detailed in Step 11 of Algorithm 1. Let . The algorithm plays the arm given by i.e., the best arm for the observed context according to current estimates.


For solving the NMF to obtain we use a robust version of Hottopix [32, 18] as a sub-routine. Now, we briefly touch upon the construction of the context specific sets of arms in Step 2 of the explore phase. These sets have been defined in detail in Section A.1 in the appendix. Let . A set of contexts is sampled at random, such that at the onset of the algorithm. We partition into contiguous subsets of size each. In Step 2 of explore, if , then . If for all , then the algorithm is allowed to pull any arm at random, and these samples are ignored.

A more detailed version of our main result (Theorem 1) has been provided in (Theorem 8) in the appendix, along with a detailed proof. Theorem 8 exactly specifies the algorithm parameters , and under which we obtain the regret guarantees. We provide some key theoretical insights and a brief proof sketch in Section A.2 in the appendix. In particular we discuss in detail why usual matrix completion techniques would fail to provide regret guarantees that are . We explain the challenges of dealing with sampling noise and how we overcome them through careful design of the arms to explore.

4 Simulations

We validate the performance of our algorithm against various benchmarks on real and synthetic datasets. We compare our algorithms against contextual versions of UCB-1 [9] and Thompson sampling [3]. To be more precise, these algorithms proceed by treating each context separately and applying the usual -armed version of the algorithms to each context. We also compare the performance of our algorithm to this recent algorithm [25] for stochastic rank bandits. In [25] the problem setting is different. Therefore, whenever we compare the performance with this algorithm the experiments have been performed in the setting of [25], which we call S2. The more realistic setting of our paper will be denoted by S1. The two settings are:

(a) subfigcapskip =50pt
(b) subfigcapskip =50pt
(c) subfigcapskip =50pt
(d) subfigcapskip =50pt
(e) subfigcapskip =50pt
(f) subfigcapskip =50pt
Figure 2: Comparison of contextual versions of UCB-1,Thompson sampling and Rank 1 Stochastic bandits with Algorithm 1 (NMF-Bandit) in S1 and S2 on real and synthetic data-sets.

S1 : The arrival of the contexts cannot be controlled by the algorithm and the regret is w.r.t the best arm which is context dependent. This is strongly motivated by the causal setting discussed with real world scenarios in Section LABEL:intro.

S2 : This is in accordance with the model in [25]. The contexts and the arms both can be chosen by the algorithm and the aim is to compare regret w.r.t the best arm out of all entries.

Synthetic Data-Sets : In order to generate the synthetic reward matrix , the parameter are chosen. The matrix is then generated by picking each row uniformly at random from the -dimensional simplex. The matrix is generated with each entry uniformly generated in the interval . We further corrupt % of the entries in each row of with completely arbitrary noise while ensuring that they still lie in .

In Figure (a)a,(b)b, we compare our algorithm to UCB-1 and Thompson in S1 under different values of problem dimensions. In Figure (a)a, the rewards are uniform with means given by , while they are Bernoulli in Figure (b)b. We observe that UCB-1, Thompson have linear regret as they do not get sufficient concentration for the mean parameters. However, our algorithm is able to enter the sub-linear regime much faster. We mention the choice of the parameters and below the corresponding figures. It should be noted that our algorithm performs well even for values of the explore parameter , which are much lower than prescribed. In Figure (e)e the experiments are performed under S2. We can see that our algorithm’s regret is better compared to the others by a large margin, even though it has not been designed for this setting.

Real World Data-Sets : We use the Movielens 1M [21] and the Book Crossing [40] data-sets for our real world experiments. A subset of dimension is chosen from the Movielens 1M dataset, such that we have at least 20 ratings in each row and each column. Similarly a subset of is chosen from the Book Crossing data-set with the same property. Both these partially incomplete rating matrices are then completed using the Python package fancyimpute [22] using the default settings. These completed matrices are used in place of the reward matrix

without any further modifications, and all the algorithms are completely agnostic to the process through which these matrices have been completed. The experiments have been performed in a setting where the rewards observed are uniform around the given means. The support of the uniform distributions has been specified below each figure.

In Figure (c)c and (d)d, we compare our algorithm to UCB-1 and Thompson in S1 on the MovieLens and Book Crossing data-set respectively. As before, our algorithm has superior performance. In Figure (f)f, we compare the algorithms on the Book Crossing data-set under S2. NMF-Bandit outperforms all the other algorithms, even on the real datasets.

5 Conclusion

In this paper we investigate a causal model of contextual bandits (as shown in Figure LABEL:fig:causalgraphs2) with observed contexts and arms, where the observed context influences the reward through a latent confounder. The latent confounder is correlated with the observed context and lies in a lower dimensional space with only degrees of freedom. We identify that under this causal model, the reward matrix naturally factorizes into non-negative factors and .

We propose a novel -greedy algorithm (NMF-Bandit), which attains a regret guarantee of . Our guarantees are under statistical RIP like conditions on the non-negative factors. We also establish a lower bound of for our problem. To the best of our knowledge, this is the first achievable regret guarantee for online matrix completion with bandit feedback, when rank is greater than one.

We validate our algorithm on real and synthetic datasets and show superior performance with respect to the baselines considered. This work opens up the prospect of investigating general causal models from a bandit perspective, where the goal is to control the regret of a target variable, when the algorithm can intervene only on some of the variables (limited interventional capacity), while other variables (possibly latent) can causally influence the reward. This is a natural setting and we expect that it will lead to an interesting research direction.


Appendix A Appendix

a.1 Algorithmic Details

We present a precise version of the algorithm described in Section 3 as Algorithm 1. For ease of exposition, we introduce the concept of matrix sampling, which is a notational tool to represent the sampled entries from different subsets of arms in a structured manner.

a.1.1 Matrix Sampling

Consider the reward matrix . Consider a ‘sampling matrix’ with dimensions . Let . In this work, we consider only of the following form: and zero otherwise. Consider the product between a row of and , i.e. . This selects the co-ordinates corresponding to in vector . Given a row (a context ) of , i.e. , we describe how to obtain a random Bernoulli vector estimate such that by sampling an arm as follows:

  • Given that the context is , sample a uniform random variable with support , which represents the arm to be pulled after observing the context.

  • Conditioned on , pull arm and observe the reward .

  • The random vector sample is then given by .

Then we have . In other words, whenever the context is , we pull an arm uniformly at random from and the samples are collected in .

a.1.2 Arms to be sampled during explore

Before we present the pseudocode, we define the sampling matrices . Recall that any subset of arms can be encoded in a sampling matrix. corresponds to the subset in Step 1 of explore stated in Section 3. For ease of reference, we restate the sets relevant to the context specific sampling procedure in Step 2 of explore. corresponds to the subset is . Let and . A set of contexts is sampled at random, such that at the onset of the algorithm. We partition into contiguous subsets of size each. The elements of the set will be denoted as . In Step 2 of explore, if , then . If for all , then the algorithm is allowed to pull any arm at random, and these samples are ignored.

  1. : An random matrix formed as follows: An subset is chosen randomly uniformly among all -subsets of and and all other entries are .

  2. : An matrix such that,

    when .

  3. : An matrix defined as follows:

In words, for is the matrix which has an identity matrix embedded between rows and , and is zero everywhere else.

a.1.3 Representation of the collected Samples

In what follows, let the mean of samples collected through till time be collected in a matrix such that as detailed in Section A.1.1. Let . Let the samples collected from be stored in a matrix such that for all . Let be the scaled version.

a.1.4 Pseudocode

1:At time ,
2:Observe context
4:if   then
5:     Explore: Let
6:     If sample an arm according to the matrix sampling technique applied to matrix and update .
7:     If sample an arm according to the matrix sampling technique applied to matrix if for and update . If is not in any of these sets then choose an arm at random.
9:     Exploit:
10:     Let us compute,
11:     Let be such that,
12:     Compute . Play the arm such that,
13:end if
Algorithm 1 NMF-Bandit - An -greedy algorithm for Latent Contextual Bandits

We present a detailed pseudo-code of our algorithm as Algorithm 1. For the sake of completeness we include the robust version of the Hottopix algorithm [18] which is used as a sub-program in Algorithm 1. The following LP is fundamental to the algorithm,


where is a vector with distinct positive values.

1:Input : such that , where and for all , and .
2:Output : such that .
3:Compute an optimal solution to (3).
4:Let denote the set of indices for which .
5:Set .
Algorithm 2

a.2 Theoretical Insights

Below, we discuss some of the key challenges in the theoretical analysis.

Noise Guarantees for samples used in NMF: Matrix completion algorithms that work under the incoherence assumptions require the noise in each element of the matrix to be in order to provide -norm guarantees on the recovered matrix [20]. In order to ensure such noise guarantees, we require a very large number of samples in order for estimates to concentrate. This in turn increases bandit exploration which implies that regret scales as . To avoid this, we follow a different route. In Step 1 of the explore phase, the NMF-Bandit algorithm only samples from a small subset of arms denoted by . By leveraging the -WStRIP property of , we can ensure that NMF on these samples (which are basically a noisy version of ) gives us a good estimate of at time ; this estimate is denoted by . We prove this statement formally in Lemma 6. Given that we sample only from a small subset of arms in the first step of explore, in Lemma 11 we show that the samples concentrate sharply enough.

Ensuring enough linear equations to recover : Recall that the reward matrix has the structure . Therefore, an initial approach would be to use the current estimate of along with samples of the rewards, and directly recover . This however will not work due to lack of concentrations. First, the estimate of in the early stages will be too noisy to provide sharp estimates about the location of the extreme points aka the latent contexts. Even if we knew the identities of the observed contexts that correspond to “pure” latent contexts (extreme points of the affine space corresponding to the observed contexts), most observed contexts will not correspond to these extreme points – thus, a large number of samples will be wasted, again leading to poor concentrations. Second, if one decides to sample the entries in at random, the concentration of the entries would be too weak. As before, these weak concentrations will imply regret.

Instead, we design the context dependent sets of arms to pull in Step 2 of the explore phase, such that we get enough independent linear equations to recover The key is to have a small number of arms to sample per observed contexts, but the small number of arms differ across observed contexts. In this case, we show that by leveraging the -WStRIP property of we can get a good estimate of , denoted by even in the presence of sampling noise. Since we sample from a small subset of arms for each observed context, in Lemma 12 we can ensure that we have sharp concentrations.

Scheduling the optimal arm during exploit: The -norm bounds on the errors in and , imply that with probability at least provided is sufficiently big (see proof of Theorem 8). Here . This essentially implies that the correct arm is pulled at time w.h.p if the algorithm decides to exploit.

a.3 Description of Generative Models for matrices and

The model for and are both very similar with deterministic and random parts. The technical description of the model given below is complex due to the following two reasons:

  1. Fact 1: Rows of must sum to .

  2. Fact 2: The rows of shifted by an arbitrary vector does not affect the NMF algorithms employed. The setting is invariant to such a shift.

  1. Random+Deterministic Composition:

    1. We assume that columns corresponding to the column index set is arbitrary and deterministic. . The maximum entry in every row of is assumed to be contained in the deterministic part.

    2. Similarly, where is arbitrary and deterministic. Let . . Row sum of every row of is . In order to ensure separability [32] we assume that there is a subset such that . For all , .

  2. Bounded randomness in the random part:

    1. -th entry of is an independent mean zero sub-gaussian entry with variance , and bounded support and sub-gaussian parameter . is an arbitrary deterministic vector 111This is introduced to respect Fact in Section A.3 .

    2. is a deterministic perturbation matrix satisfying . The support parameters for , and are chosen such that

    is a matrix which is a row-normalized version of another random matrix . We first describe the random model on the matrix . Like in the case for model of ,

    1. is a matrix with independent mean zero sub-gaussian entries each with variance , and bounded support and sub-gaussian parameter .

    2. We denote the matrix of means by consisting of the parameters . The norm of every row of is at most . The support, sub-gaussian parameter and the matrix of means are chosen such that . The stricter condition (in the lower bound) ensures that after normalization by the row sum, .

a.4 Projection onto a Low Dimensional Space

In this section, we will prove some properties of the matrix where is a as defined in Section A.1.1. From the definition in Section 2.1, contains a sub-matrix corresponding to the rows in . Further, the row sum of every row of is . This means that the rows of consists of points in the convex hull of extreme points, i.e. the rows of , together with the extreme points themselves.

The extreme points in are mapped to extreme points in . We also show that the new set of extreme points also satisfy what is called the simplical property when satisfies the assumptions in Section A.3.

When the entries in are random and independent bounded random variables as in Section 2.4, we show that distance of any non-zero vector such that is preserved under the map with high probability over for any fixed . We need some results relating to sub-gaussianity of the matrix which we deal with in the next subsection.

a.5 Sub-gaussianity of a matrix with bounded i.i.d random entries

Definition 7.

[16] A random variable is sub-gaussian with parameter if .

Definition 8.

[16] A random vector is isotropic if . It is sub-gaussian with parameter if the scalar random variable is sub-gaussian with parameter for all , i.e. .

Lemma 1.

[16],[34] Consider a random variable such that for some constant . Then, is sub-gaussian with parameter . Consider a random vector where each entry is drawn i.i.d from a mean zero, unit variance and a sub-gaussian distribution with parameter . Then is a sub-gaussian isotropic vector with the same sub-gaussian parameter

Remark: The first part is from Theorem in [34] while the second part is from Lemma from [16].

Lemma 2.

[16] Let and be two matrices of the same dimensions. Let and be the largest and smallest singular values of a matrix respectively. Then,


Let where . Then,

Lemma 3.

[16] Consider an matrix with every row being a random independent sub-gaussian isotropic vector with sub-gaussian parameter . Let , then:




Here, is a constant that depends only on the sub-gaussian parameter .

Remark: The first result follows from equation in [16] and also from combining Lemma and Lemma in [16]. The second follows from applying Lemma 2

Definition 9 ([32]).

Let us consider a matrix which is where . Let be the -th row of the matrix . The matrix is -simplical if . In other words, every row is at least far away in distance from the convex hull of other points.

a.6 Results regarding sub-matrices of

The following results hold for since when is the set of column indices associated with as in Section A.1.

Theorem 5.

Let follow the random generative model in Section 2.4. Let . Let ,


with probability at least over the randomness in . Here, is a constant that depends on the sub-gaussian parameter of the distributions in the generative model in Section