Log In Sign Up

Stochastic Linear Bandits with Hidden Low Rank Structure

by   Sahin Lale, et al.

High-dimensional representations often have a lower dimensional underlying structure. This is particularly the case in many decision making settings. For example, when the representation of actions is generated from a deep neural network, it is reasonable to expect a low-rank structure whereas conventional structures like sparsity are not valid anymore. Subspace recovery methods, such as Principle Component Analysis (PCA) can find the underlying low-rank structures in the feature space and reduce the complexity of the learning tasks. In this work, we propose Projected Stochastic Linear Bandit (PSLB), an algorithm for high dimensional stochastic linear bandits (SLB) when the representation of actions has an underlying low-dimensional subspace structure. PSLB deploys PCA based projection to iteratively find the low rank structure in SLBs. We show that deploying projection methods assures dimensionality reduction and results in a tighter regret upper bound that is in terms of the dimensionality of the subspace and its properties, rather than the dimensionality of the ambient space. We modify the image classification task into the SLB setting and empirically show that, when a pre-trained DNN provides the high dimensional feature representations, deploying PSLB results in significant reduction of regret and faster convergence to an accurate model compared to state-of-art algorithm.


page 21

page 22

page 23

page 24

page 25

page 26

page 27


Non-Linear Subspace Clustering with Learned Low-Rank Kernels

In this paper, we present a kernel subspace clustering method that can h...

Bilinear Bandits with Low-rank Structure

We introduce the bilinear bandit problem with low-rank structure where a...

Precise expressions for random projections: Low-rank approximation and randomized Newton

It is often desirable to reduce the dimensionality of a large dataset by...

Low-Rank Nonlinear Decoding of μ-ECoG from the Primary Auditory Cortex

This paper considers the problem of neural decoding from parallel neural...

Low-Rank Isomap Algorithm

The Isomap is a well-known nonlinear dimensionality reduction method tha...

Linear Optimal Low Rank Projection for High-Dimensional Multi-Class Data

Classification of individual samples into one or more categories is crit...

Transition Subspace Learning based Least Squares Regression for Image Classification

Only learning one projection matrix from original samples to the corresp...

1 Introduction

Stochastic linear bandit (SLB) is a class of sequential decision-making under uncertainty where an agent sequentially chooses actions from very large action sets. At each round, the agent applies its action, and as a response, the environment emits a stochastic reward whose expected value is an unknown linear function of the action. The agent’s goal is to collect as much reward as possible over the course of interactions.

In SLB, the actions are represented as

-dimensional vectors, and the agent maintains limited information about the unknown linear function of reward. Through the course of interaction, the agent implicitly or explicitly constructs the model of the environment. It dedicates the decisions to not only maximize the current reward but also explore other actions to build a better estimation of the unknown linear function and guarantee higher future rewards. This is known as the

exploration vs. exploitation trade-off.

The lack of oracle knowledge of the true environment model causes the agent to make mistakes by picking sub-optimal actions during the exploration. While the agent examines actions in the decision set, its committed mistakes accumulate. The aim of the agent is to design a strategy to minimize the cumulative cost of these mistakes, known as regret. One promising approach to minimize the regret is through utilizing the optimism in the face of uncertainty (OFU) principle first proposed by Lai & Robbins (1985). OFU

based algorithms estimate the environment model up to its confidence interval and construct a plausible set of models within that interval. Among those models in the plausible set, they choose the most optimistic one and follow the optimal behavior suggested by the selected model for the next round of decision making.

For general SLB problems, Abbasi-Yadkori et al. (2011) deploy the OFU principle, propose OFUL algorithm, and for -dimensional SLB, derive a regret upper bound of which matches the lower bound up to a log factor. These regret bounds in high dimensional problems especially when and are about the same order are not practically tolerable. Fortunately, real-world problems usually are not arbitrary and may contain hidden low-dimensional structures. For example in classical recommendation systems, each item is represented by a large and highly detailed hand-engineered feature vector; therefore is intractably large. In these problems, not all the features are helpful for the recommendation task. For instance, the height of goods such as a pen is not a relevant feature for its recommendation while this feature is valuable for furnitures. Therefore the true underlying linear function in SLBs is highly sparse. Abbasi-Yadkori et al. (2012) show how to exploit this additional structure and design a practical algorithm with regret of where is the sparsity level of the true underlying linear function. Under slightly stronger assumptions, Carpentier & Munos (2012) show the theory of compressed sensing can provide a tighter bound of .

The contemporary success of Deep Neural Networks (DNN) in representation learning enables classical machine learning methods to provide significant advancements in many machine learning problems, e.g., classification and regression tasks 

(LeCun et al., 1998). DNNs convolve the raw features of the input and construct new feature representations which replace the hand-engineered feature vectors in many real-world sequential decision making applications, e.g., recommendation systems. However, when a DNN provides the feature representations, one cannot see a sparse structure.

Dimension reduction and subspace recovery form the core of unsupervised learning methods and principal component analysis (PCA) is the main technique for linear dimension reduction 

(Pearson, 1901; Eckart & Young, 1936). At each round of SLB, the agent chooses an action and receives the reward corresponding to that action. Therefore, the chosen action is assigned a supervised reward signal while other actions in the decision set remain unsupervised. Even though the primary motivation in the SLB framework is decision-making within a large and stochastic decision set, the majority of prior works do not exploit possible hidden structures in these sets. For example, Abbasi-Yadkori et al. (2011) only utilizes supervised actions, the actions selected by the algorithm, to construct the environment model. It ignores all other unsupervised actions in the decision set. On the contrary, large number of actions in the decision sets can be useful in reducing the dimension of the problem and simplifying the learning problem.

Contributions:  In this paper, we deploy unsupervised subspace recovery using PCA to exploit the massive number of unsupervised actions which are observed in the decision sets of SLB and reduce the dimensionality and the complexity of SLBs. We propose PSLB for SLBs and show that if there exists an -dimensional subspace structure such that the actions live in a perturbed region around this subspace, deploying PSLB improves the regret upper bound to . Here represents the difficulty of subspace recovery as a function of the structure of the problem. If learning the subspace is hard, e.g., the eigengap is small to analyze in a reasonable amount of samples, actions are widely distributed in the orthogonal dimensions of the subspace due to perturbation or , then using projection approaches are not remedial. On the other hand, if underlying subspace is identifiable, i.e., large number of actions are available from the decision sets in each round, the eigengap is significant or , then using subspace recovery provides faster learning of the underlying linear function; thus, smaller regret.

We adapt the image classification tasks on MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky & Hinton, 2009)

and ImageNet 

(Krizhevsky et al., 2012) datasets to the SLB framework and apply both PSLB and OFUL on these datasets. We observe that there exists a low dimensional subspace in the feature space when a pre-trained DNN produces the -dimensional feature vectors. We empirically show that using subspace recovery PSLB learns the underlying model significantly faster than OFUL and provides orders of magnitude smaller regret in SLBs obtained from MNIST, CIFAR-10, and ImageNet datasets.

2 Preliminaries

For any positive integer , denotes the set . The Euclidean norm of a vector is denoted by . The spectral norm of matrix is denoted by , ie., . denotes the Moore-Penrose inverse of matrix . For any symmetric and positive semi-definite matrix , let denote the norm of a vector defined as . The

-th eigenvalue of a symmetric matrix

is denoted by , where . The largest and smallest eigenvalue of are denoted as and , respectively. denotes identity matrix. If is a column vector then is a matrix whose columns are whereas if is a scalar then is a column vector whose elements are . defines the multiset summation operation over the sets .


Let be the total number of rounds. At each round , the agent is given a decision set with actions, . Let be an orthonormal matrix with , where defines a -dimensional subspace in . Consider a zero meantrue action vector, , such that for all and . Let be zero mean random vectors which are uncorrelated with true action vectors, i.e., for all and . Each action is generated as follows,


This model states that each in is a perturbed version of the true underlying . Denote the covariance matrix of by . Notice that is rank-. Perturbation vectors, , are assumed to be isotropic, thus covariance matrix . Let and . We will make a boundedness assumption on and .

Assumption 1 (Bounded Action and Perturbation Vectors).

There exists finite constants, and , such that for all and ,

Both and can be dependent on or and they can be interpreted as the effective dimensions of the corresponding vectors.

At each round , the agent chooses an action, and observes a reward such that


where is the projection matrix for the -dimensional subspace , is the unknown parameter vector and is the random noise at round . Notice that since , therefore, .111The reward generative model of is equivalent to where contains the randomness in as well as the perturbations due to . Consider as any filtration of -algebras such that for any , is measurable and is measurable.

Assumption 2 (Subgaussian Noise).

For all , is conditionally R-sub-Gaussian where is a fixed constant, ie. .

This implies that or equivalently . The goal of the agent is to maximize the total expected reward accumulated in rounds, . The oracle’s strategy with the knowledge of at each round is . We evaluate the agent’s performance against the oracle performance. Define regret as the difference between expected reward of the oracle and the agent,


The agent aims to minimize this quantity over time. In the setting described above, the agent is assumed to know that there exists a -dimensional subspace of in which true action vectors and the unknown parameter vector lie. Finally, we define some quantities about the structure of the problem for all :


3 Overview of Pslb

We propose PSLB, a SLB algorithm which employs subspace recovery to extract information from the unsupervised data accumulated in the SLB. During the course of interaction, the agent constructs the confidence set of the underlying model with and without subspace recovery, then takes the intersection of these two sets. Among the plausible models in this set, the agent deploys OFU principle and follows the optimal action of the most optimistic model. The pseudocode of PSLB is given in Algorithm  1. PSLB consists of 4 key elements: warm-up, subspace estimation, creating confidence sets and acting optimistically. In the following, we will discuss each of them briefly.

1:  Input: m, , , , ,
2:  for t = 1 to  do
3:     Compute PCA over
4:     Create

with first m eigenvectors

5:     Construct

, high probability confidence set on

6:     Construct , high probability confidence set for using subspace recovery
7:     Construct , high probability confidence set for without using subspace recovery
8:     Construct
10:     Play and observe
11:  end for
Algorithm 1 PSLB

3.1 Warm-Up

The decision set at each round , , has a finite number of actions. The algorithm needs to acquire enough samples of action vectors to reliably estimate the hidden subspace. The process of acquiring sufficient samples is considered as the warm-up period. The duration of the warm-up period, , can be chosen in many ways. We set based on the theoretical analysis outlined in Section 4.1. The crux of this choice is to provide a theoretical guarantee of convergence to the underlying subspace. In other words, PSLB collects samples until it has some confidence on the recovered subspace. This idea is considered in more detail in Section 3.2. Note that warm-up periods are implicitly assumed in most SLB algorithms since the given bounds are not meaningful for short periods of time.

3.2 Subspace Estimation

At each round, the algorithm predicts the -dimensional subspace that the true action vectors belong, using the action vectors collected up to that round. In particular, at round , the algorithm uses PCA over action vectors observed so far, . It calculates which is the matrix of top eigenvectors of , thus is the predicted -dimensional subspace. Then, is used to create the estimated projection matrix associated to this subspace, .

As the agent observes more action vectors, the estimated projection matrix becomes more accurate. The accuracy of is measured by the projection error . As more action vectors are collected, shrinks. Since is not known, cannot be calculated directly. Thus, PSLB calculates a high-probability upper bound on the projection error. Using the derived bound, PSLB deploys confidence in the subspace estimation and construct the set of plausible projection matrices where and both lie in with high probability. The construction of is reliant on the structural properties of the problem and the number of samples . We analyze these properties in Section 4.1.

3.3 Confidence Set Construction

At each round, PSLB creates two confidence sets for the model parameter . First, it tries to exploit a possible -dimensional hidden subspace structure. Thus, it searches for a high probability confidence set, , that lies around the estimated subspace at round . Using the history of action-reward pairs, the algorithm solves a regularized least squares problem in the estimated subspace and obtains , the estimated parameter vector in . Then it creates the confidence set around , such that with high probability.

Second, PSLB searches for a high probability confidence set in the ambient space without having subspace recovery. It deploys the confidence set generation subroutine of OFUL by Abbasi-Yadkori et al. (2011). Using the history of action-reward pairs, the algorithm solves another regularized least squares problem but this time in the ambient space and obtains . PSLB then creates the confidence set centered around such that with high probability. Finally, PSLB takes the intersection of constructed confidence sets to create the main confidence set, . still contains with high probability. With this operation, PSLB provides a new perspective that if there exists an easily recoverable -dimensional subspace, it exploits that structure to get lower regret than OFUL can solely achieve. If it fails to detect such structure or the confidence set is looser than what OFUL provides, then it still provides the same regret as OFUL.

3.4 Optimistic Action

For the final step in round , the algorithm chooses an optimistic triplet from the confidence sets created and the current decision set which jointly maximizes the reward:


4 Theoretical Analysis of Pslb

In this section we first state the upper regret bound of PSLB which is the main result of the paper. Then we analyze the components that build up to the result. In order to get a meaningful bound, we assume that the expected rewards are bounded. Recalling the quantities defined in (4), define such that


It represents the difficulty of subspace recovery in terms structural properties of SLB setting, and it is analyzed in Section 4.3. Using , the theorem below states the regret upper bound of PSLB.

Theorem 1 (Regret Upper Bound of Pslb).

Fix any . Assume that Assumptions 1 and 2 hold. Also assume that for all , . Then, with probability at least , the regret of PSLB satisfies


The proof of the theorem involves two main pieces: the projection error analysis and the construction of projected confidence sets. They are analyzed in Sections 4.1 and 4.2 respectively. Finally, in Section 4.3 their role in the proof of Theorem 1 is explained and the meaning of the result is discussed.

4.1 Projection Error Analysis

Consider the matrix where

th singular value is denoted by

, such that . Extending the definition of inner products of two vectors to subspaces and using Courant-Fischer-Weyl minimax principle, one can define th principal angle between and via

Using the analysis in Akhiezer & Glazman (2013) it can be seen that:


where is the largest principal angle between the column spans of and . Thus, bounding the projection error between two projection matrices is equivalent to bounding the sine of the largest principal angle between the subspaces that they project. In light of this relation, one can use the Davis-Kahan theorem (Davis & Kahan, 1970) to bound the projection error. The exact theorem statement can be found in Section A in the Supplementary Material. Informally, the theorem considers a symmetric matrix and its’ perturbed version and bounds the sine of the largest principal angle caused by this perturbation. Using Davis-Kahan theorem, following lemma bounds the finite sample projection error.

Lemma 2 (Finite Sample Projection Error).

Fix any . Let . Suppose Assumption 1 holds. Then with probability at least , ,


The lemma and it’s proof are along the same lines of Corollary 2.9 of Vaswani & Narayanamurthy (2017). However, we improve the bound on the projection error by using the Matrix Chernoff Inequality (Tropp, 2015) and provide the precise problem dependent quantities in the bound which are required for defining the minimum number of samples for the warm-up period and the construction of confidence sets for . Note that as discussed in Section 3.2, (9) defines the confidence set for all . The general version of the lemma and the details of the proof are given in Section A of the Supplementary Material, but here we provide a proof sketch.

Up to round , the agent observes action vectors in total within the decision sets. Using PCA, PSLB estimates an -dimensional subspace spanned by top eigenvectors of the sample covariance matrix of action vectors and obtain the projection matrix for that subspace. In order to derive Lemma 2, we first carefully pick two symmetric matrices such that the span of their first eigenvectors are equivalent to subspaces that and project to. Using Davis-Kahan theorem with matrix concentration inequalities provided by Tropp (2015), we derive the finite sample projection error bound.

Lemma 2 is key to defining the warm-up period duration. Due to equivalence in (8), , . Therefore, any projection error bound greater than 1 is vacuous. We pick such that with high probability, we obtain theoretically non-trivial bound on projection error. With the given choice of , the bound on the projection error in (9) becomes less than 1 when . After , PSLB starts to produce non-trivial confidence sets around . However, note that can be significantly big for problems that have structure that is hard to recover, e.g. having linear in .

Lemma 2 also brings several important intuitions about the subspace estimation problem in terms of the problem structure. Recalling the definition of in (4), as decreases, the projection error shrinks since the underlying subspace becomes more distinguishable. Conversely, as diverges from 1, it becomes harder to recover the underlying -dimensional subspace. Additionally, since is the maximum of the effective dimensions of the true action vector and the perturbation vector, having large makes the subspace recovery harder and the projection error bound looser, whereas observing more action vectors, in each round produces tighter bound on . The effects of these structural properties on the subspace estimation translate to confidence set construction and ultimately to regret upper bound.

4.2 Projected Confidence Sets

In this section, we analyze the construction of and . For any round , define . At round , let for . The rewards obtained up to round is denoted as . At round , after estimating the projection matrix associated with the underlying subspace, PSLB tries to find , an estimate of , while believing that lives within the estimated subspace. Therefore, is the solution to the following Tikhonov-regularized least squares problem with regularization parameters and ,

Notice that regularization is applied along the estimated subspace. Solving for gives . Define such that for all and , and let . The following theorem gives the construction of projected confidence set, , which is an ellipsoid centered around which contains with high probability.

Theorem 3 (Projected Confidence Set Construction).

Fix any . Suppose Assumptions 1 & 2 hold, and and , . If then, with probability at least , , lies in the set


The detailed proof and a general version of the theorem are given in Section B of the Supplementary Material. We will highlight the key aspects in here. The overall proof follows a similar machinery used by Abbasi-Yadkori et al. (2011). Specifically, the first term of in (10) is derived similarly by using the self-normalized tail inequality. However, since at each round PSLB projects the past actions to an estimated -dimensional subspace to estimate , is replaced by in the bound. While enjoying the benefit of projection, this construction of the confidence set suffers from the finite sample projection error, i.e., uncertainty in the subspace estimation. This effect is observed via second term in (10). The second term involves the confidence bound for the estimated projection matrix, . This is critical in determining the tightness of the confidence set on . As discussed in Section 4.1, reflects the difficulty of subspace recovery of the given problem and it depends on the underlying structure of the problem and SLB. This shows that as estimating the underlying subspace gets more difficult, having a projection based approach in the construction of the confidence sets on provides looser bounds.

In order to tolerate the possible difficulty of subspace recovery, PSLB also constructs , which is the confidence set for without having subspace recovery. The construction of follows OFUL by Abbasi-Yadkori et al. (2011). Let . The algorithm tries to find which is the -regularized least squares estimate of in the ambient space. Thus, . Construction of is done under the same assumptions of Theorem 3, such that with probability at least , lies in the set

The search for an optimistic parameter vector happens in the intersection of and . Notice that with probability at least . Optimistically choosing the triplet, , within the described confidence sets gives PSLB a way to tolerate the possibility of failure in recovering an underlying structure. If confidence set is loose or PSLB is not able to recover an underlying structure, then provides the useful confidence set to obtain desirable learning behavior.

4.3 Regret Analysis

Now that the confidence set constructions and the decision making procedures of PSLB are explained, it only remains to analyze the regret of PSLB. Using the intersection of and as the confidence set at round , gives PSLB the ability to obtain the lowest possible instantaneous regret among both confidence sets. Therefore, the regret of PSLB is upper bounded by the minimum of the regret upper bounds on the individual strategies. Using only is equivalent to following OFUL and the regret analysis can be found in Abbasi-Yadkori et al. (2011). The regret analysis of using only the projected confidence set is the main contribution of this work. It follows the standard regret decomposition into instantaneous regret components. However, due to having different estimated projection matrices in each round, the derivation of the bound uses a different strategy involving the Matrix Chernoff Inequality (Tropp, 2015). The detailed analysis of the regret upper bound and the proof can be found in Section C of the Supplementary Material. Here we elaborate more on the nature of the regret obtained by using projected confidence sets only, i.e. first term in Theorem 1, and discuss the effect of in particular.

is the reflection of the finite sample projection error at the beginning of the algorithm. It captures the difficulty of subspace recovery based on the structural properties of the problem and determines the regret of deploying projection based methods in SLBs. Recall that is the maximum of the effective dimensions of the true action vector and the perturbation vector. Depending on the structure of the problem, can be , e.g., the perturbation can be effective in many dimensions, which prevents the projection error from shrinking; thus, causes resulting in regret. The eigengap within the true action vectors and the eigengap between the true action vectors and the perturbation vectors are critical factors that determine the identifiability of the hidden subspace. As increases, the subspace recovery becomes harder since the effect of perturbation increases. Conversely, as increases, the underlying subspace becomes easier to identify. These effects are significant on the regret of PSLB and they are captured by in . Moreover, having finite samples to estimate the subspace affects the regret bound through . Due to the nature of SLB, this is unavoidable and it scales the final regret by . Overall, with all these elements, represents the hardness of using PCA based methods in dimensionality reduction in SLBs.

Theorem 1 states that if the underlying structure is easily recoverable, e.g. , then using PCA based dimension reduction and construction of confidence sets provide substantially better regret upper bound for large . If that is not the case, then due to the best of the both worlds approach provided by PSLB, the agent obtains the best possible regret upper bound. Note that the bound for using only is a worst case bound and as we present in Section 5, in practice PSLB can give significantly better results.

5 Experiments

Figure 7: Regret and Optimistic Model Accuracy Comparisons of PSLB and OFUL on MNIST, CIFAR-10 and ImageNet
Figure 8: *

In the experiments, we study MNIST, CIFAR-10 and ImageNet datasets and use them to create the decision sets for the SLB setting. A simple 5-layer CNN, a pre-trained ResNet-18 and a pre-trained ResNet-50 are deployed respectively for MNIST, CIFAR-10 and ImageNet. Before training, we modify the architecture of the representation layer (the layer before the final layer) to make it suitable for the SLB study and obtain decision sets for each image.

Consider a standard network whose dimension of the representation layer is . Therefore, the final layer for class classification is fully connected and it is a matrix that outputs numbers to be used for classification. In this study, instead of having a final layer of matrix, we construct the final layer as a -dimensional vector and make the feature representation layer a dimensional vector. We treat this vector as the concatenation of -dimensional contexts i.e., . The final -dimensional layer is of the SLB

, where the logit for each class is computed as an inner product of the class context

and . We train these architectures for different values using cross entropy loss. Here we provide results for MNIST and CIFAR-10 with and ImageNet with .

Removing the final layer, the resulting trained networks are used to generate the feature representations of each image for each class which produces the decision sets at each time step of SLB. Since MNIST and CIFAR-10 have 10 classes, in each decision set we obtain 10 action vectors where each of them are segments in the representation layer. On the other hand, from the ImageNet dataset we get 1000 actions per decision set due to 1000 classes in the datasets. In the SLB setting, the agent receives a reward of 1 if it chooses the right action, which is the segment in the representation layer corresponding to correct label according to trained network, and 0 otherwise. We apply both PSLB and OFUL on these SLBs. We measure the regret by counting the number of mistakes each algorithm makes. To come up with the optimistic choice of action at each time step, both of these algorithms requires solving an inner optimization problem. To mitigate the burden of these computation costs, we sample many models from the confidence sets and choose the most optimistic model among the sampled ones.

Through computing PCA of the empirical covariance matrix of the action vectors, surprisingly we found that projecting action vectors onto the -dimensional subspace defined by the dominant eigenvector is sufficient for these datasets in the SLB setting; thus, . During the experiments PSLB tried to recover a -dimensional subspace using the action vectors collected. We present the regrets obtained by PSLB and OFUL for MNIST, CIFAR-10 and ImageNet in Figure (a)a(b)b(c)c respectively. With the help of subspace recovery and projection, PSLB provides a massive reduction in the dimensionality of the SLB problem and immediately estimates a fairly accurate model for . On the other hand, OFUL naively tries to sample from all dimensions in order to learn . This difference yields orders of magnitude improvement in regret. During the SLB experiment, we also sample the optimistic models that are chosen by PSLB and OFUL. We use these models to test the model accuracy of the algorithms, i.e. perform classification over all images in dataset. The optimistic model accuracy comparisons are depicted in Figure (d)d(e)e(f)f. These portray the learning behavior of PSLB and OFUL. Using projection, PSLB learns the underlying linear model in the first few rounds, whereas OFUL suffers from high-dimension of SLB framework and lack of knowledge besides chosen action-reward pairs. We extend these experiments for settings with and which can be found in Section D.

6 Related Work

The primary class of partial information problems is the multi-arm bandit (MAB). Robbins (1985) introduces the standard stochastic MAB and Lai & Robbins (1985) studies the asymptotic property of learning algorithms on this class. Stochastic MABs are a special case of SLB when the arms representations are orthogonal to each other. For finite sample regime, Auer et al. (2002) deploys the principle of OFU and provide finite sample guarantee for MABs. Auer (2002) deploys the same principle to provide regret guarantee for MABs with the linear pay-off. This principle is realized as the primary approach even for more general problems such as Linear Quadratic systems (Abbasi-Yadkori & Szepesvári, 2011)

and Markov Decision Processes 

(Jaksch et al., 2010).

The study of linear bandit problems extends to various algorithms and environment settings (Dani et al., 2008; Rusmevichientong & Tsitsiklis, 2010; Li et al., 2010). Kleinberg et al. (2010) studies the class of problems when the decision set changes time to time, while Dani et al. (2008) studies this problem when the decision set provides a set of fixed actions. Further analysis in the area extend these approaches to classes where there are more structures in the problem setup. In traditional decision-making problems, where hand engineered feature representations are provided, sparsity in the linear function is a valid structure. Sparsity, as the key in high-dimensional conventional structured linear bandits, conveys series of successes in classical settings (Abbasi-Yadkori et al., 2012; Carpentier & Munos, 2012). In recommendations systems, where a set of users and items are given, Gopalan et al. (2016) consider the low-rank structure of the user-item preference matrix and provide an algorithm which exploits this further structure.

To the best of our knowledge, there are no hidden low-dimensional subspace assumptions on actions and/or unknown weight vector in literature for SLB. On the other hand, subspace recovery and dimension reduction problems are well studied in the literature. Several linear and nonlinear dimension reduction methods have been proposed such as PCA (Pearson, 1901)

, independent component analysis 

(Hyvärinen & Oja, 2000), random projections (Candes & Tao, 2006) and non-convex robust PCA (Netrapalli et al., 2014). Among the linear dimension reduction techniques, PCA is the simplest, yet most widely used method. Analysis of PCA based methods mostly focus on the asymptotic results  (Anderson et al., 1963; Jain et al., 2016). However, in the settings like SLB with finite number of arms, it is necessary to have finite sample guarantees for the application of PCA. In the literature, among few finite sample PCA works, Nadler (2008) provides finite sample guarantees for one-dimensional PCA, whereas Vaswani & Narayanamurthy (2017) extends it to larger dimensions with various noise models.

7 Conclusion

In this paper, we study a linear subspace structure in the action set of an SLB problem. We deploy PCA based projection to exploit the immense number of unsupervised actions in the decision sets and learn the underlying subspace. We proposed PSLB, a SLB algorithm which utilizes the subspace estimated through PCA to improve the regret upper bound of SLB problems. If such structure does not exist or is hard to recover, then the PSLB reduces to the standard SLB algorithm, OFUL. We empirically study MNIST, CIFAR-10 and ImageNet datasets to create SLB framework from image classification tasks. We test the performance of PSLB versus OFUL in the SLB setting created. We show that when DNNs produce features of the actions, a significantly low dimensional structure is observed. Due to this structure, we showed that PSLB substantially outperforms OFUL and converges to an accurate model while OFUL still struggles to sample in high dimensions to learn the underlying parameter vector.

In this work, we studied the class of linear subspace structures. In the future work, we plan to extend this line of study to the general class of low dimensional manifold structured problems. Bora et al. (2017) peruse a similar approach for compression problems. While optimism is the primary approach in the theoretical analyses of SLB

s, it mainly poses a computationally intractable internal optimization problem. An alternative method is Thompson sampling, a practical algorithm for

SLBs. In future work, we plan to deploy Thompson sampling and mitigate the computational complexity of PSLB.


  • Abbasi-Yadkori & Szepesvári (2011) Abbasi-Yadkori, Y. and Szepesvári, C. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 1–26, 2011.
  • Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320, 2011.
  • Abbasi-Yadkori et al. (2012) Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pp. 1–9, 2012.
  • Akhiezer & Glazman (2013) Akhiezer, N. I. and Glazman, I. M. Theory of linear operators in Hilbert space. Courier Corporation, 2013.
  • Anderson et al. (1963) Anderson, T. W. et al. Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 34(1):122–148, 1963.
  • Auer (2002) Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • Bora et al. (2017) Bora, A., Jalal, A., Price, E., and Dimakis, A. G. Compressed sensing using generative models. arXiv preprint arXiv:1703.03208, 2017.
  • Candes & Tao (2006) Candes, E. J. and Tao, T. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE transactions on information theory, 52(12):5406–5425, 2006.
  • Carpentier & Munos (2012) Carpentier, A. and Munos, R. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Artificial Intelligence and Statistics, pp. 190–198, 2012.
  • Dani et al. (2008) Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. 2008.
  • Davis & Kahan (1970) Davis, C. and Kahan, W. M. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.
  • Eckart & Young (1936) Eckart, C. and Young, G. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
  • Freedman (1975) Freedman, D. A. On tail probabilities for martingales. the Annals of Probability, pp. 100–118, 1975.
  • Gopalan et al. (2016) Gopalan, A., Maillard, O.-A., and Zaki, M. Low-rank bandits with latent mixtures. arXiv preprint arXiv:1609.01508, 2016.
  • Hyvärinen & Oja (2000) Hyvärinen, A. and Oja, E. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
  • Jain et al. (2016) Jain, P., Jin, C., Kakade, S. M., Netrapalli, P., and Sidford, A. Streaming pca: Matching matrix bernstein and near-optimal finite sample guarantees for oja’s algorithm. In Conference on Learning Theory, pp. 1147–1164, 2016.
  • Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P.

    Near-optimal regret bounds for reinforcement learning.

    Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • Kleinberg et al. (2010) Kleinberg, R., Niculescu-Mizil, A., and Sharma, Y. Regret bounds for sleeping experts and bandits. Machine learning, 80(2-3):245–272, 2010.
  • Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E.

    Imagenet classification with deep convolutional neural networks.

    In Advances in neural information processing systems, pp. 1097–1105, 2012.
  • Lai & Robbins (1985) Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
  • Nadler (2008) Nadler, B. Finite sample approximation results for principal component analysis: A matrix perturbation approach. The Annals of Statistics, 36(6):2791–2817, 2008.
  • Netrapalli et al. (2014) Netrapalli, P., Niranjan, U., Sanghavi, S., Anandkumar, A., and Jain, P. Non-convex robust pca. In Advances in Neural Information Processing Systems, pp. 1107–1115, 2014.
  • Pearson (1901) Pearson, K. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
  • Robbins (1985) Robbins, H. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, pp. 169–177. Springer, 1985.
  • Rusmevichientong & Tsitsiklis (2010) Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411, 2010.
  • Tropp (2015) Tropp, J. A. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.
  • Vaswani & Narayanamurthy (2017) Vaswani, N. and Narayanamurthy, P. Finite sample guarantees for pca in non-isotropic and data-dependent noise. In Communication, Control, and Computing (Allerton), 2017 55th Annual Allerton Conference on, pp. 783–789. IEEE, 2017.

Appendix A Projection Error Analysis, Proof of Lemma 2

In this section, we provide the general version of Lemma 2 with the proof details. As stated in the main text, in order to bound the projection error, we will use Davis-Kahan theorem which states the following:

Theorem 4 ((Davis & Kahan, 1970)).

Let be symmetric matrices, such that . The eigenvalues of and are and respectively. Define the eigendecompositions of and :

where and are diagonal matrices with first eigenvalues of and respectively. and denote the corresponding eigenvectors. Define

If , then , sine of the largest principal angle between the column spans of and , can be upper bounded as


Notice that in order to use Davis-Kahan theorem in our setting, we need to pick 2 symmetric matrices and such that their first eigenvectors has the same span with the subspaces that and project to. Followed by these choices, in order to get a non-trivial bound we require a significant eigengap between and , due to denominator in (11). We use the following matrix concentration inequalities to maintain an eigengap with high probability.

Theorem 5 (Matrix Chernoff Inequality; (Tropp, 2015)).

Consider a finite sequence of independent, random, symmetric matrices in . Assume that and

for each index k. Introduce the random matrix

. Let denote the minimum eigenvalue of the expectation ,


Theorem 6 (Corollary of Matrix Bernstein; (Tropp, 2015)).

Consider a set of i.i.d. realization of a random matrix , as . If is bounded,

almost surely, with second moment of

Then, for all ,

Define . Now that we have the required machinery, we present general version of Lemma 2.

Lemma 7.

Fix any . Suppose that Assumption 1 holds. Then with probability at least ,



We set and where . Let U be the top eigenvectors of S. Notice that and is the matrix of top eigenvectors of . Therefore, one can apply Theorem 4 with given choices of and , to bound . Since ,

where (1) follows from Weyl’s inequality and the fact that is rank , , and (2) is due to triangle inequality. With the given choices of and and Assumption 1, we have the following:

Inserting these expressions we get,


We first bound . From Assumption 1, for all and from the model properties, . Using Theorem 5, one can get that


Now we consider . From triangle inequality we have,

We will consider each term on the right hand side separately. If Assumption 1 holds, then we have: