1 Introduction
Lowrank models are widely used in various applications, such as matrix completion, computer vision, etc
(Candès and Recht, 2009; Basri and Jacobs, 2003). We study lowrank (generalized) linear models in the bandit setting (Lai and Robbins, 1985). During the learning process, the agent adaptively pulls an arm (denoted as ) from a set of arms based on the past experience. At each pull, the agent observes a noisy reward corresponding to the arm pulled. Let be an unknown lowrank matrix with rank . The learner’s goal is to maximize the total reward: where is the time horizon, is an action pulled at time that belongs to a prespecified action set and denotes a link function. Note that in the standard linear case the link function is identity.Many practical applications can be framed in this lowrank bandit model. For traveling websites, the recommendation system needs to choose a flighthotel bundle for the customer that can achieve high revenue. Often one has features of size for flights and features of size for hotels. It is natural to form a matrix feature (e.g. via an outer product) for each pair or simply combine the two features row/columnwise if . One can model the appeal of a bundle by a (generalized) linear function of the matrix feature. In online advertising with image recommendation, the advertiser selects an image to display and the goal is to achieve the maximum clicking rate. The image is often stored as a matrix, and one can use a generalized linear model (GLM) with the link function being the logistic function to model the click rate (Richardson et al., 2007; McMahan et al., 2013). In all of these applications, one puts some capacity control on the underlying matrix linear coefficient and a natural condition is being lowrank. We note that the examples such as online dating and online shopping discussed in Jun et al. (2019) can also be formulated as our model.
In this paper, we measure the quality of an algorithm in terms of its cumulative regret^{1}^{1}1See Section 3 for the definition.. A naive approach is to ignore the lowrank structure and directly apply the standard (generalized) linear bandit algorithms (AbbasiYadkori et al., 2011; Filippi et al., 2010). These approaches suffer regret.^{2}^{2}2 omits polylogarithmic factors of . However, in practice, can be huge. Then a natural question is:
Can we utilize the lowrank structure of to achieve regret?
Jun et al. (2019) studied a subclass of our problem, where the actions are rank one matrices. They proposed an algorithm that achieves regret under additional incoherence and singular value assumptions of an augmented matrix defined via the arm set and and a singular value assumption of . They also provided strong evidence that their bound is unimprovable.
We summarize our contributions below.

We propose Low Rank Linear Bandit with Online Computation algorithm (LowLOC) for the lowrank linear bandit problem, that achieves regret. Notably, comparing with the result in Jun et al. (2019)
, our result 1) applies to more general action sets which can contain highrank matrices and 2) does not require the incoherence and bounded eigenvalue assumption of the augmented matrix mentioned in the previous paragraph. Our regret bound also matches with their conjectured lower bound. For LowLOC, we first design a novel online predictor which uses an
exponentially weighted average forecaster on a covering of lowrank matrices to solve the online lowrank linear prediction problem with regret. We then plug in our online predictor to the onlinetoconfidenceset conversion framework proposed by AbbasiYadkori et al. (2012) to construct a confidence set of in our bandit setting, and at every round we choose the action optimistically. 
We further propose Low Rank Generalized Linear Bandit with Online Computation algorithm (LowGLOC) for the generalized linear setting that also achieves regret. LowGLOC is similar to LowLOC but here we need to design a new onlinetoconfidenceset conversion method, which can be of independent interest.

LowLOC and LowGLOC enjoy good regret but are unfortunately not efficiently implementable. To overcome this issue, we provide an efficient algorithm LowRankExploreSubspaceThenRefine (LowESTR) for the linear setting, inspired by the ESTR algorithm proposed by Jun et al. (2019). We show that under a mild assumption on action set , LowESTR achieves regret, where is a lower bound for the th singular value of . Comparing with ESTR, LowESTR does not need the incoherence and the eigenvalue assumption of the augmented matrix while the assumptions on the action set of the two algorithms are different. We also provide empirical evaluations to demonstrate the effectiveness of LowESTR.
2 Related Work
Our work is inspired by Jun et al. (2019) where they model the reward as . is a left arm and is a right arm ( and are left and right arm sets, repsectively). Note this model is a special case of our lowrank linear bandit model because one can write and define the arm set as . Their ESTR algorithm enjoys regret bound under the assumptions: 1) an augmented matrix is incoherent (Keshavan et al., 2010) and has a finite condition number, where is constructed by arms from that maximizes and is constructed by arms from that maximizes , and 2) and are upper bounded by a constant. Their algorithm requires explicitly finding and
, which is in general NPhard, even though they also proposed heuristics to speed up this step. Comparing with ESTR, our LowLOC and LowGLOC algorithm are also not computationally efficient, but they both apply to richer action sets (matrices of any rank) without assumptions on
, and and their regret bound does not depend on . Our LowESTR algorithm is computationally efficient if the action set admits a nice exploration distribution (see details in Section 6). LowESTR achieves regret bound but it does not require assumptions on , and as well.Katariya et al. (2017b) and Kveton et al. (2017) also studied rank1 and lowrank bandit problems. They assume there is an underlying expected reward matrix , at each time the learner picks an element on
position and receives a noisy reward. It can be viewed as a special case of bilinear bandit with onehot vectors as left and right arms.
Katariya et al. (2017b) is further extended by Katariya et al. (2017a)that uses KL based confidence intervals to achieve a tighter regret bound. Our problem is more general comparing to these works.
Johnson et al. (2016) considered the same setting as ours, but their method relies on the knowledge of many parameters that depend on the unknown and in particular only works for continuous arm set.There are other works that utilize the lowrank structure in different model settings. For example, Gopalan et al. (2016)
studied low rank bandits with latent structures using robust tensor power method.
Lale et al. (2019) imposed lowrank assumptions on the feature vectors to reduce the effective dimension. These work all utilize the lowrank structure to achieve better regret bound than standard approaches that do not take the lowrank structure into account.3 Preliminaries
We formally define the problem and review relevant background in this section.
3.1 Lowrank Linear Bandit
Let be the arm space. In each round , the learner chooses an arm , and observes a noisy reward of a linear form: , where is an unknown parameter and is a
subGaussian random variable. Denote the rank of
by , we assume . Let the th singular value of is lower bounded by . We use to denote the inner product between matrix and . We follow the standard assumptions in linear bandits: and , for all .In our bandit problem, the goal of the learner is to maximize the total reward , where is the time horizon. Clearly, with the knowledge of the unknown parameter , one should always select an action . It is natural to evaluate the learner relative to the optimal strategy. The difference between the learner’s total reward and the total reward of the optimal strategy is called pseudoregret (Audibert et al., 2009): For simplicity, we use the word regret instead of pseudoregret for .
3.2 Generalized Lowrank Linear Bandit
We also study the generalized linear bandit model of the following form: where is a link function. This framework builds on the wellknown Generalized Linear Models (GLMs) and has been widely studied in many applications. For example, when rewards are binaryvalued, a natural link function is the logistic function . For the generalized setting, we assume the reward given the action follows an exponential family distribution:
(1) 
where is a known scale parameter and and are some known functions. From basic calculation we get . We assume the above exponential family is a minimal representation, then is ensured to be strictly convex (Wainwright and Jordan, 2008), and thus the negative log likelihood (NLL) loss is also strictly convex.
We make the following standard assumption on the link function (Jun et al., 2017).
Assumption 1.
There exist constants , such that the link function is Lipschitz on , continously differentiable on , and .
One can write down the above reward model (1) in an equivalent way: where is conditionally subGaussian given and . Using the form of , Taylor expansion and the strictly convexity of , one can show that by the definition of the subGaussian constant. An optimal arm is . The performance of an algorithm is again evaluated by cumulative regret:
We use and for the standard Big O and Big Omega notations. and ignore the polylogarithmic factors of . means and are of the same order ignoring the polylogarithmic factors of .
4 Lowrank Linear Bandit with Online Computation
We first present our algorithm, LowLOC (Algorithm 1) for lowrank linear bandit problems.
Theorem 1 (Regret of LowLOC (Algorithm 1)).
Note that LowLOC achieves the desired goal of outperforming the standard linear bandit approach with regret. Furthermore, this bound does not depend on any other problemdependent parameters such as least singular value of and does not require any other assumption which appeared in Jun et al. (2019). In the following section, we explain details of our algorithm design choices.
4.1 OFU and Onlinetoconfidenceset Conversion
This algorithm follows the standard Optimism in the Face of Uncertainty (OFU) principle. We maintain a confidence set at every round that contains the true parameter with high probability and we choose the action according to
Typically, the faster shrinks, the lower regret we have. The main diffculty is to construct that leverages the lowrank structure so that we only have regret. Our starting point is the to use the onlinetoconfidenceset conversion framework proposed by AbbasiYadkori et al. (2012) who builds the confidence set based on an online predictor. At each round, an online predictor receives , predicts , based on historical data , observes the true value and suffers a loss . The performance of this online predictor is measured by comparing its cumulative loss to the cumulative loss of a fixed linear predictor using coefficient : .
The key idea of onlinetoconfidenceset conversion (adapted to our lowrank setting) is that if one can guarantee for some nondecreasing sequence , we can construct the confidence interval for as:
(2) 
where is the failure probability. Lemma 7 in appendix guarantees that is contained in with high probability and Lemma 8 further guarantees the overall regret .
Therefore, the problem achieving the regret bound reduces to designing an online predictor which guarantees and . To achieve this rate, the key is to leverage the lowrank structure.
4.2 Online Low Rank Linear Prediction
We adopt the classical exponentially weighted average forecaster (EW) framework (CesaBianchi and Lugosi, 2006) which uses experts to predict with the following formula
(3) 
In above, denotes the th expert that makes a prediction at time , is the cumulative loss incurred by expert , and is a tuning parameter. By choosing carefully, one can guarantee that this predictor achieves regret comparing with the best expert among the expert set.
In our setting, an expert can be viewed as a matrix satisfies and , and makes prediction according to . There are infinitely many such experts so we cannot directly use EW which requires finite number of experts. Our main idea is to construct experts which guarantees is small and these experts can represent the original expert set well, and apply EW using these experts. We construct an net , i.e., for any , there exists a , such that . We further show that in Lemma 6, so the number of experts in Equation 3 is at most if we set .
The following lemma summarizes the performance of this online predictor.
Lemma 2 (Regret of EW under Squared Loss).
Let in EW forecaster (3). Then, for any , with probability at least , we have
5 Lowrank Generalized Linear Bandit
We also study the lowrank generalized linear bandit setting. Our algorithm LowGLOC is similar to LowLOC, so we only present the differences and leave the detailed presentation of the algorithm (Algorithm 3) to the appendix (Section H).
We still use EW to perform online predictions, but instead of squared loss, we use negative log likelihood (NLL) loss to construct the forecaster in Equation (3), where is as defined in Section 3. Therefore, the performance of EW using NLL loss relative to a fixed linear predictor is measured by: . If there exists a nondecreasing sequence such that , we construct in the following way:
(4) 
where . Lemma 11 guarantees that the true parameter is contained in with high probability. Lemma 12 further guarantees that the overall regret of LowGLOC satisfies . Following the onlinetoconfidenceset conversion idea as used in LowLOC, we prove that in Lemma 13.
We next present the regret of LowGLOC in the next theorem, which can be easily achieved by plugging Lemma 13 into Lemma 12.
Theorem 3 (Regret of LowGLOC).
For , with probability at least , Algorithm 3 achieves regret:
To the best of our knowledge, this is the first regret bound for lowrank GLM bandits.
6 An Efficient Algorithm for the Linear Case
At every round, LowLOC and LowGLOC need to calculate exponentially weighted predictions, which involves calculating weights of the covering of lowrank matrices. These approaches has high computation complexity even though their regret is ideal. In this section, we propose a computationally efficient method LowESTR (Algorithm 2) that also achieves regret under some mild assumptions on the action set in the following.
Assumption 2.
This assumption is easily satisfied in many arm sets. To guarantee the existence of above sampling distribution , we only need that the convex hull of a subset of arms contains a ball with radius , which does not scale with or . Simple examples for are Euclidean unit ball/sphere.
We extend the twostage procedure "Explore Subspace Then Refine (ESTR)" proposed by Jun et al. (2019)
. In stage 1, ESTR estimates the row and column subspaces of
. In stage 2, ESTR transforms the original problem into a dimensional linear bandit problem and invokes LowOFUL algorithm (Jun et al., 2019), which leverages the estimated row/column subspaces of .(5) 
6.1 LowESTR
LowESTR also proceeds with the twostage framework as ESTR, but we use different estimation method in stage 1.
Stage 1.
We are inspired by a line of work on lowrank matrices recovery using nuclearnorm penalty with squared loss (Wainwright, 2019). The learner pulls arm according to distribution and observes the reward up to a horizon , then uses to solve a nuclearnorm penalized least square problem in (5) and receives an estimated for . Notably, instead of invoking an NPhard problem in stage 1 as ESTR, the optimization problem (5) in LowESTR is convex and thus can be solved easily using standard gradient based methods. Assumption 2 guarantees that in Theorem 15 (Section E). We get the estimated row/column subspaces of simply by running an SVD step.
Stage 2.
In stage 2, we apply LowOFUL algorithm (Algorithm 4 in Section H) proposed by Jun et al. (2019) in our setting. The key idea is reducing the problem to linear bandit and utilizing the estimated subspaces in the standard linear bandit method OFUL (AbbasiYadkori et al., 2011).
We now present the overall regret of Algorithm 2.
Theorem 4 (Regret of LowESTR for Low Rank Bandit).
Suppose we run LowESTR in stage 1 with and . We invoke LowOFUL in stage 2 with , , , , and the rotated arm sets defined in Algorithm 2, the overall regret of LowESTR is, with prob at least , .
We believe that this “ExploreSubspaceThenRefine" framework can also be extended to the generalized linear setting. In stage 1, an Mestimator that minimizes the negative loglikelihood plus nuclear norm penalty (Fan et al., 2019) can be used instead, while in stage 2, one can revise a standard generalized linear bandit algorithm such as GLMUCB (Filippi et al., 2010) by leveraging the lowrank knowledge in the same way as LowOFUL. We leave this extension for future work.
7 Lower Bound for Lowrank Linear Bandit
In this section, we discuss the regret lower bound of the lowrank linear bandit model. Suppose , we first present a lower bound, which is a straightforward extension of the linear bandit lower bound (Lattimore and Szepesvári, 2018).
Theorem 5 (Lower Bound).
Assume and let . Then , where , , s.t. .
Above bound is tight when as it matches with the standard dimensional linear bandit lower bound, but for small , our upper bound is larger than the lower bound by a factor of .
Nevertheless, we conjecture that is the correct lower bound for small . It is wellknown that the regret lower bound for sparse linear bandit problem (dimension , sparsity ) is (Lattimore and Szepesvári, 2018). Our problem can be viewed as a dimensional linear bandit problem with degrees of freedom in . Then, using the analogue of the degrees of freedom between sparse vectors and lowrank matrices, one can plug in for and for in the sparse linear bandit regret lower bound and achieve as our lower bound.
8 Experiments
In this section, we compare the performance of OFUL and LowESTR to validate that it is crucial to utilize the lowrank structure. We run our simulation with and . In both settings, the true is a diagonal matrix. For , we set while for , . For arms in both settings, we draw 256 vectors from and standardize them by dividing their 2norms, then we reshape all standardized dimensional vectors to matrices. We use these matrices as the arm set . For each arm , the reward is generated by , where . We run both algorithms for rounds and repeat 100 times for each simulation setup to calculate the averaged regrets and their 1sd confidence intervals at every step. We leave the hyperparameters of OFUL and LowESTR in the appendix (Section I). Regret comparison plots are displayed in Figure 1.
Regret Comparison between OFUL and LowESTR. We plot the averaged cumulative regret with red and blue curves, and 1standard deviation for each method within the yellow area.
We observe that in both plots, LowESTR incurs less regret comparing to OFUL within several hundreds of time steps. Further, as we increase the rank from to , the regret gap between the two approaches becomes smaller. This phenomenon is compatible with our theory.
We also conduct simulations to see the sensitivity of LowESTR to . We observe that LowESTR indeed performs better for large , which again matches with our theory. The detailed description and the plot for this experiment are left to the appendix (Section I).
9 Conclusion & Future Work
In this paper, we studied the lowrank (generalized) linear bandit problem. We proposed LowLOC and LowGLOC algorithm for the linear and generalized linear setting, respectively. Both of them enjoy regret. Further, our efficient algorithm LowESTR achieves regret under mild conditions on the action set. There are several interesting directions that we left as future work:
1) We provided some preliminary ideas in Section 6 about how to extend LowESTR to the generalized linear setting. We expect that a similar regret bound can be achieved under certain regularity conditions over the link function. 2) We plan to investigate if one can design an efficient algorithm whose regret does not depend on . 3) As we have shown in Section 7, is our conjectured tight lower bound. It will be very interesting to formally prove this.
Acknowledgement
AT acknowledges the support of NSF CAREER grant IIS1452099 and an Adobe Data Science Research Award.
References
 Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §1, §6.1.
 Onlinetoconfidenceset conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pp. 1–9. Cited by: LowRank Generalized Linear Bandit Problems, item i, §4.1, Lemma 7, Lemma 8.

Exploration–exploitation tradeoff using variance estimates in multiarmed bandits
. Theoretical Computer Science 410 (19), pp. 1876–1902. Cited by: §3.1.  Lambertian reflectance and linear subspaces. IEEE transactions on pattern analysis and machine intelligence 25 (2), pp. 218–233. Cited by: §1.
 Tight oracle inequalities for lowrank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory 57 (4), pp. 2342–2359. Cited by: Lemma 6.
 Exact matrix completion via convex optimization. Foundations of Computational mathematics 9 (6), pp. 717. Cited by: §1.
 Prediction, learning, and games. Cambridge university press. Cited by: §4.2, Lemma 10, Lemma 9.
 Generalized highdimensional trace regression via nuclear norm regularization. Journal of econometrics 212 (1), pp. 177–202. Cited by: §6.1.
 Parametric bandits: the generalized linear case. In Advances in Neural Information Processing Systems, pp. 586–594. Cited by: §1, §6.1.
 Lowrank bandits with latent mixtures. arXiv preprint arXiv:1609.01508. Cited by: §2.
 Structured stochastic linear bandits. arXiv preprint arXiv:1606.05693. Cited by: §2.
 Scalable generalized linear bandits: online computation and hashing. In Advances in Neural Information Processing Systems, pp. 99–109. Cited by: §C.2, §3.2.

Bilinear bandits with lowrank structure.
In
International Conference on Machine Learning
, pp. 3163–3172. Cited by: Appendix C, Appendix H, LowRank Generalized Linear Bandit Problems, item i, item iii, §1, §1, §2, §4, §6.1, §6, Corollary 16, Lemma 17, Lemma 18, Algorithm 4.  Bernoulli rank bandits for click feedback. arXiv preprint arXiv:1703.06513. Cited by: §2.
 Stochastic rank1 bandits. In Artificial Intelligence and Statistics, pp. 392–401. Cited by: §2.
 Matrix completion from noisy entries. Journal of Machine Learning Research 11 (Jul), pp. 2057–2078. Cited by: §2.
 Stochastic lowrank bandits. arXiv preprint arXiv:1712.04644. Cited by: §2.
 Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1.
 Stochastic linear bandits with hidden low rank structure. arXiv preprint arXiv:1901.09490. Cited by: §2.
 Bandit algorithms. preprint. Cited by: Appendix F, §7, §7.
 Highdimensional regression with noisy and missing data: provable guarantees with nonconvexity. In Advances in Neural Information Processing Systems, pp. 2726–2734. Cited by: §D.2.
 Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222–1230. Cited by: §1.
 Predicting clicks: estimating the clickthrough rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pp. 521–530. Cited by: §1.
 Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1 (12), pp. 1–305. Cited by: §3.2.
 Highdimensional statistics: a nonasymptotic viewpoint. Vol. 48, Cambridge University Press. Cited by: §6.1, Definition 1, Definition 2, Lemma 22.
Appendix A Proof for Theorem 1
Lemma 6 (Covering number for lowrank matrices, modified from (Candes and Plan, 2011)).
Let . Then there exists an net for the Frobenius norm obeying
(6) 
Proof.
Use SVD decomposition: of any obeying . We will construct an net for by covering the set of permissible and . Let be the set of diagonal matrices with nonnegative diagonal entries and Frobenius norm less than or equal to one. We take to be an net for with . Next, let . To cover , we use the norm defined as
(7) 
where denotes the th column of . Let . It is easy to see that
since the columns of an orthogonal matrix are unit normed. We see that there is an
net for obeying . Similarly, let . Define , we have . By the same argument, there is an net for obeying . We now let , and remark that . It remains to show that for all , there exists with .Fix and decompose it as . Then there exists with , , satisfying , and . This gives
(8)  
(9)  
(10) 
For the first term, since is an orthogonal matrix,
(11)  
(12) 
Thus we have shown , by the same argument, we also have . For the second term, . This completes the proof. ∎
Lemma 7 (OnlinetoConfidenceSet Conversion (adapted from Theorem 1 in AbbasiYadkori et al. (2012))).
Suppose we feed into an online prediction algorithm which, for all , admits a regret . Let be the prediction at time step by the online learner. Then, for any , with probability at least , we have
(13) 