Low-rank models are widely used in various applications, such as matrix completion, computer vision, etc(Candès and Recht, 2009; Basri and Jacobs, 2003). We study low-rank (generalized) linear models in the bandit setting (Lai and Robbins, 1985). During the learning process, the agent adaptively pulls an arm (denoted as ) from a set of arms based on the past experience. At each pull, the agent observes a noisy reward corresponding to the arm pulled. Let be an unknown low-rank matrix with rank . The learner’s goal is to maximize the total reward: where is the time horizon, is an action pulled at time that belongs to a pre-specified action set and denotes a link function. Note that in the standard linear case the link function is identity.
Many practical applications can be framed in this low-rank bandit model. For traveling websites, the recommendation system needs to choose a flight-hotel bundle for the customer that can achieve high revenue. Often one has features of size for flights and features of size for hotels. It is natural to form a matrix feature (e.g. via an outer product) for each pair or simply combine the two features row/column-wise if . One can model the appeal of a bundle by a (generalized) linear function of the matrix feature. In online advertising with image recommendation, the advertiser selects an image to display and the goal is to achieve the maximum clicking rate. The image is often stored as a matrix, and one can use a generalized linear model (GLM) with the link function being the logistic function to model the click rate (Richardson et al., 2007; McMahan et al., 2013). In all of these applications, one puts some capacity control on the underlying matrix linear coefficient and a natural condition is being low-rank. We note that the examples such as online dating and online shopping discussed in Jun et al. (2019) can also be formulated as our model.
In this paper, we measure the quality of an algorithm in terms of its cumulative regret111See Section 3 for the definition.. A naive approach is to ignore the low-rank structure and directly apply the standard (generalized) linear bandit algorithms (Abbasi-Yadkori et al., 2011; Filippi et al., 2010). These approaches suffer regret.222 omits poly-logarithmic factors of . However, in practice, can be huge. Then a natural question is:
Can we utilize the low-rank structure of to achieve regret?
Jun et al. (2019) studied a subclass of our problem, where the actions are rank one matrices. They proposed an algorithm that achieves regret under additional incoherence and singular value assumptions of an augmented matrix defined via the arm set and and a singular value assumption of . They also provided strong evidence that their bound is unimprovable.
We summarize our contributions below.
We propose Low Rank Linear Bandit with Online Computation algorithm (LowLOC) for the low-rank linear bandit problem, that achieves regret. Notably, comparing with the result in Jun et al. (2019)
, our result 1) applies to more general action sets which can contain high-rank matrices and 2) does not require the incoherence and bounded eigenvalue assumption of the augmented matrix mentioned in the previous paragraph. Our regret bound also matches with their conjectured lower bound. For LowLOC, we first design a novel online predictor which uses anexponentially weighted average forecaster on a covering of low-rank matrices to solve the online low-rank linear prediction problem with regret. We then plug in our online predictor to the online-to-confidence-set conversion framework proposed by Abbasi-Yadkori et al. (2012) to construct a confidence set of in our bandit setting, and at every round we choose the action optimistically.
We further propose Low Rank Generalized Linear Bandit with Online Computation algorithm (LowGLOC) for the generalized linear setting that also achieves regret. LowGLOC is similar to LowLOC but here we need to design a new online-to-confidence-set conversion method, which can be of independent interest.
LowLOC and LowGLOC enjoy good regret but are unfortunately not efficiently implementable. To overcome this issue, we provide an efficient algorithm Low-Rank-Explore-Subspace-Then-Refine (LowESTR) for the linear setting, inspired by the ESTR algorithm proposed by Jun et al. (2019). We show that under a mild assumption on action set , LowESTR achieves regret, where is a lower bound for the -th singular value of . Comparing with ESTR, LowESTR does not need the incoherence and the eigenvalue assumption of the augmented matrix while the assumptions on the action set of the two algorithms are different. We also provide empirical evaluations to demonstrate the effectiveness of LowESTR.
2 Related Work
Our work is inspired by Jun et al. (2019) where they model the reward as . is a left arm and is a right arm ( and are left and right arm sets, repsectively). Note this model is a special case of our low-rank linear bandit model because one can write and define the arm set as . Their ESTR algorithm enjoys regret bound under the assumptions: 1) an augmented matrix is incoherent (Keshavan et al., 2010) and has a finite condition number, where is constructed by arms from that maximizes and is constructed by arms from that maximizes , and 2) and are upper bounded by a constant. Their algorithm requires explicitly finding and
, which is in general NP-hard, even though they also proposed heuristics to speed up this step. Comparing with ESTR, our LowLOC and LowGLOC algorithm are also not computationally efficient, but they both apply to richer action sets (matrices of any rank) without assumptions on, and and their regret bound does not depend on . Our LowESTR algorithm is computationally efficient if the action set admits a nice exploration distribution (see details in Section 6). LowESTR achieves regret bound but it does not require assumptions on , and as well.
Katariya et al. (2017b) and Kveton et al. (2017) also studied rank-1 and low-rank bandit problems. They assume there is an underlying expected reward matrix , at each time the learner picks an element on
position and receives a noisy reward. It can be viewed as a special case of bilinear bandit with one-hot vectors as left and right arms.Katariya et al. (2017b) is further extended by Katariya et al. (2017a)
that uses KL based confidence intervals to achieve a tighter regret bound. Our problem is more general comparing to these works.Johnson et al. (2016) considered the same setting as ours, but their method relies on the knowledge of many parameters that depend on the unknown and in particular only works for continuous arm set.
There are other works that utilize the low-rank structure in different model settings. For example, Gopalan et al. (2016)
studied low rank bandits with latent structures using robust tensor power method.Lale et al. (2019) imposed low-rank assumptions on the feature vectors to reduce the effective dimension. These work all utilize the low-rank structure to achieve better regret bound than standard approaches that do not take the low-rank structure into account.
We formally define the problem and review relevant background in this section.
3.1 Low-rank Linear Bandit
Let be the arm space. In each round , the learner chooses an arm , and observes a noisy reward of a linear form: , where is an unknown parameter and is a
-sub-Gaussian random variable. Denote the rank ofby , we assume . Let the -th singular value of is lower bounded by . We use to denote the inner product between matrix and . We follow the standard assumptions in linear bandits: and , for all .
In our bandit problem, the goal of the learner is to maximize the total reward , where is the time horizon. Clearly, with the knowledge of the unknown parameter , one should always select an action . It is natural to evaluate the learner relative to the optimal strategy. The difference between the learner’s total reward and the total reward of the optimal strategy is called pseudo-regret (Audibert et al., 2009): For simplicity, we use the word regret instead of pseudo-regret for .
3.2 Generalized Low-rank Linear Bandit
We also study the generalized linear bandit model of the following form: where is a link function. This framework builds on the well-known Generalized Linear Models (GLMs) and has been widely studied in many applications. For example, when rewards are binary-valued, a natural link function is the logistic function . For the generalized setting, we assume the reward given the action follows an exponential family distribution:
where is a known scale parameter and and are some known functions. From basic calculation we get . We assume the above exponential family is a minimal representation, then is ensured to be strictly convex (Wainwright and Jordan, 2008), and thus the negative log likelihood (NLL) loss is also strictly convex.
We make the following standard assumption on the link function (Jun et al., 2017).
There exist constants , such that the link function is Lipschitz on , continously differentiable on , and .
One can write down the above reward model (1) in an equivalent way: where is conditionally -sub-Gaussian given and . Using the form of , Taylor expansion and the strictly convexity of , one can show that by the definition of the sub-Gaussian constant. An optimal arm is . The performance of an algorithm is again evaluated by cumulative regret:
We use and for the standard Big O and Big Omega notations. and ignore the poly-logarithmic factors of . means and are of the same order ignoring the poly-logarithmic factors of .
4 Low-rank Linear Bandit with Online Computation
We first present our algorithm, LowLOC (Algorithm 1) for low-rank linear bandit problems.
Theorem 1 (Regret of LowLOC (Algorithm 1)).
Note that LowLOC achieves the desired goal of outperforming the standard linear bandit approach with regret. Furthermore, this bound does not depend on any other problem-dependent parameters such as least singular value of and does not require any other assumption which appeared in Jun et al. (2019). In the following section, we explain details of our algorithm design choices.
4.1 OFU and Online-to-confidence-set Conversion
This algorithm follows the standard Optimism in the Face of Uncertainty (OFU) principle. We maintain a confidence set at every round that contains the true parameter with high probability and we choose the action according to
Typically, the faster shrinks, the lower regret we have. The main diffculty is to construct that leverages the low-rank structure so that we only have regret. Our starting point is the to use the online-to-confidence-set conversion framework proposed by Abbasi-Yadkori et al. (2012) who builds the confidence set based on an online predictor. At each round, an online predictor receives , predicts , based on historical data , observes the true value and suffers a loss . The performance of this online predictor is measured by comparing its cumulative loss to the cumulative loss of a fixed linear predictor using coefficient : .
The key idea of online-to-confidence-set conversion (adapted to our low-rank setting) is that if one can guarantee for some non-decreasing sequence , we can construct the confidence interval for as:
Therefore, the problem achieving the regret bound reduces to designing an online predictor which guarantees and . To achieve this rate, the key is to leverage the low-rank structure.
4.2 Online Low Rank Linear Prediction
We adopt the classical exponentially weighted average forecaster (EW) framework (Cesa-Bianchi and Lugosi, 2006) which uses experts to predict with the following formula
In above, denotes the -th expert that makes a prediction at time , is the cumulative loss incurred by expert , and is a tuning parameter. By choosing carefully, one can guarantee that this predictor achieves regret comparing with the best expert among the expert set.
In our setting, an expert can be viewed as a matrix satisfies and , and makes prediction according to . There are infinitely many such experts so we cannot directly use EW which requires finite number of experts. Our main idea is to construct experts which guarantees is small and these experts can represent the original expert set well, and apply EW using these experts. We construct an -net , i.e., for any , there exists a , such that . We further show that in Lemma 6, so the number of experts in Equation 3 is at most if we set .
The following lemma summarizes the performance of this online predictor.
Lemma 2 (Regret of EW under Squared Loss).
Let in EW forecaster (3). Then, for any , with probability at least , we have
5 Low-rank Generalized Linear Bandit
We also study the low-rank generalized linear bandit setting. Our algorithm LowGLOC is similar to LowLOC, so we only present the differences and leave the detailed presentation of the algorithm (Algorithm 3) to the appendix (Section H).
We still use EW to perform online predictions, but instead of squared loss, we use negative log likelihood (NLL) loss to construct the forecaster in Equation (3), where is as defined in Section 3. Therefore, the performance of EW using NLL loss relative to a fixed linear predictor is measured by: . If there exists a non-decreasing sequence such that , we construct in the following way:
where . Lemma 11 guarantees that the true parameter is contained in with high probability. Lemma 12 further guarantees that the overall regret of LowGLOC satisfies . Following the online-to-confidence-set conversion idea as used in LowLOC, we prove that in Lemma 13.
Theorem 3 (Regret of LowGLOC).
For , with probability at least , Algorithm 3 achieves regret:
To the best of our knowledge, this is the first regret bound for low-rank GLM bandits.
6 An Efficient Algorithm for the Linear Case
At every round, LowLOC and LowGLOC need to calculate exponentially weighted predictions, which involves calculating weights of the covering of low-rank matrices. These approaches has high computation complexity even though their regret is ideal. In this section, we propose a computationally efficient method LowESTR (Algorithm 2) that also achieves regret under some mild assumptions on the action set in the following.
This assumption is easily satisfied in many arm sets. To guarantee the existence of above sampling distribution , we only need that the convex hull of a subset of arms contains a ball with radius , which does not scale with or . Simple examples for are Euclidean unit ball/sphere.
We extend the two-stage procedure "Explore Subspace Then Refine (ESTR)" proposed by Jun et al. (2019)
. In stage 1, ESTR estimates the row and column subspaces of. In stage 2, ESTR transforms the original problem into a -dimensional linear bandit problem and invokes LowOFUL algorithm (Jun et al., 2019), which leverages the estimated row/column subspaces of .
LowESTR also proceeds with the two-stage framework as ESTR, but we use different estimation method in stage 1.
We are inspired by a line of work on low-rank matrices recovery using nuclear-norm penalty with squared loss (Wainwright, 2019). The learner pulls arm according to distribution and observes the reward up to a horizon , then uses to solve a nuclear-norm penalized least square problem in (5) and receives an estimated for . Notably, instead of invoking an NP-hard problem in stage 1 as ESTR, the optimization problem (5) in LowESTR is convex and thus can be solved easily using standard gradient based methods. Assumption 2 guarantees that in Theorem 15 (Section E). We get the estimated row/column subspaces of simply by running an SVD step.
In stage 2, we apply LowOFUL algorithm (Algorithm 4 in Section H) proposed by Jun et al. (2019) in our setting. The key idea is reducing the problem to linear bandit and utilizing the estimated subspaces in the standard linear bandit method OFUL (Abbasi-Yadkori et al., 2011).
We now present the overall regret of Algorithm 2.
Theorem 4 (Regret of LowESTR for Low Rank Bandit).
Suppose we run LowESTR in stage 1 with and . We invoke LowOFUL in stage 2 with , , , , and the rotated arm sets defined in Algorithm 2, the overall regret of LowESTR is, with prob at least , .
We believe that this “Explore-Subspace-Then-Refine" framework can also be extended to the generalized linear setting. In stage 1, an M-estimator that minimizes the negative log-likelihood plus nuclear norm penalty (Fan et al., 2019) can be used instead, while in stage 2, one can revise a standard generalized linear bandit algorithm such as GLM-UCB (Filippi et al., 2010) by leveraging the low-rank knowledge in the same way as LowOFUL. We leave this extension for future work.
7 Lower Bound for Low-rank Linear Bandit
In this section, we discuss the regret lower bound of the low-rank linear bandit model. Suppose , we first present a lower bound, which is a straightforward extension of the linear bandit lower bound (Lattimore and Szepesvári, 2018).
Theorem 5 (Lower Bound).
Assume and let . Then , where , , s.t. .
Above bound is tight when as it matches with the standard -dimensional linear bandit lower bound, but for small , our upper bound is larger than the lower bound by a factor of .
Nevertheless, we conjecture that is the correct lower bound for small . It is well-known that the regret lower bound for sparse linear bandit problem (dimension , sparsity ) is (Lattimore and Szepesvári, 2018). Our problem can be viewed as a -dimensional linear bandit problem with degrees of freedom in . Then, using the analogue of the degrees of freedom between sparse vectors and low-rank matrices, one can plug in for and for in the sparse linear bandit regret lower bound and achieve as our lower bound.
In this section, we compare the performance of OFUL and LowESTR to validate that it is crucial to utilize the low-rank structure. We run our simulation with and . In both settings, the true is a diagonal matrix. For , we set while for , . For arms in both settings, we draw 256 vectors from and standardize them by dividing their 2-norms, then we reshape all standardized -dimensional vectors to matrices. We use these matrices as the arm set . For each arm , the reward is generated by , where . We run both algorithms for rounds and repeat 100 times for each simulation setup to calculate the averaged regrets and their 1-sd confidence intervals at every step. We leave the hyper-parameters of OFUL and LowESTR in the appendix (Section I). Regret comparison plots are displayed in Figure 1.
Regret Comparison between OFUL and LowESTR. We plot the averaged cumulative regret with red and blue curves, and 1-standard deviation for each method within the yellow area.
We observe that in both plots, LowESTR incurs less regret comparing to OFUL within several hundreds of time steps. Further, as we increase the rank from to , the regret gap between the two approaches becomes smaller. This phenomenon is compatible with our theory.
We also conduct simulations to see the sensitivity of LowESTR to . We observe that LowESTR indeed performs better for large , which again matches with our theory. The detailed description and the plot for this experiment are left to the appendix (Section I).
9 Conclusion & Future Work
In this paper, we studied the low-rank (generalized) linear bandit problem. We proposed LowLOC and LowGLOC algorithm for the linear and generalized linear setting, respectively. Both of them enjoy regret. Further, our efficient algorithm LowESTR achieves regret under mild conditions on the action set. There are several interesting directions that we left as future work:
1) We provided some preliminary ideas in Section 6 about how to extend LowESTR to the generalized linear setting. We expect that a similar regret bound can be achieved under certain regularity conditions over the link function. 2) We plan to investigate if one can design an efficient algorithm whose regret does not depend on . 3) As we have shown in Section 7, is our conjectured tight lower bound. It will be very interesting to formally prove this.
AT acknowledges the support of NSF CAREER grant IIS-1452099 and an Adobe Data Science Research Award.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §1, §6.1.
- Online-to-confidence-set conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pp. 1–9. Cited by: Low-Rank Generalized Linear Bandit Problems, item i, §4.1, Lemma 7, Lemma 8.
Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science 410 (19), pp. 1876–1902. Cited by: §3.1.
- Lambertian reflectance and linear subspaces. IEEE transactions on pattern analysis and machine intelligence 25 (2), pp. 218–233. Cited by: §1.
- Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory 57 (4), pp. 2342–2359. Cited by: Lemma 6.
- Exact matrix completion via convex optimization. Foundations of Computational mathematics 9 (6), pp. 717. Cited by: §1.
- Prediction, learning, and games. Cambridge university press. Cited by: §4.2, Lemma 10, Lemma 9.
- Generalized high-dimensional trace regression via nuclear norm regularization. Journal of econometrics 212 (1), pp. 177–202. Cited by: §6.1.
- Parametric bandits: the generalized linear case. In Advances in Neural Information Processing Systems, pp. 586–594. Cited by: §1, §6.1.
- Low-rank bandits with latent mixtures. arXiv preprint arXiv:1609.01508. Cited by: §2.
- Structured stochastic linear bandits. arXiv preprint arXiv:1606.05693. Cited by: §2.
- Scalable generalized linear bandits: online computation and hashing. In Advances in Neural Information Processing Systems, pp. 99–109. Cited by: §C.2, §3.2.
Bilinear bandits with low-rank structure.
International Conference on Machine Learning, pp. 3163–3172. Cited by: Appendix C, Appendix H, Low-Rank Generalized Linear Bandit Problems, item i, item iii, §1, §1, §2, §4, §6.1, §6, Corollary 16, Lemma 17, Lemma 18, Algorithm 4.
- Bernoulli rank- bandits for click feedback. arXiv preprint arXiv:1703.06513. Cited by: §2.
- Stochastic rank-1 bandits. In Artificial Intelligence and Statistics, pp. 392–401. Cited by: §2.
- Matrix completion from noisy entries. Journal of Machine Learning Research 11 (Jul), pp. 2057–2078. Cited by: §2.
- Stochastic low-rank bandits. arXiv preprint arXiv:1712.04644. Cited by: §2.
- Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §1.
- Stochastic linear bandits with hidden low rank structure. arXiv preprint arXiv:1901.09490. Cited by: §2.
- Bandit algorithms. preprint. Cited by: Appendix F, §7, §7.
- High-dimensional regression with noisy and missing data: provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pp. 2726–2734. Cited by: §D.2.
- Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222–1230. Cited by: §1.
- Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pp. 521–530. Cited by: §1.
- Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1 (1-2), pp. 1–305. Cited by: §3.2.
- High-dimensional statistics: a non-asymptotic viewpoint. Vol. 48, Cambridge University Press. Cited by: §6.1, Definition 1, Definition 2, Lemma 22.
Appendix A Proof for Theorem 1
Lemma 6 (Covering number for low-rank matrices, modified from (Candes and Plan, 2011)).
Let . Then there exists an net for the Frobenius norm obeying
Use SVD decomposition: of any obeying . We will construct an net for by covering the set of permissible and . Let be the set of diagonal matrices with nonnegative diagonal entries and Frobenius norm less than or equal to one. We take to be an -net for with . Next, let . To cover , we use the norm defined as
where denotes the th column of . Let . It is easy to see that
since the columns of an orthogonal matrix are unit normed. We see that there is an-net for obeying . Similarly, let . Define , we have . By the same argument, there is an -net for obeying . We now let , and remark that . It remains to show that for all , there exists with .
Fix and decompose it as . Then there exists with , , satisfying , and . This gives
For the first term, since is an orthogonal matrix,
Thus we have shown , by the same argument, we also have . For the second term, . This completes the proof. ∎
Lemma 7 (Online-to-Confidence-Set Conversion (adapted from Theorem 1 in Abbasi-Yadkori et al. (2012))).
Suppose we feed into an online prediction algorithm which, for all , admits a regret . Let be the prediction at time step by the online learner. Then, for any , with probability at least