1 Introduction
A recommender system or collaborative filter uses a database of user preferences to make sequential product recommendations. Let be a matrix containing the true user preferences, where each row corresponds to a user and each column corresponds to a product. Our database consists of a sequence of entries of observed with noise
(1) 
where , and is the noise in the observation. We denote the set of entries observed .
Much research on recommender systems has focused on the matrix completion problem. A body of work initiated by Candès & Recht (2009) has shown that, if the matrix is of lowrank—user preferences are explained by a few latent features—and the factors of the matrix satisfy certain decoherence conditions, it is possible to complete the matrix from a small set of possibly noisy observations. Our aim here is to study the matrix completion problem sequentially. Briefly, we would like to find a sequential rule or policy for product recommendation which will yield the highest possible ratings after a finite time horizon.
More formally, suppose at each step we can choose an action that corresponds to observing some, possibly corrupt, entry of the matrix. Here, the source of corruption would be that the user does not rate the product consistently. We define the reward for taking action at step t as
(2) 
where
is a sequence of independent noise variables with mean 0 and variance
. The variable counts the number of times that action has been chosen before time , and the parameter geometrically discounts the reward of an entry after it has been observed. We will focus on the case , which for nonnegative corresponds to the case where we can make a certain recommendation at most once.At each step , the experimenter observes the reward for the action chosen, . We define the pseudoregret for finitehorizon as
(3) 
The goal of our recommender system will be to minimize the pseudoregret, especially when corresponds to a number of entries smaller than the matrix completion threshold—the horizon at which the matrix can be completed confidently with a semidefinite program.
The problem described is related to contextual bandits, a family of multiarmed bandits including linear and gaussian process bandits in which the mean reward of an arm is a function of a set of given predictors. The principal difference is that in sequential matrix completion, we must learn latent factors which explain the rewards. The formulation of the problem above is meant to emphasize this distinction. However, it would be straightforward to adapt the algorithms introduced in this paper to the requirements of realistic recommender systems, such as (i) taking advantage of predictors associated to users or products, (ii) restrictions of the set of userproduct pairs available at any given step, (iii) models for nonresponse, and (iv) prior information about the matrix of preferences.
As a note with regard to this formulation, in certain applications, it might not be possible to choose the user at each step. At an abstract level, our algorithm performs Bayesian optimisation over sets indexed by two categorical variables, which has applications beyond recommender systems. Furthermore, this paper will later show that the ability to choose useritem pairs can be exploited for information gain that achieves remarkable performance. In real recommender system settings, our formulation makes it convenient to include information on the likelihood that a user gives any feedback into the model— this helps ease the restrictive assumption that a user will definitely respond with their rating of the item. It can thus highlight whether there exist certain users that are good to interact with to gain more information. However, in this formulation one must be careful to avoid the pitfall of recommending items only to users who tend to give higher ratings. To avoid this, one could adaptively scale the usercolumns so as to equalize their average rating. Finally, note that if there were constraints on the set of users or items available for recommendation at each time step, these could be imposed naturally with our formulation.
The remainder of the introduction provides some background on stochastic bandit policies. Section 2 defines two policies for sequential matrix completion based on a gamma process factor model, and describes a fast variational procedure for inference. Section 3 evaluates the policies in simulations using synthetic data, as well as 3 real datasets. The final section provides further connections to the literature and discusses theoretical challenges.
1.1 ExplorationExploitation Tradeoffs and Stochastic Bandit Policies
Our estimator policy relies on an accurate reconstruction of the complete matrix,
, knowing only observations corresponding to . Note that this is only feasible when is low rank. Matrix completion algorithms typically need the observed entries to be distributed somewhat uniformly around the matrix. For instance, an estimator policy that only interacts with one user would have no hope of completing the matrix, as it knows nothing about the other users’ preferences. In this sense the estimator policy must explore the useritem pairs. On the other hand, the policy must begin to exploit useritem pairs that are expected to be optimal in order to minimise the regret.This type of problem relates to wellstudied strategies on the multiarmed bandit problem. The narrative for this problem is as follows. A player is at a casino with multiple slot machines. At each step he chooses an arm to pull and then observes the reward of his choice, which is assumed to be drawn from some underlying distribution associated to the chosen arm. To perform well the player must strike a balance between exploration and exploitation. That is, he must pull each arm enough to get a rough estimate of their expected reward, but he must also exploit the arms that seem to have high rewards.
Suppose there are arms with underlying parameters that respectively describe the distribution of rewards for each arm. At time we choose an arm which depends on the history of rewards before according to some policy. The reward is drawn from the distribution with expectation . The measure of performance for a policy is the cumulated regret, which at time is
(4) 
where . Frequentist analyses of multiarmed bandits have focused on worstcase bounds—upper bounds on for specific policies—as well as minimax results, which lower bound this quantity over a space of policies. Bayesian analysis of regret, on the other hand, assumes the parameter is random and bounds the expected regret under a prior distribution ,
(5) 
also known as the Bayes regret. The policy which minimises Bayes regret is known as Bayesoptimal and is typically the solution to an intractable dynamic program. An important exception is the case of a geometrically discounted multiarmed bandit, in which the Gittins index strategy is provably Bayesoptimal (Weber et al., 1992).
An effective and simple strategy from both Bayesian and frequentist perspectives is Thompson Sampling (TS). Let denote the posterior distribution after observations. We draw , and then choose an action
. In other words, an arm is chosen with a probability equal to the posterior probability that it is the best arm. The worstcase and Bayes regret of TS have been characterised in a range of models, and it is known that in the case of Bernoulli rewards, the regret grows at the optimal rate
(Agrawal & Goyal, 2013; Kaufmann et al., 2012; Russo & Van Roy, 2014a).Policies like TS or the Upper Confidence Bound method (Lai & Robbins, 1985)
encourage exploration through the heuristic of optimism under uncertainty, but they do not explicitly quantify the information to be gained by each possible actions. By contrast, Information Directed Sampling (IDS) selects actions using both a measure of expected regret as well as a measure of expected information gain derived from a Bayesian model
(Russo & Van Roy, 2014b).Using the notation from Russo & Van Roy (2014b), the action
is a random variable depending on the history of observations
Define a discrete distribution on the space of actions , by , where denotes the random variable taking on the value of the optimal action. The expected regret obtained from taking action at step is denoted by . The information gain obtained from taking action at step is denoted by and is defined as the expected decrease in entropy of ,
Let denote the set of distributions over the actions. For a fixed let denote the expected regret when actions are selected by drawing from . Similarly let denote the expected information. The IDS policy samples actions from the distribution that minimises the ratio of the squared expected regret to the expected information gain,
(6) 
It is known (Russo & Van Roy, 2014b) that this optimum is achieved at an extreme point of the simplex; furthermore, is the mutual information between the two random variables and in the posterior distribution,
(7) 
This makes it possible to estimate the information gain by simulation. Russo & Van Roy (2014b) provide regret bounds for several general cases and gives examples of distributions for which the IDS policy is clearly superior to TS.
2 Policies for Sequential Matrix Completion
In this section we describe an implementation of TS and IDS for sequential matrix completion. The prior on the matrix is a gamma process factor model, described by Knowles (2015). Efficient inference methods are critical, as these Bayesian policies require updating the posterior after every step. We employ a Stochastic Variational Inference algorithm.
The prior assumes that the columns of the user preference matrix , denoted
, are drawn from a normal distribution
, where . Integrating out , this can be written . The prior on is specified by,(8)  
where G
denotes the gamma distribution with density
.This model in practice assumes a large fixed value of . However, the gamma process prior on tends to shrink the effective rank of the matrix , and if the true number of factors is smaller than , the posterior of the effective rank of concentrates around .
Recall that the observation at time is
(9) 
where, again, corresponds to some entry of . Since , each entry is only observed once in a smallenough horizon. Because of this, we can take each to be the noisy observation itself, absorbing the error variable into the prior for .
To choose an entry sequentially in TS or in IDS, we must sample the posterior distribution of the unobserved entries. Let be the entries observed in column , and the rest of the entries in this column. As the columns of are conditionally independent given the parameters and ,
(10) 
Each factor on the right hand side is a normal distribution. Let
(11) 
be the covariance matrix with rows and columns permuted such that the unobserved entries in appear first. It can easily be shown that
(12) 
where
(13) 
This we sample efficiently using the trick described in Doucet (2010). The more difficult task is sampling the posterior of and in the first factor of 10, which we approximate variationally.
More specifically, we compute a fixedform, mean field variational posterior
(14)  
where is a parametric family of distributions in which the parameters , , , , and for and are independent, and
(15)  
The maximisation objective in Eq. 14 is known as the evidence lower bound (ELBO) as it bounds the marginal probability of the data below. We rely on Stochastic Variational Inference (Hoffman et al., 2013; Blei et al., 2016)
to solve this problem. This method applies stochastic approximation algorithms to optimise the ELBO, deriving unbiased estimates of its gradient via Monte Carlo integration. More specifically, we apply the reparametrization trick introduced by
Salimans et al. (2013) and Kingma & Welling (2013) to estimate the ELBO gradient, using the natural transformation mapping a standard normal to a lognormal. The choice of step size for the variational parameter updates is critical to the runtime of the algorithm. We use the AdaDelta method (Zeiler, 2012) to ensure fast convergence that is mostly unaffected by the initial choice of parameters. The whole procedure is summarized in Algorithm 1.We find improved performance with a good initialisation. This is achieved by maximising the log posterior of the parameters
and centering the initial variational posterior around the maximum a posteriori estimate.
TS only requires samples from the posterior. On the other hand, IDS requires estimating the information ratio in Eq. 6. Algorithm 3, drawn from Russo & Van Roy (2014b), approximates the informationratio in Eq. 6. This algorithm uses the equivalence presented in Eq. 7 in that it consists of approximating the divergence of the distributions and . Let denote the covariance matrix drawn from the posterior distribution . Let be the probability of observing value when taking action conditioned on . Let be a discrete approximation to , to the probability of observing from action , and to the probability of observing from action when action is optimal. Then one can check that
3 Results
3.1 Synthetic Data
To test the performance of each algorithm we construct a factor matrix , for ranks , , and . For half of the runs, the true factor matrix is sampled elementwise uniformly on , otherwise it is sampled elementwise from . We draw columns from a multivariate normal with covariance matrix . We average the results over 10 runs and measure the regret by constructing the optimal sequence of actions knowing the full true matrix and subtracting the cumulative reward from the cumulative reward achieved by the policy of each algorithm. For these Bayesian methods we set the userdefined threshold rank to be . Every test run considers the horizon , since after which many methods can complete the full matrix (Candès & Plan, 2010). For computational efficiency, rather than observing a single entry at a time and updating the posterior at each iteration, we observe entries between updates of the posterior.
The regret curves are shown in Fig. 1. The plots include the result of an oracle policy which uses the true covariance matrix to sample the multivariate normal posterior (12) of the missing entries, as well as a policy using an empirical estimate of the covariance matrix in a similar way, and a greedy policy relying on matrix completion via OptSpace (Keshavan et al., 2009b). The IDS policy performs almost as well as the oracle policy, and TS is superior to the policies using the OptSpace estimate and the empirical covariance.
3.2 MovieLens and Jester Data
We test the algorithms on both a matrix from the MovieLens dataset (1, ) and the Jester dataset (Goldberg et al., 2001). For the MovieLens dataset, in order to be able to compute the regret, we complete the missing entries with OptSpace (Keshavan et al., 2009b), an algorithm known to perform well on this dataset (Keshavan et al., 2009a), and treat the completed version as the true ratings matrix. The Jester dataset has no missing entries so we compare OptSpace’s performance to our algorithms. We consider the horizon , with , , and for both datasets. For comparison we use a method that, at each step, updates the empirical covariance matrix and then draws estimates of the unobserved entries from the corresponding conditional multivariate normal pseudoposterior. As a nearoracle comparison algorithm we first compute the ‘population’ covariance matrix using the full dataset with no missing entries, then sequentially draw missing entries from the corresponding multivariate normal pseudoposterior. Finally we compare the performance of these algorithms to the current competitive methods from Kawale et al. (2015) and (Zhao et al., 2013).The regret curves are shown in Fig. 2.
4 Discussion
4.1 Related Work
Our work falls into the area of collaborative filtering. Recently, the idea of applying bandit algorithms to online collaborative filtering has become more prevalent (Li et al., 2010; Bresler et al., 2014; Li et al., 2016; Deshpande & Montanari, 2012)
. However, much of the literature focuses on contextual bandits, in which a user’s preference is a function of a given set of predictors. The setting in which observations are corrupted entries of a lowrank matrix is less common and less amenable to theoretical analysis due to the fact that adaptive confidence intervals are more difficult to obtain. Applying TS to a collaborative filtering application was first proposed by
(Zhao et al., 2013). The authors use the PMF model and employ Markov chain Monte Carlo and Gibbs Sampling to sample from the posterior distribution. In a similar vein,
(Kawale et al., 2015) employs a PMF model and implements a “RaoBlackwellized particle filter” to give a discrete approximation for the posterior distribution. Here we use a gamma process factor model which has the advantage of adapting to the true rank of the data. The main contribution of this paper is a demonstration that crude uncertainty estimates from SVI, coupled to smart Bayesian policies like IDS, can lead to near optimal designs for horizons smaller than the typical matrix completion threshold. There are several avenues to extend the work presented here. Among them, we highlight investigating the performance of the policies under different models of corruption or constraints on the set of available actions, and the integration of other predictors into the lowrank model.4.2 Future Work
4.2.1 Computational Efficiency
While SVI makes it possible to implement Bayesian policies in an online collaborative filter, the computational cost is still significant. IDS outperforms TS but involves a heavier computational burden. Applying these policies in realworld settings would require further research into fast inference algorithms or simplified uncertainty estimates. In particular, developing models in which the dimension of the parameter grows sublinearly in and is critical to make our policies feasible in a bigdata application. In the Bayesian framework for sequential matrix completion there is a tradeoff between computation time and performance. Information Directed Sampling outperforms TS, however it is much slower to compute and much less scalable. Given the impressive performance of IDS in both the synthetic data and real data, it seems that future research into how to improve the computational complexity of the method would be worthwhile. Perhaps there is a suitable, more computable alternative that is also able to quickly identify which actions lead to the best information about the underlying Bayesian model.
4.2.2 Regret Bounds
Deriving regret bounds for our policies would require significant technical advances. The methods in Russo & Van Roy (2014a, b)
can be used to obtain weak bounds on the Bayes regret, but sharp bounds which scale optimally in the number of degrees of freedom of the true matrix would require efficient confidence intervals for the missing entries in the matrix. The main hindrance is that most of the theory of lowrank matrix completion relies substantially on random designs. The bounds in
Klopp et al. (2014) for nonuniform sampling could potentially be useful even though they still require independence in the design. A line of recent work on matrix completion with deterministic sampling could also provide tools for regret bounds (Király et al., 2015; PimentelAlarcón et al., 2016); however, there is work to be done to sharpen these results and translate computable certificates of completability into simple conditions. In a previous analysis of TS for sequential matrix completion, Kawale et al. (2015) asserted that new tools to analyze generic posterior distributions are needed for robust regret bounds. However they show how to bound the regret in the special case of rank1 matrices.Our model induces a duality between estimating the missing entries and estimating the covariance of the columns. Unfortunately, much of the theory of covariance estimation in highdimensional statistics is limited to the case in which i.i.d. vectors without any missing entries are observed. An exception to this can be seen in
Lounici et al. (2014); however this analysis still relies on sampling entries uniformly at random. Guarantees on covariance estimation given structured sequences of partial observations would be essential to deriving regret bounds.References
 (1) Movielens. http://grouplens.org/datasets/movielens/. Accessed: 20160801.
 Agrawal & Goyal (2013) Agrawal, Shipra and Goyal, Navin. Further optimal regret bounds for Thompson sampling. In Aistats, pp. 99–107, 2013.
 Blei et al. (2016) Blei, David M, Kucukelbir, Alp, and McAuliffe, Jon D. Variational inference: A review for statisticians. arXiv preprint arXiv:1601.00670, 2016.
 Bresler et al. (2014) Bresler, Guy, Chen, George H, and Shah, Devavrat. A latent source model for online collaborative filtering. In Advances in Neural Information Processing Systems, pp. 3347–3355, 2014.
 Candès & Plan (2010) Candès, Emmanuel J and Plan, Yaniv. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010.
 Candès & Recht (2009) Candès, Emmanuel J and Recht, Benjamin. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
 Deshpande & Montanari (2012) Deshpande, Yash and Montanari, Andrea. Linear bandits in high dimension and recommendation systems. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pp. 1750–1754. IEEE, 2012.

Doucet (2010)
Doucet, A.
A note on efficient conditional simulation of gaussian distributions.
Departments of Computer Science and Statistics, University of British Columbia, 2010.  Goldberg et al. (2001) Goldberg, Ken, Roeder, Theresa, Gupta, Dhruv, and Perkins, Chris. Eigentaste: A constant time collaborative filtering algorithm. information retrieval, 4(2):133–151, 2001.

Hoffman et al. (2013)
Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John.
Stochastic variational inference.
The Journal of Machine Learning Research
, 14(1):1303–1347, 2013.  Kaufmann et al. (2012) Kaufmann, Emilie, Korda, Nathaniel, and Munos, Rémi. Thompson sampling: An asymptotically optimal finitetime analysis. In International Conference on Algorithmic Learning Theory, pp. 199–213. Springer, 2012.
 Kawale et al. (2015) Kawale, Jaya, Bui, Hung H, Kveton, Branislav, TranThanh, Long, and Chawla, Sanjay. Efficient Thompson sampling for online matrixfactorization recommendation. In Advances in Neural Information Processing Systems, pp. 1297–1305, 2015.
 Keshavan et al. (2009a) Keshavan, Raghunandan H, Montanari, Andrea, and Oh, Sewoong. Lowrank matrix completion with noisy observations: a quantitative comparison. In Communication, Control, and Computing, 2009. Allerton 2009. 47th Annual Allerton Conference on, pp. 1216–1222. IEEE, 2009a.
 Keshavan et al. (2009b) Keshavan, Raghunandan H, Oh, Sewoong, and Montanari, Andrea. Matrix completion from a few entries. In 2009 IEEE International Symposium on Information Theory, pp. 324–328. IEEE, 2009b.
 Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
 Király et al. (2015) Király, Franz J, Theran, Louis, and Tomioka, Ryota. The algebraic combinatorial approach for lowrank matrix completion. Journal of Machine Learning Research, 16:1391–1436, 2015.
 Klopp et al. (2014) Klopp, Olga et al. Noisy lowrank matrix completion with general sampling distribution. Bernoulli, 20(1):282–303, 2014.
 Knowles (2015) Knowles, David A. Stochastic gradient variational Bayes for gamma approximating distributions. arXiv preprint arXiv:1509.01631, 2015.
 Lai & Robbins (1985) Lai, Tze Leung and Robbins, Herbert. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
 Li et al. (2010) Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
 Li et al. (2016) Li, Shuai, Karatzoglou, Alexandros, and Gentile, Claudio. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 539–548. ACM, 2016.
 Lounici et al. (2014) Lounici, Karim et al. Highdimensional covariance matrix estimation with missing observations. Bernoulli, 20(3):1029–1058, 2014.
 PimentelAlarcón et al. (2016) PimentelAlarcón, Daniel L, Boston, Nigel, and Nowak, Robert D. A characterization of deterministic sampling patterns for lowrank matrix completion. IEEE Journal of Selected Topics in Signal Processing, 10(4):623–636, 2016.
 Russo & Van Roy (2014a) Russo, Daniel and Van Roy, Benjamin. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014a.
 Russo & Van Roy (2014b) Russo, Daniel and Van Roy, Benjamin. Learning to optimize via informationdirected sampling. In Advances in Neural Information Processing Systems, pp. 1583–1591, 2014b.

Salimans et al. (2013)
Salimans, Tim, Knowles, David A, et al.
Fixedform variational posterior approximation through stochastic linear regression.
Bayesian Analysis, 8(4):837–882, 2013.  Weber et al. (1992) Weber, Richard et al. On the Gittins index for multiarmed bandits. The Annals of Applied Probability, 2(4):1024–1033, 1992.
 Zeiler (2012) Zeiler, Matthew D. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 Zhao et al. (2013) Zhao, Xiaoxue, Zhang, Weinan, and Wang, Jun. Interactive collaborative filtering. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1411–1420. ACM, 2013.
Comments
There are no comments yet.