1-bit Matrix Completion: PAC-Bayesian Analysis of a Variational Approximation

04/14/2016 ∙ by Vincent Cottet, et al. ∙ 0

Due to challenging applications such as collaborative filtering, the matrix completion problem has been widely studied in the past few years. Different approaches rely on different structure assumptions on the matrix in hand. Here, we focus on the completion of a (possibly) low-rank matrix with binary entries, the so-called 1-bit matrix completion problem. Our approach relies on tools from machine learning theory: empirical risk minimization and its convex relaxations. We propose an algorithm to compute a variational approximation of the pseudo-posterior. Thanks to the convex relaxation, the corresponding minimization problem is bi-convex, and thus the method behaves well in practice. We also study the performance of this variational approximation through PAC-Bayesian learning bounds. On the contrary to previous works that focused on upper bounds on the estimation error of M with various matrix norms, we are able to derive from this analysis a PAC bound on the prediction error of our algorithm. We focus essentially on convex relaxation through the hinge loss, for which we present the complete analysis, a complete simulation study and a test on the MovieLens data set. However, we also discuss a variational approximation to deal with the logistic loss.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Motivated by modern applications like recommendation systems and collaborative filtering, video analysis or quantum statistics, the matrix completion problem has been widely studied in the recent years. Recovering a matrix which is, without any additional information, a purely impossible task. Actually, in general, it is obviously impossible. However, under some assumption on the structure of the matrix to be recovered, it might become feasible, as shown by [8] and [7] where the assumption is that the matrix has a small rank. This assumption is indeed very natural in many applications. For example, in recommendation systems, it is equivalent to the existence of a small number of hidden features that explain the preferences of users. While [8] and [7] focused on matrix completion without noise, many authors extended these techniques to the case of noisy observations, see [6] and  [11] among others. The main idea in [6]

is to minimize the least squares criterion, penalized by the rank. This penalization is then relaxed by the nuclear norm, which is the sum of the singular values of the matrix at hand. An efficient algorithm is described in 

[24].

All the aforementioned papers focused on real-valued matrices. However, in many applications, the matrix entries are binary, that is in the set . For example, in collaborative filtering, the entry being means that user likes object while this entry being means that he/she dislikes it. The problem of recovering a binary matrix from partial observations is usually referred as 1-bit matrix completion. To deal with binary observations requires specific estimation methods. Most works on this problem usually assume a generalized linear model (GLM): the observations for ,

, are Bernoulli distributed with parameter

, where is a link function which maps from to , for example the logistic function , and is a real matrix, see [5, 13, 16]. In these works, the goal is to recover the matrix and a convergence rate is then derived. For example, [16] provides an estimate for which, under suitable assumptions and when the data are generated according to the true model with ,

for some constant that depends on the assumptions, and where stands for the Frobenius norm (we refer the reader to Corollary 2 page 1955 in [16] for the exact statement). While this result ensures the consistency of when

is low-rank, it does not provide any guarantee on the probability of a prediction error. Moreover, such a prediction there would necessarily assume that the model is correct, that is, that the link function

is well specified, which is an unrealistic assumption.

Here, we adopt a machine learning point of view: in machine learning, dealing with binary output is called a classification problem, for which methods are known that do not assume any model on the observations. That is, instead of focusing on a parametric model for

, we will only define a set of prediction matrices

and seek for the one that leads to the best prediction error. Using the 0-1 loss function, we could actually directly use Vapnik-Chervonenkis theory 

[29]

to propose a classifier

risk would be controled by a PAC inequality. However, it is known that this approach usually is computationally intractable. A popular approach is to replace the 0-1 loss by a convex surrogate [30], namely, the hinge loss. Our approach is as follows: we propose a pseudo-Bayesian approach, where we define a pseudo-posterior distribution on a set of matrices . This pseudo-posterior distribution does not have a very simple form, however, thanks to a variational approximation, we manage to approximate it by a tractable distribution. Thanks to the PAC-Bayesian theory [22, 14, 27, 9, 10, 26, 12], we are able to provide a PAC bound on the prediction risk of this variational approximation. We then show that, thanks to the convex relaxation of the 0-1 loss, the computation of this variational approximation is actually a bi-convex minimization problem. As a consequence, algorithms that are efficient in practice are available.

The rest of the paper is as follows. In Section 2 we provide the notations used in the paper, the definition of the pseudo-posterior and of its variational approximation. In Section 3 we give the PAC analysis of the variational approximation. This allows an empirical and a theoretical upper bound on the prevision risk of our method. Section 4 provides details on the implementation of our method. Note that in the aforementioned sections, the convex surrogate of the 0-1 loss used is the hinge loss. An extension to the logistic loss is briefly discussed in Section 5 and the derived algorithm to compute the variational approximation. Finally, Section 6 is devoted to an empirical study and Section 6.2 to an application to the MovieLens data set. The proof of the theorems of Section 3 are provided in Section 8.

2 Notations

For any integer we define . We define, for any integers and and any matrix , . For a pair of matrices , we write . Finally, when an matrix has rank then it can be written as where is and is . This decomposition is obviously not unique; we put where the infimum is taken over all such possible pairs .

2.1 1-bit matrix completion as a classification problem

We formally describe the 1-bit matrix completion problem as a classification problem: we observe that are i.i.d pairs from a distribution . The ’s take values in and the ’s take values in . Hence, the -th observation of an entry of the matrix is and the corresponding position in the matrix is provided by . In this setting, a predictor is a function and thus can be represented by a matrix . It would be natural to use in the following way: when , predicts by . The ability of this predictor to predict a new entry of the matrix is then assessed by the risk

and its empirical counterpart is:

It is then possible to use the standard approach in classification theory [29]. For example, the best possible classifier is the Bayes classifier

and equivalently we have a corresponding optimal matrix

We define , and . Note that, clearly, if two matrices and are such as, for every , then , and obviously,

While the risk has a clear interpretation, it is well known that to work with its empirical counterpart usually leads to intractable problems, as it is non-smooth and non-convex. Hence, it is quite standard to replace the empirical risk by a convex surrogate [30]. In this paper, we will mainly deal with the hinge loss, which leads to the following so-called hinge risk and hinge empirical risk:

Note that the hinge loss was also used by [28] in the 1-bit matrix completion problem, with a different approach leading to different algorithms. Moreover, here, we provide an analysis of the rate of convergence of our method, that is not provided in [28].

In opposite to other matrix completion works, the marginal distribution of is not an issue and we do not consider an uniform sampling scheme. Another slight difference is that the observations are iid and a same index may be observed several times. Following standard notations in matrix completion, we can define as the set of indices of observed entries: . We will use in the following the sub-sample of for a specified line : and the counterpart for a specified column : .

2.2 Pseudo-Bayesian estimation

The Bayesian framework has been used several times for matrix completion [20, 25, 19]. A common idea in all these papers is to factorize the matrix into two parts in order to define a prior on low-rank matrices. Every matrix whose rank is can be factorized:

As mentioned in the introduction, this is motivated by the fact that we expect that the Bayes matrix is low-rank, or at least well approximated by a low-rank matrix. However, in practice, we do not know what would be the rank of this matrix. So, we actually write with , for some large enough , and then, seek for adaptation with respect to a possible rank by shrinking some columns of and to . Specifically, we define the prior as the following hierarchical model:

where the prior distribution on the variances

is yet to be specified. Usually

is an inverse-Gamma distribution because it is conjugate in this model. This kind of hierarchical prior distribution is also very similar to the Bayesian Lasso developed in

[23] and especially of the form of the Bayesian Group Lasso developed in [17] in which the variance term is Gamma distributed. We will show that the Gamma distribution is a possible alternative in matrix completion, both for theoretical results and practical considerations. Thus all the results in this paper are stated under the assumption that is either the Gamma or the inverse-Gamma distribution: , or .

Let denote the parameter . As in PAC-Bayes theory [10], we define the pseudo-posterior as follows:

where is a parameter to be fixed by the statistician. The calibration of is discussed below. This distribution is close to a classic posterior distribution but the log-likelihood has been replaced by the logarithm of the pseudo-likelihood based on the hinge empirical risk.

2.3 Variational Bayes approximations

Unfortunately, the pseudo-posterior is intractable and MCMC methods are too expensive because of the dimension of the parameter. A possible way in order to get an estimate is to seek an approximation of this distribution. A specific technique, known as Variational Bayes (VB) approximation, allows to replace MCMC methods by efficient optimization algorithms [2]. First, we fix a subset of the set of all distributions on the parameter space. The class should be large enough, in order to contain a good enough approximation of , but not too large in order to lead to tractable optimization problems. We usually define the VB approximation as . However, when is not available in close form, it is usual to replace it by an upper bound. We define here the class as follows:

where

is the density of the Gaussian distribution with parameters

and

ranges over all possible probability distributions for

. Note that VB approximations are referred as parametric when is finite dimensional and as mean-field otherwise, then we actually use a mixed approach. Informally, all the coordinates are independent and the variational distribution of the entries of and is specified. The free variational parameters to be optimized are the means and the variances, which can be both seen in a matrix form. We will show below that the optimization with respect to is available in close form. Also, note that any probability distribution is uniquely determined by , , , and . We could actually use the notation , but it would be too cumbersome, so we will avoid it as much as possible. Conversely, once is given in , we can define , and so on.

The Kullback divergence here decomposes as

(1)

for which we do not have a close form, so we rather minimize an upper bound. We will see in next sections that this estimate has actually very good properties.

Definition 1.

Let and

Proposition 1.

For any in ,

(2)

We remind the reader that all the proofs are in Section 8. The quantity will be referred to as the Approximate Variational Bound (AVB) in the following. We are now able to define our estimate.

Definition 2.

We put

(3)

Also, for simplicity, given we will use the notation

3 PAC analysis of the variational approximation

Paper [1] proposes a general framework for analyzing the prediction properties of VB approximations of pseudo-posteriors based on PAC-Bayesian bounds. In this section, we apply this method to derive a control of the out-of-sample prevision risk for our approximation .

3.1 Empirical Bound

The first result is a so-called empirical bound: it provides an upper bound on the prevision risk of the pseudo-posterior that depends only on the data and on quantities defined by the statistician.

Theorem 1.

For any , with probability at least on the drawing of the sample,

We can actually deduce from this result a more explicit bound.

Corollary 1.

Assume that . For any ,with probability at least :

where the constant is explicitely known, and depends only on the prior

(Gamma, or Inverse-Gamma) and of the hyperparameters.

An exact value for can be deduced from the proof. It is thus clear that the algorithm performs a trade-off between the fit to the data, through the term , and the rank of .

3.2 Theoretical Bound

For this bound, it is common in classification to make an additional assumption on which leads to an easier task and therefore to better rates of convergence. We propose a definition adapted from [21].

Definition 3.

Mammen and Tsybakov’s margin assumption is satisfied when there is a constant such that:

It is known that it there is a constant such that then the margin assumption is satisfied with some that depends on . For example, in the noiseless case where almost surely, then

so the margin assumption is satisfied with .

Theorem 2.

Assume that the Mammen and Tsybakov’s assumption is satisfied for a given constant . Then, for any and for , , with probability at least ,

where is known and depends only on the constants and the choice of the prior on .

Note the adaptive nature of this result, in the sense that the estimator does not depend on .

Corollary 2.

In the noiseless case a.s., for any and for , with probability at least ,

(4)

where .

Remark 1.

Note that an empirical inequality roughly similar to Corollary 1 appears in [28]. However, in this paper, no oracle inequality similar to Corollary 2 is derived - and, due to a slight modification in their definition of the empirical hinge risk, we believe that it would not be easy to derive such a bound from their techniques.

4 Algorithm

4.1 General Algorithm

Note that the minimization problem (3) that defines our VB approximation is not an easy one:

When , and all the ’s are fixed, this is actually the canonical example of so-called biconvex problems: it is convex with respect to , and with respect to , but not with respect to the pair . Such problems are notoriously difficult. In this case, alternating blockwise optimization seems to be an acceptable strategy. While there is no guarantee that the algorithm will not get stuck in a local minimum (or even in a singular point that is actually not a minimum), it seems to give good results in practice, and no efficient alternative is available. See the discussion Subsection 9.2 page 76 in [4] for more details on this problem. We update iteratively : for and we use a gradient step, while for an explicit minimization is available. The details for the mean-field optimization (that is, w.r.t. ) are given in Subsection 4.2. See Algorithm 1 for the general version of the algorithm.

0:  
  
  while  do
     
     
     
     
     
     
  end while
Algorithm 1 Variational Approximation with Hinge Loss

4.2 Mean Field Optimization

Note that the pseudo-likelihood does not involve the parameters so the variational distribution can be optimized in the same way as the model in [19] where the noise is Gaussian. The general update formula is ( stands for the whole distribution except the part involving ):

The solution then depends on . In what follows we derive explicit formulas for according to the choice of : remember that could be either a Gamma distribution, or an Inverse-Gamma distribution.

4.2.1 Inverse-Gamma Prior

The conjugate prior for this part of the model is the inverse-Gamma distribution. The prior of

is and its density is:

The moments we need to develop the algorithm and to compute the empirical bound are:

where is the digamma function. Therefore, we get:

so we can conclude that:

(5)

4.2.2 Gamma Prior

Even though it seems that this fact was not used in prior works on Bayesian matrix estimation, it is also possible to derive explicit formulas when the prior on ’s is a distribution. In this case, is given by

We remind the reader that the Generalized Inverse Gaussian distribution is a three-parameter family of distributions over , written . Its density is given by:

where is the modified Bessel function of second kind.

The variational distribution is in consequence with:

The moment we need in order to compute the variational distribution of is:

5 Logistic Model

As mentioned in [30], the hinge loss is not the only possible convex relaxation of the 0-1 loss. The logistic loss can also be used (even though it might lead to a loss in the rate of convergence). This leads to the risks:

Note that then the pseudo-likelihood becomes exactly equal to the likelihood if and we assume a logistic model, that is , where is the link function . For the sake of coherence with previous sections, we still use the machine learning notations and the likelihood is written . The prior distribution is exactly the same and the object of interest is the posterior distribution:

In order to deal with large matrices, it is still interesting to develop a variational Bayes algorithm. However it is not as simple as in the quadratic loss model, see [19] in which the authors develop a mean field approximation, because the logistic likelihood leads to intractable update formulas. A common way to deal with this model is to maximize another quantity which is very close to the one we are interested in. The principle, coming from [15], is well explained in [2] and an extended example can be found in [18].

We consider the mean field approximation so the approximation is sought among the distributions that are factorized . We have the following decomposition, for all distribution :

Since the log-evidence is fixed, minimizing the Kullback divergence w.r.t. is the same as maximizing . Unfortunately, this quantity is intractable but a lower bound, which corresponds to a Gaussian approximation, is much more easier to optimize. We introduce the additional parameter .

Proposition 2.

For all and for all ,

The algorithm is then straightforward: we optimize w.r.t and expect that, eventually, the approximation is not too far from .

5.1 Variational Bayes Algorithm

The lower bound is maximized with respect to by the mean field algorithm. A direct calculation gives that the optimal distribution of each site (written with a star subscript) is given by:

As is a quadratic form in and , the variational distribution of each parameter is Gaussian and a direct calculation gives:

The variational optimization for is exactly the same as in the Hinge Loss model (with both possible prior distributions and ). The optimization of the variational parameters is given by:

6 Empirical Results

The aim of this section is to compare our methods to the other 1-bit matrix completion techniques. Although the lack of risk bounds for GLM models, we can expect that the performance will not be worse than the one from our estimate in a general setting and will be the target for data generated according to the logistic model. It is worth noting that the low rank decomposition does not involve the same matrix: in our model, it affects the Bayesian classifier matrix; in logistic model, it concerns the parameter matrix. The estimate from our algorithm is and we focus on the zero-one loss in prediction. We first test the performances on simulated matrices and then experiment them on a real data set. We compare the three following models: (a) hinge loss with variational approximation (referred as HL), (b) Bayesian logistic model with variational approximation (referred as Logit) and (c) the frequentist logistic model from [13] (referred as freq. Logit.). The former two are tested with both Gamma and Inverse-Gamma prior distributions. The hyperparameters are all tuned by cross validation.

6.1 Simulated Matrices

The goal is to assess the models with different kind of data generation. The general scheme of the simulations is as follows: the observations come from a matrix and we pick randomly of its entries. In our algorithms, we set for computational reason, but it works well with a larger value. The observations are generated as:

The noise term is such that and has low rank. The predictions are directly compared to . Two types of matrices