Bayesian methods for low-rank matrix estimation: short survey and theoretical study

06/17/2013 ∙ by Pierre Alquier, et al. ∙ 0

The problem of low-rank matrix estimation recently received a lot of attention due to challenging applications. A lot of work has been done on rank-penalized methods and convex relaxation, both on the theoretical and applied sides. However, only a few papers considered Bayesian estimation. In this paper, we review the different type of priors considered on matrices to favour low-rank. We also prove that the obtained Bayesian estimators, under suitable assumptions, enjoys the same optimality properties as the ones based on penalization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of low-rank matrix estimation recently received a lot of attention, due to challenging high-dimensional applications provided by recommender systems, see e.g. the NetFlix challenge [BL07]. Depending on the application, several different models are studied: matrix completion [CT09], reduced-rank regression [RV98], trace regression, e.g. [KLT11], quantum tomogaphy, e.g. [ABH12], etc.

In all the above mentionned papers, the authors considered estimators obtained by minimizing a criterion that is the sum of two terms: a measure of the quality of data fitting, and a penalization term that is added to avoid overfitting. This term is usually the rank of the matrix, as in [BSW11], or, for computational reasons, the nuclear norm of the matrix, as in [CT09]

(the nuclear norm is the sum of the absolute values of the singular values, it can be seen as a matrix equivalent of the vectors

norm). However, it is to be noted that only a few papers considered Bayesian methods: we mention [Gew96] for a first study of reduced-rank regression in Bayesian econometrics, and more recently [LT07, LU09, SM08a] for matrix completion and reduced-rank regression (a more exhaustive bibliography is given below).

The objective of this paper is twofold: first, in Section 2 we provide a short survey of the priors that have been effectively used in various problems of low-rank estimation. We focus on two models, matrix completion and reduced rank-regression, but all the priors can be used in any model involving low-rank matrix estimation.

Then, in Section 3

we prove a theoretical result on the Bayesian estimator in the context of reduced rank regression. It should be noted that for some appropriate choice of the hyperparameters, the rate of convergence is the same as for penalized methods, up to

terms. The theoretical study in the context of matrix completion will be the object of a future work.

2 Model and priors

In this section, we briefly introduce two models: reduced rank regression, 2.1, and matrix completion, 2.2. We then review the priors used in these models, 2.3.

2.1 Reduced rank regression

In the matrix regression model, we observe two matrices and with

where is an deterministic matrix, is a deterministic matrix and is an random matrix with . The objective is to estimate the parameter matrix

. This model is sometimes refered as multivariate linear regression, matrix regression or multitask learning. In many applications, it makes sense to assume that the matrix

has low rank, i.e. . In this case, the model is known as reduced rank regression, and was studied as early as [And51, Ize75]. We refer the reader to the monograph [RV98] for a complete introduction.

Depending on the application the authors have in mind, additional assumptions on the distribution of the noise matrix are used:

  • the entries of

    are i.i.d., and the probability distribution of

    is bounded, sub-Gaussian or Gaussian . In this case, note that the likelihood of any matrix is given by

    where we let denote the Frobenius norm, .

  • as a generalization of the latter case, it is often assumed in econometrics papers that the rows of are i.i.d. for some variance-covariance matrix .

In order to estimate , we have to specify a prior on and, depending on the assumptions on , a prior on or on . Note however that in most theoretical papers, it is assumed that is known, or can be upper bounded, as in [BSW11]. This assumption is clearly a limitation but it makes sense in some applications: see e.g. [ABH12] for quantum tomography (that can bee seen as a special case of reduced rank regression).

In non-Bayesian studies, the estimator considered is usually obtained by minimizing the least-square criterion penalized by the rank of the matrix [BSW11] or the nuclear norm [YELM07]. In [BSW11], the estimator obtained by this method is shown to satisfy, for some constant ,

(Corollary 6 p. 1290).

2.2 Matrix completion

In the problem of matrix completion, one observes entries of an matrix for in a given set of indices . Here again, the noise matrix satisfies and the objective is to recover under the assumption that . Note that under the assumption that the are i.i.d. , the likelihood is given by

In [CT09], this problem is studied without noise (i.e. ), the general case is studied among others in [CP09, CR09, Gro11].

Note that recently, some authors studied the trace regression model, that includes linear regression, reduced-rank regression and matrix completion as special cases: see [RT11, Klo11, KLT11, Kol11]. Up to our knowledge, this model has not been considered from a Bayesian perspective until now, so we will mainly focus on reduced regression and matrix completion in this paper. However, all the priors defined for reduced-rank regression can also be used for the more general trace regression setting.

2.3 Priors on (approximately) low-rank matrices

It appears that some econometrics models can actually be seen as special cases of the reduced rank regression. Some of them were studied from a Bayesian perspetive from the seventies, to our knowledge, it was the first Bayesian study of a reduced rank regression:

The first systematic treatment of the reduced rank model from a Bayesian perspective was carried out in [Gew96]. The idea of this paper is to write the matrix parameter as for two matrices and respectively and , and to give a prior on and rather than on . Note that the rank of is in any case smaller than . So, to choose imposes a low rank structure to the matrix .

The prior in [Gew96] is given by

where is a Gaussian shrinkage on all the entries of the matrices:

for some parameter . Then, is an -dimensional inverse-Wishart distribution with degrees of freedom and matrix parameter , :

Remark that this prior is particularly convenient as it is then possible to give explicit forms for the marginal posteriors. This allows an implementation of the Gibbs algorithm to sample from the posterior. As the formulas are a bit cumbersome, we do not provide them here, however, the interested reader can find them in [Gew96].

The weak point in this approach is that the question of the choice of the reduced rank is not addressed. It is possible to estimate and for any possible

and to use Bayes factors for model selection, as in 

[KP02]. Numerical approximation and assessment of convergence for this method are provided by [CV04].

A more recent approach consists in fixing a large , as , and then in calibrating the prior so that it would naturally favour matrices with rank smaller than (or, really close to such matrices). To our knowledge, the first attempt in this direction is [LT07]. Note that this paper was actually about matrix completion rather than reduced rank regression, but once again, all the priors in this subsection can be used in both settings. Here again, we write , and

In other words, if we write and , then the and are independent and respectively and where is the indentity matrix of size

. In order to understand the idea behind this prior, assume for one moment that

and are large for and very small for . Then, for , and have entries close to , and so . So, the matrix

a matrix that has a rank at most . In practice, the choice of the ’s and

’s is the main difficulty of this approach. Based on a heuristic, the authors proposed an estimation of these quantities that seems to perform well in practice. Remark that the authors assume that the

are independent and the parameter is not modelled in the prior (but is still estimated on the data). They finally propose a variational Bayes approach to approximate the posterior.

Very similar priors were used by [LU09] and in the PMF method (Probabilistic Matrix Factorisation) of [SM08a]. However, improved versions were proposed in [SM08b, ZWC10, BLMK11, PC10]: the authors proposed a full Bayesian treatment of the problem by putting priors on the hyperparameters. We describe more precisely the prior in [SM08b]: the and are independent and respectively and , and then: , , and finally . Here again, the hyperparameters , and are to be specified. The priors in [ZWC10, BLMK11] are quite similar, and we give more details about the one in [BLMK11] in Section 3. In [SM08a, SM08b, ZWC10, BLMK11], the authors simulate from the posterior thanks to the Gibbs sampler (the posterior conditional distribution are explicitely provided e.g. in [SM08b]). Alternatively, [LU09]

uses a stochastic gradient descent to approximate the MAP (maximum a posteriori).

Some papers proposed a kernelized version of the reduced rank regression and matrix completion models. Let denote the -th row of and the -th row of . Then, leads to . We can replace this relation by

for some RKHS Kernel . In [YTS05], the authors propose a Bayesian formulation of this model: is seen as a Gaussian process on with expectation zero and covariance function related to the kernel . The same idea is refined in [YLZG09] and applied successfully to very large datasets, including the NetFlix challenge dataset, thanks to two algorithms: the Gibbs sampler, and the EM algorithm to approximate the MAP.

Finally, we want to mention the nice theoretical work [AW04, AW05]: in these papers, the authors study the asymptotic performance of Bayesian estimators in the reduced rank regression model under a general prior that has a compactly supported and infinitely differentiable density. Clearly, the priors aforementioned do not fit the compact support assumption. The question wether algorithmically tractable priors fit this assumption is, to our knowledge, still open. In Section 3

, we propose a non-asymptotic analysis of the prior of

[BLMK11].

3 Theoretical analysis

In this section, we provide a theoretical analysis of the Bayesian estimators obtained by using the idea of hierarchical priors of  [SM08b, ZWC10, BLMK11, PC10]. More precisely, we use exactly the prior of [BLMK11] and provide a theoretical result on the performance of the estimator in the reduced-rank regression model.

Several approaches are available to study the performance of Bayesian estimators: the asymptotic approach based on Bernstein-von-Mises type theorems, see Chapter 10 in [vdV98], and a non-asmptotic approach based on PAC-Bayesian inequalities. PAC-Bayesian inequalities were introduced for classification by [STW97, McA98] but tighter bounds and extentions to regression estimation can be found in [Cat04, Cat07, DT08, DT12, DS12]. In all approaches, the variance of the noise is assumed to be known or at least upper-bounded by a given constant, so we use this framework here. To our knowledge, this is the first application of PAC-Bayesian bounds to a matrix estimation problem.

3.1 Theorem

Following [BLMK11] we write where is , is , and then

for some diagonal matrix

the are i.i.d. and :

where

We will make one of the following assumptions on the noise:

  • Assumption (A1): the entries of are i.i.d. , and we know an upper bound for .

  • Assumption (A2): the entries of are iid according to any distribution supported by the compact interval with a density w.r.t. the Lebesgue measure and , and we know an upper bound .

Note that (A1) and (A2) are special case of the one in [DT08], the interested reader can replace these assumptions by the more technical condition given in [DT08]. We define

where is the probability distribution given by

Note that in the case where the entries of are i.i.d. then this is the Bayesian posterior, , when , and so is the expectation under the posterior. However, for theoretical reasons, we have to consider slightly smaller to prove theoretical results.

Theorem 3.1

Assume that either (A1) or (A2) is satisfied. Let us put and in the prior . For ,

Remark 3.1

Note that when all the entries of satisfy for some , . Moreover, let us assume that and that we can write with and and . Assume that the noise is Gaussian. We get

where we remind that . When , we can see that we recover the same upper bound as in [BSW11], up to a term. This rate (without the ) is known to be optimal, see [BSW11] remark (ii) p. 1293 and [RT11]. However, the presence of the terms and can lead to suboptimal rates in less classical asymptotics where would grow with the sample size . In the case of linear regression, a way to avoid these terms is to use heavy-tailed priors as in [DT08, DT12], or compactly supported priors as in [AL11]. However, it is not clear whether this approach would lead to feasible algorithms in matrix estimation problems. This question will be the object of a future work.

Remark 3.2

We do not claim that the choice is optimal in practice. However, from the proof it is clear that our technique requires that decreases with the dimension of as well as with the sample size to produce a meaningfull bound. Note that in [BLMK11], there is no theoretical approach for the choice of , but their simulation study tends to show that must be very small for to be approximately low-rank.

Remark 3.3

In all the above mentionned papers on PAC-Bayesian bounds, it is assumed that the variance of the noise is known, or upper-bounded by a known constant. More recently, [AC11] managed to prove PAC-Bayesian inequalities for regression with unknown variance. However, the approach is rather involved and it is not clear whether it can be used in our context. This question will also be addressed in a future work.

3.2 Proof

First, we state the following result:

Theorem 3.2

Under (A1) or (A2), for any , we have

where stands for the Kullback divergence between and , if is absolutely continuous with respect to and otherwise.

Proof of Theorem 3.2. Follow the proof of Theorem 1 in [DT08] and check that every step is valid when is a matrix instead of a vector.

We are now ready to prove our main result.

Proof of Theorem 3.1. Let us introduce, for any , the probability distribution . According to Theorem 3.2 we have

(1)

Let us fix , and . The remaining steps of the proof are to upper-bound the two terms in the r.h.s. Both upper bounds will depend on , we will optimize on after these steps to end the proof. We have

and, as and , it is easy to see that integral of the three scalar product vanish. So

(2)

Now, we deal with the second term:

We remind that and and let us denote the subset of such that for . We let denote the cardinality of , . Note that we have . For any let be the event

Then

(3)

By symmetry, we will only bound the first of these two terms. We have

(4)

We lower-bound the three factors in the integral in (4) separately. First, note that, on ,

(5)

On , and for :

where is the c.d.f. of . We use the classical inequality

to get:

and finally

(6)

Then, on , and for :

and so

(7)

We plug (5), (6) and (7) into (4) and we obtain:

Now, let us impose the following restrictions: so the last factor is . So we have:

So,

(8)

By symmetry,

(9)

and finally, plugging (8) and (9) into (3)

(10)

Finally, we can plug (2) and (10) into (1):

Let us put to get:

Finally, remember that the conditions of the theorem impose that , and . However, we used until now that , and , and . Remember that so all these equations are compatible. We obtain: