# Provable Inductive Matrix Completion

Consider a movie recommendation system where apart from the ratings information, side information such as user's age or movie's genre is also available. Unlike standard matrix completion, in this setting one should be able to predict inductively on new users/movies. In this paper, we study the problem of inductive matrix completion in the exact recovery setting. That is, we assume that the ratings matrix is generated by applying feature vectors to a low-rank matrix and the goal is to recover back the underlying matrix. Furthermore, we generalize the problem to that of low-rank matrix estimation using rank-1 measurements. We study this generic problem and provide conditions that the set of measurements should satisfy so that the alternating minimization method (which otherwise is a non-convex method with no convergence guarantees) is able to recover back the exact underlying low-rank matrix. In addition to inductive matrix completion, we show that two other low-rank estimation problems can be studied in our framework: a) general low-rank matrix sensing using rank-1 measurements, and b) multi-label regression with missing labels. For both the problems, we provide novel and interesting bounds on the number of measurements required by alternating minimization to provably converges to the exact low-rank matrix. In particular, our analysis for the general low rank matrix sensing problem significantly improves the required storage and computational cost than that required by the RIP-based matrix sensing methods RechtFP2007. Finally, we provide empirical validation of our approach and demonstrate that alternating minimization is able to recover the true matrix for the above mentioned problems using a small number of measurements.

## Authors

• 63 publications
• 36 publications
• ### Low-rank Matrix Completion using Alternating Minimization

Alternating minimization represents a widely applicable and empirically ...
12/03/2012 ∙ by Prateek Jain, et al. ∙ 0

• ### A Non-convex One-Pass Framework for Generalized Factorization Machine and Rank-One Matrix Sensing

We develop an efficient alternating framework for learning a generalized...
08/21/2016 ∙ by Ming Lin, et al. ∙ 0

• ### Recover the lost Phasor Measurement Unit Data Using Alternating Direction Multipliers Method

This paper presents a novel algorithm for recovering missing data of pha...
08/18/2017 ∙ by Mang Liao, et al. ∙ 0

• ### TARM: A Turbo-type Algorithm for Affine Rank Minimization

The affine rank minimization (ARM) problem arises in many real-world app...
02/11/2018 ∙ by Zhipeng Xue, et al. ∙ 0

• ### Nonconvex One-bit Single-label Multi-label Learning

We study an extreme scenario in multi-label learning where each training...
03/17/2017 ∙ by Shuang Qiu, et al. ∙ 0

• ### Nonlinear Inductive Matrix Completion based on One-layer Neural Networks

The goal of a recommendation system is to predict the interest of a user...
05/26/2018 ∙ by Kai Zhong, et al. ∙ 0

• ### Ranking Recovery from Limited Comparisons using Low-Rank Matrix Completion

This paper proposes a new method for solving the well-known rank aggrega...
06/14/2018 ∙ by Tal Levy, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Motivated by the Netflix Challenge, recent research has addressed the problem of matrix completion where the goal is to recover the underlying low-rank “ratings” matrix by using a small number of observed entries of the matrix. However, the standard low-rank matrix completion formulation is applicable only to the transductive setting only, i.e., predictions are restricted to the existing users/movies only. However, several real-world recommendation systems have useful side-information available in the form of feature vectors for users as well as movies, and hence one should be able to make accurate predictions for new users and movies as well.

In this paper, we formulate and study the above mentioned problem which we call inductive matrix completion, where other than a small number of observations from the ratings matrix, the feature vectors for users/movies are also available. We formulate the problem as that of recovering a low-rank matrix using observed entries and the user/movie feature vectors , . By factoring , we see that this scheme constitutes a bi-linear prediction for a new user/movie pair .

In fact, the above rank- measurement scheme also arises in several other important low-rank estimation problems such as: a) general low-rank matrix sensing in the signal acquisition domain, and b) multi-label regression problem with missing information.

In this paper, we generalize the above three mentioned problems to the following low-rank matrix estimation problem that we call Low-Rank matrix estimation using Rank One Measurements (LRROM ): recover the rank- matrix by using rank- measurements of the form:

 b=[xT1W∗y1  xT2W∗y2  …  xTmW∗ym]T,

where are “feature” vectors and are provided along with the measurements .

Now given measurements and the feature vectors , , a canonical way to recover is to find a rank- matrix such that is small. While the objective function of this problem is simple least squares, the non-convex rank constraint makes it NP-hard, in general, to solve. In existing literature, there are two common approaches to handle such low-rank problems: a) Use trace-norm constraint as a proxy for the rank constraint and then solve the resulting non-smooth convex optimization problem, b) Parameterize as and then alternatingly optimize for and .

The first approach has been shown to be successful for a variety of problems such as matrix completion [2, 3, 4, 5], general low-rank matrix sensing [1], robust PCA [6, 7], etc. However, the resulting convex optimization methods require computation of full SVD of matrices with potentially large rank and hence do not scale to large scale problems. On the other hand, alternating minimization and its variants need to solve only least squares problems and hence are scalable in practice but might get stuck in a local minima. However, [8] recently showed that under standard set of assumptions, alternating minimization actually converges at a linear rate to the global optimum of two low-rank estimation problems: a) RIP measurements based general low-rank matrix sensing, and b) low-rank matrix completion.

Motivated by its empirical as well as theoretical success, we study a variant of alternating minimization (with appropriate initialization) for the above mentioned LRROM problem. To analyze our general LRROM problem, we present three key properties that a rank- measurement operator should satisfy. Assuming these properties, we show that the alternating minimization method converges to the global optima of LRROM at a linear rate. We then study the three problems individually and show that for each of the problems, the measurement operator indeed satisfies the conditions required by our general analysis and hence, for each of the problems alternating minimization converges to the global optimum at a linear rate. Below, we briefly describe the three application problems that we study and also our high-level result for each one of them:
(a) Efficient matrix sensing using Gaussian Measurements: In this problem, and

are sampled from a sub-Gaussian distribution and the goal is efficient acquisition and recovery of rank-

matrix . Here, we show that if the number of measurements , where is the condition number of

. Then with high probability (w.h.p.), our alternating minimization based method will recover back

in linear time.

Note that the problem of low-rank matrix sensing has been considered by several existing methods [1, 9, 10], however most of these methods require the measurement operator to satisfy the Restricted Isometry Property (RIP) (see Definition 2

). Typically, RIP operators are constructed by sampling from distributions with bounded fourth moments and require

measurements to satisfy RIP for a constant . That is, the number of samples required to satisfy RIP are similar to the number of samples required by our method.

Moreover, RIP based operators are typically dense, have a large memory footprint and make the algorithm computationally intensive. For example, assuming rank and to be constant, RIP based operators would require storage and computational time, as opposed to storage and computational time required by the rank- measurement operators. However, a drawback of such rank- measurements is that, unlike RIP based operators, they are not universal, i.e., a new set of needs to be sampled for any given signal .
(b) Inductive Matrix Completion: As motivated earlier, consider a movie recommendation system with users and movies. Let be feature matrices of the users and the movies, respectively. Then, the user-movie rating can be modeled as and the goal is to learn using a small number of random ratings indexed by the set of observations . Note that matrix completion is a special case of this problem when and . Also, unlike standard matrix completion, accurate ratings can be predicted for users who have not rated any prior movies and vice versa.

If the feature matrices are incoherent and the number of observed entries , then inductive matrix completion satisfies the conditions required by our generic method and hence the global optimality result follows directly. Note that our analysis requires a quadratic number of samples, i.e., samples (assuming to be a constant) for recovery. On the other hand, applying standard matrix completion would require samples. Hence, our analysis provides significant improvement if , i.e., when the number of features is significantly smaller than the total number of users and movies.
(c) Multi-label Regression with Missing Data: Consider a multi-variate regression problem, where the goal is to predict a set of (correlated) target variables for a given . We model this problem as a regression problem with low-rank parameters, i.e., where is a low-rank matrix. Given training data points and the associated target matrix , can be learned using a simple least squares regression. However, in most real-world applications several of the entries in are missing and the goal is to be able to learn “exactly”.

Now, let the set of known entries be sampled uniformly at random from . Then we show that, by sampling entries, alternating minimization recovers back exactly. Note that a direct approach to this problem is to first recover the label matrix using standard matrix completion and then learn from the completed label matrix. Such a method would require samples of . In contrast, our more unified approach requires samples. Hence, if the number of training points is much larger than the number of labels , then our method provides significant improvement over first completing the matrix and then learning the true low-rank matrix.

We would like to stress that the above mentioned problems of inductive matrix completion and multi-label regression with missing labels have recently received a lot of attention from the machine learning community

[11, 12]. However, to the best of our knowledge, our results are the first theoretically rigorous results that improve upon the sample complexity of first completing the target/ratings matrix and then learning the parameter matrix .

Related Work: Low-rank matrix estimation problems are pervasive and have innumerable real-life applications. Popular examples of low-rank matrix estimation problems include PCA, robust PCA, non-negative matrix approximation, low-rank matrix completion, low-rank matrix sensing etc. While in general low-rank matrix estimation that satisfies given (affine) observations is NP-hard, several recent results present conditions under which the optimal solution can be recovered exactly or approximately [2, 1, 3, 13, 7, 6, 8, 9].

Of these above mentioned low-rank matrix estimation problems, the most relevant problems to ours are those of matrix completion [2, 5, 8] and general matrix sensing [1, 9, 10]. The matrix completion problem is restricted to a given set of users and movies and hence does not generalize to new users/movies. On the other hand, matrix sensing methods require the measurement operator to satisfy the RIP condition, which at least for the current constructions, necessitate measurement matrices that have full rank, large number of random bits and hence high storage as well as computational time [1]. Our work on general low-rank matrix estimation (problem (a) above) alleviates this issue as our measurements are only rank- and hence the low-rank signal can be encoded as well as decoded much more efficiently. Moreover, our result for inductive matrix completion generalizes the matrix completion work and provides, to the best of our knowledge, the first theoretical results for the problem of inductive matrix completion.

Paper Organization: We formally introduce the problem of low-rank matrix estimation with rank-one measurements in Section 2. We provide our version of the alternating minimization method and then we present a generic analysis for alternating minimization when applied to such rank-one measurements based problems. Our results distill out certain key problem specific properties that would imply global optimality of alternating minimization. In the subsequent sections 3, 4, and 5, we show that for each of our three problems (mentioned above) the required problem specific properties are satisfied and hence our alternating minimization method provides globally optimal solution. Finally, we provide empirical validation of our methods in Section 6.

## 2 Low-rank Matrix Estimation using Rank-one Measurements

Let be a linear measurement operator parameterized by , where . Then, the linear measurements of a given matrix are given by:

 A(W)=[Tr(AT1W) Tr(AT2W) … Tr(ATmW)]T, (1)

where denotes the trace operator.

In this paper, we mainly focus on the rank- measurement operators, i.e., where . Also, let be a rank-

matrix, with the singular value decomposition (SVD)

.

Then, given , the goal of the LRROM problem is to recover back efficiently. This problem can be reformulated as the following non-convex optimization problem:

 (2)

Note that to be recovered is restricted to have at most rank- and hence can be re-written as .

We use the standard alternating minimization algorithm with appropriate initialization to solve the above problem (2) (see Algorithm 1). Note that the above problem is non-convex in and hence standard analysis would only ensure convergence to a local minima. However, [8] recently showed that the alternating minimization method in fact converges to the global minima of two low-rank estimation problems: matrix sensing with RIP matrices and matrix completion.

The rank-one operator given above does not satisfy RIP (see Definition 2), even when the vectors

are sampled from the normal distribution (see Claim

3). Furthermore, each measurement need not reveal exactly one entry of as in the case of matrix completion. Hence, the proof of [8] does not apply directly. However, inspired by the proof of [8], we distill out three key properties that the operator should satisfy, so that alternating minimization would converge to the global optimum.

###### Theorem 1.

Let be a rank- matrix with -singular values . Also, let be a linear measurement operator parameterized by matrices, i.e., where . Let be as given by (1).

Now, let satisfy the following properties with parameter ():

1. Initialization: .

2. Concentration of operators : Let
and , where are two unit vectors that are independent of randomness in . Then the following holds: and .

3. Concentration of operators : Let ,
, where are unit vectors, s.t., and . Furthermore, let be independent of randomness in . Then, and .

Then, after -iterations of the alternating minimization method (Algorithm 1), we obtain s.t., , where .

###### Proof.

We explain the key ideas of the proof by first presenting the proof for the special case of rank- . Later in Appendix B, we extend the proof to general rank- case.

Similar to [8], we first characterize the update for -th step iterates of Algorithm 1 and its normalized form .

Now, by gradient of (2) w.r.t. to be zero while keeping to be fixed. That is,

 m∑i=1(bi−xTiuhˆvTh+1yi)(xTiuh)yi=0, i.e., m∑i=1(uThxi)yi(σ∗yTiv∗uT∗xi−yTiˆvh+1uThxi)=0, i.e., (m∑i=1(xTiuhuThxi)yiyTi)ˆvh+1=σ∗(m∑i=1(xTiuhuT∗xi)yiyTi)v∗, i.e., ˆvh+1=σ∗(uT∗uh)v∗−σ∗B−1((uT∗uh)B−˜B)v∗, (3)

where,

 B=1mm∑i=1(xTiuhuThxi)yiyTi,   ˜B=1mm∑i=1(xTiuhuT∗xi)yiyTi.

Note that (3) shows that is a perturbation of and the goal now is to bound the spectral norm of the perturbation term:

 ∥G∥2=∥B−1(uT∗uhB−˜B)v∗∥2≤∥B−1∥2∥uT∗uhB−˜B∥2∥v∗∥2. (4)

Now,, using Property 2 mentioned in the theorem, we get:

 ∥B−I∥2≤1/100,   i.e., σmin(B)≥1−1/100,   i.e., ∥B−1∥2≤1/(1−1/100). (5)

Now,

 (uT∗uh)B−˜B =1mm∑i=1yiyTixTi((uT∗uh)uhuTh−u∗uTh)xi, =1mm∑i=1yiyTixTi(uhuTh−I)u∗uThxi, ζ1≤1100∥(uhuTh−I)u∗∥2∥uTh∥2=1100√1−(uThu∗)2, (6)

where follows by observing that and are orthogonal set of vectors and then using Property 3 given in the Theorem 1. Hence, using (5), (6), and along with (4), we get:

 ∥G∥2≤199√1−(uThu∗)2. (7)

We are now ready to lower bound the component of along the correct direction and the component of that is perpendicular to the optimal direction .

Now, by left-multiplying (3) by and using (5) we obtain:

 vT∗ˆvh+1=σ∗(uThu∗)−σ∗vT∗G≥σ∗(uThu∗)−σ∗99√1−(uThu∗)2. (8)

Similarly, by multiplying (3) by , where is a unit norm vector that is orthogonal to , we get:

 ⟨v⊥∗,ˆvh+1⟩≤σ∗99√1−(uThu∗)2. (9)

Using (8), (9), and , we get:

 1−(vTh+1v∗)2 =⟨v⊥∗,ˆvh+1⟩2⟨v∗,ˆvh+1⟩2+⟨v⊥∗,ˆvh+1⟩2, ≤199⋅99⋅(uThu∗−199√1−(uThu∗)2)2+1(1−(uhu∗)2). (10)

Also, using Property 1 of Theorem 1, for , we get: . Moreover, by multiplying by on left and on the right and using the fact that are the largest singular vectors of , we get: . Hence, .

Using the (10) along with the above given observation and by the “inductive” assumption (proof of the inductive step follows directly from the below equation) , we get:

 1−(vTh+1v∗)2≤12(1−(uThu∗)2). (11)

Similarly, we can show that Hence, after iterations, we obtain , s.t., . ∎

Note that we require intermediate vectors to be independent of randomness in ’s. Hence, we partition into partitions and at each step and are supplied to the algorithm. This implies that the measurement complexity of the algorithm is given . That is, given samples, we can estimate matrix , s.t., , where is any constant.

## 3 Rank-one Matrix Sensing using Gaussian Measurements

In this section, we study the problem of sensing general low-rank matrices which is an important problem in the domain of signal acquisition [1]

and has several applications in a variety of areas like control theory, computer vision, etc. For this problem, the goal is to

design the measurement matrix as well as recovery algorithm, so that the true low-rank signal can be recovered back from the given linear measurements.

Consider a measurement operator where each measurement matrix is sampled using normal distribution, i.e., . Now, for this operator , we show that if , then w.p. , any fixed rank- matrix can be recovered by AltMin-LRROM (Algorithm 1). Here is the condition number of . That is, using nearly linear number of measurements in , one can exactly recover the rank- matrix .

Note that several similar recovery results for the matrix sensing problem already exist in the literature that guarantee exact recovery using measurements [1, 10, 9]. However, we would like to stress that all the above mentioned existing results assume that the measurement operator satisfies the Restricted Isometry Property (RIP) defined below:

###### Definition 2.

A linear operator satisfies RIP iff, s.t. , the following holds:

 (1−δk)∥W∥2F≤∥A(W)∥2F≤(1+δk)∥W∥2F,

where is a constant dependent only on .

Most current constructions of RIP matrices require each to be sampled from a zero mean distribution with bounded fourth norm which implies that they have almost full rank. That is, such operators require memory just to store the operator, i.e., the storage requirement is cubic in . Consequently signal acquisition as well as recovery time for these algorithms is also at least cubic in . In contrast, our proposed rank- measurements require only storage and computational time. Hence, the proposed method makes the signal acquisition as well as signal recovery at least an order of magnitude faster .

Naturally, this begs the question whether we can show that our rank- measurement operator satisfies RIP, so that the existing analysis for RIP based low-rank matrix sensing can be used [8]. We answer this question in the negative, i.e., for , does not satisfy RIP even for rank- matrices (with high probability):

###### Claim 3.

Let be a measurement operator with each , where . Let , for any constant . Then, with probability at least , does not satisfy RIP for rank- matrices with a constant .

###### Proof of Claim 3.

The main idea behind our proof is to show that there exists two rank- matrices s.t. is large while is much smaller than .

In particular, let and let where are sampled from normal distribution independent of . Now,

 ∥AGauss(ZU)∥22=m∑i=1∥x1∥42∥y1∥42+m∑i=2(xT1xi)2(yT1yi)2.

Now, as

are multi-variate normal random variables,

w.p. .

 ∥AGauss(ZU)∥22≥.5d21d22. (12)

Moreover, w.p. .

Now, consider

 ∥AGauss(ZL)∥22=m∑i=2(uTxi)2(vTyi)2,

where and are sampled from standard normal distribution, independent of . Since, are independent of and . Hence, w.p. , . Moreover, w.p. , and . That is, w.p. :

 ∥AGauss(ZL)∥22≤4m⋅d1⋅d2log4m. (13)

Furthermore, w.p. .

Using (12), (13), we get that w.p. :

 40mlog4m≤∥AGauss(Z/∥Z∥F)∥2≤.05d1d2.

Now, for RIP to be satisfied with a constant , the lower and upper bound on for all rank- should be at most a constant factor apart. However, the above equation clearly shows that the upper and lower bound can match only when . Hence, for that is at most linear in both , , RIP cannot be satisfied with probability . ∎

Now, even though does not satisfy RIP, we can still show that satisfies the three properties mentioned in the Theorem 1. and hence we can use Theorem 1 to obtain the exact recovery result.

###### Lemma 4 (Rank-One Gaussian Measurements).

Let be a measurement operator with each , where . Let . Then, Property 1, 2, 3 required by Theorem 1 are satisfied with probability at least .

###### Proof of Lemma 4.

We divide the proof into three parts where each part proves a property mentioned in Theorem 1.

###### Proof of Property 1.

Now,

 S=1mm∑i=1bixiyTi=1mm∑i=1xixTiU∗Σ∗VT∗yiyTi=1mm∑i=1Zi,

where . Note that . Also, both and are spherical Gaussian variables and hence are rotationally invariant. Therefore, wlog, we can assume that and where is the -th canonical basis vector.

As is a sum of random matrices, the goal is to apply matrix concentration bounds to show that is close to for large enough . To this end, we use Theorem 8 by [14] given below. However, Theorem 8 requires bounded random variable while is an unbounded variable. We handle this issue by clipping to ensure that its spectral norm is always bounded. In particular, consider the following random variable:

 ˜xij={xij, |xij|≤C√log(m(d1+d2)),0, otherwise, (14)

where is the -th co-ordinate of . Similarly, define:

 ˜yij={yij, |yij|≤C√log(m(d1+d2)),0, otherwise. (15)

Note that, and . Also, are still symmetric and independent random variables, i.e., . Hence, . Furthermore, ,

 E[˜x2ij] =E[x2ij]−2√2π∫∞C√log(m(d1+d2))x2exp(−x2