1 Introduction
Motivated by the Netflix Challenge, recent research has addressed the problem of matrix completion where the goal is to recover the underlying lowrank “ratings” matrix by using a small number of observed entries of the matrix. However, the standard lowrank matrix completion formulation is applicable only to the transductive setting only, i.e., predictions are restricted to the existing users/movies only. However, several realworld recommendation systems have useful sideinformation available in the form of feature vectors for users as well as movies, and hence one should be able to make accurate predictions for new users and movies as well.
In this paper, we formulate and study the above mentioned problem which we call inductive matrix completion, where other than a small number of observations from the ratings matrix, the feature vectors for users/movies are also available. We formulate the problem as that of recovering a lowrank matrix using observed entries and the user/movie feature vectors , . By factoring , we see that this scheme constitutes a bilinear prediction for a new user/movie pair .
In fact, the above rank measurement scheme also arises in several other important lowrank estimation problems such as: a) general lowrank matrix sensing in the signal acquisition domain, and b) multilabel regression problem with missing information.
In this paper, we generalize the above three mentioned problems to the following lowrank matrix estimation problem that we call LowRank matrix estimation using Rank One Measurements (LRROM ): recover the rank matrix by using rank measurements of the form:
where are “feature” vectors and are provided along with the measurements .
Now given measurements and the feature vectors , , a canonical way to recover is to find a rank matrix such that is small. While the objective function of this problem is simple least squares, the nonconvex rank constraint makes it NPhard, in general, to solve. In existing literature, there are two common approaches to handle such lowrank problems: a) Use tracenorm constraint as a proxy for the rank constraint and then solve the resulting nonsmooth convex optimization problem, b) Parameterize as and then alternatingly optimize for and .
The first approach has been shown to be successful for a variety of problems such as matrix completion [2, 3, 4, 5], general lowrank matrix sensing [1], robust PCA [6, 7], etc. However, the resulting convex optimization methods require computation of full SVD of matrices with potentially large rank and hence do not scale to large scale problems. On the other hand, alternating minimization and its variants need to solve only least squares problems and hence are scalable in practice but might get stuck in a local minima. However, [8] recently showed that under standard set of assumptions, alternating minimization actually converges at a linear rate to the global optimum of two lowrank estimation problems: a) RIP measurements based general lowrank matrix sensing, and b) lowrank matrix completion.
Motivated by its empirical as well as theoretical success, we study a variant of alternating minimization (with appropriate initialization) for the above mentioned LRROM problem. To analyze our general LRROM problem, we present three key properties that a rank measurement operator should satisfy. Assuming these properties, we show that the alternating minimization method converges to the global optima of LRROM at a linear rate. We then study the three problems individually and show that for each of the problems, the measurement operator indeed satisfies the conditions required by our general analysis and hence, for each of the problems alternating minimization converges to the global optimum at a linear rate. Below, we briefly describe the three application problems that we study and also our highlevel result for each one of them:
(a) Efficient matrix sensing using Gaussian Measurements: In this problem, and
are sampled from a subGaussian distribution and the goal is efficient acquisition and recovery of rank
matrix . Here, we show that if the number of measurements , where is the condition number of. Then with high probability (w.h.p.), our alternating minimization based method will recover back
in linear time.Note that the problem of lowrank matrix sensing has been considered by several existing methods [1, 9, 10], however most of these methods require the measurement operator to satisfy the Restricted Isometry Property (RIP) (see Definition 2
). Typically, RIP operators are constructed by sampling from distributions with bounded fourth moments and require
measurements to satisfy RIP for a constant . That is, the number of samples required to satisfy RIP are similar to the number of samples required by our method.Moreover, RIP based operators are typically dense, have a large memory footprint and make the algorithm computationally intensive. For example, assuming rank and to be constant, RIP based operators would require storage and computational time, as opposed to storage and computational time required by the rank measurement operators. However, a drawback of such rank measurements is that, unlike RIP based operators, they are not universal, i.e., a new set of needs to be sampled for any given signal .
(b) Inductive Matrix Completion: As motivated earlier, consider a movie recommendation system with users and movies. Let be feature matrices of the users and the movies, respectively. Then, the usermovie rating can be modeled as and the goal is to learn using a small number of random ratings indexed by the set of observations . Note that matrix completion is a special case of this problem when and . Also, unlike standard matrix completion, accurate ratings can be predicted for users who have not rated any prior movies and vice versa.
If the feature matrices are incoherent and the number of observed entries , then inductive matrix completion satisfies the conditions required by our generic method and hence the global optimality result follows directly. Note that our analysis requires a quadratic number of samples, i.e., samples (assuming to be a constant) for recovery. On the other hand, applying standard matrix completion would require samples. Hence, our analysis provides significant improvement if , i.e., when the number of features is significantly smaller than the total number of users and movies.
(c) Multilabel Regression with Missing Data: Consider a multivariate regression problem, where the goal is to predict a set of (correlated) target variables for a given . We model this problem as a regression problem with lowrank parameters, i.e., where is a lowrank matrix. Given training data points and the associated target matrix , can be learned using a simple least squares regression. However, in most realworld applications several of the entries in are missing and the goal is to be able to learn “exactly”.
Now, let the set of known entries be sampled uniformly at random from . Then we show that, by sampling entries, alternating minimization recovers back exactly. Note that a direct approach to this problem is to first recover the label matrix using standard matrix completion and then learn from the completed label matrix. Such a method would require samples of . In contrast, our more unified approach requires samples. Hence, if the number of training points is much larger than the number of labels , then our method provides significant improvement over first completing the matrix and then learning the true lowrank matrix.
We would like to stress that the above mentioned problems of inductive matrix completion and multilabel regression with missing labels have recently received a lot of attention from the machine learning community
[11, 12]. However, to the best of our knowledge, our results are the first theoretically rigorous results that improve upon the sample complexity of first completing the target/ratings matrix and then learning the parameter matrix .Related Work: Lowrank matrix estimation problems are pervasive and have innumerable reallife applications. Popular examples of lowrank matrix estimation problems include PCA, robust PCA, nonnegative matrix approximation, lowrank matrix completion, lowrank matrix sensing etc. While in general lowrank matrix estimation that satisfies given (affine) observations is NPhard, several recent results present conditions under which the optimal solution can be recovered exactly or approximately [2, 1, 3, 13, 7, 6, 8, 9].
Of these above mentioned lowrank matrix estimation problems, the most relevant problems to ours are those of matrix completion [2, 5, 8] and general matrix sensing [1, 9, 10]. The matrix completion problem is restricted to a given set of users and movies and hence does not generalize to new users/movies. On the other hand, matrix sensing methods require the measurement operator to satisfy the RIP condition, which at least for the current constructions, necessitate measurement matrices that have full rank, large number of random bits and hence high storage as well as computational time [1]. Our work on general lowrank matrix estimation (problem (a) above) alleviates this issue as our measurements are only rank and hence the lowrank signal can be encoded as well as decoded much more efficiently. Moreover, our result for inductive matrix completion generalizes the matrix completion work and provides, to the best of our knowledge, the first theoretical results for the problem of inductive matrix completion.
Paper Organization: We formally introduce the problem of lowrank matrix estimation with rankone measurements in Section 2. We provide our version of the alternating minimization method and then we present a generic analysis for alternating minimization when applied to such rankone measurements based problems. Our results distill out certain key problem specific properties that would imply global optimality of alternating minimization. In the subsequent sections 3, 4, and 5, we show that for each of our three problems (mentioned above) the required problem specific properties are satisfied and hence our alternating minimization method provides globally optimal solution. Finally, we provide empirical validation of our methods in Section 6.
2 Lowrank Matrix Estimation using Rankone Measurements
Let be a linear measurement operator parameterized by , where . Then, the linear measurements of a given matrix are given by:
(1) 
where denotes the trace operator.
In this paper, we mainly focus on the rank measurement operators, i.e., where . Also, let be a rank
matrix, with the singular value decomposition (SVD)
.Then, given , the goal of the LRROM problem is to recover back efficiently. This problem can be reformulated as the following nonconvex optimization problem:
(2) 
Note that to be recovered is restricted to have at most rank and hence can be rewritten as .
We use the standard alternating minimization algorithm with appropriate initialization to solve the above problem (2) (see Algorithm 1). Note that the above problem is nonconvex in and hence standard analysis would only ensure convergence to a local minima. However, [8] recently showed that the alternating minimization method in fact converges to the global minima of two lowrank estimation problems: matrix sensing with RIP matrices and matrix completion.
The rankone operator given above does not satisfy RIP (see Definition 2), even when the vectors
are sampled from the normal distribution (see Claim
3). Furthermore, each measurement need not reveal exactly one entry of as in the case of matrix completion. Hence, the proof of [8] does not apply directly. However, inspired by the proof of [8], we distill out three key properties that the operator should satisfy, so that alternating minimization would converge to the global optimum.Theorem 1.
Let be a rank matrix with singular values . Also, let be a linear measurement operator parameterized by matrices, i.e., where . Let be as given by (1).
Now, let satisfy the following properties with parameter ():

Initialization: .

Concentration of operators : Let
and , where are two unit vectors that are independent of randomness in . Then the following holds: and . 
Concentration of operators : Let ,
, where are unit vectors, s.t., and . Furthermore, let be independent of randomness in . Then, and .
Then, after iterations of the alternating minimization method (Algorithm 1), we obtain s.t., , where .
Proof.
We explain the key ideas of the proof by first presenting the proof for the special case of rank . Later in Appendix B, we extend the proof to general rank case.
Similar to [8], we first characterize the update for th step iterates of Algorithm 1 and its normalized form .
Note that (3) shows that is a perturbation of and the goal now is to bound the spectral norm of the perturbation term:
(4) 
Now,, using Property 2 mentioned in the theorem, we get:
(5) 
Now,
(6) 
where follows by observing that and are orthogonal set of vectors and then using Property 3 given in the Theorem 1. Hence, using (5), (6), and along with (4), we get:
(7) 
We are now ready to lower bound the component of along the correct direction and the component of that is perpendicular to the optimal direction .
Now, by leftmultiplying (3) by and using (5) we obtain:
(8) 
Similarly, by multiplying (3) by , where is a unit norm vector that is orthogonal to , we get:
(9) 
(10) 
Also, using Property 1 of Theorem 1, for , we get: . Moreover, by multiplying by on left and on the right and using the fact that are the largest singular vectors of , we get: . Hence, .
Using the (10) along with the above given observation and by the “inductive” assumption (proof of the inductive step follows directly from the below equation) , we get:
(11) 
Similarly, we can show that Hence, after iterations, we obtain , s.t., . ∎
Note that we require intermediate vectors to be independent of randomness in ’s. Hence, we partition into partitions and at each step and are supplied to the algorithm. This implies that the measurement complexity of the algorithm is given . That is, given samples, we can estimate matrix , s.t., , where is any constant.
3 Rankone Matrix Sensing using Gaussian Measurements
In this section, we study the problem of sensing general lowrank matrices which is an important problem in the domain of signal acquisition [1]
and has several applications in a variety of areas like control theory, computer vision, etc. For this problem, the goal is to
design the measurement matrix as well as recovery algorithm, so that the true lowrank signal can be recovered back from the given linear measurements.Consider a measurement operator where each measurement matrix is sampled using normal distribution, i.e., . Now, for this operator , we show that if , then w.p. , any fixed rank matrix can be recovered by AltMinLRROM (Algorithm 1). Here is the condition number of . That is, using nearly linear number of measurements in , one can exactly recover the rank matrix .
Note that several similar recovery results for the matrix sensing problem already exist in the literature that guarantee exact recovery using measurements [1, 10, 9]. However, we would like to stress that all the above mentioned existing results assume that the measurement operator satisfies the Restricted Isometry Property (RIP) defined below:
Definition 2.
A linear operator satisfies RIP iff, s.t. , the following holds:
where is a constant dependent only on .
Most current constructions of RIP matrices require each to be sampled from a zero mean distribution with bounded fourth norm which implies that they have almost full rank. That is, such operators require memory just to store the operator, i.e., the storage requirement is cubic in . Consequently signal acquisition as well as recovery time for these algorithms is also at least cubic in . In contrast, our proposed rank measurements require only storage and computational time. Hence, the proposed method makes the signal acquisition as well as signal recovery at least an order of magnitude faster .
Naturally, this begs the question whether we can show that our rank measurement operator satisfies RIP, so that the existing analysis for RIP based lowrank matrix sensing can be used [8]. We answer this question in the negative, i.e., for , does not satisfy RIP even for rank matrices (with high probability):
Claim 3.
Let be a measurement operator with each , where . Let , for any constant . Then, with probability at least , does not satisfy RIP for rank matrices with a constant .
Proof of Claim 3.
The main idea behind our proof is to show that there exists two rank matrices s.t. is large while is much smaller than .
In particular, let and let where are sampled from normal distribution independent of . Now,
Now, as
are multivariate normal random variables,
w.p. .(12) 
Moreover, w.p. .
Now, consider
where and are sampled from standard normal distribution, independent of . Since, are independent of and . Hence, w.p. , . Moreover, w.p. , and . That is, w.p. :
(13) 
Furthermore, w.p. .
Using (12), (13), we get that w.p. :
Now, for RIP to be satisfied with a constant , the lower and upper bound on for all rank should be at most a constant factor apart. However, the above equation clearly shows that the upper and lower bound can match only when . Hence, for that is at most linear in both , , RIP cannot be satisfied with probability . ∎
Now, even though does not satisfy RIP, we can still show that satisfies the three properties mentioned in the Theorem 1. and hence we can use Theorem 1 to obtain the exact recovery result.
Lemma 4 (RankOne Gaussian Measurements).
Let be a measurement operator with each , where . Let . Then, Property 1, 2, 3 required by Theorem 1 are satisfied with probability at least .
Proof of Lemma 4.
We divide the proof into three parts where each part proves a property mentioned in Theorem 1.
Proof of Property 1.
Now,
where . Note that . Also, both and are spherical Gaussian variables and hence are rotationally invariant. Therefore, wlog, we can assume that and where is the th canonical basis vector.
As is a sum of random matrices, the goal is to apply matrix concentration bounds to show that is close to for large enough . To this end, we use Theorem 8 by [14] given below. However, Theorem 8 requires bounded random variable while is an unbounded variable. We handle this issue by clipping to ensure that its spectral norm is always bounded. In particular, consider the following random variable:
(14) 
where is the th coordinate of . Similarly, define:
(15) 
Note that, and . Also, are still symmetric and independent random variables, i.e., . Hence, . Furthermore, ,
Comments
There are no comments yet.