Orthogonal Rank-One Matrix Pursuit for Low Rank Matrix Completion

04/04/2014 ∙ by Zheng Wang, et al. ∙ University of Georgia Arizona State University Simon Fraser University 0

In this paper, we propose an efficient and scalable low rank matrix completion algorithm. The key idea is to extend orthogonal matching pursuit method from the vector case to the matrix case. We further propose an economic version of our algorithm by introducing a novel weight updating rule to reduce the time and storage complexity. Both versions are computationally inexpensive for each matrix pursuit iteration, and find satisfactory results in a few iterations. Another advantage of our proposed algorithm is that it has only one tunable parameter, which is the rank. It is easy to understand and to use by the user. This becomes especially important in large-scale learning problems. In addition, we rigorously show that both versions achieve a linear convergence rate, which is significantly better than the previous known results. We also empirically compare the proposed algorithms with several state-of-the-art matrix completion algorithms on many real-world datasets, including the large-scale recommendation dataset Netflix as well as the MovieLens datasets. Numerical results show that our proposed algorithm is more efficient than competing algorithms while achieving similar or better prediction performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

Code Repositories

lrslibrary

Low-Rank and Sparse Tools for Background Modeling and Subtraction in Videos


view repo

lrslibrary

Low-Rank and Sparse Tools for Background Modeling and Subtraction in Videos


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, low rank matrix learning has attracted significant attentions in machine learning and data mining due to its wide range of applications, such as collaborative filtering, dimensionality reduction, compressed sensing, multi-class learning and multi-task learning. See

[1, 2, 3, 7, 9, 23, 34, 40, 37] and the references therein. In this paper, we consider the general form of low rank matrix completion: given a partially observed real-valued matrix , the low rank matrix completion problem is to find a matrix with minimum rank that best approximates the matrix on the observed elements. The mathematical formulation is given by

(1)

where is the set of all index pairs of observed entries, and is the orthogonal projector onto the span of matrices vanishing outside of .

1.1 Related Works

As it is intractable to minimize the matrix rank exactly in the general case, many approximate solutions have been proposed to attack the problem (1) (cf., e.g. [7, 24, 28]). A widely used convex relaxation of matrix rank is the trace norm or nuclear norm [7]. The matrix trace norm is defined by the Schatten -norm with . For matrix with rank , its Schatten -norm is defined by , where are the singular values of and without loss of generality we assume they are sorted in descending order. Thus, the trace norm of is the norm of the matrix spectrum as . Then the convex relaxation for problem (1) is given by

(2)

Cai et al. [6] propose an algorithm based on soft singular value thresholding (SVT). Keshavan et al. [21] and Meka et al. [18] develop more efficient algorithms by using the top- singular pairs.

Many other algorithms have been developed to solve the trace norm penalized problem:

(3)

Ji et al. [20], Liu et al. [27] and Toh et al. [44] independently propose to employ the proximal gradient algorithm to improve the algorithm of [6] by significantly reducing the number of iterations. They obtain an -accurate solution in steps. More efficient soft singular vector thresholding algorithms are proposed in [29, 30]

by investigating the factorization property of the estimated matrix. Each step of the algorithms requires the computation of a partial SVD for a dense matrix. In addition, several methods approximate the trace norm using its variational characterizations

[32, 40, 46, 37], and proceed by alternating optimization. However these methods lack global convergence guarantees.

Solving these low rank or trace norm problems is computationally expensive for large matrices, as it involves computing singular value decomposition (SVD). Most of the methods above involve the computation of SVD or truncated SVD iteratively, which is not scalable to large-scale problems. How to solve these problems efficiently and accurately for large-scale problems attracts much attention in recent years.

Recently, the coordinate gradient descent method has been demonstrated to be efficient in solving sparse learning problems in the vector case [11, 39, 47, 48]. The key idea is to solve a very simple one-dimensional problem (for one coordinate) in each iteration. One natural question is whether and how such method can be applied to solve the matrix completion problem. Some progress has been made recently along this direction. Dudík et al. [9] propose a coordinate gradient descent solution for the trace norm penalized problem. They recast the non-smooth objective in problem (3) as a smooth one in an infinite dimensional rank-one matrix space, then apply the coordinate gradient algorithm on the collection of rank-one matrices. Zhang et al. [49] further improve the efficiency using the boosting method, and the improved algorithm guarantees an -accuracy within iterations. Although these algorithms need slightly more iterations than the proximal methods, they are more scalable as they only need to compute the top singular vector pair in each iteration. Note that the top singular vector pair can be computed efficiently by the power method or Lanczos iterations [13]. Jaggi et al. [17] propose an algorithm which achieves the same iteration complexity as the algorithm in [49] by directly applying the Hazan’s algorithm [15]. Tewari et al. [42] solve a more general problem based on a greedy algorithm. Shalev-Shwartz et al. [38]

further reduce the number of iterations based on a heuristic without theoretical guarantees.

Most methods based on the top singular vector pair include two main steps in each iteration. The first step involves computing the top singular vector pair, and the second step refines the weights of the rank-one matrices formed by all top singular vector pairs obtained up to the current iteration. The main differences among these algorithms lie in how they refine the weights. The Jaggi’s algorithm (JS) [17] directly applies the Hazan’s algorithm [15], which relied on the Frank-Wolfe algorithm [10]. It updates the weights with a small step size and does not consider further refinement. It does not use all information in each step, which leads to a slow convergence rate. Similar to JS, Tewari et al. [42] use a small update step size for a general structure constrained problem. The greedy efficient component optimization (GECO) [38] optimizes the weights by solving another time consuming optimization problem. It involves a smaller number of iterations than the JS algorithm. However, the sophisticated weight refinement leads to a higher total computational cost. The lifted coordinate gradient descent algorithm (Lifted) [9] updates the rank-one matrix basis with a constant weight in each iteration, and conducts a LASSO type algorithm [43] to fully correct the weights. The weights for the basis update are difficult to tune: a large value leads to divergence; a small value makes the algorithm slow [49]. The matrix norm boosting approach (Boost) [49] learns the update weights and designs a local refinement step by a non-convex optimization problem which is solved by alternating optimization. It has a sub-linear convergence rate.

Let us summarize their common drawbacks as follows:

  • The weight refinement steps are inefficient, resulting in a slow convergence rate. The current best convergence rate is . Some refinement steps themselves contain computationally expensive iterations [9, 49], which do not scale to large-scale data.

  • They have heuristic-based tunable parameters which are not easy to use. However, these parameters severely affect their convergence speed and the approximation result. In some algorithms, an improper parameter even makes the algorithm diverge [6, 9].

In this paper, we present a simple and efficient algorithm to solve the low rank matrix completion problem. The key idea is to extend the orthogonal matching pursuit (OMP) procedure [35] from the vector case to the matrix case. In each iteration, a rank-one basis matrix is generated by the left and right top singular vectors of the current approximation residual. In the standard version of the proposed algorithm, we fully update the weights for all rank-one matrices in the current basis set at the end of each iteration by performing an orthogonal projection of the observation matrix onto their spanning subspace. The most time-consuming step of the proposed algorithm is to calculate the top singular vector pair of a sparse matrix, which costs operations in each iteration. An appealing feature of the proposed algorithm is that it has a linear convergence rate. This is quite different from traditional orthogonal matching pursuit or weak orthogonal greedy algorithms, whose convergence rate for sparse vector recovery is sub-linear as shown in [26]. See also [8], [41], [45] for an extensive study on various greedy algorithms. With this rate of convergence, we only need iterations for achieving an -accuracy solution.

One drawback of the standard algorithm is that it needs to store all rank-one matrices in the current basis set for full weight updating, which contains elements in the -th iteration. This makes the storage complexity of the algorithm dependent on the number of iterations, which restricts the approximation rank especially for large-scale matrices. To tackle this problem, we propose an economic weight updating rule for this algorithm. In this economic version of the proposed algorithm, we only track two matrices in each iteration. One is the current estimated matrix and the other one is the pursued rank-one matrix. When restricted to the observations in , each has nonzero elements. Thus the storage requirement, i.e., , keeps the same in different iterations, which is the same as the greedy algorithms [17, 42]. Interestingly, we show that using this economic updating rule we still retain the linear convergence rate. To the best of our knowledge, our proposed algorithms are the fastest among all related methods in the literature. We verify the efficiency of our algorithms empirically on large-scale matrix completion problems, such as MovieLens [31] and Netflix [4, 5], see §7.

The main contributions of our paper are:

  • We propose a computationally efficient and scalable algorithm for matrix completion, which extends the orthogonal matching pursuit from the vector case to the matrix case.

  • We theoretically prove the linear convergence rate of our algorithm. As a result, we only need steps to obtain an -accuracy solution, and in each step we only need to compute the top singular vector pair, which can be computed efficiently.

  • We further reduce the storage complexity of our algorithm based on an economic weight updating rule while retaining the linear convergence rate. This version of our algorithm has a constant storage complexity which is independent of the approximation rank and is more practical for large-scale matrices.

  • Both versions of our algorithm have only one free parameter, i.e., the rank of the estimated matrix. The proposed algorithm is guaranteed to converge, i.e., no risk of divergence.

1.2 Notations and Organization

Let be an real matrix, and denote the indices of the observed entries of . is the projection operator onto the space spanned by the matrices vanishing outside of so that the -th component of equals to for and zero otherwise. The Frobenius norm of is defined as . Let denote a vector reshaped from matrix by concatenating all its column vectors. Let be the vector by concatenating all observed entries in , which is composed by keeping the observed elements in the vector . The Frobenius inner product of two matrices and is defined as , which also equals to the component-wise inner product of the corresponding vectors as . Given a matrix , we denote by . For any two matrices , we define

. Without further declaration, the matrix norm refers to the Frobenius norm, which could also be written as .

The rest of the paper is organized as follows: we present our algorithm in Section 2; Section 3 analyzes the convergence rate of the standard version of our algorithm; we further propose an economic version of our algorithm and prove its linear convergence rate in Section 4; Section 5 extends the proposed algorithm to a more general matrix sensing case, and presents its guarantee of finding the optimal solution under rank-restricted-isometry-property condition; in Section 6 we analyze the stability of both versions of our algorithms; empirical evaluations are presented in Section 7 to verify the efficiency and effectiveness of our algorithms. We finally conclude our paper in Section 8.

2 Orthogonal Rank-One Matrix Pursuit

It is well-known that any matrix can be written as a linear combination of rank-one matrices, that is,

(4)

where is the set of all rank-one matrices with unit Frobenius norm. Clearly, there is an infinitely many choice of ’s. Such a representation can be obtained via the standard SVD decomposition of .

The original low rank matrix approximation problem aims to minimize the zero-norm of subject to the constraint:

(5)

where denotes the number of nonzero elements of vector .

If we reformulate the problem as

(6)

we could solve it by an orthogonal matching pursuit type algorithm using rank-one matrices as the basis. In particular, we are to find a suitable subset with over-complete rank-one matrix coordinates, and learn the weight for each coordinate. This is achieved by executing two steps alternatively: one is to pursue the basis, and the other one is to learn the weights of the basis.

Suppose that after the --th iteration, the rank-one basis matrices and their current weight are already computed. In the -th iteration, we are to pursue a new rank-one basis matrix with unit Frobenius norm, which is mostly correlated with the current observed regression residual , where

Therefore, can be chosen to be an optimal solution of the following problem:

(7)

Notice that each rank-one matrix with unit Frobenius norm can be written as the product of two unit vectors, namely, for some and with . We then see that problem (7) can be equivalently reformulated as

(8)

Clearly, the optimal solution of problem (8) is a pair of top left and right singular vectors of . It can be efficiently computed by the power method [17, 9]. The new rank-one basis matrix is then readily available by setting .

After finding the new rank-one basis matrix , we update the weights for all currently available basis matrices by solving the following least squares regression problem:

(9)

By reshaping the matrices and into vectors and , we can easily see that the optimal solution of (9) is given by

(10)

where is the matrix formed by all reshaped basis vectors. The row size of matrix is the total number of observed entries. It is computationally expensive to directly calculate the matrix multiplication. We simplify this step by an incremental process, and give the implementation details in Appendix.

We run the above two steps iteratively until some desired stopping condition is satisfied. We can terminate the method based on the rank of the estimated matrix or the approximation residual. In particular, one can choose a preferred rank of the approximate solution matrix. Alternatively, one can stop the method once the residual is less than a tolerance parameter . The main steps of Orthogonal Rank-One Matrix Pursuit (OR1MP) are given in Algorithm 1.

  Input: and stopping criterion.
  Initialize: Set , and .
  repeat
     Step 1: Find a pair of top left and right singular vectors of the observed residual matrix and set .
     Step 2: Compute the weight using the closed form least squares solution .
     Step 3: Set and .
  until  stopping criterion is satisfied
  Output: Constructed matrix .
Algorithm 1 Orthogonal Rank-One Matrix Pursuit (OR1MP)
Remark

In our algorithm, we adapt orthogonal matching pursuit on the observed part of the matrix. This is similar to the GECO algorithm. However, GECO constructed the estimated matrix by projecting the observation matrix onto a much larger subspace, which is a product of two subspaces spanned by all left singular vectors and all right singular vectors obtained up to the current iteration. So it has much higher computational complexity. Lee et al. [25] recently proposed the ADMiRA algorithm, which is also a greedy approach. In each step it first chose components by top- truncated SVD and then uses another top- truncated SVD to obtain a rank- matrix. Thus, the ADMiRA algorithm is computationally more expensive than the proposed algorithm. The difference between the proposed algorithm and ADMiRA is somewhat similar to the difference between the OMP [35] for learning sparse vectors and CoSaMP [33]

. In addition, the performance guarantees (including recovery guarantee and convergence property) of ADMiRA rely on strong assumptions, i.e., the matrix involved in the loss function satisfies a rank-restricted isometry property

[25].

3 Convergence Analysis of Algorithm 1

In this section, we will show that Algorithm 1 is convergent and achieves a linear convergence rate. This result is given in the following theorem.

The orthogonal rank-one matrix pursuit algorithm satisfies

Before proving Theorem 3, we need to establish some useful and preparatory properties of Algorithm 1. The first property says that is perpendicular to all previously generated for .

Property

for .

Recall that is the optimal solution of problem (9). By the first-order optimality condition, one has

which together with and implies that for .

The following property shows that as the number of rank-one basis matrices increases during our learning process, the residual does not increase.

Property

for all .

We observe that for all ,

and hence the conclusion holds.

We next establish that is linearly independent unless . It follows that formula (10) is well-defined and hence is uniquely defined before the algorithm stops.

Property

Suppose that for some . Then, has a full column rank for all .

Using Property 3 and the assumption for some , we see that for all . We now prove the statement of this lemma by induction on . Indeed, since , we clearly have . Hence the conclusion holds for . We now assume that it holds for and need to show that it also holds for . By the induction hypothesis, has a full column rank. Suppose for contradiction that does not have a full column rank. Then, there exists such that

which together with Property 3 implies that . It follows that

and hence , which contradicts the fact that for all . Therefore, has a full column rank and the conclusion holds for general .

We next build a relationship between two consecutive residuals and . For convenience, define and let

.

In view of (9), one can observe that

(11)

Let

(12)

By the definition of , one can also observe that

Property

and , where is defined in (12).

Since , it follows from Property 3 that . We then have

We next bound from below. If , clearly holds. We now suppose throughout the remaining proof that . It then follows from Property 3 that has a full column rank. Using this fact and (11), we have

where is the reshaped residual vector of . Invoking that , we then obtain

(13)

Let be the QR factorization of , where and is a nonsingular upper triangular matrix. One can observe that , where denotes the -th column of the matrix and is the reshaped vector of . Recall that . Hence, . Due to , and the definition of , we have

In addition, by Property 3, we have

(14)

Substituting into (13), and using and (14), we obtain that

where the last equality follows since is upper triangular and the last inequality is due to .

We are now ready to prove Theorem 3.

[ of Theorem 3] Using the definition of , we have

Using this inequality and Property 3, we obtain that

In view of this relation and the fact that , we easily conclude that

This completes the proof.

Remark

If is the entire set of all indices of , our orthogonal rank-one matrix pursuit algorithm equals to standard singular value decomposition using the power method. In particular, when is the set of all indices while the given entries are noisy values of an exact matrix, our OR1MP algorithm can help remove the noises.

Remark

In a standard study of the convergence rate of the Orthogonal Matching Pursuit (OMP) or Orthogonal Greedy Algorithm (OGA), one can only get , which leads a sub-linear convergence. Our is a data dependent construction which is based on the top left and right singular vectors of the residual matrix . It thus has a better estimate which gives us the linear convergence.

4 An Economic Orthogonal Rank-One Matrix Pursuit Algorithm

The proposed OR1MP algorithm has to track all pursued bases and save them in the memory. It demands storage complexity to obtain a rank- estimated matrix. For large scale problems, such storage requirement is not negligible and restricts the rank of the matrix to be estimated. To adapt our algorithm to large scale problems with a large approximation rank, we simplify the orthogonal projection step by only tracking the estimated matrix and the rank-one update matrix . In this case, we only need to estimate the weights for these two matrices by solving the following least squares problem:

(15)

This still fully corrects all weights of the existed bases, though the correction is sub-optimal. If we write the estimated matrix as a linear combination of the bases, we have with and , for . The detailed procedure of this simplified method is given in Algorithm 2.

  Input: and stopping criterion.
  Initialize: Set , and .
  repeat
     Step 1: Find a pair of top left and right singular vectors of the observed residual matrix and set .
     Step 2: Compute the optimal weights for and by solving: .
     Step 3: Set ; and for ; .
  until  stopping criterion is satisfied
  Output: Constructed matrix .
Algorithm 2 Economic Orthogonal Rank-One Matrix Pursuit (EOR1MP)

The proposed economic orthogonal rank-one matrix pursuit algorithm (EOR1MP) uses the same amount of storage as the greedy algorithms [17, 42], which is significantly smaller than that required by our OR1MP algorithm, Algorithm 1. Interestingly, we can show that the EOR1MP algorithm is still convergent and retains the linear convergence rate. The main result is given in the following theorem.

Algorithm 2, the economic orthogonal rank-one matrix pursuit algorithm satisfies

Before proving Theorem 4, we present several useful properties of our Algorithm 2. The first property says that is perpendicular to matrix and matrix .

Property

and .

Recall that is the optimal solution of problem (15). By the first-order optimality condition according to and , one has

and

which together with implies that and .

Property

for all .

We observe that for all ,

as , and hence the conclusion holds.

The following property shows that as the number of rank-one basis matrices increases during our iterative process, the residual decreases.

Property

for all .

We observe that for all ,

and hence the conclusion holds.

Let

and . The solution of problem (15) is . We next establish that and are linearly independent unless . It follows that is invertible and hence is uniquely defined before the algorithm stops.

Property

If for some , then .

If with nonzero , we get

and hence the conclusion holds with given in Property 4.

Property

Let be the maximum singular value of . for all .

The optimum in our algorithm satisfies

Using the fact that and , we get the conclusion.

Property

Suppose that for some . Then, for all .

If with , we have

As , we have and . Then from the above equality, we conclude that is the unique optimal solution of the minimization in terms of , thus we obtain its first-order optimality condition: . However, this contradicts with

The complete the proof.

We next build a relationship between two consecutive residuals and .

Property

.

This has a closed form solution as . Plugging this optimum back into the formulation, we get