A Note on Alternating Minimization Algorithm for the Matrix Completion Problem

02/05/2016 ∙ by David Gamarnik, et al. ∙ MIT 0

We consider the problem of reconstructing a low rank matrix from a subset of its entries and analyze two variants of the so-called Alternating Minimization algorithm, which has been proposed in the past. We establish that when the underlying matrix has rank r=1, has positive bounded entries, and the graph G underlying the revealed entries has bounded degree and diameter which is at most logarithmic in the size of the matrix, both algorithms succeed in reconstructing the matrix approximately in polynomial time starting from an arbitrary initialization. We further provide simulation results which suggest that the second algorithm which is based on the message passing type updates, performs significantly better.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Matrix completion refers to the problem of recovering a low rank matrix from an incomplete subset of its entries. This problem arises in a vast number of applications that involve collaborative filtering, where one attempts to predict the unknown preferences of a certain user based on collective known preferences of a large number of users. It attracted a lot of attention in recent times due to its application in recommendation systems and the well-known Netflix Prize.

1.1 Formulation

Let be a rank matrix, where . Let be a subset of the indices and let denote the entries of corresponding to the subset . The matrix completion problem is the problem of reconstructing using efficient (polynomial time) algorithms given .

Without any further assumptions, the matrix completion problem is NP-hard [1]. However under certain conditions, the problem has been shown to be tractable. The most common assumption typically considered in the literature is that the matrix is “incoherent” and the subset is chosen using some random mechanism, for example uniformly at random. The incoherence condition was introduced Candes and Recht [2] and Candes and Tao [3], where it was shown that convex relaxation resulting in nuclear norm minimization succeeds in reconstructing the matrix, assuming a certain lower bound on the size of . Keshevan et al. [4] and [5]

, use an algorithm consisting of a truncated singular value projection followed by a local minimization subroutine on the Grassmann manifold and show that it succeeds when

. Jain et al. [6] show that the local minimization in [4] can be successfully replaced by Alternating Minimization algorithm. The use of Belief Propagation for matrix factorization has also been studied by physicists in [7]heuristically. This is just a small subset of a vast literature on matrix completion problem which is the most relevant to the present work.

1.2 Algorithms and the results

For the rest of the paper we will assume for simplicity that , though our results easily extend to the more general case . Let and denote the sets of rows and columns of respectively, both indexed for simplicity by sets , and let be a bipartite undirected graph on the vertex set with edge set , where we recall that is the set of revealed entries of . Specifically, the edge if and only if the entry is revealed. Denote by the maximum degree of (the maximum number of neighbors among all nodes of ). The graph represents the structure of the revealed entries of . We denote the th row of by and the th row of by

, both thought of as column vectors. Then

. The matrix completion is then the problem of finding , such that for every , , where and are rows of and respectively. Alternatively, one can consider the optimization problem

(1)

and seek solutions with zero objective value. Observe that, computational tractability aside one can only obtain and up to orthogonal transformation, since of any orthogonal matrix , solve the same matrix completion problem.

In this paper we consider two versions of the Alternating Minimization algorithm which we call Vertex Least Square (VLS) and Edge Least Square (ELS) algorithms, where VLS is identical to the Alternating Minimization algorithm analyzed in [6], and ELS is a message passing type version of the VLS, which has some similarity with the widely studied Belief Propagation algorithm. Unlike [6], where VLS was used as a local optimization subroutine following a singular value projection, i.e., with a warm start, we consider the issue of global convergence of VLS and ELS. For the special case of rank , bounded positive entries of , bounded degree of , and when the diameter of is , we establish that both algorithms converge to the correct factorization of geometrically fast. In particular, the algorithms produce rank -approximation of in time , where

only depends on the parameters of the model. Our proof approach is based on establishing a certain contraction for the steps of VLS and ELS similar to the ones used for bounding mixing rates of Markov chain.

Even though our theoretical results show similar performance for VLS and ELS algorithms, experimentally we show that the ELS performs better and often significantly better. Specifically, in Section 3 we show that for certain classes of randomly constructed matrices ELS converges much faster than VLS, and for other classes of random graphs, ELS is converging while VLS is not. At this stage the theoretical understanding of such an apparent difference in the performance of two algorithms is lacking and constitutes an interesting open challenge, especially in light of the fact that VLS was a major component in the award winning algorithms for the Netflix challenge [8], [9].

2 VLS and ELS Algorithms

We now introduce and analyze two iterative decentralized algorithms based on the Alternating Minimization principle, that attempt to solve the non-convex least squares problem in (1). The first is what we call the Vertex Least Squares (VLS) algorithm. For the bi-partite graph we write if is connected to .

 

VERTEX LEAST SQUARES (VLS)

 

  1. Initialization: . Matrix and graph .

  2. For to : For each , set

    (2)

    For each , set

    (3)
  3. Set . Output .

 

Each iteration of the VLS consists of solving least squares problems, so the total computation time per iteration is .

The VLS algorithm in the above form is identical to Alternating Minimization [6] and exploits the biconvex structure of the objective in (1). We prefer to write the iterations of this algorithm in the above form to highlight the local decentralized nature of the updates at each vertex. In [6], this algorithm was used as a local minimization subroutine with a warm start provided by an SVD projection step prior to it. As we are about to establish, VLS solves the matrix completion problem for the case under non-negativity assumption, without any warm starts. Furthermore, we will present simulation results showing that in many cases VLS solves the completion problem for general .

Our main result concerning the is as follows. For every matrix , we denote by is its Frobenius norm: .

Theorem 2.1.

Let with and suppose there exists such that for all , we have . Suppose that the graph is connected and has diameter for some fixed constant , and maximum degree . There exists a constant which depends on and only, such that for any initialization and , there exists an iteration number such that after iterates of VLS, we have .

Before proceeding to the proof of Theorem 2.1, we remark here that in [6], the success of VLS was established by showing that the VLS updates resemble a power method with bounded error. In our proof we also show that VLS updates are like time varying power method updates, but without any error term. In [6], the warm start VLS required that principal angle distance between the left and right singular vector subspaces of the actual matrix and the initial iterates are at most . With the conditions given in Theorem 2.1, this may not always be the case. From [6], the subspace distance between two vectors (rank case) is given by

(4)

Suppose that

(5)

Then is greater than when is a small constant. In fact the subspace distance can be very close to one. Nevertheless, according to Theorem 2.1 VLS converges to the correct solution.

Proof of Theorem 2.1.

Fix and find small enough so that

(6)

From the update rules for VLS in Eq. (2)-(3), we can write

(7)

Let be the adjacency matrix of , i.e., .

Define and . With the chosen initial conditions in the theorem, we have that . Using (7), the updates for and can be written as,

(8)

The convex combination update rules in (8) imply that all future iterates satisfy and . Combining the two updates in (8), we see that can be expressed as a convex combination of

, i.e., there exists a stochastic matrix

such that , where expressed as a column vector. It is apparent that the support of is same as the support of , i.e.,

is the transition probability matrix of a random walk on

, where if and only if is a distance two neighbor of in . Although depends on , we can prove some useful properties satisfied by that hold for all times . In particular observe from (8) that since and are bounded by and , then each non-zero entry of is bounded below by .

Recalling that stands for the diameter of , define the sequence of matrices as

(9)

Then for any and , satisfies , where . Let . Then,

Combining the above gives

(10)

For and for large enough , we get by applying (10) recursively,

(11)

Substituting the definition of , we get , where . This means there exists a constant such that for all . From (8), we get that . Taking sufficiently small, we obtain

(12)

where (6) was used in the inclusion step.

Hence,

which completes the proof.

We now proceed to the Edge Least Square (ELS) algorithm, which is a message passing version of the VLS algorithm. In this algorithm, the variables, rather than being supported on nodes, namely and , are now supported on edges and will be correspondingly denoted by and .

 

EDGE LEAST SQUARES (ELS)

 

  1. Initialization: , for all . Matrix and graph .

  2. For : For each and set

    (13)

    For each and set

    (14)
  3. Compute and .

  4. Set . Output .

 

Each iteration of the ELS consists of solving least squares problems, so the total computation time per iteration is .

For the special case of rank one matrices, it is possible to conduct an analysis on the ELS iterations along the lines of the proof of Theorem 2.1 for VLS. Let be the dual graph on the directed edges of . Here where and consists of all directed edges of where and , and consists of all directed edges of , defined similarly. Additionally, and are neighbors if and only if or . Similarly to (7), we can write the corresponding update rules for ELS as follows

(15)

Define and . Then, similarly to (8), we can write the corresponding update rules for ELS as follows.

(16)

Again, as before, letting we can write for some stochastic matrix . The support of is the graph where the two vertices are neighbors if and only if they are distance two neighbors in . From the above equations, it is apparent that it is possible to prove a result similar to Theorem 2.1 for ELS. We state the result below. We omit the proof as it is identical to the proof of Theorem 2.1.

Theorem 2.2.

Let with and suppose there exists such that for all , we have . Suppose that the graph is connected and has diameter for some fixed constant and maximum degree . There exists a constant which depends on and only, such that for any initialization and , there exists an iteration number such that after iterates of ELS, we have .

3 Experiments

In this section, we provide simulation results for the VLS and ELS algorithms with particular focus on

  • The convergence rate of VLS vs ELS

  • Success of VLS and ELS for rank .

In view of Theorem 2.1, we generate independently and uniformly at random from . We then compare the decay in root mean square error (RMS) defined below in (17) with number of iterations. To do so, we first generate a uniformly random -regular bipartite graph on vertices with vertices on each side and keep it fixed for the experiment. We then run VLS and ELS on . Random regular graphs are known to be connected with high probability, and we did not find significant variation in results by changing the graph. Since ELS requires about a factor more computation per iteration, we plot the decay of RMS vs normalized iterations index which is defined as for VLS and for ELS.

The root mean square (RMS) error after iterations of ether VLS or ELS is defined as

(17)
Figure 1: RMS vs number of iteration (normalized and un-normalized) for VLS and ELS

The comparison in Figure 1 (computed for ) demonstrates that ELS converges faster than VLS. We find that this effect is even more pronounced when .

To compare VLS and ELS for rank matrices, we generate each entry of and uniformly from the interval . We generate a random regular bipartite graph to ensure that the minimum degree requirement is met. Then we generate another edge set , where each edge exists independently with probability . Finally we set . We plot the empirical fraction of failure obtained from iterations, where a failure is assumed to occur when the algorithm (VLS or ELS) fails to achieve an RMS less than within iterations. In fact a divergence characterized by an explosion in the RMS value is usually observed at a much earlier iteration whenever there the algorithm fails. Figure 2 shows the results for ELS on the left when and respectively. This provides evidence for the success of ELS even with a cold start. On the right of Figure 2 we plot the same for VLS with cold start for , showing that it does not always succeed.

Figure 2: Left: ELS failure fraction vs for with planted -regular graph, and with planted -regular graph () Right: VLS failure fraction vs for with planted -regular graph

The figures suggest the emergence of a phase transition. For each algorithm and rank value

there seems to be a critical degree such that the algorithm succeeds with high probability when and fails with high probability otherwise. Furthermore, it appears again based on the simulation results, that the for ELS is smaller than the one for VLS. In particular, for VLS the threshold appears to be around for ELS when , whereas for ELS it appears to be around , for the same value of . In other words, ELS appears to have a lower sample complexity required for it to succeed. Whether these observations can be theoretically established is left as an intriguing open problem.

References

  • [1] R. Meka, P. Jain, C. Caramanis, and I. S. Dhillon. Rank minimization via online learning. ICML, pages 656–663, 2008.
  • [2] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
  • [3] E. Candes and T. Tao. The power of convex relaxation: near optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2009.
  • [4] Raghunandan H Keshavan, Sewoong Oh, and Andrea Montanari. Matrix completion from a few entries. In Information Theory, 2009. ISIT 2009. IEEE International Symposium on, pages 324–328. IEEE, 2009.
  • [5] Raghunandan Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. In Advances in Neural Information Processing Systems, pages 952–960, 2009.
  • [6] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In

    Proceedings of the forty-fifth annual ACM symposium on Theory of computing

    , pages 665–674. ACM, 2013.
  • [7] Y. Kabashima, F. Krzakala, M. Mézard, A. Sakata, and L. Zdeborová. Phase transitions and sample complexity in bayes-optimal matrix factorization. arXiv:1402.1298., 2014.
  • [8] Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30–37, 2009.
  • [9] Y. Koren. The BellKor solution to the Netflix grand prize. 2009.