Learning Latent Features with Pairwise Penalties in Matrix Completion

02/16/2018 ∙ by Kaiyi Ji, et al. ∙ The University of Hong Kong Carnegie Mellon University The Ohio State University 0

Low-rank matrix completion (MC) has achieved great success in many real-world data applications. A latent feature model formulation is usually employed and, to improve prediction performance, the similarities between latent variables can be exploited by pairwise learning, e.g., the graph regularized matrix factorization (GRMF) method. However, existing GRMF approaches often use a squared L2 norm to measure the pairwise difference, which may be overly influenced by dissimilar pairs and lead to inferior prediction. To fully empower pairwise learning for matrix completion, we propose a general optimization framework that allows a rich class of (non-)convex pairwise penalty functions. A new and efficient algorithm is further developed to uniformly solve the optimization problem, with a theoretical convergence guarantee. In an important situation where the latent variables form a small number of subgroups, its statistical guarantee is also fully characterized. In particular, we theoretically characterize the complexity-regularized maximum likelihood estimator, as a special case of our framework. It has a better error bound when compared to the standard trace-norm regularized matrix completion. We conduct extensive experiments on both synthetic and real datasets to demonstrate the superior performance of this general framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Low-rank matrix factorization (MF) has been widely used in many real-world data applications, e.g., in signal processing, image restoration and collaborative filtering. A typical optimization  problem (Koren et al., 2009; Gopalan et al., 2014) follows the form

(1)

where , , and the projection operator retains the entries of the matrix in that denotes the observed indices.

The row vectors

of and of usually represent the features of two classes of interdependent objects, known as latent variables, e.g., user features and movie features in recommender systems (Bennett et al., 2007; Koren et al., 2009), respectively. Based on this basic model, many variants have been considered for different application scenarios. For example, a latent feature model is often employed in matrix completion and, for a better estimation error, the similarities between latent variables can be exploited by pairwise learning, e.g. the graph regularized matrix factorization (GRMF) method using a squared norm. It has been shown that GRMF can reduce the recovery error (Rao et al., 2015; Zhao et al., 2015; Monti et al., 2017). For the ground truth , assume that the feature vectors correspond to the vertices of a graph with and the edges are weighted by a non-negative matrix . Similarly, we can also define for . By adding a smoothing graph regularizer to (1), GRMF aims to solve the problem

(2)

which encourages the solutions (respectively ) if the weight (respectively ) is large.

Figure 1: Illustration of three pairwise penality functions w.r.t. . When is large, the squared norm is much larger than MCP, SCAD and the M-type function.

Existing works (Ma et al., 2011a; Kalofolias et al., 2014; Rao et al., 2015; Zhao et al., 2015) often construct the weight matrices and by incorporating additional information of the feature vectors, e.g., the social network of users and the attributes of movies in recommender systems. However, as shown in (Chiang et al., 2015), the side information could have noise, and hence the weights may be inappropriately selected. As illustrated in Fig. 1, if the difference between and is large but is not small enough, the squared norm used in the graph regularizer can severely penalize the objective function and push the solution to be close to . It can result in biased estimates, as shown in (Ma & Huang, 2017), which can dramatically affect the recovery performance. The same argument also applies to and . To address this problem, we introduce a large family of (non-)convex pairwise penalty functions that imposes less penalization on large differences, e.g., by the minimax concave penalty (MCP) (Zhang, 2010), the smoothly clipped absolute deviation (SCAD) (Fan & Li, 2001). and the M-type function, as shown in Fig. 1. To this end, we propose a general optimization framework

(3)

where and are (non-)convex pairwise penalty functions with tuning parameters .

To efficiently solve the whole class of optimization problems, we design a novel and scalable algorithm based on a modified alternating direction method of multipliers (ADMM) (Boyd et al., 2011). The allowed non-convex and non-smooth pairwise penalty functions complicate the optimization, which, if not handled carefully, can even result in divergent iterations. Theoretically, we characterize the convergence of our algorithm for the whole class of pairwise penalty functions. Compared with the standard ADMM, our algorithm has the following two features. First, although recent works (Wang et al., 2015a; Hong et al., 2016; Wang et al., 2015b) prove the convergence of ADMM for a large family of non-convex functions under different conditions, none of these results directly apply to our problem. Thus, we tailor ADMM to our setting by introducing Bregman divergences to certain subproblems. Second, the optimization algorithm needs to solve an expensive equation that typically requires running time to invert two large matrices in each iteration. We provide a conjugate gradient (CG) based approach to obtain an inexact solution of this equation, using only running time. As a result, our new method can significantly reduce the computation time (two orders of magnitude faster), as demonstrated in Appendix H.2. Notably, according to Theorem 2, we still guarantee the convergence using the inexact solutions during iterations.

Fully characterizing the statistical guarantee of the proposed class of penalty functions is challenging. Instead, we restrict to an important situation where the latent variables form a small number of subgroups. Specifically, we investigate a subgroup-based model, where two feature vectors in or are considered in the same group if they are identical. For a partially observed matrix corrupted with Gaussian noise, we prove that the complexity-regularized maximum likelihood estimator, as a special case of the framework (1), can achieve a lower error bound than the one for the trace-norm regularized matrix completion especially when the numbers of the subgroups for and are small. However, this estimator is computationally inefficient. To this end, we introduce a class of sparsity-promoting pairwise penalty functions (possibly with a finite support) to approximate the indicator function. Interestingly, not only can we identify the subgroups automatically but also we significantly reduce the recovery error, as verified in Section 5.3.

In addition, the theoretical analysis motivates us to adaptively construct the weights. As an application, our framework also applies to the datasets that do not provide side information since we heuristically pick adaptive weights based on the partially observed matrix. Our extensive experiments on both synthetic and real data demonstrate the superior performance of the proposed framework.

Notations: Let be a matrix and be a -dimensional vector. Denote by the cardinality of a set . Recall the Frobenius norm and the infinity norm . Define and norm as and , respectively.

2 Optimization Framework

We develop a general optimization framework for (1) that allows a wide class of (non-)convex pairwise penalty functions , which are characterized by the following condition.

Condition 1.

For any given vector , the penalty function satisfies

  1. for ,

  2. there exists a constant independent of such that for , is strongly convex with respect to .

This condition holds to a variety of (non-)convex pairwise penalty functions, including MCP (Zhang, 2010), M-type function and SCAD (Fan & Li, 2001). For example, Condition 1 holds for the MCP

(4)

by setting , where is the indicator function. Moreover, we introduce the following M-type function for our framework

(5)

which satisfies Condition 1 by letting . In addition, Condition 1 also holds for all convex functions, e.g., Lasso and the squared norm. Thus, our framework unifies the exisiting GRMF approaches.

3 Algorithm and Convergence

For the whole class of optimization problems introduced in Section 2, we develop an efficient and general algorithm, with a strong theoretical convergence guarantee. We begin with the following equivalent form of the main problem (1)

subject to (6)

where the index set and is defined in a similar way for .

To express the constraints of (3) in a standard form, we introduce some notations. Let be an ordered sequence of index pairs such that and for any . Similarly, we can define . Moreover, define with its column vector being and with its column vector being , where the standard basis vector has a single non-zero value 1 at its coordinate. Symmetrically, we can define matrices and . It can be verified that the standard form of the constraints of (3) is

(7)

Let be the augmented Lagrangian function of the problem (3), which is given by

(8)

where the dual multiplier matrix with and with .

As commented in the introduction, the existing convergence results for the standard ADMM do not directly apply to our problem (3). Specifically, the row ranks of and in the constraints (7) are at most and , respectively, contradicting the assumptions in Wang et al. (2015b) and (Hong et al., 2016) that require them to be both full row rank. The results in (Wang et al., 2015a) are not applicable either, because we also consider non-convex pairwise penalty functions. Next, we describe our 2-step algorithm based on a modified ADMM.

Step 1.

Define an undirected graph with and . We first use the standard depth-first-search algorithm to find cycles. Then, for each cycle of , we randomly cut one edge off by letting . As a result, the graph becomes acyclic. Similar operations can be applied for . This step verifies Property 2 in Section 3.1, which guarantees the convergence of our algorithm.

Step 2.

In each iteration , we do the following updates:

(9)

Remark on the modified ADMM: Compared to the standard ADMM, we introduce Bregman divergences to in (2). These additional terms guarantee sufficient descents of during the -subproblems in each iteration, as proven by Lemma 4 in Appendix C . Otherwise, the algorithm may diverge, as shown in Fig. 2111The image Lena was downloaded from http://www.utdallas.edu/~cxc123730/mh_bcs_spl.html.

Figure 2: Illustration of the divergence by the standard ADMM applied to the optimization problem (3). We use MCP as the pairwise penalty function , and randomly pick of the pixels of a image Lena to generate the partially observed matrix . The relative error is defined by , where is the estimate.

Compared to the Bregman ADMM (Wang & Banerjee, 2014), we do not apply Bregman divergences to and in order to save space for tracking and .

Based on (2), updating in (8) is to solve the following proximal map for each of its column ,

(10)

In practice, we choose , which, in conjunction with Condition 1, implies that the problem in (10) has a unique solution. Notably, this solution has simple and explicit analytical form for the pairwise penalty functions such as Lasso, SCAD, MCP and the M-type function.

Based on (2), updating is to minimize

Proposition 1.

Minimizing is equivalent to minimizing . The function is defined by

(11)

where is a block diagonal matrix with the block . The row vector satisfies , with being the row vector of the matrix .

By Proposition 1, letting yields

(12)

However, solving (12) requires inverting a matrix, which is computationally demanding when dealing with large datasets. Motivated by (Rao et al., 2015), we use the standard CG to directly minimize .

1:  Input:
2:  Initialize: . Set as random matrices.
3:  for  to  do
4:     for  to  do
5:         Compute by solving the proximal map (10)
6:     end for
7:     for  to  do
8:         
9:     end for
10:     Update by minimizing (1) through CG
11:     
12:     
13:     if  or  then
14:         break and output and
15:     end if
16:  end for
17:  Output:
Algorithm 1 Learning latent features with pairwise penalties in matrix completion by modified ADMM

The most expensive part in each iteration is the Hessian-vector multiplication . Using the identity , we have

(13)

where the matrix and . Since is a diagonal block matrix, computing suffices to obtain , which can be computed in time by using .

Thus, the time complexity for computing a single CG iteration is , where is the number of non zeros. For our algorithm, we use a -nearest neighbor method to select the weights , as introduced in section 4.3. Thus, using the structure of the matrix , we have an upper bound for , as in the following proposition. Its proof is given in Appendix A.

Proposition 2.

.

In most real-world applications, is small and satisfies . Thus, each CG iteration can be computed in time. We stop CG if

(14)

where is the output of the CG iteration and the parameters on the right side are chosen to satisfy Theorem 2. Different from (Rao et al., 2015), we do not require CG to fully converge. Instead, we only need a very small number ( in our implementation) of CG iterations to obtain an inexact solution that satisfies (14), which still guarantees the convergence for our algorithm, as shown in Theorem 2. This step further speeds up our algorithm.

The main steps are provided in Algorithm 1, where in the stop criterion. In the experiments, we choose and .

Time complexity: The total time complexity for one iteration in Algorithm 1 is . Due to Step 1 and the weight selection, we usually have and . Thus, our algorithm is very efficient.

3.1 Convergence Guarantee

This section proves the convergence of our algorithm. Let be the exact solutions of the subproblems in (2). We first prove the convergence of the sequence . Then, we extend this convergence result to Algorithm 1 that solves the equation (12) inexactly. We first provide two properties of our algorithm. Their proofs are presented in Appendix B.

Property 1 (Boundedness).

The sequence generated by (2) is bounded. It means that there exist constants such that .

Property 2 (Full column rank).

After step 1 of the algorithm, the matrices and defined in (7) are both full column rank. Thus, there exist positive constants such that and

Using these two properties, we establish Lemma 23 and 4 in Appendix C to upper bound the descent of during each subproblem. Based on these results, there exists a constant that is determined by such that

Theorem 1.

If , then the sequence satisfies and converges to a stationary point of .

We refer to Appendix C for the proof. By Theorem 1, we extend the convergence result for inexact iterations. Let be the sequence generated by Algorithm 1. Define for , where matrices are defined in (1). In a symmetric way, we can define . Following (He et al., 2015; Deng et al., 2017), we use the quantity as a measure of the convergence rate for the sequence . Based on (14), we have and . Then, we obtain

Theorem 2.

If , then the sequence converges to a stationary point of . Furthermore, if the sequence is non-increasing, the convergence rate of Algorithm 1 is .

The proof can be found in Appendix D. Different from the exact solution in Theorem 1, Theorem 2 only requires to satisfy . This extension allows more efficient methods to find inexact solutions of (12), e.g., the CG approach used in our algorithm. Interestingly, we observe that is non-increasing in most of our experiments. Thus, the convergence rate is usually achievable in practice.

4 Statistical Properties

Fully characterizing the statistical guarantee of the proposed class of penalty functions in Section 2 is challenging. Instead, we restrict to an important subclass, which is used for the subgroup-based model. We can rigorously derive the estimation error bound for this class of penalty functions. More importantly, our theoretical analysis motivates two interesting directions for our framework, i.e., subgroup identification and adaptive weights. We first introduce the subgroup-based model in Section 4.1. Then, the estimation error is characterized in Section 4.2. This characterization directly motivates us to use adaptive weights in Section 4.3.

4.1 Subgroup-based Model

Assume that the ground truth can be factorized as , where , and the observations are corrupted with the addictive noise . The feature vectors and form some latent subgroups, where two features and (respectively and ) are considered in the same subgroup if (respectively ). One key difference with the existing group based models, e.g., the group sparsity model (Kim et al., 2012), is that the subgroup structure in our model is not assumed to be known a priori. Studies (Jiacheng, 2017) have shown that the users with similar types in recommender systems often have the same feature, implying a natural subgroup structure. Nevertheless, our subgroup-based model is not restricted to these typical cases. For example, the number of subgroups can be even as large as the matrix dimension.

To facilitate the analysis, we introduce some useful notations. Let and be two sets of mutually exclusive partitions of the indices and , which satisfy for and for . We use and to denote the number of subgroups of and , respectively.

4.2 Estimation Error Bound

This section derives the estimation error bound for the subgroup-based model. We first introduce the complexity-regularized maximum likelihood estimator, and show it is a special case of our framework. Then, we prove that this estimator achieves a lower bound compared to the standard trace-norm regularized matrix completion.

For an integer with , assume that each index pair is included in the observed index set

independently with probability

. The elements of the matrix are independent conditional on the set . The noise matrix

contains i.i.d. random variables following a distribution

. Assume , and are bounded by , and , respectively. First, we introduce the following complexity-regularized maximum likelihood estimator

(15)

where the finite set is given by

(16)

The set in (16) is defined by

(17)

with . In a symmetrical way, we can also define .

Using the following proposition, we show that the estimator (15) is a special case of our framework (1).

Proposition 3.

Let be the total number of the feature vectors that satisfy . We have, for and ,

(18)

where the indicator function if and otherwise. A symmetrical result holds for .

See Appendix E for the proof. Moreover, the constraints and in (17) play a similar role as the regularizer in our framework. In the following theorem, we provide an error bound for the estimator (15).

Theorem 3.

If , then the estimator (15) satisfies the error bound

(19)

where the expectation is with respect to the joint distribution of

and are positive constants related to and .

The key step of the proof is the construction of the complexity penalty. See Appendix F for more details. We compare the error bound (3) with the one for the trace-norm regularized matrix completion obtained by (Koltchinskii et al., 2011), which assumes the observed entries are corrupted by addictive noise. Casting Corollary 2 in (Koltchinskii et al., 2011) to our setting, we obtain an error bound

(20)

with high probability, where and are two constants related to , and . Note that our error bound

can be much lower than (20) provided that the total number of subgroups .

However, to directly optimize (15) with regularizer (18) is undesirable in practice since the indicator function complicates the optimization and the sets are discretized. Based on the analysis, we introduce the following heuristics:

  • First, we undertake a slight relaxation of (15) by replacing with their convex hulls.

  • Then, we use a class of sparsity-inducing penalty functions with finite support that can shrink some differences and

    to zeros, e.g., by MCP and SCAD, to approximate the indicator function. By checking whether a pair of feature vectors are equal or not, we can classify

    and into subgroups.

  • Note that the formulation (18) encourages if . Based on this observation, we introduce adaptive weights in Section 4.3.

Remarkably, not only can we identify the subgroups automatically but also we can significantly reduce the recovery error, as demonstrated in Section 5.3.

4.3 Adaptive Weights

Based on the discussion at the end of Section 4.2, we introduce the adaptive weights and that are computed based on the partially observed matrix itself without any additional side information. The weights are chosen by

(21)

where the distance function needs to be approximated and the indicator function is if belongs to one of the -nearest neighbors of and otherwise. The selection (21) makes large during the computation if , similar to the adaptive weight chosen in (Zou, 2006). It effectively pushes the optimization solver to return . We also define for and by replacing and in (21) with and , respectively. Thus, in the following we only explain . In practice, and