Low-rank matrix factorization (MF) has been widely used in many real-world data applications, e.g., in signal processing, image restoration and collaborative filtering. A typical optimization problem (Koren et al., 2009; Gopalan et al., 2014) follows the form
where , , and the projection operator retains the entries of the matrix in that denotes the observed indices.
The row vectorsof and of usually represent the features of two classes of interdependent objects, known as latent variables, e.g., user features and movie features in recommender systems (Bennett et al., 2007; Koren et al., 2009), respectively. Based on this basic model, many variants have been considered for different application scenarios. For example, a latent feature model is often employed in matrix completion and, for a better estimation error, the similarities between latent variables can be exploited by pairwise learning, e.g. the graph regularized matrix factorization (GRMF) method using a squared norm. It has been shown that GRMF can reduce the recovery error (Rao et al., 2015; Zhao et al., 2015; Monti et al., 2017). For the ground truth , assume that the feature vectors correspond to the vertices of a graph with and the edges are weighted by a non-negative matrix . Similarly, we can also define for . By adding a smoothing graph regularizer to (1), GRMF aims to solve the problem
which encourages the solutions (respectively ) if the weight (respectively ) is large.
Existing works (Ma et al., 2011a; Kalofolias et al., 2014; Rao et al., 2015; Zhao et al., 2015) often construct the weight matrices and by incorporating additional information of the feature vectors, e.g., the social network of users and the attributes of movies in recommender systems. However, as shown in (Chiang et al., 2015), the side information could have noise, and hence the weights may be inappropriately selected. As illustrated in Fig. 1, if the difference between and is large but is not small enough, the squared norm used in the graph regularizer can severely penalize the objective function and push the solution to be close to . It can result in biased estimates, as shown in (Ma & Huang, 2017), which can dramatically affect the recovery performance. The same argument also applies to and . To address this problem, we introduce a large family of (non-)convex pairwise penalty functions that imposes less penalization on large differences, e.g., by the minimax concave penalty (MCP) (Zhang, 2010), the smoothly clipped absolute deviation (SCAD) (Fan & Li, 2001). and the M-type function, as shown in Fig. 1. To this end, we propose a general optimization framework
where and are (non-)convex pairwise penalty functions with tuning parameters .
To efficiently solve the whole class of optimization problems, we design a novel and scalable algorithm based on a modified alternating direction method of multipliers (ADMM) (Boyd et al., 2011). The allowed non-convex and non-smooth pairwise penalty functions complicate the optimization, which, if not handled carefully, can even result in divergent iterations. Theoretically, we characterize the convergence of our algorithm for the whole class of pairwise penalty functions. Compared with the standard ADMM, our algorithm has the following two features. First, although recent works (Wang et al., 2015a; Hong et al., 2016; Wang et al., 2015b) prove the convergence of ADMM for a large family of non-convex functions under different conditions, none of these results directly apply to our problem. Thus, we tailor ADMM to our setting by introducing Bregman divergences to certain subproblems. Second, the optimization algorithm needs to solve an expensive equation that typically requires running time to invert two large matrices in each iteration. We provide a conjugate gradient (CG) based approach to obtain an inexact solution of this equation, using only running time. As a result, our new method can significantly reduce the computation time (two orders of magnitude faster), as demonstrated in Appendix H.2. Notably, according to Theorem 2, we still guarantee the convergence using the inexact solutions during iterations.
Fully characterizing the statistical guarantee of the proposed class of penalty functions is challenging. Instead, we restrict to an important situation where the latent variables form a small number of subgroups. Specifically, we investigate a subgroup-based model, where two feature vectors in or are considered in the same group if they are identical. For a partially observed matrix corrupted with Gaussian noise, we prove that the complexity-regularized maximum likelihood estimator, as a special case of the framework (1), can achieve a lower error bound than the one for the trace-norm regularized matrix completion especially when the numbers of the subgroups for and are small. However, this estimator is computationally inefficient. To this end, we introduce a class of sparsity-promoting pairwise penalty functions (possibly with a finite support) to approximate the indicator function. Interestingly, not only can we identify the subgroups automatically but also we significantly reduce the recovery error, as verified in Section 5.3.
In addition, the theoretical analysis motivates us to adaptively construct the weights. As an application, our framework also applies to the datasets that do not provide side information since we heuristically pick adaptive weights based on the partially observed matrix. Our extensive experiments on both synthetic and real data demonstrate the superior performance of the proposed framework.
Notations: Let be a matrix and be a -dimensional vector. Denote by the cardinality of a set . Recall the Frobenius norm and the infinity norm . Define and norm as and , respectively.
2 Optimization Framework
We develop a general optimization framework for (1) that allows a wide class of (non-)convex pairwise penalty functions , which are characterized by the following condition.
For any given vector , the penalty function satisfies
there exists a constant independent of such that for , is strongly convex with respect to .
by setting , where is the indicator function. Moreover, we introduce the following M-type function for our framework
3 Algorithm and Convergence
For the whole class of optimization problems introduced in Section 2, we develop an efficient and general algorithm, with a strong theoretical convergence guarantee. We begin with the following equivalent form of the main problem (1)
where the index set and is defined in a similar way for .
To express the constraints of (3) in a standard form, we introduce some notations. Let be an ordered sequence of index pairs such that and for any . Similarly, we can define . Moreover, define with its column vector being and with its column vector being , where the standard basis vector has a single non-zero value 1 at its coordinate. Symmetrically, we can define matrices and . It can be verified that the standard form of the constraints of (3) is
Let be the augmented Lagrangian function of the problem (3), which is given by
where the dual multiplier matrix with and with .
As commented in the introduction, the existing convergence results for the standard ADMM do not directly apply to our problem (3). Specifically, the row ranks of and in the constraints (7) are at most and , respectively, contradicting the assumptions in Wang et al. (2015b) and (Hong et al., 2016) that require them to be both full row rank. The results in (Wang et al., 2015a) are not applicable either, because we also consider non-convex pairwise penalty functions. Next, we describe our 2-step algorithm based on a modified ADMM.
Define an undirected graph with and . We first use the standard depth-first-search algorithm to find cycles. Then, for each cycle of , we randomly cut one edge off by letting . As a result, the graph becomes acyclic. Similar operations can be applied for . This step verifies Property 2 in Section 3.1, which guarantees the convergence of our algorithm.
In each iteration , we do the following updates:
Remark on the modified ADMM: Compared to the standard ADMM, we introduce Bregman divergences to in (2). These additional terms guarantee sufficient descents of during the -subproblems in each iteration, as proven by Lemma 4 in Appendix C . Otherwise, the algorithm may diverge, as shown in Fig. 2111The image Lena was downloaded from http://www.utdallas.edu/~cxc123730/mh_bcs_spl.html.
Compared to the Bregman ADMM (Wang & Banerjee, 2014), we do not apply Bregman divergences to and in order to save space for tracking and .
In practice, we choose , which, in conjunction with Condition 1, implies that the problem in (10) has a unique solution. Notably, this solution has simple and explicit analytical form for the pairwise penalty functions such as Lasso, SCAD, MCP and the M-type function.
Based on (2), updating is to minimize
Minimizing is equivalent to minimizing . The function is defined by
where is a block diagonal matrix with the block . The row vector satisfies , with being the row vector of the matrix .
By Proposition 1, letting yields
The most expensive part in each iteration is the Hessian-vector multiplication . Using the identity , we have
where the matrix and . Since is a diagonal block matrix, computing suffices to obtain , which can be computed in time by using .
Thus, the time complexity for computing a single CG iteration is , where is the number of non zeros. For our algorithm, we use a -nearest neighbor method to select the weights , as introduced in section 4.3. Thus, using the structure of the matrix , we have an upper bound for , as in the following proposition. Its proof is given in Appendix A.
In most real-world applications, is small and satisfies . Thus, each CG iteration can be computed in time. We stop CG if
where is the output of the CG iteration and the parameters on the right side are chosen to satisfy Theorem 2. Different from (Rao et al., 2015), we do not require CG to fully converge. Instead, we only need a very small number ( in our implementation) of CG iterations to obtain an inexact solution that satisfies (14), which still guarantees the convergence for our algorithm, as shown in Theorem 2. This step further speeds up our algorithm.
The main steps are provided in Algorithm 1, where in the stop criterion. In the experiments, we choose and .
3.1 Convergence Guarantee
This section proves the convergence of our algorithm. Let be the exact solutions of the subproblems in (2). We first prove the convergence of the sequence . Then, we extend this convergence result to Algorithm 1 that solves the equation (12) inexactly. We first provide two properties of our algorithm. Their proofs are presented in Appendix B.
Property 1 (Boundedness).
The sequence generated by (2) is bounded. It means that there exist constants such that .
Property 2 (Full column rank).
After step 1 of the algorithm, the matrices and defined in (7) are both full column rank. Thus, there exist positive constants such that and
Using these two properties, we establish Lemma 2, 3 and 4 in Appendix C to upper bound the descent of during each subproblem. Based on these results, there exists a constant that is determined by such that
If , then the sequence satisfies and converges to a stationary point of .
We refer to Appendix C for the proof. By Theorem 1, we extend the convergence result for inexact iterations. Let be the sequence generated by Algorithm 1. Define for , where matrices are defined in (1). In a symmetric way, we can define . Following (He et al., 2015; Deng et al., 2017), we use the quantity as a measure of the convergence rate for the sequence . Based on (14), we have and . Then, we obtain
If , then the sequence converges to a stationary point of . Furthermore, if the sequence is non-increasing, the convergence rate of Algorithm 1 is .
The proof can be found in Appendix D. Different from the exact solution in Theorem 1, Theorem 2 only requires to satisfy . This extension allows more efficient methods to find inexact solutions of (12), e.g., the CG approach used in our algorithm. Interestingly, we observe that is non-increasing in most of our experiments. Thus, the convergence rate is usually achievable in practice.
4 Statistical Properties
Fully characterizing the statistical guarantee of the proposed class of penalty functions in Section 2 is challenging. Instead, we restrict to an important subclass, which is used for the subgroup-based model. We can rigorously derive the estimation error bound for this class of penalty functions. More importantly, our theoretical analysis motivates two interesting directions for our framework, i.e., subgroup identification and adaptive weights. We first introduce the subgroup-based model in Section 4.1. Then, the estimation error is characterized in Section 4.2. This characterization directly motivates us to use adaptive weights in Section 4.3.
4.1 Subgroup-based Model
Assume that the ground truth can be factorized as , where , and the observations are corrupted with the addictive noise . The feature vectors and form some latent subgroups, where two features and (respectively and ) are considered in the same subgroup if (respectively ). One key difference with the existing group based models, e.g., the group sparsity model (Kim et al., 2012), is that the subgroup structure in our model is not assumed to be known a priori. Studies (Jiacheng, 2017) have shown that the users with similar types in recommender systems often have the same feature, implying a natural subgroup structure. Nevertheless, our subgroup-based model is not restricted to these typical cases. For example, the number of subgroups can be even as large as the matrix dimension.
To facilitate the analysis, we introduce some useful notations. Let and be two sets of mutually exclusive partitions of the indices and , which satisfy for and for . We use and to denote the number of subgroups of and , respectively.
4.2 Estimation Error Bound
This section derives the estimation error bound for the subgroup-based model. We first introduce the complexity-regularized maximum likelihood estimator, and show it is a special case of our framework. Then, we prove that this estimator achieves a lower bound compared to the standard trace-norm regularized matrix completion.
For an integer with , assume that each index pair is included in the observed index set
independently with probability. The elements of the matrix are independent conditional on the set . The noise matrix
contains i.i.d. random variables following a distribution. Assume , and are bounded by , and , respectively. First, we introduce the following complexity-regularized maximum likelihood estimator
where the finite set is given by
The set in (16) is defined by
with . In a symmetrical way, we can also define .
Let be the total number of the feature vectors that satisfy . We have, for and ,
where the indicator function if and otherwise. A symmetrical result holds for .
See Appendix E for the proof. Moreover, the constraints and in (17) play a similar role as the regularizer in our framework. In the following theorem, we provide an error bound for the estimator (15).
The key step of the proof is the construction of the complexity penalty. See Appendix F for more details. We compare the error bound (3) with the one for the trace-norm regularized matrix completion obtained by (Koltchinskii et al., 2011), which assumes the observed entries are corrupted by addictive noise. Casting Corollary 2 in (Koltchinskii et al., 2011) to our setting, we obtain an error bound
with high probability, where and are two constants related to , and . Note that our error bound
can be much lower than (20) provided that the total number of subgroups .
However, to directly optimize (15) with regularizer (18) is undesirable in practice since the indicator function complicates the optimization and the sets are discretized. Based on the analysis, we introduce the following heuristics:
First, we undertake a slight relaxation of (15) by replacing with their convex hulls.
Then, we use a class of sparsity-inducing penalty functions with finite support that can shrink some differences and
to zeros, e.g., by MCP and SCAD, to approximate the indicator function. By checking whether a pair of feature vectors are equal or not, we can classifyand into subgroups.
Remarkably, not only can we identify the subgroups automatically but also we can significantly reduce the recovery error, as demonstrated in Section 5.3.
4.3 Adaptive Weights
Based on the discussion at the end of Section 4.2, we introduce the adaptive weights and that are computed based on the partially observed matrix itself without any additional side information. The weights are chosen by
where the distance function needs to be approximated and the indicator function is if belongs to one of the -nearest neighbors of and otherwise. The selection (21) makes large during the computation if , similar to the adaptive weight chosen in (Zou, 2006). It effectively pushes the optimization solver to return . We also define for and by replacing and in (21) with and , respectively. Thus, in the following we only explain . In practice, and