Non-negative matrix factorization (NMF) is a popular matrix factorization approach that approximates a non-negative matrix by the product of two non-negative low-rank factor matrices and . Different to other matrix factorization approaches, NMF takes into account the fact that most types of real-world data, particularly all images or videos, are non-negative and maintain such non-negativity constraints in factorization. This non-negativity constraint helps to learn parts-based representation supported by psychological and physical evidence (Logothetis and Sheinberg, 1996)(Wachsmuth and Oram, 1994). Therefore, NMF achieves great success in many fields such as image analysis (Monga and Mihcak, 2007)2008), video processing (Bucak and Gunsel, 2007), and environmental science (Paatero and Tapper, 1994).
NMF was first proposed by Paatero and Tapper (Paatero and Tapper, 1994) and was greatly popularized by Lee and Seung (Lee and Seung, 1999). Since then, many NMF variants have been proposed and have achieved great success in a variety of tasks. For example, Hoyer (Hoyer, 2004)
proposed sparseness constrained NMF (NMFsc) to enhance the sparseness of the learned factor matrices for computer vision tasks. Zafeiriouet al. (Zafeiriou et al., 2006) proposed discriminant NMF (DNMF) to incorporate Fisher’s criteria for classification. Cai et al. (Cai et al., 2011) proposed graph regularized NMF (GNMF) to incorporate the geometric structure of a dataset for clustering. Recently, Sandler and Lindenbaum (Sandler and Lindenbaum, 2011) proposed an earth mover’s distance metric-based NMF (EMD-NMF) to model the distortion of images for several vision tasks. Liu et al. (Liu et al., 2012) proposed a constrained NMF (CNMF) to incorporate the label information as additional constraints for image representation.
From the mathematical viewpoint, traditional NMF (Lee and Seung, 1999)(Lee and Seung, 2001) and its variants minimize the Kullback-Leibler divergence and the Euclidean distance between and to model the Poisson noise and Gaussian noise, respectively. Here, we call them KLNMF and EucNMF for short. Both KLNMF and EucNMF are popular because they can be efficiently optimized by using the multiplicative update rule (Lee and Seung, 2001)
. However, the noise in many practical applications is heavy tailed, so it cannot be well modeled by either Poisson distribution or Gaussian distribution. For example, the gradient-based image features such as SIFT(Lowe, 2004) contain non-Gaussian heavy tailed noise (Jia and Darrell, 2011). In these cases, traditional NMF does not perform well because it is not robust to outliers such as occlusions, Laplace noise, and salt & pepper noise, whose distribution is heavy tailed.
On the other hand, real-world data often lies in a lower-dimensional subspace; for example, Basri and Jacobs (Basri and Jacobs, 2003) showed that images taken from convex and Lambertian objects under distant illumination lie near an approximately nine-dimensional linear subspace. Recently, robust principal component analysis (RPCA, (Candes et al., 2011)) and GoDec (Zhou and Tao, 2011) have been proposed to robustly recover the lower-dimensional space in the presence of outliers. Both RPCA and GoDec consider the prior knowledge that noise, e.g., illumination/shadow in images and moving objects in videos, is sparse, and thus perform robustly in practice. Traditional NMF cannot robustly estimate the low-rank part of the data contaminated by outliers because it does not consider such prior knowledge of the sparse structure of noise.
In this paper, we present Manhattan NMF (MahNMF) to robustly estimate the low-rank part and the sparse part of a non-negative matrix. MahNMF models the heavy tailed Laplacian noise by minimizing the Manhattan distance between an -dimensional non-negative matrix and , i.e.,
where is the Manhattan distance and the reduced dimensionality satisfies that . Since both and are low-rank, MahNMF actually estimates the non-negative low-rank part, i.e., , and the sparse part, i.e., , of a non-negative matrix
. Benefiting from both the modeling capability of Laplace distribution to the heavy tailed behavior of noise and the robust recovery capability of the sparse and low-rank decomposition, such as RPCA and GoDec, MahNMF performs effectively and robustly when data are contaminated by outliers. We further extend MahNMF for various practical applications by developing box-constrained MahNMF, manifold regularized MahNMF, and group sparse MahNMF. These extensions follow the regularization theory by integrating MahNMF with various regularizations. By taking into account the grouping effect of the sparse part, we develop the elastic net inducing MahNMF to learn the low-rank and group sparse decomposition of a non-negative matrix. Inspired by spectral clustering, we develop a symmetric MahNMF for image segmentation. Although(Lam, 2008) tried to model Laplacian noise in NMF, it cannot be used in practice because the semi-definite programming-based optimization method used suffers from both slow convergence and non-scalable problems.
The main contribution of this paper lies in two fast optimization methods for MahNMF and its extensions: the rank-one residual iteration (RRI) method and Nesterov’s smoothing method. In particular, RRI approximates the residual matrix with the outer product of one row of and one row of in (1) and iteratively updates each variable of and in a closed form solution. RRI is efficient for optimizing small-scale MahNMF and some of its extensions, but it is neither scalable to large scale matrices nor flexible enough to optimize all MahNMF extensions. Since the objective functions of MahNMF and its extensions are neither convex nor smooth, we apply Nesterov’s smoothing method to recursively optimize one factor matrix with another matrix fixed. By setting the smoothing parameter inversely proportional to the iteration number, we improve the approximation accuracy iteratively for both MahNMF and its extensions.
We conduct experiments on both synthetic and real-world datasets, such as face images, natural scene images, surveillance videos and multi-model datasets, to show the efficiency of the proposed Nesterov’s smoothing method-based algorithms for optimizing MahNMF and its variants and their effectiveness in face recognition, image clustering, background/illumination modeling, and multi-view learning by comparing them with traditional NMF, RPCA, and GoDec.
The remainder of this paper is organized as follows: Section II presents the rank-one residual iteration (RRI) method for optimizing MahNMF, and Section III presents Nesterov’s smoothing method. Section IV presents several MahNMF extensions which can be solved by using the proposed Nesterov smoothing method-based algorithm. In Section V, we conduct experiments to show both the efficiency of Nesterov smoothing method-based algorithm for MahNMF and the effectiveness of MahNMF and its variants. Section VI concludes this paper.
Notations: We denote by a lower-case , a headed and a capital
a scalar, vector and matrix, respectively. In particular,signifies a vector full of one,
signifies an identity matrix, andsignifies zero, zero vector or null matrix. We denote by bracketed subscript and superscript the elements of a vector or a matrix, e.g., signifies the -th element of , and , , signify the -th row, the -th column, and the -th element of , respectively. We denote by a subscript, e.g., and the points in a sequence. We denote by and the set of real numbers and the set of non-negative real numbers, respectively. Consequently, and signify the set of -dimensional non-negative vectors and the set of -dimensional non-negative matrices, respectively. We denote by and the and norm of a vector , respectively. For any matrices and , we denote their Euclidean distance (Frobenius norm) and Manhattan distance as and , respectively. In addition, we denote by and their element-wise product and division, respectively.
2 Rank-one Residual Iteration Method for MahNMF
Since the objective function (1) is non-convex, we recursively optimize one factor matrix or with another fixed, i.e., at iteration , we update
until convergence. The convergence is usually checked by the following objective-based stopping condition:
Although (2) is convex, the Manhattan distance-based objective function, i.e., , is non-differentiable when contains zero elements. This means that the gradient-based method cannot be directly applied to optimizing (2). Fortunately, we will show that each variable in can be updated in a closed form solution and thus (2) can be optimized by using alternating optimization over each variable of . Given and rows of except , eq. (2) can be written as
where is the residual matrix. Actually, Eq. (5) is a rank one approximation of the residual matrix. Therefore, following (Ho et al., 2011), we term this method the rank-one residual iteration (RRI) method.
Since (5) is convex and separable with respect to each variable , wherein , there exists the optimal solution and is updated as follows
Looking carefully at , it is a continuous piecewise linear function whose piecewise points are . It is obvious that the minimum of appears at one point of . Regardless of the constraint , the point that first changes the sign of the slope of is its minimum. By sorting in an ascending order, we have , wherein . Furthermore, by sorting accordingly, we can remove the absolute operator in (6) and rewrite it into pieces as follows
Since , the slope of the piecewise function in (7) is increasing. It is easy to find the point which first changes the sign of slope. Suppose first changes the signs of slope of , i.e., and . It is clear that minimizes . Note that the minimum is not unique because the slope at may be zero. See Figure 1 for three examples of piecewise functions; it is clear that and minimizes and and any point in the range minimizes . Taking into account the non-negativity constraint, we obtain the solution of (6) as . In the case of in Figure 1, we simply take the leftmost point as its optimal solution.
We summarize the RRI method in Algorithm 1. It successively updates each row of and stops when the following stopping condition is satisfied
where is the precision, e.g., . By recursively solving (2) and (3) with Algorithm 1, the MahNMF problem (1) can be successfully solved. The previous variable is used as a warm start, i.e., , to accelerate convergence of the RRI method.
The main time cost of Algorithm 1 is spent on sentence 6 that sorts the piecewise points. Its time complexity is because the sorting operator for each piecewise point set costs time in the worst case. Sentence 7 finds the piecewise point that first changes the sign of from the sorted set. This can be done by initializing the slope as and increasing it by at the -th step. Once the sign of slope changes, the procedure stops and outputs . The worst case time complexity of this procedure is . Therefore, the total complexity of Algorithm 1 is , where is the iteration number. Since Algorithm 1 finds the closed form solution for each variable of , it converges fast. However, the time complexity is high especially when is large. Therefore, RRI is not scalable for large scale problems. In the following section, we propose an efficient and scalable algorithm for optimizing MahNMF.
3 Nesterov’s Smoothing Method for MahNMF
Since the Manhattan distance equals the summation of the norm, i.e., , the minimization problem (2) can be solved by optimizing each column of H separately. For the -th column, the sub-problem is
Without the non-negativity constraint, eq. (9) shrinks to the well-known least absolute deviations (LAD, (Karst, 1958)) regression problem. Here, we term (9) as a non-negative LAD (NLAD) problem for the convenience of presentation. According to (Harter, 1974), LAD is much more robust than the least squares (LS) method especially on the datasets contaminated by outliers. NLAD inherits the robustness from LAD and keeps the non-negativity capability of datasets.
For any given observation and a matrix , the NLAD problem (9) can be written as
where signifies the inner product in . Define the norm that endows the domain as , and construct the prox-function for the feasible set as . It is obvious that is strongly convex and the convexity parameter is . Since is a self-dual space, we know that the dual norm for any .
Since is convex and continuous, there must be an optimal solution. However, it cannot be solved directly by using the gradient-based method because is non-smooth. Fortunately, Nesterov (Nesterov, 2004) shows that can be approximated by a smooth function. In particular, we first construct a dual function for the primal non-smooth function and smooth the dual function by adding a smooth and strongly convex prox-function for the feasible set of the dual variable. Then we solve the smoothed dual function in the dual space and project the solution back to primal space. The obtained solution can be considered as an approximate minimum of the primal function. By choosing the dual domain and the feasible set , wherein is the dual variable, the primal problem (10) is equivalently rewritten as
where is the inner product in . The corresponding dual problem is
Since , is bounded, i.e., there exists a positive number such that for any . Then the dual function can be calculated explicitly, i.e., , wherein is an element-wise operator defined as . Since it is difficult to estimate , the dual problem is still difficult to solve. However, it can be easily solved by adding a simple prox-function. According to (Nesterov, 2004), we define the prox-function for as . By adding the prox-function, we obtain a smoothed approximate function for as follows
where is a parameter that controls the smoothness. The larger the parameter , the smoother the approximate function and the worse its approximate accuracy. Using algebra, eq. (11) can be written as
Let and to be the Lagrange multiplier vectors corresponding to the constraints, i.e., and , respectively, the K.K.T. conditions of (12) are as follows
where is the median operator. By substituting back into (12), we obtain the closed-form smoothed function as
where . According to Theorem 1 in (Nesterov, 2004), is well defined and continuously differentiable at any . Moreover, is convex and its gradient is Lipschitz continuous with constant , wherein is the convexity parameter of and is the norm of projection matrix which is defined as follows
By using the obtained smoothed function, (9) can be approximately solved by
Since is smooth, convex and its gradient is Lipschitz continuous, it naturally motivates us to optimize (16) by using Nesterov’s optimal gradient method (OGM, (Nesterov, 2004)). In particular, OGM constructs two auxiliary sequences in optimization: one sequence stores the historical gradients and another sequence stores the search point that minimizes the quadratic approximation of at the current solution. The step size is determined by the Lipchitz constant. In each iteration round, the solution is updated by combining the historical gradients and search point. This combination accelerates the gradient method and makes OGM achieve an optimal convergence rate of for optimizing (16). In this paper, the search points and the historical gradients are defined as follows:
According to (Nesterov, 2004), we combine and as follows:
where is the precision, e.g., . Since is unknown in ahead, we usually use instead. According to Theorem 3 of (Nesterov, 2004), the complexity of finding an -solution does not exceed . By substituting , , , we have , namely OGM finds an -solution for (9) in iterations.
As mentioned above, the smooth parameter controls the approximation of for , smaller implies better approximation. A natural question is whether minimizes (9) as goes to zero. To answer this question, we first show that is bounded and gets infinitely close to as goes to zero in the following Theorem 1 and we prove that the smoothing method finds an approximate solution of MahNMF in Theorem 2. Figure 2 gives two examples of the smoothing functions with different smooth parameters. It shows that the original non-smooth function is bounded.
Theorem 1 Given any positive number , we have the following inequality:
Proof. Defining the residual error , then its -th entry is . The approximation function can be written as the following function with respect to :
Below we will prove that . For the convenience of derivation, we focus on the following function , wherein . According to the definition of in (15), we have
It is obvious that . By substituting these inequalities into (24) and considering , we have . This completes the proof.
Since the columns of are separable, OGM can be written in a matrix form and summarized in Algorithm 2, wherein and . Algorithm 2 accepts the smooth parameter as an input and outputs an approximate solution of the sub-problem (2).
According to (Nesterov, 2004), Algorithm 2 converges at the rate of for optimizing (16) and needs iterations to yield an -solution of the original problem (9). Since the distance between the primal and dual functions is
where , and , and is the solution of (21) at the -th iteration rounds. By minimizing the right-hand side of the above inequality, we have and . Since , is bounded. However, this bound is difficult to calculate exactly. In the following section, we will show that this deficiency can be overcome by slightly modifying the feasible set .
According to (Nesterov, 2004), sentence 5 of Algorithm 2 can be slightly changed to guarantee decreasing the objective function. In particular, we find and set . This strategy requires additional computation of the objective function and thus increases the time cost of each iteration by . The main time cost of Algorithm 2 is spent on sentences 3 and 4 to calculate the gradient, whose complexity is . Therefore, the total time complexity of Algorithm 2 is , wherein is the iteration number.
According to Theorem 1, gets infinitely close to the minimum of (10), i.e., , as goes to zero. This motivates us to adaptively decrease the smooth parameter during each call of Algorithm 2. The total procedure is summarized in Algorithm 3 which sets the smoothing parameter inversely proportional to the iteration number and thus improves the approximation iteratively. In Algorithm 3, the current solution, i.e., and , is used as a warm start to accelerate the convergence of Algorithm 2 (see sentence 3 and 4).
Algorithm 3 recursively minimizes two smoothed objective functions, i.e., and , as goes to infinity. Although the generated point at the -th iteration is not the minimum of the original sub-problems, the following Theorem 2 shows that do decrease the objective function