Non-negative matrix factorization (NMF, ) explores the non-negativity property of data and has received considerable attention in many fields, such as text mining , hyper-spectral imaging , and gene expression clustering . It decomposes a data matrix into the product of two lower dimensional non-negative factor matrices by minimizing the Eudlidean distance between their product and the original data matrix. Since NMF only allows additive, non-subtractive combinations, it obtains a natural parts-based representation of the data. NMF is optimal when the dataset contains additive Gaussian noise, and so it fails on grossly corrupted datasets, e.g., the AR database  where face images are partially occluded by sunglasses or scarves. This is because the corruptions or outliers seriously violate the noise assumption.
Many models have been proposed to improve the robustness of NMF. Hamza and Brady  proposed a hypersurface cost based NMF (HCNMF) which minimizes the hypersurface cost function111The hypersurface cost function is defined as which is quadratic when its argument is small and linear when its argument is large. between the data matrix and its approximation. HCNMF is a significant contribution for improving the robustness of NMF, but its optimization algorithm is time-consuming because the Armijo’s rule based line search that it employs is complex. Lam  proposed -NMF222 When the noise is modeled by Laplace distribution, the maximum likelihood estimation yields an
When the noise is modeled by Laplace distribution, the maximum likelihood estimation yields an-norm based objective function. We therefore term the method in  -NMF. to model the noise in a data matrix by a Laplace distribution. Although -NMF is less sensitive to outliers than NMF, its optimization is expensive because the -norm based loss function is non-smooth. This problem is largely reduced by Manhattan NMF (MahNMF, ), which solves -NMF by approximating the non-smooth loss function with a smooth one and minimizing the approximated loss function with Nesterov’s method . Zhang et al.  proposed an -norm regularized Robust NMF (RNMF-) to recover the uncorrupted data matrix by subtracting a sparse error matrix from the corrupted data matrix. Kong et al.  proposed -NMF to minimize the -norm of an error matrix to prevent noise of large magnitude from dominating the objective function. Gao et al.  further proposed robust capped norm NMF (RCNMF) to filter out the effect of outlier samples by limiting their proportions in the objective function. However, the iterative algorithms utilized in -NMF and RCNMF converge slowly because they involve a successive use of the power method . Recently, Bhattacharyya et al.  proposed an important robust variant of convex NMF which only requires the average -norm of noise over large subsets of columns to be small; Pan et al.  proposed an -norm based robust dictionary learning model; and Gillis and Luce  proposed a robust near-separable NMF which can determine the low-rank, avoid normalizing data, and filter out outliers. HCNMF, -NMF, RNMF-, -NMF, RCNMF, ,  and  share a common drawback, i.e., they all fail when the dataset is contaminated by serious corruptions because the breakdown point of the -norm based models is determined by the dimensionality of the data .
In this paper, we propose a Truncated Cauchy non-negative matrix factorization (Truncated CauchyNMF) model to learn a subspace on a dataset contaminated by large magnitude noise or corruption. In particular, we proposed a Truncated Cauchy loss that simultaneously and appropriately models moderate outliers (because the loss corresponds to a fat tailed distribution in-between the truncation points) and extreme outliers (because the truncation directly cut off large errors). Based on the proposed loss function, we develop a novel Truncated CauchyNMF model. We theoretically analyze the robustness of Truncated CauchyNMF and show that Truncated CauchyNMF is more robust than a family of NMF models, and derive a theoretical guarantee for its generalization ability and show that Truncated CauchyNMF converges at a rate of order , where is the sample size. Truncated CauchyNMF is difficult to optimize because the loss function includes a nonlinear logarithmic function. To address this, we optimize Truncated CauchyNMF by half-quadratic (HQ) programming based on the theory of convex conjugation. HQ introduces a weight for each entry of the data matrix and alternately and analytically updates the weight and updates both factor matrices by easily solving a weighted non-negative least squares problem with Nesterov’s method . Intuitively, the introduced weight reflects the magnitude of the error. The heavier the corruption, the smaller the weight, and the less an entry contributes to learning the subspace. By performing truncation on magnitudes of errors, we prove that HQ introduces zero weights for entries with extreme outliers, and thus HQ is able to learn the intrinsic subspace on the inlier entries.
In summary, the contributions of this paper are three-fold: (1) we propose a robust subspace learning framework called Truncated CauchyNMF, and develop a Nesterov-based HQ algorithm to solve it; (2) we theoretically analyze the robustness of Truncated CauchyNMF comparing with a family of NMF models, and provide insight as to why Truncated CauchyNMF is the most robust method; and (3) we theoretically analyze the generalization ability of Truncated CauchyNMF, and provide performance guarantees for the proposed model. We evaluate Truncated CauchyNMF by image clustering on both simulated and real datasets. The experimental results on the datasets containing gross corruptions validate the effectiveness and robustness of Truncated CauchyNMF for learning the subspace.
The rest of this paper is organized as follows: Section 2 describes the proposed Truncated CauchyNMF, Section 3 develops the Nesterov-based half-quadratic (HQ) programming algorithm for solving Truncated CauchyNMF. Section 4 surveys the related works and Section 5 verifies Truncated CauchyNMF on simulated and real datasets. Section 6 concludes this paper. All the proofs are given in the supplementary material.
2 Truncated Cauchy Non-negative Matrix Factorization
Classical NMF  is not robust because its loss function is sensitive to outliers considering the errors of large magnitude dominate the loss function. Although some robust loss functions, such as for -NMF , Hypersurface cost , and Cauchy loss , are less sensitive to outliers, they introduces infinite energy for infinitely large noise in the extreme case. To remedy this problem, we propose a Truncated Cauchy loss by truncating the magnitudes of large errors to limit the effects of extreme outliers, i.e.,
is the scale parameter of the Cauchy distribution andis a constant.
To study the behavior of the Truncated Cauchy loss, we compare the loss functions , , , , and the loss function of the -norm, i.e., in Figure 1, because the -norm induces robust models. Figure 1(a) shows that when the error is moderately large, e.g., , shifts from to and corresponds to a fat-tailed distribution, and implies that the Truncated Cauchy loss can model moderate outliers well, while cannot because it makes the outliers dominate the objective function. When the error gets larger and larger, gets away from and behaves like , and keeps constant once the error exceeds a threshold, e.g., , and implies that the Truncated Cauchy loss can model extreme outliers, whereas neither nor cannot because they encourage infinite energy to infinitely large error. Intuitively, the Truncated Cauchy loss can model both moderate and extreme outliers well. Figure 1(b) plots the curves of both and with varying from to and accordingly varying from to . It shows that behaves more and more close to when approaches zero. By comparing the behaviors of loss functions, we believe that the Truncated Cauchy loss can induce robust NMF model.
Given high-dimensional samples arranged in a non-negative matrix , Truncated Cauchy non-negative matrix factorization (Truncated CauchyNMF) approximately decomposes into the product of two lower dimensional non-negative matrices, i.e., , where signifies the basis, signifies the coefficients, and signifies the error matrix which is measured by using the proposed Truncated Cauchy loss. The objective function of Truncated CauchyNMF can be written as
where is utilized for the convenience of derivation and is a truncation parameter, and is the scale parameter. We will next show that the truncation parameter can be implicitly determined by robust statistics and the scale parameter can be estimated by the Nagy algorithm . It is not hard to see that Truncated CauchyNMF includes CauchyNMF as a special case when . Since (2) assigns fixed energy to any large error whose magnitude exceeds , Truncated CauchyNMF can filter out any extreme outliers.
To illustrate the ability of Truncated CauchyNMF to model outliers, Figure 2 gives an illustrative example that demonstrates its application to corrupted face images. In this example, we select frontal face images of an individual in two sessions from the Purdue AR database  (see all face images in Figure 2(a)). In each session, there are frontal face images with different facial expressions, captured under different illumination conditions, with sunglasses, and with a scarf. Each image is cropped into a -dimensional pixel array and reshaped into a
-dimensional vector. The total number of face images compose a-dimensional non-negative matrix because the pixel values are non-negative. In this experiment, we aim at learning the intrinsically clean face images from the contaminated images. This task is quite challenging because more than half the images are contaminated. Since these images were taken in two sessions, we set the dimensionality low () to learn two basis images. Figure 2(b) shows that Truncated CauchyNMF robustly recovers all face images even when they are contaminated by a variety of facial expressions, illumination, and occlusion. Figure 2(c) presents the reconstruction errors and Figure 2(d) shows the basis images, which confirms that Truncated CauchyNMF is able to learn clean basis images with the outliers filtered out.
In the following subsections, we will analyze the generalization ability and robustness of Truncated CauchyNMF. Before that, we introduce Lemma 1 which states that the new representations generated by Truncated CauchyNMF are bounded if the input observations are bounded. This lemma will be utilized in the following analysis with the only assumption that each base is a unit vector. Such an assumption is typical in NMF because the bases
are usually normalized to limit the variance of its local minimizers. We useto represent the -norm and to represent the Euclidean norm.
Assuming , , and that the input observations are bounded, i.e., for some . Then the new representations are also bounded, i.e., .
Although Truncated CauchyNMF (2) has a differentiable objective function, solving it is difficult because the natural logarithmical function is nonlinear. Section 3 will present a half-quadratic (HQ) programming algorithm for solving Truncated CauchyNMF.
2.1 Generalization Ability
To analyze the generalization ability of Truncated CauchyNMF, we further assume that samples are independent and identically distributed and drawn from a space with a Borel measure . We use and to denote the -th column and the -th entry of a matrix , respectively, and is the -th entry of a vector .
For any , we define the reconstruction error of a sample as follows:
Therefore, the objective function of Truancated CauchyNMF (2) can be written as
Let us define the empirical reconstruction error of Truncated CauchyNMF as , and the expected reconstruction error of Truncated CauchyNMF as . Intuitively, we want to learn
However, since the distribution of is unknown, we cannot minimize directly. Instead, we use the empirical risk minimization (ERM, ) algorithm to learn to approximate , as follows:
We are interested in the difference between and . If the distance is small, we can say that is a good approximation of . Here, we measure the distance of their reduced expected reconstruction error as follows:
where . The right hand side is known as the generalization error. Note that since NMF is convex with respect to either or but not both, the minimizer is hard to obtain. In practice, a local minimizer is used as an approximation. Measuring the distance between the local minimizer and the global minimizer is also an interesting and challenging problem.
By analyzing the covering number  of the function class and Lemma 1, we derive a generalization error bound for Truncated CauchyNMF as follows:
Remark 1. Theorem 1 shows that under the setting of our proposed Truncated CauchyNMF, the expected reconstruction error will converge to with a fast rate of order , which means that when the sample size is large, the distance between and will be small. Moreover, if is large and a local minimizer (obtained by optimizing the non-convex objective of Truncated CauchyNMF) is close to the global minimizer , the local minimizer will also be close to the optimal .
Remark 2. Theorem 1 also implies that for any learned from (2), the corresponding empirical reconstruction error will converge to its expectation with a specific rate guarantee, which means our proposed Truncated CauchyNMF can generalize well to unseen data.
Note that the noise sampled from the Cauchy distribution should not be bounded because Cauchy distribution is heavy-tailed. And bounded observations always imply bounded noise. However, Theorem 1 keeps the boundedness assumption on the observations for two reasons: (1) the truncated loss function indicates that the observations corresponding to unbounded noise are discarded, and (2) in real applications, the energy of observations should be bounded, which means their -norms are bounded.
2.2 Robustness Analysis
We next compare the robustness of Truncated CauchyNMF with those of other NMF models by using a sample-weighted procedure interpretation . The sample-weighted procedure compares the robustness of different algorithms from the optimization viewpoint.
Let denote the objective function of any NMF problem and where . We can verify that the NMF problem is equivalent to finding a pair of such that 333When minimizing , the low rank matrices and will be updated alternately. Fixing one of them and optimizing the other implies that . In other words, if , neither nor can be a minimizer., where denotes the derivative of . Let be the contribution of the -th entry of the -th training example to the optimization procedure and be an error function. Note that we choose as the basis of contribution because we choose NMF, which aims to find a pair of such that and is sensitive to noise, as the baseline for comparing the robustness. Also note that represents the noise added to the -th entry of . The interpretation of the sample-weighted procedure explains the optimization procedure as being contribution-weighted with respect to the noise.
|NMF methods||Objective function||Derivative|
We compare of a family of NMF models in Table I. Note that since multiplying by a constant will not change its zero points, we can normalize the weights of different NMF models to unity when the noise is equal to zero. During the optimization procedures, robust algorithms should assign a small weight to an entry of the training set with large noise. Therefore, by comparing the derivative , we can easily make the following statements: (1) -NMF444For the soundness of defining the subgradient of -norm, we state that can be any value in . is more robust to noise and outliers than NMF; Huber-NMF combines the ideas of NMF and -NMF; (2) HCNMF, -NMF, RCNMF, and RNMF- work similarly to -NMF because their weights are of order with respect to the noise. It also becomes clear that HCNMF, -NMF, and RCNMF exploit some data structure information because the weights include the neighborhood information of and that RNMF- is less sensitive to noise because it employs a sparse matrix to adjust the weights; (3) The interpretation of the sample-weighted procedure also illustrates why CIM-NMF works well for heavy noise. This is because its weights decrease exponentially when the noise is large; And (4) for the proposed Truncated CauchyNMF, when the noise is larger than a threshold, its weights will drop directly to zero, which decrease far faster than that of CIM-NMF and thus Truncated CauchyNMF is very robust to extreme outliers. Finally, we conclude that Truncated CauchyNMF is more robust than any other NMF models with respect to extreme outliers because it has the power to provide smaller weights to examples.
3 Half-Quadratic Programming Algorithm for Truncated CauchyNMF
Note that Truncated CauchyNMF (2) cannot be solved directly because the energy function is non-quadratic. We present a half-quadratic (HQ) programming algorithm based on conjugate function theory . To adopt the HQ algorithm, we transform (2) to the following maximization form:
where is the core function utilized in HQ. Since the negative logarithmic function is convex, is also convex.
3.1 HQ-based Alternating Optimization
Generally speaking, the half-quadratic (HQ) programming algorithm  reformulates the non-quadratic loss function as an augmented loss function in an enlarged parameter space by introducing an additional auxiliary variable based on the convex conjugation theory . HQ is equivalent to the quasi-Newton method  and has been widely applied in non-quadratic optimization.
Note that the function is continuous, and according to , its conjugate is defined as
The core function and its conjugate satisfy
and the maximizer is .
By substituting into (9), we have the augmented loss function
where the equality comes from the separability of the optimization problems with respect to .
Although the objective function in (8) is non-quadratic, its equivalent problem (3.1) is essentially a quadratic optimization. In this paper, HQ solves (3.1) based on the block coordinate descent framework. In particular, HQ recursively optimizes the following three problems. At -th iteration,
Since (13) and (14) are symmetric and intrinsically weighted non-negative least squares (WNLS) problems, they can be optimized in the same way using the Nesterov method . Taking (13) as an example, the procedure of its Nesterov based optimization is summarized in Algorithm 1, and its derivative is derived in the supplementary material. Considering that (13) is a constrained optimization problem, similar to , we use the following projected gradient-based criterion to check the stationarity of the search point , i.e., , where . Since the above stopping criterion will make OGM run unnecessarily long, similar to , we use a relaxed version
where is a tolerance that controls how far the search point is from a stationary point.
The complete procedure of the HQ algorithm is summarized in Algorithm 2. The weights of entries and factor matrices are updated recursively until the objective function does not change. We use the following stopping criterion to check the convergence in Algorithm 2:
where signifies the tolerance, signifies the objective function of (8) and signifies a local minimizer555Since any local minimal is unknown beforehand, we instead utilize in our experiments.. The stopping criterion (16) implies that HQ stops when the search point is sufficiently close to the minimizer and sufficiently far from the initial point. Line updates the scale parameter by the Nagy algorithm and will be further presented in Section 3.2. Line detects outliers by robust statistics and will be presented in Section 3.3.
The main time cost of Algorithm 2 is incurred on lines , , , , , , and . The time complexities of lines and are both . According to Algorithm 1, the time complexities of lines and are and , respectively. Since line introduces a median operator, its time complexity is . In summary, the total complexity of Algorithm 2 is .
3.2 Scale Estimation
The parameter estimation problem for Cauchy distribution has been studied for several decades . Nagy  proposed an I-divergence based method, termed the Nagy algorithm for short, to simultaneously estimate location and scale parameters. The Nagy algorithm minimizes the discrimination information666 The discrimination information of random variable
The discrimination information of random variablegiven random variable is defined as , where and are the PDFs of and , and is the distribution function of . between the empirical distribution of the data points and the prior Cauchy distribution with respect to the parameters. In our Truncated CauchyNMF model (2), the location parameter of the Cauchy distribution is assumed to be zero, and thus we only need to estimate the scale-parameter .
Here we employ the Nagy algorithm to estimate the scale-parameter based on all the residual errors of the data. According to , supposing there exist a large number of residual errors, the scale-parameter estimation problem can be formulated as
where denotes the discrimination information, and the first equality is due to the independence of andof Cauchy distribution777The probability density function (PDF) of Cauchy distribution is , where is the location parameter, specifying the location of the peak of the distribution, and is the scale parameter, specifying the half-width at half-maximum. into (17) and replacing with , we can rewrite (17) as follows: . To solve this problem, Nagy  proposed an efficient iterative algorithm, i.e.,
3.3 Outlier Rejection
Looking more carefully at (12), (13) and (14), HQ intrinsically assigns a weight for each entry of with both factor matrices and fixed, i.e., , where denotes the error matrix at the -th iteration. The larger the magnitude of error for a particular entry, the lighter the weight is assigned to it by HQ. Intuitively, the corrupted entry contributes less in learning the intrinsic subspace. If the magnitude of error exceeds a threshold , Truncated CauchyNMF assigns zero weights to the corrupted entries to inhibit their contribution to the learned subspace. That is how Truncated CauchyNMF filters out extreme outliers.
However, it is non-trivial to estimate the threshold . Here, we introduce a robust statistics-based method to explicitly detect the support of the outliers instead of estimating the threshold to detect outliers. Since the energy function of Truncated CauchyNMF gets close to that of NMF as the error tends towards zero, i.e.,
. Truncated CauchyNMF encourages the small magnitude errors to have a Gaussian distribution. Letdenote the set of magnitudes of error at the -th iteration of HQ, i.e., where . It is reasonable to believe that a subset of , i.e., , obeys a Gaussian distribution, where signifies the median of . Since , it suffices to estimate both the mean from . According to the three-sigma-rule, we detect the outliers as and output their indices .
To illustrate the effect of outlier rejection, Figure 3 presents a sequence of weighting matrices generated by HQ for the motivating example described in Figure 2. It shows that HQ correctly assigns zero weights for the corrupted entries in only a few iterations and finally detects almost all outliers including illumination, sunglasses, and scarves (see the last column in Figure 3) in the end.
4 Related Work
Before evaluating the effectiveness and robustness of Truncated CauchyNMF, we briefly review the state-of-the-art of non-negative matrix factorization (NMF) and its robustified variants. We have thoroughly compared the robustness between the proposed Truncated CauchyNMF and all the listed related works.
Traditional NMF  assumes that noise obeys a Gaussian distribution and derives the following squared -norm based objective function: , where signifies the matrix Frobenius norm. It is commonly known that NMF can be solved by using the multiplicative update rule (MUR, ). Because of the nice mathematical property of squared -norm and the efficiency of MUR, NMF has been extended for various applications . However, NMF and its extensions are non-robust because the -norm is sensitive to outliers.
4.2 Hypersurface Cost Based NMF
Hamza and Brady  proposed a hypersurface cost based NMF (HCNMF) by minimizing the summation of hypersurface costs of errors, i.e., , where is the hypersurface cost function. According to , the hypersurface cost function has differentiable and bounded influence function. Since the hypersurface cost function is differentiable, HCNMF can be directly solved by using the projected gradient method. However, the optimization of HCNMF is difficult because Armijo’s rule based line search is time consuming .
4.3 -Norm Based NMF
To improve the robustness of NMF, Lam  assumed that noise is independent and identically distributed from Laplace distribution and proposed -NMF as follows: , where and signifies the absolute value function. Since the -norm based loss function is non-smooth, the optimization algorithm in  is not scalable on large-scale datasets. Manhattan NMF (MahNMF, ) remidies this problem by approximating the loss function of -NMF with a smooth function and minimizing the approximated loss function using Nesterov’s method. Although -NMF is less sensitive to outliers than NMF, it is not sufficiently robust because its breakdown point is related to the dimensionality of data .
4.4 -Norm Regularized Robust NMF
Zhang et al.  assumed that the dataset contains both Laplace distributed noise and Gaussian distributed noise and proposed an -norm regularized Robust NMF (RNMF-) as follows: , where is a positive constant that trades off the sparsity of . Similar to -NMF, RNMF- is also less sensitive to outliers than NMF, but they are both non-robust to large numbers of outliers because the -minimization model has a low breakdown point. Moreover, it is non-trivial to determine the tradeoff parameter .
4.5 -Norm Based NMF
Since NMF is substantially a summation of the squared -norm of the errors, the large magnitude errors dominate the objective function and cause NMF to be non-robust. To solve this problem, Kong et al.  proposed the -norm based NMF (-NMF) which minimizes the -norm of the error matrix, i.e., , where the -norm is defined as . In contrast to NMF, -NMF is more robust because the influences of noisy examples are inhibited in learning the subspace.
4.6 Robust Capped Norm NMF
Gao et al.  proposed a robust capped norm NMF (RCNMF) to completely filter out the effect of outliers by instead minimizing the following objective function: , where is a threshold that chooses the outlier samples. RCNMF cannot be applied in practical applications because it is non-trivial to determine the pre-defined threshold, and the utilized iterative algorithms in both  and  converge slowly with the successive use of the power method .
4.7 Correntropy Induced Metric Based NMF
The most closely-related work is the half-quadratic algorithm for optimizing robust NMF, which includes the Correntropy-Induced Metric (CIM)-based NMF (CIM-NMF) and Huber-NMF by Du et al. . CIM-NMF measures the approximation errors by using CIM , i.e., , where . Since the energy function increases slowly as the error increases, CIM-NMF is insensitive to outliers. In a similar way, Huber-NMF  measures the approximation errors by using the Huber function, i.e., , where and the cutoff is automatically determined by .
Truncated CauchyNMF is different from both CIM-NMF and Huber-NMF in four aspects: (1) Truncated CauchyNMF is derived from the proposed Truncated Cauchy loss which can model both modearte and extreme outliers, whereas neither CIM-NMF or Huber-NMF can do that; (2) Truncated CauchyNMF demonstrates strong evidence of both robustness and generalization ability, whereas neither CIM-NMF nor Huber-NMF demonstrates evidence of neither; (3) Truncated CauchyNMF iteratively detects outliers by the robust statistics on the magnitude of errors, and thus performs more robustly than CIM-NMF and Huber-NMF in practice; And (4) Truncated CauchyNMF obtains the optima for each factor in each iteration round by solving the weighted non-negative least squares (WNLS) problems, whereas the multiplicative update rules for CIM-NMF and Huber-NMF do not.
5 Experimental Verification
We explore both the robustness and the effectiveness of Truncated CauchyNMF on two popular face image datasets, ORL  and AR , and one object image dataset, i.e., Caltech , by comparing with six typical NMF models: (1) -NMF  optimized by NeNMF ; (2) -NMF  optimized by MahNMF ; (3) RNMF- ; (4) -NMF ; (5) CIM-NMF ; and (6) Huber-NMF . We first present a toy example to intuitively show the robustness of Truncated CauchyNMF and several clustering experiments on the contaminated ORL dataset to confirm its robustness. We then analyze the effectiveness of Truncated CauchyNMF by clustering and recognizing face images in the AR dataset, and clustering object images in the Caltech dataset.