Matrix factorization is a powerful way for data representation and has been widely used for many problems in machine learning, data mining, computer vision, and statistical data analysis. Among various factorization algorithms, some have seen widespread successes, such as singular value decomposition (SVD)[duda2012pattern]
, and principal component analysis (PCA)[jolliffe2002principal].
Recently, a number of relatively new factorization algorithms have been developed to provide improved solutions to some special problems in machine learning [lee1999learning, peng2016fast]. In particular, nonnegative matrix factorization (NMF) [lee1999learning, lee2001algorithms] has drawn considerable attention. NMF represents nonnegative data with nonnegative basis and coefficients, which naturally leads to parts-based representations [lee1999learning]
. It has been used in many real world applications, such as pattern recognition[li2001learning], multimedia analysis [cooper2002summarizing], and text mining [xu2003document]
. Recent studies have revealed interesting relationships between NMF and several other methods. For example, spectral clustering (SC)[ng2002spectral]
is shown to be equivalent to a weighted version of kernel K-means[dhillon2007weighted] and both of them are particular cases of clustering with NMF under a doubly stochastic constraint [zass2005unifying]
; the Kullback-Leibler divergence-based NMF turns out to be equivalent to the probabilistic latent semantic analysis[ding2006nonnegative, hofmann1999probabilistic], which has been further developed into the fully probabilistic latent Dirichlet allocation model [blei2003latent].
Semi-NMF extends the repertoire of NMF by removing the non-negativity constraints on the data and basis, which expands the range of applications of NMF. It also strengthens the connections between NMF and K-means [ding2010convex]. It is noted that K-means can be written as a matrix factorization, where the two factor matrices represent the centroids and cluster indicators. Particularly, the centroids can be general whereas the cluster indicators are all nonnegative, which shows the connection between K-means and Semi-NMF. To exploit nonlinear structures of the data, graph-regularized NMF (GNMF) [cai2011graph] and robust manifold NMF (RMNMF) [huang2014robust] incorporate the graph Laplacian to measure nonlinear relationships of the data on manifold. In particular, GNMF including Frobenius-norm and divergence-based formulations, which require the basis and coefficient matrices to be nonnegative; RMNMF removes the constraints on the basis matrix and can be regarded as a variant of Semi-NMF by incorporating a structured sparsity-inducing norm to enhance its robustness.
These methods have been used on 2-dimensional (2D) data such as images, where 2D data are vectorized for further data processing in a preprocessing step. While the vectorization-based Semi-NMF methodology has been growingly useful, it fails to fully exploit the inherent 2D structures and correlations in the 2D data after vectorizing the 2D data. Furthermore, there is empirical evidence that building a model with vectorized high-dimensional features is not effective to filter the noisy or redundant information in the original feature spaces [fu2016tensor]
. Besides the way of vectorizing 2D data, tensor based approaches have been proposed. While they may potentially better exploit spatial structures of the 2D data[zhang2015low], such approaches still have some limitations: They use all features of the data, hence noisy or redundant features may degrade the learning performance. Also, tensor computation and methods usually involve flattening and folding operations, which, more or less, have issues similar to those of vectorization operation and thus might not fully exploit the true structures of the data. Moreover, tensor methods usually suffer from the following major issues: 1) for candecomp/parafac (CP) decomposition based methods, it is generally NP-hard to compute the CP rank [lu2016tensor, kolda2009tensor]; 2) Tucker decomposition is not unique [kolda2009tensor]; 3) the application of a core tensor and a high-order tensor product would incur information loss of spatial details [letexier2008noise].
To address these limitations, in this paper, we propose a new Semi-NMF-like method for 2D data, where we directly use the original 2D data to help preserve their 2D spatial structures instead of vectorizing them. It is noted that recently there are tensor approaches to retain spatial information for 2D data [cao2013robust, huang2008simultaneous]. However, tensors are usually reduced to matrices for processing. For example, [zhang2015low] organizes different views of the data by a tensor structure, however in each view each sample is still vectorized and the image spatial information is still damaged. In this paper, we directly use 2D inputs whose inherent structure information is emphasized by two projection matrices, which makes our method starkly different from tensor approach. Specifically, we seek optimal projection matrices and building new representations of the data jointly, aiming at enhancing clustering. These projections matrices are optimal in the sense that they project 2D data to the most expressive subspace. Moreover, manifold is taken into consideration to capture nonlinear structures of the data. In our formulation, the manifold is adaptively updated with projection matrices capturing representative information from 2D data, and thus it is less afflicted with noise and outliers. Therefore, this paper seeks optimal projection directions, factors data for new representations, and learns intrinsic manifold structures in a single, seamlessly integrated framework, such that these tasks mutually enhance and lead to improved clustering as well as powerful representations of 2D data. It is noted that, as a special case, our method is applicable to 1-dimensional data. The main contributions of this paper are summarized as follows:
The optimal 2D data projections and an image subspace are sought for learning new representations of the 2D data and clustering 2D matrices.
The proposed method is able to retain intrinsic spatial information of 2D data, and alleviate the adverse effect of irrelevant or less important information.
Manifold learning is integrated to enhance the capability of exploiting nonlinear structures of the data. The manifold is adaptively updated according to the 2D projections that capture the most expressive information from the data, and the graph is less afflicted with irrelevant or grossly corrupted features.
The proposed model enables 2D feature extraction, adaptive manifold learning, and matrix factorization jointly, thus offering a powerful data representation ability.
An efficient optimization algorithm is developed with provable mathematical analysis; extensive experimental results verify the effectiveness of the proposed model and algorithm.
Ii Related Work
Given data with being the dimension of the data and being the number of samples, the objective of Semi-NMF is
where contains basis in columns and are the new representations of the data in rows.
Ii-B Graph Laplacian
Graph Laplacian [chung1997spectral] is widely used to incorporate the intrinsic geometrical structure of the data on manifold. In particular, the manifold enforces the smoothness of the data in linear and nonlinear spaces by minimizing
where is the trace operator, is the weight matrix that measures the pair-wise similarities of original data points, is a diagonal matrix with , and . It is seen that by minimizing creftype 2, we can have a natural effect that if two data points are close in the intrinsic geometry of the data distribution, then their new representations with respect to the new basis, and , are also close [cai2011graph].
Iii Proposed Method
For 2D data X, creftype 1 naturally leads to a formulation as follows:
where is a set of 2D centroids. It is seen that all elements or features of 2D matrices are used to construct the new representations of the data and the expressiveness of 2D spatial information is not explicitly considered in creftype 4. To alleviate this drawback, we propose to better exploit 2D spatial information by building the new representation with respect to 2D centroids in a projected subspace with the most expressive spatial information:
It is noted that in creftype 5, projects the th centroid to a subspace of rank with the most expressive information, so that the sum of squared reconstruction errors of 2D matrices from the new basis and new representations can be minimized. As a result, the new representation and the new basis are sought jointly in the projected, most expressive, low-rank subspace to take the advantage of 2D spatial information. Let , then
where is the trace operator. The second equation is true because . The third equation is true because it can easily verify that . It is seen that, in the new formulation, the new representation is sought with the projected data ’s in the first term, while in the second term the projection ensures that the most expressive information of the data is retained in the subspace given by . With , it is straightforward that creftype 5 can be written as
It is noted that the first term in creftype 7 is essentially equivalent to the first term in last equation of creftype 5, but creftype 7 keeps the physical meanings of . With simple algebra, the second term in creftype 7 can be written as . We omit the constant term and introduce a balancing parameter to balance the two terms of creftype 7 to make it more versatile, which gives raise to
where we use the notation of . When , creftype 8 falls back to creftype 7. It is seen that, by minimizing creftype 8, is sought so that the data points are projected to the most expressive subspace, aiming at building new, expressive data representations for clustering. Because clustering is performed with projected data, the adverse affects of noise, occlusions or corruptions can be alleviated. Consequently, creftype 8 is inherently robust, even though we do not explicitly enforce robustness or use sparsity-inducing norms to measure reconstruction errors.
creftype 8 only considers the linear structures of the projected data while overlooking nonlinear ones which usually exist and are important in real world applications. To address this issue, we enforce the smoothness between linear and nonlinear structures on manifold with the following formulation:
where is a balancing parameter. Here, for the ease of notation, we define , , and define the operator to convert a set of 2D inputs, M, to a matrix containing each vectorized 2D input as a column for ease of notation. Different from creftype 2, we construct the one-to-one similarity matrix using instead of , such that the graph Laplacian is adaptively learned with the most expressive features. Correspondingly, and are constructed based on in a way similar to the construction of and based on as in creftype 2. Note that the above defined operators starkly differ from straight vectorization because spatial information has been retained and these operators only provide a simple way for notation without damaging information. It is seen that the tasks of seeking projections, recovering new data representations, and manifold learning mutually enhance each other and lead to a powerful data representation.
To further enhance the capability of capturing 2D spatial information, we develop the following Two-dimensional Semi-NMF (TS-NMF):
where contains projection directions to project X on left. Here, we define , , and is constructed in a similar way to where ’s are used instead of ’s, and is constructed in a similar way to where ’s are used instead of ’s. It is noted that creftype 10 is not convex. For any solution , is also a solution with the same objective value of creftype 10 with being a positive diagonal matrix. Furthermore, the objective value of creftype 10 can be reduced if increases. To eliminate this uncertainty, in practice one usually requires the Euclidean length of to be 1 [xu2003document, cai2011graph] in a post processing step. In this paper, we also adopt this strategy.
In this section, we will develop an efficient optimization algorithm to solve creftype 10. In the following, we will present the alternating optimization steps for each variable in detail.
The subproblem for -minimization is111Inspired by [wang2014feature], is not included in the -minimization problem due the difficulty of writing it as a function of explicitly. Instead, is fixed when solving and will be updated accordingly after is updated. Similar strategy is used for -minimization.:
With straightforward algebra, creftype 11 can be rewritten as
where . Let , it is easy to see that is positive definite, hence, according to [yang2004two], can be obtained by
returns the eigenvectors of the input matrix corresponding to its smallesteigenvalues.
For convenience of theoretical analysis, we define
and separate a matrix into two parts by
Then the -minimization can be written as
Then, is updated by:
The proof of Theorem IV.1 is provided in the Appendix. It is noted that creftype 20 provides an iterative way to solve creftype 18, which requires an inner loop for optimization. However, in a way similar to NMF [lee1999learning], GNMF [cai2011graph], and Semi-NMF [ding2010convex], we do not require an exact solution to the subproblem creftype 18. Instead, creftype 20 is performed once to solve creftype 18. Similar idea is also found in [lin2010augmented], where exact solutions are not required for intermediate updating.
Iv-D Optimizing U
The subproblem associated with U-minimization is
We investigate the two terms separately. The first term is minimized when it satisfies
which is equivalent to the following condition
It is seen that there are infinitely many choices for to meet the above condition, e.g., any such that is in the null space of . Here, we use the simplest way to meet this requirement by requiring
Similarly, we see that the second term in creftype 21 can be simultaneously minimized by creftype 24. Therefore, we adopt creftype 24 to update U. Here, it is noted that is usually invertible and computationally tractable due to its small size. Otherwise, pseudo-inverse is used as in [ding2010convex].
Finally, we adjust U and as follows, such that does not change:
Then standard K-means is applied to V to obtain cluster indicators. We summarize the overall procedure in Algorithm 1.
Iv-E Complexity Analysis
Because multiplications dominate the complexity, we only count multiplications. Given that , , , , let be the total number of iterations for Algorithm 1, then the total complexity of Algorithm 1 is . It is similar to GNMF and RMNMF. The complexity mainly comes from the updating of graph Laplacian matrices with complexity per iteration. Fortunately, it can be easily parallel for this step per iteration, and thus it is not a bottleneck for real world applications.
To demonstrate the effectiveness of TS-NMF, in this section, we present the comprehensive experimental results in comparison with several state-of-the-art algorithms. The performances are measured based on three evaluation metrics including clustering accuracy (ACC), normalized mutual information (NMI), and purity, whose details can be found in[huang2014robust, peng2017nonnegative]. In the following, we will briefly introduce the benchmark data sets, the baseline methods in comparison, and present the experimental results in detail. For purpose of reproducibility, we provide the data and codes at xxxx.
V-a Benchmark Data Sets
We use seven data sets in the experiment, which are briefly described as follows: 1) Yale [belhumeur1997eigenfaces]. It contains 165 gray scale images of 15 persons with 11 images of size 3232 per person. 2) Extended Yale B (EYaleB) [georghiades2001few]. This data set has 38 persons and around 64 face images under different illuminations per each person. The images were cropped to 192168 and were resized to 3232 in our experiments. 3) ORL [samaria1994parameterisation]. This data has 40 individuals and 10 images were taken at different times, with varying facial expressions, facial details, and lighting conditions per each individual. Each image has 3232 pixels. 4) JAFFE [lyons1998japanese]. 10 Japanese female models posed 7 facial expressions and 213 images were collected. Each image has been rated on 6 motion adjectives by 60 Japanese subjects. 5) PIX [hond1997distinctive]. 100 gray scale images of pixels from 10 objects were collected. 6) Semeion. 1,593 handwritten digits written by around 80 persons were collected. These images were scanned, stretched into size 1616.
|N||Normalized Mutual Information (%)|
|N||Normalized Mutual Information (%)|
|N||Normalized Mutual Information (%)|
|N||Normalized Mutual Information (%)|