I Introduction
Highdimensional data are ubiquitous in the learning community and it has become increasingly challenging to learn from such data [14]. For example, as one of the most important tasks in, for example, multimedia and data mining, information retrieval has drawn considerable attentions in recent years [47, 18, 46], where there is often a need to handle highdimensional data. Often times, it is desirable and demanding to seek a data representaiton to reveal latent data structures of highdimensional data, which is usually helpful for further data processing. It is thus a critical problem to find a suitable representation of the data [4, 20, 22, 37]
in many learning tasks, such as single image superresolution
[48], image reconstruction [32], image clustering [34], foregroundbackground seperation in surveillance video [5], matrix completion [28], etc. To this end, a number of methods for finding proper representations have been developed, among which matrix factorization technique has been widely used to handle highdimensional data. Matrix factorization seeks two or more lowdimensional matrices to approximate the original data such that the highdimensional data can be represented with reduced dimensions [23, 35].For some types of data, such as images and documents that are widely used in real world learning problems, the entries are naturally nonnegative. For such data, nonnegative matrix factorization (NMF) was proposed to seek two nonnegative factor matrices for approximation. In fact, the way of seeking nonnegative factorization for nonnegative data naturally leads to learning partsbased representations of the data [20]. Partsbased representation is believed to commonly exist in human brain with psychological and physiological evidence [33, 39, 25]. It overcomes the drawback of latent semantic indexing (LSI) [9]
, for which the interpretation of basis vectors is difficult due to mixed signs. When the number of basis vectors is large, NMF has been proven to be NPhard
[38]; moreover, [1]has recently given some conditions, under which NMF is solvable. Recent studies have shown a close relationship between NMF and Kmeans
[11], and further study has shown that both spectral clustering and kernel Kmeans
[10] are particular cases of clustering with NMF under a doubly stochastic constraint [44]. This implies that NMF is especially suitable for clustering such data. In this paper, we will develop a novel NMF method, which focuses on the clustering capability.Many variants of NMF have been developed in the past decades, which can be mainly categorized into four types, including basic NMF [20], constrained NMF [12], structured NMF [43], and generalized NMF [2]. A fairly comprehensive review can be found in [41]. Among these methods, SemiNMF [13] removes the nonnegative constraint on the data and basis vectors, such that its applications can be expanded to more fields; convex NMF (CNMF) [13] restricts the basis vectors to lie in the feature space of the input data so that they can be represented as convex combinations of data vectors; orthogonal NMF (ONMF) [12] imposes orthogonality constraints on factor matrices, which leads to clustering interpretation. The classic NMF only considers the linear structures of the data by finding new data points with respect to the new basis and ignores the nonlinear structures of the data, which is usually important for many applications such as clustering. To learn the latent nonlinear structures of the data, graph regularized nonnegative matrix factorization (GNMF) considers the intrinsic geometrical structures of the data on a manifold by incorporating a Laplacian regularization [3]. By modeling the data space as a manifold embedded in an ambient space and performing NMF on this manifold, GNMF considers both linear and nonlinear relationships of the data points in the original instance space, and thus it is also more discriminating than ordinary NMF which only considers the Euclidean structure of the data [3]. This renders GNMF more suitable for clustering purpose than the original NMF. Based on GNMF, robust manifold nonnegative matrix factorization (RMNMF) constructs a structured sparsityinducing normbased robust formulation [17]. With a
norm, RMNMF is insensitive to the betweensample data outliers and improves the robustness of NMF
[17]. Moreover, the relaxed requirement on signs of the data makes it a nonlinear version of SemiNMF.In recent years, the importance of preserving local manifold structure has drawn considerable attentions in research community of machine learning, data mining, and pattern recognition
[45, 29, 24, 7]. It has been shown that besides pairwise sample similarity, local geometric structure of the data is also crucial in revealing underlying structure of the data [24]: 1)In the transformed lowdimensional space, it is important to maintain the intrinsic information of highdimensional data [40]; 2) It may be insufficient to represent the underlying structures of the data with a single characterization and both global and local ones are necessary [6]; 3) In some ways, we can regard the local geometric structure of the data as data dependent regularization, which helps avoid overfitting issues [24]. Despite its importance, local structure of data has yet to be exploited in NMF study. In this paper, we propose a new type of NMF method, which simultaneously learns both similarity and geometric/clustering structures of the data and clustering such that the learned basis and coefficients well preserve discriminative information of the data. Recent studies reveal that highdimensional data often reside in a union of lowdimensional subspaces and the data can be selfexpressed by a lowdimensional representation [23, 15], which can be regarded as pairwise similarity of samples. Instead of simply using pairwise similarity of samples, in our method, we transform the pairwise similarity into the similarity between a score vector of a sample on basis and the representation of another sample in the same cluster, which integrates basis and coefficient learning into simultaneous similarity learning and clustering. Nonlinear model is developed to measure both local and global nonlinear relationships of the data.The main contributions of this paper are as follows:

For the first time, in an effective yet simple way, local similarity learning is embedded into learning matrix factorization, which allows our method to learn global and local structures of the data. The learned basis and representations well preserve the inherent structures of the data and are more representative;

To our best knowledge, we are the first to integrate the orthogonalityconstrained coefficient matrix into local similarity adaption, such that local similarity and clustering can mutually enhance each other and be learned simultaneously;

Nonlinear extension is developed from kernel perspectives, which can be further expanded to cope with multiplekernel scenario;

Efficient multiplicative update rules are constructed to solve the proposed model and comprehensive theoretical analysis is provided to guarantee the convergence;

Lastly, extensive experimental results have verified the effectiveness of our method.
The rest of this paper is organized as follows: In Section II, we briefly review some methods that are closely related with our research. Then we introduce our method in Section III. Regarding the proposed method, we provide an efficient alternating optimization procedure in Section IV, and then provide complicated theoretical results for the convergence analysis in Section V. Next, we conduct comprehensive experiments and show the results in Section VI. Finally, we conclude the paper in Section VII.
Notation: For a matrix , , , and denote the th element, th column, and th row of . is the trace operator, and are the Frobenius and norms.
denotes the identity matrix of size
, is an operator that returns a diagonal matrix with identical diagonal elements to the input matrix.Ii Related Work
In this section, we briefly review some methods that are closely related with our research.
Iia Nmf
Given nonnegative data with being the dimension and sample size, NMF is to factor into (basis) and (coefficients) with the following optimization problem:
(1) 
where enforces a lowrank approximation of the original data.
IiB Graph Laplacian
Graph Laplacian [8] is defined as
(2)  
where is the weight matrix that measures the pairwise similarities of original data points, is a diagonal matrix with , and . It is widely used to incorporate the geometrical structure of the data on manifold. In particular, the manifold enforces the smoothness of the data in linear and nonlinear spaces by minimizing (2), which leads to an effect that if two data points are close in the intrinsic geometry of the data distribution, then their new representations with respect to the new basis, and , are also close [3]. This is closely related with spectral clustering (SC) [36, 27] and its further development [31, 30].
Iii Proposed Method
As aforementioned, existing NMF methods do not fully exploit local geometric structures, nor do they exploit close interaction between local similarity and clustering. In this section, we will propose an effective, yet simple, new method to overcome these two drawbacks.
CNMF restricts the basis of NMF to convex combinations of the columns of the data, i.e., , which gives rise to the following:
(3) 
By restricting , (3) has the advantage that it could interpret the columns of as weighted sums of certain data points and these columns correspond to centroids [13]. It is natural to see that reveals the importance of basis to by .
It is noted that (3) is closely related to subspace clustering [23, 15]. The observation is that highdimensional data usually reside in lowdimensional subspaces and recovering such subspaces usually needs a selfexpressiveness assumption, which refer to that the data can be approximately selfexpressed as with a representation matrix . Local structures of the data are shown to be important [29] and it is necessary to take into consideration local similarity in learning tasks. A natural assumption is that if two data points and are close to each other, then their similarity, , should be large; otherwise, small. This assumption leads to the following minimization:
(4) 
where
or in matrix form,
with being a length vector of 1s. It is noted that the minimization of creftype 4 directly enforces to reflect the pairwise similarity information of the examples. Noticing that and are nonnegative and inspired by selfexpressiveness assumption, we take as the similarity matrix , such that . Here, is the score vector of example on the basis vectors, and is the coefficient vector of the th sample with respect to the new basis. If and are close on data manifold or grouped into the same cluster, then it is natural that and have higher similarity; vice versa. This close relationship between the geometry of and on data manifold and the similarity of and suggests that using as in (4) is indeed meaningful. To encourage the interaction between similarity learning and clustering, we incorporate (4) into (3) with , obtaining the Local Similarity NMF (LSNMF):
(5)  
where is a balancing parameter. Now, it is seen that the first term in above model captures global structure of the data by exploiting linear representation of each example with respect to the overall data, while the second term exploits local structure of the data by the connection between local geometric structure and pairwise similarity.
To allow for immediate interpretation of clustering from the coefficient matrix, we impose an orthogonality constraint of , i.e., , leading to
(6)  
Note that by enforcing , the problem of NMF is directly connected with clustering in that can be regarded as relaxed cluster indicators. More importantly, learning similarity and clustering are connected through such a matrix and can be mutually promoted through an iterative optimization process. At the end of the iteration, the optimized clustering results are directly given by .
Model (6) only learns linear relationships of the data and omits the nonlinear ones, which usually exist and are important. To take nonlinear relationships of the data into consideration, it is widely considered to seek data relationships in kernel space.
We define a kernel mapping as , which maps the data points from the input space to in a reproducing kernel Hilbert space , where is an arbitrary positive integer. After kernel mapping, we obtain the mapped data points . The similarity between each pair of data points is defined as the inner product of mapped data in the Hilbert space, i.e., , where is a reproducing kernel function. In the kernel space, (6) is reduced to
(7)  
where is extended in (6) from instance space to kernel space defined as
(8)  
We expand (7) and replace with , the kernel matrix induced by kernel function associated with the mapping , giving rise to the Kernel LSNMF (KLSNMF):
(9)  
where .
Remark 1.
In this paper, we aim at providing a new NMF method to take both local and global nonlinear relationships of the data into consideration. It is also worth mentioning that our method can be extended to multiplekernel scenario. Since the future extension is out of the scope of this paper, we do not further explore it here.
Iv Optimization
We solve (9) using an iterative update algorithm and elementwisely update and as follows:
(10)  
(11) 
By counting dominating multiplications, it is seen that the complexity of (10) and (11) per iteration is . The correctness and convergence proofs of the updates are provided in the following section.
V Correctness and Convergence
In this section, we will present theoretical results regarding the updates of (10) and (11), respectively.
Va Correctness and Convergence of (10)
We present two results regarding the update rule of (10): 1) When convergent, the limiting solution of (10) satisfies the KKT condition. 2) The iteration of (10) converges. The two results are established in Theorems V.2 and V.1, respectively.
Theorem V.1.
Fixing , the limiting solution of the update rule in (10) satisfies the KKT condition.
Proof.
Fixing , the subproblem for is
(12)  
Imposing the nonnegativity constraint , we introduce the Lagrangian multipliers and the Lagrangian function
(13)  
The gradient of gives
(14) 
For ease of notation, we denote , , , and . By the complementary slackness condition, we obtain
(15) 
Note that (15) provides the fixed point condition that the limiting solution should satisfy. It is easy to see that the limiting solution of (10) satisfies (15), which is described as follows. At convergence, (10) gives
(16) 
which is reduced to
(17) 
by simple algebra. It is easy to see that (15) and (17) are identical in that both of them enforce either or . ∎
Next, we prove the convergence of the iterative update as stated in Theorem V.2.
Theorem V.2.
In this proof, we use an auxiliary function approach [21] with relevant definition and propositions given below.
Definition V.1.
A function is called an auxiliary function of if for any and the following are satisfied
(18) 
Proposition V.1.
Given a function and its auxiliary function , if we define a variable sequence with
(19) 
then the value sequence, , is decreasing due to the following chain of inequalities:
Proposition V.2 ([13]).
For any matrices , , , and , with and being symmetric, the following inequality holds:
(20) 
With the aid of Definition V.1 and Propositions V.2 and V.1, we prove Theorem V.2 in the following.
Proof of Theorem v.2.
For fixed , the objective function in (12) can be written as
First, we show that the function defined in (21) is an auxiliary function of :
(21)  
To show this equation, we find the upperbounds and lowerbounds for the positive and negative terms in , respectively. For the positive terms, we use Proposition V.2 and the inequality for to get the following upperbounds:
(22)  
For the negative term, we use the inequality for to get the following lowerbound:
(23)  
Combining these bounds, we get the auxiliary function for . Next, we will show that the update of (10) essentially follows (19), then according to Proposition V.1 we can conclude the proof. To show this, the remaining problem is to find the global minimum of (21). For this, we first prove that (21) is convex.
The firstorder derivative of is
(24) 
Then the Hessian of can be obtained elementwisely as
(25) 
where is delta function that returns 1 if or 0 otherwise. It is seen that the Hessian matrix of has zero elements off diagonal and nonzero elements on diagonal, and thus is positive definite. Therefore, is convex and achieves the global optimum by its firstorder optimality condition, i.e., (24) = 0, which gives rise to
(26) 
(26) can be further reduced to
(27) 
Define , and , we can see that (12) is decreasing under the update of (27). Substituting , , , , we recover (10). ∎
VB Correctness and Convergence of (11)
Fixing , we need to solve the following optimization problem for :
(28)  
where is nonnegative and diagonal. We introduce the Lagrangian multipliers , which is symmetric and has size . Then the Lagrangian function to be minimized gives rise to
(29)  
where we define , , , and for easier notation, and , to be two nonnegative matrices for a nonnegative matrix such that . The gradient of is
(30) 
Then the KKT complementarity condition gives
(31) 
which is a fixed point relation that the local minimum for must hold. Following the previous subsection, noting that
we give an update as follows:
(32) 
To show that the update of (32) will converge to a local minimum, we will show two results: the convergence of the update algorithm and the correctness of the converged solution.
From (32), it is easy to show that, at convergence, the solution satisfies the following condition:
(33) 
which is the fixed point condition in (31). Hence, the correctness of the converged solution can be verified.
The convergence is assured by the following theorem.
Theorem V.3.
For fixed , the Lagrangian function is monotonically decreasing under the update rule in (32).
Proof.
To prove Theorem V.3, we use the auxiliary function approach. For ease of notation, we define .
First, we find upperbounds for each positive term in . By inequality for , we get
(34) 
Then, according to Proposition V.2, by setting or to be identity matrices, we get the following two upperbounds
(35) 
Then, by the inequalities for , we get the following lowerbounds for negative terms:
(36)  
Hence, combining the above bounds, we construct an auxiliary function for :
(37)  
We take the first order derivative of (37), then we get
(38)  
Further, we can get the Hessian of (37) by taking the second order derivative:
(39)  
It is easy to verify that the Hessian matrix has zero elements off diagonal, and nonnegative values on diagonal. Therefore, is convex in and its global minimum is obtained by its first order optimality condition, (38) = 0, which gives rise to
(40) 
According to Proposition V.1, by setting and , we recover (32) and it is easy to see that is decreasing under (32). ∎
It is seen that in (32), the multipliers is yet to be determined. By the first order optimality condition of , i.e., (30) = 0, we can see that
(41)  
hence
(42) 
Note that by defining , and , we have and , . Substituting and into (32), we get the update rule in (11).
Remark 2.
So far, a conclusion can be drawn that by alternatively updating and , the objective function in (9) will decrease and the value sequence converges. We set , and regard the updates of (10) and (11) as a mapping , then at convergence we have . Following [13, 42], with nonnegativity constraint enforced, we expand , which indicates that under an appropriate matrix norm. In general, , hence the updates of (10) and (11) roughly have a firstorder convergence rate.
Vi Experiments
In this section, we conduct experiments to verify the effectiveness of the proposed KLSNMF. We will present the evaluation metrics, benchmark datasets, algorithms in comparison, and experimental results in detail.
N  Accuracy (%)  
WNMF  RMNMF  CNMF  KNMF  ONMF  KLSNMF  
2  87.5710.53  87.5810.64  88.1810.02  87.8810.73  87.1011.63  88.8610.54 
3  80.3109.91  78.2309.17  80.2310.51  80.5810.52  79.4307.39  82.8808.53 
4  71.9506.07  65.2207.80  70.3208.91  67.8810.86  70.8008.62  75.3211.16 
5  70.2406.77  62.3307.31  67.6110.23  64.4007.41  64.3608.39  75.2607.33 
6  58.2505.69  54.6706.88  57.5006.14  61.7109.32  61.5706.77  64.9108.69 
7  59.3207.24  52.9406.03  54.4205.89  61.3605.91  57.6807.48  64.6605.42 
8  59.6307.53  48.2304.31  53.5204.81  60.3305.64  58.0206.95  67.1506.74 
9  56.3504.12  44.9002.77  50.1605.59  56.0605.52  56.6308.88  59.2502.74 
10  55.56  43.57  45.20  52.54  49.15  60.58 
Average  66.57  59.74  63.01  65.86  64.97  70.99 
N  NMI (%)  
WNMF  RMNMF  CNMF  KNMF  ONMF  KLSNMF  
2  56.1628.34  55.4828.88  56.2028.69  57.4128.32  55.5130.48  60.7030.26 
3  54.0113.67  50.3912.33  53.9014.75  55.9512.58  50.2211.18  58.6811.41 
4  50.6804.88  44.8306.88  49.9306.20  50.5207.34  49.0205.37  58.2209.09 
5  52.2806.09  43.4507.15  51.0808.22  54.3203.32  49.8807.96  61.1507.27 
6  45.5804.75  39.8106.31  45.2506.11  51.1105.01  47.4605.93  55.2607.79 
7  46.5506.27  41.7104.53  44.0504.81  51.5704.88  46.5606.12  54.0704.08 
8  48.1804.90  39.5103.19  44.3603.54  52.4902.81  46.7004.29  58.9604.45 
9  47.1803.78  36.5202.66  42.7504.51  49.2903.99  45.7504.75  54.4302.45 
10  44.82  35.44  37.96  47.38  43.12  54.98 
Average  49.49  43.02  47.28  52.23  48.25  57.38 
N  Purity (%)  
WNMF  RMNMF  CNMF  KNMF  ONMF  KLSNMF  
2  87.5710.53  87.5810.64  88.1810.02  87.8810.73  87.1011.63  88.8610.54 
3  80.3109.91  78.2309.17  80.3910.19  80.6710.35  79.4307.39  82.8808.53 
4  72.3305.76  67.0806.60  71.9106.45  71.0907.60  72.0605.92  76.5108.74 
5  70.5106.74  63.7705.81  69.1307.59  69.2504.18  67.5906.43  76.1006.40 
6  60.9104.53  56.4405.79  61.0305.25  65.6406.08  63.4506.24  67.8307.45 
7  60.8806.43  54.6905.57  57.3505.68  65.0204.32  61.1206.32  67.1103.74 
8  60.5806.55  49.8803.76  55.7203.92  63.9403.71  60.1305.82  68.8404.47 
9  59.0404.61  46.1802.82  52.5705.44  60.1805.00  59.2006.61  64.1002.78 
10  56.56  45.95  45.20  52.54  54.74  61.83 
Average  67.63  61.09  64.61  68.47  67.20  72.67 
N  Accuracy (%)  
WNMF  RMNMF  CNMF  KNMF  ONMF  KLSNMF  
2  99.7500.79  100.000.00  99.7500.00  99.7500.79  99.2502.37  100.000.00 
3  96.5405.05  97.6201.86  87.9813.94  96.3603.91  84.0616.95  98.7201.47 
4  95.9205.96  98.8301.73  80.3717.35  89.5413.01  91.8814.41  99.0702.04 
5  95.7503.92  97.4603.09  88.2908.25  87.2610.56  72.4706.66  98.3902.23 
6  89.4704.41  95.1404.07  76.2613.45  83.5008.14  88.9812.69  97.8001.14 
7  89.6810.77  90.2406.90  72.0511.21  83.1409.33  79.6508.69  96.7902.35 
8  92.0505.57  91.6305.58  69.4410.06  79.2407.30  74.7407.43  96.5201.61 
9  86.8404.69  90.7307.06  63.8205.77  79.7606.36  79.0106.05  95.5101.23 
10  90.61  95.77  69.95  81.69  82.63  96.24 
Average  92.96  95.27  78.66  86.69  83.63  97.67 
N  NMI (%)  
WNMF  RMNMF  CNMF  KNMF  ONMF  KLSNMF  
2  98.5504.59  100.000.00  98.5504.59  98.5504.59  96.7910.16  100.000.00 
3  91.2910.58  92.0205.91  78.8318.03  89.9210.13  78.2517.32  95.8404.63 
4  91.4810.86  96.9803.88  75.5217.36  86.3915.37  92.3009.66  97.8204.70 
5  92.9405.56  95.0105.29  84.4208.55  85.7208.72  73.8605.60  96.6904.49 
6  85.5805.96  91.7605.45  73.1713.80  83.1706.95  88.9110.75  95.6802.05 
7  88.1809.17  87.1205.60  69.7911.35  85.4605.05  81.4308.65  94.7903.58 
8  91.2204.86  89.0905.20  66.1011.38  82.1804.27  81.3306.17  94.5002.53 
9  87.2003.18  89.3405.09  62.3705.03  83.0304.25  82.4904.57  93.7301.57 
10  89.44  93.54  70.65  82.38  84.46  94.40 
Average  90.65  92.76  75.49  86.31  84.42  95.94 
N  Purity (%)  
WNMF  RMNMF  CNMF  KNMF  ONMF  KLSNMF  
2  99.7500.79  100.000.00  99.7500.79  99.75 
Comments
There are no comments yet.