# Nonnegative Matrix Factorization with Local Similarity Learning

Existing nonnegative matrix factorization methods focus on learning global structure of the data to construct basis and coefficient matrices, which ignores the local structure that commonly exists among data. In this paper, we propose a new type of nonnegative matrix factorization method, which learns local similarity and clustering in a mutually enhancing way. The learned new representation is more representative in that it better reveals inherent geometric property of the data. Nonlinear expansion is given and efficient multiplicative updates are developed with theoretical convergence guarantees. Extensive experimental results have confirmed the effectiveness of the proposed model.

## Authors

• 23 publications
• 33 publications
• 13 publications
• 27 publications
• ### Image Analysis Based on Nonnegative/Binary Matrix Factorization

Using nonnegative/binary matrix factorization (NBMF), a matrix can be de...
07/02/2020 ∙ by Hinako Asaoka, et al. ∙ 0

• ### Self-supervised Symmetric Nonnegative Matrix Factorization

Symmetric nonnegative matrix factorization (SNMF) has demonstrated to be...
03/02/2021 ∙ by Yuheng Jia, et al. ∙ 16

• ### Adversarially-Trained Nonnegative Matrix Factorization

We consider an adversarially-trained version of the nonnegative matrix f...
04/10/2021 ∙ by Ting Cai, et al. ∙ 0

• ### Unsupervised Phase Mapping of X-ray Diffraction Data by Nonnegative Matrix Factorization Integrated with Custom Clustering

Analyzing large X-ray diffraction (XRD) datasets is a key step in high-t...
02/20/2018 ∙ by Valentin Stanev, et al. ∙ 0

• ### Robust Volume Minimization-Based Matrix Factorization for Remote Sensing and Document Clustering

This paper considers volume minimization (VolMin)-based structured matri...
08/15/2016 ∙ by Xiao Fu, et al. ∙ 0

• ### A sufficient condition for local nonnegativity

A real polynomial f is called local nonnegative at a point p, if it is n...
10/29/2019 ∙ by Jia Xu, et al. ∙ 0

• ### Effective Mean-Field Inference Method for Nonnegative Boltzmann Machines

Nonnegative Boltzmann machines (NNBMs) are recurrent probabilistic neura...
03/08/2016 ∙ by Muneki Yasuda, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

High-dimensional data are ubiquitous in the learning community and it has become increasingly challenging to learn from such data [14]. For example, as one of the most important tasks in, for example, multimedia and data mining, information retrieval has drawn considerable attentions in recent years [47, 18, 46], where there is often a need to handle high-dimensional data. Often times, it is desirable and demanding to seek a data representaiton to reveal latent data structures of high-dimensional data, which is usually helpful for further data processing. It is thus a critical problem to find a suitable representation of the data [4, 20, 22, 37]

in many learning tasks, such as single image super-resolution

[48], image reconstruction [32], image clustering [34], foreground-background seperation in surveillance video [5], matrix completion [28], etc. To this end, a number of methods for finding proper representations have been developed, among which matrix factorization technique has been widely used to handle high-dimensional data. Matrix factorization seeks two or more low-dimensional matrices to approximate the original data such that the high-dimensional data can be represented with reduced dimensions [23, 35].

For some types of data, such as images and documents that are widely used in real world learning problems, the entries are naturally nonnegative. For such data, nonnegative matrix factorization (NMF) was proposed to seek two nonnegative factor matrices for approximation. In fact, the way of seeking nonnegative factorization for nonnegative data naturally leads to learning parts-based representations of the data [20]. Parts-based representation is believed to commonly exist in human brain with psychological and physiological evidence [33, 39, 25]. It overcomes the drawback of latent semantic indexing (LSI) [9]

, for which the interpretation of basis vectors is difficult due to mixed signs. When the number of basis vectors is large, NMF has been proven to be NP-hard

[38]; moreover, [1]

has recently given some conditions, under which NMF is solvable. Recent studies have shown a close relationship between NMF and K-means

[11]

, and further study has shown that both spectral clustering and kernel K-means

[10] are particular cases of clustering with NMF under a doubly stochastic constraint [44]. This implies that NMF is especially suitable for clustering such data. In this paper, we will develop a novel NMF method, which focuses on the clustering capability.

Many variants of NMF have been developed in the past decades, which can be mainly categorized into four types, including basic NMF [20], constrained NMF [12], structured NMF [43], and generalized NMF [2]. A fairly comprehensive review can be found in [41]. Among these methods, Semi-NMF [13] removes the nonnegative constraint on the data and basis vectors, such that its applications can be expanded to more fields; convex NMF (CNMF) [13] restricts the basis vectors to lie in the feature space of the input data so that they can be represented as convex combinations of data vectors; orthogonal NMF (ONMF) [12] imposes orthogonality constraints on factor matrices, which leads to clustering interpretation. The classic NMF only considers the linear structures of the data by finding new data points with respect to the new basis and ignores the nonlinear structures of the data, which is usually important for many applications such as clustering. To learn the latent nonlinear structures of the data, graph regularized nonnegative matrix factorization (GNMF) considers the intrinsic geometrical structures of the data on a manifold by incorporating a Laplacian regularization [3]. By modeling the data space as a manifold embedded in an ambient space and performing NMF on this manifold, GNMF considers both linear and nonlinear relationships of the data points in the original instance space, and thus it is also more discriminating than ordinary NMF which only considers the Euclidean structure of the data [3]. This renders GNMF more suitable for clustering purpose than the original NMF. Based on GNMF, robust manifold nonnegative matrix factorization (RMNMF) constructs a structured sparsity-inducing norm-based robust formulation [17]. With a

-norm, RMNMF is insensitive to the between-sample data outliers and improves the robustness of NMF

[17]. Moreover, the relaxed requirement on signs of the data makes it a nonlinear version of Semi-NMF.

In recent years, the importance of preserving local manifold structure has drawn considerable attentions in research community of machine learning, data mining, and pattern recognition

[45, 29, 24, 7]. It has been shown that besides pairwise sample similarity, local geometric structure of the data is also crucial in revealing underlying structure of the data [24]: 1)In the transformed low-dimensional space, it is important to maintain the intrinsic information of high-dimensional data [40]; 2) It may be insufficient to represent the underlying structures of the data with a single characterization and both global and local ones are necessary [6]; 3) In some ways, we can regard the local geometric structure of the data as data dependent regularization, which helps avoid overfitting issues [24]. Despite its importance, local structure of data has yet to be exploited in NMF study. In this paper, we propose a new type of NMF method, which simultaneously learns both similarity and geometric/clustering structures of the data and clustering such that the learned basis and coefficients well preserve discriminative information of the data. Recent studies reveal that high-dimensional data often reside in a union of low-dimensional subspaces and the data can be self-expressed by a low-dimensional representation [23, 15], which can be regarded as pairwise similarity of samples. Instead of simply using pairwise similarity of samples, in our method, we transform the pairwise similarity into the similarity between a score vector of a sample on basis and the representation of another sample in the same cluster, which integrates basis and coefficient learning into simultaneous similarity learning and clustering. Nonlinear model is developed to measure both local and global nonlinear relationships of the data.

The main contributions of this paper are as follows:

• For the first time, in an effective yet simple way, local similarity learning is embedded into learning matrix factorization, which allows our method to learn global and local structures of the data. The learned basis and representations well preserve the inherent structures of the data and are more representative;

• To our best knowledge, we are the first to integrate the orthogonality-constrained coefficient matrix into local similarity adaption, such that local similarity and clustering can mutually enhance each other and be learned simultaneously;

• Nonlinear extension is developed from kernel perspectives, which can be further expanded to cope with multiple-kernel scenario;

• Efficient multiplicative update rules are constructed to solve the proposed model and comprehensive theoretical analysis is provided to guarantee the convergence;

• Lastly, extensive experimental results have verified the effectiveness of our method.

The rest of this paper is organized as follows: In Section II, we briefly review some methods that are closely related with our research. Then we introduce our method in Section III. Regarding the proposed method, we provide an efficient alternating optimization procedure in Section IV, and then provide complicated theoretical results for the convergence analysis in Section V. Next, we conduct comprehensive experiments and show the results in Section VI. Finally, we conclude the paper in Section VII.

Notation: For a matrix , , , and denote the -th element, -th column, and -th row of . is the trace operator, and are the Frobenius and norms.

denotes the identity matrix of size

, is an operator that returns a diagonal matrix with identical diagonal elements to the input matrix.

## Ii Related Work

In this section, we briefly review some methods that are closely related with our research.

### Ii-a Nmf

Given nonnegative data with being the dimension and sample size, NMF is to factor into (basis) and (coefficients) with the following optimization problem:

 minU≥0,G≥0∥X−UGT∥2F, (1)

where enforces a low-rank approximation of the original data.

### Ii-B Graph Laplacian

Graph Laplacian [8] is defined as

 12n∑i=1n∑j=1∥Gi−Gj∥22Wxij (2) = n∑j=1DxjjGTjGj−n∑i=1n∑j=1WxijGTiGj, = Tr(GTDxG)−Tr(GTWxG)=Tr(GTLxG),

where is the weight matrix that measures the pair-wise similarities of original data points, is a diagonal matrix with , and . It is widely used to incorporate the geometrical structure of the data on manifold. In particular, the manifold enforces the smoothness of the data in linear and nonlinear spaces by minimizing (2), which leads to an effect that if two data points are close in the intrinsic geometry of the data distribution, then their new representations with respect to the new basis, and , are also close [3]. This is closely related with spectral clustering (SC) [36, 27] and its further development [31, 30].

## Iii Proposed Method

As aforementioned, existing NMF methods do not fully exploit local geometric structures, nor do they exploit close interaction between local similarity and clustering. In this section, we will propose an effective, yet simple, new method to overcome these two drawbacks.

CNMF restricts the basis of NMF to convex combinations of the columns of the data, i.e., , which gives rise to the following:

 minW≥0,G≥0∥X−XWGT∥2F. (3)

By restricting , (3) has the advantage that it could interpret the columns of as weighted sums of certain data points and these columns correspond to centroids [13]. It is natural to see that reveals the importance of basis to by .

It is noted that (3) is closely related to subspace clustering [23, 15]. The observation is that high-dimensional data usually reside in low-dimensional subspaces and recovering such subspaces usually needs a self-expressiveness assumption, which refer to that the data can be approximately self-expressed as with a representation matrix . Local structures of the data are shown to be important [29] and it is necessary to take into consideration local similarity in learning tasks. A natural assumption is that if two data points and are close to each other, then their similarity, , should be large; otherwise, small. This assumption leads to the following minimization:

 minZ∑ij∥xi−xj∥22Zij⇔minZ% Tr(ZTD), (4)

where

 Dij=∥xi−xj∥22,

or in matrix form,

 D=1n1Tndiag(XTX)+diag(XTX)1n1Tn−2XTX,

with being a length- vector of 1s. It is noted that the minimization of creftype 4 directly enforces to reflect the pair-wise similarity information of the examples. Noticing that and are nonnegative and inspired by self-expressiveness assumption, we take as the similarity matrix , such that . Here, is the score vector of example on the basis vectors, and is the coefficient vector of the -th sample with respect to the new basis. If and are close on data manifold or grouped into the same cluster, then it is natural that and have higher similarity; vice versa. This close relationship between the geometry of and on data manifold and the similarity of and suggests that using as in (4) is indeed meaningful. To encourage the interaction between similarity learning and clustering, we incorporate (4) into (3) with , obtaining the Local Similarity NMF (LS-NMF):

 minW,G12∥X−XWGT∥2F+λTr(WTDG), (5) s.t.W≥0,G≥0.

where is a balancing parameter. Now, it is seen that the first term in above model captures global structure of the data by exploiting linear representation of each example with respect to the overall data, while the second term exploits local structure of the data by the connection between local geometric structure and pairwise similarity.

To allow for immediate interpretation of clustering from the coefficient matrix, we impose an orthogonality constraint of , i.e., , leading to

 minW,G12∥X−XWGT∥2F+λTr(WTDG), (6) s.t.W≥0,G≥0,GTG=Ik.

Note that by enforcing , the problem of NMF is directly connected with clustering in that can be regarded as relaxed cluster indicators. More importantly, learning similarity and clustering are connected through such a matrix and can be mutually promoted through an iterative optimization process. At the end of the iteration, the optimized clustering results are directly given by .

Model (6) only learns linear relationships of the data and omits the nonlinear ones, which usually exist and are important. To take nonlinear relationships of the data into consideration, it is widely considered to seek data relationships in kernel space.

We define a kernel mapping as , which maps the data points from the input space to in a reproducing kernel Hilbert space , where is an arbitrary positive integer. After kernel mapping, we obtain the mapped data points . The similarity between each pair of data points is defined as the inner product of mapped data in the Hilbert space, i.e., , where is a reproducing kernel function. In the kernel space, (6) is reduced to

 minW,G 12∥ϕ(X)−ϕ(X)WGT∥2F+λ% Tr(WTDϕG), (7) s.t.W≥0,G≥0,GTG=Ik,

where is extended in (6) from instance space to kernel space defined as

 Dϕ= 1n1Tndiag(ϕ(X)Tϕ(X)) (8) +diag(ϕ(X)Tϕ(X))1n1Tn−2ϕ(X)Tϕ(X).

We expand (7) and replace with , the kernel matrix induced by kernel function associated with the mapping , giving rise to the Kernel LS-NMF (KLS-NMF):

 minW,G 12Tr(K−2KWGT+GWTKWGT) (9) +λTr(WTDKG), s.t.W≥0,G≥0,GTG=Ik,

where .

###### Remark 1.

In this paper, we aim at providing a new NMF method to take both local and global nonlinear relationships of the data into consideration. It is also worth mentioning that our method can be extended to multiple-kernel scenario. Since the future extension is out of the scope of this paper, we do not further explore it here.

## Iv Optimization

We solve (9) using an iterative update algorithm and element-wisely update and as follows:

 Wik ←Wik√(KG)ik(KWGTG)ik+λ(DKG)ik (10) Gik ←Gik√(KW)ik+(λGGTDKW)ikλ(DKW)ik+(GGTKW)ik (11)

By counting dominating multiplications, it is seen that the complexity of (10) and (11) per iteration is . The correctness and convergence proofs of the updates are provided in the following section.

## V Correctness and Convergence

In this section, we will present theoretical results regarding the updates of (10) and (11), respectively.

### V-a Correctness and Convergence of (10)

We present two results regarding the update rule of (10): 1) When convergent, the limiting solution of (10) satisfies the KKT condition. 2) The iteration of (10) converges. The two results are established in Theorems V.2 and V.1, respectively.

###### Theorem V.1.

Fixing , the limiting solution of the update rule in (10) satisfies the KKT condition.

###### Proof.

Fixing , the subproblem for is

 minW≥0 12Tr(−2KWGT+GWTKWGT) (12) +λTr(WTDKG),

Imposing the non-negativity constraint , we introduce the Lagrangian multipliers and the Lagrangian function

 LW= 12Tr(−2KWGT+GWTKWGT) (13) +λTr(WTDKG)+Tr(ΨWT),

 ∂LW∂W=−KG+λDKG+KWGTG+Ψ. (14)

For ease of notation, we denote , , , and . By the complementary slackness condition, we obtain

 (−¯A+λ¯B+¯CW¯D)ikWik=ψikWik=0. (15)

Note that (15) provides the fixed point condition that the limiting solution should satisfy. It is easy to see that the limiting solution of (10) satisfies (15), which is described as follows. At convergence, (10) gives

 Wik=Wik ⎷(¯A)ik(¯CW¯D)ik+λ(¯B)ik, (16)

which is reduced to

 (−¯A+λ¯B+¯CW¯D)ikW2ik=0, (17)

by simple algebra. It is easy to see that (15) and (17) are identical in that both of them enforce either or . ∎

Next, we prove the convergence of the iterative update as stated in Theorem V.2.

###### Theorem V.2.

For fixed , (12), as well as (9), is monotonically decreasing under the update rule in (10).

In this proof, we use an auxiliary function approach [21] with relevant definition and propositions given below.

###### Definition V.1.

A function is called an auxiliary function of if for any and the following are satisfied

 J(H,H′)≥L(H),J(H,H)=L(H). (18)
###### Proposition V.1.

Given a function and its auxiliary function , if we define a variable sequence with

 H(t+1)=argminHJ(H,H(t)), (19)

then the value sequence, , is decreasing due to the following chain of inequalities:

 L(H(t))=J(H(t),H(t))≥J(H(t+1),H(t))≥L(H(t+1)).
###### Proposition V.2 ([13]).

For any matrices , , , and , with and being symmetric, the following inequality holds:

 n∑i=1k∑s=1(ΓS′Ω)isS2isS′is≥Tr(STΓSΩ). (20)

With the aid of Definition V.1 and Propositions V.2 and V.1, we prove Theorem V.2 in the following.

###### Proof of Theorem v.2.

For fixed , the objective function in (12) can be written as

 P(W)=Tr(−WT¯A+12WT¯CW¯D+λWT¯B)+12Tr(¯C).

First, we show that the function defined in (21) is an auxiliary function of :

 ¯P(W,W′) (21) = 12Tr(¯C)−∑ik¯AikW′ik(1+logWikW′ik) +12∑ik(¯CW′¯D)ikW2ikW′ik +λ∑ik¯BikW2ik+W′2ik2W′ik.

To show this equation, we find the upper-bounds and lower-bounds for the positive and negative terms in , respectively. For the positive terms, we use Proposition V.2 and the inequality for to get the following upper-bounds:

 Tr(WT¯B)=∑ik¯BikWik ≤∑ik¯BikW2ik+W′2ik2W′ik, (22) Tr(WT¯CW¯D) ≤∑ik(¯CW′¯D)ikW2ikW′ik.

For the negative term, we use the inequality for to get the following lower-bound:

 Tr(WT¯A) =∑ik¯AikWik (23) ≥∑ik¯AikW′ik(1+logWikW′ik).

Combining these bounds, we get the auxiliary function for . Next, we will show that the update of (10) essentially follows (19), then according to Proposition V.1 we can conclude the proof. To show this, the remaining problem is to find the global minimum of (21). For this, we first prove that (21) is convex.

The first-order derivative of is

 ∂¯P(W,W′)∂Wik=−¯AikW′ikWik+(¯CW′¯D)ikWikW′ik+λ¯BikWikW′ik. (24)

Then the Hessian of can be obtained element-wisely as

 ∂2¯P(W,W′)∂Wik∂Wjl=δijδjk(¯AikW′ikW2ik+(¯CW′¯D)ik+λ¯BikW′ik), (25)

where is delta function that returns 1 if or 0 otherwise. It is seen that the Hessian matrix of has zero elements off diagonal and nonzero elements on diagonal, and thus is positive definite. Therefore, is convex and achieves the global optimum by its first-order optimality condition, i.e., (24) = 0, which gives rise to

 ¯AikW′ikWik=(¯CW′¯D)ikWikW′ik+λ¯BikWikW′ik. (26)

(26) can be further reduced to

 Wik=W′ik√¯Aik(¯CW′¯D)ik+λ¯Bik. (27)

Define , and , we can see that (12) is decreasing under the update of (27). Substituting , , , , we recover (10). ∎

### V-B Correctness and Convergence of (11)

Fixing , we need to solve the following optimization problem for :

 argminG =12Tr(−2KWGT+GWTKWGT) (28) +λTr(WTDKG),s.t.G≥0,GTG=Λ,

where is nonnegative and diagonal. We introduce the Lagrangian multipliers , which is symmetric and has size . Then the Lagrangian function to be minimized gives rise to

 LG= 12Tr(−2KWGT+GWTKWGT) (29) +λTr(WTDKG)+12Tr(Θ(GTG−Λ)) = 12Tr(−2KWGT+GWTKWGT +2λWTDKG+GΘGT)−ξ = 12Tr(−2AGT+GCGT+2λBGT+GΘGT)−ξ = 12Tr(−2AGT+2λBGT +G(C+Θ)+GT−G(C+Θ)−GT)−ξ,

where we define , , , and for easier notation, and , to be two nonnegative matrices for a nonnegative matrix such that . The gradient of is

 ∂LG∂G=−2A+2GC+2λB+2GΘ. (30)

Then the KKT complementarity condition gives

 (−A+GC+λB+GΘ)ikGik=0, (31)

which is a fixed point relation that the local minimum for must hold. Following the previous subsection, noting that

 C+Θ=(C+Θ)+−(C+Θ)−

we give an update as follows:

 Gik←Gik√Aik+(G(C+Θ)−)ikλBik+(G(C+Θ)+)ik. (32)

To show that the update of (32) will converge to a local minimum, we will show two results: the convergence of the update algorithm and the correctness of the converged solution.

From (32), it is easy to show that, at convergence, the solution satisfies the following condition:

 (−A+GC+λB+GΘ)ikG2ik=0, (33)

which is the fixed point condition in (31). Hence, the correctness of the converged solution can be verified.

The convergence is assured by the following theorem.

###### Theorem V.3.

For fixed , the Lagrangian function is monotonically decreasing under the update rule in (32).

###### Proof.

To prove Theorem V.3, we use the auxiliary function approach. For ease of notation, we define .

First, we find upper-bounds for each positive term in . By inequality for , we get

 Tr(GTB)=∑ikBikGik≤∑ikBikG2ik+G′2ik2G′ik. (34)

Then, according to Proposition V.2, by setting or to be identity matrices, we get the following two upper-bounds

 Tr(GE+GT)≤ ∑ik(G′E+)ikG2ikG′ik (35)

Then, by the inequalities for , we get the following lower-bounds for negative terms:

 Tr(GTA)≥ ∑ikAikG′ik(1+logGikG′ik) (36) Tr(GE−GT)≥ ∑iklE−klG′ikG′il(1+logGikGilG′ikG′il).

Hence, combining the above bounds, we construct an auxiliary function for :

 J(G,G′)=−∑ikAikG′ik(1+logGikG′ik) (37) +λ∑ikBikG2ik+G′2ik2G′ik+12∑ik(G′E+)ikG2ikG′ik
 −12∑iklE−klG′ikG′il(1+logGikGilG′ikG′il) −γ2∑ikl(Wx)klG′kiG′li(1+logGkiGliG′kiG′li) +γ2(DxG′)ikG′ikG2ik+12Tr(XTX).

We take the first order derivative of (37), then we get

 ∂J(G,G′)∂Gik=−AikG′ikGik+λBikG′ikGik+(G′E+)ikG′ikGik (38) −(G′E−)ikG′ikGik+γ(DxG′)ikG′ikGik−γ(WxG′)ikG′ikGik.

Further, we can get the Hessian of (37) by taking the second order derivative:

 ∂2Z(G,G′)∂Gik∂Gjl=δijδkl(AikG′ikG2ik+λBikG′ik+(G′E+)ikG′ik (39) +(G′E−)ikG′ikG2ik+γ(DxG′)ikG′ik+γ(WxG′)ikG′ikG2ik).

It is easy to verify that the Hessian matrix has zero elements off diagonal, and nonnegative values on diagonal. Therefore, is convex in and its global minimum is obtained by its first order optimality condition, (38) = 0, which gives rise to

 Gik=G′ik√Aik+(G′E−)ikλBik+(G′E+)ik. (40)

According to Proposition V.1, by setting and , we recover (32) and it is easy to see that is decreasing under (32). ∎

It is seen that in (32), the multipliers is yet to be determined. By the first order optimality condition of , i.e., (30) = 0, we can see that

 GT(−A+GC+λB+GΘ) (41) = −GTA+GTGC+λGTB+GTGΘ = −GTA+C+λGTB+Θ = 0,

hence

 E=GTA−λGTB. (42)

Note that by defining , and , we have and , . Substituting and into (32), we get the update rule in (11).

###### Remark 2.

So far, a conclusion can be drawn that by alternatively updating and , the objective function in (9) will decrease and the value sequence converges. We set , and regard the updates of (10) and (11) as a mapping , then at convergence we have . Following [13, 42], with non-negativity constraint enforced, we expand , which indicates that under an appropriate matrix norm. In general, , hence the updates of (10) and (11) roughly have a first-order convergence rate.

## Vi Experiments

In this section, we conduct experiments to verify the effectiveness of the proposed KLS-NMF. We will present the evaluation metrics, benchmark datasets, algorithms in comparison, and experimental results in detail.