 # Context Aware Nonnegative Matrix Factorization Clustering

In this article we propose a method to refine the clustering results obtained with the nonnegative matrix factorization (NMF) technique, imposing consistency constraints on the final labeling of the data. The research community focused its effort on the initialization and on the optimization part of this method, without paying attention to the final cluster assignments. We propose a game theoretic framework in which each object to be clustered is represented as a player, which has to choose its cluster membership. The information obtained with NMF is used to initialize the strategy space of the players and a weighted graph is used to model the interactions among the players. These interactions allow the players to choose a cluster which is coherent with the clusters chosen by similar players, a property which is not guaranteed by NMF, since it produces a soft clustering of the data. The results on common benchmarks show that our model is able to improve the performances of many NMF formulations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Nonnegative matrix factorization (NMF) is a particular kind of matrix decomposition in which an input matrix is factorized into two nonnegative matrices and of rank , such that approximates

. The significance of this technique is to find those vectors that are linearly independent in a determined vector space. In this way, they can be considered as the essential representation of the problem described by the vector space and can be considered as the latent structure of the data in a reduced space. The advantage of this technique, compared to other dimension reduction techniques such as Single Value Decomposition (SVD), is that the values taken by each vector are positive. In fact, this representation gives an immediate and intuitive glance of the importance of the dimensions of each vector, a characteristic that makes NMF particularly suitable for soft and hard clustering

.

The dimensions of and are and , respectively, where is the number of objects, is the number of features, and is the number of dimensions of the new vector space. NMF uses different methods to initialize these matrices [2, 3, 4] and then optimization techniques are employed to minimize the differences between and .

The initialization of the matrices and , is crucial and can lead to different matrix decompositions, since it is performed randomly in many algorithms . To the contrary, the step involving the final clustering assignment received less attention by the research community. In fact, once and are computed, soft clustering approaches interpret each value in as the strength of association among objects and clusters and hard clustering approaches assign each object to the cluster , where:

 k=arg max(Wj1,Wj2,...,Wjk). (1)

This step is also crucial since in hard clustering it could be the case that the assignments have to be made choosing among very similar (possibly equal) values and Equation 1 in this case can results inaccurate or even arbitrary. Furthermore this approach does not guarantee that the final clustering is consistent, with the drawback that very similar objects can result in different clusters. In fact, the clusters are assigned independently with this approach and two different runs of the algorithm can result in different partitioning of the data, due to the random initializations .

These limitations can be overcome exploiting the relational information of the data and performing a consistent labeling. For this reason in this paper we use a powerful tool derived from evolutionary game theory, which allows to re-organize the clustering obtained with NMF, making it consistent with the structure of the data. With our approach we impose that the cluster membership has to be re-negotiated for all the objects. To this end, we employ a dynamical system perspective, in which it is imposed that similar objects have to belong to similar clusters, so that the final clustering will be consistent with the structure of the data. This perspective has demonstrated its efficacy in different semantic categorization scenarios

[7, 8], which involve a high number of interrelated categories and require the use of contextual and similarity information.

## Ii NMF Clustering

NMF is employed as clustering algorithm in different applications. It has been successfully applied in “parts-of-whole” decomposition , object clustering 

, multimedia analysis , and DNA gene expression grouping . It is an appealing method because it can be used to perform together objects and feature clustering. The generation of the factorized matrices starts from the assumption that the objects of a given dataset belong to clusters and that these clusters can be represented by the features of the matrix , which denotes the relevance that each cluster has for each object. This description is very useful in soft clustering applications because an object can contain information about different clusters in different measure. For example a text about a the launch of a new car model into the marked can contain information about economy, automotive or life-style, in different proportions. Hard clustering applications require to choose just one of these topics to partition the data and this can be done considering not only the information about the single text, but also the information of the other texts in the texts collection, in order to divide the data in coherent groups.

In many algorithms the initialization of the matrices and is done randomly  and have the drawback to always lead to different clustering results. In fact, NMF converges to local minima and for this reason has to be run several times in order to select the solution that approximates better the initial matrix. To overcome this limitation there were proposed different approaches to find the best initializations based on feature clustering  and SVD techniques . These initializations allow NMF to converge always to the same solution.  uses spherical -means to partition the columns of into clusters and selects the centroid of each cluster to initialize the corresponding column of

. Nonnegative Double Singular Value Decomposition (NNDSVD)

 computes the singular triplets of , forms the unit rank matrices using the singular vector pairs, extracts from them their positive section and singular triplets and with this information initializes and . This approach has been shown to be almost as good as that obtained with random initialization .

A different formulation of NMF as clustering algorithm was proposed by  (SymNMF). The main difference with classical NMF approaches is that SymNMF takes a square nonnegative similarity matrix as input instead of a data matrix. It starts from the assumption that NMF was conceived as a dimension reduction technique and that this task is different from clustering. In fact, dimension reduction aims at finding a few basis vectors that approximate the data matrix and clustering aims at partitioning the data points where similarity is high among the elements of a cluster and low among the elements of different clusters. In this formulation a basis vector strictly represents a cluster.

Common approaches obtain an approximation of minimizing the Frobenius norm of the difference

or the generalized Kullback-Leibler divergence

 , using multiplicative update rules  or gradient methods .

## Iii Game Theory and Game Dynamics

Game theory was introduced by Von Neumann and Morgenstern  in order to develop a mathematical framework able to model the essentials of decision making in interactive situations. In its normal-form representation, it consists of a finite set of players , a set of pure strategies for each player , and a utility function , which associates strategies to payoffs. Each player can adopt a strategy in order to play a game and the utility function depends on the combination of strategies played at the same time by the players involved in the game, not just on the strategy chosen by a single player. An important assumption in game theory is that the players are rational and try to maximize the value of . Furthermore, in non-cooperative games the players choose their strategies independently, considering what other players can play and try to find the best strategy profile to employ in a game.

Nash equilibria represent the key concept of game theory and can be defined as those strategy profiles in which each strategy is a best response to the strategy of the co-player and no player has the incentive to unilaterally deviate from his decision, because there is no way to do better. The players can also play mixed strategies

, which are probability distributions over pure strategies. A mixed strategy profile can be defined as a vector

, where is the number of pure strategies and each component denotes the probability that the player chooses its th pure strategy. Each mixed strategy corresponds to a point on the simplex and its corners correspond to pure strategies.

In a two-player game, a strategy profile can be defined as a pair where and . The expected payoff for this strategy profile is computed as:

 ui(p,q)=p⋅Aiq , uj(p,q)=q⋅Ajp (2)

where and are the payoff matrices of player and respectively.

In evolutionary game theory we have a population of agents which play games repeatedly with their neighbors and update their beliefs on the state of the system choosing their strategy according to what has been effective and what has not in previous games, until the system converges. The strategy space of each player is defined as a mixed strategy profile , as defined above. The payoff corresponding to a single strategy can be computed as:

 ui(ehi)=n∑j=1(Aijxj)h (3)

and the average payoff is:

 ui(x)=n∑j=1xTiAijxj (4)

where is the number of players with whom the games are played and is the payoff matrix among player and . The replicator dynamic equation  is used in order to find those states, which correspond to the Nash equilibria of the games,

 xh(t+1)=xh(t)u(eh,x)u(x,x) ∀h∈S (5)

This equation allows better than average strategies to grow at each iteration and we can consider each iteration of the dynamics as an inductive learning process, in which the players learn from the others how to play their best strategy in a determined context. Fig. 1: The pipeline of the proposed game-theoretic refiner method: a dataset is clustered using NMF obtaining a partition of the original data into k clusters. A pairwise similarity matrix A is constructed on the original set of data and the clustering assignments obtained with NMF. The output of NMF (W) and the matrix A are used to refine the assignments. The matrix W is also used to initialize the strategy space of the games. In red the wrong assignment that is corrected after the refinement. Best viewed in color.

## Iv Our Approach

In this section we present the Game Theoretic Nonnegative Matrix Factorization (GTNMF), our approach to NMF clustering refinement. The pipeline of this method is depicted in Fig. 1. We extract the feature vectors of each object in a dataset then, depending on the NMF algorithm used, we give as input to NMF the feature vectors or a similarity matrix. GTNMF takes as input the matrix obtained with NMF and the similarity graph (see Section V-C) of the dataset to produce a consistent clustering of the data.

Each data point, in our formulation, is represented as a player that has to choose its cluster membership. The weighted graph measures the influence that each player has on the others. The matrix is used to initialize the strategy space of the players. We use the following equation to constrain the strategy space of each player to lie on the standard simplex, as required in a game theoretic framework (see Section III). The dynamics are not started on the center of the

-dimensional simplex, as it is commonly done in unsupervised learning tasks, but on a different interior point, which corresponds to the solution point of NMF and do not compromise the dynamics to converge to Nash equilibria

.

Now that we have the topology of the data and the strategy space of the game we can compute the Nash equilibria of the games according to equation (5). In each iteration of the system each player plays a game with its neighbors according to the similarity graph and the payoffs are calculated as follows:

 ui(eh,s)=∑j∈Ni(aijsj)h (6)

and

 ui(s)=∑j∈NixTi(aijsj) (7)

We assume that the payoff of player depends on the similarity that it has with player , , and its preferences, (). During each phase of the dynamics a process of selection allows strategies with higher payoff to emerge and at the end of the process each player chooses its cluster according to these constraints. Since Equation 5 models a dynamical system it requires some criteria to stop. In the experimental part of this work we used as stopping criteria the maximum number of iterations and , where is the Euclidean norm between the strategy space at time and at time .

## V Experimental Setup and Results

In this section, we show the performances of GTNMF on different text and image datasets, and compare it with standard NMF , NMF-S  (same as NMF but with the similarity matrix as input instead of the features), SymNMF 2 and NNDSVD , which use the standard maximization technique to obtain an hard clustering of the data. In Table II we refer to our approach as NMF-algorithm+GT which means that the GTNMF has been initializied with the particular NMF-algorithm.

### V-a Datasets description

The evaluation of GTNMF has been conducted on datasets with different characteristics (see Table I). We used textual (Reuters, RCV1, NIPS) and image (COIL-20, ORL, Extended YaleB and PIE-Expr) datasets. Authors in  discarded the objects belonging to small clusters in order to make the dataset more balanced, simplifying the task. We tested our method using this approach and also keeping the datasets as they are (without reduction), which lead to situations in which it is possible to have in the same dataset clusters with thousands of objects and clusters with just one object (e.g. RCV1).

### V-B Data preparation

The datasets have been processed as suggested in . Given an data matrix , the similarity matrix is constructed according to the type of dataset (textual and image). With textual dataset each feature vector is normalized to have unit 2-norm and the cosine distance is computed, . For image datasets each feature (column) is first normalized to lie in the range and then it is applied the following kernel: , where is the Euclidean distance of the 7-th nearest neighbor . In all cases .

The matrix is thus sparsified keeping only the nearest neighbors for each point. The parameter is set accordingly to  and represents a theoretical bound that guarantees the connectedness of a graph:

 q=⌊log2(n)⌋+1 (8)

Let then if and otherwise. The matrix is thus normalized in a normalized-cut fashion obtaining the final matrix where . The matrix is given as input to all the compared methods, expect from NMF to which the data matrix is given. See  for further details on this phase.

### V-C Games graph

In Sec.V-B has been explained how to create the similarity matrix for NMF, the same methodology has been used to create the payoff matrix for the GTNMF, with the only difference that, in this case, we exploit the partitioning obtained with NMF in order to identify what could be the expected size of the clusters. The assumption here is that the clustering obtained via NMF provides a good insight on the size of the final clusters and accordingly with this information a proper number (see Equation 8) can be selected. A cluster can be considered as a fully connected subgraph and thus the number of neighbors of each element in the cluster should be at least to guarantee the connectedness of the cluster itself. The variable is thus chosen based on the same principle of  but instead of taking into account the entire set of points (as in Sec.V-B) we focused only on the subsets induced by the NMF clustering. This results in having a different for each point in the dataset based on the following rule:

 qi=⌊log2(|C|)⌋+1 (9)

where is the cardinality of cluster to which the -th element belongs to. For obvious reason and thus concentrating only on the potential number of neighbors that belong to the cluster and not in the entire graph because we are doing a refinement. From a game-theoretic perspective this means to focus the games only among a set of similar players which are likely to belong to the same cluster.

### V-D Evaluation measures

Our approach has been validated using two different measures, accuracy (AC) and normalized mutual information (NMI). AC is calculated as , where denotes the total number of documents in the dataset, equals to 1 if and are clustered in the same class; maps each cluster label to the equivalent label in the benchmark. The best mapping is computed using the Kuhn-Munkres algorithm . The AC counts the number of correct clusters assignments. NMI indicates the level of agreement between the clustering provided by the ground truth and the clustering produced by a clustering algorithm. The mutual information (MI) between the two clusterings is computed as,

 ∑ci∈C,c′j∈C′p(ci,c′j)⋅log2p(ci,c′j)p(ci)⋅p(c′j) (10)

where and are the probabilities that a document of the corpus belongs to cluster and , respectively, and is the probability that the selected document belongs to as well as at the same time. The MI information is then normalized with the following equation,

 NMI(C,C′)=MI(C,C′)max(H(C),H(C′)) (11)

where and are the entropies of and , respectively.

### V-E Evaluation

The results of our evaluation are shown in Table II

, where we reported the mean and standard deviation of 20 independent runs. For NNDSVD the experiments are run only one time, since it converges always to the same solution. The performances of GTNMF in most of the cases are higher those of the different NMF algorithms. In particular, we can notice that despite the different settings (textual/image datasets) our algorithm is able improve the NMI performance in 33/36 cases with a maximum gain of

(which is quite impressive) and a maximum loss of . Fig. 2: On the left side a confusion matrix produced by NMF on the ORL dataset and on the right side the ones produced by our method.

constant gain in the NMI means, in practice, that the algorithm is able to partition better the dataset, making the final clustering closer to the ground truth. In terms of AC, on cases the method improve on the compared methods, with a maximum gain of and maximum loss of . It worth noting that the negative results are very low and in most of the cases the corresponding number of incorrect reallocations is low, in fact, in the NIPS dataset means elements or in COIL-20 corresponds to elements.

The mean gain for NMI and AC are and , respectively, while the mean loss are and . In some cases we can see that we obtain a loss in NMI and a gain in AC, for example on ExtYaleB with NMF. In this case the similarity matrix given as input to GTNMF tends to concentrate more objects in the same cluster, because the dataset is not balanced and it could be the case that, in these situations, a big cluster tends to attract many objects, increasing the probability of good reallocations, which results in an increase in AC and in a potentially wrong partitioning of the data. To the contrary in some experiments we have a loss in AC and a gain in NMI. For example on PIE-Expr we noticed that we are able to put together many objects that the other approaches tend to keep separated, but in this particular case GTNMF collected in the same cluster all the objects belonging to four similar clusters and for this reason there was a loss in accuracy (see Fig. 3).

We can see that the results of our method on well balanced datasets (ORL, COIL-20) are almost always good. Also on very unbalanced datasets, such as Reuters and Reuters we have always good performances, whatever is the method used. These datasets depict better real life situations and the improvements over them are due to the fact that in these cases it is necessary to exploit the geometry of the data in order to obtain a good partitioning. Fig. 3: On a and b the confusion matrices produced by NMF and GTNMF on Pie-Expr. On c the std dev of the objects merged together by GTNMF and on d the std dev of two random clusters combined together.

A positive and a negative case study are shown in Fig. 2 and 3, respectively. In Fig. 2 the confusion matrix obtained with GTNMF is less sparse and more concentrated on the main diagonal. Given the same cluster Id, the NMF method agglomerates different clusters (red arrows) while, after the refinement, the number of elements corresponding to the correct cluster are moved. In Fig. b the algorithm tends to agglomerates the elements on a single cluster (second column of the matrix). This can be explained on how the similarity matrix is composed and on the nature of the data: in Fig. c the std dev of the images in the agglomerated cluster is reported, as one can notice the std is very low meaning that all the faces in that cluster are very similar to each other. To give a counterexample we report on Fig. d the std dev of two random cluster joined together, is straightforward to notice that the std dev is higher than in the previous example meaning that the elements within those two clusters are highly dissimilar in nature and thus easily separable.

## Vi Conclusion

In this work we presented GTNMF, a game theoretic model to improve the clustering results obtained with NMF going beyond the classical technique used to make the final clustering assignments. The matrix obtained with NMF can have an high entropy which make the choice of a cluster very difficult in many cases. With our approach we try to reduce the uncertainty in the matrix using evolutionary dynamics and taking into account contextual information to perform a consistent labeling of the data. In fact, with our method similar objects are assigned to similar clusters, taking into account the initial solution obtained with NMF.

We conducted an extensive analysis of the performances of our method and compared it with different NMF formulations and on datasets with different features and of different kind. The results of the evaluation demonstrated that our approach is almost always able to improve the results of NMF and that when it have negative results those results are practically non significant. The algorithm is quite general thanks to the adaptive auto-tuning of the payoff matrix and can deal with balanced and completely unbalanced datasets.

As future work we are planning to use different initialization of the strategy space, to use new similarity functions to construct the games graph, to apply this method to different problems and to different clustering algorithms.

## Acknowledgment

This work was supported by Samsung Global Research Outreach Program.

## References

•  W. Xu, X. Liu, and Y. Gong, “Document clustering based on non-negative matrix factorization,” in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval.   ACM, 2003, pp. 267–273.
•  D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
•  S. Wild, J. Curry, and A. Dougherty, “Improving non-negative matrix factorizations through structured initialization,” Pattern Recognition, vol. 37, no. 11, pp. 2217–2232, 2004.
•  C. Boutsidis and E. Gallopoulos, “Svd based initialization: A head start for nonnegative matrix factorization,” Pattern Recognition, vol. 41, no. 4, pp. 1350–1362, 2008.
•  C.-J. Lin, “Projected gradient methods for nonnegative matrix factorization,” Neural computation, vol. 19, no. 10, pp. 2756–2779, 2007.
•  Z.-Y. Zhang, T. Li, C. Ding, and J. Tang, “An nmf-framework for unifying posterior probabilistic clustering and probabilistic latent semantic indexing,” Communications in Statistics-Theory and Methods, vol. 43, no. 19, pp. 4011–4024, 2014.
•  R. Tripodi and M. Pelillo, “A game-theoretic approach to word sense disambiguation,” Computational Linguistics, in press.
•  R. Tripodi and M. Pelillo, “Document Clustering Games in Static and Dynamic Scenarios,” ArXiv e-prints, Jul. 2016.
•  C. Ding, T. Li, and W. Peng, “Nonnegative matrix factorization and probabilistic latent semantic indexing: Equivalence chi-square statistic, and a hybrid method,” in

Proceedings of the national conference on artificial intelligence

, vol. 21, no. 1.   Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006, p. 342.
•  Y. Wang, Y. Jia, C. Hu, and M. Turk, “Non-negative matrix factorization framework for face recognition,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 19, no. 04, pp. 495–511, 2005.
•  J. C. Caicedo, J. BenAbdallah, F. A. González, and O. Nasraoui, “Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization,” Neurocomputing, vol. 76, no. 1, pp. 50–60, 2012.
•  Z.-Y. Zhang, T. Li, C. Ding, X.-W. Ren, and X.-S. Zhang, “Binary matrix factorization for analyzing gene expression data,” Data Mining and Knowledge Discovery, vol. 20, no. 1, pp. 28–52, 2010.
•  D. Kuang, S. Yun, and H. Park, “Symnmf: nonnegative low-rank approximation of a similarity matrix for graph clustering,” Journal of Global Optimization, vol. 62, no. 3, pp. 545–574, 2015.
•  A. Berman and R. J. Plemmons, “Nonnegative matrices,” The Mathematical Sciences, Classics in Applied Mathematics, vol. 9, 1979.
•  D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Advances in neural information processing systems, 2001, pp. 556–562.
•  J. Von Neumann and O. Morgenstern, Theory of Games and Economic Behavior (60th Anniversary Commemorative Edition).   Princeton University Press, 1944.
•  P. D. Taylor and L. B. Jonker, “Evolutionary stable strategies and game dynamics,” Mathematical biosciences, vol. 40, no. 1, pp. 145–156, 1978.
•  J. W. Weibull, Evolutionary game theory.   MIT press, 1997.
• 

J. Kim, Y. He, and H. Park, “Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework,”

Journal of Global Optimization, vol. 58, no. 2, pp. 285–319, 2014.
• 

L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” in

Advances in neural information processing systems, 2004, pp. 1601–1608.
•  U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007.
•  L. Lovasz and M. Plummer, “Matching theory north-holland mathematics studies, 121,” Annals of Discrete Mathematics, vol. 29, 1986.