1 Introduction
With the rapid development of data acquisition technology, large amounts of data, such as online documents, medical images, various video series, traffic data, health data and other highdimensional data, are accumulating. Therefore, dimensionality reduction
[1]has become an essential step in data mining. Vector quantization (VQ)
[2], singular value decomposition (SVD)
[3], principal component analysis (PCA)
[4], independent component analysis (ICA)
[5] and concept factorization (CF) [6], [7] are some most commonly used dimensionality reduction methods. However, due to negative components, these methods typically cannot be reasonably explained in some practical problems. Therefore, developing a nonnegative factorization method is valuable for research. Thus, researchers have investigated the nonnegative matrix factorization (NMF) proposed by Lee and Seung [8]. This method represents a nonnegative matrix by a product of two low rank nonnegative matrices and then learns a partsbased representation. In recent years, NMF has been applied in many fields, including cluesting, dimensionality reduction, blind source separation, etc. [9, 10, 11, 12].In recent years, many variants of NMF have been proposed to extend its applicable range. For example, Ding et al. [13] learned new lowdimensional features from data with convenient clustering interpretation using SemiNMF, which allows the data matrix and the base matrix to have mixed signs [14, 15]. They also developed ConvexNMF by restricting the base vectors to a convex combination of the data points. Considering that the orthogonality constraint leads to sparsity, Pompili et al. [16] proposed an orthogonal NMF (ONMF) method, which adds orthogonality constraints to the base and representation matrices. There are two ways to solve ONMF: the Lagrange method [16] and the natural gradient on the Stiefel manifold [17].
Blondel et al. [18] conducted an extended study for NMF by incorporating predetermined weights to each attribute of each data point, demonstrating that weights can produce important flexibility by better emphasizing certain features in image approximation problems. Other researchers then conducted a series of studies with the weighted NMF (WNMF). For example, Kim and Choi [19] proposed a new WNMF method to process an incomplete data matrix with missing entries, which combined the binary weights into the NMF multiplication update. Li and Wu [20] proposed a weighted nonnegative matrix trifactorization method for coclustering that weights each row and column of the original matrix in a specific way, and the normalized cut information is combined into the optimization model. Then, to assign labels to images, Kalayeh et al. [21] proposed a weighted expansion method of multiview NMF, which imposes a consistent constraint on the representation matrices between different features, and a weight matrix mitigates data set imbalance. Dai et al. [22] proposed applying the WNMF method to image recovery, and experimental results showed that, particularly for data affected by salt and pepper noise, the method could remove noise effectively and could also provide a more accurate subspace representation.
We refer these methods as “hard WNMF”; however, one primary drawback is that the weights, which they rely on, must be predetermined. Variable weights for NMF are challenging to implement that can be resolved by designing interpretable and computable weights.
This paper presents an entropy weighted NMF (EWNMF) method that assigns a weight to indicate the importance of each attribute of each data point in matrix factorizing. Then, the entropy of these weights is used to regularize the cost function for obtaining an easy computable solution, which makes the range of weights fall within [0, 1] with a summation of one; thus, the weights can be explained as the probability of the contribution of an attribute of one data point to NMF. Experimental results with several real data sets show that EWNMF performs better than other NMF variants.
2 Methodology
2.1 Related research
NMF is a matrix factorization method that focuses on data matrices with nonnegative elements, which can reveal hidden structures and patterns from generally redundant data. We will review the standard NMF as follows.
In this paper, matrices are denoted as capital letters. For a matrix , , and denote the column, the row and element of , respectively; the Frobenius norm is represented as ; and mean the itembyitem multiplication and division of two matrices, respectively; denotes the transpose of ; means that all elements of are equal to or larger than 0.
The expression of NMF is:
(1) 
where the matrix denotes the given nonnegative matrix in which each column is a data point. The goal of NMF is to find two lowdimensional nonnegative matrices: is called the base matrix, and is called the representation matrix, whose product can approximate the original matrix [23, 24], where .
There are different standards to measure the quality of decomposition. Lee and Seung proposed using the square of the Euclidean distance and the KullbackLeibler divergence. In this paper, the Euclidean distance is used, and the formula is expressed as:
(2)  
To alternatively minimize and in Eq. (2), the construction of the auxiliary function is important to determine the iterative update rule.
If the function satisfies the following conditions:
(3)  
where is a given value; then, is the auxiliary function of on .
Then, we can draw the following conclusion.
If is an auxiliary function of , then under the update rule:
(4) 
the function does not increase.
The conditions satisfied by the auxiliary function make this proof marked because:
(5) 
Thus, if the auxiliary function reaches the minimum, the original function should decrease.
Then, we construct the update rule for the NMF problem. We consider first, where and are given. Let , of course, and . Therefore, the auxiliary function of standard NMF is:
(6) 
Because the function is separable, it can be easily minimized. We thus take the partial derivative of Eq. (6) and set it to zero so that we can obtain the following update rule as:
(7) 
The method of constructing the auxiliary function of is similar to that of . Then, the following update of is obtained:
(8) 
2.2 Proposed method
Different from previous methods, an optimizable weight matrix is used to measure the importance of the attributes in matrix factorizing. Fig. 1 shows the explanation of the proposed idea that, if there is a hat in the only image, the corresponding attributes of the hat must destroy the result of NMF, and we should certainly eliminate or weaken their importances. Thus, it may be necessary to provide different importances to the attributes of each data point based on the following constraint, which will lead to a new optimization model:
(9)  
We first consider the new variable , which can be solved in an alternative optimization manner. For fixed and , is very easy to solve as if or 0 otherwise^{1}^{1}1The update rule can be easily explained by an example as:
corresponds to the minimum value of {3, 1, 2}. This process is similar to the computation of the weights in the kmeans algorithm.
, where . And it demonstrates the simple fact that only one element of is 1, and the others are 0, which is incompatible with the real problem.To address this issue, we apply an entropy regularizer to penalize the cost function of NMF to obtain a weight in the range of [0, 1] instead of 0 or 1, which uses information entropy to calculate the uncertainty of weights. The new optimization problem is rewritten as follows:
(10)  
where
is a given hyperparameter. The first term in Eq. (
10) is the sum of errors, and the second term is the negative entropy of the weights. The original cost function in Eq. (9) results in only one attribute of each data point being involved in feature representation, and the entropy regularizer will stimulate more attributes to help feature representation.This equation can be solved by a simple algorithm, which is based on the following proposition:
Given the matrices and , in Eq. (10) is minimized when:
(11) 
We construct the Lagrange function of Eq. (10) with respective to as:
(12)  
where is a vector containing the Lagrange multipliers corresponding to the constraints.
By setting the gradient of Eq. (12) with respect to and to zero, we obtain the following equation system:
=  (13)  
=  (14) 
From Eq. (14), we know that:
(15) 
Substituting Eq. (15) into Eq. (13), we have:
(16) 
It follows that:
(17) 
Substituting this expression to Eq. (15), we find that:
(18) 
Then, we can solve and with fixed , which is similar to the standard NMF method. For example, we can construct the following auxiliary function about :
(19)  
Setting the partial derivative of to zero yields the following update rule:
(20) 
Similarly, we can also easily obtain the update rule for as follows:
(21) 
The update rules to and are similar to the existing WNMF methods in [21], [22]. Optimizing EWNMF is summarized as follows in :
2.3 Extensions
The proposed entropy weighted method can also be applied to standard NMF using the square of the Euclidean distance and to those that use KL divergence, divergence and other existing NMFs, such as ONMF, SemiNMF, ConvexNMF, etc., which demonstrates its good compatibility.
For example, incorporating the entropy regularizer into KLNMF, its cost function is expressed as:
(22)  
where can be derived from the above method, and then the update is shown below:
(23) 
Applying the entropy regularizer to the divergencebased NMF method, the cost function is:
(24)  
where is a given value. The update method of is as follows:
(25) 
The entropy weighted method can also be popularized to many other NMFs, which is similar to the above derivations, and thus, we omit them.
3 experiments
3.1 Experimental description
Experiments were performed on a HP Compaq PC with a 3.40GHz Core i76700 CPU and 16 GB memory, and all the methods were implemented in MATLAB. We compare the performance of the proposed methods with NMF [8], ONMF [16], SemiNMF [13] and ConvexNMF [13] on five public data sets, including the Yale, UMISTface, Caltech101, GTFD and TDT2 data sets.
The Yale face data set [25] was created by Yale University and contains data describing 15 people. Each person has 11 face images with different expressions, postures and lighting. The data set has a total of 150 images, and each has a size of .
The UMISTface data set [26] was established by the University of Manchester, UK. The data set has a total of 1012 images, including 20 people, each with different angles and different poses. In this experiment, the pixel number of each image is .
The Caltech101 data set [27] contains 9144 images split between 101 different object categories, as well as an additional background/clutter category. This data set has approximately 40 to 800 images per category, and most categories have approximately 50 images. The size of each image is .
The GTFD [28] contains 750 images taken in two different sessions and includes 50 people. All people in the data set are represented with cluttered backgrounds. The images show faces with different facial expressions and lighting conditions, and each has a size of .
The TDT2 Audio Corpus [29] contains six sources, including two news special lines (APW and NYT), two radio programs (VOA and PRI) and two TV programs (CNN and ABC). In this experiment, only the largest 30 categories are used (a total of 9394 documents).
Important details of these data sets are shown in Table 1.
Data set  Yale  UMISTface  Caltech101  GTFD  TDT2 

Points  165  1012  9144  750  9394 
Dimensions  1024  1024  1024  1024  36771 
Class  15  20  101  50  30 
After obtaining a new feature representation, we use kmeans to cluster them and then compare it with the label to evaluate the clustering results. Clustering accuracy (ACC) [30], [31] and normalized mutual information (NMI) [32], [33] are used to evaluate the performance of these clustering results.
Given a set of the ground true class labels and the obtained cluster labels , the clustering accuracy is defined as:
(26) 
where:
and is a permutation mapping function that maps the obtained cluster labels to the real labels. The higher the ACC value is, the better the clustering performance.
NMI is used to calculate the agreement between the two data distributions and is defined as follows:
(27) 
where is the entropy of .
quantifies the amount of information between two random variables (i.e.,
and ) and is defined as:(28) 
where and are the probabilities that a data point selected from the data set belongs to the clusters and , respectively; and is the joint probability that an arbitrarily selected data point belongs to clusters and concurrently. The NMI score ranges from 0 to 1, and the larger NMI is, the better the clustering performance.
Method  NMF  ONMF  SemiNMF  ConvexNMF  EWNMF 

ACC  39.03  41.21  42.48  31.88  46.06 
NMI  44.74  47.36  48.48  37.91  53.27 
Method  NMF  ONMF  SemiNMF  ConvexNMF  EWNMF 

ACC  51.06  50.99  52.15  23.52  64.05 
NMI  69.63  68.34  67.70  29.13  76.01 
Before the experiment, we normalized all the data sets to scale the minimum and maximum values of each data point to 0 and 1, respectively. All the methods use the same random distribution for initialization of and
to make them uniformly distributed on [0.1 1.1] and perform 300 iterations to ensure sufficient convergence.
3.2 Experimental results
3.2.1 Signal unmixing on synthetic data
We use a synthetic data set to simulate a signalmixture process, which mixes two source signals by a uniformly distributed matrix on [0 1] and then destroys the beginning part of the first one. The destroyed part in the mixed signal will hinder signal recovery. Then, we use EWNMF to unmix the signals to demonstrate the usefulness of the proposed entropy weighted method. Fig. 2 shows the source signals, mixed signals (including the destroyed signal) and the obtained weights. The weights of the destroyed elements in the first signal are small and can even be considered as zero, and the other weights are similar. Thus, we can conclude that EWNMF can provide correct weights to the importance of attributes in the data set.
3.2.2 EWNMF compared with the standard NMF
We investigate the ability of the entropy weighted strategy to improve the performance of the standard NMF. We apply NMF and EWNMF to reduce the dimension number of the Yale and UMISTface data sets and then use kmeans to cluster these new representations. Note that the number of reduced dimensions is equal to that of the clusters. Figs. 3 and 4 show the clustering results evaluated by ACC and NMI. EWNMF indeed provides better performance than the standard NMF and can achieve a consistently superior performance to the standard NMF on a wide range of the hyperparameter , demonstrating the robustness of EWNMF.
The optimal clustering results on the entire Yale and UMISTface data sets are shown in Tabs. 2 and 3. An interesting observation is highlighted and shows a consistent result with the above figures that EWNMF still provides the best evaluation standards.
3.2.3 Clustering results with different cluster numbers
We study the relationship between the evaluation standards and cluster number. The hyperparameter is selected in to obtain the optimal results in a large range. Because the NMF problem does not have a sole solution, we randomly initialized 10 times to obtain a credible averaged ACC and NMI. Different cluster numbers ranging from 2 to 10 are selected. In a certain data set with k clusters, the experimental details are described as follows:
1) Randomly select k categories as a subset for the following experiment.
2) Randomly initialize and , obtain new representations, and cluster them by kmeans. Note that EWNMF uses the selected hyperparameter according to the above instruction.
3) Repeat 1) and 2) 10 times to obtain an average result.
The clustering results, ACC and NMI versus the number of clusters, are reported in Figs. 59. The proposed method can generally yield more accurate clustering results, and only for two clusters of the GTFD data set, SemiNMF occasionally exceeds the accuracy of the proposed method. Also, none of the other methods performs better in all aspects than EWNMF.
3.2.4 Convergence speed and partbased learning
From the theoretical analysis described above, we can conclude that the proposed method is monotonically decreasing, but due to the nonconvexity of the cost function, it cannot be guaranteed to be strictly convergent [34]. Thus, we investigate the convergence speed of the NMF methods.
Fig. 10 shows the cost functions of NMF, ONMF, SemiNMF, ConvexNMF and EWNMF on the Yale data set. Except for the ConvexNMF method, which converges slowly, the other methods reached a stable point within 200 iterations, demonstrating that they have similar convergence speeds. However, ONMF cannot guarantee the monotonic decrease of the cost function because it uses a proximal Lagrange multiplier method. Fig. 11 shows the base images on the Yale data set and demonstrates that SemiNMF and ConvexNMF identify more global faces. EWNMF has poor locality compared to ONMF; however, the former has a better clustering effect than the latter. It is visually difficult to evaluate the performance of the NMF methods, even though they are all markedly different.
4 Conclusion
This paper proposes a new NMF method, which adds weights to each attribute of each data to emphasize their importance. We introduce an entropy regularizer to consider these weights as the probabilities of importance within [0 1], which mimics the process of human reasoning more accurately. These weights can be solved by the Lagrange multiplier method, and a simple update is achieved. The experimental results show that the proposed method produces performance that is competitive with those of existing methods.
The entropy regularizer requires an additional hyperparameter to control the certainty of the weights. In the future, we plan to develop an autoadjustment strategy for this hyperparameter.
Acknowledgment
The authors would like to thank the editor and anonymous reviewers for their constructive comments and suggestions.
References
 [1] I. K. Fodor, “A survey of dimensionality reduction techniques,” in Proc. Center Appl. Sci. Comput., Lawrence Livermore Nat. Lab., 2002, pp. 124.
 [2] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Springer Science and Business Media, 1992.

[3]
C. Hu, X. Lu, M. Ye, and W. Zeng, “Singular value decomposition and local near neighbors for face recognition under varying illumination,”
Pattern Recognit., vol. 64, pp. 6083, 2017.  [4] I. T. Jolliffe, Principal Component Analysis. Springer, 2002.
 [5] A. Hyvarinen, E. Oja, “Independent component analysis: algorithms and applications,” Neural Netw. vol. 13, no. 45, pp. 411430, 2000.
 [6] S. Peng, W. Ser, B. Chen, L. Sun, and Z. Lin, “Correntropy based graph regularized concept factorization for clustering,” Neurocomput., vol. 316, pp. 3448, 2018.
 [7] W. Yan, B. Zhang, S. Ma, and Z. Yang, “A novel regularized concept factorization for document clustering,” Knowl. Based Syst., vol. 135, pp. 147158, 2017.
 [8] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788791, 1999.
 [9] F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons, “Document clustering using nonnegative matrix factorization,” Inf. Manag., vol. 42, no. 2, pp. 373386, 2006.
 [10] V. P. Pauca, J. Piper, and R. J. Plemmons, “Nonnegative matrix factorization for spectral data analysis,” Linear Algebra Appl., vol. 416, no. 1, pp. 2947, 2006.
 [11] S. Jia and Y. Qian, “Constrained nonnegative matrix factorization for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 1, pp. 161173, 2009.
 [12] H. Kim and H. Park, “Sparse nonnegative matrix factorizations via alternating nonnegativity constrained least squares for microarray data analysis,” Bioinform., vol. 23, no. 12, pp. 14951502, 2007.
 [13] C. H. Q. Ding, T. Li, and M. I. Jordan, “Convex and seminonnegative matrix factorizations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1, pp. 4555, 2010.
 [14] A. Suleman, “A convex seminonnegative matrix factorization approach to fuzzy kmeans clustering,” Fuzzy Sets Syst., vol. 270, pp. 90110, 2015.
 [15] R. Hamon, V. Emiya, and C. Fevotte, “Convex nonnegative matrix factorization with missing data,” in Proc. IEEE 26th Int. Workshop Mach. Learn. Signal Process., 2016, pp. 16.
 [16] C. Ding, T. Li, W. Peng, and P. Haesun, “Orthogonal nonnegative matrix tfactorizations for clustering,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2006, pp. 126135.
 [17] S. Choi, “Algorithms for orthogonal nonnegative matrix factorization,” in Proc. Int. Jt. Conf. Neural Netw., 2008, pp. 18281832.

[18]
V. D. Blondel, N.D. Ho, and P. V. Dooren, “Weighted nonnegative matrix factorization and face feature extraction,”
Image. Vis. Comput., 2007.  [19] Y. Kim and S. Choi, “Weighted nonnegative matrix factorization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Proces, 2009, pp. 15411544.
 [20] Z. Li and X. Wu, “Weighted nonnegative matrix trifactorization for coclustering,” in Proc. IEEE Int. Conf. Tools Arti. Intell., 2011, pp. 811816.

[21]
M. M. Kalayeh, H. Idrees, and M. Shah, “NMFKNN: image annotation using weighted multiview nonnegative matrix factorization,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2014, pp. 184191.  [22] X. Dai, K. Zhang, J. Xiong, X. Zhang, Z. Tu, and N. Zhang, “Weighted nonnegative matrix factorization for image recovery and representation,” in Int. Conf. Inf. Sci. Tech. (ICIST), 2020, pp. 288293.
 [23] D. D. Lee and H. S. Seung, “Algorithms for nonnegative matrix factorization,” in Proc. Int. Conf. Neural Inform. Process. Syst., 2001, pp. 556562.

[24]
P. Paatero and U. Tapper, “Positive matrix factorization: a nonnegative factor model with optimal utilization of error estimates of data values,”
Environ., vol. 5, no. 2, pp. 111126, 1994.  [25] P. N. Bellhumer, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: recognition using class specific linear projection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 711720, 1997.
 [26] Sheffield/UMIST face data set. [Online]. Available: http://www.sheffield.ac.uk/eee/research/iel/research/face/.
 [27] F. Li, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories,” in Comput. Vis. Image Underst., 2004, pp. 178178.
 [28] Georgia tech face data set. [Online]. Available: Ftp://ftp.ee.gatech.edu/pub/users/hayes/facedb/.
 [29] D. Graff, TDT2 English audio LDC99s84, DVD, Linguistic Data Consortium, Philadelphia, 1999.
 [30] H. Liu, Z. Wu, X. Li, D. Cai, and T. S. Huang, “Constrained nonnegative matrix factorization for image representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 12991311, 2012.
 [31] L. Lovasz and M. Plummer, Matching Theory. Akademiai Kiado, 1986.
 [32] D. Cai, X. He, and J. Han, “Document clustering using locality preserving indexing,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 12, pp. 16241637, 2005.
 [33] F. Shahnaz, M. W. Berry, V. Pauca, and R. J. Plemmons, “Document clustering using nonnegative matrix factorization,” Inf. Process. Manag., vol. 42, no. 2, pp. 373386, 2004.
 [34] C. J. Lin, “On the convergence of multiplicative update algorithms for nonnegative matrix factorization,” IEEE Trans. Neural Netw., vol. 18, no. 6, pp. 15891596, 2007.
Comments
There are no comments yet.