1 Introduction
Spectral clustering methods which are based on eigendecomposition demonstrate splendid performance on many realworld challenge data sets. During the past decades, a series of spectral clustering methods have been proposed: Multidimensional Scaling (MDS) [1], Local Linear Embedding (LLE) [2], Isomap [3], Laplacian Eigenmaps [4] and variant of Spectral Clustering [5]. There are three shortages of spectral clustering methods mentioned above. First, these approaches only provide the embedding map of the training data. The outofsample extension is not straightforward. Second, The complexity of these approaches relies on the number of data points. Third, the performance of spectral clustering methods highly depend on the robustness of the affinity graph.
Many important progresses [6, 7, 8, 9, 10, 11, 12, 13, 14] have been made to mitigate the above issues of the spectral clustering. Locality Preserving Projections (LPP) proposed in [7] introduces a linear projection obtained from Laplacian Eigenmaps. Their work provides a linear approximation of the embedding mapping, which reduces the time complexity and achieves outofsample extension straightforwardly. The linear embedding gives a metric learning perspective of the spectral clustering. Nie, Wang, and Huang proposed the Projected Clustering with Adaptive Neighbors (PCAN) in [14] where they regard the pairwise similarity as an extra variable to be solved in the optimization problem and they set a penalty of the rank of graph Laplacian to restrict specific connected components in the affinity matrix. With this framework, PCAN alternately update affinity matrix and projection. Although some affinity learning algorithms have been proposed in recent years, the technique of choosing an appropriate affinity matrix is still remained to be addressed.
Our goal is to extract more adaptive similarity information with minimal extra time consumption for the linear approximation of spectral clustering. Such information will take the objective of locality preserving rather than only the distance between images into consideration. Inspired by the recent progress on scalable spectral clustering [10] and data similarity learning [14], we propose a novel approach dubbed Adaptive Affinity Matrix (AdaAM). Our affinity matrix is relatively dense and can capture both global and local information. Specifically, AdaAM decomposes the affinity graph into a product of two lowrank identical matrices. As the ideal case described in [5], if we assume the pairwise affinity in the same class are exceedingly similar, the affinity matrix may turn into a lowrank matrix. We optimize the decomposed matrix with the similar scheme of spectral clustering. The affinity graph obtained by optimization is used as an intermediate affinity matrix, firstly. With the combination of the intermediate affinity matrix and the NN affinity graph derived by the heat kernel, we figure out a final adaptive affinity matrix from a naive spectral clustering. We conduct the affinity graph with the data projection and apply LPP to this specific graph to learn a metric for clustering.
We illustrate the effective and efficiency of the proposed approach for clustering on image data sets. We show the advantage of AdaAM for challenging data sets by comparing our approach with nearest neighborhood heat kernel (NN) [4] and some other stateoftheart algorithms in Section 3.
Our main contribution is that we integrate the affinity matrix learning into the framework of spectral clustering with the same paradigm, and we employ the low rank trick to make our approach more efficient.
2 Adaptive Affinity Matrix
2.1 Notation
In this paper, we write all matrices as uppercase (English or Greek alphabet) and vectors are written as lowercase. The vector with all elements one is denoted by
. is the centering matrix denoted by . The origin data matrix is denoted by , where is the number of the data points and is the dimension of the data. is assumed to be normalized with zero mean, i.e. . The denotation means the th data point vector. We also denote the linear projection by and denote the metric matrix by . Hence the Mahalanobis distance based on is . The NN heat kernel matrix is denoted by with(1) 
where is the set of nearest neighbors of . The corresponding Laplacian matrix is denoted by , where is the diagonal matrix with . We also denote both intermediate increment and final adaptive affinity matrix as , the corresponding diagonal weight matrix and Laplacian matrix as and .
2.2 Intermediate Affinity Matrix
We separate our algorithm into two parts, intermediate affinity matrix and final adaptive affinity matrix. In this section, we will introduce the first part. For the th data point , we connect any the data point to the data point with the similarity . With the hope that small Euclidean distance between two data points leads to a large similarity, we aim to choose to minimize the following objective function
(2) 
under appropriate constraints, where is the th element of the intermediate affinity matrix .
Different from PCAN [14], we reformulate the equation with graph Laplacian,
(3) 
under some constraints.
With a straightforward thought we can decompose the Laplacian into two identical matrices, since the graph Laplacian is a positive semidefinite matrix in general. We show this thought is not appropriate in our framework as follows.
If we assume that
(4) 
where
is a column orthogonal matrix with
. With the relaxing of the constraints, we finally need to solve the problem(5) 
If we assume the product of matrix to be (i.e. ), the Eq. (5) gives a simple form of the Laplacian Eigenmaps
This optimization problem can be solved by selecting eigenvectors of matrix
corresponding to several smallest eigenvalues. However,
is a lowrank matrix generally with and the eigenvectors of minimizing the objective function in Eq. (5) is in the null space of . Hence, the solution of above problem is not unique. Inspired by LSC [10] we assume the affinity matrix to be a positive semidefinite matrix and decompose it into the product of a matrix with orthogonal columns and instead of decomposing the Laplacian matrix, where is the expected rank of .Therefore we reformulate Eq. (3) as
(6) 
where we abandon the properties that connected weight is nonnegative and the graph Laplacian is positive semidefinite. The negative connected weights in can be used to measure the dissimilarity between data points. We will show that the solution of this optimization problem makes equal to .
For the first part of Eq. (6), we can write it as
(7) 
Let . With a Lagrange multipliers , the one dimensional situation of problem (7) can be rewritten as
(8) 
Finally, the minimization problem (7) reduces to finding the eigenvector corresponding to the minimum eigenvalue of the problem . Because the matrix has rank one, there is only one nonzero eigenvalue , which implies . Hence, for the satisfying problem (7) with arbitrary column number less than , we have . It is equivalent to
(9) 
Generally, in realworld data set, always holds, thus, the minimizing the first part of the objective function (6) has the property . Meanwhile the set of all with the property is the solution of Eq. (7).
The matrix , which minimizes the second part of the objective function (6), is given by the maximum eigenvalue to the eigen problem:
(10) 
As the data has zero mean, we have . Therefore, for the maximum eigenvalue which is larger than , the corresponding eigenvector always satisfies . Let the minimum solution of the second part of problem (6) be . We have
(11) 
which means that the property holds for the optimal solution of the second part of (6) and the solution is in the set of the solution of Eq. (7). Therefore the solution of the second part of Eq. (6) can also optimize the object function (7) and the solution of the optimization problem (6) makes equal to . The objective function (6) can be reduced to
(12) 
which has the solution as singular value decomposition of
with complexity relies on rather than . We obtain the intermediate affinity matrix from the distribution of origin data with similarity and dissimilarity information. The graph Laplacian of is .To mitigate the impact of noise and rank reducing problem, we apply sparsification to . We will discuss the sparsification further in Section 2.4.
2.3 Final Adaptive Affinity Matrix
In this section, we formulate a naive linear spectral clustering and provide the final adaptive affinity matrix.
With the intermediate affinity matrix , we can solve the following problem for a linear projection :
(13) 
where is the onedimension case of and is the combination of the Laplacian of NN heat kernel and the intermediate affinity matrix. The projection vector is given by the minimum eigenvalue of the eigen problem:
(14) 
Subsequently, to compute of Eq. (13) given , we rewrite the affinity optimization problem with the linear projection matrix as we did in Eq. (6)
(15) 
where we assume the final adaptive affinity matrix to be and . The property still holds, because of the zero mean of . Therefore, Eq. (15) reduces to
(16) 
This can be solved by singular value decomposition of matrix and taking the leftsingular vectors which correspond to the largest singular values. We apply sparsification on the adaptive affinity matrix obtained from Eq. (16) and attain the sparse affinity matrix.
AdaAM  NN  ConsNN  DN  ClustRFBi  PCANMeans  PCAN  

Avg  Max  Avg  Max  Avg  Max  Avg  Max  Avg  Max  Avg  Max  
UMIST  66.06  75.65  58.16  65.39  60.27  69.22  59.15  66.96  64.63  74.44  53.79  56.52  55.30 
COIL20  74.72  87.29  71.89  81.18  75.53  84.31  71.95  82.01  76.50  85.07  72.28  83.75  81.74 
USPS  69.36  69.61  68.25  68.35  68.21  68.34  68.08  68.31  58.74  65.90  64.04  67.95  64.20 
MNIST  60.84  61.34  48.13  48.27  47.88  48.00  49.72  49.76  51.93  52.03  58.93  58.98  59.83 
ExYaleB  54.36  57.87  24.17  26.76  25.63  28.75  24.21  27.42  23.10  26.43  25.74  27.63  25.89 
Intuitively, we can iterate Eq. (13) and Eq. (16) to minimize the value of objective function. However, as Fig. 1 shows, the adaptive affinity matrix with only once iteration performs well in practice and the continuing iterations show no remarkable outperformance.
Since the weight of nodes in the graph plays an important role in some algorithms and methods based on Normalized Cuts [15] like LPP has the constraint relying on . In our approach we have , therefore we add the weight matrix computed from the NN heat kernel to our affinity matrix. Finally, we replace the affinity matrix in LPP with the matrix to get the linear projection and the metric matrix .
2.4 Sparsification Strategy
From the optimization problem (12) and (16), we can observe that the matrices and are both lowrank matrix. Seeing that the solution of the optimization problem mentioned above is based on the singular value decomposition, this lowrank fact will result in that the column number of the solution could be far less than the rank of and . This process will produce a lowrank affinity matrix which leads to a progressively rank decreasing in our approach. To prevent the rank decreasing happening, we implement sparsification in our approach. The sparsification strategy may mitigate the problem of noise edges as well.
Fig. 1 justifies our sparsification procedure by demonstrating the histogram of the magnitude of the final adaptive affinity matrix obtained from Eq. (16) without sparsification. We can observe that most elements of the affinity matrix concentrate in the range with small magnitude, and the sparsification procedure may reserve a portion of the affinity elements which are more representative.
Inspired by the thought of NN heat kernel, we sort all the elements of affinity matrix by decreasing magnitude and only reserve the first elements. We consider that the parameter is better to be in inverse proportion to the number of clusters, in which case the average elements reserved for each cluster will be proportionate to the number of data points in each cluster. The is selected by following equation:
(17) 
where is the floor function, is the number of elements in , is the number of clusters and is a coefficient.
We set to be for the first sparsification in the computation of the intermediate affinity matrix and set to be for the second sparsification in the computation of the final adaptive affinity matrix. The is decided by a rough parameter search, and it gives a stable performance in most data sets.
We summarize our algorithm in Algorithm 1. We set reduced dimension to be the same as the number of classes
3 Experiments
In this section, we conduct several experiments to demonstrate the effectiveness and efficiency of the proposed approach AdaAM.
3.1 Data Sets
We evaluate the proposed approach on five image data sets:
UMIST The UMIST Face Database consists of 575 images of 20 individuals with 220220 pixels [16]. We use the images resized to 4040 pixels in our experiments.
COIL20 A data set consists of 1,440 images of 20 objects with discarded background [17].
USPS The USPS handwritten digit database has 9,298 images of 10 digits with 1616 pixels [18].
MNIST
The MNIST database of handwritten digits has 70,000 images of 10 classes
[19]. In our experiments, we select the first 10,000 images of this database.ExYaleB The Extended Yale Face Database B consists of 2,414 cropped images with 38 individuals and around 64 images under different illuminations per individual [20].
The statistics of data sets are summarized in Tab. 2.
Data set  # of instances  # of features  # of classes 

UMIST  575  1600  20 
COIL20  1440  1024  20 
USPS  9298  256  10 
MNIST  10000  784  10 
ExYaleB  2414  1024  38 
3.2 Compared Algorithms
We compare our approach with the other affinity learning algorithms described in Section Related Work. We adopt LPP to the affinity matrices generated by these stateoftheart approaches to obtain the distance metric.
ConNN ConsNN Consensus NNs [12] with the aim of selecting robust neighborhoods.
DN Dominant Neighborhoods proposed in [11].
ClustRFBi A spacial case of ClustRFStrct [13], which is also proposed in [21, 22]. Due to the huge memory requirement of ClustRFStrct on the data set with thousands instances, we implement this special case in our experiments.
PCAN Projected Clustering with Adaptive Neighbors proposed in [14]. Because PCAN is an algorithm which can generate the linear projection and clusters simultaneously, we denote the method combining the projection of PCAN and Means as PCANMeans and we also show the clustering result of PCAN in Tab. 1 for reference.
We also compare our approach with the NN heat kernel affinity matrix. We use NN to denote this typical approach.
3.3 Parameter Selection and Experiment Details
Because there is no validation data set in unsupervised learning tasks, for more general case, we impose the same parameter selection criteria on all the algorithms in our experiments. We set the size of neighborhood to be , where is the number of data instances and is the number of classes. We also set the projected dimension, which is equal to the rank of metric matrix, to be the same as the number of classes [5]. All the other parameters in our approach are fixed in every experiment.
We denote 10 times of Means as a round and select the clustering result with the minimal withincluster sum as the result of each round of Means. We apply 100 rounds Means to each algorithms for the evaluation of the performance (Tab. 1), 10 rounds Means for the experiment of the sensitivity to the neighborhood size (Fig. 2) and one round Means for the experiment of execution time (Fig. 3).
3.4 Experiment Results
In the experiment of clustering accuracy, we evaluate the projection ability of AdaAM with other five algorithms on five benchmark data sets mentioned above. Tab. 1 gives the average and the maximal accuracy of 100 rounds Means of each model. From Tab. 1, we can observe that superiority of AdaAM on the task of the unsupervised metric learning. In most case, AdaAM performs much better than the other approaches. Our approach attains four best results of the average accuracy and five best maximal accuracy on five data sets. We can also observe that the proposed AdaAM decisively outperforms other five methods on ExYaleB data set. Different from the other data sets, the image data in ExYaleB are properly aligned and under different illumination. This difference makes some images more similar to the image in different class under the same illumination, which result in a high rank affinity matrix. Our approach is based on a low rank approximation of the optimal affinity matrix with the ability to handle such noises in the affinity matrix.
Since the neighborhood size selection criteria is fixed in the experiment of accuracy, which may cause the loss of the best performance, we show the trend of accuracy according to the size of neighborhood in Fig. 2. Fig. 2 shows that AdaAM attains the best result in most cases and the sensitivity to the size of neighborhood is better or comparable to the other models. Since our approach is based on the low rank approximation of the optimal affinity matrix, it requires more information from the pairwise similarity. Hence, for small , baseline methods are sometimes better than our approach.
Fig. 3 illustrates the efficiency of AdaAM by the semilog graph of execution time with different number of data points selected from MNIST. It can be observed that our approach is a inexpensive algorithm in practice with much lower time consumption to PCANMeans, ClustRFBi and DN. We also show that AdaAM keeps approximately double time consumption to ConsNN with the much better performance.
4 Conclusion
In this paper, we present a novel affinity learning approach for unsupervised metric learning, called Adaptive Affinity Matrix (AdaAM). In our new affinity learning model, the affinity matrix is learned from the same framework of spectral clustering. More specifically, we show that the affinity learning can be reduced to a singular value decomposition problem. With the affinity matrix learned, the distance metric can be derived by some offtheshelf approaches based on the affinity graph like LPP. Extensive experiments on clustering image data sets demonstrate the superiority of the proposed method AdaAM.
References
 [1] Trevor F Cox and Michael AA Cox, Multidimensional scaling, CRC Press, 2000.
 [2] Sam T Roweis and Lawrence K Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
 [3] Joshua B Tenenbaum, Vin De Silva, and John C Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000.
 [4] Mikhail Belkin and Partha Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering.,” in NIPS, 2001.

[5]
Andrew Y Ng, Michael I Jordan, Yair Weiss, et al.,
“On spectral clustering: Analysis and an algorithm,”
in NIPS, 2002.  [6] Yoshua Bengio, JeanFrançois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas Le Roux, and Marie Ouimet, “Outofsample extensions for lle, isomap, mds, eigenmaps, and spectral clustering,” in NIPS, 2004.
 [7] Xiaofei He and Partha Niyogi, “Locality preserving projections,” in NIPS, 2004.
 [8] Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik, “Spectral grouping using the nystrom method,” IEEE TPAMI, vol. 26, no. 2, pp. 214–225, 2004.
 [9] Donghui Yan, Ling Huang, and Michael I Jordan, “Fast approximate spectral clustering,” in ACM SIGKDD, 2009.
 [10] Xinlei Chen and Deng Cai, “Large scale spectral clustering with landmarkbased representation.,” in AAAI, 2011.
 [11] Massimiliano Pavan and Marcello Pelillo, “Dominant sets and pairwise clustering,” IEEE TPAMI, vol. 29, no. 1, pp. 167–172, 2007.
 [12] Vittal Premachandran and Ramakrishna Kakarala, “Consensus of knns for robust neighborhood selection on graphbased manifolds,” in CVPR, 2013.
 [13] Xiatian Zhu, Chen Change Loy, and Shaogang Gong, “Constructing robust affinity graphs for spectral clustering,” in CVPR, 2014.
 [14] Feiping Nie, Xiaoqian Wang, and Heng Huang, “Clustering and projected clustering with adaptive neighbors,” in ACM SIGKDD, 2014.
 [15] Jianbo Shi and Jitendra Malik, “Normalized cuts and image segmentation,” IEEE TPAMI, vol. 22, no. 8, pp. 888–905, 2000.

[16]
Daniel B Graham and Nigel M Allinson,
“Characterising virtual eigensignatures for general purpose face recognition,”
in Face Recognition, pp. 446–456. Springer, 1998.  [17] Sameer A Nene, Shree K Nayar, Hiroshi Murase, et al., “Columbia object image library (coil20),” Tech. Rep., Technical Report CUCS00596, 1996.
 [18] Jonathan J Hull, “A database for handwritten text recognition research,” IEEE TPAMI, vol. 16, no. 5, pp. 550–554, 1994.
 [19] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [20] K.C. Lee, J. Ho, and D. Kriegman, “Acquiring linear subspaces for face recognition under variable lighting,” IEEE TPAMI, vol. 27, no. 5, pp. 684–698, 2005.

[21]
Antonio Criminisi, Jamie Shotton, and Ender Konukoglu,
Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semisupervised learning
, Now Publishers Inc., 2012. 
[22]
Yuru Pei, TaeKyun Kim, and Hongbin Zha,
“Unsupervised random forest manifold alignment for lipreading,”
in ICCV, 2013.
Comments
There are no comments yet.