I Introduction
As the volume of data captured by remote sensors grows unabated, human capacity for providing labeled training datasets is strained. Stateoftheart supervised learning algorithms (e.g. deep neural networks) require very large training sets to fit the massive number of parameters associated to their highcomplexity architectures. In order to take advantage of the deluge of unlabeled remote sensing data available, new methods that are
unsupervised—requiring no training data—are necessary.This article proposes an efficient unsupervised clustering algorithm for hyperspectral imagery (HSI) that exploits not only lowdimensional geometry in the highdimensional space of spectra, but also the spatial regularity in the 2dimensional image structure of the pixels. This is achieved by considering spectralspatial diffusion geometry as captured by spatially regularized diffusion distances [1, 2]. These distances are integrated into the recently proposed diffusion learning algorithm, which has achieved competitive performance versus benchmark and stateoftheart unsupervised HSI clustering algorithms [3, 4].
Ii Background
The unsupervised clustering of HSI —understood as a point cloud—consists in providing labels to each data point without access to labeled training data. The total number of pixels in the image is , and the number of spectral bands is . Typically is quite large, causing traditional statistical methods to perform suboptimally. However, the different clusters in the data—corresponding to regions with distinct material properties—often exhibit lowdimensional, though nonlinear, structure. In order to efficiently exploit this structure, methods for clustering HSI that learn the underlying nonlinear geometry have been developed, including methods based on nonnegative matrix factorization [5], regularized graph Laplacians [6], angle distances [7], and deep neural networks [8].
Recently, the diffusion geometry [1, 2] of has been proposed in order to infer the latent clusters. The diffusion distance at time between , denoted , is a notion of distance determined by the underlying geometry of the point cloud . The computation of involves constructing a weighted, undirected graph with vertices corresponding to the points in , and weighted edges stored in the weight matrix if otherwise for some scaling parameter and with the set of nearest neighbors of in with respect to Euclidean distance. Typically is chose adaptively, for example as some multiple of average distance to nearest neighbors [9], and is chosen so that is sparse; we set in all experiments. Let be a Markov diffusion matrix defined on , where is the degree of . The diffusion distance at time is
(1) 
The computation of involves summing over all paths of length connecting to , so is small if are similar according to .
The eigendecomposition of yields fast algorithms to compute : the matrix admits a spectral decomposition (under mild conditions, see [2]
) with eigenvectors
and eigenvalues
, where . The diffusion distance (1) can then be written as(2) 
The weighted eigenvectors are new datadependent coordinates of , which are nearly geometrically intrinsic [1]. The parameter measures how long the diffusion process on has run: small values of allow a small amount of diffusion, which may prevent the interesting geometry of from being discovered. On the other hand, large allow the diffusion process to run for so long that the fine geometry is homogenized. In all our experiments we set ; see [4] and [10] for empirical and theoretical analyses of , respectively.
If the underlying graph is connected, for , so that the sum (2) may approximated by its truncation at some suitable . In our experiments, was set to be the value at which the decay of the eigenvalues begins to taper [4]. The subset used in the computation of is a dimensionreduced set of diffusion coordinates. Indeed, the mapping
(3) 
may be understood as a form of dimension reduction from the ambient space to . The truncation also enables us to compute only the first eigenvectors, reducing computational complexity.
Diffusion maps consider the data as a point cloud in . If the data has organizing structure beyond its dimensional coordinates, this information can be incorporated into the diffusion maps construction by modifying the underlying transition matrix . In the case of HSI, each point is not only a high dimensional spectra, but also a pixel arranged in an image. In particular, the HSI enjoys spatial regularity, in the sense that points in a particular class are likely to have their nearest spatial neighbors being in the same class. The incorporation of spatial information into supervised learning algorithms is known to improve empirical performance for a variety of data sets and methods [11, 12, 13]. In this article, we propose to extend the recently proposed diffusion learning unsupervised clustering framework [4] by directly incorporating spatial information into the underlying diffusion matrix .
Iii Description of Algorithm
The proposed algorithm first constructs a Markov diffusion matrix, , under the constraint that pixels may only be connected to other pixels that are within some spatial radius . Figure 1 shows how nearest neighbors with and without this spatial constraint differ. The construction of and the corresponding eigenpairs used to compute the diffusion distances are described in Algorithm 1.
Once the eigenpairs are computed, pairwise diffusion distances are simply Euclidean distances in the new coordinate system (3). Cluster modes are computed as points maximizing , where
is a kernel density estimator and
is the diffusion distance of a point to its nearest neighbor of higher density. The mode detection algorithm is summarized in Algorithm 2; see [4] for details.From the modes, points are labeled iteratively—from highest density to lowest density—according to their nearest spectral neighbor of higher density that has already been labeled, unless it is the case that such a labeling would strongly violate spatial regularity. In that case, points are labeled according to their spatial nearest neighbors. The spectralspatial labeling scheme is summarized in Algorithm 3, and its crucial parameters and the role of the labeling spatial regularization are discussed at length in [4].
Iv Experimental Results
We evaluate the SRDL algorithm on real HSI datasets. The Indian Pines, Salinas A, and Kennedy Space Center datasets considered are standard, have ground truth, and are publicly available^{1}^{1}1http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes. In order to quantitatively evaluate the unsupervised results in the presence of ground truth (GT), we consider three metrics. Overall accuracy (OA) is total number of correctly labeled pixels divided by the total number of pixels, which values large classes more than small classes. Average accuracy (AA) is the average, over classes, of the OA of each class, which values small classes and large classes equally. Cohen’s statistic () measures agreement across two labelings adjusted for random chance [14].
We note that the Indian Pines and Kennedy Space Center datasets are restricted to subsets in the spatial domain, due to welldocumented challenges of unsupervised methods for data containing a large number of classes [15]. These datasets are restricted to reduce the number of classes and achieve meaningful clusters. The Salinas A dataset is considered in its entirety. We remark that results on small subsets can be patched together [4]; such results are not shown here for reasons of space. While the proposed method automatically estimates the number of clusters based on the decay of , the number of class labels in the ground truth images were used as parameter for all clustering algorithms to make a fair comparison with methods that cannot reliably estimate the number of clusters.
Since the proposed and comparison methods are unsupervised, experiments are performed on the entire dataset, including points without ground truth labels. The labels for pixels without ground truth are not accounted for in the quantitative evaluation of the algorithms tested.
Iva Comparison Methods
We consider 13 benchmark and stateoftheart methods of HSI clustering for comparison. The benchmark methods are: means [16] run directly on ; principal component analysis (PCA) followed by means; independent component analysis (ICA) followed by means[17, 18, 19]^{2}^{2}2https://www.cs.helsinki.fi/u/ahyvarin/papers/fastica.shtml; Gaussian random projections followed by means [20]; DBSCAN [21]; spectral clustering (SC) [22]; and Gaussian mixture models (GMM) [23]
, with parameters determined by expectation maximization.
The recent, stateoftheart clustering methods considered are: sparse manifold clustering and embedding (SMCE) [24, 25]^{3}^{3}3http://vision.jhu.edu/code/, which fits the data to lowdimensional, sparse structures, and then applies spectral clustering; hierarchical clustering with nonnegative matrix factorization (HNMF) [5]^{4}^{4}4https://sites.google.com/site/nicolasgillis/code, which has shown excellent performance for HSI clustering when the clusters are generated from a single endmember; a graphbased method based on the MumfordShah segmentation [26][6], related to spectral clustering, and called fast MumfordShah (FMS) in this article^{5}^{5}5http://www.ipol.im/pub/art/2017/204/?utm_source=doi; fast search and find of density peaks clustering (FSFDPC) algorithm [27], which has been shown effective in clustering a variety of data sets; and two variants of the recently proposed diffusion learning algorithm, in which the labeling process considers only spectral information (DL) or spectral and spatial information (DLSS) [4].
Among the comparison methods, the proposed method bears closest resemblance to the DL and DLSS methods. SRDL and these two methods differ primarily in how the underlying geometry for clustering is learned. In DL and DLSS, the diffusion geometry is computed though by considering the HSI only as a spectral point cloud. SRDL regularizes the construction of by incorporating spatial information into the nearest neighbors construction. The proposed SRDL method also bears similarity to SC, SMCE, and FMS since all these methods use datadriven graphs. The FSFDPC algorithm uses a mode detection scheme similar to SRDL, but without diffusion geometry or spatial information.
IvB Indian Pines Data
The Indian Pines dataset used for experiments is a subset of the full Indian Pines datasets, consisting of three classes that are difficult to distinguish visually; see Figure 2. Clustering results for Indian Pines are in Figure 3 and Table I. SRDL was run with . The spatial regularization in the construction of is beneficial in terms of labeling: the proposed method improves over DLSS. As expected, SRDL leads to a spatially smoother labeling. However, a mistake is still made in the labeling of the proposed method, indicating that this is a challenging dataset to cluster without supervision.
IvC Salinas A Data
The Salinas A dataset (see Figure 1
) consists of 6 classes in diagonal rows. Certain pixels in the HSI have the same values; in order to distinguish these pixels, small Gaussian noise (variance
was added as a preprocessing step. SRDL was run with . Visual results for Salinas A appear in Figure 4 and quantitative results appear in Table I. The proposed method yields the best results, and moreover the labels recovered by the proposed method are quite spatially regular.IvD Kennedy Space Center Data
The Kennedy Space Center dataset used for experiments consists of a subset of the original dataset, and contains four classes. Figure 5 shows the data along with ground truth consisting of the examples of four vegetation types which dominate the scene. SRDL was run with . Results appear in Table I; for reasons of space, visual results are not shown.
IvE Parameter Analysis
The crucial parameter in the proposed method is the spatial radius which determines how near the nearest neighbors in the underlying diffusion process must be. The impact of this parameter in terms of overall accuracy, average accuracy, and appear in Figure 6. The plots exhibit the tradeoff typical of regularization in machine learning: insufficient or excessive regularization are both detrimental. It is critical to find a good range of regularization parameters, and the flat regions near the maxima in Figure 6 suggest the proposed method is relatively robust to the choice of . We remark that if knowledge of the spatial smoothness of the image is known or can be estimated a priori, then can be estimated.
V Conclusions and Future Directions
The incorporation of spatial regularity into the construction of the the underlying diffusion geometry improves empirical performance of diffusion learning for HSI clustering in all three datasets considered. For images whose labels are sufficiently smooth, our results suggest there will be a regime for choices of spatial window in which incorporating spatial proximity improves the underlying mode detection and consequent labeling of HSI.
In terms of computational complexity, it suffices to note that the bottleneck is in the construction of . Since nearest neighbors are sought, and neighbors are constrained to live within a spatial radius , as long as with respect to , the nearest neighbor searches for all points can be done in . This gives an overall complexity for the algorithm that is essentially linear in . When the spatial radius is is large enough so that the full scene is considered in the nearest neighbor search, indexing structures (e.g. cover trees) allow for fast nearest neighbor searches, giving a quasilinear algorithm.
The proposed method is likely to be of substantial benefit for scenes that are in some sense smooth with respect to the underlying class labels. For images with many classes that are rapidly varying in space—for example, urban HSI—alternative approaches for incorporating spatial features may be necessary. One approach would be to consider as underlying data points not individual pixels, but higher order semantic features, for example image patches [28]. This would allow for information about fine features such as edges and textures to be incorporated into the diffusion process, allowing for finescale clusters to be learned. On the other hand, using patches instead of distinct pixels as the underlying dataset will increase the dimensionality of the data from to , which may lead to slower computational time. However, it is expected that the patch space is lowdimensional [29], making the statistical learning problem tractable.
Acknowledgments
This research was partially supported by NSFATD1737984, AFOSR FA95501710280, and NSFIIS1546392.
References
 [1] R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S.W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences of the United States of America, 102(21):7426–7431, 2005.
 [2] R.R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5–30, 2006.
 [3] J.M. Murphy and M. Maggioni. Diffusion geometric methods for fusion of remotely sensed data. In Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XXIV, volume 10644, page 106440I. International Society for Optics and Photonics, 2018.

[4]
J. Murphy and M. Maggioni.
Nonlinear unsupervised clustering and active learning of hyperspectral images.
IEEE Transactions on Geoscience and Remote Sensing, 2018. To Appear.  [5] N. Gillis, D. Kuang, and H. Park. Hierarchical clustering of hyperspectral images using ranktwo nonnegative matrix factorization. IEEE Transactions on Geoscience and Remote Sensing, 53(4):2066–2078, 2015.
 [6] Z. Meng, E. Merkurjev, A. Koniges, and A. Bertozzi. Hyperspectral video analysis using graph clustering methods. Image Processing On Line, 7:218–245, 2017.
 [7] A. Erturk and S. Erturk. Unsupervised segmentation of hyperspectral images using modified phase correlation. IEEE Geoscience and Remote Sensing Letters, 3(4):527–531, 2006.

[8]
C. Tao, H. Pan, Y. Li, and Z. Zou.
Unsupervised spectral–spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification.
IEEE Geoscience and Remote Sensing Letters, 12(12):2438–2442, 2015.  [9] L. ZelnikManor and P. Perona. Selftuning spectral clustering. In Advances in neural information processing systems, pages 1601–1608, 2005.
 [10] M. Maggioni and J.M. Murphy. Clustering by unsupervised geometric learning of modes. ArXiv:1810.06702, 2018*.
 [11] M. Fauvel, J. Chanussot, and J.A. Benediktsson. A spatial–spectral kernelbased approach for the classification of remotesensing images. Pattern Recognition, 45(1):381–392, 2012.
 [12] N.D. Cahill, W. Czaja, and D.W. Messinger. Schroedinger eigenmaps with nondiagonal potentials for spatialspectral clustering of hyperspectral imagery. In Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XX, volume 9088, page 908804. International Society for Optics and Photonics, 2014.
 [13] H. Zhang, H. Zhai, L. Zhang, and P. Li. Spectral–spatial sparse subspace clustering for hyperspectral remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 54(6):3672–3684, 2016.
 [14] M. Banerjee, M. Capozzoli, L. McSweeney, and D. Sinha. Beyond kappa: A review of interrater agreement measures. Canadian journal of statistics, 27(1):3–23, 1999.
 [15] W. Zhu, V. Chayes, A. Tiard, S. Sanchez, D. Dahlberg, A. Bertozzi, S. Osher, D. Zosso, and D. Kuang. Unsupervised classification in hyperspectral imagery with nonlocal total variation and primaldual hybrid gradient algorithm. IEEE Transactions on Geoscience and Remote Sensing, 55(5):2786–2798, 2017.
 [16] J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning, volume 1. Springer series in statistics Springer, Berlin, 2001.
 [17] P. Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287–314, 1994.
 [18] A. Hyvärinen. Fast and robust fixedpoint algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626–634, 1999.
 [19] A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4):411–430, 2000.

[20]
S. Dasgupta.
Experiments with random projection.
In
Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence
, pages 143–151. Morgan Kaufmann Publishers Inc., 2000.  [21] M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996.

[22]
A.Y. Ng, M.I. Jordan, and Y. Weiss.
On spectral clustering: Analysis and an algorithm.
In NIPS, volume 14, pages 849–856, 2001.  [23] N. Acito, G. Corsini, and M. Diani. An unsupervised algorithm for hyperspectral image segmentation based on the gaussian mixture model. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS), volume 6, pages 3745–3747, 2003.
 [24] E. Elhamifar and R. Vidal. Sparse manifold clustering and embedding. In Advances in Neural Information Processing Systems, pages 55–63, 2011.
 [25] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2765–2781, 2013.
 [26] D. Mumford and J. Shah. Optimal approximations by piecewise smooth functions and associated variational problems. Communications on pure and applied mathematics, 42(5):577–685, 1989.
 [27] A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, 344(6191):1492–1496, 2014.
 [28] A. Buades, B. Coll, and J.M. Morel. A nonlocal algorithm for image denoising. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 60–65. IEEE, 2005.
 [29] A.D. Szlam, M. Maggioni, and R.R. Coifman. Regularization on graphs with functionadapted diffusion processes. Journal of Machine Learning Research, 9(Aug):1711–1739, 2008.