1 Introduction
The manifold hypothesis
is a fundamental guiding principle of modern machine learning. This hypothesis asserts that if a highdimensional data set arises from a realworld phenomenon, then it is only highdimensional in a superficial sense; in reality, the data must lie on or near a lowdimensional manifold
[11]in the ambient space. In order to exploit the manifold hypothesis, many modern machine learning pipelines make use of
manifold learning, a collection of techniques for nonlinear dimensionality reduction that preserve the geometric structure of the underlying data manifold [5]. Examples of classic manifold learning algorithms include Kernel PCA [12], Isomap [21], and Locally Linear Embedding [19].In manifold clustering
, we are given a data set sampled from the disjoint union of multiple lowdimensional manifolds, which may differ in terms of their intrinsic geometric properties, and our goal is to cluster these data according to the manifolds in which they lie. Many existing clustering algorithms can be construed as instances of manifold clustering. For example, in recent years, a popular clustering paradigm involves training a deep autoencoder to embed the input data in a lowdimensional latent space
[26, 10, 4]. Viewing the decoder as a coordinate chart, it becomes clear that any partition of the latent space (e.g. by means) corresponds to a partition of the original data set into an approximate union of lowdimensional manifolds.1.1 Spectral Clustering
Spectral clustering [20, 13, 24]
is perhaps the quintessential manifold clustering algorithm. Given a set of data points, this algorithm first constructs a graph with vertices corresponding to points and weighted edges connecting highly similar points. Next, the algorithm embeds the data in a lowdimensional space using the eigenvectors of the graph’s normalized Laplacian matrix. This embedding, known as the spectral embedding or Laplacian eignemap algorithm, is closely related to the notion of normalized graph cuts. Finally, standard techniques, such as
means, are used to identify clusters in the embedding space.The empirical success of spectral clustering is owed in large part to the choices made when constructing the similarity graph. One classic approach selects a constant and then constructs the graph by inserting an edge of weight between each pair of data points separated by a distance of at most . Another approach is to add an edge of weight from each point to its nearest neighbors. Yet another popular strategy is to add edges between all pairs of points, with weights decaying exponentially as a function of their squared pairwise distance [1].
1.2 Sparse Subspace Clustering
In the last decade, a plethora of far more sophisticated algorithms have been proposed to construct meaningful similarity graphs for spectral clustering. One powerful class of such algorithms arose from studying the special case of data manifolds that are actually affine subspaces.
In this setting, the problem of manifold clustering is known as subspace clustering, a wellstudied problem in machine learning that has amassed a rich history and literature [7, 15, 23, 14, 28, 29, 27]. The study of subspace clustering was originally motivated by the fact that in several realworld scenarios, the underlying data manifolds are not only lowdimensional but also flat. Indeed, images of faces under varying lighting conditions and handwritten digits with varying transformations are both examples of data sets that have been empirically observed to lie on or near flat, lowdimensional affine subspace [7].
One compelling approach to subspace clustering exploits a property called selfexpressiveness, which posits that data points sampled from a union of affine subspaces can be accurately reconstructed by a sparse linear combination of other data points in that subspace. The process of computing these linear combinations is known as sparse selfrepresentation. The key idea behind the classic sparse subspace clustering algorithm is to use the magnitude of these sparse selfrepresentation coefficients as edge weights for a similarity graph, which can then be used for spectral clustering [7].
In its most basic form, there are two drawbacks to learning the similarity graph via sparse selfrepresentation. First, representing each data point as a sparse linear combination of the remaining data points requires solving an optimization problem over a very large number of parameters, which grows quadratically with the number of data points. Said differently, sparse subspace clustering searches the entire space of weighted similarity graphs, and the number of potential graph edges scales quadratically with the number of vertices.
Another drawback is that the notion of “similarity” used by the algorithm relies strongly on the assumption that the underlying data manifolds have a global affine structure. Although sparse selfrepresentation produces a highly meaningful similarity graph when this assumption holds, it’s not immediately clear how to extend the method to the case of nonlinear data manifolds (without falling back on techniques like neighborhoods and nearest neighbors).
Some works, such as Sparse Manifold Clustering and Embedding [6] and others [17, 9, 16], have attempted to transfer sparse selfrepresentation from the realm of subspace clustering to the realm of manifold clustering. These works nevertheless suffer from the same scalability issues as the original sparse subspace clustering algorithm.
Ultimately, these two limitations make it difficult to reap the benefits of sparse selfrepresentation when constructing a similarity graph on large data sets with nonlinear structure.
1.3 Our Contributions
The main contribution of our work is a novel manifold learning and clustering algorithm based on sparse selfrepresentation that overcomes both of the obstacles described in the previous section. Our algorithm learns a similarity graph on the data in an incremental, online fashion. Unlike existing manifold learning algorithms based on selfrepresentation, our approach does not require the full data set in order to start building the similarity graph, but rather processes the data in small batches. This makes our algorithm more scalable than previous approaches. Our algorithm is also applicable to data sets with nonlinear structure.
In order to achieve these goals, our algorithm employs a novel variant of sparse selfrepresentation which does not attempt to represent points as sparse linear combinations of the full data set, but rather as sparse convex combinations of a small set of additional atoms, which are also learned by our algorithm. We handle nonlinear structure in the data by incorporating a simple Laplacian penalty term in the optimization procedure, which acts as a kind of manifold regularization by encouraging data points to be reconstructed from nearby atoms.
Another contribution of our work is our algorithm to learn the set of auxiliary atoms. Our approach, which is based on incrementally updating the weights of a structured deep autoencoder, is an unconventional application of iterative algorithm unrolling
, a technique for designing interpretable deep neural networks that has recently gained traction in the signal and image processing communities. Our particular deep autoencoder is derived from the unfolded iterations of a firstorder weighted
minimization algorithm, which arises from our need to represent data points as sparse convex combinations of nearby atoms.1.4 Outline
In Section 2, we describe the optimization problem underpinning our novel manifold clustering algorithm and provide a simple geometric interpretation of the solution to this optimization problem. In Section 3, we discuss an optimization algorithm for our problem based on structured deep learning. In Section 4, we share experimental results on real and synthetic data sets that demonstrate our algorithm’s ability to learn meaningful data representations and achieve high unsupervised clustering accuracy.
2 Proposed Method
Assume that is a collection of data points in sampled from the union of disjoint lowdimensional manifolds. Given , our goal is to cluster according to these underlying manifolds by drawing a similarity graph on the data and applying spectral clustering to this graph.
In the case that the underlying manifolds are actually affine subspaces, the principle of sparse selfrepresentation suggests that we compute a coefficient matrix that solves the following optimization problem:
subject to  
In this problem, the conditions and ensure that each data point is represented a linear combination of other data points, so that the absolute value of the coefficient
can be used as a heuristic measure of the similarity between the points
and . The objective function ensures that the coefficient matrix is sparse, and hence that the symmetrized coefficient matrix can be used as the adjacency matrix for a sparse similarity graph on the data points.Provided that is small and is sparse, computing Laplacian eigenmaps of this graph will be efficient. However, computing itself may still take time. Moreover, if the underlying manifolds of the data are not affine subspaces, then the coefficients in will fail to capture meaningful similarity relationships among the data. As promised in Section 1.3, we now derive a variant of the above optimization algorithm that simultaneously addresses these two issues.
First, in order to make the computation of more efficient, let us abandon the overly ambitious goal of drawing a similarity graph on all vertices. Instead, let us first choose new atoms and learn a similarity graph whose edges are only permitted to connect vertices of the type with vertices of the type , and vice versa. This is a bipartite graph on vertices with at most edges that can be represented compactly by a matrix . In order to preserve sparse selfrepresentation, we will choose and such that .
In order for to contain useful information for clustering, we will make two further changes to the model. First, we will insist that the coefficients in each column of be nonnegative, and sum to . In other words, we will reconstruct data points as sparse convex combinations of the atoms. Next, rather than minimizing the entrywise norm , we will instead minimize a weighted penalty term of the form . Ultimately, we arrive at the following optimization problem:
subject to  
Using the coefficients that solve this optimization problem, we construct the similarity graph on the data points and atoms by including an edge of weight between each data point and each atom .
2.1 Motivating the Regularizer
The expression used in our proposed optimization objective can be motivated in many ways.
First, it can be viewed as a weighted norm penalty on the entries of . Combined with the constraint for all and , it’s clear that minimizing this expression has a sparsityinducing effect on the entries of (verified empirically in Section 4). We say that the penalty is weighted because the scalar quantity more harshly penalizes a large coefficient when the points and are far apart, while accomodating a large coefficient more easily when the points and are very close together.
The expression can also be interpreted elegantly in terms of the learned similarity graph on the data points and atoms. Since this is an undirected graph with edge weights determined by , the summation is precisely the Laplacian quadratic form of the graph, evaluated on the function mapping vertices of the graph to their corresponding positions in . In this sense, the penalty term has a manifold regularization effect. The authors of [29], who also use this regularizer, describe it as an “adaptive distance regularization”.
Yet another interpretation of the objective function is as a soft relaxation of the means objective function, in which
places a certain positive probability mass on each atom in some subset of
.2.2 Similarity Graph Structure
In this section, we make rigorous the following informallystated claim: for a fixed point and a fixed set of atoms , the convex combination that minimizes only uses atoms that are “most similar” to . Although it is clear that a property of this kind is desirable for sparse selfrepresentation, it is not always clear what notion of “similarity” a particular algorithm achieves. In the case of our algorithm, we can characterize this “similarity” as follows:
Theorem 1.
Let lie in the convex hull of the points , and let be a solution to the following optimization problem.
subject to  
If the points have a unique Delaunay triangulation, then the set of points such that comprise the vertices of some face of this triangulation that contains .
Proof.
It suffices to show that if a Delaunay cell of contains , then each point such that is a vertex of this cell. To this end, consider an arbitrary Delaunay cell containing defined by the vertices . Here, denotes a set of indices. We will prove that only if .
Since lies in the Delaunay cell, we may write it as a convex combination of only the vertices
using a coefficient vector
, which may differ from the solution to the optimization problem.Observe that for any vector , the identities and imply
The above identity also holds if we replace with .
By the definition of a Delaunay cell, there is a sphere with center and radius such that for each and for each (this is where we use the assumption that the triangulation is unique). Since the support of is contained in and the support of is not, we have
which contradicts the optimality of . Thus, each point such that is a vertex of the cell. We conclude that are the vertices of some face of the triangulation containing . ∎
3 Optimization Algorithm
There are two key steps to solving the optimization problem proposed in Section 2. First, given the data set, we must select a suitably representative set of atoms. Next, given the data set and the atoms, we must compute the coefficients that best reconstruct each data point as a sparse convex combination of nearby atoms.
Of these two steps, the latter is relatively straightforward. In fact, if we treat the data set and atoms as fixed, then the resulting optimization problem over the coefficients is simply a weighted minimization problem. Convex optimization problems of this kind have been studied extensively, and there exists a rich array of efficient firstorder methods to solve them.
In contrast, the process of choosing the optimal atoms is far from straightforward. One heuristic is to sample the atoms uniformly at random from the data set, but doing so is unlikely to lead to optimal performance. Another approach, inspired by our problem’s resemblance to sparse dictionary learning, is alternating minimization, which takes turns minimizing the objective function over the atoms and over the coefficients. Ultimately, we’d like an approach for learning the atoms that goes beyond these traditional frameworks.
The goal of this section is to describe an algorithm that solves both steps of the proposed optimization problem in tandem. Our algorithm is an unconventional application of algorithm unrolling, a technique for structured deep learning that has recently gained traction in the signal and image processing communities. Since our algorithm involves training a neural network, it has the unique advantage of being able to learn the atoms and coefficients entirely from online passes over the data set.
3.1 Relaxing the Exact Recovery Constraint
Our optimization algorithm solves the following relaxation of the problem from Section 2. Let
denote the standard simplex (a.k.a. the “probability simplex”) in . For a set of atoms , a data point , and a coefficient vector
, define the loss function
If is sampled uniformly from the data set , then the relaxed optimization problem is
In this formulation of the problem, we have replaced the exact recovery constraint with a more flexible reconstruction error term in the objective function. The tradeoff between this term and the Laplacian penalty term, which determines the sparsity of , is controlled by the parameter .
3.2 Algorithm Unrolling
In order to solve the relaxed optimization problem stated in the previous section, we introduce an autoencoder architecture that implicitly solves the problem when trained by backpropagation. The network has weights
, which are initialized to a random subset of the data . The autoencoder takes as input a single data point (or a batch of such points), and outputs a reconstructed point . The output of the encoder (and input to the decoder) is a sparse coefficient vector .Specifically, given weights and input , the encoder produces a coefficient vector that best represents as a sparse convex combination of the atoms in . The layers of this encoder are derived from the unrolled iterations of a projected gradient descent algorithm for minimizing the loss function with respect to the variable .
This process of designing this kind of highlystructured recurrent neural network is known as
algorithm unrolling; although our application of the technique for manifold learning is new, there exists a rich, burgeoning literature on the subject in the context of sparse dictionary learning [8, 18, 22, 3].Given , the decoder multiplies by the weight matrix
to compute an estimate
of the input data . We train the autoencoder by backpropagation on the loss function , applied to the network weights , input data samples , and estimated coefficient vectors computed by the encoder. The remainder of this section is devoted to describing the autoencoder architecture in greater detail.Forward Pass. Given and , there are several efficient firstorder convex optimization algorithms at our disposal to compute an optimal coefficient vector
One such method is accelerated projected gradient descent [2]. This method initializes and iterates
for . The parameter is a step size, which is best initialized to be inversely proportional to the square of . The constants are given by the recurrence
The gradient of the loss function is given by
The operator projects its input onto , the standard simplex. It is known [25] that this operator has the form
for a certain piecewiselinear bias function . Thus, the approximate code is the output of a recurrent encoder with input , weights
, and activation function
.Backward Pass. In order to approximate
it suffices to minimize
by backpropagation through , the regularized squared reconstruction error of the decoder, and the computation graph of , which is computed by the encoder. We also use backprop to train the encoder step size described in the previous section.
A complete diagram of the autoencoder architecture is shown in Figure 1.
4 Experiments
In this section, we conduct experiments to evaluate the behavior of our algorithm on synthetic and real data sets. Our goal is to gain insight into the atoms learned by our structured autoencoder, their associated representation coefficients, and (in the case of labeled data sets) the unsupervised clustering accuracy achieved by performing spectral clustering on the resulting similarity graph.
In each experiment, unless otherwise specified, we choose the parameter that results in the most even balance between the two terms in the objective function, namely the squared reconstruction error term and the weighted penalty term. In general, we found that setting achieved this balance reasonably well, which can be partially explained by the fact that for any lying in the standard simplex, the terms and are both bounded in magnitude by .^{1}^{1}1Our code is publicly available at https://github.com/pbt17/manifoldlearningwithsimplexconstraints
4.1 Reconstructing Synthetic Data
For our first experiment, we focus on the atoms and coefficients learned by our autoencoder when the data is sampled from onedimensional manifolds in . Figure 2 shows two such data sets. The first is the unit circle in . The second is the classic two moons data set, which consists of two disjoint semicircular arcs in .
For each of these two data sets, we trained the autoencoder on an infinite stream of data sampled uniformly from the underlying manifold(s). We added small Gaussian white noise to each data point to make the representation learning problem more challenging.
Figure 2 shows the result of training the autoencoder on these data sets. We see that in each case, the atoms learned by the model are meaningful. Moreover, in each case, we accurately reconstruct each data point as sparse convex combinations of these atoms, modulo the additive white noise.
4.2 Clustering Synthetic Data
Our next experiment assesses the clustering capabilities of our algorithm. We studied a simple family of data distributions consisting of two underlying clusters in . These clusters took the shape of two concentric circles of radii and , where is a separation parameter. For multiple values of , we trained our structured autoencoder with atoms on an infinite stream of data sampled uniformly from these two manifolds (each with half the probability mass). For this experiment, we did not add any Gaussian noise to this data.
Figure 4 shows the result of training the autoencoder on a handful of these data sets for various combinations of and . Figure 5 shows the accuracy achieved by performing spectral clustering on the corresponding similarity graphs. Based on these results, it appears that our clustering algorithm is capable of distinguishing between clusters of arbitrarily small separation , provided that the number of atoms is sufficiently large.
We also remark that the clustering accuracy of our algorithm on the two moons data set, shown earlier in Figure 2, is over 98%, as well.
4.3 Clustering Handwritten Digits
For our third experiment, we ran our clustering algorithm with
atoms on a subset of the MNIST handwritten digit data set, consisting of the five digits 0, 1, 3, 6, and 7. Figure
6 shows a handful of the randomly initialized atoms, their “smoothed” counterparts after training, and their predicted cluster labels. Ultimately, performing spectral clustering on the learned similarity graph recovers the true digit labels with over 98% unsupervised clustering accuracy.5 Conclusion
In this work, we proposed a new manifold learning and clustering algorithm for large, nonlinear data sets that utilizes the principle of sparse selfrepresentation. Our algorithm first trains a highly structured deep autoencoder to identify a small set of representative atoms for the data. We then reuse this autoencoder to express each data point as a sparse convex combination of these atoms—a process that amounts to solving a natural convex optimization problem with simplex constraints. Finally, we use the sparse representation coefficients to construct a similarity graph on the data and the atoms, opening the door to nonlinear dimensionality reduction and clustering via spectral methods. Our experiments demonstrated the ability of our algorithm to learn meaningful representations of nonlinear data manifolds in terms of a small set of atoms, as well as its ability to accurately recover class labels in an unsupervised fashion. Ultimately, we believe that our approach opens a promising new avenue for efficient yet interpretable manifold learning.
References
 [1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003.
 [2] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(34):231–357, 2015.
 [3] Thomas Chang, Bahareh Tolooshams, and Demba E. Ba. Randnet: Deep learning with compressed measurements of images. In International Workshop on Machine Learning for Signal Processing, 2019.
 [4] Shlomo E. Chazan, Sharon Gannot, and Jacob Goldberger. Deep clustering based on a mixture of autoencoders. In International Workshop on Machine Learning for Signal Processing, pages 1–6, 2019.
 [5] Ronald R. Coifman and Stéphane Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5–30, 2006.
 [6] Ehsan Elhamifar and René Vidal. Sparse manifold clustering and embedding. In Advances in Neural Information Processing Systems, pages 55–63, 2011.
 [7] Ehsan Elhamifar and René Vidal. Sparse subspace clustering: Algorithm, theory, and applications. Transactions on Pattern Analysis and Machine Intelligence, 35(11):2765–2781, 2013.
 [8] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In International Conference on Machine Learning, 2010.
 [9] Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian D. Reid. Deep subspace clustering networks. In Advances in Neural Information Processing Systems, pages 24–33, 2017.

[10]
Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou.
Variational deep embedding: An unsupervised and generative approach
to clustering.
In
International Joint Conference on Artificial Intelligence
, pages 1965–1972, 2017.  [11] John Lee. Introduction to Smooth Manifolds. Springer, 2003.
 [12] Sebastian Mika, Bernhard Schölkopf, Alex J Smola, KlausRobert Müller, Matthias Scholz, and Gunnar Rätsch. Kernel PCA and denoising in feature spaces. In Advances in Neural Information Processing Systems, pages 536–542, 1999.

[13]
Andrew Y. Ng, Michael I. Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.
In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, editors, Advances in Neural Information Processing Systems, pages 849–856, 2001.  [14] V. M. Patel and R. Vidal. Kernel sparse subspace clustering. In IEEE International Conference on Image Processing (ICIP), pages 2849–2853, 2014.

[15]
Vishal M. Patel, Hien Van Nguyen, and René Vidal.
Latent space sparse subspace clustering.
In
International Conference on Computer Vision
, pages 225–232, 2013.  [16] Xi Peng, Jiashi Feng, Shijie Xiao, WeiYun Yau, Joey Tianyi Zhou, and Songfan Yang. Structured autoencoders for subspace clustering. Transactions on Image Processing, 27(10):5076–5086, 2018.
 [17] Xi Peng, Shijie Xiao, Jiashi Feng, WeiYun Yau, and Zhang Yi. Deep subspace clustering with sparsity prior. In International Joint Conference on Artificial Intelligence, pages 1925–1931, 2016.
 [18] Jason Tyler Rolfe and Yann LeCun. Discriminative recurrent sparse autoencoders. In International Conference on Learning Representations, 2013.
 [19] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
 [20] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
 [21] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
 [22] Bahareh Tolooshams, Sourav Dey, and Demba E. Ba. Scalable convolutional dictionary learning with constrained recurrent sparse autoencoders. In International Workshop on Machine Learning for Signal Processing, pages 1–6, 2018.
 [23] René Vidal and Paolo Favaro. Low rank subspace clustering (LRSC). Pattern Recognition Letters, 43:47–61, 2014.
 [24] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
 [25] Weiran Wang and Miguel A CarreiraPerpinán. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. arXiv preprint arXiv:1309.1541, 2013.
 [26] Junyuan Xie, Ross B. Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning, volume 48, pages 478–487, 2016.
 [27] Jufeng Yang, Jie Liang, Kai Wang, Paul L. Rosin, and MingHsuan Yang. Subspace clustering via good neighbors. Transactions on Pattern Analysis and Machine Intelligence, 42(6):1537–1544, 2020.
 [28] Chong You, Daniel P. Robinson, and René Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In Conference on Computer Vision and Pattern Recognition, pages 3918–3927, 2016.

[29]
Guo Zhong and ChiMan Pun.
Subspace clustering by simultaneously feature selection and similarity learning.
Knowledge Based Systems, 193:105512, 2020.
Comments
There are no comments yet.