Scalable Deep k-Subspace Clustering

11/02/2018 ∙ by Tong Zhang, et al. ∙ 0

Subspace clustering algorithms are notorious for their scalability issues because building and processing large affinity matrices are demanding. In this paper, we introduce a method that simultaneously learns an embedding space along subspaces within it to minimize a notion of reconstruction error, thus addressing the problem of subspace clustering in an end-to-end learning paradigm. To achieve our goal, we propose a scheme to update subspaces within a deep neural network. This in turn frees us from the need of having an affinity matrix to perform clustering. Unlike previous attempts, our method can easily scale up to large datasets, making it unique in the context of unsupervised learning with deep architectures. Our experiments show that our method significantly improves the clustering accuracy while enjoying cheaper memory footprints.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Subspace Clustering (SC) is the de facto method in various clustering tasks such as motion segmentation [17, 7, 10, 11], face clustering [9, 8] and image segmentation [36, 24]. As the name implies, the underlying assumption in SC is that samples forming a cluster can be adequately described by a subspace. Such data modeling is natural in many applications. One prime example is face clustering in which it has been shown that the face images of one subject obtained with a fixed pose and varying illumination lie in a low-dimensional subspace [20].

Most recent subspace clustering methods [8, 21]

assume that data points lie on a union of linear subspaces and construct an affinity matrix for spectral clustering. Although promising results on certain datasets are obtained, the performances degrade significantly when non-linearity arises in the data. Moreover, constructing the affinity matrix and performing clustering demand hefty memory footprints and processing power. To benefit from the concept of SC and its unique features, two issues should be addressed;

Non-Linearity: Majority of the SC algorithms target clustering with linear subspaces. This is a very bold assumption and can hardly be met in practice. Some studies [26, 38, 34, 12]employ kernel methods to alleviate this limitation. Nevertheless, kernel methods still suffer from the scalability issues [40]. To make things more complicated, there is no guideline as how to choose the kernel function and its parameters which truly well-suited to subspace clustering.

Scalability: With the current trend in analyzing big data, SC algorithms should be able to deal with large volume of data. However, most of the state-of-the-art methods for SC make use of an affinity matrix along norm regularization (e.g.,  [7, 8],  [13] or nuclear  [21, 32]). Not only building an affinity matrix demands for solving large scale optimization problems, but also performing spectral clustering on an affinity matrix, whose size is dictated by the number of samples, is overwhelming.

In this paper, instead of constructing the affinity matrix for spectral clustering, we revisit the -subspace clustering (-SC) method [5, 30, 2] to design a novel and scalable method. In order to handle non-linear subspaces, we propose to utilize deep neural networks to project data to a latent space where -SC can be easily applied. Our contributions in this paper are three-folds:

  1. We bypass the steps of constructing an affinity matrix and performing spectral clustering, which have been used in mainstream subspace clustering algorithms, and accelerate the computation by using a variant of -subspace clustering. As a result, our method can handle datasets that are orders of magnitudes larger than those considered in traditional methods.

  2. In order to address non-linearity, we equip deep neural networks with subspace priors. This in return enables us to learn an explicit non-linear mapping of the data that is well-suited for subspace clustering.

  3. We propose novel strategies to update subspace bases. When the size of the dataset at hand is manageable, we update subspaces in closed-form using Singular Value Decomposition (SVD) with a simple mechanism to rule out outliers. For large datasets, we update subspaces by making use of the stochastic optimization methods on the Grassmann manifolds.

Empirically, evaluations on relatively large datasets such as MNIST and Fashion-MNIST dataset [33] show that our proposed method achieves the state-of-the-art results in terms of clustering accuracies and speed.

2 Related Work

Linear subspace clustering methods can be classified as algebraic algorithms, iterative methods, statistical methods and spectral clustering-based methods 

[31]. Among them, spectral clustering-based methods [7, 21, 13, 16, 14, 39] have become dominant in the literature. In general, spectral clustering-based methods solve the problem in two steps: encode a notion of similarity between pairs of data points into an affinity matrix; then, apply normalized cuts [29] or spectral clustering [25] on this affinity matrix. To construct the affinity matrix, recent methods tend to rely on the concept of self-expressiveness, which seeks to express each point in a cluster as a linear combination of other points sharing some common notions (e.g., coming from the same subspace).

The literature on true end-to-end learning of subspace clustering is surprisingly limited. Furthermore and to the best of our knowledge, none of the deep algorithms can handle medium size datasets, let aside the large ones111Among all the datasets that have been tested, COIL100 with 7,200 images seems to be the largest one.. In hybrid methods such as [28], hand-crafted features (e.g., SIFT [23] or HOG [6]

) are fed into a deep auto-encoder with a sparse subspace clustering (SSC) prior. The final clustering is then obtained by applying k-means or SSC on the learned auto-encoder features. Instead of using hand-crafted features, Deep subspace clustering Networks (DSC-NET) 

[15] employ the deep convolutional Auto-encoder to nonlinearly map the images to a latent space, and make use of a self-expressive layer between the encoder and the decoder to learn the affinities between all the data points. Through learning affinity matrix within the neural network, state-of-the-art results on several traditional small datasets are reported in [15]. Nevertheless, relying on the whole dataset to create the affinity matrix, DSC-NET cannot scale for large dataset.

The SSC by Orthogonal Matching Pursuit (SSC-OMP) [40]

is probably the only subspace clustering which could be considered as “scalable”. The main idea is to replace the large scale convex optimization procedure with the OMP algorithm in constructing the affinity matrix. Having said this, SSC-OMP makes use of spectral clustering and hence still fails to really push subspace clustering for large scale datasets.

-Subspace Clustering [30, 2], an iterative methods, can be considered as a generalization of -means algorithm. -SC shows fast convergence behavior and can handle both linear and affine subspaces explicitly. However, -SC methods are sensitive to outliers and initialization. Attempts to make -SC methods more robust include the work of Zhang et al[41] and Balzano et al[3]. In the former, best subspaces from a large number of candidate subspaces are selected using a greedy combinatorial algorithm [41] to make the algorithm robust to data corruptions. Balzano et al.  propose a variant of subspaces method named -GROUSE which can handle the missing data in subspace clustering. However, the resulting methods seem not to producing competitive results compared to methods relying on affinity matrices.

In this paper, we propose -Subspace Clustering(-SC) networks which incorporate -SC into a deep neural network embedding. This lets us not only bypass the affinity construction and spectral clustering procedure, but also handle data points lying in non-linear subspaces.

3 -Subspace Clustering(-SC) Networks

Our -subspace clustering networks leverage on the properties of deep convolutional auto-encoder and the -subspaces clustering. In this section we will discuss the -subspace property and the whole framework in detail.

Figure 1: Scalable deep -subspace structure. As an example, we show our scalable batch-based subspace clustering with three convolutional encoder layers and three deconvolutional decoder layers. During the training, we first pre-train the deep convolutional auto-encoder by simply reconstructing the corresponding images, and then fine-tune the network using this pre-trained model as initialization. During the fine-tuning, the network minimizes the sum of distances of each sample in the latent space to its closet subspace.

3.1 -Subspace Clustering

Consider a collection of points belonging to a union of subspaces of dimensions , respectively 222We assume in the remainder.. With slight abuse of notation, we will use to represent the basis of the subspace index by , that is and with denoting identity matrix. The goal of subspace clustering is to learn the subspaces and assign points to their nearest subspaces. Once every data point is assigned to a subspace, the corresponding subspace basis can be re-calculated by SVD (will be shown shortly). Different from self-expressiveness-based methods which obtain the affinity matrix by solving large-scale optimization problems, -SC seeks to minimize the sum of residuals of points to their nearest subspaces. The cost function of -SC can be written as

(1)

Given the subspace basis , the optimal value for can be written as

(2)

For the sake of discussion, let us arrange into a membership matrix . Beginning with an initialization of candidate subspaces bases, -SC updates the membership assignments and subspaces in an alternating fashion: 1) cluster points by assigning the nearest subspace as in Eqn. (2

); 2) re-estimate the new subspace bases by performing SVD on the points of each cluster (the columns of

where the i-th row is 1). Similar to

-means, the whole algorithm works in an Expectation Maximization (EM) style, and is guaranteed to converge to a local minimum in a finite number of iterations. We will shortly show how stochastic optimization techniques can be applied to minimize the problem depicted in (

1), equipping our solution with the ability to handle large-scale data.

3.2 -SC with Convolutional Auto-Encoder Network

Denoising fully-connected Auto-Encoders (AEs) are widely used with generic clustering algorithms [37, 35]. We have found such structures difficult to train (due to the large number of parameters in the fully-connected layers) and propose to use convolutional AEs to learn the embeddings for -SC.

Specifically, let denote the AE parameters, which can be decomposed into encoder parameters and decoder parameters . Let be the encoder mapping function and

as the decoder mapping function, both of which are composed of a sequence of convolution kernels and nonlinear activation functions. Our overall loss can be written as

(3)

where is a regularization parameter to balance the reconstruction loss and the k-subspace clustering loss. The auto-encoder reconstruction loss is defined as

(4)

The is the loss for subspace clustering and is written as

(5)

where denotes the Grassmann manifold consisting of -dimensional subspaces with ambient dimension .

As a pre-processing step, some of traditional algorithms such as [40, 41] use PCA to project images onto a low-dimensional space. However, the mapping by PCA projection is linear and fixed. By contrast, our encoder function can update its parameters to adapt to a space which is subspace-clustering-friendly.

4 Optimization

Input: dimensionality of subspaces

, number of class K, epochs number T, batch size

and dataset ,
Pre-train CAE using ,
Generate based on the pre-trained model and initial cluster labels
for do
for
  Update the CAE parameters by Eqn. (6)
end
 Recalculate the latent space for the whole data set,
 Assign the membership for every as Eqn.(2) and
 rule out the farthest points as outliers, for each
 class we have set
 Update each subspace through SVD decomposition on the
end
Output: Subspaces {}, and membership assignment
Algorithm 1 Scalable -Subspace Clustering (SVD update)

The cost function (3) is highly non-convex and three sets of variables (i.e., , and ) should be updated alternatively. It is known that alternating optimization problems are not without difficulties. A strategy such as wake-and-sleep is a common practice to update one set of variables while fixing the others. As mentioned before, we first pre-train a CAE without having any information about and . Therefore, it is natural to obtain an initial state for and directly from the output of the pre-trained CAE. This is exactly how we initialize and .

As shown in Fig (1), the gradient of the encoder comes from the loss of reconstruction and the loss of -subspace clustering loss, i.e.,

(6)

By fixing , the assignments for a mini-batch can be obtained easily and the required gradient for updating the CAE follows by back-propagating the error. The most difficult part in our problem is to find a way to update the subspaces efficiently and accurately. Here we will explain two approaches to update the subspaces. The first method is based on the SVD decomposition and the second one makes use of the Riemannian geometry of Grassmannian to update the subspaces

4.1 SVD Update

Although SVD decomposition is computationally more expensive, we empirically observe that the SVD can provide satisfactory results. In our optimization, we update the encoder through back-propagation, batch by batch, and update the subspaces by employing the SVD once per epoch. This is mainly because updating subspaces more frequently hinders the convergence. Intuitively, if the gradient takes the network to a bad direction, updating subspaces accordingly could intensify the negativity and worsen the CAE. Empirically, we observe updating subspaces after every epoch can neutralize the good and the bad directions of the gradient, yielding a stable framework.

The outliers may affect the subspace clustering badly, especially for -subspace clustering. Therefore, when updating each subspace, we rule out the farthest points as outliers. That is, after back propagation on CAE, we pass all the data through the encoder and assign their membership. We then sort the distance between each sample and the subspace it belongs to, and remove the outliers. Finally, we apply SVD on the remainder of points assigned to a subspace to obtain its new basis. Note that we only need to compute the largest

singular values and corresponding vectors to update a subspace. Specifically, after fixing

and in Eqn. (5), updating the subspace basis translates to solving the following problem

(7)

where consists of (as columns) that belong to cluster . The solution to (7) corresponds to the column space , which can be obtained by applying SVD on and taking the top left singular vectors.

4.2 Gradient based update

If more frequent updates are required, the SVD solution can be replaced by a Riemannian gradient descent method based on the geometry of Grassmannian. In particular, let be the gradient of the loss with respect to after an iteration (or accumulated gradient after a few iterations). In Riemannian optimization, is updated according to the following rule;

(8)

We explain Eqn. (8) with the aid of Fig. 2. First we note that a global coordinate system on a Riemannian manifold cannot be defined. As such Riemannian techniques make extensive use of the tangent bundle of the manifold to achieve their goal. Note that moving in the direction of will take us off the manifold. For a Riemannian manifold embedded in a Euclidean space (our case here), an ambient vector such as can be projected orthogonally on the tangent space at the current solution . We denote this operator by in Eqn. (8). The resulting tangent vector shown by the green arrow in Fig. 2 identifies a geodesic on the manifold. Moving along this geodesic (sufficiently) will guarantee to decrease the loss while preserving the orthogonality of the solution. In Riemannian optimization, this is achieved by a retraction which is local approximation to the exponential map on the manifold. We denote the retraction in Eqn. (8) by . The only remaining bit is which is the learning rate. For the Grassmannian, we have

(9)
(10)

In Eqn. (10),

is the Q part of the QR decomposition which is much faster than SVD. Although the SVD can perform good enough in experiments, we provide the other method which is faster in order to deal with very large datasets.

Figure 2: Illustration of how we update the gradient and keep the subspaces lie on the manifold
Input: dimensionality of subspaces , number of class K, initial , epochs number T, pre-trained CAE, batch size and dataset ,
for do
for
  Assign the membership for every point based on the distance
  to each subspace as Eqn. (2)
  Compute the gradients with respect to each subspace as
  Eqn. (5) and store them
  Update the CAE parameters by Eqn. (6)
end
 Project the gradients on Grassmannian manifolds
 based on Eqn. (9)
 Apply the gradients on the corresponding subspace
 accord to Eqn. (10)
end
Output: subspaces , and membership assignment
Algorithm 2 Scalable -Subspace Clustering(Gradients update)

5 Experiment

We use Tensorflow 

[1] to build our networks. We used MNIST dataset [19] in our first experiment. MNIST is not considered as a standard dataset for previous subspace clustering algorithms, since the size of this dataset is far beyond the size that traditional algorithms can handle. In addition, the original images do not follow the structure of linear subspaces. Taking advantage of CAE with our -subspace clustering module, we aim to project all the MNIST data into a space which is more friendly for subspace clustering. In order to enforce our conclusion, we also evaluate our method on Fashion-MNIST dataset [33], a similar dataset to MNIST but with fashion images. Fashion-MNIST has 10 classes, with image being gray scale and of size . The images of Fashion-MNIST come from the fashion products which are classified based on a certain assortment and manually labeled by in-house fashion experts and reviewed by a separate team. It contains more variations within each class and it is thus more challenging compared to MNIST.

5.0.1 Baseline Methods

For most of the baselines and our method, we evaluate them on the whole datasets of MNIST and Fashion-MNIST with all 70000 images (including both training and testing sets). We compare our solution with the following generic clustering algorithms:

1) -Means [22]: -means finds clusters based on spatial closeness. As an EM method, it heavily relies on good initialization. Hence, for -means (and other -means based methods), we run the algorithm 20 times with different centroid seeds and report the best result.

2) Deep Embedded Clustering (DEC) [35]: A rich structure for the MNIST dataset is proposed in [35]

which we follow here. In particular, stacked autoencoder(SAE) 

[4]

along layer-wise pre-training was considered. The structure of the network reads as 784-500-500-2000-10. Image brightness is scaled from 0-1 to 0-5.2 to boost the performance. We observe that this method is highly sensitive to network parameters in the sense that even a small change in the structure will result in a significant performance drop. However, the feature extracted by the pre-trained model is very discriminative,

i.e., even simply using k-means on top of it can achieve competitive results. We call the feature extracted by this network the SEA features in the sequel.

3) Deep Clustering Network (DCN) [37]: Based on the vanilla SAE, Yang et al.propose to add -means clustering loss in addition to the data reconstruction loss of SAE.

4) Stacked Auto-Encoder followed by -Means (SAE-KM): Extract features with SAE followed by applying -means.

5) PCA followed by -subspace (PCA-KS): It projects the original data onto a low-dimensional space at first, then use -subspace to obtain the final results. Since PCA is a linear projection, it helps the readers to understand where the improvements come from compared to our nonlinear projection. The results are reported based on the 10 trails due to the randomness of initialization when employing -subspace.

6) Convolutional Auto-Encoder followed by -Means (CAE-KM): Extract features with SAE and then apply -means. This is also the initialization for our method. It also can be considered as an evaluation of the quality of our initialization.

For those subspace clustering algorithms that rely on affinity matrix construction and spectral clustering, since they are not scalable to the whole dataset, we can report their results on the test sets (with 10000 images) only. We list several state-of-the-art subspace clustering algorithms for baselines: Sparse Subspace Clustering (SSC) [8], Low Rank Representation (LRR) [21], Kernel Sparse Subspace Clustering (KSSC) [26], SSC by Orthogonal Matching Pursuit(SSC-OMP) [40] and the latest one Deep Subspace Clustering Networks (DSC-Net) [15].

5.0.2 Evaluation Metric

For all quantitative evaluations, we make use of the unsupervised clustering accuracy rate, defined as

(11)

where is the ground-truth label, is the subspace assignment produced by the algorithm, and ranges over all possible one-to-one mappings between subspaces and labels. The mappings can be efficiently computed by the Hungarian algorithm. We also use normalized mutual information (NMI) as the additional quantitative standard. NMI scales from 0 to 1, where a smaller value means less correlation between predict label and ground truth label. Another quantitative metric is the adjusted Rand index (ARI), which is scaled between -1 and 1. The larger the ARI, the better the clustering performance.

5.0.3 Implementation

We build our CAE in a bottle-neck structure, meaning we decrease the number of channels and the size of feature maps layer by layer. We design a six layer convolutional auto-encoder, where the kernel size in the first layer is and in the last two layers of the encoder is . We set the number of channels in each layer to

for the encoder, and the reverse for the decoder since they are symmetric in structure. Between layers, we set the stride to 2 in both horizontal and vertical directions, and use rectified linear unit (ReLU) as the non-linear activations. We use the same structure for both MNIST and Fashion-MNIST datasets.

Instead of greedy layer-wise pre-training [37, 35], we pre-trained our network end-to-end from random initialization, until the reconstructed images are similar to the input ones (200 epochs suffice for pre-training). For subspaces initialization, we randomly sampled 2000 images and use DSC network to generate the clusters and corresponding subspaces. We noticed that initialization by the DSC subspaces leads to a model that under-performs in the beginning as compared to the -Means algorithm. Nevertheless, our algorithm successfully recovers from such an initialization in all the experiments. During the optimization we use Adam [18] optimizer, an adaptive momentum based gradient descent method, to minimize the loss, where we set the learning rate to in both our pre-training and fine-tuning stages. For different datasets, the only two parameters need tuning are the in (3) and the subspace ambient dimension , since the subspace intrinsic dimension is fixed by the number of feature map of CAE.

5.1 MNIST Dataset

In this section, we will report and discuss results on the MNIST dataset. To the best of our knowledge, existing subspace clustering methods, with raw images as input, have not achieved satisfactory results on this dataset. As far as we know, the best performance reported in [27] is in the range , where the DSIFT features are employed.

On MNIST, we fix our subspace dimension as 7, which means each subspace lies on a Grassmannian manifold . The is set to 0.08, which balances between subspace clustering and CAE data reconstruction. Table (1) reports the results of all the baselines, including both subspace clustering algorithms and generic clustering algorithms. SCN-S is to update the subspaces by employing the SVD decomposition, and SCN-G stands for updating the subspaces by the Grassmannian gradients, which empirically is not as stable as the SVD updating scheme, probably due to the stochastic nature of each gradient step.This Grassmannian update, however, runs faster and takes less time to converge. We run our methods 15 times and report the average. The results of DEC are taken from the original paper. We tune the parameters for DCN very carefully and report the best results.

Among all the algorithms, our algorithm achieves the best performance in ACC and ARI. Especially for ACC, ours is higher than the second best, namely DEC. From the results, it is not difficult to conclude that the DEC and DCN perform only marginally better than SAE-KM, which is the initialization for DEC and DCN. Specifically, DEC improvements over the initialization are around and DCN only boosts around over SAE-KM. By contrast, our method starts from CAE-KM (with ACC), and improves it by to ACC. The improvement can be visualized by Fig (3), which shows the projections of CAE feature space and the latent space of our network in a two-dimensional space. Compared to CAE features, which are all mixed up, our latent space are well separated even though the two-dimensional space is not suitable for visualizing subspace structure as they reside in high-dimensional ambient space.

(a) CAE feature
(b) Our latent space
Figure 3: Visualization using t-SNE on the latent space generated by projecting the testing set images on pre-trained CAE and our network. Points marked with the same color belong to the same class.

For traditional subspace clustering algorithms, around 37 Gigabytes of memory is required to store the affinity matrix, which is computationally prohibitive. Therefore, we contrast our algorithm against SSC, LRR, KSSC, SSC-OMP and Deep Subspace Clustering Networks on a smaller experiment, namely only using the 10000 test images of the MNIST dataset (see Table. (2) for results). Note that SSC-OMP completely fails in dealing with feature generated by SAE and CAE, achieving around ACC and NMI. Generally speaking, with more samples, better accuracies are expected. We can see that all the subspace clustering algorithms using the SAE feature perform better compared to using CAE feature. To some extent, it proves that there exists a nonlinear mapping which is more favorable to subspace clustering. At the same time, our algorithm still achieves the best results within all subspace clustering algorithms, even higher that DSC-Net.

SAE-KM CAE-KM K-means PCA-KS DEC DCN SCN-G SCN-S
ACC 81.29% 51 % 53% 68.53% 84.3% 83.31% 82.22% 87.14%
NMI 73.78 % 44.87% 50 % 64.17% 80% 80.86% 73.93% 78.15%
ARI 67% 33.52 % 37 % 54.17% 75% 74.87% 71.10% 75.81 %
Table 1: Results on MNIST (70000 samples).
MNIST Fashion-MNIST
ACC NMI ACC NMI
SSC-SAE 75.49% 66.26% 52.33 % 51.26%
SSC-CAE 43.03 % 56.81% 35.31% 18.10%
LRR-SAE 74.09% 66.97% 58.09% 59.19%
LRR-CAE 51.37% 66.59% 34.43% 18.57%
KSSC-SAE 81.53% 84.53% 57.10% 60.40%
KSSC-CAE 56.42% 65.66% 35.41% 18.18%
DSC-Net 53.20% 47.90% 55.81% 54.80 %
SCN-S 83.30% 77.38% 60.02% 62.30%
Table 2: The results of subspace clustering algorithms on the test sets of the MNIST and Fashion-MNIST datasets, the best two are in bold

5.2 Fashion-MNIST

Unlike MNIST dataset, which only contains simple digits, every class in Fashion-MNIST has different styles and come from different gender groups: men, women, kids and neutral. In Fashion-MNIST, there are 60000 training images and 10000 test images. In our case, we pre-trained and fine-tuned the network using the whole dataset. On Fashion-MNIST, we fix our subspace dimension to 11 and set to 0.11.

Consistent with the MNIST dataset, the DCN sightly improves upon its initialization (SAE-KM) in terms of ACC and NMI. Moreover, we find out that the DCN algorithm works better with smaller learning rates, which in turn requires more epochs to converge properly. From Table (3), we can see that our method still improves the accuracy by compared to our initialization, and outperforms other algorithms. The t-SNE maps in Fig. (4) show that there exists a subspace structure in our latent space even in two dimensional space.

Table (2) shows that the subspace clustering algorithms also achieve acceptable results on the 10000 test sets, with our algorithm being the best among all. Compared to other subspace clustering algorithms, our algorithm runs much faster, only requiring less than 8 minutes(including pre-training and fine tuning with subspace clustering) to generate final results, whereas the traditional algorithms need at least 40 minutes to process these 10000 samples even after the dimensionality reduction.

(a) CAE feature
(b) Our latent feature
Figure 4: Visualization using t-SNE for the latent space generated by pretrained CAE and our network on Fashion-MNIST. Points marked with the same color belong to the same class.
SAE-KM CAE-KM K-means PCA-KS DCN SCN-G SCN-S
ACC 54.35% 39.84 % 47.58% 53.41% 56.14% 58.67% 63.78%
NMI 58.54 % 39.80% 51.24 % 57.5% 59.4% 52.88% 62.04%
ARI 41.86% 25.93 % 34.86 % 41.17% 43.04% 42% 48.04%
Table 3: Results on Fashion-Mnist

5.3 Further Discussion

Based on the above experiments, we observe that our algorithm consistently achieves higher accuracies as compared to DCN (even with the initialization using CAE). One may argue that the performance gain over DCN is due to the fact that unlike SAE, CAE can be trained easily333In our experiments, the number of parameters in SAE is 2600 times more than that of CAE.. To verify that this is not the case, we replace the SAE with the CAE in DCN to see whether DCN can generate competitive results. Table (4) demonstrates that even with the CAE, the DCN cannot boost the clustering results as much as ours. On MNIST, DCN-CAE can hardly improve the accuracy and NMI; on Fashion-MNIST, it can increase the accuracy more than 3 percent (and NMI around 1 percent). This can be attributed to the loss introduced by -means in DCN, compared to our -subspace clustering loss which we believe is more robust. In other words, the subspace structure could be more desirable than cluster centroids in high dimensional space.

MNIST Fashion-MNIST
ACC NMI ACC NMI
DCN-CAE 51.10 % 45.18 % 45.64 % 47.8%
Initilization 50.98% 44.87 % 42.38 % 46.75 %
Table 4: The performance of the DCN-CAE and its CAE initialization.

6 Conclusions

In this paper, we proposed a scalable deep -subspace clustering algorithm, which combined the -subspace clustering and convolutional auto-encoder in a principle way. Our algorithm makes it possible to scale subspace clustering algorithms to large datasets. Furthermore, we proposed two efficient and robust schemes to update the subspaces. These allow our -SC networks to iteratively fit every sample into its corresponding subspace and update the subspaces accordingly, even from a bad initialization (as observed in our experiments).

Our extensive experiments on MNIST and Fashion-MNIST dataset demonstrated that our deep -subspace clustering method provides significant improvements over various state-of-the-art subspace clustering solutions in terms of clustering accuracy and efficiency.

References

  • [1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467 (2016)
  • [2] Agarwal, P.K., Mustafa, N.H.: K-means projective clustering. In: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. pp. 155–165. ACM (2004)
  • [3] Balzano, L., Szlam, A., Recht, B., Nowak, R.: K-subspaces with missing data. In: Statistical Signal Processing Workshop (SSP), 2012 IEEE. pp. 612–615. IEEE (2012)
  • [4] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: NIPS. pp. 153–160 (2007)
  • [5] Bradley, P.S., Mangasarian, O.L.: K-plane clustering. Journal of Global Optimization 16(1), 23–32 (2000)
  • [6] Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR 2005. pp. 886–893. IEEE (2005)
  • [7] Elhamifar, E., Vidal, R.: Sparse subspace clustering. In: CVPR. pp. 2790–2797 (2009)
  • [8] Elhamifar, E., Vidal, R.: Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. on Pattern Analysis and Machine Intelligence 35(11), 2765–2781 (2013)
  • [9] Ho, J., Yang, M.H., Lim, J., Lee, K.C., Kriegman, D.: Clustering appearances of objects under varying illumination conditions. In: CVPR. vol. 1, pp. 11–18. IEEE (2003)
  • [10] Ji, P., Li, H., Salzmann, M., Dai, Y.: Robust motion segmentation with unknown correspondences. In: ECCV. pp. 204–219. Springer (2014)
  • [11] Ji, P., Li, H., Salzmann, M., Zhong, Y.: Robust multi-body feature tracker: a segmentation-free approach. In: CVPR. pp. 3843–3851 (2016)
  • [12] Ji, P., Reid, I., Garg, R., Li, H., Salzmann, M.: Low-rank kernel subspace clustering. arXiv preprint arXiv:1707.04974 (2017)
  • [13]

    Ji, P., Salzmann, M., Li, H.: Efficient dense subspace clustering. In: IEEE Winter Conf. on Applications of Computer Vision (WACV). pp. 461–468. IEEE (2014)

  • [14] Ji, P., Salzmann, M., Li, H.: Shape interaction matrix revisited and robustified: Efficient subspace clustering with corrupted and incomplete data. In: ICCV. pp. 4687–4695 (2015)
  • [15] Ji, P., Zhang, T., Li, H., Salzmann, M., Reid, I.: Deep subspace clustering networks. In: Advances in Neural Information Processing Systems. pp. 23–32 (2017)
  • [16] Ji, P., Zhong, Y., Li, H., Salzmann, M.: Null space clustering with applications to motion segmentation and face clustering. In: (ICIP). pp. 283–287. IEEE (2014)
  • [17] Kanatani, K.i.: Motion segmentation by subspace separation and model selection. In: ICCV. vol. 2, pp. 586–591. IEEE (2001)
  • [18] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
  • [19] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
  • [20]

    Lee, K.C., Ho, J., Kriegman, D.J.: Acquiring linear subspaces for face recognition under variable lighting. TPAMI

    27(5), 684–698 (2005)
  • [21] Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace structures by low-rank representation. IEEE Trans. on Pattern Analysis and Machine Intelligence 35(1), 171–184 (2013)
  • [22] Lloyd, S.: Least squares quantization in pcm. IEEE transactions on information theory 28(2), 129–137 (1982)
  • [23] Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)
  • [24] Ma, Y., Derksen, H., Hong, W., Wright, J.: Segmentation of multivariate mixed data via lossy data coding and compression. TPAMI 29(9) (2007)
  • [25]

    Ng, A.Y., Jordan, M.I., Weiss, Y., et al.: On spectral clustering: Analysis and an algorithm. In: NIPS. vol. 14, pp. 849–856 (2001)

  • [26] Patel, V.M., Vidal, R.: Kernel sparse subspace clustering. In: ICIP. pp. 2849–2853. IEEE (2014)
  • [27] Peng, X., Feng, J., Xiao, S., Lu, J., Yi, Z., Yan, S.: Deep sparse subspace clustering. arXiv preprint arXiv:1709.08374 (2017)
  • [28] Peng, X., Xiao, S., Feng, J., Yau, W.Y., Yi, Z.: Deep subspace clustering with sparsity prior. In: IJCAI (2016)
  • [29] Shi, J., Malik, J.: Normalized cuts and image segmentation. TPAMI 22(8), 888–905 (2000)
  • [30] Tseng, P.: Nearest q-flat to m points. Journal of Optimization Theory and Applications 105(1), 249–252 (2000)
  • [31] Vidal, R.: Subspace clustering. IEEE Signal Processing Magazine 28(2), 52–68 (2011)
  • [32] Vidal, R., Favaro, P.: Low rank subspace clustering (LRSC) 43, 47–61 (2014)
  • [33]

    Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms (2017)

  • [34] Xiao, S., Tan, M., Xu, D., Dong, Z.Y.: Robust kernel low-rank representation. IEEE transactions on neural networks and learning systems 27(11), 2268–2281 (2016)
  • [35] Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International conference on machine learning. pp. 478–487 (2016)
  • [36] Yang, A.Y., Wright, J., Ma, Y., Sastry, S.S.: Unsupervised segmentation of natural images via lossy data compression. CVIU 110(2), 212–225 (2008)
  • [37] Yang, B., Fu, X., Sidiropoulos, N.D., Hong, M.: Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In: ICML. pp. 3861–3870 (2017)
  • [38] Yin, M., Guo, Y., Gao, J., He, Z., Xie, S.: Kernel sparse subspace clustering on symmetric positive definite manifolds. In: CVPR. pp. 5157–5164 (2016)
  • [39] You, C., Li, C.G., Robinson, D.P., Vidal, R.: Oracle based active set algorithm for scalable elastic net subspace clustering. In: CVPR. pp. 3928–3937 (2016)
  • [40] You, C., Robinson, D., Vidal, R.: Scalable sparse subspace clustering by orthogonal matching pursuit. In: CVPR. pp. 3918–3927 (2016)
  • [41] Zhang, T., Szlam, A., Wang, Y., Lerman, G.: Hybrid linear modeling via local best-fit flats. International journal of computer vision 100(3), 217–240 (2012)