In this paper, we tackle the problem of Subspace Clustering (SC), which is a sub-field of unsupervised learning, aiming to cluster data points drawn from a union of low-dimensional subspaces in an unsupervised manner. Suppose that represents data set with data points in ambient dimension , and data points lie in subspaces of dimensions (). The task of SC is to partition data points into clusters so that data points within the same cluster lie in the same intrinsic subspace . SC has achieved great success in many applications, e.g., motion segmentation , face clustering  and image representation and compression .
are based on the linear subspace assumption to construct the affinity matrix for spectral clustering. However, the data doesn’t necessarily conform to a linear subspace model, which motivates non-linear SC techniques. Kernel methods[10, 11, 12]
can be employed to implicitly map data to higher dimensional spaces for better conforming to linear models in the resulting spaces. However, the selection of different kernel types is largely empirical without theoretical guarantee. Recently, Convolutional Neural Networks has shown the superior ability in learning image representation, and Deep Subspace Clustering Networks (DSC-Net) have been proposed to exploit the self-expression of data in a union of subspaces.
Despite the significant improvements of clustering accuracy, DSC-Net suffers from the slow training compared with conventional “shallow” SC methods. To achieve higher model training efficiency and higher clustering accuracy, we propose a Residual Encoder-Decoder network for deep Subspace Clustering (RED-SC). In particular, we make the following contributions:
We propose to establish skip connections between corresponding convolutional and deconvolutional layers. These skip connections help to back-propagate the gradients to bottom layers and pass data details to top layers, making training of the end-to-end mapping easier and more effective.
We propose to insert the self-expressive layer in each skip connection to generate the linear representation coefficients. We present a new global loss function and minimize it by RED-SC. This helps to learn the linearity information of features in different latent spaces.
To the best of our knowledge, our approach constitutes the first attempt to apply residual encoder-decoder network on the task of unsupervised learning.
Experimental results demonstrate that our network converges much faster in model training and fine-tuning, and obtains better clustering results. We reduce the computational cost remarkably, and obtain higher accuracy simultaneously.
2 Related Work
2.1 Subspace Clustering
Many methods have been developed for linear subspace clustering. Generally, these approaches are based on a two-stage framework. In the first stage, an affinity matrix is generated from data by computing the linear representation coefficients matrix . In the second one, spectral clustering is applied on the affinity matrix. These methods learn the affinity matrix based on the self-expressiveness model, which states that each data point in a union of subspaces can be expressed as a linear combination of other data points, i.e., where is the data matrix, is the coefficients matrix. To find the coefficients matrix , current methods solve the following optimization problem in the first stage:
where denotes different norm regularization applied on . For instance, in Sparse Subspace Clustering (SSC) , the norm regularization is adopted as a convex surrogate over the norm regularization to encourage the sparsity of . Least Squares Regression (LSR)  uses the norm regularization on . Low Rank Representation (LRR)  uses nuclear norm regularization on . Elastic Net Subspace Clustering (ENSC)  uses a mixture of norm and norm regularization on . In SSC by Orthogonal Matching Pursuit (OMP)  and our previous work Sparse-Dense Subspace Clustering (SDSC) , the norm regularization is investigated. However, they can only cluster linear subspaces, which limits their application. To address this problem, kernel based subspace clustering methods [10, 11, 12] have been developed. There is, however, no clear reason why such kernels should correspond to feature spaces that are well-suited to subspace clustering. Recently, Deep Subspace Clustering Networks (DSC-Net)  are introduced to tackle the nonlinearity arising in subspace clustering, where data is nonlinearly mapped to a latent space with convolutional auto-encoders and a self-expressive layer is introduced to facilitate an end-to-end learning of the coefficients matrix. Although DSC-Net outperforms traditional SC methods, the computational cost especially in model training is overwhelming.
2.2 Residual Encoder-Decoder
Encoder-decoder networks can non-linearly map data into a latent space. It can be viewed as a form of non-linear PCA if the latent space has lower dimension than the original space . Residual encoder-decoder networks with skip-layer connections have been exploited effective in many applications, e.g., image restoration , semantic segmentation  and iris segmentation . It has been shown that residual encoder-decoder networks converge much faster in model training since the skip connections help to back-propagate the gradients to bottom layers and pass image details to top layers, making training of the end-to-end mapping easier. Besides, the feature maps passed by skip connections carry much image detail, which helps deconvolution to recover a better and cleaner image. To the best of our knowledge, it has not been used in any tasks of unsupervised learning. Our RED-SC to solve subspace clustering problems constitutes the first attempt to apply residual encoder-decoder on the tasks of unsupervised learning.
3 Residual Encoder-Decoder Network for Deep Subspace Clustering (RED-SC)
The proposed network uses the residual encoder-decoder and the self-expressiveness property. In this section, we first discuss each component, then introduce the network architecture, and finally elaborate its training and clustering process.
3.1 Residual Encoder-Decoder in RED-SC
The DSC-Net uses auto-encoders to map the data into a latent space, then uses the feature in the latent space to generate the linear representation coefficients for affinity matrix, and finally recover the data by a chain of decoders. But the intuitive question is that, is deconvolution able to recover the data detail from the abstraction only? Another question is that, can the feature from only one latent space represent the data to generate the linear representation coefficients? We find that much data detail is lost in the convolution, making DSC-Net hard to train, and the affinity matrix is inaccurate to represent the relationship of original data.
To address the above two problems, inspired by residual networks  and highway networks , we add skip connections between two corresponding convolutional and deconvolutional layers as shown in Fig.1. A building block is shown in Fig.2. Instead of directly learning the mappings from input to the output , we would like the network to fit the residual of the problem, which is denoted as:
Such a learning strategy is applied on inner blocks of the encoding-decoding network to make training more effective.
By using the residual encoder-decoder network, the feature maps passed by skip connections carry much data detail, which helps deconvolution to better recover the data. Besides, the skip connections also achieve benefits on back-propagating the gradient to bottom layers, which avoids the network suffering from gradient vanishing.
3.2 Self-Expressive Layer in RED-SC
Recall from the optimization problem in (1), to account for data corruptions, this problem is relaxed as:
In our RED-SC network, latent representation from multiple layers are adopted as input of the self-expressive layer to generate the self-expressive coefficients. Let denote the input data, and denote the output of each convolution layer, we introduce a self-expression loss as:
where is the number of convolutional layer, is the self-expressive coefficients matrix. Our goal is to train a deep residual encoder-decoder network, we can calculate the reconstruction loss of data after the network as:
where represents the data reconstructed by the residual encoder-decoder, and respectively represent the encoder parameters and the decoder parameters. Then we can compute the global loss of our RED-SC network:
where the network parameters consist of , and . In this work, we consider the norm regularization on for computational efficiency.
3.3 Network Architecture
In this paper, we focus on image clustering problems. As is shown in Fig.1, we use all the images as a single batch. The input images
are mapped to a collection of latent vectors
by each convolutional layer. In the self-expressive layer, the nodes are fully connected using linear weights without bias and non-linear activations. Then the latent vectors of skip connections are mapped into symmetric layer in decoder for addition and non-linear activations (ReLU), and the output of the last convolutional layer is mapped back into the original space by deconvolutional layers in decoder. Finally we use the self-expressive coefficients to generate the affinity matrix, and then apply spectral clustering on the affinity matrix to get the clustering labels.
In particular, for the th convolutional layer with channels of kernel size , there are weight parameters. The total number of weight parameters in our network is , and that of bias parameters is . Suppose the number of input samples is , then the number of self-expressive parameters is , which is much larger than the number of weights and bias parameters. Thus the self-expressive parameters dominate the network.
3.4 Training Strategy
Due to the limited size of data sets for unsupervised subspace clustering, it’s difficult to train a network with millions of parameters. Thus we design a pre-training network without self-expressive layer in Fig.3. Then we use the trained parameters to initialize the encoder and decoder layers in our fine-tuning network with the self-expressive layer. With the help of Adam , we then use a big batch of all the data to minimize the loss defined in (6). Note that we don’t use any label information to train the model, our training strategy remains unsupervised. Finally, we use the trained self-expressive coefficients to construct the affinity matrix for spectral clustering, and get the clustering labels.
We implement our approach with Tensorflow on a NVIDIA TITAN Xp GPU, and evaluate the performance of RED-SC on a handwritten digit data set MNIST , and a face data set Extended Yale B . We compare our RED-SC with LRR , LRSC , SSC , SSC-OMP , SSSC , SDSC , EDSC  and DSC-Net  with two norm regularization. We use the code provided by the respective authors which is tuned to give the best performance. We evaluate the clustering performance by using clustering error (ERR) , normalized mutual information (NMI)  and purity (PUR) . For RED-SC, the kernel sizes are always 5-3-3-3-3-5 and channels are 10-20-30-30-20-10. We use the pre-training network to obtain the parameters for the fine-tuning networks. The best results in tables are in bold.
4.1 Experiments on MNIST
We evaluate the effectiveness of RED-SC on MNIST, which consists of 70,000 hand-written digit images of size . We randomly select 1,000 images for each digit, resulting a subset of 10,000 images. For traditional SC algorithms LRR, SSC and ENSC, we use a subset of 1,000 images due to their limited scalability. The results are reported in Table 1.
|ERR (%)||NMI (%)||PUR (%)|
We can observe that RED-SC outperforms the traditional SC algorithms greatly, this is partly because RED-SC uses a multi-layer encoder as the feature extractor. Besides, compared with the deep approach DSC-Net, our RED-SC obtains better performance in all three metrics. This is because RED-SC tunes the self-expressive coefficients in multiple latent spaces, while DSC-Net only uses the latent representation from the last convolutional layer. This experimental result demonstrates the effectiveness of RED-SC to ensure better self-expressive coefficients for spectral clustering.
4.2 Experiments on Extended Yale B
We evaluate the efficiency of RED-SC in model training and fine-tuning on Extended Yale B, which contains 2,414 frontal face images of 38 individuals under 9 poses and 64 illumination conditions. Each cropped face image consists of 192168 pixels. We downsample the images to 4842 pixels. We randomly pick subjects and take all the images of selected subjects to be clustered.
As is shown in Table 2
, RED-SC remarkably reduces the clustering error and outperforms all the listed methods. This demonstrates again the effectiveness of RED-SC. Besides, we report the convergence compared with DSC-Net with the same number of parameters in Fig.4. From Fig.4(a) we observe that RED-SC converges much faster than DSC-Net in training, since the residual encoder-decoder architecture helps back-propagate gradient to better fit the end-to-end mapping. From Fig.4(b) we observe that RED-SC generates a high-quality affinity matrix for spectral clustering by approximately 300 epoches, while DSC-Net uses about 1,000 epoches. This is partly because that RED-SC uses the latent representation from multiple convolutional layers to fine-tune the self-expressive coefficients, which accelerates the convergence. Thus RED-SC gains a higher efficiency.
We present a Residual Encoder-Decoder network for deep Subspace Clustering (RED-SC), which symmetrically links convolutional and deconvolutional layers with skip-layer connections. We present a new global loss and minimize it by RED-SC. We are the first one to apply residual encoder-decoder on unsupervised learning tasks. Series of experiments validate that RED-SC remarkably reduces computational cost and improves clustering performance.
-  S. R. Rao, R. Tron, R. Vidal, and Y. Ma, “Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 10, pp. 1832–1845, 2010.
-  R. Basri and D. W. Jacobs, “Lambertian reflectance and linear subspaces,” in ICCV, 2001, pp. 383–390.
D. J. Kriegman K.C. Lee, J. Ho,
“Acquiring linear subspaces for face recognition under variable lighting,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 5, pp. 684–698, 2005.
-  E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” CoRR, vol. 1203.1005, 2012.
-  C.Y. Lu, H. Min, Z.Q. Zhao, L. Zhu, D.S. Huang, and S. Yan, “Robust and efficient subspace segmentation via least squares regression,” in ECCV, 2012, pp. 347–360.
-  G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rank representation,” in ICML, 2010, pp. 663–670.
-  C. You, C.G. Li, D. P. Robinson, and R. Vidal, “Oracle based active set algorithm for scalable elastic net subspace clustering,” in CVPR, 2016, pp. 3928–3937.
-  C. You, D. P. Robinson, and R. Vidal, “Scalable sparse subspace clustering by orthogonal matching pursuit,” in CVPR, 2016, pp. 3918–3927.
-  R. Vidal and P. Favaro, “Low rank subspace clustering (LRSC),” Patt. Recog. Letters, vol. 43, pp. 47–61, 2014.
-  V. M. Patel, H. V. Nguyen, and R. Vidal, “Latent space sparse subspace clustering,” in CVPR, 2013, pp. 225–232.
-  V. M. Patel and R. Vidal, “Kernel sparse subspace clustering,” in ICIP, 2014, pp. 2849–2853.
-  S. Xiao, M. Tan, D. Xu, and Zhao Y. D., “Robust kernel low-rank representation,” IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 11, pp. 2268–2281, 2016.
-  P. Ji, T. Zhang, H. Li, M. Salzmann, and I. D. Reid, “Deep subspace clustering networks,” in NIPS, 2017, pp. 24–33.
-  S. Yang, W. Zhu, and Y. Zhu, “Sparse-dense subspace clustering,” .
-  G. E. Hinton and R. R. Salakhutdinov., “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
-  X.J. Mao, C. Shen, and Y.B. Yang, “Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections,” in NIPS, 2016, pp. 2802–2810.
-  J. Jiang, L. Zheng, F. Luo, and Z. Zhang, “Rednet: Residual encoder-decoder network for indoor RGB-D semantic segmentation,” CoRR, vol. 1806.01054, 2018.
-  M. Arsalan, D. Kim, M. Lee, M. Owais, and R.Kang, “Fred-net: Fully residual encoder-decoder network for accurate iris segmentation,” Expert Systems with Applications, vol. 122, 01 2019.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in NIPS, 2015, pp. 2377–2385.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
A. Krizhevsky, I. Sutskever, and G. E. Hinton,
“Imagenet classification with deep convolutional neural networks,”in NIPS, 2012, pp. 1106–1114.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd ICLR, 2015.
-  M. Abadi, A. Agarwal, and P. Barham, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” CoRR, vol. 1603.04467, 2016.
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998.
-  A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 643–660, 2001.
-  C.G. Li, C. You, and R. Vidal, “Structured sparse subspace clustering: A joint affinity learning and subspace clustering framework,” IEEE Trans. Image Processing, vol. 26, no. 6, pp. 2988–3001, 2017.
-  N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” J. Mach. Learn. Res., vol. 11, pp. 2837–2854, 2010.
-  P. Raghavan C. Manning and H. Schutze, “Introduction to information retrieval,” 2010.