Deep Sparse Subspace Clustering

09/25/2017 ∙ by Xi Peng, et al. ∙ Sichuan University 0

In this paper, we present a deep extension of Sparse Subspace Clustering, termed Deep Sparse Subspace Clustering (DSSC). Regularized by the unit sphere distribution assumption for the learned deep features, DSSC can infer a new data affinity matrix by simultaneously satisfying the sparsity principle of SSC and the nonlinearity given by neural networks. One of the appealing advantages brought by DSSC is: when original real-world data do not meet the class-specific linear subspace distribution assumption, DSSC can employ neural networks to make the assumption valid with its hierarchical nonlinear transformations. To the best of our knowledge, this is among the first deep learning based subspace clustering methods. Extensive experiments are conducted on four real-world datasets to show the proposed DSSC is significantly superior to 12 existing methods for subspace clustering.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Subspace clustering aims at simultaneously implicitly finding out an underlying subspace to fit each group of data points and performing clustering based on the learned subspaces, which has attracted a lot of interest from the computer vision and image processing community 

[1]. Most existing subspace clustering methods can be roughly divided into following categories: algebraic methods [2, 1], iterative methods [3, 4], statistical methods [5, 6], and spectral clustering based methods [7, 8, 9, 10, 11].

Recently, a large number of spectral clustering based methods have been proposed [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], which first form an affinity matrix using the linear reconstruction coefficients of the whole data set and then obtain clustering results by applying spectral clustering on the affinity matrix. Those methods differ from each other mainly in their adopted priors on the coefficients. For example, -norm based sparse subspace clustering (SSC) [12, 14] and its -norm based variant [31], low rank representation (LRR) [19, 20]

, and thresholding ridge regression (TRR) 

[32, 33] build the affinity matrix using the linear representation coefficients under the constraint of -, nuclear-, and -norm, respectively. Formally, SSC, LRR, TRR, as well as many of their variants learn the representation coefficients to build the affinity matrix via:

(1)

where denotes the linear representation of the input . Here, denotes the dimension of data and is the number of data points. denotes certain imposed structure prior over , and the choice of representation error function is usually dependent on the distribution assumption of , e.g.

a typical loss function is

.

Fig. 1: The flowchart of the proposed DSSC method. For a given data set , we project them into the feature space as by using a set of hierarchical nonlinear transformations and learn the self sparse representation of input at the top layer of the neural network, where denotes the top layer of the neural network. Once the neural network converges, we apply spectral clustering on the affinity matrix built by the obtained representation like SSC. Noted that the proposed neural network is based on a novel structure which simultaneously enjoys the sparsity of SSC and the nonlinearity of neural networks.

Although those methods have achieved impressive performance for subspace clustering, they generally suffer from the following limitations. First of all, those methods assume that each sample can be linearly reconstructed by the whole sample collection. However, in real-world cases, the data may not be linearly represented by each other in the input space. Therefore, performance of those methods usually drop in practice. To address this problem, several recent works [34, 35, 36, 37] have developed kernel-based approaches which have shown their effectiveness in subspace clustering. However, kernel-based approaches are similar to template-based approaches, whose performance heavily depends on the choice of kernel functions. Moreover, the approaches cannot give explicit nonlinear transformations, causing difficulties in handling large-scale data sets.

Inspired by the remarkable success of deep learning in various applications [38, 39], in this work, we propose a new subspace clustering framework based on neural networks, termed deep sparse subspace clustering (DSSC). The basic idea of DSSC (see Figure 1) is simple but effective. It uses a neural network to project data into another space in which SSC is valid to the nonlinear subspace case. Unlike most existing subspace clustering methods, our method simultaneously learns a set of hierarchical transformations parametrized by a neural network and the reconstruction coefficients to represent each mapped sample as a combination of others. Compared with kernel based approaches, DSSC is a deep instead of shallow model which can explicitly map samples from the input space into a latent space, with parameters in the transformations learned in a data-driven way. To the best of our knowledge, DSSC is the first deep extension of SSC, which satisfies the sparsity principle of SSC and meanwhile makes SSC valid to nonlinear subspace case.

The contribution of this work is twofold. From the view of subspace clustering, we show how to make it benefit from the success of deep neural networks so that the nonlinear subspace clustering could be achieved. From the view of neural networks, we show that it is feasible to integrate the advantages of existing subspace clustering methods and deep learning to develop new unsupervised learning algorithms.

Notations: throughout the paper, lower-case bold letters

represent column vectors and

UPPER-CASE BOLD ONES denote matrices. denotes the transpose of the matrix and

denotes an identity matrix.

2 Related Works

Subspace Clustering: The past decade saw an upsurge of subspace clustering methods with various applications in computer vision, e.g. motion segmentation [14, 16, 19, 21, 6, 40], face clustering [12, 17, 18, 20, 22], image processing [41, 15, 31], multi-view analysis [24], and video analysis [36]. Particularly, among these works, spectral clustering based methods have achieved state-of-the-art results. The key of these methods is to learn a satisfactory affinity matrix in which denotes the similarity between the -th and the -th sample. Ideally, only if the corresponding data points and are drawn from the same subspace. To this end, some recent works (e.g. SSC [12, 14]

) assume that any given sample can be linearly reconstructed by other samples in the input space. Based on the self-representation, an affinity matrix (or called similarity graph) can be constructed and fed to spectral clustering algorithms to obtain the final clustering results. In practice, however, high-dimensional data (such as face images) usually resides on the nonlinear manifold. Unfortunately, linear reconstruction assumption may not be satisfied in the original space and in this case the methods may fail to capture the intrinsic nonlinearity of manifold. To address this limitation, the kernel approach is used to first project samples into a high-dimensional feature space in which the representation of the whole data set is computed 

[34, 35, 36, 37]. After that, the clustering result is achieved by performing traditional subspace clustering methods in the kernel space. However, the kernel-based methods behave like template-based approaches which usually require the prior knowledge on the data distribution to choose a desirable kernel function. Clearly, such a prior is hard to obtain in practice. Moreover, they cannot learn an explicit nonlinear mapping functions from data set, thus suffering from the scalability issue and the out-of-sample problem [42, 43].

Unlike these classical subspace clustering approaches, our method learns a set of explicit nonlinear mapping functions from data set to map the input into another space, and calculates the affinity matrix using the representation of the samples in the new space.

Deep Learning:

Aimed at learning high-level features from inputs, deep learning has shown promising results in numerous computer vision tasks in the scenario of supervised learning, such as image classification 

[39]. In contrast, less attention [44, 45, 46] has been paid to the applications with unsupervised learning scheme. Recently, some works [47, 48, 49, 50, 51, 52, 53] have devoted to combining deep learning and unsupervised clustering and shown impressive results over the traditional clustering approaches. These methods share the same basic idea, i.e.

, using deep learning to learn a good representation and then achieving clustering with existing clustering methods like k-means. The major differences among them reside on the neural network structure and the objective function.

Different from these works, our framework is based on a new neural network instead of an existing network. Moreover, our method focuses on subspace clustering rather than clustering, which simultaneously learns the high-level features from inputs and the self-representation in a joint way, whereas these existing methods do not enjoy the effectiveness of the self-expressive subspace clustering. We believe that such a general framework is complementary to existing shallow subspace clustering methods, since it can adopt the loss functions and regularizations in these methods. To the best of our knowledge, this is one of the first several deep subspace clustering methods. It should be pointed out that, our model is also significantly different from [49] as below: 1) [49] performs like manifold learning, which requires the data could be linearly reconstructed in the input space and embeds the obtained sparse representation from input space into latent space. In contrast, our model aims to solve the problem of nonlinear subspace clustering, i.e. the data cannot linearly represented in the input space. 2) In [49], sparse representation is used as a type of priori, which keeps unchanged. In contrast, this work dynamically seeks an good sparse representation to jointly optimize our neural network. 3) The proposed method can be regarded as a deep nonlinear extension of the well-known SSC, which makes SSC handling nonlinear subspace clustering possible.

3 Deep Sparse Subspace Clustering

In this section, we first briefly review SSC, and then present the details of our deep subspace clustering method.

3.1 Sparse Subspace Clustering

For a given data set , SSC seeks to linearly reconstruct the -th sample using a few of other samples. In other words, the representation coefficients are expected to be sparse. To achieve this end, the problem is formulated as below,

(2)

where denotes -norm (i.e., the sum of absolute values of all elements in a vector) that acts as a relaxation of -norm, and denotes the -th element in . Specifically, penalizing encourages to be sparse, and enforcing the constraint to avoid trivial solutions. To deal with the optimization problem in (2), the alternating direction method of multipliers (ADMM) [54, 55] is often used.

Once the sparse representation of the whole data set is obtained by solving (2), an affinity matrix in SSC is calculated as , based on which spectral clustering is applied to give clustering results.

3.2 Deep Subspace Clustering

In most existing subspace clustering methods including SSC, each sample is encoded as a linear combination of the whole data set. However, when dealing with high-dimensional data which usually lie on nonlinear manifolds, such methods may fail to capture the nonlinear structure, thus leading to inferior results. To address this issue, we propose a deep learning based method which maps given samples using explicit hierarchical transformations in a neural network, and simultaneously learns the reconstruction coefficients to represent each mapped sample as a combination of others.

As shown in Figure 1, the neural network in our proposed framework consists of stacked layers with nonlinear transformations, which takes a given sample as the input to the first layer. For ease presentation, we make several definitions below. For the first layer of our neural network, we define its input as . Moreover, for the subsequent layers, let

(3)

be the output of the -th layer (in which indexes the layer), where

is a nonlinear activation function,

is the dimension of the output of the -th layer, and denote the weights and bias associated with the -th layer, respectively. In particular, given as the input of the first layer, the output at the top layer of our neural network is

(4)

In fact, if denoting the expression above as , we can observe that is a nonlinear function determined by the weights and biases of our neural network (i.e., ) as well as the choice of activation function . Furthermore, for samples, we define as the collection of the corresponding outputs given by our neural network, i.e.

(5)

With the above definitions, we present the objective function of our method in the following form:

(6)

where is a positive trade-off parameter, and are defined below. Intuitively, the first term is designed to minimize the discrepancy between and its self-expressed representation. Moreover, it meanwhile regularizes for some desired properties. To be specific, can be expressed in the form of

(7)

where takes the value of if is not in some feasible domains, and otherwise. Note that, the form of , and may be adopted from many existing subspace clustering works. In this paper, we aim to develop a deep extension of SSC and thus take , , if is violated, and otherwise.

The second part is designed to remove an arbitrary scaling factor in the latent space. In this work, we set

(8)

Noticed that, without the above term, our neural network may collapse in the trivial solutions like .

With detailed above, the optimization problem of our proposed DSSC can be expressed as follows:

(9)

where denotes the parametric neural network, i.e., .

3.3 Optimization

For ease of presentation, we first rewrite as follows:

(10)

where is a variant of , which is obtained by simply replacing in with .

Given data points, DSSC simultaneously learns nonlinear mapping functions and sparse codes by solving (3.3). As (3.3) is a multiple-variable optimization problem, we employ an alternating minimization algorithm by alternatively updating one of variables while fixing the others.

Step 1: Fix and , update , (3.3) can be rewritten as

(11)

where is a constant.

To solve (11), we adopt the stochastic sub-gradient descent (SGD) algorithm to obtain the parameters , . Moreover, we also enforce -norm on the parameters to avoid overfitting [39, 56], where the regularization parameter is fixed as in all experiments. Noticed that, (11) could also be solved with mini-batch SGD, especially, when the data size is large. However, the mini-batch SGD may give two issues. First, it introduces a new hyper parameter (i.e., batch size), which increases human effort for model selection. Second, the efficiency may be at the cost of performance degradation [57].

Step 2: Fix and update by

(12)

where is a constant. Note that, (12) is a standard -minimization problem faced in SSC, which can be solved by using many existing -solvers [58].

Step 1 and Step 2 are repeated until convergence. After obtaining , we construct a similarity graph via and obtain the clustering results based on . The optimization procedure of DSSC is summarized in Algorithm 1.

Input: A given data set and the tradeoff parameters .
// Initialization:
Initialize , and .
for  do

        Do forward propagation to get and via solving (3) and (12), respectively.
end for
// Optimization
while not converge do
        for  do
               Randomly select a data point and let ,
for  do
                      Compute via (3).
               end for
              Compute using via (3). for  do
                      Calculate the gradient using the SGD algorithm.
               end for
              for  do
                      Update and with the gradient.
               end for
              
        end for
       
end while
Output: and .
Algorithm 1 Deep Sparse Subspace Clustering

3.4 Discussions

Our approach DSSC can provide satisfactory subspace clustering performance befitting from following factors. First, different from SSC, DSSC performs sparse coding in a deep latent space instead of the original one and the latent space is automatically learned in a data-driven manner. After mapping input data into the latent space via the learned transformation matrices, the transformed samples are more favorable for linear reconstruction. Second, DSSC can also be deemed as a deep kernel method which automatically learns transformations in a data-driven way. Considering the demonstrated effectiveness of kernel-based subspace clustering approaches such as  [35, 36], DSSC is well-expected to show even better performance for subspace clustering thanks to the representative capacity of deep neural network.

It should be pointed out that the proposed DSSC adopts similar neural network structure with deep metric learning networks (DMLNs) [59, 60, 61, 62], i.e., a set of fully connected layers to perform nonlinear transformation and then perform specific task on the output of neural network. The major differences among them are: 1) the objective functions are different. Our method aims to segment different sample into different subspaces, whereas these metric learning networks aim to learn similarity function that measures how similar or related two data points are; 2) our DSSC is unsupervised, whereas DMLNs are supervised approaches which require the label information to train neural networks.

3.5 Implementation Details

In this section, we introduce the implementation details of the used activation functions and the initialization of .

The activation functions can be chosen from various forms. In our experiments, we use the function which is defined as follows:

(13)

and the corresponding derivative is calculated as

(14)

Regarding the initializations of , we initialize as a rectangular matrix with ones at the main diagonal and zeros as other elements. Moreover, is initialized as . Note that, the used networks could also be initialized with an auto-encoder.

4 Experiments

In this section, we compare our method with 12 popular subspace clustering methods on four different real-world data sets in terms of four clustering performance metrics.

4.1 Datasets and Experimental Settings

Data sets: Four different data sets are used in our experiments, i.e. COIL20 object images [63], the MNIST handwritten digital database [64], AR facial images [65], and the BF0502 video face data set [66].

The COIL20 database contains 1,440 samples distributed over 20 objects, where each image is with the size of . The MNIST data set includes 60,000 handwritten digit images of which the first 2,000 training images and the first 2,000 testing images are used in our experiments, where the size of each image is .

The AR database is one of the most popular facial image data sets for subspace clustering. In our experiments, we use a widely-used subset of the AR database [67] which consists of 1,400 undisguised faces evenly distributed over 50 males and 50 females, where the size of each image is .

The BF0502 data set contains facial images detected from the TV series “Buffy the Vampire Slayer”. Following [36], a subset of BF0502 is used, which includes 17,337 faces in 229 tracks from 6 main casts. Each facial image is represented as a 1,937-dimensional vectors extracted from 13 facial landmark points (e.g., the left and right corners of each eye). In our experiments, we use the first 200 samples from each category, thus resulting in 1,200 images in total.

For the purpose of nonlinear subspace clustering, we use the following four types of features instead of raw data from the COIL20, MNIST, and AR data sets in experiments, i.e. dense scale-invariant feature transform (DSIFT) [68], the histogram of oriented gradients (HOG) [69], local binary pattern (LBP) [70], and local phase quantization (LPQ) [71]. The details of extracting these features are introduced as follows:

  • DSIFT: We divide each image into multiple non-overlapping patches and then densely sample SIFT descriptors from each patch. The patch sizes of AR, COIL20, and MNIST are set as , , and , respectively. By concatenating these SIFT descriptors extracted from each image, we obtain a feature vector with the dimension of 11,264 (AR), 2,048 (COIL20), and 6,272 (MNIST).

  • HOG: We first divide each image into multiple blocks with two scales, i.e. and for AR, and and for MNIST and COIL20. Then, we extract a 9-dimensional HOG feature from each block. By concatenating these features for each image, the dimension of the feature vector are 13,770 (AR), 2,205 (MNIST), and 2,880 (COIL20) , respectively.

  • LBP: Like DSIFT, we divide each image into multiple non-overlapping patches and then extract LBP features using 8 sampling points on a circle of radius 1. Thus, we obtain a 59-dimensional LBP feature vector from each patch. By concatenating the descriptors of each image, we obtain a feature vector with the dimension of 7,788 (COIL20) and 2,891 (MNIST).

  • LPQ: The patch size is set as for COIL20 and MNIST. For all the tested data sets, we set the size of LPQ window as 3, 5, and 7. By concatenating the features of all patches of each image, the dimension of each feature is 12,288 for COIL20 and 6,912 for MNIST.

For computational efficiency, we perform PCA to reduce the feature dimension of all data sets to 300, by following the setting in [59, 14]

Baseline Methods: We compare DSSC with 12 state-of-the-art subspace clustering methods, i.e. SSC [12, 14], Kernel SSC (KSSC) [35], LRR [20, 19], low rank subspace clustering (LRSC) [16], Kernel LRR [36], least square regression (LSR) [21], smooth representation (SMR) [17]

. LSR has two variants which are denoted by LSR1 and LSR2, respectively. KSSC and KLRR have also two variants which are based on the RBF function (KSSC1 / KLRR1) and the polynomial function (KSSC2 / KLRR2), respectively. Moreover, we have also used the deep autoencoder (DAE) with SSC as a baseline to show the efficacy of our method. More specifically, we adopt the pre-training and fine-tuning strategy 

[72]

to train a DAE that consists of five layers with 300, 200, 150, 200, and 300 neurons. In the experiments, we investigate the performance of DAE with two popular nonlinear activation functions,

i.e.

the sigmoid function (DAEg) and the saturating linear transfer function (DAEs). After the DAE converging, we perform SSC on the output of the third layer to obtain the clustering results. For fair comparisons, we use the same

-solver (i.e. the Homotopy method [58, 73]) to solve the

-minimization problem in DSSC, SSC, and DAE. Noted that, our method could also be compatible to other neural networks such as convolutional neural networks (CNN). In experiments, we adopt the fully connected network (FCN) instead of CNN because the former has offered a desirable performance in our experiments. Moreover, FCN is with fewer hyper-parameters than CNN, which remarkably reduces the effort to seek optimal value for hyper-parameters.

Features DSIFT HOG
Methods ACC NMI ARI Fscore Para. ACC NMI ARI Fscore Para.
DSSC 80.822.88 90.520.94 77.632.09 78.881.96 , 20 87.102.82 91.671.07 82.561.26 83.512.12 , 30
SSC 78.963.12 89.061.03 76.462.31 77.592.17 0.5, 0.2 85.010.85 89.990.38 81.131.08 82.081.02 0.5, 0.1
KSSC1 71.002.13 78.720.98 63.331.85 65.181.75 , 18 75.290.97 82.750.49 66.461.43 68.201.33 , 18
KSSC2 72.012.68 83.840.89 64.223.47 66.223.16 , 18 69.531.30 81.270.69 61.161.83 63.321.69 , 18
DAEg 55.832.80 70.421.43 47.062.74 50.002.52 0.5, 0.2 69.601.00 78.520.47 59.380.79 61.630.74 0.5, 0.1
DAEs 55.812.60 70.711.68 48.493.31 51.463.05 0.5, 0.2 64.751.31 77.480.60 56.811.12 59.131.06 0.5, 0.1
LRR 71.031.47 80.521.05 63.832.09 65.701.97 5e-2 76.891.46 86.520.78 70.791.73 72.391.62 5e-3
KLRR1 70.461.55 79.611.01 61.251.94 63.351.81 500 76.740.27 82.000.14 69.430.48 70.960.45 10
KLRR2 70.851.37 80.091.15 62.751.54 64.731.46 100 72.332.65 80.981.21 63.112.88 65.072.68 5
LRSC 71.820.28 77.650.23 62.720.52 64.620.49 0.08 57.111.24 69.910.73 46.271.57 49.201.48 0.01
LSR1 63.932.15 73.181.12 53.292.26 55.752.14 0.6 54.811.80 64.440.94 42.281.55 45.351.44 0.5
LSR2 68.111.14 75.330.62 56.291.56 58.611.41 0.9 53.811.51 63.001.22 42.071.5 45.191.42 0.3
SMR 76.970.96 85.300.58 71.561.02 73.020.96 , 80.150.87 85.930.6 73.511.06 74.871.01 ,
TABLE I: Clustering results on the COIL20

data set. Results in boldface are significantly better than the others, according to the t-test with a significance level at 0.05.

Features LBP LPQ
Methods ACC NMI ARI Fscore Para. ACC NMI ARI Fscore Para.
DSSC 72.891.41 84.320.79 67.311.96 69.011.85 , 40 78.122.09 85.380.77 71.351.34 72.871.25 , 60
SSC 70.170.65 82.660.19 64.190.60 66.070.58 , 74.600.81 84.210.49 67.690.83 69.350.79 , 0.1
KSSC1 69.331.97 80.650.86 61.151.91 63.181.79 1, 16 68.492.38 79.281.27 59.062.37 61.232.21 0.1, 12
KSSC2 70.421.13 83.670.69 65.281.23 68.031.16 1, 16 69.242.33 79.520.93 61.071.72 63.171.62 0.1, 12
DAEg 40.962.18 53.540.89 26.271.33 30.571.22 , 62.190.90 72.040.54 51.510.75 54.150.72 , 0.1
DAEs 40.681.13 52.120.92 23.671.30 28.261.10 , 59.642.46 67.441.06 44.901.81 47.981.65 , 0.1
LRR 71.604.02 84.451.78 65.475.68 66.295.21 0.5 69.001.09 80.310.88 60.121.51 62.291.41 0.1
KLRR1 65.830.31 77.340.30 56.410.50 58.600.47 30 69.431.46 77.340.53 57.011.02 59.240.96 500
KLRR2 70.101.27 79.580.13 62.910.51 64.820.47 1000 65.332.48 76.411.13 54.222.11 56.691.94 100
LRSC 62.960.61 73.380.79 53.311.06 55.671.01 0.04 66.380.50 78.730.58 58.810.97 60.990.91 0.08
LSR1 70.242.90 82.401.41 64.542.85 67.332.69 1 66.971.68 74.420.62 55.481.52 57.741.43 0.2
LSR2 70.543.26 81.631.16 63.712.58 66.592.41 0.6 65.251.55 73.811.29 54.341.65 56.661.56 0.3
SMR 71.931.35 81.170.39 63.541.41 66.391.32 , 70.560.57 80.680.41 61.680.49 63.680.45 ,
TABLE II: Clustering results on the COIL20 data set. Results in boldface are significantly better than the others, according to the t-test with a significance level at 0.05.
Features DSIFT HOG
Methods ACC NMI ARI Fscore Para. ACC NMI ARI Fscore Para.
DSSC 72.650.00 70.420.00 61.800.00 65.790.00 , 20 78.100.00 77.510.00 68.720.00 72.030.00 , 30
SSC 62.450.00 65.750.00 53.750.00 58.810.00 1, 77.350.00 75.700.00 66.900.00 70.230.00 10,
KSSC1 50.900.00 49.750.00 35.280.00 41.800.00 , 10 66.900.00 70.200.00 56.790.00 61.350.00 , 12
KSSC2 60.800.00 63.960.00 50.810.00 56.260.00 , 10 68.000.00 70.740.00 57.690.00 62.120.00 , 12
DAEg 52.550.00 58.360.00 40.980.00 47.480.00 1, 23.360.66 11.560.78 5.830.28 15.890.32 10,
DAEs 42.281.19 48.700.56 31.700.31 39.850.22 1, 23.071.14 10.480.52 4.910.31 15.020.29 10,
LRR 63.200.00 68.340.00 54.110.00 59.480.00 0.05 73.300.00 74.490.00 63.200.00 67.120.00 0.01
KLRR1 57.050.00 57.970.00 44.630.00 50.700.00 3000 72.150.00 70.950.00 61.380.00 65.420.00 30
KLRR2 22.630.89 12.551.37 9.020.46 22.980.14 1000 73.550.00 73.300.00 63.690.00 67.500.00 3
LRSC 59.300.00 58.840.00 46.900.00 52.370.00 0.1 61.200.00 59.650.00 47.050.00 52.590.00 0.01
LSR1 63.500.00 60.390.00 49.020.00 54.290.00 0.1 58.420.09 56.410.07 44.850.12 50.790.11 0.2
LSR2 63.550.00 60.530.00 49.140.00 54.390.00 0.4 60.400.02 57.780.00 46.450.00 51.980.00 0.1
SMR 69.150.00 68.900.00 59.170.00 63.400.00 , 77.220.05 77.250.00 66.850.00 71.110.00 ,
TABLE III: Clustering results on the MNIST data set. Results in boldface are significantly better than the others, according to the t-test with a significance level at 0.05.
Features LBP LPQ
Methods ACC NMI ARI Fscore Para. ACC NMI ARI Fscore Para.
DSSC 61.700.00 54.230.00 44.540.00 50.250.00 , 40 65.040.02 54.850.01 46.380.00 52.040.00 , 70
SSC 59.750.00 53.830.00 43.520.00 49.020.00 0.1, 0.01 62.350.00 53.860.00 44.670.00 50.420.00 1, 0.01
KSSC1 58.500.00 53.960.00 41.580.00 47.670.00 0.1, 10 44.000.00 34.970.00 23.490.00 31.260.00 , 16
KSSC2 57.700.00 54.270.00 41.950.00 47.960.00 0.1, 10 54.990.03 51.070.01 36.620.01 43.210.01 , 16
DAEg 36.200.00 27.810.00 16.680.00 25.420.00 0.1, 0.01 30.610.51 22.050.32 12.190.05 22.070.05 10, 0.01
DAEs 32.200.00 24.850.00 14.750.00 23.390.00 0.1, 0.01 34.100.00 22.710.02 12.410.01 22.430.01 10, 0.01
LRR 55.700.00 45.700.00 37.770.00 44.580.00 0.5 52.150.00 50.630.00 37.860.00 44.630.00 0.5
KLRR1 54.120.06 50.980.01 37.840.07 44.250.06 1000 55.600.00 51.660.00 38.420.00 44.770.00 3000
KLRR2 53.750.00 50.700.00 37.050.00 43.550.00 300 56.750.00 51.690.00 38.870.00 45.160.00 1000
LRSC 42.450.00 35.600.00 23.420.00 31.430.00 0.03 53.200.00 42.490.00 32.030.00 39.040.00 0.05
LSR1 53.120.05 45.810.05 35.270.01 41.870.01 0.1 52.600.00 46.930.00 34.930.00 41.710.00 1
LSR2 52.930.04 45.650.03 34.980.05 41.610.05 0.1 53.250.05 47.570.11 35.540.06 42.280.06 1
SMR 49.900.00 44.290.00 32.160.00 39.180.00 , 48.900.00 43.430.00 30.180.00 37.620.00 ,
TABLE IV: Clustering results on the MNIST data set. Results in boldface are significantly better than the others, according to the t-test with a significance level at 0.05.

Experimental Settings: In our experiments, we adopt cross-validation for selecting the optimal parameters for all the tested methods [56]111The following parameters are tuned with the cross validation technique: DSSC (, ), SSC (, ), KSSC (, ), DAE (, ), LRR (), KLRR (), LRSC (), LSR (), and SMR (, ).

. More specifically, we equally split each data set into two partitions and tune parameters using one partition. With the tuned parameters, we repeat each algorithm 10 times on the other partition and report the achieved mean and standard deviation of the used clustering performance metrics. In all the experiments, we train a DSSC consisting of three layers, with 300, 200, and 150 neurons respectively. Moreover, we set

and the convergence threshold as for DSSC and adopt early stopping technique (w.r.t. the parameter ) to avoid overfitting by following [56], where is the data size. Once the network converges, we experimentally found that removing the nonlinear functions could be helpful for following clustering step in inference phrase. Note that, we directly use the tuned parameters (sparsity) and (tolerance) of SSC for DSSC. If these two parameters are tuned specifically, the performance of DSSC could be further improved.

Evaluation Criteria: Like [24], we adopt four popular metrics to evaluate the clustering performance of our algorithm, i.e. accuracy (ACC) or called purity, normalized mutual information (NMI), adjusted rand index (ARI), and Fscore. Higher value of these metrics indicates better performance.

4.2 Comparison with state-of-the-art methods

In this section, we compare DSSC with 12 recently-proposed subspace clustering methods on the COIL20 and the MNIST data sets, where each data set is with four different features.

On COIL20: We first investigate the performance of DSSC using the COIL20 data set. Tables III report the results from which we can see that:

  • DSSC consistently outperforms other tested methods in terms of all of the used performance metrics. Regarding the four types of features, DSSC achieves at least 1.86%, 2.09%, 0.96% and 3.52% relative improvement over the ACC of the best baseline, respectively.

  • SSC usually outperforms DAEs and DAEg, whereas our DSSC method consistently outperforms SSC in all the settings. This shows that it is hard to achieve a desirable performance by simply introducing deep learning into subspace clustering since unsupervised deep learning is an open challenging issue [44].

On MNIST: We also investigate the performance of DSSC by using the MNIST data set.

Tables VIIV show the result, from which we obverse that the ACC of DSSC with the DSIFT feature is 72.65% which improves SSC by 10.20% and the best baseline algorithm by 3.50%. With respect to the other three features, the improvement of DSSC comparing with all the baseline approaches is also significant, which is 1.82%, 1.02%, and 1.71% in terms of ARI. It should be pointed out that, all the tested methods perform very stable on this data set, whose standard deviations on these four performance metrics are close to 0.

Features DSIFT HOG
Methods ACC NMI ARI Fscore Para. ACC NMI ARI Fscore Para.
DSSC(M=2) 85.381.08 95.170.17 82.150.63 82.350.62 , 50 85.051.53 94.360.43 78.981.58 79.211.56 , 30
DSSC (M=1) 83.811.72 94.570.45 81.231.94 81.421.92 , 30 81.900.96 91.930.35 71.871.97 72.171.95 , 20
SSC 74.831.27 89.910.38 66.431.44 66.811.42 , 81.651.18 92.480.41 74.231.76 74.521.74 0.5,
KSSC1 70.271.66 87.290.53 58.611.78 59.081.76 1, 18 83.120.90 93.070.34 75.681.37 75.941.36 , 20
KSSC2 78.281.78 91.550.39 71.131.44 71.441.43 1, 18 83.221.34 92.710.32 74.561.06 74.841.05 , 20
DAEg 74.371.20 89.530.43 65.421.56 65.811.54 , 74.671.25 89.070.49 63.771.52 64.171.50 0.5,
DAEs 72.650.91 88.540.52 62.231.81 62.671.78 , 73.321.31 88.170.43 61.121.42 61.561.40 0.5,
LRR 82.671.00 93.480.33 77.331.37 77.601.35 0.1 83.001.36 93.270.46 77.343.21 77.613.16 0.01
KLRR1 79.921.52 91.560.51 71.082.13 71.422.10 300 83.921.26 93.000.45 77.491.49 77.731.47 100
KLRR2 23.080.36 52.010.62 5.310.24 6.730.24 100 76.071.69 88.780.74 63.932.67 64.342.63 5
LRSC 83.551.20 92.840.37 78.331.39 78.571.38 0.06 83.421.43 92.670.48 73.861.73 74.151.71 0.02
LSR1 82.431.31 92.690.49 74.941.87 75.221.85 0.3 83.321.70 92.450.49 73.112.24 73.402.21 0.8
LSR2 82.451.58 92.640.42 74.491.80 74.771.78 0.7 83.651.07 92.450.45 73.241.77 73.541.75 1
SMR 71.071.91 87.010.52 60.822.22 61.262.19 , 81.380.73 91.750.27 72.510.85 72.810.84 ,
TABLE V: Deep vs. Shallow Models on the AR data set. Results in boldface are significantly better than the others, according to the t-test with a significance level at 0.05.

4.3 Deep Model vs. Shallow Models

In this section, we investigate the influence of the depth of DSSC using the AR data set with DSIFT and HOG features. More specifically, we report the performance of DSSC with two hidden layers () and one hidden layer (), respectively. In the case of , the number of hidden neurons is also set as 150. Note that, KSSC1 and KSSC2 can be regarded as two shallow models of SSC with one nonlinear hidden layer.

Table V shows the clustering results of the methods, as well as the tuned parameters. We observe that our DSSC (

) consistently outperform the shallow models in terms of all of these evaluation metrics. The results also verify our claim and motivation,

i.e. our deep model DSSC significantly benefit from deep learning.

Methods Accuracy NMI ARI Fscore Para.
DSSC (t) 79.50 71.02 65.11 71.09 ,90
DSSC (s) 82.67 79.01 71.69 66.55 ,60
DSSC (n) 75.08 67.72 59.17 72.11 ,10
DSSC (r) 80.08 75.60 65.67 72.11 ,10
SSC 79.50 74.83 62.37 69.15 1,0.2
KSSC1 74.50 71.99 61.95 68.85 0.1,12
KSSC2 77.83 69.89 70.65 70.55 0.1,12
DAEg 55.50 38.16 30.69 43.15 -
DAEs 21.67 6.07 0.85 28.65 -
LRR 78.17 74.89 70.57 70.58 0.01
KLRR1 75.33 66.60 56.83 64.07 3
KLRR2 75.00 69.32 68.35 74.16 3
LRSC 69.17 60.60 53.28 61.71 0.01
LSR1 67.50 57.53 51.36 60.19 1.00
LSR2 77.00 59.91 56.27 63.60 0.50
SMR 76.00 74.69 58.09 71.87 ,1e-02
TABLE VI:

The influence of different activation functions of DSSC on the BF0502 database. DSSC (t), DSSC (s), DSSC (n), and DSSC (r) denote DSSC with the tanh, sigmoid, nssigmoid, and relu function, respectively.

4.4 Influence of Different Activation Functions

In this section, we investigate the influence of different nonlinear activation functions in DSSC. The investigated functions are sigmoid, non-saturating sigmoid (nssigmoid

), and the rectified linear unit (relu) 

[74]. We carry out experiment on the BF0502 data set which contains facial images detected from the TV series “Buffy the Vampire Slayer”. Following [36], a subset of BF0502 is used, which includes 17,337 faces in 229 tracks from 6 main casts. Each facial image is represented as a 1,937-dimensional vectors extracted from 13 facial landmark points (e.g., the left and right corners of each eye). In our experiments, we use the first 200 samples from each category, thus resulting in 1,200 images in total.

From Table VI, we can observe that DSSC with different activation functions outperforms SSC by a considerable performance margin. With the sigmoid function, DSSC is about 3.17%, 4.18%, 9.32%, and 2.96% higher than SSC in terms of Accuracy, NMI, ARI, and Fscore, respectively. It is worth noting that although tanh is not the best activation function, it is more stable than the other three activation functions in our experiments. Thus, we use the tanh function as the activation function for comparisons as shown in the above sections.

4.5 Convergence Analysis and Time Cost

Fig. 2:

Convergence curve and time cost of DSSC. The left y-axis indicates the loss at each epoch and the right one is the total time cost taken by our method.

In this section, we examine the convergence speed and time cost of our DSSC on the BF0502 data set. From Figure 2, we can see that the loss of DSSC generally keeps unchanged after 90–100 epochs. For each epoch, DSSC takes about 2.2 seconds to obtain results on a macbook with a 2.6GHz Intel Core i5 CPU and 8GB memory. Like other deep learning based methods, the computational cost of DSSC can be remarkably reduced by GPU.

5 Conclusion

In this paper, we proposed a new deep learning based framework for simultaneous data representation learning and subspace clustering. Based on such deep subspace clustering framework, we further devised a new method, i.e. DSSC. Experimental results on the facial, object, and handwritten digit image databases data sets show the efficacy of DSSC in terms of four performance evaluation metrics. In the future, we plan to investigate the performance of our proposed framework when adopting other loss/regularization functions, and extend our proposed framework for other applications such as weakly-supervised learning.

References

  • [1]

    R. Vidal, Y. Ma, and S. Sastry, “Generalized principal component analysis (GPCA),”

    IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 12, pp. 1945–1959, 2005.
  • [2] J. P. Costeira and T. Kanade, “A multibody factorization method for independently moving objects,” Int. J. Comput. Vis., vol. 29, no. 3, pp. 159–179, 1998.
  • [3] P. S. Bradley and O. L. Mangasarian, “k-plane clustering,” J. Global Optim., vol. 16, no. 1, pp. 23–32, 2000.
  • [4] L. Lu and R. Vidal, “Combined central and subspace clustering for computer vision applications,” in

    Proc. of 23th Int. Conf. Machine Learn.

    , Pittsburgh, USA, Jun. 2006, pp. 593–600.
  • [5] Y. Ma, H. Derksen, W. Hong, and J. Wright, “Segmentation of multivariate mixed data via lossy data coding and compression,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 9, pp. 1546–1562, 2007.
  • [6] S. Rao, R. Tron, R. Vidal, and Y. Ma, “Motion segmentation via robust subspace separation in the presence of outlying, incomplete, or corrupted trajectories,” in

    Proc. of 21th IEEE Conf. Comput. Vis. and Pattern Recognit.

    , Anchorage, AL, Jun. 2008, pp. 1–8.
  • [7]

    A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in

    Proc. of 14th Adv. in Neural Inf. Process. Syst., Vancouver, Canada, Dec. 2001, pp. 849–856.
  • [8] J. B. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000.
  • [9] A. Y. Yang, J. Wright, Y. Ma, and S. S. Sastry, “Unsupervised segmentation of natural images via lossy data compression,” Comput. Vis. Image Underst., vol. 110, no. 2, pp. 212–225, May 2008.
  • [10] G. L. Chen and G. Lerman, “Spectral curvature clustering (scc),” Int. J. of Comput. Vision, vol. 81, no. 3, pp. 317–330, 2009.
  • [11] F. Nie, Z. Zeng, T. I. W., D. Xu, and C. Zhang, “Spectral embedded clustering: A framework for in-sample and out-of-sample spectral clustering,” IEEE Trans. Neural. Netw., vol. 22, no. 11, pp. 1796–1808, 2011.
  • [12] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in Proc. of 22th IEEE Conf. Comput. Vis. and Pattern Recognit., Miami, FL, Jun. 2009, pp. 2790–2797.
  • [13] ——, “Sparse manifold clustering and embedding,” in Proc. of 24th Adv. in Neural Inf. Process. Syst., Granada, Spain, Dec. 2011, pp. 55–63.
  • [14] ——, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2765–2781, 2013.
  • [15] J. Feng, Z. Lin, H. Xu, and S. Yan, “Robust subspace segmentation with block-diagonal prior,” in Proc. of 27th IEEE Conf. Comput. Vis. and Pattern Recognit., Columbus, OH, Jun. 2014, pp. 3818–3825.
  • [16]

    P. Favaro, R. Vidal, and A. Ravichandran, “A closed form solution to robust subspace estimation and clustering,” in

    Proc. of 24th IEEE Conf. Comput. Vis. and Pattern Recognit., Colorado Springs, CO, Jun. 2011, pp. 1801–1807.
  • [17] H. Hu, Z. Lin, J. Feng, and J. Zhou, “Smooth representation clustering,” in Proc. of 27th IEEE Conf. Comput. Vis. and Pattern Recognit., Columbus, OH, Jun. 2014, pp. 3834–3841.
  • [18] P. Ji, M. Salzmann, and H. Li, “Shape interaction matrix revisited and robustified: Efficient subspace clustering with corrupted and incomplete data,” in Proc. of 15th IEEE Conf. Comput. Vis., Santiago, Chile, Dec. 2015.
  • [19] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 171–184, 2013.
  • [20] G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rank representation,” in Proc. of 27th Int. Conf. Mach. Learn., Haifa, Israel, Jun. 2010, pp. 663–670.
  • [21] C. Lu, H. Min, Z. Zhao, L. Zhu, D. Huang, and S. Yan, “Robust and efficient subspace segmentation via least squares regression,” in Proc. of 12th Eur. Conf. Comput. Vis., Florence, Italy, Oct. 2012, pp. 347–360.
  • [22] R. Vidal and P. Favaro, “Low rank subspace clustering (LRSC),” Pattern Recognit. Lett., vol. 43, pp. 47 – 61, 2014.
  • [23] M. Soltanolkotabi, E. Elhamifar, and E. J. Candes, “Robust subspace clustering,” Ann. Stat., vol. 42, no. 2, pp. 669–699, 2014.
  • [24]

    C. Zhang, H. Fu, S. Liu, G. Liu, and X. Cao, “Low-Rank Tensor Constrained Multiview Subspace Clustering,” in

    Proc. of 21th Int. Conf. Comput. Vis.   Santiago: IEEE, Dec. 2015, pp. 1582–1590.
  • [25] R. He, Y. Zhang, Z. Sun, and Q. Yin, “Robust subspace clustering with complex noise,” IEEE Trans. on Image Process., vol. 24, no. 11, pp. 4001–4013, Nov. 2015.
  • [26]

    L. Zhang, W. Zuo, and D. Zhang, “LSDT: Latent Sparse Domain Transfer Learning for Visual Adaptation,”

    IEEE Trans. Image Process., vol. 25, no. 3, pp. 1177–1191, 2016.
  • [27] C.-M. Lee and L.-F. Cheong, “Minimal basis subspace representation: A unified framework for rigid and non-rigid motion segmentation,” Int. J. of Comput. Vision, pp. 1–25, 2016.
  • [28] R. Liu, Z. Lin, F. D. la Torre, and Z. Su, “Fixed-rank representation for unsupervised visual learning,” in Proc. of 25th IEEE Conf. Comput. Vis. and Pattern Recognit., Providence, RI, Jun. 2012, pp. 598–605.
  • [29] B. Cheng, J. Yang, S. Yan, Y. Fu, and T. Huang, “Learning with -graph for image analysis,” IEEE Trans. on Image Process., vol. 19, no. 4, pp. 858–866, 2010.
  • [30] C. You, D. P. Robinson, and R. Vidal, “Scalable sparse subspace clustering by orthogonal matching pursuit,” in Proc. of 29th IEEE Conf. Comput. Vis. and Pattern Recognit., Las Vegas, NV, Jun. 2016, pp. 3918–3927.
  • [31] Y. Yang, J. Feng, N. Jojic, J. Yang, and T. S. Huang, “L0-sparse subspace clustering,” in Proc. of 14th Euro. Conf. Comput. Vis., Amsterdam, Netherlands, Oct. 2016, pp. 731–747.
  • [32] X. Peng, Z. Yi, and H. Tang, “Robust subspace clustering via thresholding ridge regression,” in Proc. of 29th AAAI Conference on Artificial Intelligence, Austin Texas, USA, Jan. 2015, pp. 3827–3833.
  • [33] X. Peng, Z. Yu, Z. Yi, and H. Tang, “Constructing the l2-graph for robust subspace learning and subspace clustering,” IEEE Trans. Cybern., vol. 47, no. 4, pp. 1053–1066, Apr. 2017.
  • [34] V. Patel, H. V. Nguyen, and R. Vidal, “Latent space sparse subspace clustering,” in Proc. of 14th IEEE Conf. Comput. Vis., Sydney, VIC, Dec. 2013, pp. 225–232.
  • [35] V. Patel and R. Vidal, “Kernel sparse subspace clustering,” in Proc. of IEEE Int. Conf. on Image Process., Paris, Oct. 2014, pp. 2849–2853.
  • [36] S. Xiao, M. Tan, D. Xu, and Z. Dong, “Robust kernel low-rank representation,” IEEE Trans. Neural. Netw. Learn. Syst., vol. PP, no. 99, pp. 1–1, 2015.
  • [37] M. Yin, Y. Guo, J. Gao, Z. He, and S. Xie, “Kernel sparse subspace clustering on symmetric positive definite manifolds,” in Proc. of 29th IEEE Conf. Comput. Vis. and Pattern Recognit., Jun. 2016, pp. 5157–5164.
  • [38] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
  • [39]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Proc. of 25th Adv. in Neural Inf. Process. Syst., Lake Tahoe, CA, Dec. 2012, pp. 1097–1105.
  • [40] B. Poling and G. Lerman, “A new approach to two-view motion segmentation using global dimension minimization,” Int. J. of Comput. Vision, vol. 108, no. 3, pp. 165–185, 2014.
  • [41] C. Ding, X. He, H. Zha, M. Gu, and H. Simon, “A min-max cut algorithm for graph partitioning and data clustering,” in Proc. of 1st IEEE Int. Conf. on Data Mining, San Jose, CA, Nov. 2001, pp. 107–114.
  • [42] X. Peng, L. Zhang, and Z. Yi, “Scalable sparse subspace clustering,” in Proc. of 26th IEEE Conf. Comput. Vis. and Pattern Recognit., Portland, OR, Jun. 2013, pp. 430–437.
  • [43] X. Peng, H. Tang, L. Zhang, Z. Yi, and S. Xiao, “A unified framework for representation-based subspace clustering of out-of-sample and large-scale data,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–14, 2015.
  • [44] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
  • [45]

    H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Unsupervised learning of hierarchical representations with convolutional deep belief networks,”

    Commun. ACM, vol. 54, no. 10, pp. 95–103, Oct. 2011.
  • [46] Z. Y. Wang, Q. Ling, and T. S. Huang, “Learning deep l0 encoders,” in Proc. of 30th AAAI Conf. Artif. Intell., Feb. 2016, pp. 2194–2200.
  • [47] P. Huang, Y. Huang, W. Wang, and L. Wang, “Deep embedding network for clustering,” in Proc. of 22nd Int. Conf. Pattern Recognit., Stockholm, Sweden, Aug. 2014, pp. 1532–1537.
  • [48] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in Proc. of 33th Int. Conf. Mach. Learn., New York, Jun. 2016.
  • [49] X. Peng, S. Xiao, J. Feng, W. Yau, and Z. Yi, “Deep subspace clustering with sparsity prior,” in Proc. of 25th Int. Joint Conf. Artif. Intell., New York, NY, USA, Jul. 2016, pp. 1925–1931.
  • [50] Z. Wang, S. Chang, J. Zhou, M. Wang, and T. S. Huang, “Learning a task-specific deep architecture for clustering,” in Proc. of SIAM Int. Conf. on Data Mining, Miami, Florida, May 2015, pp. 369–377.
  • [51] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in Proc. of 29th IEEE Conf. Comput. Vis. and Pattern Recognit., 2016.
  • [52] Y. Chen, L. Zhang, and Z. Yi, “Subspace clustering using a low-rank constrained autoencoder,” Information Sciences, 2017.
  • [53] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid, “Deep subspace clustering networks,” arXiv preprint arXiv:1709.02508, 2017.
  • [54] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, Jan. 2011.
  • [55] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low-rank representation,” in Proc. of 24th Adv. in Neural Inf. Process. Syst., Grendada, Spain, Dec. 2011, pp. 612–620.
  • [56] G. Montavon, G. B. Orr, and K.-R. Müller, Eds., Neural Networks: Tricks of the Trade, Reloaded, 2nd ed., ser. Lecture Notes in Computer Science (LNCS).   Springer, 2012, vol. 7700.
  • [57] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,” ArXiv e-prints, Jun. 2017.
  • [58]

    A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Fast l1-minimization algorithms and an application in robust face recognition: A review,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2010-13, Feb. 2010.

  • [59] J. Hu, J. Lu, and Y.-P. Tan, “Discriminative deep metric learning for face verification in the wild,” in Proc. of 27th IEEE Conf. Comput. Vis. and Pattern Recognit., Columbus, OH, Jun. 2014, pp. 1875–1882.
  • [60] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proc. of 28th IEEE Conf. Comput. Vis. and Pattern Recognit., Boston, MA, Jun. 2015, pp. 815–823.
  • [61]

    J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” in

    Proc. of 27th IEEE Conf. Comput. Vis. and Pattern Recognit., Washington, DC, USA, 2014, pp. 1386–1393.
  • [62] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning for person re-identification,” in Proc. of 22nd Int Conf. Pattern Recognit., Stockholm, Sweden, Aug. 2014, pp. 34–39.
  • [63] S. A. Nene, S. K. Nayar, H. Murase et al., “Columbia object image library (coil-20),” Technical Report CUCS-005-96, Tech. Rep., 1996.
  • [64] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. of IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998.
  • [65] A. Martinez, “The AR face database,” CVC Technical Report, vol. 24, 1998.
  • [66]

    J. Sivic, M. Everingham, and A. Zisserman, “Who are you? - learning person specific classifiers from video,” in

    Proc. of 22th IEEE Conf. Comput. Vis. and Pattern Recognit., Miami, FL, Jun. 2009, pp. 1145–1152.
  • [67] L. Zhang, M. Yang, and X. Feng, “Sparse representation or collaborative representation: Which helps face recognition?” in Proc. of 13th IEEE Int. Conf. on Comput. Vis., Barcelona, Spain, Nov. 2011, pp. 471–478.
  • [68] D. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.
  • [69] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. of 18th IEEE Conf. Comput. Vis. and Pattern Recognit., vol. 1, San Diego, CA, Jun. 2005, pp. 886–893 vol. 1.
  • [70] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 12, pp. 2037–2041, Dec. 2006.
  • [71] V. Ojansivu and J. Heikkilä, “Blur insensitive texture classification using local phase quantization,” in Image and signal process.   Springer, 2008, pp. 236–243.
  • [72] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
  • [73] M. R. Osborne, B. Presnell, and B. A. Turlach, “A new approach to variable selection in least squares problems,” SIAM J. Numer. Anal., vol. 20, no. 3, pp. 389–403, 2000.
  • [74]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    Proc. of 27th Int. Conf. Mach. Learn., Haifa, Israel, Jun. 2010, pp. 807–814.