Recent years have witnessed an increase in the use of deep learning in various research domains, such as audio recognition, image and video analysis, and natural language processing. Deep learning methods can be divided into two groups: supervised learning and unsupervised learning. Supervised learning aims to learn certain classification functions based on known training samples and their labels for pattern recognition, while unsupervised learning aims to learn useful representations from unlabeled data. Both groups of methods have achieved great success, but we are particularly interested in the unsupervised learning methods because their mechanisms are closer to the learning mechanism of human brain and simpler than those of supervised methods.
A commonly used group of unsupervised deep-learning methods are Auto-encoders (AEs). An AE learns a single transformation matrix for embedding all the data, which means it does not discriminate between data and treats every data sample in the same way. This is coincident with some linear learning methods, such as PCA  and ICA 
, which usually assume that the training data obey a single Gaussian distribution. However, this assumption is not quite exact even for the same kind of data. In fact, the data used in various real-world applications often exhibit a trait of multiple Gaussian distribution. When we use multiple Gaussian models to depict the data distribution, each Gaussian model in fact reflects some local characteristic or the locality of the data set. Therefore it could be important for the AEs to preserve the local characteristic during feature learning. This concern has been supported by some previous works in which an AE was equipped with a regularization term, forcing it to be sensitive only to the data variations along the data manifold where the locality was preserved. However, the regularization term was originally designed for improving robustness more than preserving locality.
A successful group of approaches to locality preservation are manifold learning algorithms [5, 6, 7, 8]. They exploit a structure called a neighbourhood graph to learn the interrelations between data and transfer the interrelations to low-dimensional space. Some algorithms assume that if some data are close to one another in the high-dimensional space, they should also be close in the low-dimensional space [6, 8]. Some algorithms assume that an unknown low-dimensional data sample can be reconstructed by its neighbours in the same way as its high-dimensional counterpart is reconstructed in the original space . These objectives differ largely from that of AEs. In comparison, Local Tangent Space Alignment (LTSA) 
computes a linear transformation for each neighbourhood to align the local tangent-space coordinates of each neighbourhood with the low-dimensional representations in a global coordinate system. Since the local tangent-space coordinates can exactly reconstruct each neighbourhood, LTSA shares more similarity with AEs. The only difference between them is that each neighbourhood in LTSA has a distinct reconstruction function and all the data in AEs share the same reconstruction function.
Enlightened by LTSA, we propose an unsupervised deep-learning framework for dimension reduction, in which the low-dimensional feature representations are obtained by aligning local features of a series of data subsets that capture the locality of the original data set. Specifically, we construct a neighbourhood for each data sample using the current sample and its neighbouring samples. Next, we stack several regularized AEs (Contractive AE or CAE 
) together to form a deep neural network called Stacked CAE (SCAE) for mining local features from the neighbourhood. We derive the final low-dimensional feature representations by imposing a local affine transformation on the features of each neighbourhood to transfer the features from each local coordinate system to a global coordinate system. The local features learned by each SCAE reflect the deep-level characteristics of the neighbourhood, thus the proposed method can be named Local Deep-Feature Alignment (LDFA). We also derive an explicit mapping from the LDFA framework to map a new data sample to the learned low-dimensional subspace. It is worthwhile to highlight several aspects of the proposed method:
The locality characteristics contained in the neighbourhood can be effectively preserved by the local SCAEs.
The regularization term of each SCAE facilitates estimating the parameters from a neighbourhood that usually does not contain much data.
The number of ”variations” of the local embedding function is small  due to the data similarity among each neighbourhood, which reduces the difficulty in robust feature learning.
The local features are learned from a small amount of data in a neighbourhood, so the proposed method can work well when the data amount is not large.
The rest of this paper is organized as follows: In Section 2, we will review the related works to give the readers more insight into deep learning and locality-preserving learning. In Section 3, we will introduce the proposed method in detail. In Section 4, we will show a series of experimental results on different applications. Section 5 will present the paper’s conclusions.
Ii Related works
Ii-a Deep Learning
from data. The current DLAs include supervised methods and unsupervised methods. The most representative supervised methods are Convolutional Neural Networks (CNNs), which are constructed by stacking three kinds of layers together, i.e., the Convolutional Layer (CL), Pooling Layer (PL), and Fully Connected Layer (FCL)
. The CL and PL differ from layers of regular neural networks in that the neurons in each of these two layers are only connected to a small region of the previous layer, instead of to all the neurons in a fully connected manner. This greatly reduces the number of parameters in the network and makes CNNs particularly suitable for dealing with images. However, CNNs require every data sample to have a label indicating the class tag of the sample, so they are not applicable to unsupervised feature extraction.
The most representative unsupervised deep-learning methods are AEs . An AE aims to learn the low-dimensional feature representations that can best reconstruct the original data. Some researchers have enhanced AEs to increase the robustness to noise . AEs are often used to construct deep neural network structures for feature extraction .
However, traditional AEs train a single transformation matrix for embedding all the data into low-dimensional space. Thus traditional AEs capture only the global characteristics of the data and do not consider the local characteristics. This might be inappropriate because locality has proved to be a very useful characteristic in pattern recognition [17, 18].
Ii-B Locality-Preserving Learning
Manifold learning methods are well known for their capabilities of preserving the local characteristics of the data set during dimension reduction [5, 6, 7, 8, 19]. The local characteristics of the original data set are contained in a structure called the neighbourhood graph, where each node representing a data sample is connected to its nearest neighbuoring nodes by arcs. The neighbourhood graph is then fed to some local estimators  that are capable of transferring the locality to the learned low-dimensional feature representations. Different manifold algorithms have different locality-preservation strategies.
Locally Linear Embedding (LLE)  reconstructs each data sample by linearly combining its neighboring data samples, and assumes the low-dimensional feature representation of the data sample can be reconstructed by its neighbours using the same combination weights. Laplacian Eigenmap (LE)  assumes that if two data samples are close to each other in the original data set, their low-dimensional counterparts should also be close to each other. Locality Preserving Projections (LPP)  shares the same objective with LE, but is realized in a linear way. Local Tangent Space Alignment (LTSA) 
transfers the local characteristics of the data set to a low-dimensional feature space using a series of local Principal Component Analysis (PCA) applications, and then obtains the low-dimensional feature representations by aligning the local features learned by these local PCAs. In LTSA, the local estimators realized by these local PCAs are explicit and can be easily used for projecting data into low-dimensional space. This is important in dimension reduction algorithms. The locality characteristic of data can also be preserved using the joint/conditional probability distribution of data pairs based on their neighbouring structures. A typical relevant method is t-distributed stochastic neighbor embedding (t-SNE) that is particularly effective in data visualization.
Some manifold-based methods encode discriminative information in neighbourhood construction. The study in  simultaneously extracts a pair of manifolds based on the data similarity among neighbourhoods such that the two manifolds complement each other to enhance the discriminative power of the features. In , the neighbourhood of each data is defined by the identity and pose information of a subject, so that the learned manifolds can be applied to person-independent human pose estimation.
However, Bengio confirmed that each of these manifold algorithms could be reformed as a single-layer nonlinear neural network  that fails to discover deep-level features from the original data. In addition, it does not make much sense to stack manifold learning algorithms directly as a layered structure to learn deep features because the objective functions of manifold learning methods are deliberately designed for one-layer learning. To achieve deep-level feature learning based on manifold methods, we might want to combine the locality-preservation capability of manifold methods with deep-learning methods.
Ii-C Combination of Deep Learning and Locality Preserving Learning
There is still no conclusion about how to encode the locality learning into the deep-learning process. Rifai [4, 23] showed that a certain kind of smoothness regularization might be useful in preserving the data locality. In [4, 23], the smoothness regularization forces AEs to be sensitive only to the data variations along the data manifold where locality is preserved. However, the smoothness regularization was originally designed for improving robustness more than preserving locality, so it only indirectly models the data locality.
Some works have been proposed to enhance deep learning with some straightforward locality-preserving constraints. A Deep Adaptive Exemplar Auto-Encoder was proposed in  to extract deep discriminant features by knowledge transferring. In this method, a low-rank coding regularizer transfers the knowledge of the source domain to a shared subspace with the target domain, while keeping the source and target domains well aligned through the use of locality-awareness reconstruction. This method is particularly useful in domain adaptation where the data in source domain should be labeled.  trained CNNs for each class of data and combined the learning results of these CNNs at the top layers using the maximal manifold-margin criterion. However, this criterion preserves discriminative information between classes by directly evaluating the Euclidean distances between the deep features learned by these CNNs. This is not quite exact because the features learned from different CNNs actually lie in different coordinate systems. In addition, all the parameters of these CNNs should be solved simultaneously in the feature-learning process, which makes the optimization process highly nonlinear and hard to solve. More importantly, the method can still not learn features without data labels.
Iii Local Deep Feature Alignment Framework
Iii-a Contractive Auto-Encoder (CAE)
The original AEs are designed for learning a feature representation from the input data sample that can be used to reconstruct the input data sample as accurately as possible . Given a set of training data , the encoding (embedding) and decoding (reconstruction) process can be described as:
where is the feature representation, describes the weight matrix of the encoding, and are the bias terms, is the reconstructed data sample and
Next, the parameters of an AE can be obtained by solving the following optimization problem:
The functionality of an AE is depicted in Fig. 1 (a).
However, the learning process of an AE will be not robust enough in some cases, for example, when the number of data samples is much smaller than the dimension. To improve the robustness, some researchers propose to use smoothness prior to regularize the AE and thus derive the Contractive Auto-encoder (CAE)  whose objective function in matrix form is
where , is a matrix in which every entry is equal to one, and denotes the dot product. The third term in (4) (the smoothness regularization term) keeps the feature learning process insensitive to data variations while competent for data reconstruction. This helps to extract robust low-dimensional features .
CAEs can be used as components to form a deep neural network, where the lower layer’s output serves as the higher and adjacent layer’s input. An example of a two-layer Stacked CAE (SCAE) is shown in Fig. 1 (b). We believe the SCAE will extract more robust features than the one-layer CAE. The SCAE can be described as:
where and represents the input and output of the layer respectively, and the superscript indicates that the parameters correspond to the layer, and we reuse , and to represent the parameters of the SCAE.
Iii-B Objective Function of Local Deep-Feature Alignment
Our basic concern is to extract deep-level features from each local data subset that reflect some local characteristic of the data subset. Then we align these local features to form the global deep features.
To this end, we propose to construct a neighbourhood for each data sample that includes the sample and a number of its closest neighbouring samples. Then we use an SCAE to extract deep-level features from each neighbourhood and impose a local affine transformation on the deep features of each neighbourhood to align the features from each local coordinate system with a global coordinate system. The framework of the method is illustrated in Fig. 2.
The objective function of the method should include two parts, local deep-feature learning and global alignment of local features. For each , we define its neighbourhood as where is its number of neighbours. Hence the error of the SCAEs used for local feature extraction from all the neighbourhoods can be represented as:
where the subscript is the index of the neighbourhood and other symbols have the same meanings as in formula (5).
The top-layer local deep features learned so far are neighbour-wise, and we need to derive the global deep features. Based on the success of LTSA, it is reasonable to assume that there exists an affine transformation between and their global counterparts. Let be the affine transformation matrix, the alignment error of each neighbourhood can be described as
where moves the feature representations in to their geometric centre, is the global deep features corresponding to the neighbourhood. We need to find and such that preserves as much of the locality characteristics contained in as possible. This problem can be solved by minimizing the overall alignment error:
We adopt a two-stage strategy to optimize the problem (9). In the first stage, we learn the local deep features using a series of SCAEs. In the second stage, we align the local features to form the global feature representations.
Stage 1: In training each SCAE, to achieve optimized encoding and decoding, we can separately pre-train each CAE and optimize the deep network with the parameters initialized by the pre-trained parameters of each layer. The problem can be solved using a gradient descent algorithm with back-propagation . The optimization process of an SCAE is shown in Fig. 3.
Suppose are the final feature representations where corresponds to , and let be the 0-1 selection matrix such that . We then need to find to minimize the overall alignment error:
where and with
Let . Then, (11) can be rewritten as , thus we reformulate the problem as:
which is a typical eigenvalue problem and can be easily solved by existing methods.
Iii-D Embedding a New Data Sample
Our proposed method can be easily extended to embed a new data sample into the learned low-dimensional subspace. We seek to construct an explicit embedding function for each local neighbourhood. Then, given a new data sample, we can find the closest sample to it in the training set, and use the corresponding embedding function to obtain the low-dimensional representation of the new data sample.
To this end, we use a one-layer fully connected feed-forward neural network to model the mapping from the top-layer local featuresto the global feature representations . This one-layer network still exploits the sigmoid activation function and its optimization can be described as:
Fig. 4 illustrates the one-layer fully connected network.
Next, we replace the local affine-transformation matrix between and with the aforementioned fully connected network so that it is stacked on the top of the local SCAE, whose top-layer output are . This is shown in Fig. 5.
Note that a CAE is also realized by a one-layer neural network, which means that the fully connected network defined in (14) shares the same mathematical form with a CAE. For this reason, each SCAE, together with the corresponding fully connected network, forms a new uniform deep neural network that is able to explicitly embed a data sample into the learned low-dimensional subspace to obtain the globally aligned feature representation. The advantage of using the fully connected network for local feature alignment is that we can initialize the uniform deep neural network using the parameters of the learned SCAE and fine-tune the uniform deep network via the gradient-descent algorithm with back-propagation. The construction of a uniform deep neural network is also shown in Fig. 5. It is worthwhile to point out that the back-propagation can be used only when the local feature alignment is achieved by fully connected neural network. Therefore Algorithm 1 does not include back-propagation in local feature alignment.
Suppose each SCAE has layers. We build an -layer deep neural network for each neighbourhood, and then initialize the parameters of the first layers with the trained SCAE and initialize the parameters of the layer with the fully connected network. Specifically, let be the explicit embedding function representing the uniform deep neural network, such that . We initialize its parameters and in the following way.
where the superscript indicates the parameters in the layer of the uniform deep network. Once is obtained, we can locate the nearest neighbour of a new data sample in the training set, and use the corresponding to embed the data sample into a low-dimensional subspace.
To realize the embedding of a new data sample , we modify the original LDFA algorithm such that it splits into Algorithm 2 and Algorithm 3, which describe the training and embedding process respectively.
We will use the proposed LDFA method in several representative applications and evaluate its performances. Dimension reduction is commonly used as a preprocessing step for subsequent data-visualization, data-clustering and data-classification tasks. Therefore, we will first examine LDFA’s image-visualization capability and clustering accuracy based on the images that have been dimension reduced using LDFA. In addition, we will determine the classification accuracy based on the LDFA feature representations. Both qualitative and quantitative experimental results will be reported, and a comparison with other existing methods will be provided.
The clustering accuracy is defined as the purity, which is computed as the ratio between the number of correctly clustered samples and the total number of samples:
where is the clustered data set with representing the data in the cluster and is the original data set with representing the data in the class.
Iv-a The Data Sets
The experiments adopt seven benchmark data sets for image visualization/clustering/classification, the data sets include the MNIST Digits 111http://www.cs.nyu.edu/~roweis/data.html , USPS Digits 11footnotemark: 1, Olivetti Faces 11footnotemark: 1, the UMist Faces 11footnotemark: 1, the NABirds 222http://dl.allaboutbirds.org/nabirds, the Stanford Dogs 333http://vision.stanford.edu/aditya86/ImageNetDogs/, and the Caltech-256 444http://www.vision.caltech.edu/ImageDatasets/Caltech256/intro/ data sets.
Table I shows the attributes of these data sets and how we use these data sets in the experiments. The attributes of each data set are the class number, the total number of data and the data dimension. Considering the computational efficiency, we are not going to use all the data for evaluation. Table I clearly indicates how many images (and per class) are involved in the experiments, and how many images are chosen for training and testing respectively. Particularly, the NABirds data set covers 400 species, but only 100 species are involved in our experiments. All the experiments are repeated 10 times, with randomly selected images in each time, and we show the statistical results of these experiments using box plot.
Iv-B Data Visualization and Clustering
In data visualization, the original data are embedded into a two- or three-dimensional subspace and the low-dimensional embeddings are rendered to show the spatial relationships between data samples. A good visualization result usually groups data of the same class together and separates data from different classes. In this sense, a good data-visualization result leads to high data-clustering accuracy, and viceversa. The result of data visualization and the accuracy of clustering reflect the discriminative information contained in the data, so they are commonly used for evaluating the feature representations learned by unsupervised dimension reduction algorithms.
Three factors may influence the clustering performance of the LDFA algorithm: the number of layers of the local SCAEs, the dimension of the output feature representations, and the size of the neighbourhood. Hence we want to find the proper number of layers, feature dimensions and neighbourhood size for the LDFA algorithm. First, we fix the neighbourhood size to 10, which often generates good results in many locality-preserving dimension reduction algorithms , and run Algorithm 1
dozens of times with different combinations of numbers of layers and feature dimensions on the MNIST Digits data set, USPS Digits data set, Olivetti Faces data set and UMist Faces data set. We perform K-means clustering using the output feature representations and compute the clustering accuracies. The results of ten repeated experiments are shown in Fig. 6, where we find that four- to five-layer local SCAEs with two-dimensional features are sufficient to obtain good result for digit data sets, while one-layer local SCAEs with two- to ten-dimensional features are sufficient to get good result for face data set. Specifically, the appropriate network structures of the local SCAEs applied to the MNIST Digits, USPS Digits, UMist Faces and Olivetti Faces data sets for data visualization can be 784-300-200-150-2, 256-300-250-200-150-2, 10304-2, and 4096-2, respectively. The number of neurons in the network are smaller than that used in ; we believe the reason for this is that LDFA learns features from neighbourhoods, which usually contain only very similar data samples, thus it does not need a very complex network structure.
Then, using the network depth described above, we change the neighbourhood size in Algorithm 1 from 10 to 100 by intervals of 10, and the clustering accuracies of ten repeated experiments on the MNIST and USPS data sets are shown in Fig. 7. We find that the performance of the LDFA algorithm drops dramatically when the neighbourhood size increases from 10 to 20. Therefore, we believe LDFA learns discriminative local features well with relatively small neighbourhood size.
To further testify our speculation about the neighbourhood size, we perform another four tests on Caltech-256 data set that covers much more classes than MNIST and USPS data sets. In the first test, we choose 1000 samples from 50 classes with 20 samples per class and extract 59-dimension Local Binary Pattern (LBP) features  from these images. Then we compute the clustering accuracies based on the two dimensional features extracted by the LDFA, with the neighbourhood size varying from 10 to 90. In the second test, 2000 samples are chosen from 100 classes, and the neighbourhood size varies from 10 to 200. In the third test, 3000 samples are chosen from 150 classes, and the neighbourhood size varies from 10 to 250. In the last test, 4000 samples are selected from 200 classes, and the neighbourhood size varies from 10 to 300. The network structure is 59-30-10-2, and the experiments are also repeated 10 times with randomly selected samples each time. The clustering accuracies of these four tests are shown in Fig. 8 where we find the clustering accuracies are not increasing with bigger neighbourhood sizes when more classes are given.
Consequently, we fix the neighbourhood size in the LDFA algorithm to 10 to reduce the dimension of the aforementioned four data sets. The 2-D visualization results of the four data sets are depicted in Fig. LABEL:fig:vis_mnist, Fig. 10, Fig. 11 and Fig. 12. For comparison, we also demonstrate the visualization results of three other methods: the PCA, the t-SNE  and the Locally Linear Coordination (LLC) 555http://lvdmaaten.github.io/drtoolbox/. The t-SNE solves a problem known as the crowding problem during dimension reduction, which could be severe when the embedding dimension is very low. Therefore it is anticipated that the 2-D features learned by the t-SNE can well reflect the real data distribution. Fig. LABEL:fig:vis_mnist through Fig. 12 show that the LDFA are close to or comparable with t-SNE in data visualization, but better than the PCA and the LLC. This is owing to the ability of the LDFA to capture not only the global but also the local deep-level characteristics of the data sets.
To quantitatively evaluate the dimension reduction results of Fig. LABEL:fig:vis_mnist through Fig. 12, we show the clustering accuracies based on those 2-D features in Fig. 13, where we also show the clustering accuracies derived from the LTSA, the basic stacked Auto-encoders (SAE) and the SCAE. Different from other methods, the features learned by the SAE and SCAE for clustering are of 30 dimensions for good performance.In this experiment, we apply the same network structure to the SAE, SCAE and LDFA except for the dimension of the output data. The clustering on each data set is also repeated 10 times. On MNIST and USPS data sets, the LDFA is better than the PCA, LTSA, LLC, SAE and SCAE owing to the ability to capture local deep features, but inferior to the t-SNE, whose good performance is predictable from the dimension reduction results shown in Fig. LABEL:fig:vis_mnist through Fig. 12. On Olivetti and UMist Faces data sets, the LDFA achieves the best performances. It is noteworthy that the SCAE performs worse than the SAE on MNIST and USPS data sets. We think this can be explained by the characteristic of the SCAE’s regularization term to ignore the data variations that are significant even for the same kind of digits. On Olivetti and UMist Faces data sets, the performance of the SCAE is much better than the SAE because human faces share very similar structure and differenciate from one another mainly in texture, thus are more suitable to be processed by the SCAE. Also, we believe the training of the SAE needs relatively large sample set because the SAE tends to fail in capturing the data variations when the sample set is small. This explains why the clustering accuracies of the SAE on Olivetti and UMist Faces data sets are low in Fig. 13.
Additionally, we perform image clustering on three bigger data sets-the NABirds, Stanford Dogs and Caltech-256, using the low-dimensional features produced by the aforementioned dimension reduction algorithms. These three data sets are much more challenging for feature learning because the foreground objects in the images are usually shown in different poses and sizes, and a large number of images are cluttered with natural scenes in the background. Some of the sample images randomly selected from the NABirds data set are shown in Fig.14, where each row represents a breed of birds. It is obvious that there exists great differences between the images of the same class. In order to improve the robustness, we extract a 59-dimension LBP feature descriptor from each image to represent it. For the SAE, SCAE and LDFA, we use the same network structure. For the LTSA, LLC and t-SNE, we use the default settings in the existing implementations. We repeat the experiment 10 times. The clustering accuracies derived from these dimension reduction algorithms are depicted in Fig. 15 where the LDFA outperforms the other five algorithms.
Iv-C Data Classification
To further evaluate the discriminative information contained in the low-dimensional feature representations, we also perform image classification using the dimension-reduced data. The classification algorithms are carefully tuned so that they can generate their best results. Note that we aim to compare the feature representations learned by different methods, not to achieve the highest possible classification accuracy. Thus, we deliberately do not use some state-of-the-art classification methods.
Designating a very low data dimension in learning might cause the features extracted by the AEs to ”collapse” onto the same dimension , and this will undoubtedly influence the classification accuracy. We believe similar problem may happen to the SAE. Hence, we need to redesign the network structure for data classification. As pointed out by 
, four layers should be appropriate for an AE to learn good features. So we define 19 different four-layer network structures and implement the SAE using these structures to extract features, which are then fed into a 1-NN classifier. The classification is performed 10 times and the results are show in Fig.16 where the SAE with a 784-512-256-128-64 structure generates the best results. In addition, we plot the classification results derived from the LDFA using the 19 structures. It is clear that the LDFA outperforms the SAE with the same network structure. We will not raise the dimension of the feature representations any higher to prevent introducing the noise and instability into the feature learning .
Guided by similar exploration processes, we determine the network structures to be applied on the USPS, Olivetti Faces and UMist Faces data sets as 256-64-30, 4096-64 and 10304-64 respectively. Then we perform dimension reduction again using the LLC, t-SNE, SAE, SCAE and LDFA, and feed the low-dimensional feature representations to the 1-NN, random forest
and the naive Bayes classifiers. For different dimension reduction methods, we reduce the data to the same dimension. The experiment is repeated 10 times with randomly chosen samples, and the classification accuracies can be found in Fig. 17 where the low-dimensional feature representations learned by the LDFA algorithm achieve the highest classification accuracy in most cases except for the Bayes classification on the USPS Digits and UMist Faces data sets. We believe the good performance of LDFA stems from its ability to learn not only the global characteristic but also the local deep-level characteristic of the data sets.
In addition, we extract the LBP feature descriptors from the NABirds, Stanford Dogs and Caltech-256 data sets and conduct classification using the low-dimensional representations learned from these LBP descriptors by the LTSA, SAE, SCAE, LLC, t-SNE and the LDFA. We use the same network structure that has been adopted in Section IV-B, and this procedure is also repeated 10 times. The classification results of the 1-NN, random forest, naive Bayes, AdaBoost ensemble , and LDA  classifiers are shown in Fig. 18 through Fig. 20, which indicate that the LDFA produces the best classification results in that the features learned by LDFA generate not only the highest but also the best mean classification accuracy with respect to each classification algorithm. The only exception is the random forest classification on the Stanford Dogs data set.
Furthermore, we use Histogram of Oriented Gradients (HOG)  feature descriptors of the NABirds, Stanford Dogs and Caltech-256 data sets to evaluate the aforementioned dimension reduction methods. Specifically, we re-scale all the bird images to the same size of 512512, and set the size of the cells as 88. Hence, the extracted HOG descriptors are of 142884 dimensions. Similarly, we resize all the dog and object images to 256256 while keeping the cell size unchanged to obtain 34596-dimension HOG descriptors. For the sake of computational convenience, the dimensions of all the HOG descriptors are reduced to 500 through PCA firstly. Then we use the LTSA, SAE, SCAE, LLC, t-SNE and LDFA to extract 30-dimension representations from the 500-dimension HOG descriptors, and conduct the classification based on the extracted representations. The experiment is also performed 10 times with randomly chosen data each time. The classification accuracies of four classification algorithms on the NABirds, Stanford Dogs and Caltech-256 data sets are demonstrated in Fig. 21 through Fig. 23 where the LDFA still produces the best results except for the AdaBoost and LDA classification on Stanford Dogs data set. Therefore, we believe the LDFA algorithm can learn more discriminative and more robust feature representations.
We have proposed an unsupervised deep-learning method named Local Deep-Feature Alignment (LDFA). We define a neighbourhood for each data sample and learn the local deep features via SCAEs. Then we align the local features with global features by local affine transformations. Additionally, we provide an explicit approach to mapping new data into the learned low-dimensional subspace.
The proposed LDFA method has been used as a pre-processing step for image visualization, image clustering and image classification in our experiments. We found that SCAE could extract discriminative local deep features robustly from a small number of data samples (neighbourhood) with few network layers. These experimental results persuaded us that using SCAE to capture the local characteristics of data sets would improve the performance of the unsupervised deep-learning method.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  K. Pearson, “Liii. on lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, no. 11, pp. 559–572, 1901.
-  A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis. John Wiley & Sons, 2004, vol. 46.
-  G. Alain, Y. Bengio, and S. Rifai, “Regularized auto-encoders estimate local statistics,” Proceedings of CoRR, vol. abs/1211.4246, no. 11, pp. 1–17, 2012.
-  S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
-  M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in neural information processing systems, 2002, pp. 585–591.
-  Z. Zhang and H. Zha, “Principal manifolds and nonlinear dimensionality reduction via tangent space alignment,” SIAM journal on scientific computing, vol. 26, no. 1, pp. 313–338, 2004.
-  X. He and P. Niyogi, “Locality preserving projections,” in Advances in neural information processing systems, 2004, pp. 153–160.
Y. Bengio, “Learning deep architectures for ai,”
Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
C. Jia, M. Shao, and Y. Fu, “Sparse canonical temporal alignment with deep tensor decomposition for action recognition,”IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 738–750, 2017.
-  G. Li and Y. Yu, “Visual saliency detection based on multiscale deep cnn features,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5012–5024, 2016.
D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, and T. S. Huang, “Robust single image super-resolution via deep networks with sparse prior,”IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3194–3207, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.
S. C. AP, S. Lauly, H. Larochelle, M. Khapra, B. Ravindran, V. C. Raykar, and A. Saha, “An autoencoder approach to learning bilingual word representations,” inAdvances in Neural Information Processing Systems, 2014, pp. 1853–1861.
-  G. Alain and Y. Bengio, “What regularized auto-encoders learn from the data-generating distribution.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 3563–3593, 2014.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,”Journal of Machine Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010.
L. Liu, M. Yu, and L. Shao, “Unsupervised local feature hashing for image similarity search,”IEEE transactions on cybernetics, vol. 46, no. 11, pp. 2548–2558, 2016.
-  J. Tang, L. Shao, X. Li, and K. Lu, “A local structural descriptor for image matching via normalized graph laplacian embedding,” IEEE transactions on cybernetics, vol. 46, no. 2, pp. 410–420, 2016.
-  S. T. Roweis, L. K. Saul, and G. E. Hinton, “Global coordination of local linear models,” in Advances in neural information processing systems, 2002, pp. 889–896.
-  L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.
-  Y. Su, S. Li, S. Wang, and Y. Fu, “Submanifold decomposition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 11, pp. 1885–1897, 2014.
-  S. Yan, H. Wang, Y. Fu, J. Yan, X. Tang, and T. S. Huang, “Synchronized submanifold embedding for person-independent pose estimation and beyond,” IEEE Transactions on Image Processing, vol. 18, no. 1, pp. 202–210, 2009.
-  S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive auto-encoders: Explicit invariance during feature extraction,” in Proceedings of the 28th International Conference on Machine Learning, 2011, pp. 833–840.
-  M. Shao, Z. Ding, H. Zhao, and Y. Fu, “Spectral bisection tree guided deep adaptive exemplar autoencoder for unsupervised domain adaptation.” in AAAI, 2016, pp. 2023–2029.
J. Lu, G. Wang, W. Deng, P. Moulin, and J. Zhou, “Multi-manifold deep metric
learning for image set classification,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1137–1145.
P. Baldi, “Autoencoders, unsupervised learning, and deep architectures.”
ICML unsupervised and transfer learning, vol. 27, no. 37-50, p. 1, 2012.
-  Y. Chauvin and D. E. Rumelhart, Backpropagation: Theory, architectures, and applications. Hove,United Kingdom: Psychology Press, 1995.
-  A. Smith, H. Zha, and X.-m. Wu, “Convergence and rate of convergence of a manifold-based dimension reduction algorithm,” in Advances in Neural Information Processing Systems, 2008, pp. 1529–1536.
-  J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
-  G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 12, pp. 2037–2041, 2006.
-  L. Van Der Maaten, E. Postma, and J. Van den Herik, “Dimensionality reduction: a comparative review,” Journal of Machine Learning Research, vol. 10, pp. 66–71, 2009.
-  E. Levina and P. J. Bickel, “Maximum likelihood estimation of intrinsic dimension,” in Advances in neural information processing systems, 2005, pp. 777–784.
-  T. K. Ho, “Random decision forests,” in Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1. IEEE, 1995, pp. 278–282.
-  C. Elkan, “Boosting and naive bayesian learning,” Department of Computer Science and Engineering, University of California, San Diego La Jolla, California 92093-0114, Tech. Rep. CS97-557, 1997.
-  H.-B. Shen and K.-C. Chou, “Ensemble classifier for protein fold pattern recognition,” Bioinformatics, vol. 22, no. 14, pp. 1717–1722, 2006.
-  R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of human genetics, vol. 7, no. 2, pp. 179–188, 1936.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1. IEEE, 2005, pp. 886–893.