A Distributed Deep Representation Learning Model for Big Image Data Classification

by   Le Dong, et al.

This paper describes an effective and efficient image classification framework nominated distributed deep representation learning model (DDRL). The aim is to strike the balance between the computational intensive deep learning approaches (tuned parameters) which are intended for distributed computing, and the approaches that focused on the designed parameters but often limited by sequential computing and cannot scale up. In the evaluation of our approach, it is shown that DDRL is able to achieve state-of-art classification accuracy efficiently on both medium and large datasets. The result implies that our approach is more efficient than the conventional deep learning approaches, and can be applied to big data that is too complex for parameter designing focused approaches. More specifically, DDRL contains two main components, i.e., feature extraction and selection. A hierarchical distributed deep representation learning algorithm is designed to extract image statistics and a nonlinear mapping algorithm is used to map the inherent statistics into abstract features. Both algorithms are carefully designed to avoid millions of parameters tuning. This leads to a more compact solution for image classification of big data. We note that the proposed approach is designed to be friendly with parallel computing. It is generic and easy to be deployed to different distributed computing resources. In the experiments, the largescale image datasets are classified with a DDRM implementation on Hadoop MapReduce, which shows high scalability and resilience.



page 11


HMIC: Hierarchical Medical Image Classification, A Deep Learning Approach

Image classification is central to the big data revolution in medicine. ...

Analyzing Big Datasets of Genomic Sequences: Fast and Scalable Collection of k-mer Statistics

Distributed approaches based on the map-reduce programming paradigm have...

Hiding Information in Big Data based on Deep Learning

The current approach of information hiding based on deep learning model ...

Representation Learning on Large and Small Data

Deep learning owes its success to three key factors: scale of data, enha...

Image Classification Based on Quantum KNN Algorithm

Image classification is an important task in the field of machine learni...

Distributed Averaging CNN-ELM for Big Data

Increasing the scalability of machine learning to handle big volume of d...

Deep Learning on Real Geophysical Data: A Case Study for Distributed Acoustic Sensing Research

Deep Learning approaches for real, large, and complex scientific data se...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent years have witnessed great success and the development of deep learning Hinton1

applied to multiple levels of representation and abstraction that help make sense of image data to accomplish higher-level tasks such as image retrieval

Liang37 ; Penatti4 , classification Samat2 ; Dong38 ; Dong39 , detection Li3 ; Pedersoli40 , etc.

Elegant deep representation obtained through greedily learning successive layers of features will contribute to make subsequent tasks more achievable. Provided the scarce labeled data, current deep learning methods such as CNN (Convolutional Neural Networks)

Krizhevsky5 ; Jarrett6 , Sparse coding Yang7 , Sparse auto-encoder Goodfellow8 ; Poultney11

and RBMs (Restricted Boltzmann Machines)


typically employed an unsupervised learning algorithm to train a model of the unlabeled data and then used the gained deep representation to extract interesting features. These aforementioned deep learning models generally have huge amounts of hyper-parameters to be tuned, which impose sharp requirements for storage and computational expense. More recently, researchers found that it is possible to achieve state-of-the-art performance by focusing effort on the design parameters (

e.g., the receptive field size, the number of hidden nodes, the step-size between extracted features, etc.) with simple learning algorithms and a single layer of features Coates9 . However, the superiority demonstrated in Coates9 is based on the relatively small benchmarkdatasets like NORB Deng12 and CIFAR-100 Krizhevsky10

. When applied to big image datasets such as ImageNet

Dean13 , the classification accuracy of the single layer feature approach may suffer from the information loss during the feature extraction. Indeed, big image data comes along accompanied by the widespread real applications in various areas, such as engineering, industrial manufacture, military and medicine, etc., which directly motivate us to construct a robust and reliable model for big image data classification with the joining efforts of feature design, deep learning and distributed computing resources.

Inspired by the previous state-of-the-art approaches Coates9 ; Wu18 ; Coates36

, we utilize hierarchical distributed deep representation learning algorithm, based on K-means, to serve as the unsupervised feature learning module. The proposed approach avoids the selection of multiple hyper-parameters, such as learning rates, momentum, sparsity penalties, weight decay which must be chosen through cross-validation and result in substantially increased runtime. K-means has enjoyed wide adoption in computer vision for building codebooks of visual words used to define higher-level image features, but it has been less widely used in deep learning. In our design, K-means is used to construct a dictionary


centroids in each layer to gain the feature mapping function so that an input data vector

can be mapped to a new feature representation that minimizes the error in reconstruction. The proposed approach is computational simpler as not require any hyper-parameter to be tuned other than obtaining the dictionary . We note that the complexity of the dictionary grows linearly with the number of the layers, which imposes non-trivial computations cannot be handled by a single machine. To mitigate this problem, we utilize the distributed computing resources to provide competent computing capability and storage. Here, the prevalent MapReduce Dean13 , aimed at big data parallel processing, is chosen to serve as the implementation platform. Based on this platform, our proposed Distributed Deep Representation Learning Model (DDRL) is reliably trained. Note that DDRL model is not restricted to be run on MapReduce. It is generic and can be deployed to any other distributed platform.

In general, the significant contribution of this paper can be summarized as three aspects:

1) The proposed DDRL model is a hierarchical structure designed to abstract the layer-wise image feature information. Each hierarchy is based on hierarchical distributed deep representation learning algorithm to learn the inherent statistics of the image data. The DDRL model abandons millions of parameters estimation, which releases the complexity of the traditional model learning methods;

2) To further counter the computing challenges brought by big image data, DDRL model is set up on the distributed resources which help to release the storage and computation efficiency issues. In addition, each layer adapts parallel processing, which further improves the scalability and fault tolerance of DDRL;

3) Owning to the excellent parallel design and simplifying burdensome models, our DDRL model learns the saliency representation to achieve big image data classification and obtains desirable performance.

The remainder of this paper is organized as follows: Section 2 provides a review of the related works and Section 3 elaborates our proposed approach. Experimental evidences that validates our work and a comparison with other methods are presented in Section 4. Finally, Section 5 concludes the paper.

2 Related work

In recent years, much attention has been concentrated on the flourish of deep learning, which can be of unsupervised Hinton1 , supervised Deng12 , or a hybrid form Liu17 . Hierarchical and recursive networks Hinton1 ; Huang18 have demonstrated great promise in automatically learning thousands or even millions of features. Image classification Hinton19 ; Ngiam20 ; LuoY32 ; Fu33 ; Babu34 ; Simonyan35 based on deep learning have also observed significant performance, especially in the presence of large amount of training data. Simonyan35 extracts hand-designed low-level features, which fails to capture the inherent variability of images.

Despite the superiority of deep learning on processing vision tasks, some potential problems with the current deep learning frameworks still exist, just as Lee21 concluded: reduced transparency and determinativeness of the features learned at hidden layers Zeiler22 ; training difficulty due to exploding and vanishing gradients Pascanu23 ; Glorot24 ; lack of a thorough mathematical understanding about the algorithmic behavior, despite of some attempts made on the theoretical side Eigen25 ; dependence on the availability of large amount of training data Hinton19 ; complexity of manual tuning during training Krizhevsky5 . To enhance the performance of deep learning from various angles, several techniques such as dropout Hinton19 , drop connect Wan26 , pre-training Dahl27 , and data augmentation Ciresan28 , have been proposed. In addition, a variety of engineering tricks are employed to fine-tune feature scale, step size, and convergence rate.

In Coates9 , k-means successfully plays an unsupervised feature learning role to achieve good performance. It is particularly noteworthy for its simple implementation and fast training. Unfortunately, it suffers from the problem that a very large number of centroids are required to generate good features, which directly brings heavy burden on computing speed and storage capacity. To take advantages of the simpleness of K-means and overcome aforementioned deficiencies, we consider employing the distributed resources to make contributions. In this sense, MapReduce Dean13 ; Yu29 , as a prevalent distributed processing framework, is a reliable platform to provide sufficient computing resources. MapReduce is a prevalent framework capable of efficiently processing huge data amount in a parallel manner across numerous distributed nodes. The excellent fault tolerance and load balance of MapReduce benefit from its inherent working mechanism which detects failed map or reduces tasks and reschedules these tasks to other nodes in the cluster. Detailed operation and combination of MapReduce with our DDRL model will be further introduced in subsequent sections.

Figure 1: System framework.

3 Proposed Approach

This section presents the detailed design of our proposed distributed deep representation learning model (DDRL).

3.1 System Overview

The structure of DDRL model consists of five layers. As shown in Figure 1, each layer has a similar structure which includes extracting features and selecting feature maps. The input of the first layer is an unlabeled image dataset and the output of the last layer, i.e. the fifth layer, is the learned features which are fed to the SVM to test our model. Between the first and the last layers, each layer extracts features from the output of the previous layer, and then feeds the extracted features to the next layer. In the process of training model, the training image set is partitioned into six small datasets (each of which corresponds to a specified layer) where are unlabeled image datasets used to train our DDRL model and is a labeled image dataset to train SVM for classification. Here, we divide the whole image dataset to several small subsets. This step helps reduce the training time and ensure the richness of the extracted features.

Figure 2: The details of hierarchical feature extraction and selection.

The specific structure of each layer is depicted in Figure 2, which includes input, pre-processing, learning dictionary, extracting features and selecting feature maps. Here, we use the first layer as an example to illustrate our DDRL model. We extract random patches from on the multiple Map nodes in parallel, followed by a pre-processing stage including contrast normalization and whitening of these patches. In Coates9 , it has been proved that normalization can remove the unit limitation of the data, which helps compare and weight the data of different units and whitening helps remove the redundancy between the pixels. Then, K-means acting as the unsupervised feature learning algorithm runs on each Map node to gain small dictionaries ( is the total number of map nodes in the cluster), which will be reduced on the Reduce node to produce the final dictionary of the first layer. Thus, provided the first layer’s feature mapping , we can extract image features of and employ a step of spatial pooling Boureau14 ; Boureau15 on these features to obtain more compact representations. Note that the pooled features are still in rather high dimension and it is hard to select receptive fields from such huge amount of features. To this end, we utilize a similarity metric to produce feature maps, each of which contains an equal number of the most similar features, and these feature maps will be input to the second layer and assigned to the map nodes for parallel processing. Here, on each Map node, K-means will run on the feature maps to gain the corresponding dictionaries, and subsequent operations are the same as in the first layer. Finally, in the last layer, we use the dictionary to extract image features of the labeled image dataset and then input the pooled features and labels to train the SVM for classification.

3.2 DDRL Model Formulation

Suppose that each randomly extracted patch has dimension and has channels (e.g. for a RGB image), we can then construct a dataset of sample patches , where and . Given this dataset, the pre-processing step can be done followed by the unsupervised feature learning process to accomplish the distributed deep representation learning.

3.2.1 Pre-processing

Previous state-of-the-art method Coates9 have validated the key roles of pre-processing on image patches to improve the subsequent feature learning performance. In our work, the pre-processing operation involves normalization and whitening, to provide a cooperative contribution. Since the pre-processing of each image is irrelevant, it can be distributed on the Map node of our DDRL model. Normalization can remove the unit limitation of the data, enabling comparison and weighting of the data of different units. Whitening helps remove the redundancy between the pixels. Here, we normalize the patches according to Eq.(1):


where and

are the variance and mean of the elements of

and is a constant added to the variance before division to suppress noise and to avoid division by zero. For visual data, this operation corresponds to the local brightness and contrast normalization.

Since that the adjacent pixel values in the raw input image are highly correlated, we employ the classical PCA whitening on each obtained from the normalization to make the input less redundant. We have


We have: where Eq.(2) computes the eigenvalues and eigenvectors of

, Eq.(3) uncorrelates the input features, and Eq.(4) obtains the whitened data. Note that some of the eigenvalues may be numerically close to zero. Thus, the scaling step where we divide by would result in a division by a value close to zero, causing the data to blow up (take on large values) or otherwise be numerically unstable. Therefore, we add a small to the eigenvalues before taking their square root, just as shown in Eq.(5).


Furthermore, adding here can contribute to smooth the input image, remove aliasing artifacts caused by the way which pixels laid out in an image, and improve the learned features.

3.2.2 DDRL Hierarchical Feature Extraction

Given the pre-processed image data, we input them into the DDRL model to learn hierarchical representation. Specifically, we utilize the K-means algorithm to learn the image statistics and gain a dictionary in the layer. Then, the pre-processed images or the feature maps and the duplicated dictionary are distributed on multiple map nodes. On each map node, we define a feature-mapping function that maps an input vector to a new feature representation of features. Here, we choose the soft-threshold nonlinearities , the feasibility of which has been validated in Coates9 , as our feature extractor, where is a tunable constant. Thus, on each map node, we obtain the corresponding feature maps, and K-means is again used to learn dictionaries from these feature maps. The dictionaries on each map node are then reduced on the reduce node to aggregate a complete one. Similarly, the reduced dictionary is duplicated and distributed on multiple map nodes to respectively extract feature information of the feature maps, just as -layer does. Similar operations are replicated in subsequent layers, and in the last layer, we reduce the learned feature maps into a whole. Section 3.2.3 provides the subsequent operation on these feature maps.

Figure 3: Feature extraction from the input image.

3.2.3 DDRL Hierarchical Feature Selection

Given the feature extractor and , we can extract features of . Some previous works [13,14] have theoretically analyzed the importance of spatial pooling to achieve invariance to image transformation, more compact representations, and better robustness to noise and clutter. Here, in our work, since that the features learned by K-means are relatively sparse, we choose average pooling to exploit its advantages. Figure 3 illustrates the feature extracted from the equally spaced sub-patches (i.e. receptive fields) covering the input image. We first extract receptive fields separated by S pixels and then map them to dimensional feature vectors via the feature extractor to form a new image representation. Then, these vectors are pooled over quadrants of the image to form a feature vector for subsequent processing. As Figure 3 presents, the pooled features will be input to SVM for classification.

Since that the dictionary is designed to be very large to extract adequate representative features for the accurate big image data classification, the learned features are commonly in a huge amount and very high dimensional. Thus, efficiently selecting the receptive fields will be a rather challenging bottleneck since the single machine may possibly suffer from a breakdown. On the other hand, even the distributed computing resources cannot yield a desirable solution because the map nodes just cannot play their full advantages to process these unorganized features. Therefore, we wonder what if these unorganized, huge-sized, and high dimensional features are organized into a whole which can be easily processed by the map nodes in the cluster? The similarity metric between features proposed in Coates16 inspires us to utilize Eq.(6) to produce feature maps which are composed of equal number of the most similar features. Given two features and , the similarity between them is measured as follows:


Here, in our design, the core idea is to find the top most correlated features from as a feature map, and K-means would then separately take a group of feature maps as input to obtain the corresponding dictionary on the map nodes in parallel, which desirably enhances the time efficiency and avoids the breakdown of the machine.

4 Experiments

In this section, we conduct comprehensive experiments to evaluate the performance of our work on two large-scale image datasets, i.e., ImageNet Deng12 and CIFAR-100 Krizhevsky10 . Here, we implement a multi-layered network to accomplish the deep representation learning for subsequent SVM classification. To provide convincing results, we compare our work with the method proposed in Coates9 which similarly utilized K-means to learn the feature representation on a single node. To guarantee a fair comparison, we set up the experimental environment exactly as Coates9 .

4.1 Experimental Environment and Datasets

We built the Hadoop-1.0.4 cluster with four PCs, each with 2-core 2.6 GHz CPU and 4 GB memory space. The total number of the map nodes is 4 and the number of reduce node is 1.

ImageNet is a dataset of over 15 million labeled high resolution images belonging to roughly 22,000 categories, which aims to provide researchers an easily accessible image database, and it is organized according to the WordNet hierarchy in which each node of the hierarchy is depicted by hundreds or thousands of images. Currently, there is an average of over five hundred images per node. To validate the performance of our DDRL model, we chose 100 categories, in total 120,000 images from ImageNet datasets, with 80,000 for training and the rest for testing. Since that the images from ImageNet are not of the the same size, we first resized the chosen images to 32-by-32 for the sake of convenience.

CIFAR-100 consists of 60,000 32-by-32 color images in 100 classes, with 600 images per class. There are 500 training images and 100 test images per class.

Figure 4: CIFAR-100 feature extraction. (a): a random selection of 320 filters chosen from the 6-by-6-by-1600 image representations learned by the first layer of DDRL model. (b): a random selection of 320 filters (out of 1600) of size 6-by-6 learned by Coates9 .

4.2 Comparison of Dictionary

Before looking at the classification results, we first inspect the dictionary learned by our DDRL model and the dictionary learned by Coates9 on CIFAR-100. The receptive field size is

and the stride between two receptive fields is

pixel. As presented in Figure 4, (a) provides the randomly selected filters (out of ) learned by the first layer of DDRL model, and (b) gives a random selection of filters from the completed dictionary composed of filters learned by Coates9 . Visually, little difference can be observed between (a) and (b). Both of them present diversified features to contribute to the subsequent feature extraction and SVM classification. It is worth to mention that the dictionary presented in (a) is gained by reducing the four dictionaries obtained on the four map nodes, and (b) is gained using a single machine. Thus, considering both the computing resources and the final obtained similar dictionary, we can demonstrate that the feature learning model proposed in our work is superior to that presented in Coates9 . If we need to learn a much bigger dictionary for better classification performance, the approach proposed in Coates9 will impose a serious computation constraint on the single machine while our distributed deep model (DDRL) is competent to tackle this challenge with the joint efforts of the distributed computing resources.

Figure 5: Effect of whitening.

4.3 Effect of Whitening

Whitening is a part of pre-processing, which can help remove the redundancy of the pixels. In this section, we conduct experiment to validate the effect of whitening on our model.

Figure 5 shows the performance of our model with and without whitening. Our DDRL model has 5 layers, with 1,600/2,000/2,400/2,800/3,200 centroids from the first layer to the last. In all these layers, the size of the receptive filed is , and the stride is set as pixel. From the experimental results, both on ImageNet and CIFAR-100 dataset, we observe that the performance of our model gets improved when the layer number increases, and this increase takes place no matter whether or not the whitening operation is included. The reason for this increase will be discussed in the next subsection. In addition, Figure 5 shows that when the model depth is the same, the whitening operation can help DDRL model achieve higher classification accuracy, both on ImageNet and CIFAR-100 dataset. Thus, we can conclude that whitening is a crucial pre-processing to optimize the proposed model.

Figure 6: Effect of receptive field size.

4.4 Effect of Receptive Field Size and Stride

In this section, we conduct experiments to compare the effect of receptive field size and stride on DDRL model and Coates9 , both on ImageNet and CIFAR-100 dataset.

Figure 6 illustrates the effect of receptive field size between DDRL model and Coates9 on ImageNet and CIFAR-100 dataset. The result of Coates9 is gained with stride=1 pixel and 1,600 centroids. The results of DDRL model is obtained with 5 layers (centroids number per layer is 1,600, 2,000, 2,400, 2,800, 3,200) and 1 pixel stride. In this experiment, we set the receptive field size as , , , and . As the lines present, both DDRL model and Coates9 show decreasing performance when the receptive field size increases, while DDRL model still achieves higher accuracies than Coates9 in all cases. From this perspective, smaller receptive field will lead to better performance, which, however, will result in higher computation expense. In this sense, both on ImageNet and CIFAR-100 dataset, our model can release such constraint with the distributed storage and computing, while with the approach proposed in Coates9 , it is hard to deal with this overhead.

Figure 7: Effect of stride.

Figure 7 presents the effect of different strides proposed model in Coates9 and DDRL model on ImageNet and CIFAR-100 dataset. The model in Coates9 sets the receptive field size as and centroids number as 1,600. Our DDRL model is the same as described before, consisting of 5 layers with 1,600/2,000/2,400/2,800/3,200 centroids at different layers and the receptive field size is fixed at . Similar to Figure 6, both the model in Coates9 and DDRL model get decreasing performance when the stride increases, and DDRL model keeps superiority over Coates9 at all stride values. Similarly, smaller stride makes great contribution to the classification performance while introducing extra computation cost. Again, on ImageNet and CIFAR-100 dataset, our DDRL model can overcome the computational constraint with the distributed computing resources while it is difficult for a single machine to overcome such a problem.

layer 1 2 3 4 5
DDRL 70.19% 72.58% 74.86% 75.14% 75.53%
Coates9 70.01% N/A N/A N/A N/A
Table 1: Comparison of the classification performance on ImageNet dataset.
Method Coates9 Goodfellow30 Nitish31 DDRL
Accuracy 61.28% 61.43% 63.15% 62.83%
Table 2: Comparison of the classification performance on CIFAR-100 dataset.

4.5 Classification Performance

In this section, we validate the classification performance of our DDRL model on ImageNet and CIFAR-100. In both Coates9 and DDRL model, the receptive field size is , and the stride is . Table 1 presents the results we gained on models with different number of layers. Although the inherent map and reduce phase of MapReduce may inevitably bring some compromises, the superiority of DDRL model become obvious when the layer number grows. A two-layer setup (with 1,600 centroids in the first dictionary and 2,000 centroids in the second) lead 2.39% improvements on ImageNet compared with the single layer. The results gained on a three/four/five-layer model continue to achieve an increase to different extent. The five-layer model gained subtle increase (only 0.39%) compared with the four-layer one, which indicates that when the model reaches an enough depth, the classification performance will gradually stop improving. Considering the consumption of computation and storage resources, the five-layer depth of DDRL model is deep enough in general.

Although the main body of work is conducted on the ImageNet dataset, we also investigate how the model performs on the CIFAR-100 dataset. As shown in Table 2, our DDRL model achieves 62.83% accuracy. We can observe that the result of DDRL model outperforms Goodfellow30 by 1.4%, and Coates9 by 1.55%. Compared to Nitish31 , DDRL model gets 0.32% lower accuracy, which was mainly resulted from the required image amount to train the classification model. Thus, some relatively small datasets (e.g. NORB and CIFAR-10, etc..) were not used to validate the performance of our DDRL model. Considering both the classification results and computing consumption, such subtle discrepancy is acceptable and reasonable. Thus, when the amount of training images is large enough, the superiority of our DDRL model may become more obvious.

5 Conclusion

In this work, we have successfully implemented the distributed deep representation learning model (DDRL) focusing on the task of big image data classification. Different from previous methods perusing algorithm optimization, our design focuses on employing some elegant designs to enhance the classification accuracy while retaining the simplicity of the feature learning algorithm and the robustness in the implementation platform. Since that desirable accuracy of big image data classification imposes a high requirement for the amount and richness of features, we proposed a platform with excellent fault tolerance to avoid the breakdown of the single machine. Experimental results demonstrated the encouraging performance of our design and we expect to pursue tackling further challenges in the big image data in the future work.


This work was supported in part by the National Natural Science Foundation of China under Grant 61370149, in part by the Fundamental Research Funds for the Central Universities(ZYGX2013J083), and in part by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry.



  • (1) G. Hinton, S. Osindero and Y. W. Teh, “A fast learning algorithm for deep belief nets,” in Neural Computation , vol. 18, no. 7, pp. 1527-1554, 2006.
  • (2) Y. Liang, L. Dong, S. Xie and N. Lv, “Compact feature based clustering for large-scale image retrieval,” in IEEE International Conference on Multimedia and Expo Workshops, pp. 1-6, 2014.
  • (3) O. A. B. Penatti, F. B. Silva, E. Valle, V. Gouet-Brunet and R. D. S. Torres, “Visual word spatial arrangement for image retrieval and classification,” in Pattern Recognition, vol. 47, no. 2, pp. 705-720, 2014.
  • (4)

    A. Samat, J. Li, S. Liu, P. Du, Z. Miao and J. Luo, “Improved Hyperspectral Image Classification by Active Learning Using Pre-Designed Mixed Pixels,” in

    Pattern Recognition, vol. 51, pp. 43-58, 2016.
  • (5) L. Dong, E. Izquierdo, “A Biologically Inspired System for Classification of Natural Images,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 5, pp. 590-603, 2007.
  • (6) L. Dong, J. Su, E. Izquierdo, “Scene-oriented hierarchical classification of blurry and noisy images,” in IEEE Transactions on Image Processing, vol. 21, no. 5, pp. 2534-2545, 2012.
  • (7) Y. Li, S. Wang, Q. Tian and X. Ding, “Feature representation for statistical-learning-based object detection: A review,” in Pattern Recognition, vol. 48, no. 11, pp. 3542-3559, 2015.
  • (8) M. Pedersoli, A. Vedaldi, J. Gonz lez and X. Roca, “A coarse-to-fine approach for fast deformable object detection,” in Pattern Recognition, vol. 48, no. 7, pp. 1844-1853, 2015.
  • (9) A. Krizhevsky, I. Sutskever and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, pp. 1097-1105, 2012.
  • (10) K. Jarrett, K. Kavukcuoglu, M. Ranzato and Y. LeCun, “What is the best multi-stage architecture for object recognition?,” in Proceedings of IEEE International Conference on Computer Vision, vol. 30, no. 2, pp. 2146-2153, 2009.
  • (11) J. Yang, K. Yu, Y. Gong and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1794-1801, 2009.
  • (12) I. J. Goodfellow, H. Lee, Q. V. Le, A. Saxe and A. Y. Ng, “Measuring invariances in deep networks,” in Advances in Neural Information Processing Systems, pp. 646-654, 2009.
  • (13)

    M. Ranzato, C. Poultney, S. Chopra and Y. LeCun, “Efficient learning of sparse representations with an energy-based model,” in

    Advances in Neural Information Processing Systems, pp. 1137-1144, 2006.
  • (14) A. Krizhevsky, G. Hinton, “Learning multiple layers of features from tiny images,” in Computer Science Department, University of Toronto, Tech. Rep, 2009.
  • (15) A. Coates, H. Lee and A. Y. Ng, “An analysis of single-layer networks in unsupervised feature learning,” in

    International Conference on Artificial Intelligence and Statistics

    , pp. 215-223, 2011.
  • (16) J. Deng, W. Dong and R. Socher, “Imagenet: A large-scale hierarchical image database,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
  • (17) J. Dean, S. Ghemawat, “MapReduce: simplified data processing on large clusters,” in Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.
  • (18) J. Wu, Y. Yu, H. Chang and K. Yu, “Deep multiple instance learning for image classification and auto-annotation,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3460-3469, 2015.
  • (19) A. Coates, A. Y. Ng, “Learning feature representations with k-means,” in Neural Networks: Tricks of the Trade, pp. 561-580, 2012.
  • (20) A. Coates, A. Y. Ng, “Selecting receptive fields in deep networks,” in Advances in Neural Information Processing Systems, pp. 2528-2536, 2011.
  • (21)

    Y. Liu, S. Zhou, Q. Chen, “Discriminative deep belief networks for visual data classification,” in

    Pattern Recognition, vol. 44, no. 10, pp. 2287-2296, 2011.
  • (22) F. J. Huang, Y. LeCun, “Large-scale learning with svm and convolutional for generic object categorization,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 284-291, 2006.
  • (23) G. E. Hinton, N. Srivastava and A. Krizhevsky, “Improving neural networks by preventing co-adaptation of feature detectors,” in arXiv:1207.0580, vol. 3, no. 4, pp. 212-223, 2012.
  • (24) J. Ngiam, Z. Chen and D. Chia, “Tiled convolutional neural networks,” in Advances in Neural Information Processing Systems, pp. 1279-1287, 2010.
  • (25) B. Wang, J. Tsotsos, “Dynamic label propagation for semi-supervised multi-class multi-label classification,” in Pattern Recognition, vol. 52, pp. 75-84, 2016.
  • (26) Z. Fu, G. Lu and K. M. Ting, “Learning sparse kernel classifiers for multi-instance classification,” in IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 9, pp. 1377-1389, 2013.
  • (27)

    G. S. Babu, S. Suresh, “Sequential projection-based metacognitive learning in a radial basis function network for classification problems,” in

    IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 2, pp. 194-206, 2013.
  • (28) K. Simonyan, A. Vedaldi and A. Zisserman, “Deep Fisher networks for large-scale image classification,” in Advances in Neural Information Processing Systems, pp. 163-171, 2013.
  • (29) C. Y. Lee, S. Xie and P. Gallagher, “Deeply-Supervised Nets,” in arXiv:1409.5185, 2014.
  • (30) M. D. Zeiler, R. Fergus, “Visualizing and understanding convolutional networks,” in arXiv:1311.2901, 2013.
  • (31) R. Pascanu, T. Mikolov and Y. Bengio, “On the difficulty of training recurrent neural networks,” in arXiv:1211.5063v2, pp. 1279-1287, 2014.
  • (32) X. Glorot, Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in International Conference on Artificial Intelligence and Statistics, pp. 249-256, 2010.
  • (33) D. Eigen, J. Rolfe and R. Fergus, “Understanding Deep Architectures using a Recursive Convolutional Network,” in arXiv: 1312.1847v2, 2014.
  • (34) L. Wan, M. Zeiler and S. Zhang, “Regularization of neural networks using dropconnect,” in

    International Conference on Machine Learning

    , pp. 1058-1066, 2013.
  • (35) G. E. Dahl, Y. Dong, L. Deng and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30-42, 2012.
  • (36) D. Ciresan, U. Meier and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3642-3649, 2012.
  • (37) H. Yu, D. Wang, “Mass log data processing and mining based on Hadoop and cloud computing,” in International Conference on Computer Science and Education, pp. 197-202, 2012.
  • (38) Y. L. Boureau, J. Ponce and Y. L. Cun, “A theoretical analysis of feature pooling in visual recognition,” in International Conference on Machine Learning, pp. 111-118, 2010.
  • (39) Y. L. Boureau, N. L. Roux and F. Bach, “Ask the locals: multi-way local pooling for image recognition,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2651-2658, 2011.
  • (40) I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville and Y. Bengio, “Maxout networks,” in International Conference on Machine Learning, 2013.
  • (41)

    N. Srivastava, R. Salakhutdinov, “Discriminative transfer learning with tree-based priors,” in

    Advances in Neural Information Processing Systems, pp. 2094-2102, 2013.