I Introduction
CONVOLUTIONAL neural networks (CNNs) [11, 15]
have achieved a great success in the field of computer vision, including image classification
[9, 12, 22, 29, 30, 31] and object detection [6, 7, 17, 33, 18]. The underlying reason lies in the fact that CNN is able to learn a hierarchy of features [1, 3, 4, 34]that can represent objects in different levels. Lowlevel features denote some visual features such as edges, dots, and textures, whereas highlevel features represent objects in a semantic way. Lowlevel features are shared by all objects while highlevel features are of high discriminability. Highlevel features are learned progressively from lowlevel features. All these features are in fact learned through a series of linear and nonlinear transformations which are the primary elements of CNNs.
Typically, CNN consists of several computational building blocks: convolution, activation, and pooling. They work together to fulfill the task of feature extraction and transformation. Convolution takes inner product of the linear filter and the local region of input channel. Activation imposes a nonlinear transformation on the convolutional results. Pooling gathers the responses of a given region. Among these three building blocks, convolutional block plays the most important role in CNN. It controls the number of feature maps (i.e., width of CNN) and the number of layers (i.e., depth of CNN). The width and depth determine the capacity of CNN. The size of neural network is a doubleedged sword. On the one side, large size means large capacity. Large capacity makes it possible for deep networks to learn rich features which are essentially important for task of recognizing tens or even thousands of object categories. On the other side, large size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting especially when the number of labelled samples in the training set is limited. What is more, the main drawback of large network is the dramatically increased consumption of computational resources.
To construct a compact and powerful network, we propose a novel type of convolutional filters. Given a local patch, traditional CNNs typically use a convolutional filter which is the same size as the patch to extract features. We argue that this level of abstraction is not strong enough to generate robust features. MultiLayer Perceptron (MLP)
[14] can be used to impose more complex transformation. However, it is still not complex enough to represent the input data which lies on a highly nonlinear manifold. Therefore, in this paper, we propose to use cascaded subpatch filters to bring in much more complex structures to abstract the local patch within the receptive field. One subpatch filter contains two subsequent convolutional filters. The first one abstracts subpatches of the input patch. The second one is to fully connect all the output channels of the first one. Taking the convolutional output of previous subpatch filter as input, we repeat constructing new subpatch filters until the final output contains only one neuron in spatial domain. This results in the Cascaded Subpatch Network (CSNet). CSNet can be used to replace conventional convolutional layer to extract more complex and more robust features. We call the resulting layer a csconv layer. A deep neural network can be obtained by stacking multiple csconv layers. For clarity, in the rest of this paper, the overall deep network containing multiple csconv layers is called a CSNet.The goal of the proposed method is to construct a more effective structure to abstract the local patch. Instead of designing a CNN that is too wide (i.e., too many feature maps in one layer) or too lengthy (i.e., too many layers), we present a novel neural network which is compact, yet powerful. Specifically, the contributions and merits of this paper are summarized as follows.

We gain new insight into the convolutional block of CNN. When abstracting one local patch of size , we propose to use cascaded subpatch filters to replace conventional convolutional filter.

Subpatch filter consists of an linear filter followed by a filter. Its purpose is to impose a complex transformation on subpatches of the input patch while reducing the number of parameters.

Cascaded subpatch filters contain a sequence of subpatch filters and they together reach the goal of generating a more complex and more robust abstraction of the local patch. The cascaded subpatch filters can be regarded as one new convolutional kernel structure called csconv filter. Csconv filter abstracts local patch much better than conventional filter.

Csconv filter is a flexible structure which can be constructed using a group of different subpatch filters according to size of the local region and the demanding number of parameters.

We build several CSNets with different number of parameters to deal with different tasks. And our CSNets achieve the stateoftheart performance on four widely used benchmark image classification datasets.
This paper is organized as follows. Section II reviews the related work. Section III presents the proposed CSNet method. The experimental results are given in Section IV. Finally, Section V concludes this paper.
Ii Related Work
Since the great success of AlexNet [12]
on the ImageNet Large Scale Visual Recognition Challenge (ILSCRC2010), a number of attempts have been made to improve the architectures of CNN in order to achieve better accuracy. We divide these methods into the following three categories.
(1) Parameter adjusting
. Some researchers paid attention to the parameters of CNN, such as the sizes of convolutional filters, the strides of filters, the number of feature channels in each layer, and the number of convolutional layers. They tried to adjust the parameters to improve the performance of CNN through exhaustive experiments. Zeiler and Fergus
[25] visualized the trained CNN model and found that large filter size and large stride of the first convolutional layer could cause aliasing artifacts. Therefore, they used smaller receptive window size and smaller stride. Sermanet et al. [21] utilized smaller strides in the first convolution, larger number of feature maps, and larger number of layers. They achieved better results than the AlexNet. The VGG network [22] pushes the depth of CNN to up to 19 convolutional layers by using very small convolutional filters and gains a significant improvement. These above efforts can be viewed as preliminary explorations on how to construct networks with better performance.(2) Structure designing. Another line of improvements go further into the designing of new CNN structures. Network in Network (NIN) [14] utilizes shallow MultiLayer Perception (MLP) to increase the representational power of neural networks. MLP is a more complex structure which consists of multiple fully connected layers. Conventional linear convolution and MLP together result in a new convolutional structure called mlpconv. Mlpconv can be easily implemented by stacking additional convolutional layers on conventional convolutional layer. These convolutions actually enhance the connection between different feature channels. Therefore, mlpconv is able to abstract local regions much more effectively than conventional convolution. Szegedy et al. [24]
constructed a 22layer GoogleNet by stacking dozens of Inception modules. Each Inception module contains a group of convolutional filters of different sizes which aim at capturing information of multiple scales. However, such an Inception module is too wide to be efficiently used in a very deep network. To overcome the disaster of having too many parameters, GoogleNet takes advantage of
convolutions as dimension reduction modules to remove computational bottlenecks. As a result, it allows for increasing not just the depth but also the width of the GoogleNet without significant increasing in parameters.(3) Deeper and wider networks. Since increasing the size of CNN is the most straightforward way to improve their performance, researchers went even further into designing much deeper networks which are up to hundreds or even thousands of layers. Highway Networks [23] make it possible to train very deep networks even with hundreds of layers by using adaptive gating units to regulate the information flow. More importantly, Highway Networks are able to train deeper networks without sacrificing generalization ability. The 32layer highway network presented in [23] achieved the stateoftheart performance on the CIFAR10 [37] dataset. Based on Highway Networks, He et al. [32] recently presented a residual learning framework to effectively train networks which are substantially deeper than ever used. They constructed residual nets (ResNest) with a depth of up to 152 layers and evaluated them on the ImageNet dataset. They also presented ResNets with 100 and 1000 layers and evaluated them on the CIFAR10 dataset. They argued that the depth of representations is of great importance for many visual recognition tasks. In addition to the depth of CNN, the width of CNN is also very important. Shao et al. [2] pointed out that the combination of multicolumn deep neural networks could enhance the robustness. Instead of simply averaging the outputs of multicolumn predictions, they learned a compact representation from multicolumn deep neural networks by embedding the features of all the penultimate layers into a multispectral space. The resulting features are then used for classification. Their multispectral neural networks (MSNN) in fact make use of the complementary information captured by different neural networks. Since the MSNN has to use multiple networks that do not share parameters, the computation increases with the number of networks.
We agree that both width and depth of networks are important for the tasks of visual recognition. However, larger capacity does not guarantee higher accuracy. Given a dataset with limited samples, when a network reaches its peak performance, it is difficult to further improve performance by simply adding more feature maps or stacking more convolutional layers. This means that the discriminability of networks does not increase infinitely with the size of networks. The performance comparison with ResNet110 [32] and ResNet1202 [32] on the CIFAR10 dataset supports our viewpoint. ResNet110 is a 110layer CNN with 1.7M parameters, and ResNet1202 is a 1202layer CNN with up to 19.4M parameters. However, ResNet110 achieves a test error of whereas ResNet1202 achieves a test error of . That is, the classification ability of deep neural network may suffer from excessive parameters. Therefore, it is important to explore new methods to learn features in a more effective way.
Iii Proposed Method
This paper is aimed at using subpatch filters to construct compact and powerful CNNs. One of the characteristics of the proposed method is that the size of subpatch filter is smaller than that of the patch to be presented. In our method, cascaded subpatch filters are used to represent a patch. We take the cascaded subpatch filters as a whole and call it csconv filter. Applying the csconv filters layer by layer results in a deep CNN which we call CSNet. In this section, we first introduce the subpatch filter. Next, we describe cascaded subpatch filters. Then, the CSNet is presented. Finally, analysis of the computational complex of CSNet is given.
Iiia Subpatch Filter
The task is to represent an input patch where stands for the spatial size and
is the number of channels. Throughout this paper, the spatial size is used to express the patch size. By vectoring the threeorder tensor,
P can be expressed as an dimensional column vector . Conventional convolution uses a linear filter whose size is the same as the patch X. The conventional convolution can be computed by inner product(1) 
where is the number of output channels. The convolution converts the patch of spatial size into a scalar . For the sake of notation consistence, we use to represent the size of feature . Fig. 2(a) shows the conventional convolution.
To make the feature representation more effective, we propose to utilize cascaded subpatch filters to transform the patch from size to . Let is subpatch of X with and . The number of overlapping subpatches of size in the patch of size is . A subpatch filter consists of two subsequent filters with the size of the first filter being and the size of the second filter being . To explicitly show that a subpatch filter is composed of two basic filters and , we denote the subpatch filter by and denote the size of subpatch filter by . We call the second filter channel filter because its function is to fully connect different channels. We call the first filter spatial filter because its size is larger than and its role is to extract features from both spatial and channel domain.
The inner product between the spatial filter and the subpatch x is
(2) 
where is the number of output channels. Express the output of the spatial filter as a dimensional feature vector . Taking the feature vector as the input of the channel filter , the second inner product is obtained by
(3) 
where is the number of output channels.
Eq. 2 and Eq. 3 are only involved in one subpatch. There are subpatches. So we apply Eq. 2 and Eq. 3 on all the subpatches. That is, the subpatch filter of size convolves with the input patch
. The convolution is conducted without zeropadding. Consequently, the output
of the convolution with subpatch filter is a patch of size where is and is . Fig. 1 demonstrates one subpatch filter of size .IiiB Generating a Csconv Filter by Cascaded Subpatch Filters
In the previous section, we explicitly denote a subpatch filter by where indexes channels. When there are several subpatch filters of different sizes, for the sake of clarity, we denote the th subpatch filter of size by . Fig. 1 shows that convolving a subpatch filter of size within the input patch of size results in an output patch of size . But our goal is to output a patch to represent the input P. This goal can be arrived at by convolving the output patch with another subpatch filter of size with and . The size of the output of is . It can be noted that and . That is, once a subpatch filter is used, the size of the output patch is decreased. The subpatch filters are subsequently used until the output is of size . Specially, th sizes of spatial filters can be expressed as:
(4) 
It is noted that the size of penultimate output patch is the same as that of the spatial filter of the last subpatch filter.
Suppose that subpatch filters are finally used to obtain a output patch. We denote the cascaded subpatch filters by . If all the subpatch filters have the same size (i.e., ), then the cascaded filter can be denoted by . We call the cascaded subpatch filters stage csconv filter. Fig. 2(b) demonstrates an stage csconv filter.
Given a local patch, different filters can be used to deal with it. An example of conventional filter, twostage csconv filter, and threestage csconv filter is shown in Fig. 3. The input patch is of size with channels, the conventional filter in Fig. 3(a) is of size , the twostage csconv filter in Fig. 3(b) is , and the threestage csconv filter in Fig. 3(c) is . In Fig. 3(a), the convolution directly generates an output of size with channels. In Fig. 3(b), feature channels of size are obtained by applying the subpatch filter, and then an output of size with channels is obtained by applying the subpatch filter on them. In Fig. 3(c), feature channels of size are firstly obtained by applying the first subpatch filter. And then feature channels of size are obtained by applying the second subpatch filter on feature channels of size . Finally, an output of size with channels is obtained by applying the third subpatch filter on feature channels of size . It is seen that the conventional convolution is the most simplest one and that the csconv convolution with a threestage csconv filter is the most complex one.
IiiC Form Cascaded Subpatch Network (CSNet) by Stacking Csconv Layers
Let be a csconv filter. Applying on an input patch yields a unit and convolving over the whole input channels yields a convolutional layer (called csconv layer) containing a number of units. As the conventional CNN, we can create a new CNN (called CSNet) by stacking a number of csconv layers: with . It is noted that we express both spatial filter and channel filter as fourorder tensors by explicitly writing the number of input channels ( for spatial filter and for channel filter) and the number of output channels ( for spatial filter and for channel filter).
It is noted that the number of subpatch filters of a csconv filter is determined by the spatial size of the input patch and the spatial size of each subpatch filter (see Eq. 4). It is also noted that different csconv layers can have either different or the same configuration of csconv filters. For example, the first three csconv layers of one CSNet can have csconv filters of , , and , respectively. Fig. 4
shows the overall structure of the proposed CSNet. The first two csconv layers both have a twostage csconv filter. The number of csconv filters and the number of subpatch filters in each csconv filter can be tuned according to different tasks. In the proposed CSNet, Rectified Linear Units (ReLUs)
[16] follows the output of each convolution of the subpatch filter. Subsampling layers can be added in between the csconv layers as in CNN if necessary.IiiD Computational Complex Analysis
Though the csconv convolution is much more complex than conventional convolution, it does not mean that the parameters of one deep CSNet have to be very huge. A comparison of parameters consumed by conventional convolution and the csconv convolution is shown in Tab. I. Suppose that a conventional convolution has a filter of size (see Fig. 3(a)), where is the size of the convolutional filter in spatial domain, is the number of input feature channels, and is the number of output feature channels. The corresponding threestage csconv convolution (, see Fig. 3(c)) can be implemented with different number of input channels and output channels. Table I presents configurations of two common csconv layers denoted by csconv 1 and csconv 2. As shown in Table I, the parameters consumed by conventional convolution, csconv 1 and csconv 2 are , and , respectively. Therefore, the difference between conventional convolution and csconv 1 is . If , then the number of parameters consumed by csconv 1 is no larger than that of conventional convolution. Similarly, the difference between conventional convolution and csconv 2 is . If , then the number of parameters consumed by csconv 2 is no larger than that of conventional convolution. Especially, if , then , , and . It is obviously seen that the number of parameters consumed by conventional convolution is lager than that of the proposed csconv convolution.
In case that and do not satisfy the constraints above, the convolution can be used as reduction layer to reduce the number of intermediate output feature channels. This can guarantee that the total number of parameters consumed by csconv convolution is no larger than that of conventional convolution. Since the parameters of each csconv layer are no larger than those of the corresponding conventional convolutional layer, the total parameters of one deep CSNet are also no larger than those of the corresponding conventional neural network.
Method  Conventional  csconv 1  csconv 2 

Structure  
#params  
#params  
Iv Experimental Results
We evaluate the proposed CSNet on four standard benchmark datasets: CIFAR10 [39], CIFAR100 [39], MNIST [5], and SVHN [39]. We compare our CSNets with a dozen well known networks that have achieved the stateoftheart performance on the four datasets. These networks include Maxout (Maxout Networks) [8], NIN (Network in Network) [14]
, NIN+LA (Networks with Learned Activation Functions)
[26], FitNet (Thin and Deep Networks) [19], DSN (Deeply Supervised Networks) [13], DropConnect (Networks using Dropconnect) [27], dasNet (Deep Attention Selective Networks) [35], Highway (Networks Allowing Information Flow on Information Highways) [23], ALLCNN (ALL Convolutional Networks) [38], RCNN (Recurrent Convolutional Neural Networks) [28], and ResNet (Deep Residual Networks) [32].Iva Configuration
We adopt the global average pooling scheme introduced in [14] on the top layer of CSNet. We also incorporate dropout layers with dropout rate of 0.5 [8]
. In addition, we use Batch Normalization (BN)
[10] to accelerate the training stage. The CSNet is implemented using the MatConvNet [40] toolbox in the Matlab environment. We follow a common training protocol [8]in all experiments. We use stochastic gradient descent technique with minibatch of size 100 at a fixed constant momentum value of 0.9. Initial value for learning rate and weight decay factor is determined based on the validation set. The proposed CSNet is easy to converge and no particular engineering tricks are adopted in all our experiments. All the results are achieved without using the model averaging
[12] techniques which can help improve the performance.To comprehensively evaluate the performance of the proposed CSNet, we design three CSNets of different architectures, each of which has different number of parameters. Our small CSNet (CSNetS), middle CSNet (CSNetM) and large CSNet (CSNetL) have 0.96M, 1.6M and 3.5M parameters, respectively. The configurations of CSNetS, CSNetM, and CSNetL are given in Table II. And the corresponding overall structures are presented Fig.6. Though the three CSNets are specifically designed for the CIFAR10 dataset, they are also applied on the other three datasets with all the parameters almost remaining the same. The only modification is to change the number of output feature channels of the last csconv layer from 10 to 100 on CIFAR100 dataset.
As shown in Table II, the CSNetS and the CSNetM have three csconv layers, and the CSNetL has four csconv layers. Since the input sample is small ( or ), the receptive field of the filters adopted by traditional methods is typically of size . Therefore, our CSNets use csconv filters of to replace linear filters of size . Fig. 5 shows the twostage csconv filter used in our experiment. As shown in Fig. 6
, maxpooling follows the first two csconv filters of each CSNet. Averagepooling is applied after the last csconv layer to assign one single score for each class. Softmax classifier is then used to recognize the objects.
CSNetS  
Patch size  5x5 
#params  0.96M 
CSNetM  
Patch size  5x5 
#params  1.6M 
CSNetL  
Patch size  5x5 
#params  3.5M 
IvB Results on the CIFAR10 Dataset
CIFAR10 dataset [37] consists of 10 classes of images with 50K training images and 10K testing images. These images are color images including airplanes, automobiles, ships, trucks, horses, dogs, cats, birds, deers and frogs. Before training, we preprocess these images using global contrast normalization and ZCA whitening. We carry on experiments with and without data augmentation, respectively. For a fair comparison, we obtain the augmented dataset by padding 4 pixels on each side, and then doing random cropping and random flipping on the fly during training. The augmented data is denoted by CIFAR10. During testing, we only evaluate the single view of the original color image.
Methods  #layers  #params  CIFAR10  CIFAR10 

NIN[14]  9  0.97M  10.41  8.81 
CSNetS  12  0.96M  8.33  6.98 
ResNet110[32]  110  1.7M    6.43 
CSNetM  12  1.6M  8.15  6.38 
ResNet1202[32]  1202  19.4M    7.93 
CSNetL  16  3.5M  7.74  5.68 
To have a quick overview of the performance of the CSNets, we firstly compare CSNets with two well known neural networks on this dataset. The first one is the classic NIN network which has 0.97M parameters. The second one is a new network called ResNet [32] which is the champion of the ILSVRC 2015 [20] classification task. ResNet110 [32] is a really deep neural network which has up to 110 layers and 1.7M parameters. ResNet1202 [32] is even much deeper and has 19M parameters. It can be seen that the CSNetS (0.96M) has a little fewer parameters than NIN, and that the CSNetM (1.6M) has 0.1M fewer parameters than ResNet110, and that the CSNetL (3.5M) has much fewer parameters than ResNet1202.
The comparison results are presented in Tab. III. Compared with NIN, the CSNetS reduces the test error from to (without data augmentation), which improves the performance by more than two percent. The CSNetM obtains a test error of which is a slightly lower than of ResNet110 (with data augmentation). However, the CSNetM has only 12 layers which are much fewer than the 110 layers of ResNet110. Therefore, it is much easier to train CSNetM than ResNet110. Unlike ResNet1020 which degrades the performance due to the huge parameters, our CSNetL further reduces the test error to . The above comparison results demonstrate the superiority of the proposed CSNets. A comprehensive comparison of various methods is presented in Tab. IV. It can be seen that the CSNetS is already among the stateof theart results. The CSNetM surpasses ResNet100 by and the CSNetL surpasses ResNet100 by .
Methods  CIFAR10  CIFAR10 

Maxoutt[8]  11.68  9.38 
NIN[14]  10.41  8.81 
NIN+LA[26]  9.59  7.51 
FitNet[19]    8.39 
DSN[13]  9.75  8.22 
DropConnect[27]  9.41   
dasNet[35]  9.22   
Highway[23]    7.54 
ALLCNN[38]  9.08  7.25 
RCNN160[28]  8.69  7.09 
ResNet110[32]    6.43 
ResNet1202[32]    7.93 
CSNetS  8.33  6.98 
CSNetM  8.15  6.38 
CSNetL  7.74  5.68 
IvC Results on the CIFAR100 Dataset
The CIFAR100 dataset [37] is just like the CIFAR10 dataset. It has the same amount of training images and testing images as the CIFAR10. However, CIFAR100 contains 100 classes which are ten times of those of CIFAR10. Therefore, the number of images in each class is only one tenth of CIFAR10. The 100 classes in CIFAR100 are grouped into 20 superclasses. Each image has two labels. One is the ”fine” label indicating the specific class and the other one is the ”coarse” label indicating the superclass. Considering the number of training images for each class, it is much more difficult to recognize the 100 classes of CIFAR100 than the 10 classes of CIFAR10. There is no data augmentation for CIFAR100. We use the same data preprocessing methods as in CIFAR00.
Methods  CIFAR100 

Maxout[8]  38.57 
NIN[14]  35.68 
NIN+LA[26]  34.40 
FitNet[19]  35.04 
DSN[13]  34.57 
dasNet[35]  33.78 
ALLCNN[38]  33.71 
Highway[23]  32.24 
RCNN160[28]  31.75 
CSNetM  30.24 
Since there are 100 classes to be recognized, we adopt the CSNetM in this experiment. The only difference is that the last convolutional layer of the third csconv layer outputs 100 feature channels, each of which is then averaged to generate one score for one specific class. Details of performance comparison are shown in Tab. V. It can be seen that CSNetM obtains a test error of for CIFAR100, which surpasses the second best performance (RCNN160 with ) by 1.51 percent. It also should be noted that RCNN160 has 1.87M parameters, which are about 0.27M larger than those of CSNetM.
IvD Results on the MNIST Dataset
MNIST [5]
is one of the most well known datasets in the field of machine learning. It consists of hand written digits ranging from 0 to 9. There are 60000 training images and 10000 testing images which are
grayscale images. Only mean subtraction is used to preprocess the dataset. Since MNIST is relatively a simpler dataset compared with CIFAR10, CSNetS is used in this experiment. The results of performance comparison are shown in Tab. VI. It can be seen that CSNetS achieves the stateoftheart performance with a test error of .IvE Results on the SVHN Dataset
The SVHN (Street View House Numbers) [39] is a realworld image dataset containing 10 classes representing digits of 0 to 9. There are totally 630,420 color images which are divided into three sets, 73,527 images in training set, 26,032 images in testing set, and 531,131 images in extra set. More than one digit may exist in an image, and the task is to classify the digit located at the center. We followed the training and testing procedure described by Goodfellow et al. [8]. That is, 400 samples per class are randomly selected from the training set, and 200 samples per class are randomly selected from the extra set. These selected data together form the validation set. The remaining images in the training set and extra set compose the training set. The validation set is only used for tuning hyperparameter selection and not used during training. Since there are large variations among one same kind of digit in SVHN due to the changes of color and brightness, it is much more difficult to recognize digits in SVHN than in MNIST. Therefore, local contrast normalization is used to preprocess the samples. No data augmentation is used in this experiment. To deal with the large variations of digits, we use the CSNetM in this experiment. The performance comparison with other methods is shown in Tab. VII. It can be seen that CSNetM obtains a test error test error of , which already improves NIN ( with 1.98M parameters) by 0.45 percent. CSNetM achieves the second best performance () which is very close the best performance with a test error of .
V Conclusion
In this paper, we have presented a novel CNN structure called CSNet. The core of CSNet is to represent a local patch with one neuron which is obtained by using cascaded subpatch filters. The subpatch filter has two characteristics: (1) the spatial size of the subpatch filter is smaller than that of the input patch, (2) the subpatch filter consists of an (with and ) filter followed by a filter. The role of cascaded subpatch filters can be considered as representing the input patch using a pyramid with the resolution decreasing from to . Due to the large ability of feature representation, the proposed method achieves the stateoftheart performance.
References
 [1] C. H. Chang, “Deep and shallow architecture of multilayer neural networks,” IEEE Transactions on Neural Networks and Learning Systems, 2015, vol. 26, no. 10, pp. 24772486, 2015.
 [2] L. Shao, D. Wu, and X. Li, “Learning deep and wide: a spectral method for learning deep networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 12, pp. 23032308, 2014.
 [3] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: a review and new perspectives,” IEEE Transactions Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
 [4] M. Gong, J. Liu, H. Li, Q. Cai, and L. Su, “A Multi objective Sparse Feature Learning Model for Deep Neural Networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 12, pp. 32633277, 2015.
 [5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 22782324, 1998.
 [6] R. Girshick, “Fast RCNN,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 14401448.

[7]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
, 2014, pp. 580587.  [8] I. J. Goodfellow, D. WardeFarley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” CoRR, abs/1302.4389, 2013.
 [9] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in Proceedings of European Conference on Computer Vision, 2014, pp. 346361.
 [10] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, abs/1502.03167, 2015.
 [11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: convolutional architecture for fast feature embedding,” CoRR, abs/1408.5093, 2014.
 [12] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of Advances in Neural Information Processing Systems, 2012, pp. 1106–1114.
 [13] C.Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply supervised nets,” CoRR, abs/1409.5185, 2014.
 [14] M. Lin, Q. Chen, and S. Yan, “Network in network,” CoRR, abs/1312.4400, 2013.
 [15] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 34313440.

[16]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,”
in Proceedings of International Conference on Machine Learning, 2010, pp. 807814.  [17] S. Ren, K. He, R. Girshick, and J. Sun, “Faster RCNN: towards realtime object detection with region proposal networks,” in Proceedings of Advances in Neural Information Processing Systems, 2015, pp. 9199.
 [18] Y. Pang, J. Cao, X. Li, “Learning Sampling Functions for Efficient Object Detection,” CoRR, abs/1508.05581, 2015.
 [19] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” CoRR, abs/1412.6550, 2014.
 [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, and F. Li, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211252, Dec. 2015.
 [21] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: integrated recognition, localization and detection using convolutional networks,” CoRR, abs/1312.6229, 2013.
 [22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” CoRR, abs/1409.1556, 2014.
 [23] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Proceedings of Advances in Neural Information Processing Systems, 2015, pp. 23682376.
 [24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,2015, pp. 19.
 [25] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional neural networks,” in Proceedings of European Conference on Computer Vision, 2014, pp. 818833.
 [26] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learning activation functions to improve deep neural networks,” CoRR, abs/1412.6830, 2014.
 [27] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of International Conference on Machine Learning, 2013, pp. 10581066.
 [28] M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015, p. 33673375.
 [29] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the Performance of Multilayer Neural Networks for Object Recognition,” in Proceedings of European Conference on Computer Vision, 2014, pp. 329344.
 [30] Z. Ji, Y. Pang, X. Li, “Relevance Preserving Projection and Ranking for Web Image Search Reranking,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 41374147, 2015.
 [31] Y. Pang, S.Wang, Y. Yuan, “Learning Regularized LDA by Clustering,” IEEE Trans. Neural Netw. Learning Syst, vol. 25, no. 12, pp. 21912201, 2014.
 [32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” CoRR, abs/1512.03385 ,2015.
 [33] X. Jiang, Y. Pang, X. Li, and J. Pan, “Speed up deep neural network based pedestrian detection by sharing features across multiscale models,” Neurocomputing, doi:10.1016/j.neucom.2015.12.042, 2015.
 [34] D. Wu and L. Shao, “Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition”, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 724731.
 [35] M. Stollenga, J. Masci, F. J. Gomez, and J. Schmidhuber, “Deep networks with internal selective attention through feedback connections,” in Proceedings of Advances in Neural Information Processing Systems, 2014, pp. 35453553.

[36]
Y. Netzer, T. Wang, A.Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,”
Advances in Neural Information Processing Systems Workshop on Deep Learning and Unsupervised Feature Learning
, 2011.  [37] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Master’s thesis, Department of Computer Science, University of Toronto, 2009.
 [38] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for Simplicity,” CoRR, abs/1412.6806v3, 2014.
 [39] I.J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, “Multidigit number recognition from street view imagery using deep convolutional neural networks,” CoRR, abs/1312.6082, 2013
 [40] A. Vedaldi and K. Lenc, “MatConvNet: convolutional neural networks for matlab,” in Proceedings of ACM Conference on Multimedia Conference, 2015, pp. 689692.