Systematic evaluation of CNN advances on the ImageNet

by   Dmytro Mishkin, et al.

The paper systematically studies the impact of a range of recent advances in CNN architectures and learning methods on the object categorization (ILSVRC) problem. The evalution tests the influence of the following choices of the architecture: non-linearity (ReLU, ELU, maxout, compatibility with batch normalization), pooling variants (stochastic, max, average, mixed), network width, classifier design (convolutional, fully-connected, SPP), image pre-processing, and of learning parameters: learning rate, batch size, cleanliness of the data, etc. The performance gains of the proposed modifications are first tested individually and then in combination. The sum of individual gains is bigger than the observed improvement when all modifications are introduced, but the "deficit" is small suggesting independence of their benefits. We show that the use of 128x128 pixel images is sufficient to make qualitative conclusions about optimal network structure that hold for the full size Caffe and VGG nets. The results are obtained an order of magnitude faster than with the standard 224 pixel images.


page 1

page 2

page 3

page 4


The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

We investigate how the final parameters found by stochastic gradient des...

MegDet: A Large Mini-Batch Object Detector

The improvements in recent CNN-based object detection works, from R-CNN ...

Dynamic Normalization

Batch Normalization has become one of the essential components in CNN. I...

Yedrouj-Net: An efficient CNN for spatial steganalysis

For about 10 years, detecting the presence of a secret message hidden in...

Mean-field Analysis of Batch Normalization

Batch Normalization (BatchNorm) is an extremely useful component of mode...

Code Repositories


Traffic Sign Classifier

view repo


Result of the third project of the Self-driving car engineering nanodegree

view repo

1 Introduction

Deep convolution networks have become the mainstream method for solving various computer vision tasks, such as image classification ILSVRC15 , object detection ILSVRC15 ; PASCAL2010 , semantic segmentation Dai2015

, image retrieval 

Tolias2016 , tracking Nam2015 , text detection Jaderberg2014 , stereo matching Zbontar2014 , and many other.

Besides two classic works on training neural networks

LeCun1998 and Bengio2012 , which are still highly relevant, there is very little guidance or theory on the plethora of design choices and hyper-parameter settings of CNNs with the consequent that researchers proceed by trial-and-error experimentation and architecture copying, sticking to established net types. With good results in ImageNet competition, the AlexNet AlexNet2012 , VGGNet VGGNet2015 and GoogLeNet(Inception) Googlenet2015 have become the de-facto standard.

Improvements of many components of the CNN architecture like the non-linearity type, pooling, structure and learning have been recently proposed. First applied in the ILSVRC ILSVRC15 competition, they have been adopted in different research areas.

The contributions of the recent CNN improvements and their interaction have not been systematically evaluated. We survey the recent developments and perform a large scale experimental study that considers the choice of non-linearity, pooling, learning rate policy, classifier design, network width, batch normalization BatchNorm2015 . We did not include ResNets DeepResNet2015 – a recent development achieving excellent results – since they have been well covered in papers He2016 ; Szegedy2016 ; WideResNets2016 ; FractalNets2016 .

There are three main contributions of the paper. First, we survey and present baseline results for a wide variety of architectures and design choices both alone and in combination. Based on large-scale evaluation, we provide novel recommendations and insights about construction deep convolutional network. Second, we present ImageNet-128px as fast (24 hours of training AlexNet on GTX980) and reliable benchmark – the relative order of results for popular architectures does not change compared to common image size 224x224 or even 300x300 pixels. Last, but not least, the benchmark is fully reproducible and all scripts and data are available online111

The paper is structured as follows. In Section 2.1 we explain and validate experiment design. In Section 3, the influence of the a range of hyper-parameters is evaluated in isolation. The related literature is review the corresponding in experiment sections. Section 4 is devoted to the combination of best hyper-parameter setting and to “squeezing-the-last-percentage-points” for a given architecture recommendation. The paper is concluded in Section 5.

2 Evaluation

Standard CaffeNet parameters and architecture are shown in Table 2. The full list of tested attributes is given in Table 1.

Hyper-parameter Variants
Non-linearity linear, tanh, sigmoid, ReLU, VLReLU, RReLU,
PReLU, ELU, maxout, APL, combination
Batch Normalization (BN) before non-linearity. after non-linearity
BN + non-linearity linear, tanh, sigmoid, ReLU, VLReLU,
RReLU, PReLU, ELU, maxout
Pooling max, average, stochastic, max+average,
strided convolution
Pooling window size

3x3, 2x2, 3x3 with zero-padding

Learning rate decay policy step, square, square root, linear
Colorspace & Pre-processing RGB, HSV, YCrCb, grayscale, learned,
CLAHE, histogram equalized
Classifier design pooling-FC-FC-clf, SPP-FC-FC-clf,
Network width 1/4, 1/, 1/2, 1/, 1 ,, 2, 2, 4, 4
Input image size 64, 96, 128, 180, 224
Dataset size 200K, 400K, 600K, 800K, 1200K(full)
Batch size 1, 32, 64, 128, 256, 512, 1024
Percentage of noisy data 0, 5%, 10%, 15%, 32%
Using bias yes/no
Table 1: List of hyper-parameters tested.

2.1 Evaluation framework

All tested networks were trained on the 1000 object category classification problem on the ImageNet dataset ILSVRC15 . The set consists of a 1.2M image training set, a 50K image validation set and a 100K image test set. The test set is not used in the experiments. The commonly used pre-processing includes image rescaling to 256xN, where , and then cropping a random 224x224 square AlexNet2012 ; Howard2013 . The setup achieves good results in classification, but training a network of this size takes several days even on modern GPUs. We thus propose to limit the image size to 144xN where (denoted as ImageNet-128px). For example, the CaffeNet jia2014caffe is trained within 24 hours using NVIDIA GTX980 on ImageNet-128px.

Figure 1: Impact of image and network size on top-1 accuracy.

2.1.1 Architectures

The input size reduction is validated by training CaffeNet, GoogLeNet and VGGNet on both the reduced and standard image sizes. The results are shown in Figure 1. The reduction of the input image size leads to a consistent drop in top-1 accuracy around 6% for all there popular architectures and does not change their relative order (VGGNet GoogLeNet CaffeNet) or accuracy difference.

In order to decrease the probability of overfitting and to make experiments less demanding in memory, another change of CaffeNet is made. A number of filters in fully-connected layers 6 and 7 were reduced by a factor of two, from 4096 to 2048. The results validating the resolution reduction are presented in Figure 


The parameters and architecture of the standard CaffeNet are shown in Table 2. For experiments we used CaffeNet with 2x thinner fully-connected layers, named as CaffeNet128-FC2048. The architecture can be denoted as 96C11/4 MP3/2 192G2C5/2 MP3/2 384G2C3 384C3 256G2C3 MP3/2 2048C3 2048C1

1000C1. Here we used fully-convolutional notation for fully-connected layers, which are equivalent when image input size is fixed to 128x128 px. The default activation function is ReLU and it is put after every convolution layer, except the last 1000-way softmax classifier.

2.1.2 Learning

SGD with momentum 0.9 is used for learning, the initial learning rate is set to 0.01, decreased by a factor of ten after each 100K iterations until learning stops after 320K iterations. The L2 weight decay for convolutional weights is set to 0.0005 and it is not applied to bias. The dropout Dropout2014 with probability 0.5 is used before the two last layers. All the networks were initialized with LSUV Mishkin2016LSUV

. Biases are initialized to zero. Since the LSUV initialization works under assumption of preserving unit variance of the input, pixel intensities were scaled by 0.04, after subtracting the mean of BGR pixel values (104 117 124).

input image 128x128 px, random crop from 144xN, random mirror
pre-process out = 0.04 (BGR - (104; 117; 124))
conv1 conv 11x11x96, stride 4
pool1 max pool 3x3, stride 2
conv2 conv 5x5x256, stride 2, pad 1, group 2
pool2 max pool 3x3, stride 2
conv3 conv 3x3x384, pad 1
conv4 conv 3x3x384, pad 1, group 2
conv5 conv 3x3x256, pad 1, group 2
pool5 max pool 3x3, stride 2
fc6 fully-connected 4096
drop6 dropout ratio 0.5
fc7 fully-connected 4096
drop7 dropout ratio 0.5
fc8-clf softmax-1000
Table 2: The basic CaffeNet architecture used in most experiments. Pad 1 – zero-padding on the image boundary with1 pixel. Group 2 convolution – filters are split into 2 separate groups. The architecture is denoted in “shorthand” as 96C11/4 MP3/2 192G2C5/2 MP3/2 384G2C3 384C3 256G2C3 MP3/2 2048C3 2048C1 1000C1.

3 Single experiments

This section is devoted to the experiments with a single hyper-parameter or design choice per experiment.

3.1 Activation functions

3.1.1 Previous work

The activation functions for neural networks are a hot topic, many functions have been proposed since the ReLU discovery ReLU2011 . The first group are related to ReLU, i.e. LeakyReLU Maas2013 and Very Leaky ReLU GrahamCIFAR , RReLU RReLU2015 ,PReLU PReLU2015 and its generalized version – APL APL2014 , ELU ELU2016 . Others are based on different ideas, e.g. maxout Maxout2013 , MBA MBA2016 , etc. However, to our best knowledge only a small fraction of this activation functions have been evaluated on ImageNet-scale dataset. And when they have, e.g. ELU, the network architecture used in the evaluation was designed specifically for the experiment and is not commonly used.

3.1.2 Experiment

Name Formula Year
none y = x -
sigmoid y = 1986
tanh y = 1986
ReLU y = max(x, 0) 2010
(centered) SoftPlus y = 2011
LReLU y = max(x, x), 2011
maxout y = max(x + ,x + ) 2013
APL y = max(x,0) + 2014
VLReLU y = max(x, x), 2014
RReLU y = max(x, x), = random(0.1, 0.5) 2015
PReLU y = max(x, x), is learnable 2015
ELU y = x, if x 0, else 2015
Table 3: Non-linearities tested.

We have tested the most popular activation functions and all those with available or trivial implementations: ReLU, tanh, sigmoid, VLReLU, RReLU, PReLU, ELU, linear, maxout, APL, SoftPlus. Formulas and references are given in Table 3. We have selected APL and maxout with two linear pieces. Maxout is tested in two modifications: MaxW – having the same effective network width, which doubles the number of parameters and computation costs because of the two linear pieces, and MaxS – having same computational complexity - with thinner each piece. Besides this, we have tested ”optimally scaled” tanh, proposed by LeCun LeCun1998 . We have also tried to train sigmoid Rumelhart1986 network, but the initial loss never decreased. Finally, as proposed by Swietojanski ConvReLUFCMaxout2014 , we have tested combination of ReLU for first layers and maxout for the last layers of the network.

Results are shown in Figure 2. The best single performing activation function similar in complexity to ReLU is ELU. The parametric PReLU performed on par. The performance of the centered softplus is the same as for ELU. Surprisingly, Very Leaky ReLU, popular for DCGAN networks DCGAN2015 and for small datasets, does not outperforms vanilla ReLU. Interesting, the network with no non-linearity has respectable performance – 38.9% top-1 accuracy on ImageNet, not much worse than tanh-network.

The Swietojanski ConvReLUFCMaxout2014 hypothesis about maxout power in the final layers is confirmed and combined ELU (after convolutional layers) + maxout (after fully connected layers) shows the best performance among non-linearities with speed close to ReLU. Wide maxout outperforms the rest of the competitors at a higher computational cost.

Figure 2: Top-1 accuracy gain over ReLU in the CaffeNet-128 architecture. MaxS stands for ”maxout, same compexity”, MaxW – maxout, same width, CSoftplus – centered softplus. The baseline, i.e. ReLU, accuracy is 47.1%.

3.2 Pooling

3.2.1 Previous work

Pooling, combined with striding, is a common way to archive a degree of invariance together with a reduction of spatial size of feature maps. The most popular options are max pooling and average pooling. Among the recent advances are: Stochastic pooling StochasticPool2013 , LP-Norm pooling LPNormPool2013 and Tree-Gated pooling GenPool2015 . Only the authors of the last paper have tested their pooling on ImageNet.

The pooling receptive field is another design choice. Krizhevskiy etal. AlexNet2012 claimed superiority of overlapping pooling with 3x3 window size and stride 2, while VGGNet VGGNet2015 uses a non-overlapping 2x2 window.

3.2.2 Experiment

Name Formula Year
max y = 1989
average y = 1989
stochastic y = with prob. 2013
strided convolution 2014
max + average y = 2015
Table 4: Poolings tested.
Figure 3: Top-1 accuracy gain over max pooling for the CaffeNet-128 architecture. Left – different pooling methods, right – different receptive field sizes. Stoch stands for stochastic pooling, ”stoch no dropout” – for a network with stochastic pooling and turned off drop6 and drop7 layers.

We have tested (see Table 4) average, max, stochastic and proposed by Lee et al GenPool2015 sum of average and max pooling, and skipping pooling at all, replacing it with strided convolutions proposed by Springenberd et al. ALLCNN2015 . We have also tried Tree and Gated poolings GenPool2015 , but we encountered convergence problems and the results were strongly depend on the input image size. We do not know if it is a problem of the implementation, or the method itself and therefore omitted the results.

The results are shown in Figure 3, left. Stochastic pooling had very bad results. In order to check if it was due to extreme randomization by the stochastic pooling and dropout, we trained network without the dropout. This decreased accuracy even more. The best results were obtained by a combination of max and average pooling. Our guess is that max pooling brings selectivity and invariance, while average pooling allows using gradients of all filters, instead of throwing away 3/4 of information as done by non-overlapping 2x2 max pooling.

The second experiment is about the receptive field size. The results are shown in Figure 3, right. Overlapping pooling is inferior to a non-overlapping 2x2 window, but wins if zero-padding is done. This can be explained by the fact that better results are obtained for larger outputs; 3x3/2 pooling leads to 3x3 spatial size of pool5 feature map, 2x2/2 leads to 4x4 pool5, while 3x3/2 + 1 – to 5x5. This observation means there is a speed – performance trade-off.

3.3 Learning rate policy

Figure 4: Left: learning rate decay policy, right: validation accuracy. The formulas for each policy are given in Table 5
Name Formula Parameters Accuracy
step lr = = 100K, , 0.471
square lr = 0.483
square root lr = 0.483
linear lr = 0.493
Table 5: Learning rate decay policies, tested in paper. – initial learning rate, = number of learning iterations, – current iteration, – step iteration. – decay coefficient .

Learning rate is one of the most important hyper-parameters which influences the final CNN performance. Surprisingly, the most commonly used learning rate decay policy is ”reduce learning rate 10x, when validation error stops decreasing” adopted with no parameter search. While this works well in practice, such lazy policy can be sub-optimal. We have tested four learning rate policies: step, quadratic and square root decay (used for training GoogLeNet by BVLC jia2014caffe ), and linear decay. The actual learning rate dynamics are shown in Figure 4, left. The validation accuracy is shown in the right. Linear decay gives the best results.

3.4 Image pre-processing

3.4.1 Previous work

The commonly used input to CNN is raw RGB pixels and the commonly adopted recommendation is not to use any pre-processing. There has not been much research on the optimal colorspace or pre-processing techniques for CNN. Rachmadi and Purnama ColorCar2015 explored different colorspaces for vehicle color identification, Dong ColorSuperRes2015

compared YCrCb and RGB channels for image super-resolution, Graham 

GrahamRetinopathy2015 extracted local average color from retina images in winning solution to the Kaggle competition.

3.4.2 Experiment

Figure 5: Left: performance of using various colorspaces and pre-processing. Right: learned colorspace transformations. Parameters are given in Table 6.
Name Architecture Non-linearity Acc.
A RGB conv1x1x10conv1x1x3 tanh 0.463
RGB RGB - 0.471
B RGB conv1x1x3conv1x1x3 VLReLU 0.480
C RGB conv1x1x10 conv1x1x3 + RGB VLReLU 0.482
D [RGB; log(RGB)] conv1x1x10 conv1x1x3 VLReLU 0.482
E RGB conv1x1x16conv1x1x3 VLReLU 0.483
F RGB conv1x1x10conv1x1x3 VLReLU 0.485
Table 6: Mini-networks for learned colorspace transformations, placed after image and before conv1 layer. In all cases RGB means scales and centered input 0.04 * (Img - (104, 117,124)).

The pre-processing experiment is divided in two parts. First, we have tested popular handcrafted image pre-processing methods and colorspaces. Since all transformations were done on-the-fly, we first tested if calculation of the mean pixel and variance over the training set can be replaced with applying batch normalization to input images. It decreases final accuracy by 0.3% and can be seen as baseline for all other methods. We have tested HSV, YCrCb, Lab, RGB and single-channel grayscale colorspaces. Results are shown in Figure 5. The experiment confirms that RGB is the best suitable colorspace for CNNs. Lab-based network has not improved the initial loss after 10K iterations. Removing color information from images costs from 5.8% to 5.2% of the accuracy, for OpenCV RGB2Gray and learned decolorization resp. Global HistEq1977 and local (CLAHE CLAHE1994 ) histogram equalizations hurt performance as well.

Second, we let the network to learn a transformation via 1x1 convolution, so no pixel neighbors are involved. The mini-networks architectures are described in Table 6. The learning process is joint with the main network and can be seen as extending the CaffeNet architecture with several 1x1 convolutions at the input. The best performing network gave 1.4% absolute accuracy gain without a significant computational cost.

3.5 Batch normalization

Batch normalization BatchNorm2015 (BN) is a recent method tha t solves the gradient exploding/vanishing problem and guarantees near-optimal learning regime for the layer following the batch normalized one. Following Mishkin2016LSUV , we first tested different options where to put BN – before or after the non-linearity. Results presented in Table 7 are surprisingly contradictory: CaffeNet architecture prefers Conv-ReLU-BN-Conv, while GoogLeNet – Conv-BN-ReLU-Conv placement. Moreover, results for GoogLeNet are inferior to the plain network. The difference to BatchNorm2015 is that we have not changed any other parameters except using BN, while in the original paper, authors decreased regularization (both weight decay and dropout), changed the learning rate decay policy and applied an additional training set re-shuffling. Also, GoogLeNet behavior seems different to CaffeNet and VGGNet w.r.t. to other modification, see Section 4.

Network BN placement
No BN Before After
CaffeNet128-FC2048 0.471 0.478 0.499
GoogLeNet128 0.619 0.603 0.596
Table 7: Top-1 accuracy on ImageNet-128px, batch normalization placement. ReLU activation is used.

For the next experiment with BN and activations, we selected placement after non-linearity. Results are shown in Figure 6. Batch normalization washes out differences between ReLU-family variants, so there is no need to use the more complex variants. Sigmoid with BN outperforms ReLU without it, but, surprisingly, tanh with BN shows worse accuracy than sigmoid with BN.

Figure 6: Top-1 accuracy gain over ReLU without batch normalization (BN) in CaffeNet-128 architecture. The baseline – ReLU – accuracy is 47.1%.

3.6 Classifier design

3.6.1 Previous work

The CNN architecture can be seen as integration of feature detector and which is following by a classifier. Ren et. al. NoC2015

proposed to consider convolutional layers of the AlexNet as an feature extractor and fully-connected layers as 2-layer MLP as a classifier. They argued that 2 fully-connected layers are not the optimal design and explored various architectures instead. But they considered only pre-trained CNN or HOGs as feature extractor, so explored mostly transfer learning scenario, when the most of the network weights are frozen. Also, they explored architectures with additional convolution layers, which can be seen not as better classifier, but as an enhancement of the feature extractor.

There is three the most popular approaches to classifier design. First – final layer of the feature extractor is max pooling layer and the classifier is a one or two layer MLP, as it is done in LeNet LeNet1998 , AlexNet AlexNet2012 and VGGNet VGGNet2015 . Second – spatial pooling pyramid layer SPPNet2014 instead of pooling layer, followed by two layer MLP. And the third architecture consist of average pooling layer, squashing spatial dimensions, followed by softmax classifier without any feature transform. This variant is used in GoogLeNet Googlenet2015 and ResNet DeepResNet2015 .

3.6.2 Experiment

We have explored following variants: default 2-layer MLP, SPPNet with 2 and 3 pyramid levels, removing pool5-layer, treating fully-connected layers as convolutional, which allows to use zero-padding, therefore increase effective number of training examples for this layer, averaging features before softmax layer or averaging spatial predictions of the softmax layer 

NiN2013 . The results are shown in the Figure 7. The best results are get, when predictions are averaged over all spatial positions and MLP layers are treated as convolution - with zero padding. The advantage of the SPP over standard max pooling is less pronounced.

Figure 7: Classifier design: Top-1 accuracy gain over standard CaffeNet-128 architecture.

3.7 Batch size and learning rate

Figure 8: Batch size and initial learning rate impact to the accuracy

The mini-batch size is always a trade-off between computation efficiency – because GPU architecture prefers it large enough – and accuracy; early work by Wilson and Martinez BatchSize2003

shows superiority of the online training to batch-training. Here we explore the influence of mini-batch size on the final accuracy. Experiments show that keeping a constant learning rate for different mini-batch sizes has a negative impact on performance. We also have tested the heuristic proposed by Krizhevskiy 

OneWeirdTrick2014 which suggests to keep the product of mini-batch size and learning rate constant. Results are shown in Figure8. The heuristics works, but large (512 and more) mini-batch sizes leads to quite significant decrease in performance. On the other extreme, online training (mini-batch with single example) does not bring accuracy gains over 64 or 256, but significantly slows down the training wall-clock time.

3.8 Network width

All the advances in ImageNet competition so far were caused by architectural improvement. To the best of our knowledge, there is no study about network width – final accuracy dependence. Canziani CNNEfficiency2016 did a comparative analysis of the ImageNet winner in terms of accuracy, number of parameters and computational complexity, but it is a comparison of the different architectures. In this subsection we evaluate how far one can get by increasing CaffeNet width, with no other changes. The results are shown in Figure 9. The original architecture is close to optimal in accuracy per FLOPS sense: a decrease in the number of filters leads to a quick and significant accuracy drop, while making the network thicker brings gains, but it saturates quickly. Making the network thicker more than 3 times leads to a very limited accuracy gain.

Figure 9: Network width impact on the accuracy.

3.9 Input image size

The input image size, as it brings additional information and training samples for convolution filters, plays a very important role. Our initial experiment, showed in Figure 1 indicates that CaffeNet, trained on 227x227 images can compete with much more complex GoogLeNet architecture, trained on smaller images. So the obvious question is what is the dependence between image size and final accuracy.

We have performed an experiment with different input image sizes: 96, 128, 180 and 224 pixels wide. The results are presented in Figure 10. The bad news are that while accuracy depends on image size linearly, the needed computations grow quadratically, so it is a very expensive way to a performance gain. In the second part of experiment, we kept the spatial output size of the pool1 layer fixed while changing the input image size. To archieve this, we respectively change the stride and filter size of the conv1 layer. Results show that the gain from a large image size mostly (after some minimum value) comes from the larger spatial size of deeper layers than from the unseen image details.

Figure 10: Input image size impact on the accuracy

3.10 Dataset size and noisy labels

Figure 11: Training dataset size and cleanliness impact on the accuracy

3.10.1 Previous work

The performance of the current deep neural network is highly dependent on the dataset size. Unfortunately, not much research has been published on this topic. In DeepFace taigman2014deepface , the authors shows that dataset reduction from 4.4M to 1.5M leads to a 1.74% accuracy drop. Similar dependence is shown by Schroff schroff2015facenet

but on an extra-large dataset: decreasing the dataset size from 260M to 2.6M leads to accuracy drop in 10%. But these datasets are private and the experiments are not reproducible. Another important property of a dataset is the cleanliness of the data. For example, an estimate of human accuracy on ImageNet is 5.1% for top-5 

ILSVRC15 . To create the ImageNet, each image was voted on by ten different people ILSVRC15 .

3.10.2 Experiment

We explore the dependency between the accuracy and the dataset size/cleanliness on ImageNet. For the dataset size experiment, 200, 400, 600, 800 thousand examples were random chosen from a full training set. For each reduced dataset, a CaffeNet is trained from scratch. For the cleanliness test, we replaced the labes to a random incorrect one for 5%, 10%, 15% and 32% of the examples. The labels are fixed, unlike the recent work on disturbing labels as a regularization method DisturbLabel2016 .

The results are shown in Figure 11 which clearly shows that bigger (and more diverse) dataset brings an improvement. There is a minimum size below which performance quickly degrades. Less clean data outperforms more noisy ones: a clean dataset with 400K images performs on par with 1.2M dataset with 800K correct images.

3.11 Bias in convolution layers

We conducted a simple experiment on the importance of the bias in the convolution and fully-connected layers. First, the network is trained as usual, for the second – biases are initialized with zeros and the bias learning rate is set to zero. The network without biases shows 2.6% less accuracy than the default – see Table 8.

Network Accuracy
With bias 0.471
Without bias 0.445
Table 8: Influence of the bias in convolution and fully-connected layers. Top-1 accuracy on ImageNet-128px.

4 Best-of-all experiments

Finally, we test how all the improvements, which do not increase the computational cost, perform together. We combine: the learned colorspace transform F, ELU as non-linearity for convolution layers and maxout for fully-connected layers, linear learning rate decay policy, average plus max pooling. The improvements are applied to CaffeNet128, CaffeNet224, VGGNet128 and GoogleNet128.

The first three demonstrated consistent performance growth (see Figure 12), while GoogLeNet performance degraded, as it was found for batch normalization. Possibly, this is due to the complex and optimized structure of the GoogLeNet network. Unfortunately, the cost of training VGGNet224 is prohibitive, one month of GPU time, so we have not subjected it to the tests yet.

Figure 12: Applying all improvements that do not change feature maps size: linear learning rate decay policy, colorspace transformation ”F”, ELU nonlinearity in convolution layers, maxout non-linearity in fully-connected layers and a sum of average and max pooling.

5 Conclusions

We have compared systematically a set of recent CNN advances on large scale dataset. We have shown that benchmarking can be done at an affordable time and computation cost. A summary of recommendations:

  • use ELU non-linearity without batchnorm or ReLU with it.

  • apply a learned colorspace transformation of RGB.

  • use the linear learning rate decay policy.

  • use a sum of the average and max pooling layers.

  • use mini-batch size around 128 or 256. If this is too big for your GPU, decrease the learning rate proportionally to the batch size.

  • use fully-connected layers as convolutional and average the predictions for the final decision.

  • when investing in increasing training set size, check if a plateau has not been reach.

  • cleanliness of the data is more important then the size.

  • if you cannot increase the input image size, reduce the stride in the consequent layers, it has roughly the same effect.

  • if your network has a complex and highly optimized architecture, like e.g. GoogLeNet, be careful with modifications.

Group Name Names acc [%]
Baseline 47.1
Non-linearity Linear 38.9
tanh 40.1
VReLU 46.9
APL2 47.1
ReLU 47.1
RReLU 47.8
maxout (MaxS) 48.2
PReLU 48.5
ELU 48.8
maxout (MaxW) 51.7
Batch Normalization (BN) before non-linearity 47.4
after non-linearity 49.9
BN + Non-linearity Linear 38.4
tanh 44.8
sigmoid 47.5
maxout (MaxS) 48.7
ELU 49.8
ReLU 49.9
RReLU 50.0
PReLU 50.3
Pooling stochastic, no dropout 42.9
average 43.5
stochastic 43.8
Max 47.1
strided convolution 47.2
max+average 48.3
Pooling window size 3x3/2 47.1
2x2/2 48.4
3x3/2 pad=1 48.8
Learning rate decay policy step 47.1
square 48.3
square root 48.3
linear 49.3
Colorspace & Pre-processing OpenCV grayscale 41.3
grayscale learned 41.9
histogram equalized 44.8
HSV 45.1
YCrCb 45.8
CLAHE 46.7
RGB 47.1
Classifier design pooling-FC-FC-clf 47.1
SPP2-FC-FC-clf 47.1
pooling-C3-C1-clf-maxpool 47.3
SPP3-FC-FC-clf 48.3
pooling-C3-C1-avepool-clf 48.9
C3-C1-clf-avepool 49.1
pooling-C3-C1-clf-avepool 49.5
Percentage of noisy data 5% 45.8
10% 44.7
15% 43.7
32% 40.1
Dataset size 1200K 47.1
800K 43.8
600K 42.5
400K 39.3
200K 30.5
Network width 4 56.5
4 56.3
2 55.2
2 53.3
1 47.1
1/ 46.0
1/2 41.6
1/ 31.8
1/4 25.6
Batch size BS=1024, lr=0.04 46.5
BS=1024, lr=0.01 41.9
BS=512, lr=0.02 46.9
BS=512, lr=0.01 45.5
BS=256, lr=0.01 47.0
BS=128, lr=0.005 47.0
BS=128, lr=0.01 47.2
BS=64, lr=0.0025 47.5
BS=64, lr=0.01 47.1
BS=32, lr=0.00125 47.0
BS=32, lr=0.01 46.3
BS=1, lr=0.000039 47.4
Bias without 44.5
with 47.1
Architectures CaffeNet128 47.1
CaffeNet128All 53.0
CaffeNet224 56.5
CaffeNet224All 61.3
VGGNet16-128 65.1
VGGNet16-128All 68.2
GoogLeNet128 61.9
GoogLeNet128All 60.6
GoogLeNet224 67.9
Table 9: Results of all tests on ImageNet-128px


The authors were supported by The Czech Science Foundation Project GACR P103/12/G084 and CTU student grant SGS15/155/OHK3/2T/13.

6 References