Traffic Sign Classifier
The paper systematically studies the impact of a range of recent advances in CNN architectures and learning methods on the object categorization (ILSVRC) problem. The evalution tests the influence of the following choices of the architecture: non-linearity (ReLU, ELU, maxout, compatibility with batch normalization), pooling variants (stochastic, max, average, mixed), network width, classifier design (convolutional, fully-connected, SPP), image pre-processing, and of learning parameters: learning rate, batch size, cleanliness of the data, etc. The performance gains of the proposed modifications are first tested individually and then in combination. The sum of individual gains is bigger than the observed improvement when all modifications are introduced, but the "deficit" is small suggesting independence of their benefits. We show that the use of 128x128 pixel images is sufficient to make qualitative conclusions about optimal network structure that hold for the full size Caffe and VGG nets. The results are obtained an order of magnitude faster than with the standard 224 pixel images.READ FULL TEXT VIEW PDF
We investigate how the final parameters found by stochastic gradient des...
The improvements in recent CNN-based object detection works, from R-CNN ...
Batch Normalization has become one of the essential components in CNN. I...
For about 10 years, detecting the presence of a secret message hidden in...
Batch Normalization (BatchNorm) is an extremely useful component of mode...
DNN structures are continuously developing and achieving high performanc...
Traffic Sign Classifier
Result of the third project of the Self-driving car engineering nanodegree
Deep convolution networks have become the mainstream method for solving various computer vision tasks, such as image classification ILSVRC15 , object detection ILSVRC15 ; PASCAL2010 , semantic segmentation Dai2015Tolias2016 , tracking Nam2015 , text detection Jaderberg2014 , stereo matching Zbontar2014 , and many other.
Besides two classic works on training neural networks –LeCun1998 and Bengio2012 , which are still highly relevant, there is very little guidance or theory on the plethora of design choices and hyper-parameter settings of CNNs with the consequent that researchers proceed by trial-and-error experimentation and architecture copying, sticking to established net types. With good results in ImageNet competition, the AlexNet AlexNet2012 , VGGNet VGGNet2015 and GoogLeNet(Inception) Googlenet2015 have become the de-facto standard.
Improvements of many components of the CNN architecture like the non-linearity type, pooling, structure and learning have been recently proposed. First applied in the ILSVRC ILSVRC15 competition, they have been adopted in different research areas.
The contributions of the recent CNN improvements and their interaction have not been systematically evaluated. We survey the recent developments and perform a large scale experimental study that considers the choice of non-linearity, pooling, learning rate policy, classifier design, network width, batch normalization BatchNorm2015 . We did not include ResNets DeepResNet2015 – a recent development achieving excellent results – since they have been well covered in papers He2016 ; Szegedy2016 ; WideResNets2016 ; FractalNets2016 .
There are three main contributions of the paper. First, we survey and present baseline results for a wide variety of architectures and design choices both alone and in combination. Based on large-scale evaluation, we provide novel recommendations and insights about construction deep convolutional network. Second, we present ImageNet-128px as fast (24 hours of training AlexNet on GTX980) and reliable benchmark – the relative order of results for popular architectures does not change compared to common image size 224x224 or even 300x300 pixels. Last, but not least, the benchmark is fully reproducible and all scripts and data are available online111https://github.com/ducha-aiki/caffenet-benchmark.
The paper is structured as follows. In Section 2.1 we explain and validate experiment design. In Section 3, the influence of the a range of hyper-parameters is evaluated in isolation. The related literature is review the corresponding in experiment sections. Section 4 is devoted to the combination of best hyper-parameter setting and to “squeezing-the-last-percentage-points” for a given architecture recommendation. The paper is concluded in Section 5.
|Non-linearity||linear, tanh, sigmoid, ReLU, VLReLU, RReLU,|
|PReLU, ELU, maxout, APL, combination|
|Batch Normalization (BN)||before non-linearity. after non-linearity|
|BN + non-linearity||linear, tanh, sigmoid, ReLU, VLReLU,|
|RReLU, PReLU, ELU, maxout|
|Pooling||max, average, stochastic, max+average,|
|Pooling window size||
3x3, 2x2, 3x3 with zero-padding
|Learning rate decay policy||step, square, square root, linear|
|Colorspace & Pre-processing||RGB, HSV, YCrCb, grayscale, learned,|
|CLAHE, histogram equalized|
|Classifier design||pooling-FC-FC-clf, SPP-FC-FC-clf,|
|Network width||1/4, 1/, 1/2, 1/, 1 ,, 2, 2, 4, 4|
|Input image size||64, 96, 128, 180, 224|
|Dataset size||200K, 400K, 600K, 800K, 1200K(full)|
|Batch size||1, 32, 64, 128, 256, 512, 1024|
|Percentage of noisy data||0, 5%, 10%, 15%, 32%|
All tested networks were trained on the 1000 object category classification problem on the ImageNet dataset ILSVRC15 . The set consists of a 1.2M image training set, a 50K image validation set and a 100K image test set. The test set is not used in the experiments. The commonly used pre-processing includes image rescaling to 256xN, where , and then cropping a random 224x224 square AlexNet2012 ; Howard2013 . The setup achieves good results in classification, but training a network of this size takes several days even on modern GPUs. We thus propose to limit the image size to 144xN where (denoted as ImageNet-128px). For example, the CaffeNet jia2014caffe is trained within 24 hours using NVIDIA GTX980 on ImageNet-128px.
The input size reduction is validated by training CaffeNet, GoogLeNet and VGGNet on both the reduced and standard image sizes. The results are shown in Figure 1. The reduction of the input image size leads to a consistent drop in top-1 accuracy around 6% for all there popular architectures and does not change their relative order (VGGNet GoogLeNet CaffeNet) or accuracy difference.
In order to decrease the probability of overfitting and to make experiments less demanding in memory, another change of CaffeNet is made. A number of filters in fully-connected layers 6 and 7 were reduced by a factor of two, from 4096 to 2048. The results validating the resolution reduction are presented in Figure1.
The parameters and architecture of the standard CaffeNet are shown in Table 2. For experiments we used CaffeNet with 2x thinner fully-connected layers, named as CaffeNet128-FC2048. The architecture can be denoted as 96C11/4 MP3/2 192G2C5/2 MP3/2 384G2C3 384C3 256G2C3 MP3/2 2048C3 2048C1
1000C1. Here we used fully-convolutional notation for fully-connected layers, which are equivalent when image input size is fixed to 128x128 px. The default activation function is ReLU and it is put after every convolution layer, except the last 1000-way softmax classifier.
SGD with momentum 0.9 is used for learning, the initial learning rate is set to 0.01, decreased by a factor of ten after each 100K iterations until learning stops after 320K iterations. The L2 weight decay for convolutional weights is set to 0.0005 and it is not applied to bias. The dropout Dropout2014 with probability 0.5 is used before the two last layers. All the networks were initialized with LSUV Mishkin2016LSUV
. Biases are initialized to zero. Since the LSUV initialization works under assumption of preserving unit variance of the input, pixel intensities were scaled by 0.04, after subtracting the mean of BGR pixel values (104 117 124).
|input||image 128x128 px, random crop from 144xN, random mirror|
|pre-process||out = 0.04 (BGR - (104; 117; 124))|
|conv1||conv 11x11x96, stride 4|
|pool1||max pool 3x3, stride 2|
|conv2||conv 5x5x256, stride 2, pad 1, group 2|
|pool2||max pool 3x3, stride 2|
|conv3||conv 3x3x384, pad 1|
|conv4||conv 3x3x384, pad 1, group 2|
|conv5||conv 3x3x256, pad 1, group 2|
|pool5||max pool 3x3, stride 2|
|drop6||dropout ratio 0.5|
|drop7||dropout ratio 0.5|
This section is devoted to the experiments with a single hyper-parameter or design choice per experiment.
The activation functions for neural networks are a hot topic, many functions have been proposed since the ReLU discovery ReLU2011 . The first group are related to ReLU, i.e. LeakyReLU Maas2013 and Very Leaky ReLU GrahamCIFAR , RReLU RReLU2015 ,PReLU PReLU2015 and its generalized version – APL APL2014 , ELU ELU2016 . Others are based on different ideas, e.g. maxout Maxout2013 , MBA MBA2016 , etc. However, to our best knowledge only a small fraction of this activation functions have been evaluated on ImageNet-scale dataset. And when they have, e.g. ELU, the network architecture used in the evaluation was designed specifically for the experiment and is not commonly used.
|none||y = x||-|
|ReLU||y = max(x, 0)||2010|
|(centered) SoftPlus||y =||2011|
|LReLU||y = max(x, x),||2011|
|maxout||y = max(x + ,x + )||2013|
|APL||y = max(x,0) +||2014|
|VLReLU||y = max(x, x),||2014|
|RReLU||y = max(x, x), = random(0.1, 0.5)||2015|
|PReLU||y = max(x, x), is learnable||2015|
|ELU||y = x, if x 0, else||2015|
We have tested the most popular activation functions and all those with available or trivial implementations: ReLU, tanh, sigmoid, VLReLU, RReLU, PReLU, ELU, linear, maxout, APL, SoftPlus. Formulas and references are given in Table 3. We have selected APL and maxout with two linear pieces. Maxout is tested in two modifications: MaxW – having the same effective network width, which doubles the number of parameters and computation costs because of the two linear pieces, and MaxS – having same computational complexity - with thinner each piece. Besides this, we have tested ”optimally scaled” tanh, proposed by LeCun LeCun1998 . We have also tried to train sigmoid Rumelhart1986 network, but the initial loss never decreased. Finally, as proposed by Swietojanski et.al ConvReLUFCMaxout2014 , we have tested combination of ReLU for first layers and maxout for the last layers of the network.
Results are shown in Figure 2. The best single performing activation function similar in complexity to ReLU is ELU. The parametric PReLU performed on par. The performance of the centered softplus is the same as for ELU. Surprisingly, Very Leaky ReLU, popular for DCGAN networks DCGAN2015 and for small datasets, does not outperforms vanilla ReLU. Interesting, the network with no non-linearity has respectable performance – 38.9% top-1 accuracy on ImageNet, not much worse than tanh-network.
The Swietojanski et.al ConvReLUFCMaxout2014 hypothesis about maxout power in the final layers is confirmed and combined ELU (after convolutional layers) + maxout (after fully connected layers) shows the best performance among non-linearities with speed close to ReLU. Wide maxout outperforms the rest of the competitors at a higher computational cost.
Pooling, combined with striding, is a common way to archive a degree of invariance together with a reduction of spatial size of feature maps. The most popular options are max pooling and average pooling. Among the recent advances are: Stochastic pooling StochasticPool2013 , LP-Norm pooling LPNormPool2013 and Tree-Gated pooling GenPool2015 . Only the authors of the last paper have tested their pooling on ImageNet.
|stochastic||y = with prob.||2013|
|max + average||y =||2015|
We have tested (see Table 4) average, max, stochastic and proposed by Lee et al GenPool2015 sum of average and max pooling, and skipping pooling at all, replacing it with strided convolutions proposed by Springenberd et al. ALLCNN2015 . We have also tried Tree and Gated poolings GenPool2015 , but we encountered convergence problems and the results were strongly depend on the input image size. We do not know if it is a problem of the implementation, or the method itself and therefore omitted the results.
The results are shown in Figure 3, left. Stochastic pooling had very bad results. In order to check if it was due to extreme randomization by the stochastic pooling and dropout, we trained network without the dropout. This decreased accuracy even more. The best results were obtained by a combination of max and average pooling. Our guess is that max pooling brings selectivity and invariance, while average pooling allows using gradients of all filters, instead of throwing away 3/4 of information as done by non-overlapping 2x2 max pooling.
The second experiment is about the receptive field size. The results are shown in Figure 3, right. Overlapping pooling is inferior to a non-overlapping 2x2 window, but wins if zero-padding is done. This can be explained by the fact that better results are obtained for larger outputs; 3x3/2 pooling leads to 3x3 spatial size of pool5 feature map, 2x2/2 leads to 4x4 pool5, while 3x3/2 + 1 – to 5x5. This observation means there is a speed – performance trade-off.
|step||lr =||= 100K, ,||0.471|
|square root||lr =||0.483|
Learning rate is one of the most important hyper-parameters which influences the final CNN performance. Surprisingly, the most commonly used learning rate decay policy is ”reduce learning rate 10x, when validation error stops decreasing” adopted with no parameter search. While this works well in practice, such lazy policy can be sub-optimal. We have tested four learning rate policies: step, quadratic and square root decay (used for training GoogLeNet by BVLC jia2014caffe ), and linear decay. The actual learning rate dynamics are shown in Figure 4, left. The validation accuracy is shown in the right. Linear decay gives the best results.
The commonly used input to CNN is raw RGB pixels and the commonly adopted recommendation is not to use any pre-processing. There has not been much research on the optimal colorspace or pre-processing techniques for CNN. Rachmadi and Purnama ColorCar2015 explored different colorspaces for vehicle color identification, Dong et.al ColorSuperRes2015
compared YCrCb and RGB channels for image super-resolution, GrahamGrahamRetinopathy2015 extracted local average color from retina images in winning solution to the Kaggle competition.
|C||RGB conv1x1x10 conv1x1x3 + RGB||VLReLU||0.482|
|D||[RGB; log(RGB)] conv1x1x10 conv1x1x3||VLReLU||0.482|
The pre-processing experiment is divided in two parts. First, we have tested popular handcrafted image pre-processing methods and colorspaces. Since all transformations were done on-the-fly, we first tested if calculation of the mean pixel and variance over the training set can be replaced with applying batch normalization to input images. It decreases final accuracy by 0.3% and can be seen as baseline for all other methods. We have tested HSV, YCrCb, Lab, RGB and single-channel grayscale colorspaces. Results are shown in Figure 5. The experiment confirms that RGB is the best suitable colorspace for CNNs. Lab-based network has not improved the initial loss after 10K iterations. Removing color information from images costs from 5.8% to 5.2% of the accuracy, for OpenCV RGB2Gray and learned decolorization resp. Global HistEq1977 and local (CLAHE CLAHE1994 ) histogram equalizations hurt performance as well.
Second, we let the network to learn a transformation via 1x1 convolution, so no pixel neighbors are involved. The mini-networks architectures are described in Table 6. The learning process is joint with the main network and can be seen as extending the CaffeNet architecture with several 1x1 convolutions at the input. The best performing network gave 1.4% absolute accuracy gain without a significant computational cost.
Batch normalization BatchNorm2015 (BN) is a recent method tha t solves the gradient exploding/vanishing problem and guarantees near-optimal learning regime for the layer following the batch normalized one. Following Mishkin2016LSUV , we first tested different options where to put BN – before or after the non-linearity. Results presented in Table 7 are surprisingly contradictory: CaffeNet architecture prefers Conv-ReLU-BN-Conv, while GoogLeNet – Conv-BN-ReLU-Conv placement. Moreover, results for GoogLeNet are inferior to the plain network. The difference to BatchNorm2015 is that we have not changed any other parameters except using BN, while in the original paper, authors decreased regularization (both weight decay and dropout), changed the learning rate decay policy and applied an additional training set re-shuffling. Also, GoogLeNet behavior seems different to CaffeNet and VGGNet w.r.t. to other modification, see Section 4.
For the next experiment with BN and activations, we selected placement after non-linearity. Results are shown in Figure 6. Batch normalization washes out differences between ReLU-family variants, so there is no need to use the more complex variants. Sigmoid with BN outperforms ReLU without it, but, surprisingly, tanh with BN shows worse accuracy than sigmoid with BN.
The CNN architecture can be seen as integration of feature detector and which is following by a classifier. Ren et. al. NoC2015
proposed to consider convolutional layers of the AlexNet as an feature extractor and fully-connected layers as 2-layer MLP as a classifier. They argued that 2 fully-connected layers are not the optimal design and explored various architectures instead. But they considered only pre-trained CNN or HOGs as feature extractor, so explored mostly transfer learning scenario, when the most of the network weights are frozen. Also, they explored architectures with additional convolution layers, which can be seen not as better classifier, but as an enhancement of the feature extractor.
There is three the most popular approaches to classifier design. First – final layer of the feature extractor is max pooling layer and the classifier is a one or two layer MLP, as it is done in LeNet LeNet1998 , AlexNet AlexNet2012 and VGGNet VGGNet2015 . Second – spatial pooling pyramid layer SPPNet2014 instead of pooling layer, followed by two layer MLP. And the third architecture consist of average pooling layer, squashing spatial dimensions, followed by softmax classifier without any feature transform. This variant is used in GoogLeNet Googlenet2015 and ResNet DeepResNet2015 .
We have explored following variants: default 2-layer MLP, SPPNet with 2 and 3 pyramid levels, removing pool5-layer, treating fully-connected layers as convolutional, which allows to use zero-padding, therefore increase effective number of training examples for this layer, averaging features before softmax layer or averaging spatial predictions of the softmax layerNiN2013 . The results are shown in the Figure 7. The best results are get, when predictions are averaged over all spatial positions and MLP layers are treated as convolution - with zero padding. The advantage of the SPP over standard max pooling is less pronounced.
The mini-batch size is always a trade-off between computation efficiency – because GPU architecture prefers it large enough – and accuracy; early work by Wilson and Martinez BatchSize2003
shows superiority of the online training to batch-training. Here we explore the influence of mini-batch size on the final accuracy. Experiments show that keeping a constant learning rate for different mini-batch sizes has a negative impact on performance. We also have tested the heuristic proposed by KrizhevskiyOneWeirdTrick2014 which suggests to keep the product of mini-batch size and learning rate constant. Results are shown in Figure8. The heuristics works, but large (512 and more) mini-batch sizes leads to quite significant decrease in performance. On the other extreme, online training (mini-batch with single example) does not bring accuracy gains over 64 or 256, but significantly slows down the training wall-clock time.
All the advances in ImageNet competition so far were caused by architectural improvement. To the best of our knowledge, there is no study about network width – final accuracy dependence. Canziani et.al CNNEfficiency2016 did a comparative analysis of the ImageNet winner in terms of accuracy, number of parameters and computational complexity, but it is a comparison of the different architectures. In this subsection we evaluate how far one can get by increasing CaffeNet width, with no other changes. The results are shown in Figure 9. The original architecture is close to optimal in accuracy per FLOPS sense: a decrease in the number of filters leads to a quick and significant accuracy drop, while making the network thicker brings gains, but it saturates quickly. Making the network thicker more than 3 times leads to a very limited accuracy gain.
The input image size, as it brings additional information and training samples for convolution filters, plays a very important role. Our initial experiment, showed in Figure 1 indicates that CaffeNet, trained on 227x227 images can compete with much more complex GoogLeNet architecture, trained on smaller images. So the obvious question is what is the dependence between image size and final accuracy.
We have performed an experiment with different input image sizes: 96, 128, 180 and 224 pixels wide. The results are presented in Figure 10. The bad news are that while accuracy depends on image size linearly, the needed computations grow quadratically, so it is a very expensive way to a performance gain. In the second part of experiment, we kept the spatial output size of the pool1 layer fixed while changing the input image size. To archieve this, we respectively change the stride and filter size of the conv1 layer. Results show that the gain from a large image size mostly (after some minimum value) comes from the larger spatial size of deeper layers than from the unseen image details.
The performance of the current deep neural network is highly dependent on the dataset size. Unfortunately, not much research has been published on this topic. In DeepFace taigman2014deepface , the authors shows that dataset reduction from 4.4M to 1.5M leads to a 1.74% accuracy drop. Similar dependence is shown by Schroff et.al schroff2015facenet
but on an extra-large dataset: decreasing the dataset size from 260M to 2.6M leads to accuracy drop in 10%. But these datasets are private and the experiments are not reproducible. Another important property of a dataset is the cleanliness of the data. For example, an estimate of human accuracy on ImageNet is 5.1% for top-5ILSVRC15 . To create the ImageNet, each image was voted on by ten different people ILSVRC15 .
We explore the dependency between the accuracy and the dataset size/cleanliness on ImageNet. For the dataset size experiment, 200, 400, 600, 800 thousand examples were random chosen from a full training set. For each reduced dataset, a CaffeNet is trained from scratch. For the cleanliness test, we replaced the labes to a random incorrect one for 5%, 10%, 15% and 32% of the examples. The labels are fixed, unlike the recent work on disturbing labels as a regularization method DisturbLabel2016 .
The results are shown in Figure 11 which clearly shows that bigger (and more diverse) dataset brings an improvement. There is a minimum size below which performance quickly degrades. Less clean data outperforms more noisy ones: a clean dataset with 400K images performs on par with 1.2M dataset with 800K correct images.
We conducted a simple experiment on the importance of the bias in the convolution and fully-connected layers. First, the network is trained as usual, for the second – biases are initialized with zeros and the bias learning rate is set to zero. The network without biases shows 2.6% less accuracy than the default – see Table 8.
Finally, we test how all the improvements, which do not increase the computational cost, perform together. We combine: the learned colorspace transform F, ELU as non-linearity for convolution layers and maxout for fully-connected layers, linear learning rate decay policy, average plus max pooling. The improvements are applied to CaffeNet128, CaffeNet224, VGGNet128 and GoogleNet128.
The first three demonstrated consistent performance growth (see Figure 12), while GoogLeNet performance degraded, as it was found for batch normalization. Possibly, this is due to the complex and optimized structure of the GoogLeNet network. Unfortunately, the cost of training VGGNet224 is prohibitive, one month of GPU time, so we have not subjected it to the tests yet.
We have compared systematically a set of recent CNN advances on large scale dataset. We have shown that benchmarking can be done at an affordable time and computation cost. A summary of recommendations:
use ELU non-linearity without batchnorm or ReLU with it.
apply a learned colorspace transformation of RGB.
use the linear learning rate decay policy.
use a sum of the average and max pooling layers.
use mini-batch size around 128 or 256. If this is too big for your GPU, decrease the learning rate proportionally to the batch size.
use fully-connected layers as convolutional and average the predictions for the final decision.
when investing in increasing training set size, check if a plateau has not been reach.
cleanliness of the data is more important then the size.
if you cannot increase the input image size, reduce the stride in the consequent layers, it has roughly the same effect.
if your network has a complex and highly optimized architecture, like e.g. GoogLeNet, be careful with modifications.
|Group Name||Names||acc [%]|
|Batch Normalization (BN)||before non-linearity||47.4|
|BN + Non-linearity||Linear||38.4|
|Pooling||stochastic, no dropout||42.9|
|Pooling window size||3x3/2||47.1|
|Learning rate decay policy||step||47.1|
|Colorspace & Pre-processing||OpenCV grayscale||41.3|
|Percentage of noisy data||5%||45.8|
|Batch size||BS=1024, lr=0.04||46.5|
The authors were supported by The Czech Science Foundation Project GACR P103/12/G084 and CTU student grant SGS15/155/OHK3/2T/13.
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: F. Pereira, C. Burges, L. Bottou, K. Weinberger (Eds.), Advances in Neural Information Processing Systems 25, Curran Associates, Inc., 2012, pp. 1097–1105.
S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: D. Blei, F. Bach (Eds.), Proceedings of the 32nd International Conference on Machine Learning (ICML-15), JMLR Workshop and Conference Proceedings, 2015, pp. 448–456.
X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: G. J. Gordon, D. B. Dunson (Eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS-11), Vol. 15, Journal of Machine Learning Research - Workshop and Conference Proceedings, 2011, pp. 315–323.
Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1701–1708.
F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition and clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 815–823.