ImageNet pre-trained models with batch normalization for the Caffe framework
Convolutional neural networks (CNN) pre-trained on ImageNet are the backbone of most state-of-the-art approaches. In this paper, we present a new set of pre-trained models with popular state-of-the-art architectures for the Caffe framework. The first release includes Residual Networks (ResNets) with generation script as well as the batch-normalization-variants of AlexNet and VGG19. All models outperform previous models with the same architecture. The models and training code are available at http://www.inf-cv.uni-jena.de/Research/CNN+Models.html and https://github.com/cvjena/cnn-modelsREAD FULL TEXT VIEW PDF
Pronounced as "musician", the musicnn library contains a set of pre-trai...
We find that 3.3
sets, respectively, have duplicates in the training set...
Batch normalization (BN) is a key facilitator and considered essential f...
Recent advances in computer vision take advantage of adversarial data
Data auditing is a process to verify whether certain data have been remo...
The convolutional neural networks (CNNs) trained on ILSVRC12 ImageNet we...
This paper seeks to answer the question: as the (near-) orthogonality of...
ImageNet pre-trained models with batch normalization for the Caffe framework
Simple image classification framework written in Python
The rediscovery of convolutional neural networks (CNN)  in the past years is a result of both the dramatically increased computational speed and the advent of large scale datasets as part of the big data trend. The computational speed was mainly boosted by the efficient use of GPUs for common computer vision functions like convolution and matrix multiplication. Large scale datasets [25, 19, 16, 5, 22, 6], on the other hand, provide the amount of data required for training large scale models with more than a hundred million parameters.
This combination allowed for huge advances in all fields of computer vision research ranging from traditional tasks like classification [11, 27, 28, 3, 8, 18], object detection [23, 26, 10], and segmentation [20, 4, 36], to new ones like image captioning [15, 24, 21, 35, 34], visual question answering [2, 9, 33] and 3D information prediction [7, 32]. Most of these works are based on models, which are pre-trained on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset . The classification task of the last year’s ILSVRC contains 1.2 million training images categorized into one thousand categories, which represent a wide variety of everyday objects. Pre-training on this dataset proved to be a crucial step for obtaining highly accurate models in most of the tasks mentioned above.
While computational speed was dramatically increased by the use of GPUs, training a large model like VGG19  still takes several months on a high-end GPU. We hence release a continuously growing set of pre-trained models with popular architectures for the Caffe framework . In contrast to most publicly available models for this framework, our release includes the batch normalization  variants of popular networks like AlexNet  and VGG19 . In addition, we provide training code for reproducing the results of residual networks  in Caffe, which was not provided by the authors of the paper . The release includes all files required for reproducing the model training as well as the log file of the training of the provided model.
Especially for larger models like VGG19, batch normalization  is crucial for successful training and convergence. In addition, architectures with batch normalization allow for using much higher learning rates and hence yield in models with better generalization ability. In our experiments, we found that higher learning rates show a slower initial convergence speed, but end up at a lower final error rate. This was the case for both AlexNet and VGG19.
The advantage of batch normalization is present even for fine-tuning in certain applications. For example, Amthor et al.  report that their multi-loss architectures only converged reliably if batch normalization was added to the networks. However, adding batch normalization afterwards to models trained without batch normalization yields in a severe increase in error rates due to mismatching output statistics. Instead, fine-tuning with our batch normalization models is directly possible, which allows for easy adaption to new tasks.
We modified AlexNet and VGG19 by adding a batch normalization layer  between each convolutional and activation unit layer as well as between each inner product and activation unit layer. We followed the suggestions of  and removed the local response normalization and dropout layers. In addition, we also omitted the mean subtraction during training and replaced it by an batch normalization layer on the input data. This results in an adaptively calculated mean in training and relieves users from manually subtracting the mean during feature computation. In addition, this approach has the advantage that the mean adapts automatically during fine-tuning and no manual mean calculation and storage is required.
We train for 64 epochs on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 – 2016 dataset
, which contains roughly 1.2 million images and one thousand object categories. A batch size of 256 and initial learning rate of 0.05 (AlexNet), 0.01 (VGG19) and 0.1 (ResNet) was used. The learning rate follows a linear decay over time. Due to batch normalization, it is important that the batch size is greater than sixteen to obtain robust statistics estimations in the batch normalization layers. In the Caffe framework, this means the batch size in the network definition needs to be sixteen or larger, the solver parameteriter_size does not compensate a too small batch size in the network definition. If you want to fine-tune a model but do not have enough GPU memory, you can enable the use of global statistics in training in order to lift this batch size requirement. This will disable the statistics estimation in each forward pass and global statistics will be used instead.
All images are resized such that the smaller side has length 256 pixel and the aspect ratio is preserved. During training, we randomly crop a (ResNet, VGG19) or (AlexNet) pixel square patch and feed it into the network. During validation, a single centered crop is used. We did not use any kind of color, scale or aspect ratio augmentation.
During training of residual networks, we also observe a sudden divergence at random time points in training as explained by Szegedy et al. . In this case, we restarted the training using the last snapshot. Due to a different random seed, the order of the images is different and hence the training does not diverge at this time point anymore.
Please note, that the final models are not cherry picked based on the validation error. We provide the final model after the full training is completed. We did not intervene with training and especially did not manually changed the learning rate, as usually done if the step policy is used for the learning rate.
The top-1 and top-5-error of the trained models are shown in Table 1.
|Model||Top-1 error||Top-5 error|
As observed in previous works , the error rates benefit from the added batch normalization layers. All provided models slightly improve the error rate achieved by previously trained models . In case of AlexNet, for example, we even observe an error decrease of over 2.6%.
In addition to the final results, we also visualize the single-crop top-1 error on the validation set during the training of AlexNet in Fig. 1.
As shown in the figure, the error decreases consistently and fairly quickly during training. Since we use linear learning rate decay, there is a steep error decrease towards the end of the training. While it might look like the error could decrease even further, this is not true. The reason is that the learning rate approaches 0 during the end of the training. Even if the learning rate is kept constant, no improvement can be observed. This is supported by several experiments we performed.
This paper presents a new set of pre-trained models for the ImageNet dataset using the Caffe framework. We focus on the batch-normalization-variants of AlexNet and VGG19 as well as residual networks. All models outperform previous pre-trained models. In particular, we were able to reproduce the ImageNet results of residual networks. All models, log files and training code are available at http://www.inf-cv.uni-jena.de/Research/CNN+Models.html and https://github.com/cvjena/cnn-models.
The authors thank Nvidia for GPU hardware donations. Part of this research was supported by grant RO 5093/1-1 of the German Research Foundation (DFG)
Convolutional patch networks with spatial prior for road detection and urban scene understanding.In International Conference on Computer Vision Theory and Applications (VISAPP), pages 510–517, 2015.
German Conference on Pattern Recognition (GCPR), 2016.
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 448–456, 2015.
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015, 2015.
Conditional random fields as recurrent neural networks.In The IEEE International Conference on Computer Vision (ICCV), December 2015.