DNNs have applications in image classification, object detection, machine translation, and many others(He et al., 2016; Redmon et al., 2016; Wu et al., 2016). In such applications, even a marginal improvement in model performance can have significant business value.
Ensemble methods are commonly used in computer vision competitions and achieve better performance comparing compared to single models (Krizhevsky et al., 2012; Simonyan and Zisserman, 2015; He et al., 2016). However, in the case of DNNs, training even a single model is computationally intensive, making ensemble approaches less tractable.
The distribution of DNN parameters has been studied extensively as part of Bayesian Neural Networks. The state-of-the-art variational inference provides robustness to overfitting leading to better model performance (Gal and Ghahramani, 2016). However, the information from training updates is not fully utilized.
Recently, Garipov et al. (2018) proposed a procedure to ensemble a DNN model at different training stages. The method enables a fast ensemble by reducing the number of models that need to be trained from scratch. Furthermore, the same team improved the method by directly averaging the weights instead of using an ensemble thereby reducing the computation cost (Izmailov et al., 2018).
The above-mentioned methods all require retraining the model. We propose a new method to use the uncertainty residing in the Stochastic Gradient Descent (SGD) updates for the model ensembling and parameter averaging to improve the model prediction performance.
The key contributions of the paper include:
In this paper, we first introduce our approach in Sec. 2. Then we carry out an extensive analysis using LeNet model on MNIST dataset and evaluate the result on a variety DNNs models on ImageNet dataset in Sec. 3. Finally, we discuss the proposed methods and compare with other related works in Sec. 4 and conclude the paper.
where is the loss of a sample for given model parameters at step
and the hyperparameteris the learning rate that controls the step size of the update.
Given the variations across batches of data, the updates are stochastic and the parameters asymptotically reach local optima. And to reduce the convergence instability, the learning rate that throttles the steps size of updates is either predetermined as a constant or follows a learning schedule or is updated according to the update statistics.
In this paper, we propose to use the uncertainty of the model parameters during the training updates to create a final model. We first estimate the mean and the variance of the parameters by continuing the training with a few mini-batches after the model is trained (thisfine-tuning stage may or may not share the same SGD method used in the previous training). Because the network size is commonly very large, we uses an online algorithm to update the mean and variance (Welford, 1962), instead of saving all intermediate values:
And then, we use two different approaches to resample the parameters for predictions.
We reassign the value of parameters to the mean after the fine-tuning stage.
We tested our method against two experiment sets. Using MNIST, we explored a large number of configurations to understand the limiting factors in Sec. 3.1. And we provided a number of results from pre-trained models on ImageNet to examine the robustness of the method in Sec.3.2.
We used MNIST dataset (LeCun et al., 1998) to quickly explore the configurations of the LeNet model, namely optimization and regularization, and to understand important factors in the proposed method.
MNIST is a commonly used dataset for computer vision, which contains hand-written digits that split into 60000 training samples and 10000 testing samples. We train our model on the full training set with a batch size of 128 and measure the accuracies on the testing set. And for each configuration, we repeat the same procedure 10 times and report the mean and standard deviation of the accuracies.
The model improvement is sensitive to the final status of the pre-trained model. In the extreme case, a model at the global minimum cannot be further improved without overfitting the data. We choose different learning rates for training, , to examine the proposed method. A small leads to local minima that might be far away from the global one, while a large value prevents the model from settling into the minima. In this paper, we trained the model using with 2 epochs. We found a larger or smaller has deteriorated performance and we excluded them from the discussion. Similarly, at the fine-tuning stage, the learning rate is also important and we tested with . The weight distribution is estimated from updates from 500 mini-batches initialized from the pre-trained model (roughly one epoch).
Besides the learning rate, the optimization method for the model update is also important. We mainly focused on using the plain SGD method to understand the method behavior. Many other update strategies have been proposed for better convergence rate, e.g. AdaGrad, Adam (Duchi et al., 2011; Kingma and Ba, 2015)
, which adaptively adopt the learning rate for a faster and better convergence. In this paper, we also tested our method using Adam optimizer with default values in Tensorflow111See https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer, version r1.8..
Regularization method also has an important impact on the model generalization and prediction accuracy. A generalized model alleviates from the overfitting of the training data and improves prediction accuracy. In this work, we tested our method against models with and without Dropout (Srivastava et al., 2014). We used the solid line for with Dropout and dashed line for no regularization hereafter.
First, we present the result of training the models using fixed learning rate SGD method in Fig. 1. The finetune learning rates, , are jittered around in the figure for better visualization. As expected, in the plain SGD approach (green colored), the combination of a larger learning rate at training stage and a smaller learning rate at finetuning stage are always preferred for the best performance, because a larger learning rate at the training stage explored a larger space for global minimal and a smaller finetuning learning rate helps convergence. The comparison shows that the regularization helps the model to be more general and more accurate.
In comparison with the plain finetuning, a larger is always preferred in our approaches. And we didn’t see any performance degradation in a wide range of learning rates. Also, the regularization has less impact on the model performance as the difference between dashed lines and solid lines are marginal.
Finally, we didn’t see significant differences in the mean-resampled model and ensemble approach with 3 resampled models. However, we do see a marginal improvement with 10 ensembles but it is typically not feasible in a real application as it takes 10 times longer. We could treat the mean-resampled model as a special case in this ensemble approach. Also, we found that the method need enough updates to measure the distribution reliably (one epoch is typically sufficient).
In Fig. 2, we compared the result on pre-trained models that were trained using the Adam optimizer. Again, the results from the finetune stage are similar to the SGD results in Fig. 1. We also performed the fine-tuning stage using the Adam optimizer with default values. As learning rate is not relevant for Adam, the results fine-tuned by the Adam optimizer are marked by the straight line in black. Our resampled method gives the best result while the fine-tuned Adam result is below 0.993 (not shown). Finally given the large scatter from the 10 different runs, we only see marginal improvement by using the regularization in our approach.
3.2 Dnn Results
We also performed many experiments on ImageNet (Deng et al., 2009) using public available pre-trained models to validate the generalization of our proposed method. As the size of the ImageNet is large, we only used 25 % of the full dataset in the finetuning stage to estimate the uncertainties of model parameters (10,000 updates). Finally, the accuracies after the fine tuning are also reported, and given the computational cost, we ran only one iteration per model configuration.
Fig. 3, we examined the pre-trained Inception-V3 model222Retrieved from http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz (Szegedy et al., 2015) with a range of learning rates. The pre-trained model is highly fine-tuned, hence the improvements are very small. But still, the resampled mean weights does improve upon the results from the baseline model and the best-finetuned model. Also, it is worth to mention that the proposed method showed a consistently better performance over a wide range of learning rates in both pre-trained models.
Fig. 4 refers to our results from MobileNet architecture333Retrieved from https://storage.googleapis.com/mobilenet_v2/checkpoints/mobilenet_v2_1.0_224.tgz (Howard et al., 2017) and the pre-trained base model with a top-1 accuracy of 70.124 % because the model is designed as a light-weight model. We achieved some improvement over the baseline model even by just using the SGD method to finetune the model parameters. And, upon resampling the weights, the results show more improvement on both models in all cases.
From the previous experiment results, we justified the usability of our proposed method. In this section, we will first highlight the benefits of it and then compare it with another relevant study.
The major contributions of this work are following. First, it is tested to improve the accuracies of a range of DNN models. Second, it is less sensitive to the learning rate that used to update the model parameters in the training stage. Third, resampling the model parameters with their mean values requires no additional computing cost for the inference and a marginal burden in the training stage. Finally, the model is efficient that it just requires one epoch or less to finetune a pre-trained DNN model.
Izmailov et al. (2018) proposed a Stochastic Weight Averaging (SWA) method to improve the model performance. Similar to our model to use mean to reassign model parameters, it uses the average of the parameters during the training steps. The two main differences between our approaches are:
We sampled the parameter distribution at each step during the fine-tuning stage, and the SWA method samples at the end of each learning cycle.
So, we focused on the improvement of the pre-trained model rather than comparing with their approach. And, it is interesting to compare with the SWA method and other algorithms in future studies.
We concluded the paper with extensive experiments with our proposed method on the MNIST with a simple LeNet model and initial results on ImageNet data with state-of-the-art DNN models. Our future work includes developing a theoretical understanding of this approach that will provide the solid foundation to further guide the usability of our method.
Deng et al. (2009)
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.
Imagenet: A large-scale hierarchical image database.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
Duchi et al. (2011)
John Duchi, Elad Hazan, and Yoram Singer.
Adaptive subgradient methods for online learning and stochastic
The Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33 rd International Conference on Machine Learning, pages 1050—-1059, New York, NY, USA, 2016. JMLR.
- Garipov et al. (2018) Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. Feb 2018. doi: arXiv:1802.10026.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Howard et al. (2017)
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, and Hartwig Adam.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.Apr 2017. doi: arXiv:1704.04861.
- Izmailov et al. (2018) Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging Weights Leads to Wider Optima and Better Generalization. Mar 2018. doi: arXiv:1803.05407.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. In International Conference on Learning Representations 2015, 2015.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Advances In Neural Information Processing Systems, pages 1–9, 2012. doi: http://dx.doi.org/10.1016/j.protcy.2014.09.007.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Redmon et al. (2016) Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
- Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICRL), 2015.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Szegedy et al. (2015) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. Dec 2015. doi: arXiv:1512.00567.
- Welford (1962) B. P. Welford. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420, 1962. ISSN 00401706.
Wu et al. (2016)
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner,
Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws,
Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian,
Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick,
Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean.
Google’s neural machine translation system: Bridging the gap between human and machine translation.Sep 2016. doi: arXiv:1609.08144.