Make (Nearly) Every Neural Network Better: Generating Neural Network Ensembles by Weight Parameter Resampling

07/02/2018 ∙ by Jiayi Liu, et al. ∙ 0

Deep Neural Networks (DNNs) have become increasingly popular in computer vision, natural language processing, and other areas. However, training and fine-tuning a deep learning model is computationally intensive and time-consuming. We propose a new method to improve the performance of nearly every model including pre-trained models. The proposed method uses an ensemble approach where the networks in the ensemble are constructed by reassigning model parameter values based on the probabilistic distribution of these parameters, calculated towards the end of the training process. For pre-trained models, this approach results in an additional training step (usually less than one epoch). We perform a variety of analysis using the MNIST dataset and validate the approach with a number of DNN models using pre-trained models on the ImageNet dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

DNNs have applications in image classification, object detection, machine translation, and many others(He et al., 2016; Redmon et al., 2016; Wu et al., 2016). In such applications, even a marginal improvement in model performance can have significant business value.

Ensemble methods are commonly used in computer vision competitions and achieve better performance comparing compared to single models (Krizhevsky et al., 2012; Simonyan and Zisserman, 2015; He et al., 2016). However, in the case of DNNs, training even a single model is computationally intensive, making ensemble approaches less tractable.

The distribution of DNN parameters has been studied extensively as part of Bayesian Neural Networks. The state-of-the-art variational inference provides robustness to overfitting leading to better model performance (Gal and Ghahramani, 2016). However, the information from training updates is not fully utilized.

Recently, Garipov et al. (2018) proposed a procedure to ensemble a DNN model at different training stages. The method enables a fast ensemble by reducing the number of models that need to be trained from scratch. Furthermore, the same team improved the method by directly averaging the weights instead of using an ensemble thereby reducing the computation cost (Izmailov et al., 2018).

The above-mentioned methods all require retraining the model. We propose a new method to use the uncertainty residing in the Stochastic Gradient Descent (SGD) updates for the model ensembling and parameter averaging to improve the model prediction performance.

The key contributions of the paper include:

  • We propose a fast and universal method to finetune a given DNN model for better prediction performance.

  • We explore and study the factors that are critical to the proposed method using MNIST dataset (LeCun et al., 1998).

  • We test the approach against the state-of-the-art models, i.e. Inception-V3, MobileNet (Szegedy et al., 2015; Howard et al., 2017), using the ImageNet dataset (Deng et al., 2009).

In this paper, we first introduce our approach in Sec. 2. Then we carry out an extensive analysis using LeNet model on MNIST dataset and evaluate the result on a variety DNNs models on ImageNet dataset in Sec. 3. Finally, we discuss the proposed methods and compare with other related works in Sec. 4 and conclude the paper.

2 Method

DNNs are commonly trained by the SGD method or its variants, where the parameters, , are updated based on the derivative of the loss for each mini-batch of data.


where is the loss of a sample for given model parameters at step

and the hyperparameter

is the learning rate that controls the step size of the update.

Given the variations across batches of data, the updates are stochastic and the parameters asymptotically reach local optima. And to reduce the convergence instability, the learning rate that throttles the steps size of updates is either predetermined as a constant or follows a learning schedule or is updated according to the update statistics.

In this paper, we propose to use the uncertainty of the model parameters during the training updates to create a final model. We first estimate the mean and the variance of the parameters by continuing the training with a few mini-batches after the model is trained (this

fine-tuning stage may or may not share the same SGD method used in the previous training). Because the network size is commonly very large, we uses an online algorithm to update the mean and variance (Welford, 1962), instead of saving all intermediate values:


And then, we use two different approaches to resample the parameters for predictions.

  • We reassign the value of parameters to the mean after the fine-tuning stage.

  • We assign the value parameters follows a Gaussian distribution given the mean and standard deviation during the fine-tuning stage. We create multiple models from the resampling and make predictions by ensembling the model predictions by the average.

3 Experiment

We tested our method against two experiment sets. Using MNIST, we explored a large number of configurations to understand the limiting factors in Sec. 3.1. And we provided a number of results from pre-trained models on ImageNet to examine the robustness of the method in Sec.3.2.

3.1 Mnist

Figure 1: MNIST Accuracies with Different Learning Rates, Optimized by SGD

3.1.1 Setup

We used MNIST dataset (LeCun et al., 1998) to quickly explore the configurations of the LeNet model, namely optimization and regularization, and to understand important factors in the proposed method.

MNIST is a commonly used dataset for computer vision, which contains hand-written digits that split into 60000 training samples and 10000 testing samples. We train our model on the full training set with a batch size of 128 and measure the accuracies on the testing set. And for each configuration, we repeat the same procedure 10 times and report the mean and standard deviation of the accuracies.

The model improvement is sensitive to the final status of the pre-trained model. In the extreme case, a model at the global minimum cannot be further improved without overfitting the data. We choose different learning rates for training, , to examine the proposed method. A small leads to local minima that might be far away from the global one, while a large value prevents the model from settling into the minima. In this paper, we trained the model using with 2 epochs. We found a larger or smaller has deteriorated performance and we excluded them from the discussion. Similarly, at the fine-tuning stage, the learning rate is also important and we tested with . The weight distribution is estimated from updates from 500 mini-batches initialized from the pre-trained model (roughly one epoch).

Besides the learning rate, the optimization method for the model update is also important. We mainly focused on using the plain SGD method to understand the method behavior. Many other update strategies have been proposed for better convergence rate, e.g. AdaGrad, Adam (Duchi et al., 2011; Kingma and Ba, 2015)

, which adaptively adopt the learning rate for a faster and better convergence. In this paper, we also tested our method using Adam optimizer with default values in Tensorflow

111See, version r1.8..

Regularization method also has an important impact on the model generalization and prediction accuracy. A generalized model alleviates from the overfitting of the training data and improves prediction accuracy. In this work, we tested our method against models with and without Dropout (Srivastava et al., 2014). We used the solid line for with Dropout and dashed line for no regularization hereafter.

3.1.2 Results

First, we present the result of training the models using fixed learning rate SGD method in Fig. 1. The finetune learning rates, , are jittered around in the figure for better visualization. As expected, in the plain SGD approach (green colored), the combination of a larger learning rate at training stage and a smaller learning rate at finetuning stage are always preferred for the best performance, because a larger learning rate at the training stage explored a larger space for global minimal and a smaller finetuning learning rate helps convergence. The comparison shows that the regularization helps the model to be more general and more accurate.

In comparison with the plain finetuning, a larger is always preferred in our approaches. And we didn’t see any performance degradation in a wide range of learning rates. Also, the regularization has less impact on the model performance as the difference between dashed lines and solid lines are marginal.

Finally, we didn’t see significant differences in the mean-resampled model and ensemble approach with 3 resampled models. However, we do see a marginal improvement with 10 ensembles but it is typically not feasible in a real application as it takes 10 times longer. We could treat the mean-resampled model as a special case in this ensemble approach. Also, we found that the method need enough updates to measure the distribution reliably (one epoch is typically sufficient).

In Fig. 2, we compared the result on pre-trained models that were trained using the Adam optimizer. Again, the results from the finetune stage are similar to the SGD results in Fig. 1. We also performed the fine-tuning stage using the Adam optimizer with default values. As learning rate is not relevant for Adam, the results fine-tuned by the Adam optimizer are marked by the straight line in black. Our resampled method gives the best result while the fine-tuned Adam result is below 0.993 (not shown). Finally given the large scatter from the 10 different runs, we only see marginal improvement by using the regularization in our approach.

Figure 2: MNIST Accuracies with Different Learning Rates, Optimized by Adam

3.2 Dnn Results

We also performed many experiments on ImageNet (Deng et al., 2009) using public available pre-trained models to validate the generalization of our proposed method. As the size of the ImageNet is large, we only used 25 % of the full dataset in the finetuning stage to estimate the uncertainties of model parameters (10,000 updates). Finally, the accuracies after the fine tuning are also reported, and given the computational cost, we ran only one iteration per model configuration.

Fig. 3, we examined the pre-trained Inception-V3 model222Retrieved from (Szegedy et al., 2015) with a range of learning rates. The pre-trained model is highly fine-tuned, hence the improvements are very small. But still, the resampled mean weights does improve upon the results from the baseline model and the best-finetuned model. Also, it is worth to mention that the proposed method showed a consistently better performance over a wide range of learning rates in both pre-trained models.

Fig. 4 refers to our results from MobileNet architecture333Retrieved from (Howard et al., 2017) and the pre-trained base model with a top-1 accuracy of 70.124 % because the model is designed as a light-weight model. We achieved some improvement over the baseline model even by just using the SGD method to finetune the model parameters. And, upon resampling the weights, the results show more improvement on both models in all cases.

Figure 3: Imagenet Results Using Pre-trained Inception
Figure 4: Imagenet Results Using Pre-trained MobileNet

4 Discussion

From the previous experiment results, we justified the usability of our proposed method. In this section, we will first highlight the benefits of it and then compare it with another relevant study.

The major contributions of this work are following. First, it is tested to improve the accuracies of a range of DNN models. Second, it is less sensitive to the learning rate that used to update the model parameters in the training stage. Third, resampling the model parameters with their mean values requires no additional computing cost for the inference and a marginal burden in the training stage. Finally, the model is efficient that it just requires one epoch or less to finetune a pre-trained DNN model.

Izmailov et al. (2018) proposed a Stochastic Weight Averaging (SWA) method to improve the model performance. Similar to our model to use mean to reassign model parameters, it uses the average of the parameters during the training steps. The two main differences between our approaches are:

  • Our method is to finetune a model based on the pre-trained values, whereas SWA

    method needs to train a model from scratch. The two methods have different focuses at the moment.

  • We sampled the parameter distribution at each step during the fine-tuning stage, and the SWA method samples at the end of each learning cycle.

So, we focused on the improvement of the pre-trained model rather than comparing with their approach. And, it is interesting to compare with the SWA method and other algorithms in future studies.

5 Conclusion

We concluded the paper with extensive experiments with our proposed method on the MNIST with a simple LeNet model and initial results on ImageNet data with state-of-the-art DNN models. Our future work includes developing a theoretical understanding of this approach that will provide the solid foundation to further guide the usability of our method.


  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    , pages 248–255. IEEE, 2009.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.

    The Journal of Machine Learning Research

    , 12(Jul):2121–2159, 2011.
  • Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33 rd International Conference on Machine Learning, pages 1050—-1059, New York, NY, USA, 2016. JMLR.
  • Garipov et al. (2018) Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. Feb 2018. doi: arXiv:1802.10026.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Howard et al. (2017) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam.

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.

    Apr 2017. doi: arXiv:1704.04861.
  • Izmailov et al. (2018) Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging Weights Leads to Wider Optima and Better Generalization. Mar 2018. doi: arXiv:1803.05407.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. In International Conference on Learning Representations 2015, 2015.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Advances In Neural Information Processing Systems, pages 1–9, 2012. doi:
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Redmon et al. (2016) Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICRL), 2015.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • Szegedy et al. (2015) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. Dec 2015. doi: arXiv:1512.00567.
  • Welford (1962) B. P. Welford. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420, 1962. ISSN 00401706.
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean.

    Google’s neural machine translation system: Bridging the gap between human and machine translation.

    Sep 2016. doi: arXiv:1609.08144.