1 Introduction
Empiricism is at the very core of machine learning research, where we demand that new methods and approaches compare favorably to previously introduced work. We expect this in terms of performance on either artificially generated data that highlights specific challenges and/or realworld data for specific tasks. In this process, we implicitly rely on fellow scientists and reviewers to note discrepancies (intentional or not, to err is after all human) in the experimental setting – such as for example, any kind of overfitting.^{1}^{1}1http://hunch.net/?p=22 Recently, several studies have noted empirical shortcomings in the machine learning literature. For example, Henderson et al. (2017)
observed that due to nondeterminism, variance, and lack of significance metrics, it is difficult to judge whether claimed advances in reinforcement learning are empirically justified. Also,
Melis et al. (2018)established that several years of assumed progress in language modeling did not in fact improve upon a standard stacked LSTM model if the hyperparameters for all models were adequately tuned.
Bayesian Deep Learning applies the ideas of Bayesian inference to deep networks and is an active area of machine learning research. Popular techniques for approximate inference in deep networks include variational inference (VI)
(Graves, 2011), probabilistic backpropagation (PBP)
(HernándezLobato and Adams, 2015) for Bayesian neural nets, dropout as an interpretation of approximate Bayesian inference (Gal and Ghahramani, 2016), Deep Gaussian Processes (DGP) as a multilayer generalization of Gaussian Processes (Bui et al., 2016), Bayesian neural networks using Variational Matrix Gaussian (VMG) posteriors
(Louizos and Welling, 2016), and variants of Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) methods (Springenberg et al., 2016).A welldefined experimental setup is necessary to compare and benchmark these methods; one of the most popular setups is a regression experiment on a number of curated UCI datasets (HernándezLobato and Adams, 2015)
. The choice of this experiment is appealing for Bayesian neural nets because it provides predictive loglikelihood in addition to RMSE as an evaluation metric. As such, predictive loglikelihood can be used to judge the quality of the uncertainty estimates produced by the model
(Gal and Ghahramani, 2016).In this study, we observe a discrepancy in the setup used for the UCI regression experiment in some recent works, in that they compare their model to baselines obtained under a different experimental setting. Concretely, this applies to VMG (Louizos and Welling, 2016), HSBNN (Ghosh and DoshiVelez, 2017), PBPMV, (Sun et al., 2017) and SGHMC (Springenberg et al., 2016). In order to gauge the impact of this erroneous comparison, we reevaluate the regression experiments of the abovementioned works compared to MC dropout (Gal and Ghahramani, 2016) in the same setting. The experimental results indicate that the networks with dropout inference – when trained under the same conditions – outperform VMG, HSBNN, and SGHMC; and they are a close second to PBPMV. These results suggest that several methods, when introduced, erroneously claimed state of the art performance in their publications.
2 Regression Experiments
We perform the nonlinear regression experiments proposed in
HernándezLobato and Adams (2015), which have been adopted to evaluate approximate inference techniques in numerous subsequent works: Gal and Ghahramani (2016), Bui et al. (2016), Louizos and Welling (2016), Ghosh and DoshiVelez (2017), Sun et al. (2017), Springenberg et al. (2016) etc. All the datasets from HernándezLobato and Adams (2015) are used, apart from the YearPredictionMSD dataset. The YearPredictionMSD dataset is very large with 515,345 instances, each of which has 90 dimensions. Hence, tuning network hyperparameters over this dataset requires an inordinate amount of time. Our network architecture has a single hidden layer with 50 hidden units for all datasets, except for the Protein Structure dataset for which there are 100 hidden units.There are two ways in which this experiment has been conducted in the past. In the first, the networks are trained for a fixed number of iterations (specifically, 40 epochs), and the average training time of the networks are noted and compared. This setting was used by
HernándezLobato and Adams (2015), Gal and Ghahramani (2016), and Bui et al. (2016). In this work, we refer to this as the timed setting of the experiment. In the second variant of the experiment, the networks are trained to convergence with a higher number of training iterations and was used by Louizos and Welling (2016), Ghosh and DoshiVelez (2017), Sun et al. (2017), Springenberg et al. (2016), and others. For both variants, the test set RMSE and loglikelihood values are used as the evaluation metrics.Given these two settings, it is a natural question to ask if models trained under the timed setting and those trained to convergence are comparable. One might argue that networks trained for a fixed number of iterations might not have converged and will thus perform poorly compared to those which have been trained to convergence. To test this hypothesis, we use networks with MC dropout (Gal and Ghahramani, 2016) – one of the de facto standard baselines for approximate inference.
We compare with the following works: VMG (Louizos and Welling, 2016)
, where models trained to convergence have been benchmarked against MC dropout baselines from the timed setting; Bayesian networks with horseshoe priors (HSBNN)
(Ghosh and DoshiVelez, 2017) and probabilistic backpropagation with the Matrix Variate Gaussian (MVG) distribution (PBPMV) (Sun et al., 2017), where the models were only benchmarked against VMG; Stochastic Gradient Hamiltonian Monte Carlo methods (SGHMC) (Springenberg et al., 2016), where the results have been compared with Probabilistic Backpropagation (PBP) (HernándezLobato and Adams, 2015) and Variational Inference (VI) (Graves, 2011) baselines obtained under the timed setting.There are two hyperparameters: i) the model precision parameter to evaluate the loglikelihood and ii) the dropout rate . We perform the following two variations of the regression experiment:

[leftmargin=*]

Convergence: The networks are trained to convergence for 4,000 epochs and the hyperparameter values are obtained by Bayesian Optimization (BO) (as described in Gal and Ghahramani (2016)).

Hyperparameter tuning: Just as in the previous variant the networks are trained for 4,000 epochs, but we also obtain optimal hyperparameter values by performing grid search over a range of pairs and choose the best pair based on performance over a validation set. The validation set is created by randomly choosing 20% of the data points in the training set.
The RMSE and log likelihood values obtained from the above experiments are compared in Table 1 and 2 respectively. It should be noted that the RMSE and log likelihood values of VMG, HSBNN, PBPMV, and SGHMC have been taken from their respective papers. The experimental results indicate that the Convergence and Hyperparameter tuning baseline experiments show a substantial improvement in performance compared to the results in the timed setting. We also observe that the baseline outperform the other methods on the Concrete Strength, Naval Propulsion Plants, Wine Quality Red, and Yacht Hydrodynamics datasets in terms of RMSE. With respect to log likelihood, the baseline performs best on the Boston Housing, Concrete Strength, and Wine Quality Red datasets. Finally, we also observe that on the other datasets the baseline is competitive – second only to PBPMV (Sun et al., 2017).
Dataset 








Boston Housing  
Concrete Strength  
Energy Efficiency  
Kin8nm  
Naval Propulsion  
Power Plant  
Protein Structure  
Wine Quality Red  
Yacht Hydrodynamics 
The RMSE values along with corresponding standard errors are presented.
Dataset 










Boston Housing  
Concrete Strength  
Energy Efficiency  
Kin8nm  
Naval Propulsion  
Power Plant  
Protein Structure  
Wine Quality Red  
Yacht Hydrodynamics 
3 Conclusion
The RMSE and loglikelihood values obtained from the Convergence and Hyperparameter tuning experiments (as given in Tables 1 and 2) provide substantially better results for MC dropout. We conclude that previous comparisons with baselines obtained from the timed setting are unrepresentative, as the models in one setting had not reached convergence. In summary, when benchmarking a method, its performance should always be evaluated using an experimental setup identical to the one used to evaluate its peers.
The source code for our experiments can be found at: https://github.com/yaringal/DropoutUncertaintyExps
References
 Bui et al. (2016) Bui, T., HernándezLobato, D., HernandezLobato, J., Li, Y., and Turner, R. (2016). Deep gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, pages 1472–1481.
 Gal and Ghahramani (2016) Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML16).
 Ghosh and DoshiVelez (2017) Ghosh, S. and DoshiVelez, F. (2017). Model selection in bayesian neural networks via horseshoe priors. arXiv preprint arXiv:1705.10388.
 Graves (2011) Graves, A. (2011). Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356.
 Henderson et al. (2017) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2017). Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560.
 HernándezLobato and Adams (2015) HernándezLobato, J. M. and Adams, R. (2015). Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Machine Learning, pages 1861–1869.
 Louizos and Welling (2016) Louizos, C. and Welling, M. (2016). Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pages 1708–1716.
 Melis et al. (2018) Melis, G., Dyer, C., and Blunsom, P. (2018). On the state of the art of evaluation in neural language models. In International Conference on Learning Representations.
 Springenberg et al. (2016) Springenberg, J. T., Klein, A., Falkner, S., and Hutter, F. (2016). Bayesian optimization with robust bayesian neural networks. In Advances in Neural Information Processing Systems, pages 4134–4142.
 Sun et al. (2017) Sun, S., Chen, C., and Carin, L. (2017). Learning structured weight uncertainty in bayesian neural networks. In Artificial Intelligence and Statistics, pages 1283–1292.
Comments
There are no comments yet.