How to Use Dropout Correctly on Residual Networks with Batch Normalization

by   Bum Jun Kim, et al.

For the stable optimization of deep neural networks, regularization methods such as dropout and batch normalization have been used in various tasks. Nevertheless, the correct position to apply dropout has rarely been discussed, and different positions have been employed depending on the practitioners. In this study, we investigate the correct position to apply dropout. We demonstrate that for a residual network with batch normalization, applying dropout at certain positions increases the performance, whereas applying dropout at other positions decreases the performance. Based on theoretical analysis, we provide the following guideline for the correct position to apply dropout: apply one dropout after the last batch normalization but before the last weight layer in the residual branch. We provide detailed theoretical explanations to support this claim and demonstrate them through module tests. In addition, we investigate the correct position of dropout in the head that produces the final prediction. Although the current consensus is to apply dropout after global average pooling, we prove that applying dropout before global average pooling leads to a more stable output. The proposed guidelines are validated through experiments using different datasets and models.


page 1

page 2

page 3

page 4


Adjusting for Dropout Variance in Batch Normalization and Weight Initialization

We show how to adjust for the variance introduced by dropout with correc...

Functional Network: A Novel Framework for Interpretability of Deep Neural Networks

The layered structure of deep neural networks hinders the use of numerou...

Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks

L2 regularization for weights in neural networks is widely used as a sta...

Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift

This paper first answers the question "why do the two most powerful tech...

Don't ignore Dropout in Fully Convolutional Networks

Data for Image segmentation models can be costly to obtain due to the pr...

Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification

Batch normalization (BN) has become a de facto standard for training dee...

Ensemble Model Patching: A Parameter-Efficient Variational Bayesian Neural Network

Two main obstacles preventing the widespread adoption of variational Bay...

Please sign up or login with your details

Forgot password? Click here to reset