Batch normalization layers Ioffe and
Szegedy (2015) are an essential building block for most modern network architectures with visual inputs. However, these layers have a slightly different structure that requires more careful consideration when performing the Mean Var init. Letting x be a batch of activations, batch norm computes

Here, γ,β are learnable scale, shift parameters, and μB,σB are an accumulated running mean and variance over the train dataset. Thus, in transfer learning, μB,σB start off as the mean/variance of the ImageNet data activations, unlikely to match the medical image statistics. Therefore, for the Mean Var Init, we initialized all of the batch norm parameters to the identity: γ,σB=1, β,μB = 0. We call this the BN Identity Init. Two alternatives are BN ImageNet Mean Var, resampling the values of all batch norm parameters according to the ImageNet means and variances, and BN Imagenet Transfer, copying over the batch norm parameters from ImageNet. We compare these three methods in Figure 5, with non-batchnorm layers initialized according to the Mean Var Init. Broadly, they perform similarly, with BN Identity Init (what we use) performing slightly better. We observe that BN ImageNet Transfer, where the ImageNet batchnorm parameters are transferred directly to the medical images, performs the worst.