Methods that reduce internal covariate shift via learned rescaling and recentering neural activation, like Batch Normalization (Ioffe & Szegedy, 2015)
, have been an essential ingredient for successfully training deep neural networks (DNNs). In Batch Normalization, neural activation values are rescaled with trainable parameters, where summary neural activity is typically computed as mean and standard deviation over a batch of inputs. Such compact batch statistics however are sensitive to the input distribution, resulting in errors when novel images are outside this distribution, for example when faced with different and unseen lighting or noise conditions. Then, and unlike the human visual system, modern DNNs perform and generalize poorly(Geirhos et al., 2018).
While the original Batch Normalization computed statistics across the activity in a single feature map (or channel) (Ioffe & Szegedy, 2015), trainable normalizations have been proposed along a number of dimensions of deep neural network layers, including Layer Normalization, (Ba et al., 2016), Group Normalization (Wu & He, 2018), and Instance Normalization (Ulyanov et al., 2016); the recently proposed Switchable Normalization (Luo et al., 2018) meta-learns which normalization method to use during training. While these methods each have their merits, they do not resolve the sensitivity of DNNs to image-degradation because these have properties that are not observed by the network..
Here, we propose a local variant of Batch Normalization (BatchNorm), Local Normalization (LocalNorm), inspired by the continuous adaptation of spiking neurons to local temporal contrast(Mensi et al., 2016)
: we observe that the mean and variance in channel activity changes when images are subjected to noise-related degradation. Figure1 shows an example of how the addition of Gaussian Noise flattens the color distribution for each channel in an image - other types of noise similarly affect the summary statistics, see Appendix. To increase the summary image statistical variance of the world from which the network learns, LocalNorm regularizes the normalization parameters during training by splitting the Batch into Groups, each with their own normalization scaling parameters. At test-time, the local channel statistics are then computed on the fly, either over a single image or a set (batch) of images in the test-set.
We show that DNNs trained with LocalNorm normalization are much more robust to image degradation: the trained networks exhibit strong performance for unseen images with noise conditions that are not in the training set. An example is shown in Figure 2
, where poorly lit or camouflaged images of cars are misclassified in the network using BatchNorm and correctly classified by the same network architecture using LocalNorm. We also find that networks drastically improves classification of distorted images in general, as measured on the CIFAR10-c dataset(Hendrycks & Dietterich, 2018), and we suggest a simple data augmentation scheme to improve summary statistics of small images. LocalNorm is straightforward to implement, also for networks already trained with standard BatchNorm - we show how a trained ResNet152 network trained further with LocalNorm improves accuracy for the Stanford Car dataset. Training networks from scratch, we show that LocalNorm achieves the same or slightly better performance as BatchNorm (and modern variants) on image classification benchmarks at little additional computational expense.
2 Related work
Lighting and noise conditions can vary wildly over images, and various pre-processing steps are typically included in an image-processing pipeline to adjust color and reduce noise. In traditional computer vision, different filters and probabilistic models for image denoising are applied(Motwani et al., 2004). Modern approaches for noise removal include deep neural networks, like Noise2Noise (Lehtinen et al., 2018), DURR (Zhang et al., 2018b)
, and a denoising AutoEncoder(Vincent et al., 2010) where the network is trained on a combination of noisy and original images to improve its performance on noisy dataset thus increasing the networks’ robustness to image noise and also to train a better classifier. However, as noted in (Geirhos et al., 2018), training on images that include one type of noise in DNNs does not generalize to other types of noise.
2.1 Neural Normalizing techniques
Normalization is typically used to rescale the dynamic range of an image. This idea has also been applied to deep learning in various guises, and notably Batch Normalization (BatchNorm) (Ioffe & Szegedy, 2015) was introduced to renormalize the mean and standard deviation of neural activations using an end-to-end trainable parametrization.
Normalization techniques. A Normal-based normalization is generally computed as
where the is a part of feature tensor computed by the previous layer and and are the (trainable) scaling parameters. For normal 3-Dimensional image like RGB and GBR,
is a 4D vector indexing the feature inorder where is the batch size(number of images per batch), and are the spatial height and width axes, and is the channel axis.
The space spanned by can be subdivided and subsequently normalised in multiple ways. We call the subdivision, the elements on which this normalization is performed, a group : different forms of input normalisations can be described as dealing with different groups. The mean and standard deviation of the certain computation group are computed as:
where is a small constant like . The computation group (where )is a set of pixels which shares the mean and std , and is the size of the group . BatchNorm and its variants can be mapped to a computational group along various axes (Figure 3).
Batch Normalization (BatchNorm) was developed to ease training and improve convergence speed and generalization ability of deep neural networks. In 3(a), for each channel, BatchNorm computes and along the axes. The computational group of BatchNorm comprises of all the pixels (inputs) from all batch samples sharing the same channel index. We can write this as , where denotes the pixel and the pixel’s channel index.
Layer Normalization (LayerNorm) (Ba et al., 2016)
was designed to solve BatchNorm’s dependence on the batch size, and as a smart way to apply a normalization method on recurrent networks. LayerNorm estimates the statistical features of one sample, which could also correspond to an input of a time step in sequence inputs (Figure3(b)). For each input sample, LayerNorm calculates ( and ) along the axes: as for BatchNorm, the computational group of LayerNorm can be defined as .
Group Normalization (GroupNorm) (Wu & He, 2018) was designed to enable the use of larger batches. In general, the use of larger batch sizes improves the generalization ability of the network and accelerates the training process (Smith et al., 2017; Goyal et al., 2017). Large batch sizes however are typically limited by the locally available computational resources. Group normalization computes summarizing statistics only over a subset of channels (the group; Figure3(c)), normalizing the computational group along the axes. The computational group for GroupNorm is thus defined as .
Instance Normalization (InstaNorm) (Ulyanov et al., 2016, 2017) was created for style transfer and quantity improvement. InstaNorm normalizes pixels of one sample in a single channel (Figure3(d)). The InstaNorm computational group is defined as .
Switchable Normalization (SwitchNorm) (Luo et al., 2018) was proposed as the linear combination of BatchNorm, LayerNorm and InstaNorm: in the SwitchNorm layer, the relative weighing of each kind of normalization method is adjusted during the training process. This allows the network to learn the right type of normalization at the right place in the network to improve performance; this does come however at the expense of a sizable increase in parameters and computation.
3 Local Normalization (LocalNorm)
We develop LocalNorm to improve the robustness of DNNs to various noise conditions. For BatchNorm, the mean and std are calculated along all training samples in a channel and then fixed for evaluation on test images; as noted however, when the (test) image distribution changes, these statistical parameters will drift. As a result, DNNs with BatchNorm layers are sensitive to input that deviates from the training distribution, including noisy images.
Simply computing the summary statistics on-the-fly, to account for a potential drift, only partly solves the problem: in Figure 4, we show what happens when the mean and std are computed as dynamical quantities also at test time for standard benchmarks CIFAR10 and Stanford Cars, using modern deep neural networks (for details, see below). For each test image (or batch of test images) we compute (), for increasing noise (here for added Gaussian noise). For CIFAR10, Figure 4a, we find that using single test images when evaluating gives poor results, as the small (32x32) images do not result in channel activity sufficient for effective summarizing statistics (Dynamic BN). However, computing these statistics over a batch shows a marked improvement (Dynamic BN-Batch): then, test accuracy exceeds standard BatchNorm for noisy images, at the expense of a slight decrease in accuracy for noiseless images. For the large images in Stanford Cars, we see that dynamically computing () at test time even for single images drastically improves accuracy (Figure 4b); the actual classification accuracy absent noise however drops. While computing summary statistics over a batch at test-time is feasible for benchmarking purposes, real world application would correspond to for example using a video stream, which would however substantially increase computational cost and latency.
In LocalNorm, we regularize the normalization layer for variations in and . The aim is to make the trained architecture less sensitive to changes in these statistics at test-time, such that we can dynamically recompute and on test-images. We divide the Batch into separate Groups for which we each compute summarizing statistics and associate separate scaling parameters and with each Group (illustrated in Figure 5). As shown in Figure3(e), for LocalNorm the computational group is defined along the axes:
Effectively, each computational group can be regarded as a separate network sharing most parameters, where inputs are passed randomly through one such network during training.
As noted, for BatchNorm the channel summary statistics are taken as fixed from the training set after training. For LocalNorm, we recompute these statistics at test-time: this naturally incorporates changes in the image statistics, and the Group-induced regularized normalization ensures that the network also performs well for different such summary statistics.
Since LocalNorm provides both multiple independent Groups and computes summary statistics at test-time, there are different variants for classifying a novel image at test-time. Ideally, a single new image is passed through a randomly selected Group, such that summary statistics are computed on the fly only on this single image (). A second method is to do the same, but pass a single image through all Groups and then use voting to determine the classification (). A third method is to collect the number of images corresponding to the Group size (), or use a set of images corresponding to the Batch size (). For benchmark testing, is the fastest evaluation method, whereas is the computationally most desirable method for real-world application.
LocalNorm is easily implemented in auto-differentiation frameworks like Keras (Chollet et al., 2015) and Tensorflow (Abadi et al., 2016) by adapting a standard batch normalization implementation111code available at https://github.com/byin-cwi/LocalNorm1
. For multi-GPUs, LocalNorm can map computational groups on separate GPUs which can accelerate training and allow the training of larger networks. In a variant of transfer learning(Pan et al., 2010), it is straightforward to adapt a model pre-trained with BatchNorm by replacing all BatchNorm layers with LocalNorm layers initialized with the BatchNorm parameters, and then continue training.
4 Image Noise
We test LocalNorm in a Noisy-object classification task where synthetic Gaussian, Poisson and Bernoulli noise is added to images, as in Noise2Noise (Lehtinen et al., 2018). All three kinds of independent noise are added on each channel of the image as follows:
For Additive Gaussian Noise (AGN), Gaussian noise with zero mean is added to the image on each channel, defined as .
Additive Poisson Noise (APN) is one of the most dominating noise sources in photographs, and is easily visable in low-light images. APN is a type of zero-mean noise and is hard to remove by pre-processing because it is distributed independently at each channel. Mathmatically, APN is computed as or , where .
Multiplicative Bernoulli Noise (MBN)
removes some random pixels from the image with probability. MBN defined by .
5 Experimental Results
5.1 Benchmark Accuracy
We apply LocalNorm to a number of classical benchmarks: MNIST (LeCun et al., 1998), CIFAR10 (Krizhevsky & Hinton, 2009), and Stanford Cars (Krause et al., 2013), and compare with other normalization methods. Where useful, we evaluate the benchmarks using all four different types of LocalNorm evaluation methods; when not explicitly mentioned otherwise, the application of LocalNorm refers to the evaluation method.
Results for all three normalization methods (BatchNorm, SwitchNorm and LocalNorm) are shown in Table 1 using otherwise identical network architectures, where we evaluate LocalNorm with LocalNorm-Single, LocalNorm-Batch and LocalNorm-Voting. For BatchNorm, SwitchNorm, LocalNorm-Batch and LocalNorm-Voting, we achieve near state-of-the-art accuracy on the original datasets, where in 3 our of 4 cases, LocalNorm-Voting and LocalNorm-Batch outperform BatchNorm and SwitchNorm. The improvement for CIFAR10 using the VGG architecture with LocalNorm-Voting in particular stands out, as accuracy improves from 88.8% to 95.3%; no such improvement is observed for the ResnNet32 architecture, and only a slight improvement for the ResNet152 as applied ot Stanford Cars. We also observe that for the small images in CIFAR10, evaluating test-images using only a single image at a time (LocalNorm-Single) gives poor results. Comparing training time, for CIFAR10, we find that LocalNorm incurs only a small computational cost (10-20%), while SwitchNorm proves much more computationally expensive (Table 1).
For MNIST, we designed a standard DNN (Input-16c-16c-32c-32c-512d-1024d-output), we set the batch size to 100; for LocalNorm, we divide the batch into 10 computational groups with 10 images each group. For CIFAR10, we use two classical network architectures – VGG19 and ResNet32. The classical VGG19 network architecture (Simonyan & Zisserman, 2014) is often used as a baseline to test new network architectures. Residual Networks, or ResNets (He et al., 2016)
have achieved state-of-the-art accuracy on many machine learning datasets, and ResNet32 (a ResNet with 32 Layers) achieves competitive results on the CIFAR10 dataset(Zhang et al., 2018a). We use a batch size of 128, as in most recent state-of-the-art models. For LocalNorm, we divide the batch into 8 computational groups with 16 images per group by default.
The Stanford Cars dataset contains 16,185 images of 196 classes of cars, and each image is large, similar to images in the ImageNet dataset, allowing us to compare LocalNorm to the other normalization methods when applied to large networks and large images. The training and test dataset are similarly large, and the images are taken under various conditions. We use ResNet152 for this dataset for improved accuracy; 16 images are trained as a batch and are divided into 4 groups for LocalNorm. For ResNet152, we use the pre-trained ImageNet weights from github222https://gist.github.com/flyyufelix/7e2eafb149f72f4d38dd661882c554a6 and then continue training this network with BatchNorm, SwitchNorm or LocalNorm.
In Figure 6 we plot the development of mean and variance of the normalization scaling parameters and for LocalNorm and BatchNorm (averaged over all channels) when training VGG19 on CIFAR10 using 8 Groups for LocalNorm. We see that LocalNorm converges to a spread of and values during training.
5.2 Noisy Image degradation
To measure noise robustness and noise generalization, we use the networks trained with various normalization methods and the original training dataset, and test them on images degraded with different levels of noise.
We evaluated the CIFAR10 and Stanford Cars dataset for all variants of LocalNorm, both where a batch of images is used at test-time ( and ) to obtain summary statistics, and where only a single image at a time is used at test-time to obtain summary statistics ( and ).
In the MNIST dataset, images only have one channel. We apply AGN to MNIST to demonstrate DNN performance facing out-of-sample noise-degraded images. In Figure 7, we see that for all normalization methods, performance decreases when images become more degraded, e.g., for , the digit is clearly visible as is some noise. The performance of BatchNorm and SwitchNorm however decreases to and respectively, while LocalNorm still achieved an accuracy of 97.8; for , where BatchNorm already yields random choice performance (around ), LocalNorm still performs with moderately reduced accuracy of (SwitchNorm obtains ). For very high noise levels, also difficult for humans, LocalNorm still outperforms SwitchNorm by a factor of two.
We tested VGG19 trained on CIFAR10 with various normalization methods on noisy test images degraded with AGN. Figure 8a shows that the accuracy when using BatchNorm decreases rapidly, achieving only accuracy for sigma=1. For the different types LocalNorm evaluation, we find that LocalNorm-Batch and LocalNorm-Voting substantially improve over BatchNorm and SwitchNorm, where for LocalNorm-Voting the network accuracy is at sigma=1, almost three times better than the BatchNorm-based network. Evaluation using only single images, LocalNorm-Single and LocalNorm-Single-Voting, while being more robust to noise, clearly underperform for noiseless data. Similar observations apply for the other types of noise. For APN, both BatchNorm and LocalNorm’s accuracy curve dropped sharply, while the LocalNorm still substantially outperforms BatchNorm and SwitchNorm in general (Figure 8b). For MBN in Figure 8c, both SwitchNorm and BatchNorm’s accuracy drops exponentially and converge to random choice, while LocalNorm’s performance decreases slower. We see the same performance order for a ResNet32 network applied to CIFAR10 (see Appendix, Figure 16).
The Cifar10-C dataset was published specifically to test network robustness to image corruption (Hendrycks & Dietterich, 2018). It contains 19 types of algorithmically generated corruptions from noise, blur, weather, and digital categories. To evaluate robustness, the networks are trained on the original CIFAR10 dataset, and evaluated on the corrupted dataset using LocalNorm-Batch. The result are shown in Figure 9: we find that LocalNorm-Batch outperforms standard BatchNorm everywhere, with the largest improvements observed for those image corruptions that incur the largest performance drop (Noise, Blur). We also see that LocalNorm improves the accuracy of the VGG-19 network much more than for the ResNet32 network, to the point that VGG becomes substantially more accurate than ResNet32.
Stanford Car Dataset
For the large images in the Stanford Cars dataset, we find that when testing on noisy images (Figure 8d), all LocalNorm variants perform very similar, demonstrating that here, a single large image is sufficient to dynamically compute the summary statistics at test-time. LocalNorm maintains a test accuracy over under any tested level of AGN, while under BatchNorm accuracy declines sharply to for sigma 2.5; a similar behavior is observed for APN (Figure 8e). For MBN, Figure 8f, the BatchNorm accuracy decreases exponentially while LocalNorm’s performance declines essentially linearly333For Stanford Cars, we omitted data for SN as we obtained near-zero performance on noise-degraded images with the publicly available code..
To directly investigate generalization ability under different noise levels, we computed the confusion matrix for each model under various conditions: this is shown in Figures17-19 in the Appendix. In general, we find that networks using BatchNorm increasingly default classification to a select few classes for increasing noise levels, whereas for networks using LocalNorm this is not the case - classification becomes essentially random.
5.3 Single Image Data augmentation at test-time
To improve the performance of LocalNorm-Single and LocalNorm-Single-Voting evaluation on small images, a simple suggestion is to enrich the summary statistics. Here, we augment the data by adding rotated versions of the image to the computation group to enrich the summary statistics. We find that this trick drastically improves LocalNorm-Single and LocalNorm-Single-Voting for the small images of CIFAR10. For Cifar10 dataset, adding the rotated the image along the axis W and C could improve the single image performance, as shown in Fig11, this increases the details of the mean of the computational group.
During classification, the prediction is made for the original image, and rotated images are only used to compute the summary statistics. In Figure 11; as before, this type of classification can be done by either voting the prediction of each group or selecting a prediction randomly as the final result. As show in Figure 12 for AGN, we find that for CIFAR10, thus enhancing the summary statistics for single image evaluation improves robustness and noiseless accuracy to the same level as LocalNorm-Batch - we observe the same for image degradation with APN and MBN (not shown).
While performance improves and such rotation allows a network to apply LocalNorm also to the small images of CIFAR10, this comes at the cost of filling one group or multiple groups with rotated images and computing the corresponding network activity. While this is a substantial increase in computational cost, there is no cost to training, and evaluation on such small images tends to be fast.
5.4 Training effects
Training on augmented noisy datasets. We next examine how network robustness improves when noisy AGN images are added to the training dataset. As can be seen in Figure 10, when testing on images with AGN or MBN noise, adding AGN noise samples in the training set does improve accuracy for BatchNorm-trained networks on noisy test-images. This AGN-noise network however hardly improves accuracy on test-data containing Poisson noise (APN) or Bernouilli noise (MBN), confirming the observation in (Geirhos et al., 2018) that noise is hard to generalize. Moreover, networks trained using LocalNorm without added noise samples still perform better, and we also find that for the noise-augmented BatchNorm network the test accuracy on the original dataset is slightly reduced. In practice, it is next to impossible to cover all noise conditions in the training dataset, and training with many such added examples is computationally expensive.
Group size. LocalNorm has as a parameter the number of groups which, for a given batch size, determines the number of images in each group. While we did not extensively optimize for group number, we found that a small-ish number of groups, 4-8, performed best in practice for the batch sizes used in this study (Figure 13).
We develop an effective and robust normalization layer–LocalNorm. LocalNorm regularizes the Normaliation layer during training, and includes a dynamic computation of the Normalization layer’s summary statistics during test-time. The key insight here is that out-of-sample conditions, like noise degradation, will shift the summary statistics of an image, and the LocalNorm approach makes a DNN more robust to such shifts.
We demonstrate the effectiveness of the approach on classical benchmarks, including both small and large images, and find that LocalNorm decisively outperforms both classical Batch Normalization and modern variants like SwitchNorm. We show that computing LocalNorm only has a limited computational cost with respect to training time, of order 10-20%. LocalNorm furthermore can be evaluated on batches of test-images, and, for large enough images, also on single images passed through only a single group, then incurring the same evaluation cost as Batch Normalization. To enable the evaluation of small images one-at-a-time, we demonstrated the use or image rotation as a form of data augmentation to sufficiently improve the summary statistics. For more general type of image distortions, we find that using LocalNorm also makes networks substantially more robust, as evidenced by the results on the CIFAR10-c dataset.
BY is funded by the NWO-TTW Programme “Efficient Deep Learning” (EDL); the Titan Xp used for this research was donated by the NVIDIA Corporation.
- Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pp. 265–283, 2016.
- Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Chollet et al. (2015) Chollet, F. et al. Keras. https://github.com/fchollet/keras, 2015.
- Geirhos et al. (2018) Geirhos, R., Temme, C. R., Rauber, J., Schütt, H. H., Bethge, M., and Wichmann, F. A. Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems, pp. 7549–7561, 2018.
- Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017. URL http://arxiv.org/abs/1706.02677.
He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Hendrycks & Dietterich (2018) Hendrycks, D. and Dietterich, T. Benchmarking neural network robustness to common corruptions and perturbations. 2018.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Krause et al. (2013) Krause, J., Stark, M., Deng, J., and Fei-Fei, L. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
- Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lehtinen et al. (2018) Lehtinen, J., Munkberg, J., Hasselgren, J., Laine, S., Karras, T., Aittala, M., and Aila, T. Noise2noise: Learning image restoration without clean data. CoRR, abs/1803.04189, 2018. URL http://arxiv.org/abs/1803.04189.
- Luo et al. (2018) Luo, P., Ren, J., and Peng, Z. Differentiable learning-to-normalize via switchable normalization. arXiv preprint arXiv:1806.10779, 2018.
- Mensi et al. (2016) Mensi, S., Hagens, O., Gerstner, W., and Pozzorini, C. Enhanced sensitivity to rapid input fluctuations by nonlinear threshold dynamics in neocortical pyramidal neurons. PLoS Comput. Biol., 12(2):e1004761, February 2016.
- Motwani et al. (2004) Motwani, M. C., Gadiya, M. C., Motwani, R. C., and Harris, F. C. Survey of image denoising techniques. In Proceedings of GSPX, pp. 27–30, 2004.
- Pan et al. (2010) Pan, S. J., Yang, Q., et al. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
- Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. URL http://arxiv.org/abs/1409.1556.
- Smith et al. (2017) Smith, S. L., Kindermans, P., and Le, Q. V. Don’t decay the learning rate, increase the batch size. CoRR, abs/1711.00489, 2017. URL http://arxiv.org/abs/1711.00489.
- Ulyanov et al. (2016) Ulyanov, D., Vedaldi, A., and Lempitsky, V. S. Instance normalization: The missing ingredient for fast stylization. CoRR, abs/1607.08022, 2016. URL http://arxiv.org/abs/1607.08022.
- Ulyanov et al. (2017) Ulyanov, D., Vedaldi, A., and Lempitsky, V. S. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. CoRR, abs/1701.02096, 2017. URL http://arxiv.org/abs/1701.02096.
- Vincent et al. (2010) Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(Dec):3371–3408, 2010.
- Wu & He (2018) Wu, Y. and He, K. Group normalization. arXiv preprint arXiv:1803.08494, 2018.
- Zhang et al. (2018a) Zhang, G., Wang, C., Xu, B., and Grosse, R. Three mechanisms of weight decay regularization. arXiv preprint arXiv:1810.12281, 2018a.
- Zhang et al. (2018b) Zhang, X., Lu, Y., Liu, J., and Dong, B. Dynamically unfolding recurrent restorer: A moving endpoint control method for image restoration. CoRR, abs/1805.07709, 2018b. URL http://arxiv.org/abs/1805.07709.