CrossNorm and SelfNorm for Generalization under Distribution Shifts (ICCV 2021)
Normalization techniques are crucial in stabilizing and accelerating the training of deep neural networks. However, they are mainly designed for the independent and identically distributed (IID) data, not satisfying many real-world out-of-distribution (OOD) situations. Unlike most previous works, this paper presents two normalization methods, SelfNorm and CrossNorm, to promote OOD generalization. SelfNorm uses attention to recalibrate statistics (channel-wise mean and variance), while CrossNorm exchanges the statistics between feature maps. SelfNorm and CrossNorm can complement each other in OOD generalization, though exploring different directions in statistics usage. Extensive experiments on different domains (vision and language), tasks (classification and segmentation), and settings (supervised and semi-supervised) show their effectiveness.READ FULL TEXT VIEW PDF
CrossNorm and SelfNorm for Generalization under Distribution Shifts (ICCV 2021)
Normalization methods, e.g., Batch Normalization, Layer Normalization , and Instance Normalization , play a pivotal role in training deep neural networks. These methods generally try to make training more stable and convergence faster, assuming that training and test data come from the same distribution. However, few studies investigate normalization in improving OOD generalization in real-world scenarios. For example, image corruptions , like snow and blur, can make test data deviate from the clean training distribution. This work aims to encourage the interaction between normalization and OOD generalization. Specifically, we manipulate feature mean and variance to make models generalize better to out-of-distribution data.
Our inspiration comes from the observation that channel-wise mean and variance of feature maps carry some style information. For instance, exchanging the RGB means and variances between two images can transfer style between them, as shown in Figure 1
(a). For many tasks such as CIFAR image classification, the style encoded by channel-wise mean and variance is usually less critical in recognizing the object than other information such as object shape. Therefore, we propose CrossNorm which swaps channel-wise mean and variance of feature maps. CrossNorm can augment styles in training, making the model more robust to appearance changes.
Furthermore, given one image in different styles, we can reduce the style discrepancy when adjusting the RGB means and variances properly, as illustrated in Figure 1 (b). Intuitively, the style recalibration can reduce appearance variance, which may be useful in bridging distribution gaps between training and unforeseen testing data. To this end, we propose SelfNorm by using attention  to adjust channel-wise mean and variance automatically.
It is interesting to analyze the distinction and connection between CrossNorm and SelfNorm. At first glance, they take opposite actions (style augmentation v.s. style reduction). Even so, they use the same tool: channel-wise statistics and pursue the same goal: OOD robustness. Additionally, CrossNorm can increase the capacity of SelfNorm by style augmentation, making SelfNorm generalize better to OOD data.
Concept and Intuition
. The style concept here refers to a family of weak cues associated with the semantic content of interest. For instance, the image style in object recognition can include many appearance-related factors such as color, contrast, and brightness. Style sometimes may help in decision-making, but the model should rely more on vital content cues to become robust. To reduce its bias rather than discarding it, we use CrossNorm with probability in training. The insight beneath CrossNorm is that each instance, or feature map, has its unique style. Further, style cues are not equally important. For example, the yellow color seems more useful than other style cues in recognizing an orange. In light of this, the intuition behind SelfNorm is that attention may help emphasize essential styles and suppress trivial ones. Although we use the channel-wise mean and variance to modify styles, we do not assume that they are sufficient to represent all style cues. Better style representations are available with more complex statistics or even style transfer models [37, 17]. We choose the first and second-order statistics mainly because they are simple, efficient to compute, and can naturally connect normalization to out-of-distribution generalization. In summary, the key contributions are:
We propose SelfNorm and CrossNorm, two simple yet effective normalization techniques to enhance out-of-distribution generalization.
SelfNorm and CrossNorm form a unity of opposites in using feature mean and variance for model robustness.
They are domain agnostic and can advance state-of-the-art robustness performance for different domains (vision or language), settings (fully or semi-supervised), and tasks (classification and segmentation).
Out-of-distribution generalization. Although the current deep models continue to break records on benchmark IID datasets, they still struggle to generalize to OOD data caused by common corruptions  and dataset gaps .
To improve the robustness against corruption, Stylized-ImageNet conducts style augmentation to reduce the texture bias of CNNs. Recently, AugMix  trains robust models by mixing multiple augmented images based on random image primitives or image-to-image networks . Adversarial noises training (ANT)  and unsupervised domain adaptation 
can also improve the robustness against corruption. CrossNorm is domain agnostic and orthogonal to AugMix and ANT, making it possible for their joint application. Compared to Stylized-ImageNet, CrossNorm has two main advantages. First, CrossNorm is efficient as it transfer styles directly in the feature space of the target CNNs. However, Stylized-ImageNet relies on external style datasets and pre-trained style transfer models. Second, CrossNorm can advance the performance on both clean and corrupted data, while Stylized-ImageNet hurts clean generalization because external styles can result in massive distribution shifts.
Besides common corruptions, generalization with distribution gaps  across different datasets also remains challenging. IBN  mixes instance and batch normalization to shrink the domain distances. Domain randomization  uses style augmentation for domain generalization on segmentation datasets. It suffers from the same issues of Stylized-ImageNet as it also uses pre-trained style transfer models and additional style datasets. Compared to IBN and the domain randomization, SelfNorm can bridge the domain gaps by style recalibration, and CrossNorm is more efficient and balances better between the source and target domains’ performance.
Beyond the vision field, many natural language processing (NLP) applications also face the out-of-distribution generalization challenges. Benefiting from the domain-agnostic property, SelfNorm and CrossNorm can also improve model robustness in the NLP area.
Normalization and attention. Batch Normalization  is a milestone technique that inspires many following normalization methods such as Instance Normalization , Layer Normalization , and Group Normalization . Recently, some works integrate attention  into feature normalization. Mode normalization  and attentive normalization  use attention to weigh a mixture of batch normalizations. Examplar normalization  learns to combine multi-type normalizations by attention. By contrast, SelfNorm uses attention with only instance normalization. More importantly, unlike previous normalization approaches, SelfNorm and CrossNorm aim to improve out-of-distribution generalization.
In addition, SelfNorm is different from SE , though they use similar attention. First, SelfNorm learns to recalibrate channel-wise mean and variance instead of channel features in SE. Second, SE models the interdependency between channels, while SelfNorm deals with each channel independently. Also, a SelfNorm unit, with , is more lightweight than a SE one, of , where denotes the channel number.
Data augmentation. Data augmentation is an important tool in training deep models. Current popular data augmentation techniques are either label-preserving [3, 24, 14] or label-perturbing [43, 41]. The label-preserving methods usually rely on domain-specific image primitives, e.g., rotation and color, making them inflexible for tasks beyond the vision domain. The label-perturbing techniques mainly work for classification and may have trouble in broader applications, e.g., segmentation. CrossNorm, as a data augmentation method, is readily applicable to diverse domains (vision and language) and tasks (classification and segmentation). The goal of CrossNorm is to boost out-of-distribution generalization, which is also different from many former data augmentation methods.
Background. Technically, SelfNorm and CrossNorm share the same origin: instance normalization . In 2D CNNs, each instance has feature maps of size . Given , instance normalization first normalizes the feature map and then conducts affine transformation:
are the mean and standard deviation;and denotes learnable affine parameters. As shown in Figure 1 and also pointed out by the style transfer practices [7, 37, 17], and can encode some style information.
SelfNorm. SelfNorm replaces and with recalibrated mean and standard deviation , as illustrated in Figure 2, where and are the attention functions. The adjusted channel becomes:
As and learn to scale and based on themselves, normalizes itself by self-gating, hence SelfNorm. SelfNorm is inspired by the fact that attention can help the model emphasize informative features and suppress less useful ones. In terms of recalibrating and , SelfNorm expects to highlight the discriminative styles and understate trivial ones. In practice, we use a fully connected (FC) network to wrap attention functions and . The architecture is efficient as its input and output are both two scalars. Since each channel has its independent statistics, SelfNorm recalibrates each channel separately using lightweight FC networks, hence the complexity of .
CrossNorm. CrossNorm exchanges and of channel with and of channel , i.e., changing and to each other’s and , shown in Figure 2:
where and seem to normalize each other, hence CrossNorm. CrossNorm is motivated by the key observation that a target dataset, such as a classification dataset, has rich, though subtle, styles. Specifically, each instance, or even every channel, has its unique style. Exchanging the statistics can perform efficient style augmentation, reducing the style bias in decision-making. In mini-batch training, we turn on CrossNorm with some probability.
Unity of Opposites. SelfNorm and CrossNorm both start from instance normalization but head in opposite directions. SelfNorm recalibrates statistics to focus on only necessary styles, reducing standardized features (zero-mean and unit-variance) and statistics mixtures’ diversity. In contrast, CrossNorm transfers statistics between channels, enriching the combinations of standardized features and statistics. They perform opposite operations mainly because they target different stages. SelfNorm dedicates to style recalibration during testing, while CrossNorm functions only in training. Note that SelfNorm is a learnable module, requiring training to work. Figure 3 shows the flowchart of SelfNorm and CrossNorm. Additionally, SelfNorm helps make the model less sensitive to appearance changes, while CrossNorm aims to lessen the model’s style bias. Despite these differences, they both can facilitate out-of-distribution generalization. Further, CrossNorm can boost the performance of SelfNorm because its style augmentation can prevent SelfNorm from overfitting to specific styles. Overall, the two seemingly opposed methods form a unity of using normalization statistics to advance out-of-distribution robustness.
The core idea of CrossNorm is to swap mean and variance between feature maps. Different choices of feature maps can result in different CrossNorm variants.
1-instance mode. For 2D CNNs, given one instance , CrossNorm can exchange statistics between its channels:
where and refer to the channel pair in Equation 3.
2-instance mode. If two instances given, CrossNorm can swap statistics between their corresponding channels, i.e., and become:
Compared with 1-instance CrossNorm, the 2-instance one tends to consider instance-level style instead of channel-level.
Crop. Moreover, distinct spatial regions probably have different mean and variance statistics. To promote the style diversity, we propose to crop regions for CrossNorm:
where the crop function returns a square with area ratio no less than a threshold . The whole channel is a special case in cropping. There are three cropping choices: content only, style only, and both. For content cropping, we crop A only when we use its standardized feature map. In other words, no cropping applies to A when it provides its statistics to B. Cropping both means cropping A and B no matter we employ their standardized feature map or statistics. The cropping strategy can produce diverse styles for both the 1-instance and 2-instance CrossNorms.
SelfNorm and CrossNorm can naturally work in the feature space, making it flexible to plug them into many network locations. Two questions come: how many units are necessary and where to place them? To simplify the questions, we turn to the modular design by embedding them into a network cell. For example, in ResNet , we put them into a residual module. The search space significantly shrinks for the limited positions in a residual module. We will investigate the position choices in experiments. The modular design allows using multiple SelfNorms and CrossNorms in a network. We will show in the ablation study that accumulated style recalibrations are helpful for model robustness. Since excessive style augmentations are harmful , we randomly turn on only some CrossNorms in a forward process. Random sampling encourages diverse augmentations even though the same data pass multiple times.
We evaluate SelfNorm (SN) and CrossNorm (CN) on out-of-distribution data that arise from image corruptions and dataset differences. The evaluation uses not only supervised and semi-supervised settings but also image classification and segmentation tasks. In addition to the vision tasks, we also apply them to a NLP task. Due to limited space, we leave all ablation studies in the appendix.
Image classification datasets. We use benchmark datasets: CIFAR-10 , CIFAR-100, and ImageNet. To evaluate the model robustness against corruption, we use the datasets: CIFAR-10-C, CIFAR-100-C, and ImageNet-C . These datasets are the original test data poisoned by 15 everyday image corruptions from 4 general types: noise, blur, weather, and digital. Each noise has 5 intensity levels when injected into images.
Clean error and mCE (%) of ResNet50 trained 90 epochs on ImageNet. SNCN, using simple domain-agnostic statistics, achieves comparable performance as AugMix. Jointly applying SNCN with AugMix and IBN can produce the lowest clean and corruption errors.
Image segmentation datasets. We further validate our method using a domain generalization setting, where the models are trained without any target domain data and tested on the unseen domain. We use the synthetic dataset Grand Theft Auto V (GTA5) 
as the source domain and generalize to the real-world dataset Cityscapes. GTA5 has the training, validation, and test divisions of 12,403, 6,382, and 6,181, more than those of 2,975, 500, and 1,525 from Cityscapes. Despite the differences, their pixel categories are compatible with each other, allowing to evaluate models’ generalization capability from one to another.
Sentiment classification datasets.
Besides vision tasks, we demonstrate that our method can also work well on NLP tasks. We use the out-of-distribution (OOD) generalization setting in binary sentiment classification. The model is trained on IMDb dataset and is tested on SST-2 testing dataset . The IMDb dataset collects highly polarized full-length lay movie reviews with 25,000 positive and 25,000 negative reviews. The SST-2 contains 9613 and 1821 reviews for training and testing, which is also a binary sentiment classification dataset but instead contains pithy expert movie reviews.
Metric. For image classification, we use test errors to measure robustness. Given corruption type and severity , let denote the test error. For CIFAR datasets, we use the average over 15 corruptions and 5 severities: . In contrast, for ImageNet, we normalize the corruption errors by those of AlexNet : . The above two metrics follow the convention  and are denoted as mean corruption errors (mCE) whether they are normalized or not. Different from classification, segmentation use the mean Intersection over Union (mIoU) over all categories as metric. For sentiment classification, we report accuracy as the metric.
Hyper-parameters. In the experiments, a SN unit uses one fully connected layer, followed by Batch Norm and a sigmoid layer. We put CN ahead of SN, and plug them into every cell in a network, e.g., each residual module in a ResNet. During training, we turn on only some CNs with probability to avoid excessive data augmentation. Unless specified, 2-instance CN is used with cropping. We sample the cropping bounding box ratio uniformly and set the threshold . Please refer to the appendix for details of active number and probability.
Supervised training on CIFAR. Following AugMix , we evaluate SN and CN with four different backbones: an All Convolutional Network , a DenseNet-BC (k = 12, d = 100) , a 40-2 Wide ResNet , and a ResNeXt-29 (324) . We also use the same hyper-parameters in the AugMix Github repository111https://github.com/google-research/augmix.
According to Table 1, individual SN and CN can outperform most previous approaches on robustness against unseen corruptions and combining them can decrease the mean error by
12% on both CIFAR-10-C and CIFAR-100-C. One possible explanation is that the corruptions mainly change image textures. SN and CN, through style recalibration and augmentation, may help reduce the texture sensitivity and bias, making the classifiers more robust to unseen corruptions. Also, the domain-agnostic SN and CN are orthogonal to AugMix, which relies on domain-specific operations. Their joint application can continue to lower the mCEs by 2.2% and 3.6% on top of AugMix.
Supervised training on ImageNet. Following the AugMix Github repository, we train a ResNet-50 for 90 epochs with weight decay 1e-4. The learning rate starts from 0.1, divided by 10 at epochs 30 and 60. Note that AugMix reports the results of 180 epochs in their paper. For a fair comparison, we also train it 90 epochs in our experiments. Besides, we also add Instance-batch normalization (IBN)  in the final combination with AugMix. It was initially designed for domain generalization but can also boost model robustness against corruption.
Table 2 gives the results on ImageNet. We can observe that both clean and corrupted errors decrease when applying SN and CN separately. Their joint usage can make the clean and corruption errors drop by 10.2% and 0.6% simultaneously, closing the gap with AugMix. Moreover, applying SN and CN on top of AugMix can significantly lower its clean and corruption errors by 1.1% and 5.6%, respectively, achieving state-of-the-art performance. IBN also makes some contributions here since it is complementary to other components.
Semi-supervised training on CIFAR.
Apart from supervised training, we also evaluate CN in semi-supervised learning. Following state-of-the-art FixMatch
setting, we train a 28-2 Wide ResNet for 1024 epochs on CIFAR-10. The SGD optimizer applies with Nesterov momentum 0.9, learning rate 0.03, and weight decay 5e-4. The probability threshold to generate pseudo-labels is 0.95, and the weight for unlabeled data loss is 1. We sample 250 and 4,000 labeled data with random seed 1, leaving the rest as unlabeled data. In each experiment, we apply CrossNorm to either all data or only unlabeled data and choose the better one. Our experiments use the Pytorch FixMatch implementation222https://github.com/kekmodel/FixMatch-pytorch.
Figure 4 shows the semi-supervised results. We run FixMatch with the strong RandAugment  or only weak random flip and crop augmentations. With either FixMatch version, CN can always decrease both the clean and corruption errors, demonstrating its effectiveness in semi-supervised training. Especially with the help of CN, training with 250 labels even has 3% lower corruption error than with 1000 labels, according to the right sub-figure. Additionally, two points are noteworthy here. First, we try FixMatch with only weak augmentations to simulate more general situations. For new domains other than natural images, humans may have the limited domain knowledge to design advanced augmentation operations. Fortunately, CN is domain-agnostic and easily applicable to such situations. Moreover, previous semi-supervised methods mainly focus on in-distribution generalization. Here we introduce out-of-distribution robustness as another metric for more comprehensive evaluation.
Training setup. We perform domain generalization from GTA5 (synthetic) to Cityscapes (realistic), following the setting of IBN . It uses 1/4 training data in GTA5 to match the data scale of Cityscapes. We train the FCN  with ResNet50 backbone in source domain GTA5 for 80 epochs with batch size 16. The network is initialized with ImageNet pre-trained weights. We test the trained model on both the source and target domains. The training uses random scaling, flip, rotation, and cropping () for data augmentation. We use the 2-instance CN with style cropping in this setting. Besides, we re-implement the domain randomization  and make the training iterations the same as ours. It transfers the synthetic images to 15 auxiliary domains with ImageNet image styles.
Results. Based on Table 3, SN and CN both can substantially increase the segmentation accuracy on the target domain by 8.5% and 10.6%. SN learns to highlight the discriminative styles that are likely to share across domains. CN performs style augmentation to make the model focus more on domain-invariant features. SN and CN get comparable generalization performance as state-of-the-art IBN  and domain randomization . However, CN significantly outperforms the domain randomization method by 12.2% on the source accuracy. Because the domain randomization transfers external styles to the source training data, causing dramatic distribution shifts. Moreover, combining SN and CN gives the best generalization performance while still maintaining high source accuracy.
Setup. We also conduct out-of-distribution generalization on the binary sentiment classification task in the NLP field to validate the versatility of SN and CN. The model is trained on the IMDb dataset and then tested on SST-2 dataset. Follow the setting of , we use GloVe 
word embedding and the Convolutional Neural Networks (ConvNets) as the classification model. We use the implementation of ConvNets in this repository333https://github.com/bentrevett/pytorch-sentiment-analysis. The convolutional layers with three kernel sizes (3,4,5) are used to extract features within the review texts. SN and CN units are placed between the embedding layer and the convolutional layers. We use the Adam optimizer and train the model for 20 epochs.
Results. From Table 4, we can find that SN improves the performance in both the source and target domains by 2.07% and 0.63%. CN can also increase target accuracy without much degradation in the source domain. Combining them gives a 3.05% accuracy boost. This experiment indicates that SN and CN can also work in the NLP area, not limited to the vision tasks. Despite the lack of intuitive explanations as for the image data, the mean and variance statistics in NLP data are also useful in facilitating out-of-distribution generalization.
Apart from the quantitative comparisons, we also provide some visualization results of SN and CN to better understand their effects. It is nontrivial to visualize them directly in feature space. To deal with this, we map the feature changes made by SN and CN back to image space by inverting the feature representations . For detailed experimental settings, refer to the appendix.
To visualize SN at a network location, we first forward an image to obtain the target representation immediately after the SN. Then we turn off the chosen SN and optimize the original image to make its representation fit the target one. In this way, we can examine SN’s effect by observing changes in image space. As shown in Figure 6, SN can primarily reduce the contrast and color at the first network block. The effect becomes more subtle as SN goes deeper into the network. One possible explanation is that the high-level representations lose many low-level details, making it difficult to visualize their changes.
In addition to visualizing individual SNs, it is also interesting to see their compound effect. To this end, we reconstruct an image from random noises by matching its representation with a given one. The reconstructed image can show what information is preserved by the feature representation. By comparing two reconstructed images from a network with or without SN, we can observe the joined recalibration effects of SNs before a selected location. From Figure 7, we can find SNs in the first two network blocks can suppress much style information and preserve object shapes. The reconstructions from block 3 do not look visually informative due to the high-level abstraction. Even so, SNs can restrain the high-frequency signals kept in the vanilla network.
In visualizing CN, we pair one content image with multiple style images for better illustration. We first forward them to get their representations at a chosen position. Then we compute standardized features from the content image representation and means and variances of the style image representations. The optimization starts from the content image and tries to fit its representation to the target one mixing the standardized features with different means and variances. Figure 5 shows diverse style changes made by CN. The style changes become more local and subtle as CN moves deeper in the network.
In this paper, we have presented SelfNorm and CrossNorm, two simple yet effective normalization techniques to improve OOD robustness. They form a unity of opposites as they confront and conform to each other in terms of approach (statistics usage) and goal (OOD robustness). Beyond their extensive applications, they may also shed light on developing domain agnostic methods applicable to multiple fields such as vision and language, and broad OOD generalization circumstances such as unseen corruptions and distribution gaps across datasets. Given the simplicity of SelfNorm and CrossNorm, we believe there is substantial room for improvement. The current channel-wise mean and variance are not optimal to encode diverse styles. One possible direction is to explore better style representations.
The cityscapes dataset for semantic urban scene understanding. In , pp. 3213–3223. Cited by: §4.
Fixmatch: simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685. Cited by: §4.1.
In each experiment, we conduct a grid search on four combinations of active numbers (1, 2) and probability (0.25, 0.5) and report the best result.
CN variants. CN can be in 1-instance or 2-instance mode with four cropping options. According to Table 5, the 2-instance mode constantly gets lower errors than the 1-instance. Furthermore, cropping can help decrease the error since it can encourage style augmentation diversity. Note that style cropping may not always be superior. We will give a more detailed study on cropping choices later.
Blocks choices for SN and CN. Given both SN and CN placed at the post-addition location of a residual cell in WideResNet-40-2, we study which blocks in a network should build on the modified cells. No residual cell is required if applying SN and CN to the image space. According to Table 6, They both perform the best on CIFAR-100-C when plugged into all blocks.
|Stages||N/A||Image||Block 1||Block 2||Block 3||All|
Order of SN and CN. In this experiment, we study two orders: SNCN and CNSN when plugging them into the post-addition place in all residual cells of WideResNet-40-2. Table 7 shows very close mCEs, indicating their order has little influence on the robustness performance.
SN v.s. SE. Although SN shares a similar attention mechanism with SE, SN obtains much lower corruption error than SE, according to Table 8. SN recalibrates the feature map statistics, suppressing unseen styles in the OOD data, whereas SE, modeling the interdependence between feature channels, may not help OOD robustness.
Modular positions. Here we investigate the positions of SN and CN inside network cells. We give a comprehensive study on different cells and measure the performance on both CIFAR-10-C and CIFAR-100-C. Specifically, we conduct experiments using four backbones: AllConvNet, DenseNet, WIdeResNet and ResNeXt, consisting of three types of cells: naive convolutional cell, dense cell, and residual module, illustrated in Figures 8 and 9. According to Tables 9 and 10, SN’s optimal positions are different for CIFAR-10-C and CIFAR-100-C, while CN has stable best positions across the two datasets.
|SN on CIFAR-10-C|
|SN on CIFAR-100-C|
|CN on CIFAR-10-C|
|CN on CIFAR-100-C|
|SNCN with cropping on CIFAR-10-C|
|DenseNet, Conv1 Pre||18.8||18.2||18.7||18.8|
|SNCN with cropping on CIFAR-100-C|
|DenseNet, Conv1 Pre||51.4||49.4||49.1||49.0|
|AllConvNet, 1, style||30.8||24.0||26.0||18.8||17.2||15.0||11.8|
|DenseNet, Conv1 Pre, both||30.7||22.0||24.7||18.8||18.5||12.7||10.4|
|WideResNet, Post, both||26.9||20.8||21.6||17.5||16.9||11.2||9.9|
|ResNeXt, Post, neither||27.5||21.5||22.4||17.7||15.7||10.9||9.1|
|AllConvNet, 1, style||56.4||50.3||52.2||43.9||42.8||42.7||36.8|
|DenseNet, Conv1 Pre, both||59.3||53.9||55.4||49.0||48.5||39.6||37.0|
|WideResNet, Post, both||53.30||47.4||48.8||44.5||43.7||35.9||33.4|
|ResNeXt, Post, neither||53.4||47.6||47.0||41.0||40.8||34.9||30.8|
Cropping Choices for CN. Cropping enables diverse statistics transfer between feature maps. Here we study four cropping choices: neither (no cropping), style, content, and both. In Table 11, we can find the best cropping choice may change over backbones and datasets.
Incremental ablation study. SN and CN are general and straightforward normalization techniques to improve the OOD robustness. They are orthogonal to each other, and other methods such as the consistency regularization  and AugMix . According to Table 12, they can lower the corruption error both separately and jointly. On top of them, proper cropping, consistency regularization, and domain-specific AugMix can further advance the OOD robustness.
CN v.s. Stylized-ImageNet. We also compare CN to Stylized-ImageNet, which transfers styles from external datasets to perform style augmentation. Stylized-ImageNet finetunes a pre-trained ResNet-50 for 45 epochs with double data (stylized and original ImageNets) in each epoch. To compare CrossNorm with Stylized-ImageNet, we perform the finetuning for 90 epochs using only the original ImageNet. In Table 13, although Stylized-ImageNet has 2% lower corruption error than CN, its clean error is 3.8% higher. Because the external styles in Stylized-ImageNet cause large distribution shifts, impairing its clean generalization. In contrast, The more consistent yet diverse internal styles help CN decreases both corruption and clean errors.
|Clean error (%)||23.9||27.2||23.4|
SN and CN locations. Moreover, in Table 14, we also investigate the SN and CN locations in a residual module using ImageNet and ResNet50. Similar to the CIFAR results, the post-addition position performs the best for corruption robustness.
|SN modular positions|
|Clean error (%)||24.0||23.0||23.2||23.7|
|CN modular positions|
|Clean error (%)||25.2||23.4||23.5||23.4|
Ablation study with IBN. Table 15 reports the results of applying SN or CN with IBN. We can observe that they can cooperate to improve the corruption robustness of ResNet50. Moreover, integrating SN, CN, IBN, and AugMix can bring the lowest corruption error. This shows SN and CN’s advantage that they are general and simple to boost other state-of-the-art methods.
Visualization setup. Our visualization builds on the technique: understanding deep image representations by inverting them . The goal is to find an image whose feature representation best matches the given one. The search is done automatically by a SGD optimizer with learning rate 1e4, momentum 0.9, and 200 iterations. The learning rate is divided by 10 every 40 iterations. During the optimization, the network is in its evaluation mode with its parameters fixed. In the experiment, we use WideResNet-40-2 and images from CIFAR-10. In visualizing CN, we use the training images and a model trained for 1 epoch. The SN visualization uses test images and a well-trained model. We use different settings for them because CN is for training, while SN works in testing.
More visualization results. Figure 10, extending Figure 5, shows more CN visualizations in deeper network blocks. Figure 11 gives an illustration of 15 corruptions used in robustness evaluation on CIFAR and ImageNet. Moreover, Figure 12 shows some synthetic images from GTA5 and realistic ones from Cityscapes. The visualization of CN applied to the synthetic images is provided in Figure 13.