A simple normalization technique using window statistics to improve the out-of-distribution generalization in medical images

Since data scarcity and data heterogeneity are prevailing for medical images, well-trained Convolutional Neural Networks (CNNs) using previous normalization methods may perform poorly when deployed to a new site. However, a reliable model for real-world applications should be able to generalize well both on in-distribution (IND) and out-of-distribution (OOD) data (e.g., the new site data). In this study, we present a novel normalization technique called window normalization (WIN), which is a simple yet effective alternative to existing normalization methods. Specifically, WIN perturbs the normalizing statistics with the local statistics computed on a window of features. This feature-level augmentation technique regularizes the models well and improves their OOD generalization significantly. Taking its advantage, we propose a novel self-distillation method called WIN-WIN to further improve the OOD generalization in classification. WIN-WIN is easily implemented with twice forward passes and a consistency constraint, which can be a simple extension for existing methods. Extensive experimental results on various tasks (such as glaucoma detection, breast cancer detection, chromosome classification, optic disc and cup segmentation, etc.) and datasets (26 datasets) demonstrate the generality and effectiveness of our methods. The code is available at https://github.com/joe1chief/windowNormalizaion.

READ FULL TEXT VIEW PDF

page 1

page 5

page 8

03/22/2021

Adversarial Feature Augmentation and Normalization for Visual Recognition

Recent advances in computer vision take advantage of adversarial data au...
03/29/2021

DualNorm-UNet: Incorporating Global and Local Statistics for Robust Medical Image Segmentation

Batch Normalization (BN) is one of the key components for accelerating n...
09/13/2021

HCDG: A Hierarchical Consistency Framework for Domain Generalization on Medical Image Segmentation

Modern deep neural networks struggle to transfer knowledge and generaliz...
02/09/2020

MS-Net: Multi-Site Network for Improving Prostate Segmentation with Heterogeneous MRI Data

Automated prostate segmentation in MRI is highly demanded for computer-a...
02/04/2021

SelfNorm and CrossNorm for Out-of-Distribution Robustness

Normalization techniques are crucial in stabilizing and accelerating the...
03/06/2021

NeRD: Neural Representation of Distribution for Medical Image Segmentation

We introduce Neural Representation of Distribution (NeRD) technique, a m...
06/25/2022

RandStainNA: Learning Stain-Agnostic Features from Histology Slides by Bridging Stain Augmentation and Normalization

Stain variations often decrease the generalization ability of deep learn...

1 Introduction

Despite the tremendous success of the CNNs in medical image analysis, they are mainly built upon the “i.i.d. assumption”, which states that the training data and the test data are independent and identically distributed. The assumption barely holds in real-world applications due to the nature of medical images, rendering the sharp drop of well-trained models on unseen data with distribution shifts [9]. Since the expensive costs for data acquisition and data annotation lead to data scarcity, the data used for model training can only capture a small population of real data distribution. Meanwhile, the data heterogeneity caused by inconsistent standards (e.g., various operating procedures and imaging equipment) worsens the distribution shifts in images. Therefore, OOD generalization for these heterogeneous data is a crucial challenge for real-world clinical applications [33, 18].

The most straightforward solution is to acquire sufficient data from various sites and train a robust model. However, it would be high-cost and even impossible in real-world applications. In practice, another economic and popular solution is data augmentation. Data augmentation enhances the breadth of seen data with predefined class-preserving options, such as mixing multiple augmented images [11] and mixing an augmented image with dreamlike images [12]. With the help of the augmented samples simulating the unseen images, CNNs improve their generalization to OOD data. Although the data augmentation methods no longer require significant manual designation efforts [2, 3], their performance is less appealing to medical images because they are mainly developed for natural images and under the “i.i.d. assumption”. Besides, they significantly increase computing overhead and impede model convergence.

Normalization layers are essential components of modern CNNs. Traditional normalization techniques (e.g., Batch Normalization and Instance Normalization) are generally and simplistically built upon the “i.i.d. assumption”. For instance, Batch Normalization (

BN) [13]

constrains intermediate features within the normalized distribution with mini-batch statistical information to stabilize and accelerate training. Its fundamental flaw is train-test statistics inconsistency, which worsens under the distribution shifts. Instance normalization (

IN) [31] overcomes this limitation through overcomes this limitation through computing the normalization statistics consistently. During the training and testing, IN normalizes features with statistics of the spatial dimension. Meanwhile, this method effectively reduces the data distribution (i.e., by eliminating the style discrepancy) and improves the OOD generalization [25]. But its improvement for OOD generalization is still inadequate. In the model generalization literature, existing incremental studies of IN are complicated and inefficient [25, 30, 14, 22, 23, 28]. For instance, Jin et al. [14] designed a normalization technique that consists of three components: Style Normalization and Restitution Module, Dual Causality loss, and Dual Restitution Loss.

In this work, we present a simple yet effective normalization technique called Window Normalization (WIN) to improve the OOD generalization on heterogeneous data. WIN

exploits the mean and variance of a stochastic window to perturb the normalization statistics with a mixup operation, which is extremely easy to use and remarkably effective. With the benefits of

WIN, we introduce a novel self-distillation scheme WIN-WIN which forward passes the input in different model modes and compels the consistency between the outputs. WIN-WIN can be implemented with few codes and further improve the OOD generalization in classification tasks. We demonstrate that our methods can generally boost the OOD generalization of various tasks such as classification, segmentation, and retrieval spanning datasets, with free parameters and a negligible effect for IND data. Our main contributions are summarized as follows:

  • We develop a simple normalization WIN for improving OOD generalization on heterogeneous data. WIN is a good alternative for existing normalization methods, with free parameters and a negligible effect for IND data.

  • We propose a novel self-distillation scheme WIN-WIN to fully exploit the WIN in classification tasks, which is easy to implement and significantly effective.

  • We demonstrate through extensive experiments that our methods can significantly and generally boost the OOD generalization across various tasks, including glaucoma detection, breast cancer detection, chromosome classification, optic disc and cup segmentation, etc. The code of our method is publicly available at https://github.com/joe1chief/windowNormalizaion.

2 Related work

2.1 Data Augmentation

Data augmentation is an important tool in training deep models, which effectively increases the data amount and enriches the data diversity through random operations such as translation, flipping, and cropping [16]. Intuitively, simulated and augmented inputs or outputs could aid a model in learning the invariances in the data domains and improve its generalization to novel domains. Popular data augmentation techniques are conducted at either image-level or feature-level. Image-level augmentation usually needs manual designation, which is challenging due to the difficulty of simulating the data from unforeseen target domains. To this end, several methods automatically searched for the augmentation policies in their predefined search space [2, 3]. However, their performance on OOD data is doubtful as the augmentation policies were usually tuned to optimize performance on a particular domain’s data. In the OOD generalization literature, Lopes et al. [21] added noise to randomly selected patches on the input image. AugMix [11] and PIXMIX [12] mixed multiple augmented images or an augmented image and dreamlike images, respectively. Zhou et al. [43] synthesized the pseudo-novel domain data with a data generator. DeepAugment [9] perturbed the input image with an image-to-image network. In contrast, feature-level data augmentation has also work well in this context. Verma et al. simply mixed up the latent features. Zhou et al. designed a plug-and-play module that mixes feature statistics between instances. Li et al. introduced an augmentation module that perturbs the feature with a data-independent noise and an adaptive dependent noise.

However, the aforementioned methods inevitably introduce computational overhead. Our method, which outperforms many prior data augmentation techniques for improving OOD generalization, merely implements the feature-level data augmentation by the essential normalization layers in modern CNNs.

2.2 Normalization

Normalization layer is an essential module in deep models. Extensive normalization techniques have been proposed for various domains such as natural language processing, computer vision, and machine learning. A milestone study is Batch Normalization (

BN) [13], which has verified its generality on a wide variety of tasks. Although BN has been empirically proven to accelerate the model convergence and combat overfitting, it exhibits several drawbacks because it computes the statistic along the batch dimension by default [34, 35]. For example, the instance-specific information is lacking. To this end, Li et al. [19] re-calibrated features via an attention mechanism and Gao et al. [6] added a feature calibration scheme into BN to incorporate this information. Instance Normalization (IN) [31] is another milestone study that normalizes features with statistics of the spatial dimension. It has been extensively applied in the field of image style transfer as it could eliminate instance-specific style discrepancy (namely, standardizing features with the mean and the variance of IN). However, IN eliminated the style information and resulted in inferior performance on in-distribution samples [25, 14]. Following the BN and IN, Group Normalization (GN) [34] exploited the statistics of grouped channel dimensions to train the model with small batches. On the contrary, there are several normalization techniques that utilized statistics of partial regions [24, 36] instead of all pixels within a dimension (e.g., Batch Normalization, Instance Normalization, Layer Normalization [1], and Group Normalization). Ortiz et al. [24] proposed the local context normalization (LCN) in which each feature is normalized based on a window around it and the filters in its group. It has been demonstrated its effectiveness in dense prediction tasks, including object detection, semantic segmentation, and instance segmentation applications.

Since it has been a subtle choice about which normalization techniques can be applied in a particular scenario, many studies explored the combination of multiple normalizations. Batch-Instance Normalization (BIN) [23] adaptively balanced the BN output and the IN output with a learnable gate parameter. This gating mechanism also appeared in the Switchable Normalization (SN) series [22, 28]

which combined three types of statistics estimated channel-wise

[31], layer-wise [1], and minibatch-wise [13]. Besides, Qiao et al. [26] introduced another combination that stacked the BN with the GN rather than conducting a weighted summation operation. These multi-normalization combination methods enable better adaptability and easy usage for various deep networks and tasks. However, they usually bring the extra computational cost.

Above normalization techniques are proposed under the assumption that training and test data follow the same distribution. Meanwhile, numerous normalization studies are proposed to address the inevitable distribution shifts between training and test distribution in real-world applications. For example, Mode Normalization [4] assigned the samples to different modes and normalized them with their corresponding statistics. Li et al. [18] used local batch normalization to address non-iid training data. Tang et al. [30] enlarged the training distribution with CrossNorm and bridged the gaps between training and test distribution with SelfNorm. Since the IN can effectively alleviate the style discrepancy among domains/instances, Pan et al. [25] mixed IN and BN in the same layer to narrow the distribution gaps and Jin et al. [14] proposed a style normalization and restitution module to eliminate the style variation and preserve the discrimination capability. Compared to these methods, WIN simply yet effectively addresses the problem of OOD generalization. It has zero parameters to tune and no need for cautious choices of plug-in manners. As a variant of IN, it can be directly employed as the normalization layers and bring free benefits for OOD generalization.

3 Methods

Figure 1: (a)-(c) Statistics calculation. Each subplot shows the feature map . The mean and variance are computed by aggregating the values of the pixels in blue. And, these pixels share the same normalizing statics. (d) The schematic illustration of WIN-WIN (right) and WIN (left). WIN is adopted as the normalization layer in CNNs. It uses mixing statistics during the training and global statistics during the evaluation. denotes the inputs. and

are the prediction logits in training mode and evaluation mode, respectively.

WIN-WIN passes the input twice and compels the consistency between outputs and .

3.1 Background

Normalization methods usually consist of feature standardization and affine transformation operations. Given the input feature in 2D CNNs, where , , , and are batch size, channel number, height, and width of the input feature, respectively. The feature standardization and affine transformation can be formulated as follows:

(1)
(2)

and denote the mean and variance, respectively. and are learnable affine parameters. is used to prevent zero variance.

In BN, the mean and variance during training are defined as follows:

(3)

, are computed within each channel of (see Fig. 1(a)). During the evaluation, the mean and variance are computed as exponential moving averages of , and accumulated as follows:

(4)

where is a momentum. However, the running mean and variance usually result in significant degradation in performance since they cannot be aligned with the statistics estimated on OOD data [35]. In addition, for the affine transformation, BN employs two learnable parameters and .

IN is defined the mean and variance as follows:

(5)

, are computed across spatial dimensions independently for each channel and each instance (see Fig. 1(b)). In addition, and encode the instance-specific style information. Thus, standardizing the feature statistics (i.e., , ) is a kind of style normalization. Style normalization will eliminate the feature variance caused by appearance, which benefits the OOD generalization. However, it inevitably removes some discriminative information and degrades the IND generalization [25]. In the IN, the calculation of mean and variance is consistent during the training and evaluation, and the affine transformation which adopts and is usually deactivated in practice.

3.2 Window Normalization

Our main inspiration is the observation of Ghost Batch Normalization (GBN) [5] which improves the model generalization with the mean and variance calculated on small parts of the input batch. Intuitively, GBN is equivalent to adding different noises to different slices of the input batch, which can be regarded as a kind of feature-level data augmentation. Accordingly, we implement another feature-level data augmentation by injecting the noises into each instance rather than the slices of batch [5].

Specifically, the perturbation (i.e., noise injection) for each instance is conducted with the feature standardization operation. We standardize the feature with approximations of global statistics (i.e., and ) (see Eq. 1). As shown in Fig. 1(c), the mean and variance for each channel and instance are calculated within a random window. Formally, given a window that is specified by the top-left coordinate and bottom-right coordinate , the mean and variance , are computed as:

(6)

Algorithm 1 demonstrates the Window sampling strategy: we repeat generating a window denoted as and placing this window on the feature until the size of the random window exceeds . is a threshold for the ratio of window size. LCN [24] also uses the statistics of windows, but they normalize each feature based on the statistics of its local neighborhoods and corresponding feature group. Besides, their prime use is in dense prediction tasks such as object detection, semantic segmentation, and instance segmentation.

Data: Features , Threshold ,
Window ratio , Window height , Window width , Window center ;
Result: Top-left of window , Bottom-right of window ;
1 repeat
       // get a squared window
2       Sample the window ratio

from the beta distribution

;
3       ;
4       ;
5       Uniformly sample the window center ;
       // place the window
6       ;
7       ;
8       ;
9       ;
10      
11until ;
Algorithm 1 Window sampling

Considering such perturbation is marginal for the images with consistent background (i.e., the chromosomes), we propose another strategy called Block which computes the and within multiple small windows. Firstly, we divide the input feature into non-overlapping blocks. Each block corresponds to a particular patch on the input image regardless of the feature scale. At scale , the -th block is denoted as . For example, the resolution of input images is and the patch size is , so they can be divided into blocks. Then, we sample

random blocks without replacement, following a uniform distribution. The mean and variance

, at the scale are computed as:

(7)

where is the index of random blocks.

In order to diversify the perturbations and smooth the model response, WIN mixes the local statistics and the global statistics. The mixing statistics can be formulated as follows:

(8)

where is an random instance-specific weight sampled from the beta distribution .

To keep the evaluation deterministic, we adopt the and during evaluation as the mixing statistics are approximations of global statistics (see Fig. 1(d)). The mixing statistics are equivalent to global statistics when or .

3.3 Training and evaluation discrepancy

Normalizing the features with WIN can regularize the CNNs training effectively and improves the OOD generalization significantly. However, the evaluation statistics cannot be strictly aligned to training statistics and the model parameters are trained to fit features normalized by ,

, which degrades the model generalization. On the other hand, normalizing the feature with mixing statistics (i.e., training mode) or global statistics (i.e., evaluation mode) can produce two correlated views of a same sample. Minimizing the discrepancy between training and evaluation is a natural pretext task for self-supervised learning

[7, 20].

To these ends, we introduce the WIN-WIN for classification tasks, which needs twice forward passes. As shown in Fig. 1(d), given the input with a ground-truth label , the first pass uses mixing statistics to normalize the features and outputs the logits , while the second pass uses global statistics to normalize the features and outputs the logits . WIN-WIN will compel the model to minimize the Jensen-Shannon divergence between and and the cross-entropy between and and and . The Jensen-Shannon divergence loss can be written as:

(9)

where the logits are converted to a probability vector via

. The cross-entropy loss can be written as:

(10)

4 Experiments

4.1 Datasets and Evaluation Metrics

To demonstrate the generality of WIN, the experiments cover a wide range of OOD generalization tasks: binary classification, multiclass classification, image segmentation, and instance retrieval. And, the datasets range from common benchmarks to real-world applications. The datasets under the same task have a heterogeneous appearance but share the same labels (see Fig. 2).

Binary Classification. Following [41], we set up a benchmark for our methods using seven glaucoma detection datasets from seven different sites, including five public datasets (i.e., LAG, REFUGE, RIMONE-r2, ODIR, and ORIGA ) and two private datasets (i.e., BY and ZR). BY and ZR are collected from Peking University Third Hospital and the Second Affiliated Hospital of Zhejiang University, respectively. Furthermore, the experiments were also conducted on the benchmark Camelyon17. Camelyon17 is a breast cancer detection benchmark of WILDS [15], which includes five data sources. All histopathological images in Camelyon17 were resized from to to facilitate the window sampling. Fig. 2

(a) and (b) show two benchmarks’ example cases and sample numbers. Here, the evaluation metric we applied was the mean clean area under the curve (

-) following [41]. Each dataset was divided into and for training and validation, while other datasets were held out for testing. Before aggregating, each test dataset’s result (i.e., the area under the curve) will be recalibrated by its inherent difficulties. A slight difference to [41] is that we only run it once, taking into account multiple test datasets and time-saving. Higher - means better OOD generalization.

Multiclass Classification. As shown in Fig. 2 (c), we collected chromosome images from two hospitals to cross-evaluate the OOD generalization performance in chromosome classification, including samples from The Obstetrics Gynecology Hospital of Fudan University (RHH) and samples from The International Peace Maternity Child Health Hospital of China welfare institute (IPMCH). The dataset from one hospital was divided into a training set and a validation set according to the proportion and the other hospital data was used as the test set. The result was reported in Top-1 accuracy (). Notice that the blank area occupies a large proportion of these chromosome images.

Besides, we also conducted experiments on two robustness benchmarks, CIFAR-10-C and CIFAR-100-C

[10], and a commonly used domain generalization (DG) benchmark, Digit-DG [43]. CIFAR-10-C and CIFAR-100-C consist of corrupted versions of the original test set. The metric for them is the mean corruption error (

) across all fifteen corruptions and five severities for each corruption. Lower is better. Digit-DG consists of four handwritten digit recognition datasets (MNIST, MNIST-M, SVHN, and SYN) with distribution shifts for font style, stroke color, and background.

Following prior DG works [44], we applied the leave-one-domain-out protocol for Digit-DG and reported the Top-1 accuracy on each target dataset. Chromosome images, CIFAR-10-C, and CIFAR-100-C were resized to and the images of Digit-DG were resized to .

Image Segmentation. We employed the retinal fundus images in [33] for optic cup and disc (OC/OD) segmentation. As shown in Fig. 2 (c), they are collected from four sites, and each image was center-cropped and resized to as network inputs. We randomly divided the images on each site into training and test sets, following the ratio of . In our experiments, the training sets of three sites were used for model training, and the model performance was evaluated on the test set of the remaining site. For example, the training sets of Site A, B, and C are used for training, and the test set of Site D is used for testing. It is the common setting of DG. The segmentation result was reported in the Dice coefficient (). Higher is better.

Instance Retrieval. The person re-identification (re-ID) aims to match people across disjoint camera views. Naturally, each camera view is itself a distinct domain. The camera views from different sites will increase the distribution shifts in terms of backgrounds, illuminations, image resolutions, viewpoints, etc. In our experiments, we adopted Market1501 [39] and CUHK03 [17]. Market1501 contains images of identities captured by cameras, while CUHK03 contains images of identities captured by cameras. All images were resized to . A model was trained with one dataset and tested on the other. The result was reported in ranking accuracy and mean average precision () [39]. Higher is better.

Figure 2: Example cases and sample numbers of each data source in our experiments. (a) glaucoma detection, (b) breast cancer detection, (c) chromosome classification, and (d) optic disc and cup segmentation
Methods IND () OOD () -
LAG REFUGE RIMONE-r2 ODIR ORIGA BY ZR (mean std)
DeepAll 0.982 0.989 0.912 0.874 0.782 0.989 0.999 1.0000.00
BN [13] 0.991 0.740 0.390 0.596 0.698 0.597 0.771 0.6880.15
IN [31] 0.983 0.870 0.651 0.773 0.741 0.763 0.841 0.8400.08
WIN-WIN 0.989 0.896 0.691 0.820 0.775 0.874 0.879 0.8930.07
LCN [24] 0.988 0.820 0.578 0.698 0.675 0.712 0.771 0.7690.08
SNR [14] 0.993 0.806 0.442 0.609 0.660 0.563 0.779 0.6980.13
BIN [23] 0.984 0.875 0.560 0.771 0.760 0.819 0.889 0.8450.11
SN [22] 0.992 0.791 0.621 0.629 0.644 0.591 0.751 0.7290.08
IBN-a [25] 0.993 0.777 0.597 0.667 0.728 0.550 0.795 0.7480.12
IBN-b [25] 0.991 0.862 0.700 0.730 0.706 0.637 0.811 0.8050.08
GN [34] 0.991 0.837 0.499 0.724 0.683 0.661 0.821 0.7640.12
RBN [6] 0.991 0.755 0.458 0.617 0.724 0.633 0.838 0.7290.14
CNSN [30] 0.987 0.750 0.459 0.582 0.617 0.503 0.592 0.6360.11
AutoAugment [2] 0.996 0.605 0.452 0.548 0.646 0.601 0.761 0.6550.11
RandAugment [3] 0.992 0.712 0.462 0.650 0.761 0.771 0.779 0.7500.14
Patch Gaussian [21] 0.995 0.653 0.428 0.562 0.614 0.531 0.606 0.6170.10
randomErasing [40] 0.994 0.700 0.401 0.526 0.627 0.540 0.660 0.6260.12
mixup [38] 0.993 0.605 0.443 0.583 0.640 0.520 0.746 0.6430.12
CutMix [37] 0.995 0.679 0.455 0.590 0.683 0.675 0.768 0.6970.11
AugMix [11] 0.995 0.679 0.455 0.590 0.683 0.675 0.768 0.6970.11
PixMix [12] 0.997 0.702 0.423 0.640 0.693 0.708 0.783 0.7150.13
mainfold mixup [32] 0.993 0.637 0.389 0.502 0.653 0.664 0.348 0.5830.16
MixStyle [44] 0.995 0.705 0.364 0.602 0.678 0.590 0.829 0.6820.16
Table 1: Comparison on the glaucoma detection task of WIN and other state-of-the-art methods. Experimental results are reported with the ResNet-50 for epochs on LAG. DeepAll serves as the upper bound for each dataset. The performance on each dataset is evaluated with the area under the curve (). Top results are bolded.
Methods Training epochs - Avg. (- )
data H1 H2 H3 H4 H5 (meanstd)
DeepAll All 10 0.998 0.997 0.999 0.999 0.999 1.0000.00 1.000
BN [13] H1 50 0.999 0.868 0.696 0.908 0.528 0.7520.18 0.659
H2 50 0.721 0.999 0.184 0.702 0.565 0.5440.25
H3 40 0.718 0.844 1.000 0.327 0.553 0.6120.22
H4 25 0.730 0.697 0.410 1.000 0.602 0.6110.14
H5 25 0.767 0.708 0.668 0.946 1.000 0.7740.12
IN [31] H1 50 0.999 0.936 0.816 0.967 0.867 0.8980.07 0.758
H2 50 0.952 0.998 0.949 0.908 0.499 0.8280.22
H3 40 0.726 0.544 0.999 0.434 0.597 0.5760.12
H4 25 0.957 0.886 0.664 0.999 0.773 0.8220.13
H5 25 0.621 0.680 0.771 0.580 0.999 0.6640.08
WIN-WIN H1 50 0.998 0.944 0.935 0.973 0.942 0.9500.02 0.778
H2 50 0.967 0.997 0.918 0.966 0.573 0.8570.19
H3 40 0.734 0.504 0.999 0.461 0.795 0.6250.17
H4 25 0.921 0.892 0.506 0.999 0.629 0.7380.20
H5 25 0.665 0.800 0.796 0.609 0.999 0.7190.10
Table 2: Comparison with baselines on the breast cancer detection task. Experimental results are reported with the ResNet-50. DeepAll serves as the upper bound for each dataset. The performance on each dataset is evaluated with the area under the curve (). Top results are bolded and the in-distribution result is underlined.

4.2 Implementation Details

We conducted the experiments on two NVIDIA GeForce GTX 2080Ti with PyTorch implementation. All models were trained from scratch using the cross-entropy loss and the task determined the training epoch. The hyper-parameters

used in mixing statistics were empirically set to . With the exception of chromosome classification, all tasks used the Window strategy. The Block strategy was adopted for chromosome classification due to the chromosome images with consistent backgrounds (see Fig. 2). Furthermore, we set the ratio threshold for all tasks except for instance retrieval and for instance retrieval.

Binary Classification. We optimized the ResNet-50 [8]

with the following settings: SGD optimizer with Nesterov momentum of

; weight decay of ; batch size of . The learning rate increased linearly from to at the first epochs and linearly decayed to following a cosine decay schedule. The image was randomly flipped horizontally with a probability and randomly cropped to before feeding into the model.

Multiclass classification. For the chromosome classification task, we ran five times with the above setting. The CIFAR-10-C and CIFAR-100-C experiments were built based on PIXMIX [12]111https://github.com/andyzoujm/pixmix. SGD with a Nesterov momentum of was adopted as the optimizer to train the ResNet-18 [8]. The batch size was set to . The learning rate was set to and decayed following a cosine learning rate schedule. The augmentation policy was the same as the binary classification. In the experiments of Digit-DG, five ResNet18 models were optimized for epochs with the following settings: SGD optimizer with a momentum of ; weight decay of ; batch size of ; cosine decay schedule. The code was built based on mixstyle-release [44] 222https://github.com/KaiyangZhou/mixstyle-release. There was no augmentation for the input images.

Image segementaion. We built the code based on DoFE [33]333https://github.com/emma-sjwang/Dofe and MONAI 444https://github.com/Project-MONAI/MONAI. We trained the layer U-Net [27] for five times and optimized the cross-entropy loss with the Adam optimizer. The learning rate was set to in the first epochs and decreased to in the last epochs. and were set to and , respectively. The input images were randomly cropped and resized to .

Instance Retrieval. We evaluated WIN in this task using two different CNN architectures: ResNet-50 and OSNet [42]. Following [44], we used the code of mixstyle-release and trained the model for five times. SGD was used as the optimizer. The batch size was set to and the total number of epochs was . The learning rate was set to and decayed by every epochs. The image was randomly flipped horizontally with a probability.

Methods RHH IPMCH (200 ) IPMCH RHH (60 )
IND OOD IND OOD
BN [13] 93.30.3 21.41.4 97.60.1 24.81.2
IN [31] 92.60.2 35.22.7 97.50.1 34.23.1
WIN-WIN 92.90.3 41.33.2 97.60.2 31.41.9
Table 3: Comparison with baselines on the chromosome classification. Experimental results are reported with the ResNet-50 and averaged over five runs. The results are reported in Top-1 accuracy (,%). Top results are bolded. is an abbreviation of epochs.
Methods CIFAR-10-C (180 ) CIFAR-100-C (200 )
IND OOD IND OOD
BN [13] 95.6 23.5 75.3 50.6
IN [31] 94.6 18.5 75.1 48.4
WIN 94.2 18.1 75.2 46.4
WIN-WIN 94.6 17.5 76.4 45.7
Table 4: Comparison with baselines on the CIFAR-10-C and CIFAR-100-C. Experimental results are reported with the ResNet-50 trained by the original CIFAR-10 or CIFAR-100. IND performance and OOD performance are reported in Top-1 Accuracy (,% and mean corruption error (, %), respectively. Top results are bolded.
Methods Target domain Avg.
M MM SV SY
BN [13] 93.80.5 45.00.9 55.21.7 83.70.8 69.4
IN [31] 95.10.3 48.82.5 52.80.7 86.50.6 70.8
WIN-WIN 97.60.1 62.50.8 66.01.6 92.70.3 79.7
Table 5: Comparison with baselines on the Digit-DG. Experimental results are reported with the ResNet-18 and averaged over five random splits. The performance on the target domain is evaluated with Top-1 accuracy (,%). M: MNIST. MNIST-M: MM. SVHN: SV. SYN: SY. Top results are bolded.

4.3 Comparison with Baseline

Two baseline methods BN and IN are introduced since they are the most popular normalization methods. Both of them have demonstrated their effectiveness in improving model generalization. In practice, as we conducted the experiments on two GPUs, the BN is equivalent to GBN [5]. In the following, we adopted the WIN-WIN as the competitor for classification tasks and the WIN for other tasks.

For the binary classification tasks, we aggregated all datasets of the same task and trained the DeepAll. DeepAll provides a difficulty coefficient and can be served as the upper bound for each dataset [41]. The experimental results of baselines and our method were presented in Table 1 and 2. Compared with BN, our method surpassed it for the OOD generalization remarkably (the - of on glaucoma detection and the average - of on breast cancer detection). Compared with IN, our method also led by - in the glaucoma detection and - in the breast tumor detection. Besides, the IND data results revealed a slight variation. They are insensitive to normalization methods.

Multiclass classification is a more challenging task. The superiority of our method still exists. Table 3

presents the experimental results of the chromosome classification. This task asks the models to classify the input chromosome image into 24 types. A few images in RHH are mislabeled. Thereby, the distribution shifts appear in the input images and the labels simultaneously. RHH

IPMCH denotes training on RHH and testing on IPMCH, and IPMCHRHH reverses this setting. The accuracy of our method excelled the BN by and IN by in RHHIPMCH. On the other setting, our method was suboptimal and inferior to IN by . Furthermore, we verified our method in the common benchmark CIFAR-10-C, CIFAR-100-C, and Digit-DG. As shown in Table 4, the WIN-WIN effectively improved the corruption robustness compared with the baselines. In the experiments of Digit-DG, three domains were used for training and one domain for testing. Our method greatly outperformed the baselines regardless of the target domain (see Table 5), demonstrating that our method can improve the OOD generalization with multi-source domain data.

Methods Site A Site B Site C Site D Avg. ()
OC OD OC OD OC OD OC OD
BN [13] 0.7090.05 0.9210.01 0.6660.02 0.8340.01 0.7790.01 0.9060.01 0.8130.01 0.9050.02 0.817
IN [31] 0.7530.03 0.9480.00 0.7100.02 0.8670.01 0.8190.02 0.9380.00 0.8480.02 0.9270.00 0.851
WIN 0.7560.03 0.9520.00 0.7340.01 0.8480.01 0.8390.01 0.9400.01 0.8460.01 0.9240.00 0.855
Table 6: Comparison with baselines on the optic cup (OC) and optic disc (OD) segmentation task. Experimental results are reported with the 5-layer U-Net and averaged over five runs. The performance on each test dataset was evaluated with a dice score (). Top results are bolded.
Model Method Market1501CUHK03 CUHK03Market1501
- - - - - - - -
ResNet-50 BN [13] 4.1 4.1 8.6 12.3 17.3 9.1 24.2 41.9 50.6 58.6
IN [31] 1.6 1.2 4.2 6.4 9.8 3.7 12.7 26.0 32.7 41.4
WIN 2.0 1.9 5.5 7.9 12.2 3.9 11.4 25.6 33.3 42.0
OSNet-50 BN [13] 3.1 3.1 6.7 9.3 13.9 8.6 22.6 39.5 49.0 58.6
IN [31] 1.1 0.8 2.9 4.3 6.6 2.5 8.5 18.3 24.7 32.7
WIN 2.1 1.1 4.9 9.1 14.5 4.1 12.9 27.2 34.9 43.2
Table 7: Comparison with baselines on the cross-dataset person re-id task. The ResNet-50 and OSNet-50 both run once. The performance is evaluated with mean average precision () and ranking accuracy (). Top results are bolded.

In addition to the classification tasks, we investigated the application of WIN in image segmentation and instance retrieval. Table 3 presents that WIN outperformed BN with an average of dice score and outperformed IN with an average of dice score on the OC/OD segmentation. This demonstrates once again that our method can improve the OOD generalization with multi-source domain data. In the instance retrieval task, BN consistently outperformed other methods under both settings and with different architectures (see Table 7). The reason may be that instance-specific style discrepancy was eliminated by IN or WIN [14]. However, it was clear that WIN was better than IN.

In summary, WIN is a versatile method for improving the OOD generalization. It significantly surpasses the BN and IN in a number of tasks (such as classification and segmentation) regardless of training datasets (i.e., single domain or multiple domains). Although the characteristics of WIN approach to IN, its performance overwhelms the IN.

4.4 Ablation Analysis

Methods Strategies Stat. Mixing Consist. -
WIN Window , × 0.873
Window , × × 0.871
Window × × 0.704
Window × × 0.711
Global , × × 0.840
Block , × × 0.846
Pixel , × × 0.814
Mask , × × 0.779
Speckle , × × 0.813
WIN-WIN Window , logits 0.893
Window , × logits 0.876
Window , features 0.867
Table 8: Ablation results on the glaucoma detection using ResNet50.

In this section, we looked at the options for and in WIN first. And that, we investigated the effect of the mechanisms behind WIN-WIN. Finally, we presented the analysis of hyper-parameter sensitivity. All experiments in this section were conducted using ResNet-50 trained by the LAG.

First of all, we removed the statistics mixing. Secondly, we discussed the choice of the statistics and in WIN. We employed the global mean or global variance in the third and fourth row of Table 8, respectively. Thirdly, we investigated the area contributed to and (see Fig. 3). The global computes the mean and variance with all pixels, namely IN. The Block and Pixel compute them with randomly selected blocks or pixels, respectively. The Mask randomly erases a certain region and computes the and with the remaining pixels. Finally, we perturbed the global statistics in another way, adding the speckle noise (i.e., the Speckle) [9]. The following conclusions can be drawn: 1) Statistics mixing is beneficial. 2) Only employing one local statistic dramatically degenerates the OOD generalization. 3) Window and Block are the best practice for the local statistics computation. 4) Compared with IN, perturbing the statistics with the speckle noise does not help the OOD generalization.

In WIN-WIN, we removed the statistics mixing and consistency constraint (i.e., the first row of Table 8), respectively. As shown in Table 8

, both mechanisms helped the OOD generalization marginally. But the best practice is to combine the statistics mixing and consistency constraint. These two mechanisms can complement each other well. Furthermore, we imposed the consistency constraint on the features extracted from the penultimate layer (before the linear classifier) using the NT-Xent loss

[7]. The gain of consistency constraint still existed but was inferior to the current design .

Fig. 4 shows the results of different hyper-parameters settings. The - was impacted by mainly, but was insensitive to . Choosing and can gain a better result than IN.

Figure 3: Different Strategies for the computation of local statistics and . and are computed on the bright area. They are conducted in the feature space actually, but here we demonstrate with the raw images.
Figure 4: Hyper-parameter sensitivity analysis for the and .

4.5 Comparison with State-of-the-arts

The comprehensive comparison with state-of-the-art methods is presented in Table 1. All results are reported with the results of ResNet-50 on the glaucoma detection task.

We first compared our method with several normalization methods. Among these methods, IN [13], BN [31], and GN [34] are the most popular, LCN [24], SNR [14], BIN [23], SN [22], IBN-a [25], and IBN-b [25] are incremental studies for IN, and RBN [6] and CNSN [30] are recently proposed methods for improving the model generalization. Overall, these normalization methods did not present any obvious advantages for IND generalization and OOD generalization, and our method significantly outperformed them. It is worth noticing that LCN and CNSN are most relevant to our method. LCN normalizes every feature with the statistics of its neighbors. CNSN proposes the crossnorm which exchanges local statistics and global statistics between channels. Although local statistics are used in LCN and CNSN, our method surpassed them considerably ( - for LCN and - for CNSN).

On the other hand, our method is also relevant to augmentation-based methods. The comparison with these methods also have been conducted in the ResNet-50 (using BN), including image-space augmentation (i.e., AutoAugment [2], RandAugment [3], Patch Gaussian [21], randomErasing [40], mixup [38], CutMix [37], AugMix [11], and PixMix [12]) and feature-space augmentation (i.e., manifold mixup [32] and MixStyle [44]). As shown in Table 1, these methods showed their superiority on IND data, but in terms of OOD generalization, their effects were negligible or even negative except for RandAugment and PixMix. Overall, our method significantly outperformed these augmentation-based methods.

Furthermore, it should be noted that SNR [14], IBN-a [25], IBN-b [25], CNSN [30], Patch Gaussian [21], AugMix [11], PixMix [12], and MixStyle [44] are state-of-the-arts of domain generalization and adaptation or robustness which improve the OOD generalization with only a single source domain. However, our method significantly outperformed them.

Figure 5: T-SNE visualization of features on CIFAR-10-C. All plots are drawn with the features of in-distribution data (i.e., validation set).
Figure 6:

(a) cosine similarity between the raw image and its corrupted versions (i.e., OOD data); (b) training loss curves on CIFAR-10-C; (c) validation loss curves on CIFAR-10-C; (d) time costs per epoch of ResNet18 on CIFAR-10-C.

5 Discussions

In practice, the training data is a small population of real data, which is smaller in medical images due to the expensive costs for data acquisition and the high diversity caused by operating procedures and imaging equipment. Thereby, the discrepancy between training data and test data from unseen deploy environments is prevailing exists and leads to performance dips. In this paper, we mainly focus on improving the OOD generalization on heterogeneous data with the same labels but different appearances. A simple yet effective normalization technique WIN is proposed to address this issue with free parameters and a negligible effect for IND data. WIN conducts the feature-level augmentation by perturbing the normalizing statistics with stochastic window statistics. With this feature-level augmentation, we propose a novel self-distillation scheme WIN-WIN to eliminate the train-test inconsistency and further improve the OOD generalization in classification tasks. WIN-WIN is easily implemented by twice forward passes and a consistency constraint. Extensive experiments demonstrated our methods can significantly and generally boost the OOD generalization across different tasks spanning 26 datasets.

Here, we presented a comprehensive investigation of the properties of the WIN, covering the impacts on IND data and OOD data, the model convergence, and time cost. To ensure the reproducibility and provide insight for subsequent studies, we conducted these experiments on the widely-used OOD generalization benchmark CIFAR-10-C [10]. Fig. 5 shows the t-SNE visualization of the penultimate feature layer for the CIFAR-10-C benchmark. Although BN reported the best result on in-distribution data (see Table 4), IN and WIN were better separated for different classes than BN. In addition, the averaged cosine similarity between the raw image and its corrupted versions was employed to show the impact on OOD data. As shown in Fig. 6 (a), WIN had better invariance for the features extracted from the penultimate layer. Finally, we analyzed the training loss curves for BN, IN, and WIN and their time costs. Fig. 6 (b) and (c) show that the model convergence rate of WIN is close to that of BN and IN. For the time costs, we introduced two schemes: WIN (offline) and WIN (online). WIN (offline) caches the window parameters, while WIN (online) generates these on-the-fly. According to Fig. 6 (d), WIN (offline) greatly reduced the time cost but was over for BN and for IN. In a word, despite the longer training time, WIN can be a good alternative to existing normalization techniques to improve the OOD generalization.

Despite generating a superior improvement in the OOD generalization, WIN still has some limitations. First, the window sampling is inefficient for small sizes. Fortunately, it is no longer a limitation for existing CNNs as the large input size can improve the model generalization [29]. Second, as a variant of IN, the flaws of IN may also exist in WIN. For instance, the results of IN and WIN are significantly inferior to that of BN in the instance retrieval task (see Table 7). Accordingly, a reasonable conjecture is that the advantages of IN are also retained in WIN. In future work, we will apply WIN

to the pixel2pixel tasks (i.e., super-resolution and style transfer) that widely use the

IN. Furthermore, since its superiority for heterogeneous data, its use in federated learning is worth digging into thoroughly.

6 Conclusions

In this paper, we propose a simple yet effective normalization technique, namely WIN, which uses stochastic local statistics to boost the OOD generalization and not sacrifice the IND generalization. Based on WIN, we propose the WIN-WIN to improve further OOD generalization for classification tasks, which can be implemented with only a few lines of code. Our methods gracefully address the OOD generalization of heterogeneous data in real-world clinical practice. Extensive experiments on various tasks and datasets demonstrated their generality and superiority compared with the baselines and many state-of-the-art methods.

References

  • [1] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §2.2, §2.2.
  • [2] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In CVPR, pp. 113–123. Cited by: §1, §2.1, §4.5, Table 1.
  • [3] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2020) Randaugment: practical automated data augmentation with a reduced search space. In CVPR, pp. 702–703. Cited by: §1, §2.1, §4.5, Table 1.
  • [4] L. Deecke, I. Murray, and H. Bilen (2018) Mode normalization. arXiv preprint arXiv:1810.05466. Cited by: §2.2.
  • [5] N. Dimitriou and O. Arandjelovic (2020) A new look at ghost normalization. arXiv preprint arXiv:2007.08554. Cited by: §3.2, §4.3.
  • [6] S. Gao, Q. Han, D. Li, M. Cheng, and P. Peng (2021) Representative batch normalization with feature calibration. In CVPR, pp. 8669–8679. Cited by: §2.2, §4.5, Table 1.
  • [7] T. Gao, X. Yao, and D. Chen (2021) SimCSE: simple contrastive learning of sentence embeddings. In EMNLP, Cited by: §3.3, §4.4.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.2, §4.2.
  • [9] D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, et al. (2021) The many faces of robustness: a critical analysis of out-of-distribution generalization. In ICCV, pp. 8340–8349. Cited by: §1, §2.1, §4.4.
  • [10] D. Hendrycks and T. G. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, Cited by: §4.1, §5.
  • [11] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan (2020) AugMix: A simple data processing method to improve robustness and uncertainty. In ICLR, Cited by: §1, §2.1, §4.5, §4.5, Table 1.
  • [12] D. Hendrycks, A. Zou, M. Mazeika, L. Tang, B. Li, D. Song, and J. Steinhardt (2022) PixMix: dreamlike pictures comprehensively improve safety measures. CVPR. Cited by: §1, §2.1, §4.2, §4.5, §4.5, Table 1.
  • [13] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Vol. 37, pp. 448–456. Cited by: §1, §2.2, §2.2, §4.5, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7.
  • [14] X. Jin, C. Lan, W. Zeng, and Z. Chen (2021) Style normalization and restitution for domain generalization and adaptation. IEEE TMM. Cited by: §1, §2.2, §2.2, §4.3, §4.5, §4.5, Table 1.
  • [15] P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, et al. (2021) Wilds: a benchmark of in-the-wild distribution shifts. In ICML, pp. 5637–5664. Cited by: §4.1.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. NeurIPS 25. Cited by: §2.1.
  • [17] W. Li, R. Zhao, T. Xiao, and X. Wang (2014) Deepreid: deep filter pairing neural network for person re-identification. In CVPR, pp. 152–159. Cited by: §4.1.
  • [18] X. Li, M. Jiang, X. Zhang, M. Kamp, and Q. Dou (2021) Fedbn: federated learning on non-iid features via local batch normalization. arXiv preprint arXiv:2102.07623. Cited by: §1, §2.2.
  • [19] X. Li, W. Sun, and T. Wu (2020) Attentive normalization. In ECCV, pp. 70–87. Cited by: §2.2.
  • [20] X. Liang, L. Wu, J. Li, Y. Wang, Q. Meng, T. Qin, W. Chen, M. Zhang, and T. Liu (2021) R-drop: regularized dropout for neural networks. In NeurIPS, Cited by: §3.3.
  • [21] R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk (2019) Improving robustness without sacrificing accuracy with patch gaussian augmentation. arXiv preprint arXiv:1906.02611. Cited by: §2.1, §4.5, §4.5, Table 1.
  • [22] P. Luo, R. Zhang, J. Ren, Z. Peng, and J. Li (2019) Switchable normalization for learning-to-normalize deep representation. IEEE TPAMI 43 (2), pp. 712–728. Cited by: §1, §2.2, §4.5, Table 1.
  • [23] H. Nam and H. Kim (2018) Batch-instance normalization for adaptively style-invariant neural networks. NeurIPS 31. Cited by: §1, §2.2, §4.5, Table 1.
  • [24] A. Ortiz, C. Robinson, D. Morris, O. Fuentes, C. Kiekintveld, M. M. Hassan, and N. Jojic (2020) Local context normalization: revisiting local normalization. In CVPR, pp. 11276–11285. Cited by: §2.2, §3.2, §4.5, Table 1.
  • [25] X. Pan, P. Luo, J. Shi, and X. Tang (2018) Two at once: enhancing learning and generalization capacities via ibn-net. In ECCV, Vol. 11208, pp. 484–500. Cited by: §1, §2.2, §2.2, §3.1, §4.5, §4.5, Table 1.
  • [26] S. Qiao, H. Wang, C. Liu, W. Shen, and A. Yuille (2019) Rethinking normalization and elimination singularity in neural networks. arXiv preprint arXiv:1911.09738. Cited by: §2.2.
  • [27] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. MICCAI 2015. Cited by: §4.2.
  • [28] W. Shao, T. Meng, J. Li, R. Zhang, Y. Li, X. Wang, and P. Luo (2019) Ssn: learning sparse switchable normalization via sparsestmax. In CVPR, pp. 443–451. Cited by: §1, §2.2.
  • [29] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In ICML, Vol. 97, pp. 6105–6114. Cited by: §5.
  • [30] Z. Tang, Y. Gao, Y. Zhu, Z. Zhang, M. Li, and D. N. Metaxas (2021) CrossNorm and selfnorm for generalization under distribution shifts. In ICCV, pp. 52–61. Cited by: §1, §2.2, §4.5, §4.5, Table 1.
  • [31] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. CoRR abs/1607.08022. Cited by: §1, §2.2, §2.2, §4.5, Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7.
  • [32] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio (2019)

    Manifold mixup: better representations by interpolating hidden states

    .
    In ICML, pp. 6438–6447. Cited by: §4.5, Table 1.
  • [33] S. Wang, L. Yu, K. Li, X. Yang, C. Fu, and P. Heng (2020) DoFE: domain-oriented feature embedding for generalizable fundus image segmentation on unseen datasets. IEEE TMI. Cited by: §1, §4.1, §4.2.
  • [34] Y. Wu and K. He (2018) Group normalization. In ECCV, pp. 3–19. Cited by: §2.2, §4.5, Table 1.
  • [35] Y. Wu and J. Johnson (2021) Rethinking” batch” in batchnorm. arXiv preprint arXiv:2105.07576. Cited by: §2.2, §3.1.
  • [36] T. Yu, Z. Guo, X. Jin, S. Wu, Z. Chen, W. Li, Z. Zhang, and S. Liu (2020)

    Region normalization for image inpainting

    .
    In AAAI, Vol. 34, pp. 12733–12740. Cited by: §2.2.
  • [37] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In ICCV, pp. 6023–6032. Cited by: §4.5, Table 1.
  • [38] H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In ICLR, Cited by: §4.5, Table 1.
  • [39] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. ICCV. Cited by: §4.1.
  • [40] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2020) Random erasing data augmentation. In AAAI, Vol. 34, pp. 13001–13008. Cited by: §4.5, Table 1.
  • [41] C. Zhou, J. Ye, J. Wang, Z. Zhou, L. Wang, K. Jin, Y. Wen, C. Zhang, and D. Qian (2022) Improving the generalization of glaucoma detection on fundus images via feature alignment between augmented views. Biomed. Opt. Express 13 (4), pp. 2018–2034. Cited by: §4.1, §4.3.
  • [42] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang (2019) Learning generalisable omni-scale representations for person re-identification. IEEE TPAMI. Cited by: §4.2.
  • [43] K. Zhou, Y. Yang, T. M. Hospedales, and T. Xiang (2020) Learning to generate novel domains for domain generalization. ECCV. Cited by: §2.1, §4.1.
  • [44] K. Zhou, Y. Yang, Y. Qiao, and T. Xiang (2021) Domain generalization with mixstyle. In ICLR, Cited by: §4.1, §4.2, §4.2, §4.5, §4.5, Table 1.