Robust and Generalizable Visual Representation Learning via Random Convolutions

07/25/2020 ∙ by Zhenlin Xu, et al. ∙ University of North Carolina at Chapel Hill Yale University 0

While successful for various computer vision tasks, deep neural networks have shown to be vulnerable to texture style shifts and small perturbations to which humans are robust. Hence, our goal is to train models in such a way that improves their robustness to these perturbations. We are motivated by the approximately shape-preserving property of randomized convolutions, which is due to distance preservation under random linear transforms. Intuitively, randomized convolutions create an infinite number of new domains with similar object shapes but random local texture. Therefore, we explore using outputs of multi-scale random convolutions as new images or mixing them with the original images during training. When applying a network trained with our approach to unseen domains, our method consistently improves the performance on domain generalization benchmarks and is scalable to ImageNet. Especially for the challenging scenario of generalizing to the sketch domain in PACS and to ImageNet-Sketch, our method outperforms state-of-art methods by a large margin. More interestingly, our method can benefit downstream tasks by providing a more robust pretrained visual representation.



There are no comments yet.


page 2

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generalizability and robustness on out-of-distribution samples have been the pain points of applying deep neural networks (DNNs) in real world applications Volpi et al. (2018). While we are collecting datasets with millions of training samples, DNNs are still vulnerable to domain shift, small perturbations, and adversarial examples to which humans are remarkably robust Luo et al. (2019); Elsayed et al. (2018). Recent research has shown that neural networks tend to use superficial features rather than global shape information for prediction even when trained on large scale datasets such as ImageNet Geirhos et al. (2019). These superficial features can be local textures or even patterns imperceptible to humans but detectable to the DNNs, as is the case for adversarial examples Ilyas et al. (2019). In contrast, image semantics often depend more on object shapes rather than local textures. For image data, local texture differences are one of the main sources of domain shift, e.g., between synthetic virtual images and real data Sun and Saenko (2014). Therefore our goal is to learn visual representations that are invariant to local texture such that they can generalize to unseen domains.

max width= Input Input

Figure 1: RandConv data augmentation examples on images of size . 1st and 2nd rows: First column is the input image; following columns are convolutions results using random filters of different sizes . 3rd row: Mixup results between an image and one of its random convolution results with different mixing coefficients .

We address the challenging setting of robust visual representation learning from single domain data. Limited work exists in this setting. Proposed methods include data augmentation Volpi et al. (2018); Qiao et al. (2020); Geirhos et al. (2019), domain randomization Tobin et al. (2017); Yue et al. (2019)

, self-supervised learning 

Carlucci et al. (2019), and penalizing the predictive power of low-level network features Wang et al. (2019a). Following the spirit of adding inductive bias towards global shape information over local textures, we propose using random convolutions to improve the robustness to domain shifts and small perturbations. In addition, considering that many computer vision tasks rely on training deep networks based on ImageNet-pretrained weights (including many domain generalization benchmarks), we ask “Can a more robust pre-trained model make the finetuned model more robust on downstream tasks?” Different from Kornblith et al. (2019) which studied the transferability of a pretrained ImageNet representation to new tasks while focusing on in-domain generalization, we explore generalization performance on unseen domains for new tasks.

We make the following contributions:

  • [leftmargin=2em]

  • We justify that random convolutions preserve shape information, based on the distance preserving property of random linear projections. The spatial extent of the convolution filter determines the scale at which shape information is maintained, and local textures are perturbed.

  • We develop RandConv, a data augmentation technique using multi-scale random-convolutions to generate images with random texture while maintaining object shape. We explore both directly using the RandConv output as training images or mixing it with the original images. We show that a consistency loss can further enforce invariance under texture changes.

  • We validate RandConv and its mixup variant in extensive experiments on synthetic and real-world benchmarks as well as on the large-scale ImageNet dataset. Our methods outperform single domain generalization approaches by a large margin on the digits recognition datasets and for the challenging case of generalizing to the Sketch domain in PACS and to ImageNet-Sketch.

  • We explore if the robustness/generalizability of a pretrained representation can transfer. We show that transferring a model pretrained with RandConv on ImageNet can further improve domain generalization performance on new downstream tasks on the PACS dataset.

2 Related Work

Domain Generalization (DG) aims at learning domain invariant representations that generalize to unseen domains. Modern techniques range between feature fusion Shen et al. (2019), meta-learning Li et al. (2018a); Balaji et al. (2018), and adversarial training Shao et al. (2019); Li et al. (2018b). Note that most current DG work Ghifary et al. (2016); Li et al. (2018a, b) requires a multi-source setting to work well. However, in practice, it might be difficult and expensive to collect data from multiple sources, such as collecting data from multiple medical centers Raghupathi and Raghupathi (2014). Instead, we consider the more strict single domain generalization DG setting, where we train the model on source data from a single domain and generalize it to new unseen domains Carlucci et al. (2019); Wang et al. (2019b).

Domain Randomization (DR) was first introduced as a DG technique by Tobin et al. Tobin et al. (2017) to handle the domain gap between simulated and real data. As the training data in Tobin et al. (2017) is synthesized in a virtual environment, it is possible to generate diverse training samples by randomly selecting background images, colors, lighting, and textures of foreground objects. When a simulation environment is not accessible, image stylization can be used to generate new domains Yue et al. (2019); Geirhos et al. (2019). However, this requires extra effort to collect data and to train an additional model; further, the number of randomized domains is limited by the number of predefined styles.

Data Augmentation

has been widely studied to improve the generalizability of machine learning models 

Simard et al. (2003). We can consider DR approaches a type of synthetic data augmentation. To improve performance on unseen domains, Volpi et al. Volpi et al. (2018) generate adversarial examples to augment the training data, and Qiao et al. Qiao et al. (2020) extend this approach via meta-learning. Like other adversarial training algorithms, significant extra computation cost is required to obtain adversarial examples.

Learning Representations biased by Global Shape Geirhos et al. Geirhos et al. (2019)

demonstrated that convolutional neural networks (CNNs) tend to use superficial local features even when trained on large datasets. To counteract this effect, they proposed to train on stylized ImageNet, thereby forcing a network to rely on object shape instead of textures. Wang et al. improved out-of-domain performance by penalizing the correlation between a learned representation and superficial features such as the gray-level co-occurrence matrix 

Wang et al. (2019b)

, or by penalizing the predictive power of local, low-level layer features in a neural network via an adversarial classifier 

Wang et al. (2019a). Our approach shares the idea that learning representations invariant to local texture helps generalization to unseen domains. However, RandConv avoids searching over many hyper-parameters, collecting extra data, and training other networks. It adds minimal computation overhead and is thus scalable to large-scale datasets.

Random Projections in Robust Learning Introducing randomness improves the robustness of neural networks, e.g., Dropout Srivastava et al. (2014). Random projections have also been effective for dimension reduction based on the distance preserving property of the Johnson–Lindenstrauss lemma Johnson and Lindenstrauss (1984). Vinh et al. (2016) applied random projections on entire images as data augmentation to make neural networks robust to adversarial examples. Recent work Lee et al. (2020)

uses random convolutions to improve the performance of reinforcement learning (RL) on unseen environments. While 

Lee et al. (2020) explored adding randomized convolutional layers to a CNN and empirically demonstrated that adding random convolutions with filter size three as the input layer can improve the robustness for new domains on RL tasks. However, no analysis of what a randomized convolution layer does and why it works is provided. We show that RandConv is approximately shape-preserving by proving the relative distance preserving property of random linear projections. We also extend RandConv via a multi-scale and mixup design and test it extensively on domain generalization benchmarks. Further, we demonstrate the transferability of robustness with our method and shed light on how to better use pretrained models.

3 RandConv: Distance-Preservation and Random Convolutions

We propose using a convolution layer with random weights as the first layer of a DNN during training. This strategy generates shape-consistent DNN inputs with random local texture and is beneficial for robust visual representation learning. Sec. 3.1 presents a theoretical bound on distance-preservation under random linear projections. This bound motivates shape-preservation under random convolutions, which we also illustrate empirically on real image data. Sec. 3.2 describes RandConv, our data augmentation algorithm using a multi-scale randomized convolution layer and its mixup Zhang et al. (2018); Hendrycks* et al. (2020) variant.

3.1 A Randomized Convolution Layer Preserves Shapes

Convolution is a fundamental operation for image filtering and the key building block for deep convolution neural networks (DCNNs). Consider a convolution layer with filters with an image

as input, the output (with padding)

and , where and are the spatial size of the input/output and the filter respectively; and and denote the number of feature channels for the input and output respectively.

Convolution is linear, hence we can express a convolution layer as a local linear projection:


where (

) is the vectorized image patch centerized at location

, is the output feature at location , and is the matrix expressing the convolution layer filters . I.e., for each sliding window centered at , a convolution layer applies a linear transform projecting the dimensional local image patch to its dimensional feature . Next, we show that a convolution layer preserves shape information when

is independently randomly sampled, e.g. from a Gaussian distribution.

In images, pixels intensities, colors, or image patches tend to more similar within a particular structure or shape than across. We informally define shapes as clusters of similar pixels. If and are pixel coordinates inside the same shape of the original image and is a location within a different shape, then we should have . We say a transformation is shape-preserving if it maintains such relative distance relations for most pixel triplets: i.e., for any two spatial location and ; is a constant. Thm. 1 shows that a random linear projection is approximately shape-preserving by bounding the range of .

Theorem 1.

Suppose we have N data points . Let be a random linear projection such that and . Then we have:


where and . Here, denotes the

-upper quantile of the

distribution with degrees of freedom.

Thm. 1 tells us that for any data pair in a set of points, the distance rescaling ratio after a random linear projection is bounded by and

with probability

. A Smaller and a larger output dimension give better bounds. E.g., when , , and , and . Thm. 1 gives a theoretical bound for all the pairs. However, in practice, preserving distances for a majority of pairs is sufficient. To empirically verify this, we test the range of central of on real image data. Using the same , of the pairs lie in , which is significantly better than the strict bound: . A proof of the theorem and simulation details are in the Appendix.

In summary, random linear projections approximately preserve relative distances for local image patches. Since a convolution layer is a linear projection per Eq. (1), its output inherits this property. However, the size of local patches controlled by the size of the convolution filters determines the smallest shape it can preserve. E.g., 1x1 random convolutions preserve shapes at the single-pixel level; using large filters perturbs shapes smaller than the filter size. See Fig. 1 for examples.

3.2 Multi-scale Image Augmentation with a Randomized Convolution Layer

1:Input: Model , task loss , training images and their labels , pool of filter sizes , fraction of original data , whether to with original images, consistency loss weight
2:function RandConv(I, , , )
3:     Sample
4:     if  < and mix is False then
5:          return When not in mix mode, use the original image with probability
6:     else
7:          Sample scale
8:          Sample convolution weights
9:           Apply convolution on
10:          if  is True then
11:               Sample
12:               return Mix with original images
13:          else
14:               return                
15:Learning Objective:
16:for  do
17:     for  do
18:           Predict labels for three augmented variants of the same image      
19:      where Consistency Loss
20:      Learning with the task loss and the consistency loss
Algorithm 1 Learning with Data Augmentation by Random Convolutions

Sec. 3.1 showed that outputs of randomized convolution layers approximately maintain shape information at a scale larger than their filter sizes. Here, we develop our RandConv data augmentation technique using a randomized convolution layer with to generate shape-consistent images with randomized texture (see Alg. 1). Our key RandConv design choices are as follows:

: Augmenting Images with Random Texture A simple approach is to use the randomized convolution layer outputs, , as new images; where are the randomly sampled weights and is a training image. If the original training data is in the domain , a sampled weight generates images with consistent global shape but random texture forming the random domain . Thus, by random weight sampling, we obtain an infinite number of random domains

. Input image intensities are assumed to be a standard normal distribution

(via data whitening). As the outputs of RandConv should have the same value range we sample the convolution weights from where , which is commonly applied for network initialization He et al. (2015).

: Mixup Variant As shown in Fig. 1, outputs from can vary significantly from the appearance of the original images. Although generalizing to domains with significantly different local texture distributions is great, we may not want to sacrifice much performance on domains similar to training domains. A common compromise is to include the original images for training at a ratio (where

is a hyperparameter). Inspired by the AugMix  

Hendrycks* et al. (2020) strategy, we propose to blend the original image with the outputs of the RandConv layer via linear convex combinations , where is the mixing weight uniformly sampled from .In , the RandConv outputs provide shape-consistent perturbations of the original images. Varying

, we continuously interpolate between the training domain and the randomly sampled domain of


Multi-scale Texture Corruption As shown in Sec. 3.1, the distance between a spatial pair of RandConv outputs depends on the distance between the local input image patches; their size in turn is controlled by convolution filter size. Specifically, image shape information at the scale smaller than a filter’s size will be corrupted. Therefore, we can use filters of varying sizes to preserve shapes at various scales. We choose to uniformly randomly sample a filter size from a pool before sampling convolution weights from a Gaussian distribution . Fig. 1 shows examples of multi-scale RandConv outputs.

Consistency-encouraged Learning To learn representations invariant to texture changes, we use a loss encouraging consistent network predictions for the same RandConv-augmented image for different random filter samples. Approaches for transform-invariant domain randomization Yue et al. (2019) and data augmentation Hendrycks* et al. (2020) use similar strategies. We use Kullback-Leibler (KL) divergence to measure consistency. However, enforcing predictions similarity of two augmented variants may be too strong. Instead, we use RandConv to obtain 3 augmentation samples of image : for and obtain their predictions with a model : . We then compute the relaxed loss as , where is the average over the samples.

4 Experiments

Secs. 4.1 to 4.3 evaluate our methods on the following datasets: multiple digit recognition datasets, PACS, and ImageNet sketch. Sec. 4.4 uses PACS to explore if transferring a representation pretrained on Image-Net with our method to a new task improves model performance. All experiments are in the single domain generalization setting where training and validation only have access to one domain.

4.1 Digit Recognition

The five digit recognition datasets MNIST LeCun et al. (1998), MNIST-M Ganin et al. (2016), SVHN Netzer et al. (2011), SYNTH Ganin and Lempitsky (2014) and USPS Denker et al. (1989), have been widely used for domain adaptation and generalization research Peng et al. (2019a, b); Qiao et al. (2020). We follow the setups in Volpi et al. (2018) and Qiao et al. (2020). We train a simple CNN with 10,000 MNIST samples and evaluate the accuracy on the test set of the other four data. We also test on MNIST-C Mu and Gilmer (2019), a robustness benchmark with 15 common corruptions of MNIST and report the average accuracy over all corruptions.

Selecting Hyperparameters and Ablation Study. Fig. 2(a) shows the effect of the hyperparameter on with filter size 1. We see that adding only RandConv data () immediately improves the average performance (DG-Avg) on MNIST-M, SVHN, SYNTH and USPS performance from 53.53 to 69.19 outperforming all other approaches (see Tab. 1) for every dataset. We choose , which obtains the best DG-Avg. Fig. 2(b) shows results for a multiscale ablation study. Increasing the pool of filter sizes up to improves DG-Avg performance. Therefore we use multi-scale to study the consistency loss weight , shown in Fig.  2(c). Adding the consistency loss improves both RandConv variants on G-avg: favors while perform similarly for and . We choose for all our subsequent experiments.

max width=

Figure 2:

Average accuracy and 5-run variance of MNIST model on MNIST-M, SVHN, SYNTH and USPS. Studies for: (a) original data fraction

for ; (b) multiscale design (1-n refers to using scales 1,3,..,n) for (orange) and (blue); (c) consistency loss weight for (orange) and (blue).

Results. Tab. 1 compares the performance of and with other state-of-art approaches. We show results of the adversarial training based methods GUD Volpi et al. (2018), M-ADA Qiao et al. (2020), and PAR Wang et al. (2019a). The baseline model is trained only the classification loss. To show RandConv is more than a trivial color/contrast adjustment method, we also test the ColorJitter222

See PyTorch documentation for implementation details, all parameters are set to 0.5.

data augmentation (which randomly changes image brightness, contrast, and saturation) and GreyScale (where images are transformed to grey-scale for training and testing). RandConv and its mixup variant outperforms the best competing methods M-ADA by 17% on DG-Avg and achieves the best 91.62% accuracy on MNIST-C. While the difference between the two variants is marginal, performs better on both DG-Avg and MNIST-C. Fig 3 shows t-SNE image feature plots for unseen domains generated by the baseline approach and . The RandConv embeddings suggest better generalization to unseen domains.


Figure 3: t-SNE feature embedding visualization for digit datasets for models trained on MNIST without (top) and with our approach (bottom). Different colors denote different classes.
Baseline 98.40(0.84) 58.87(3.73) 33.41(5.28) 79.27(2.70) 42.43(5.46) 53.50(4.23) 88.20(2.10)
Greyscale 98.82(0.02) 58.41(0.99) 36.06(1.48) 80.45(1.00) 45.00(0.80) 54.98(0.86) 89.15(0.44)
ColorJitter 98.72(0.05) 62.72(0.66) 39.61(0.88) 79.18(0.60) 46.40(0.34) 56.98(0.39) 89.48(0.18)
PAR (our imp) 98.79(0.05) 61.16(0.21) 36.08(1.27) 79.95(1.18) 45.48(0.35) 55.67(0.33) 89.34(0.45)
GUD - 60.41 35.51 77.26 45.32 54.62 -
M-ADA - 67.94 42.55 78.53 48.95 59.49 -
, =10 98.85(0.04) 87.76(0.83) 57.52(2.09) 83.36(0.96) 62.88(0.78) 72.88(0.58) 91.62(0.77)
, =0.5, =5 98.86(0.05) 87.67(0.37) 54.95(1.90) 82.08(1.46) 63.37(1.58) 72.02(1.15) 90.94(0.51)
Table 1:

Average accuracy and 5-run standard deviation (in parenthesis) of MNIST10K model on MNIST-M, SVHN, SYNTH, USPS and their average (DG-avg); and average accuracy of 15 types of corruptions in MNIST-C. Both

RandConv variants significantly outperform all other methods.
Base Method Photo Art Cartoon Sketch Average
Ours Deep All 86.77(0.42) 60.11(1.33) 64.12(0.32) 55.28(4.71) 66.57(1.36)
GreyScale 83.93(1.47) 61.60(1.18) 62.12(0.61) 60.07(2.47) 66.93(0.83)
Colorjitter 84.61(0.83) 59.01(0.24) 61.43(0.68) 62.44(1.68) 66.88(0.33)
PAR (our imp.) 87.21(0.42) 60.17(0.95) 63.63(0.88) 55.83(2.57) 66.71(0.58)
, =0.5 86.50(0.72) 61.10(0.38) 64.24(0.62) 68.50(1.83) 70.09(0.43)
, =0.5, =10 81.15(0.76) 59.56(0.79) 62.42(0.59) 71.74(0.43) 68.72(0.58)
86.60(0.67) 61.74(0.90) 64.05(0.66) 69.74(0.66) 70.53(0.25)
,=10 81.78(1.11) 61.14(0.51) 63.57(0.29) 71.97(0.38) 69.62(0.24)
Wang et al. (2019a) Deep All (our run) 88.40 66.26 66.58 59.40 70.16
PAR (our run) 88.40 65.19 68.58 61.86 71.10
PAR (reported) 89.6 66.3 68.3 64.1 72.08
Carlucci et al. (2019) Deep All 89.98 66.68 69.41 60.02 71.52
Jigen 89.00 67.63 71.71 65.18 73.38
Li et al. (2018a) Deep All 86.67 64.91 64.28 53.08 67.24
MLDG* 88.00 66.23 66.88 58.96 70.01
Li et al. (2018c) Deep-All 77.98 57.55 67.04 58.52 65.27
CIDDG* 78.65 62.70 69.73 64.45 68.88
Table 2: Mean and 5-run standard deviation (in parenthesis) results for domain generalization on PACS. Best results are in bold. The domain name in each column represents the target domain. Base column indicates different baselines. Approaches using domain labels for training are marked by *.

4.2 PACS Experiments

The PACS dataset Li et al. (2018b) considers 7-class classification on 4 domains: photo, art painting, cartoon, and sketch, with very different texture styles. Recent domain generalization work tests on this benchmark, but most work studies the multi-source domain setting and uses domain labels of the training data. Although we follow the convention to train on 3 domains and to test on the fourth, we simply pool the data from the 3 training domains as in Wang et al. (2019a), without using domain labels during the training.

Baseline and State-of-the-Art. Following Li et al. (2017), we use Deep-All as the baseline, which finetunes an ImageNet-pretrained AlexNet on 3 domains using only the classification loss and test on the fourth domain. We test our RandConv variants and with and without consistency loss, and ColorJitter/GreyScale data augmentation as in the digit experiments. We also implemented PAR Wang et al. (2019a) using our baseline model. In addition, we compare to the following state-of-the-art approaches: JigenCarlucci et al. (2019) using self-supervision, MLDGLi et al. (2018a) based on meta-learning, and the conditional invariant deep domain generalization method CIDDGLi et al. (2018c). Note that MLDG and CIDDG use domain labels for training. For comparison, we also report the Deep-All baseline performance.

Results. Tab. 2 shows significant improvements on Sketch for both RandConv variants. Sketch is the most challenging domain with no color and much less texture compared to the other 3 domains. The success on Sketch demonstrates that our methods can guide the DNN to learn global representations focusing on shapes that are robust to texture changes. Without using the consistency loss, achieves the best overall result improving over Deep-All by 4%. Adding the consistency loss with , and perform better on Sketch but degrade performance on the other 3 domains. This is also the case for GreyScale and ColorJitter.

4.3 Generalizing an ImageNet Model to ImageNet-Sketch

BaselineWang et al. (2019a) PAR Wang et al. (2019a) Baseline , =0.5, =10 , =10
ImageNet-Sketch Top1 12.04 13.06 10.28 18.09 16.91
Top5 25.60 26.27 21.60 35.40 33.99
Table 3: Accuracy of ImageNet-trained AlexNet on ImageNet-Sketch data. Our methods outperform PAR by 5% while PAR was built on top of a stronger baseline than our model.

ImageNet-Sketch Wang et al. (2019a) is an out-of-domain test set for models trained on ImageNet. We trained AlexNet models from scratch with and . We evaluate their performance on ImageSketch. We use the AlexNet model trained without RandConv as our baseline. Tab. 3 compares PAR and its baseline model. Although PAR uses a stronger baseline, RandConv achieves significant improvements over our baseline and outperform PAR by a big margin. Our methods achieve more than a 7% accuracy improvement over the baseline and surpass PAR by 5%.

4.4 Revisiting PACS with more Robust Pretrained Representations

Impact of ImageNet Pretraining A model trained on ImageNet may be biased towards textures Geirhos et al. (2019). Finetuning ImageNet pretrained models on PACS may inherit this texture bias, thereby benefitting generalization on the Photo domain (which is similar to ImageNet), but hurting performance on the Sketch domain. Therefore, as shown in Sec. 4.2, using RandConv to correct this texture bias improves results on Sketch, but degraded them on the Photo domain.Since pretraining has such a strong impact on transfer performance to new tasks, we ask: "Can the generalizability of a pretrained model transfer to downstream tasks? I.e., does a pretrained model with better generalizability improve performance on unseen domains on new tasks?" To answer this, we revisit the PACS tasks with ImageNet-pretrained weights using our two RandConv variants of Sec. 4.3 for initialization. We verify if this changes the performance for the Deep-All baseline and for finetuning with RandConv.

PACS ImageNet Photo Art Cartoon Sketch Avg
Deep-All Baseline 86.77(0.42) 60.11(1.33) 64.12(0.32) 55.28(4.71) 66.57(1.36)
, =0.5, =10 84.48(0.52) 62.61(1.23) 66.13(0.80) 69.24(0.80) 70.61(0.53)
, =10 85.59(0.40) 63.30(0.99) 63.83(0.85) 68.29(1.27) 70.25(0.45)
=0.5, =10 Baseline 81.15(0.76) 59.56(0.79) 62.42(0.59) 71.74(0.43) 68.72(0.58)
, =0.5, =10 84.36(0.36) 63.73(0.91) 68.07(0.55) 75.41(0.57) 72.89(0.33)
, =10 84.63(0.97) 63.41(1.22) 66.36(0.43) 74.59(0.84) 72.25(0.54)
=10 Baseline 81.78(1.11) 61.14(0.51) 63.57(0.29) 71.97(0.38) 69.62(0.24)
, =0.5, =10 85.16(1.03) 63.17(0.38) 67.68(0.60) 76.11(0.43) 73.03(0.46)
, =10 86.17(0.56) 65.33(1.05) 65.52(1.13) 73.21(1.03) 72.56(0.50)
Table 4: Generalization results on PACS with RandConv pretrained ImageNet model. PACS column indicates the methods used for finetuning on PACS; ImageNet column shows how the pretrained model is trained on ImageNet (vanilla represents training the ImageNet model using only the classification loss). Best and second best accuracy for each target domain are highlighted in bold and underlined.

Better Performance via RandConv pretrained model We start by testing the Deep-All baselines using the two RandConv-trained ImageNet models of Sec. 4.3 as initialization. Tab. 4 shows significant improvements on Sketch. Results are comparable to finetuning with RandConv on a normal pretrained model. Art is also consistently improved. Performance drops slightly on Photo as expected, since we reduced the texture bias in the pretrained model, which is helpful for the Photo domain. Using RandConv for both ImageNet training and PACS finetuning, we achieve 76.11% accuracy on Sketch. As far as we know, this is the best performance using an AlexNet baseline. This approach even outperforms Jigen Carlucci et al. (2019) (71.35%) with a stronger ResNet18 baseline model. Cartoon and Art are also improved. The degradation on Photo is marginal. The best average domain generalization accuracy is 73.03%, with a more than 6% improvement over our initial Deep-All baseline.

This experiment confirmed that generalizability may transfer: removing texture bias may not only make a pretrained model more generalizable, but it may help generalization on downstream tasks. For similar target and pretraining domains, where texture bias may be helpful, performance may degrade.

5 Conclusion and Discussion

Randomized convolution (RandConv) is a simple but powerful data augmentation technique for randomizing local image texture. RandConv helps focus visual representations on global shape information rather than local texture. We theoretically justified the approximate shape-preserving property of RandConv and developed RandConv techniques using multi-scale and mixup designs. We also make use of a consistency loss to encourage texture invariance. RandConv outperforms state-of-the-art approaches on the digit recognition benchmark, on the sketch domain of PACS and on ImageNet-Sketch by a large margin. By finetuning a model pretrained with RandConv on PACS, we showed that the generalizability of a pretrained model may transfer to and benefit a new downstream task. This resulted in a new state-of-art performance on PACS, in particular, on its Sketch domain.

However, local texture features can be useful for many computer vision tasks, especially for fine-grained visual recognition. In such cases, visual representations that are invariant to local texture may hurt the in-domain performance. Therefore, important future work includes learning representation that disentangles shape and texture features and building models to use such representations in an explainable way.

Broader Impact

Our work focuses on general visual representation learning. Anyone who uses a deep learning model for visual computing applications may benefit from our proposed

RandConv approach to improve robustness and generalizability. As our approach is general and does not focus on a specific application, any potential biases or disadvantages will be application-specific. However, as our approach targets representations with greater levels of texture invariance, applications that would benefit from such invariance may show gains in robustness. Our approach does not leverage biases in the data beyond the deep neural networks that will be combined with it.

We thank Zhiding Yu and and Nathan Cahill for the discussion and advice. Research reported in this publication was supported by the National Institutes of Health (NIH) under award numbers NIH 1R01AR072013. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.


  • Y. Balaji, S. Sankaranarayanan, and R. Chellappa (2018) Metareg: towards domain generalization using meta-regularization. In Advances in Neural Information Processing Systems, pp. 998–1008. Cited by: §2.
  • F. M. Carlucci, A. D’Innocente, S. Bucci, B. Caputo, and T. Tommasi (2019) Domain generalization by solving jigsaw puzzles. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2229–2238. Cited by: §1, §2, §4.2, §4.4, Table 2.
  • J. S. Denker, W. Gardner, H. P. Graf, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, H. S. Baird, and I. Guyon (1989) Neural network recognizer for hand-written zip code digits. In Advances in neural information processing systems, pp. 323–331. Cited by: §4.1.
  • G. Elsayed, S. Shankar, B. Cheung, N. Papernot, A. Kurakin, I. Goodfellow, and J. Sohl-Dickstein (2018) Adversarial examples that fool both computer vision and time-limited humans. In Advances in Neural Information Processing Systems, pp. 3910–3920. Cited by: §1.
  • Y. Ganin and V. Lempitsky (2014)

    Unsupervised domain adaptation by backpropagation

    arXiv preprint arXiv:1409.7495. Cited by: §4.1.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §4.1.
  • R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2019) ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2, §2, §4.4.
  • M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang (2016) Scatter component analysis: a unified framework for domain adaptation and domain generalization. IEEE transactions on pattern analysis and machine intelligence 39 (7), pp. 1414–1430. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §3.2.
  • D. Hendrycks*, N. Mu*, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan (2020) AugMix: a simple method to improve robustness and uncertainty under data shift. In International Conference on Learning Representations, External Links: Link Cited by: §3.2, §3.2, §3.
  • A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, pp. 125–136. Cited by: §1.
  • W. B. Johnson and J. Lindenstrauss (1984) Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics 26 (189-206), pp. 1. Cited by: §2.
  • S. Kornblith, J. Shlens, and Q. V. Le (2019) Do better imagenet models transfer better?. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2661–2671. Cited by: §1.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
  • K. Lee, K. Lee, J. Shin, and H. Lee (2020) Network randomization: a simple technique for generalization in deep reinforcement learning. In International Conference on Learning Representations. https://openreview. net/forum, Cited by: §2.
  • D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2017) Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542–5550. Cited by: §4.2.
  • D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2018a) Learning to generalize: meta-learning for domain generalization. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2, §4.2, Table 2.
  • H. Li, S. Jialin Pan, S. Wang, and A. C. Kot (2018b) Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5400–5409. Cited by: §2, §4.2.
  • Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao (2018c) Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 624–639. Cited by: §4.2, Table 2.
  • Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang (2019) Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2507–2516. Cited by: §1.
  • N. Mu and J. Gilmer (2019) MNIST-c: a robustness benchmark for computer vision. arXiv preprint arXiv:1906.02337. Cited by: §4.1.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. Cited by: §4.1.
  • X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang (2019a) Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1406–1415. Cited by: §4.1.
  • X. Peng, Z. Huang, X. Sun, and K. Saenko (2019b) Domain agnostic learning with disentangled representations. In ICML, Cited by: §4.1.
  • F. Qiao, L. Zhao, and X. Peng (2020) Learning to learn single domain generalization. arXiv preprint arXiv:2003.13216. Cited by: §1, §2, §4.1, §4.1.
  • W. Raghupathi and V. Raghupathi (2014) Big data analytics in healthcare: promise and potential. Health information science and systems 2 (1), pp. 3. Cited by: §2.
  • R. Shao, X. Lan, J. Li, and P. C. Yuen (2019) Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10023–10031. Cited by: §2.
  • W. B. Shen, D. Xu, Y. Zhu, L. J. Guibas, L. Fei-Fei, and S. Savarese (2019) Situational fusion of visual representation for visual navigation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2881–2890. Cited by: §2.
  • P. Y. Simard, D. Steinkraus, J. C. Platt, et al. (2003) Best practices for convolutional neural networks applied to visual document analysis.. In Icdar, Vol. 3. Cited by: §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
  • B. Sun and K. Saenko (2014) From virtual to reality: fast adaptation of virtual object detectors to real domains. In Proceedings of the British Machine Vision Conference, Cited by: §1.
  • J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30. Cited by: §1, §2.
  • N. X. Vinh, S. Erfani, S. Paisitkriangkrai, J. Bailey, C. Leckie, and K. Ramamohanarao (2016) Training robust models using random projection. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 531–536. Cited by: §2.
  • R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese (2018) Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems, pp. 5334–5344. Cited by: §1, §1, §2, §4.1, §4.1.
  • H. Wang, S. Ge, Z. Lipton, and E. P. Xing (2019a) Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pp. 10506–10518. Cited by: §1, §2, §4.1, §4.2, §4.2, §4.3, Table 2, Table 3.
  • H. Wang, Z. He, and E. P. Xing (2019b) Learning robust representations by projecting superficial statistics out. In International Conference on Learning Representations, External Links: Link Cited by: §2, §2.
  • X. Yue, Y. Zhang, S. Zhao, A. Sangiovanni-Vincentelli, K. Keutzer, and B. Gong (2019) Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2100–2110. Cited by: §1, §2, §3.2.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In International Conference on Learning Representations, External Links: Link Cited by: §3.

Appendix A Relative Distance Preservation Property of Random Linear Projections

Random linear projections can approximately preserve distances based on the following theorem.

Theorem 2.

Suppose we have N data points . Let be a random linear projection such that and . Then we have:


where and . Here, denotes the -upper quantile of the distribution with degree of freedom .


Let represent to the -th row of . It is easy to check that . Therefore,

Therefore, for , we have

From the above inequality, we have

which is equivalent to

Similarly, we have

Simulation on Real Image Data To better understand the relative distance preservation property of random linear projections in practice, we use Algorithm 2 to empirically obtain a bound for real image data. We choose , , and as in computing our theoretical bounds. We use real images from the PACS dataset for this simulation. Note that the image patch size or does not affect the bound. We use a patch size of resulting in . This simulation tell us that applying linear projections with a randomly sampled on local images patches in every image, we have a chance that of is in the range .

1:Input: images , number of data points , projection output dimension , standard deviation of normal distribution, confidence level .
2:for  do
3:     Sample images patches in at 1,000 locations and vectorize them as
4:     Sample a projection matrix and
5:     for  do
6:          for  do
7:               Compute , where                
8:      = quantile of for
9:      = quantile of for Get the central 80% of in each image
10: = quantile of all
11: = quantile of all Get the confident bound for and
12:return ,
Algorithm 2 Simulate the range of central 80% of on real image data

Appendix B Experimental Details

Digits Recognition The network for our digits recognition experiments is composed of two Conv5


2 blocks with 64/128 output channels and three fully connected layer with 1024/1024/10 output channels. We train the network with batch size 32 for 10,000 iterations. During training, the model is validated every 250 iterations and saved with the best validation score for testing. We apply the Adam optimizer with an initial learning rate of 0.0001.

PACS We use the official data splits for training/validation/testing; no extra data augmentation is applied. We use the official PyTorch implementation and the pretrained weights of AlexNet for our PACS experiments. AlextNet is finetuned for 50,000 iterations with a batch size 128. Samples are randomly selected from the training data mixed between the three domains. We use the validation data of source domains only at every 100 iterations. We use the SGD

optimizer for training with an initial learning rate of 0.001, Nesterov momentum, and weight decay set to 0.0005. We let the learning rate decay by a factor of 0.1 after finishing 80% of the iterations.

ImageNet Following the PyTorch example 333

on training ImageNet models, we set the batch size to 256 and train AlexNet from scratch for 90 epochs. We apply the

SGD optimizer with an initial learning rate of 0.01, momentum 0.9, and weight decay 0.0001. We reduce the learning rate via a factor of 0.1 every 30 epochs.

Appendix C Hyperparameter Selections and Ablation Studies on Digits Recognition Benchmarks

We provide detailed experimental results for the digits recognition datasets. Table 5 shows results for different hyperameters for . Table 6 shows results for an ablation study on the multi-scale design for and . Table 7 shows results for studying the consistency loss weight for and . Tables 56, and 7 correspond to Fig. 2 (a)(b)(c) in the main text respectively.

Baseline 98.40(0.84) 58.87(3.73) 33.41(5.28) 79.27(2.70) 42.43(5.46) 53.50(4.23) 88.20(2.10)
, =0.9 98.68(0.06) 83.53(0.37) 53.67(1.54) 80.38(1.41) 59.19(0.85) 69.19(0.34) 89.79(0.44)
, =0.7 98.64(0.07) 84.17(0.61) 54.50(1.55) 80.85(0.91) 60.25(0.85) 69.94(0.50) 89.20(0.60)
, =0.5 98.72(0.08) 85.17(1.12) 55.97(0.54) 80.31(0.85) 61.07(0.47) 70.63(0.42) 88.66(0.62)
, =0.3 98.71(0.12) 85.45(0.87) 54.62(1.52) 79.78(1.40) 60.51(0.41) 70.09(0.60) 89.02(0.32)
, =0.1 98.66(0.06) 85.57(0.79) 54.34(1.52) 79.21(0.44) 60.18(0.63) 69.83(0.38) 88.53(0.38)
, =0 98.55(0.13) 86.27(0.42) 52.48(3.00) 79.01(1.11) 59.53(1.14) 69.32(1.19) 88.01(0.36)
Table 5: Ablation study of hyperparameter for on digits recognition benchmarks. DG-Avg is the average performance on MNIST-M, SVHN, SYNTH and USPS. Best results are bold.
98.62(0.06) 83.98(0.98) 53.26(2.59) 80.57(1.09) 59.25(1.38) 69.26(1.35) 88.59(0.38)
98.76(0.02) 84.66(1.67) 55.89(0.83) 80.95(1.15) 60.07(1.05) 70.39(0.58) 89.80(0.94)
98.76(0.06) 84.32(0.43) 56.50(2.68) 81.85(1.05) 60.76(1.02) 70.86(0.86) 90.06(0.80)
98.82(0.06) 84.91(0.68) 55.61(2.63) 82.09(1.00) 62.15(1.30) 71.19(1.21) 90.30(0.44)
98.81(0.12) 85.13(0.72) 54.18(3.36) 82.07(1.28) 61.85(1.41) 70.81(1.24) 90.83(0.52)
, =0.5 98.66(0.05) 85.12(0.96) 55.59(0.29) 80.65(0.71) 60.85(0.48) 70.55(0.15) 89.00(0.45)
, =0.5 98.79(0.07) 85.36(1.04) 55.60(1.09) 80.99(0.99) 61.26(0.80) 70.80(0.86) 89.84(0.70)
, =0.5 98.83(0.07) 86.33(0.47) 54.99(2.48) 80.82(1.83) 62.61(0.75) 71.19(1.25) 90.70(0.43)
, =0.5 98.83(0.07) 86.08(0.27) 54.93(1.27) 81.58(0.74) 62.78(0.86) 71.34(0.61) 91.18(0.38)
, =0.5 98.80(0.12) 85.63(0.70) 52.82(2.01) 81.48(1.22) 62.55(0.74) 70.62(0.73) 90.79(0.48)
Table 6: Ablation study of multi-scale RandConv on digits recognition benchmarks for and . Best entries for each variant are bold.
20 98.90(0.05) 87.18(0.81) 57.68(1.64) 83.55(0.83) 63.08(0.50) 72.87(0.47) 91.14(0.53)
10 98.85(0.04) 87.76(0.83) 57.52(2.09) 83.36(0.96) 62.88(0.78) 72.88(0.58) 91.62(0.77)
5 98.94(0.09) 87.53(0.51) 55.70(2.22) 83.12(1.08) 62.37(0.98) 72.18(1.04) 91.46(0.50)
1 98.95(0.05) 86.77(0.79) 56.00(2.39) 83.13(0.71) 63.18(0.97) 72.27(0.82) 91.15(0.42)
0.1 98.84(0.07) 85.41(1.02) 56.51(1.58) 81.84(1.14) 61.86(1.44) 71.41(0.98) 90.72(0.60)
0 98.82(0.06) 84.91(0.68) 55.61(2.63) 82.09(1.00) 62.15(1.30) 71.19(1.21) 90.30(0.44)
20 98.79(0.04) 87.53(0.79) 53.92(1.59) 81.83(0.70) 62.16(0.37) 71.36(0.49) 91.20(0.53)
10 98.86(0.05) 87.67(0.37) 54.95(1.90) 82.08(1.46) 63.37(1.58) 72.02(1.15) 90.94(0.51)
5 98.90(0.04) 87.77(0.72) 55.00(1.40) 82.10(0.55) 63.58(1.33) 72.11(0.62) 90.83(0.71)
1 98.86(0.04) 86.74(0.32) 53.26(2.99) 81.51(0.48) 62.00(1.15) 70.88(0.93) 91.11(0.62)
0.1 98.85(0.14) 86.85(0.31) 53.55(3.63) 81.23(1.02) 62.77(0.80) 71.10(1.31) 91.13(0.69)
0 98.83(0.07) 86.08(0.27) 54.93(1.27) 81.58(0.74) 62.78(0.86) 71.34(0.61) 91.18(0.38)
Table 7: Ablation study of consistency loss weight on digits recognition benchmarks for and . DG-Avg is the average performance on MNIST-M, SVHN, SYNTH and USPS. Best results for each variant are bold.

Appendix D More Examples of RandConv Data Augmentation

We provide additional examples of RandConv outputs for different convolution filter sizes in Fig. 5 and for its mixup variants at scale with different mixing coefficients in Fig. 4. We observe that RandConv with different filter sizes retains object shapes at different scales. The mixup strategy can continuously interpolate between the training domain and a randomly sampled domain.

max width= Input sample_id1< 4

Figure 4: Examples of the RandConv mixup variant on images of size with different mixing coefficients . When , the output is just the original image input;when , we use the output of the random convolution layer as the augmented image.

max width=
Original image

Figure 5: RandConv data augmentation examples on images of size . First column is the input image; following columns are convolution results using random filters of different sizes . We can see that the smaller filter sizes help maintain the finer object shapes.