Reducing Domain Gap via Style-Agnostic Networks

10/25/2019 ∙ by Hyeonseob Nam, et al. ∙ 0

Deep learning models often fail to maintain their performance on new test domains. This problem has been regarded as a critical limitation of deep learning for real-world applications. One of the main causes of this vulnerability to domain changes is that the model tends to be biased to image styles (i.e. textures). To tackle this problem, we propose Style-Agnostic Networks (SagNets) to encourage the model to focus more on image contents (i.e. shapes) shared across domains but ignore image styles. SagNets consist of three novel techniques: style adversarial learning, style blending and style consistency learning, each of which prevents the model from making decisions based upon style information. In collaboration with a few additional training techniques and an ensemble of several model variants, the proposed method won the 1st place in the semi-supervised domain adaptation task of the Visual Domain Adaptation 2019 (VisDA-2019) Challenge.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the success of deep neural networks learned with large-scale labeled data, their performance often drops significantly when they confront data from a new test domain, which is known as the problem of

domain shift [21]. For successful deployment of models to ever-changing real-world scenarios, it has become crucial to make the model robust against the domain shift.

A recent line of studies has explored the relationship between a model’s robustness and the style (texture) of an image [8, 19, 14, 20, 1]. Geirhos et al. [8] showed that standard CNNs for image classification are biased towards styles. They also reported that when the model is learned to concentrate on image contents (shapes), the model becomes more robust under variable image distortions. Furthermore, [19, 14] demonstrated that adjusting the proportion of style information on convolutional features helps overcome domain differences.

From the previous studies, we assume that the style information easily changes across domains compared with the content information. Inspired by this, we propose Style-Agnostic Networks (SagNets) to prevent the model from making decisions based on styles and allow it to focus more on the contents. SagNets comprise three novel techniques—style adversarial learning, style blending and style consistency learning—which complement each other to effectively reduce the style bias of CNNs. Our approach is applicable to a wide range of the problems that suffer from heterogeneous domains, such as domain generalization (DG) [15], unsupervised domain adaptation (UDA) [21] and semi-supervised domain adaptation (SSDA) [24].

We shall start by describing the baseline architecture in Section 2 and introduce our SagNets in Section 3. We also provide details of our additional training techniques and model variants for ensemble in Section 4 and 5, respectively. Then we present the results of semi-supervised domain adaptation in Section 6 and conclude in Section 7.

Figure 1: Our style adversarial learning framework.

2 Baseline Architecture

We use ResNet-152 [9]

pretrained on ImageNet 

[23] as our backbone CNN architecture. To tackle the distribution gap between different domains and utilize unlabeled target data, we integrate two additional components into our base domain adaptation framework: Minimax Entropy (MME) [24]

and Domain-Specific Batch Normalization (DSBN) 


2.1 Minimax Entropy (MME)

We adopt MME [24]

as our baseline adaptation method, which alternatingly moves class prototypes toward the target data distribution by maximizing the entropy of predictions and updates features to be better clustered around the estimated prototypes by minimizing it. It also includes a similarity-based framework inspired by


, where the classification is made upon the similarity between a normalized feature vector and class prototypes. This framework is effective for harnessing few-shot labeled examples provided in the semi-supervised domain adaptation setting.

2.2 Domain-Specific Batch Normalization (DSBN)

Batch Normalization (BN) [12] is one of the standard tools used in deep neural networks, which normalizes feature responses to stabilize and accelerate training. When a model is learned with multiple domains, it has been reported that using separate BN modules for individual domains helps to align their feature distributions [16, 2, 3]. We call this method Domain-Specific Batch Normalization (DSBN), following [3]222Unlike [3], we share the affine transformation parameters of BN across domains as it performs better in our experiments.. As DSBN makes the feature statistics of each domain to be centered around zero, we expect that it is effective for reducing the style gap between domains. Hence we opt for DSBN as our base normalization module.

3 Style-Agnostic Networks

Motivated by [8], we aim to make the model’s decision less depend on the style of an image to improve its robustness across different domains. It is well known in style transfer literature [7, 11]

that the feature statistics (mean and variance) of a CNN effectively capture the style of an image. Based on this relationship, we propose three novel techniques to effectively reduce the style bias of a CNN by utilizing its feature statistics: style adversarial learning, style blending, and style consistency training.

3.1 Style Adversarial Learning

We employ an adversarial learning framework to prevent the model from learning style-dependent feature representation. Specifically, we constraint the style features (the mean and variance of convolutional features) to be not capable of discriminating the object class labels by introducing a novel adversarial loss. It can be also viewed as defending against adversarial attacks that fool the network by manipulating styles, which makes the network more robust under arbitrary style changes.

The overview of our style adversarial learning is illustrated in Figure 1. We apply an adversarial loss only to low- to middle-level layers of the network, since the feature statistics of higher layers may encode complex patterns which cannot be free from the object class categories. To this end, given a CNN, we extract an intermediate feature map at a certain layer 333For ResNet-152 comprising four stages, we select the last layer of the second stage as the layer ., and take its channel-wise mean and variance as style features. Then we construct a style-based network (with a form of ) which takes the style features

as input and learns to predict the class probability

. The feature extractor , composed of the layers up to the layer , are trained to fool the style-based network by inserting a gradient reversal layer (GRL) [6] between and . Consequently, is encouraged to encode contents rather than styles, so that the rest of the network , which we call a content-based network, can make the final prediction focusing on the contents which are more robust under domain shifts. We can also effectively control the trade-off between content and style biases by adjusting the coefficient for the adversarial loss, which is set to 0.1.

This approach can be applied to both labeled and unlabeled examples in UDA or SSDA settings. For labeled examples either on the source or target domain, and are trained by minimizing and maximizing the cross-entropy loss, respectively. In case of unlabeled examples on the target domain, is not trained but can be still trained by maximizing the entropy of the prediction from , which decreases the confidence of style-based decisions.

3.2 Style Blending

To further make the model agnostic to styles, we introduce a novel style blending which randomizes the style information during training. Style blending is performed on a feature space by interpolating the feature statistics between different examples regardless of their class labels. Given a random pair of examples

within a mini-batch, it changes the feature map of sample to


where . By randomly blending the style features during training, the network can no longer rely on styles in making decisions.

The style blending module is inserted into lower parts of the network444For ResNet-152, we place a style blending module right after the first convolution layer and the end of the first stage of the network. to make the network more agnostic to low-level styles which are heavily susceptible to domain shifts. In our initial experiments, we observed performance drop when style blending is applied to up to middle-level features, which may degrade the discriminative power of the network. Thus, the middle-level features are only regularized by the style adversarial learning without applying style blending.

3.3 Style Consistency Learning

One of the effective approaches for leveraging unlabeled data for semi-supervised learning is consistency/smoothness enforcing 

[18, 27], which let the model prediction invariant to small perturbations on data. We apply this approach to our semi-supervised domain adaptation problem by introducing a new consistency loss to make the model prediction invariant to style variations. Instead of directly perturbing the style on image pixels, we perturb the style on latent spaces by simply applying different feature statistics for normalization. For each training example, we obtain two prediction vectors from the network: one normalized with mini-batch statistics throughout the network and the other normalized with the global moving-averaged statistics. This approach efficiently creates perturbations on styles by utilizing the randomness inherent in the stochastic mini-batch sampling. The consistency between the two final predictions are estimated by the KL divergence and minimized for all unlabeled data. Following [27], training signal annealing with log-schedule and confidence-based masking with a threshold 0.5 are applied to prevent overfitting.

4 Additional Training Techniques

In a semi-supervised domain adaptation problem where only a few labeled target domain examples are available for training, the use of a large model such as ResNet-152 could lead to the memorization of certain examples. Thus, we adopt synthetic data augmentation and Mixup [29] to further generalize the model to unseen images. To fully leverage the unlabeled data, we introduce a simple semi-supervised learning method which repeats fine-tuning with pseudo labels of the unlabeled data.

4.1 Synthetic Data Augmentation

CyCADA [10] has showed that domain adaptation methods performed on feature-level sometimes fail to capture pixel-level domain disparity. In order to tackle this issue, we train CycleGAN [30] to transfer the styles between the source and target domains in pixel-level. We remove the identity loss term from the original loss in CycleGAN, as one domain is quite far from the other in our case. Unlike CyCADA which solely uses target domain images and target-stylized source domain images, we additionally utilize source domain images and source-stylized target domain images. With these additional data which imitate the different styles of the original images, we can avoid over-fitting and improve generalization to the domain bias.

4.2 Intra- and Inter-Domain Mixup

Mixup [29] is a simple and effective data augmentation method, where the images and labels of two training examples are interpolated to create a mixed image and a mixed soft label. A naive implementation of Mixup in a multi-domain scenario is intra-domain Mixup, where only the examples from the same domain are mixed. An extended version is inter-domain Mixup, where the examples from different domains are mixed. In SSDA with synthetic data augmentation, we perform both intra- and inter-domain Mixup for four different domains: source, target, target-stylized source (CyCADA) and source-stylized target (CyCADA).

4.3 Iterative Pseudo Labeling

[28, 26, 13] have demonstrated that different pseudo labeling methods are effective for domain adaptation. However, MSTN [28] and clustering-based pseudo labels [26] have not shown clear improvements in our experiments, possibly due to the high complexity of the given task. Instead, a simple pseudo-labeling method [13] with a labeling threshold yielded significant improvements. Given a learned model, we assign a pseudo label to an unlabeled example if the prediction score is higher than . This procedure is repeated several times until the final loss converges.

5 Model Variants

For extra performance improvement, we train multiple models and ensemble their results. While keeping the backbone network as ResNet-152, we construct two variants equipped with Batch-Instance Normalization (BIN)  [19] and Style-based Recalibration Module (SRM) [14], respectively, both of which are effective in handling style variations.

5.1 Batch-Instance Normalization (BIN)

BIN [19] is a normalization method which combines the benefits from BN [12] and Instance Normalization (IN) [25]. Based on the property that IN removes the style of each image while BN maintains it, BIN learns to selectively remove unnecessary styles using IN but keep important styles using BN, which can help alleviating the problem of domain shift.

5.2 Style-based Recalibration Module (SRM)

We also utilize SRM [14] which is an architectural unit that adaptively recalibrates intermediate feature maps by exploiting their style information. It estimates per-channel recalibration weight from style features then performs a channel-wise recalibration. By explicitly incorporating the styles into CNN representation, SRM can alleviate the inherent the style disparity between domains.

Method Accuracy (%)
Baseline (Sec.2) 46.56
SagNet (Sec.3) 55.70
SagNet+synthetic (Sec.4.1 and 4.2) 60.73
SagNet+synthetic+pseudo (Sec.4.3) 62.51
SagNet+synthetic+pseudo+ensemble (Sec.5) 63.08
Table 1: Results on VisDA-2019 SSDA where the source is real and the target is sketch (the validation phase).

6 Experiments

We demonstrate the effectiveness of SagNets and additional training techniques for SSDA with the DomainNet [22] dataset. It consists of 345 categories with 0.6 million images from 6 distinct domains. For data augmentation, the input images are randomly cropped to 224x224 patches then random horizontal flipping and AutoAugment [5] of policy learned on ImageNet are applied. The networks are trained by SGD with an initial learning rate of 0.002, a momentum of 0.9, and a weight decay of 0.0001. We train the networks for 30,000 iterations with a batch size 256 and cosine learning rate decay [17].

Table 1 demonstrates the results on SSDA where the source domain is real and the target domain is sketch. The proposed method significantly boosts the domain adaptation performance of the baseline by reducing style bias. Furthermore, our additional training techniques and the ensemble of model variants also bring considerable performance improvement. Our method is also recorded as the top-performing algorithm in the SSDA task of the VisDA-2019 Challenge.

7 Conclusion

We have presented Style-Agnostic Networks (SagNets) that are robust against style variations caused by domain shifts. SagNets are trained to concentrate more on contents rather than styles in their decision-making process. We have also employed a few additional training techniques and model variants for further performance improvement. Our experiments have demonstrated the effectiveness of SagNets in reducing the disparity between domains in the DomainNet dataset. The principle of how we let the network concentrate on image contents could be applied to other problems such as improving the robustness of neural networks under image corruptions and adversarial attacks.


  • [1] F. Brochu (2019)

    Increasing shape bias in imagenet-trained networks using transfer learning and domain-adversarial methods

    arXiv preprint. Cited by: §1.
  • [2] F. M. Cariucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò (2017) Autodial: automatic domain alignment layers. In ICCV, Cited by: §2.2.
  • [3] W. Chang, T. You, S. Seo, S. Kwak, and B. Han (2019) Domain-specific batch normalization for unsupervised domain adaptation. In CVPR, Cited by: §2.2, §2, footnote 2.
  • [4] W. Chen, Y. Liu, Z. Kira, Y. Wang, and J. Huang (2019) A closer look at few-shot classification. In ICLR, Cited by: §2.1.
  • [5] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2018) Autoaugment: learning augmentation policies from data. In CVPR, Cited by: §6.
  • [6] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. JMLR. Cited by: §3.1.
  • [7] L. A. Gatys, A. S. Ecker, and M. Bethge (2016)

    Image style transfer using convolutional neural networks

    In CVPR, Cited by: §3.
  • [8] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2019) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In ICLR, Cited by: §1, §3.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §2.
  • [10] J. Hoffman, E. Tzeng, T. Park, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2018) CyCADA: cycle consistent adversarial domain adaptation. In ICML, Cited by: §4.1.
  • [11] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, Cited by: §3.
  • [12] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §2.2, §5.1.
  • [13] D. H. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks.. In ICML Workshop, Cited by: §4.3.
  • [14] H. Lee, H. Kim, and H. Nam (2019) SRM: a style-based recalibration module for convolutional neural networks. In ICCV, Cited by: §1, §5.2, §5.
  • [15] D. Li, Y. Yang, Y. Song, and T. M. Hospedales (2017) Deeper, broader and artier domain generalization. In ICCV, Cited by: §1.
  • [16] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou (2016) Revisiting batch normalization for practical domain adaptation. arXiv preprint. Cited by: §2.2.
  • [17] I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    In ICLR, Cited by: §6.
  • [18] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. TPAMI. Cited by: §3.3.
  • [19] H. Nam and H. Kim (2018) Batch-instance normalization for adaptively style-invariant neural networks. In NeurIPS, Cited by: §1, §5.1, §5.
  • [20] A. E. Orhan and B. M. Lake (2019)

    Improving the robustness of imagenet classifiers using elements of human visual cognition

    arXiv preprint. Cited by: §1.
  • [21] S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering. Cited by: §1, §1.
  • [22] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang (2018) Moment matching for multi-source domain adaptation. arXiv preprint. Cited by: §6.
  • [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV. Cited by: §2.
  • [24] K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko (2019) Semi-supervised domain adaptation via minimax entropy. In ICCV, Cited by: §1, §2.1, §2.
  • [25] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint. Cited by: §5.1.
  • [26] VisDA 2018 challenge openset classification winner presentation. Note: 2019-10-01 Cited by: §4.3.
  • [27] Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation for consistency training. arXiv preprint. Cited by: §3.3.
  • [28] S. Xie, Z. Zheng, L. Chen, and C. Chen (2018) Learning semantic representations for unsupervised domain adaptation. In ICML, Cited by: §4.3.
  • [29] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In ICLR, Cited by: §4.2, §4.
  • [30] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In ICCV, Cited by: §4.1.