Since the remarkable advance of deep neural networks, they have become ubiquitous in various fields, especially computer vision systems. However, there still exist potential risks lying on the flip side of their success. One of the main concerns is their vulnerability against visual domain shift(ben2007analysis; ben2010theory; saenko2010adapting). Concretely, deep models react unexpectedly when confronting data unaffiliated to the training distribution. For example, an auto-tagging model trained on clean product images shows poor performance when taking as inputs real product images which are under various viewpoints and light conditions.
To equip the models with the ability in coping with domain shift, previous studies tackle the problem of domain adaptation (daume2009frustratingly; sun2016return; tzeng2017adversarial; saito2018maximum; yue2019domain; kim2020learning). In this problem setting, two different domains sharing the same label space are prepared for training and test, which are referred to as the “source” and the “target” domains, respectively. During training, a model has access to both labeled images from the source domain and unlabeled (or partially labeled) images from the target domain. Being aware of the target domain, existing works successfully minimize the discrepancy between two distinct domains, thus leading to large performance boosts (ben2010theory; tzeng2017adversarial; sun2016return; tzeng2017adversarial; hoffman2018cycada; saito2018maximum; yue2019domain; kim2020learning; Xu2018DeepCN).
However, the problem setting of domain adaptation is impractical in that domain shift is generally unpredictable in the real-world scenarios, i.e., we do not know the target domain at training time. To this end, a new task has attracted much attention recently, aiming to learn domain robustness without accessing the target domain data, namely domain generalization (carlucci2019domain; Li2018LearningTG; Li2019EpisodicTF; dou2019domain; li2018domain; ghifary2015domain; Seo2020LearningTO; Chattopadhyay2020LearningTB; huang2021fsdr; qiao2021uncertainty; choi2021robustnet; xu2021fourier). In this setting, multiple datasets from different source domains are typically utilized to learn domain invariant representations.
Previous methods (zhou2020learning; nuriel2020permuted; xu2020robust; qiao2020learning) embrace the observation that domain robustness is proportional to the number of domains observable in the training stage (tobin2017domain)
. To that end, they utilize generative adversarial networks (GAN)(zhou2020learning) or adaptive instance normalization (AdaIN) (huang2017arbitrary; nuriel2020permuted) for synthesizing novel (unseen) domains. Nonetheless, they have two clear limitations which are critical for domain generalization. First, GAN-based methods become prohibitively difficult to optimize as the number of novel domains increases, which limits the size of observable domain space. Next, AdaIN-based approaches fail to preserve the semantics of original images, as instance normalization (IN) tends to wash away class discriminative information (nam2018batch; Seo2020LearningTO).
In this paper, we introduce a novel framework for domain generalization, overcoming the above limitations. Specifically, to synthesize novel domains without losing class discriminative information, we propose a novel feature stylization block
. First, we calculate batch-wise feature statistics of source domains and sample novel domain styles from the feature distribution. We re-scale the standard deviation of the source feature distribution so that the outlying style statistics are more likely to be sampled. However, the original semantics can be distorted during the stylization process, which will disturb the training. To preserve the original semantics during stylization, inspired by a recent photo-realistic stylization method(Yoo2019PhotorealisticST), we decompose original features into high and low frequency components which contain structural and textural information, respectively. Afterwards, we manipulate the low frequency components while remaining shape cues in high frequency ones to prevent semantics distortion. Lastly, we re-merge the stylized low frequency components and the high frequency ones, leading to the stylized features. By incorporating them in the training, our model is allowed to learn robust representation against domain shift.
Rather than naively utilizing stylized features for training, we seek for better strategies that can provide domain robustness guidance. Intuitively, a robust model against domain shift should yield consistent predictions for the stylized features and the original ones. In this point of view, we adopt the consistency loss to maximize the agreement between the model predictions for them. Concretely, we measure the KL divergence between two output distributions and minimize it with the consistency loss.
Moreover, we propose the domain-aware supervised contrastive loss to minimize distance between the stylized and the original features, in order to achieve feature-level consistency. Although the conventional supervised contrastive loss has proven to be effective, we found that it is unsuitable for domain generalization. The loss expels the samples from different domains and thus disturbs domain invariance, which conflicts with the goal of domain generalization. To this end, we introduce the novel domain-aware supervised contrastive loss which ignores negative samples from different domains, hence preserving domain invariance while empowering class discriminability.
The contributions of this paper can be summarized into three folds. Firstly, we propose a novel domain generalization framework, where diverse domain styles are generated and leveraged through the proposed feature stylization block. The stylized features are in turn used to enhance domain robustness by encouraging the model to produce consistent outputs. Secondly, we introduce the novel domain-aware supervised contrastive loss. The proposed loss strengthens the domain invariance by contrasting features with respect to domain and class labels. Lastly, we demonstrate the effectiveness of each component of our model through analyses and ablation studies. Furthermore, experimental results show that our method surpasses previous methods with obvious margins, achieving a new state-of-the-art on the widely used benchmarks: PACS and Office-Home. Even on the single-source domain generalization task, our method shows delightful performance improvements over the baseline.
2. Related Works
2.1. Domain Adaptation
Domain adaptation aims to transfer learned knowledge from source domains to a target domain. In this setting, the source domain is usually a large scale dataset with annotations, and the target domain data is either partially labeled or completely unlabeled. They are referred to semi-supervised domain adaptation (SSDA)(donahue2013semi; yao2015semi; ao2017fast; saito2019semi) and unsupervised domain adaptation (UDA) (daume2009frustratingly; sun2016return; tzeng2017adversarial; hoffman2018cycada; saito2018maximum; yue2019domain; kim2020learning), respectively.
Semi-supervised domain adaptation methods impose constraints on both labeled and unlabeled instances of the target domain in various ways. Donahue et al. (donahue2013semi) build a similarity graph to constrain unlabeled data and transfer knowledge with a projective model transfer method. Ao et al. (ao2017fast) distill knowledge from the source domain by generating pseudo labels for the unlabeled target data. Saito et al. (saito2019semi)estimate class-specific prototypes with sparsely labeled examples of the target domain, then update them by solving a minimax game on the unlabeled data.
In unsupervised domain adaptation, most methods (daume2009frustratingly; sun2016return; tzeng2017adversarial; hoffman2018cycada; kim2020learning) conduct feature alignment between source and target domains. To this end, CORAL (sun2016return) minimizes the distance between the covariance matrices, while ADDA (tzeng2017adversarial) employs a domain discriminator for adversarial learning. Meanwhile, CyCADA (hoffman2018cycada)
adopts an image-to-image translation framework to transfer the source domain data to the target domain data on image-level. Recently, domain randomization(tobin2017domain; yue2019domain; zakharov2019deceptionnet; kim2020learning) is another generative stream which diversifies the textures of source domain images, allowing the model to learn texture invariant representations. Yue et al. (yue2019domain)
manipulate images into an external class from ImageNet(krizhevsky2012imagenet), and LTIR (kim2020learning) exploits an artistic style transfer method to alter the textures of the source and target domains.
Our method relates to the domain randomization approaches in that it aims to generate features with diverse domain characteristics. However, they are not suitable for domain generalization as they require external datasets (krizhevsky2012imagenet; yue2019domain; kim2020learning). On the contrary, our proposed feature stylization block is able to generate various stylized features based on the statistics of source domains, without access to additional data.
2.2. Domain Generalization
The goal of domain generalization is to learn domain invariant representations based on only source domains. Different from unsupervised domain adaptation, target domain data is inaccessible during training, making the task more challenging. In addition, multiple domains are typically utilized to achieve domain-agnostic representation without having target domain data. Previous domain generalization methods can be roughly categorized into four groups: meta-learning, architectural modification, regularization, and generative approaches.
The first group exploits meta-learning techniques (finn2017model; ravi2016optimization) to align different domains (Li2018LearningTG; balaji2018metareg; Li2019EpisodicTF; dou2019domain). These approaches borrow the powerful adaptability of meta-learning algorithms whose effectiveness is proven in the field of few-shot learning. Representatively, Li et al. (Li2019EpisodicTF) separates the training set into multiple episodes, each of which handles only a single domain. During training, they update the backbone with aggregated regularization losses from domain specific networks. Meanwhile, MASF (dou2019domain) simulates domain shift using different episodes. They perform global alignment of class relationships while clustering local samples.
Secondly, some works (Motiian2017UnifiedDS; li2018domain; ghifary2015domain; Seo2020LearningTO; Chattopadhyay2020LearningTB; xu2021fourier) try architectural changes to model a shared embedding space (Motiian2017UnifiedDS; li2018domain; ghifary2015domain) or to build domain-specific networks (Chattopadhyay2020LearningTB; Seo2020LearningTO). Exploiting auxiliary pretext tasks are also favored as a sub-stream (carlucci2019domain; Wang2020LearningFE). As a pioneer, JiGen (carlucci2019domain) proposes to solve jigsaw puzzles as an auxiliary task to induce the model to learn the concepts of spatial correlation. Inheriting from JiGen, EIS-Net (Wang2020LearningFE) employs a momentum metric learning task to provide extrinsic relationship supervision. Other approaches (wang2018learning; wang2019learning; Huang2020SelfChallengingIC; shi2020informative; choi2021robustnet) apply diverse regularization during training. HEX (wang2018learning) employs the neural gray-Level co-occurrence matrix to find superficial representations related to the task. PAR (wang2019learning) penalizes the predictive power of earlier layers so that the model relies more on global representations from deeper layers. RSC (Huang2020SelfChallengingIC) masks out both spatial regions and channels which have high contributions to the task. RobustNet (choi2021robustnet) encourage model to utilize domain-invariant features by selectively whitening domain-variant feature channels in the gram matrix during training.
Lastly, based on the intuition that the generalization ability can be boosted with samples from more diverse domains (tobin2017domain), generative approaches arises (zhou2020learning; nuriel2020permuted; xu2020robust; qiao2020learning). They augment the training set with samples similar in semantics but different in domain characteristics. L2A-OT (zhou2020learning) adopts generative adversarial networks to synthesize images which are distant from original ones in terms of the Wasserstein distance. Qiao et al. (qiao2020learning) apply adversarial perturbations on the images to augment the source domains. From the perspective of frequency, FACT (xu2021fourier)
analyze the frequency components of the image with the fourier transformation, and conduct data augmentation by mixing the amplitude information.
Our method can be viewed as a harmonious combination of the generative method and the regularization-based approach. We generate features of novel domains via the novel feature stylization block during training, then apply regularization in terms of output consistency and feature similarity. The efficacy of our method is demonstrated through extensive experiments in Sec. 4.
In this section, we first describe the baseline setup of multi-source domain generalization for image classification, then introduce our novel feature stylization method and consistency learning process thereafter. The overall framework of our method is illustrated in Fig. 1 (a).
In the multi-source domain generalizaiton task, multiple datasets of source domains are accessible during training. Each dataset contains a set of images with the corresponding class label set , where is the number of images in the -th dataset. Naturally, the domain label of can be obtained as . We also note that all datasets share the same label space, i.e., . We train a neural network which consists of a feature extractor
and a following classifier. The feature extractor is composed of multiple convolutional layers and we denote the output features of the -th convolutional layer by , where is the cardinality of a mini-batch, and is the number of channels, while and are the height and width of the feature, respectively. The classifier is a single fully-connected layer. We train the network by minimizing the cross-entropy loss as follows.
where indicates the softmax function. Consequently, with the above baseline setup, the network is trained to classify an image into its corresponding label .
3.2. Feature Stylization
An intuitive way to improve the generalization ability of a model would be allowing it to see diverse samples from different domains (tobin2017domain). In this point of view, we augment the source domains by synthesizing novel domains by manipulating feature statistics. Before stylization, we note that it should be ensured that the generated feature should maintain the original semantics. To this end, we borrow the feature decomposition of a photo-realistic style transfer model (Yoo2019PhotorealisticST), where structural features and textural features are separated into high frequency and low frequency components, respectively. The feature decomposition process is formulated as:
where “AvgPool” denotes spatial average pooling operation with the kernel size of 2, and “UP” indicates nearest neighbor upsampling operation. After decomposition, we perform stylization on the low frequency feature only to preserve structural information.
Since neither extra datasets nor pre-trained networks are available in our setting, we stylize the feature by utilizing its batch-wise statistics. Firstly, the mean and variance are obtained as follows:
where indicates flattening operation, while denote the mean and the variance of feature style, respectively.
As thoroughly investigated in previous studies (li2016adabn; pan2018two)
, these batch-wise statistics are highly related to the domain characteristics. In order to generate the domain statistics, we model the prior distributions of both style vectors (, ) as gaussians. For this purpose, we calculate the channel-wise mean and variance of style vectors as follows.
where and denote channel-wise statistics of , while and are statistics of . is the number of channels of .
To generate the novel domain styles, we manipulate the variance of distributions with the scaling parameters and , then sample new style vectors from its distribution as:
As the variance increases, outlying style vectors, i.e., outliers, are more likely to be sampled from the distributions, whereas in-liers are sampled with higher probability in the opposite case. We show the effects of scale parameters through experiments in section4.4.
After the sampling stage, the style vectors and are applied to the original low frequency component via the affine transformation as follows.
We can interpret the sampling and transformation process as generating arbitrary domain statistics and applying on the original one. Notably, our affine transformation process is analogous to the batch normalization (BN)(ioffe2015batch) where the affine parameters are and . Compared to adaptive instance normalization (AdaIN) (huang2017arbitrary) which washes away discriminative features (nam2018batch; Seo2020LearningTO), our feature stylization process conserves them while generating novel domain styles.
Lastly, the stylized low frequency feature are then combined with the original high frequency feature via the following equation.
We note that our feature stylization block can be inserted into any layer of the feature extractor . Analyses on the best location of the feature stylization block will be discussed through ablation studies.
3.3. Consistency Regularization
The augmented feature is passed through the remaining layers, the same as the original feature . Given the stylized feature, we further encourage the model to output a consistent prediction with the original one. For this, we minimize the discrepancy between output predictions from the original and the stylized features, which is formulated as:
where indicates the softmax function, denotes the neural network output in which feature stylization is performed on an intermediate layer, and is the temperature hyper-parameter.
The effect of the consistency loss is illustrated in Fig. 1 (b). Through the consistency loss, the log-likelihood between the predictions is maximized. In addition, we apply temperature scaling with on the original prediction, denoted as , to encourage the prediction of stylized feature to have low entropy (grandvalet2004semi; hinton2015distilling).
3.4. Domain-aware Supervised Contrastive Loss
Furthermore, we bring another intuition that a robust feature extractor should embed stylized features adjacent to original ones. Hence, we minimize the distance between the original feature (anchor) and the stylized one in terms of the dot-product similarity. This is accomplished by contrasting stylized features (positives), with other samples (negatives) (he2020momentum).
In addition, to encourage class discriminability, we adopt a supervised contrastive learning framework (khosla2020supervised) where output features from augmented samples and those from samples with the same class label are treated as positive. Meanwhile, the others in the mini-batch are considered negative. The basic formulation of supervised contrastive learning is defined as:
where indicates a set of indices of features and its stylized augmentation, contains the indices of all samples but the anchor, denotes the set of indices of all positives to the anchor, and denotes an output feature from the feature extractor after L2 normalization. The softmax function with temperature scaling is applied on the similarity matrix of the anchor.
As shown in Fig. 2 (a), the loss induces the model to attract positive features while repulsing negative ones from the anchor. However, the performance degradation occurs when the loss is directly adopted for the the domain generalization task. Concretely, the feature space becomes domain-discriminative since the samples from different domains are pushed aside from the anchor. This is widely known to be detrimental for achieving the domain-invariance (daume2009frustratingly; sun2016return; tzeng2017adversarial; li2018domain). To this end, we propose to modify Eq. (9) to be more suitable for domain generalization, namely a domain-aware supervised contrastive loss:
where is a set containing the indices of samples sharing the same domain label with the anchor.
As shown in Fig. 2 (b), , i.e., samples from the different domain which were included in the earlier negative set, are excluded. For example, when a anchor is a “photo dog”, its positive set is “stylized photo dog”, “cartoon dog”, “sketch dog”, …, whereas remaining negatives belongs to different classes in ”photo” domain.
Consequently, with the proposed domain-aware supervised contrastive loss, our feature extractor produces features not only discriminative to class labels but also invariant by attracting samples from different domain, i.e., .
3.5. Overall Training and Inference
We train the neural network with the weighted sum of losses as follows:
where is the weighting factor. The overall training is conducted in end-to-end manner. The network is updated with respect to and , while affects only the feature extractor . During the inference, we detach the feature stylization module from the forward path so that the model predicts based on the original feature.
4.1. Experiment Details
Datasets. As our evaluation benchmarks, we use PACS (Li2017DeeperBA) and Office-Home (finn2017model) following conventional settings. PACS is made up of four domains, i.e., Photo (1,670 images), Art Painting (2,048 images), Cartoon (2,344 images) and Sketch (3,929 images). The total number of images is 9,991 and the image resolution is . The dataset contains 7 common categories: ‘dog’, ‘elephant’, ‘giraffe’, ‘guitar’, ‘horse’, ‘house’, ‘person’. Another benchmark is Office-Home which consists of images from four different domains, namely Artistic, Clip Art, Product, and Real world images. Each domain contains images of 65 object categories which are found in office and home. The total number of images is 15,500.
Evaluation. A common evaluation protocol in domain generalization is leave-one-domain-out evaluation (Li2017DeeperBA). Specifically, we first select one domain as the target domain. Then, the other domains are set as source domains. We train our model on the source domains and evaluate it on the target domain. We note that any sample from the target domain is not allowed during the training step. This procedure is repeated to ensure that every domain is chosen to be the source domain exactly once, and we report the averaged accuracy.
Implementation details. For fair comparison with previous studies, we adopt ResNet-18 and ResNet-50 (he2016deep)
pre-trained on ImageNet(krizhevsky2012imagenet)
. We optimize our network with Stochastic Gradient Descent (SGD) optimizer. We set an initial learning rate as 0.004 and train for 40 epochs. The decay rate is set to 0.0005 which is applied after 20 epochs. A single mini-batch contains a total of 126 images, 42 images for each source domain.
Inspired by FixMatch (sohn2020fixmatch) and SupCon (khosla2020supervised), the temperature parameters of and are set to 0.5 and 0.15, respectively. Considering the scale of loss fucntions, weighting factors (, ) are set to (0.3, 12) and (0.9, 6) for ResNet-18 and ResNet-50, respectively. Besides, the scale parameters () are set to 10 and 20 for ResNet-18 and ResNet-50, respectively. Our model is built upon the popular implementation of Zhou et al. (zhou2020domain).
4.2. Comparison with State-of-the art Methods
Results on PACS. In Table 1, we compare our method with previous domain generalization methods on PACS dataset (Li2017DeeperBA). Recognizably, our method beats previous approaches and achieves a new state-of-the-art performance with the average accuracy of 85.86% with ResNet-18. Consistently, our method shows large improvements when adopting ResNet-50 as our backbone, accomplishing a new record with the average accuracy of 87.86% across the leave-one-domain-out scenarios.
Through the experiments, vivid performance gains are observed when art and sketch domains are set as target domains. This is reasonable since our method has the strength in generating novel styles while preserving the shape cues which are essential for accurate classification in those domains. Despite the slight performance degradation in the photo domain, our method outperforms the competitors in terms of the average accuracy, validating the better domain generalization ability. We note that our feature stylization module does not require additional network parameters, which makes our model more competitive in terms of memory.
Results on office-Home. We also provide the results on Office-Home benchmark (finn2017model) in Table 2. Again, our method breaks the record with ResNet-18, achieving the average accuracy of 66.2%. Conspicuously, ours makes the performance improvements regardless of the target domain. Overall comparisons verify the effectiveness of our feature stylization strategy and the proposed contrastive loss.
4.3. Single-source Domain Generalization
In Table 3, we present the single source domain generalization results with ResNet18 on PACS benchmark. In this setting, only a single domain is selected as a source dataset and the trained model is tested on other target domains. Rows and columns indicate source and target domains, respectively. We use the same hyper-parameter settings described in the previous section, except for the scale parameters , both of which are scaled down to 5. In addition, since the source domain is single in this setting, in is naturally ignored.
Except for the diagonal elements where train and test domains are the same, we achieve improvements on the most of domain generalization scenarios. We can observe remarkable performance gain on “Photo-to-Sketch”, “Art-to-Cartoon”, “Sketch-to-Art”, and “Sketch-to-Cartoon” settings. Especially, a huge performance improvement is observed when the sketch is used for the source domain, i.e., only coarse shape information is available for training. This demonstrates the style diversification ability of our feature stylization block. The “Art-to-Photo” setting is the only generalization scenario where a performance degradation is observed but still in an acceptable margin.
In this section, we analyze our method and conduct ablation studies on the PACS benchmark with ResNet-18 backbone. To be specific, we first analyze the effect of each loss function, then we investigate the correlation between the performance and the scale parameter. Thereafter, we examine the most suitable location of the feature stylization block followed by the analysis on the feature decomposition with frequency components. We note that all remaining hyper-parameters are fixed through ablation studies.
Ablation study on components. We conduct an ablation study to inspect the contribution of the feature stylization along with loss functions. As shown in Table 4, every single component enlarges the generalization capacity of the model compared to the baseline. In detail, the effectiveness of feature stylization is observable with the overall performance improvement of 4.08%, in terms of the average accuracy. Specifically, the performance gain on the sketch domain is delightful, reaching 8.71%. In addition, it is verified that the consistency loss boosts domain robustness by regularizing the output discrepancy between original and stylized features. Moreover, the proposed domain-aware contrastive loss enhances the performance by pursuing feature similarities between different domains but with the same category. Consequently, with the harmonious combination of aforementioned components, we can find that all components are complementary and have a positive effect in the domain generalization task.
The scale parameter. In Fig. 3, we compare different scale parameters in the feature stylization block. We adjust scale parameters (, ) in . With the scale parameters of 1, augmented style vectors follow the original style distribution, leading to a marginal improvement. As the scale parameter increases, the outlying style vectors are more likely to be sampled, thus increasing the generalization ability. However, excessive scale values produce too distant features and outputs from the original one, resulting in high favoritism on shape cues. This is undesirable since shape cues are not sufficient for visual recognition and texture cues still contains class-discriminative information as in the human visual system (sann2007perception). The best performance is found at the scale parameter of 10, which may be the “sweet spot” of exploiting both shape and textural cues.
Feature decomposition strategy. We verify the effect of decomposing the feature into high frequency and low frequency components. In Table 5, we compare between exploiting whole feature without decomposition (-), high frequency components , and low frequency components for feature stylization. As shown in Table 5, the best performance is achieved when the feature stylization is applied only on low frequency components. Applying stylization on high frequency feature falls behind the other strategies, since only the shape information is partially distorted. Meanwhile, although transforming the whole feature seems to be a good strategy overall, it shows inferior performances on the domains where shape cues are crucial, such as art and sketch domains.
Location of the feature stylization block. We discuss on where the proposed feature stylization block should be located. We denote these stack of residual blocks by re-grouping the ResNet architecture into 5 groups of layers, Conv and ResBlock 1-4, where Conv denotes the first convolutional layer before residual blocks.
As shown in Table 6, the best spot of the proposed feature stylization is right after the second residual blocks. This is quite reasonable considering the nature of deep neural networks (gatys2016image; donahue2014decaf; shi2020informative), where features at this level adequately represent low-level structural information as well as high-level semantic information.
In this paper, we proposed a novel framework for domain generalization, where the features are stylized into diverse domains. In detail, we sampled domain style vectors from the manipulated distribution of batch-wise feature statistics, then utilized the style vectors for affine transformation. To achieve the domain robustness, we exploited stylized features for regularization in terms of output consistency and feature similarity via consistency loss and novel domain-aware supervised contrastive loss, respectively. Through comparisons and extensive analyses on two popular benchmarks, we demonstrated the effectiveness of the proposed feature stylization and two losses.
This research was partly supported by the MSIT (Ministry of Science, ICT), Korea, under the High-Potential Individuals Global Training Program (No. 2021-0-01696) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2019R1A2C2003760), and the Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-01361: Artificial Intelligence Graduate School Program (YONSEI UNIVERSITY)). This project was also supported by Microsoft Research Asia.