Deep learning usually relies on the independent, identically distributed (i.i.d.) assumption of training and testing datasets, while target tasks are usually significantly heterogeneous and diverse [2, 20]. This motivates many researchers to investigate domain adaptation (DA)  and domain generalization (DG) . The source and target domain shifts are expected to be compensated for by a variety of adaptation steps. Despite the success of DA in several tasks, much of the prior work relies on utilizing the massive labeled/unlabeled target samples for its training.
The recently aroused DG task  assumes that several labeled source domains are available in training without access to the target sample/label. Collectively exploiting these source domains can potentially lead to a trained system that can be generalized well on a target domain . A predominant stream in DG is the domain invariant feature learning (IFL) , which attempts to enforce , where and index the two different source domains. The typical solution can be obtained via momentum or adversarial training .
However, the different classes can follow different domain shift protocols, e.g., the street lamp will be sparkly at night, while the pedestrian is shrouded in darkness. Therefore, we would expect the fine-grained class-wise alignment of the condition shift w.r.t. , where is a feature extractor [32, 10, 3].
Assuming there is no concept shift (i.e., ) and label shift (i.e.,
), and given the Bayes’ theorem,, IFL is able to align if .
However, the label shift , i.e., class imbalance, is quite common in DG, as illustrated in Fig. 1. Since is a deterministic mapping function, IFL is able to encode the domain invariant representation under the assumption (i.e., only ) . Under the label shift, the alignment cannot be used as an alternative of alignment (i.e., . Actually, both the conditional and label shifts are the realistic setting in most of DG tasks.
Recently,  proposes to align the conditional shift, assuming that there is no label shift. However, it is ill-posed to only consider one of conditional or label shift [32, 14]. To mitigate this, both the and shifts are taken into account for DA from a causal interpretation view [32, 10]. However, its linearity assumption might be too restrictive for real-world challenges.
In this work, we first analyze the different shift conditions in real-world DG, and rethink the limitation of conventional IFL under different shift assumptions. Targeting the conditional and label shifts, we propose to explicitly align and via variational Bayesian inference and posterior label alignment. Note that the fine-grained class-wise alignment can lead to
alignment following the law of total probability.
Aligning the conditional distribution across source domains under the label shift is usually intractable. Thus, we propose to infer the domain-specific variational approximations of these conditional distributions, and reduce the divergence among these approximations.
Specifically, we enforce the conditional domain invariance by optimizing two objectives. The first one enforces the approximate conditional distributions indistinguishable across domains by matching their reparameterized formulations (i.e., the mean and variance of Gaussian distribution). The second one maximizes the probability of observing the input
given the latent representation and domain label, which is achieved by a domain-wise likelihood learning network. Assuming that the conditional shift is aligned, we can then align the posterior classifier with the label distribution following a plug-and-play manner.
The main contributions are summarized as follows:
We explore both the conditional and label shifts in various DG tasks, and investigate the limitation of conventional IFL methods under different shift assumptions.
We propose a practical and scalable method to align the conditional shift via the variational Bayesian inference.
The label shift can be explicitly aligned by the posterior alignment operation.
We empirically validate its effectiveness and generality of our framework on multiple challenging benchmarks with different backbone models and demonstrate superior performance over the comparison methods.
2 Related Work
DG assumes that we do not have prior knowledge about the target domain in that we do not have access to labeled/unlabeled samples of the target domain at the training stage . The conventional DG methods can be divided into two categories. The first strategy aims to extract the domain invariant features with IFL . A typical solution is based on adversarial learning, which reduces the inter-domain divergence between the feature representation distributions [17, 20].
The other strategy focuses on the fusion of domain-specific feature representations (DFR-fusion).  builds the domain-specific classifiers by multiple independent convolutional networks. Then, it uses a domain agnostic component to fuse the probabilities of a target sample belonging to different source domains.  infers the domain-invariant feature by matching its low-rank structure with domain-specific features. Typically, these DG methods assume that is invariant across domains. Therefore, aligning can be a good alternative to align the conditional shift. However, this assumption is often violated due to the label shift in real-world applications. Therefore, independent conditional and label shift assumptions are more realistic in real-world applications.
Domain shifts in DA can be categorized into covariant, label, conditional, and concept shifts [25, 31]. In this work, we examine these concepts and adapt their causal relationships to DG, as summarized in Fig. 2. Conventionally, each shift is studied independently, by assuming that the other shifts are invariant . For example,  aligns the conditional shift assuming that no label shift occurs. We note that the concept shift usually has not been considered in DG tasks, since most of the prior work assumes that an object has different labels in different domains. Some recent works [32, 10, 3] assume that both conditional and label shifts exist in DA tasks and tackle the problem with a causal inference framework. However, its linearity assumption and location-scale transform are too restrictive to be applied in many real-world applications. It is worth noting that under the conditional and label shift assumption, is the cause of , and therefore it is natural to infer of different domains directly as in [27, 9] as a likelihood maximization network.
In this work, we propose a novel inference graph as shown in Fig. 1 to explicitly incorporate the conditional dependence, which is trained via variational Bayesian inference.
We denote the input sample, class label, and domain spaces as , and
, respectively. With random variables, , and
, we can define the probability distribution of each domain as. For the sake of simplicity, we assume and are the discrete variables for which is the set of class labels. In DG, we are given source domains to train the latent representation encoder and the representation classifier . The trained and fixed encoder and classifier are used to predict the labels of samples drawn from the marginal distribution of an “unseen” target domain .
The conventional IFL assumes that and are invariant across domains. Since is a deterministic mapping function, should also be invariant across domains. Therefore, if of different domains are aligned, the conditional shift, , is also aligned. However, with the conditional and label shift assumption, the alignment of and is more challenging than the covariant shift which only requires to align the marginal distribution .
We note that with the law of total probability, can be aligned if the fine-grained is aligned . Moreover, for all source domains can be calculated by simply using the class label proportion in each domain. Besides, modeling is natural, since it follows the inherent causal relation under the conditional and label shift (see Fig. 2).
For the simplicity and consistency with the notation of autoencoder, we denoteas 111We use and interlaced to align with the conventional IFL and variational autoencoder literature, respectively., which is the latent variable encoded from by . We note that is dependent on its corresponding input sample . The class conditional distribution can be reformulated as .
The corresponding inference graph and detailed framework are shown in Fig. 1 and Fig. 3, respectively. When inferring the latent representation , we explicitly concatenate and as input. Moreover, the final class prediction is made by a posterior alignment of the label shift, which also depends on label distribution .
3.1 Variational Bayesian Conditional Alignment
Although and for all source domains can be modeled by IFL and class label proportion, respectively,
is usually intractable for moderately complicated likelihood functions, e.g., neural networks with nonlinear hidden layers. While this could be solved by the Markov chain Monte Carlo simulation (MCMC), this requires expensive iterative inference schemes per data point and does not scalable to the large-scale high-dimensional data.
To align the class-dependent across different domains, we first construct its approximate Gaussian distribution , and resort to the variational Bayesian inference  to bridge it with a simple Gaussian family for which the inference is tractable. Specifically, we have the following proposition:
Proposition 1. The minimization of the inter-domain conditional shift is achieved when its approximate distribution is invariant across domains, and the KL divergence between and is minimized.
Proof 1. A simple justification can be:
if we have ,
then, we have .
where the term is the KL divergence of the approximate from the true posterior and the ELBO of the likelihood , i.e., can be re-written as
which can be reformulated as
where denotes the expectation. Therefore, approximating with requires two objectives, i.e., minimizing , while maximizing the expectation of .
For domains, is the prior distribution in the variational model, e.g., multivariate Gaussian distribution. When is sampled from the same Gaussian distribution and is invariant across the source domains, the first objective in Eq. (3), i.e., , can explicitly enforce to be invariant across domains.
Actually, if we have , , and , , then .
The first optimization objective of Eq. (3) targets to align the conditional distribution across the source domains. Since the prior distribution is the multivariate Gaussian distribution, it is also natural to configure as multivariate Gaussian. Practically, we follow the reparametric trick of variational autoencoder  in such a way that the inference model has two outputs, i.e., and which are the mean and variance of the Gaussian distribution . Then, , where . Without loss generality, we have:
where is the number of input in a batch from domain . Usually, we set the prior to be the standard multivariate Gaussian distribution, where the mean and variance are and , respectively.
The second optimization objective of Eq. (3) aims to maximize the probability of observing the input given and . We propose to configure a likelihood maximization network . It maximizes the likelihood that the latent feature representation of images in a specific domain can effectively represent the same-class images in this domain. Practically, our contains sub-branches, each of which corresponds to a domain. At the training stage, we choose the corresponding branch according to the source domain label . Its loss can be formulated as
which solves the maximum likelihood problem by minimizing the difference between the input data and the generated data in the corresponding domain. We note that is only used in training.
Posterior collapses is a long-lasting problem of VAEs using continuous Gaussian prior. The recent progress of discrete priors or flow model can be potentially added on our model. However, our training does not suffer from it significantly. Note that each of the decoders is only trained with about data of encoder and it is likely that the weak decoder can also be helpful. Using an approximation with multivariate Gaussian prior offers much better tractability as in variational autoencoder (VAE), which is good for posterior modeling.
Inferencing the latent representation requires to know the label information in advance, since we are modeling the approximate conditional distribution . Although is always available in training, the ground truth is not available in testing. We note that is only used in training which always has .
To alleviate this limitation, we infer the label from the input image as a prior to control the behavior of the conditional distribution matching module. Specifically, we configure a label-prior network to infer the pseudo-label, and use as input to the posterior label alignment classifier in both training and testing. Our label prior network is trained by the cross-entropy loss with the ground-truth label .
Moreover, at the training stage, we can further utilize the ground-truth and minimizing to update the encoder . We denote the to be minimized KL divergence as . Minimizing is not mandatory, while can encourage the encoder to be familiar with the noisy  and learn to compensate for the noisy prediction. We note that assigning to an uniform histogram as the dialog system  to fill the missing variate can degenerate the modeling of to in our DG task. Therefore, the pseudo-label will be post-processed by both encoder and classifier, which may adjust the unreliable .
3.3 Posterior Alignment with Label Shift
Finally, we align to obtain the final classifier . Since the classifier is deployed on all of the source domains, we can regard all of the source domains as a single domain, and denote the classifier as
where , , and are its class-conditional, latent representation, and label distribution, respectively.
Suppose that all the conditional distribution and the latent distribution are aligned to each other using variational Bayesian inference, they should also be aligned with and . Therefore, the posterior alignment and the final prediction of the sample from domain can be formulated as
where is the -th element value of the classifier’s softmax prediction. Here, we also calculate the cross entropy loss between and the ground truth label .
As detailed in Algorithm 1, we update with , update with , update with , and update with , respectively.
This section demonstrates the effectiveness of our variational Bayesian inference framework under conditional and label shift (VBCLS) on the classic VLCS DG benchmark for image classification and the PACS benchmark for object recognition with domain shift.
4.1 Implementation Details
The domain invariant encoder and posterior alignment classifier use the encoder and classifier structure as our compared models (e.g., AlexNet and ResNet18), and the label prior network is simply the concatenation of and , and the likelihood maximization network uses the reversed CNN decoder structure of
. We implement our methods using PyTorch, we empirically set, , and
via grid searching. In our experiments, the performance does not sensitive to these hyperparameters for a relatively large range.
, we use models pre-trained on the ILSVRC dataset and label smoothing on the task classifier in order to prevent overfitting. The model is trained for 30 epochs via the mini-batch stochastic gradient descent (SGD) with a batch size of 128, a momentum of 0.9, and a weight decay of. The initial learning rate is set to , which is scaled by a factor of 0.1 after 80% of the epochs. For the VLCS dataset, the initial learning rate is set to , since a high learning rate leads to early convergence due to the overfitting in the source domain. In addition, the learning rate of the domain discriminator and the classifier is set to be larger (i.e., 10 times) than that of the feature extractor.
For the ablation study, VBCLS- denotes without posterior alignment and VBCLS- denotes without minimizing .
4.2 VLCS Dataset
contains images from four different datasets including PASCAL VOC2007 (V), LabelMe (L), Caltech-101 (C), and SUN09 (S). Different from PACS, VLCS offers photo images taken under different camera types or composition bias. The domain V, L, C, and S have 3,376, 2,656, 1,415, and 3,282 instances, respectively. Five shared classes are collected to form the label space, including bird, car, chair, dog, and person. We follow the previous works to exploit the publicly available pre-extracted DeCAF6 features (4,096-dim vector) for leave-one-domain-out validation by randomly splitting each domain into 70% training and 30% testing. We report the mean over five independent runs for our results.
The mean accuracy on the test partition using leave-one-domain-out validation on VLCS is reported in Table 1. We also compare our method with the DG models trained using the same architecture and source domains.
Our results indicate that our VBCLS outperforms the covariant shift setting methods (e.g., CCSA , MMD-AAE ) by a large margin. The improvement over the conditional shift only method CIDDG  is significant, demonstrating the necessity of incorporating both conditional and label shift. When compared with the recent SOTA methods, e.g., self-challenging based RSC , information bottleneck based MetaVIB , and self-training based EISNet , we can observe that our VBCLS yields better performance in almost all cases. We note that JiGen  uses the Jigsaw Puzzles solving, which are essentially different from the IFL.
The good performance of our strategy indicates the invariant feature learning works well, and the conditional and shift assumption can be an inherent property that needs to be addressed for real-world DG challenges. The discrepancy between marginal distributions and
is measured via the KL-divergence as the semi-supervised learning with the selective bias problem. The impact of the label shift is empirically illustrated in Fig. 4 left. In Fig. 4
right, we show the label refinement can effectively estimate the testing label distribution without tuning the network. The label alignment can be relatively accurate after 3 epochs and is almost stable after 5 epochs.
In the ablation study, we can see that posterior alignment is necessary if there is label shift. Besides, the performance of the label prior network can be improved by minimizing on encoder .
Another evaluation protocol on VLCS is to examine whether eliminating examples from one source domain impacts the performance on the target domain. This protocol is designed to evaluate the impact of the diversity of the source domains on the target accuracy. In this experiment, each target domain on models by training all combinations of the remaining domains as the source domains is evaluated, as shown in Table 2. In addition, the CIDDG baseline is included for reference. Results in Table 2 demonstrate that for all target domains, reducing the number of source domains from 3 (see Table 1) to 2 degrades the performance for all combinations of the source domains. We can see that, in some cases, excluding a particular source from the training substantially degrades the target loss. However, we can see that our VBCLS can still be more robust under these cases.
4.3 PACS Dataset
The object recognition benchmark PACS  consists of images divided into 7 classes from four different datasets, including Photo (P), Art painting (A), Cartoon (C), and Sketch (S). In Tab. 3, we provide the detailed statistics of the number of samples in each domain. It is clear that the class ratio is different across domains, which indicates the significant label shift.
As shown in the results (c.f. Table 4), the Sketch domain produces the lowest accuracy when used as the target domain, and therefore it is deemed the most challenging one. In light of this, we follow the previous work to tune the model using the S domain as the target domain and reuse the same hyperparameters for the experiments with the remaining domains. In Table 4, we show the result using our method, which was averaged over 5 different initializations alongside all the other comparison methods. Overall, we can see that our method yields better average performance over all source domains compared to previous SOTA methods. Among them, CrossGrad  synthesizes data for a new domain, MMLD  using a mixture of multiple latent domains. It is also comparable to the recent self-challenging based RSC , information bottleneck based MetaVIB , and self-training based EISNet . More importantly, we can observe that our method outperforms CIDDG , an adversarial conditional IFL strategy, by a large margin. The ablation studies are also consistent with the results in VLCS and PACS. We also provide the results using the ResNet18 backbone.
|JiGen 2019 [Res18]||96.03||79.42||75.25||71.35||80.51|
|MMLD 2020 [Res18]||96.09||81.28||77.16||72.29||81.83|
|EISNet 2020 [Res18]||95.93||81.89||76.44||74.33||82.15|
|RSC 2020 [Res18]||95.99||83.43||80.85||80.31||85.15|
In this work, we target to establish a more realistic assumption in DG that both conditional and label shifts arise independently and contemporarily. We theoretically analyze the inequilibrium of conventional IFL under the different shift assumptions. Motivated by that, a concise yet effective VBCLS framework based on variational Bayesian inference with the posterior alignment is proposed to reduce both the conditional shift and label shifts. Extensive evaluations verify our analysis and demonstrate the effectiveness of our approach, compared with IFL approaches.
This work was supported in part by the Hong Kong GRF 152202/14E, PolyU Central Research Grant G-YBJW, and Jiangsu NSF (BK20200238).
-  (2019) Domain generalization by solving jigsaw puzzles. In CVPR, Cited by: §4.2.
-  (2021) Deep verifier networks: verification of deep discriminative models with deep generative models. AAAI. Cited by: §1.
-  (2020) Domain adaptation with conditional distribution matching and generalized label shift. arXiv. Cited by: §1, §2.
-  (2017) Deep domain generalization with structured low-rank constraint. TIP. Cited by: §2.
-  (2018) Importance weighting and variational inference. In NIPS, Cited by: §3.1.
-  (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In ICML, Cited by: §4.2.
-  (2020) Learning to learn with variational information bottleneck for domain generalization. ECCV. Cited by: §4.2, §4.3.
-  (2017) Scatter component analysis: a unified framework for domain adaptation and domain generalization. IEEE T-PAMI. Cited by: §1, §2, §4.2.
-  (2018) Causal generative domain adaptation networks. arXiv. Cited by: §2.
-  (2016) Domain adaptation with conditional transferable components. In ICML, Cited by: §1, §1, §2.
-  (2019) Domain generalization via multidomain discriminant analysis. In UAI, Cited by: §1.
-  (2020) Self-challenging improves cross-domain generalization. ECCV. Cited by: §4.2, §4.3.
-  (2013) Auto-encoding variational bayes. arXiv. Cited by: §3.1, §3.1, §3.1.
An introduction to domain adaptation and transfer learning. arXiv preprint arXiv:1812.11806. Cited by: §1, §2.
-  (2017) Deeper, broader and artier domain generalization. In ICCV, Cited by: §4.1, §4.3.
-  (2019) Episodic training for domain generalization. In ICCV, Cited by: §4.1.
-  (2018) Domain generalization with adversarial feature learning. In CVPR, Cited by: §2, §3, §4.2.
-  (2018) Deep domain generalization via conditional invariant adversarial networks. In ECCV, Cited by: §1, §1, §2, §4.2, §4.3.
-  (2019) Learning to select knowledge for response generation in dialog systems. IJCAI. Cited by: §3.2.
-  (2021) Mutual information regularized feature-level frankenstein for discriminative recognition. PAMI. Cited by: §1, §2.
-  (2021) Subtype-aware unsupervised domain adaptation for medical diagnosis. AAAI. Cited by: §1, §1.
-  (2018) Best sources forward: domain generalization through source-specific nets. In ICIP, Cited by: §2.
-  (2020) Domain generalization using a mixture of multiple latent domains.. In AAAI, Cited by: §4.3.
-  (2020) Domain generalization using a mixture of multiple latent domains. AAAI. Cited by: §1, §1, §2.
-  (2012) A unifying view on dataset shift in classification. Pattern Recognition 45 (1), pp. 521–530. Cited by: §1, §2.
-  (2017) Unified deep supervised domain adaptation and generalization. In ICCV, Cited by: §4.2.
-  (2012) On causal and anticausal learning. arXiv. Cited by: §2.
-  (2018) Generalizing across domains via cross-gradient training. arXiv. Cited by: §4.3.
-  (2020) Learning from extrinsic and intrinsic supervisions for domain generalization. ECCV. Cited by: §4.2, §4.3.
-  (2004) Learning and evaluating classifiers under sample selection bias. In ICML, Cited by: §4.2.
-  (2015) Multi-source domain adaptation: a causal view. In AAAI, Cited by: §2.
-  (2013) Domain adaptation under target and conditional shift. In ICML, Cited by: §1, §1, §1, §2, §3.