1 Introduction
Deep learning usually relies on the independent, identically distributed (i.i.d.) assumption of training and testing datasets, while target tasks are usually significantly heterogeneous and diverse [2, 20]. This motivates many researchers to investigate domain adaptation (DA) [21] and domain generalization (DG) [24]. The source and target domain shifts are expected to be compensated for by a variety of adaptation steps. Despite the success of DA in several tasks, much of the prior work relies on utilizing the massive labeled/unlabeled target samples for its training.
The recently aroused DG task [24] assumes that several labeled source domains are available in training without access to the target sample/label. Collectively exploiting these source domains can potentially lead to a trained system that can be generalized well on a target domain [11]. A predominant stream in DG is the domain invariant feature learning (IFL) [8], which attempts to enforce , where and index the two different source domains. The typical solution can be obtained via momentum or adversarial training [8].
However, the different classes can follow different domain shift protocols, e.g., the street lamp will be sparkly at night, while the pedestrian is shrouded in darkness. Therefore, we would expect the finegrained classwise alignment of the condition shift w.r.t. , where is a feature extractor [32, 10, 3].
Assuming there is no concept shift (i.e., ) and label shift (i.e.,
), and given the Bayes’ theorem,
, IFL is able to align if .However, the label shift [21], i.e., class imbalance, is quite common in DG, as illustrated in Fig. 1. Since is a deterministic mapping function, IFL is able to encode the domain invariant representation under the assumption (i.e., only ) [25]. Under the label shift, the alignment cannot be used as an alternative of alignment (i.e., [18]. Actually, both the conditional and label shifts are the realistic setting in most of DG tasks.
Recently, [18] proposes to align the conditional shift, assuming that there is no label shift. However, it is illposed to only consider one of conditional or label shift [32, 14]. To mitigate this, both the and shifts are taken into account for DA from a causal interpretation view [32, 10]. However, its linearity assumption might be too restrictive for realworld challenges.
In this work, we first analyze the different shift conditions in realworld DG, and rethink the limitation of conventional IFL under different shift assumptions. Targeting the conditional and label shifts, we propose to explicitly align and via variational Bayesian inference and posterior label alignment. Note that the finegrained classwise alignment can lead to
alignment following the law of total probability
[32].Aligning the conditional distribution across source domains under the label shift is usually intractable. Thus, we propose to infer the domainspecific variational approximations of these conditional distributions, and reduce the divergence among these approximations.
Specifically, we enforce the conditional domain invariance by optimizing two objectives. The first one enforces the approximate conditional distributions indistinguishable across domains by matching their reparameterized formulations (i.e., the mean and variance of Gaussian distribution). The second one maximizes the probability of observing the input
given the latent representation and domain label, which is achieved by a domainwise likelihood learning network. Assuming that the conditional shift is aligned, we can then align the posterior classifier with the label distribution following a plugandplay manner.
The main contributions are summarized as follows:
We explore both the conditional and label shifts in various DG tasks, and investigate the limitation of conventional IFL methods under different shift assumptions.
We propose a practical and scalable method to align the conditional shift via the variational Bayesian inference.
The label shift can be explicitly aligned by the posterior alignment operation.
We empirically validate its effectiveness and generality of our framework on multiple challenging benchmarks with different backbone models and demonstrate superior performance over the comparison methods.
2 Related Work
DG assumes that we do not have prior knowledge about the target domain in that we do not have access to labeled/unlabeled samples of the target domain at the training stage [24]. The conventional DG methods can be divided into two categories. The first strategy aims to extract the domain invariant features with IFL [8]. A typical solution is based on adversarial learning, which reduces the interdomain divergence between the feature representation distributions [17, 20].
The other strategy focuses on the fusion of domainspecific feature representations (DFRfusion). [22] builds the domainspecific classifiers by multiple independent convolutional networks. Then, it uses a domain agnostic component to fuse the probabilities of a target sample belonging to different source domains. [4] infers the domaininvariant feature by matching its lowrank structure with domainspecific features. Typically, these DG methods assume that is invariant across domains. Therefore, aligning can be a good alternative to align the conditional shift. However, this assumption is often violated due to the label shift in realworld applications. Therefore, independent conditional and label shift assumptions are more realistic in realworld applications.
Domain shifts in DA can be categorized into covariant, label, conditional, and concept shifts [25, 31]. In this work, we examine these concepts and adapt their causal relationships to DG, as summarized in Fig. 2. Conventionally, each shift is studied independently, by assuming that the other shifts are invariant [14]. For example, [18] aligns the conditional shift assuming that no label shift occurs. We note that the concept shift usually has not been considered in DG tasks, since most of the prior work assumes that an object has different labels in different domains. Some recent works [32, 10, 3] assume that both conditional and label shifts exist in DA tasks and tackle the problem with a causal inference framework. However, its linearity assumption and locationscale transform are too restrictive to be applied in many realworld applications. It is worth noting that under the conditional and label shift assumption, is the cause of , and therefore it is natural to infer of different domains directly as in [27, 9] as a likelihood maximization network.
In this work, we propose a novel inference graph as shown in Fig. 1 to explicitly incorporate the conditional dependence, which is trained via variational Bayesian inference.
3 Methodology
We denote the input sample, class label, and domain spaces as , and
, respectively. With random variables
, , and, we can define the probability distribution of each domain as
. For the sake of simplicity, we assume and are the discrete variables for which is the set of class labels. In DG, we are given source domains to train the latent representation encoder and the representation classifier [17]. The trained and fixed encoder and classifier are used to predict the labels of samples drawn from the marginal distribution of an “unseen” target domain .The conventional IFL assumes that and are invariant across domains. Since is a deterministic mapping function, should also be invariant across domains. Therefore, if of different domains are aligned, the conditional shift, , is also aligned. However, with the conditional and label shift assumption, the alignment of and is more challenging than the covariant shift which only requires to align the marginal distribution .
We note that with the law of total probability, can be aligned if the finegrained is aligned [32]. Moreover, for all source domains can be calculated by simply using the class label proportion in each domain. Besides, modeling is natural, since it follows the inherent causal relation under the conditional and label shift (see Fig. 2).
For the simplicity and consistency with the notation of autoencoder, we denote
as ^{1}^{1}1We use and interlaced to align with the conventional IFL and variational autoencoder literature, respectively., which is the latent variable encoded from by . We note that is dependent on its corresponding input sample . The class conditional distribution can be reformulated as .The corresponding inference graph and detailed framework are shown in Fig. 1 and Fig. 3, respectively. When inferring the latent representation , we explicitly concatenate and as input. Moreover, the final class prediction is made by a posterior alignment of the label shift, which also depends on label distribution .
3.1 Variational Bayesian Conditional Alignment
Although and for all source domains can be modeled by IFL and class label proportion, respectively,
is usually intractable for moderately complicated likelihood functions, e.g., neural networks with nonlinear hidden layers. While this could be solved by the Markov chain Monte Carlo simulation (MCMC), this requires expensive iterative inference schemes per data point and does not scalable to the largescale highdimensional data.
To align the classdependent across different domains, we first construct its approximate Gaussian distribution , and resort to the variational Bayesian inference [13] to bridge it with a simple Gaussian family for which the inference is tractable. Specifically, we have the following proposition:
Proposition 1. The minimization of the interdomain conditional shift is achieved when its approximate distribution is invariant across domains, and the KL divergence between and is minimized.
Proof 1. A simple justification can be:
if we have ,
,
,
then, we have .
Following the variational bound [13], minimizing the KL divergence between and is equivalent to maximizing the evidence lower bound (ELBO) [5] of the likelihood , denoted by :
(1) 
where the term is the KL divergence of the approximate from the true posterior and the ELBO of the likelihood , i.e., can be rewritten as
(2) 
which can be reformulated as
(3) 
where denotes the expectation. Therefore, approximating with requires two objectives, i.e., minimizing , while maximizing the expectation of .
For domains, is the prior distribution in the variational model, e.g., multivariate Gaussian distribution. When is sampled from the same Gaussian distribution and is invariant across the source domains, the first objective in Eq. (3), i.e., , can explicitly enforce to be invariant across domains.
By further incorporating the second objective into Eq. (3), we attempt to minimize the KL divergence of and as in Eq. (1). Then, should be invariant across the source domains, i.e.,
(4) 
Actually, if we have , , and , , then .
The first optimization objective of Eq. (3) targets to align the conditional distribution across the source domains. Since the prior distribution is the multivariate Gaussian distribution, it is also natural to configure as multivariate Gaussian. Practically, we follow the reparametric trick of variational autoencoder [13] in such a way that the inference model has two outputs, i.e., and which are the mean and variance of the Gaussian distribution . Then, , where . Without loss generality, we have:
(5) 
where is the number of input in a batch from domain . Usually, we set the prior to be the standard multivariate Gaussian distribution, where the mean and variance are and , respectively.
The second optimization objective of Eq. (3) aims to maximize the probability of observing the input given and . We propose to configure a likelihood maximization network . It maximizes the likelihood that the latent feature representation of images in a specific domain can effectively represent the sameclass images in this domain. Practically, our contains subbranches, each of which corresponds to a domain. At the training stage, we choose the corresponding branch according to the source domain label . Its loss can be formulated as
(6) 
which solves the maximum likelihood problem by minimizing the difference between the input data and the generated data in the corresponding domain. We note that is only used in training.
Posterior collapses is a longlasting problem of VAEs using continuous Gaussian prior. The recent progress of discrete priors or flow model can be potentially added on our model. However, our training does not suffer from it significantly. Note that each of the decoders is only trained with about data of encoder and it is likely that the weak decoder can also be helpful. Using an approximation with multivariate Gaussian prior offers much better tractability as in variational autoencoder (VAE), which is good for posterior modeling.
3.2 Labelprior
Inferencing the latent representation requires to know the label information in advance, since we are modeling the approximate conditional distribution . Although is always available in training, the ground truth is not available in testing. We note that is only used in training which always has .
To alleviate this limitation, we infer the label from the input image as a prior to control the behavior of the conditional distribution matching module. Specifically, we configure a labelprior network to infer the pseudolabel, and use as input to the posterior label alignment classifier in both training and testing. Our label prior network is trained by the crossentropy loss with the groundtruth label .
Moreover, at the training stage, we can further utilize the groundtruth and minimizing to update the encoder . We denote the to be minimized KL divergence as . Minimizing is not mandatory, while can encourage the encoder to be familiar with the noisy [19] and learn to compensate for the noisy prediction. We note that assigning to an uniform histogram as the dialog system [19] to fill the missing variate can degenerate the modeling of to in our DG task. Therefore, the pseudolabel will be postprocessed by both encoder and classifier, which may adjust the unreliable .
3.3 Posterior Alignment with Label Shift
Finally, we align to obtain the final classifier . Since the classifier is deployed on all of the source domains, we can regard all of the source domains as a single domain, and denote the classifier as
(7) 
where , , and are its classconditional, latent representation, and label distribution, respectively.
Suppose that all the conditional distribution and the latent distribution are aligned to each other using variational Bayesian inference, they should also be aligned with and . Therefore, the posterior alignment and the final prediction of the sample from domain can be formulated as
(8) 
where is the th element value of the classifier’s softmax prediction. Here, we also calculate the cross entropy loss between and the ground truth label .
As detailed in Algorithm 1, we update with , update with , update with , and update with , respectively.
4 Experiments
This section demonstrates the effectiveness of our variational Bayesian inference framework under conditional and label shift (VBCLS) on the classic VLCS DG benchmark for image classification and the PACS benchmark for object recognition with domain shift.
4.1 Implementation Details
The domain invariant encoder and posterior alignment classifier use the encoder and classifier structure as our compared models (e.g., AlexNet and ResNet18), and the label prior network is simply the concatenation of and , and the likelihood maximization network uses the reversed CNN decoder structure of
. We implement our methods using PyTorch, we empirically set
, , andvia grid searching. In our experiments, the performance does not sensitive to these hyperparameters for a relatively large range.
Following previous work on domain generalization [15, 16]
, we use models pretrained on the ILSVRC dataset and label smoothing on the task classifier in order to prevent overfitting. The model is trained for 30 epochs via the minibatch stochastic gradient descent (SGD) with a batch size of 128, a momentum of 0.9, and a weight decay of
. The initial learning rate is set to , which is scaled by a factor of 0.1 after 80% of the epochs. For the VLCS dataset, the initial learning rate is set to , since a high learning rate leads to early convergence due to the overfitting in the source domain. In addition, the learning rate of the domain discriminator and the classifier is set to be larger (i.e., 10 times) than that of the feature extractor.For the ablation study, VBCLS denotes without posterior alignment and VBCLS denotes without minimizing .
4.2 VLCS Dataset
VLCS [8]
contains images from four different datasets including PASCAL VOC2007 (V), LabelMe (L), Caltech101 (C), and SUN09 (S). Different from PACS, VLCS offers photo images taken under different camera types or composition bias. The domain V, L, C, and S have 3,376, 2,656, 1,415, and 3,282 instances, respectively. Five shared classes are collected to form the label space, including bird, car, chair, dog, and person. We follow the previous works to exploit the publicly available preextracted DeCAF6 features (4,096dim vector)
[6] for leaveonedomainout validation by randomly splitting each domain into 70% training and 30% testing. We report the mean over five independent runs for our results.The mean accuracy on the test partition using leaveonedomainout validation on VLCS is reported in Table 1. We also compare our method with the DG models trained using the same architecture and source domains.
Our results indicate that our VBCLS outperforms the covariant shift setting methods (e.g., CCSA [26], MMDAAE [17]) by a large margin. The improvement over the conditional shift only method CIDDG [18] is significant, demonstrating the necessity of incorporating both conditional and label shift. When compared with the recent SOTA methods, e.g., selfchallenging based RSC [12], information bottleneck based MetaVIB [7], and selftraining based EISNet [29], we can observe that our VBCLS yields better performance in almost all cases. We note that JiGen [1] uses the Jigsaw Puzzles solving, which are essentially different from the IFL.
Target ()  V  L  C  S  Average 

CCSA 2017  67.10  62.10  92.30  59.10  70.15 
MMDAAE 2018  67.70  62.60  94.40  64.40  72.28 
CIDDG 2018  64.38  63.06  88.83  62.10  69.59 
EpiFCR 2019  67.10  64.30  94.10  65.90  72.90 
JiGen 2019  70.62  60.90  96.93  64.30  73.19 
MetaVIB 2020  70.28  62.66  97.37  67.85  74.54 
RCS 2020  73.93  61.86  97.61  68.32  75.43 
VBCLS  72.16  68.63  96.52  70.37  76.920.06 
VBCLS  69.40  65.00  94.60  65.60  73.700.08 
VBCLS  71.74  68.18  96.20  70.02  76.560.07 
The good performance of our strategy indicates the invariant feature learning works well, and the conditional and shift assumption can be an inherent property that needs to be addressed for realworld DG challenges. The discrepancy between marginal distributions and
is measured via the KLdivergence as the semisupervised learning with the selective bias problem
[30]. The impact of the label shift is empirically illustrated in Fig. 4 left. In Fig. 4right, we show the label refinement can effectively estimate the testing label distribution without tuning the network. The label alignment can be relatively accurate after 3 epochs and is almost stable after 5 epochs.
In the ablation study, we can see that posterior alignment is necessary if there is label shift. Besides, the performance of the label prior network can be improved by minimizing on encoder .
Source  

Target  Method  VC  VL  VS  LC  LS  CS 
V  CIDDG        60.42  62.21  59.56 
VBCLS        65.82  68.45  66.76  
L  CIDDG  53.24    52.27      49.58 
VBCLS  60.75    60.76      58.82  
C  CIDDG    78.82  78.58    74.67   
VBCLS    85.56  86.68    81.47    
S  CIDDG  59.04  56.29    59.80     
VBCLS  62.34  60.35    61.52     
Another evaluation protocol on VLCS is to examine whether eliminating examples from one source domain impacts the performance on the target domain. This protocol is designed to evaluate the impact of the diversity of the source domains on the target accuracy. In this experiment, each target domain on models by training all combinations of the remaining domains as the source domains is evaluated, as shown in Table 2. In addition, the CIDDG baseline is included for reference. Results in Table 2 demonstrate that for all target domains, reducing the number of source domains from 3 (see Table 1) to 2 degrades the performance for all combinations of the source domains. We can see that, in some cases, excluding a particular source from the training substantially degrades the target loss. However, we can see that our VBCLS can still be more robust under these cases.
Guitar  House  Giraffe  Person  Horse  Dog  Elephant  

Art Paint  184  295  285  449  201  379  255 
Cartoon  135  288  346  405  324  389  457 
Photo  186  280  182  432  199  189  202 
Sketch  608  80  753  160  816  772  740 
4.3 PACS Dataset
The object recognition benchmark PACS [15] consists of images divided into 7 classes from four different datasets, including Photo (P), Art painting (A), Cartoon (C), and Sketch (S). In Tab. 3, we provide the detailed statistics of the number of samples in each domain. It is clear that the class ratio is different across domains, which indicates the significant label shift.
As shown in the results (c.f. Table 4), the Sketch domain produces the lowest accuracy when used as the target domain, and therefore it is deemed the most challenging one. In light of this, we follow the previous work to tune the model using the S domain as the target domain and reuse the same hyperparameters for the experiments with the remaining domains. In Table 4, we show the result using our method, which was averaged over 5 different initializations alongside all the other comparison methods. Overall, we can see that our method yields better average performance over all source domains compared to previous SOTA methods. Among them, CrossGrad [28] synthesizes data for a new domain, MMLD [23] using a mixture of multiple latent domains. It is also comparable to the recent selfchallenging based RSC [12], information bottleneck based MetaVIB [7], and selftraining based EISNet [29]. More importantly, we can observe that our method outperforms CIDDG [18], an adversarial conditional IFL strategy, by a large margin. The ablation studies are also consistent with the results in VLCS and PACS. We also provide the results using the ResNet18 backbone.
Target ()  P  A  C  S  Average 

CrossGrad 2018  87.6  61.0  67.2  55.9  67.9 
CIDDG 2018  78.7  62.7  69.7  64.5  68.9 
EpiFCR 2019  86.1  64.7  72.3  65.0  72.0 
JiGen 2019  89.0  67.6  71.7  65.2  73.4 
MMLD 2020  88.98  69.27  72.83  66.44  74.38 
RSC 2020  90.88  71.62  66.62  75.11  76.05 
EISNet 2020  91.20  70.38  71.59  70.25  75.86 
MetaVIB 2020  91.93  71.94  73.17  65.94  75.74 
VBCLS  92.12  70.60  77.36  70.19  77.550.07 
VBCLS  91.02  68.86  74.18  65.40  74.610.04 
VBCLS  91.77  70.54  76.56  70.33  77.190.06 
JiGen 2019 [Res18]  96.03  79.42  75.25  71.35  80.51 
MMLD 2020 [Res18]  96.09  81.28  77.16  72.29  81.83 
EISNet 2020 [Res18]  95.93  81.89  76.44  74.33  82.15 
RSC 2020 [Res18]  95.99  83.43  80.85  80.31  85.15 
VBCLS [Res18]  97.21  84.63  82.06  79.25  86.730.05 
5 Conclusion
In this work, we target to establish a more realistic assumption in DG that both conditional and label shifts arise independently and contemporarily. We theoretically analyze the inequilibrium of conventional IFL under the different shift assumptions. Motivated by that, a concise yet effective VBCLS framework based on variational Bayesian inference with the posterior alignment is proposed to reduce both the conditional shift and label shifts. Extensive evaluations verify our analysis and demonstrate the effectiveness of our approach, compared with IFL approaches.
Acknowledgments
This work was supported in part by the Hong Kong GRF 152202/14E, PolyU Central Research Grant GYBJW, and Jiangsu NSF (BK20200238).
References
 [1] (2019) Domain generalization by solving jigsaw puzzles. In CVPR, Cited by: §4.2.
 [2] (2021) Deep verifier networks: verification of deep discriminative models with deep generative models. AAAI. Cited by: §1.
 [3] (2020) Domain adaptation with conditional distribution matching and generalized label shift. arXiv. Cited by: §1, §2.
 [4] (2017) Deep domain generalization with structured lowrank constraint. TIP. Cited by: §2.
 [5] (2018) Importance weighting and variational inference. In NIPS, Cited by: §3.1.
 [6] (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In ICML, Cited by: §4.2.
 [7] (2020) Learning to learn with variational information bottleneck for domain generalization. ECCV. Cited by: §4.2, §4.3.
 [8] (2017) Scatter component analysis: a unified framework for domain adaptation and domain generalization. IEEE TPAMI. Cited by: §1, §2, §4.2.
 [9] (2018) Causal generative domain adaptation networks. arXiv. Cited by: §2.
 [10] (2016) Domain adaptation with conditional transferable components. In ICML, Cited by: §1, §1, §2.
 [11] (2019) Domain generalization via multidomain discriminant analysis. In UAI, Cited by: §1.
 [12] (2020) Selfchallenging improves crossdomain generalization. ECCV. Cited by: §4.2, §4.3.
 [13] (2013) Autoencoding variational bayes. arXiv. Cited by: §3.1, §3.1, §3.1.

[14]
(2018)
An introduction to domain adaptation and transfer learning
. arXiv preprint arXiv:1812.11806. Cited by: §1, §2.  [15] (2017) Deeper, broader and artier domain generalization. In ICCV, Cited by: §4.1, §4.3.
 [16] (2019) Episodic training for domain generalization. In ICCV, Cited by: §4.1.
 [17] (2018) Domain generalization with adversarial feature learning. In CVPR, Cited by: §2, §3, §4.2.
 [18] (2018) Deep domain generalization via conditional invariant adversarial networks. In ECCV, Cited by: §1, §1, §2, §4.2, §4.3.
 [19] (2019) Learning to select knowledge for response generation in dialog systems. IJCAI. Cited by: §3.2.
 [20] (2021) Mutual information regularized featurelevel frankenstein for discriminative recognition. PAMI. Cited by: §1, §2.
 [21] (2021) Subtypeaware unsupervised domain adaptation for medical diagnosis. AAAI. Cited by: §1, §1.
 [22] (2018) Best sources forward: domain generalization through sourcespecific nets. In ICIP, Cited by: §2.
 [23] (2020) Domain generalization using a mixture of multiple latent domains.. In AAAI, Cited by: §4.3.
 [24] (2020) Domain generalization using a mixture of multiple latent domains. AAAI. Cited by: §1, §1, §2.
 [25] (2012) A unifying view on dataset shift in classification. Pattern Recognition 45 (1), pp. 521–530. Cited by: §1, §2.
 [26] (2017) Unified deep supervised domain adaptation and generalization. In ICCV, Cited by: §4.2.
 [27] (2012) On causal and anticausal learning. arXiv. Cited by: §2.
 [28] (2018) Generalizing across domains via crossgradient training. arXiv. Cited by: §4.3.
 [29] (2020) Learning from extrinsic and intrinsic supervisions for domain generalization. ECCV. Cited by: §4.2, §4.3.
 [30] (2004) Learning and evaluating classifiers under sample selection bias. In ICML, Cited by: §4.2.
 [31] (2015) Multisource domain adaptation: a causal view. In AAAI, Cited by: §2.
 [32] (2013) Domain adaptation under target and conditional shift. In ICML, Cited by: §1, §1, §1, §2, §3.