Deep learning has shown impressive abilities on solving numerous machine learning tasks. Most of recent advances heavily rely on the access to huge amount of labeled data and the assumption that both training and test data are sampled from the same underlying distribution. However, there are many application scenarios where the labeled data for the task of interest (target domain) is hard to obtain, while another correlated domain (source domain) with non-negligible dissimilarity consists of sufficient annotated data. Hence, there is strong motivation to leverage the supervision signal from source domain to help build an effective model in target domain. Learning an accurate predictive model for target domain with the presence of covariate shift [sugiyama2012machine] (i.e., the input data distributions of source and target domains are different) is known as domain adaptation. In this paper, we focus on a general and challenging setting where no label information is available in target domain, which is termed as unsupervised domain adaption.
Recent advances in deep learning stimulate a fruitful line of domain adaptation works, which leverage deep neural networks to infer the latent variables and match the marginal distributions of source and target domains in the latent space[ganin2016domain, long2015learning, RTN]. Inspired by Generative Adversarial Networks [goodfellow2014generative], an adversarial domain adaptation mechanism is utilized in [ganin2016domain, tzeng2017adversarial, luo2017label, xie2018learning]. This mechanism involves a two-player game between a discriminator and a feature extractor: the domain discriminator is trained to tell whether the samples come from source or target domain, while the feature extractor is trained to maximize the discriminator’s classification error. Essentially, adversarial domain adaptation methods seek to minimize the Jensen-Shannon divergence between source and target distribution of latent features.
However, it has been shown that matching the marginal distribution in latent feature space is not strong enough for ensuring the essential information to be transferred [zhao2019learning]. It is possible that the learned mapping is misled by the domain invariant yet task-irrelevant factors and fails to capture the semantic information. Consider the following example, the adaptation task is to recognize animals in the pictures, and in source domain, most of the sheep are appearing with the grassland as the background and most of the horses are appearing with the animal house as the background; while in target domain the background configuration of the two species are random and marginal distributions of the background are the same. In cases like this, directly matching the marginal feature distributions could result in that the domain-invariant yet task-irrelevant information, (i.e. the background in the above example), outweigh the task-relevant information, which then lead to worse performance on target domain (also known as negative transfer [pan2010survey]).
To tackle the lack of semantic alignment, many recent works proposed to enhance the label information of target domain based on some strong assumptions [luo2017label, shu2018dirt, xie2018learning, saito2017asymmetric]. One of the most widely used hypotheses is the cluster assumption [grandvalet2005semi] (also known as low density separation assumption), which states that the data instances are distributed into several separate clusters and samples in the same cluster share the same label. However, the cluster assumption is actually too strong and inappropriate for many practical scenarios, and directly using the cluster assumption could bring non-negligible undesired effects [shu2018dirt].
In this paper, inspired by the information bottleneck principle, we propose a simple yet effective regularization technique for domain adaptation methods by combining conditional entropy minimization and variational information bottleneck, which enforces the feature extractor to ignore the irrelevant factors and focus on the essential information for the task of interest (i.e., the sufficient statistics for determining the parameters of the predictive models). Our method tends to learn a balanced and clean representation space (i.e., no information preference on source or target domain and less irrelevant factors), which improves the generalization ability of the predictive model and renders strong yet widely used assumptions such as the cluster assumption more realistic. We further provide a theoretical analysis on the generalization error bound in Section 4.3. Extensive experimental results demonstrate that our model outperforms state-of-the-art methods across three domain adaptation benchmark datasets [officehome, office31, svhn].
2 Related Works
In this section, we discuss several most relevant works in the field of domain adaptation. [ganin2016domain] and [long2015learning] proposed to project the source and target domain into a common representation space, and encouraged the corresponding marginal feature distribution to be matched under the guidance of some distance or divergence. Adversarial techniques based on the framework of GAN [goodfellow2014generative] are widely explored in the literatures of domain adaptation [saito2017asymmetric11, hoffman2017cycada, tzeng2017adversarial], which corresponds to minimizing the symmetric Jensen-Shannon divergence. However, [JAN] pointed out that adversarial domain adaptation methods which only match the marginal distribution are problematic and insufficient for successful adaptation. To address this limitation, various methods have been proposed. For example, [JAN, CDAN]
proposed to match the joint distribution instead of purely matching the marginal;[ghifary2016deep] introduced a decoder architecture for capturing the semantic information; and [hoffman2017cycada] utilized cycle consistency constraints to preserve semantic information. However, the main limitation of these methods is that, although the semantic information is enhanced, the learned representation is still likely to preserve domain-invariant factors that are irrelevant to the predictive task, which may mislead the semantic alignment especially when training samples are not sufficient enough. In the animal recognition example mentioned in Introduction section, the background is the domain-invariant yet irrelevant factors. The learned representation tended to preserve the background information due to the fact that the background has statistically dependency with the class label in source domain and the marginal distribution of background is invariant between the source and target domain. And irrelevant information will disturb the predictive task on target domain. Hence there is strong motivation to enforce the feature extractor to only focus on the essential information for the task of interest and ignore as much irrelevant factors as possible, no matter they are domain-invariant or not. Inspired by this intuition, we propose to regularize domain adaptation models with information bottleneck principle [tishby2000information]
, which seeks to find the optimal tradeoff between representation accuracy and compression. Since information bottleneck method has been successfully applied to supervised learning[alemi2016deep], generative modeling [jeon2018ib, peng2018variational]peng2018variational], in the context of domain adaptation, we propose to exploit it to preserve sufficient statistics and remove irrelevant factors in the learned representations. While [motiian2016information] also augment domain adaptation with information bottleneck, they focus on a specific scenario, where an auxiliary data view (e.g., skeleton data for gestures and bounding box for objects) is available and the information bottleneck is incoporated to leverage these additional data view. In contrast, our method seeks to provide a new regularization technique for general unsupervised domain adaptation with deep neural networks.
On the other hand, to counter the lack of attention on target semantic information, conditional entropy minimization [grandvalet2005semi] is widely used in unsupervised domain adaptation [RTN, luo2017label]. These methods are based on the cluster assumption that, the decision boundary should not cross high density regions, but instead lie in low density regions [chapelle2005semi]. In other words, it assumes that the data instances are distributed into several separate clusters, and samples in the same cluster share the same class label. However, it should be noted that the cluster assumption can be too strong to be satisfied in many practical scenarios, which will bring undesired effects to the stability of training and performance of the models. Essentially, the cluster assumption in the representation space is satisfied only when the learned representations merely preserve semantic information that is relevant to the predictive task, while our variational bottleneck domain adaptation framework intrinsically seeks to find such a clean representation space which renders the cluster assumption more realistic and achieves better feature transferability.
3 Background & Notations
3.1 Domain Adaptation
To describe a domain, we introduce a joint data distribution with which we define both the marginals and conditionals. Let denote the underlying joint data distribution of the data instance and the corresponding label for source domain, and let denote the marginal distribution of . and
are defined analogously for target domain. In feature-based unsupervised domain adaptation, our objective is to train a classifierwhich can perform well on target domain. Specifically, is the feature extractor, which is a projection function from data space to latent feature space , and is a classification function on the representation space, where
denotes the set of probability distributions over the label set. To address the covariate shift problem, many domain adaption methods are proposed to minimize the following objective motivated by the theory in [ben2010theory, ganin2016domain]:
Here and are latent representations for source and target domain; and are the marginal distributions of and , which is implicitly defined by the marginals and the deterministic mapping ; is the cross entropy loss for training a classifier; is some divergence or distance measure between two distributions and is the weighting factor. For instance, in [long2015learning], the divergence is realized as maximum mean discrepancy (MMD) and in many adversarial domain adaptation methods [hoffman2017cycada, ganin2016domain, luo2017label], the Jensen-Shannon divergence between and is minimized within an adversarial learning framework [goodfellow2014generative]:
where is a domain classifier on the representation space. Intuitively, the domain classifier is trained to distinguish the latent representations of source domain from that of target domain, while the feature extractor is jointly trained to confuse the discriminator by maximizing its classification error. At optimality, the marginal distributions of latent representations will be matched and the learned representations will be domain-invariant.
3.2 Information Bottleneck Principle
Let random variabledenote the original signal and random variable denote an output variable (e.g. desired label), whose information we want to preserve. Given their joint distribution , assuming the statistical dependence between and , the mutual information measures the mutual dependence between these two random variables. In this case, implicitly determines both the relevant and irrelevant features in . The information bottleneck(IB) method seeks to find an optimal representation of which captures the relevant part and filters out the irrelevant part.
Formally, in the context of information bottleneck, we are interested in finding the relevant part of with respect to , denoted by , the minimal sufficient statistics of with respect to . Thus we assume the following Markov chain: and we can obtain the optimal representation by minimizing under a constraint on (to ensure the predictive ability of ).
The objective of finding the optimal representation can be further formulated as the maximization of the following Lagrangian [tishby2015deep]:
subject to the Markov chain constraint. Here the positive Lagrangian multiplier represents a tradeoff between the complexity of the representation () and the amount of preserved relevant information (). In essence, information bottleneck principle explicitly enforces the learned representation to only preserve the information in that is useful to the prediction of , i.e., the minimal sufficient statistics of with respect to .
In this paper, under the framework of information bottleneck principle, we propose a novel domain adaptation method which enforces the feature extractor to focus on the relevant factors implicitly defined by the task, and provide a thorough analysis of the benefits brought by our method both emipirically and theoretically.
[ganin2016domain] claimed that a successful adaptation can be achieved when the source domain classification error and the domain confusion loss are both small, which can be realized through optimizing the objective in Equation (1).
From the perspective of information preference, we can reformulate the objective in Equation (1
) and understand the weakness of the constraint in a more straightforward way. To begin with, we split the loss function in Equation (1) into two terms, and . In the following, we will show that minimizing the first term is equivalent to maximizing a variational lower bound of the mutual information between learned representations and the labels in source domain (i.e., ), and minimizing the second term corresponds to finding the domain-invariant features. To see these, let us first rewrite the negative of cross entropy loss as:
where denotes the conditional distribution implied by the projection function (when is a deterministic projection, corresponds to a delta distribution with non-zero density at ). With the Markov chain assumption introduced in the Information Bottleneck Principle section , Equation (3) can be rewritten as:
Here, the inequality holds for the fact that . Since is a constant in our optimization procedure of and , we know that minimizing the first term in Equation (1) corresponds to maximizing a lower bound of .
The second term accounts for matching the marginal distribution of latent variables under the guidance of some distance or divergence. One notable example is the optimization of Jensen-Shannon divergence with adversarial training. Essentially, this constraint seeks to find the domain-invariant features of . However, it should be noted that matching the marginals of latent features is agnostic to the task of interest, which implies that the preserved domain-invariant features is likely to contain factors that are irrelevant to the prediction of desired labels. From learning theory [sontag1998vc] , we know that when the sample size is finite, the irrelevant factors (for the predictive task) in the noisy inputs can decrease the generalization ability of the models. We provide a formal discussion about the generalization error bound in Theoretical Analysis section. From this perspective, we know that one direction to improve domain adaptation models is to add more constraints on the representation space so that the preserved features will not only be domain-invariant, but also relevant to the task of interest.
On the other hand, due to the supervised learning objective in source domain, the learned representation with Equation (1) will intrinsically tend to capture the relationship between data instances and labels from source domain, while taking less attention on target domain. To take the label information for target domain where the exact label is not available into account during the feature learning, the cluster assumption can be adapted [RTN, chapelle2005semi], where the input distribution is assumed to contain separated data clusters and that data samples in the same cluster share the same class label. Cluster assumption introduces an inductive bias where we are seeking decision boundaries that do not go through high-density regions, which can be implemented through the following conditional entropy minimization:
Note that the cluster assumption is satisfied when the learned representations only preserve semantic information that is relevant to the predictive task, it is strongly motivated to find a clean representation space with information bottleneck to justify the use of strong assumptions in domain adaptation methods.
4.2 Variational Bottleneck Domain Adaption
Inspired by conditional entropy minimization in semi-supervised learning[chapelle2005semi, grandvalet2005semi] and deep variational information bottleneck [tishby2000information, alemi2016deep], to achieve better generalization ability, we propose a new regularization mechanism for domain adaptation, which explicitly enforces the feature extractor to only preserve the minimal sufficient statistics of the input data with respect to the labels for both source and target domain.
As discussed in the Motivations section, from the perspective of information bottleneck principle, we know that the objective in Equation (1) lacks a constraint for minimizing the mutual information between and :
However, it should be noted that in general, directly computing and optimizing is computationally intractable[alemi2016deep], as it requires solving an integral over latent feature space. To achieve tractability, we follow the methods proposed in [alemi2016deep] and instead optimize a tractable variational upper bound:
Here, is the prior distribution of latent features and denotes the marginal distribution implied by and conditional distribution , and the inequality holds for the fact that .
To incorporate the above variational information bottleneck, with abuse of notation, we introduce a stochastic feature extracting function , which maps a sample to a stochastic representation . Now we can add the following terms to the objective in order to enforce the feature extractor to only preserve task-relevant factors:
In our experiments, the stochastic feature extracting function is realized as a Gaussian distribution, where outputs the mean and diagonal covariance matrix of . When
allows for the computation of Kullback-Leibler divergence analytically, the upper bound in Equation (4.2) can be easily optimized. Thus we choose
to be a standard normal distribution,
. Note that although the objective here shares similar mathematical form with the KL regularization term in Variational Autoencoder (VAE)[kingma2013auto], the motivation and interpretation of the objectives are related but different. As a generative model, VAE consists of a pre-determined prior for the latent variables and a stochastic decoder for reconstruction. The amortized encoder is introduced as a variational approximation to the true posterior and the resulting evidence lower bound (ELBO) works as a tractable lower bound for the log-likelihood objective. While in the variational information bottleneck, the is introduced to derive a tractable upper bound for minimizing the mutual information term. Note that the equality in Equation (4.2) holds only when . Therefore, by choosing a simple realization of such as standard normal distribution, we are also introducing an inductive bias of regularizing the marginal distribution of the learned representations (i.e., ) to be as simple as possible.
Putting things together, the final objective function in our framework can be written as:
Here, is the classification loss; and are implicit marginal distributions induced by the marginal distributions and the conditional distribution ; is the conditional entropy term defined in Equation (4); and
are hyperparameters controlling the optimization tradeoff among each term. Note that there is a stochastic structure in the model, we utilize the reparameterization trick introduced in[kingma2013auto]
to back-propagate unbiased estimated gradients through single example.
4.3 Theoretical Analysis
In this section, we analyze the theoretical properties of our proposed method. [[ben2010theory]] Let be the hypothesis space, Given and as the two domains and their corresponding test error functions. Then for any , we have:
Here represents a discrepancy measure between source and target domain with respect to a hypothesis space , which is defined as:
For a fixed hypothesis space , is the intrinsic difference between source and target domain, which is fixed and determined by the characteristics of the data distributions. Now we will show that how the term from information bottleneck principle can help minimize the test error term, i.e. and in Theorem 4.3.
[[shamir2010learning]] For any probability distribution , with a probability of at least over the draw of the sample of size from , and are the empirical estimate of the mutual information and . Then for any ,
where and are constants. and correspond to the cardinality of variables and .
|Method||A W||D W||W D||A D||D A||W A||Avg|
|MADA[pei2018multi]||90.0 0.1||97.4 0.1||99.6 0.1||87.8 0.2||70.3 0.3||66.4 0.3||85.2|
|SimNet[pinheiro2018unsupervised]||88.60.5||98.2 0.2||99.70.2||85.3 0.3||73.4 0.8||71.6 0.6||86.2|
|GTA [GTA]||89.50.5||97.90.3||99.80.4||87.70.5||72.80.3||71.4 0.4||86.5|
Theorem. 4.3 shows that the which is a measure of difference between training and test error is bounded by a monotonic function of . Essentially, it is true that minimizing will minimize the generalization error, but this is not enough. A degenerate case is , in which case the prediction is random, although the difference between training and test error is zero. So we also need to make sure both the training error and the generalization error is small. We can decrease with information bottleneck (IB) principle, since we are explicitly minimizing the training error in source domain and the generalization error in both domains. For , ideally IB will not harm predictive ability by just removing irrelevant factors, so the combined training error should be the same with or without IB. While the combined test error is the sum of combined training error and combined generalization error, we are also able to reduce the combined test error.
We conduct experiments on various visual domain adaptation benchmarks including Office-31, Office-home and Digits, to compare our approach against state-of-the-art deep domain adaptation methods.
Office-31 [office31] is a widely-used dataset for visual domain adaptation, with 4,652 images and 31 categories from three distinct domains: Amazon (A), which contains images downloaded from amazon.com, Webcam (W) and DSLR (D), which contain images taken by web camera and digital SLR camera respectively. We denote the three domains as A, W and D. By permuting the 3 domains, we get 6 domain adaptation tasks.
Office-home [officehome] is a better organized and more difficult dataset than Office-31, which consists of 15,500 images in 65 object classes in office and home settings. It consists of four extremely dissimilar domains: Artistic images (Ar), Clip Art (Cl), Product images (Pr), and Real-World images (Rw). There are 12 domain adaptation tasks by permuting the 4 domains.
Digits We also explore three digits datasets of varying difficulty, MNIST, SVHN and USPS. Following the evaluation protocol of CyCADA [hoffman2017cycada], we investigate the following three tasks: USPS to MNIST (), MNIST to USPS () and SVHN to MNIST ().
We follow the standard protocols for evaluating unsupervised domain adaptation [long2015learning, ganin2016domain]. In the experiments, we observed that the hyperparameters (, , , ) are easy to choose and work well across multiple tasks. Specifically, we keep a fixed weight for domain adversarial loss and we choose the value of , , from a small candidate set, i.e., . The hyperparameters , for variational information bottleneck are selected according to the entropy of domain. For example, the higher-entropy domain tends to hold more irrelevant information and needs a larger mutual information regularization weight. The experiments on Office-31 and Office-home is implemented based on ResNet-50 [he2016deep]
pretrained on the ImageNet dataset[deng2009imagenet]. As for the digits dataset, we train our models with a small CNN [french2017self].
The results on the Office-31 dataset are reported in Table 1. For fair comparison, the baselines are directly reported from their original papers if the protocol is the same. Our VBDA model remarkably outperforms all comparison methods on most of the tasks. Notably, the model performance are remarkably improved on the hard task, e.g., , , where the two domain are significantly different. The interpretation follows that the variations of the source and target domain in these tasks are substantially different, and the task-irrelevant information are the main obstacles for adapting model. Thus this demonstrate that VBDA is good at eliminating these factors and focusing on the essential information for the task of interest. The performance is also further promoted on the relatively easy tasks, such as and . However, the model performance on the tasks, and are slightly lower than some approaches. This is due to the fact that, the average number of images for 31 classes in Webcam and DSLR are only 26 and 16, which are much lower than the number of bins for representing the image distribution and make the empirically estimated mutual information bounds not reliable enough for applying effective information bottleneck.
The results on the Office-home can be found in Table 2. The VBDA method significantly promotes the accuracy on most domain adaptation tasks and outperforms CDAN, a state-of-the-art method on this dataset by 0.41% on average. The Office-home is a more challenging dataset, which has four domains with larger domain gap and more categories. The information difference between the four domains are more obvious, i.e. Rw and Ar contains much more redundant information than Cl and Pr for classification task, and the information bottleneck can help control the information flow flexibly to learn clean representation for adaptation and classification. The desirable performance on such challenging domain adaption tasks highlights the effectiveness of matching essential information by utilizing information bottleneck principle.
The results on digits datasets are shown in Table (3). In task MNISTUSPS and USPSMNIST, VBDA performs better or at least comparably with previous methods. And on the more challenging setting, SVHNMNIST, our model promotes the existing methods by 3.3%. In particular, VBDA outperforms CyCADA, a state-of-the-art pixel-level adaptation method, which further proves the efficacy of VBDA.
5.3 Analysis and Ablation Study
To make a distinction between the utility of two main components of VBDA: the conditional entropy term and the information bottleneck term, we conduct a case study on task SVHNMNIST. We can observe that in Fig (1(c)), with conditional entropy term only(DANN+CE), the training is quite unstable; with bottleneck term only, the training is stable while the performance declines; with both terms, the model converges stably to a best test accuracy on target domain.
We also conduct ablation studies on hyper-parameter learning for and in task SVHNMNIST. is preferred to be smaller than , since SVHN has more irrelevant information to be penalized than MNIST. From Fig (1(a)), we can observe, the accuracy suffers with a too large . As becomes larger, we forget more about the input and the learned representation start to become more and more indistinguishable. And the best performance is achieved with an intermediate value of , in this case, the best setting is . Similar phenomenon can be observed on the in Fig (1(b)).
And the mutual information changes during the optimization is showed in the Fig (1(d)). As we can observe, the mutual information between the representation and label, i.e., and , are both improved during the training and the mutual information upper bound, i.e., and , between input and representation gradually declined, which indicates that more semantic information has been embedded and more nuisances have been removed in the representation space.
5.4 Feature Visualization
The t-SNE visualization of representation in task AW (31 classes) is illustrated in Fig (2). Note that the source and target representation is not aligned well by Resnet. DANN can match the marginal feature distribution, but there are still target points near or across the class boundary. MADA aligns the source and target domain and discriminates categories better, but each class are more scattered than that in VBDA and some target points deviate from the corresponding cluster center. VBDA has clearer cluster boundary and more compact and centered clusters, demonstrating that information irrelevant to classification is filtered by the proposed variational information bottleneck and only information relevant to classification is preserved.
In this paper, we proposed Variation Bottleneck Domain Adaptation (VBDA), a simple yet effective regularization mechanism for unsupervised domain adaptation. VBDA enhances semantic information and removes irrelevant factors in the learned representation space, which improves generalization ability and renders strong hypothesis such as cluster assumption more realistic. Comprehensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on various domain adaptation benchmarks.