Conditional Bures Metric for Domain Adaptation

07/31/2021 ∙ by You-Wei Luo, et al. ∙ SUN YAT-SEN UNIVERSITY 0

As a vital problem in classification-oriented transfer, unsupervised domain adaptation (UDA) has attracted widespread attention in recent years. Previous UDA methods assume the marginal distributions of different domains are shifted while ignoring the discriminant information in the label distributions. This leads to classification performance degeneration in real applications. In this work, we focus on the conditional distribution shift problem which is of great concern to current conditional invariant models. We aim to seek a kernel covariance embedding for conditional distribution which remains yet unexplored. Theoretically, we propose the Conditional Kernel Bures (CKB) metric for characterizing conditional distribution discrepancy, and derive an empirical estimation for the CKB metric without introducing the implicit kernel feature map. It provides an interpretable approach to understand the knowledge transfer mechanism. The established consistency theory of the empirical estimation provides a theoretical guarantee for convergence. A conditional distribution matching network is proposed to learn the conditional invariant and discriminative features for UDA. Extensive experiments and analysis show the superiority of our proposed model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large-scale data with sufficient annotations are vital sources of machine learning. However, the data collected from the real-world scenarios are usually unlabeled and the manual annotations are expensive. Recent advances in transfer learning yields plenty of methods for dealing with the shortage of labeled data. These methods aim to transfer the knowledge on a labeled source domain to a target domain with few or no annotations, such setting is also known as domain adaptation

[pan2009survey].

The most common assumption in Unsupervised Domain Adaptation (UDA) is that the labeled source domain and unlabeled target domain have the same feature spaces, but different marginal distributions [pan2009survey], , , . This assumption is also called covariate shift [shimodaira2000improving] and sample selection bias [zadrozny2004learning]. Ben-David [ben2007analysis] give a theoretical insight into the domain adaptation problem, they show that the risk of the target domain is mainly bounded by the risk of the source domain and the discrepancy between distributions of two domains. Inspired by this theory, many methods are proposed to mitigate the discrepancy between feature distributions of the source and target domains, , explicit discrepancy minimization via Maximum Mean Discrepancy (MMD) [gretton2012kernel, long2015learning], domain invariant feature learning [pan2010domain], Optimal Transport (OT) based feature matching [courty2016optimal, li2020Enhanced, zhang2019optimal], manifold based feature alignment [gong2012geodesic]

, statistical moment matching

[long2015learning, sun2016return] and adversarial domain adaptation [ganin2016domain]. These methods are proved to be effective in minimizing the marginal discrepancy and alleviating the domain shift problem. However, this assumption may lead to the omission of discriminant information in the label distributions, which is described in Figure 1. Recent advancements [li2020maximum, long2018conditional, luo2020unsupervised] show that the adaptation models will be more discriminative on the target domain if the target label information (, pseudo labels) is explored carefully.

Figure 1: Illustration of the conditional shift problem. Previous metrics that only consider the marginal distribution discrepancy may lead to a misaligned conditional distribution, , the red circle region. On the bottom, the class-level alignment is achieved by exploiting the conditional distribution embedding metric.

Extended from the marginal shift assumption, the conditional shift problem is studied to build a conditional invariant model [zhang2013domain], , . The most critical problem is to construct a framework which can explicitly reflect the relation between different conditional distributions. Zhao [zhao2019learning]

prove a new generalization bound which quantitatively reflects the underlying structure of the conditional shift problem. Several works have also been made in the field of conditional/joint distribution matching for domain adaptation, , multi-layer feature approximation

[Long2017Deep], conditional variants of MMD [kang2020contrastive, li2020maximum, zhu2020deep], conditional invariant learning with causal interpretations [gong2016domain, ren2018generalized], OT based joint distribution models [bhushan2018deepjdot, courty2017joint].

In this paper, we aim to estimate the transport cost in Reproducing Kernel Hilbert Space (RKHS) for the continuous conditional distributions. Inspired by pioneering work [fukumizu2009kernel], which employs the conditional covariance operator on the RKHS to characterize the independence, we define transport cost estimation on the set of conditional covariance operators called Conditional Kernel Bures (CKB) metric. By virtue of the conditional covariance operator and OT theory, we prove that the CKB metric reflects the discrepancy between two conditional distributions directly. This result can be taken as an extension of the marginal distribution embedding property in MMD [gretton2012kernel] and kernel Bures metric [zhang2019optimal]. An explicit empirical estimation of the CKB metric and its consistency theory are presented. Further, we apply it to the proposed conditional distribution matching network. Extensive experiment results show the effectiveness of the CKB metric and the superiority of the proposed model. Our contributions are summarized as follows.

  • A novel CKB metric for characterizing conditional distribution discrepancy is proposed, and the kernel embedding property of the CKB metric is proved to show that it is well-defined on conditional distributions. This metric is also exactly the OT between conditional distributions, which provides an interpretable approach to understand the knowledge transfer mechanism.

  • An explicit empirical estimation of the CKB metric is derived, which provides a computable measurement for conditional domain discrepancy. The asymptotic property of the estimation is proved which provides a rigorous theoretical guarantee for convergence.

  • A conditional distribution matching network based on the CKB metric is proposed for discriminative domain alignment, and a joint distribution matching variant is further extended. The SOTA results in extensive experiments validate the model’s effectiveness.

2 Related Work

Unsupervised Domain Adaptation. Based on the distribution shift assumption, the UDA methods can be roughly categorized as follows. Domain invariant feature learning methods like Transfer Component Analysis (TCA) [pan2010domain] try to learn a set of transfer components that make the corresponding distribution robust to the change of domains. OT based methods mitigate the domain discrepancy by minimizing the cost of transporting the source samples to the target domain. It has been shown that OT alignment is equivalent to minimizing the KL divergence [courty2016optimal] or Wasserstein distance [zhang2019optimal] between the distributions. Moment matching methods attempt to minimize the distribution discrepancy via statistical moments, , Domain Adaptation Network (DAN) [long2015learning] for the first order matching and CORAL [sun2016return] the second order. Manifold alignment methods take the domains as the points on the manifold and align the domains under the manifold metric [gong2012geodesic, luo2020unsupervised]. Adversarial based methods [ganin2016domain, tang2020discriminative] alternatively optimize the feature generator and domain discriminator, which are respectively supposed to be domain-confusable and discriminative, to achieve domain confusion. Extended from the marginal distribution assumption, recent works [bhushan2018deepjdot, courty2017joint, li2020Enhanced, long2018conditional, Long2017Deep] show that the models yield promising results by introducing the label information. Joint Adaptation Network (JAN) [Long2017Deep] builds a joint distribution alignment model via the features from different hidden layers. Conditional Domain Adversarial Network (CDAN) [long2018conditional]

extends the Domain Adversarial Neural Network (DANN)

[ganin2016domain] by exploring a multilinear map to describe the conditional variables in adversarial training.

Optimal Transport. Recently, OT has been successively applied to the UDA problem [bhushan2018deepjdot, courty2017joint, courty2016optimal, li2020Enhanced, zhang2019optimal]. Courty [courty2016optimal] deal with UDA based on the Kantorovitch formulation of OT, which allows to define the well-known Wasserstein distance between the domain distributions. As a variant of Wasserstein distance, Bures metric has been of great interest to various research fields like quantum information, information theory and Riemannian geometry [bhatia2019bures]. The original Bures metric is defined on the set of Positive Semi-Definite (PSD) matrices and cannot be used to measure the distribution discrepancy. In [zhang2019optimal], Zhang extend the OT problem to RKHS, and then define the kernel Wasserstein distance and kernel Bures metric. They show the covariance embedding in RKHS is injective which implies that the kernel Bures metric defines a metric on the distributions. However, these discrepancy measures mainly focus on the marginal distribution. To exploit the label information, joint distribution OT models [bhushan2018deepjdot, courty2017joint] seek an optimal joint transport map that minimizes the generalized cost associated to the joint space of features and labels . Enhanced Transport Distance (ETD) [li2020Enhanced]

uses the prediction feedback from the classifier to reweigh the transport cost. Differing from the above OT based methods which are formulated on discrete joint distribution or marginal distribution, our work focuses on the explicit estimation of OT between conditional distributions under the continuous case.

3 OT for Conditional Distribution

In this section, we first review the definitions and properties of conditional covariance operator and Kantorovitch’s OT in RKHS, which are the fundamentals of the proposed CKB metric. Then we present the theoretical definition and property of the CKB metric. Finally, we provide the empirical estimation and its asymptotic property.

3.1 Preliminary

Conditional Covariance Operators. Let be a measure space with Borel -field . Denote as the RKHSs of , which is generated by the positive definite kernels . The mean element in with law is given by , where is the nonlinear feature map of . It is assumed that satisfies the reproducing properties and , .

To explore the casual connection between and , we consider the pair

with probability measure

, where is the set of Borel probability measures on . Given a joint measure , its corresponding cross-covariance operator [baker1973joint] satisfies that

Formally, is defined as [song2009hilbert]

If equals to , is just the covariance operator on . Based on the cross-covariance operator, we further consider the conditional covariance of w.r.t. the conditioning variable . The conditional covariance operator is usually written as [fukumizu2009kernel]

Note that may be non-invertible, especially in the real-world applications with finite samples. When necessary conditions are fulfilled [fukumizu2009kernel], the conditional covariance operator also satisfies that

Kantorovitch’s OT in RKHS. For any two distributions , let be the set of probabilistic couplings, the Kantorovitch formulation of OT is

(1)

The Kantorovitch problem in Eq. (1) is also equivalent to the Wasserstein distance. Under the Gaussian measures, if the distributions and have the same expectations, the Wasserstein distance between them is equivalent to the Bures metric between their covariance matrices. Let be the set of PSD matrices; for any PSD matrix , its unique square root is defined by . The Bures metric is defined by

where and and are the covariance matrices of and , respectively. Recent work shows that the Bures metric is also related to the Riemannian geometry, as it can be taken as the metric on PSD manifold [bhatia2019bures]. Though the Bures metric defines a metric on , it cannot reflect discrepancy between distributions and .

The kernel Bures metric [zhang2019optimal] generalizes the PSD setting in Bures metric to the infinite-dimensional RKHS . Let be the set of all positive, self-adjoint, and trace-class operators on with kernel , the kernel Bures metric on is written as:

where and are the covariance operators of and on , respectively. Note the kernel Bures is exactly the transport cost in RKHS when the push-forward measures and are Gaussian [zhang2019optimal]. Zhang [zhang2019optimal] prove that if the measurable space is locally compact and Hausdorff, the embedding is injective. It turns out that defines a metric on , which no longer holds for the Bures metric. With this property, the kernel Bures metric can be used to quantify the discrepancy between two distributions.

3.2 Conditional Kernel Bures Metric

To introduce conditional distribution to OT, we develop the kernel covariance embedding property for conditional distributions and apply it to the kernel Bures metric. The CKB metric for conditional distributions is now defined.

Definition

The Conditional Kernel Bures (CKB) metric between two conditional distributions is defined as

(2)

where .

Proposition

CKB defines a metric on .

Recall that the conditional covariance operator is also positive, self-adjoint, and trace-class on [fukumizu2009kernel]. Thus, we can deduce from Proposition 3.2 that the CKB metric is well-defined on the conditional covariance operators.

The injective property of mean embedding [gretton2012kernel] and covariance embedding [zhang2019optimal] in RKHS give the theoretical insights into how two distributions are matched via the defined metrics, , MMD and kernel Bures metric. Similarly, we also make connection between the CKB metric and conditional distributions. Note that though the above embedding properties are well studied, they only consider connections between the operators and the marginal distributions. As the embedding property between the covariance operators and conditional distributions is unexplored, our work focuses on extending the CKB metric to a metric on conditional distributions . For convenience, we denote the set of measures that satisfy the 3-splitting property [zhang2019optimal] by and the direct sum by .

Theorem

Let be the locally compact and Hausdorff measurable space and be -universal kernel. Assuming that

is a Gaussian random variable in

. For any , we have

The above theorem shows that the CKB metric defines a metric on if some conditions are satisfied. Note that the CKB metric is exactly the minimized OT cost between two conditional distributions since and are also Gaussian. Thus, it can be used to measure the discrepancy between two conditional distributions. The condition -universal [sriperumbudur2011universality] in Theorem 3.2

is satisfied by many common kernels, , Gaussian kernel and Laplacian kernel. The assumption of Gaussian random variable can be taken as the extension of Gaussian distribution which takes values in RKHS

[klebanov2020rigorous]. Recall that the feature maps and are implicit, so the conditional covariance operator is not formulable in practical computation of the CKB metric. To present an explicit formulation of the CKB metric, we use the kernel trick, , , to avoid the explicit nonlinear maps in the next section.

3.3 Empirical Estimation of the Conditional Kernel Bures Metric

Let and be two sets of samples, which are assumed to be drawn i.i.d. from and , respectively. Note that , and we map the data (resp. ) to the RKHS (resp. ) with the implicit feature map (resp. ). Let , and be the explicit kernel matrices computed as , and , respectively. Denote the feature map matrices by and . Their cross-covariance matrices can be written as , where is the centering matrix, and is

-dimensional vector with all elements equal to 1. As the covariance matrix

is always rank-deficient under the finite sample case, we regularize it as

(3)

where is the regularization parameter. Denote the matrices

where

are the centralized kernel matrices. With the decomposition , the conditional covariance operator can be reformulated as ( is the same)

(4)
Proposition

If is positive definite kernel, then and are positive definite for any . Especially, we have

Remark

Proposition 3.3 shows that is positive definite with a positive definite kernel (, Gaussian kernel and Laplacian kernel), so the decomposition always exists. But, such a decomposition is not unique, , Cholesky factorization and eigendecomposition. Here we compute

based on the Eigenvalue Decomposition (EVD) of

as ( is the same)

where and

are the eigenvector and eigenvalue matrices of

, respectively.

The reformulation Eq. (4) affords an explicit insight into the conditional covariance operator. As is computed from the gram matrix , is highly related to the conditional variable . Compared with the covariance operator on RKHS , the feature map in conditional covariance operator is transformed by the modified centering matrix which contains the conditional information. Based on the above reformulation, the following theorem provides the explicit computation of the CKB metric. Note that the reformulation Eq. (4) is included in the proof of Theorem 3.3, and all proofs of theorems, propositions are provided in the supplementary material.

Theorem

The empirical estimation of the CKB metric is computed as

(5)

where is the nuclear norm.

Remark

The computational complexity of the CKB metric consists of three terms shown in Eq. (5). As for the first term, the cost of the kernel matrices and matrix inverse are about . Similarly, the cost of the second term is about . As for the third term, the cost of kernel matrix, EVD and nuclear norm is about . Thus, the computational complexity of the CKB metric is about , where and are the feature dimension and number of classes, respectively.

3.4 Convergence Analysis

In this section, we focus on the convergence of the empirical estimation of the CKB metric. This convergence theorem is based on the properties of trace-class operator on the Hilbert space and the asymptotic theory of the conditional covariance operator established by Fukumizu [fukumizu2009kernel]. Let be the conditional covariance operator drawn i.i.d. from distribution with sample size which is computed as Eq. (3), Proposition 7 in [fukumizu2009kernel] shows that the estimator converges to in probability. Moreover, it shows that the sequence is bounded in probability at rate .

With the consistency of conditional covariance operator, we now establish the asymptotic theory for the CKB metric. Assuming that the conditional covariance operators are specified by the source and target domains, we define and the squared CKB metric as and . The convergence of is dominated by the convergence of three terms in Eq. (2). Specifically, the convergence of first two terms are concluded from the consistency of conditional covariance operator, and the third term can be deduced to the convergence in trace-norm on the Hilbert space. We present the convergence theorem of the CKB metric as follows.

Theorem

Let the regularization parameter in Eq. (3) be a series related to , , . Assuming satisfies that and (), then we have

in probability with rate .

Theorem 3.4 shows that the empirical estimation error of the CKB metric converges to 0 as in probability. Specifically, the estimation error is bounded in probability at rate . Compared to the rate of the conditional covariance operator, the square root rate of the CKB metric comes from the convergence rate of the cross term, , .

4 Unsupervised Domain Adaptation

In this section, we tackle the UDA problem by describing the domains as conditional distributions and minimizing the conditional distribution discrepancy under the CKB metric.

4.1 Conditional Distribution Matching Network

For UDA problems, is taken as the source domain and the unlabeled target domain, where represent the observations and the one-hot labels with classes. The primary task is to generalize the classifier trained on both and to predict the . Previous UDA methods assume that the target distribution is shifted from the source distribution (, ) and generalize by minimizing the distribution discrepancy. This assumption only considers the feature distribution, but ignores the discriminant information from the labels. Here we consider the shift of conditional distribution , which will help the adaptation model to incorporate discriminant information. To learn a conditional distribution matching model, we first design a feature extractor based on Deep Neural Networks (DNNs), which aims to align the conditional distributions of the domains, , and . Then the classifier will be trained on the aligned features. Denote the extracted features by and the soft predictions by , where . The detailed network architecture is provided in the supplementary material.

Figure 2: Flowchart of the conditional matching model. The features are mapped into the RKHS, and the conditional distributions of the domains are represented by their conditional covariance operators in RKHS. Then the conditional distribution discrepancy is estimated based on the CKB metric, and the adaptation model is optimized according to the discrepancy feedback.

The flowchart of the proposed method is shown in Figure 2. It aligns the source and target domains to the conditional invariant space by minimizing the CKB metric between the extracted features, , and . Based on the conditional invariant features, a discriminative classifier is learned by applying the entropy-based criterion to both domains. A well-aligned feature space is more preferable for training classifier. Meanwhile, a more accurate classifier leads to a more precise estimation of the CKB metric and fewer misaligned sample pairs. Therefore, the two processes can benefit from each other and enhance the transferability and discriminability of the model alternatively.

In general, the proposed network is trained based on three loss terms. First, the cross-entropy function is applied to the labeled source data, which builds a basic network for classification. The cross-entropy loss is written as

Then the entropy is applied to the target prediction:

This term has been proved to be effective in the semi-supervised learning and unsupervised learning

[grandvalet2005semi]. For UDA, it preserves the intrinsic structure of the target domain and reduces the uncertainty of the target prediction.

To match the conditional distributions of two domains, the CKB metric is applied to the deep features learned by the nonlinear mapping

. Thus, the kernel matrices and feature maps are computed from the deep features hereinafter, , and . In terms of the conditional variable , the kernel matrix and feature map are computed from the source labels . As the ground-truth labels of the target samples are unknown, we use the pseudo labels to approximate them and compute the feature map as . The CKB loss is computed according to Eq. (5) as

Let and be the trade-off parameters, the objective function of the conditional alignment model is written as

(6)

According to Theorem 3.2, the domain conditional distributions are aligned (, ) when . Further, if the marginal distributions and are also aligned, then the domain joint distribution matching is also achieved as . Since the target distribution is unknown, we can apply the marginal matching constraint to the label distribution estimated from the classifier’s predictions. Specifically, the marginal discrepancy can be approximated by the MMD between and , , , where is computed from the soft predictions . Finally, the joint distribution alignment loss is the sum of and , and the objective function is written as

(7)

In summary, and aim to integrate the samples from different domains by mitigating the conditional or joint distribution discrepancies, and the first two terms enhance the model’s discriminability by using the label and prediction information from both domains.

Office-Home ArCl ArPr ArRw ClAr ClPr ClRw PrAr PrCl PrRw RwAr RwCl RwPr Mean
Source [he2016deep] 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1
DAN [long2015learning] 43.6 57.0 67.9 45.8 56.5 60.4 44.0 43.6 67.7 63.1 51.5 74.3 56.3
DANN [ganin2016domain] 45.6 59.3 70.1 47.0 58.5 60.9 46.1 43.7 68.5 63.2 51.8 76.8 57.6
KGOT [zhang2019optimal] 36.2 59.4 65.0 48.6 56.5 60.2 52.1 37.8 67.1 59.0 41.9 72.0 54.7
CDAN+E [long2018conditional] 50.7 70.6 76.0 57.6 70.0 70.0 57.4 50.9 77.3 70.9 56.7 81.6 65.8
ETD [li2020Enhanced] 51.3 71.9 85.7 57.6 69.2 73.7 57.8 51.2 79.3 70.2 57.5 82.1 67.3
DMP [luo2020unsupervised] 52.3 73.0 77.3 64.3 72.0 71.8 63.6 52.7 78.5 72.0 57.7 81.6 68.1
CKB 54.7 74.4 77.1 63.7 72.2 71.8 64.1 51.7 78.4 73.1 58.0 82.4 68.5
CKB+MMD 54.2 74.1 77.5 64.6 72.2 71.0 64.5 53.4 78.7 72.6 58.4 82.8 68.7

Image-CLEF-DA IP PI IC CI CP PC Mean
Source [he2016deep] 74.8 0.3 83.9 0.1 91.5 0.3 78.0 0.2 65.5 0.3 91.2 0.3 80.7
DAN [long2015learning] 74.5 0.4 82.2 0.2 92.8 0.2 86.3 0.4 69.2 0.4 89.8 0.4 82.5
DANN [ganin2016domain] 75.0 0.3 86.0 0.3 96.2 0.4 87.0 0.5 74.3 0.5 91.5 0.6 85.0
KGOT [zhang2019optimal] 76.3 83.3 93.5 87.5 74.8 89.0 84.1
CDAN+E [long2018conditional] 77.7 0.3 90.7 0.2 97.7 0.3 91.3 0.3 74.2 0.2 94.3 0.3 87.7
ETD [li2020Enhanced] 81.0 91.7 97.9 93.3 79.5 95.0 89.7
DMP [luo2020unsupervised] 80.7 0.1 92.5 0.1 97.2 0.1 90.5 0.1 77.7 0.2 96.2 0.2 89.1
CKB 80.7 0.1 93.7 0.1 97.0 0.1 93.5 0.2 79.2 0.1 97.0 0.1 90.2
CKB+MMD 80.7 0.2 92.2 0.1 96.5 0.1 92.2 0.2 79.9 0.2 96.7 0.1 89.7

Office10 AC AD AW CA CD CW DA DC DW WA WC WD Mean
Source [krizhevsky2012imagenet] 82.7 85.4 78.3 91.5 88.5 83.1 80.6 74.6 99.0 77.0 69.6 100.0 84.2
GFK [gong2012geodesic] 78.1 84.7 76.3 89.1 88.5 80.3 89.0 78.4 99.3 83.9 76.2 100.0 85.3
CORAL [sun2016return] 85.3 80.8 76.3 91.1 86.6 81.1 88.7 80.4 99.3 82.1 78.7 100.0 85.9
OT-IT [courty2016optimal] 83.3 84.1 77.3 88.7 90.5 88.5 83.3 84.0 98.3 88.9 79.1 99.4 87.1
KGOT [zhang2019optimal] 85.7 86.6 82.4 91.4 92.4 87.1 91.8 85.6 99.3 89.7 85.0 100.0 89.7
DMP [luo2020unsupervised] 86.6 90.4 91.3 92.8 93.0 88.5 91.4 85.3 97.7 91.9 85.6 100.0 91.2
CKB 87.0 93.6 90.2 93.4 93.6 90.8 92.7 83.5 100.0 92.4 84.3 100.0 91.8
CKB+MMD 87.5 93.0 89.8 93.3 91.7 92.9 92.3 83.4 99.7 92.8 85.8 100.0 91.9
Table 1: Accuracies (%) on Office-Home (ResNet-50), Image-CLEF-DA (ResNet-50) and Office10 (AlexNet).

4.2 Implementation Details

We train the proposed model with back-propagation in the mini-batch manner. As refers to the inverse of the kernel matrices and , we treat in as constant to make the optimization stable. Thus, and are independent of the network parameter and there are no gradients refer to them. The regularization parameter of inverse in Eq. (5) is set as empirically. In terms of the kernel function, Gaussian kernel is adopted, and the parameter is set as the mean of the all square Euclidean distances that refer to the corresponding kernel matrix. The kernel parameters are adaptively updated for each minibatch. Thanks to the smoothness of the Gaussian kernel, the gradients of the network parameters always exist. The proposed methods in Eq. (6) and Eq. (7) are respectively abbreviated as CKB and CKB+MMD hereinafter.

5 Experiment

The proposed methods are evaluated and compared with the SOTA methods on four UDA datasets.

ImageCLEF-DA [caputo2014imageclef] consists of 3 domains with 12 common classes, , Caltech (C), ImageNet (I), Pascal (P), where each domain include 600 images.

Office-Home [OfficeHome] contains 15500 images from 4 domains with 65 classes, , Art (Ar), Clipart (Cl), Product (Pr) and Real-World (Rw).

Office10 [gong2012geodesic] consists of 4 domains with 10 classes, , Amazon (A), Caltech (C), DSLR (D) and Webcam (W).

Digits Recognition Follow the protocol in [hoffman2018cycada]

, we conduct the adaptation task between the handwritten digit datasets MNIST (

M) and USPS (U).

5.1 Results

Comparison. Several state-of-the-art UDA approaches are used to compare with the proposed methods, and the results are shown in Table 1-2. From the results on Office-Home in Table 1, we observe that the CKB+MMD method outperforms the compared methods in average accuracy, and the relaxed variant CKB also achieves the accuracy of 68.5%. The experiment results on ImageCLEF-DA are shown in the middle of Table 1. The CKB method improves the mean accuracy to 90.2% by further considering the discrepancy between the conditional distributions. The results show that the higher the accuracy of target predictions, the more effective the CKB alignment, , tasks P I and P C. Table 1 shows the results on Office10 dataset. OT-IT and KGOT methods achieve the accuracy of 87.1% and 89.7%, which show the superiority of the OT theory in distribution matching. CKB+MMD method achieves Top-1 accuracy in most tasks and improves the mean accuracy to 91.8%. Table 2 shows the results on digits recognition tasks. The proposed models surpass the advanced OT-based method ETD and achieves the highest accuracy in all tasks.

(a) Hyper-parameter (b) Hyper-parameter (c) Ablation (d) Ablation
Figure 3: (a)-(b): Grid search for hyper-parameters and . (c)-(d): Ablation analysis.
(a) Before Adaptation (b) After Adaptation (c) Before Adaptation (d) After Adaptation
Figure 4: Feature visualization of the source-only and CKB models via t-SNE [maaten2008visualizing] on Image-CLEF C I task. ‘+’: source domain, ‘’: target domain. (a)-(b): Features colored by domains. (c)-(d): Features colored by classes.
Method MU UM
Source [hoffman2018cycada] 82.2 0.8 69.6 3.8
DANN [ganin2016domain] 95.7 0.1 90.0 0.2
CyCADA [hoffman2018cycada] 95.6 0.4 96.5 0.2
DeepJDOT [bhushan2018deepjdot] 95.7 96.4
ETD [li2020Enhanced] 96.4 0.3 96.3 0.1
CKB 96.3 0.1 96.6 0.4
CKB+MMD 96.6 0.1 96.3 0.1
Table 2: Accuracies (%) on Digits (LeNet).

Hyper-parameter. We investigate the selection of hyper-parameters and on ImageCLEF-DA dataset. The optimal and are respectively searched from [1-2,5-2,1-1,5-1,10] and [1-1,10,11,12]. Figure 3 (a)-(b) show the results of grid search, we observe that the model is stable for different hyper-parameter values and =(5-1,10) is optimal among all settings.

Ablation. We compare the CKB metric with the Bures and Kernel Bures metrics [zhang2019optimal], and evaluate the effectiveness of the loss terms in Eq. (6) on ImageCLEF-DA dataset. The model without CKB alignment loss and target entropy loss are abbreviated as w/o and w/o , respectively. The results in Figure 3 (c)-(d) show that the CKB metric is superior to the Bures and Kernel Bures metric, which proves that the conditional operators help the model to obtain the discriminant information from the labels and predictions.

Visualization. To evaluate the aligned features quantitatively, we use t-SNE [maaten2008visualizing] to visualize the features of the source-only model (before adaptation) and the CKB model (after adaptation) on Image-CLEF C I task. From Figure 4 (a), we observe that the conditional distribution is still shifted in the source-only model. In Figure 4 (b), all clusters are well-aligned by the CKB method. Figure 4 (c)-(d) show the features colored by classes, we observe that the CKB model achieves the inter-class separability and intra-class compactness on the target domain.

Figure 5: Time comparison.

Time Comparison. We conduct the time comparison experiments on Office-Home and Image-CLEF-DA datasets. The results in Figure 5 suggest that CKB model is faster than CKB+MMD and DMP, which demonstrates that the conditional discrepancy metric is more efficient than the structure learning model DMP. As the proposed models are trained in mini-batch manner, the time complexity of the CKB metric is only about , where is the batch size. Thus the CKB metric does not introduce much complexity compared to the DNNs. Results show that CKB model only takes 10s longer than ResNet while improving the accuracy significantly by 22% on Office-Home dataset.

6 Conclusion

In this paper, we consider the conditional distribution shift problem in classification. Theoretically, we extend OT in RKHS by introducing the conditional variable, and prove that the proposed CKB metric defines a metric on the conditional distributions. An empirical estimation is derived to provide an explicit computation of the CKB metric, and its asymptotic theory is established for the consistency. By applying the CKB metric to DNNs, we propose a conditional distribution matching network which alleviates the shift of conditional distributions and preserves the intrinsic structures of both domains simultaneously. Extensive experimental results show the superiority of the proposed models in UDA problems.

Acknowledgement

This work is supported in part by the National Natural Science Foundation of China under Grants 61976229, 61906046, 11631015, and 12026601.

References