Bridging the Theoretical Bound and Deep Algorithms for Open Set Domain Adaptation

06/23/2020 ∙ by Li Zhong, et al. ∙ University of Technology Sydney Tsinghua University 18

In the unsupervised open set domain adaptation (UOSDA), the target domain contains unknown classes that are not observed in the source domain. Researchers in this area aim to train a classifier to accurately: 1) recognize unknown target data (data with unknown classes) and, 2) classify other target data. To achieve this aim, a previous study has proven an upper bound of the target-domain risk, and the open set difference, as an important term in the upper bound, is used to measure the risk on unknown target data. By minimizing the upper bound, a shallow classifier can be trained to achieve the aim. However, if the classifier is very flexible (e.g., deep neural networks (DNNs)), the open set difference will converge to a negative value when minimizing the upper bound, which causes an issue where most target data are recognized as unknown data. To address this issue, we propose a new upper bound of target-domain risk for UOSDA, which includes four terms: source-domain risk, ϵ-open set difference (Δ_ϵ), a distributional discrepancy between domains, and a constant. Compared to the open set difference, Δ_ϵ is more robust against the issue when it is being minimized, and thus we are able to use very flexible classifiers (i.e., DNNs). Then, we propose a new principle-guided deep UOSDA method that trains DNNs via minimizing the new upper bound. Specifically, source-domain risk and Δ_ϵ are minimized by gradient descent, and the distributional discrepancy is minimized via a novel open-set conditional adversarial training strategy. Finally, compared to existing shallow and deep UOSDA methods, our method shows the state-of-the-art performance on several benchmark datasets, including digit recognition (MNIST, SVHN, USPS), object recognition (Office-31, Office-Home), and face recognition (PIE).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 12

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Domain Adaptation (DA) methods aim to train a target-domain classifier with data in source and target domains [lu2015transfer]. Based on the variety of data in the target domain (i.e., fully-labeled, partially-labeled, and unlabeled), DA consists of three categories: supervised DA [motiian2017unified, zuo2018fuzzy01, zuo2017granular], semi-supervised DA [pereira2018semi, saito2019semi, zuo2018fuzzy02], and unsupervised DA (UDA) [liu2017heterogeneous, fang2019unsupervised]. In practice, UDA methods have been deployed to solve diverse real-world problems, such as object recognition [gopalan2011domain, kan2014domain], cross-domain recommendation [zhang2017cross]

, and sentiment analysis

[liu2020heterogeneous].

There are two common settings in UDA: unsupervised closed set domain adaptation (UCSDA) and unsupervised open set domain adaptation (UOSDA). UCSDA is a classical scenario in which source and target domains share the same label sets. By contrast, in UOSDA, the target domain contains some unknown classes that are not observed in the source domain, and the data with unknown classes are called unknown target data. In Fig. 1, the source domain contains four known classes (i.e., monitor, mug, staple, and calculator), but the target domain contains some unknown classes in addition to the classes in the source domain.

Fig. 1: Unsupervised open set domain adaptation (UOSDA). When the target domain does not contain unknown classes, UOSDA will degenerate into the unsupervised closed set domain adaptation (UCSDA).

UOSDA is more general than UCSDA, since the label sets are usually not consistent between source and target domains in a real-world scenario. Namely, the target domain may contain classes that are not observed in the source domain. For example, a classifier trained with images of various kinds of cats is likely to encounter the image of a dog or another animal in reality. In this case, the UCSDA methods are unable to distinguish the unseen animals (i.e., unknown classes). UOSDA methods, however, can establish a boundary between known classes and unknown classes.

Panareda et al. [panareda2017open] are the first to propose the setting of UOSDA, but the source domain also contains some unknown classes in Panareda’s paper. Since it is expensive and prohibitive to obtain data labeled by unknown classes in the source domain, Saito et al. [saito2018open] propose a new UOSDA setting where the source domain only contains known classes. In this paper, we focus on the same setting as Saito’s paper, which is more realistic [saito2018open, fang2019open].

In UOSDA, we aim to train a target-domain classifier with labeled data in the source domain and unlabeled data in the target domain. The trained classifier is expected to accurately 1) recognize unknown target data, and 2) classify other target data. Existing UOSDA methods can be divided into two groups: shallow methods and deep methods. For shallow methods, a recent work [fang2019open] proved an upper bound of target-domain risk, which can provide a theoretical guarantee for the design of a shallow UOSDA method. For deep methods, since [long2013transfer, yosinski2014transferable, DBLP:conf/icml/DonahueJVHZTD14] have shown that DNNs can learn more transferable features, researchers presented DNNs-based methods to address the UOSDA problem [saito2018open, feng2019attract, liu2019separate]. Nevertheless, these deep UOSDA methods lack theoretical guarantees. Thus, how to bridge theoretical bound and deep algorithms is both necessary and important for addressing the UOSDA problem.

In order to train an effective target-domain classifier, Zhen et al. [fang2019open] have proven an upper bound of target-domain risk (Eq. (14)) for the UOSDA problem and propose a shallow UOSDA method. Specifically, the bound consists of four terms: source-domain risk, distributional discrepancy between domains, open set difference (), and a constant. Open set difference, as an important term in upper bound, is leveraged to measure the risk of a classifier on unknown target data. The shallow method in [fang2019open]

trains a target-domain classifier by minimizing the empirical estimation of the upper bound.

However, the theoretical bound presented in [fang2019open] is not adaptable to flexible classifiers (i.e., deep neural networks (DNNs)). In Fig. 2, we show that if the classifier is a DNN, the accuracy (OS in Fig. 2 (b)) in the target domain will drop significantly (yellow line in Fig. 2 (b)) when minimizing the empirical estimates of the upper bound. This phenomenon confirms that we cannot simply combine the existing theoretical bound and deep algorithms to address the UOSDA problem.

To reveal the nature of this phenomenon, we investigate that the lower bound of the distributional discrepancy is the negative value of open set difference. Since DNNs are very flexible and the empirical open set difference can be a negative value, empirical open set difference will be quickly minimized to a very negative value (yellow line in Fig. 2 (a)). Based on the lower bound of the distributional discrepancy, if the empirical open set difference is a very small negative number, the distributional discrepancy is greater than a very large positive number. Consequently, we fail to align the distributions of the two domains, resulting in a very low accuracy on the target domain (yellow line in Fig. 2 (b)).

In this paper, we propose a new upper bound of target-domain risk for UOSDA (Eq. (20)), which includes four terms: source-domain risk, -open set difference (), conditional distributional discrepancy between domains, and a constant. is the lower bound of open set difference and we construct a new risk estimator that limits the descent of the open set difference by . can ensure the promptly prevention of the lower bound of the distributional discrepancy between two domains from significantly increasing. Fig. 2 shows that minimizing the empirical estimates of the new upper bound achieves higher accuracy (green line in Fig. 2(b)).

Then, we propose a new principle-guided deep UOSDA method that trains DNNs via minimizing empirical estimates of the new upper bound. The network structure is shown in Fig. 3. We employ a generator () to extract the feature of input data, a classifier () to classify input data, and a domain discriminator () to assist distribution alignment. The overall object function consists of source classification loss, binary adversarial loss, domain adversarial loss, and empirical . Specifically, the source classification loss and empirical are minimized by gradient descent, and a gradient reverse layer is adopted for adversarial losses.

Fig. 2: The accuracy of OS and loss w.r.t. the task Ar Cl. “c” denotes the conditional adversarial training strategy. is the -open set difference proposed in this paper and is the open set difference proposed in [fang2019open]. The loss in (a) is the value of or . It is worth noting that the green line and the red line in (a) are partially coincident. Here, is set as .

To effectively align distributions between data with known classes, we propose a novel open-set conditional adversarial training strategy based on the tensor product between the feature representation and the label prediction to capture the multimodal structure of distribution. According to

[song2009hilbert, long2018conditional], it is significant to capture the multimodal structures of distributions using cross-covariance dependency between the features and classes. However, existing deep UOSDA methods align distributions by either the binary adversarial net [saito2018open, feng2019attract] or the multi-binary classifier [liu2019separate], which is not adequate for distributions with multimodal structure. Furthermore, this novel training strategy also pushes unknown target data away from data with known classes via . As shown in Fig. 2 (b), the novel distribution alignment strategy can further boost the performance of the classifier.

To validate the efficacy of the proposed method, we conduct extensive experiments on several standard benchmark datasets containing transfer tasks. Compared to existing shallow and deep UOSDA methods, our method shows state-of-the-art performance on digit recognition (MNIST, SVHN, USPS), object recognition (Office-31, Office-Home) and face recognition (PIE). The main contributions of this paper are:

  • A new theoretical bound of target-domain risk for UOSDA is proposed. It is essential since the existing bound does not apply to flexible classifiers (i.e., DNNs). Thus this work can bridge the gap between the existing theoretical bound and deep algorithms for the UOSDA problem.

  • A UOSDA method based on DNNs is proposed under the guidance of the proposed theoretical bound. The method can better estimate the risk of the classifier on unknown data than existing deep methods with the theoretical guarantee.

  • A novel open-set conditional adversarial training strategy is proposed to ensure that our method can align the distributions of two domains better than existing UOSDA methods.

  • Experiments on Digits, Office-31, Office-Home, and PIE show that the accuracy of the OS of our method significantly outperforms all baselines, which shows that our method achieves state-of-the-art performance.

This paper is organized as follows. Section II reviews the works related to UCSDA, open set recognition, and UOSDA. Section III introduces the definitions of notations and our problem. Section IV demonstrates the motivation of this paper. Theoretical results and the proposed method are shown in Section V. Experimental results and analyses are provided in Section VI. Finally, Section VII concludes this paper.

Ii Related Work

Unsupervised open set domain adaptation is a combination of unsupervised closed set domain adaptation and open set recognition. In this section, we present a systematic review of related studies.

Ii-a Closed Set Domain Adaption

In [ben2007analysis], a theoretical bound for UCSDA is given, which indicates that minimizing the source risk and distributional discrepancy is the key to the UCSDA problem. Based on this point, there are two kinds of methods for UCSDA: one is to employ a distributional discrepancy measurer to measure the domain gap [pan2010domain]; the other is the adversarial training strategy [long2018conditional].

Transfer Component Analysis (TCA) [pan2010domain] utilizes MMD [gretton2012kernel]

learning a domain invariant feature by aligning marginal distribution. Meanwhile, Joint Distribution Adaptation (JDA)

[long2013transfer] align marginal distribution and conditional distribution simultaneously. In order to simplify the training of a classifier, Easy Transfer Learning (EasyTL) [wang2019easy] exploits the intra-domain information to get a non-parametric feature and the classifier. CORrelation Alignment (CORAL) [sun2016return] aligns second-order statistics of source and target domain to minimize domain divergence. Manifold Embedded Distribution Alignment (MEDA) [wang2018visual] performs a dynamic distribution alignment in a Grassmann manifold subspace.

Meanwhile, deep neural networks have also been introduced into domain adaptation and achieved competitive performance in UCSDA. Deep Adaptation Networks (DAN) [long2015learning] employs the multi-kernel MMD (MK-MMD) to align the feature of 6-8 layers in Alexnet. Deep CORAL Correlation is the extension of shallow method CORAL in deep neural networks. Wasserstein Distance Guided Representation Learning (WDGRL) [shen2018wasserstein] employs the Wasserstein distance to learn an invariant representation in deep neural networks.

Representative adversarial-training-based method are Domain-Adversarial Training of Neural Networks (DANN) [ganin2016domain] and conditional adversarial domain adaptation (CDAN) [long2018conditional]. DANN employs a domain discriminator to recognize which domain data comes from and deceives the domain discriminator by changing features so that an invariant representation can be learned during the adversarial procession. Furthermore, CDAN utilizes the tensor product between feature and classifier prediction to grasp the multimodal information and an entropy condition to control the uncertainty of the classifier. However, these methods can only cope with the UCSDA problem and are unable to address the UOSDA problem.

Ii-B Open Set Recognition

This setting allows some unknown classes to be shown in the target domain, but there is no distributional discrepancy between domains. Open Set SVM [jain2014multi] rejects the unknown classes via a fixed threshold. Open Set Nearest Neighbor (OSNN) [junior2017nearest] extends the Nearest Neighbor to recognize unknown classes. Bendale et al. [bendale2016towards]

introduces a layer named OpenMAX to estimate the probability that an input data is recognized as unknown classes in DNNs. However, these methods do not consider distributional discrepancy. They are also unable to address the UOSDA problem.

Ii-C Open Set Domain Adaptation

Busto et al. [panareda2017open]

were the first to propose the setting of UOSDA. They employed a method named Assign-and-Transform-Iterately (ATI) to assign labels to target data using a distance matrix between target data and source class centers and aligned distributions through a mapping matrix. In the setting of this paper, however, the source domain contains some unknown classes to assist the classifier to recognize unknown data. Since obtaining unknown samples of the source domain is expensive and time-consuming, Open Set Backpropagation (OSBP)

[saito2018open] assumes a more realistic scenario that the source domain has no unknown classes, which is more challenging. An adversarial network is used to recognize unknown samples and align distribution during backpropagation.

Based on OSBP, Feng et al. [feng2019attract] proposed a method named SCI_SCM, which utilizes semantic structure among data to align the distribution of known classes and push unknown classes away from known classes. Separate to Adapt (STA) [liu2019separate] utilizes a coarse-to-fine weight mechanism to separate unknown samples from the target domain. In Distribution Alignment with Open Difference (DAOD) [fang2019open], a theoretical bound is proposed for UOSDA and a risk estimator is used to recognize unknown target data.

However, existing deep UOSDA methods lack the theoretical guidance and the upper bound in [fang2019open] is not applicable to DNNs, which causes a large distributional discrepancy (details are shown in Section IV). Obviously, for UOSDA, there is a gap between existing theoretical bound and deep algorithms. In this paper, we aim to fill this gap.

Iii Preliminary and Notations

The definitions of the UOSDA problem and some important concepts are introduced in this section. The notations used in this paper are summarized in Table I.

Iii-a Definitions and Problem Setting

Important definitions are presented as follows.

Definition 1 (Domain[fang2019open]).

Given a feature space and a label space , a domain is a joint distribution

, where the random variables

, .

In Definition 1, and mean that the spaces and contain the image sets of and respectively. In the paper, we name the random variable

as feature vector and the random variable

as label. Based on this definition, we have:

Definition 2 (Domains for Open Set Domain Adaptation[fang2019open]).

Given a feature space and the label spaces , the source and target domains have different joint distributions and , where the random variables , , , and the label space .

From the definitions above, we can notice that: 1) This paper focuses on homogeneous situations. Thus and are belong to the same space, and 2) contains . It is unknown target classes that are the classes from . It is are the known classes that are the classes from . Thus, the UOSDA problem is:

Problem 1 (Unsupervised Open Set Domain Adaptation (UOSDA) [fang2019open]).

Given labeled samples drawn from the joint distribution of the source domain i.i.d and unlabeled samples drawn from the marginal distribution of the target domain i.i.d. The aim of UOSDA is to find a target classifier such that
1) classifies the known target samples into the correct known classes;
2) recognizes the unknown target samples as unknown.

According to the definition of the problem, the target-domain classifier only needs to recognize unknown target data as unknown and classify other target data. It is not necessary to classify unknown target data, and all unknown target data are recognized as the “unknown class”. In general, we assume that , where the label denotes the unknown class and the label is a one-hot vector. The label denotes the -th class.

Notation           Description Notation           Description
feature space source, target joint distributions
source, target label sets source, target marginal distributions
random variables on the feature space open set difference
random variables on the label spaces
source, target risks partial risk on known target classes
one-hot vector (class ) partial risk on unknown target classes
feature transformation , classifier over risks that samples regarded as unknown
hypothesis space, set of classifiers

class-prior probability for unknown class

sample from empirical distribution, empirical risk
distance tensor discrepancy distance
TABLE I: Notations and their descriptions.

Iii-B Concepts and Notations

It is necessary to introduce some important concepts and notations before demonstrating our main results. Unless otherwise specified, all the following notations are used consistently throughout this paper without further explanations.

Iii-B1 Notations for distributions

For simplicity, we denote the joint distributions and by the notations and respectively. Similarly, we use and denote the marginal distributions and respectively.

denotes the target conditional distribution for the known classes, while denotes the target conditional distribution for the unknown classes. denotes the class-prior probability for the unknown target classes.

Given a feature transformation:

(1)

the induced distributions related to and are

(2)

Lastly, the notation denotes the corresponding empirical distribution to any distribution . For example, represents the empirical distribution corresponding to .

Iii-B2 Risks and Partial Risks

In learning theory, risks and partial risks are two important concepts, which are briefly explained below.

Following the notations in [DBLP:conf/icml/0002LLJ19], consider a multi-class classification task with a hypothesis space of the classifiers

(3)

Let

(4)

be the loss function. For convenience, we also require

to satisfy the following conditions in Theorem 1:
1. is symmetric and satisfies triangle inequality;
2. iff ;
3. if and are one-hot vectors.

We can check many losses satisfying the above conditions such as - loss and loss .

Then the risks of w.r.t. under and are given by

(5)

The partial risk of for the known target classes is

(6)

and the partial risk of for the unknown target classes is

(7)

Lastly, we denote

(8)

as the risks that the samples are regarded as the unknown classes.

Given a risk , it is convenient to use notation as the empirical risk that corresponds to .

Iii-B3 Discrepancy Distance

How to measure the difference between domains plays a critical role in domain adaptation. To achieve this, a famous distribution distance has been proposed as the measures of the distribution difference.

Definition 3 (Distributional Discrepancy [DBLP:conf/colt/MansourMR09]).

Given a hypothesis space containing a set of functions defined in a feature space . Let be a loss function, and be distributions on space . The  distance between distributions and over is

In this paper, we have used a tighter distance named tensor discrepancy distance, which is firstly proposed by [long2018conditional]. The tensor discrepancy distance can future extract the multimodal structure of distributions to make sure the knowledge related to learned classifier and pseudo labels can be utilized during the distribution aligning process.

We consider the following tensor mapping:

(9)

Then we induce two importance distributions:

(10)

Using , we reconstruct a new hypothetical set:

(11)

where . Then the distance between and is:

(12)

where is the sign function.

It is easy to prove that under the conditions (1)-(3) for loss and for any , we have

(13)

Iii-B4 Existing Theoretical Bound

Zhen et al. [fang2019open] firstly proposed a theoretical bound for UOSDA:

(14)

There are four main terms: source risk, distributional discrepancy, a constant and open set difference. The fourth term, open set difference, is designed to estimate the risk of classifier on unknown data.

Iv Motivation

In UOSDA, the target-domain classifier aims to accurately recognize unknown target data and classify the other target data. Since the knowledge about unknown classes is missing, the classifier is likely to be confused about the boundary between known and unknown target data. Thus, recognizing unknown target data plays a critical role in addressing the UOSDA problem.

In order to obtain an effective target-domain classifier, Zhen et al. [fang2019open] have proven an upper (Eq. (14)) bound for UOSDA and proposed a shallow method based on the bound. It consists of four terms: source-domain risk, distributional discrepancy, open set difference (), and a constant. Particularly, open set difference, as an important term, is leveraged to estimate the risk of the classifier on unknown target data.

In order to verify whether open set difference works in DNNs, we introduced open set difference into DNNs and conducted a group of experiments on the task Ar Cl in Office-Home. The classifier consists of backbone (ResNet50), generator (two linear layers), and classifier (one linear layer). It is evident that the classifier is very flexible. As shown in Fig. 2, the empirical open set difference converges to a negative value (refer to the yellow line in Fig. 2(a)) and the accuracy of OS, average accuracy among all classes that include unknown classes (Eq. (29)), significantly decreases when empirical open set difference converges to a negative value.

To reveal the nature of this phenomenon, first we investigate the distributional discrepancy and discover that the distributional discrepancy has a lower bound. Specifically, the distributional discrepancy is greater than the negative value of open set difference (Eq. (18)). Based on the lower bound, if the value of the open set difference is a large negative number, then the distributional discrepancy is greater than a large positive number. Hence, we may fail to align the distributional discrepancy. In fact, experiments have shown that the empirical open set difference may converge to a large negative value if we introduce the open set difference into DNNs.

Clearly, there is a gap between existing theoretical bound and DNNs. In order to bridge theoretical bound and deep algorithms, in this paper, we propose a new practical upper bound (Eq. (20)) for UOSDA that applies to DNNs. The term, -open set difference, in the new bound can effectively overcome the defect of open set difference. As shown in Fig. 2, -open set difference guarantees that the risk of the classifier on unknown data is always greater than the lower bound of open set difference by (refer to the green line in Fig. 2(a)). Furthermore, the -open set difference significantly outperforms the open set difference (refer to the green line in Fig. 2(b)).

To sum up, existing upper bound is not compatible with DNNs. That is why we propose a new upper bound that contains an amended risk estimator, -open set difference (). Details of the new upper bound and are shown in Section V.

V The Proposed Method

In this section, we firstly propose a theoretical bound that applies to DNNs for UOSDA. Under the guidance of the bound, we then propose a UOSDA method based on DNNs.

Notation           Description
cross entropy, mean square error loss function
set of predicted unknown target data with high confidence
set of predicted known target data with high confidence
number of source data
number of target data
number of
number of
source data
target data
TABLE II: Notations and their descriptions.

V-a Theoretical Results

V-A1 An Analysis for Open Set Difference

Eq. (15) is the open set difference:

(15)

where and are defined in Eq. (8). The positive term is used to recognize unknown data and the negative term is designed to prevent known data from being classified as unknown classes. By combining these two terms, the classifier can recognize unknown target samples. According to [fang2019open], the open set difference satisfies the following inequality:

(16)

The proof of Eq. (16) can be found in Appendix A. proposition 1. Note that

(17)

hence, the distributional discrepancy is greater than the negative open set difference:

(18)

Theoretically, we hope that the optimized open set difference should not be a large negative value. Otherwise, it is impossible to eliminate the distributional discrepancy. However, in fact, the empirical open set difference may converge to a large negative value (see Fig. 2). This results in that the distributional discrepancy may still be large.

V-A2 -Open Set Difference

Based on the analyses above, we try to correct the open set difference to avoid the problem mentioned above. According to Eq. (18), the open set difference is lower bounded. We denoted the lower bound of the open set difference by . An potentiality is to limit the lower bound of the open set difference by a small negative constant . Hence, we propose an amended risk estimator, -open set difference (), to overcome the existing defect in the open set difference:

(19)

If we optimize the empirical -open set difference, we can guarantee that the empirical -open set difference is always larger than . Lastly, combining Eqs. (12), (13) with Eq. (19), we develop a new theoretical bound for UOSDA.

Theorem 1.

Given a feature transformation , a loss function satisfying conditions 1-3 introduced in Section III-B-2), a nonegative constant and a hypothesis with a mild condition that the constant vector value function , then for any , we have

(20)

where and are the risks defined in (5), and are the risks defined in (8), is the partial risk defined in (6) and .

Proof.

The proof is given in Appendix A. ∎

It is notable that the theoretical bound introduced in Theorem 1 has two main differences from the learning bound introduced by [fang2019open]. The first one is the -open set difference. As mentioned before, -open set difference is designed to eliminate distributional discrepancy caused by open set difference when the module is based on DNNs. The other difference is that we use the tensor distributional discrepancy to estimate the domain difference. There are two advantages for the tensor distributional discrepancy compared with the distributional discrepancy (Definition 3): 1) the tensor distributional discrepancy is tighter than the distributional discrepancy (see Eq. (13)); 2) the tensor distributional discrepancy can extract the multimodal structure of distributions to make sure the knowledge related to the learned classifier and pseudo labels can be utilized during the process of distribution alignment [long2018conditional].

V-B Method Description

According to Theorem 1, we formally present our method (see Fig. 3), which consists of three parts. Part 1) Binary adversarial domain adaptation. Following [saito2018open], we employ a binary adversarial module to find a rough boundary between the class-known data (known data) and the class-unknown data (unknown data), and thus this module can provide target samples with high confidence for other modules. Part 2) -open set difference (). The is leveraged to estimate the risk of the classifier on unknown data such that the classifier can accurately recognize the unknown target data. Part 3) Conditional adversarial domain adaptation. Existing deep UOSDA methods ignore the importance of the multimodal structure of distribution while aligning distributions for known classes. According to the tensor distributional discrepancy, we design a novel open set conditional adversarial strategy to align distributions for known classes. Notations used in this section are summarized in Table II.

Fig. 3: Framework of the proposed method. The generator () aims to extract the feature () of input data and feed it to the classifier () to predict its label (). This whole framework consists of three parts. 1) Binary adversarial domain adaptation, which is made of source classification loss and binary adversarial loss. Classifier can find a rough boundary between known data and unknown data. 2) -open set difference (). We proposed the amended risk estimator to more properly estimate the risk of the classifier on unknown data. 3) Conditional adversarial domain adaptation, which aims to capture multimodal structure of distribution for distribution alignment. In summary, our method can achieve better performance by accurately estimating risk on unknown target data and aligning distribution more adequately.

V-B1 Binary adversarial domain adaptation (BADA)

According to our theoretical bound, the first term is source risk. For the source domain, the label is available. We utilize a cross-entropy for the classification of source samples:

(21)

For the target domain, it is imperative to recognize the unknown target data before aligning distribution. Following [saito2018open], we employ a binary cross-entropy and a gradient reverse layer between generator and classifier to find a boundary between the known data and the unknown data:

(22)

where is the -th value of hypothesis function .

The minimax game is shown in Section V-C. During the process of adversarial training, the classifier attempts to minimize , but the generator attempts to maximize . Therefore, recognizing unknown data is achieved during the process of adversarial training.

However, this module can only find a coarse boundary between the known data and the unknown data, which cannot accurately recognize the unknown target data. Table VI verifies that only binary adversarial domain adaptation cannot achieve satisfactory performance. Therefore, we employ the -open set difference for recognizing unknown target data more appropriately and the open-set conditional adversarial strategy to further align distribution.

V-B2 -open set difference

The principle of the -open set difference () is adequately demonstrated in Sections IV and V-A. Then we introduce to recognize unknown target data. According to Eqs. (19), (23), we can calculate the empirical -open set difference by:

(23)

Without more label information, in Eq.(19) is impossible to be evaluated accurately, thus, we introduce a parameter, , to replace it. The analysis of is discussed in Section VI.

V-B3 Conditional adversarial domain adaptation

Here we utilize the tensor distributional discrepancy to align the distribution between the known classes. Firstly, the empirical representations of and can be written as follows:

(24)

where is the set of target data from the known classes and is the Dirac measure.

Then, motivated by DANN [ganin2016domain] and CDAN [long2018conditional] , we can reformulate the tensor distributional discrepancy between the known classes as follows:

(25)

where is the domain discriminator designed to classify domains.

Since the target data is unlabeled, Eq. (25) cannot be directly calculated. Thanks to the pseudo labels provided by BADA, we leverage it to replace the true label. Since these pseudo labels are not completely accurate, we only select the samples with a confidence of 0.9. We then formulate the domain adversarial loss function below.

(26)

where denotes the set of samples from known classes with high confidence in the target domain, and .

Domain adversary loss aims to minimize over and maximize over . The gradient reverse layer between and results in becoming confused about the source data and the target data. The minimax game is shown in Section V-C. The classifier aims to identify what input data belongs to which domain, but the generator aims to deceive the classifier by changing the features of the input data. Distribution alignment can be achieved during this process.

Furthermore, the unknown data may distract distribution alignment of the known data. Thus the unknown data should be pushed away from known data to prevent them from affecting distribution alignment. We construct the loss function below. It is worth noting that there is no gradient reverse between and during the process of backpropagation.

(27)

where is the unknown target samples with high confidence and .

In this subsection, we construct a domain discriminator () to align the distributions for the known data by a tensor product, which can capture the multimodal structure of distribution. Furthermore, we construct a loss function to push the unknown data away from the known data to prevent the unknown data affecting distribution alignment.

V-C Training Procedure

Combining Eqs. (21), (22), (23), (26) and (27), We solve UOSDA problem by the following minimax game:

(28)

We introduce the gradient reverse layer for adversary learning. The whole training procedure is shown in Algorithm 1. Firstly, we initialize the parameters of the generator (), the classifier () and the domain discriminator (

) (line 1). In each epoch, we divide data into multi minibatches (line 4-5). Then we calculate source risk (

), binary adversarial loss () and according to Eqs. (21), (22), (23) (line 6-7). After selecting target samples with high confidence () (line 8), we calculate and according to Eqs. (26) and (27) (line 9). Finally, parameters are updated Via the SGD optimizer (line 10).

With the proposed method, in binary adversarial domain adaptation (, ), a coarse boundary between known data and unknown data can be found. Furthermore, -open set difference () can adequately estimate the risk of the classifier on unknown data, which is effective for the classifier to accurately recognize unknown target data. Then, we further align distributions of known data () and push unknown data away from known data () using a domain discriminator. Finally, combining these three modules, we can adequately solve the UOSDA problem.

Input: source samples , target samples .
Parameter: learning rate , batch size , the number of iteration , network parameters , , .
Output: predicted target label .
1:  Initialize , ,
2:  =0
3:  while  do
4:     sample source minibatch {}.
5:     sample target minibatch {}.
6:     calculate , according to Eqs. (21) and (22).
7:     calculate according to Eq. (23).
8:     select high confidence target samples according to the
output of softmax .
9:     calculate according to Eqs. (26) and (27) by leveraging high confidence target samples.
10:     update parameter:
.
11:     
12:  end while
Algorithm 1 Training procedure of our method

Vi Experiments And Evaluations

In this section, we conducted extensive experiments on standard benchmark datasets (including transfer tasks) to demonstrate the effectiveness of our method. Several state-of-the-art UOSDA methods such ATI- [panareda2017open], OSBP [saito2018open], SCI_SCM [feng2019attract], STA [liu2019separate] and DAOD [fang2019open] are employed as our baselines.

Vi-a Datasets

Digits contains three digit datasets: MNIST (M) [lecun1998gradient], SVHN (S) [netzer2011reading], USPS (U) [hull1994database]. We construct three open set domain adaptation tasks as previous works [saito2018open]: S M, M U and U M. Following the protocol of [saito2018open], we select classes - as the known classes and classes - as the unknown classes of the target domain.

Office-31 [saenko2010adapting] is an object recognition dataset with imges, which consists of three domains with slight discrepancy: amazon (A), dslr (D) and webcam (W). Each domain contains kinds of object. So there are open set domain adaptation tasks on Office-31: A D, A W, D A, D W, W A, W D. We follow the open set protocol of [saito2018open], selecting the first classes in alphabetical order as the known classes and classes - as the unknown classes of the target domain.

Office-Home [venkateswara2017deep] is an object recognition dataset with image, which contains four domains with more obvious domain discrepancy than Office-31. These domains are Artistic (Ar), Clipart (Cl), Product (Pr), Real-World (Rw). Each domain contains kinds of objects. So there are open set domain adaptation tasks on Office-Home: Ar Cl, Ar Pr, Ar Rw, …, Rw Pr. Following the standard protocol, we chose the first classes as the known classes and - classes as the unknown classes of the target domain.

PIE [Rasouli_2019_ICCV] is a face recognition dataset, containing images of people with multifarious pose, illumination and expression. following the protocol of [fang2019open], We performed open set domain adaptation among out of poses and selected classes - as the known classes and classes - as the unknown classes of the target domain:x PIE1 (left pose), PIE2 (upward pose), PIE3 (downward pose), PIE4 (frontal pose) and PIE5 (right pose). We construct open set domain adaptation tasks, i.e., PIE1 PIE2, PIE1 PIE3, …, PIE5 PIE4.

Vi-B Implementation

Network structure. For the Digit

, we employ the similar convolution neural network as

[shu2018a, saito2018open] for S M and other tasks, respectively, and train the DNNs from scratch. For Office-31, we leverage VGGNet [simonyan2014very] as backbone to extract features of images. We employ two fully-connected layers as the generator and one fully-connected layer as the classifier. For Office-Home, We leverage ResNet- [he2016deep] as backbone to extract features of images. The network structure of the generator and the classifier are the same as Office-31. PIE has provided valid features of all images. Therefore CNN is not necessary, and we adopted a similar generator and classifier as Office-31. Details about the network can be found in Appendix B. In the same manner as [saito2018open, feng2019attract], we do not update the parameters of the backbone during the training process.

Parameter setting. In the proposed method, there are two important parameters: and . We set as in all experiments, which is because distributional discrepancy is gradually approaching to during the process of domain adaptation and should be greater than or equal to when distributional discrepancy is . Besides, we set as for Office-31, for Digit and Office-Home, and for PIE. When the distributional discrepancy is relatively large, we advise that should be smaller for steady training. All experiment results are the accuracy averaged over three independent runs.

Vi-C Baselines

We compare our method with five UOSDA methods: ATI-, OSBP [saito2018open], SCA_SCM [feng2019attract], STA [liu2019separate], and DAOD [fang2019open]. We briefly introduce these baselines in the following.
 ATI- [panareda2017open] employs an integer programming to assign the label for the target domain and a mapping matrix to align distribution.
 OSBP [saito2018open] employs a classifier to align distributions between data (with known classes) in both source and target domains and an adversarial net to reject unknown samples through the probability of samples in the target domain.
SCA_SCM [feng2019attract] aligns the centroids between source and target and pushes unknown samples away from known classes to achieve a good performance.
 STA [liu2019separate] utilizes a coarse-to-fine weight mechanism to separate unknown samples from the target domain and achieves distribution alignment simultaneously.
 DAOD [fang2019open] trains a target-domain classifier via minimizing Eq. (14). The term, open set difference, is used to estimate the risk of the classifier on unknown classes.

Vi-D Evaluation Metrics

Following previous works [panareda2017open, saito2018open, fang2019open], we employ the two metrics below to evaluate our method. OS: average accuracy among all classes that include unknown classes. OS*: average accuracy among known classes.

(29)

where is the target classifier, and is the set of target samples with label .

Vi-E Results

Dataset ATI- OSBP SCA_SCM STA DAOD OURS
OS OS* OS OS* OS OS* OS OS* OS OS* OS OS*
S   M 67.6 66.5 63.1 59.1 68.6 65.5 76.9 75.4 - - 82.9 82.6
M U 86.8 89.6 92.1 94.9 91.3 92.0 93.0 94.9 - - 93.4 94.6
U M 82.4 81.5 92.3 91.2 93.1 95.2 92.2 91.3 - - 90.7 92.7
Average 78.9 79.2 82.4 81.7 84.3 84.2 87.3 87.2 - - 89.0 90.0
TABLE III: Acc(OS*) and Acc(OS) (%) on Digits
Dataset ATI- OSBP SCA_SCM STA DAOD OURS
OS OS* OS OS* OS OS* OS OS* OS OS* OS OS*
A D 79.8 86.8 85.8 85.8 90.1 92.0 88.6 92.8 89.2 91.1 96.0 97.5
A W 86.4 93.0 76.9 76.6 86.4 87.7 91.9 94.3 90.5 91.9 92.5 93.7
D A 75.0 81.5 89.4 91.5 81.6 88.4 73.4 74.3 75.4 73.6 85.3 86.0
D W 91.7 98.6 96.0 96.6 97.9 99.8 96.5 99.5 98.6 100.0 98.4 100.0
W A 75.8 82.0 83.4 83.1 80.3 82.6 71.3 71.3 75.6 74.7 83.2 83.9
W D 91.5 99.3 97.1 97.3 98.2 99.3 95.4 100.0 98.6 99.3 98.6 100.0
Average 83.4 90.2 88.0 88.5 89.1 91.6 86.2 88.7 88.0 88.4 92.3 93.5
Ar Cl 53.1 54.2 53.1 53.3 58.9 59.9 57.0 59.3 55.4 55.3 61.6 62.8
Ar Pr 68.6 70.4 68.4 69.2 73.4 74.4 67.2 69.5 71.8 72.6 76.6 78.3
Ar Rw 77.3 78.1 78.0 79.1 79.2 80.2 79.1 81.9 77.6 78.2 83.2 85.0
Cl Ar 57.8 59.1 57.9 58.2 60.6 61.5 59.1 61.3 59.2 59.1 62.2 62.8
Cl Pr 66.7 68.3 71.6 72.4 67.5 68.4 63.4 65.9 70.1 70.8 71.0 72.2
Cl Rw 74.3 75.3 71.4 72.3 74.8 75.8 72.7 75.5 77.0 77.8 77.7 79.0
Pr Ar 61.2 62.6 59.6 61.0 63.8 64.7 63.8 65.2 65.8 66.7 64.6 65.4
Pr Cl 53.9 54.1 55.7 56.9 58.1 59.0 56.5 58.6 59.1 60.0 60.0 60.8
Pr Rw 79.9 81.1 82.1 83.9 77.7 78.7 80.1 82.4 82.2 84.1 81.5 82.9
Rw Ar 70.0 70.8 66.5 68.2 67.3 68.2 69.3 71.3 70.5 71.3 70.6 71.6
Rw Cl 55.2 55.4 57.8 59.2 55.8 56.7 57.5 59.2 57.8 58.4 58.8 59.6
Rw Pr 78.3 79.4 78.6 80.8 77.7 78.6 79.4 82.2 80.6 81.8 81.3 82.8
Average 66.4 67.4 66.7 67.9 67.9 68.8 67.1 69.4 68.9 69.6 70.8 71.9
TABLE IV: Acc(OS*) and Acc(OS) (%) on Office-31 (VGG-19) and Office-Home (Resnet-50).
Dataset ATI- OSBP SCA_SCM STA DAOD OURS
OS OS* OS OS* OS OS* OS OS* OS OS* OS OS*
P1 P2 41.9 44.0 64.2 66.6 60.7 60.9 54.2 55.0 56.5 57.3 76.4 78.1
P1 P3 53.6 56.3 66.4 69.1 65.7 66.0 67.7 68.8 52.2 53.1 75.7 77.4
P1 P4 64.6 67.9 76.2 80.0 79.5 80.3 81.6 83.6 82.4 85.2 89.6 91.6
P1 P5 43.3 45.4 49.1 50.2 45.7 45.3 42.4 41.7 46.1 47.3 57.2 58.0
P2 P1 56.7 59.5 52.9 54.2 63.6 65.2 51.0 51.6 68.1 69.7 81.6 83.9
P2 P3 53.6 56.3 61.5 63.5 66.9 68.5 58.3 59.0 69.9 71.7 76.5 78.3
P2 P4 73.5 77.1 90.4 92.9 91.2 93.6 78.6 80.6 88.2 91.2 94.0 96.4
P2 P5 34.9 36.7 45.1 45.9 45.3 46.0 39.6 39.6 49.4 49.8 51.8 52.6
P3 P1 66.9 68.4 61.3 61.0 75.2 77.3 69.2 70.7 66.6 68.3 82.7 85.0
P3 P2 52.4 55.0 64.1 64.6 68.9 70.7 59.5 61.0 68.5 70.4 76.0 78.0
P3 P4 70.5 74.0 74.7 76.9 86.6 89.1 77.6 79.8 83.9 87.1 84.9 87.2
P3 P5 44.8 47.1 46.3 46.7 59.7 61.0 46.3 46.7 52.3 53.3 62.8 64.2
P4 P1 63.7 66.8 67.2 68.7 85.7 86.9 84.4 86.6 84.4 87.1 93.1 95.4
P4 P2 74.4 78.1 82.2 85.0 90.0 91.3 89.7 92.5 82.4 84.8 93.9 96.2
P4 P3 58.7 61.7 66.9 67.6 86.0 87.1 81.6 84.4 77.6 80.0 85.1 86.9
P4 P5 46.2 48.5 61.7 63.8 63.2 63.6 68.8 71.0 59.9 61.3 71.3 72.7
P5 P1 30.2 23.5 64.2 66.6 54.3 55.7 61.2 62.6 59.2 60.6 62.8 64.3
P5 P2 34.9 36.7 35.4 35.8 48.8 49.7 49.8 50.0 35.0 34.8 50.2 51.1
P5 P3 39.9 41.9 45.1 46.3 58.7 60.0 46.5 46.3 44.6 44.4 69.2 70.8
P5 P4 55.8 58.6 52.2 53.5 71.1 73.0 70.2 71.7 68.6 70.3 80.2 82.4
Average 53.0 55.2 61.4 62.9 68.3 69.6 63.9 65.2 64.8 66.4 75.8 77.5
TABLE V: Acc(OS*) and Acc(OS) (%) on PIE.
Dataset A D A W D A D W W A W D Avg
OS OS* OS OS* OS OS* OS OS* OS OS* OS OS* OS OS*
BADA 85.8 85.8 76.9 76.6 89.4 91.5 96.0 96.6 83.4 83.1 97.1 97.3 88.0 88.5
BADA+ 92.7 93.3 89.8 90.6 81.6 81.7 98.0 99.5 83.6 78.9 98.5 100.0 89.9 90.7
BADA+c 92.2 94.1 87.6 89.0 81.5 84.1 97.7 100.0 80.3 83.4 97.3 100.0 89.5 91.8
BADA++c 94.1 94.6 89.2 89.7 83.2 83.4 98.5 100.0 83.3 81.9 98.6 100.0 90.9 91.7
BADA+ 95.5 97.0 92.6 94.0 82.3 82.6 98.0 99.5 83.4 79.5 98.4 100.0 91.0 92.2
OURS 96.0 97.5 92.5 93.7 85.3 86.0 98.4 100.0 83.2 83.9 98.6 100.0 92.3 93.5
TABLE VI: Ablation study on Office-31

Results on three tasks of Digit datasets are shown in Table III, Obviously, our method achieves the best performance ( on OS and on OS*) within three tasks. Moreover, compared to U M and M U, M U is more challenging. There is a bigger distribution between S and M. Whereas on the most difficult task, our method still outperforms the best baseline STA by and on OS and OS* respectively. It is worth noting that DAOD is a shallow method, which cannot extract feature by convolutional neural network. Therefore there is no comparison on Digits. The results of ATI- are from [liu2019separate].

Results on standard benchmark object datasets (Office-31 and Office-Home) are recorded in Table IV. For Office-31, our method significantly outperforms baselines among