Open Set Domain Adaptation: Theoretical Bound and Algorithm

07/19/2019 ∙ by Zhen Fang, et al. ∙ University of Technology Sydney 2

Unsupervised domain adaptation for classification tasks has achieved great progress in leveraging the knowledge in a labeled (source) domain to improve the task performance in an unlabeled (target) domain by mitigating the effect of distribution discrepancy. However, most existing methods can only handle unsupervised closed set domain adaptation (UCSDA), where the source and target domains share the same label set. In this paper, we target a more challenging but realistic setting: unsupervised open set domain adaptation (UOSDA), where the target domain has unknown classes that the source domain does not have. This study is the first to give the generalization bound of open set domain adaptation through theoretically investigating the risk of the target classifier on the unknown classes. The proposed generalization bound for open set domain adaptation has a special term, namely open set difference, which reflects the risk of the target classifier on unknown classes. According to this generalization bound, we propose a novel and theoretically guided unsupervised open set domain adaptation method: Distribution Alignment with Open Difference (DAOD), which is based on the structural risk minimization principle and open set difference regularization. The experiments on several benchmark datasets show the superior performance of the proposed UOSDA method compared with the state-of-the-art methods in the literature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Standard supervised learning relies on the assumption that both training and test samples are drawn from the same distribution. Unfortunately, this assumption does not hold in many applications, since the process of collecting samples is prone to dataset bias

[1]. In object recognition, for example, there can be a discrepancy in the distributions between training and testing as a result of specific conditions, device type, position, orientation, and so on. To address this problem, unsupervised domain adaptation (UDA) [2, 3] was proposed to transfer the related knowledge from the source domain, which has abundant labeled samples, to an unlabeled domain (the target domain).

The aim of UDA is to minimize the distribution difference in learning the related knowledge between domains. The existing work on UDA falls into two main categories: (1) feature matching, which seeks a new feature space where the marginal distributions or conditional distributions from two domains are similar [4, 5, 6]

, and (2) instance reweighting, which estimates the weights of the source domain so that the distribution discrepancy is minimized

[7, 8]. There is an implicit assumption in most existing UDA methods [9, 10, 11] that the source and target domains share the same label set. UDA under this assumption is also known as Unsupervised Closed Domain Adaptation (UCSDA) [12].

Fig. 1: Unsupervised open set domain adaptation problem (UOSDA), where the target domain contains “unknown” classes that are not contained in the label set of the source domain.

However, the assumption in UCSDA methods is not realistic in an unsupervised setting (i.e., when there are no labels in the target domain), since it is not known whether the classes of target samples are from the label set of the source domain. It is possible that the target domain contains additional classes (unknown classes) which are not found in the label set of the source domain [13]. For example, in the Syn2Real task [14], real-world objects (target domain) may have more classes than synthetic objects (source domain). If existing UCSDA methods are used to solve the UDA problem without the assumption, negative transfer [15] may occur, due to the mismatch between unknown and known classes (see Fig. 2(b)).

To address UDA problem without the assumption, Busto et al. [12] and Saito et al. [13] recently proposed a new problem setting, Unsupervised Open Set Domain Adaptation (UOSDA), in which the unlabeled target domain contains unknown classes that do not belong to the label set of the source domain (see Fig. 1). There are two key challenges [13] in addressing the UOSDA problem. The first challenge is how to classify unknown target samples, since there is insufficient knowledge to support learning which samples are from unknown classes. To address this challenge, it is necessary to mine deeper domain information to delineate a boundary between known and unknown classes. The second challenge in UOSDA is distribution difference. When distributions are matched, unknown target samples should not be matched, otherwise negative transfer may occur.

Only a small number of methods have been proposed to address UOSDA [12, 13, 16, 17]. The first proposed UOSDA method is Assign-and-Transform-Iteratively (ATI-) [12], which recognizes unknown target samples by using a constraint integer programming then learns a linear map to match source domain with target domain by excluding predicted unknown target samples. However, ATI- has an additional assumption that the source domain also contains unknown classes which do not belong to the target classes. The first proposed deep UOSDA method is Open Set Back Propagation (OSBP) [13]. OSBP addresses the UOSDA problem without the assumption required by ATI-. It rejects unknown target samples by training a binary cross entropy loss.

Fig. 2: 1) UCSDA methods match the source samples with target samples, however as Fig. (b) shows, the unknown target samples interfere with distribution matching. This may lead to negative transfer. 2) UOSDA classifies known target samples into correct known classes and recognizes unknown target samples as unknown.

It is clear that ATI- and OSBP mainly focus on UOSDA algorithms, however they have not analyzed UOSDA theoretically. Moreover, there is no work to give a generalization bound for the open set domain adaptation problem. To fill this gap, we research UOSDA from the theoretical aspect. We first study the risk of target classifier on unknown classes. We discover the risk of target classifier on unknown classes is closely related to a special term called open set difference which can be estimated by unlabeled samples. Minimizing open set difference help us to classify unknown target samples and address the first challenge.

Following our theory, we design a principle-guided UOSDA method referred to as Distribution Alignment with Open Difference (DAOD). This method can accurately classify unknown target samples while minimizing the discrepancy between two domains for known classes. DAOD learns the target classifier by simultaneously optimizing the structural risk functional [18]

, the joint distribution alignment, the manifold regularization

[19], and open set difference. The reason DAOD is able to avoid negative transfer lies in its ability to minimize the open set difference, which enables the accurate classification of unknown target samples (addressing the first challenge). By excluding these recognized unknown target samples, the source and target domains can be precisely aligned, which addresses the second challenge.

There is no theoretical work in the literature for open set domain adaptation. The closest theoretical work is by Ben-David et al. [20], who give VC-dimension-based generalization bounds. Unfortunately, this work has several restrictions: 1) the theoretical analysis can only handle the closed setting; 2) the work only solves the binary classification task, whereas there are multiple classes in the target domain in the open setting. A significant contribution of our paper is that our theoretical work gives a generalization bound for open set domain adaptation.

The contributions of this paper are summarized as follows.

We provide the theoretical analysis and generalization bound for open set domain adaptation. The closed set domain adaptation theory [20] is a special case of our theoretical results. To the best of our knowledge, this is the first work on open set domain adaptation theory.

We develop an unsupervised novel open set domain adaptation method, Distribution Alignment with Open Difference (DAOD), which is based on our theoretical work. The method enables unknown target samples to be separated from known samples using open set difference.

We evaluate DAOD and existing UOSDA methods on real-world UOSDA tasks (including face recognition tasks and object recognition tasks). Extensive experiments demonstrate that DAOD outperforms the state-of-the-art UOSDA methods ATI- and OSBP.

This paper is organized as follows. Section II reviews existing work on unsupervised closed set domain adaptation, open set recognition and unsupervised open set domain adaptation. Section III presents the problem definitions, our main theoretical results and our proposed method. Theoretical analysis for open set domain adaptation is then presented in Section IV. Comprehensive evaluation results and analyses are provided in Section V. Lastly, Section VI concludes the paper.

Ii Related Work

In this section, we present related work of unsupervised closed set domain adaptation methods, open set recognition and unsupervised open set domain adaptation.

Closed Set Domain Adaptation. Ben-David et al.[20] proposed generalization bounds for closed set domain adaptation. The bound represents that the performance of the target classifier depends on the performance of the source classifier and the discrepancy between the source and target domains. Many UCSDA methods [6, 10, 21] have been proposed according to the theoretical bound and attempt to minimize the discrepancy between domains. We roughly separate these methods into two categories: feature matching and instance reweighting.

Feature matching aims to reduce the distribution discrepancy by learning a new feature representation. Transfer component analysis (TCA) [4] learns a new feature space to match distributions by employing the Maximum Mean Discrepancy (MMD) [22]. Joint distribution adaptation (JDA) [5] improves TCA by jointly matching marginal distributions and conditional distributions. Adaptation Regularization Transfer Learning (ARTL) [23] considers a manifold regularization term [19] to learn the geometric relations between domains, while matching distributions. Joint Geometrical and Statistical Alignment (JGSA) [24] not only considers the distribution discrepancy but also matches the geometric shift. Recent advances show that deep networks can be successfully applied to closed set domain adaptation tasks. Deep Adaptation Networks (DAN) [25] considers three adaptation layers for matching distributions and applies multiple kernels (MK-MMD) [26] for adapting deep representations. Wasserstein Distance Guided Representation Learning (WDGRL) [27] minimizes the distribution discrepancy by employing Wasserstein Distance

in neural networks.

The instance reweighting method reduces distribution discrepancy by weighting the source samples. Kernel mean matching (KMM) [7] defines the weights as the density ratio between the source domain and the target domain. Yu et al. [8] provided a theoretical analysis for important instance reweighting methods. However, when the domain discrepancy is substantially large, a large number of effective source samples will be down-weighted, resulting in the loss of effective information.

Unfortunately, the methods mentioned above cannot be applied to open set domain adaptation, because unknown target samples in the closed set domain adaptation scenario are used to match distributions, which leads to negative transfer.

Open Set Recognition. When the source domain and target domain for known classes share the same distribution, the open set domain adaptation becomes Open Set Recognition. A common method for handling open set recognition relies on the use of threshold-based classification strategies [28]. Establishing a threshold on the similarity score means rejecting distant samples from the training samples. Open set Nearest Neighbor (OSNN) [29] recognizes whether a sample is from unknown classes by comparing the threshold with the ratio of similarity scores to the two most similar classes of the sample. Another trend relies on modifying Support Vector Machines (SVM) [30, 31, 32]. Multi-class open set SVM (OSVM) [32]

uses a multi-class SVM as a basis to learn the unnormalized posterior probability which is used to reject unknown samples.

Open Set Domain Adaptation. The open set domain adaptation problem was proposed by Assign-and-Transform-Iteratively (ATI-) [12]. Using distance between each target sample and the center of each source class, ATI- constructs a constraint integer programming to recognize unknown target samples

, then learns a linear transformation to match the source domain and target domain excluding

. However, ATI-

requires the help of unknown source samples, which are unavailable in our setting. Recently, a deep learning method, Open Set Back Propagation (OSBP)

[13], has been proposed. OSBP relies on adversarial neural network and a binary cross entropy loss to learn the probability of target samples, then uses the estimated probability to separate unknown target classes samples. However, we have not found any paper that considers the generalization bound for open set domain adaptation. In this paper, we complete the blank in open set domain adaptation theory.

Iii Proposed Method

In this section, we first establish the basic definitions of domains, closed set domain adaptation (CSDA), and open set domain adaptation (OSDA), then introduce the problems which will be solved in this paper. Second, we present our main theoretical results. Lastly, we propose our UOSDA method based on our theoretical work.

Iii-a Notation and Problem Setting

A domain

is a joint probability distribution

on , where and are the feature and label spaces respectively. Let and be the marginal distributions corresponding to spaces and respectively. We define closed set domain adaptation as follows.

Definition 1 (Closed Set Domain Adaptation).

Let and be the source domain and target domain respectively, where , and . , and are samples drawn from domains i.i.d. The task of closed set domain adaptation is to learn a good target classifier given as the training examples.

When there are no labeled target samples (), the scenario is called unsupervised closed domain adaptation. It is noteworthy that the assumption is crucial in the definition of closed set domain adaptation. However, the assumption does not hold in the open set setting. In open set domain adaptation, the target classes have two types : known classes and unknown classes. The unknown classes gather all additional classes which are not contained by the label set . The known classes are the same as the source classes . We define the open set domain as a joint distribution on , where . Let be marginal distributions corresponding to the feature space , and be the conditional distribution . We define open set domain adaptation task as follows.

Definition 2 (Open Set Domain Adaptation).

Let and be the source domain and target domain respectively, where and . , and are samples drawn from domains i.i.d. Given as the training examples, the tasks of open set domain adaptation is to learn a good target classifier such that

1) classifies known target samples into correct known classes;

2) classifies unknown target samples as unknown.

When there are no labeled target samples (), the setting is called unsupervised open set domain adaptation.

Problem 1 (Unsupervised Open Set Domain Adaptation).

Let and be the source domain and target domain respectively, where and . and are samples drawn from domains i.i.d. How can learn a good target classifier by using as the training samples?

Notions and their descriptions are summarized in Table I.

Notation           Description
feature space
number of source/target samples
the feature dimension
the number of known classes
class unknown target class
, source/target joint distribution
, source/target marginal distribution
, source/target conditional distribution for class
target marginal distribution for known classes
target joint distribution for known classes
data matrix , source samples
data matrix , target samples
data matrix , source samples with label
data matrix , target samples with pseudo label
data matrix , samples predicted as known
number of samples in
number of samples in
number of samples in
kernel feature map and kernel function induced by
TABLE I: Notations and their descriptions.

Iii-B Main Theoretical Results and Open Set Difference

We theoretically analyze the OSDA problem.

We consider multiclass classification with hypothesis space of classifiers

where , the classes and the class represents the unknown target classes.

Denoted by

partial risks, where

is the symmetric loss function satisfying the triangle inequality. We note that when

, is the risk of classifier on unknown target classes.

The risks of w.r.t. under , and are given by

(1)

where and

are class-prior probabilities. Specifically, let

(2)

be risks that unlabeled samples are regarded as unknown samples. For stating the main theoretical result of the paper, we need to introduce discrepancy distance, , which measures the difference between two distributions .

Definition 3 (Discrepancy Distance[33]).

Let be a set of functions from to , and be a loss function. The between distributions and over is

(3)

The following theorem provides an open set domain adaptation bound according to discrepancy distance.

Theorem 1.

Given a hypothesis with a mild condition that constant function , then for any , we have

(4)

where , .

The proof can be found in Section IV. It is noteworthy that the open set difference is the crucial term to bound the risk of on unknown target classes, since

(5)

The risk of on unknown target classes is intimately bound up with the open set difference ,

(6)

When , Theorem 1 degenerates the closed set scenario with the theoretical bound

This is because when , the open set difference

The significance of Theorem 1 is twofold. First, it highlights that the open set difference is the main term for controlling the generalization performance in open set domain adaptation. Second, the bound shows a direct connection with the closed set domain adaptation theory.

In addition, the open set difference consists of two parts: positive term and negative term . Larger positive term implies more target samples are classified as unknown samples. The negative term is used to prevent source samples from being classified as unknown. According to Eq.(5), the negative term and distance discrepancy jointly prevent all target samples from being recognized as unknown classes. In addition, Corollary 1.1 also tells us that the positive term and negative term can be estimated just by unlabeled samples. Using Natarajan Dimension Theory [34] to bound the source risk , risks and by empirical estimates , and respectively, we obtain the following result.

Corollary 1.1.

Given a symmetric loss function satisfying the triangle inequality and bounded by , and a hypothesis with conditions: 1) and 2) the Natarajan dimension of is , if a random labeled sample of size is generated by -i.i.d and a random unlabeled sample of size is generated by -i.i.d, then for any and with probability at least , we have

where and empirical open set difference .

Next, we employ the open set difference to construct our model, Distribution Alignment with Open Difference (DAOD).

Iii-C Method

In this section, we propose our open set domain adaptation method. In Theorem 1, we derive the bound for open set domain adaptation which shows: 1) the first term (Source Risk) bounds the performance of the source domain; 2) the second term (Distribution Discrepancy) is a measure of the discrepancy between the source marginal distribution and the target marginal distribution for known classes ; 3) the third term is the open set difference , which is the difference between and . In this paper, we utilize the term to simulate the open set difference , where are free parameters.

Let , be the source and target data matrix respectively, and be the source label matrix. We can then write the bound as follows.

(7)

where is the distribution discrepancy for known classes.
Structural Risk Minimization. From a statistical machine learning perspective, we solve the UOSDA problem by the structural risk minimization (SRM) principle [18]. In SRM, the predicted function can be formulated as

(8)

where is the regularization term, and the hypothesis is defined as a subset of functional space

here is a reproducing kernel Hilbert space (RKHS) related to a kernel . Then, the classifier is

for any . Here the vector-value function is called the scoring function.

To effectively handle the different source domain and target domain for known samples, we can further divide the regularization term as

(9)

where is the manifold regularization [19], and the term means the joint distribution alignment for known classes, defined as follows.

(10)

Here is the empirical marginal distribution alignment for known samples, is the empirical conditional distribution alignment (), and is the adaptive factor [35] to represent the importance between the empirical marginal distribution alignment and the empirical conditional distribution alignment.

As formula (7) shows, we also add the open set difference to learn the unknown samples. Lastly, we formula our optimization problem as follows.

(11)

where and is the regulation term for avoiding over-fitting.
Remark 1. In this paper, we employ Maximum Mean Discrepancy (MMD) [22] to match distributions. However, this results in a gap with discrepancy distance which is used to measure the distribution difference in Theorem 1. Inspired by Lemma 3, we also give a similar theoretical bound by using MMD distance. The details of the theoretical bound based on MMD are shown in Theorem 5. However, for proving Theorem 5, we also need an additional condition that the loss is squared loss . Thus, we use the squared loss to design our method. In addition, we use scoring functions to represent classifiers, and one-hot vectors to represent labels. Related theoretical analysis about scoring functions can be found in Section IV.

Using the representer theorem, if the optimization problem has a minimizer , then can be written as

where is the parameter and .

Distribution Alignment. We first introduce the definition of MMD distance and use MMD distance to match joint distributions and .

Given two distributions and , the MMD distance between and is defined as:

where is the reproducing kernel Hilbert space (RKHS) and is the kernel feature map.

Let be the source samples with label , be the target samples with pseudo label . DAOD minimizes the MMD distances between empirical marginal distributions , and conditional distributions (). To make MMD a proper regularization for the scoring function , we adopt the projected MMD [23, 35], which is computed as

(12)
(13)

where is the number of predicted known target samples, is the predicted known target sample.

Then using the representer theorem and kernel trick, we can write Eq.(10) as

(14)

where , is the kernel matrix , and is the MMD matrix:

(15)
(16)

where .

Manifold Regularization. To learn the geometrical relation between , DAOD uses manifold regularization. By the manifold assumption [19], if two points and are close in the support set of the distributions , then the values of the scores and are similar.

We denote the pair-wise affinity matrix as

(17)

where

is the similarity function such as cosine similarity,

denotes the set of -nearest neighbors to point and is a free parameter. The manifold regularization can then be formulated as follows.

(18)

where is the Laplacian matrix, which can be written as , here .

Using the representer theorem and kernel trick, we can also write Manifold Regularization as

(19)


Open Set Loss Function. Here we use a matrix to rewrite the loss function and open set difference. Let the label matrix be , where is a one-hot vector such that if the sample is from with label , if the sample is from and , otherwise. , where is a one-hot vector such that if the sample is from and , otherwise.

Then

(20)

where is a diagonal matrix with if , if ; is a diagonal matrix with if , if .

Overall Reformulation. We formulate our method DAOD by incorporating the above three formulas (14), (19), (20):

(21)

Iii-D Training

There is a negative term in Eq.(III-C) hence it may be not correct to compute the optimizer by addressing the equation directly. Maybe the “minimizer” solved by is a maximum point. Fortunately, the following theorem shows that there exists a unique optimizer which can be solved by .

Theorem 2.

If the parameter is small than and the kernel function is universal, then Eq.(III-C) has a unique optimizer which can be written as:

(22)
Proof.

See Appendix A. ∎

Input: Data ; source labels: ; iterations ; parameters and neighbor ; threshold ; kernel function K.
1. ;% Predict pseudo labels; 2. Compute using , , and ; 3. ;
while  do
       4. Compute using , , and ; 5. Compute by solving Eq (22);
       6. ;%Predict pseudo labels;
       7. ;
      
Output: Predicted target labels , classifier .
Algorithm 1 DAOD

To compute a true value of Eq.(22), it was best for us to use the groundtruth labels of the target domain. However, the setting of our problem is unsupervised, which implies that it is impossible to obtain any true target labels. Inspired by methods JDA [5], ARTL [23] and MEDA [35], we use pseudo labels instead of the groundtruth labels. Pseudo labels are generated by applying an open set classifier trained on the source data to the target data.

In this paper, we use Open Set Nearest Neighbor for Class Verification- (-) [29] to help us learn pseudo labels. We select the two nearest neighbors from the test sample . If both nearest neighbors have the same label , is classified with the label . Otherwise, we calculate the ratio

here we assume that . If is smaller than or equal to a pre-defined threshold , , is classified with the same label of . Otherwise, is recognized as the unknown sample.

To make the pseudo labels more accurate, we use the iterative pseudo label refinement strategy, proposed by JDA [5]. The implementation details are demonstrated in Algorithm 1.

Iv Generalization Bounds for Open Set Domain Adaptation

Since our method DAOD is based on MMD distance, but not the discrepancy distance used in Theorem 1, we also give a theoretical bound for OSDA that shows how MMD controls generalization performance in the case of the squared loss .

We first prove Theorem 1 as follows.

Proof of Theorem 1.

Given Eq.(1), we have

(23)

Let , and , then