Towards Novel Target Discovery Through Open-Set Domain Adaptation

05/06/2021 ∙ by Taotao Jing, et al. ∙ Tulane University Brandeis University 0

Open-set domain adaptation (OSDA) considers that the target domain contains samples from novel categories unobserved in external source domain. Unfortunately, existing OSDA methods always ignore the demand for the information of unseen categories and simply recognize them as "unknown" set without further explanation. This motivates us to understand the unknown categories more specifically by exploring the underlying structures and recovering their interpretable semantic attributes. In this paper, we propose a novel framework to accurately identify the seen categories in target domain, and effectively recover the semantic attributes for unseen categories. Specifically, structure preserving partial alignment is developed to recognize the seen categories through domain-invariant feature learning. Attribute propagation over visual graph is designed to smoothly transit attributes from seen to unseen categories via visual-semantic mapping. Moreover, two new cross-main benchmarks are constructed to evaluate the proposed framework in the novel and practical challenge. Experimental results on open-set recognition and semantic recovery demonstrate the superiority of the proposed method over other compared baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, domain adaptation (DA) attracts great interests to address the label insufficiency or unavailability issues, which is the bottleneck to the success of deep learning models

[14]. DA casts a light by transferring existing knowledge from a relevant source domain to the target domain of interest via eliminating the distribution gap across domains [10, 28, 24]. Most DA efforts focus on the closed-set domain adaptation (CSDA) [10, 24], assuming the source and target domain share identical label space, which is not always satisfied in real-world scenarios, since the target domain may contain more than we know from source domain. Following this, open-set domain adaptation (OSDA) has been widely studied given the source domain only covers a sub-set of the target domain label space[32, 28, 22, 19]

. Unfortunately, these pioneering OSDA attempts simply identify the known categories while leaving the remaining unobserved samples as an “unknown” outlier set. Without any further step, OSDA fails to discover what the unknown categories really are. Interestingly, the target domain may contain some exactly-new categories human beings never see before. This motivates us to further analyze the unknown set more specifically and discover novel categories.

In this paper, we define such a problem as Semantic Recovery Open-Set Domain Adaptation (SR-OSDA), where source domain is annotated with both class labels and semantic attributal annotation, while target domain only contains the unlabeled and unannotated data samples from more categories. The goal of SR-OSDA is to identify the seen categories and also recover the missing semantic for unseen categories to interpret the new categories existed in target domain. To our best knowledge, this is a completely new problem in literature with no exploration. The challenges now become two folds: (1) how to accurately identify seen and unseen categories in target domain with well-labeled source knowledge; (2) how to effectively recover the missing attributes of unseen categories.

To this end, we propose a novel framework to simultaneously recognize the known categories and discover new categories from target domain as well interpret them in the semantic level. The general idea of our model is to learn domain-invariant visual features by mitigating the cross-domain shift, and consequently build visual-semantic projection to recover the missing attributes of unknown target categories. Our contributions are highlighted as follows:

  • We are the first to address the SR-OSDA problem, and propose a novel and effective solution to identify seen categories and discover unseen one.

  • We propose structure preserving partial alignment to mitigate the domain shift when target covers larger label space than source, and attributes propagation over visual graph to seek the visual-semantic mapping for better missing attribute recovery.

  • Two new benchmarks are built for SR-OSDA evaluation. Our proposed method achieves promising performance in both target sample recognition and semantic attribute recovery.

2 Related Work

Here we introduce the related work along the open-set domain adaptation and zero-shot learning, and highlight the differences between our work and the existing literature.

Open-Set Domain Adaptation. Compared to classic closed set domain adaptation [40, 8, 17, 6, 37, 46, 36], open-set domain adaptation manages a more realist task when the target domain contains data from classes never present in the source domain [4, 30, 25, 27, 19, 13, 35, 3, 32]. Busto attempts to study the realistic scenario when the source and target domain both includes exclusive classes from each other [28]. Later on, Saito focus on the situation when the source domain only covers a subset of the target domain label space and utilizes adversarial framework to generate features and recognizes samples deviated from the pre-defined threshold as “unknown” [32]. Instead of relying on the manually pre-defined threshold, [13] takes advantage of the semantic categorical alignment and contrastive mapping to encourage the target data from known classes to move close to corresponding centroid while staying away from unknown classes. STA adopts a coarse-to-fine weighting mechanism to progressively separate the target data into known and unknown classes [22]. Most recently, SE-CC augments the Self-Ensembling technique to with category-agnostic clustering in the target domain [27].

Zero-shot learning. Demand of leveraging annotated data to recognize novel classes unseen before motivates a boom thread of research known as Zero-Shot Learning (ZSL) [26, 38, 7, 44, 39, 45, 11, 23, 16]

. Early ZSL works explore class semantic attributes as intermediate to classify the data from unseen classes

[21, 20]. Some ZSL methods learns a mapping between the visual and semantic spaces to compensate for the lack of visual features from the unseen categories [5, 1]. However, ZSL methods do not guarantee the discrimination between the seen and unseen classes, leading to bias towards seen classes under another realistic scenario, Generalized Zero-Shot Learning (GZSL). GZSL assumes the target data to evaluate are drawn from the whole label space including seen and unseen classes [16, 33, 15, 5, 12]. Recently, generative frameworks are explored to generate synthesized visual features from unseen classes boosts the performance of ZSL and GZSL [42, 41]. [48, 42] use a Wasserstein GAN [2] and the seen categories classifier to increase the discrimination of the synthesized features. [12] utilizes the cycle consistency loss to optimize the synthesized feature generator, and [41] study conditional VAEs [18] to learn the feature generator.

Different from open-set domain adaptation, the proposed SR-OSDA problem demands to recover interpretable knowledge of the target data from classes never present in the source domain, and uncover new classes. Moreover, SR-OSDA is more challenging than the GZSL problem because we do not have access to the attributes nor any other semantic knowledge of the target domain new categories, which makes SR-OSDA a more realistic and practical problem.

Figure 1: Illustration of our proposed framework, where contains some unseen categories from

. Convolutional neural networks (e.g., ResNet

[14]) are used as backbone to extract visual features , which are further input to to learn domain-invariant features through partial alignment. then maps to semantic attributes . Visual-semantic features are fused for the final classification tasks, one is to identify seen/unseen from target data, and the other to recognize all cross-domain data into +1 classes (i.e., seen + one unseen large category).

Notation
Description
source / target domain

seen/unseen set

source / target domain number of classes

source / target input features

source / target samples number

seen / unseen set samples number

source domain labels / attributes



source / target domain instance



source / target domain embedding features


predicted source / target label

predicted source / target attributes


visual / embedding features prototypes








source / target joint representations


Table 1: Notations and Descriptions

3 Motivations and Problem Definition

In this section, we illustrate our motivations and provide the problem definition of the semantic recovery open-set domain adaptation.

Open-set domain adaptation tasks [28] focus on the scenario when the target domain contains data from classes never observed in the source domain, which is more practical than the conventional closed-set domain adaptation [10]. However, existing open-set domain adaptation efforts simply identify those unseen target samples as one large unknown category and give up exploring the discriminative and semantic knowledge inside the unknown set. The demand of further understanding the novel classes that only exist in the target domain motivates us to study how to recover missing semantic attributes to explain the target data and discover novel classes, which leads to the problem Semantic Recovery Open-Set Domain Adaptation (SR-OSDA) addressed in this paper. The main challenges of SR-OSDA lie in not only identifying the target samples in the unseen classes, but also providing the partitional structures of these samples with recovered semantic attributes for further interpretation.

For better understanding, we clarify the problem with mathematical notations. The target domain is defined as containing samples with visual features from categories. The auxiliary source domain consists of samples from classes with visual features , labels , and semantic attributes . For each source sample, the semantic attributes are obtained from , which consists of class-wise attributes of the source domain. SR-OSDA aims to recover the missing semantic attributes for the target data based on the visual features, and uncover novel categories never present in the source domain. Table 1 shows several key notations and descriptions in the SR-OSDA setting.

It is noteworthy that the source and target domains are drawn from different distributions. Besides, the target data set covers all classes in the source domain, as well as exclusive categories only exist in the target domain, where . SR-OSDA is different from open-set domain adaptation, which ignores to recover interpretable knowledge and discover new classes in the target domain. Moreover, the defined problem is different from generalized zero-shot learning [33], as we have no access to the semantic knowledge of the target domain unseen categories.

To our best knowledge, SR-OSDA is the first time proposed, aiming to discover novel target classes via recovering semantic attributes from the auxiliary source data. In the following, we illustrate our solution to learn the relationship between the visual features and semantic attributes with the guidance of the source data, which can be transferred to the target data and interpretably discover unseen classes.

4 The Proposed Method

4.1 Framework Overview

To tackle the above SR-OSDA problem, we propose a novel target discovery framework (Figure 1) to simultaneously recognize the target domain data from categories already observed in the source domain, and recover the interpretable semantic attributes for the unknown target classes from the source. To achieve this, three modules are consequently designed to address the cross-domain shift, semantic attributes prediction and task-driven open-set classification. Specifically, the source data are adapted to the target domain feature space through partial alignment while preserving the target structure. A projector bridging the domain invariant feature space and the semantic attributes space is trained by the source data as well as the target data with confident pseudo attributes. Moreover, the visual features will guide the attributes propagated from seen categories to unseen ones, and the semantic attributes will also promote the visual features discrimination through joint visual-semantic representation recognition for and , where is a binary classifier to identify seen and unseen target samples, and is an extended multi-class classifier with outputs.

Since the target data are totally unlabeled and all three modules rely on the label information in target domain, we first discuss how to obtain the pseudo labels of target samples through our design progressive seen-unseen separation stage. That is, we will assign target samples into observed categories and unobserved categories. In the following, we introduce the progressive seen-unseen separation and three key modules in our proposed framework.

4.2 Modules and Objective Function

Progressive Seen-Unseen Separation

. Here we describe the initialization strategy to separate the target domain data into seen and unseen sets based on the visual features space. Intuitively, part of source-style target samples are promisingly identified by the well-trained source model, which are actually belonging to seen categories more probably. On the other hand, those target samples assigned with even and mixed prediction probabilities across multiple classes tend to be unseen categories, as no source classifier can easily recognize them. To achieve this, we apply the prototypical classifier to measure the similarities between each target sample to all source class prototypes

[34]. For each target sample and the source prototypes , the probability prediction is defined as:

(1)

where is the distance function. The highest probability prediction is adopted as the pseudo label for . Next, we adopt a threshold to progressively separate all target samples into seen and unseen sets . The number of samples in and are denoted as and , respectively. Specifically, we define the mean of the highest probability prediction of all target samples, i.e., . Based on that, we can build two sets as:

(2)

Since we only have the source prototypes in the beginning, they are not accurate to identify seen and unseen sets due to the domain shift. Thus, we can gradually update the seen prototypes by involving newly-labeled target samples from as , where denotes a set of target samples predicted as confidently, and is the small value to control the mixture of cross-domain prototypes.

After obtaining all pseudo labels in the seen set , we also need to explore more specific knowledge in instead of treating it as a whole like OSDA [32]

. To this end, we apply K-means clustering algorithm to group

into clusters with the cluster center as . In this way, we can obtain all prototypes of seen and unseen categories as . In order to refine the pseudo labels of target samples, we adopt K-means clustering algorithm with centers initialized as over until the results are converged.

To this end, we obtain all pseudo labels for target samples. We also assign semantic attribute to seen target samples based on their pseudo label belonging to which source category. Next, we explore structure preserving partial alignment, attribute propagation and task-driven classification to solve SR-OSDA.

Structure Preserving Partial Alignment. Due to the disparity between the source and target domains label spaces, directly matching the feature distribution across domains is destructive. Considering our goal of uncovering the unseen categories in the target domain, preserving the structural knowledge of the target domain data becomes even more crucial. Thus, instead of mapping the source and target domains into a new domain-invariant feature space, we seek to align the source data to the target domain distribution through partial alignment.

Specifically, with the help of the target domain pseudo labels , for each class in the pseudo label space, which contains categories, the prototype can be calculated as the class center in the space of feature can be calculated as . The prototypes describe the class-wise structural knowledge in the target domain in the feature space. To solve the domain disparity, we align each source sample to its specific target center and also keep away from other target centers as:

(3)

where is the total number of prototypes in . Moreover, we deploy the similar loss to make within-class target samples more compact while keeping between-class target samples more discriminative as:

(4)

Such a loss function will make within-class target samples more compact while pushing away from others.

These two loss functions help align source and target to obtain domain-invariant visual features and also seek more discriminative knowledge over target samples. Then we obtain the objective of structure preserving partial domain adaptation as .

Attributes Propagation with Visual Structure. Since unseen target samples are totally without any annotations either class label or semantic attributes, our goal is to recover their semantic attributes via visual-semantic projector . However, only attributes knowledge of the classes seen in the source domain is available for training, while the target samples from unseen categories have no way to optimize the , which might lead the projector towards bias to the seen categories when dealing with unseen target class samples. To this end, we propose the mechanism of attributes propagation to aggregate the visual graph knowledge into the semantic description projection, which is beneficial to the attributes propagated from seen classes to unseen classes.

Specifically, for features of a training batch, the adjacency matrix is calculated as , where , and is the distance of . is a scaling factor and we accept as [31] to stabilize training. The semantic attributes projected from the visual features are reconstructed as:

(5)

where and , in which is a scaling factor fixed as suggested by [31], and

is the identity matrix

[47]. After the semantic attributes propagation, is refined as a weighted combination of its neighbors guided by the visual graph. This benefits attributes projector from overfitting to the seen categories, while removing undesired noise [31].

After the projected attributes refinement via attribution propagation, we optimize the attributes projector on the seen categories across two domains as:

(6)

where is the binary cross-entropy loss, and is the number of samples in . Each dimension of the semantic attributes represents one specific semantic characteristic, and describes the predicted probability that the input sample has specific characteristics.

Visual-Semantic Fused Recognition. Since visual features and semantic attributes describe the data distribution from different perspectives. To simultaneously leverage the multi-modality benefits of visual and semantic descriptions, we explore the joint visual and semantic representation by conveying the semantic discriminative information into the visual feature as , where is concatenating and as joint feature .

It is noteworthy that during the training, several different semantic attributes are available in different stages, e.g., ground-truth (), pseudo attributes (), and predicted attributes (). We take them all into account and will obtain various joint representations as:

(7)

where , , and . All joint features in and are input into the classifier and to optimize the framework.

To maintain the performance of classifier over supervision from source and target domains, we construct the cross-entropy classification loss as:

(8)

where is the cross-entropy loss and denotes the source labels and target labels. Moreover, we train a binary classifier to separate the target domain into seen and unseen subsets, which can be optimized by:

(9)

in which indicates if the target sample is from the seen categories (), or from the unseen categories ().

Then we have our classification supervision objective on both source and target domain with joint visual and semantic representations as .

Overall Objective Function. To sum up, we can obtain the overall objective function by integrating the structure preserving partial adaptation, semantic attributes propagation and prediction, and joint visual-semantic representation recognition as:

(10)

where and are two trade-off parameters. Through minimizing the proposed objective, the semantic descriptive knowledge is aggregated from the source data into the unlabeled target domain through the joint visual-semantic representation supervision and attributes propagation. Meanwhile, the discriminative visual structure in the target domain is promoted by the cross-domain partial adaptation.

Dataset D2AwA I2AwA
Domain A P R I Aw
Role source target source target source target source target
#Images 9,343 16,306 3,441 5,760 5,251 10,047 2,970 37,322
#Attributes 85 85 85 85 85 85 85 85
#Classes 10 17 10 17 10 17 40 50
Table 2: Statistical characteristics on D2AwA and I2AwA dataset
Dataset D2AwA I2AwA
Task AP AR PA PR RA RP IAw
Method OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS OS
OSBP [32] 49.6 10.8 46.0 74.2 13.6 68.7 76.0 9.1 69.9 63.3 6.9 58.2 90.1 13.7 83.2 55.9 10.6 51.7 67.6 7.5 66.2
STA [22] 60.1 33.0 57.6 85.5 10.8 78.7 90.2 5.7 82.5 82.8 7.4 76.0 88.5 7.2 81.1 66.9 13.5 62.0 51.5 45.5 51.4
AOD [13] 50.7 9.5 46.9 78.4 12.7 72.4 80.3 5.1 73.5 79.7 5.3 73.0 92.0 12.8 84.8 61.2 9.6 56.5 75.2 6.3 73.5
Ours(Init) 53.1 45.1 52.3 78.8 72.3 78.2 75.3 94.8 77.1 67.3 82.0 68.6 86.2 87.7 86.4 52.0 77.8 54.4 82.2 6.3 73.5
Ours(Vis) 54.1 76.1 56.1 75.4 70.3 75.0 69.5 98.5 72.1 57.4 83.1 59.7 88.3 98.8 89.2 58.7 91.2 61.6 48.2 70.3 48.7
Ours 62.8 47.2 61.4 90.9 71.4 89.1 79.2 98.5 81.0 78.3 83.7 78.8 94.9 90.5 94.5 61.2 80.4 63.0 83.2 70.2 82.8
Table 3: Open-set domain adaptation accuracy () on D2AwA and I2AwA
Dataset D2AwA I2AwA
Task AP AR PA PR RA RP IAw
Method S U H S U H S U H S U H S U H S U H S U H
Source-only 67.6 0.0 0.0 87.6 0.0 0.0 91.3 0.0 0.0 85.3 0.0 0.0 94.1 0.0 0.0 71.1 0.0 0.0 77.2 0.3 0.7
ABP [49] 68.1 0.0 0.0 87.9 0.0 0.0 91.7 0.0 0.0 83.6 0.0 0.0 94.4 0.0 0.0 70.0 0.0 0.0 79.8 0.0 0.0
TF-VAE [26] 70.4 0.0 0.0 88.4 0.0 0.0 85.1 0.0 0.0 79.6 0.0 0.0 96.4 0.0 0.0 72.5 0.0 0.0 62.8 0.0 0.0
ABP* [49] 64.5 6.4 11.7 86.0 5.9 11.1 84.0 24.4 37.8 81.3 12.7 21.9 93.8 16.2 27.6 67.6 7.9 14.1 78.0 13.4 22.9
TF-VAE* [26] 59.7 12.8 21.0 77.9 16.4 27.1 35.1 35.6 35.3 34.8 32.7 33.7 68.5 36.1 47.3 50.7 21.0 29.7 37.7 20.0 26.2
Ours 62.5 27.0 37.7 90.7 30.0 45.1 79.2 36.7 50.2 78.0 15.7 26.1 95.2 37.8 54.1 59.0 20.8 30.8 83.1 22.0 34.8
Table 4: Semantic Recovery Accuracy () on D2AwA and I2AwA

5 Experiments

5.1 Experimental Settings

Datasets. We construct two datasets for the novel SR-OSDA setting. (1) D2AwA is constructed from the DomainNet dataset [29] and AwA2[43]. Specifically, we choose the shared 17 classes between the DomainNet and AwA2, and select the alphabetically first 10 classes as the seen categories, leaving the rest 7 classes as unseen. The corresponding attributes features in AwA2 are used as the semantic description. It is noteworthy that DomainNet contains 6 different domains, while some of them barely share the semantic characteristics described by the attributes of AwA2, e.g., quick draw. Thus, we only take the “real image” (R) and “painting” (P) domains into account, together with the AwA2 (A) data for model evaluation. (2) I2AwA is collected by [50] consisting of 50 animal classes, and split into 40 seen categories and 10 unseen categories as [43]. The source domain (I), includes 2,970 images from the seen categories collected via Google image search engine, while the target domain comes from the AwA2 (Aw) dataset proposed in [43] for zero-shot learning with 37,322 images in all 50 classes. We use the binary attributes feature of AwA2 as the semantic description, and only the seen categories attributes of source data are available for training. Only one task IAw is evaluated on I2AwA. Table 2 shows several statistical characteristics of D2AwA and I2AwA.

Figure 2: tSNE visualization of representations generated by (a) ResNet, (b) STA, and (c) Ours on I2AwA . (d) shows the joint visual-semantic features proposed in our paper. Red circles denote source data. Blue and gray triangles denote target domain seen and unseen classes.
Figure 3: Ablation study of our proposed model on I2AwA by removing specific one of structure preserving partial alignment (w/o ), binary classifier (w/o ,) attributes propagation (w/o AP), or joint visual-semantic representation (w/o VS).

Evaluation Metrics. We evaluate our method in two aspects: (1) target sample recognition under the open-set domain adaptation and (2) generalized semantic attribute recovery. For the first one, we follow the conventional open-set domain adaptation studies [28, 32], recognizing the whole target domain data into one of the seen categories or “unknown” category. The standard open-set domain adaptation average accuracy calculated on all the classes are reported as OS. Besides, we report the average accuracy calculated on the target domain seen classes as OS, while for the target domain unseen categories, the accuracy is reported as OS. For semantic attribute recovery, we compare the predicted semantic description with the ground-truth semantic attributes. Specifically, we adopt a TWO-stage test: (a) identifying a test sample from seen or unseen set, (b) applying prototypical classification with corresponding seen/unseen ground-truth attributes. We report the performances on the seen categories and unseen categories as and

, respectively, and calculate the harmonic mean

 [33], defined as . Note that all results we reported are the average of class-wise top-1 accuracy, to eliminate the influence caused by the imbalanced class.

Implementation. We use the pre-trained ResNet-50 [14]

on ImageNet as the backbone, and take the second last fully connected layer as the features

[9, 14].

is a two-layer fully connected neural network with hidden layer dimension as 1,024, and the output feature dimension is 512.

and are both two-layer fully connected neural networks classifier with hidden layer dimension as 256, and the output dimension of is , while the output of is just two dimensions indicating seen or unseen classes.

is a two-layer neural network with hidden layer dimension as 256 followed, and the final output dimension is the same as the semantic attributes dimension followed by Sigmoid function. We employ the cosine distance for the prototypical classification, while all other distances used in the paper are Euclidean distances. For parameters, we fix

, , , , and the learning rate is fixed as

for all experiments, and report the 100-th epoch results for all the experiments.

Figure 4: Selected samples from AwA2 dataset and attributes predicted by our method. The black ones are correctly predicted attributes, red ones are wrong prediction, and the green

ones are wrong predictions but reasonable for the specific instance. “P” and “R” denote precision and recall of the attributes prediction for each sample, respectively.

Figure 5: Confusion matrix of target samples from I2AwA. (a) shows the results of STA and (b) lists ours. The unseen classes are zoomed in for better visualization.

Competitive Methods. Since the problem we address in this paper is in a novel and practical setting, we mainly compare two distinctive branches of baselines in terms of open-set domain adaptation and zero-shot learning.

For open-set domain adaptation, we compare our method with OSBP [32], AOD [13], and STA [22]. OSBP utilizes the adversarial training strategy to extract features for the target data, which is recognized into seen/unseen classes by a pre-defined threshold [32]. AOD exploits the semantic structure of open set data from categorical alignment and contrastive mapping to push the unknown classes away from the decision boundary [13]. Differently, STA adopts a coarse-to-fine mechanism to progressively separate the known and unknown data without any manually set threshold [22].

For the semantic recovery tasks, we implement a source-only trained neural network, and two zero-shot learning methods, ABP [49] and TF-VAE [26] under our setting, as baselines. The source-only model is a fully-connected neural network trained with only source domain ResNet-50 [14] features available, which learns a projector mapping the visual features to semantic attributes. ABP trains a conditional generator mapping the class-level semantic features and Gaussian noise to visual features [49]. TF-VAE propose to enforce semantic consistency at all training, feature synthesis, and classification stages [26]. Besides, both ABP and TF-VAE are able to handle generalized zero-shot learning problems given the semantic attributes from the whole target label space. We also report ABP* and TF-VAE*, which take extra the semantics of unseen target categories as inputs.

5.2 Algorithmic Performance

Table 3 shows the open-set domain adaptation accuracy on D2AwA and I2AwA. From the results we observe that our proposed method outperforms all compared baselines in terms of overall accuracy on most tasks. Especially on the task AR, our model improves over the second best compared method. The significant improvements come from our effective framework and the extra source semantic information. Note that in the classical open-set domain adaptation, none of the semantic attributes are leveraged. For fair comparisons, we provide the initialized results based on the visual features reported as “Ours(Init),” and further implement another variant of our method with only visual features available for training, denoted as “Ours(Vis).” The performance decrease of “Ours(Vis)” proves the contribution and effectiveness of the semantic attributes for the open-set domain adaptation. Moreover, our proposed method reaches promising results on the unseen classes while keeping performance on the seen classes for all tasks. For example, STA achieves the best overall accuracy on task P A, but completely fails on the unseen categories and overfitting to the seen classes. Such an observation emphasizes the superiority of our method in exploring target domain unseen categories.

Table 4 show the semantic recovery accuracy on D2AwA and I2AwA, respectively. Within the expectation, all ZSL methods fail to recognize the data from unseen categories and overfit to the seen classes due to lack of the capacity on tackling the open-set setting. Our proposed method achieves promising results on recognizing both seen and unseen categories, e.g., our method achieves accuracy for unseen classes data while keeping performance on seen classes for task RA. Moreover, our proposed method even outperforms the ABP* and TF-VAE*. They have access to both the seen and unseen categorical attributes in source and target domain, while our method only employs the seen categories attributes information in the source domain.

5.3 In-Depth Factor Exploration

In this subsection, we first explore the ablation study of our model, visualize the representation from our model, provide more details on the seen and unseen target categories by confusion matrix and finally showcase several representative samples with the predicted attributes.

Ablation Study. We dive into our complete method and several variants for open-set domain adaptation and semantic recovery tasks to understand the contribution of each specific design in our framework. As shown in Figure 3, we have the following observations. (1) Compared to w/o R which removes the structure preserving partial alignment term , our method achieves significant performance gains on the open-set domain task, especially for the seen categories. This demonstrates the effectiveness of aligning the source data to the target domain while preserving the target data structural characteristics. (2) Our method improves the performance D on both tasks compared to w/o, which removes the binary classifier and only uses classifier to recognize seen/unseen categories. We conclude that the binary classifier can refine the separation of seen and unseen classes. (3) By removing the attributes propagation mechanism, the performance w/o decreases significantly on the semantic recovery tasks, especially for the unseen categories, proving the contribution of attributes propagation for semantic recovery tasks and uncovering unseen classes. (4) Our method outperforms the variant without constructing visual-semantic fusion w/o VS, which only uses visual features for prediction. For both open-set domain adaptation seen classes and semantic recovery unseen classes, validating the effectiveness of semantic knowledge to the visual features in both preserving performance on seen classes and exploring unseen categories.

Representation Visualization. We show the t-SNE embeddings of I2AwA from different models in Figure 2, where red circles denote source data, blue and gray triangles denote target domain seen and unseen classes, respectively. The embedding of our method shows that the same class samples across domains are more compact while discriminative inter classes than the representation produced by source only ResNet-50 [14] and STA [22]. Moreover, our embedding shows the joint visual-semantic representations with more discriminative distribution and separates the unseen categories from seen classes more clear. Such an observation demonstrates the effectiveness of the semantic attributes, which is not only beneficial to the unseen categories, but also promotes the quality of features of the seen classes.

Confusion Matrix. We visualize the confusion matrix of STA and our method on I2AwA in Figure 5. STA only recognizes those target samples from unseen categories as unknown. On the contrary, our proposed method can discover novel categories in the target domain. Surprisingly, the accuracy of our method for the category “Giraffe” achieves . Moreover, we also notice that not just benefiting uncover unseen categories, our method also enhances the accuracy of the seen classes compared to STA.

Qualitative Demonstration. To qualitatively illustrate the effectiveness of our method in discovering novel classes and recovering missing semantic information, we further show several representative samples from the target domain unseen categories on I2AwA in Figure 4. For each sample, we show some of the correct and wrong predicted attributes with corresponding prediction probabilities. “P” and “R” indicate the precision and recall score of predicting attributes of each sample. Moreover, some predicted attributes are wrong for the corresponding category, but reasonable for the specific image. From the results, we demonstrate the ability of our model in transferring semantic knowledge from the source domain into the target data, and discovering novel classes through missing semantic information recovery.

6 Conclusion

We addressed a novel and practical Semantic Recovery Open-set Domain Adaptation problem, which aimed to discover target samples from classes unobserved in the source domain and interpreted based on recovered semantic attributed. To this end, we proposed a novel framework consisting of structural preserving partial alignment, attributes propagation via visual graph, and task-driven classification over joint visual-semantic representations. Finally, two semantic open-set domain adaptation benchmarks were constructed to evaluate our model in terms of open-set recognition and semantic attribute recovery.

References

  • [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2015) Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence 38 (7), pp. 1425–1438. Cited by: §2.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017)

    Wasserstein generative adversarial networks

    .
    In

    International conference on machine learning

    ,
    pp. 214–223. Cited by: §2.
  • [3] M. Baktashmotlagh, M. Faraki, T. Drummond, and M. Salzmann (2019) Learning factorized representations for open-set domain adaptation. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §2.
  • [4] S. Bucci, M. R. Loghmani, and T. Tommasi (2020) On the effectiveness of image rotation for open set domain adaptation. In

    European Conference on Computer Vision

    ,
    pp. 422–438. Cited by: §2.
  • [5] W. Chao, S. Changpinyo, B. Gong, and F. Sha (2016) An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In European conference on computer vision, pp. 52–68. Cited by: §2.
  • [6] M. Chen, S. Zhao, H. Liu, and D. Cai (2020) Adversarial-learned loss for domain adaptation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 3521–3528. Cited by: §2.
  • [7] X. Chen, X. Lan, F. Sun, and N. Zheng (2020) A boundary based out-of-distribution classifier for generalized zero-shot learning. In European Conference on Computer Vision, pp. 572–588. Cited by: §2.
  • [8] S. Cui, S. Wang, J. Zhuo, C. Su, Q. Huang, and Q. Tian (2020) Gradually vanishing bridge for adversarial domain adaptation. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 12455–12464. Cited by: §2.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §5.1.
  • [10] J. Dong, Y. Cong, G. Sun, Y. Liu, and X. Xu (2020) CSCL: critical semantic-consistent learning for unsupervised domain adaptation. In European Conference on Computer Vision, pp. 745–762. Cited by: §1, §3.
  • [11] M. Elhoseiny and M. Elfeki (2019) Creativity inspired zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5784–5793. Cited by: §2.
  • [12] R. Felix, I. Reid, G. Carneiro, et al. (2018) Multi-modal cycle-consistent generalized zero-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: §2.
  • [13] Q. Feng, G. Kang, H. Fan, and Y. Yang (2019) Attract or distract: exploit the margin of open set. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7990–7999. Cited by: §2, Table 3, §5.1.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, Figure 1, §5.1, §5.1, §5.3.
  • [15] H. Huang, C. Wang, P. S. Yu, and C. Wang (2019) Generative dual adversarial network for generalized zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 801–810. Cited by: §2.
  • [16] H. Jiang, R. Wang, S. Shan, and X. Chen (2019) Transferable contrastive network for generalized zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9765–9774. Cited by: §2.
  • [17] X. Jiang, Q. Lao, S. Matwin, and M. Havaei (2020) Implicit class-conditioned domain alignment for unsupervised domain adaptation. In International Conference on Machine Learning, pp. 4816–4827. Cited by: §2.
  • [18] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • [19] J. N. Kundu, N. Venkat, A. Revanur, R. V. Babu, et al. (2020) Towards inheritable models for open-set domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12376–12385. Cited by: §1, §2.
  • [20] C. H. Lampert, H. Nickisch, and S. Harmeling (2009) Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958. Cited by: §2.
  • [21] C. H. Lampert, H. Nickisch, and S. Harmeling (2013) Attribute-based classification for zero-shot visual object categorization. IEEE transactions on pattern analysis and machine intelligence 36 (3), pp. 453–465. Cited by: §2.
  • [22] H. Liu, Z. Cao, M. Long, J. Wang, and Q. Yang (2019) Separate to adapt: open set domain adaptation via progressive separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2927–2936. Cited by: §1, §2, Table 3, §5.1, §5.3.
  • [23] Y. Liu, J. Guo, D. Cai, and X. He (2019) Attribute attention for semantic disambiguation in zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6698–6707. Cited by: §2.
  • [24] M. Long, Z. Cao, J. Wang, and M. I. Jordan (2017) Conditional adversarial domain adaptation. arXiv preprint arXiv:1705.10667. Cited by: §1.
  • [25] Y. Luo, Z. Wang, Z. Huang, and M. Baktashmotlagh (2020) Progressive graph learning for open-set domain adaptation. In International Conference on Machine Learning, pp. 6468–6478. Cited by: §2.
  • [26] S. Narayan, A. Gupta, F. S. Khan, C. G. Snoek, and L. Shao (2020) Latent embedding feedback and discriminative features for zero-shot classification. In ECCV, Cited by: §2, Table 4, §5.1.
  • [27] Y. Pan, T. Yao, Y. Li, C. Ngo, and T. Mei (2020) Exploring category-agnostic clusters for open-set domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13867–13875. Cited by: §2.
  • [28] P. Panareda Busto and J. Gall (2017) Open set domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 754–763. Cited by: §1, §2, §3, §5.1.
  • [29] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang (2019) Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1406–1415. Cited by: §5.1.
  • [30] S. Rakshit, D. Tamboli, P. S. Meshram, B. Banerjee, G. Roig, and S. Chaudhuri (2020) Multi-source open-set deep adversarial domain adaptation. In European Conference on Computer Vision, pp. 735–750. Cited by: §2.
  • [31] P. Rodríguez, I. Laradji, A. Drouin, and A. Lacoste (2020) Embedding propagation: smoother manifold for few-shot classification. In European Conference on Computer Vision, pp. 121–138. Cited by: §4.2.
  • [32] K. Saito, S. Yamamoto, Y. Ushiku, and T. Harada (2018)

    Open set domain adaptation by backpropagation

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 153–168. Cited by: §1, §2, §4.2, Table 3, §5.1, §5.1.
  • [33] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019)

    Generalized zero-and few-shot learning via aligned variational autoencoders

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8247–8255. Cited by: §2, §3, §5.1.
  • [34] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4080–4090. Cited by: §4.2.
  • [35] S. Tan, J. Jiao, and W. Zheng (2019) Weakly supervised open-set domain adaptation by dual-domain collaboration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5394–5403. Cited by: §2.
  • [36] H. Tang, K. Chen, and K. Jia (2020) Unsupervised domain adaptation via structurally regularized deep clustering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8725–8735. Cited by: §2.
  • [37] H. Tang and K. Jia (2020) Discriminative adversarial domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 5940–5947. Cited by: §2.
  • [38] M. R. Vyas, H. Venkateswara, and S. Panchanathan (2020) Leveraging seen and unseen semantic relationships for generative zero-shot learning. In European Conference on Computer Vision, pp. 70–86. Cited by: §2.
  • [39] Z. Wan, D. Chen, Y. Li, X. Yan, J. Zhang, Y. Yu, and J. Liao (2019) Transductive zero-shot learning with visual structure constraint. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Cited by: §2.
  • [40] H. Wang, T. Shen, W. Zhang, L. Duan, and T. Mei (2020) Classes matter: a fine-grained adversarial approach to cross-domain semantic segmentation. In European Conference on Computer Vision, pp. 642–659. Cited by: §2.
  • [41] W. Wang, Y. Pu, V. Verma, K. Fan, Y. Zhang, C. Chen, P. Rai, and L. Carin (2018) Zero-shot learning via class-conditioned deep generative models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §2.
  • [42] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016)

    A discriminative feature learning approach for deep face recognition

    .
    In European conference on computer vision, pp. 499–515. Cited by: §2.
  • [43] Y. Xian, B. Schiele, and Z. Akata (2017) Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4582–4591. Cited by: §5.1.
  • [44] G. Xie, L. Liu, F. Zhu, F. Zhao, Z. Zhang, Y. Yao, J. Qin, and L. Shao (2020) Region graph embedding network for zero-shot learning. In European Conference on Computer Vision, pp. 562–580. Cited by: §2.
  • [45] H. Yu and B. Lee (2019) Zero-shot learning via simultaneous generating and learning. Advances in Neural Information Processing Systems 32, pp. 46–56. Cited by: §2.
  • [46] Y. Zhang, B. Deng, K. Jia, and L. Zhang (2020)

    Label propagation with augmented anchors: a simple semi-supervised learning baseline for unsupervised domain adaptation

    .
    In European Conference on Computer Vision, pp. 781–797. Cited by: §2.
  • [47] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf (2004) Learning with local and global consistency. Advances in neural information processing systems 16 (16), pp. 321–328. Cited by: §4.2.
  • [48] Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal (2018) A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1004–1013. Cited by: §2.
  • [49] Y. Zhu, J. Xie, B. Liu, and A. Elgammal (2019) Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9844–9854. Cited by: Table 4, §5.1.
  • [50] J. Zhuo, S. Wang, S. Cui, and Q. Huang (2019) Unsupervised open domain recognition by semantic discrepancy minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 750–759. Cited by: §5.1.