On Universal Black-Box Domain Adaptation

by   Bin Deng, et al.

In this paper, we study an arguably least restrictive setting of domain adaptation in a sense of practical deployment, where only the interface of source model is available to the target domain, and where the label-space relations between the two domains are allowed to be different and unknown. We term such a setting as Universal Black-Box Domain Adaptation (UB^2DA). The great promise that UB^2DA makes, however, brings significant learning challenges, since domain adaptation can only rely on the predictions of unlabeled target data in a partially overlapped label space, by accessing the interface of source model. To tackle the challenges, we first note that the learning task can be converted as two subtasks of in-class[In this paper we use in-class (out-class) to describe the classes observed (not observed) in the source black-box model.] discrimination and out-class detection, which can be respectively learned by model distillation and entropy separation. We propose to unify them into a self-training framework, regularized by consistency of predictions in local neighborhoods of target samples. Our framework is simple, robust, and easy to be optimized. Experiments on domain adaptation benchmarks show its efficacy. Notably, by accessing the interface of source model only, our framework outperforms existing methods of universal domain adaptation that make use of source data and/or source models, with a newly proposed (and arguably more reasonable) metric of H-score, and performs on par with them with the metric of averaged class accuracy.



page 1

page 2

page 3

page 4


Unsupervised Domain Adaptation of Black-Box Source Models

Unsupervised domain adaptation (UDA) aims to learn a model for unlabeled...

Divide to Adapt: Mitigating Confirmation Bias for Domain Adaptation of Black-Box Predictors

Domain Adaptation of Black-box Predictors (DABP) aims to learn a model o...

Unveiling Class-Labeling Structure for Universal Domain Adaptation

As a more practical setting for unsupervised domain adaptation, Universa...

Known-class Aware Self-ensemble for Open Set Domain Adaptation

Existing domain adaptation methods generally assume different domains ha...

Distill and Fine-tune: Effective Adaptation from a Black-box Source Model

To alleviate the burden of labeling, unsupervised domain adaptation (UDA...

On the Effectiveness of Image Rotation for Open Set Domain Adaptation

Open Set Domain Adaptation (OSDA) bridges the domain gap between a label...

Optimizing Black-box Metrics with Iterative Example Weighting

We consider learning to optimize a classification metric defined by a bl...

Code Repositories


This repository provides code for the paper ---- On Universal Black-Box Domain Adaptation.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised domain adaptation (UDA) aims to learn a prediction function for the unlabeled data of a target domain with the help of labeled data in a shifted but similar source domain. In the past decades, UDA has been heavily studied among different tasks such as object recognition [64, 62], semantic segmentation [8, 37], object detection [22, 56], and person re-identification [11, 14]. Existing UDA approaches mainly focus on learning domain-invariant features [28, 30, 13]

, or taking the advantages of semi-supervised learning

[46, 63]

and unsupervised learning

[47] techniques. Apart from the close-set domain adaptation, where the same label space is shared across domains, problem variants of UDA have been introduced depending on the degree of overlap between the label space of the two domains. These options respectively give rise to other learning settings such as partial [4, 61], open-set [44, 3], and universal domain adaptations [57, 10, 42]. Among these UDA methods, source data are specifically required by target domain for the adaptation use.

Recently, source-free or source-absent domain adaptation methods [26, 34]

are proposed due to increasing concerns for data privacy. They assume that the source data are not accessible to the target domain; a pre-trained source model is instead required and the adaptation is performed in a way similar to the hypothesis transfer learning (HTL)

[23]. In this paper, we consider a less restrictive learning setting, black-box domain adaptation, where only the interface of the source model is available for the target domain [9, 60]

. Considering that increasing number of open-AI interfaces (e.g., GPT-3, Google and Tencent AI platforms etc.) are provided, it is convenient to directly use these off-the-shelf, black-box interfaces to assist the learning of our interested target models. Such a learning strategy is really attractive since it not only preserves the data privacy but also maintains the commercial interests of providers of the black-box models.

However, when applying the black-box domain adaptation into practice, an inevitable challenge would be faced: it is hard to guarantee the task consistency between the target task and the fixed source one. In other words, the label space of the source black-box predictor may be not identical to the target one. The reason for this is two-fold: first, the target domain is fully unlabeled and thus we may know nothing about its label space; second, the target label set may be diverse in different applications whereas the source one is usually fixed. To this end, we propose a more practical setting of universal black-box domain adaptation (UBDA), where we further relax the restrictions by allowing the target label space to be varying and unknown. Different from the closed-set black-box domain adaptation [9, 60], the proposed UBDA aims to learn a robust target prediction model that not only can recognize the in-class target examples but also can detect the out-class ones.

To address this challenge problem, we first note that the UB

DA task can be converted as two subtasks of in-class discrimination and out-class detection. Specifically, the in-class discrimination is achieved by learning a multi-class classifier by model distillation and we conduct out-class detection with a binary classifier learned by entropy separation. We then propose to unify the two subtasks into a self-training framework, regularized by consistency of predictions in local neighborhoods of target samples. Such a learning framework is simple, robust, and optimized easily; it also maintains consistent objective to the theoretical founding of self-training

[54]. Experiments on several domain adaptation benchmarks demonstrate the superiority of our proposed method over state-of-the-art methods of non-black-box universal domain adaptation. The main contributions of our paper are summarized as follows.

  1. To facilitate practical deployment, we propose a more realistic learning setting of UBDA, where only the interface of source model is available to the target domain, and where the label-space relations between the two domains are allowed to be different and unknown.

  2. To address the challenging UBDA problem, we unify subtasks of in-class discrimination and out-class detection within a regularized self-training framework, leading to a simple, robust, and easily optimized method. We also illustrate the connection between our method and the existing theoretical result.

  3. We show that our simple black-box solution, without access to both source data and source model, significantly outperforms state-of-the-art non-black-box methods in most of the benchmark tasks with the metric of H-score and performs on par with them with metric of averaged class accuracy, justifying its efficacy.

2 Related Work

2.1 Unsupervised domain adaptation

Based on different relationships between the label spaces of source and target domains, unsupervised domain adaptation can be divided into two main categories: none-universal and universal domain adaptation. None-universal domain adaptation requires a specific knowledge of relationship of label spaces between source and target domains, including closed set [29, 49, 28, 45, 24, 12, 48, 43, 47], partial [5, 61, 4, 6], and open set [3, 44] domain adaptation. In contrast, universal domain adaptation [57] assumes that there is no any prior information about the target label space, which thoroughly relaxes the unrealistic assumptions about the label spaces across domains. To solve this problem, You et al. [57] propose the universal adaptation network (UAN) by jointly training an adversarial domain adaptation network and a progressive instance-level weighting scheme, which quantifies the transferability of both source and target samples. Fu et al. [10] propose calibrated multiple uncertainties (CMU), which is similar to UAN by using instance-level weighting scheme for adversarial alignment, but with a novel weighting solution composed of entropy, consistency, and confidence. Saito et al. [42] introduce domain adaptative neighborhood clustering (DANCE) via entropy optimization, which shows promising performance on universal domain adaptation. All these seminal works have to require source data for adapting to the target domain, which may violate the privacy concerns in real-world application. Different from these methods, [34] proposes a novel solution without access to the source data, which is similar to ours, but considers to access to a well-design source model.

Recently, black-box domain adaptation is being concerned due to the widely available off-the-shelf interfaces [35] or security considerations [60]. However, these settings are proposed for addressing the closed set problem, which is not realistic since the black-box interface is usually fixed by the provider but our target tasks may be various in different application scenarios. Therefore, it is hard to guarantee that the target categories exactly match that of black-box interfaces. In our learning task, we focus on universal black-box domain adaptation, which is more realitic and much more challenge.

2.2 Noisy label learning

Our learning setting is similar to the noisy label learning [1], which assumes that incorrect labels exist in the training data. There are many ways to address this problem, such as by leveraging the noisy transition matrix [16, 39, 19], by modifying the objective function [2, 27, 53, 31, 36], and by exploiting memorization effects [21, 17, 58]. However, most of noisy label learning methods cope with specific noise, such as symmetric noise [50], asymmetric noise [39], or pure open-set noise [52]

. In our universal black-box learning paradigm, we do not make any assumption to the noisy distribution, which means that the true labels of some target noisy data may not be included in the set of known classes of training data. Moreover, our black-box learning setting assumes accessing to the noisy scores of source training classes, which is different from that in noisy label learning as it assumes the noisy labels are one-hot vectors.

3 Problem Formulation

Assume a set of unlabeled data on a target domain over and a model that was trained on data from a source domain over . Both the source data and parameters of the model are not available; we also call as the black-box source model. The goal of interest is to learn a target model such that the generalization risk is minimized, where

denotes an appropriate loss function, e.g., 0-1 loss. In this work, we consider the least restrictive setting where

is unknown and may be different from , giving rise to the task of universal black-box domain adaptation (UBDA).

We note that the task of UBDA can be converted as (1) learning a binary classifier to minimize , where is an indicator function, and (2) learning a multi-class classifier to minimize . As such, learning amounts to detecting those target instances whose labels are not in the label space of , i.e., out-class detection, and learning amounts to discriminating the target instances whose labels are in (possibly a subset of) , i.e., in-class discrimination.

4 The proposed framework

In UBDA, given the black-box source model , what we can only have for the target data are the predictions . Due to the inevitable domain gap between and (and the possible difference between the label spaces and ), each prediction would be noisy. In spite of being noisy, is the only source of information that we can rely on to trigger the learning of and . Our overall idea is to decompose the learning objective into those respectively for learning and . For the former task, the idea of entropy separation [42] has demonstrated its efficacy in the setting of universal domain adaptation [57]. The later task can be simply solved via model distillation [20]. For UBDA studied in the present work, we are motivated by the recent theoretical result [54], and propose to unify the aforementioned two learning tasks into a regularized self-training framework. Let . We first parameterize and with a function such that


where is a softmax function and is a function to compute entropy. Assuming has been learned, each target instance can be directly classified as


4.1 A unified self-training

Since parameters of the source model are not accessible, we initially apply knowledge distillation [20] to train the network model


where denotes the cross entropy function. During distillation, the model would be gradually more discriminative for the target data, which provides a well initialized model for further self-training of the model


where is defined as:


where is a threshold.

We emphasize that minimizing the loss (5) has two effects of (1) learning a binary classifier by entropy separation [42] and (2) learning a multi-class classifier by entropy minimization. Both of the two effects are equivalent to self-training via pseudo-labeing [25]. The loss (5) combing with (4) is thus a unified self-training objective.

4.2 Regularization by consistency of predictions in local neighborhoods

The unified self-training proposed in the preceding section is a general strategy for deep unsupervised learning. However, it is short of considerations common in classical unsupervised or semi-supervised learning, such as minimum coding length

[32, 59] or density separation [7]. To improve over self-training, we leverage a simple criterion that promotes the consistency of predictions for target samples in each local neighborhood. Given samples in such a neighborhood, we expect the model to be learned such that elements of are consistent. Technically, we construct the learning model as a feature extractor followed by a linear classifier , i.e. . We introduce learnable prototypes to be the centers of clusters of target data features, where is set to a relatively large number. Then, we search the neighborhoods of each target instance in the feature space via Cosine distance. Let denote the similarity between target feature point and prototype as


with the Cosine distance between and prototype , i.e. . To ensure consistency of neighborhood predictions by and prevent large clusters from distorting the discriminative feature space, we follow similar strategies of [55, 15] by constructing self-supervised information for each as


Then, our regularization of self-supervised style for neighborhood consistency is designed as


where and . The prototypes

are initialized by k-means algorithms to form cluster centers and then updated during the optimization. Once learned, these prototypes are discarded, and only the model

is maintained for inference, as showed in (3).

4.3 The overall objective

With both the losses (4) and (5) that unify self-training and the regularizer (9), our overall objective of regularized self-training is written as


where and are penalty parameters.

5 Theoretical Insight

Since our technical scheme of regularized self-training framework is motivated by Wei et al. [54], we want to illustrate the relationship between our technical solution and the theory in this section.

Let denote a distribution of target examples over input space . Assume the target data is partitioned into classes with ground-truth classifier . For any , define to be the disagreement between and . Let be the expected error of . Denote be the pseudo-labeler and be the fraction of examples in -th class which are mistakenly pseudo-labeled.

Under several conditions, Wei et al. [54] propose the theory that can bound the expected error of a classifier – trained to fit pseudo labels while regularizing consistency of predictions in local neighborhoods – by a term that is smaller than the error of pseudo-labeler. This theory can be explained as follows.

Theorem 1

Suppose satisfies expansion property (Assumption 4.1 in [54]). Then for any minimizer of

we have


Here, can be considered as the neighborhood around , represents the opposite degree of neighborhood consistency by , and in Theorem 1 is a value that represents expansion degree of population .

Assume is very small and approach to zero, the above theorem shows that under the expansion assumption, if we have a pseudo-laber with , then and we can get by minimizing such that .

In our method, we can consider that there are two self-training tasks for minimizing and , corresponding to minimizing with and respectively. Minimizing loss in (5) is equivalent to apply pseudo-labeling to fit the pseudo-labels, which corresponds to minimizing the term in Theorem 1. Our distillation loss and the designed strategy of (6) are to ensure that the pseudo-labels are mostly correct such that is small enough, and the designed regularization of is to match the goal of neighborhood consistency of predictions by both and , which demonstrates the similar effect of minimizing in Theorem 1 . Therefore, minimizing in (10) is approximately equivalent to minimize in Theorem 1 for the both two tasks. Assume the conditions in Theorem 1 hold and the pseudo-labeler initialized from the source black-box model by distillation satisfies for both the two classification tasks, then we can get decreasing errors of both and during the iterative self-training process.

6 Experiment

In this section, we evaluate our method on three universal domain adaptation benchmarks under the UBDA setting.

H-score (%)
Setting Method A2W D2W W2D A2D D2A W2A Avg.
Non-Black-Box UAN [57] 58.6 70.6 71.4 59.7 60.1 60.3 63.5

CMU [10] 67.3 79.3 80.4 68.1 71.4 72.2 73.1

DANCE [42] 67.4 89.9 90.7 70.8 79.1 71.9 78.3

SO++ 50.3 74.9 61.5 48.1 77.0 70.8 63.8
Ours 78.2 92.6 87.9 80.9 92.6 89.4 86.8
AA (%)

UAN [57] 86.6 94.8 98.0 86.5 85.5 85.1 89.2

USFDA [34] 85.6 95.2 97.8 88.5 87.5 86.6 90.2

CMU [10] 86.9 95.7 98.0 89.1 88.4 88.6 91.1

DANCE [42] 92.8 97.8 97.7 91.6 92.2 91.4 93.9
Black-Box SO++ 75.0 94.1 94.6 83.0 81.2 83.6 85.3
Ours 83.0 98.1 97.8 88.7 91.4 91.0 91.7
Table 1: Classification results of tasks on Office31 dataset
H-score (%)
Setting Method A2C A2P A2R C2A C2P C2R P2A P2C P2R R2A R2C R2P Avg.
Non-Black-Box UAN [57] 51.6 51.7 54.3 61.7 67.6 61.9 50.4 47.6 61.5 62.9 52.6 65.2 56.6

CMU [10] 56.0 56.9 59.2 67.0 64.3 67.8 54.7 51.1 66.4 68.2 57.9 69.7 61.6

DANCE [42] 35.9 29.3 35.2 42.6 18.0 29.3 50.2 45.4 41.1 19.8 38.8 52.6 36.5

SO++ 49.8 50.1 53.2 62.4 50.3 57.4 61.1 49.5 56.1 56.9 53.0 53.7 54.5
Ours 60.9 69.6 76.3 74.4 69.2 76.5 74.5 60.3 76.2 74.1 62.0 71.1 70.4
AA (%)

UAN [57] 63.0 82.8 87.9 76.9 78.7 85.4 78.2 58.6 86.8 83.4 63.2 79.4 77.0

USFDA [34] 63.4 83.3 89.4 71.0 72.3 86.1 78.5 60.2 87.4 81.6 63.2 88.2 77.0

DANCE [42] 64.1 84.3 91.2 84.3 78.3 89.4 83.4 63.6 91.4 83.3 63.9 86.9 80.4
Black-Box SO++ 55.4 81.0 91.0 71.2 71.0 83.6 70.3 50.9 89.5 77.6 56.6 84.7 63.8
Ours 56.8 86.9 94.4 70.8 76.4 91.2 77.0 56.6 93.0 82.0 58.5 89.2 77.7
Table 2: Classification results of tasks on Office-Home dataset
Setting Method P2R R2P P2S S2P R2S S2R Avg.
Non-Black-Box UAN [57] 41.85 43.59 39.06 38.95 38.73 43.69 40.98
CMU [10] 50.78 52.16 45.12 44.82 45.64 50.97 48.25
DANCE [42] 35.15 49.24 43.32 40.18 46.24 36.57 41.78
Black-Box SO++ 47.30 48.19 43.89 35.93 41.99 43.71 43.50
Ours 57.10 54.76 47.23 41.39 44.03 51.53 49.31
Table 3: Classification results (H-score (%)) of universal domain adaptation on DomainNet dataset. Note that all comparison methods did not report the results of AA for this dataset in their papers.

6.1 Setup

Datasets. As the most previous used datasets, Office31 [41], Office-Home [51], and a large scale one DomainNet [40] are used in our experiments. For all datasets, we follow the same setup as [10]. Specifically, Office31 has 31 classes shared by three domains of Amazon (A), Dslr (D), and Webcam (W). The 10 classes shared by Office31 and Caltech-256 are used as the common label set between source and target domains and then in alphabetical order, the next 10 classes and the remaining 11 classes are used as the source private classes and the target private classes respectively. Office-Home is a larger dataset containing 65 classes and four domains of Artistic (A), Clip-Art (C), Product (P), and Real-World (R). In alphabetical order, the first 10 classes are selected as the common classes, the next 5 classes as the source private classes, and the rest as target private classes. DomainNet [40] is the largest domain adaptation dataset by far, which contains six domains: Clipart(C), Infograph(I), Painting(P), Quickdraw( Q), Real(R) and Sketch(S) across 345 classes. Following [10], three domains of P, R, and S are selected in the experiments. In alphabetical order, we use the first 150, next 50, and remaining classes as the common classes, source and target private ones, respectively, as the same setup as [10].

Evaluation protocols. As pointed out in [10], previous works [57, 42] treated all out-class classes as an extra class and used Average Class Accuracy (AA) for evaluation can not truly reflect the ability of the algorithm for out-class detection as it is badly biased to accuracy of in-class classes. In our experiments, we follow the same evaluation protocol of [10] by using H-score

as the main evaluation metric. The H-score is defined as:


where and represent the instance accuracy on in-class and out-class categories respectively. This score is high only when both tasks of in-class discrimination and out-class detection are well performed. Besides, we also present AA for more comprehensive comparisons to previous methods.

Compared methods. Source-only (SO) is the model trained only on the source data without using target data. However, SO can not apply for out-class detection. We modify it to SO++ by using the same inference as ours as (3), which builds a baseline for universal domain adaptation. Besides SO++, we compare our method mainly to the previous universal domain adaptation approaches, including UAN [57], DANCE [42], CMU [10], and USFDA [34]. Note that these methods all need to access to source data or source model for adapting and without work in our black-box setting.


All experiments are implemented in Pytorch


with Stochastic Gradient Descent (SGD) optimizer. For fair comparisons to previous methods, we use the same backbone of ResNet50


pre-trained on ImageNet to obtain the source black-box model, which was fine-tuned on source examples optimizing with cross-entropy loss function and then treat it like a black-box by only requiring the input-output interfaces of this model in our experiments. For the target model, we use the same backbone as the comparing methods of pre-trained ResNet50 but we also analyse the performance of our method when using different backbones. We fix the hyperparameter of

to 0.5 in all our experiments. For the number of prototypes and weight parameter , we set them to 100 and 1.0 for small scale datasets: office31 and officehome, and to 1000 and 0.05 for the large scale dataset: DomainNet. The analysis of sensitivity of these hyperparameters are discussed. The initial value of is set as 1, which is decayed with the factor of , where

denotes the number of epoches. We run each experiment for 100 epochs and report the average result over three random runs. The code and the learning settings of our experiments are available in supplementary.

Figure 1: Convergence analysis in D2A task. (Best viewed in color)
(a) Source Model
(b) Ours
Figure 2: t-SNE feature visualization of target representations in D2A task. Red dots represent in-class target examples while black dots represent out-class examples. (Best viewed in color)
AA H-score AA H-score
- - 76.9 67.5 83.3 76.6
- 80.6 77.6 88.6 88.0
83.0 78.2 91.0 89.4
SO++ 75.0 50.3 83.6 70.8
Table 4: Ablation Study. , , and represent the loss functions defined in equations (4), (5), and (9) respectively. These three functions constitute our final optimization objective (10).

6.2 Results

Tables 1, 2, and 3 show the classification results on Office31, Office-Home, and DomainNet datasets. We can see from the figure that the H-score of UAN and DANCE methods even worse that the baseline of SO++ in some of tasks even though their results of AA are good. The reason behind this is because these two methods perform badly on out-class detection due to the overemphasis on alignment between source and target domains. However, our method, learning on a black-box setting, significantly outperforms previous non-black-box universal domain adaptation methods with respect to the H-score and show a large margin improvement to the baseline of SO++ for all tasks. Interestingly, in terms of the AA, although our method is inferior to DANCE, it still outperforms other non-black-box methods such as UAN [57], USFDA [34], and CMU [10] on Office31 and Office-Home datasets, which demonstrates clear advantages of our black-box method.

6.3 Analysis

Ablation study. Here, we conduct ablation studies to demonstrate the effectiveness of each loss component in (10). The results are illustrated in Table 4. Firstly, comparing to the baseline of SO++, where the results are directly taken from the output of the black-box source model, target model by using only distillation for training shows large improvement on H-score while almost remaining the same on AA. This demonstrates that distillation from source to target could prevent over-confident predictions, leading to better out-class detection. After gradually adding the self-training loss and neighborhood consistency regularization to the optimization, we can see further improvement on both AA and H-score, verifying the effectiveness of the self-training module for addressing our learning task and the usefulness of neighborhood consistency constraint for improving the self-training, which is consistent with our theoretical analysis.

Convergence analysis. Since our algorithm is trained in an end-to-end manner with SGD optimization, we show its convergence performance with respect to the classification results of H-score and AA on the task of D2A, as illustrated in Figure 1. It can be seen that our algorithm converge very fast and show almost gradual upward trend on both H-score and AA, showing the stability of our proposed method.

Feature visualization. We visualize in Figures 2(a) and 2(b) the output representations extracted by source model as well as ours on the D2A task by t-SNE [33]. Compared to source model, our method learns more discriminative representations for the target in-class examples and show clear separation between the in-class and out-class examples.

Analysis on different categories relationship between source and target domains. Figure 3 shows the classification results on D2A task under different degrees of overlap categories between source and target domains. Here, the currently state-of-the-art method of DANCE [42] and the baseline of SO++ are compared with ours. Denote be the common label set between source and target domains and be the number of classes in . Firstly, we set with be the classes shared between Caltech256 and Office31 datasets and vary to construct different source and target domains ( therefore changes correspondingly since ). In this case, we range the remaining 21 classes to and in alphabetical order respectively. The running results of AA and H-score under this setting are illustrated in Figures 3(a) and 3(b) respectively. Second, we vary the number of and let to construct different source and target domains, correspondingly showed in Figures 3(c) and 3(d). For that, we select the class set , , and in alphabetical order correspondingly. From these figures, we can see that our method consistently outperforms the baseline of SO++ across all scenarios, and improves the state-of-the-art of DANCE in most of tasks with a large margin with respect to the H-score. Although our method do not show much advantage related to the AA while comparing to the DANCE, it still significantly better than the baseline of SO++ overall.

(a) AA w.r.t
(b) H-score w.r.t
(c) AA w.r.t
(d) H-score w.r.t
Figure 3: Average Class Accuracy and H-score with respect to the number of source private classes of and the number of common classes of , where . In (a) and (b), we fix and ; In (c) and (d), we fix and . Note that the strategy of selecting sets , , and is different among these two settings. In the first setting, the fixed is the class set shared between Caltech256 and Office31 and the remaining two sets are selected in alphabetical order, while in the second setting, we select them correspondingly in alphabetical order. (Best viewed in color)

Parameters sensitivity analysis. To analyse how hyperparameters of , , and

affect our experimental results, we run each experiment three times on D2A task across a wide range of different values of each hyperparameter while fixing others as default and report their average results and standard deviations in Figure

4. From the presented results of Figures 4, 4, and 4, it is clearly seen that our method is considerably non-sensitive to these hyperparameters and consistently outperform the baseline of SO++ across a wide range of different values, demonstrating the robustness of our method to these hyperparameters.

Figure 4: (a)-(c): Parameters sensitive analysis in D2A task, where is the threshold in (6), is the weight parameter in (10), and is the number of prototypes defined in Section 4.2. The default values of these hyperparameters are set to , , and in this task. (a), (b), and (c) respectively show the effect of changing one of the parameters on the result. Note that we run each experiment three random times and report the average results and standard deviations in the figure, including that in SO++. (Best viewed in color)
Network Avg. of AA Avg. of H-score
ResNet18 91.30.2 87.00.3
ResNet34 91.10.3 86.10.5
ResNet50 91.70.1 86.80.2
Table 5: Classification results on Office31 dataset (average on 6 tasks) when using different network backbones for target model, where the standard deviations are calculated by three random runs.

Analysis on different target networks. To analyse what the effects of our algorithm when using different network backbones as the target model, we experiment on Office31 dataset with other two networks of ResNet18 and ResNet34, as showed in Table 5, where the average result and standard deviation of each task across three runs are reported. We can observe that our method is robust in using different network architectures for target model, which is very beneficial in practical applications since we can just choose a small scale network to obtain our target model of interest without loss accuracy. The small standard deviations by three runs further demonstrate robustness of our method.

7 Discussion

Our learning setting of UBDA is motivated by real application due to the privacy concerns and widely accessible open-AI interfaces. Our approach to learning a target model only relies on the output information from the black-box interface. As the results showed in experiments, only by accessing to the interface of source model, we can already get comparable or even better results to the previous complexity algorithms. We believe that this builds a potential guideline to the community of universal domain adaptation by rethinking whether the source data are indeed needed when transfer to a new domain. Our discovery demonstrates that a well initialized prediction model may play a more important role than source data, which inspires us when tackling a target task of interest, we should first consider to get the help from an auxiliary prediction model instead of by correcting expensive annotation data.

There are still limitations of our learning setting and algorithm. Firstly, we assume that only a single interface of source model is provided for learning, while multi-source universal black-box domain adaptation may be more practical in real application. Furthermore, the proposed learning paradigm assumes that the output information of the interfaces includes all source categorical scores, which may not hold when some of that only output the top-K prediction categories. Such problems become more complex and we leave them to future work.

8 Conclusion

This work provides the least restrictive setting of UBDA, where only the interface of source model is required and the target label space is allowed to be varying and unknown. To address such task, we decompose the learning objective into two subtasks of in-class discrimination and out-class detection, and then propose to unify them into a self-training framework, regularized by consistency of predictions in local neighborhoods of target samples. The proposed framework is very simple, robust, and easy to be optimized. We extensively evaluate our method on three domain adaptation benchmarks. In most of experiments, our method, learning on a black-box setting, outperforms state-of-the-art non-black-box methods with a large margin on the metric of H-score, and show comparable results to them on the metric of averaged class accuracy.


  • [1] D. Angluin and P. Laird (1988) Learning from noisy examples. Machine Learning 2 (4), pp. 343–370. Cited by: §2.2.
  • [2] S. Azadi, J. Feng, S. Jegelka, and T. Darrell (2016) Auxiliary image regularization for deep cnns with noisy labels. In ICLR, Cited by: §2.2.
  • [3] P. P. Busto and J. Gall (2017) Open set domain adaptation. In ICCV, pp. 754–763. Cited by: §1, §2.1.
  • [4] Z. Cao, M. Long, J. Wang, and M. I. Jordan (2018) Partial transfer learning with selective adversarial networks. In CVPR, pp. 2724–2732. Cited by: §1, §2.1.
  • [5] Z. Cao, L. Ma, M. Long, and J. Wang (2018) Partial adversarial domain adaptation. In ECCV, Vol. 11212, pp. 139–155. Cited by: §2.1.
  • [6] Z. Cao, K. You, M. Long, J. Wang, and Q. Yang (2019) Learning to transfer examples for partial domain adaptation. In CVPR, pp. 2985–2994. Cited by: §2.1.
  • [7] O. Chapelle and A. Zien (2005) Semi-supervised classification by low density separation.. In AISTATS, Vol. 2005, pp. 57–64. Cited by: §4.2.
  • [8] M. Chen, H. Xue, and D. Cai (2019) Domain adaptation for semantic segmentation with maximum squares loss. In ICCV, Cited by: §1.
  • [9] B. Chidlovskii, S. Clinchant, and G. Csurka (2016) Domain adaptation in the absence of source domain data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 451–460. Cited by: §1, §1.
  • [10] B. Fu, Z. Cao, M. Long, and J. Wang (2020) Learning to detect open classes for universal domain adaptation. In ECCV, Cited by: §1, §2.1, §6.1, §6.1, §6.1, §6.2, Table 1, Table 2, Table 3.
  • [11] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, U. Uiuc, and T. Huang (2019) Self-similarity grouping: a simple unsupervised cross domain adaptation approach for person re-identification. In ICCV, Cited by: §1.
  • [12] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky (2016)

    Domain-adversarial training of neural networks

    J. Mach. Learn. Res. 17, pp. 59:1–59:35. Cited by: §2.1.
  • [13] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §1.
  • [14] Y. Ge, D. Chen, and H. Li (2020) Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. In ICLR, Cited by: §1.
  • [15] K. Ghasedi Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang (2017)

    Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization

    In ICCV, Cited by: §4.2.
  • [16] J. Goldberger and E. Ben-Reuven (2016) Training deep neural-networks using a noise adaptation layer. In ICLR, Cited by: §2.2.
  • [17] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In NeurIPS, Cited by: §2.2.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §6.1.
  • [19] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel (2018) Using trusted data to train deep networks on labels corrupted by severe noise. In NeurIPS, Cited by: §2.2.
  • [20] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. External Links: 1503.02531 Cited by: §4.1, §4.
  • [21] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2018) Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, pp. 2304–2313. Cited by: §2.2.
  • [22] S. Kim, J. Choi, T. Kim, and C. Kim (2019) Self-training and adversarial background regularization for unsupervised domain adaptive one-stage object detection. In ICCV, Cited by: §1.
  • [23] Q. Lao, X. Jiang, and M. Havaei (2021) Hypothesis disparity regularized mutual information maximization. In AAAI, Cited by: §1.
  • [24] C. Lee, T. Batra, M. H. Baig, and D. Ulbricht (2019) Sliced wasserstein discrepancy for unsupervised domain adaptation. In CVPR, pp. 10285–10295. Cited by: §2.1.
  • [25] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In ICML workshop, Cited by: §4.1.
  • [26] J. Liang, D. Hu, and J. Feng (2020-July 13–18) Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In ICML, pp. 6028–6039. Cited by: §1.
  • [27] T. Liu and D. Tao (2015) Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence 38 (3), pp. 447–461. Cited by: §2.2.
  • [28] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. In ICML, Vol. 37, pp. 97–105. Cited by: §1, §2.1.
  • [29] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu (2013)

    Transfer feature learning with joint distribution adaptation

    In ICCV, pp. 2200–2207. Cited by: §2.1.
  • [30] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. In NeurIPS, pp. 136–144. Cited by: §1.
  • [31] Y. Lyu and I. W. Tsang (2019) Curriculum loss: robust learning and generalization against label corruption. In ICLR, Cited by: §2.2.
  • [32] Y. Ma, H. Derksen, W. Hong, and J. Wright (2007) Segmentation of multivariate mixed data via lossy data coding and compression. IEEE transactions on pattern analysis and machine intelligence 29 (9), pp. 1546–1562. Cited by: §4.2.
  • [33] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §6.3.
  • [34] J. Nath Kundu, N. Venkat, M. V. Rahul, and R. Venkatesh Babu (2020) Universal source-free domain adaptation. In CVPR, Cited by: §1, §2.1, §6.1, §6.2, Table 1, Table 2.
  • [35] A. R. Nelakurthi, R. Maciejewski, and J. He (2018) Source free domain adaptation using an off-the-shelf classifier. In Big Data, pp. 140–145. Cited by: §2.1.
  • [36] D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox (2019) Self: learning to filter noisy labels with self-ensembling. In ICLR, Cited by: §2.2.
  • [37] F. Pan, I. Shin, F. Rameau, S. Lee, and I. S. Kweon (2020) Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. In CVPR, Cited by: §1.
  • [38] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §6.1.
  • [39] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. In CVPR, pp. 1944–1952. Cited by: §2.2.
  • [40] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang (2019) Moment matching for multi-source domain adaptation. In ICCV, Cited by: §6.1.
  • [41] K. Saenko, B. Kulis, M. Fritz, and T. Darrell (2010) Adapting visual category models to new domains. In ECCV, Cited by: §6.1.
  • [42] K. Saito, D. Kim, S. Sclaroff, and K. Saenko (2020) Universal domain adaptation through self-supervision. In NeurIPS, Cited by: §1, §2.1, §4.1, §4, §6.1, §6.1, §6.3, Table 1, Table 2, Table 3.
  • [43] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018) Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, pp. 3723–3732. Cited by: §2.1.
  • [44] K. Saito, S. Yamamoto, Y. Ushiku, and T. Harada (2018)

    Open set domain adaptation by backpropagation

    In ECCV, pp. 153–168. Cited by: §1, §2.1.
  • [45] J. Shen, Y. Qu, W. Zhang, and Y. Yu (2018) Wasserstein distance guided representation learning for domain adaptation. In AAAI, pp. 4058–4065. Cited by: §2.1.
  • [46] R. Shu, H. Bui, H. Narui, and S. Ermon (2018) A DIRT-t approach to unsupervised domain adaptation. In ICLR, Cited by: §1.
  • [47] H. Tang, K. Chen, and K. Jia (2020) Unsupervised domain adaptation via structurally regularized deep clustering. In CVPR, Cited by: §1, §2.1.
  • [48] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In CVPR, pp. 2962–2971. Cited by: §2.1.
  • [49] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell (2014) Deep domain confusion: maximizing for domain invariance. CoRR abs/1412.3474. Cited by: §2.1.
  • [50] B. Van Rooyen, A. K. Menon, and R. C. Williamson (2015) Learning with symmetric label noise: the importance of being unhinged. In NeurIPS, Cited by: §2.2.
  • [51] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan (2017) Deep hashing network for unsupervised domain adaptation. In CVPR, Cited by: §6.1.
  • [52] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S. Xia (2018) Iterative learning with open-set noisy labels. In CVPR, pp. 8688–8696. Cited by: §2.2.
  • [53] Y. Wang, A. Kucukelbir, and D. M. Blei (2017) Robust probabilistic modeling with bayesian data reweighting. In ICML, pp. 3646–3655. Cited by: §2.2.
  • [54] C. Wei, K. Shen, Y. Chen, and T. Ma (2021) Theoretical analysis of self-training with deep networks on unlabeled data. In ICLR, External Links: Link Cited by: §1, §4, §5, §5, Theorem 1.
  • [55] J. Xie, R. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In ICML, Cited by: §4.2.
  • [56] M. Xu, H. Wang, B. Ni, Q. Tian, and W. Zhang (2020) Cross-domain detection via graph-induced prototype alignment. In CVPR, Cited by: §1.
  • [57] K. You, M. Long, Z. Cao, J. Wang, and M. I. Jordan (2019) Universal domain adaptation. In CVPR, pp. 2720–2729. Cited by: §1, §2.1, §4, §6.1, §6.1, §6.2, Table 1, Table 2, Table 3.
  • [58] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama (2019) How does disagreement help generalization against label corruption?. In ICML, pp. 7164–7173. Cited by: §2.2.
  • [59] Y. Yu, K. H. R. Chan, C. You, C. Song, and Y. Ma (2020) Learning diverse and discriminative representations via the principle of maximal coding rate reduction. In NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Cited by: §4.2.
  • [60] H. Zhang, Y. Zhang, K. Jia, and L. Zhang (2021) Unsupervised domain adaptation of black-box source models. External Links: 2101.02839 Cited by: §1, §1, §2.1.
  • [61] J. Zhang, Z. Ding, W. Li, and P. Ogunbona (2018) Importance weighted adversarial nets for partial domain adaptation. In CVPR, pp. 8156–8164. Cited by: §1, §2.1.
  • [62] L. Zhang and X. Gao (2020) Transfer adaptation learning: a decade survey. External Links: 1903.04687 Cited by: §1.
  • [63] Y. Zhang, B. Deng, K. Jia, and L. Zhang (2020) Label propagation with augmented anchors: a simple semi-supervised learning baseline for unsupervised domain adaptation. In ECCV, pp. 781–797. Cited by: §1.
  • [64] Y. Zhang, B. Deng, H. Tang, L. Zhang, and K. Jia (2020) Unsupervised multi-class domain adaptation: theory, algorithms, and practice. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. External Links: Document Cited by: §1.