Source-Free Progressive Graph Learning for Open-Set Domain Adaptation

by   Yadan Luo, et al.
The University of Queensland

Open-set domain adaptation (OSDA) has gained considerable attention in many visual recognition tasks. However, most existing OSDA approaches are limited due to three main reasons, including: (1) the lack of essential theoretical analysis of generalization bound, (2) the reliance on the coexistence of source and target data during adaptation, and (3) failing to accurately estimate the uncertainty of model predictions. We propose a Progressive Graph Learning (PGL) framework that decomposes the target hypothesis space into the shared and unknown subspaces, and then progressively pseudo-labels the most confident known samples from the target domain for hypothesis adaptation. Moreover, we tackle a more realistic source-free open-set domain adaptation (SF-OSDA) setting that makes no assumption about the coexistence of source and target domains, and introduce a balanced pseudo-labeling (BP-L) strategy in a two-stage framework, namely SF-PGL. Different from PGL that applies a class-agnostic constant threshold for all target samples for pseudo-labeling, the SF-PGL model uniformly selects the most confident target instances from each category at a fixed ratio. The confidence thresholds in each class are regarded as the 'uncertainty' of learning the semantic information, which are then used to weigh the classification loss in the adaptation step. We conducted unsupervised and semi-supervised OSDA and SF-OSDA experiments on the benchmark image classification and action recognition datasets. Additionally, we find that balanced pseudo-labeling plays a significant role in improving calibration, which makes the trained model less prone to over-confident or under-confident predictions on the target data. Source code is available at


page 11

page 13


One Ring to Bring Them All: Towards Open-Set Recognition under Domain Shift

In this paper, we investigate open-set recognition with domain shift, wh...

Domain Adaptation without Source Data

Domain adaptation assumes that samples from source and target domains ar...

Progressively Select and Reject Pseudo-labelled Samples for Open-Set Domain Adaptation

Domain adaptation solves image classification problems in the target dom...

General Domain Adaptation Through Proportional Progressive Pseudo Labeling

Domain adaptation helps transfer the knowledge gained from a labeled sou...

Unsupervised Domain Adaptation with Progressive Adaptation of Subspaces

Unsupervised Domain Adaptation (UDA) aims to classify unlabeled target d...

Transformer-Based Source-Free Domain Adaptation

In this paper, we study the task of source-free domain adaptation (SFDA)...

CALDA: Improving Multi-Source Time Series Domain Adaptation with Contrastive Adversarial Learning

Unsupervised domain adaptation (UDA) provides a strategy for improving m...

1 Introduction

While deep learning has made remarkable advances across a wide variety of machine-learning tasks and applications such as image and video recognition, it is commonly at a great cost of curating large-scale training data annotations. To relieve the burden of expensive data labeling, transfer learning has been introduced to extract knowledge from the existing annotated training data (i.e. source domain) and convey it to the unlabeled or partially labeled test data (i.e. target domain). However, the source and target domains are generally constructed under varying conditions such as illuminations, camera poses, and backgrounds, which is referred to as

domain shift. For instance, the Gameplay-Kinetics [DBLP:conf/iccv/ChenKAYCZ19] dataset for action recognition is built under the challenging “Synthetic-to-Real” protocol, where the training videos are synthesized by game engines and the test samples are collected from real scenes. In this case, the domain shift between the source and target domains inevitably leads to severe degradation of the model generalization performance.

To mitigate the aforementioned domain gap, unsupervised domain adaptation (UDA) techniques have been proposed to align source and target distributions through statistical matching [dip, manifold, distribution, DDC, JDA, DAN, DANN, ADDA, jingjing] or adversarial learning [DANN, ADDA, JAN, DRCN, DBLP:conf/mm/LuoHW0B20, DBLP:conf/mm/WangLHB20], which provide rigorous error bounds on the target data [david1, discrepancy, bridgingtheory]. Although the UDA methods have been advanced and applied on many tasks such as object detection, semantic segmentation, and action recognition, the evaluation protocols were restricted to a scenario where the target domain shares an identical set of classes with the source domain. Such a scenario typically refers to a closed-set setting, which could be hardly guaranteed in real-world applications, where the test samples may come from unknown classes that are not seen during training.

In the light of the above discussion, a more realistic open-set domain adaptation (OSDA) setting [OSBP] has been introduced, which allows the target data to contain an additional “unknown” category, covering all classes that are not present in the source domain. The key challenge of OSDA is to safely transfer knowledge across the domains while recognizing the unknown classes accurately. To tackle this, different strategies such as confidence manipulation [OSBP, Attract-UTS], subspace reconstruction [mahsa], instance weighting [STA], and extreme value theory [DBLP:conf/aaai/Jing0ZDLY21] have been studied. Nevertheless, there are three non-negligible obstacles that prevent existing OSDA methods to be applied successfully in a real-world scenario:

  • The lack of theoretical analysis of generalization bound for OSDA methods: According to [david1, discrepancy], the target error is bounded by four factors including the source risk, discrepancy across the domains, the shared error coming from the conditional shift [conditionalshift], and the open-set risk. Among all, open-set risk contributes the most to the error bound, specifically when a large percentage of data belongs to the unknown class. However, designing an effective strategy to minimize the open-set risk remains an open problem.

  • The reliance on co-existing source and target data during adaptation: Deploying the existing OSDA approaches on portable and mobile devices is infeasible, as they require loading and processing large-scale source data. Source videos in the Gameplay [DBLP:conf/iccv/ChenKAYCZ19] dataset may consume hundreds of gigabytes of storage. In addition, the assumption of data accessibility is likely to trigger concerns for data sharing and digital privacy, especially in the medical and biometrics communities.

  • The failure of estimating predictive uncertainty of models: Solely focusing on improving target accuracy at inference time could result in producing over-confident predictions in mainstream OSDA methods. This issue, typically referred to as a miscalibration, can be a major problem in decision-critical scenarios.

In this paper, we, therefore, propose a generic Progressive Graph Learning (PGL) framework and its source-free variant (SF-PGL) for OSDA. We theoretically analyze the generalization and calibration properties of the proposed frameworks. The PGL method follows a common open-set domain adaptation setting, where the source and target data are available during training. The deep model consists of a feature extraction module, a graph neural network, and a classifier (hypothesis). To minimize partial risks and achieve a tighter error bound for open-set adaptation, the proposed PGL integrates four different strategies: (1) To suppress the source risk, we decompose the original hypothesis space

into two subspaces and , where includes classifiers for the shared classes of the source and target domains and is specific to classifying unknowns in the target domain. With a restricted size of the subspace , the possibility of misclassifying source data as unknowns will be reduced. (2) To control the open-set risk, the progressive learning paradigm [curriculum] is adopted, where the target samples with low classification confidence are gradually rejected from the target domain and inserted as the pseudo-labeled unknown set in the source domain. This mechanism suppresses the potential negative transfer where the private representations across domains are falsely aligned. (3) To address conditional shift [conditionalshift] at both sample- and manifold-level, we design an episodic training scheme and align conditional distributions across domains by gradually replacing the source data with the pseudo-labeled known data in each episode. We learn class-specific representations by aggregating the source and target features and passing episodes through deep graph neural networks. (4) An adversarial domain discriminator is seamlessly equipped, which effectively closes the gap between the source and target marginal distributions for the known categories.

While effective, the PGL model relies on the adversarial learning mechanism to align cross-domain distributions, hereby failing to handle the source-free scenarios with no access to the source data. To overcome this limitation, we further put forward a balanced pseudo-labeling (BP-L) strategy and form a complete two-stage framework, namely, SF-PGL. More specifically, we pass all target data to obtain its respective predictions through the freezed feature extraction module and the hypothesis, which are pre-trained by the source data. Then we sort the confidence of target predictions and evenly select a fixed-size group of high-confidence samples from each category to initialize the pseudo-labeled known set. The confidence threshold in each class is recorded, which measures the ‘uncertainty’ of learning the class-specific information. In the second stage, the based model along with the graph neural network is iteratively trained with the pseudo-labeled and unlabeled target samples until all target samples are labeled. The predictions produced from the trained model are uncertainty-aware, due to the fact that the classification loss for each instance is weighted by the class uncertainty.

A preliminary version of this work was presented in [pgl]. In this work, we additionally (1) introduce a novel variant SF-PGL framework tailored for the source-free setting and provide a thorough experimental evaluation on its calibration capacity. The proposed SF-PGL model achieves the lowest expected calibration error (ECE) compared to all conventional open-set and source-free open-set domain adaptation methods. (2) We further apply the proposed PGL approach to four action recognition datasets i.e., UCF-HMDB, UCF-Olympic, UCF-HMDB and Kinetics-Gameplay, which are manually formed for the tasks of OSVDA and S-OSVDA. Our approaches achieve state-of-the-art results in all settings, both for unsupervised and source-free open set domain adaptation.

2 Related Work

2.1 Open-set Domain Adaptation

Different from canonical closed-set domain adaptation, open-set domain adaptation (OSDA) [ATI, OSBP, DBLP:journals/pami/BustoIG20] addresses the interference from the unshared categories in the target domain when adapting the learned models. To avoid the potential risk of negative transfer [DBLP:journals/tkde/PanY10] brought by the unshared categories, it is important for OSDA methods to accurately determine the irrelevant target samples as the ‘unknown’ class while aligning the shared classes. The most intuitive way is leveraging OSVM [OSVM] that uses a class-wise confidence threshold to classify target instances into the shared classes, or reject them as unknown. Busto and Gall [ATI] introduced an ATI- method, which assigns the target data a pseudo class label or an unknown label based on its distance to each source cluster. Saito et al. [OSBP] derived an objective in the adversarial Open Set Back-Propagation (OSBP) framework, which balances the classifier’s confidence on the known and unknown class with a threshold. Baktashmotlagh et al. [mahsa] proposed to learn factorized representations of the source and target data, so that unknown points can be identified by examining reconstructions from domain-specific subspaces. Liu et al. [STA] and Feng et al. [Attract-UTS] aimed to push the unknown class away from the decision boundary by a multi-binary classifier or semantic contrastive mapping. Luo et al. [pgl] followed the progressive learning paradigm, which globally ranks all target samples and gradually isolates the ones of lower confidence as unknown samples. Bucci et al. [DBLP:conf/eccv/BucciLT20]

built the framework upon the recent success of self-supervised learning, which separates the unknowns with the confidence of predicting rotations and semantics. Jing

et al. [DBLP:conf/aaai/Jing0ZDLY21] leveraged Distance-Rectified Weibull model for rejecting known samples based on the angular distance. Chen et al. [DBLP:conf/mmasia/ChenLB21] applied a class-conditional extreme value theory for open-set video domain adaptation. Of late, Universal Domain Adaptation (UDA) [DBLP:conf/cvpr/YouLCWJ19] has been proposed, which further handles the case when the source domain holds private classes. While effective, all these methods assume the target user’s access to the source domain, which could be infeasible due to privacy and security issues.

2.2 Source-free Domain Adaptation

Source-free domain adaptation studies how to transfer knowledge when the source data is absent. Kundu et al. [DBLP:conf/cvpr/KunduVRVB20, DBLP:conf/cvpr/KunduVVB20] proposed to generate out-of-domain (OOD) samples by applying the feature splicing technique, which enhances the model generalization capacity to the unseen samples in the target domain. Liang et al. [DBLP:conf/icml/LiangHF20] aimed to force the target representations to resemble the source features by information maximization while augmenting target features with self-supervised pseudo-labeling. Li et al. proposed a 3C-GAN to augment the target sets for training, where the weight regularization and clustering based regularization are adopted for preventing classifier drifting and smoothing distribution. In the same vein, Kurmi et al. [DBLP:conf/wacv/KurmiSN21] leveraged generated data points as proxy samples for adaptation, where the generative model is modeled as an energy-based function.

3 Preliminaries

In this section, we introduce the notations, problem settings, definitions, and theoretical analysis for the tasks of OSDA and SF-OSDA.

3.1 Definitions and Problem Settings

Definition 1.

Closed-set Unsupervised Domain Adaptation (UDA). Let and

be the joint probability distribution of the source domain and the marginal distribution of the target domain, respectively. The corresponding label spaces for both domains are equal, i.e.,

= {1, …, C}, where is the number of classes. Given the labeled source data and the unlabeled target data , the aim of UDA is to learn a feature transformation and an optimal classifier , such that, the learnt model can correctly classify the target samples. is the hypothesis space of classifiers, with and indicating the size of source and target dataset, respectively.

Definition 2.

Open-set Domain Adaptation (OSDA) [OSBP]. Different from the UDA setup, OSDA allows the target label space to include the additional unknown class , which is not present in the source label space . Given independent and identically distributed (i.i.d.) samples drawn from the source domain and target domain , the goal of OSDA is to train a model such that the model can classify the samples from known classes and identify the samples coming from additional unknown class .

Definition 3.

Source-free Open-set Domain Adaptation (SF-OSDA). SF-OSDA aims to adapt the model to the target domain without having access to the source data. Given the model pre-trained on the source set , and the unlabeled target set , the goal of SF-OSDA is to adapt the model to , such that the adapted model is able to correctly classify target samples into the shared classes and the unknown class.

3.2 Risks and Partial Risks

Risks and partial risks are fundamental notions in learning theoretical bounds of OSDA and SF-OSDA. The source risk and target risk of a classifier

with respect to the source joint distribution

and the target joint distribution are given by,


where and

are class-prior probabilities of the source and target distributions, respectively. The bounded loss function

satisfies symmetry and triangle inequality. Particularly, the target risk can be split into two partial risks and , indicating the risks for the known target classes and the unknown class,


where the respective partial risks are defined as,


To derive the generalization bound for open-set domain adaptation, we first define a discrepancy measure between the source and target domains:

Definition 4.

Discrepancy Distance [discrepancy]. For any , the discrepancy between the distributions of the source and target domains can be formulated as:


Notably, the discrepancy distance is symmetric and satisfy the triangle inequality.

Given the definition of the discrepancy distance, generalization bounds for open-set domain adaptation can be derived as:

Theorem 3.1.

OSDA Generalization Bounds [open_theory]. Given the hypothesis space

with a mild condition that constant vector value function

, , the expected error on target samples is bounded by,


where the shared error .


To compute the error upper bound for the closed-set unsupervised domain adaptation, Theorem 3.1 can be reduced to:


with and .

According to Eq. (5), the target error is bounded by four terms, which opens four directions for improvement:

  • Source risk . A part of the source risk can be avoided based on the assumption that the source domain does not include any unknown samples. This, in turn, minimizes the upper bound of the error. This direction is rarely investigated in the existing literature of open-set domain adaptation.

  • Discrepancy distance . Minimizing the discrepancy distance between the source and the target domains has been well investigated in recent years in statistics-based [MMD] or adversarial-based approaches [DANN].

  • Shared error of the joint optimal hypothesis . The mismatch in class-wise conditional distributions enlarges the shared error , even when the marginal distributions are aligned.

  • Open set risk . As shown in Eq. (5), the first term of can be interpreted as the mis-classification rate for the unknown samples in the target, and the second term is the rate of mis-classifying the source samples as unknown. Therefore, when a large percentage of data is unknown (), open set risk contributes the most to the error bound.

Fig. 1: Proposed PGL framework. By alternating between Steps 2 and 3, we progressively achieve the optimal classification model for the shared classes and pseudo-labeling function for rejecting the unknowns.

4 Progressive Graph Learning (PGL)

Aiming to minimise the four partial risks mentioned above, we reformulate the open-set unsupervised domain adaptation in a progressive way, and as such, we redefine the task at hand as follows.

4.1 Definitions and Risks

Definition 5.

Progressive Open-Set Domain Adaptation (POSDA). Given the labeled source data and unlabeled target data , the main goal is to learn an optimal target classifier for the shared classes and a pseudo-labeling function for the unknown class .

Given the target set will be pseudo-labeled through steps, the enlarging factor for each step can be defined as . As long as the hypothesis and share the same feature extraction part, we can decompose the shared hypothesis into and define the pseudo-labeling function at the -th step in line with ’s prediction:



are the index-based thresholds to classify the unknown and known samples, respectively. The hyperparameter

measures the openness of the given target set as the ratio of unknown samples.

is a global ranking function which ranks predicted probabilities in ascending order and returns the sorted index list as an output. The output of pseudo-labeling function is

for the possible known samples, and for the unknown ones.

In our case, the upper bound of expected target risk is formulated in the following theorem,

Theorem 4.1.

POSDA Generalization Bound. Given the hypothesis space , , , for and , with a condition that the openness of the target set is fixed, the expected error on the target samples is bounded by:


where the shared error . indicates the prior probability that target samples being pseudo-labeled by (refer to the supplementary material for proof).


For and , the following inequality holds,


We can observe that our progressive learning framework can achieve a tighter upper bound compared to conventional open-set domain adaptation framework.

4.2 Overview

In this section, we go through the details of the proposed Progressive Graph Learning (PGL) framework, as illustrated in Fig. 1. Our approach is mainly motivated by the two aspects of minimizing the shared error , and effectively controlling the progressive open-set risk .

Minimizing the shared error . Conditional shift [conditionalshift] arises when the class-conditional distributions of the input features substantially differ across the domains, and it is the most significant obstacle for finding an optimal classifier for the source and target data. Specifically, with unaligned distributions of the source distribution and target distribution , there is no guarantee to find an optimal classifier for both domains. Therefore, we address the conditional shift in a transductive setting from two perspectives:

  • Sample-level: Motivated by [meta, meta1], we adopt the episodic training scheme (Section 4.3), and leverage the source samples from each class to “support” predictions on unlabeled data in each episode. While the labeled set is expanding through pseudo-labeling process (Section 4.4), we progressively update training episodes by replacing the source samples with pseudo-labeled target samples (Section 4.5).

  • manifold-level: To regularize the class-specific manifold, we construct -layer Graph Neural Networks (GNNs) on top of the backbone network (e.g., ResNet). The GNN consists of paired node update networks and edge update networks . The source nodes and pseudo-labeled target nodes from the same class are densely connected, aggregating information though multiple layers.

Controlling progressive open-set risk . As discussed in Section 4.4, we iteratively squeeze the index-based thresholds, and , to approximate the optimal threshold, , as illustrated in Fig. 3. Since the thresholds are mainly determined by the enlarging factor , we can always seek a proper value of to alleviate the mis-classification error and the subsequent negative transfer. Our experimental results characterize the trade-off between computational complexity and performance improvement.

The overall learning procedure can be divided into several steps: (1) Episodic training with graph neural networks: the shared classifier is learned in a transductive setting, along with adversarial objectives for closing domain discrepancy; (2) Progressive paradigm: in agreement with ’s prediction, all unlabeled target samples are ranked based on confidence, among which we select those with higher scores to form the pseudo-labeled known set and reject ones with lower scores as the unknown set; (3) Mix-up strategy: we randomly replace source samples in each episode, with the pseudo-labeled known set obtained from the last step. We will elaborate each of the steps in the next subsections.

4.3 Step1: Initial Episodic Training with GNNs

Firstly, we denote the initial episodic formulation of a batch input as , with as the batch size. Each episode in the batch consists of two parts, i.e., the source episode randomly sampled from each class and the target episode randomly sampled from the target set. All instances in a mini-batch can form an undirected graph . Each vertex is associated with a source or a target feature, and the edge between nodes and measures the node affinity. The integrated GNNs are naturally able to perform a transductive inference taking advantage of labeled source data and unlabeled target data. The propagation rule for edge update and node update is elaborated in the following subsections.

Edge Update. The generic propagation rule for normalized edge features at the -th layer can be defined as,



being the sigmoid function,

the degree matrix of ,

the identity matrix, and

the non-linear edge network parameterized by .

Fig. 2: The network architecture of the node network and edge network .

Node Update. Similarly, the propagation rule for node features at the -layer is defined as,


with being the neighbor set of the node , the concatenation operation and the node network consisting of two convolutional layers, LeakyReLU activations and dropout layers. The node embedding is initialized with the extracted representations from the backbone embedding model, i.e., .

Adaptive Learning. We exploit adversarial loss to align the distributions of the source and target features extracted from the backbone network . Specifically, a domain classifier is trained to discriminate between the features coming from the source or target domains, along with a generator to fool the discriminator . The two-player minimax game shown in Eq.(12) is expected to reach an equilibrium resulting in the domain invariant features:


Node Classification. By decomposing the shared hypothesis into a feature learning module and a shared classifier , we train both networks to classify the source node embedding. To alleviate the inherent class imbalance issue, we adopt the focal loss to down-weigh the loss assigned to correctly-classified examples:


with the hyperparameter and being the node embedding from the -th node update layer. The total loss combines all losses from layers to improve the gradient flow in the lower layers.

Edge Classification. Based on the given labels of the source data, we construct the ground-truth of edge map , where if and belong to the same class, and , otherwise. The networks are trained by minimizing the following binary cross-entropy loss:


Final Objective Function. Formally, our ultimate goal is to learn the optimal parameters for the proposed model,


with and the coefficients of the edge loss and adversarial loss, respectively.

4.4 Step2: Pseudo-Labeling in Progressive Paradigm

With the optimal model parameters obtained at the -th step, we freeze the model and feed forward all the target samples, as shown in the Step 2 of Fig. 1. Then, we rank the maximum likelihood produced from the shared classifier in ascending order. Giving priority to the “easier” samples with relatively high/low confidence scores, we select samples to enlarge the pseudo-labeled known set and unknown set (Refer to Eq. (7)):


and are newly annotated known set and unknown set, respectively, and the pseudo-label is given by . To find a proper value of enlarging factor , we have two options: by aggressively setting a large value to , the progressive paradigm can be accomplished in fewer steps resulting in potentially noisy and unreliable pseudo-labeled candidates; on the contrary, choosing a small value of can result in a steady increase of the model performance and the computational cost.

Fig. 3: An illustration of the progressive learning to construct the pseudo-labeled target set. indicates the ideal threshold for classifying known and unknown samples.

4.5 Step3: Episodic Update with Mix-up Strategy

We mix the source data with the samples from the updated pseudo-labeled known-set at the -th step, and construct new episodes at the -th step, as depicted in the Step 3 of Fig. 1. In particular, We randomly replace the source samples with pseudo-labeled known data with a probability . Each episode in the new batch consists of three parts,


with being the conditional distribution of the pseudo-labeled known set at the -th step. Then, we update the model parameters according to Eq. (15) and repeat pseudo-labeling with the newly constructed episodes until convergence.

4.6 Extension to Video Domain Adaptation

We provide an extension of the proposed PGL approach for tasks of open-set video domain adaptation (OSVDA), where target video data is curated under a different condition and contains additional classes of actions or events that do not exist in the source domain. For instance, in the Gameplay-Kinetics [DBLP:conf/iccv/ChenKAYCZ19] dataset, our OSVDA differs from vanilla OSDA in a sense that the domain shift is present in video clips rather than still images. Specifically, we sample a fixed-number of frames,

, with an equal spacing from each video for training. We then encode each frame with the Resnet-101 pretrained on ImageNet into a 2048-D vector. Without loss of generality, the extracted frame features for source and target domains are then aggregated through the average pooling layer to obtain the video-level representations. Likewise, the graph-based model

is jointly trained to align the source and target features, as illustrated in Section 4.3 to Section 4.5.

We also provide an extension of the PGL method in a semi-supervised setting for video data (S-OSVDA), where part of the target labels are observable. To leverage the supervision, the node objective and edge loss are adapted to take both source samples and the labeled target samples, while the unlabeled target videos are pseudo-labeled iteratively.

Fig. 4: Proposed SF-PGL framework. The black line represents the data flow of both the source and target data.

5 Source-free Progressive Graph Learning (SF-PGL)

In this section, we propose simple yet effective modifications to make our progressive graph learning model work in a source-free setup (SF-PGL). The overall workflow is presented in Fig. 4, which consists of (1) Pre-training the backbone Model, (2) Balanced pseudo-labeling, and (3) Uncertainty-aware updating.

5.1 Step1: Pre-training Backbone Model

First, we train the backbone network and the source hypothesis module on the source domain, using the cross-entropy objective,


Due to the absence of target samples, the training of the graph model is not involved in Step 1, as otherwise, the model overfits the source data.

5.2 Step2: Balanced Pseudo-Labeling (BP-L)

After the backbone model and classifier is warmed up, we pass all target samples to get the predictions for pseudo labeling. With no access to the labeled source data, we cannot simply apply the global ranking strategies in Section 4.4 to get the pseudo-labeled known set and the unknown set. This is because the global ranking may be biased to some ‘easy’ classes where samples tend to have high confidence, which can cause downsampling of ‘difficult’ classes. The imbalance existed in the pseudo-labeled set is prone to trigger the overfitting for certain categories and notorious overconfidence issue [DBLP:conf/icml/GuoPSW17], which makes the model poorly calibrated and less generalizable. To be specific, overconfidence refers to the problem that the produced confidence scores are typically higher than the predictive accuracy, which may increase the risks for decision-critical applications.

To solve this, in SF-PGL framework, we adopt a balanced pseudo-labeling (BP-L) mechanism in the inference stage. Firstly, we separate all target data into groups according to their potential labels and sort the samples based on their confidence scores. For each shared class , we create an empty label bank of the size and then insert the highest confidence scores from the -th group into the label bank until fully filled. Afterwards, we merge label banks to form the pseudo-labeled known set at the -th step:


The pseudo-labeled unknown set is collected in the same way described in Section 4.4, by obtaining the target samples with the lowest maximum confidence scores, such as


In order to measure the class uncertainty, we further record the confidence threshold in each label bank and concatenate the thresholds as ,


where denotes the concatenation. Then we normalize the confidence thresholds to calculate the class importance at the -th step,


where is the standard softmax function. The value of varies from 0 to 1. A smaller value indicates the class is relatively easy to learn, whilst a larger value means the class is difficult and should pay more attention to learn the concept.

5.3 Step3: Uncertainty-aware Updating

With the pseudo-labeled known set constructed, we leverage the same episodic training paradigm used in Section 4.3 to train the backbone , the node and edge network and and the classifier . At the -th step, each episode in the new batch consists of labeled data sampled from and unlabeled target data sampled i.i.d.,


The networks are trained by minimizing the edge loss , supervised node classification loss and a soft entropy loss for unlabeled target data as defined below,

where is the loss coefficient. is empirically set to 1 for the VisDA-17 dataset and 3 for the Syn2Real-O dataset. Different from objectives in PGL, the node loss is weighted by the obtained from the last step, hereby the learning for difficult categories can be enhanced. This strategy also alleviates the issues of overfitting and miscalibration evidenced by experimental results shown in Section 6.5. Finally, we update the model parameters and repeat balanced pseudo-labeling with the newly constructed episodes until convergence.

6 Experiments

In this section, we quantitatively compare our proposed model against various domain adaptation baselines on three image classification and four action recognition datasets.

6.1 Datasets

To testify the versatility, we evaluate the proposed PGL and SF-PGL methods over three image recognition and four action recognition benchmarks as introduced below.

Office-Home [officehome] is a challenging domain adaptation benchmark, which comprises 15,500 images from 65 categories of everyday objects. The dataset consists of 4 domains: Art (Ar), Clipart (Cp), Product (Pr), and Real-World (Rw). Following the same splits used in  [STA], we select the first 25 classes in alphabetical order as the known classes, and group the rest of the classes as the unknown.

VisDA-17 [visda2017] is a cross-domain dataset with 12 categories in two distinct domains. The Synthetic domain consists of 152,397 synthetic images generated by 3D rendering and the Real domain contains 55,388 real-world images from MSCOCO [MSCOCO] dataset. Following the same protocol used in  [OSBP, STA], we construct the known set with 6 categories and group the remaining 6 categories as the unknown set.

Syn2Real-O [visda18] is the most challenging synthetic-to-real testbed, which is constructed from the VisDA-17. The Syn2Real-O dataset significantly increases the openness to 0.9 by introducing additional unknown samples in the target domain. According to the official setting, the Synthetic source domain contains training data from the VisDA-17 as the known set, and the target domain Real includes the test data from the VisDA-17 (known set) plus 50k images from irrelevant categories of MSCOCO dataset (unknown set).

Property UCF-HMDB UCF-HMDB UCF-Olympic Kinetics-Gameplay
Video Length 21 Seconds 33 Seconds 39 Seconds 10 Seconds
Classes 5 12 6 30
Training Videos UCF: 482 / HMDB: 350 UCF:1,438 / HMDB: 840 UCF: 601 / Olympic: 250 Kinetics: 43,378 / Gameplay: 2,625
Validation Videos UCF: 189 / HMDB: 571 UCF: 360 / HMDB: 350 UCF: 240 / Olympic: 54 Kinetics: 3,246 / Gameplay: 749
Known/Unknown Classes 4/1 6/6 5/1 15/15
TABLE I: The general statistics of the four action recognition datasets for tasks of OSVDA and S-OSVDA.
climb RockClimbingIndoor, RopeClimbing climb
fencing Fencing fencing
golf GolfSwing golf
kick_ball SoccerPenalty kick_ball
pullup PullUps pullup
punch Punch, punch
BoxingPunchingBag, BoxingSpeedBag
pushup PushUps pushup
ride_bike Biking ride_bike
ride_horse HorseRiding ride_horse
shoot_ball Basketball shoot_ball
shoot_bow Archery shoot_bow
walk WalkingWithDog walk
TABLE II: The summary of all collected categories in the UCF-101 and HMDB datasets. Classes highlighted in blue represent the unknown class.

We adapted four video DA benchmark datasets for the open-set domain adaptation setting by resplitting the label spaces, of which the statistics are summarized in Table I. For each source and target video, we sample a fixed number

with equal spacing and encode each frame with the ResNet-101 pretrained on ImageNet into a 2048-D vector. In our experiments,

is empirically set to 5 for all OSVDA approaches.

The UCF-HMDB and UCF-HMDB are the overlapped subsets of two large-scale action recognition datasets, i.e.

, the UCF101 

[ucf] and HMDB51 [hmdb], covering 5 and 12 highly relevant categories respectively. Each category may correspond to multiple categories in the original UCF101 or HMDB51 dataset, as shown in Table II. UCF-HMDB only contains golf, pullup, ride_bike, ride_horse, shoot_ball, where the shoot_ball is considered as the unknown class. For UCF-HMDB, the classes highlighted in blue (Table II) are grouped as the unknown class.

The UCF-Olympic selects the shared 6 classes from the UCF101 and Olympic Sports Datasets [olympic], including Basketball, Clearn and Jerk, Diving, Pole Vault, Discus Throw and Tennis, where the Tennis acts as the unknown class.

Kinetics-Gameplay: The fourth and most challenging dataset is the cross-domain Kinetics-Gameplay dataset, which has a large domain gap between its synthetic videos and real-world videos. To create this dataset, 30 shared categories were selected from both the Gameplay [TAN] dataset and one of the largest public video datasets, Kinetics-600 [kinetics]: break, carry, clean floor, climb, crawl, crouch, cry, dance, drink, drive, fall down, fight, hug, jump, kick, light up, news anchor, open door, paint brush, paraglide, pour, push, read, run, shoot gun, stare, talk, throw, walk, and wash dishes. Each category in Kinetics-Gameplay may also correspond to multiple categories in both datasets, which poses another challenge of class imbalance. We manually select the last 15 classes as the unknown class.

6.2 Baselines

We compare the performance of the proposed PGL and SF-PGL models with 1) a basic ResNet-50 [resnet] deep classification model; 2) closed-set domain adaptation methods: Maximum Mean Discrepancy (MMD) [MMD], Domain-Adversarial Neural Networks (DANN) [DANN], Residual Transfer Networks (RTN) [RTN], Joint Adaptation Networks (JAN) [JAN], Maximum Classifier Discrepancy (MCD) [MCD]

, 3) partial domain adaptation methods: Adaptive Batch Normalization (AdaBN)

[AdaBN], Importance Weighted Adversarial Nets (IWAN) [IWAN], Example Transfer Network (ETN) [ETN] 4) open-set domain adaptation methods: Assign-and-Transform-Iteratively (ATI-) [ATI], Open Set domain adaptation by Back-Propagation (OSBP) [OSBP], STA[STA], DAOD [open_theory], ROS[DBLP:conf/eccv/BucciLT20], Self-Ensembling with Category-agnostic Clusters (SE-CC) [DBLP:conf/cvpr/PanYLNM20], 5) universal domain adaptation method (UAN) [DBLP:conf/cvpr/YouLCWJ19] and 6) source-free open-set domain adaptation approach Inherit [DBLP:conf/cvpr/KunduVRVB20]. To be able to apply the non-open-set baseline methods in the open-set setting, we follow the previous baselines [STA, Attract-UTS]

and reject unknown outliers from the target data using


Method ArCl ArPr ArRw ClRw ClPr ClAr PrAr PrCl PrRw RwAr RwCl RwPr Avg.
OSNN [OSNN] 33.7 32.1 40.6 39.4 57.0 56.6 47.7 46.9 40.3 39.1 34.0 32.3 39.7 38.5 36.3 35.0 59.7 59.6 52.1 51.4 39.2 38.0 59.2 59.2 45.0 44.0
OSVM [OSVM] 37.5 38.7 42.2 42.6 49.2 51.4 53.8 55.5 48.5 50.0 39.2 40.3 53.4 55.1 43.5 44.8 70.6 72.9 65.6 67.4 49.5 50.8 72.7 75.1 52.1 53.7
DANN[DANN] 52.3 52.1 71.3 72.4 82.3 83.8 73.2 74.5 62.8 64.1 61.4 62.3 63.5 64.5 46.0 46.3 77.2 78.3 70.5 71.3 55.5 56.2 79.1 80.7 66.2 67.2
ATI-[DBLP:journals/pami/BustoIG20] 53.1 54.2 68.6 70.4 77.3 78.1 74.3 75.3 66.7 68.3 57.8 59.1 61.2 62.6 53.9 54.1 79.9 81.1 70.0 70.8 55.2 55.4 78.3 79.4 66.4 67.4
OSBP[OSBP] 56.1 57.2 75.8 77.8 83.0 85.4 75.5 77.2 69.2 71.3 64.6 65.9 64.6 65.3 48.3 48.7 79.5 81.6 72.1 73.5 54.3 55.3 80.2 81.9 68.6 70.1
STA[STA] 58.1 - 71.6 - 85.0 - 75.8 - 69.3 - 63.4 - 65.2 - 53.1 - 80.8 - 74.9 - 54.4 - 81.9 - 69.5 -
STA 46.6 45.9 67.0 67.2 76.2 76.6 64.9 65.2 57.7 57.6 50.2 49.3 49.5 48.4 42.9 40.8 76.6 77.3 68.7 68.6 46.0 45.4 73.9 74.5 60.0 59.8
ROS[DBLP:conf/eccv/BucciLT20] 51.5 50.6 68.5 68.4 75.9 75.8 65.6 65.3 60.3 59.8 54.1 53.6 57.6 57.3 46.5 46.5 71.1 70.8 67.1 67.0 52.3 51.5 72.3 72.0 62.0 61.6
DAOD[open_theory] 56.1 55.5 69.1 69.2 78.7 79.3 77.3 78.2 69.6 70.2 62.6 62.9 66.8 67.7 59.7 60.3 83.3 85.0 72.3 73.2 59.9 60.4 81.8 82.8 69.8 70.4
PGL 61.6 63.3 77.1 78.9 85.9 87.7 82.8 85.9 72.0 73.9 68.8 70.2 72.2 73.7 58.4 59.2 82.6 84.8 78.6 81.5 65.0 68.8 83.0 84.8 74.0 76.1
TABLE III: Recognition accuracies (%) on 12 pairs of source/target domains from Office-Home benchmark using ResNet-50 as the backbone. Ar: Art, Cp: Clipart, Pr: Product, Rw: Real-World. indicates our re-implementation with the officially released code.

6.3 Evaluation Metrics

To evaluate the proposed method and the baselines, we utilize three widely used measures [OSBP, STA], i.e., accuracy of the unknown class (UNK), normalized accuracy for all classes (OS), normalized accuracy for the known classes only (OS

) and harmonic mean accuracy (



with being the set of target samples in the -th class, and the classifier. In our case, we use the shared classifier for the known classes and pseudo-labeling function for the unknown one. Notably, H

is considered as the fairest evaluation metric, which trades off between the performance of the methods on known and unknown class samples.

In addition to classification accuracy, we explore the calibration capacity of the adapted model by leveraging the metric of expected calibration error (ECE). Given the model predictions and its respective confidence scores, ECE is calculated by grouping test data into M interval bins of equal size. Let be the set of indices whose maximum prediction score falls into the -th bin. The accuracy and average confidence for are defined as,


where is the highest confidence of the sample . Given the accuracy and confidence scores for each bin, ECE is computed as the weighted sum of the mismatch over bins,


where is the total number of samples.

6.4 Implementation Details

PyTorch implementation of the proposed PGL is available in a GitHub repository111 and the source code of SF-PGL is also made available222 In our experiments, we employ ResNet-50, ResNet-152 [resnet] and VGGNet [VGG] pre-trained on ImageNet as the backbone network. For VGGNet, we only fine-tune the parameters in FC layers. The networks are trained with the ADAM optimizer with a weight decay of . The learning rate is initialized as and for the GNNs and the backbone module, respectively, and then decayed by a factor of every epochs. The dropout rate is fixed to and the depth of GNN is set to for all experiments. The loss coefficients and are empirically set to and , respectively. The batch sizes of the proposed PGL are set to 2, 8, 6 for three open-set benchmarks. The batch sizes of SF-PGL are fixed to 4 and 8 for VisDA-17 and Syn2Real-O datasets. The enlarging factor

is 0.05 The image feature extracted by the fc7 layer of VGGNet backbone is a 4096-D vector, and the deep feature extracted from the ResNet-50 is a 2048-D vector. For video tasks, we set the batch size to 12 for two UCF-HMDB datasets, UCF

Olympic task, 10 for UCFOlympic task and 8 for GameplayKinetics. More details can be found in the Github repository for reproduction.

6.5 Results of Domain-adaptive Image Classification

To validate the effectiveness of the proposed PGL and SF-PGL models, we compare them with state-of-the-art OSDA and SF-OSDA approaches. As reported in Table III, Table IV, and Table VI, we clearly observe that our method PGL consistently outperforms the state-of-the-art results, improving mean accuracy (OS) by , and on the benchmark datasets of Office-Home, VisDA-17 and Syn2Real-O datasets respectively. Note that our proposed approach provides significant performance gains for the more challenging datasets of Syn2Real-O and VisDA-17 which require knowledge transfer across different modalities. This phenomenon can be also observed in the transfer sub-tasks with a large domain shift e.g., RwCl and PrAr in Office-Home, which demonstrates the strong adaptation ability of the proposed framework. For the SF-OSDA task, the proposed SF-PGL model surpasses not only the Inherit approach but also all OSDA methods by a large margin, as reported in the last row of Table IV and row 5-7, row 11-13 of Table VI. Compared with the original PGL model, SF-PGL is capable of balancing the learning of ‘easy’ and ‘difficult’ concepts by weighting the classification loss with the uncertainty-aware coefficient . For instance, by comparing the results shown in row 4 and row 7 of Table VI, the mean accuracies of the Person and Knife categories are improved from to and from to .

Method Bic Bus Car Mot Tra Tru UNK OS OS H
MMD[MMD] 39.0
ATI-[DBLP:journals/pami/BustoIG20] 46.2
STA[STA] 52.4
Inherit[DBLP:conf/cvpr/KunduVRVB20] 53.5 88.5
PGL 93.5 93.8 75.7 98.8 96.2 38.5 80.7 82.8 75.0
Method Bic Bus Car Mot Tra Tru UNK OS OS H
OSVM [OSVM] 40.2
DANN[DANN] 32.4 - - -
RTN[RTN] 31.6 - - -
ETN[ETN] 31.6
UAN[DBLP:conf/cvpr/YouLCWJ19] 42.6
ATI-[DBLP:journals/pami/BustoIG20] 33.6
STA[STA] 50.1 82.4
SF-PGL w/o BP-L 91.5
SF-PGL 93.6 97.6 89.6 95.3 96.7 95.6 91.0 94.7 79.6
TABLE IV: Performance comparisons on the VisDA-17. indicates methods with OSVM.
Model UNK OS OS H-Score
PGL w/o Progressive 43.6 54.4 55.3 48.8
PGL w NLL 48.6 56.9 57.6 52.7
PGL w/o GNNs 49.2 57.8 58.5 53.4
PGL w/o Mix-up 49.8 62.5 63.6 55.9
PGL 49.6 65.5 66.8 56.9
TABLE V: Ablation performance on the Syn2Real-O (ResNet-50). “w” indicates with and “w/o” indicates without.
Method Aer Bic Bus Car Hor Kni Mot Per Pla Ska Tra Tru UNK OS OS H-Score
DANN [DANN] 50.8 44.1 19.0 58.5 76.8 26.6 68.7 50.5 82.4 21.1 69.7 1.1 33.6 46.3 47.4 39.3
OSBP [OSBP] 75.5 67.7 68.4 66.2 71.4 0.0 86.0 3.2 39.4 23.2 68.1 3.7 79.3 50.1 47.7 59.6
STA [STA] 64.1 70.3 53.7 59.4 80.8 20.8 90.0 12.5 63.2 30.2 78.2 2.7 59.1 52.7 52.2 55.4
PGL 81.5 68.3 74.2 60.6 91.9 45.4 92.2 41.0 87.9 67.5 79.2 6.4 49.6 65.5 66.8 56.9
SF-PGL () 82.7 84.7 86.1 82.0 87.9 49.8 82.7 86.8 88.8 67.1 77.3 0.0 85.0 73.9 73.0 78.5
SF-PGL () 91.5 88.8 90.4 87.4 89.8 80.7 92.2 88.5 74.9 87.3 87.2 0.0 88.9 80.6 79.9 84.2
SF-PGL () 90.7 90.5 93.0 90.0 93.6 84.5 92.6 94.8 88.8 93.4 91.7 0.0 94.3 84.5 83.6 88.6
Method Aer Bic Bus Car Hor Kni Mot Per Pla Ska Tra Tru UNK OS OS H-Score
OSVM [OSVM] 53.8 54.2 50.3 48.7 72.7 5.3 82.0 27.0 49.6 43.4 78.0 5.1 44.2 47.3 47.5 45.8
OSBP [OSBP] 80.2 63.1 59.1 63.1 83.2 12.1 89.1 5.0 61.0 14.0 79.2 0.0 69.0 52.2 50.8 58.5
SE-CC [DBLP:conf/cvpr/PanYLNM20] 82.1 80.7 59.7 50.0 80.6 36.7 83.1 56.2 56.6 21.9 57.7 4.0 70.6 56.9 55.8 62.3
SF-PGL () 94.8 87.9 90.0 83.5 84.0 49.3 92.2 82.8 76.9 91.2 93.1 0.2 84.2 77.7 77.2 80.5
SF-PGL () 97.5 90.5 94.7 89.6 90.0 76.8 95.0 91.3 84.1 95.2 96.9 0.0 85.3 83.6 83.5 84.4
SF-PGL () 98.4 95.4 96.4 93.9 93.0 88.4 95.0 93.2 87.2 98.2 98.3 0.0 85.0 86.4 86.5 86.4
TABLE VI: Recognition accuracies (%) for open-set domain adaptation experiments on the Syn2Real-O (ResNet-50).
Fig. 5: Performance Comparisons w.r.t. varying (a) openness of the Syn2Real-o (ResNet-50); (b) loss coefficients and on the ArCl task (Office-Home) with the ResNet-50 backbone.

6.6 Model Analysis of PGL

Ablation Study: To investigate the impact of the derived progressive paradigm, GNNs, node classification loss, and mix-up strategy, we compare four variants of the PGL model on the Syn2Real-O dataset shown in Table V. Except for PGL w/o Progressive that takes and , all experiments are conducted under the default setting of hyperparameters. PGL w/o Progressive corresponds to the model directly trained with one step, followed by pseudo-labeling function for classifying the unknown samples. As shown in Table V, without applying the progressive learning strategy, the OS result of PGL w/o Progressive significantly drops by 16.9% because PGL w/o Progressive does not leverage the pseudo-labeled target samples leading to the failure in minimizing the shared error at the sample-level. In PGL w NLL, the focal loss of the node classification objective is replaced with the Negative log-likelihood (NLL) loss, resulting in OS performance dropping from 65.5% to 56.9%. Due to the absence of the focal loss re-weighting module, the model tends to assign more pseudo-labels to easy-to-classify samples, which consequently hinders effective graph learning in the episodic training process. In PGL w/o GNNs, we used ResNet-50 as the backbone for feature learning, which triggers 12.5% OS performance drops compared to the graph learning model. The inferior results reveal that the GNN module can learn the class-wise manifold, which mitigates the potential noise and permutation by aggregating the neighboring information. PGL w/o Mix-up refers to the model that constructs episodes without taking any pseudo-labeled target data. We observe that the OS performance of PGL w/o Mix-up is 4.6% lower than the proposed model, confirming that replacing the source samples with pseudo-labeled target samples progressively can alleviate the side effect of conditional shift.