Zero-Round Active Learning

07/14/2021 ∙ by Si Chen, et al. ∙ Harvard University Virginia Polytechnic Institute and State University 0

Active learning (AL) aims at reducing labeling effort by identifying the most valuable unlabeled data points from a large pool. Traditional AL frameworks have two limitations: First, they perform data selection in a multi-round manner, which is time-consuming and impractical. Second, they usually assume that there are a small amount of labeled data points available in the same domain as the data in the unlabeled pool. Recent work proposes a solution for one-round active learning based on data utility learning and optimization, which fixes the first issue but still requires the initially labeled data points in the same domain. In this paper, we propose D^2ULO as a solution that solves both issues. Specifically, D^2ULO leverages the idea of domain adaptation (DA) to train a data utility model which can effectively predict the utility for any given unlabeled data in the target domain once labeled. The trained data utility model can then be used to select high-utility data and at the same time, provide an estimate for the utility of the selected data. Our algorithm does not rely on any feedback from annotators in the target domain and hence, can be used to perform zero-round active learning or warm-start existing multi-round active learning strategies. Our experiments show that D^2ULO outperforms the existing state-of-the-art AL strategies equipped with domain adaptation over various domain shift settings (e.g., real-to-real data and synthetic-to-real data). Particularly, D^2ULO is applicable to the scenario where source and target labels have mismatches, which is not supported by the existing works.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have been successful on various tasks across different fields with the help of large-scale labeled datasets. However, data labeling processes are often expensive and time-consuming. One popular framework to reduce labeling costs is

active learning (AL), which strategically selects and labels the data instances from the unlabeled data pool with the goal of achieving comparable performance with fewer labeled instances.

In a typical AL framework [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], a learner begins with a small number of labeled data points and requests labels for more data points iteratively. At each round, a subset of points is selected based on its utility to the current model, which is trained on all the points selected in previous rounds. However, the multi-round nature could be a limitation for applying AL to real-world applications, because the most common labeling platforms, e.g., Amazon Mechanical Turk and annotation outsourcing companies, usually do not support a timely interaction between the learner and data annotators. Moreover, multi-round AL does not allow complete parallelization of labeling efforts, which can otherwise greatly improve annotation efficiency.

A recent work [13] proposes a framework DULO that brings AL to a new setting where the selection is performed in only one round. Specifically, DULO starts by querying the labels for a small number of randomly selected unlabeled instances and after the annotator returns the labels for these instances are labeled, it trains a utility model that takes a set of points as input and outputs its corresponding utility. Utility of any possible set of unlabeled instances could be predicted by this model. Then, it selects the set of unlabeled instances by doing a greedy search to maximize the utility model. While the zero-round setting of DULO is attractive in practice, it still requires one-round interaction as it needs feedback from annotators to get the initial labeled data. These initial labeled data are a basis for designing strategies for subsequent selection.

In this paper, we explore the possibility of zero-round AL and ask the question: can we select unlabeled data in a way that does not rely on any feedback from potential annotators but works better than random selection? Such data selection strategies, if exist, can be directly plugged into the widely used labeling platforms to help reduce labeling costs. Moreover, they can serve as a warm-start for existing AL approaches, one-round or multi-round, which all require a labeled data at the beginning. Our key idea to enable effective zero-round data selection is inspired by the observations that there are often labeled datasets available from related domains and in some applications, such as autonomous driving, there are off-the-shelf simulators that can simulate a large set of labeled data points that are related to the dataset to be labeled. Intuitively, these labeled data, although from a different domain, might still provide useful information about what types of data are worth being labeled.

In this paper, we present Domain adaptive Data Utility function Learning and Optimization (), an algorithm that can leverage labeled datasets from a source domain to help select the unlabeled data instances in the target domain. Importantly, our approach does not rely on any labeled instances from the target domain and hence provides a zero-round AL strategy. Specifically, we train a utility model that predicts the utility for any given unlabeled dataset, along with a feature extractor. We design our training scheme so that it will force the feature extractor to extract some domain-invariant features that are, at the same time, effective for predicting the utility of the dataset. One important benefit enabled by the utility modeling is that our approach can provide an estimate for utility of the selected data, which is very useful in practice for learners to decide the amount of unlabeled points to select and annotate. We apply to unlabeled data selection on various object recognition tasks across domains. Experiments show that our algorithm achieves state-of-the-art results in various domain shifts settings, including real source/target-domain data as well as more challenging ones: source domain is synthetic data while target domain is real data, and source and target domain data have label mismatches.

Compared with DULO, has more novel applications. For example, as existing unsupervised domain adaptation often falls behind its supervised counterpart, it is necessary to select data points in the target domain to further improve the performance, and provides a strategy that can select data more efficiently. Besides, our method can also be applied to perform data selection on the target domain even when it has inconsistent labels with the source domain, while typical AL strategies cannot.

2 Related Work

Active Learning. Active learning aims to reduce labeling effort by selecting data that are most valuable for model training, and it usually performs in an iterative manner. Earlier works [1, 2, 3, 4, 5, 6, 7] select only one sample each round. Such AL strategies cannot parallelize labeling efforts and are often time-consuming in practice. Batch-mode active learning [8], by contrast, queries data in groups and hence improves learning efficiency; in particular, it can better handle models with slow training procedures (e.g., deep neural networks). Many other works investigated batch-mode AL as well. For example, Core-set [9] performs k-center clustering to select informative data points while preserving their geometry. BADGE [8] attempts to capture the diversity and informativeness of data points in the gradient space and select data with gradients with diverse directions and high magnitude. Some works [10, 12, 11], on the other hand, exploit the properties of submodular functions and hence further improve selection efficiency.

Different from AL strategies above which are designed to proceed iteratively until exceeding the labeling budget. Most recent work [13] creates a new setting for active learning: they propose DULO for one-round AL, which selects the desired amount of unlabeled points all at once based on an initially labeled set. More specifically, DULO formulates the problem of one-round AL as the one of maximizing data utility functions, which map a dataset to some performance measure of the model trained on the set. In this paper, we propose and show that it is possible to get rid of the use of labeled instances in the target domain by exploiting information from a related but different domain where annotated data is available.

Domain Adaptation. Domain adaptation (DA) is a common solution to dealing with distribution shifts between source and target domain. The core idea is to learn some domain-invariant features, so the task model trained on the source domain can be readily applied to the target domain. There are three categories of DA depending on the data available from the target domain: unsupervised DA, semi-supervised DA, and supervised DA. Unsupervised DA is a setting where labeled target data is not available and agrees with the problem setting studied by our paper. Earlier works in this setting focus on minimizing some specific measurements of distributional discrepancy in the feature space. For example, [14, 15, 16] characterizes distribution distance via the Maximum Mean Discrepancy (MMD) of kernel embeddings; [17, 18]

utilizes category predictions from two task classifiers to measure the domain discrepancy. These approaches were further improved by the use of an adversarial objective loss function regarding a domain discriminator that tries to distinguish between source and target feature embeddings

[19, 20, 21, 22]. However, the adversarial training may encounter the technical difficulty of model collapse [23]. A recent work [24]

combines generative adversarial networks (GAN) with cycle-consistent constraints and adapts representations at both feature-level and pixel-level effectively.

Combining Active Learning and Domain Adaptation. Although both active learning and domain adaptation are two possible solutions to problem of insufficient labels, only a few works in the literature integrate these two methodologies into a single framework. [25] proposes a method that re-weights source data and selects target data to query simultaneously, so that the dataset, consisting of re-weighted source samples, labeled target samples and queried target samples, is closest to the distribution of target unlabeled data. [26, 27] propose ALDA that consists of three models: a domain adaptation classifier which adapts feature representation of source domain; a domain classifier that avoids querying labels for “source” data that are similar to target samples; a source classifier that provides labels for “source“ data that resemble target samples. AADA, recently proposed by [28], starts from training an unsupervised domain adaptation classifier, then target sample selection using importance weights. Model retraining is performed iteratively in AADA. These methods share the same limitation as most of the existing AL strategies as they are all designed to proceed for multi-rounds until exceeding the selection budget. To the best of our knowledge, is the first that combines domain adaptation with AL in the zero-round setting.

Figure 1: Overall Workflow of .

3 Approach

Setting.

Existing AL strategies rely on a small amount of labeled data in the target domain . By contrast, our goal is to develop zero-round AL strategies, which do not require any labeled data in the target domain. We assume that there exists a source domain whose distribution is closely related to the target distribution , while unlike target domain , the label of instances from is already available or easily accessible.

In this section, we introduce our algorithm . The key idea is to first learn a data utility model that can predict the utility for any set of unlabeled instances. We leverage domain adaptation techniques to ensure that the model is useful for predicting the utility for unlabeled instances in the target domain and further use this model to guide the data selection. We will denote labeled data by , unlabeled data by , and input and output spaces by , respectively.

3.1 Overview

Input : a subset of samples chosen from training set which the index is given by ; validation set ; classifier ; feature extractor ; metric function .
Output : utility dataset for DeepSets training.
1 Initialize utility dataset  . for  do
2       Randomly choose a subset where . Train classifier with .
3 end for
return
Algorithm 1 Data Utility Sampling

The concept central to our AL strategy is a data utility function, which maps any set of unlabeled instances to the performance of the ML model trained on the set once it is labeled. With such a function, AL can be done by simply selecting the unlabeled instances that maximize the output of the data utility model. Although data utility functions may have a close form for certain types of learning algorithms and model performance metrics (e.g., the test classification accuracy of K-Nearest-Neighbor [10]), for most models data utility functions cannot analytically derived. Recent work [13] proposed to learn data utility functions from data. Note that data utility functions are set functions, in which the input is a data set and the output is a real value indicating the utility of the data. Hence, each training samples for data utility function learning consist of a set of data points and the corresponding utility score, indicating the performance of the ML model trained on the set. Constructing the training set for data utility learning could be expensive, because to label each training sample, one needs to re-train the model. Fortunately, [13]

presents some empirical evidence that the learning of data utility functions could be sample-efficient due to its “diminishing return” property. Also, one can replace the original ML model with an efficiently-trainable proxy model (such as logistic regression) while still retaining good data selection performance.

Note that data utility learning requires labeled data instances, which make it possible to create the training set. [13] assumes a small labeled set in the target domain for data utility learning. However, this assumption no longer holds true in the zero-round AL setting. To resolve this problem, we propose to learn the data utility model on the source domain and mitigate the effects of domain shifts via domain adaptation.

3.2 Algorithm

The workflow of our algorithm is summarized in 1.

Utility Sampling.

The goal of this step is to construct the training set for learning data utility functions. Given a set of samples and a validation set in the source domain, each time we randomly sample a subset and train a classifier on it. Utility of this subset is then given by utility metric which in this paper is the validation accuracy of on . The utility training set is thus . A general utility sampling workflow is demonstrated in Algorithm 1.

Utility model training.

The goal of this step is to train a utility model effective for predicting the utility for unlabeled data in the target domain. Following [13], we adopt the popular set function model–DeepSets [29]–as the data utility model. DeepSets is a deep neural network which has the property of permutation invariance and equivariance, which make it suitable for set function modeling. Specifically, with the utility samples from the last step, a feature extractor will be utilized to get the embedding of the training instances in , and the DeepSets model maps the feature embedding of a set of points to its corresponding utility.

Input : labeled source data ; unlabeled target data ; utility dataset , where .
Model : ; feature extractor ; class predictor ; discriminator ; DeepSets utility model
4 for  do
5       for k steps do
6             Train with
7       end for
8      Fix ; extract the feature embedding of utiltiy dataset Train a DeepSets model on Fix ; train with
9 end for
return ;
Algorithm 2

In the setting of interest to our paper, labeled data is not available in the target domain and the utility model can only be trained on data from another domain. Hence, domain adaptation is needed to mitigate the performance drop caused by domain shift.

A domain adaptation framework usually consists of three components: a feature extractor , a class predictor which takes the output embedding of and makes class predictions, and a discriminator that aims to distinguish between source and target domain data. DA typically has two goals: 1) map examples from two domains to a common feature space; and 2) retain useful information for classification. Those two goals are usually achieved through optimizing the GAN loss and classification loss , given by

(1)
(2)

We now discuss how to leverage domain adaptation in data utility learning to train a utility model useful for data selection in the target domain. A naive way could be breaking the task into two steps: 1) apply domain adaptation techniques to obtain a feature extractor that extracts domain invariant features; and 2) train a DeepSets utility model

on the source feature extracted by the pre-trained

. However, the feature extractor learned in this way is only optimized towards the goal of being useful for classification, ignoring the goal of being useful for predicting data utility. Hence, we propose a joint training process for and that is mindful of the two goals simultaneously.

Specifically, given the labeled source data , unlabeled target data , a utility training set obtained from Algorithm 1, we alternate between steps of general domain adaptation training and one step of utility training. The former one just follows the usual DA framework. For the latter one, a DeepSets model is first trained on given current feature extractor , and it will be fixed and used to optimize in turn given the same objective of minimizing DeepSets Loss:

(3)

The main reason that we add is that the features suitable for classification may not be equally helpful for learning data utilities. For example, the best possible features for classification tasks would be simply the label for the data points. However, this kind of features contains no information about the quality of the data points. Intuitively, will enable to map source and target domain to a feature space that is more suitable for data utility learning.

Note that can be combined with any state-of-the-art DA frameworks, and we use CyCADA [24], UDA [30], AFN[31] in this paper.

Unlabeled Data Selection.

The last step of is to seek for the unlabeled data attaining maximal utility under the learned utility model. Formally, we solve the following optimization problem:

(4)

Similar to [13], we perform a stochastic greedy algorithm to solve it.

4 Evaluation

4.1 Evaluation Settings

4.1.1 Evaluation Protocol

We use two approaches to evaluate the utility of the selected subset: 1) Train-from-Scratch: we train a model from scratch on the data points selected from the target domain, and the utility of selected data points is given by the trained model’s accuracy; 2) Fine-tune: we adopt the fine-tune method proposed in [32] to fine-tune the classifier obtained from our algorithm. Specifically, given a batch of labeled target samples chosen by the strategy, we compute the centroid of each class in the feature space and generate a hypothesized label for each unlabeled sample given its similarity between different centroids. We use the inverse of Wasserstein distance [33] as the similarity metric.

4.1.2 Baseline Algorithms

For baseline algorithms, we combine state-of-the-art active learning strategies with domain adaptation. Specifically, we pre-train a feature extractor that minimizes the distance between source and target domain in the feature space, apply it to extract features for the unlabeled data pool and perform active data selection on the extracted features. Note that most of these existing AL strategies cannot be directly applicable to the zero-round AL setting.

We compare with the following state-of-the-art batch active learning strategies equipped with domain adaption. Specifically, our baselines contain as follows:

  • FASS. [10] performs subset selection as maximization of Nearest Neighbor submodular function on unlabeled data with hypothesized labels.

  • BADGE. [34] selects a subset of samples with hypothesized labels whose gradients span a diverse set of directions.

  • GLISTER. [11] formulates the selection as a discrete bi-level optimization on samples with hypothesized labels.

  • AADA. [28] uses a sample selection criterion which is the product of importance estimation and entropy of unlabeled data.

  • Random. In this setting we randomly select a subset from all the unlabeled target data.

Moreover, we also train an “optimal” DeepSets model on labeled target domain data, which serves as an upper bound of the active learning performance with only labeled source domain data available. We label this upper bound with Optimal. Note that this upper bound is not realizable in the zero-round AL setting because of the lack of labeled target domain data. We plot this setting in the figures only to better understand how much room our strategy could be further improved.

4.1.3 Datasets and Implementation Details

Source Target Domain Adaptation
MNIST[35] USPS CyCADA[24]
USPS [36] MNIST CyCADA
SVHN [37] MNIST CyCADA
CIFAR-10 [38] STL-10 [39] UDA [30]
VISDA-Synthetic [40] VISDA-Real AFN [31]
MNIST-04 MNIST-59 N/A
MNIST-04 USPS-59 CyCADA
Table 1: Dataset and Training Settings.

Table 1 summarizes the datasets and implementation settings. We evaluate the performance of and baseline approaches over four pairs of domain shifts: MNIST USPS, USPS MNIST, SVHN MNIST, CIFAR10 STL10. We also evaluate two more challenging transfer settings, where the source domain has inconsistent labels with the target domain: MNIST with digits 0-4 MNIST with digits 5-9, as well as MNIST with digits 0-4 USPS with digits 5-9. None of the baselines is applicable to these two settings by design. Following the settings in prior work [11, 13]

, we examine the effectiveness of different strategies on robust data selection, where partial data is corrupted by white noise.

For all the source datasets, we randomly sample 300 (MNIST, USPS) or 500 (SVHN, CIFAR10, STL10) data points of the training set as to perform data utility sampling demonstrated in Algorithm 1. We follow the implementation of DULO [13] to set and split the obtained into training and validation set with a ratio . We use small models (i.e., SVM, Logistic, Small CNN) as the classifier in Algorithm 1 to obtain the data utility. This is because needs to be trained for thousands times to construct an utility dataset. DULO [13] empirically finds that data utility functions for small models are positively correlated with those for large models. Since data selection based on utility models only relies on the relative utility values between different set, utility models trained on samples obtained from small proxy models could still be useful for selecting data for large models.

We consider the state-of-art domain adaptation techniques for the specific transfer settings. Specifically, we combine our method with three different DA frameworks: CyCADA [24], UDA [30], and AFN[31], and the DA framework used for each transfer setting are given in Table 1. For training the DeepSets model, we use the same hyper-parameter as [13]: we use Adam optimizer with learning rate , mini-batch size of 32, = 0.9, and = 0.999.

For Fine-tune performance evaluation, the starter model is in the corresponding domain adaptation framework in each setting. For Train-from-Scratch evaluation, we use three types of models to calculate the performance: 1) SVM, which is implemented with scikit-learn [41] with regularization parameter

; 2) Logistic model; and 3)Small CNN model which has two convolutional layers and two max pooling layers and three fully-connected layers. Adam optimizer with learning rate

, is used for training the small CNN model.

We use GeForce RTX 2080 ti for experiments on VISDA and NVIDIA Tesla K80 GPU for all the other experiments.

4.2 Experiment Results

Real-to-Real Adaptation.

We start from comparing with baselines on various domain shifts between real datasets.

Figure 2 shows the results averaged over multiple random seeds. The axis shows the number of target sample selected by different strategies, and the axis is the accuracy of the model trained on selected points. In this figure, outperforms all baselines in various shifts. Interestingly, the margin between and Optimal is small, except for Figure 2 (d) where Optimal is worse than . This may be caused by the overfitting of DeepSets.

Figure 2: Performance of on various adaptation shifts. The first row gives the results of Train-from-Scratch, where ‘SVM’, ‘Logistic’ and ‘SmallCNN’ indicate the model used for obtaining utilities. The second row give the results of Fine-tune, and the start points are the classifier accuracy on target validation set after domain adaptation.
Figure 3: True Utiltiy vs Estimated Utility.

Another advantage of over existing AL strategies is that we can provide a utility estimate for the selected data using the data utility model. Such a utility estimate could be very useful in practice for making an informed decision about the labeling budget. We compare our DeepSets estimated utility with the true utility for the MNIST USPS setting. Here, we randomly choose 300 data points of the target domain (USPS) and perform Algorithm 1 to sample 4000 subsets. The true utilities are given by a SVM model trained on the subsets. Note that these 300 data points are unseen during DeepSets training. Figure 3 shows that the estimated utility is positively correlated with the true utility but systematically underestimates the true utility. Hence, the utility estimates provided by can serve as a lower bound on the actual utility, which is still useful for guiding the choice of labeling budget. With better modeling of the relationship between the estimated and the true utility, one may be able to correct the bias in our estimation. We leave the exploration of this interesting direction to future work. Figure 3 also sheds light on the strong ability of on differentiating unlabeled data quality in the target domain, even without access to any labeled data from the domain.

Synthetic-to-Real Adaptation.

We further study the effectiveness of different strategies in the synthetic-to-real transfer setting. This setting could have great practical value because in many application domains, there exist sophisticated simulators that can generate a large amount of labeled data. We experiment on the VISDA-2017 dataset which has a significant synthetic-to-real domain gap. The source domain of VISDA are synthetic images generated by rendering from 3D models; the target domain are real object images collected from Microsoft COCO

[42] and contains some natural variations in image quality.

As shown in Figure 4, achieves the best performance among all the strategies. We also notice that even a very small amount of labeled target data can help improve the classifier accuracy by a large margin. For instance, in Figure 4 (b), the Fine-tune accuracy increases rapidly at the beginning while only 100 data points are selected. This emphasizes the need of selecting data in the target domain for further improvement of domain adaptation performance. Since the target domain is relatively clean, random baseline works already very well.

Figure 4: VISDA-2017 result: synthetic real. (a) gives the results of Train-from-Scratch, (b) and (c) give the result of Fine-tune. One interesting finding is that, all the strategies achieve a large performance improvement on classification accuracy. This indicates the needs to select data points from the target domain.
Figure 5: Performance of on domains that have inconsistent label space.
Label Mismatch.

There are many real-world datasets that do not have overlap in the label space or only share a few common classes. Hence, we also conduct experiments in the setting where the source domain has entirely different object categories from the target domain. Specifically, we use digits 0-4 of MNIST dataset as the source domain, digits 5-9 of MNIST and USPS datasets as the target domain. This setting has great practical value yet has not been studied by previous AL literature. The reason is that most AL strategies rely on hypothesized labels generated by the classifier trained on the source domain and they become infeasible in this setting. For the same reason, we omit the Fine-tune performance metric and only report the accuracy of Train-from-Scratch. As we can see from Figure 5, both outperforms random by a large margin and is comparable to Optimal.

5 Conclusion

In this paper, we propose , a zero-round active learning strategy which does not require any labeled data in the target domain. We propose a novel training algorithm for data utility model, which extracts features from a data set useful for both classification and utility prediction. We evaluate the effectiveness of on various types of domain shifts and show that can achieve state-of-the-art performance.

There are many interesting venues for future work. For instance, in our experiments, we observe that the DeepSet-based utility learning often overfits the training samples, which directly affects the efficacy of subsequent data selection tasks. One interesting future work is to develop preventative measures against overfitting for DeepSets training via a new training algorithm, model architectures, and regularization techniques. It is also interesting to study how to customize the synthetic data generation to the goal of improving active learning performance.

References

Appendix A Details of Datasets Used in Section 4

Mnist [35].

MNIST dataset contains a training set of 60,000 examples and a test set of 10,000 examples. The images are grayscale handwritten digits with size . We resize the images to in setting SVHN MNSIT.

Usps [36].

USPS dataset is a digit dataset scanned from envelopes. It contains a total of 9,298 grayscale pixels. We resize them to in both MNIST USPS and USPS MNIST setting.

Svhn [37].

SVHN is a real-world color house-number dataset containing 73,257 images for training and 26,032 images for testing. We use the version where all digits have been resized to pixels.

Cifar-10 [38].

The CIFAR-10 is an image recognition dataset containing 60,000 3-channel images in 10 classes.

Stl-10 [39].

The STL-10 dataset consists of 13,000 color images of size in 10 classes. We resize them to in the experiments.

Visda2017 [40].

VISDA2017 dataset is designed for unsupervised domain adaptation challenge which contains more than 280K images across 12 object categories with large domain gaps. The source domain are synthetic 2D images rendering of 3D models which the angles and lighting conditions are different. The target domain are photo-realistic or real-images. In the experiment, we resize all the images to and crop at the center obtaining images with size . An example of synthetic-real image pair is shown in Figure 6.

Figure 6: Example images in VISDA2017. The left is an image of source domain (synthetic) while the right is an image of target domain (real).

Appendix B Details of Models and Baseline Algorithms in Section 4

Svm.

We use Linear Support Vector Classification (SVC) implemented by scikit-learn [41] with L2 penalty and regularization parameter . Others remain as default.

Logistic Regression.

We use Logistic Regression implemented by scikit-learn [41]. We set the maximum number of iterations to be 1000.

Small CNN.

The small CNN model we used has two convolutional layers and two max pooling layers and three fully-connected layers. We use Adam optimizer with learning rate , , batch size 32 for training the small CNN model.

DeepSets Model.

A DeepSets model can be represented as where both and are neural networks. In our experiments, both and

contain 3 linear layer with ELU activation, and we set the number of neurons to be 256 in each hidden layer, the dimension of set features which is the output of

network to be 256. For training DeepSets models, we use Adam optimizer with learning rate , batch size 32, , and .

Baseline AL Techniques.

We use BADGE, FASS, and GLISTER implemented by DISTIL111https://github.com/decile-team/distil. Specifically, we set batch size to be 32 for all of the three strategies, and learning rate to be 0.001 for glister.

Appendix C Other Implementation Details

Domain Adaptation.

We test our method with three state-of-the-art domain adaptation frameworks in this paper: CyCADA [24], UDA [30], AFN [31].

For CyCADA222https://github.com/jhoffman/cycada_release, we follow their official implementation where a source classifier is firstly trained using Adam optimizer with learning rate , batch size 128, , and . Then, weights of this source classifier are used as the initial weights of target classifier to perform domain adaptation. The same optimizer is used for training target classifier. We set the in Line 10 of Algorithm 2 to be 10.

For UDA333https://github.com/yueatsprograms/uda_release

, we use SGD optimizer with initial learning rate 0.1. We later decay the learning rate to 0.001 after 10 epochs. And we set

to be 5.

For AFN444https://github.com/jihanyang/AFN, we use SGD optimizer with learning rate 0.001 and weight decay for training feature extractor, and SGD optimizer with learning rate 0.001, momentum 0.9 and weight decay for training class predictor. We set to be 5.

When integrating all of the above three DA frameworks into DULO, we use the same Adam optimizer with learning rate , , and for DeepSets Loss back-propagation.

Data Selection.

We apply stochastic greedy optimization [43] to solve Equation (4), and we set .