1 Introduction
Domain adaptation is the task of adapting a model trained on a labelsufficient source domain to a labelscarce target domain when their input distributions are different. It has played crucial roles in applications that involve significant input distribution shifts such as recognition under adverse conditions (e.g., climate changes [10, 49, 48, 28] and nighttime [47]) and synthetictoreal adaptation [38]. The most popular direction in this field is unsupervised domain adaptation [4, 15] which assumes a totally unlabeled target domain. However, in practice, labeling a small part of target data is usually feasible. Hence, labelefficient domain adaptation tasks such as semisupervised domain adaptation [46, 64, 29, 30, 65] and active domain adaptation [55, 14, 39, 43] have attracted increasing attention.
In this paper, we consider active domain adaptation (ADA) [55, 14, 39, 43], where we can interact with an oracle to obtain annotations on a subset of target data given budget constraint, while utilizing the annotations for domain adaptation. The key to the success of ADA is to codesign sampling mechanism selecting a subset of target data to be annotated and utilization of the annotations. Existing ADA methods utilize the obtained annotations only for supervised learning, similar to existing Active Learning (AL) methods [50, 54, 3, 59]. Accordingly, they count diversity, representativeness, and uncertainty of the data to boost the effect of supervised learning.
We argue that for domain adaptation, there is another use of the sampled data, which deserves attention but is missing in the previous work: matching label distributions of source and target domains. In practice, domain adaptation often encounters label distribution shift, i.e., the frequencies of classes significantly differ between source and target domains. It has been proven in [66, 6] that matching label distributions of source and target domains is a necessary condition for successful domain adaptation [66, 6]. Also, it has been empirically verified in [6] that mismatched label distributions restrict or even deteriorate performance of existing domain adaptation methods [15, 32, 33].
Motivated by this, we present a new method that addresses the label distribution shift for the first time in ADA. At the heart of our method lies LAbel distribution Matching through Densityaware Active sampling, and thus it is dubbed LAMDA. Its key idea is to use sampled data for label distribution matching as well as supervised learning. During training, it estimates the label distribution of the target domain through the annotated labels of sampled target data, and builds each source data minibatch in a way that the label frequencies of the batch follow the estimated target label distribution. To this end, we design a new sampling strategy useful for label distribution estimation as well as supervised learning. For supervised learning, sampled data are encouraged to be representative, diverse, and uncertain. For label distribution estimation, on the other hand, sampled data should well approximate the entire data distribution of the target domain. As will be demonstrated empirically, existing ADA methods often fail to satisfy the second condition since they blindly select uncertain instances or do not take the overall target distribution into account.
Our sampling method satisfies both of the above conditions. Specifically, it selects a subset of target data whose statistical distance from the entire target data is minimized. Since the distribution of the sampled data well approximates that of the entire target data, their labels are expected to follow the latent target label distribution. They also spontaneously become diverse and representative in order to cover the entire target data distribution. In addition, LAMDA asks the oracle for labeling only uncertain instances in the sampled subset; it in turn utilizes the manually labeled samples for both supervised learning and label distribution estimation, while the rest are assigned pseudo labels by the model’s prediction and used only for label distribution estimation. This strategy lets LAMDA annotate and exploit only uncertain data in the subset for supervised learning, and estimate the target label distribution accurately by using the entire subset. The advantage of our sampling method is illustrated in Fig. 1.
In addition, we propose to use the cosine classifier
[16, 41], instead of the conventional linear classifier, in order to further alleviate the adverse effect of label distribution shift. The cosine classifier is known to be less biased to dominant classes since its classification weights are normalized, and thus has been used for longtailed recognition [24] and fewshot learning [41, 16, 5]. We find that such a property is also useful to combat label distribution shift; it is empirically verified that the cosine classifier significantly improves ADA performance when combined with a domain alignment method.To evaluate and compare LAMDA with existing ADA methods thoroughly, we present a unified evaluation protocol for ADA. Extensive experiments based on the evaluation protocol demonstrate impressive performance of LAMDA, which largely surpasses records of existing ADA methods [14, 39, 43], on four public benchmarks for domain adaptation [58, 56, 37, 38]. The main contribution of this paper is fourfold:

[leftmargin=5mm]

LAMDA is the first attempt to tackle the label distribution shift for ADA. The importance of this research direction is demonstrated by the outstanding performance of LAMDA.

We propose a new sampling strategy for choosing target data best preserving the entire target data distribution as well as being representative, diverse, and uncertain. Data selected by our strategy are useful for both label distribution matching and supervised learning.

In our experiment with each of the four domain adaptation datasets, LAMDA substantially outperforms all the existing ADA models.
2 Related work
Unsupervised domain adaptation (UDA). Major approaches in UDA aim at learning domain invariant features so that a classifier trained on the labeled source domain data can be transferred to the unlabeled target domain data [4]. To do so, previous methods align feature distribution between the two domains using various domain discrepancy measures such as MMD [31, 33], Wasserstein discrepancy [7, 8, 11, 27], and divergence [1, 15, 32, 9, 40, 57]. On the other hand, recent studies [66, 6] found that such domain alignment is only effective when the label distributions of the two domains are matched. This condition is difficult to be satisfied due to the limited access to the target class distribution. In this work, we propose to utilize the actively sampled data in ADA to estimate the target label distribution and match the label distribution of the two domains for the effective domain alignment.
Active learning (AL). AL is a task of selecting the most performanceprofitable samples to be annotated from an oracle [51]. Previous methods design various selection strategies, where they often refer to uncertainty [2, 19, 36], diversity [50, 54], or the both [3, 60, 59] for the selection. Uncertaintybased methods prefer difficult samples for the model, e.g., samples with high entropy. Diversitybased methods prefer samples that are different from the selected ones. Our method shares a similar idea with Wang [60, 59] in that we use MMD [17], but we additionally select the easybutrepresentative samples as a pseudolabeled set to precisely estimate the target label distribution, which can be used to help domain adaptation process.
Active domain adaptation (ADA). ADA is a variant of active learning that selects samples to maximize the domain adaptation performance. ADA is first introduced by Rai et al. [42] and first adapted to image classification by AADA [55]. Existing methods mainly refer to the difficulty of samples (i.e., uncertainty) for selection. TQS [14] selects uncertain samples by combining three sampling criteria: disagreement among ensemble models [52]
, top2 margin of predictive probabilities
[45], and confidence of domain discriminator [55]. CLUE [39] additionally considers the diversity among the selected samples along with the uncertainty by using entropyweighted means clustering [20]. VAADA [43]designs a setbased scoring function that favors three properties: vulnerability to adversarial perturbation, diversity within the sampled set, and representativeness to avoid outliers. More recent methods utilize a free energy biases
[26] of the two domains [61], Kmedoids algorithm [44, 12], and the distance to different class centers [62] for the selection. To newly tackle the critical issue of label distribution shift in ADA, we propose a sampling strategy that considers the data distribution of the target domain. The main technical difference between the sampling of conventional ADA and ours is illustrated in Fig. 1.3 Problem formulation
Given a labeled source dataset of size and an unlabeled target dataset of size , we study a standard ADA scenario of rounds, in each of which samples of target data are newly labeled and utilized for model update, i.e., the perround budget is and the total budget is . Let be the labeled target dataset actively collected, which grows up to size . We consider image classification such that is an image and
is a categorical variable, where a model, parameterized by
, predicts for input image . The goal of ADA is to maximize the test accuracy of in the target domain, where is trained on , and in the iterative manner.4 Proposed method
4.1 Overview of LAMDA
We present a novel ADA method, named LAMDA, that addresses label distribution shift between source and target domains. Our core idea is to select and utilize target samples useful for both label distribution matching and supervised learning. This idea is implemented in LAMDA by three components: prototype sampling, label distribution matching, and model training. First, LAMDA selects a set of prototypes, i.e., target data that best approximate the entire target data distribution. Uncertain prototypes in the set are then identified by the model and annotated by oracle, while the rest are assigned pseudolabels (Sec. 4.2). Next, LAMDA estimates the target label distribution using the assigned labels of the prototypes and adjusts the label distribution of source data being drawn within each minibatch according to the estimated target label distribution (Sec. 4.3). Under the matched label distribution, the model is trained by both crossentropy loss and domain adversarial loss (Sec. 4.4). The overall framework of LAMDA is illustrated in Fig. 2. In what follows, we describe each component at a round.
4.2 Prototype set sampling in target data
We begin with a model which is from the previous round, or pretrained on source dataset for the first round. Let denote the set of images in dataset for notational simplicity. To select the prototype set that represents the target data distribution, we first seek subset which minimizes a statistical distance between and the entire target data . Inspired by the sampling technique for examplebased model explanation [25], we employ the squared Maximum Mean Discrepancy (MMD) [17] between and on the feature space, which is formally given by
(1)  
where we let be the feature of input extracted by , and
be the Radial Basis Function (RBF) kernel. Noting that the last term in Eq. (
1) is constant with respect to , we define as follows:(2)  
where a constant is added to make , and the first and second terms measure representativeness and diversity of , respectively.
The prototypes can be then identified by a constrained combinatorial optimization to maximize
given a certain size limit , i.e.,(3) 
This is generally intractable due to the exponentially many candidates. However, a greedy process selecting samples one after one to locally maximize can efficiently find a nearoptimal solution in polynomial time since is monotone submodular when is RBF kernel [25]. To be specific, the greedy process is proven to achieve at least of the optimum [35]. We hence adopt the greedy process to select subset from the unlabeled target data.
We note that setting and spending all the budget for would be a waste of budget when includes easy prototypes, whose labels are accurately predicted by . We hence set in an adaptive way so that we spend budget only for hard prototypes. To be specific, starting from and from the previous round (or for the first round), each greedy selection is added to either or . includes only easy prototypes of whose margin between top1 and top2 predictions is larger than threshold , and only hard prototypes in are labeled by oracle. For , we use top1 prediction as the pseudo label which is given by
(4) 
In each round, we continue the sampling process until hard samples are newly annotated by oracle. Thus, is determined by the adaptation scenario and the model in hand. This is possible because the greedy selection can return a nearoptimal solution at any iteration. The sampling process is illustrated in Fig. 2, and described formally in Algorithm 1.
We denote the set of labeled prototypes by and that of pseudolabeled prototypes by . is used for both supervised learning and label distribution estimation, while is used only for label distribution estimation; details will be described in the following section.
4.3 Label distribution matching
We use the prototype set to estimate the target data distribution , which is in turn used for label distribution matching. To estimate , we investigate the frequency of each class within and . The frequency of class in is computed by
(5) 
where is an indicator function. On the other hand, the class frequency in is weighted by the corresponding predictive probability, which is given by
(6) 
Then, the target label distribution is estimated by
(7) 
where and is the number of classes. Note that we add an offset 1 to each category frequency of Eq. (7) to ensure at least a single instance is considered to be present in the target domain. This is consistent with the assumption of UDA, where both domains have the same label space . To make the observed source label distribution follow , we apply classweighted sampling when building source minibatches. The ratio between the source label distribution and the estimated target label distribution is denoted by . Then, the probability of sampling from for source minibatch construction is defined by
(8) 
where indicates the sample index.
4.4 Model training
Loss functions. As the label frequencies of a source minibatch match those of the target domain by Eq. (8), we can now apply a domain alignment loss while alleviating the label distribution shift. We choose the domain adversarial loss [15], but any other losses [15, 32, 33] for domain alignment can be employed. For domain adversarial training, a domain discriminator, parameterized by , is trained to classify the domain of input feature by probability , where is domain label. In the meantime, the feature extractor parameterized by is adversarially trained to confuse the discriminator. The domain adversarial loss with the matched label distributions is given by
(9) 
where the first expectation is taken over of
and the second one is taken over uniform distribution of
. The is updated to maximize , while is updated to minimize . Meanwhile, the crossentropy loss for labeled data and is given by(10) 
In summary, the total training loss for the proposed framework is given by
(11) 
Cosine classifier. To further alleviate the negative effect of label distribution shift, LAMDA employs a cosine classifier [16, 41]
, which measures cosine similarities between the classifier weights and an embedding vector as classification scores. The norm of classifier weight is known to be greatly affected by the label distribution
[24, 16, 63]. Since the norm does not interfere with the classification score in the cosine classifier, it can alleviate the label distribution shift. Specifically, let , where indicates a weight of classifier for class with embedding dimension . Then, the class probability predicted by the cosine classifier is given by(12) 
where is a single hidden layer that projects feature vector into dimensional embedding space, and is a temperature term that adjusts sharpness of the predicted probability.
5 Experiments
We first describe datasets, experiment setup, and implementation details in Sec. 5.1. Then LAMDA is evaluated and compared with previous work in Sec. 5.2, and contributions of its components are scrupulously analyzed in Sec. 5.3.
5.1 Setup
Datasets. We use four domain adaptation datasets with different characteristics: OfficeHome [58], OfficeHomeRSUT [56], VisDA2017 [38], and DomainNet [37]. OfficeHome contains 16k images from four domains {Art, Clipart, Product, Real}, where we conduct a diverse set of domain adaptation for each of 12 sourcetarget permutations. OfficeHomeRSUT is a dataset subsampled from three domains {Clipart, Product, Real} of OfficeHome, where the subsampling protocol, called reverselyunbalanced source and unbalanced target (RSUT), is employed to make a large label shift between source and target domains. VisDA2017 is a largescale dataset consisting of 207k images from two domains {Synthetic, Real} in a realistic scenario of synthetictoreal domain adaptation. DomainNet is also a largescale dataset but has a prevalent labeling noise. In DomainNet, we use five domains {Real, Clipart, Painting, Sketch, Quickdraw}^{1}^{1}1 The domains are chosen considering their consistency with existing benchmarks [39]. consisting of 362k images. We use 10% of the datasets for validation and the rest are kept for training. While DomainNet includes an independent test set, the other datasets do not provide an explicit test set. Hence, for OfficeHome, OfficeHomeRSUT, and VisDA2017, we use the whole dataset (i.e., trainval set) as the test set following the conventional protocol of UDA and previous work on ADA [14, 39].
Experimental setup. We compare LAMDA to the stateoftheart ADA methods: TQS [14], CLUE [39], and VAADA [43]. We note that the existing ADA works have evaluated their methods with different evaluation protocols (e.g., budget size, sampling interval, and dataset). For fair comparison, we first benchmark them on four public datasets for domain adaptation through a unified evaluation protocol. We conduct 5 rounds of data sampling, each of which updates the model from the previous round after newly acquiring labels of 2%budget, i.e., 10%budget in total, where we let %budget denote % of the target train set size. For both of our method and the previous methods, the model is selected based on the validation accuracy. For each of the methods, we use the original authors’ official implementation. The detailed descriptions are provided in the supplementary material (Sec. 0.C.2).
Implementation details. We use ResNet50 [18]
backbone initialized with pretrained weights from ImageNet
[13] classification for both our and the previous methods. Our classifier consists of 2 fully connected layers where the embedding dimension is . For all experiments, we use an identical set of hyperparameters. Our model is trained using SGD optimizer with a learning rate of 0.1, and a weight decay offor 100 epochs. We set the margin threshold
to 0.8, the temperature in Eq. (12) to and the of RBF kernel in Eq. (1) to an inverse of the feature dimension, which in our case is .5.2 Results
Overall superiority of LAMDA with varying budget. In Fig. 3, we compare the performance of LAMDA and the existing approaches^{2}^{2}2 Unfortunately, VAADA [43] for DomainNet and VisDA2017 requires infeasible memory consumption, in the supplementary material (Sec. 0.B.3), we report its performance on a part of scenarios of DomainNet which our resource allows. varying budget for each of OfficeHome, OfficeHomeRSUT, VisDa2017, and DomainNet datasets. Note that each method is equipped with its own domain adaptation technique (e.g., VAADA [53] for VAADA, and MME [46] for CLUE) and classifier (i.e., cosine classifier for LAMDA). We evaluate these methods while varying their adaptation techniques or classifier to examine the contribution of their components thoroughly. The results show that LAMDA clearly outperforms the previous arts in every setting on all the datasets. In particular, LAMDA with only %budget is often as competitive as or even outperforms the methods with %budget. The performance gap between LAMDA and other methods increases as the budget increases. This suggests that LAMDA utilizes the budget effectively by both ways: label distribution matching and supervised learning.
DA method  AL method  OfficeHome  

A C  A P  A R  C A  C P  C R  P A  P C  P R  R A  R C  R P  Avg  
  TQS[14]  64.3  84.8  83.5  66.1  81.0  76.7  66.5  61.4  82.0  73.7  65.9  88.5  74.5 
MME  CLUE[39]  62.1  80.6  73.9  55.2  76.4  75.4  53.9  62.1  80.7  67.5  63.0  88.1  69.9 
VAADA  VAADA[43]  67.8  83.9  82.9  67.0  81.5  79.5  65.8  65.9  82.4  74.8  68.6  87.8  75.7 
DANN  TQS[14]  68.7  80.1  83.1  64.0  83.1  76.9  67.7  71.0  84.4  76.4  72.7  90.0  76.5 
CLUE[39]  70.3  81.9  80.4  65.6  83.8  75.8  64.7  73.9  82.7  76.1  74.3  87.0  76.4  
VAADA[43]  65.5  79.6  80.0  65.4  82.2  75.5  68.4  68.1  84.0  73.5  70.7  88.6  75.1  
Ours w/o COS  73.0  87.6  84.2  69.5  85.9  81.0  71.9  74.6  85.3  77.3  75.9  91.6  79.8  
Ours  74.8  88.5  86.9  73.8  88.2  83.3  74.6  75.5  86.9  80.8  77.8  91.7  81.9 


Advantages of LAMDA across diverse sourcetarget domain pairs. In Table 12, we compare LAMDA and the existing ADA methods in every domain adaptation scenario of the four datasets given 10%budget, where LAMDA always outperforms the others. Regarding that OfficeHomeRSUT has a significant class distribution shift compared to OfficeHome, the advantage of LAMDA equipped with the label distribution matching becomes clearer in OfficeHomeRSUT (Table 2(a)) than OfficeHome (Table 1). Table 2(b) demonstrates the scalability of LAMDA, where it clearly outperforms the previous work by about 4% or more in all scenarios of the largescale datasets, VisDA 2017 and DomainNet. In the supplementary material (Sec. 0.B.1), we also show that LAMDA surpasses stateoftheart SSDA methods [29, 30].
5.3 Analysis
Prototype  Matching  Cosine  OfficeHome  OfficeHomeRSUT 

✓  ✓  ✓  81.9 (+2.1)  75.8 (+1.7) 
✓  ✓  ✗  79.8 (+2.7)  74.1 (+6.8) 
✓  ✗  ✗  77.1 (+3.8)  67.3 (+3.7) 
✗  ✗  ✗  73.3  63.6 
Contribution of each component of LAMDA. Table 3 quantifies the contribution of each components of LAMDA: (i) prototype set sampling in Sec. 4.2; (ii) label distribution matching in Sec. 4.3; and (iii) cosine classifier in Sec. 4.4. Every component in LAMDA improves the performance in both OfficeHome and OfficeHomeRSUT. The performance gap between the last (random sampling with DANN [15]) and the second last rows verifies that our prototype sampling method boost the effect of supervised learning. Comparing the second and the third rows, one can see the remarkable performance gain by our label distribution matching strategy, in particular on OfficeHomeRSUT with significant label distribution shift. Finally, the use of cosine classifier further improves performance by 1.9% in average. The results for every individual adaptation scenarios are reported in the supplementary material (Sec. 0.A.1).
Quality of estimated label distribution. As described in Sec. 4.3, estimating target label distribution plays a prominent role in LAMDA. In Fig. 4a, we visualize label distributions of sampled data of LAMDA and those of the previous work, and compute JensenShannon divergence (JSD) between the estimated distributions and the true one. The results demonstrate that LAMDA enables to estimate target label distribution most accurately compared to the previous work and the random sampling, which is a naive but intuitive sampling strategy for the estimation. Note that the previous work is even worse than the random sampling in terms of the estimation accuracy, which empirically reconfirm that the sampling strategies of the previous work are not aware of the target data distribution. When solely utilizes all of the pseudolabels from source pretrained model for the estimation, it gives JSD of 0.025 which is worse than ‘Source’ baseline. This is mainly due to the bias of the pseudolabeled data; they are highly confident samples. Our sampling method avoids this bias by combining labeled and pseudolabeled data.
Benefit of label distribution matching. In Fig. 4b, we plot training curves of domain alignment combined with label distribution matching, where each methods utilizes identical source classification loss and domain alignment loss as in Eq. (9), but with different label distribution estimated from each sampling methods in Fig. 4a. Training without label distribution matching (e.g., Source) or matching with inaccurate target label distribution degrade accuracy, while ours does not, thanks to the accurate estimation of target label distribution. It is worth noting that our model using 10%budget shows comparable accuracy with Oracle, which has access to the true target label distribution.
Visualization of sampled data by SNE. Fig. 5 visualizes distributions of target features and those selected by LAMDA and TQS, to show the difference of their sampling strategies. Since TQS prefers to select uncertain data, mostly located in unclustered regions, its samples do not reflect the target data distribution, e.g., the certain instances in the clustered region are undersampled. In contrast, LAMDA considers certain samples ignored in TQS and assigns them pseudolabels for label distribution prediction, while it requests an oracle to annotate uncertain data within the budget. Such a sampling strategy allows us to mainly invest a budget on uncertain data while utilizing densityaware samples to estimate the target label distribution. These observations align with our design rationale, depicted in Fig. 1. We also visualize the selected target feature vectors of CLUE and VAADA in the supplementary material (Sec. 0.A.4).
Hyperparameter analysis. In Fig. 6, we evaluate the sensitivity of LAMDA to the choice of the threshold in Algorithm 1. LAMDA is surprisingly robust to the change of , where the change of accuracy is less than 1% for both OfficeHome and OfficeHomeRSUT when the is between 0.7 and 0.9. We note that while the optimal value of varies among the datasets, we use the same value for all of our experiments. When we do not utilize the pseudo label in LAMDA (i.e., ), the accuracy drops 4% and 1.1% in OfficeHome and OfficeHomeRSUT, respectively. This shows the effectiveness of our prototype sampling strategy.
Analysis of cosine classifier. To inspect the cosine classifier, we compare in Fig. 7 the frequencies and the weight norm of the linear classifier for each class. The norm of the linear classifier is positively correlated to the frequencies of each class within the source domain (blue and green lines). Since a large norm of classifier weights has been known to result in predictions biased to major classes [24], the mismatch between the classifier norm and the target domain class frequencies (red and blue lines) is undesirable. The cosine classifier alleviates this issue by normalizing its weight scale. In the supplementary material (Sec. 0.B.2), we evaluate existing ADA methods combined with cosine classifier.
6 Conclusion
We proposed LAMDA, a new method to address the issue of label distribution shift in ADA. It selects target data best preserving the target data distribution as well as being representative, diverse, and uncertain. During training, LAMDA estimates the label distribution of the target domain, and builds each source data minibatch in a way that the label frequencies of the batch follow the estimated target label distribution. On the four different domain adaptation datasets, the proposed method substantially outperforms all the existing ADA models.
Acknowledgement. We sincerely appreciate Moonjeong Park for fruitful discussions. This work was supported by the NRF grant and the IITP grant funded by Ministry of Science and ICT, Korea (NRF2018R1A5A1060031, NRF2021R1A2C3012728, IITP2019001906, IITP2020000842, IITP2021002068, IITP2022000290).
References
 [1] Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M.: Domainadversarial neural networks. arXiv preprint arXiv:1412.4446 (2014)
 [2] Asghar, N., Poupart, P., Jiang, X., Li, H.: Deep active learning for dialogue generation. In: Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017) (2017)
 [3] Ash, J.T., Zhang, C., Krishnamurthy, A., Langford, J., Agarwal, A.: Deep batch active learning by diverse, uncertain gradient lower bounds. In: Proc. International Conference on Learning Representations (ICLR) (2020)

[4]
BenDavid, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W.: A theory of learning from different domains. Machine learning
79(1), 151–175 (2010)  [5] Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C.F., Huang, J.B.: A closer look at fewshot classification. In: Proc. International Conference on Learning Representations (ICLR) (2019)
 [6] Tachet des Combes, R., Zhao, H., Wang, Y.X., Gordon, G.J.: Domain adaptation with conditional distribution matching and generalized label shift. In: Proc. Neural Information Processing Systems (NeurIPS) (2020)

[7]
Courty, N., Flamary, R., Habrard, A., Rakotomamonjy, A.: Joint distribution optimal transportation for domain adaptation. In: Proc. Neural Information Processing Systems (NeurIPS) (2017)
 [8] Courty, N., Flamary, R., Tuia, D., Rakotomamonjy, A.: Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 39(9), 1853–1865 (2016)
 [9] Cui, S., Wang, S., Zhuo, J., Su, C., Huang, Q., Tian, Q.: Gradually vanishing bridge for adversarial domain adaptation. In: Proc. Neural Information Processing Systems (NeurIPS) (2020)

[10]
Dai, D., Sakaridis, C., Hecker, S., Van Gool, L.: Curriculum model adaptation with synthetic and real data for semantic foggy scene understanding. International Journal of Computer Vision (IJCV) (2020)
 [11] Damodaran, B.B., Kellenberger, B., Flamary, R., Tuia, D., Courty, N.: Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
 [12] Deheeger, F., MOUGEOT, M., Vayatis, N., et al.: Discrepancybased active learning for domain adaptation. In: Proc. International Conference on Learning Representations (ICLR) (2021)

[13]
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., FeiFei, L.: ImageNet: a largescale hierarchical image database. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
 [14] Fu, B., Cao, Z., Wang, J., Long, M.: Transferable query selection for active domain adaptation. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

[15]
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domainadversarial training of neural networks. Journal of Machine Learning Research (JMLR)
17(1), 2096–2030 (2016)  [16] Gidaris, S., Komodakis, N.: Dynamic fewshot visual learning without forgetting. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
 [17] Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel twosample test. Journal of Machine Learning Research (JMLR) 13(1), 723–773 (2012)
 [18] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

[19]
He, T., Jin, X., Ding, G., Yi, L., Yan, C.: Towards better uncertainty sampling: Active learning with multiple views for deep convolutional neural network. In: 2019 IEEE International Conference on Multimedia and Expo (ICME) (2019)

[20]
Huang, J.Z., Ng, M.K., Rong, H., Li, Z.: Automated variable weighting in kmeans type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
27(5), 657–668 (2005) 
[21]
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proc. International Conference on Machine Learning (ICML) (2015)

[22]
Jiang, J., Chen, B., Fu, B., Long, M.: Transferlearninglibrary.
https://github.com/thuml/TransferLearningLibrary (2020)  [23] Jin, Y., Wang, X., Long, M., Wang, J.: Minimum class confusion for versatile domain adaptation. In: European Conference on Computer Vision. pp. 464–480. Springer (2020)
 [24] Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., Kalantidis, Y.: Decoupling representation and classifier for longtailed recognition. In: Proc. International Conference on Learning Representations (ICLR) (2020)
 [25] Kim, B., Khanna, R., Koyejo, O.O.: Examples are not enough, learn to criticize! criticism for interpretability. In: Proc. Neural Information Processing Systems (NeurIPS) (2016)
 [26] LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energybased learning. Predicting structured data 1(0) (2006)
 [27] Lee, C.Y., Batra, T., Baig, M.H., Ulbricht, D.: Sliced wasserstein discrepancy for unsupervised domain adaptation. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
 [28] Lee, S., Son, T., Kwak, S.: Fifo: Learning foginvariant features for foggy scene segmentation. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
 [29] Li, J., Li, G., Shi, Y., Yu, Y.: Crossdomain adaptive clustering for semisupervised domain adaptation. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
 [30] Li, K., Liu, C., Zhao, H., Zhang, Y., Fu, Y.: Ecacl: A holistic framework for semisupervised domain adaptation. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
 [31] Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: Proc. International Conference on Machine Learning (ICML). PMLR (2015)
 [32] Long, M., Cao, Z., Wang, J., Jordan, M.I.: Conditional adversarial domain adaptation. Proc. Neural Information Processing Systems (NeurIPS) (2018)
 [33] Long, M., Zhu, H., Wang, J., Jordan, M.I.: Deep transfer learning with joint adaptation networks. In: Proc. International Conference on Machine Learning (ICML). PMLR (2017)
 [34] Van der Maaten, L., Hinton, G.: Visualizing data using tsne. Journal of Machine Learning Research (JMLR) 9(11) (2008)
 [35] Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions—i. Mathematical programming 14(1), 265–294 (1978)

[36]
Ostapuk, N., Yang, J., CudréMauroux, P.: Activelink: deep active learning for link prediction in knowledge graphs. In: The World Wide Web Conference (WWW) (2019)

[37]
Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multisource domain adaptation. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
 [38] Peng, X., Usman, B., Kaushik, N., Wang, D., Hoffman, J., Saenko, K.: Visda: A synthetictoreal benchmark for visual domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 2021–2026 (2018)
 [39] Prabhu, V., Chandrasekaran, A., Saenko, K., Hoffman, J.: Active domain adaptation via clustering uncertaintyweighted embeddings. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
 [40] Purushotham, S., Carvalho, W., Nilanon, T., Liu, Y.: Variational recurrent adversarial deep domain adaptation. In: Proc. International Conference on Learning Representations (ICLR) (2016)
 [41] Qi, H., Brown, M., Lowe, D.G.: Lowshot learning with imprinted weights. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[42]
Rai, P., Saha, A., Daumé III, H., Venkatasubramanian, S.: Domain adaptation meets active learning. In: Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing (2010)
 [43] Rangwani, H., Jain, A., Aithal, S.K., Babu, R.V.: S3vaada: Submodular subset selection for virtual adversarial active domain adaptation. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
 [44] Rdusseeun, L., Kaufman, P.: Clustering by means of medoids. In: Proceedings of the statistical data analysis based on the L1 norm conference, neuchatel, switzerland. vol. 31 (1987)
 [45] Roth, D., Small, K.: Marginbased active learning for structured output spaces. In: European Conference on Machine Learning. pp. 413–424. Springer (2006)
 [46] Saito, K., Kim, D., Sclaroff, S., Darrell, T., Saenko, K.: Semisupervised domain adaptation via minimax entropy. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
 [47] Sakaridis, C., Dai, D., Gool, L.V.: Guided curriculum model adaptation and uncertaintyaware evaluation for semantic nighttime image segmentation. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
 [48] Sakaridis, C., Dai, D., Hecker, S., Van Gool, L.: Model adaptation with synthetic and real data for semantic dense foggy scene understanding. In: Proc. European Conference on Computer Vision (ECCV) (2018)
 [49] Sakaridis, C., Dai, D., Van Gool, L.: Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision (IJCV) (2018)
 [50] Sener, O., Savarese, S.: Active learning for convolutional neural networks: A coreset approach. In: Proc. International Conference on Learning Representations (ICLR) (2018)
 [51] Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison (2009)

[52]
Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. In: Proceedings of the fifth annual workshop on Computational learning theory. pp. 287–294 (1992)
 [53] Shu, R., Bui, H.H., Narui, H., Ermon, S.: A dirtt approach to unsupervised domain adaptation. In: Proc. International Conference on Learning Representations (ICLR) (2018)
 [54] Sinha, S., Ebrahimi, S., Darrell, T.: Variational adversarial active learning. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
 [55] Su, J.C., Tsai, Y.H., Sohn, K., Liu, B., Maji, S., Chandraker, M.: Active adversarial domain adaptation. In: Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2020)
 [56] Tan, S., Peng, X., Saenko, K.: Classimbalanced domain adaptation: an empirical odyssey. In: Proc. European Conference on Computer Vision (ECCV) (2020)
 [57] Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (2015)
 [58] Venkateswara, H., Eusebio, J., Chakraborty, S., Panchanathan, S.: Deep hashing network for unsupervised domain adaptation. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
 [59] Wang, Z., Du, B., Tu, W., Zhang, L., Tao, D.: Incorporating distribution matching into uncertainty for multiple kernel active learning. IEEE Transactions on Knowledge and Data Engineering 33(1), 128–142 (2019)
 [60] Wang, Z., Ye, J.: Querying discriminative and representative samples for batch mode active learning. ACM Transactions on Knowledge Discovery from Data (TKDD) 9(3), 1–23 (2015)
 [61] Xie, B., Yuan, L., Li, S., Liu, C.H., Cheng, X., Wang, G.: Active learning for domain adaptation: An energybased approach. In: Proc. AAAI Conference on Artificial Intelligence (AAAI) (2022)
 [62] Xie, M., Li, Y., Wang, Y., Luo, Z., Gan, Z., Sun, Z., Chi, M., Wang, C., Wang, P.: Learning distinctive margin toward active domain adaptation. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
 [63] Xu, R., Li, G., Yang, J., Lin, L.: Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
 [64] Yang, L., Wang, Y., Gao, M., Shrivastava, A., Weinberger, K.Q., Chao, W.L., Lim, S.N.: Deep cotraining with task decomposition for semisupervised domain adaptation. In: Proc. IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
 [65] Yoon, J., Kang, D., Cho, M.: Semisupervised domain adaptation via sampletosample selfdistillation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1978–1987 (2022)
 [66] Zhao, H., Des Combes, R.T., Zhang, K., Gordon, G.: On learning invariant representations for domain adaptation. In: Proc. International Conference on Machine Learning (ICML). PMLR (2019)
Appendix 0.A Further analysis of LAMDA
0.a.1 Component analysis for each adaptation scenarios
Table a4a5 quantifies the impact of each component of LAMDA, in every domain adaptation scenario of the two datasets. Every component in LAMDA generally improves the performance of each domain adaptation scenario. Comparing the second and the third rows of Table a5, one can see the remarkable performance gain by our label distribution matching strategy since it alleviates the significant label distribution shift of OfficeHomeRSUT. Additionally, Our random sampling baseline incorporating DANN achieves 86.7% on VisDA2017 and 51.3% on DomainNet.
Prototype  Matching  Cosine  OfficeHome  

A C  A P  A R  C A  C P  C R  P A  P C  P R  R A  R C  R P  Avg  
✓  ✓  ✓  75.4  88.5  85.9  73.3  88.7  83.8  75.2  75.3  87.1  80.9  77.8  91.8  82.0 
✓  ✓  ✗  74.0  87.8  85.4  69.4  85.6  81.7  71.5  74.2  86.0  78.1  76.3  91.8  80.2 
✓  ✗  ✗  69.6  82.7  82.8  64.6  84.2  78.4  66.3  72.1  84.9  74.8  74.1  91.3  77.1 
✗  ✗  ✗  65.2  77.4  79.1  61.3  77.5  73.9  65.6  68.0  81.2  73.7  69.4  84.6  73.1 
. The accuracy is a mean of three runs, and the subscript denotes standard deviation.
Prototype  Matching  Cosine  OfficeHomeRSUT  

C P  C R  P C  P R  R C  R P  Avg  
✓  ✓  ✓  82.0  74.9  60.8  81.5  65.4  85.9  75.1 
✓  ✓  ✗  79.2  72.6  61.3  81.0  62.7  86.2  73.8 
✓  ✗  ✗  70.5  64.6  52.8  75.7  53.8  79.6  66.2 
✗  ✗  ✗  64.2  60.1  52.0  73.7  54.6  73.6  63.1 
AL strategy  Computational complexity  Computation time  

OfficeHome (A C)  DomainNet (C S)  
forward +
sampling

TQS[14]  5.41s  0.88m  
CLUE[39]  35.98s  43.94m  
VAADA[43]  132.20s  487.11m  
Ours  7.11s  2.37m 
0.a.2 Computational complexity of LAMDA
In Table a6, we compare the computational complexity and computation time per round of LAMDA and previous ADA methods. The computation time is measured on Intel(R) Xeon(R) Gold 6240 CPU. While the computational complexity of LAMDA exhibits quadratic growth regarding the target dataset’s size, it is significantly faster than CLUE [39] and VAADA [43], even on the large scale Clipart dataset of DomainNet consists of 30K images. This is because the calculation for our sampling consists of simple matrix multiplication, which can be computed efficiently in parallel. LAMDA achieves the best performance among the previous arts with a reasonable sampling time.
0.a.3 Combine LAMDA with other DA methods
In Fig. a8, we compare the performance of LAMDA with with CLUE [39] varying budget, when equipped with MME [46], on sourcetarget domain pair of OfficeHome: Clipart and Product; and OfficeHomeRSUT: Clipart and Real. We follow the training protocol of CLUE, where at each round, the model is provided with the budget and is trained for 20 epochs with MME loss. While CLUE is specialized with MME, the result shows that LAMDA constantly outperforms CLUE in all four domain adaptation scenarios. The performance gap between LAMDA and CLUE increases as the budget increases. In Table a7, we compare the performance of ADA methods coupled with MCC [23], the stateofthe art UDA method, when using 10%budget; the cosine classifier is not used in this comparison to verify the generalization of our sampling strategy. While MCC increases the overall performance of ADA methods, LAMDA still shows the best performance among them. This result demonstrates that LAMDA can significantly improve the performance regardless of adaptation technique by using the budget effectively for label distribution matching and supervised learning.
DA method  AL method  OfficeHome  

A C  A P  A R  C A  C P  C R  P A  P C  P R  R A  R C  R P  Avg  
MCC [23]  TQS[14]  70.9  91.4  87.8  74.2  90.1  84.6  74.4  72.8  87.3  78.7  75.6  92.7  81.7 
CLUE[39]  77.3  91.1  87.7  74.3  90.6  85.4  74.7  77.2  87.2  81.3  78.4  93.1  83.2  
VAADA[43]  71.3  89.1  87.7  74.2  87.5  85.5  74.9  71.8  86.3  81.3  76.2  92.4  81.5  
Ours w/o COS  76.9  90.2  89.0  76.7  91.5  86.8  77.8  77.4  88.5  83.5  78.5  93.5  84.2 
0.a.4 tSNE visualization
In Fig. a9, we visualize the target feature vectors selected by LAMDA and the previous ADA methods from the source pretrained model. When the feature vectors are clustered, LAMDA selects a comparably large number of prototypes within the cluster to reflect its density (Fig. a9a), which provides a better estimation of target statistics as in Fig. 4. In the clustered region, most prototypes are pseudo labeled since there tend to include easy samples (GT class in Fig. a9). While for the unclustered region, the prototypes are labeled by an oracle (Fig. a9a), providing groundtruth supervision to the uncertain prototypes. This allows us to mainly invest a budget on uncertain samples while utilizing densityaware samples to estimate the target label distribution.
On the other hand, the selected samples of the previous methods less preserve the density of the target data as they avoid selecting easy samples within the clustered regions. Even though CLUE [39] and VAADA [43] select part of the samples within the clustered region, they do not reflect the density of the cluster. Since CLUE and VAADA spend part of their budgets in the clustered regions, they cannot select samples as many as LAMDA in the unclustered regions (Fig. a9c and Fig. a9d). Oppositely, while TQS [14] selects samples as many as LAMDA within the unclustered region, it ignores samples in the clustered region (Fig. a9b).
0.a.5 Detailed component analysis of prototype set sampling
Table a8 quantifies the contribution of each components of our sampling strategy: (i) Subset sampling using MMD measure; (ii) Preventing labeling of the easy prototypes; and (iii) Utilizing the pseudolabeled prototypes for the label distribution estimation. Every component of our sampling strategy improves the performance for every adaptation scenario in OfficeHome. The performance gap between the last (random sampling with DANN [15]) and the second last rows verifies that our distributionaware sampling boosts the effect of supervised learning. Comparing the second and the third rows, one can see the performance gain by considering uncertainty for the sampling. The performance gap between the first and the second rows shows utilizing the pseudolabeled prototypes improves the quality of the label distribution estimation.
Density  Uncertainty  Pseudo label  OfficeHome  

A C  A P  A R  C A  C P  C R  P A  P C  P R  R A  R C  R P  Avg  
✓  ✓  ✓  75.4  88.5  85.9  73.3  88.7  83.8  75.2  75.3  87.1  80.9  77.8  91.8  82.0 
✓  ✓  ✗  72.5  87.3  84.1  73.2  87.6  81.3  74.0  73.6  84.8  78.7  76.0  90.6  80.3 
✓  ✗  ✗  69.4  84.1  82.6  71.0  83.5  79.6  71.6  70.4  83.9  77.2  73.9  88.0  77.9 
✗  ✗  ✗  66.1  76.7  79.2  62.2  76.1  73.6  66.1  70.3  82.1  72.9  71.3  83.1  73.3 
0.a.6 The ratio of the pseudolabeled prototypes
In Fig. a10, we measure the ratio of the pseudolabeled prototypes among all the prototypes with different threshold values. As shown in Fig. a10, a considerable amount of prototypes is pseudolabeled. When increases, the portion of pseudolabeled prototypes decreases, while their precision increases.
Appendix 0.B More results
0.b.1 Comparison with SOTA semisupervised DA method
SemiSupervised Domain Adaptation (SSDA) is a labelefficient domain adaptation task that allows a small amount of labeled data per class (i.e., shot per class) in the target domain. While SSDA and ADA are similar in that both utilize a small amount of labeled target data for domain adaptation, the focus of their methods is slightly different. While ADA methods focus on sampling the most performanceprofitable data, SSDA methods focus on effective training strategies to utilize the few labeled target data.
In Table a9a10, we compare LAMDA with stateoftheart SSDA methods [29, 30] using a 10%budget for the sourcetarget domain pair of OfficeHome and OfficeHomeRSUT. LAMDA surpasses previous arts of both ADA and SSDA in every setting of the datasets. The performance of stateoftheart SSDA methods is often as competitive as or even outperforms the previous ADA methods, Thus, combining the training strategy of SSDA with ADA methods would boost the performance of ADA.
Task  Method  OfficeHome  

A C  A P  A R  C A  C P  C R  P A  P C  P R  R A  R C  R P  Avg  
SSDA  ECACL [30]  72.2  86.7  82.8  70.5  85.0  82.6  70.9  71.5  82.9  76.0  74.0  88.9  78.7 
CDAC [29]  69.5  83.2  80.2  66.9  82.4  78.7  66.1  70.6  80.9  72.3  70.5  87.2  75.7  
ADA  TQS [14]  64.3  84.8  83.5  66.1  81.0  76.7  66.5  61.4  82.0  73.7  65.9  88.5  74.5 
CLUE [39]  62.1  80.6  73.9  55.2  76.4  75.4  53.9  62.1  80.7  67.5  63.0  88.1  69.9  
VAADA [43]  67.8  83.9  82.9  67.0  81.4  79.5  65.8  65.9  82.4  74.8  68.6  87.9  75.7  
Ours  74.8  88.5  86.9  73.8  88.2  83.3  74.6  75.5  86.9  80.8  77.8  91.7  81.9 
Task  Method  OfficeHomeRSUT  

C P  C R  P C  P R  R C  R P  Avg  
SSDA  ECACL [30]  78.6  68.6  59.5  77.1  61.9  82.0  71.3 
CDAC [29]  73.0  58.7  55.8  73.3  50.3  77.3  64.7  
ADA  TQS [14]  69.4  65.7  53.0  76.3  53.1  81.1  66.4 
CLUE [39]  69.7  65.9  57.1  73.4  59.5  82.7  68.1  
VAADA [43]  73.0  63.0  50.7  69.6  52.6  78.3  64.5  
Ours  81.2  75.7  64.1  81.6  65.1  87.2  75.8 
0.b.2 Previous methods equipped with cosine classifier
In Fig. a11, we combine the cosine classifier with the existing approaches and compare their performance with that of LAMDA varying budget for each of OfficeHome and OfficeHomeRSUT datasets. hods with 10%budget. LAMDA constantly outperforms the previous arts in every setting on both datasets, where it with 6%budget is often as competitive as or even surpasses the previous methods with 10%budget. The performance gap between LAMDA and other methods increases as the budget increases (Fig. a11a), indicating that LAMDA effectively utilizes the budget in label distribution matching and supervised learning.
0.b.3 Evaluation on DomainNet
In Fig. a12, we compare the performance of LAMDA and the existing methods varying budget for two sourcetarget domain pairs: Art to Product and Product to Clipart of DomainNet. LAMDA clearly outperforms the previous arts for the two domain adaptation settings, which demonstrates the scalability of LAMDA in the largescale datasets. In particular, LAMDA with only 2%budget is often as competitive as or even outperforms the previous methods with a 10%budget.
Appendix 0.C Experiment details
0.c.1 Configuration of DANN
We adopt the implementation of DANN [15] from [22] and follow its default configurations. The discriminator consists of three fully connected layers of , where each hidden layer is followed by batch normalization layer [21]
and ReLU. We utilize Gradient Reverse Layer (GRL) for adversarial learning, where the coefficient
of GRL is scheduled with , where denotes the training progress that scales from 0 to 1. We use the same DANN configuration and training schedule when combining previous ADA methods with DANN.0.c.2 Implementation details of previous ADA methods
We evaluate the previous ADA methods based on the official implementations of TQS^{3}^{3}3https://github.com/thuml/TransferableQuerySelection [14], CLUE^{4}^{4}4https://github.com/virajprabhu/CLUE [39], and VAADA^{5}^{5}5https://github.com/valiisc/s3vaada [43]. We fit their implementations into our evaluation protocol with minimal modifications. We change the architecture of CLUE from ResNet34 into ResNet50, and we add missing parts of TQS implementation to make the code follow the original paper. When training each method, we use hyperparameter values stated in each original paper, and if not stated, we tune them using the validation set.