Many applications of machine learning methods require to learn a regression task, for instance, estimation of manufactured products performance, sentiment analysis of customer reactions, forecasting of supply and demand or prediction of the time spent by a patient in a hospital. In most of these applications, groups of products or patients define several domains with different distributions. Acquiring a sufficient amount of labeled data to provide a model performing well on all of these domains is often difficult and expensive. In practical cases, only a few labeled data are available for thetarget domain of interest whereas a large amount of labeled data are available from other source domains. One then seeks to leverage information from these source domains to learn efficiently the task on the target one through supervised training with a small sample of labeled target data.
Most of previous works on domain adaptation have focused on the unsupervised scenario where no target labels are available. Unlabeled target data are then used to correct the difference between source and target distributions by either creating a new feature space (feature-based methods Ganin2016DANN , Sun2016CORAL , Chen2012mSDA ) or reweighting the training instance losses (instance-based methods Huang2007KMM , Sugiyama2007KLIEP , Cortes2014DAregression ).
Available domain adaptation methods for regression are essentially instance-based methods which reweights the loss of source instances in order to minimize a distance between source and target distributions, such as the KL-divergence Sugiyama2007KLIEP , Garcke2014weightingDAregression , the MMD Huang2007KMM , Sun20112SWMDA , or the discrepancy Mansour2009DATheory , Cortes2014DAregression , Mohri2010Ydiscrepancy , adlam2019GanDisc . The latter offers the advantage, of being adapted to the underlying task and the particular class of hypotheses chosen to learn this task. An extension of the discrepancy, known as the -discrepancy, introduced in Mohri2010Ydiscrepancy , defined as the maximal difference between source and target risk over a set of hypotheses, presents tighter theoretical bounds of the target risk than the discrepancy Medina2015thesis . However, as far as we know, previous discrepancy minimization algorithms do not estimate the -discrepancy directly. As they focus on the use of unlabeled target data, they choose instead to consider unsupervised approximation of this distance as the discrepancy Mohri2010Ydiscrepancy or the generalized discrepancy Cortes2019GeneralDisc .
Most instance-based methods rely on the use of functions induced by positive semi-definite kernels and their weighting strategy consists in general in solving a quadratic problem Cortes2014DAregression , Cortes2019GeneralDisc , Huang2007KMM , Sun20112SWMDA . Thus, these methods present a computational burden when the number of data is important. The work of Pardoe2010boost
In this paper, we present a novel instance-based method for supervised domain adaptation for regression tasks using a few labeled target data. We propose the Weighting Adversarial Neural Network (WANN) algorithm to learn the optimal weights to correct the difference between source and target distributions. WANN proceeds to the minimization, in one feed-forward gradient descent, of an original objective function composed of the empirical -discrepancy between source and target domains and the task risk on these same domains (section 2). Compared to other discrepancy minimization algorithms, we use adversarial neural networks to estimate at each gradient step the importance weights of source instances and the empirical -discrepancy. We thus propose an efficient way to extend adversarial domain adaptation to regression tasks. After presenting related work in section 3, we show on several experiments that the novel weighting strategy of WANN leads to results which outperform state of the art methods for domain adaptation in regression and provides a method which scales better with large datasets. All the code for the experiments presented in this paper is available on GitHub. We also implement an online demo of our algorithm which can be found via the links provided in section 4.
2 Weighting Adversarial Neural Network
Given and , we consider the supervised domain adaptation regression setting where and are respectively the labeled source and target datasets. In this setting, the sample size of is typically much smaller than , the one of ().
We consider two classes of neural networks
of a given architecture and activation functions. We introduce three networksand called respectively the task, the discrepancy and the weighting network.
2.2 Objective function
Our approach is based on the instance-based assumption that an optimal task hypothesis for the regression task on the target domain can be computed by optimally weighting the loss of the source instances during the training phase. We suppose in addition that the optimal source weights can be computed by a weighting network such that learns the relationship between the input space and the source weights. A reweighting of the source instances is considered to be optimal if it minimizes the empirical -discrepancy between the reweighted source and the target distributions. However, computing the empirical -discrepancy requires finding a maximum over the set of hypotheses which is difficult in general. We then introduce a discrepancy hypothesis to approximate the empirical -discrepancy using adversarial techniques introduced by Ganin2016DANN .
This leads to the objective function provided below (equation 1). The function can be understood as a regularization of the target risk (second term of ) by the weighted source risk (first term of ). Since the sample size of is small, adding a selection of labeled source data will prevent from overfitting. The weighting network is trained to "select" the most informative source instances which are "close" to the target instances in term of the estimated -discrepancy (third term of ).
In order to approximate the -discrepancy which consists of the maximal difference between source and target risks, the network is trained to maximize the third term of , i.e seeks to provide antagonist performances on the two distributions. Thus, by looking for the source weights minimizing the -discrepancy, the network learns a new source distribution on which any hypothesis in will perform as well as on the target distribution.
The purpose of using a weighting neural network is to capture the underlying dependence between source instances. Indeed, the weights of dependent source instances should increase or decrease altogether as these instances will be similarly related to the target data. Using a neural network for this purpose provides a way to preserve the spatial structure in the source weighting scheme such that the weights of source instances close to each other in the input space will be similar (section 4.1).
2.3 Weighting Adversarial Neural Network Algorithm
We define here the parameters of the respective networks and which will be denoted in this section and for the sake of clarity. In the same way, the objective function from equation 1 will be denoted as a function of the parameters.
The details of WANN gradient descent are presented in Algorithm 1. The goal of this algorithm is to approximate the saddle point verifying:
A reversal gradient layer, as defined in Ganin2016DANN , is used to change the sign of the gradient in back-propagation. Thus, in the same gradient step, the current objective function is back-propagated through and whereas its opposite value is returned to (Figure 1).
There are no theoretical guarantees that admits this saddle point. However, in practice, we observe numerical convergence of WANN and improved performance on the target task compared to other methods (section 4).
A constraint is applied on each network by projecting the weights of their layers at each gradient step on the Euclidean ball of radius Srivastava2014dropout . This constraint is used in several adversarial algorithms such as WGAN Arjovsky2017WGAN and DANN Ganin2016DANN . Furthermore, mini-batch gradient descent is also used. Notice finally that we consider, in the algorithm, the squared -discrepancy in order to make differentiable.
3 Related work
3.1 Discrepancy Minimization
The present work is in line with discrepancy minimization methods, which were first introduced in Mansour2009DATheory and further developed in Cortes2014DAregression , Mohri2010Ydiscrepancy , Kuroki2019SDisc , Zhang2019MDD and Cortes2019GeneralDisc . More specifically, the WANN algorithm aims at minimizing the empirical -discrepancy introduced in Mohri2010Ydiscrepancy .
Definition and theoretical results for the -discrepancy are provided in Medina2015thesis (Definition 5, Proposition 1) where theoretical bounds of the average loss over the target domain can be found. Considering the empirical source and target distributions and and labeling functions , it is showed that the task risk on the target distribution can be upper bounded by the empirical risk on any reweighted source distribution plus the empirical -discrepancy. Besides, the target risk is also upper bounded by its empirical estimation plus a Rademacher complexity term.
Following these considerations, it appears that to minimize the target risk, one should minimize the -discrepancy and the task risk on the empirical distributions and . This is the purpose of WANN algorithm which aims at solving the following optimization formulation:
where , and are respectively the reweighted source risk and the target risk with
a loss function over pairs of labels.
The optimization formulation (3) is a min-max optimization problem. The algorithms in Cortes2014DAregression and Cortes2019GeneralDisc solve a related problem for respectively the discrepancy and the generalized discrepancy on the class of functions induced by PSD kernels using quadratic programming. In this paper, we choose instead to estimate the "max" part of the equation with a discrepancy network trained with adversarial techniques.
3.2 Adversarial neural networks
As far as we know, our WANN algorithm is the first application of adversarial techniques to domain adaptation for supervised regression tasks. Indeed, adversarial techniques, originally introduced for domain adaptation in Ganin2016DANN , are essentially used in unsupervised feature-based methods for classification tasks. DANN Ganin2016DANN and ADDA Tzeng2017ADDA algorithms, focus on finding a new representation of the input features where source and target instances cannot be distinguished by any discriminative hypothesis. This process aims at minimizing the -divergence introduced by BenDavid2006DATheory . Considering other distances, the adversarial methods MCD Saito2018MCD and MDD Zhang2019MDD
learn a new features representation by minimizing respectively the absolute difference between the predictions of two classifiers and thedisparity discrepancy between source and target domains. Similarly, in adlam2019GanDisc , the discrepancy distance is considered for the training of GANs.
In this section, we report the results of WANN algorithm compared to other domain adaptation methods for regression. The experiments are conducted on one synthetic and three public datasets: Superconductivity Hamidieh2018Superconductor , Kin-familly Rasmussen1996delve and Amazon review Blitzer2007SA . Following the standards of reproducible experiments, the source code of the used methods and all the scripts to obtain the presented results are available on GitHub 111https://github.com/AnonymousAccount0/WANN with an online demo. GDM code used is the one provided by the authors of Cortes2019GeneralDisc 222https://cims.nyu.edu/~munoz/. All results presented in this section have been computed on a ( GHz, G RAM) computer. The following competitors are selected to compare the performance of the WANN algorithm:
TrAdaBoostR2 Pardoe2010boost is based on a reverse-boosting principle where the weight of source instances poorly predicted are decreased at each boosting iteration. We choose the two-stage version of TrAdaBoostR2 with first stage and second stage iterations. A fold cross-validation is performed at each first stage and the best hypothesis is returned.
Generalized Discrepancy Minimization (GDM) Cortes2019GeneralDisc is an adaptation of DM algorithm Cortes2014DAregression to the supervised scenario. The GDM hyper-parameter is selected from the set and the and hyper-parameters from the set: , , the selection is made with cross-validation on the few available labeled target data.
Kullback-Leibler Importance Estimation Procedure (KLIEP) Sugiyama2007KLIEP is a sample bias correction method minimizing the KL-divergence between a reweighted source and target distributions. We choose the KLIEP likelihood cross validation (LCV) version with selection of Gaussian kernel bandwidth in the set .
Kernel Mean Matching (KMM) Huang2007KMM reweights source instances in order to minimize the MMD between domains. A Gaussian kernel is used with selected with cross-validation on labeled target data. Parameters and are set to and .
Discriminative Adversarial Neural Network (DANN) Ganin2016DANN is used here for regression tasks by considering the mean squared error as task loss instead of the binary cross-entropy proposed in the original algorithm. In the following DANN uses all labeled data to learn the task and all available training target data (including unlabeled ones) to find a common feature space. The trade-off parameter is selected on values between and with cross-validation on the training labeled target data.
To compare only the adaptation effect of each method, we use, for all of them, the same class of functions to learn the task which is the class of fully-connected neural networks with ReLU activation functions and a static architecture. All networks implement a projecting regularization of parameter . Adam optimizer kingma2014adam is used in all experiments for the training of neural networks. For WANN algorithm, the two networks , are chosen in the specified class . For DANN algorithm, a linear discriminative network is placed at the last hidden layer of the task network, thus DANN uses the same hypothesis as other compared methods to learn the task.
4.1 Synthetic Experiment
We first propose to give an intuitive understanding of WANN behavior through a one-dimensional dataset. For this purpose, we consider the synthetic experiment where source and target input instances are drawn uniformly on
. Source instances follow (with equal probability) one of these five labeling functions:for , with . Target instances follow the labeling function . As presented in Figure 2.A, we thus model a domain adaptation scenario where target and source data have fairly the same behaviour on the first half of the distribution but differ on the second. We consider labeled target data equally separated along the domain with additional noise (black squares).
Figure 2.B displays the predictions computed with the "No reweight" method which attributes uniform weights to all training instances. It appears that the "No reweight" strategy fails to provide a suitable hypothesis for the target task by following the mean of source tasks on the second half of the domain. In the contrary, the two domain adaptation methods TrAdaBoostR2 (Figure 2.C) and WANN (Figure 2.D) are able to "select", with an appropriate reweighting, the source instances which present similar behaviour than the target data and to discard the others. WANN, however, presents a more continuous reweighting than TrAdaBoostR2 due to the use of a weighting network which conserves some spatial structure. In this case, a slight benefit is observed for WANN in terms of target risk. We notice however for some cases in this synthetic setup (in particular for high learning rate), suboptimal convergence of WANN which may be due to the instability of adversarial training Mescheder2018ConvergenceAdversarial .
4.2 Experiments on a large dataset
A major advantage of WANN algorithm over previous instance-based methods for domain adaptation in regression is to propose a weighting strategy suited for neural networks. Thus our method scales better to large datasets than other methods involving kernels and quadratic programming. We propose here to demonstrate the efficiency of WANN on the UCI dataset Superconductivity Hamidieh2018Superconductor , Dua2019UCI , against "No Reweight", TrAdaBoostR2 and DANN which can also handle large datasets.
The goal is to predict the critical temperature of superconductors based on features extracted from their chemical formula. This is a common regression problem in industry, as industrials are particularly interested to model the relationship between a material and its properties.
We divide this dataset in separate domains following the setup of Pardoe2010boost . We select an input feature with a moderate correlation factor with the output (). We then sort the set according to this feature and split it in four parts: low (l), midle-low (ml), midle-high (mh), high (h). Each part defining a domain with around instances. The considered feature is then withdrawn from the dataset. We conduct an experiment for each pair of domains which leads to experiments. All source and target labeled instances are used in the training phase, the other target data are used to compute the results reported in Table 1. We also report the average MSE as well as the average rank over the 12 experiments. Notice that each experiment is repeated
times to obtain the standard deviation in brackets. Here, the networks fromused by all methods to learn the task is composed of a layer of neurons, with a projecting parameter . WANN weighting network is taken in with a projecting constant equal to . The learning rate is set to , the number of epochs to and the batch size to . A standard scaling preprocessing is performed with the training data on both input and output features. We also consider the two basic methods: "Src Only", trained on source data only and "Tgt Only", trained on the few labeled target data only.
The results of Table 1 underlined the ability of WANN to efficiently adapt between domains. In particular, we observe significant gains against DANN, "No Reweight", "Src Only" and "Tgt Only", when the source and target domains are less related, for instance when adapting from "midle-high" to "low" (mh l). TrAdaboostR2 shows competitive results to WANN in some experiments. It should be mentioned however, that this method requires to train networks where the others only have to train one. Besides, as boosting iterations need to be executed successively, the training of these networks cannot be parallelized. The fact that WANN algorithm is based on the minimization of a theoretically well founded objective function may explain its better performances over TrAdaBoostR2.
|Expe.||l ml||l mh||l h||ml l||ml mh||ml h||mh l|
|Tgt Only||432 (54)||412 (68)||385 (63)||726 (46)||488 (60)||527 (76)||875 (82)|
|Src Only||563 (35)||512 (50)||883 (176)||525 (34)||316 (23)||513 (55)||1996 (85)|
|No Re.||444 (28)||436 (35)||498 (56)||499 (12)||317 (19)||477 (63)||1167 (105)|
|DANN||597 (39)||589 (66)||755 (90)||484 (21)||343 (20)||465 (62)||1526 (254)|
|TrAdaB.||313 (17)||340 (39)||383 (62)||483 (44)||281 (6)||383 (35)||659 (22)|
|WANN||235 (10)||324 (18)||392 (40)||430 (9)||261 (13)||352 (26)||626 (25)|
|Expe.||mh ml||mh h||h l||h ml||h mh||Avg MSE||Avg rank|
|Tgt Only||669 (77)||599 (59)||982 (106)||725 (81)||602 (80)||618 (71)||4.67|
|Src Only||629 (29)||353 (14)||3092 (531)||1101 (209)||372 (16)||904 (105)||5.00|
|No Re.||566 (34)||344 (13)||740 (41)||493 (40)||345 (17)||527 (39)||3.42|
|DANN||503 (29)||387 (39)||1808 (553)||527 (65)||355 (23)||695 (105)||4.33|
|TrAdaB.||458 (36)||338 (9)||656 (23)||555 (61)||404 (23)||438 (31)||2.25|
|WANN||392 (12)||339 (10)||625 (30)||503 (28)||331 (23)||401 (20)||1.33|
4.3 Experiments on small datasets
In order to compare our method against KMM, KLIEP and GDM, we consider several experiments on smaller datasets extracted from respectively Kin-8xy Rasmussen1996delve and Amazon review Blitzer2007SA . We choose the same experimental setups than GDM in Cortes2019GeneralDisc for the choice of training and testing data. Notice that the training set in all experiments is composed of the source and a few labeled target instances as well as unlabeled target instances. However, WANN and TrAdaBoostR2 do not use the unlabeled ones. It should also be underlined that KMM and KLIEP are two stage methods which first reweight the training instances and then learn the task hypothesis. Gaussian kernels are used only in the first stage. To learn the task, the same class of hypotheses is used for all methods. Exception is made for GDM algorithm which is a one stage algorithm implemented for hypotheses induced by PSD kernels.
The first experiments are conducted on Kin-8xy Rasmussen1996delve which is a family of datasets synthetically generated from a realistic simulation of the forward kinematics of an 8 link all-revolute robot arm. The task consists in predicting the distance of the end-effector from a target. The task for each dataset has a specific degree of noise (moderate "m" or high "h") and linearity (fairly-linear "f", non-linear "n"). We conduct one experiment on each of the 12 pairs of domains defined by these 4 datasets. We pick 200 source, 200 target unlabeled and 10 target labeled instances. 400 other target instances are used to compute the MSE scores reported in Figure 3.
We conduct the next experiments on the cross-domain sentiment analysis dataset of Amazon review Blitzer2007SA where reviews from four domains: dvd, kitchen, electronics and books are rated between 1 and 5. The task consists in predicting the rating given one review. For pre-processing, we select the top 1000 uni-grams and bi-grams. Here, 700 labeled source and unlabeled target data are given as well as 50 labeled target data. The results are computed on 1000 target data.
For both datasets, the networks of used to learn the task are composed of layers of respectively and neurons, parameter is set to . Dropouts are added at the end of each layer with the respective rates . A learning rate of , epochs and a batch size of are used in the optimization for the experiments on Kin-8xy. For the ones on Amazon review the number of epochs is set to and the batch size to . All experiments are run times to compute standard deviations.
The choice of WANN hyper-parameters lies in the choice of network architecture. In the experiments, we arbitrarily choose the same architecture as task networks from . For each dataset, we choose the same projecting parameter of in all experiments, the choice of is done using cross-validation on one of the experiments using the few training labeled target data. The constant selected here is for Kin-8xy and for Amazon review. We try other architectures and choices of parameter for and also notice leading results for WANN algorithm.
Figure 3.A presents the results of kin experiments. WANN provides the best MSE in a majority of experiments, in particular when labeling functions differ between source and target domains. Again, our algorithm presents better performance than other methods on the sentiment analysis experiments (Table 2). As the only difference between WANN and the other domain adaptation methods (at the exception of GDM and DANN) is the weighting strategy, these results underline the efficiency of using a neural network to learn the source instances weights. Notice that our method and TradaboostR2 do not take advantage of unlabeled target data, however the two methods present the best score in almost all experiments. These considerations highlight the difficulty to make consistent unsupervised adaptation on a regression task. This fact can also be observed on Figure 3.B presenting the impact of the number of labeled target data on the target risk. We observe that significant decreases of MSE are due to the presence of labeled target data more than to the method used, in particular when labeling functions differ between domains (fm nh or nh fm). In these cases, it appears that it is better to use a "No reweight" strategy with a few labeled target data than to use an unsupervised algorithm. However, we observe that the "No Reweight" strategy needs between to labeled target data to obtain the same level of MSE obtained with WANN algorithm using only of them.
|Avg MSE||970 (24)||1002 (33)||1127 (36)||1017 (19)||1019 (18)||1020 (18)||1039 (2)|
In this work, we present a novel instance based approach for regression tasks in the context of supervised domain adaptation. We show that the weights accorded to source instance losses during the training phase can be optimally adjusted with a neural network in order to learn efficiently the target task. We propose the WANN algorithm which minimizes with adversarial techniques an original objective function involving the -discrepancy. WANN algorithm provides, on various experiments, results which outperform baselines for regression domain adaptation and proposes a weighting strategy able to handle large datasets. We show that using a weighting network for instance-based domain adaptation provides an efficient way to conserve spatial structure in the weighting scheme. Our work also reveals the importance of labeled target data to obtain performing models in the context of domain adaptation with regression tasks.
-  Ben Adlam, Corinna Cortes, Mehryar Mohri, and Ningshan Zhang. Learning gans and ensembles using discrepancy. In Advances in Neural Information Processing Systems, pages 5788–5799, 2019.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
-  Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 137–144. MIT Press, 2007.
-  John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. https://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html. In Proceedings of the 45th annual meeting of the association of computational linguistics, pages 440–447, 2007.
Minmin Chen, Zhixiang Xu, Kilian Q. Weinberger, and Fei Sha.
Marginalized denoising autoencoders for domain adaptation.In Proceedings of the 29th International Conference on Machine Learning, ICML’12, page 1627–1634, Madison, WI, USA, 2012. Omnipress.
-  Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, 2014.
-  Corinna Cortes, Mehryar Mohri, and Andrés Muñoz Medina. Adaptation based on generalized discrepancy. J. Mach. Learn. Res., 20(1):1–30, January 2019.
-  Dheeru Dua and Casey Graff. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2017.
-  Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. J. Mach. Learn. Res., 17(1):2096–2030, January 2016.
Jochen Garcke and Thomas Vanck.
Importance weighted inductive transfer learning for regression.In Toon Calders, Floriana Esposito, Eyke Hüllermeier, and Rosa Meo, editors, Machine Learning and Knowledge Discovery in Databases, pages 466–481, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg.
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics, 2010.
-  Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor. https://archive.ics.uci.edu/ml/datasets/superconductivty+data#. Computational Materials Science, 154:346–354, 2018.
-  Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Schölkopf, and Alex J. Smola. Correcting sample selection bias by unlabeled data. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 601–608. MIT Press, 2007.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
-  Seiichi Kuroki, Nontawat Charoenphakdee, Han Bao, Junya Honda, Issei Sato, and Masashi Sugiyama. Unsupervised domain adaptation based on source-guided discrepancy. Proceedings of the AAAI Conference on Artificial Intelligence, 33:4122–4129, 07 2019.
-  Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT, 2009.
-  Andrés Munoz Medina. Learning Theory and Algorithms for Auctioning and Adaptation Problems. PhD thesis, PhD thesis, New York University, 2015.
-  Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3481–3490, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
-  Mehryar Mohri and Andres Muñoz Medina. New analysis and algorithm for learning with drifting distributions. In Nader H. Bshouty, Gilles Stoltz, Nicolas Vayatis, and Thomas Zeugmann, editors, Algorithmic Learning Theory, pages 124–138, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
-  David Pardoe and Peter Stone. Boosting for regression transfer. In Proceedings of the 27th International Conference on Machine Learning (ICML), June 2010.
-  Carl Edward Rasmussen, Radford M Neal, Geoffrey Hinton, Drew van Camp, Michael Revow Zoubin Ghahramani, Rafal Kustra, and Rob Tibshirani. The delve project. http://www.cs.toronto.edu/~delve/data/datasets.html, 1996.
-  Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In , pages 3723–3732, 2018.
-  Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
-  Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul von Bünau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, page 1433–1440, Red Hook, NY, USA, 2007. Curran Associates Inc.
-  Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pages 443–450. Springer, 2016.
-  Qian Sun, Rita Chattopadhyay, Sethuraman Panchanathan, and Jieping Ye. A two-stage weighting framework for multi-source domain adaptation. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 505–513. Curran Associates, Inc., 2011.
-  Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167–7176, 2017.
-  Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. Bridging theory and algorithm for domain adaptation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7404–7413, Long Beach, California, USA, 09–15 Jun 2019. PMLR.