1 Introduction
With advances in algorithms and hardware, the amount of highquality, labelled training data is becoming the bottleneck for many machine learning tasks. Methods for making good use of available unlabelled data are thus an active area of research with great potential. Two established methods addressing this issue are semisupervised learning and domain adaptation. Semisupervised learning aims to improve a model of
through a better estimate of the marginal
, obtainable via unlabelled data from the same distribution (Chapelle et al., 2010). However, due to different data sources, experimental setups, or sampling processes, this i.i.d. assumption is often violated in practice (Storkey, 2009). Domain adaptation, on the other hand, aims to adapt a model trained on a source domain (or distribution) to a different, but related target distribution from which no, or only limited, labelled data is available (Pan and Yang, 2010; QuioneroCandela et al., 2009). This situation arises, for example, when training and test sets are not drawn from the same distribution.This paper aims to investigate the possibility of semisupervised learning in a domain adaptation setting, that is, not only adapting but also actively improving a model given unlabelled data from different distributions. Here, we focus on the most commonly used and wellstudied assumption in domain adaptation: the covariateshift assumption (Shimodaira, 2000; Sugiyama and Kawanabe, 2012).
With and indicating source and target domains respectively, covariate shift states that the difference in distributions arises exclusively as a consequence of a shift in the marginal distributions, , while the conditional, , remains invariant. Using the domain variable this assumption can thus be formulated as . Assuming that changes in are caused externally ()–as opposed to some internal process like, for example, a sampling bias ( or )– this covariateshift assumption thus implicitly treats all features as causal () (Storkey, 2009), for otherwise the vstructure at X () would introduce a conditional dependence of on the domain given (Koller and Friedman, 2009).
Recent work argued that semisupervised learning should not be possible in such a causal learning setting () as and should be independent mechanisms in this case (Janzing and Schölkopf, 2010; Schölkopf et al., 2012). In other words, the conditional distributions of each variable given its causes (i.e., its mechanism) represent “autonomous modules that do not inform or influence each other” (Peters et al., 2017). In the causal setting, a better estimate of obtainable from unlabelled data should thus not help to improve the estimate of the independent mechanism . With effect features (), on the other hand, semisupervised learning is, in principle, possible (Janzing and Schölkopf, 2015).
This need for effect features for semisupervised learning motivates considering the specific case of covariate shift shown in Fig. 1. Note that, by the same vstructure argument as before, we require for covariate shift to hold. We thus assume throughout that–through prior causal discovery, expert knowledge, or background information–the underlying causal structure is known and compatible with Fig. 1. We will make this assumption precise and discuss a possible relaxation in Sec. 3.1.
While requiring particular causal relationships between variables to be known a priori may seem a restrictive assumption, we have already seen that other commonly made, untestable assumptions such as covariate shift also carry implicit assumptions of a causal nature. Due to the lack of labels from the target distribution, the problem of unsupervised domain adaptation considered in this paper is illposed, and thus requires such strong assumptions. Our assumptions enable us to go beyond adaptation and to explore the possibility of semisupervised learning away from the i.i.d. setting when the underlying causal structure is known.
The following two examples constitute realworld scenarios which are compatible with the considered setting of prediction from cause and effect features.

Predicting disease, , from risk factors like genetic predisposition or smoking, , and symptoms, : while we might have (possibly unlabelled) data from multiple geographical regions or demographic groups leading to different distributions over risk factors (), we would not necessarily expect this to affect the behaviour of the disease itself ().

Predicting a hidden intermediate state of a physical system with inputs and outputs : again, we might have data from various experiments with differing input distributions (), but the laws of physics or nature () should not change.
We highlight the following contributions:

We introduce the causallyinspired semigenerative model, , for learning with cause and effect features, and show how its parameters can be fitted from both labelled and unlabelled data in a covariateshift adaptation setting using a maximum likelihood approach (Sec. 3).

We show how our method may also be applied for regression, using realworld protein data (Sec. 4).
2 Related Work
A sizeable body of literature has been published on the topic of domain adaptation, see e.g. (Patel et al., 2015) for a recent survey. Our focus is on unsupervised domain adaptation under covariate shift where no labels from the target domain are available and the conditional remains invariant. In general, the aim is to find a predictor, , which minimizes the target risk,
, for a given loss function,
. Most previous works on this setting fit into one of two families.Importance weighting approaches make use of the invariance of to rewrite the unknown target distribution as , where the importance weights can be estimated from unlabelled data (Shimodaira, 2000; Sugiyama et al., 2007; QuioneroCandela et al., 2009; Sugiyama and Kawanabe, 2012). This allows for empirical risk minimization on the reweighted labelled source sample to approximate the expected target risk.
Feature transformation approaches, on the other hand, are based on finding domain invariant features in a new (sub)space (Fernando et al., 2013; Gong et al., 2012). Generally, they learn a map s.t. the projected features are as domain invariant as possible, . Various criteria have been used to measure such similarity, e.g., MMD (Pan et al., 2011), HSIC (Yan et al., 2017), mutual information with (Shi and Sha, 2012)
, or performance of a domain classifier
(Ganin et al., 2016). The final model is trained on the transformed labelled sample.Note that in either approach unlabelled data is used only for adaptation, while the final model is trained on labelled data only. The current work aims to also include unlabelled data in the model fitting when labelled data is scarce. To the best of our knowledge, this is the first work addressing this novel setting.
3 Learning With Cause and Effect Features
We now state our assumptions, show how they lead us to a semigenerative model, and show how to fit its parameters using a maximumlikelihood approach. Note, however, that our semigenerative model can also be applied in a Bayesian way, see Appendix D of the supplementary material for details and further experiments using a Bayesian approach.
3.1 Assumptions
Consider the setting of predicting the outcome of target random variable,
, from the observation of two disjoint, nonempty sets of random variables, or features, and . Assume that we are given a small, labelled sample from a source domain () and a potentially large, unlabelled sample from a target domain (). We formalise our causal assumptions as motivated in Sec. 1 using Pearl’s framework of a structural causal model (SCM) (Pearl, 2009).An SCM over a set of random variables with corresponding causal graph is defined by a set of structural equations,
where is the set of causal parents of in , are mutually independent, random noise variables, and are deterministic functions.
Assumption 1 (Causal structure).
The relationship between the random variables , , and the domain indicator is accurately captured by the SCM
(1)  
(2)  
(3) 
where , , and are mutually independent, and , , and represent independent mechanisms.
This SCM is shown schematically in Fig. 2. The (unknown) noise distributions together with Eq. (1)(3) induce a range of observational and interventional distributions over which depend on . Here, we focus on the two observational distributions arising from the choice of which we denote by (source domain) and (target domain).^{1}^{1}1Note that even though we focus on the case here, it should be straight forward to include additional labelled or unlabelled data from different sources as in domain generalisation (RojasCarulla et al., 2018).
It is worth pointing out, that Assumption 1 does not allow a direct causal influence of on , and is thus strictly stronger than necessary. (As stated in Sec. 1, is sufficient for covariate shift to hold.) This assumption of two conditionally independent feature sets given also plays a key role in the popular cotraining algorithm (Blum and Mitchell, 1998). Interestingly, it has been shown for cotraining that performance deteriorates once this assumption is violated and the two feature sets are correlated beyond a certain degree (Krogel and Scheffer, 2004). Similar behaviour can reasonably be expected for our related setting, justifying .
3.2 Analysis
Given that the joint distribution induced by an SCM factorises into independent mechanisms
(Pearl, 2009),it follows from Assumption 1 that
(4) 
It is clear from Eq. (4) that only the distribution of causes is directly affected by the domain change, while the two mechanisms generating from , and from are invariant across domains. It is this invariance which we will exploit by learning a map from to from unlabelled data, which can be thought of as a noisy composition of and as indicated by the dashed arrow in Fig. 2.
Note that changes in the distribution of causes are still propagated through the two independent, domaininvariant mechanisms, and , and thereby also indirectly affects the distributions over and . We also note that for importance weighting it is sufficient to correct for the shift in . Writing it follows from Eq. (4) that
(5) 
Thus conditioning on causal features is sufficient to obtain domaininvariance–an idea which also plays a central role in "Causal inference using invariant prediction" (Peters et al., 2016).
Since it is the aim of domain adaptation to minimise the targetdomain risk, we are interested in obtaining a good estimate of the target conditional, . From Eq. (4), we have
(6)  
As the last term does not depend on , this shows that covariate shift indeed holds, as intended by construction. While it would be possible to write the target conditional differently, only conditioning on as in Eq. (6) leads to a domain invariant expression. Such invariance is necessary since, due to a lack of target labels, the numerator involving can only be estimated in the source domain.
Moreover, Eq. (6) shows that the conditional can be expressed exclusively in terms of the mechanisms and , and is thus independent of the distribution over causes, . A better estimate of obtainable from unlabelled data will thus not help improve our estimate of . This is consistent with the claims of Schölkopf et al. (2012) that the distribution of causal features is useless for semisupervised learning, while that of effect features may help. Another way to see this is directly from the data generating process, i.e., the SCM in Assumption 1. While Eq. (1) does not depend on (which is only drawn after ), Eq. (3) clearly does.
What is novel about our approach is explicitly considering both cause and effect features at the same time. Substituting Eq. (2) into Eq. (3) we obtain
so that learning to predict from we may hope to improve our estimates of and . In terms of the induced distribution, this corresponds to improving our estimates of and via a better estimate of , which we will refer to as the unsupervised model. This is possible since parameters are shared between the supervised and unsupervised models.
3.3 SemiGenerative Modelling Approach
Our analysis of the different roles played by and suggest explicitly modelling the distribution of , while conditioning on ,
(7) 
where . We refer to the model on the LHS as semigenerative, as it can be seen as an intermediate between fully generative, , and fully discriminative, .
As opposed to a fullygenerative model, our semigenerative model is domain invariant due to conditioning on and can thus be fitted using data from both domains. At the same time, as opposed to a fullydiscriminative model, the semigenerative model also allows including unlabelled data by summing (or integrating if is continuous) out ,
(8) 
For our setting, a semigenerative framework thus combines the best from both worlds: domain invariance and the possibility to include unlabelled data in the parameter fitting process.
It is clear from Eq. (8) that we can always obtain the unsupervised model exactly for classification tasks. For regression, however, we are restricted to particular types of mechanisms and for which the integral can be computed analytically. Otherwise we have to resort to approximating Eq. (8).
Our approach can then be summarised as follows. We train a semigenerative model , formed by the two mechanisms and , on the labelled sample, such that the corresponding unsupervised model (Eq. 8) agrees well with the unlabelled causeeffect pairs. For prediction, given a parameter estimate , the conditional can then easily be recovered from and as in Eq. (6).
3.4 Fitting by Maximum Likelihood
The average loglikelihood of our semigenerative model given the labelled source data is given by
(9) 
and importanceweighting by as described in Eq. (5) yields the weighted, or adapted, form
(10) 
The corresponding average loglikelihood of the unsupervised model given unlabelled target data is
(11)  
We propose to combine labelled and unlabelled data in a pooled loglikelihood by interpolating between average source (Eq.
9) and target (Eq. 11) loglikelihoods,(12) 
where the hyperparameter
has an interpretation as the weight of the labelled sample. For example, corresponds to using only the labelled sample, whereas gives equal weight to labelled and unlabelled examples, see Sec. 4.4 for more details.4 Experiments
Since it is our goal to improve model performance with unlabelled data () when the amount of labelled data () is the main limiting factor, we focus in our experiments on the case of small (relative to the dimensionality) and compare learning curves as is increased.
4.1 Estimators and Compared Methods
We compare our approach with purelysupervised and importanceweighting approaches which take the known causal structure (Assumption 1) into account:

– training on the labelled source data only (baseline, no adaptation)

– training on reweighted source data (adaptation by importanceweighting using known weights on the synthetic datasets)

– training on the entire pooled data set combining unweighted labelled and unlabelled data via (our proposed estimator)
Where applicable, we report the performance of a linear/logistic regression model,
, trained on the joint feature set , i.e., ignoring the known causal structure. Moreover, we also consider trained after applying different feature transformation methods: TCA (Pan et al., 2011), MIDA (Yan et al., 2017), SA (Fernando et al., 2013), and GFK (Gong et al., 2012). For this we use the domainadaptation toolbox by Ke Yan with default parameters (Yan, 2016).4.2 Synthetic Classification Data
To generate synthetic domainadaptation datasets for binary classification which satisfy the assumed causal structure we draw from the following SCM:
where
is the logistic sigmoid function. The resulting datasets all have linear decision boundaries, but can differ in domaindiscrepancy, classimbalance, and classoverlap or difficulty, depending on the choice of
and , respectively. For one such choice, an example draw is shown in Fig. 3.This data generating process induces the distributions
The corresponding unsupervised model (Eq. 8) for an unlabelled causeeffect pair is thus given by
(13) 
where denotes the pdf of a normal random variable with mean
. Together with and given above, Eq. (13) suffices to compute our estimator. Note that, like a logistic regression model, our model has three parameters: .In addition, to test our approach in a discrete and higherdimensional setting, we apply our approach to the LUCAS toy dataset^{2}^{2}2http://www.causality.inf.ethz.ch/data/LUCAS.html, treating ’Lung Cancer’ as target , ’Smoking’ and ’Genetics’ as causes , ’Caughing’ and ’Fatigue’ as effects , and ’Anxiety’ as domain indicator .
4.3 RealWorld Regression Data
To demonstrate how a semigenerative model can be used for linear regression, we apply our approach to the “Causal ProteinSignaling Network” data by
Sachs et al. (2005), which contains singlecell measurements of 11 phosphoproteins and phospholipids under 14 different experimental conditions, as well as–important for our method–the corresponding inferred causal graph. We focus on a subset of variables which seems most compatible with our assumptions^{3}^{3}3Assumption 1 is not fully satisfied because of the existence of confounding variables (e.g., PKA, see Fig. 4), so that conclusions drawn may be limited. With causal inference and causal structures becoming of interest in more and more areas, however, more suitable realworld data will eventually become abundant. At this point our work should thus be considered more methodological in nature. , and from which we extract two domain adaptation datasets by taking source data to correspond to normal conditions while target data is obtained by intervention on the causal feature, see Fig. 4. As can be seen, (MEKERKAKT) shows a high similarity between domains, whereas (PKC PKAAKT) seems more challenging due to high domain discrepancy.As is often the case with biological data, variables span multiple orders of magnitude and seem to be reasonablywell approximated by power laws. We therefore decide to first transform the data by taking logarithms and then fit a linear model in logspace, corresponding to a powerlaw relationship in original space. Denoting the logtransformed cause, target, and effect by and
as before, and using Gaussian noise with unknown variance, this corresponds to the following model
(14)  
with corresponding distributions
(15)  
Substituting for in the second line of Eq. (14), and given that the sum of two Gaussian random variables is again Gaussian, we can compute the unsupervised model (Eq. 8) in this case as follows:
(16) 
Eq. (14) and (16) combined allow to compute our proposed estimator. To make predictions given a parameter estimate, we need to compute the of the conditional (Eq. 6). It is given by
(17)  
which can be interpreted as a weighted average of the predictions of each of the two independent mechanisms. A detailed derivation of Eq. (17) can be found in the supplementary material, Appendix A.
To investigate how background knowledge can aid our approach in challenging realworld applications, we also fit a model under the constraint , that is, fitting lines with negative slope on the harder data set . This constraint captures that both PKCPKA and PKAAKT appear to be inverse relationships–something which may be known in advance from domain expertise.
4.4 Choosing the Hyperparameter
To choose , we performed extensive empirical evaluation on synthetic data considering different combinations of and , the results of which can be found in the supplement, Appendix C. For classification, data was generated as detailed in Sec. 4.2 with a fixed choice of parameters. For regression, we used a linear Gaussian model to generate synthetic data.
For classification, we found that , giving equal weight to all observations (c.f. Eq. 12), i.e., more weight to the unsupervised model as is increased, seems to be a good choice across settings.
In contrast, for linear regression a good choice of does not seem to depend strongly on and . Rather than weighting all observations equally, values of giving the fixed majority weight to the average supervised model appear to be preferred. We thus choose a constant for our regression experiments. Note, however, that this value can be further increased when more labelled data becomes available (e.g., ) and the unsupervised model becomes obsolete.
4.5 Simulations and Evaluation
For synthetic classification experiments, we fix and vary , and as indicated in the figure captions. We thus consider different amounts of labelled data and classoverlap, or difficulty. We perform simulations, each time drawing a new training set of size and a new targetdomain test set of size . We report testset averages of error rate and semigenerative negative loglikelihood (NLL), . The latter is the quantity our model is trained to minimise, and thus acts as a proxy or surrogate for the nonconvex, discontinuous 01 loss.
For realworld regression experiments, we draw labelled source training data, and reserve 200 target observations as test set. From the remaining target data, we then draw additional unlablelled training data. (Each experiment performed by Sachs et al. (2005) contains ca. 1000 measurements.) We perform simulations and report test set averages of root mean squared error (RMSE).
Code to reproduce all our results is available online.^{4}^{4}4https://github.com/Juliusvk/SemiGenerativeModelling
5 Discussion
Classification results for two synthetic datasets are shown in Fig. 5. For both the more difficult (4(a), Bayes error rate ), and the simpler (4(b)) data sets, average error rate and variance are monotonically decreasing as a function of
, leading to significant (paired ttest with
) improvements of over , , and when sufficient unlabelled data is available. A very similar behaviour is observed for the semigenerative NLL, indicating that it is a suitable surrogate loss. Whereas the largest absolute drop in error rate () is achieved on the more difficult dataset, the largest relative improvement () and earlier saturation occur when–due to the larger absolute value of – carries more information about . The latter is intuitive as can be interpreted as a second label in this case.Results for the LUCAS toy data in Table 1 show similar behaviour to those in Figure 5, and demonstrate that our approach is suitable also for discrete data and higher dimensional features.
\  0  1  4  16  64  256 

8  0.232  0.230  0.226  0.220  0.212  0.208 
16  0.206  0.205  0.203  0.198  0.192  0.188 
Regression results on the real datasets are shown in Fig. 6. On the simpler , our approach outperforms the others when only four labelled observations are available (5(a)). As is increased to 16 (5(b)), however, feature transformation methods gain the upper hand. Given that even (coinciding with the curve of TCA) yields better results in this case, a possible explanation is that–due to the common confounder PKA (see Fig. 4)–our assumptions are violated. On the much more challenging , none of the methods yields low RMSE, but the restricted version of our approach performs best, followed by the restricted version of the purelysupervised baseline.
Comparison with FeatureTransformation Methods
The case of illustrates a potential advantage of our approach for realworld applications. Since we use raw features, it is possible to incorporate available domain expertise in the model. Since variables resulting from a transformation of the joint feature set are no longer easily interpretable, including background knowledge is much harder for transformed features. As such transformations can also introduce new dependencies between variables, it is not clear how our approach and feature transformations can be easily combined. An interesting idea though could be to relax the assumption , and then try to correct for the shift in due to by learning a transformation of only which maximises domain invariance of prior to applying our approach. As a final note, runtime of our method is roughly an order of magnitude less than for featuretransformation methods.
Combination with Importance Weighting
Importance weighting, on the other hand, should not be seen as an alternative, but rather as complementary to our approach. Through the unlabelled target sample we obtain an estimate of . The first factor can be used to estimate importance weights, whereas our work has focused on improving the model via information carried by the second factor. Both ideas could be combined by forming a weighted pooled loglikelihood, , by replacing by in Eq. (12).
Model Flexibility and Role of
It seems our approach is more promising for classification than for regression tasks. Too much emphasis on the unlabeled data (as controlled by ) can, for regression in particular, lead to overfitting of the unsupervised model. This can be observed on for large enough using , and is further illustrated on synthetic data in the supplement, Appendix B. Since the main difference between regression and classification in our approach is summing over a finite, or integrating over an infinite number of when computing the unsupervised model (Eq. 8), we conjecture that model flexibility plays an important role in determining the success of our approach. If there is a bottleneck at , so that only few values can explain a given causeeffect pair , then the unsupervised model can help to improve our estimates of and , as demonstrated for the case of binary classification. If, on the other hand, many possible can explain the observed equally well, then the unsupervised model appears to be less useful.
Acknowledgements
The authors would like to thank Adrian Weller and Michele Tonutti for helpful feedback on the manuscript.
References

Blum and Mitchell (1998)
A. Blum and T. Mitchell.
Combining labeled and unlabeled data with cotraining.
In
Proceedings of the eleventh annual conference on Computational learning theory
, pages 92–100. ACM, 1998.  Chapelle et al. (2010) O. Chapelle, B. Schölkopf, and A. Zien. SemiSupervised Learning. The MIT Press, 1st edition, 2010.

Fernando et al. (2013)
B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars.
Unsupervised visual domain adaptation using subspace alignment.
In
Proceedings of the IEEE international conference on computer vision
, pages 2960–2967, 2013. 
Ganin et al. (2016)
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,
M. Marchand, and V. Lempitsky.
Domainadversarial training of neural networks.
Journal of Machine Learning Research, 17(59):1–35, 2016. 
Gong et al. (2012)
B. Gong, Y. Shi, F. Sha, and K. Grauman.
Geodesic flow kernel for unsupervised domain adaptation.
In
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on
, pages 2066–2073. IEEE, 2012.  Janzing and Schölkopf (2010) D. Janzing and B. Schölkopf. Causal inference using the algorithmic markov condition. IEEE Transactions on Information Theory, 56(10):5168–5194, 2010.
 Janzing and Schölkopf (2015) D. Janzing and B. Schölkopf. Semisupervised interpolation in an anticausal learning scenario. Journal of Machine Learning Research, 16:1923–1948, 2015.
 Koller and Friedman (2009) D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 Krogel and Scheffer (2004) M.A. Krogel and T. Scheffer. Multirelational learning, text mining, and semisupervised learning for functional genomics. Machine Learning, 57(12):61–81, 2004.

Pan and Yang (2010)
S. J. Pan and Q. Yang.
A survey on transfer learning.
IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.  Pan et al. (2011) S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
 Patel et al. (2015) V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine, 32(3):53–69, 2015.
 Pearl (2009) J. Pearl. Causality. Cambridge university press, 2009.

Peters et al. (2016)
J. Peters, P. Bühlmann, and N. Meinshausen.
Causal inference by using invariant prediction: identification and confidence intervals.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012, 2016.  Peters et al. (2017) J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference  Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, MA, USA, 2017.
 QuioneroCandela et al. (2009) J. QuioneroCandela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
 RojasCarulla et al. (2018) M. RojasCarulla, B. Schölkopf, R. Turner, and J. Peters. Invariant models for causal transfer learning. Journal of Machine Learning Research, 19(36), 2018.
 Sachs et al. (2005) K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan. Causal proteinsignaling networks derived from multiparameter singlecell data. Science, 308(5721):523–529, 2005.
 Schölkopf et al. (2012) B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal learning. In 29th International Conference on Machine Learning (ICML 2012), pages 1–8. International Machine Learning Society, 2012.
 Shi and Sha (2012) Y. Shi and F. Sha. Informationtheoretical learning of discriminative clusters for unsupervised domain adaptation. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1275–1282. Omnipress, 2012.
 Shimodaira (2000) H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
 Storkey (2009) A. Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, pages 3–28, 2009.
 Sugiyama and Kawanabe (2012) M. Sugiyama and M. Kawanabe. Machine learning in nonstationary environments: Introduction to covariate shift adaptation. MIT press, 2012.
 Sugiyama et al. (2007) M. Sugiyama, M. Krauledat, and K.R. Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.
 Yan (2016) K. Yan. Domain adaptation toolbox. https://github.com/viggin/domainadaptationtoolbox, 2016.
 Yan et al. (2017) K. Yan, L. Kou, and D. Zhang. Learning domaininvariant subspace using domain features and independence maximization. IEEE transactions on cybernetics, 2017.
References

Blum and Mitchell (1998)
A. Blum and T. Mitchell.
Combining labeled and unlabeled data with cotraining.
In
Proceedings of the eleventh annual conference on Computational learning theory
, pages 92–100. ACM, 1998.  Chapelle et al. (2010) O. Chapelle, B. Schölkopf, and A. Zien. SemiSupervised Learning. The MIT Press, 1st edition, 2010.

Fernando et al. (2013)
B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars.
Unsupervised visual domain adaptation using subspace alignment.
In
Proceedings of the IEEE international conference on computer vision
, pages 2960–2967, 2013. 
Ganin et al. (2016)
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,
M. Marchand, and V. Lempitsky.
Domainadversarial training of neural networks.
Journal of Machine Learning Research, 17(59):1–35, 2016. 
Gong et al. (2012)
B. Gong, Y. Shi, F. Sha, and K. Grauman.
Geodesic flow kernel for unsupervised domain adaptation.
In
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on
, pages 2066–2073. IEEE, 2012.  Janzing and Schölkopf (2010) D. Janzing and B. Schölkopf. Causal inference using the algorithmic markov condition. IEEE Transactions on Information Theory, 56(10):5168–5194, 2010.
 Janzing and Schölkopf (2015) D. Janzing and B. Schölkopf. Semisupervised interpolation in an anticausal learning scenario. Journal of Machine Learning Research, 16:1923–1948, 2015.
 Koller and Friedman (2009) D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 Krogel and Scheffer (2004) M.A. Krogel and T. Scheffer. Multirelational learning, text mining, and semisupervised learning for functional genomics. Machine Learning, 57(12):61–81, 2004.

Pan and Yang (2010)
S. J. Pan and Q. Yang.
A survey on transfer learning.
IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.  Pan et al. (2011) S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
 Patel et al. (2015) V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine, 32(3):53–69, 2015.
 Pearl (2009) J. Pearl. Causality. Cambridge university press, 2009.

Peters et al. (2016)
J. Peters, P. Bühlmann, and N. Meinshausen.
Causal inference by using invariant prediction: identification and confidence intervals.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012, 2016.  Peters et al. (2017) J. Peters, D. Janzing, and B. Schölkopf. Elements of Causal Inference  Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. The MIT Press, Cambridge, MA, USA, 2017.
 QuioneroCandela et al. (2009) J. QuioneroCandela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
 RojasCarulla et al. (2018) M. RojasCarulla, B. Schölkopf, R. Turner, and J. Peters. Invariant models for causal transfer learning. Journal of Machine Learning Research, 19(36), 2018.
 Sachs et al. (2005) K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan. Causal proteinsignaling networks derived from multiparameter singlecell data. Science, 308(5721):523–529, 2005.
 Schölkopf et al. (2012) B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal learning. In 29th International Conference on Machine Learning (ICML 2012), pages 1–8. International Machine Learning Society, 2012.
 Shi and Sha (2012) Y. Shi and F. Sha. Informationtheoretical learning of discriminative clusters for unsupervised domain adaptation. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1275–1282. Omnipress, 2012.
 Shimodaira (2000) H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
 Storkey (2009) A. Storkey. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, pages 3–28, 2009.
 Sugiyama and Kawanabe (2012) M. Sugiyama and M. Kawanabe. Machine learning in nonstationary environments: Introduction to covariate shift adaptation. MIT press, 2012.
 Sugiyama et al. (2007) M. Sugiyama, M. Krauledat, and K.R. Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.
 Yan (2016) K. Yan. Domain adaptation toolbox. https://github.com/viggin/domainadaptationtoolbox, 2016.
 Yan et al. (2017) K. Yan, L. Kou, and D. Zhang. Learning domaininvariant subspace using domain features and independence maximization. IEEE transactions on cybernetics, 2017.
Comments
There are no comments yet.