1 Introduction
Due to the scarcity and often high acquisition cost of labelled data, machine learning methods that make effective use of large quantities of unlabelled data are crucial. One such method is semisupervised learning (SSL)
(Zhu2005; Chapelle2010) where in addition to labelled data, possibly large numbers of unlabelled observations are available to the learner at training time.While positive results have been obtained on a range of problems, a shortcoming of SSL is that it can actually degrade performance if certain assumptions are not met (Chapelle2010, Chapter 4). For example, BenDavid2008 show that the cluster assumption, commonly used in SSL settings, can lead to degraded performance even in simple cases, e.g., for binary classification with data generated from two unimodal Gaussians. Such examples make it clear that many aspects of SSL are, as of yet, not well understood.
Building on the principle of independent causal mechanisms (ICM) daniuvsis2010inferring; JanSch10; Peters2017, Schoelkopf2012 have pointed out a link between the possibility of SSL and the causal structure underlying a given learning problem. Specifically, they argue that SSL should be impossible when predicting a target variable from its causes (referred to as causal learning), but possible when predicting it from its effects (referred to as anticausal learning)–see Sec. 2 for details. Empirical evidence from a metaanalysis of various SSL scenarios supports this claim Schoelkopf2012.
In this work, we extend the investigation of connections between SSL and causality to a more general setting. Rather than treating causal and anticausal learning in isolation, we consider predicting a target variable from both causes and effects at the same time. As an example, consider the setting of predicting disease from medical data where both types of features are commonly found. A patient’s age, sex, medical family history, genetic information, diet, and other risk factors such as smoking all constitute (possible) causal features, whereas examples of effect features include the clinical symptoms exhibited by the patient, as well results of medical tests such as imaging results, serum tests, or tissue samples.
As our main result we show that for this general setting of learning with both cause and effect features, the revelant information that additional unlabelled data may provide for prediction is contained in the conditional distribution of effect features given causal features (Sec. 3). We then argue how this new insight may be used to reformulate classical SSL assumptions (Sec. 4), and propose algorithms based on these assumptions (Sec. 5). Results from evaluating our methods against wellestablished SSL algorithms on synthetic and medical datasets (Sec. 6) provide empirical support for our analysis.
2 Background and related work
We start by reviewing previous work and key concepts upon which our work builds. Throughout, we use
to denote a random variable taking values in
, which is assumed to be a subset of .denotes a probability measure and
the probability distribution of
with density . We write for a scalar,for a vector, and
for a matrix or collection of samples.2.1 Semisupervised learning
SSL describes a learning setting where, in addition to a labelled sample , we have access to an unlabelled sample at training time. It is usually assumed that . At test time, the task is to predict targets from inputs . If predictions are made on the unlabelled training data only we speak of transductive learning vapnik1998.
The aim and hope of SSL is that additional unlabelled data helps in making better predictions. can improve the estimate of , but SSL aims at improving . This can only work if there is a link between and . Indeed, many approaches to SSL establish such a link through additional assumptions Zhu2005; Chapelle2010; Schoelkopf2012. Two common ones are the cluster assumption, positing that points in the same cluster of have the same label ; and lowdensity separation, stating that class boundaries of should lie in an area where is small. For original references, as well as for discussion how these assumption relate to various SSL methods, refer to Chapelle2010.
We briefly mention four of the more common methods, starting with selflearning (sometimes also called the Yarowskyalgorithm). This is a wrapper algorithm that initializes the learner based on the labeled data, updates the labels for the unlabelled data, and then retrains based on all labeled data available, possibly iterating this procedure Scudder65; Blum1998; Abney2004. Secondly, generative model approaches maximise the likelihood of a generative model
(1) 
While this is a hard optimization problem due to the latent variables , a local optimum can be found via the expectation maximisation algorithm (EM) dempster1977maximum. The third class of common methods we mention are the graphbased approaches. These construct a similaritybased graph representation of the data and propagate labels to neighbours in this graph zhu2002learning; zhu2003semi; zhou2004learning. Finally, transductive SVMs assign labels which maximise a (soft) margin over labelled and unlabelled data while minimizing a regularized risk on the labelled data vapnik1998; joachims1999transductive.
2.2 Causality
Despite data showing a positive correlation between chocolate consumption and the number of Nobel prizes per capita messerli2012chocolate, we would not expect that forcefeeding the population chocolate would result in higher research output. The correlation in this example may make chocolate consumption a useful predictor in an i.i.d. setting, but it does not allow one to answer interventional questions of the form “what would happen if we actively changed some of the variables?”.
This notion of intervention is at the heart of the difference between correlation and causation. While much of machine learning is concerned with using correlations between variables to make predictions, reichenbach1956direction has argued that such correlations always result from underlying causal relationships: statistical dependence is an epiphenomenon—a byproduct of a causal process.^{1}^{1}1For the given example, a possible causal explanation for the observed correlation would be that a healthy economy acts as a common cause for both chocolate consumption and a good education system.
Structural causal model (SCM)
To reason about causality in SSL, we adopt the SCM framework Pearl2000 which defines a causal model over variables as: (i) a collection of structural assignments , where are deterministic functions computing variable from its causal parents
; and (ii) a factorizing joint distribution over the unobserved noise variables
. Together, (i) and (ii) define a causal generative process and imply an observational joint distribution over which factorises over the causal graph^{2}^{2}2The causal graph is obtained by drawing a directed edge from each node in (i.e, the direct causes of ) to for all . We assume throughout that is acyclic. as:(2) 
Principle of independent causal mechanisms (ICM)
Motivated by viewing the in the definition of SCMs as independent physical mechanisms of nature, ICM states that “the causal generative process (…) is composed of independent and autonomous modules that do not inform or influence each other” Peters2017. In other words, the conditional distributions of each variable given its parents, , in eq. (2) are independent modules which do not share information. Note that this notion of independence is different from statistical independence (indeed, variables can still be statistically dependent), but it can be formalized as an algorithmic independence of distributions. Intuitively, two distributions are considered algorithmically independent if encoding them jointly does not admit a shorter description than describing each of them separately. In this case we say that they do not share information. This notion has been formalized in terms of Kolmogorov complexity (or Algorithmic Information) by Janzing2010 who show that when is a cause of , ,
Here, the notation
refers to a constant due to the choice of a Turing machine in the definintion of algorithmic information. For the bivariate setting, ICM reduces to an
independence of cause and mechanism daniuvsis2010inferring; Lemeire2006. This is illustrated in Fig. 0(a) using the information geometric causal inference (IGCI) model janzing2012information for in which a deterministic invertible function generates effect from cause . If the input distribution of the cause, , is chosen independently from the mechanism (or more generally, ), then this independence is violated in the backward (noncausal) direction, since has large density where has small slope and thereby contains information about .2.3 Causal and anticausal learning
For the task of predicting a target from features , Schoelkopf2012 distinguish between causal learning where is a cause of (see Fig. 0(b)), and anticausal learning where is an effect of (see Fig. 0(c)). In a causal learning setting, it then follows from the independence of cause and mechanism that and are algorithmically independent. Recalling the goal of SSL–improving from – SSL should thus be impossible. In the anticausal direction, on the other hand, this independence relation is between and whereas may (and in some cases provably will daniuvsis2010inferring) share information with and SSL is thus, in principle, possible.
3 Semisupervised learning with cause and effect features
, suggesting to classify unlabelled points according to whether they are better explained by the red or the blue function. Best viewed in color.
In this work, we assume that we are given a small labelled sample and a large unlabelled sample generated from the following SCM:
(3)  
(4)  
(5) 
This causal model is shown in Figure 1(a). We will refer to as causal features and as effect features and assume this partitioning to be known (e.g., think of the medical example with risk factors and diagnostic tests). Analogous to eq. (2), the SCM eqs. (3)(5) induces an observational distribution which factorises into independent causal mechanisms as
(6) 
Note that this setting generalises the cases of only causes or effects considered by Schoelkopf2012 without positing any new statistical independences. It thus remains widely applicable.^{3}^{3}3Note that, for example, omitting the link renders the two feature sets conditionally independent given vonkugelgen2019semi, which can be a restrictive assumption for realistic scenarios and can already be well addressed by approaches like cotraining Blum1998.
Our goal is to predict the target from features , so we are interested in estimating
(7) 
while having additional information about from unlabelled data.
In analogy to the case of causal learning (see Sec. 2.3 and Schoelkopf2012), by the principle of independent causal mechanisms, the distribution over causes does not contain any information about or (see RHS of eq. (6)) and thereby also not about (see RHS of eq. (7)). Indeed, is completely determined by the structural equations for as a function of , eq. (4), and as a function of and , eq. (5), and does not depend on what distribution of causal features , eq. (3), is fed into this generative process.
Having established that does not contain useful information for our task,^{4}^{4}4Since we generally aim to minimise an expected loss, can still be helpful in getting a better estimate of the expectation operator Peters2017. By useful information here we mean information about . we are left with
which according to the chain rule of probability admits two possible factorisations,
(8)  
(9) 
Eq. (8) is a causal factorization into independent mechanisms which do not share any information. Eq. (9), however, corresponds to a noncausal factorization, implying that the factors on the RHS may share information. Since we care about estimating and we have additional information about from unlabelled data, it is precisely this potential dependence or link between and that SSL approaches should aim to exploit in our setting (Figure 1(a)). We formulate this result as follows.
Main insight.
When learning with both causes and effects of a target as captured by the causal model in eqs. (3)(5), contains all relevant information provided by additional unlabelled data about . Therefore, SSL approaches for such a setting should aim at exploiting this information and linking these two distributions via suitable assumptions.
We note that this contains previous results for causal and anticausal learning as special cases: in absence of causal features (i.e., for anticausal learning) reduces to the known setting of containing information about , whereas in absence of effect features, becomes meaningless and SSL thereby impossible, both consistent with the findings of Schoelkopf2012.
However, our result goes further than this since having additional unlabelled data of both cause and effect features can be strictly more informative than having only unlabelled effects. To illustrate this point, consider the example where is a possibly noisy copy, or proxy label, of . In this case, unlabelled data contains information which is very similar to the information contained in the labelled data so that learning to predict from can be very helpful in solving the problem.
4 New assumptions for semisupervised classification
While the previous analysis (Sec. 3) applies to general prediction tasks including regression, we now focus our attention on classification. For conceptual simplicity and ease of illustration, we will assume binary classification in what follows, but extensions to the multiclass setting are straightforward.
First, we note that for a binary label we can rewrite eqs. (4) and (5) as follows:
(10)  
(11) 
where is the indicator function and is a uniform random variable on , so that computes . Allowing arbitrary and , this comes without loss of generality.
Next, we use our insight to reformulate standard SSL assumptions (see Sec. 2.1) for the setting of Fig. 1(a) where both and are observed. Our aim is to adapt these assumptions such that they make use of potential information shared between and .
4.1 Conditional cluster assumption
While the standard cluster assumption advocates for sharing labels within clusters in the marginal distribution of all features, in view of the above we postulate that points in the same cluster of share the same label . We refer to this as the conditional cluster assumption.
Here, one can think of clusters of as clusters in the space of functions computing effects from causes. Different functions in this space can arise from different choices of and in the structural equation for , eq. (11). The conditional cluster assumption can then be understood as saying that the two classdependent mechanisms and form clusters in .^{5}^{5}5Note that due to the general form of eq. (11), it is possible to have more than one cluster per class in . For handwritten digits, for example, could act as a switch between 7s with and without the horizontal stroke.
This idea is illustrated for two cases of linear and nonlinear functions with additive and unimodal noise in Figs. 1(b) and 1(c), respectively (best viewed in colour). These are simple examples where the asymmetry introduced by knowing the causal partitioning of features can help identify the true mechanisms (shown in solid and dashed lines) and therefore the correct labelling. Standard SSL approaches agnostic to the causallyinduced asymmetry between features, on the other hand, can easily fail in these situations. For the data shown in Fig. 1(b), for example, largemargin methods or graphbased approaches (s. Sec. 2.1) operating in the joint feature space will learn to classify by the sign of (i.e., the boundary ), leading to an error rate of almost 50%.
4.2 Low conditionaldensity separation
In a similar vein, we adapt the lowdensity separation assumption to our setting. While in its original form, lowdensity separation is a statement about the joint density of all features, we have argued that (subject to the ICM principle) contains no information about , but that the conditional density may do so. We therefore propose that a more justified notion of separation is that class boundaries of should lie in regions where is small. We refer to this is as low conditionaldensity separation.
5 Algorithms
While the main contribution of the present paper is conceptual, it is illustrative to discuss the implication of our assumptions from Sec. 4 for some of the standard approaches to SSL introduced in Sec. 2.1 and propose variations thereof which explicitly aim to make use of the information shared between and .
5.1 Semigenerative models
While a naive generative model would unnecessarily model the full distribution–including the uninformative part –this approach to SSL is easily adapted to our new assumptions by only modelling the informative part of the generative process, . This type of semigenerative model has been introduced by vonkugelgen2019semi in the context of domain adaptation and under the additional assumption of conditionally independent feature sets.
Given a model parameterised by , a maximum likelihood approach similar to eq. (1) then yields
(12) 
Equivalently, we minimise the negative loglikelihood (NLL) which for fixed labels decomposes according to eq. (6) into separate terms which can be optimised independently for and :
This separation leads us to an EM approach dempster1977maximum to find a local optimum of eq. (12) by iteratively computing the expected label given the current parameters (Estep) and then minimising the NLL w.r.t. to the parameters keeping the labels fixed (Mstep). This is summarized in Algorithm 1
. For the specific case of logistic regression for
and a classdependent linear Gaussian model for we provide a more detailed procedure for both soft and hard labels in Algorithm 3.5.2 Conditional selflearning
The second algorithm we propose is loosely related to the ideas of label propagation and selflearning Scudder65. However, instead of propagating labels based on similarities between points computed in the joint feature space as in the conventional approach, we argue that contains no information about and that we should instead focus on the information contained in . To achieve this, we assume an additive noise model HoyJanMooPetetal09 for in eq. (11), i.e.,
(13) 
Note, however, that unlike in the probabilistic approach of Sec. 5.1 we do not make additional assumptions about the exact noise distribution, such as Gaussianity. We do however assume that the noise has zero mean and is unimodal, so that there is one function from to for each label.
Our approach then aims at learning these functions and can be summarised as follows. We first initialise two functions and from the labelled sample by regressing on . We then proceed to iteratively compute the predictions of the on unlabelled data, label the point with smallest prediction error as the respective class, and then use this point to update the corresponding until all points are labelled. We refer to this approach as conditional selflearning, see Algorithm 2.
Soft labels and connection to probabilistic approach
It is also possible to use the above approach with soft labels (as often done in conventional label propagation zhu2003semi; zhou2004learning) by using a weighted regression scheme. This would require a method of computing regression weights from prediction errors of and
, though, and therefore needs additional assumptions or heuristics. We note that choosing a particular noise distribution for
and using as a class prior leads to a softlabel EM approach, see Algorithm 3. We therefore presently restrict ourselves to hard labels.While it is conceptually based on the ICM assumption and an analysis of the causal structure among the feature set, the conditional selflearning approach is linked to a number of known methods, including not only selflearning, but also methods building on competition of experts, as recently applied to the problem of learning causal mechanisms. In this work parascandolo2018learning, the functions are generative models competing for data that has undergone unknown transformations, and eventually each specializing on how to invert one of those transformations.
6 Experiments
To corroborate our analysis with empirical evidence, we evaluate our algorithms from Sec. 5 on synthetic data as well as on two medical datasets from the UCI repository. We compare with TSVMs vapnik1998; joachims1999transductive with linear and RBF kernels using the q3svm implementation q3svm, and with label propagation zhu2003semi; zhou2004learning using the implementation in scikit learn scikitlearn
. We use the default parameters in all cases. For our conditional selflearning algorithm we use linear ridge regression
scikitlearn with default regularization strength 1 and for the EM algorithms we use a logistic regression model for and linear, classdepedent Gaussians for , see Algorithm 3 and Appendix B.2 for details.Synthetic data
As controlled environments, we generate three different types of datasets S1, S2, and S3 with cause and effect features: S1 represents linearlyseparable data; S2 corresponds to a nonlinear decision boundary similar to Fig.1(c); and S3 is a version of S2 with multidimensional features. Details of how exactly synthetic data was generated are provided in Appendix B.2
Medical data
As real world data, due to the fact that both plausibly contain cause and effect features, we chose the two medical datasets Pima Indians Diabetes smith1988using and Heart Disease (detrano1989international). We select those features which are most strongly correlated with the target variable () and categorise them into cause and effect features to the best of our knowledge (see Appendix B.1).
Results
The results of our experiments are summarised in Table 1, see the table caption for details. On the synthetic datasets, our causallymotivated methods outperform the purely supervised logistic regression baseline as well as the other SSL approaches, which in the case of S1 and S3 actually lead to a decrease in performance. The probabilistic approaches perform particularly well on the synthetic datasets, which is expected since the generative model for these cases was specified by us and thus known (see Appendix B.2). However, our conditional selflearning algorithm is rather competitive as shown by the results on the real data. Notably, it is the only method which improves upon the baseline (i.e., achieves SSL) for each of the five datasets considered. Moreover, it also leads to significant improvements on the Heart dataset where–likely due to a violation of the underlying assumptions on the generative model–the EM approaches fail.
Method  S1 (lin)  S2 (nonlin)  S3 (mult.dim)  Diabetes  Heart Disease 

Lin. log. reg.  .968 .023  .823 .080  .945 .039  .626 .058  .526 .066 
Lin. TSVM  .865 .093  .878 .074  .822 .117  .602 .065  .746 .060 
RBF TSVM  .863 .094  .876 .075  .821 .116  .601 .064  .745 .060 
RBF label propag.  .924 .082  .909 .065    .650 .030  .528 .068 
Semigen. (sup.)  .968 .076  .935 .074  .949 .082  .669 .064  .550 .096 
Semigen.+soft EM  .986 .081  .989 .024  .991 .067  .661 .063  .518 .050 
Semigen.+hard EM  .985 .079  .972 .058  .987 .076  .695 .064  .518 .050 
Cond. selflearning  .980 .052  .923 .090  .961 .069  .659 .079  .719 .076 
one standard deviation across 100 runs, each time randomly drawing 10 (for S1, S2, S3) or 20 (for Diabetes, Heart) new labelled and 200 new unlabelled samples. Results refer to transductive evaluation for ease of comparison with other methods. Best method for each dataset highlighted in bold. Last four rows are our causallymotivated methods. The “” indicates that label propagation did not converge on S3.
7 Discussion
The present paper looks at SSL from the point of view of causal modelling. Its main contribution is conceptual rather than algorithmic. We argue that if we know how the feature set in the input can be partitioned into cause and effect features ( and ), then this has surprising theoretical implications for how SSL should utilize unlabelled data: rather than simply exploiting links between and , as formalized for instance in the standard SSL cluster assumption, one should exploit links between and . Note that we view this not as a contradiction to the usual cluster assumption, but rather as an explication or refinement thereof, taking into account the causal structure; indeed, it subsumes SSL in the anticausal setting Schoelkopf2012 as a special case. It does not subsume SSL in the causal setting, but as argued by Schoelkopf2012, SSL fails in this case.
We do not mean to claim that all assumptions underlying these insights always apply in practice. We may not know the causal structure of the features, and in practice, some of the features may be neither causes nor effects, but correlated to the target by unobserved confounders (a case which would be interesting to study in future work). Moreover, the principle of independent causal mechanisms (ICM) underlying both our analysis and the one of Schoelkopf2012 may not strictly hold for a task at hand.
While the present analysis is intriguing and points out a previously unexplored link between two conditional distributions, the jury is still out on how to best exploit unlabelled data in machine learning. The present insight is but one step, and in particular, while encouraging, the algorithms and experiments based upon it can only be a starting point. We hope that they will lead to new approaches that make explicit use of causal structure and exploit the conditional cluster assumption in more elegant and effective ways. Ultimately, the value of novel assumptions and conceptual models lies in whether they provide a fertile basis to inspire further algorithm development and theoretical understanding, and we expect that the present ideas and analysis will constitute such a contribution.
Appendix A Algorithms
In Algorithm 1, the negative loglikelihood terms (NLL) implicitly also depend on
where the additive constants do not depend on and , respectively, and can thus be ignored for finding a minimum. For sake of brevity and since only the values of fitted labels and parameter estimates change throughout the algorithm while the observed feature values are held fixed, we omit explicitly conditioning on .
In Algorithm 2, the notation and refers to those samples of causal and effect features with label . Moreover, denote the residuals from regressing on using .
Algorithm 3 below describes concrete soft and hard labelling versions of the EM approach proposed in Algorithm 1 for the model assumption of a logistic regression for
and linear Gaussian distributions for
and in some more detail. It was used for our experiments in Sec. 6.Appendix B Experimental details
b.1 Realworld data
For the Pima Indians Diabetes dataset we used the partitioning {DiabetesPedigreeFunction, Pregnancies, BMI} and {Glucose}. DiabetesPedigreeFunction is a measure of the family history of diabetes and BMI stands for body mass index.
For the (Coronary) Heart Disease dataset we used the partitioning {sex, ca, thal} and {chest pain}. Here, “ca” refers to the number of major vessels (03) that contained calcium (colored by flouroscopy), and “thal” to thallium scintigraphy results, a nuclear medicine test that images the blood supply to the muscles of the heart.
For further details we refer to smith1988using; detrano1989international.
b.2 Synthetic data
The synthetic datasets used in our experiments were generated as follows. First, we draw from a mixture of dimensional Gaussians. Next, we draw and according to the SCM
with ; ; ; and are diagonal matrices of standard deviations.
denotes the logistic sigmoid function.
This induces the distributions
and
For experiments on synthetic data we draw a new dataset according to the above generative process in each run, keeping parameters fixed at the following values.
S1: Linear synthetic dataset
We use the following parameters to generate S1.

feature dimensions:

: components with weights , means and standard deviations

:

:
S2: Nonlinear synthetic dataset
We use the following parameters to generate S2.

feature dimensions:

:

:
S3: Nonlinear multidimensional synthetic dataset
We use the following parameters to generate S3.

feature dimensions:

: components with weights , means and covariances

:

:
Appendix C Code
We will release Python code for our algorithms, as well as scripts to reproduce our results with the camera ready version of this paper.
Comments
There are no comments yet.