Semi-Supervised Learning, Causality and the Conditional Cluster Assumption

05/28/2019 ∙ by Julius von Kugelgen, et al. ∙ 7

While the success of semi-supervised learning (SSL) is still not fully understood, Schölkopf et al. (2012) have established a link to the principle of independent causal mechanisms. They conclude that SSL should be impossible when predicting a target variable from its causes, but possible when predicting it from its effects. Since both these cases are somewhat restrictive, we extend their work by considering classification using cause and effect features at the same time, such as predicting a disease from both risk factors and symptoms. While standard SSL exploits information contained in the marginal distribution of the inputs (to improve our estimate of the conditional distribution of target given inputs), we argue that in our more general setting we can use information in the conditional of effect features given causal features. We explore how this insight generalizes the previous understanding, and how it relates to and can be exploited for SSL.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the scarcity and often high acquisition cost of labelled data, machine learning methods that make effective use of large quantities of unlabelled data are crucial. One such method is semi-supervised learning (SSL)

(Zhu2005; Chapelle2010) where in addition to labelled data, possibly large numbers of unlabelled observations are available to the learner at training time.

While positive results have been obtained on a range of problems, a shortcoming of SSL is that it can actually degrade performance if certain assumptions are not met (Chapelle2010, Chapter 4). For example, BenDavid2008 show that the cluster assumption, commonly used in SSL settings, can lead to degraded performance even in simple cases, e.g., for binary classification with data generated from two unimodal Gaussians. Such examples make it clear that many aspects of SSL are, as of yet, not well understood.

Building on the principle of independent causal mechanisms (ICM) daniuvsis2010inferring; JanSch10; Peters2017, Schoelkopf2012 have pointed out a link between the possibility of SSL and the causal structure underlying a given learning problem. Specifically, they argue that SSL should be impossible when predicting a target variable from its causes (referred to as causal learning), but possible when predicting it from its effects (referred to as anticausal learning)–see Sec. 2 for details. Empirical evidence from a meta-analysis of various SSL scenarios supports this claim Schoelkopf2012.

In this work, we extend the investigation of connections between SSL and causality to a more general setting. Rather than treating causal and anticausal learning in isolation, we consider predicting a target variable from both causes and effects at the same time. As an example, consider the setting of predicting disease from medical data where both types of features are commonly found. A patient’s age, sex, medical family history, genetic information, diet, and other risk factors such as smoking all constitute (possible) causal features, whereas examples of effect features include the clinical symptoms exhibited by the patient, as well results of medical tests such as imaging results, serum tests, or tissue samples.

As our main result we show that for this general setting of learning with both cause and effect features, the revelant information that additional unlabelled data may provide for prediction is contained in the conditional distribution of effect features given causal features (Sec. 3). We then argue how this new insight may be used to reformulate classical SSL assumptions (Sec. 4), and propose algorithms based on these assumptions (Sec. 5). Results from evaluating our methods against well-established SSL algorithms on synthetic and medical datasets (Sec. 6) provide empirical support for our analysis.

2 Background and related work

We start by reviewing previous work and key concepts upon which our work builds. Throughout, we use

to denote a random variable taking values in

, which is assumed to be a subset of .

denotes a probability measure and

the probability distribution of

with density . We write for a scalar,

for a vector, and

for a matrix or collection of samples.

2.1 Semi-supervised learning

SSL describes a learning setting where, in addition to a labelled sample , we have access to an unlabelled sample at training time. It is usually assumed that . At test time, the task is to predict targets from inputs . If predictions are made on the unlabelled training data only we speak of transductive learning vapnik1998.

The aim and hope of SSL is that additional unlabelled data helps in making better predictions. can improve the estimate of , but SSL aims at improving . This can only work if there is a link between and . Indeed, many approaches to SSL establish such a link through additional assumptions Zhu2005; Chapelle2010; Schoelkopf2012. Two common ones are the cluster assumption, positing that points in the same cluster of have the same label ; and low-density separation, stating that class boundaries of should lie in an area where is small. For original references, as well as for discussion how these assumption relate to various SSL methods, refer to Chapelle2010.

We briefly mention four of the more common methods, starting with self-learning (sometimes also called the Yarowsky-algorithm). This is a wrapper algorithm that initializes the learner based on the labeled data, updates the labels for the unlabelled data, and then retrains based on all labeled data available, possibly iterating this procedure Scudder65; Blum1998; Abney2004. Secondly, generative model approaches maximise the likelihood of a generative model


While this is a hard optimization problem due to the latent variables , a local optimum can be found via the expectation maximisation algorithm (EM) dempster1977maximum. The third class of common methods we mention are the graph-based approaches. These construct a similarity-based graph representation of the data and propagate labels to neighbours in this graph zhu2002learning; zhu2003semi; zhou2004learning. Finally, transductive SVMs assign labels which maximise a (soft) margin over labelled and unlabelled data while minimizing a regularized risk on the labelled data vapnik1998; joachims1999transductive.

2.2 Causality

Despite data showing a positive correlation between chocolate consumption and the number of Nobel prizes per capita messerli2012chocolate, we would not expect that force-feeding the population chocolate would result in higher research output. The correlation in this example may make chocolate consumption a useful predictor in an i.i.d. setting, but it does not allow one to answer interventional questions of the form “what would happen if we actively changed some of the variables?”.

This notion of intervention is at the heart of the difference between correlation and causation. While much of machine learning is concerned with using correlations between variables to make predictions, reichenbach1956direction has argued that such correlations always result from underlying causal relationships: statistical dependence is an epiphenomenon—a by-product of a causal process.111For the given example, a possible causal explanation for the observed correlation would be that a healthy economy acts as a common cause for both chocolate consumption and a good education system.

Structural causal model (SCM)

To reason about causality in SSL, we adopt the SCM framework Pearl2000 which defines a causal model over variables as: (i) a collection of structural assignments , where are deterministic functions computing variable from its causal parents

; and (ii) a factorizing joint distribution over the unobserved noise variables

. Together, (i) and (ii) define a causal generative process and imply an observational joint distribution over which factorises over the causal graph222The causal graph is obtained by drawing a directed edge from each node in (i.e, the direct causes of ) to for all . We assume throughout that is acyclic. as:

(a) IGCI model for , taken from janzing2012information

(b) Causal learning

(c) Anticausal learning
Figure 1: (a) Illustration of the idea of ICM: if the distribution of the cause is chosen independently of the mechanism , then this independence is violated in the backward direction as contains information about . In a causal learning setting (b) SSL is thus impossible as contains no information about , whereas in an anticausal learning setting (c) may contain information about and SSL is, in principle, possible.

Principle of independent causal mechanisms (ICM)

Motivated by viewing the in the definition of SCMs as independent physical mechanisms of nature, ICM states that “the causal generative process (…) is composed of independent and autonomous modules that do not inform or influence each otherPeters2017. In other words, the conditional distributions of each variable given its parents, , in eq. (2) are independent modules which do not share information. Note that this notion of independence is different from statistical independence (indeed, variables can still be statistically dependent), but it can be formalized as an algorithmic independence of distributions. Intuitively, two distributions are considered algorithmically independent if encoding them jointly does not admit a shorter description than describing each of them separately. In this case we say that they do not share information. This notion has been formalized in terms of Kolmogorov complexity (or Algorithmic Information) by Janzing2010 who show that when is a cause of , ,

Here, the notation

refers to a constant due to the choice of a Turing machine in the definintion of algorithmic information. For the bivariate setting, ICM reduces to an

independence of cause and mechanism daniuvsis2010inferring; Lemeire2006. This is illustrated in Fig. 0(a) using the information geometric causal inference (IGCI) model janzing2012information for in which a deterministic invertible function generates effect from cause . If the input distribution of the cause, , is chosen independently from the mechanism (or more generally, ), then this independence is violated in the backward (non-causal) direction, since has large density where has small slope and thereby contains information about .

2.3 Causal and anticausal learning

For the task of predicting a target from features , Schoelkopf2012 distinguish between causal learning where is a cause of (see Fig. 0(b)), and anticausal learning where is an effect of (see Fig. 0(c)). In a causal learning setting, it then follows from the independence of cause and mechanism that and are algorithmically independent. Recalling the goal of SSL–improving from – SSL should thus be impossible. In the anticausal direction, on the other hand, this independence relation is between and whereas may (and in some cases provably will daniuvsis2010inferring) share information with and SSL is thus, in principle, possible.

3 Semi-supervised learning with cause and effect features

(a) Learning with cause () and effect () features
(b) Example of linear mechanisms between and
(c) Example of non-linear mechanisms between and
Figure 2: (a) Causal setting assumed in this paper, and two datasets which could arise from it (b),(c). As shall be explained below, the conditional cluster assumption links class labels to membership in clusters of the conditional

, suggesting to classify unlabelled points according to whether they are better explained by the red or the blue function. Best viewed in color.

In this work, we assume that we are given a small labelled sample and a large unlabelled sample generated from the following SCM:


This causal model is shown in Figure 1(a). We will refer to as causal features and as effect features and assume this partitioning to be known (e.g., think of the medical example with risk factors and diagnostic tests). Analogous to eq. (2), the SCM eqs. (3)-(5) induces an observational distribution which factorises into independent causal mechanisms as


Note that this setting generalises the cases of only causes or effects considered by Schoelkopf2012 without positing any new statistical independences. It thus remains widely applicable.333Note that, for example, omitting the link renders the two feature sets conditionally independent given vonkugelgen2019semi, which can be a restrictive assumption for realistic scenarios and can already be well addressed by approaches like co-training Blum1998.

Our goal is to predict the target from features , so we are interested in estimating


while having additional information about from unlabelled data.

In analogy to the case of causal learning (see Sec. 2.3 and Schoelkopf2012), by the principle of independent causal mechanisms, the distribution over causes does not contain any information about or (see RHS of eq. (6)) and thereby also not about (see  RHS of eq. (7)). Indeed, is completely determined by the structural equations for as a function of , eq. (4), and as a function of and , eq. (5), and does not depend on what distribution of causal features , eq. (3), is fed into this generative process.

Having established that does not contain useful information for our task,444Since we generally aim to minimise an expected loss, can still be helpful in getting a better estimate of the expectation operator Peters2017. By useful information here we mean information about . we are left with

which according to the chain rule of probability admits two possible factorisations,


Eq. (8) is a causal factorization into independent mechanisms which do not share any information. Eq. (9), however, corresponds to a non-causal factorization, implying that the factors on the RHS may share information. Since we care about estimating and we have additional information about from unlabelled data, it is precisely this potential dependence or link between and that SSL approaches should aim to exploit in our setting (Figure 1(a)). We formulate this result as follows.

Main insight.

When learning with both causes and effects of a target as captured by the causal model in eqs. (3)-(5), contains all relevant information provided by additional unlabelled data about . Therefore, SSL approaches for such a setting should aim at exploiting this information and linking these two distributions via suitable assumptions.

We note that this contains previous results for causal and anticausal learning as special cases: in absence of causal features (i.e., for anticausal learning) reduces to the known setting of containing information about , whereas in absence of effect features, becomes meaningless and SSL thereby impossible, both consistent with the findings of Schoelkopf2012.

However, our result goes further than this since having additional unlabelled data of both cause and effect features can be strictly more informative than having only unlabelled effects. To illustrate this point, consider the example where is a possibly noisy copy, or proxy label, of . In this case, unlabelled data contains information which is very similar to the information contained in the labelled data so that learning to predict from can be very helpful in solving the problem.

4 New assumptions for semi-supervised classification

While the previous analysis (Sec. 3) applies to general prediction tasks including regression, we now focus our attention on classification. For conceptual simplicity and ease of illustration, we will assume binary classification in what follows, but extensions to the multi-class setting are straightforward.

First, we note that for a binary label we can rewrite eqs. (4) and (5) as follows:


where is the indicator function and is a uniform random variable on , so that computes . Allowing arbitrary and , this comes without loss of generality.

Next, we use our insight to reformulate standard SSL assumptions (see Sec. 2.1) for the setting of Fig. 1(a) where both and are observed. Our aim is to adapt these assumptions such that they make use of potential information shared between and .

4.1 Conditional cluster assumption

While the standard cluster assumption advocates for sharing labels within clusters in the marginal distribution of all features, in view of the above we postulate that points in the same cluster of share the same label . We refer to this as the conditional cluster assumption.

Here, one can think of clusters of as clusters in the space of functions computing effects from causes. Different functions in this space can arise from different choices of and in the structural equation for , eq. (11). The conditional cluster assumption can then be understood as saying that the two class-dependent mechanisms and form clusters in .555Note that due to the general form of eq. (11), it is possible to have more than one cluster per class in . For handwritten digits, for example, could act as a switch between 7s with and without the horizontal stroke.

This idea is illustrated for two cases of linear and non-linear functions with additive and unimodal noise in Figs. 1(b) and 1(c), respectively (best viewed in colour). These are simple examples where the asymmetry introduced by knowing the causal partitioning of features can help identify the true mechanisms (shown in solid and dashed lines) and therefore the correct labelling. Standard SSL approaches agnostic to the causally-induced asymmetry between features, on the other hand, can easily fail in these situations. For the data shown in Fig. 1(b), for example, large-margin methods or graph-based approaches (s. Sec. 2.1) operating in the joint feature space will learn to classify by the sign of (i.e., the boundary ), leading to an error rate of almost 50%.

4.2 Low conditional-density separation

In a similar vein, we adapt the low-density separation assumption to our setting. While in its original form, low-density separation is a statement about the joint density of all features, we have argued that (subject to the ICM principle) contains no information about , but that the conditional density may do so. We therefore propose that a more justified notion of separation is that class boundaries of should lie in regions where is small. We refer to this is as low conditional-density separation.

5 Algorithms

While the main contribution of the present paper is conceptual, it is illustrative to discuss the implication of our assumptions from Sec. 4 for some of the standard approaches to SSL introduced in Sec. 2.1 and propose variations thereof which explicitly aim to make use of the information shared between and .

5.1 Semi-generative models

While a naive generative model would unnecessarily model the full distribution–including the uninformative part –this approach to SSL is easily adapted to our new assumptions by only modelling the informative part of the generative process, . This type of semi-generative model has been introduced by vonkugelgen2019semi in the context of domain adaptation and under the additional assumption of conditionally independent feature sets.

Given a model parameterised by , a maximum likelihood approach similar to eq. (1) then yields


Equivalently, we minimise the negative log-likelihood (NLL) which for fixed labels decomposes according to eq. (6) into separate terms which can be optimised independently for and :

This separation leads us to an EM approach dempster1977maximum to find a local optimum of eq. (12) by iteratively computing the expected label given the current parameters (E-step) and then minimising the NLL w.r.t. to the parameters keeping the labels fixed (M-step). This is summarized in Algorithm 1

. For the specific case of logistic regression for

and a class-dependent linear Gaussian model for we provide a more detailed procedure for both soft and hard labels in Algorithm 3.

5.2 Conditional self-learning

The second algorithm we propose is loosely related to the ideas of label propagation and self-learning Scudder65. However, instead of propagating labels based on similarities between points computed in the joint feature space as in the conventional approach, we argue that contains no information about and that we should instead focus on the information contained in . To achieve this, we assume an additive noise model HoyJanMooPetetal09 for in eq. (11), i.e.,


Note, however, that unlike in the probabilistic approach of Sec. 5.1 we do not make additional assumptions about the exact noise distribution, such as Gaussianity. We do however assume that the noise has zero mean and is unimodal, so that there is one function from to for each label.

Our approach then aims at learning these functions and can be summarised as follows. We first initialise two functions and from the labelled sample by regressing on . We then proceed to iteratively compute the predictions of the on unlabelled data, label the point with smallest prediction error as the respective class, and then use this point to update the corresponding until all points are labelled. We refer to this approach as conditional self-learning, see Algorithm 2.

Input: labelled data ; unlabelled data ; models and
Output: fitted labels ; estimates
4 while not converged do
10 end while
return , ,
Algorithm 1 EM-like algorithm for fitting a semi-generative model by maximum likelihood
Input: labelled data ; unlabelled data; ; Regress() method
Output: fitted labels ; functions
2 while unlabelled data left do
3       for  do
4             Regress
7       end for
13 end while
Algorithm 2 Conditional self-learning

Soft labels and connection to probabilistic approach

It is also possible to use the above approach with soft labels (as often done in conventional label propagation zhu2003semi; zhou2004learning) by using a weighted regression scheme. This would require a method of computing regression weights from prediction errors of and

, though, and therefore needs additional assumptions or heuristics. We note that choosing a particular noise distribution for

and using as a class prior leads to a soft-label EM approach, see Algorithm 3. We therefore presently restrict ourselves to hard labels.

While it is conceptually based on the ICM assumption and an analysis of the causal structure among the feature set, the conditional self-learning approach is linked to a number of known methods, including not only self-learning, but also methods building on competition of experts, as recently applied to the problem of learning causal mechanisms. In this work parascandolo2018learning, the functions are generative models competing for data that has undergone unknown transformations, and eventually each specializing on how to invert one of those transformations.

6 Experiments

To corroborate our analysis with empirical evidence, we evaluate our algorithms from Sec. 5 on synthetic data as well as on two medical datasets from the UCI repository. We compare with T-SVMs vapnik1998; joachims1999transductive with linear and RBF kernels using the q3svm implementation q3svm, and with label propagation zhu2003semi; zhou2004learning using the implementation in sci-kit learn scikit-learn

. We use the default parameters in all cases. For our conditional self-learning algorithm we use linear ridge regression

scikit-learn with default regularization strength 1 and for the EM algorithms we use a logistic regression model for and linear, class-depedent Gaussians for , see Algorithm 3 and Appendix B.2 for details.

Synthetic data

As controlled environments, we generate three different types of datasets S1, S2, and S3 with cause and effect features: S1 represents linearly-separable data; S2 corresponds to a non-linear decision boundary similar to Fig.1(c); and S3 is a version of S2 with multi-dimensional features. Details of how exactly synthetic data was generated are provided in Appendix B.2

Medical data

As real world data, due to the fact that both plausibly contain cause and effect features, we chose the two medical datasets Pima Indians Diabetes smith1988using and Heart Disease (detrano1989international). We select those features which are most strongly correlated with the target variable () and categorise them into cause and effect features to the best of our knowledge (see Appendix B.1).


The results of our experiments are summarised in Table 1, see the table caption for details. On the synthetic datasets, our causally-motivated methods outperform the purely supervised logistic regression baseline as well as the other SSL approaches, which in the case of S1 and S3 actually lead to a decrease in performance. The probabilistic approaches perform particularly well on the synthetic datasets, which is expected since the generative model for these cases was specified by us and thus known (see Appendix B.2). However, our conditional self-learning algorithm is rather competitive as shown by the results on the real data. Notably, it is the only method which improves upon the baseline (i.e., achieves SSL) for each of the five datasets considered. Moreover, it also leads to significant improvements on the Heart dataset where–likely due to a violation of the underlying assumptions on the generative model–the EM approaches fail.

Method S1 (lin) S2 (non-lin) S3 (mult.dim) Diabetes Heart Disease
Lin. log. reg. .968 .023 .823 .080 .945 .039 .626 .058 .526 .066
Lin. T-SVM .865 .093 .878 .074 .822 .117 .602 .065 .746 .060
RBF T-SVM .863 .094 .876 .075 .821 .116 .601 .064 .745 .060
RBF label propag. .924 .082 .909 .065 - .650 .030 .528 .068
Semi-gen. (sup.) .968 .076 .935 .074 .949 .082 .669 .064 .550 .096
Semi-gen.+soft EM .986 .081 .989 .024 .991 .067 .661 .063 .518 .050
Semi-gen.+hard EM .985 .079 .972 .058 .987 .076 .695 .064 .518 .050
Cond. self-learning .980 .052 .923 .090 .961 .069 .659 .079 .719 .076
Table 1: Average accuracies on unlabelled data (higher is better)

one standard deviation across 100 runs, each time randomly drawing 10 (for S1, S2, S3) or 20 (for Diabetes, Heart) new labelled and 200 new unlabelled samples. Results refer to transductive evaluation for ease of comparison with other methods. Best method for each dataset highlighted in bold. Last four rows are our causally-motivated methods. The “-” indicates that label propagation did not converge on S3.

7 Discussion

The present paper looks at SSL from the point of view of causal modelling. Its main contribution is conceptual rather than algorithmic. We argue that if we know how the feature set in the input can be partitioned into cause and effect features ( and ), then this has surprising theoretical implications for how SSL should utilize unlabelled data: rather than simply exploiting links between and , as formalized for instance in the standard SSL cluster assumption, one should exploit links between and . Note that we view this not as a contradiction to the usual cluster assumption, but rather as an explication or refinement thereof, taking into account the causal structure; indeed, it subsumes SSL in the anticausal setting Schoelkopf2012 as a special case. It does not subsume SSL in the causal setting, but as argued by Schoelkopf2012, SSL fails in this case.

We do not mean to claim that all assumptions underlying these insights always apply in practice. We may not know the causal structure of the features, and in practice, some of the features may be neither causes nor effects, but correlated to the target by unobserved confounders (a case which would be interesting to study in future work). Moreover, the principle of independent causal mechanisms (ICM) underlying both our analysis and the one of Schoelkopf2012 may not strictly hold for a task at hand.

While the present analysis is intriguing and points out a previously unexplored link between two conditional distributions, the jury is still out on how to best exploit unlabelled data in machine learning. The present insight is but one step, and in particular, while encouraging, the algorithms and experiments based upon it can only be a starting point. We hope that they will lead to new approaches that make explicit use of causal structure and exploit the conditional cluster assumption in more elegant and effective ways. Ultimately, the value of novel assumptions and conceptual models lies in whether they provide a fertile basis to inspire further algorithm development and theoretical understanding, and we expect that the present ideas and analysis will constitute such a contribution.

Appendix A Algorithms

First, we make some clarifying remarks regarding Algorithms 1 and 2 presented in the main paper.

In Algorithm 1, the negative log-likelihood terms (NLL) implicitly also depend on

where the additive constants do not depend on and , respectively, and can thus be ignored for finding a minimum. For sake of brevity and since only the values of fitted labels and parameter estimates change throughout the algorithm while the observed feature values are held fixed, we omit explicitly conditioning on .

In Algorithm 2, the notation and refers to those samples of causal and effect features with label . Moreover, denote the residuals from regressing on using .

Algorithm 3 below describes concrete soft and hard labelling versions of the EM approach proposed in Algorithm 1 for the model assumption of a logistic regression for

and linear Gaussian distributions for

and in some more detail. It was used for our experiments in Sec. 6.

Input: labelled data ; unlabelled data ; regularisation strength for ridge regression; labelling type (hard/soft)
Output: fitted labels ; parameters
// initialise parameter estimates using only the labelled sample
2 LogisticRegresssion
5 while not converged do
       // compute soft (class probabilities) and hard labels (E-step)
       // compute weights (1 for hard labels, class prob. for soft labels)
8       if hard labelling then
10      else
12       end if
       // Update parameter estimates keeping estimated labels fixed (M-step)
15       WeightedLogisticRegression
20 end while
return , ,
Algorithm 3 Soft and hard label EM algorithms for a semi-generative model with logistic regression for and class-dependent linear Gaussian models for .

Appendix B Experimental details

b.1 Real-world data

For the Pima Indians Diabetes dataset we used the partitioning {DiabetesPedigreeFunction, Pregnancies, BMI} and {Glucose}. DiabetesPedigreeFunction is a measure of the family history of diabetes and BMI stands for body mass index.

For the (Coronary) Heart Disease dataset we used the partitioning {sex, ca, thal} and {chest pain}. Here, “ca” refers to the number of major vessels (0-3) that contained calcium (colored by flouroscopy), and “thal” to thallium scintigraphy results, a nuclear medicine test that images the blood supply to the muscles of the heart.

For further details we refer to smith1988using; detrano1989international.

b.2 Synthetic data

The synthetic datasets used in our experiments were generated as follows. First, we draw from a mixture of -dimensional Gaussians. Next, we draw and according to the SCM

with ; ; ; and are diagonal matrices of standard deviations.

denotes the logistic sigmoid function.

This induces the distributions


For experiments on synthetic data we draw a new dataset according to the above generative process in each run, keeping parameters fixed at the following values.

S1: Linear synthetic dataset

We use the following parameters to generate S1.

  • feature dimensions:

  • : components with weights , means and standard deviations

  • :

  • :

S2: Nonlinear synthetic dataset

We use the following parameters to generate S2.

  • feature dimensions:

  • : components with weights , means

    and variances

  • :

  • :

S3: Nonlinear multidimensional synthetic dataset

We use the following parameters to generate S3.

  • feature dimensions:

  • : components with weights , means and covariances

  • :

  • :

Appendix C Code

We will release Python code for our algorithms, as well as scripts to reproduce our results with the camera ready version of this paper.