Clinical data mining is the application of machine learning to medical databases as a tool to assist clinical research and decision. Its impact depends on one hand on the clinical pertinence of the models, but also on their predictive performance, interpretability, and appropriate integration in clinical software . While many methodological successes exist, examples that affect patient management are seldom found . Measurement uncertainty limits the application of data mining, as it impairs model performance and generalizability [3, 4]. Improving data quality is however resource-intensive and often unfeasible . We hypothesize that the acknowledgement of uncertainty, in particular by integrating domain-knowledge about the reliability of each measurement, can improve models and the leverage them as an asset for clinical research and decision making.
The uncertainty of a measurement reflects the lack of knowledge about its exact value, and is often caused by noise in the acquisition . Clinical measurement uncertainty originates from multiple sources including distinct diagnostic practices , inter- and intra-observer variability , manufacturer-dependent technology , the use of distinct modalities for the same measurement , or patient factors such as body habitus or claustrophobia, which affect the choice and the quality of an imaging test . E.g. in the determination of device size for left-atrial appendage closure, consistency between CT, transesophageal echocardiography and angiography was observed in only 21.6% of the cases 
. In the estimation of EF, the variability between CMR and echocardiography resulted in 28% of the population having opposing device eligibility. Clinical reasoning involves making guideline-abiding decisions based on such unreliable data. In order to employ scientific evidence in practice, the experienced clinician assesses the reliability of each measurement, and integrates it with his/her training and experience. This endeavor is all the more challenging, considering the emphasis on internal validity of medical research, as opposed to external validity. The contrast between the scrutinous design of populations used for research and the actual populations where evidence is employed has been considered an obstacle to evidence-based medicine .
DT are a knowledge-representation structure, where decisions at the internal nodes lead to a prediction at the leaves. They can be efficiently built using the well-known algorithms ID3 , C4.5 , CART  or CHAID 
. Compared to other methods, DT can be interpreted as a sequence of decisions. Although recent methods such as deep neural networks can offer better performance, their output is often a black box. Interpretability is all the more necessary as recent European law secures the right to an explanation of all algorithmically-made decisions .
Learning DT from noisy measurements can overfit to the noise and fail to generalize. Generating predictions for noisy instances can generate incorrect predictions. Each test at a DT node compares a measurement with a hard threshold, such that small errors can lead to the instance following an opposing path. Moreover, the distance between the measurement and the threshold is disregarded . Several algorithms explore the idea of soft thresholds to make DT robust to the uncertainty [20, 21, 22, 23, 24]. Such approaches weight the contribution of all child branches to the prediction. Some methods focus on cognitive uncertainty, while others focus on statistical uncertainty or noise. The notion of fractional tuple was first introduced in C4.5 for handling missing values .
Fuzzy DT use fuzzy logic to handle cognitive uncertainty, which deals with inconsistencies in human reasoning . Using fuzzy DT to handle statistical noise involves setting the fuzzy membership functions to express uncertainties around the DT thresholds, known in advance. Alternatively, discretization algorithms can be used. But those methods are computationally costly , and the accuracy of the resulting DT strongly depends on the chosen algorithm .
Probabilistic DT, on the other hand, focus on stochastic uncertainty or random noise, arising from the unpredictable behavior of physical systems and measurement limitations:
proposed a method using PLT for evaluating a DT, assuming a two-modal uniform distribution for the noise. The parameters of this distribution are set using a statistical heuristic based on the training data. Dvorák and Savický employed a variation of this method, where the parameters were estimated through simulated annealing. Experiments in one dataset led to 2-3% error rate reductions compared to CART , suggesting the potential of the method and the need for an evaluation on more data. No significant differences were found between the two parameter-estimation methods, but the authors highlight the computational cost of simulated annealing. The approach is applied to a finalized DT, so the uncertainty is not accounted for during training.
The UDT algorithm  extends the probabilistic splits to the training phase, assuming a Gaussian noise, and achieving accuracy gains in datasets. The method takes an oversampling strategy, where each measurement is replaced by points. The authors propose optimized searches to control the therefore increased training time. In UDT, the same noise model is used for training and evaluation. In medical research and practice, however, the noise in the data used to obtain evidence can be very different from the noise in the data used in practice . A learning algorithm ideally supports independent uncertainty models for the training data and the target data.
Soft approaches have also been proposed for multivariate DT. An algorithm 
employs logistic regression, treating the separation of data among the children as a classification problem. The method led to small accuracy gains in 10 classification datasets, with significant reductions in tree size. Hierarchical mixtures of experts are another multivariate tree architecture, where each test is a softmax linear combination of all variables, with performance gains compared to CART. Multivariate DT can compactly model complex phenomena, and generally improve accuracy with fewer nodes. They are however intended to generate predictions that do not need to be understood, lacking the interpretability of univariate DT.
The aforementioned approaches show the benefit of probabilistic DT in improving DT performance. However, the strategies used to model noise are either limited to the evaluation phase, or they do not use independent noise distributions for learning and evaluation. Moreover, the behavior of the methods upon increasing levels of noise was not investigated, and no distinction was made between the impact of noise in the training data and noise in the test data. A practical, flexible and well-understood approach for handling measurement uncertainty has not been established for DT learning.
This manuscript proposes a probabilistic DT learning approach to handle uncertainty, modeled as a distribution of noise added to the real measurements. Unlike previous methods, the uncertainty distribution is orthogonally considered:
during training, when searching for the best threshold at each node, denoted SS;
during training, in the propagation of the training data through the DT, denoted STP;
in the propagation of the test instances when evaluating a finalized tree, or SE.
To promote interpretability, we consider univariate DT where each node contains a decision based on a single variable. The proposed SS keeps the computational cost under control. We address the problem of integrating knowledge about the uncertainty coming from clinical experience and from the meta-analysis of clinical studies. As a proof-of-concept, we opted for a normal noise model, and evaluate its impact on the distinct learning phases. We also separately study the effect of corrupting the training or test data.
In the following, we introduce the ID3 and C4.5 algorithms and discuss a probabilistic interpretation. The manuscript then proceeds with a description of the proposed soft DT algorithm components, followed by the experiments to illustrate and evaluate them.
2 Decision tree learning
Consider the input variable with dimensions , , and the output , related by the unknown distribution . Drawing samples from composes the training dataset
. The supervised learning problem consists in learning a model fromthat predicts for an unseen sample . DT algorithms follow an algorithmic approach that does not attempt to learn .
Learning an optimal DT that maximizes generalization accuracy with a minimum number of decisions is NP-complete . Although non-greedy methods exist for multivariate DT, locally-optimal approaches offer a good trade-off of accuracy and computational complexity. Notable examples include the ID3 , CART  and C4.5 . The ID3 selects binary splits of numerical variables using the information gain, and the C4.5 extends it to categorical features. For a survey on top-down DT induction, refer to Rokach et al. .
DT are regularized by employing node pruning algorithms. C4.5 employs a postpruning approach that cuts branches with high estimates of generalization error, based on the training data.
2.1 Split search
In top-down DT learning, suppose that a new node sees the training subset . In a binary DT, we define the branch function at as that indicates if the instance goes to the left or right child of , or . is parametrized by the attribute index and the threshold :
where is the observation of variable for . The split search consists in finding the attribute and threshold that split with maximum class separation.
The entropy of variable is defined as:
We take such that the entropy is measured in bit, and omit . In ID3, increasing class purity corresponds to reducing in and , compared to . In other words, and are chosen to maximize the Mutual Information between and , :
In the DT literature, is known as information gain, and is equal to the difference between the entropy of in and the entropy of in the resulting nodes:
The term is the conditional entropy of given the split variable . Equation 2 is equivalent to:
The maximum-likelihood estimates of the probabilities in Equations3 and 4 from the training data are:
where is the number of training instances in , , and is the number of instances in for which event was observed, .
In this manuscript, we consider numerical variables, and employ binary splits and use the information gain as a split criterion. In C4.5 release 8, the best split for each numerical attribute is first selected using information gain. Subsequently, the gain ratio criterion is used to compare the best splits found for all variables, categorical and numerical .
-ary splits of categorical variables. Since the information gain is biased towards splits with many outcomes, C4.5 normalizes it by the entropy of thevariable, defining the gain ratio. In C4.5 release 8, the best split for each numerical attribute is first selected using information gain. Subsequently, the gain ratio is used to compare the best splits found for all variables, categorical and numerical.
2.2 Probabilistic interpretation of the splits
Consider the probability of given instance , based on which we can make a prediction about . We can estimate from the DT rooted in node , with child nodes and as:
where and are conditionally independent given . Equation 7 is based on Bayesian model averaging , which translates the uncertainty of the distinct models, in this case the two subtrees and , into uncertainty in the class prediction.
Probabilistic DT use this idea to instead express the uncertainty about the observed instance . This uncertainty is modeled by the posterior , known as the gating function, . The estimate of becomes:
If we assume that the observed value is accurate, node performs a hard split as in Equation 1, and so:
In this case, if is close to , small variations in its value can drastically change the estimate , and produce incorrect predictions . Probabilistic DT instead model the uncertainty each variable as a distribution of noise, and becomes the CDF of the chosen distribution. A small variation of around the threshold value will then smoothly alter .
When searching for the best split for a node , we use the training subset to obtain the probability estimates of Equation 5. E.g. we can estimate , or equivalently , in terms of :
3 Proposed approach
We propose a probabilistic DT approach to handle uncertainty by modeling it as a distribution of noise around the observed value. The noise model should expresses existing knowledge about the uncertainty. This model is independently considered:
When searching for the split thresholds, during training, or (Section 3.1, SS),
When propagating the training instances, during training (Section 3.2, STP),
When propagating test instances through the constructed DT, during evaluation DT (Section 3.3, SE).
As a proof-of-concept, we consider that the noise of variable is additive and captured by the variable
. We consider the normal distribution to be a good assumption to study our approach. It can be useful for various types of data, owing to the Central Limit Theorem, and is fully described by its mean and variance.
3.1 Soft search
Let us focus on the computation of the information gain for variable . Suppose that takes distinct values sorted as with , in the training data . Note that . We focus on finding a split for node , and consider that all the concepts in this Section refer to , and omit the subscript. is the number of instances with value , and the number of such instances that belong to class . Let denote the candidate threshold for a split based on at node .
In C4.5, the split search is done by computing the information gain in Equation 2 for each candidate , . Since the method assumes certain measurements, we have . The probabilities and are estimated as in Equation 5. We now describe how to estimate these probabilities efficiently, considering uncertain measurements.
If we consider the noise model , the gating function will be the normal CDF. Let us denote the numerator of Equation 9 as , representing the density of training examples falling on the left child node:
where is the Gaussian kernel function centered on the measurement with variance :
This corresponds to using Gaussian kernel density estimation for the probabilities used to compute the information gain. Sinceis no longer constant between each and , the candidate thresholds need not be the dataset values. We take:
with and . The parameter controls the resolution of the search, and is the window factor, . For efficiency, we consider the kernel adjusted to be zero for . Using Equation 12, the number of information gain computations is limited to .
To compute the information gain efficiently for the candidate splits, we keep two running sums of , on the left and on the right of , and . We initialize the search with and . Each time is incremented , the left and right sums are respectively incremented and decremented the quantity :
with the CDF of the standard normal distribution. The contribution of measurement to is proportional to . The last point ensures that the density contributions of sum to .
Similarly, to estimate , we consider the sum of the densities per class, . The density increments per class are computed by replacing the number of instances by in Equation 13. Finally, the estimated probabilities and are used to minimize the conditional entropy in Equation 3. Searching for the threshold using the set of values in Equation 12 and the density increments and acts as a Gaussian filter to the information gain. We refer to this approach as SS.
The standard deviation ofis set to , where is the SS uncertainty factor and is the sample mean of . This choice of was made, as clinicians often report the uncertainty in terms of the absolute values of the attributes . In the following experiments, is assumed to be the same for all variables, although it could be specified independently.
3.2 Soft training propagation
We can also account for the uncertainty when splitting the training data, which we call STP. The two soft training approaches, SS and STP can be used in combination or independently.
Consider the split variable and value at node . As before, a soft split is achieved by setting the gating function to the CDF of centered in :
with the CDF of . Each training instance is fractionally divided between the child nodes, according to the probability . The standard deviation is set to , with the STP uncertainty factor, and the variable mean.
STP leads to increased learning times. Using hard splits, each instance is sent down one branch, and the total number of instances remains constant at each level of the tree. The number of information gain computations at any DT depth is bounded by the dataset size and the number of attributes, . With STP, each instance is sent down all the branches, with weights determined by the gating function. Therefore, a node sees a greater number of instances compared to the hard approach. The number of information gain computations at a given DT depth is raised to .
3.3 Soft evaluation
The uncertainty may also be accounted for when classifying test cases with a finalized DT. SE is achieved by setting the gating function as in Equation14, but with . The evaluation uncertainty factor is denoted as and is obtained from the training data. From Equation 8, the class probability estimates for instance become:
As described in Section 2.2, this approach adjusts to reflect the choice of noise distribution, in this case, Gaussian.
3.4 Motivation on a toy example
To better motivate the use of the soft algorithm components, we consider a toy example with a input variable and class .
Suppose we have a training dataset of four examples , with and . Figure 0(a) displays the information gain computed by C4.5 and the SS for the corresponding candidate splits, denoted by . C4.5 would choose as a split, while SS would choose a split close to , potentially avoiding the misclassification of test examples in the interval , for which there is no training data. If the measurements are noisy such that e.g. and , C4.5 would select or , as seen in Figure 0(b). The SS would select a value close to zero.
Let us now discard two training examples, and keep and . We model the uncertainty of , such that and are drawn from normal distributions with mean and , and common variance . Let us also consider the test point . We analyze the probability of misclassifying . We note that the sum and difference of normal distributions are also normally-distributed:
Let us assume that the search selects the midpoint as a split value. The prediction depends on whether the difference is greater than or less than . Specifically, the probability of misclassifying is composed by two terms:
In case is smaller than , C4.5 assigns class to iff . In case is greater than , C4.5 assigns class to iff . Each of the above probabilities is given by the CDF associated with the corresponding distribution. If we shift each distribution to have mean, we can express the probabilities as:
is the CDF of . Figure 1(a) displays the probability of misclassifying as a function of when is certain. When , and so , is nearly zero until the distributions of the training instances start to overlap with increasing . The inverse occurs for , where the error probability starts to decrease. Figure 1(a) shows how the prediction probabilities change as the model expresses less confidence on the training data, when using STP.
Let us now express the uncertainty about the test example, such that is also normally distributed with variance . We can obtain and as:
Figure 1(b) displays for and , which now converges faster to .
4 Experiment design
The experiments assess effect of the proposed uncertainty model in the distinct DT learning phases, and the impact of corrupting the training versus the test data with noise.
Improving prediction performance consists in maximizing the generalization accuracy, i.e. the fraction of correctly classified test examples, while minimizing model complexity . DT complexity is assessed by measuring its number of leaves.
Each proposed soft component, SS, STP and SE is independently compared to C4.5. The C4.5 pruning and missing-value strategies are equally employed in all experiments. Pruning is extended with the Laplace correction, as recommended to favor the exploration of smaller DT for the same accuracy . We also extend the evaluation of the PLT  and UDT  approaches to more data and noise scenarios.
4.1 Data description and pruning confidence factor
Table I displays the employed datasets. We sought clinical data from open resources, including UCI ML  and KEEL  repositories. Non-ordinal features were excluded. We synthesized additional datasets using an adaptation of the method by Guyon , available in Scikit-learn’s implementation make_classification . The datasets were generated with varying properties.
Since optimal DT size is problem- and dataset-dependent , the pruning confidence factor is not optimized. Instead, for each dataset the confidence factor is fixed such that the average DT size achieved by C4.5 through CV is 15 leaves. This corresponds to a binary tree wit nearly levels, considered a manageable/interpretable number of decisions in a visual clinical guideline. Some of the datasets were too small to reach 15 leaves, so their confidence factor is set to either 10 or 5 leaves. The confidence factor is fixed across all experiments.
For each real dataset, random train-test permutations were created, containing respectively 70% and 30% of the data. Stratified sampling was used, so that class proportions are equal in all samples. For each synthetic dataset, distinct instances were generated and divided into 30 different sets, which were then split into 70%-30% train-test samples.
The experiments are performed with varying degrees of noise to the data. The noise added to a variable in a data subset is sampled according to , with the noise factor and the training subset mean of . The same is used for all its variables. All randomness was generated with fixed seeds for reproducibility.
|Heart disease||UCI ML||8||303||2|
|Pima Indians diabetes||UCI ML||8||768||2|
|South African heart||UCI ML||8||462||2|
|Breast cancer Wisconsin||UCI ML||39||569||2|
|Haberman’s breast cancer||UCI ML||3||306||2|
|Indian liver patient||UCI ML||9||582||2|
|BUPA liver disorders||UCI ML||6||345||2|
|Vertebral column (2 classes)||UCI ML||12||310||2|
|Vertebral column (3 classes)||UCI ML||12||310||3|
|Thyroid gland||UCI ML||5||215||3|
|Oxford Parkinson’s disease||UCI ML||22||194||2|
|Thoracic surgery||UCI ML||3||470||2|
4.2.1 Experiment 1: noise added to the training data
Experiment studies the approaches with noise added to the training data, so we denote the training noise factor as . The uncertainty factors , , of the SS, STP and SE approaches, as well as the UDT parameter , control the standard deviation of the uncertainty model. As such, we tune them by CV for each dataset and . The SS parameters and are respectively set to and . Initial experiments showed that they do not significantly impact the results, provided that is large enough to contain most of the density of the uncertainty distribution, and is small enough to ensure a large set of candidate splits. The PLT parameters and are derived using the method proposed by Quinlan . We summarize the steps for model tuning and evaluation:
For each train-test permutation, and model:
1. Hold out the 30% test set
2. If model has parameter to set (, , , or ), use the 70% training set to tune it by CV:
(a) Compute 10 stratified CV folds.
(b) Add noise to the CV training folds, distributed
as . Do not add noise to the CV
(c) Tune the parameter to maximize CV accuracy.
3. Add noise to the initial 70% training set, distributed as .
4. Learn a tree using the noisy training set, and the selected parameter value, if applicable.
5. Evaluate the tree on the 30% test set, to which no noise was added.
C4.5 and PLT do not undergo step .
4.2.2 Experiment 2: noise added to the test data
Experiment evaluates the merit of the models built on data without added noise in predicting the labels of noisy test cases. Noise is added to the data used for evaluation, and we denote as . Model evaluation is done as:
For each train-test permutation, and model:
1. Hold out the 30% test set
2. If model has parameter to set (, , , or ), use the 70% training set to tune it by CV:
(a) Compute 10 stratified CV folds.
(b) Do not add noise to the CV training folds.
Add noise to the CV validation folds, distri-
buted as .
(c) Tune the parameter to maximize CV accuracy.
3. Learn a tree using the initial 70% training set, and the selected parameter value, if applicable
4. Add noise to the 30% test set, distributed as .
5. Evaluate the tree on noisy the 30% test set.
As before, C4.5 and PLT do not undergo step 2. Note that noise is added to the CV validation folds.
5.1 Illustration on a single variable
We take the EF estimates of the Data Science Bowl Cardiac Challenge . EF is a variable of critical importance in cardiology. Implantable device therapy is officially recommended for . Therefore, we take cases with to have positive eligibility, i.e. , as shown in Figure 2(a). Adding random noise to these data results in FN and FP, as seen in Figure 2(b).
SS: In Section 3.4, we motivated the SS as way of increasing the set of candidate splits and smoothing the information gain. We now observe this on real measurements. Figure 3(a) displays the number of patients for each class and EF value, , like a histogram. Figure 3(b) shows the same data with noise. Figure 3(c) shows the SS density increments , and SS information gain.
Soft search methods such SS or UDT increase the set of candidate splits. This dataset has instances. Employing UDT with a resampling factor raises the number of information gain computations from to . Setting the SS parameters e.g. as , and limits this number to a maximum of approximately .
STP: To illustrate how STP alters the class probability estimates, we consider the noisy data of Figure 2(b). We learned two single-node two-leaf trees using the standard training propagation and STP with . Table II shows that the class probability estimates are less extreme when STP is employed, as they reflect the choice of noise distribution.
|Standard training propagation||STP|
SE: Figure 2(b) shows that 7 FN and 5 FP were introduced by the noise added to the EF dataset. Table III shows the probability estimates of those misclassified examples, obtained using hard or soft evaluation, with . The numbers of FN and FP are different because the algorithm learned a threshold of rather than .
5.2 Experimental results
The average number of leaves, test accuracy and running time are computed over 30 train-test permutations, for each experiment and dataset. The absolute difference to C4.5, averaged over all datasets, is displayed in Tables IV and V.
However, since absolute results of distinct datasets cannot be directly compared , we focus on standardized metrics. The results of each dataset and method were standardized by the dataset’s baseline. The baseline result is obtained by C4.5 without added noise, and estimated with the 30 permutations. The standardization consists in subtracting the baseline mean, and dividing by the baseline standard deviation. E.g. the baseline of the Heart disease dataset has mean leaves and standard deviation leaves. In this case, standardized results , and translate into , and leaves, respectively. Tables IV and V display the standardized metrics, averaged over all datasets, and Figure 5 shows the corresponding boxplots. Computational times are merely indicative, as the experiments were run on a cluster, and the specifications of the machines may vary slightly.
All approaches show an increase in the number of leaves and a reduction in test accuracy as the noise factor grows. The decrease in accuracy was sharper for the predictions made on noisy test data, as seen in Table V.
SS, STP and UDT show maintenance or non-statistically significant improvement in accuracy compared to C4.5, in all noise scenarios. In Table IV, we see that STP had higher accuracy compared to the other methods for all and . For the SS and STP approaches, the maintenance of accuracy was accompanied by statistically significant reductions in the number of leaves compared to C4.5 and UDT. The SS tree size reduction was statistically significant for noise factors greater than . STP had a further reduction in tree size, significant for all and . The maintenance of accuracy by UDT compared to C4.5 was accompanied by an increase in the number of leaves. The method was considerably slower than SS and STP.
In Experiment 1, the accuracies obtained by SE and PLT were equal or smaller than those obtained through hard evaluation, as seen in Table IV. This is particularly evident when . The number of leaves remains unchanged as these methods do not affect training.
On the contrary, Table V shows that SE accuracy was superior to that of hard evaluation in Experiment 2 for . However, this increase was not statistically significant. It suggests that the uncertainty distribution considered in the SE approach better captures the noise added to the data, compared to PLT.
We propose a probabilistic DT approach to handle uncertain data, which separates the uncertainty model in three independent algorithm components. Our experiments evaluate these components in their ability to handle varying degrees of noise in the training and test data.
The first observation is that corrupting the data decreases the accuracy of the predictions, specially if the noise is in the test data. Accordingly, learning on data with increasing uncertainty results in DT with a larger number of leaves, as the models attempt to learn the particularities of the training set.
The results indicate that SS, STP and UDT are at least as robust to noise as C4.5, with non-significant improvements in accuracy. This was observed both for training or test data noise. UDT results are consistent with previous experiments with accuracy gains in the order of , but where DT size was not reported .
All soft training methods had longer running times compared to C4.5, the slowest being UDT. When comparing the search approaches, we observe that, by employing the discretization in Equation 12, SS increases the set of possible thresholds compared to C4.5, while preventing the computation of the information gain for values that are very close. UDT generates samples for each measurement. The number of entropy computations per attribute is bounded by , and therefore grows with the size of the dataset. Using SS, this number is bounded by , and does not grow with .
While maintaining accuracy, SS and STP led to significantly smaller DT. All approaches built larger trees for increasing , as they start to overfit to the noise. SS and STP were able to cope with this by reducing the number of splits, acting as regularizers. This can be interpreted as a consequence of having class probability estimates that reflect the uncertainty, illustrated in Figure 1(a) and Table II, on the pruning algorithm. The C4.5 pruning method uses a statistical test on the training data to make a pessimistic estimate the generalization error. Using a soft training approach expresses less confidence in the data, causing the pruning algorithm to remove more nodes. Tree size reductions were also observed for the multivariate sigmoid-split approach . However, they were most likely caused by the use of multivariate split functions, which are able to express complex rules more compactly, at the cost of reduced interpretability.
To investigate if the DT size reduction could be obtained by changing the C4.5 pruning confidence factor , in Appendix C we show the result of varying when using C4.5, SS or STP for the first datasets of Table I. For lower , SS resulted in smaller models with similar accuracy. In the overfitting range, this tendency is inverted. This indicates that the estimated class probabilities are more accurate with SS, when the model actually learned representative splits. STP has led to consistently smaller trees than C4.5, except when is vey close to .
Given that the uncertainty models in SS, STP and UDT are all Gaussian, an explanation for the disparity in results obtained by the proposed soft training approaches and UDT could be the mismatch between the uncertainty model considered in the algorithms and the distribution of the noise added to the data. In our experiments, the same distribution was used in SS and STP and in the noise model, i.e. the standard deviation was a factor of the variable mean. In UDT, the standard deviation is instead a factor of the range of the data. This is also suggested by the values of the parameter selected through CV displayed in Figures A.1 and A.2, which are smaller than and . Some degree of correlation between the parameters , and the noise factors , was observed. But no significant correlation was observed between and or .
We hypothesize that the soft learning works more effectively as the uncertainty model approximates the real noise distribution. We therefore recommend the use of uncertainty models specified by domain experts. In our experiments, noise was simulated as an experimental proof-of-concept. To validate our approach on concrete clinical decision problems, the uncertainty distribution and its parameters shall be estimated a priori for each variable. Such an estimation may be based, for instance, on the meta-analysis of clinical studies and on empirical clinical knowledge.
|Metric||Method||Training data noise factor ()|
|No. leaves standardized by C4.5 with||C4.5||0.00||0.44||0.72||1.09||1.42||1.75||2.03|
|Absolute diff. in no. leaves to C4. with same||SS||-2.9||-5.1||-5.6||-5.4||-5.4||-5.3||-5.0|
|Accuracy standardized by C4.5 with||C4.5||0.00||-0.52||-0.87||-1.27||-1.59||-1.65||-2.06|
|Absolute diff. in accuracy (%) to C4. with same||SS||0.8||1.2||1.7||1.6||1.8||1.3||2.1|
|Absolute diff. running time (s) to C4. with same||SS||2.7||5.6||4.5||4.9||5.1||4.3||5.2|
|Metric||Method||Test data noise factor ()|
|No. leaves standardized by C4.5 with||C4.5||0.00||0.00||0.00||0.00||0.00||0.00||0.00|
|Absolute diff. in no. leaves to C4. with same||SS||-2.9||-4.3||-4.3||-3.4||-4.0||-3.5||-3.4|
|Accuracy standardized by C4.5 with||C4.5||0.00||-0.97||-1.38||-2.01||-2.59||-3.06||-3.50|
|Absolute diff. in accuracy (%) to C4. with same||SS||0.8||1.5||2.0||1.8||2.2||2.4||2.7|
|Absolute diff. running time (s) to C4. with same||SS||2.05||4.8||6.0||4.8||4.7||3.3||5.29|
SE and PLT did not improve accuracy given noisy training data, compared to the standard hard split approach. When noise was added to the test data, SE led to non-significant increases in accuracy. This suggests that modeling uncertainty to target training data noise is only effective when this model is incorporated in the training phase, and not during evaluation. As such, we do not recommend the use of soft evaluation to handle training noise.
The disparity between the SE and the PLT results may also be explained by the consistency between the uncertainty model considered by the algorithms, and the model used to corrupt the data. PLT has demonstrated 2-3% error rate reductions on a previous experiment using a single dataset, where the shape of the uncertainty model had been optimized.
This paper presents a probabilistic DT learning approach to handle the uncertainty in the data, where the uncertainty model is separated in three independent algorithm components. The context is providing interpretable models for clinical decision support, with the motivation that the acknowledgement of uncertainty will facilitate the adoption of automated learning approaches in practice.
Previous DT algorithms suggest the potential of probabilistic approaches to improve prediction robustness, and the need for an evaluation on more datasets and levels of noise. The impact of considering an uncertainty model in the learning phase or during evaluation was not however reported, as well as the impact of having noise is the training examples or the test examples.
In our approach, the uncertainty representation is incorporated: in the learning phase when searching for the optimal thresholds (SS), when propagating the training data through the tree (STP), and in the evaluation phase when obtaining predictions for unseen data (SE). Any model can be chosen to capture the uncertainty. Our purpose is to incorporate clinical knowledge about the reliability of measurements. But as a proof-of-concept, we model the uncertainty as normally-distributed noise.
In our experiments, corrupting the test data seems to have a more severe impact on accuracy than adding noise to the training data. Upon increased noise, the soft training components, SS, STP and UDT, show maintained or improved accuracy compared to C4.5. STP and SS act as regularizers, showing significant reductions in tree size, with STP outperforming the latter. This was not the case of UDT, possibly given the disparity between the noise model in the data and the uncertainty model in the algorithm. The running times of SS and STP were lower than those of UDT. None of the soft evaluation approaches shows significant benefit compared to hard evaluation. Overall, we recommend using SS and STP with an uncertainty model that approximates as much as possible the real noise in the data. Finally, our study shows the importance of the acknowledgement of data uncertainty when learning decision models. Ideally, when designing clinical studies, an assessment of the reliability of each measurement should be considered part of the database.
Future work directions include evaluating the approach with domain-specific noise distributions, and studying the conditions under which the soft training provides benefit, regarding the complexity of the data. For highly separable data, the benefit of any soft approach is expected to be limited.
This work was supported by the European Union Horizon 2020 research and innovation programme under grant agreement 642676 (Cardiofunxion), by the Spanish Ministry of Economy and Competitiveness (grant TIN2014-52923-R; Maria de Maeztu Units of Excellence Programme - MDM-2015-0502), by the European Union FP7 for research, technological development and demonstration under grant agreement VP2HF (611823), and FEDER.
V. L. Patel, E. H. Shortliffe, M. Stefanelli, P. Szolovits, M. R. Berthold, R. Bellazzi, and A. Abu-Hanna, “The coming of age of artificial intelligence in medicine,”Artificial Intelligence in Medicine, vol. 46, no. 1, pp. 5–17, 2009.
-  D. A. Clifton, K. E. Niehaus, P. Charlton, and G. W. Colopy, “Health Informatics via Machine Learning for the Clinical Management of Patients,” Yearb Med Inform, vol. 10, no. 1, pp. 38–43, 2015.
-  J. Roddick, P. Fule, and W. Graco, “Exploratory medical knowledge discovery: Experiences and issues,” ACM SIGKDD Explorations Newsletter, pp. 2–7, 2003.
-  N. Lavrač, I. Kononenko, E. Keravnou, M. Kukar, and B. Zupan, “Intelligent data analysis for medical diagnosis: Using machine learning and temporal abstraction,” AI Communications, vol. 11, no. 3, pp. 191–218, 1998.
-  A. F. Karr, A. P. Sanil, and D. L. Banks, “Data quality: A statistical perspective,” Statistical Methodology, vol. 3, no. 2, pp. 137–173, 2006.
-  W. G. . Joint Committee for Guides in Metrology, “Evaluation of measurement data - guide to the expression of uncertainty in measurement,” in Tech. Rep. JCGM 100: 2008 (BIPM, IEC, IFCC, ILAC, ISO, IUPAC, IUPAP and OIML, 2008.
-  K. J. Cios and G. William Moore, “Uniqueness of medical data mining,” Artificial Intelligence in Medicine, vol. 26, no. 1-2, pp. 1–24, 2002.
-  K. Singh, B. K. Jacobsen, S. Solberg, K. H. Bønaa, S. Kumar, R. Bajic, and E. Arnesen, “Intra- and interobserver variability in the measurements of abdominal aortic and common iliac artery diameter with computed tomography. The Tromsø study,” European Journal Vascular and Endovascular Surgery, vol. 25, no. 5, pp. 399–407, 2003.
-  T. Foley, S. Mankad, N. Anavekar, C. Bonnichsen, M. Morris, T. Miller, and P. Araoz, “Measuring left ventricular ejection fraction-techniques and potential pitfalls,” European Cardiology, vol. 8, no. 2, pp. 108–114, 2012.
-  J. R. Lopez-Minguez, R. Gonzalez-Fernandez, C. Fernandez-Vegas, V. Millan-Nunez, M. E. Fuentes-Canamero, J. M. Nogales-Asensio, J. Doncel-Vecino, M. Yuste Dominguez, L. Garcia Serrano, and D. Sanchez Quintana, “Comparison of imaging techniques to assess appendage anatomy and measurements for left atrial appendage closure device selection.” The Journal of invasive cardiology, vol. 26, no. 9, pp. 462–467, sep 2014.
-  T. S. Genders, B. S. Ferket, and M. M. Hunink, “The quantitative science of evaluating imaging evidence,” JACC: Cardiovascular Imaging, vol. 10, no. 3, pp. 264–275, 2017.
-  S. de Haan, K. de Boer, J. Commandeur, A. M. Beek, A. C. van Rossum, and C. P. Allaart, “Assessment of left ventricular ejection fraction in patients eligible for ICD therapy: Discrepancy between cardiac magnetic resonance imaging and 2D echocardiography,” Netherlands Heart Journal, vol. 22, no. 10, pp. 449–455, 2014.
-  L. W. Green, “Closing the chasm between research and practice: evidence of and for change,” Health Promotion Journal of Australia, vol. 25, no. 1, pp. 25–29, 2014.
-  J. Quinlan et al., “Interactive dichotomizer, id3,” Eds. Morgan Kauffmann, Springer-Verlag, 1979.
-  R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers, 1993.
-  L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth International Group, 1984.
-  G. V. Kass, “An exploratory technique for investigating large quantities of categorical data,” Applied statistics, pp. 119–127, 1980.
-  S. F. Weng, J. Reps, J. Kai, J. M. Garibaldi, and N. Qureshi, “Can machine-learning improve cardiovascular risk prediction using routine clinical data?” PLOS ONE, vol. 12, no. 4, 2017.
-  “Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC,” 2016 O.J. L 119, 4.5.:1–88.
-  J. R. Quinlan, “Decision trees as probabilistic classifiers,” in Proceedings of the 4th International Workshop on Machine Learning. Morgan Kauffman, 1987, pp. 31–37.
-  J. Dvorák and P. Savický, “Softening splits in decision trees using simulated annealing,” in Adaptive and Natural Computing Algorithms, 8th International Conference, ICANNGA 2007, Warsaw, Poland, April 11-14, 2007, Proceedings, Part I, 2007, pp. 721–729.
-  S. Tsang, B. Kao, K. Y. Yip, W.-S. Ho, and S. D. Lee, “Decision trees for uncertain data,” IEEE transactions on knowledge and data engineering, vol. 23, no. 1, pp. 64–78, 2011.
-  O. Irsoy, O. T. Yıldız, and E. Alpaydın, “Soft decision trees,” in Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 2012, pp. 1819–1822.
-  Y. Yuan, “Induction of fuzzy decision trees,” Fuzzy Sets and Systems, vol. 69, no. 2, pp. 125–139, 1995.
-  X. Wang, B. Chen, G. Qian, and F. Ye, “On the optimization of fuzzy decision trees,” Fuzzy Sets and Systems, vol. 112, no. 1, pp. 117–125, may 2000.
-  A. Segatori, F. Marcelloni, and W. Pedrycz, “On Distributed Fuzzy Decision Trees for Big Data,” IEEE Transactions on Fuzzy Systems, pp. 1–1, 2017.
-  J. R. Quinlan, “Probabilistic decision trees,” Machine learning: an artificial intelligence approach, vol. 3, pp. 140–152, 1990.
-  M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the em algorithm,” Neural computation, vol. 6, no. 2, pp. 181–214, 1994.
-  L. Hyafil and R. L. Rivest, “Constructing optimal binary decision trees is NP-complete,” Information Processing Letters, vol. 5, no. 1, pp. 15–17, 1976.
-  J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
-  L. Rokach and O. Maimon, “Top-down induction of decision trees classifiers - A survey,” IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, vol. 35, no. 4, pp. 476–487, 2005.
-  Quinlan, Ross. Ross Quinlan’s personal homepage. Accessed: 2018-06-03. [Online]. Available: www.rulequest.com/Personal/
-  J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky, “Bayesian model averaging: a tutorial,” Statistical science, pp. 382–401, 1999.
-  J. D’Hooge, D. Barbosa, H. Gao, P. Claus, D. Prater, J. Hamilton, P. Lysyansky, Y. Abe, Y. Ito, H. Houle et al., “Two-dimensional speckle tracking echocardiography: standardization efforts based on synthetic ultrasound data,” Eur Heart J Cardiovasc Imaging, vol. 17, no. 6, pp. 693–701, 2016.
-  M. Kearns, Y. Mansour, A. Y. Ng, and D. Ron, “An experimental and theoretical comparison of model selection methods,” Machine Learning, vol. 50, pp. 7–50, 1997.
-  T. Niblett and I. Bratko, “Learning decision rules in noisy domains,” in Proceedings of Expert Systems ’86, The 6Th Annual Technical Conference on Research and development in expert systems III. Cambridge University Press, 1986, pp. 25–34.
-  M. Lichman, “UCI Machine Learning Repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
-  J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera, “KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework,” Journal of Multiple-Valued Logic and Soft Computing, vol. 17, no. 2-3, pp. 255–287, 2011.
-  I. Guyon, “Design of experiments for the nips 2003 variable selection benchmark,” 2003. [Online]. Available: clopinet.com/isabelle/Projects/NIPS2003
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
-  D. Jensen and T. Oates, “The effects of training set size on decision tree complexity,” in Proceedings of the 14th International Conference on Machine Learning, 1999, pp. 254–262.
-  National Heart, Lung, and Blood Institute, “Data Science Bowl Cardiac Challenge Data,” 2015. [Online]. Available: www.kaggle.com/c/second-annual-data-science-bowl
-  P. Ponikowski et al., “2016 ESC Guidelines for the diagnosis and treatment of acute and chronic heart failure: The Task Force for the diagnosis and treatment of acute and chronic heart failure of the European Society of Cardiology (ESC) Developed with the special contribution of the Heart Failure Association (HFA) of the ESC,” European heart journal, vol. 37, no. 27, pp. 2129–2200, 2016.
-  J. Demsar, “Statistical Comparison of Classifiers over Multiple Data Sets,” Journal of Machine Learning Research, vol. 7, no. 7, pp. 1–30, 2006.
-  F. Wilcoxon, “Individual Comparisons by Ranking Methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945.
-  C. Clopper and E. Pearson, “The use of confidence or fiducial limits illustrated in the case of the binomial,” Biometrika, vol. 26, no. 4, p. 404, 1934.
Appendix A Parameter tuning
Appendix B Pruning algorithm in C4.5
C4.5 employs the following pruning algorithm. Consider a leaf with majority class
, where an instance misclassification follows a Bernoulli distribution with probability. We now show how to obtain an estimate of from the training data. During training, the leaf sees instances, belong to class . Let us assume that the number of training errors
follows a binomial distribution,. One approach to estimating
is to consider the upper limit of the confidence intervalof this binomial under a specified confidence level, . In C4.5, a two-tailed confidence level is set by specifying the confidence factor , such that the probability that is smaller than . To estimate the binomial confidence limits, the Clopper-Pearson method was used .
The number of errors is finally estimated by multiplying the number of instances by the probability estimate, . This is computed for three scenarios: 1) maintaining the node 2) replacing the node by a leaf and 3) replacing the node by the subbranch with smallest predicted error. The smallest-error scenario is chosen. If is small, will be higher for leaves with less instances, and the tree will be more aggressively pruned. If is high, will be smaller for nodes with few instances compared to the parent. The tree will be less pruned.
Appendix C Changing the pruning confidence factor
To complement the results, we investigate if the reduction in the number of leaves with maintained accuracy could have been achieved using C4.5 with a distinct pruning confidence factor , in order to achieve the same regularizing effect. We evaluated the models learned using C4.5 and SS for with noise levels and . The results can be seen in Figures C.1 and C.2 for 5 of the datasets. Both in the case with no-added noise and with , the SS approach led to smaller trees for the lower range of . The differences in accuracy between the two methods were small. This indicates that for the lower range of , SS was able to make more assertive estimates of the generalization error compared to C4.5.
Appendix D Absolute results
This section of the Appendix shows the absolute value of the accuracy and number of leaves for each, obtained with train and test noise levels 0.0 and 0.2.
|Pima Indians diabetes||15.2||2.9||5.0||15.2||15.2||19.1||17.5||13.5||4.0||17.5||17.5||20.2|
|South African heart||15.0||6.1||3.0||15.0||15.0||15.3||18.0||6.9||3.8||18.0||18.0||19.2|
|Breast cancer Wisconsin||10.0||12.8||11.6||10.0||10.0||12.2||13.6||14.9||17.4||13.6||13.6||14.4|
|Indian liver patient||15.5||1.4||1.0||15.5||15.5||15.5||17.7||1.5||1.0||17.7||17.7||19.2|
|BUPA liver disorders||15.2||10.7||9.1||15.2||15.2||15.2||16.9||10.1||6.4||16.9||16.9||18.4|
|Vertebral column (2c)||15.1||15.1||11.8||15.1||15.1||15.1||16.5||17.8||13.6||16.5||16.5||17.8|
|Vertebral column (3c)||15.1||12.4||11.4||15.1||15.1||15.0||17.4||16.1||11.1||17.4||17.4||16.1|
|Average difference to C4.5||-2.9||-5.4||0.0||0.0||1.5||-5.4||-9.3||0.0||0.0||0.7|
|Pima Indians diabetes||74.3||74.5||75.0||74.5||74.3||74.3||71.8||72.7||73.3||71.6||71.4||71.8|
|South African heart||65.5||67.9||67.5||65.9||65.5||66.7||65.4||67.5||67.7||65.0||65.4||66.1|
|Breast cancer Wisconsin||93.2||94.1||94.0||93.3||93.0||94.6||92.3||92.4||93.3||93.2||92.4||93.0|
|Indian liver patient||68.2||70.9||71.0||68.1||68.1||68.2||67.5||70.6||71.0||67.7||67.7||67.9|
|BUPA liver disorders||66.1||66.9||64.6||65.2||64.3||66.1||60.6||64.1||62.1||60.1||60.6||60.4|
|Vertebral column (2c)||79.1||79.1||79.6||79.0||79.0||79.1||78.2||77.0||77.5||77.7||77.2||77.5|
|Vertebral column (3c)||81.4||80.5||82.4||81.2||80.4||81.0||76.2||78.8||79.3||77.9||76.0||77.8|
|Avg. difference to C4.5||0.8||0.9||-2.0||-1.2||0.3||1.6||1.9||-0.1||-0.3||0.4|
|Pima Indians diabetes||74.3||74.5||75.0||74.5|