A One-Class Decision Tree Based on Kernel Density Estimation

05/14/2018
by   Sarah Itani, et al.
0

One-Class Classification (OCC) is a domain of machine learning which achieves training by means of a single class sample. The present work aims at developing a one-class model which addresses concerns of both performance and readability. To this end, we propose a hybrid OCC method which relies on density estimation as part of a tree-based learning algorithm. Within a greedy and recursive approach, our proposal rests on kernel density estimation to split a data subset on the basis of one or several intervals of interest. Our method shows favorable performance in comparison with common methods of the literature on a range of benchmark datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/15/2020

New Nearly-Optimal Coreset for Kernel Density Estimation

Given a point set P⊂ℝ^d, kernel density estimation for Gaussian kernel i...
10/14/2016

Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation

We consider multi-class classification where the predictor has a hierarc...
08/25/2020

Multiple-Source Adaptation with Domain Classifiers

We consider the multiple-source adaptation (MSA) problem and improve a p...
03/09/2016

Optimized Kernel Entropy Components

This work addresses two main issues of the standard Kernel Entropy Compo...
10/31/2007

Supervised Machine Learning with a Novel Pointwise Density Estimator

This article proposes a novel density estimation based algorithm for car...
06/03/2019

Temporal Density Extrapolation using a Dynamic Basis Approach

Density estimation is a versatile technique underlying many data mining ...
09/05/2018

Traffic Density Estimation using a Convolutional Neural Network

The goal of this project is to introduce and present a machine learning ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many data science issues have to be addressed through unbalanced datasets. Indeed, it may be quite affordable to gather data on the representatives of a given pathology in medicine, or positive operating scenarios of machines in the industry 

[Khan2009]. The related complementary occurrences are, by contrast, scarce and/or expensive to raise. The practice of One-Class Classification (OCC) has been developed within this consideration [Moya1993, Khan2009].

One-class classifiers are trained on a single class sample, in the possible presence of a few counter-examples. The related issue consists of understanding and isolating a given class from the rest of the universe. The resulting model allows to predict

target (or positive) patterns and to reject outlier (or negative) ones.

One-Class Support Vector Machine (OCSVM) is a popular OCC method 

[Scholkopf2001, Chang2011]. Statistics-based techniques such as Gaussian models and Kernel Density Estimation (KDE) [Silverman1986] are also commonly considered as respectively parametric and non-parametric approaches to estimate a sample distribution. Thresholded at a given level of confidence, this estimation is used to reject any instance located beyond the decision boundary thus established [Tarassenko1995]. Actually, these OCC methodologies present the disadvantage of losing performance and readability towards high dimensional samples [Desir2013]. Yet in a number of applications like clinical decision support, it is crucial to goal readable (and thus interpretable) predictions, beyond accuracy.

Basically devoted to supervised classification, decision trees [Quinlan1986] provide satisfaction on both objectives of performance and interpretability. Under a sequential reasoning scheme, the related inference mechanism is indeed very close to the human way of thinking [Duch2004]

. However, this quality is missed in tree-based one-class methodologies, generally focused on ensemble strategies like random forests to boost the performances 

[Desir2013, Goix2016]. In this case, a one-class problem is converted into a binary one in generating artificial outliers based on which a supervised decision tree is finally trained.

Our work puts forward a hybrid one-class classifier, called One-Class decision Tree (OC-Tree). The construction of OC-Trees relies on an innovative splitting mechanism which is supported by Kernel Density Estimation (KDE). A parent node is divided in one or several interval(s) of interest, based on indications provided by density estimation. The contributions of our work are exposed below.

  • As a single and readable tree-based classifier, the OC-Tree has proved successful in comparison to an ensemble technique like the performant One-Class Random Forest (OCRF) which presents a poor potential of interpretability [Desir2013].

  • The OC-Tree has proved robustness against high dimensional data in comparison to reference methods of the literature, including the multi-dimensional KDE.

  • As a result of (1) and (2), the OC-Tree achieves the integration of a multi-dimensional KDE within an intuitive and structured decision scheme, based on a subset of significant attributes, with increased performances in comparison to the original method.

  • In making KDE the key element of learning, the OC-Tree is also suitable for clustering. The method is indeed able to raise clusters through hyper-rectangles in preserving their structure.

The remainder of the paper is organized as follows. In section 2, we expose our algorithm. The assessment procedure is presented in section 3. The results are exposed and discussed in section 4. Finally, we conclude this paper and give future prospects in section 5.

2 Our proposal

Our One-Class Tree (OC-Tree) is implemented in a divide and conquer spirit in order to find target groupings, i.e. parts of the space where target samples are concentrated, and to describe these groupings in a simple and readable way. Let us note:

  • the initial set of training instances;

  • a space of dimensions including ;

  • the set of continuous training attributes;

  • the set of training instances available at a given node ;

  • the sub-space of dimensions related to node ().

At each node , the algorithm searches the attribute which raises at best from , a number of (non necessarily adjacent) target sub-space(s) such that:

(1)

is the value of instance for attribute ; and are respectively the left and right bounds of the closed sub-intervals raised to split the current node in target nodes , based on attribute . As the divisions are made parallel to the axes, the target sub-spaces may be seen as hyper-rectangles of interest. To achieve this result at a given node , the training algorithm processes each training attribute according the following steps.

  • Compute a Kernel Density Estimation (KDE), i.e. an estimation of the probability density function

    based on the available training instances (see section 2.1).

  • Divide based on the modes of (see section 2.2).

  • Assess the quality of the division by computing the resulting impurity decrease (see section 2.3).

The attribute that achieves the best impurity decrease is selected to split the current node in child nodes. If necessary, some branches are pre-pruned, in order to preserve the interpretability of the tree (see section 2.4). The algorithm is run recursively; termination occurs when some stopping condition is reached (see section 2.5).

2.1 Density estimation

In order to identify concentrations of target instances, we have to estimate their distribution over the space, which is provided by a Kernel Density Estimation (KDE). In particular, our proposal is based on the popular Gaussian kernel [Silverman1986]:

is the set of instances available at node , the kernel function and , a parameter called bandwidth.

The parameter influences the pace of the resulting function  [Silverman1986]. As tends towards zero, appears over-shaped while high values of induce a less detailed density estimation. Adaptive methods, such as a least-squares cross-validation, may help setting the bandwidth value [Jones1996, Li2007]. However, such iterative techniques are computationally expensive; their use may be hardly considered in this context of recursive divisions. Hence, we compute based on the following formula [Silverman1986]:

(2)

where

is the standard deviation of the sample

and

, the associated inter-quartile range. The first relation corresponds to the Silverman’s

rule of thumb [Silverman1986]. We consider the second relation to address samples having a zero . Indeed, a zero may reveal very concentrated data, with the potential presence of some singularities that should be eliminated.

2.2 Division

At node , division is executed based on , in four steps.

Figure 1: Division mechanism
  • Clipping KDE ()
    is thresholded at the level .
    This allows to raise a set of target sub-intervals .

  • Revision ()
    If is -modal () and , revision occurs since some modes were not identified. Each sub-interval of is thus analyzed: if its image by includes at least a significant local minimum, intermediate apertures are created around this (these) local minimum (minima).

  • Assessment ()
    The sub-intervals of covering a number of training instances inferior to a quantity are dropped. This ensures keeping the most significant target nodes.

  • Shrinking
    The detected sub-intervals are shrunk in close intervals, in a way to fit the domain strictly covered by the related target training instances, as defined by Eq. (1).

Actually, is potentially updated at the end of steps (b), (c), (d).

If we consider the KDE presented by Fig.1, (a) results in . As the density estimation is 3-modal in this case, a revision of the interval partitioning (b) is launched. It appears there is no need to split the sub-interval since the piecewise includes a single maximum. By contrast, a local minimum is detected in , in the piecewise . The sub-interval [A,B] is thus split into three parts around

Concretely, such a split occurs if the local minimum is significant, i.e. sufficiently deep in comparison with both nearby local maxima. In mathematical terms:

Thus . Steps (c) and (d) are then launched. The sub-intervals are shrunk around the target training instances (represented by crosses in Fig.1), which results in:

The complement represents the set of outlier sub-spaces: it may be represented by a single branch entitled ”else”.

Except for prior knowledge that would help choosing its value more specifically, there should be no reason to set a high reject threshold (e.g. ) since the training set is supposed to include target instances only; this would be penalizing with the exclusion of real target nodes as a consequence. The influence of parameters and will be discussed further (see section 4). Basically, the value of the clipping threshold should be low (e.g. 0.05), because it aims at rejecting outliers. Parameter is related to the width of the intermediate apertures created between a couple of close target groupings: if this aperture includes very few instances, it will be dropped. Thus, we may set per default and set a non-zero value in case of noisy datasets. Finally, a non-zero value for (e.g. 0.5) will lead to revision, which appears to be interesting if we want to detect precisely target groupings.

2.3 Impurity decrease computation

The adaptation of the classical supervised decision tree to OCC is generally achieved through the physical or virtual generation of outliers in each node to enable the emergence of target concentrations [Hempstalk2008, Desir2013, Goix2016]. As a result of the division, each child node includes instances which has to be estimated. The work of [Goix2016] assumes

where denotes the measure of the hyper-rectangle to which node relates. Based on this predictive calculation, [Goix2016] gives a proxy for the Gini impurity decrease for the purpose of OCC. We adapt this result to our proposal where more than two child nodes may result from division:

where is the total number of target and outlier sub-intervals, included in . Thus, impurity decrease is computed with no need for the physical generation of outliers.

Figure 2: Pre-pruning mechanism

2.4 Pre-pruning mechanism

A branch of an OC-Tree is prepruned if there are no more eligible attributes for division. An attribute is not eligible if:

  • for this attribute, all the instances have the same value;

  • the computed bandwidth is strictly inferior to the minimum of the difference between two (different) successive values in the set of available instances, i.e. data granularity.

At a given node , a division based on a non-eligible attribute makes no more sense. Besides, the eligibility of the whole set of training attributes is lost if the algorithm selects successively a same attribute or a same sequence of attributes to cut a same target node, i.e. in tightening more the domain covered by the available training instances in the associated hyper-rectangle. These successive refinement splittings result in accuracy loss in the bounds of the hyper-rectangle domain in question. Obviously, such divisions are useless; they contribute to the target space erosion. Fig. 2 shows a tree learned on two training attributes. The nodes in dotted lines are developed in absence of a pre-pruning mechanism; the latter allows to get a shorter and readable decision tree. Note the branches related to outliers are omitted for the sake of clarity.

It should be noted the user has the choice to keep either the tree as a full predictive model which describes the development that brought to the space division, or the description of the final target hyper-rectangles as a set of sub-intervals of interest regarding the attributes that were used for division.

2.5 Stopping conditions

The algorithm converges under some global and local conditions.

  • Global condition (, maxit)
    At each iteration, we compute the training accuracy : it corresponds to the ratio of training instances included in the target nodes. The algorithm is stopped if, rounded to (e.g. ), remains stable after maxit iterations in which no additional target node was raised. Indeed, in this case, the training process reaches a stage where the target sub-spaces are simply more precisely delimited on the basis of additional attributes, with no further multiplication. As a result, the more maxit is high, the more complex the predictive model is. The parameter tunes therefore the length of the model. We can set maxit by default, in a way that the resulting model focuses only on the attributes with the most significant separative power.

  • Local conditions
    Divisions may be stopped locally if there are compelling reasons to convert a node in a leaf, i.e. when pre-pruning is necessary (see section 2.4).

3 Experimental procedure

Figure 3: Synthetic datasets

3.1 Single evaluation

First, we propose a targeted qualitative evaluation of our method. We thus use synthetic data in order to assess the advocated methodology in ideal conditions with respect to the expected objective of delineating target hyper-rectangles. These datasets are composed of two-dimensional Gaussian blobs (see Figure 3): by means of different blob dispositions, sizes and span, we can study the influence of the algorithm parameters related with the tree construction. These Gaussian blobs play the role of different groupings as representatives of the same target class.

3.2 Comparison with reference methods

In absence of a universal experimental protocol and benchmark data for OCC, it is standard practice to convert multi-class problems into One-Class (OC) ones for evaluation purposes. In that regard, the one vs rest [Desir2013] approach consists in considering a class as a target one and the others as outliers [Wang2006, Hempstalk2008, Desir2013, Nguyen2015, Fragoso2016, Wang20162]. Following the appropriate conversion of reference datasets, the evaluation of an OC classifier is generally conducted under a Cross-Validation (CV) strategy, with a range of possible variants depending on the options envisaged, i.e. with/without stratification, number of folds (), repetition(s). In the context of a one-class problem, a CV strategy is lead in a way that once the folds are created, the folds on which the classifier is trained are devoid of outliers [Ratle2007, Hempstalk2008, Desir2013, Nguyen2015, Fragoso2016, Wang20162].

Let us denote as (resp. ), the number of True Targets (resp. True Outliers), i.e. the number of instances correctly detected as targets (resp. outliers); (resp. ) are the number of False Targets (resp. False Outliers) [Nguyen2015]. In the context of OCC, it is convenient to resort to the Matthews Correlation Coefficient (MCC). In particular, the work of [Desir2013] shows MCC is well suited for the assessment of OCC classifiers [Maldonado2014, Fragoso2016, Wang20162]. Derived from the Pearson correlation coefficient for binary configurations [Baldi2000], the MCC, given by Eq. (3), measures the correlation between the predictions and the real instance labels.

(3)

A zero MCC indicates the classifier makes arbitrary decisions or fails in predicting both outputs simultaneously [Zhang2012, Desir2013]. One can therefore understand that the information of accuracy, given by Eq. (4), complemented by the MCC form an interesting way of measuring the real performance of an OC classifier.

(4)

Under a one vs rest approach, we adopt a repeated stratified cross-validation strategy. Indeed, stratification and repetition in cross-validation procedures help to reduce the variability of the performance measures, averaged over the iterations [Witten2005]. The tests are based on data including continuous attributes, extracted from the UCI repository [UCIRepository]. We compare our results with the recent ones of [Desir2013] proposing a learning methodology of One-Class Random Forests (OCRF). Moreover, the work of [Desir2013]

proposes a comparison to reference OCC methods, which allows us to extend our assessment scope based on the performances of the OCSVM, KDE, Gaussian Mixture Model (GMM) and Gaussian estimator. To ensure a fair comparison, we assessed our OC-Tree in the same conditions as in 

[Desir2013], i.e. a stratified 10-fold CV strategy, repeated five times.

4 Results & Discussion

Preliminary remark: unless otherwise specified, the results related to our OC-Tree are achieved with the following parameter values (cf. section 2).

  • (with min.10 instances/node)

  • maxit

4.1 Comparison of one-class and multi-class tasks

We start by comparing the results of the training process exerted on the dataset (see Fig. 4):

  • with algorithm C4.5, through the resolution of a multi-class problem supposing that each Gaussian blob is associated to a distinct class. The associated space division is represented in dashed lines in Fig. 4.

  • with OC-Tree. In this case, the Gaussian blobs are all the representatives of the same single class. The limits of the corresponding hyper-rectangles are represented in continuous lines in Fig. 4.

The same observation is made as regards , represented in Fig. 5. The C4.5 decision tree (see Fig. 6) rests on a single division based on attribute , while the OC-Tree (see Fig. 7) uses both dimensions, in order to isolate the sub-spaces covered by both blobs.

Figure 4: Splitting limits by algorithm C4.5 (dashed line) and our OC-Tree (continuous line)
Figure 5: Splitting limits by algorithm C4.5 (dashed line) and our OC-Tree (continuous line)

Figure 6: C4.5 tree learned on ( Class 1, Class 2)

else

else

else

Figure 7: OC-Tree learned on ( = Target, = Outlier)

As expected, multi- and one-class learning processes lead to different predictive models. Indeed, in the context of a multi-class problem, the class representatives are supposed to share the whole domain in which the attributes take their values. Hence, a decision tree learned with an algorithm like C4.5 proposes a decomposition of the whole space in hyper-rectangles. On the opposite, aiming at solving a one-class classification problem, we propose a learning process looking for target hyper-rectangles that do not necessarily cover the whole domain in which the attributes take their values, since there may exist outliers to discard.

Figure 8: Evolution of the training accuracy according to

4.2 Parameters influence

The second part of our experimental procedure is dedicated to the study of the influence of the parameters. In particular, let us note the parameter conditions the level at which the estimation of the probability density function is clipped in order to get target zones of interest (see Fig. 1). To understand the influence of , we conducted the learning process on , and in making evolving from to in steps of . We notice that as increases, the training accuracy decreases (see Fig. 8). This is expected since a high value of leads to reject a high proportion of training instances as outliers.

The couple of parameters and allows to identify more precisely concentrations of instances which are close on some dimension . Indeed, such a proximity may impact the corresponding estimation of the density by the existence of shallow local minima located above the clipping level . In absence of any revision of splitting (see section 2.2), such concentrations would be roughly included in a same hyper-rectangle.

We experiment the marginal effect of in setting . Fig. 9 shows the related influence on . As expected, a zero value of leads to the identification of a single hyper-rectangle including both Gaussian blobs. Setting gives a more pertinent result, in identifying each blob separately. For a given value of , a non-zero value of allows to isolate some buffer zones between close concentrations of instances. As long as they are significant, i.e. include a number of training instances greater than a proportion of the training set, such intermediate zones may be raised for the purpose of a more nuanced interpretation, e.g. to localize regions of incertitude or transition between two different sub-concepts of the target class. Should these intermediate zones not be significant, they allow anyway, with their elimination, to reinforce noise rejection in the neighborhood of close sub-concepts of the class. This fact is illustrated in Fig. 10 which shows the result of the training process exerted on in setting . Actually, it turns out that, during training, an aperture was created between two close regions of concentration. As includes instances, i.e. of the training set size, it was dropped with no further processing, which allows actually to emphasize the individuality of each zone and . In particular, the borders of both hyper-rectangles are better defined in comparison to the result exposed in Fig. 4 achieved with . Let us note the negative impact of high values of , not on the training accuracy, but on the quality of the localization of target groups. Obviously, increasing the value of encourages the creation of large and sparse apertures, which extend over nearby consistent concentrations. Fig. 11 illustrates this effect on with : a third hyper-rectangle appears since it includes enough instances.

Figure 9: Synthetic dataset – Effect of
Figure 10: Synthetic dataset – Effect of ()
Figure 11: Synthetic dataset – Effect of ()
Figure 12: Combined effects of and on

The marginal effects of both parameters and are combined in a global effect. The interaction of these parameters for the treatment of the dataset is exposed by Fig. 12. The tests are achieved with non-zero values of and a zero value of . Depending on the value of , it appears the results are qualitatively different. In particular, a value of close to one (in this case, ) involves a systematic revision of the subdivisions, which may cause the emergence of additional small hyper-rectangles and a less precise localization of the instance concentrations (see Fig. 12 - ). On the opposite, a low value of (in this case, ) may lead to none revision of the subdivisions, because the related constraint to satisfy for splitting is severe. A good compromise is achieved with (see Fig. 12 - ) through a quite perfect detection of the blobs. That being said, a non-zero value of (see Fig. 12 - ) allows to create intermediate apertures which intercept the problem of unstructured divisions observed with high values of , as illustrated by Fig. 12 - . The result with (see Fig. 12 - ) gives indeed a more structured division in comparison to the one which occurs in setting the same value for and .

In view of the foregoing, parameter conditions the granularity of the subdivisions. A value of equal to 0.5 appears reasonable in this regard. As far as parameter is concerned, it allows to deal with noisy training sets, in reinforcing the rejection of small groupings of outliers. It may also be used to enhance intermediate regions between close significant groupings for the sake of a deeper interpretation of the data distribution in the space. One can thus set by default and adapt the value if necessary.

Accuracy (%) MCC
Dataset Class OC-Tree OCRF [Desir2013] OC-Tree OCRF [Desir2013]
Diabetes Positive 50.3 46.4 0.131 0.139
Negative 69.1 68.7 0.323 0.241
Ionosphere Bad 77.6 56.7 0.511 0.169
Good 62.8 83.3 0.439 0.683
Glass Build wind float 69.7 66.2 0.324 0.403
Build wind non-float 61.0 56.5 0.179 0.229
Vehic wind float 66.4 69.0 0.135 0.064
Containers 81.8 90.0 0.433 0.498
Headlamps 90.2 95.0 0.401 0.813
Iris Versicolor 87.2 81.5 0.748 0.579
Virginica 92.0 82.7 0.835 0.614
Setosa 87.5 87.1 0.772 0.722
Sonar Mines 45.5 53.3 -0.087 0.048
Rocks 59.0 59.0 0.193 0.179
Pendigits 0 97.2 99.6 0.842 0.976
1 83.3 85.8 0.550 0.585
2 96.2 96.3 0.791 0.835
3 97.2 98.5 0.832 0.918
4 96.9 99.3 0.822 0.961
5 94.7 94.1 0.764 0.756
6 97.4 99.7 0.845 0.985
7 93.7 97.6 0.689 0.887
8 94.3 89.3 0.697 0.634
9 86.1 85.9 0.454 0.577
Mfeat factors 0 63.5 97.2 0.326 0.844
1 60.5 97.8 0.247 0.873
2 90.5 97.9 0.590 0.879
3 76.7 98.0 0.332 0.887
4 95.3 98.0 0.739 0.884
5 51.4 97.3 0.223 0.843
6 68.3 98.5 0.353 0.910
7 88.4 97.9 0.614 0.879
8 59.8 90.6 0.282 0.613
9 81.6 97.6 0.486 0.866
Mfeat morphology 0 94.3 91.6 0.738 0.698
1 76.6 56.5 0.429 0.304
2 70.7 54.0 0.381 0.291
3 62.4 63.5 0.334 0.335
4 79.6 56.8 0.471 0.294
5 74.3 67.4 0.427 0.378
6 77.6 88.7 0.438 0.637
7 84.4 70.0 0.541 0.398
8 90.8 98.9 0.647 0.943
9 75.9 76.7 0.424 0.456
Table 1: Comparison with OCRF on benchmark data

4.3 Performance comparison

Interestingly, our OC-Tree compares favorably with another tree-based method like the One-Class Random Forest (OCRF) [Desir2013]. The performances of both classifiers on different benchmark data are exposed by Table 1, in terms of averaged accuracy and MCC. Let us note the high-dimensional datasets pendigits and multiple features (mfeat) are related to the recognition of numerals, from 0 to 9. We selected from mfeat the subsets related to profile correlations (mfeat-fac), and morphological features (mfeat-morph). Indeed, in terms of MCC, OCRF dominates the other reference methods as regards the mfeat-fac set, while it performs less well on mfeat-morph.

We can notice the OC-Tree stands out advantageously as regards some datasets, which can be generally observed by a joint increase in accuracy and MCC (see in bold). There are also some improved accuracy rates for positive correlations (see in italic). In particular, the OC-Tree tackles well the problem of numerals recognition, as regards the pendigits and mfeat-morph datasets. By contrast, on the mfeat-fac dataset, our proposal appears to be less pertinent. Actually, the related training instances overlap in the space in a rather confusing distribution that ensemble techniques like OCRF may naturally better address.

In Table 2, we extend the comparison of our OC-Tree, in terms of MCC, with other non greedy methods which consider the treatment of the training attributes all at once. These methods are the One-Class Support Vector Machine (OCSVM), the Gaussian estimator (Gauss), the Kernel Density Estimator (KDE) and the Gaussian Mixture Model (GMM); the related results are extracted from [Desir2013]. The latter classification techniques, except for Gauss, fail in tackling a certain number of problems, which is noticeable through the numerous MCC values close or equal to 0. In that regard, greedy methods like the OC-Tree and OCRF present a clear advantage.

Let us remind the OC-Tree is based on successive one-dimensional kernel density estimations, followed by the detection of target intervals of interest, in the frame of recursive process. Somehow, the OC-Tree may be perceived as an adaptation of the KDE, making way for a certain transparency in data processing, and in selecting only the most meaningful attributes in the sense of the purity criterion. We thus focus on the comparison of the OC-Tree to the (multi-dimensional) KDE. The MCC values achieved with KDE and improved by the OC-Tree are marked with an asterisk in Table 2. It appears the OC-Tree stands out in 80% of the cases, and is able to deal with high dimensional data like pendigits and mfeat-factors, where the multi-dimensional KDE fails completely.

Dataset Class OC-Tree OCRF [Desir2013] OCSVM [Desir2013] Gauss [Desir2013] KDE [Desir2013] GMM [Desir2013]
Diabetes Positive 0.131 0.139 0 0.147 0.188 0.219
Negative 0.323 0.241 0 -0.046 0.064* 0.020
Ionosphere Bad 0.511 0.169 -0.348 -0.410 0.106* -0.346
Good 0.439 0.683 0.785 0.781 0.180* 0.584
Glass Build wind float 0.324 0.403 0.896 0.465 0.484 0.509
Build wind non-float 0.179 0.229 0.880 0.212 0.322 0.365
Vehic wind float 0.135 0.064 0.908 0.179 0.145 0.091
Containers 0.433 0.498 0.465 0.964 0.307* 0.823
Headlamps 0.401 0.813 0.703 0.308 0.877 0.749
Iris Versicolor 0.748 0.579 0.897 0.903 0.685* 0.607
Virginica 0.835 0.614 0.900 0.813 0.716* 0.604
Setosa 0.772 0.722 0.903 0.921 0.799 0.643
Sonar Mines -0.087 0.048 0.882 0.342 0* 0.222
Rocks 0.193 0.179 0.889 0.120 0* 0.274
Pendigits 0 0.842 0.976 0 0.970 0.100* 0.961
1 0.550 0.585 0 0.652 0.212* 0.835
2 0.791 0.835 0 0.957 0* 0.956
3 0.832 0.918 0 0.969 0.092* 0.949
4 0.822 0.961 0 0.969 0* 0.953
5 0.764 0.756 0 0.880 0.092* 0.942
6 0.845 0.985 0 0.970 0* 0.954
7 0.689 0.887 0 0.887 0* 0.937
8 0.697 0.634 0 0.716 0* 0.951
9 0.454 0.577 0 0.577 0.093* 0.936
Mfeat factors 0 0.326 0.844 0 0.737 0* 0
1 0.247 0.873 0 0.712 0* 0
2 0.590 0.879 0 0.740 0* 0
3 0.332 0.887 0.017 0.695 0* 0
4 0.739 0.884 0 0.743 0* 0
5 0.223 0.843 0.013 0.738 0* 0
6 0.353 0.910 0.068 0.770 0* 0
7 0.614 0.879 0.017 0.841 0* 0
8 0.282 0.613 0 0.647 0* 0
9 0.486 0.866 0.026 0.751 0* 0
Mfeat morphology 0 0.738 0.698 0.136 0.682 0.765 0.764
1 0.429 0.304 0 0.345 0.375* 0.395
2 0.381 0.291 0 0.400 0.457 0.407
3 0.334 0.335 0.030 0.326 0.298* 0.328
4 0.471 0.294 0 0.432 0.443* 0.430
5 0.427 0.378 0.013 0.468 0.388* 0.468
6 0.438 0.637 0.057 0.397 0.398* 0.416
7 0.541 0.398 0.026 0.524 0.505* 0.540
8 0.647 0.943 0.013 0.682 0.666 0.645
9 0.424 0.456 0.013 0.389 0.395* 0.398
Table 2: Comparison with reference methods on benchmark data

5 Conclusion & Future work

The reality of poor data availability, notably in medical and industrial applications, has lead to look for alternatives to the traditional supervised techniques. The practice of one-class classification has been proposed within this consideration. This recent discipline of machine learning has generated a considerable interest with the development of new classification techniques, some of which were adapted from supervised classification techniques.

In this work, we proposed a one-class decision tree by completely rethinking the splitting mechanism considered to build such models. Our One-Class Tree (OC-Tree) may be actually seen as an adaptation of the KDE for the sake of readability and interpretability, based on a subset of significant attributes for the purpose of prediction. In that respect, our method has proved successful in comparison to multi-dimensional KDE, as also to One-Class Random Forest (OCRF).

This work leaves some interesting perspectives. In particular, our proposal is able to deal with continuous attributes; it would be thus judicious to consider the treatment of nominal and ordinal variables. Theoretically, our work supports such variables, by means of an adaptation of the density estimation technique: the use of discrete histograms is an avenue worth exploring in that regard. Furthermore, the parametrization of the KDE remains an open question as regards the computation of the bandwidth

and the use of other kernels . Indeed, on the one hand, our proposal is based on a Gaussian kernel attractive by its mathematical properties, but the pertinence of other configurations may be studied on a comparative basis. On the other hand, deduced based on the Silverman’s rule of thumb, is quite sensitive to the training set content. In our proposal, this sensitivity is controlled by setting a pre-pruning mechanism. In the future, we would like to rise to the challenge of establishing a rule able to address this issue of sensitivity.

6 Acknowledgments

This work is funded by the Fonds de la Recherche Scientifique - FNRS (F.R.S.- FNRS), Brussels (Belgium).

References