Graph-based Selective Outlier Ensembles

04/17/2018 ∙ by Hamed Sarvari, et al. ∙ George Mason University Sapienza University of Rome 0

An ensemble technique is characterized by the mechanism that generates the components and by the mechanism that combines them. A common way to achieve the consensus is to enable each component to equally participate in the aggregation process. A problem with this approach is that poor components are likely to negatively affect the quality of the consensus result. To address this issue, alternatives have been explored in the literature to build selective classifier and cluster ensembles, where only a subset of the components contributes to the computation of the consensus. Of the family of ensemble methods, outlier ensembles are the least studied. Only recently, the selection problem for outlier ensembles has been discussed. In this work we define a new graph-based class of ranking selection methods. A method in this class is characterized by two main steps: (1) Mapping the rankings onto a graph structure; and (2) Mining the resulting graph to identify a subset of rankings. We define a specific instance of the graph-based ranking selection class. Specifically, we map the problem of selecting ensemble components onto a mining problem in a graph. An extensive evaluation was conducted on a variety of heterogeneous data and methods. Our empirical results show that our approach outperforms state-of-the-art selective outlier ensemble techniques.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An ensemble technique is characterized by the mechanism that generates the components and by the mechanism that combines them. A common way to achieve the consensus is to enable each component to equally participate in the process. A problem with this approach is that poor components are likely to negatively affect the quality of the consensus result. To address this issue, alternatives have been explored in the literature to build selective classifier and cluster ensembles, where only a subset of the components contributes to the computation of the consensus. Typically, in classifier ensembles the selection is driven by the trade-off between accuracy and diversity [1]

. Boosting, perhaps the most well-known example, achieves the consensus by weighing the components based on their accuracy. In an unsupervised scenario, such as clustering and outlier detection, defining the selection mechanism is more challenging due to the lack of ground truth. Quality and diversity have been used as measure to drive the selection of components for cluster ensembles


Of the family of ensemble methods, outlier ensembles are the least studied [3, 4, 5, 6, 7, 8, 9, 10, 11]. In particular, only recently the selection problem for outlier ensembles has been discussed [4, 5], and its potential positive effect on event detection has been shown [5]. In this work, we further explore the selection issue for outlier ensembles, and define a new graph-based class of ranking selection methods, of which we detail specific instances.

To better understand the nature of the problem we want to tackle, let’s consider Figure 1. Plots (a)-(f) show six ranking components generated from the WDBC data using the LOF algorithm under different conditions (see Section 5 for details). Each row corresponds to a ranking. The horizontal axis captures the data points (in a fixed order across all six rows), and the vertical axis measures the LOF scores assigned to each point. The 10 leftmost points are the actual outliers, and the red vertical bars highlight the top-10 LOF score values, in the respective rankings. The four rankings (a)-(d) identify many of the outliers among the top-10 ranked points, while the rankings (e) and (f) have at most one outlier among the top-10 ranked points. Figure 1(g) shows the area under the PR curve (AUCPR) for an ensemble of 20 rankings, of which six are the ones illustrated. Rankings (a)-(d) correspond to the most accurate ones, and (e)-(f) are the two least accurate. As also observed in [5], the best rankings tend to agree on the high scored points, but the actual scores change. As a consequence, their aggregation may produce an improved ranking. On the other hand, rankings (e) and (f) largely rank non-outliers as the highest, and including them in the aggregation process may affect the consensus ranking negatively. We aim at identifying such poor rankings and remove them from further consideration. Our technique, called Core (described in Section 4), was able to select the five top rankings from the ensemble of 20 components given in Figure 1. The consensus ranking achieved via averaging the selected five components gives an AUCPR of 0.8, while the consensus ranking achieved via averaging the entire 20 components gives an AUCPR of 0.2. This result is indicative of the great potential our graph-based approach to selective outlier ensembles has to offer.

The paper is organized as follows. Section 2 discusses related work. We introduce our framework and methodology in Sections 3 and 4, and in Section 5 we present our experiments, results, and analysis. Section 7 concludes the paper with thoughts for future work.

Figure 1: WDBC data set. (a)-(f): Outlier scores generated using the LOF algorithm; (g): Area under the PR curve for 20 components.

2 Related Work

Ensemble methods have been exploited in the literature to boost the performance of classifiers, e.g. [12, 13, 14, 15, 16, 17], and more recently clustering ensembles have emerged as a technique for overcoming problems with clustering algorithms, e.g. [18, 19, 20]

. It is well known that off-the-shelf clustering methods may discover different patterns in a given set of data. This is because each clustering algorithm has its own bias resulting from the optimization of different criteria. Furthermore, there is no ground truth against which the clustering result can be validated. Thus, no cross-validation technique can be carried out to tune input parameters involved in the clustering process. Clustering ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature: they can provide more robust and stable solutions by making use of the consensus across multiple clustering results, while averaging out emergent spurious structures that arise due to the various biases to which each participating algorithm is tuned, or to the variance induced by different data samples.

Anomaly (or outlier) detection is another unsupervised problem that suffers from many of the same challenges as clustering. As such, many different anomaly detection techniques (e.g., density-based and distance-based, global vs. local), and multiple variations of each have been studied

[21, 22, 23, 24, 25, 26, 27, 28, 29, 30]. A comprehensive survey of these methods can be found in [31]. Invariably, all anomaly detection algorithms involve parameters that are problematic to set. Ensemble techniques can provide a framework to address these issues for anomaly detection algorithms, in a way similar to clustering ensembles. Nevertheless, the discussion on outlier ensembles has started only recently [3, 4, 5, 6, 7, 8, 9, 10, 11], and the avenue remains largely unexplored.

In this paper we focus on the problem of component selection for outlier ensembles. As discussed above with the example shown in Figure 1, poor components can negatively affect the consensus ranking. Only recently the selection problem for outlier ensembles has been discussed [4, 5], and its potential positive effect on event detection has been shown [5]. The selection models presented in [4] (DivE) and in [5]

(SelectV) are both greedy selective strategies based upon a target ranking treated as pseudo ground-truth. Rankings are selected in sequence if they increase the weighted Pearson correlation of the current ensemble prediction with the target vector. SelectH


is also based on a pseudo ground-truth, for anomalies this time. Components that do not rank the estimated anomalies sufficiently high, are candidate to be discarded. To compute the pseudo ground-truth, a mixture modeling approach is used to convert each component ranking into a binary vector. A majority vote applied to these binary lists identifies the anomalies.

In this work, we take a different approach. We tackle the problem of selecting outlier ensemble components by mapping it onto a graph mining problem, which does not use the concept of pseudo ground-truth. The main idea is to capture high quality rankings as nodes forming patterns in a graph. We advocate that such transformation can lead to a family of new effective approaches for the selective outlier ensembles problem. The presentation of our framework follows.

3 Selective Outlier Ensembles Framework

Let , , be a collection of data. From the data, a collection of outlier score rankings is generated. Each ranking is a sequence of outlier scores, one for each data point, sorted in non-decreasing order: . The collection constitutes the ensemble.

In Algorithm 1, the Selective Outlier Ensembles framework (Soul) is presented. The Soul framework takes in input the collection of rankings (however generated), and applies a two-phase algorithm. The first step is the Ranking Selection phase, which allows to plug in any selection function that specifies which rankings to retain. The Ranking Aggregation phase enables the use of a variety of aggregators to compute a consensus ranking . The Soul framework does not require access to the original features of the data, and is transparent to the process that generates the ensemble. Soul can therefore be used with any outlier detection algorithm that produces a ranking, and any combination thereof.

The two steps Ranking Selection and Ranking Aggregation can also be merged to produce the consensus ranking . In this work, though, we focus on the design of an effective Ranking Selection function of the components. As such, we make a distinction between the two phases, and apply only commonly used consensus functions, i.e. Maximum, Average, and Minimum [26, 27, 32], for Ranking Aggregation.

1:A collection of rankings .
2:A consensus ranking .
3:Ranking Selection;
4:Ranking Aggregation;
5:return ;
Algorithm 1 The Soul Framework
1:A collection of rankings .
2:A subset of rankings .
3:Construct the complete weighted graph ;
4:Derive the pruned graph ;
5:Compute the -core subgraph of with largest ;
6:return Vertices (rankings) with largest coreness values;
Algorithm 2 Core Ranking Selection
1:A collection of rankings .
2:A subset of rankings .
3:Construct the complete weighted graph ;
4:Compute weighted degrees of each node;
5:Discard nodes with lowest weighted degree;
6:return Remaining vertices (rankings);
Algorithm 3 Cull Ranking Selection

4 Ranking Selection

Figure 2: (a) Core ranking selection: Graph obtained for the 20 rankings of Figure 1(g). (b) Weighted Kendall tau similarity values between the six rankings given in Figure 1.

We define a graph-based class of ranking selection methods. A method in this class is characterized by two major steps:

  1. Mapping the rankings onto a graph structure

  2. Mining the resulting graph to identify a subset of rankings.

Several approaches in the literature formulate the ensemble consensus function as a graph partitioning problem [18, 33, 34]. Our aim here is different, since we are not concerned with the aggregation step. Instead, we organize the components (rankings) in a graph, with the goal of removing poor rankings from further consideration. The challenge is to relate the quality of the rankings with the structure of the graph, in absence of supervision.

We define two instances of the graph-based ranking selection class, called Core and Cull. In the Core approach, we map the problem of selecting ensemble components onto a community detection problem in a graph. To this end, given the ranking ensemble , we construct a complete weighted graph , where the vertices in correspond to the rankings, and . In connecting the vertices with one another, we want the good rankings to form strongly connected components, so that they can emerge as a dense community. Looking at the rankings in Figure 1, we observe that pairs of good rankings are similar in their top ranked objects, while a good and a poor rankings will be dissimilar in the way they rank objects at the top. As such, a similarity measure that emphasizes the top ranked points will in general consider two good rankings as more similar than a good and a poor rankings. The weighted Kendal tau correlation measure [35] is a measure of similarity that satisfies this property. Let’s consider, as an example, Figure 1(b), which shows the weighted Kendall tau similarity values between all pairs of rankings given in Figure 1. We observe high values for all pairs between 1 and 4 (the good rankings), and significantly smaller values for any ranking 1 through 4 and rankings 19 and 20 (the poor rankings). This suggests the following definition of .

For every two distinct rankings and , , where is the weighted Kendall tau measure. , for all . In order to enable the strongly connected components to emerge, we then prune the edges in by retaining only the edges corresponding to the largest values, where is equal to the number of vertices. We observe that, a connected graph with vertices, has at least edges. In pruning the edges we wanted to enable graph connectivity, thus the choice of as threshold on the number of edges. Figure 1(a) shows the graph obtained for the ensemble of 20 rankings of Figure 1. Notably, the five best components form a 5-clique in this case, which is also the -core subgraph for the resulting , with the largest (). A -core subgraph is the maximal connected subgraph of in which all vertices have degree at least . Nodes 19 and 20 (the poorest rankings) end up being isolated nodes.

We also observe that the three poor rankings 15, 16, and 17 in Figure 1(a) form a 3-clique, revealing pair-wise correlations superior to the threshold. Under the assumption that the ranking components are affected by diverse errors, cliques of “poor” rankings will stay small, and the -core subgraph, with the largest , can identify the subset of rankings of good quality to provide as input to the aggregation function. We call this ranking selection algorithm Core, and summarize its steps in Algorithm 2.

The Cull ranking selection technique takes a different approach to prune the complete weighted graph . The Core technique typically retains a minority of the ensemble components (25% on average in our experiments). In an effort of selecting a larger number of (good) components, rather than keeping the components in the largest -core, we discard the ones deemed as poor, and keep the remaining. To estimate the poor components we proceed as follows. For each vertex in , we compute its weighted degree . Under the assumption that rankings make diverse errors, we expect poor components to have small weights associated to the incident edges, and therefore a low weighted degree. For example, considering the adjacency matrix in Figure 1(b), the weighted degrees of the vertices are: , , , , , ; hence, the poor components (19 and 20) have the lowest weighted degrees. In our experiments we discard 20% of the total number of vertices with the lowest values. We call the resulting algorithm Cull. Cull strikes to preserve a larger pool of components in comparison to Core. A summary of the steps is given in Algorithm 3.

Hierarchical versions of both Core and Cull can also be adopted. One can run Core on independent ensembles, and then aggregate all the selected components in a new ensemble, and run Core again on it. We can proceed similarly for Cull. If poor components are sifted at each level, improvements upon the Core (Cull) technique are expected. The depth of the hierarchy can be extended beyond two as well. In our experiments, we test the two-level hierarchy, and call the respective techniques and .

5 Experiments

5.1 Datasets

To evaluate outlier methods, typically, data for classification is used and adapted to the task of anomaly detection. The majority class, or a combination of different large classes, is considered as the inliers. The rest of the data, mostly downsampled, plays the role of outliers. For our experiments, we used datasets from two publicly available repositories. In particular, Lymphography, Shuttle, SpamBase, Waveform, WDBC, Wilt, and WPBC were generated as described in [36]111Data available at:; Ecoli4, Pima, Segment0, Yeast2v4, and Yeast05679v4 were generated as described in [37, 38]222Data available at: SatImage is taken from the UCI repository [39]: the majority class is used as inliers, and of the rest of the data is subsampled to derive the outliers. A summary of the datasets is available in Table 1.

width= Ecoli4 Glass Lymphography PageBlocks Pima SatImage Segment0 Shuttle SpamBase Stamps Waveform WDBC Wilt WPBC Yeast05679v4 Yeast2v4 Instances 314 214 148 5,473 510 1,072 2,308 1,013 2,528 340 3,433 357 4,839 198 528 514 Attributes 8 7 47 10 8 37 20 9 59 9 21 32 5 35 8 8 Outliers % 6 4 4 10 2 3 14 1 2 9 3 3 5 23 10 10

Table 1: Characteristics of the datasets for outlier detection used in the experiments.

5.2 Ensemble construction

To evaluate ranking selection algorithms, we first need to construct an ensemble of outlier rankings. We construct homogeneous ensembles, where all components are generated using the LOF algorithm [26] as the base detector. In order to generate diverse components, we perform subsampling. It has been shown that subsampling can create diverse outlier components, and under specific conditions can improve the overall performance, compared to using the entire data [8]. To encourage diversity, for each component, we randomly select the subsampling rate from the (%) values . We also select the value of the MinPts parameter of LOF in the set . Once the subsample is selected, for each point in the dataset, the nearest neighbors, their distances, and the relative densities required in the LOF algorithm are calculated only with respect to the points in the subsample and using the selected MinPts value. In our experiments, we fix the ensemble size to .

5.3 Consensus functions

Various methods exist in the literature to unify outlier scores obtained from different base detectors [40, 41], and to merge different rank lists [42, 43]. However, since in our setting each ensemble component is generated by the LOF algorithm, no unification is needed.

Our methods focus on the design of the selection mechanism of the ensemble components. As such, we use simple consensus functions across all approaches being compared, i.e. Maximum, Average, and Minimum functions. The use of more sophisticated aggregation functions is out of scope for the current study, and will be considered in the future. The Maximum function assigns to each point the largest score among those received by the various rankings. Average and Minimum work accordingly in the same fashion. Maximum and Average are among the most commonly used consensus functions to aggregate rankings [26, 27, 32].

5.4 Methods and Evaluation

We compare our techniques against state-of-the-art selective outlier ensemble methods, namely SelectV and SelectH (we used the code available from the author’s website) [5], and DivE [4] (we used the implementation provided by the authors of [5]). The code of Core and Cull is publicly available333Code available at: In our experiments, we also include the baseline that selects all the components (called All). Moreover, to assess the effectiveness of the ensemble, we apply the simple LOF algorithm [26] on the whole data. We ran all methods (Core, Cull, SelectV, SelectH, DivE, and All) on multiple independent ensembles, and report the average performance of each method. The LOF baseline was also applied the same number of times on each data set, with a random choice of the MinPts parameter from the set . Performance is measured using the area under the Precision-Recall (PR) curve, namely average precision [44]. This measure was also used by the authors of SelectV and SelectH to assess their methods [5]. We observe that, although the area under ROC curve is widely used to evaluate outlier detection methods [36], it has been shown that the Precision-Recall curve is more informative than ROC plots when evaluating imbalanced datasets [44].

To run the hierarchical version of Core and Cull ( and , respectively), we consider batches of 20 ensembles. For each batch, we run Core (Cull) on each ensemble; we then assemble the outputs of 20 selections in a new ensemble, and run Core (Cull) again on it. The process is repeated for multiple independent batches of 20 ensembles, and average performance is reported for each method. We also run a variant in which, for each batch of 20 ensembles, we just assemble the 20 outputs of Core (Cull) in a new ensemble and directly apply the consensus function. These variants are called Core.U and Cull.U.

For a fair comparison, we also set up runs of SelectV, SelectH, DivE, and All, where we enable the techniques to have access to all the ensembles in each batch. That is, we generate a single ensemble of components from a given batch, and run each competitor method on it. The techniques in this setting are denoted as SelectV.U, SelectH.U, DivE.U, and All.U, respectively.

5.5 Results

Table 2 gives the average performances (areas under the PR curve) of Core, Cull, DivE, SelectV, SelectH, and All across all datasets and for the three consensus functions. For WDBC, WPBC, Pima,Yeast05679v4, Ecoli4, Shuttle, and SpamBase averages are computed over 400 independent ensembles. For the remaining datasets averages are computed over 200 independent ensembles.

Table 3 gives the average performances (areas under the PR curve) of , , Core.U, Cull.U, DivE.U, SelectV.U, SelectH.U, and All.U across all datasets and for the three consensus functions. For WDBC, WPBC, Pima,Yeast05679v4, Ecoli4, Shuttle, and SpamBase averages are computed over 20 batches (of 20 ensembles each). For the remaining datasets, averages are computed over 10 batches.

For both tables, statistical significance is assessed using a one-way ANOVA with a post-hoc Tukey HSD test with a p-value threshold equal to . For each dataset, boldface indicates the technique with the best performance score, and any technique which is not statistically significantly inferior to it. For each dataset, the best performance score is also underlined.

5.6 Analysis

Table 2 shows that, out of the 16 datasets, Core and Cull are ranked among the top performers in 12 datasets; SelectV and SelectH in 6; DivE and All in 9, and LOF in 3. Overall, our selective techniques are superior against the state-of-the-art approaches for selective outlier ensembles (SelectV, SelectH, and DivE), and against All. In particular, Core and Cull give the best performance scores (underlined values) in 10 datasets; DivE in 1; SelectH and SelectV in 2; All in 3; and LOF in 3. Core and Cull win by a large margin.

It’s known that the All technique, especially when combined with average, is a strong baseline and hard to defeat [45]. Our results confirm this fact. In particular, SelectV and SelectH are not competitive against All on the wide range of problems considered in our experiments. We also observe that DivE often selects all the components, and therefore reduces to All. Core and Cull emerge as the strongest competitors against All. Single LOF is among the top performers in only three cases; this supports the overall effectiveness of the constructed ensemble. It’s interesting to observe that in two out of these three cases (Glass and Segment0), LOF is the only top performer, indicating that the constructed ensemble did not work well for these two problems, regardless of the selective or consensus techniques used.

width= DataSetMethod avg. max. min. avg. max. min. avg. max. min. avg. max. min. avg. max. min. avg. max. min. avg. Ecoli4 0.133 0.135 0.125 0.127 0.124 0.108 0.123 0.115 0.094 0.123 0.114 0.093 0.121 0.112 0.094 0.123 0.114 0.093 0.053 Glass 0.115 0.115 0.117 0.125 0.123 0.107 0.131 0.113 0.098 0.098 0.086 0.072 0.098 0.085 0.077 0.131 0.114 0.097 0.155 Lymphography 0.311 0.294 0.321 0.585 0.379 0.651 0.567 0.368 0.627 0.291 0.266 0.32 0.322 0.288 0.331 0.348 0.3 0.365 0.475 PageBlocks 0.413 0.392 0.366 0.428 0.383 0.313 0.428 0.375 0.272 0.426 0.373 0.272 0.423 0.373 0.276 0.428 0.375 0.272 0.252 Pima 0.028 0.028 0.027 0.027 0.028 0.026 0.027 0.027 0.024 0.026 0.026 0.024 0.026 0.025 0.024 0.027 0.027 0.024 0.02 SatImage 0.514 0.521 0.472 0.505 0.526 0.36 0.479 0.523 0.231 0.464 0.502 0.23 0.46 0.5 0.232 0.486 0.526 0.232 0.21 Segment0 0.104 0.105 0.108 0.103 0.106 0.111 0.103 0.108 0.113 0.104 0.108 0.113 0.104 0.109 0.113 0.103 0.108 0.113 0.118 Shuttle 0.157 0.157 0.139 0.168 0.16 0.129 0.17 0.152 0.134 0.151 0.138 0.131 0.126 0.123 0.124 0.17 0.152 0.134 0.113 SpamBase 0.088 0.092 0.076 0.091 0.095 0.068 0.094 0.096 0.064 0.126 0.129 0.076 0.125 0.128 0.079 0.094 0.096 0.064 0.072 Stamps 0.081 0.077 0.088 0.087 0.077 0.104 0.089 0.072 0.109 0.076 0.061 0.093 0.077 0.062 0.089 0.089 0.073 0.11 0.098 Waveform 0.115 0.13 0.098 0.105 0.123 0.081 0.099 0.115 0.071 0.105 0.117 0.082 0.101 0.117 0.073 0.099 0.115 0.071 0.062 WDBC 0.815 0.8 0.798 0.815 0.8 0.798 0.814 0.796 0.747 0.752 0.732 0.688 0.749 0.735 0.716 0.814 0.796 0.748 0.514 Wilt 0.075 0.063 0.084 0.076 0.059 0.085 0.078 0.058 0.083 0.076 0.058 0.083 0.078 0.058 0.084 0.078 0.058 0.084 0.071 WPBC 0.226 0.226 0.224 0.225 0.225 0.222 0.224 0.225 0.219 0.224 0.224 0.222 0.223 0.224 0.219 0.224 0.225 0.219 0.21 Yeast05679v4 0.132 0.131 0.132 0.133 0.132 0.134 0.134 0.133 0.136 0.133 0.132 0.136 0.133 0.132 0.135 0.134 0.133 0.136 0.137 Yeast2v4 0.207 0.22 0.187 0.19 0.219 0.158 0.186 0.206 0.153 0.228 0.239 0.188 0.224 0.234 0.188 0.186 0.208 0.154 0.147

Table 2: Average performance (area under the PR curve) for all methods and datasets (no hierarchy).

width= DataSetMethod avg. max. min. avg. max. min. avg. max. min. avg. max. min. Ecoli4 0.14 0.141 0.131 0.13 0.131 0.116 0.135 0.139 0.128 0.129 0.123 0.089 Glass 0.109 0.115 0.116 0.119 0.136 0.102 0.117 0.105 0.118 0.126 0.111 0.076 Lymphography 0.28 0.269 0.289 0.527 0.282 0.692 0.31 0.27 0.336 0.621 0.293 0.69 PageBlocks 0.416 0.391 0.399 0.437 0.356 0.303 0.436 0.368 0.323 0.439 0.354 0.261 Pima 0.03 0.032 0.03 0.027 0.029 0.025 0.028 0.03 0.026 0.028 0.028 0.022 SatImage 0.524 0.532 0.486 0.519 0.535 0.332 0.522 0.535 0.413 0.514 0.551 0.224 Segment0 0.103 0.104 0.108 0.102 0.103 0.11 0.102 0.102 0.11 0.102 0.104 0.114 Shuttle 0.158 0.158 0.141 0.17 0.171 0.138 0.165 0.172 0.143 0.172 0.158 0.135 SpamBase 0.084 0.096 0.071 0.09 0.098 0.059 0.089 0.094 0.064 0.091 0.106 0.055 Stamps 0.071 0.076 0.088 0.08 0.076 0.1 0.077 0.068 0.108 0.093 0.078 0.119 Waveform 0.108 0.121 0.096 0.11 0.146 0.079 0.114 0.159 0.087 0.105 0.13 0.067 WDBC 0.815 0.806 0.803 0.815 0.806 0.803 0.815 0.798 0.768 0.815 0.798 0.768 Wilt 0.079 0.065 0.091 0.077 0.054 0.089 0.075 0.053 0.092 0.077 0.053 0.089 WPBC 0.23 0.232 0.228 0.226 0.224 0.224 0.227 0.227 0.225 0.225 0.228 0.224 Yeast05679v4 0.131 0.131 0.132 0.132 0.133 0.131 0.132 0.133 0.132 0.133 0.136 0.13 Yeast2v4 0.255 0.268 0.246 0.195 0.235 0.157 0.207 0.241 0.171 0.191 0.22 0.147 DataSetMethod avg. max. min. avg. max. min. avg. max. min. avg. max. min. Ecoli4 0.097 0.097 0.097 0.125 0.107 0.076 0.126 0.115 0.08 0.125 0.107 0.076 Glass 0.106 0.106 0.106 0.1 0.077 0.053 0.1 0.083 0.052 0.134 0.094 0.07 Lymphography 0.444 0.444 0.444 0.288 0.256 0.336 0.325 0.261 0.314 0.353 0.277 0.362 PageBlocks 0.318 0.318 0.318 0.435 0.32 0.214 0.432 0.327 0.217 0.437 0.322 0.214 Pima 0.024 0.024 0.024 0.027 0.026 0.02 0.027 0.026 0.022 0.028 0.027 0.02 SatImage 0.429 0.429 0.429 0.469 0.526 0.156 0.453 0.508 0.176 0.492 0.551 0.156 Segment0 0.115 0.115 0.115 0.102 0.108 0.112 0.103 0.108 0.114 0.102 0.108 0.112 Shuttle 0.118 0.118 0.118 0.157 0.127 0.135 0.124 0.123 0.122 0.173 0.135 0.143 SpamBase 0.081 0.081 0.081 0.132 0.138 0.057 0.132 0.131 0.059 0.095 0.113 0.05 Stamps 0.079 0.079 0.079 0.085 0.055 0.114 0.082 0.057 0.099 0.099 0.066 0.137 Waveform 0.093 0.093 0.093 0.11 0.141 0.069 0.1 0.14 0.063 0.099 0.136 0.06 WDBC 0.773 0.773 0.773 0.755 0.726 0.655 0.751 0.737 0.708 0.815 0.791 0.731 Wilt 0.072 0.065 0.077 0.078 0.051 0.089 0.079 0.052 0.09 0.079 0.051 0.09 WPBC 0.219 0.226 0.203 0.226 0.227 0.223 0.224 0.228 0.218 0.225 0.224 0.216 Yeast05679v4 0.126 0.126 0.126 0.133 0.136 0.133 0.132 0.137 0.131 0.134 0.137 0.133 Yeast2v4 0.186 0.186 0.186 0.233 0.221 0.161 0.232 0.234 0.166 0.187 0.197 0.137

Table 3: Average performance (area under the PR curve) for all methods and datasets (with hierarchy).

The results reveal an interesting fact about Core and Cull: they manifest their best behavior under different scenarios. They are among the best performing methods in 7 and 11 datasets, respectively, of which only 6 are in common. On the other hand, SelectH and SelectV seem highly correlated. They both become competitive in the same 6 datasets. The complementary nature of Core and Cull enables them to succeed in a wide range of problems, since they seem to induce a different learning bias. This opens a new research path to investigate characteristics of the datasets to which each of these methods is tuned, and suggests the potential for a hybrid approach that leverages their diversity.

Table 3 shows that and are ranked among the top performers in 15 datasets; Core.U and Cull.U in 15 as well; DivE.U in 4; SelectV.U and SelectH.U in 12; and All.U in 12. Again, the hierarchical versions of our techniques emerge as the strongest competitors. In particular, and give the best performance scores (underlined values) in 7 datasets; Core.U and Cull.U in 5; DivE.U in 1; SelectH.U and SelectV.U in 2; and All.U in 5.

The two-level pruning mechanisms of and effectively prune poor components among a large pool of rankings. On the other hand, All.U deals with large ensembles, which are likely to contain a fair number of poor components, thus hurting the relative performance against the competitors. Overall, the behavior of SelectV.U and SelectH.U is comparable to All.U, while DivE.U gives the worst performance.

An insightful observation from the results in Table 3 is the strong performance of Core.U. It’s among the best performers in 14 (out of 16) datasets, and its overall performance is superior to . In a way, Core.U achieves the best-of-both-worlds: it first uses Core to discard poor components across different ensembles; then it aggregates all selected rankings, acting like All, but on a ”boosted” pool of components. Core.U is superior to (or tied with) All.U in almost all scenarios (15 out of 16), and therefore a very promising candidate for outlier ensemble selection.

We finally observe that the best performing consensus functions depend on the dataset, and to a less extent on the method. A deeper understanding of this behavior is in our agenda for future work.

6 Complexity Analysis

We analyze the theoretical complexity for all the considered methods. For simplicity, we omit the cost of sampling, the cost needed to compute the anomaly scores, and the cost of running the aggregation function. These steps are common to all methods. Let be the size of each ranking and the size of the ensemble.

  • All: The cost of selecting all the components is simply equal to the size of the ensemble: .

  • Core and Cull: The cost of the graph construction is , obtained by multiplying the number of edges with the cost of computing the weighted tau, which is as reported in [35]. The rest of the computation is linear in the number of edges, which is . So the total cost is: , which is dominated by the factor .

  • DivE: As reported in [4], the first operation performed by DivE is the Union of the top-k outliers, with a cost of when . The next step consists in sorting the converted rankings using the weighted Pearson correlation, with a cost of . This includes the cost of Pearson’s coefficients (), and the cost of sorting the rankings according to the elaborated coefficients. The computation of sorting the converted rankings using the weighted Pearson correlation is repeated two times before the loop that contain the rankings. As a result, the overall cost of DivE is: , which it is dominated by the factor .

  • Select-V: As reported in [5], the first operation performed by SelectV is Unification

    , which converts scores to probability estimates. Even if we consider as constant the cost of Unification, the total cost of performing Unification for all rankings is

    . The cost of rank sorting is . The next step consists in sorting the converted rankings using the weighted Pearson correlation, with a cost of , exactly as in DivE. The computation of sorting the converted rankings using the weighted Pearson correlation is repeated times, and the final running cost to perform SelectV is: , which it is dominated by the factor .

  • Select-H: As reported in [5], the first expensive procedure performed is the computation of MixtureModel. Its cost depends on the number of iterations , which was set to 100 as suggested in [5], on the size of the ensemble, and on the length of the score vectors; the resulting cost is . The second expensive procedure is RobustRankAggregation, which costs . The subsequent loop is dominated by the number of estimated outliers, and in the worst case its cost is

    . The algorithm concludes with a clustering phase (k-means

    [46] is the used algorithm), which costs . The final running cost to perform SelectH is: , which is dominated by the factor .

The theoretical analysis shows that SelectH is dominated by , our methods Core and Cull by , and DivE and SelectV by . In real world scenarios, as the number of data grows large, SelectH may become prohibitively expensive.

width=1 Core/Cull Dive SelectH SelectV Core-U/Cull-U DiveU SelectHU SelectVU Ecoli4 0.1955 0.0407 0.3507 0.0139 3.7343 2.4337 9833.1550 1.8934 Glass 0.1202 0.0310 0.2313 0.0124 5.1370 2.2313 6139.0900 1.6363 Lymphography 0.0918 0.0228 0.2686 0.0107 37.3748 2.1586 1798.3125 0.7797 PageBlocks 3.5285 0.1021 7.7770 0.1160 98.0259 18.2131 134998.5714 11.9919 Pima 0.3028 0.0286 0.9487 0.0205 5.3593 2.8979 14137.0000 2.5817 SatImage 0.7446 0.0323 1.2203 0.0353 7.4759 4.0789 30714.4000 3.4625 Segment0 1.5537 0.0691 3.3306 0.0481 20.7469 6.0357 64577.3000 5.9603 Shuttle 0.6247 0.0346 1.1076 0.0214 9.6138 3.9340 27573.0500 3.0115 SpamBase 1.9090 0.0456 4.8094 0.0448 14.6854 7.1720 71212.8333 6.4726 Stamps 0.1834 0.0265 0.3725 0.0136 1.5201 2.3445 9135.5000 1.8553 Waveform 2.3860 0.0553 6.5074 0.0693 73.8695 9.1355 103188.4444 9.5774 WDBC 0.2162 0.0330 0.3878 0.0138 1.4657 2.5056 10659.8000 1.8951 Wilt 3.3298 0.0648 4.6988 0.0887 29.2481 13.7465 117418.8889 10.7646 WPBC 0.1116 0.0647 1.9563 0.0175 1.7298 2.6575 3197.7000 1.0396 Yeast2v4 0.3031 0.0373 0.5683 0.0152 3.7115 2.6427 14770.1000 2.2619 Yeast05679v4 0.3110 0.0538 0.5941 0.0154 3.8485 2.8501 15323.2500 2.3146 Average 0.2532 0.0473 0.4724 0.0146 3.7914 2.6419 12578.2025 2.1040

Table 4: Running times of experiments in Table 3 expressed in seconds.

We have also computed the empirical running times. For each dataset, the average running time of all the runs for each method is recorded in Table 4. Experiments were run on a laptop with an Intel Core i7 Processor @2.80GHz and 16GB RAM. The empirical running times are consistent with the complexity analysis given above. In particular, we observe that the running times of DivE and SelectV have the same order of magnitude. Core and Cull are faster then SelectH but slower than DivE and SelectV. When the number of components increases, the running times of Core.U, Cull.U, Dive.U, and SelectV.U are almost identical; in contrast, the runnning time of SelectH.U increases much more rapidly with respect to the other methods.

7 Conclusion

We have introduced a new graph-based class of ranking selection methods for outlier ensembles. In particular, we have defined two specific approaches, Core and Cull, and hierarchical extensions of the same. Our extensive evaluation on a variety of heterogeneous data and methods shows that our approach outperforms state-of-the-art selective outlier ensemble techniques in a number of cases. Interesting and challenging questions are open for future investigation, including a characterization of the scenarios when Core outperforms Cull, or viceversa; studying how our selective techniques affect the accuracy/diversity tradeoffs; exploring hybrid methods, different outlier detection techniques and alternative consensus functions, and analyze their effects in more depth.


  • [1] N. Li and Z.-H. Zhou, “Selective ensemble of classifier chains,” in Proceedings of the International Workshop on Multiple Classifier Systems, pp. 146–156, Springer, 2013.
  • [2] X. Z. Fern and W. Lin, “Cluster ensemble selection,” Stat. Anal. Data Min., vol. 1, pp. 128–141, Nov. 2008.
  • [3] A. Zimek, R. J. Campello, and J. Sander, “Ensembles for unsupervised outlier detection: challenges and research questions a position paper,” ACM Sigkdd Explorations Newsletter, vol. 15, no. 1, pp. 11–22, 2014.
  • [4] R. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel, “On evaluation of outlier rankings and outlier scores,” in Proceedings of the SIAM International Conference on Data Mining, pp. 1047–1058, SIAM, 2012.
  • [5] S. Rayana and L. Akoglu, “Less is more: Building selective anomaly ensembles with application to event detection in temporal graphs,” in Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 622–630, SIAM, 2015.
  • [6] C. C. Aggarwal and S. Sathe, “Theoretical foundations and algorithms for outlier ensembles,” SIGKDD Explor. Newsl., vol. 17, pp. 24–47, Sept. 2015.
  • [7] C. C. Aggarwal, “Outlier ensembles: Position paper,” SIGKDD Explor. Newsl., vol. 14, pp. 49–58, Apr. 2013.
  • [8] A. Zimek, M. Gaudet, R. J. Campello, and J. Sander, “Subsampling for efficient and effective unsupervised outlier detection ensembles,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 428–436, ACM, 2013.
  • [9] H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan, “Mining outliers with ensemble of heterogeneous detectors on random subspaces,” in Proceedings of the 15th International Conference on Database Systems for Advanced Applications - Volume Part I, DASFAA’10, (Berlin, Heidelberg), pp. 368–383, Springer-Verlag, 2010.
  • [10] B. Micenkova, B. McWilliams, and I. Assent, “Learning representations for outlier detection on a budget,” in CoRR abs/1507.08104, 2015.
  • [11] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, (Washington, DC, USA), pp. 413–422, IEEE Computer Society, 2008.
  • [12] T. G. Dietterich et al.

    , “Ensemble methods in machine learning,”

    Multiple classifier systems, vol. 1857, pp. 1–15, 2000.
  • [13] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123–140, 1996.
  • [14] Y. Freund and R. E. Schapire, “A desicion-theoretic generalization of on-line learning and an application to boosting,” in

    European conference on computational learning theory

    , pp. 23–37, Springer, 1995.
  • [15]

    L. Breiman, “Random forests,”

    Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
  • [16] P. Domingos, “Bayesian averaging of classifiers and the overfitting problem,” in ICML, vol. 2000, pp. 223–230, 2000.
  • [17] S. Džeroski and B. Ženko, “Is combining classifiers with stacking better than selecting the best one?,” Machine learning, vol. 54, no. 3, pp. 255–273, 2004.
  • [18] A. Strehl and J. Ghosh, “Cluster ensembles—a knowledge reuse framework for combining multiple partitions,” Journal of machine learning research, vol. 3, no. Dec, pp. 583–617, 2002.
  • [19] S. Bickel and T. Scheffer, “Multi-view clustering.,” in ICDM, vol. 4, pp. 19–26, 2004.
  • [20] E. Muller, S. Gunnemann, I. Farber, and T. Seidl, “Discovering multiple clustering solutions: Grouping objects in different views of the data,” in Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pp. 1207–1210, IEEE, 2012.
  • [21] E. M. Knorr and R. T. Ng, “A unified notion of outliers: Properties and computation.,” in KDD, pp. 219–222, 1997.
  • [22] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algorithms for mining outliers from large data sets,” in ACM Sigmod Record, vol. 29, pp. 427–438, ACM, 2000.
  • [23] K. Zhang, M. Hutter, and H. Jin, “A new local distance-based outlier detection approach for scattered real-world data,” Advances in knowledge discovery and data mining, pp. 813–822, 2009.
  • [24] S. D. Bay and M. Schwabacher, “Mining distance-based outliers in near linear time with randomization and a simple pruning rule,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 29–38, ACM, 2003.
  • [25] W. Jin, A. K. Tung, and J. Han, “Mining top-n local outliers in large databases,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 293–298, ACM, 2001.
  • [26] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying density-based local outliers,” in ACM sigmod record, vol. 29, pp. 93–104, ACM, 2000.
  • [27] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos, “Loci: Fast outlier detection using the local correlation integral,” in Data Engineering, 2003. Proceedings. 19th International Conference on, pp. 315–326, IEEE, 2003.
  • [28] W. Jin, A. K. Tung, J. Han, and W. Wang, “Ranking outliers using symmetric neighborhood relationship.,” in PAKDD, vol. 6, pp. 577–593, Springer, 2006.
  • [29] H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek, “Loop: local outlier probabilities,” in Proceedings of the 18th ACM conference on Information and knowledge management, pp. 1649–1652, ACM, 2009.
  • [30] E. M. Knox and R. T. Ng, “Algorithms for mining distancebased outliers in large datasets,” in Proceedings of the International Conference on Very Large Data Bases, pp. 392–403, Citeseer, 1998.
  • [31] V. Chandola, A. Banerjee, and V. Kumar, “Outlier detection: A survey,” ACM Computing Surveys, 2007.
  • [32] A. Lazarevic and V. Kumar, “Feature bagging for outlier detection,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 157–166, ACM, 2005.
  • [33] X. Z. Fern and C. E. Brodley, “Solving cluster ensemble problems by bipartite graph partitioning,” in Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, (New York, NY, USA), pp. 36–, ACM, 2004.
  • [34] C. Domeniconi and M. Al-Razgan, “Weighted cluster ensembles: Methods and analysis,” ACM Trans. Knowl. Discov. Data, vol. 2, pp. 17:1–17:40, Jan. 2009.
  • [35] S. Vigna, “A weighted correlation index for rankings with ties,” in Proceedings of the 24th international conference on World Wide Web, pp. 1166–1176, ACM, 2015.
  • [36] G. O. Campos, A. Zimek, J. Sander, R. J. Campello, B. Micenková, E. Schubert, I. Assent, and M. E. Houle, “On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study,” Data Mining and Knowledge Discovery, vol. 30, no. 4, pp. 891–927, 2016.
  • [37] J. Alcalá-Fdez, L. Sanchez, S. Garcia, M. J. del Jesus, S. Ventura, J. M. Garrell, J. Otero, C. Romero, J. Bacardit, V. M. Rivas, et al.

    , “Keel: a software tool to assess evolutionary algorithms for data mining problems,”

    Soft Computing-A Fusion of Foundations, Methodologies and Applications, vol. 13, no. 3, pp. 307–318, 2009.
  • [38] J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera, “Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework.,” Journal of Multiple-Valued Logic & Soft Computing, vol. 17, 2011.
  • [39] C. L. Blake and C. J. Merz, “Uci repository of machine learning databases [http://www. ics. uci. edu/~ mlearn/mlrepository. html]. irvine, ca: University of california,” Department of Information and Computer Science, vol. 55, 1998.
  • [40] J. Gao and P.-N. Tan, “Converting output scores from outlier detection algorithms into probability estimates,” in Data Mining, 2006. ICDM’06. Sixth International Conference on, pp. 212–221, IEEE, 2006.
  • [41] H.-P. Kriegel, P. Kroger, E. Schubert, and A. Zimek, “Interpreting and unifying outlier scores,” in Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 13–24, SIAM, 2011.
  • [42] J. G. Kemeny, “Mathematics without numbers,” Daedalus, vol. 88, no. 4, pp. 577–591, 1959.
  • [43] R. Kolde, S. Laur, P. Adler, and J. Vilo, “Robust rank aggregation for gene list integration and meta-analysis,” Bioinformatics, vol. 28, no. 4, pp. 573–580, 2012.
  • [44] T. Saito and M. Rehmsmeier, “The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets,” PloS one, vol. 10, no. 3, p. e0118432, 2015.
  • [45] A. Chiang and Y.-R. Yeh, “Anomaly detection ensembles: In defense of the average,” in Web Intelligence and Intelligent Agent Technology (WI-IAT), 2015 IEEE/WIC/ACM International Conference on, vol. 3, pp. 207–210, IEEE, 2015.
  • [46] S. Lloyd, “Least squares quantization in pcm,” IEEE Trans. Inf. Theor., vol. 28, pp. 129–137, Sept. 2006.