LSCP: Locally Selective Combination in Parallel Outlier Ensembles

12/04/2018 ∙ by Yue Zhao, et al. ∙ UNIVERSITY OF TORONTO PwC 0

In unsupervised outlier ensembles, the absence of ground truth makes the combination of base detectors a challenging task. Specifically, existing parallel outlier ensembles lack a reliable way of selecting competent base detectors, affecting accuracy and stability, during model combination. In this paper, we propose a framework---called Locally Selective Combination in Parallel Outlier Ensembles (LSCP)---which addresses this issue by defining a local region around a test instance using the consensus of its nearest neighbors in randomly generated feature spaces. The top-performing base detectors in this local region are selected and combined as the model's final output. Four variants of the LSCP framework are compared with six widely used combination algorithms for parallel ensembles. Experimental results demonstrate that one of these LSCP variants consistently outperforms baseline algorithms on the majority of eighteen real-world datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Outlier detection methods aim to identify anomalous data objects from the general data distribution and are useful for problems such as credit card fraud prevention and network intrusion detection [8]. Since the ground truth is often absent in outlier mining [1], unsupervised detection methods are commonly used for this task [5, 17, 8]. However, unsupervised approaches are susceptible to generating high false positive and false negative rates [10]. To improve model accuracy and stability in these scenarios, recent research explores ensemble approaches to outlier detection [1, 3, 24, 32]

. Ensemble learning combines multiple base estimators to achieve superior detection performance and reliability when compared to an individual estimator

[12, 15]. It is important to ensure that the combination process is robust because constituent estimators, if synthesized inappropriately, may be detrimental to the predictive capability of an ensemble[22, 23]. Similar to prior works in classification [24], an outlier ensemble may be characterized as parallel if detectors are generated independently. In contrast, it may be considered sequential if the detector generation, selection or combination is iterative.

Model combination is important for parallel ensembles to ensure diversity exists among base detectors; however, existing works have not jointly addressed two key limitations in this process. First, most parallel ensembles generically combine all detectors without considering selection. This limits the benefits of model combination since individual base detectors may not be proficient at identifying all outlier instances [9]. For example, prior work has demonstrated that the value of good detectors can be neutralized by the inclusion of poor detectors in a generic averaging framework [3]. Secondly, data locality is rarely emphasized in the context of detector selection and combination leading to potentially sub-optimal outcomes. While it is acknowledged that certain types of outliers are better identified by local data relationships [27], detectors are often evaluated at a global scale, where all training points are considered, instead of the local region related to a test instance.

To address the aforementioned limitations, we propose a fully unsupervised framework called Locally Selective Combination in P

arallel Outlier Ensembles (LSCP) to selectively combine base detectors by emphasizing data locality. The idea is motivated by an established supervised ensemble framework known as Dynamic Classifier Selection (DCS)

[15]. DCS selects the best classifier for each test instance by evaluating base classifier competency at a local scale [9]. The rationale behind this is that base classifiers will generally not excel at categorizing all unknown test instances and that an individual classifier is more likely to specialize in a specific local region [9, 16]. Similarly, LSCP first defines the local region of a test instance by the consensus of the nearest training points in randomly selected feature subspaces, and then identifies the most competent base detector in this local region by measuring similarity relative to a pseudo ground truth (see [7, 23] for examples). To further improve algorithm stability and capacity, ensemble variations of LSCP are proposed where promising base detectors are kept for a second-phase combination instead of using the single most competent detector. As a whole, LSCP is intuitive, stable and effective for unsupervised combination of independent outlier detectors.

Our technical contributions in this paper are:

  1. We propose a novel combination framework which, to the best of our knowledge, is the first published effort to adapt DCS from supervised classification tasks to unsupervised parallel outlier ensembles.

  2. Extensive experiments on 20 real-world datasets show that LSCP consistently yields better performance than existing parallel combination methods.

  3. As a general framework, LSCP is formulated to be compatible with different types of base detectors; we demonstrate its effectiveness with a homogeneous pool of Local Outlier Factor [5] detectors.

  4. We employ various analysis methods to improve model interpretability. First, theoretical explanations and complexity are provided. Second, visualization techniques are used to intuitively explain why LSCP works and when to use it. Third, statistical tests are used to compare experimental results.

  5. Effort has been made to streamline the accessibility of LSCP. All source code, experiment results and figures are shared for reproduction111Repository: In addition, the framework has been implemented in the Python Outlier Detection (PyOD) toolbox with a unified API and detailed documentation [31].

2 Related Works

2.1 Dynamic Classifier Selection and Dynamic Ensemble Selection.

Dynamic Classifier Selection (DCS) is an established combination framework for classification tasks. The technique was first proposed by Ho et al. in 1994 [15] and then extended, under the name DCS Local Accuracy, by Woods et al. in 1997 [28] to select the most accurate base classifier in a local region. The motivation behind this approach is that base classifiers often make distinctive errors and offer a degree of complementarity [6]. Consequently, selectively combining base classifiers can result in a performance improvement over generic ensembles which use the majority vote of all base classifiers. Subsequent theoretical work by Giacinto and Roli validated that, under certain assumptions, the optimal Bayes classifier could be obtained by selecting non-optimal classifiers [14]. DCS was later expanded by Ko et al. to Dynamic Ensemble Selection (DES) which selects multiple base classifiers for a second-phase combination given each test instance [16]. By minimizing reliance on a single classifier and delegating the classification task to a group competent classifiers, the algorithm has demonstrated that it is more robust than DCS [16]. Motivated by these approaches, LSCP adapts dynamic selection to unsupervised outlier detection tasks. It should be noted that LSCP is an extension of our prior work on DCSO [29], a recent dynamic detector combination framework. Comparatively, LSCP refines the local region definition process to achieve a more stable combination mechanism.

2.2 Data Locality in Outlier Detection.

The relationship among data objects is critical in outlier detection and existing algorithms can roughly be categorized as either global or local [17, 24, 26]. The former considers all objects during inference while the latter only considers a local selection of objects [26]. In both cases, their applicability is dependent on the structure of the data. Global outlier detection algorithms, for example, offer superior performance when outliers are highly distinctive from the data distribution [23]

but often fail to identify outliers in the local neighborhoods of high-dimensional data

[5, 27]. Global models also struggle with data represented by a mixture of distributions, where global characteristics do not necessarily represent the distribution of objects in local regions [27]. To address these limitations, numerous works have explored local algorithms such as Local Outlier Factor (LOF) [5]

, Local Outlier Probabilities (LoOP)

[17] and GLOSS [27]. However, data locality is rarely considered in the context of detector combination; instead, most combination methods utilize all training data points, e.g., the weight calculation in weighted averaging [32]. LSCP explores both global and local data relationships by training base detectors on the entire dataset and emphasizing data locality during detector combination.

2.3 Outlier Detector Combination.

Recently, studying outlier ensembles has become a popular research area [1, 2, 3, 32] resulting in numerous popular works including: (i) parallel ensembles such as Feature Bagging [18] and Isolation Forest [19]; (ii) sequential methods including CARE [24], SELECT [23] and BoostSelect [7] and (iii) hybrid approaches like BORE [21] and XGBOD [30]. When the ground truth is unavailable, combining outlier models is challenging. Existing unsupervised combination algorithms in parallel ensembles are often generic and global (GG); a list of representative GG methods are described below (see [1, 2, 3, 32, 24, 18] for details):

  1. Averaging (GG_A): average scores of all detectors.

  2. Maximization (GG_M): maximum score across all detectors.

  3. Weighted Averaging (GG_WA): weight each base detector when averaging.

  4. Threshold Sum (GG_TH): discard all scores below a threshold and sum over the remaining scores.

  5. Average-of-Maximum (GG_AOM): divide base detectors into subgroups and take the maximum score for each subgroup. The final score is the average of all subgroup scores.

  6. Maximum-of-Average (GG_MOA): divide base detectors into subgroups and take the average score for each subgroup. The final score is the maximum of all subgroup scores.

  7. Feature Bagging (GG_FB): generate a diversified set of base detectors by training on randomly selected feature subsets. The final score is the average of base detector scores.

As discussed in §2.2, GG methods ignore the importance of data locality while evaluating and combining detectors, which may be inappropriate given the characteristics of outliers [5, 27]. Moreover, without a selection process, poor detectors may hurt the overall detection performance of an ensemble [23, 24]. All seven GG algorithms are thus included as baselines.

There have been attempts to build selective outlier ensembles sequentially in a boosting style. Rayana and Akoglu introduced SELECT [23] and CARE [24] to pick promising detectors and exclude the underperforming ones iteratively, which yielded impressive results on both temporal graphs and multi-dimensional outlier data. Campos et al. further extend this idea by proposing an unsupervised boosting strategy BoostSelect for outlier ensemble selection [7]. As an alternative to sequential selection models, this paper chooses to focus on parallel detector selection which stresses the importance of data locality. Compared with sequential detector selection methods, our approach can select detectors without iteration which may reduce computational cost.

Figure 1: LSCP flow chart. Steps requiring re-computation are highlighted in yellow; cached steps are in gray.
1:the pool of detectors , training data , test data , the local region factor
2:outlier scores of
3:Train all base detectors in on
4:Get training outlier scores with Eq.(1)
5:Get pseudo with Eq.(2)
6:for each test instance in  do
7:     Define local region by kNN ensemble (§3.3)
8:     Get local pseudo ground truth by selecting objects in from
9:     for each base detector in  do
10:         Get the outlier scores associated with training data in the local region
11:         Evaluate the local competency of by the Pearson correlation between and
12:     end for
13:     if LSCP_A or LSCP_M then
14:         return where has the highest Pearson correlation to
15:     else// LSCP ensembles
16:         Select a group of most similar detectors and add to the empty set 3.5)
17:         if LSCP_AOM then
18:              return
19:         else
20:              return
21:         end if
22:     end if
23:end for
Algorithm 1 Locally Selective Combination

3 Algorithm Design

LSCP starts with a group of diversified detectors to be combined. For each test instance, LSCP first defines its local region and then picks the most competent local detector(s). The selected detector(s) are used to generate the outlier score for the test instance. The workflow of all four proposed LSCP methods is shown in Algorithm 1 and Fig. 1.

3.1 Base Detector Generation.

An effective ensemble should be constructed with diversified base estimators [24, 32]

to promote learning distinct characteristics in the data. With a group of homogeneous base detectors, diversity can be induced by subsampling the training set and feature space, or by varying model hyperparameters

[6, 32]. In this study, we demonstrate the effectiveness of LSCP by using distinct hyperparameters to construct a pool of models with the same base algorithm. However, in practice, LSCP can also be used as a general framework with heterogeneous base detectors.

Let denote training data with points and features, and denote a test set with points. The algorithm first generates a pool of base detectors initialized with a range of hyperparameters, e.g., a group of LOF detectors with distinct [5]. All base detectors are first trained on and then inference is performed on the same dataset. The results are combined into an outlier score matrix , formalized in Eq.(1), where

denotes the score vector from the

base detector. Each detector score is normalized using Z-normalization as per prior work[2, 32].


3.2 Pseudo Ground Truth Generation.

Since LSCP evaluates detector competency without ground truth labels, two methods are used for generating a pseudo ground truth (denoted ) with : (i) LSCP_A: averages base detector scores and (ii) LSCP_M: maximum score across detectors. This is generalized in Eq.(2) where represents the aggregation (average or max) taken across all base detectors.


It should be noted that the pseudo ground truth in LSCP is generated using training data and used solely for detector selection.

3.3 Local Region Definition.

The local region of a test instance is defined as the set of its k nearest training objects. Formally, this is denoted as:


where describes the set of a test instance’s nearest neighbours subject to an ensemble criteria. This variation of kNN, which is similar to Feature Bagging [18]

, is proposed to alleviate concerns involving the curse of dimensionality on

kNN [4] while leveraging its better precision compared to clustering algorithms in DCS [9]. The process is as follows: (i) groups of features are randomly selected to construct new feature spaces, (ii) the nearest training objects to in each group are identified using euclidean distance and (iii) training objects that appear more than times are added to thus defining the local region. The size of the region is not fixed because it is dependent on the number of training objects that meet the selection criteria.

The local region factor k decides the number of nearest neighbors to consider during this process; care is given to avoid selecting extreme values. Smaller values of k give more attention to local relationships which can result in instability, while large values of k may place too much emphasis on global relationships and have higher computational costs. While it is possible to experimentally determine an optimal k with cross-validation [16] when ground truth is available, a similar trivial approach does not exist in an unsupervised setting. For these reasons, we recommend setting , 10% of the training samples, bounded in the range of , which yielded good results in practice.

3.4 Model Selection and Combination.

For each test instance, the local pseudo ground truth can be obtained by retrieving values associated with the local region from :


where denotes the cardinality of . Similarly, the local training outlier scores can be retrieved from the pre-calculated training score matrix as:


Consequently, although the local region needs to be re-computed for each test instance, the local outlier scores and targets can be efficiently retrieved from pre-calculated values (see Fig. 1).

For evaluating base estimator competency in a local region, DCS measures the accuracy of base classifiers as the percentage of correctly classified points [16], while LSCP measures the similarity between base detector scores and the pseudo target instead. This distinction is motivated by the lack of direct and reliable ways to access binary labels in unsupervised outlier mining. Although converting pseudo outlier scores to binary labels is feasible, defining an accurate conversion threshold is challenging. Additionally, since imbalanced datasets are common in outlier detection tasks, it is more stable to use similarity measures over absolute accuracy for competency evaluation. Therefore, LSCP measures the local competency of each base detector by the Pearson correlation, which has proven useful in outlier ensemble model combination [25], between the local pseudo ground truth and the local detector score . The detector with the highest similarity is regarded as the most competent local detector for , and its outlier score can be considered the final score for the corresponding test sample.

3.5 Dynamic Outlier Ensemble Selection.

Selecting only one detector, even if it is most similar to the pseudo ground truth, can be risky in unsupervised learning. This risk can be mitigated by selecting a group of detectors for a second-phase combination. This idea can be viewed as an adaption of supervised DES

[16] to outlier detection tasks; correspondingly, we introduce ensemble variations of LSCP which employ Maximum of Average (LSCP_MOA) and Average of Maximum (LSCP_AOM) ensembling methods. Specifically, when the pseudo ground truth is generated by , LSCP_MOA selects a subset of competent detectors in the local region of a test instance and then takes the maximum of their predictions as the outlier score. Inversely, LSCP_AOM computes the average of the selected subset when the pseudo target is generated with . Setting the group size of selected detectors equal to 1 is a special case of the ensembles yielding the original LSCP algorithms (LSCP_A and LSCP_M). Larger group sizes may be considered more global in their detector selection while a group size of

results in a fully global algorithm. In response to this, we recommend using a group size selection process which includes some variance. Specifically, a histogram of detector Pearson correlation scores (to the pseudo ground truth) is built with

equal intervals. The detectors belonging to the most frequent interval are kept for the second-phase combination. A large results in selecting fewer detectors which flexibly controls the strength of the group size in LSCP ensembles.

The time complexity for training each base detector and generating the pseudo ground truth is dependent on the underlying model and number of training samples. However, since this work proposes a combination framework, our discussion focuses on the overhead introduced during the combination stage. With the appropriate implementation of LSCP_A and LSCP_M, e.g., using a k-d tree, the additional time complexity for defining the local region of each test instance is : for the distance calculation and for summation and sorting [16]. Here, refers to each test instance and refers to its dimensionality. While defining the local region requires multiple iterations, the number of iterations is fixed and is thus excluded in the complexity analysis. To combine the base detectors in LSCP_MOA and LSCP_AOM, an additional is needed resulting in a total time complexity of .

3.6 Theoretical Considerations.

Recently, Aggarwal and Sathe laid the theoretical foundation for outlier ensembles [2] using the bias-variance tradeoff, a widely used framework for analyzing generalization error in classification problems. The reducible generalization error in outlier ensembles may be minimized by either reducing squared bias or variance where a tradeoff between these two channels usually exists. A high variance detector is sensitive to data variation with high instability; a high bias detector is less sensitive to data variation but may fit complex data poorly. The goal of outlier ensembles is to control both bias and variance to reduce the overall generalization error. Various newly proposed algorithms have been analyzed using this new framework to enhance interpretability [23, 24, 30].

It has been shown that combining diversified base detectors, by averaging them for example, results in variance reduction [23, 24, 2]. However, a combination of all base detectors may also include inaccurate ones leading to higher bias. This explains why generic global averaging does not work well. Within Aggarwal’s bias-variance framework, LSCP possesses a combination of both variance and bias reduction. It induces diversity by initializing various base detectors with distinct hyperparameters and indirectly promotes variance reduction in the way that the pseudo ground truth is generated, e.g., averaging in LSCP_A. Furthermore, LSCP focuses on detector selection by local competency, which helps identify base detectors with conditionally low model bias. LSCP_M is also expected to be more stable than global maximization (GG_M) since the variance is reduced by using the most competent detector’s output rather than the global maximum of all base detectors. LSCP_MOA and LSCP_AOM

further decrease generalization error through bias and variance reduction, respectively, in their second-phase combination. However, LSCP is still a heuristic framework and can yield unpredictable results on pathological datasets.

4 Numerical Experiments

4.1 Datasets and Evaluation Metrics.

Table 1 summarizes the 20 public outlier detection benchmark datasets used in this study from ODDS222ODDS Library: and DAMI333DAMI Datasets: In each experiment, 60% of the data is used for training and the remaining 40% is set aside for validation. Performance is evaluated by taking the average score of 30 independent trials using area under the receiver operating characteristic (ROC-AUC) and mean average precision (mAP). Both metrics are widely used in outlier research [3, 4, 21, 23, 30, 13] and statistical measures are used to analyze the results [11]. Specifically, we use a non-parametric Friedman test followed by a post-hoc Nemenyi test. For these tests, is considered to be statistically significant.

Dataset Pts Dim Outliers %Outlier
Annthyroid 7200 6 534 7.41
Arrhythmia 452 274 66 14.60
Breastw 683 9 239 34.99
Cardio 1831 21 176 9.61
Letter 1600 32 100 6.25
MNIST 7603 100 700 9.21
Musk 3062 166 97 3.17
PageBlocks 5393 10 510 9.46
Pendigits 6870 16 156 2.27
Pima 768 8 268 34.90
Satellite 6435 36 2036 31.64
Satimage-2 5803 36 71 1.22
Shuttle 49097 9 3511 7.15
SpameSpace 4207 57 1679 39.91
Stamps 340 9 31 9.12
Thyroid 3772 6 93 2.47
Vertebral 240 6 30 12.50
Vowels 1456 12 50 3.43
WBC 378 30 21 5.56
Wilt 4819 5 257 5.33
Table 1: Real-world datasets used for evaluation

4.2 Experimental Design.

This study compares the seven GG algorithms introduced in §2.3 with the four proposed LSCP variations described in Algorithm 1. All models use a pool of 50 LOF base detectors ensuring consistency during performance evaluation. To induce diversity among base detectors, distinct initialization hyperparameters—specifically the number of neighbors () used in each LOF detector—are randomly selected in the range of . It is noted that a narrower range (i.e., ) is taken for certain datasets due to limitations associated with size or computational cost. For GG_AOM and GG_MOA, the base detectors are divided into 5 subgroups and each group contains 10 base detectors selected without replacement. For Feature Bagging (GG_FB), the base detector is randomly chosen from the same pool of LOF detectors and then trained on 50 individual subsamples with randomly selected features. For all LSCP algorithms, the default hyperparameters mentioned in §3 are used.

Annthyroid 0.7548 0.7590 0.7849 0.7520 0.7642 0.7660 0.7769 0.7730 0.7632 0.7552 0.7854
Arrhythmia 0.7746 0.7715 0.7729 0.7763 0.7758 0.7749 0.7656 0.7690 0.7758 0.7313 0.7709
Breastw 0.6553 0.7044 0.7236 0.7845 0.7362 0.7140 0.6590 0.6838 0.7453 0.6285 0.3935
Cardio 0.8691 0.8908 0.8491 0.9013 0.8770 0.8865 0.8798 0.8903 0.8782 0.8830 0.8422
Letter 0.7818 0.7954 0.8361 0.7867 0.7925 0.8031 0.8434 0.8300 0.7908 0.8001 0.7640
MNIST 0.8576 0.8623 0.7812 0.8633 0.8557 0.8588 0.8349 0.8553 0.8563 0.8272 0.8468
Musk 0.9950 0.9970 0.9931 0.9981 0.9937 0.9960 0.9960 0.9970 0.9953 0.9958 0.7344
PageBlocks 0.9349 0.9343 0.8687 0.9488 0.9443 0.9440 0.9240 0.9371 0.9453 0.9418 0.9284
Pendigits 0.8238 0.8656 0.7238 0.8744 0.8378 0.8509 0.8488 0.8622 0.8425 0.8548 0.8034
Pima 0.7059 0.6991 0.6640 0.7061 0.7030 0.7003 0.6730 0.6856 0.7037 0.6349 0.6989
Satellite 0.5814 0.6106 0.6006 0.6015 0.5881 0.5992 0.6258 0.6220 0.5876 0.6101 0.5818
Satimage-2 0.9852 0.9931 0.9878 0.9935 0.9872 0.9907 0.9909 0.9925 0.9880 0.9881 0.9181
Shuttle 0.5392 0.5551 0.5373 0.5514 0.5439 0.5504 0.5612 0.5602 0.5413 0.5561 0.3702
SpamSpace 0.3792 0.4594 0.4305 0.4744 0.4487 0.4377 0.4060 0.4128 0.4580 0.4104 0.3312
Stamps 0.8888 0.8719 0.8525 0.8985 0.8946 0.8927 0.8559 0.8763 0.8953 0.8904 0.8715
Thyroid 0.9579 0.9624 0.9413 0.9700 0.9656 0.9647 0.9385 0.9510 0.9665 0.9644 0.8510
Vertebral 0.3324 0.3662 0.4306 0.3478 0.3433 0.3467 0.3662 0.3614 0.3442 0.3678 0.3385
Vowels 0.9276 0.9185 0.9238 0.9199 0.9265 0.9275 0.9313 0.9271 0.9261 0.9299 0.9148
WBC 0.9379 0.9344 0.9242 0.9451 0.9421 0.9409 0.9321 0.9367 0.9420 0.9314 0.9407
Wilt 0.5275 0.5517 0.6550 0.4286 0.5101 0.5358 0.6384 0.6056 0.5037 0.5586 0.5868
Table 2: ROC-AUC scores (average of 30 independent trials, highest score highlighted in bold)

4.3 Algorithm Performances.

Tables 2 and 3 summarize the ROC-AUC and mAP scores on the 20 datasets. Our experiments demonstrate that LSCP can bring consistent performance improvement over its GG counterparts, which is especially noticeable in the mAP scores. The Friedman test shows there is a statistically significant difference between the 11 algorithms in both ROC-AUC and mAP ; however, the Nemenyi test fails to identify which pairs of algorithms are significantly different. The latter result is expected in an unsupervised setting due to the difficulty of this task relative to the limited number of datasets [11]. In general, LSCP algorithms show great potential: they achieve the highest ROC-AUC scores on 15 datasets and the highest mAP scores on 18 datasets. GG_M yields higher ROC-AUC on four datasets while GG_FB and GG_AOM perform better on Annthyroid in terms of ROC-AUC and mAP respectively. In all other cases, GG algorithms are outperformed by a variant of LSCP. Specifically, LSCP_AOM is the best performing method and ranks highest on 13 datasets in terms of ROC-AUC and 14 datasets in terms of mAP. It should be noted that GG methods with a second-phase combination (GG_MOA and GG_AOM) demonstrate better performance than GG_A and better stability than GG_M. For instance, GG_M has lower mAP than GG_AOM on 15 datasets. These observations agree with the conclusions in Aggarwal’s work [2, 3].

Annthyroid 0.2283 0.2375 0.2349 0.2453 0.2301 0.2395 0.2413 0.2516 0.2306 0.2277 0.1864
Arrhythmia 0.3780 0.3744 0.3790 0.3796 0.3766 0.3769 0.3690 0.3722 0.3766 0.3468 0.3707
Breastw 0.4334 0.4766 0.4728 0.5655 0.4995 0.4849 0.4249 0.4577 0.5085 0.4366 0.2854
Cardio 0.3375 0.3960 0.3197 0.4117 0.3516 0.3708 0.3666 0.3864 0.3535 0.3629 0.3643
Letter 0.2302 0.2396 0.3346 0.2407 0.2388 0.2473 0.3160 0.2867 0.2372 0.2416 0.2193
MNIST 0.3933 0.3974 0.3353 0.3979 0.3911 0.3941 0.3701 0.3896 0.3918 0.3836 0.3928
Musk 0.8478 0.8773 0.8433 0.9240 0.8245 0.8718 0.8479 0.8806 0.8608 0.8629 0.5806
PageBlocks 0.5805 0.5707 0.4684 0.6360 0.6043 0.6016 0.5297 0.5733 0.6077 0.6064 0.6094
Pendigits 0.0709 0.0893 0.0625 0.0944 0.0777 0.0823 0.0834 0.0895 0.0780 0.0832 0.0834
Pima 0.5092 0.5045 0.4716 0.5142 0.5089 0.5054 0.4813 0.4920 0.5095 0.4599 0.5094
Satellite 0.4077 0.4268 0.4223 0.4196 0.4047 0.4139 0.4385 0.4352 0.4047 0.4031 0.4049
Satimage-2 0.3477 0.6248 0.3994 0.6249 0.3959 0.5089 0.5344 0.5922 0.4159 0.4114 0.4851
Shuttle 0.1228 0.1296 0.1167 0.1330 0.1297 0.1316 0.1239 0.1294 0.1293 0.1316 0.0549
SpamSpace 0.3326 0.3615 0.3592 0.3665 0.3572 0.3521 0.3379 0.3413 0.3612 0.3601 0.3079
Stamps 0.3596 0.3310 0.3193 0.3779 0.3694 0.3660 0.3144 0.3387 0.3706 0.3638 0.3535
Thyroid 0.3544 0.3955 0.2638 0.4651 0.4045 0.4123 0.2850 0.3488 0.4130 0.4071 0.1186
Vertebral 0.0948 0.1020 0.1230 0.0988 0.0971 0.0975 0.1029 0.1000 0.0972 0.1067 0.0965
Vowels 0.3913 0.3678 0.3482 0.3539 0.3783 0.3790 0.3760 0.3732 0.3784 0.3783 0.3340
WBC 0.6033 0.5983 0.5472 0.6131 0.6097 0.6069 0.5579 0.5925 0.6105 0.6045 0.5933
Wilt 0.0518 0.0557 0.0770 0.0423 0.0493 0.0523 0.0715 0.0633 0.0486 0.0537 0.0591
Table 3: mAP scores (average of 30 independent trials, highest score highlighted in bold)
Figure 2: t-SNE visualizations on Cardio (left), Thyroid (middle) and Letter (right), where normal (N) and outlying (O) points are denoted as grey dots and orange squares, respectively. Points that can only be correctly classified by a particular framework are shown in green and blue for GG and LSCP respectively.

LSCP_A and LSCP_M do not demonstrate strong performance relative to their GG counterparts. Given that both pseudo ground truth generation methods are heuristic, they may result in poor local competency evaluation. For example, as discussed in §3.6, LSCP_A theoretically benefits from both the variance and bias reduction by averaging and focusing on locality. In practice though, by only selecting the single most competent detector, it’s possible that this approach yields weaker variance reduction compared to GG_A which uses all detector scores. As consequence, the variance reduction may not be able to sufficiently offset the bias inherent to the the pseudo ground truth generation process leading to diminished performance. Comparatively, when the ground truth is generated by taking the maximum among multiple detectors, LSCP_M exhibits unstable behaviour similar to GG_M. LSCP_M only outperforms LSCP_AOM on Letter, Vertebral and Wilt in terms of both ROC-AUC and mAP. As discussed in [2, 3], selecting maximum scores across detectors yields high model variance which explains these results; a second-phase combination may mitigate this risk.

A Friedman test confirms there is a significant difference among the four LSCP algorithms in both ROC-AUC and mAP . Correspondingly, the LSCP ensemble variation LSCP_AOM shows promise. Building on GG_AOM’s success as one of the most effective combination methods [3], LSCP_AOM averages outlier scores within the selected group of detectors which could be viewed as an additional reduction of model variance over LSCP_M. Moreover, LSCP_AOM’s concentration on the local competency evaluation may have improved model bias in addition to the variance reduction by the second-phase averaging, leading to better accuracy. The results show that LSCP_AOM outperforms all models on 13 datasets in terms of ROC-AUC and 14 datasets in terms of mAP. The latter improvement over GG methods is especially considerable on Breastw, Cardio, Satimage-2 and Thyroid.

The benefit of taking the second-phase combination is less effective for LSCP_MOA, which does not outperform LSCP_A or GG_MOA. As discussed in [2, 3], it is less effective to do a second-phase combination after averaging because information has already been already lost to blunting. The experiment results confirm that, in an LSCP scheme, the benefit from a second-phase maximization cannot offset the information loss due to the initial averaging. Overall, only LSCP_AOM is recommended for detector combination due to its combined bias and variance reduction capability.

4.4 Visualization Analysis.

Figure 2 visually compares the performance of the best performing GG and LSCP methods on Cardio, Thyroid and Letter using t-Distributed Stochastic Neighbor Embedding (t-SNE) [20]. The green and blue markers highlight objects that can only be correctly classified by GG and LSCP methods respectively to emphasize the mutual exclusivity of the two approaches. The visualizations of Cardio (left) and Thyroid (middle) illustrate that LSCP methods have an edge over GG methods in detecting local outliers when they cluster together (highlighted by red dotted circles in Fig. 2). Additionally, LSCP methods can contribute to classifying both outlying and normal points when locality is present in the data. However, the outlying data distribution in Letter (right) is more dispersed—outliers do not form local clusters but rather mix with normal points. This causes LSCP to perform worse than GG_M in terms of ROC-AUC despite showing an improvement in terms of mAP. Based on these visualizations, one could assume that LSCP is useful when outlying and normal objects are well separated, but less effective when they are interleaved. Additionally, this suggests that the size of LSCP’s local region should be informed by the estimated proportion of outliers in the dataset. For instance, outliers account for only 3.43% and 6.25% of Vowels and Letter respectively, which may not be sufficient to form outlier clusters when the local region size is set to 10% of the training data. A smaller local region size is more appropriate when a small number of outliers is assumed.

4.5 Limitations and Future Directions.

Firstly, the local region definition in LSCP depends on finding nearest neighbors by euclidean distance. However, this approach is not ideal due to: (i) high time complexity and (ii) degraded performance when many irrelevant features are presented in high dimensional space [4]. It may be improved by using prototype selection [9] or by defining the local region using advanced clustering methods [9]. Secondly, only simple pseudo ground truth generation methods are explored (averaging or maximization) in this study; more accurate methods should be considered, such as actively pruning base detectors [23] and greedy model combination [25]. Lastly, DCS has proven to work with heterogeneous base classifiers in classification problems [16, 9], which is pending for verification in LSCP. A more significant performance improvement is expected for LSCP framework when the base detectors are more diversified.

5 Conclusions

In this work, we propose four variants of a novel unsupervised outlier detection framework called Locally Selective Combination in Parallel Outlier Ensembles (LSCP). Unlike traditional combination approaches, LSCP identifies the top-performing base detectors for each test instance relative to its local region. To validate its effectiveness, the proposed framework is assessed on 20 real-world datasets and is found to be superior to baseline algorithms. The ensemble approach LSCP_AOM demonstrates the best performance achieving the highest detection score on 13/20 datasets with respect to ROC-AUC and 14/20 datasets with respect to mAP. Theoretical considerations under the bias-variance framework and visualizations are also provided for LSCP to provide a holistic overview of the framework. Since LSCP demonstrates the promise of data locality, future work can extend this exploration by investigating the use of heterogeneous base detectors and more reliable pseudo ground truth generation methods.


  • [1] C. C. Aggarwal, Outlier ensembles: position paper, ACM SIGKDD Explorations, 14 (2013), pp. 49–58.
  • [2] C. C. Aggarwal and S. Sathe, Theoretical Foundations and Algorithms for Outlier Ensembles, ACM SIGKDD Explorations, 17 (2015), pp. 24–47.
  • [3]  , Outlier ensembles: An introduction, Springer, 1st ed., 2017.
  • [4] L. Akoglu, H. Tong, J. Vreeken, and C. Faloutsos,

    Fast and Reliable Anomaly Detection in Categorical Data

    , in CIKM, 2012.
  • [5] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. ö. r. Sander, LOF: Identifying Density-Based Local Outliers, ACM SIGMOD, (2000), pp. 1–12.
  • [6] A. S. Britto, R. Sabourin, and L. E. Oliveira, Dynamic selection of classifiers - A comprehensive review

    , Pattern Recognition, 47 (2014), pp. 3665–3680.

  • [7] G. O. Campos, A. Zimek, and W. Meira, An Unsupervised Boosting Strategy for Outlier Detection Ensembles, PAKDD, (2018), pp. 564–576.
  • [8] V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, CSUR, 41 (2009), p. 15.
  • [9] R. M. Cruz, R. Sabourin, and G. D. Cavalcanti, Dynamic classifier selection: Recent advances and perspectives, Information Fusion, 41 (2018), pp. 195–216.
  • [10] S. Das, W.-K. Wong, T. Dietterich, A. Fern, and A. Emmott, Incorporating Expert Feedback into Active Anomaly Discovery, ICDM, (2016), pp. 853–858.
  • [11] J. Dem š ar, Statistical Comparisons of Classifiers over Multiple Data Sets, JMLR, 7 (2006), pp. 1–30.
  • [12] T. G. Dietterich,

    Ensemble Methods in Machine Learning

    , MCS, 1857 (2000), pp. 1–15.
  • [13] A. Emmott, S. Das, T. Dietterich, A. Fern, and W.-k. Wong, A Meta-Analysis of the Anomaly Detection Problem, arXiv preprint, (2015).
  • [14] G. Giacinto and F. Roli, A theoretical framework for dynamic classifier selection, ICPR, 2 (2000), pp. 0–3.
  • [15] T. K. Ho, J. J. Hull, and S. N. Srihari, Decision Combination in Multiple Classifier Systems, TPAMI, 16 (1994), pp. 66–75.
  • [16] A. H. Ko, R. Sabourin, and A. S. Britto, From dynamic classifier selection to dynamic ensemble selection, Pattern Recognition, 41 (2008), pp. 1735–1748.
  • [17] H.-P. Kriegel, P. Kr ö ger, E. Schubert, and A. Zimek, LoOP: local outlier probabilities, CIKM, (2009), pp. 1649–1652.
  • [18] A. Lazarevic and V. Kumar, Feature bagging for outlier detection, ACM SIGKDD, (2005), p. 157.
  • [19] F. T. Liu, K. M. Ting, and Z. H. Zhou, Isolation forest, ICDM, (2008), pp. 413–422.
  • [20] L. v. d. Maaten and G. Hinton, Visualizing data using t-sne, JMLR, 9 (2008), pp. 2579–2605.
  • [21] B. Micenkov á, B. McWilliams, and I. Assent, Learning Representations for Outlier Detection on a Budget, arXiv preprint, (2015).
  • [22] S. Rayana and L. Akoglu, An Ensemble Approach for Event Detection and Characterization in Dynamic Graphs, in ACM SIGKDD ODD Workshop, 2014.
  • [23]  , Less is More: Building Selective Anomaly Ensembles, TKDD, 10 (2016), pp. 1–33.
  • [24] S. Rayana, W. Zhong, and L. Akoglu, Sequential ensemble learning for outlier detection: A bias-variance perspective, ICDM, (2017), pp. 1167–1172.
  • [25] E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel, On evaluation of outlier rankings and outlier scores, in SDM, SIAM, 2012, pp. 1047–1058.
  • [26] E. Schubert, A. Zimek, and H. P. Kriegel, Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection, DMKD, 28 (2014), pp. 190–237.
  • [27] B. van Stein, M. van Leeuwen, and T. B ä ck, Local subspace-based outlier detection using global neighbourhoods, IEEE International Conference on Big Data, (2016), pp. 1136–1142.
  • [28] K. Woods, W. Kegelmeyer, and K. Bowyer, Combination of multiple classifiers using local accuracy estimates, TPAMI, 19 (1997), pp. 405–410.
  • [29] Y. Zhao and M. K. Hryniewicki, DCSO: dynamic combination of detector scores for outlier ensembles, in ACM SIGKDD ODD Workshop, London, UK, 2018.
  • [30] Y. Zhao and M. K. Hryniewicki, XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning, IJCNN, (2018).
  • [31] Y. Zhao, Z. Nasrullah, and Z. Li, PyOD: A python toolbox for scalable outlier detection, arXiv preprint arXiv:1901.01588, (2019).
  • [32] A. Zimek, R. J. G. B. Campello, and J. ö. r. Sander, Ensembles for unsupervised outlier detection: Challenges and research questions, ACM SIGKDD Explorations, 15 (2014), pp. 11–22.