1 Introduction
Outlier detection methods aim to identify anomalous data objects from the general data distribution and are useful for problems such as credit card fraud prevention and network intrusion detection [8]. Since the ground truth is often absent in outlier mining [1], unsupervised detection methods are commonly used for this task [5, 17, 8]. However, unsupervised approaches are susceptible to generating high false positive and false negative rates [10]. To improve model accuracy and stability in these scenarios, recent research explores ensemble approaches to outlier detection [1, 3, 24, 32]
. Ensemble learning combines multiple base estimators to achieve superior detection performance and reliability when compared to an individual estimator
[12, 15]. It is important to ensure that the combination process is robust because constituent estimators, if synthesized inappropriately, may be detrimental to the predictive capability of an ensemble[22, 23]. Similar to prior works in classification [24], an outlier ensemble may be characterized as parallel if detectors are generated independently. In contrast, it may be considered sequential if the detector generation, selection or combination is iterative.Model combination is important for parallel ensembles to ensure diversity exists among base detectors; however, existing works have not jointly addressed two key limitations in this process. First, most parallel ensembles generically combine all detectors without considering selection. This limits the benefits of model combination since individual base detectors may not be proficient at identifying all outlier instances [9]. For example, prior work has demonstrated that the value of good detectors can be neutralized by the inclusion of poor detectors in a generic averaging framework [3]. Secondly, data locality is rarely emphasized in the context of detector selection and combination leading to potentially suboptimal outcomes. While it is acknowledged that certain types of outliers are better identified by local data relationships [27], detectors are often evaluated at a global scale, where all training points are considered, instead of the local region related to a test instance.
To address the aforementioned limitations, we propose a fully unsupervised framework called Locally Selective Combination in P
arallel Outlier Ensembles (LSCP) to selectively combine base detectors by emphasizing data locality. The idea is motivated by an established supervised ensemble framework known as Dynamic Classifier Selection (DCS)
[15]. DCS selects the best classifier for each test instance by evaluating base classifier competency at a local scale [9]. The rationale behind this is that base classifiers will generally not excel at categorizing all unknown test instances and that an individual classifier is more likely to specialize in a specific local region [9, 16]. Similarly, LSCP first defines the local region of a test instance by the consensus of the nearest training points in randomly selected feature subspaces, and then identifies the most competent base detector in this local region by measuring similarity relative to a pseudo ground truth (see [7, 23] for examples). To further improve algorithm stability and capacity, ensemble variations of LSCP are proposed where promising base detectors are kept for a secondphase combination instead of using the single most competent detector. As a whole, LSCP is intuitive, stable and effective for unsupervised combination of independent outlier detectors.Our technical contributions in this paper are:

We propose a novel combination framework which, to the best of our knowledge, is the first published effort to adapt DCS from supervised classification tasks to unsupervised parallel outlier ensembles.

Extensive experiments on 20 realworld datasets show that LSCP consistently yields better performance than existing parallel combination methods.

As a general framework, LSCP is formulated to be compatible with different types of base detectors; we demonstrate its effectiveness with a homogeneous pool of Local Outlier Factor [5] detectors.

We employ various analysis methods to improve model interpretability. First, theoretical explanations and complexity are provided. Second, visualization techniques are used to intuitively explain why LSCP works and when to use it. Third, statistical tests are used to compare experimental results.

Effort has been made to streamline the accessibility of LSCP. All source code, experiment results and figures are shared for reproduction^{1}^{1}1Repository: https://github.com/yzhao062/LSCP. In addition, the framework has been implemented in the Python Outlier Detection (PyOD) toolbox with a unified API and detailed documentation [31].
2 Related Works
2.1 Dynamic Classifier Selection and Dynamic Ensemble Selection.
Dynamic Classifier Selection (DCS) is an established combination framework for classification tasks. The technique was first proposed by Ho et al. in 1994 [15] and then extended, under the name DCS Local Accuracy, by Woods et al. in 1997 [28] to select the most accurate base classifier in a local region. The motivation behind this approach is that base classifiers often make distinctive errors and offer a degree of complementarity [6]. Consequently, selectively combining base classifiers can result in a performance improvement over generic ensembles which use the majority vote of all base classifiers. Subsequent theoretical work by Giacinto and Roli validated that, under certain assumptions, the optimal Bayes classifier could be obtained by selecting nonoptimal classifiers [14]. DCS was later expanded by Ko et al. to Dynamic Ensemble Selection (DES) which selects multiple base classifiers for a secondphase combination given each test instance [16]. By minimizing reliance on a single classifier and delegating the classification task to a group competent classifiers, the algorithm has demonstrated that it is more robust than DCS [16]. Motivated by these approaches, LSCP adapts dynamic selection to unsupervised outlier detection tasks. It should be noted that LSCP is an extension of our prior work on DCSO [29], a recent dynamic detector combination framework. Comparatively, LSCP refines the local region definition process to achieve a more stable combination mechanism.
2.2 Data Locality in Outlier Detection.
The relationship among data objects is critical in outlier detection and existing algorithms can roughly be categorized as either global or local [17, 24, 26]. The former considers all objects during inference while the latter only considers a local selection of objects [26]. In both cases, their applicability is dependent on the structure of the data. Global outlier detection algorithms, for example, offer superior performance when outliers are highly distinctive from the data distribution [23]
but often fail to identify outliers in the local neighborhoods of highdimensional data
[5, 27]. Global models also struggle with data represented by a mixture of distributions, where global characteristics do not necessarily represent the distribution of objects in local regions [27]. To address these limitations, numerous works have explored local algorithms such as Local Outlier Factor (LOF) [5], Local Outlier Probabilities (LoOP)
[17] and GLOSS [27]. However, data locality is rarely considered in the context of detector combination; instead, most combination methods utilize all training data points, e.g., the weight calculation in weighted averaging [32]. LSCP explores both global and local data relationships by training base detectors on the entire dataset and emphasizing data locality during detector combination.2.3 Outlier Detector Combination.
Recently, studying outlier ensembles has become a popular research area [1, 2, 3, 32] resulting in numerous popular works including: (i) parallel ensembles such as Feature Bagging [18] and Isolation Forest [19]; (ii) sequential methods including CARE [24], SELECT [23] and BoostSelect [7] and (iii) hybrid approaches like BORE [21] and XGBOD [30]. When the ground truth is unavailable, combining outlier models is challenging. Existing unsupervised combination algorithms in parallel ensembles are often generic and global (GG); a list of representative GG methods are described below (see [1, 2, 3, 32, 24, 18] for details):

Averaging (GG_A): average scores of all detectors.

Maximization (GG_M): maximum score across all detectors.

Weighted Averaging (GG_WA): weight each base detector when averaging.

Threshold Sum (GG_TH): discard all scores below a threshold and sum over the remaining scores.

AverageofMaximum (GG_AOM): divide base detectors into subgroups and take the maximum score for each subgroup. The final score is the average of all subgroup scores.

MaximumofAverage (GG_MOA): divide base detectors into subgroups and take the average score for each subgroup. The final score is the maximum of all subgroup scores.

Feature Bagging (GG_FB): generate a diversified set of base detectors by training on randomly selected feature subsets. The final score is the average of base detector scores.
As discussed in §2.2, GG methods ignore the importance of data locality while evaluating and combining detectors, which may be inappropriate given the characteristics of outliers [5, 27]. Moreover, without a selection process, poor detectors may hurt the overall detection performance of an ensemble [23, 24]. All seven GG algorithms are thus included as baselines.
There have been attempts to build selective outlier ensembles sequentially in a boosting style. Rayana and Akoglu introduced SELECT [23] and CARE [24] to pick promising detectors and exclude the underperforming ones iteratively, which yielded impressive results on both temporal graphs and multidimensional outlier data. Campos et al. further extend this idea by proposing an unsupervised boosting strategy BoostSelect for outlier ensemble selection [7]. As an alternative to sequential selection models, this paper chooses to focus on parallel detector selection which stresses the importance of data locality. Compared with sequential detector selection methods, our approach can select detectors without iteration which may reduce computational cost.
3 Algorithm Design
LSCP starts with a group of diversified detectors to be combined. For each test instance, LSCP first defines its local region and then picks the most competent local detector(s). The selected detector(s) are used to generate the outlier score for the test instance. The workflow of all four proposed LSCP methods is shown in Algorithm 1 and Fig. 1.
3.1 Base Detector Generation.
An effective ensemble should be constructed with diversified base estimators [24, 32]
to promote learning distinct characteristics in the data. With a group of homogeneous base detectors, diversity can be induced by subsampling the training set and feature space, or by varying model hyperparameters
[6, 32]. In this study, we demonstrate the effectiveness of LSCP by using distinct hyperparameters to construct a pool of models with the same base algorithm. However, in practice, LSCP can also be used as a general framework with heterogeneous base detectors.Let denote training data with points and features, and denote a test set with points. The algorithm first generates a pool of base detectors initialized with a range of hyperparameters, e.g., a group of LOF detectors with distinct [5]. All base detectors are first trained on and then inference is performed on the same dataset. The results are combined into an outlier score matrix , formalized in Eq.(1), where
denotes the score vector from the
base detector. Each detector score is normalized using Znormalization as per prior work[2, 32].(1) 
3.2 Pseudo Ground Truth Generation.
Since LSCP evaluates detector competency without ground truth labels, two methods are used for generating a pseudo ground truth (denoted ) with : (i) LSCP_A: averages base detector scores and (ii) LSCP_M: maximum score across detectors. This is generalized in Eq.(2) where represents the aggregation (average or max) taken across all base detectors.
(2) 
It should be noted that the pseudo ground truth in LSCP is generated using training data and used solely for detector selection.
3.3 Local Region Definition.
The local region of a test instance is defined as the set of its k nearest training objects. Formally, this is denoted as:
(3) 
where describes the set of a test instance’s nearest neighbours subject to an ensemble criteria. This variation of kNN, which is similar to Feature Bagging [18]
, is proposed to alleviate concerns involving the curse of dimensionality on
kNN [4] while leveraging its better precision compared to clustering algorithms in DCS [9]. The process is as follows: (i) groups of features are randomly selected to construct new feature spaces, (ii) the nearest training objects to in each group are identified using euclidean distance and (iii) training objects that appear more than times are added to thus defining the local region. The size of the region is not fixed because it is dependent on the number of training objects that meet the selection criteria.The local region factor k decides the number of nearest neighbors to consider during this process; care is given to avoid selecting extreme values. Smaller values of k give more attention to local relationships which can result in instability, while large values of k may place too much emphasis on global relationships and have higher computational costs. While it is possible to experimentally determine an optimal k with crossvalidation [16] when ground truth is available, a similar trivial approach does not exist in an unsupervised setting. For these reasons, we recommend setting , 10% of the training samples, bounded in the range of , which yielded good results in practice.
3.4 Model Selection and Combination.
For each test instance, the local pseudo ground truth can be obtained by retrieving values associated with the local region from :
(4) 
where denotes the cardinality of . Similarly, the local training outlier scores can be retrieved from the precalculated training score matrix as:
(5) 
Consequently, although the local region needs to be recomputed for each test instance, the local outlier scores and targets can be efficiently retrieved from precalculated values (see Fig. 1).
For evaluating base estimator competency in a local region, DCS measures the accuracy of base classifiers as the percentage of correctly classified points [16], while LSCP measures the similarity between base detector scores and the pseudo target instead. This distinction is motivated by the lack of direct and reliable ways to access binary labels in unsupervised outlier mining. Although converting pseudo outlier scores to binary labels is feasible, defining an accurate conversion threshold is challenging. Additionally, since imbalanced datasets are common in outlier detection tasks, it is more stable to use similarity measures over absolute accuracy for competency evaluation. Therefore, LSCP measures the local competency of each base detector by the Pearson correlation, which has proven useful in outlier ensemble model combination [25], between the local pseudo ground truth and the local detector score . The detector with the highest similarity is regarded as the most competent local detector for , and its outlier score can be considered the final score for the corresponding test sample.
3.5 Dynamic Outlier Ensemble Selection.
Selecting only one detector, even if it is most similar to the pseudo ground truth, can be risky in unsupervised learning. This risk can be mitigated by selecting a group of detectors for a secondphase combination. This idea can be viewed as an adaption of supervised DES
[16] to outlier detection tasks; correspondingly, we introduce ensemble variations of LSCP which employ Maximum of Average (LSCP_MOA) and Average of Maximum (LSCP_AOM) ensembling methods. Specifically, when the pseudo ground truth is generated by , LSCP_MOA selects a subset of competent detectors in the local region of a test instance and then takes the maximum of their predictions as the outlier score. Inversely, LSCP_AOM computes the average of the selected subset when the pseudo target is generated with . Setting the group size of selected detectors equal to 1 is a special case of the ensembles yielding the original LSCP algorithms (LSCP_A and LSCP_M). Larger group sizes may be considered more global in their detector selection while a group size ofresults in a fully global algorithm. In response to this, we recommend using a group size selection process which includes some variance. Specifically, a histogram of detector Pearson correlation scores (to the pseudo ground truth) is built with
equal intervals. The detectors belonging to the most frequent interval are kept for the secondphase combination. A large results in selecting fewer detectors which flexibly controls the strength of the group size in LSCP ensembles.The time complexity for training each base detector and generating the pseudo ground truth is dependent on the underlying model and number of training samples. However, since this work proposes a combination framework, our discussion focuses on the overhead introduced during the combination stage. With the appropriate implementation of LSCP_A and LSCP_M, e.g., using a kd tree, the additional time complexity for defining the local region of each test instance is : for the distance calculation and for summation and sorting [16]. Here, refers to each test instance and refers to its dimensionality. While defining the local region requires multiple iterations, the number of iterations is fixed and is thus excluded in the complexity analysis. To combine the base detectors in LSCP_MOA and LSCP_AOM, an additional is needed resulting in a total time complexity of .
3.6 Theoretical Considerations.
Recently, Aggarwal and Sathe laid the theoretical foundation for outlier ensembles [2] using the biasvariance tradeoff, a widely used framework for analyzing generalization error in classification problems. The reducible generalization error in outlier ensembles may be minimized by either reducing squared bias or variance where a tradeoff between these two channels usually exists. A high variance detector is sensitive to data variation with high instability; a high bias detector is less sensitive to data variation but may fit complex data poorly. The goal of outlier ensembles is to control both bias and variance to reduce the overall generalization error. Various newly proposed algorithms have been analyzed using this new framework to enhance interpretability [23, 24, 30].
It has been shown that combining diversified base detectors, by averaging them for example, results in variance reduction [23, 24, 2]. However, a combination of all base detectors may also include inaccurate ones leading to higher bias. This explains why generic global averaging does not work well. Within Aggarwal’s biasvariance framework, LSCP possesses a combination of both variance and bias reduction. It induces diversity by initializing various base detectors with distinct hyperparameters and indirectly promotes variance reduction in the way that the pseudo ground truth is generated, e.g., averaging in LSCP_A. Furthermore, LSCP focuses on detector selection by local competency, which helps identify base detectors with conditionally low model bias. LSCP_M is also expected to be more stable than global maximization (GG_M) since the variance is reduced by using the most competent detector’s output rather than the global maximum of all base detectors. LSCP_MOA and LSCP_AOM
further decrease generalization error through bias and variance reduction, respectively, in their secondphase combination. However, LSCP is still a heuristic framework and can yield unpredictable results on pathological datasets.
4 Numerical Experiments
4.1 Datasets and Evaluation Metrics.
Table 1 summarizes the 20 public outlier detection benchmark datasets used in this study from ODDS^{2}^{2}2ODDS Library: http://odds.cs.stonybrook.edu and DAMI^{3}^{3}3DAMI Datasets: http://www.dbs.ifi.lmu.de/research/outlierevaluation/DAMI. In each experiment, 60% of the data is used for training and the remaining 40% is set aside for validation. Performance is evaluated by taking the average score of 30 independent trials using area under the receiver operating characteristic (ROCAUC) and mean average precision (mAP). Both metrics are widely used in outlier research [3, 4, 21, 23, 30, 13] and statistical measures are used to analyze the results [11]. Specifically, we use a nonparametric Friedman test followed by a posthoc Nemenyi test. For these tests, is considered to be statistically significant.
Dataset  Pts  Dim  Outliers  %Outlier 
Annthyroid  7200  6  534  7.41 
Arrhythmia  452  274  66  14.60 
Breastw  683  9  239  34.99 
Cardio  1831  21  176  9.61 
Letter  1600  32  100  6.25 
MNIST  7603  100  700  9.21 
Musk  3062  166  97  3.17 
PageBlocks  5393  10  510  9.46 
Pendigits  6870  16  156  2.27 
Pima  768  8  268  34.90 
Satellite  6435  36  2036  31.64 
Satimage2  5803  36  71  1.22 
Shuttle  49097  9  3511  7.15 
SpameSpace  4207  57  1679  39.91 
Stamps  340  9  31  9.12 
Thyroid  3772  6  93  2.47 
Vertebral  240  6  30  12.50 
Vowels  1456  12  50  3.43 
WBC  378  30  21  5.56 
Wilt  4819  5  257  5.33 
4.2 Experimental Design.
This study compares the seven GG algorithms introduced in §2.3 with the four proposed LSCP variations described in Algorithm 1. All models use a pool of 50 LOF base detectors ensuring consistency during performance evaluation. To induce diversity among base detectors, distinct initialization hyperparameters—specifically the number of neighbors () used in each LOF detector—are randomly selected in the range of . It is noted that a narrower range (i.e., ) is taken for certain datasets due to limitations associated with size or computational cost. For GG_AOM and GG_MOA, the base detectors are divided into 5 subgroups and each group contains 10 base detectors selected without replacement. For Feature Bagging (GG_FB), the base detector is randomly chosen from the same pool of LOF detectors and then trained on 50 individual subsamples with randomly selected features. For all LSCP algorithms, the default hyperparameters mentioned in §3 are used.
Dataset  LSCP_ A  LSCP_ MOA  LSCP_ M  LSCP_ AOM  GG_ A  GG_ MOA  GG_ M  GG_ AOM  GG_ WA  GG_ TH  GG_ FB 
Annthyroid  0.7548  0.7590  0.7849  0.7520  0.7642  0.7660  0.7769  0.7730  0.7632  0.7552  0.7854 
Arrhythmia  0.7746  0.7715  0.7729  0.7763  0.7758  0.7749  0.7656  0.7690  0.7758  0.7313  0.7709 
Breastw  0.6553  0.7044  0.7236  0.7845  0.7362  0.7140  0.6590  0.6838  0.7453  0.6285  0.3935 
Cardio  0.8691  0.8908  0.8491  0.9013  0.8770  0.8865  0.8798  0.8903  0.8782  0.8830  0.8422 
Letter  0.7818  0.7954  0.8361  0.7867  0.7925  0.8031  0.8434  0.8300  0.7908  0.8001  0.7640 
MNIST  0.8576  0.8623  0.7812  0.8633  0.8557  0.8588  0.8349  0.8553  0.8563  0.8272  0.8468 
Musk  0.9950  0.9970  0.9931  0.9981  0.9937  0.9960  0.9960  0.9970  0.9953  0.9958  0.7344 
PageBlocks  0.9349  0.9343  0.8687  0.9488  0.9443  0.9440  0.9240  0.9371  0.9453  0.9418  0.9284 
Pendigits  0.8238  0.8656  0.7238  0.8744  0.8378  0.8509  0.8488  0.8622  0.8425  0.8548  0.8034 
Pima  0.7059  0.6991  0.6640  0.7061  0.7030  0.7003  0.6730  0.6856  0.7037  0.6349  0.6989 
Satellite  0.5814  0.6106  0.6006  0.6015  0.5881  0.5992  0.6258  0.6220  0.5876  0.6101  0.5818 
Satimage2  0.9852  0.9931  0.9878  0.9935  0.9872  0.9907  0.9909  0.9925  0.9880  0.9881  0.9181 
Shuttle  0.5392  0.5551  0.5373  0.5514  0.5439  0.5504  0.5612  0.5602  0.5413  0.5561  0.3702 
SpamSpace  0.3792  0.4594  0.4305  0.4744  0.4487  0.4377  0.4060  0.4128  0.4580  0.4104  0.3312 
Stamps  0.8888  0.8719  0.8525  0.8985  0.8946  0.8927  0.8559  0.8763  0.8953  0.8904  0.8715 
Thyroid  0.9579  0.9624  0.9413  0.9700  0.9656  0.9647  0.9385  0.9510  0.9665  0.9644  0.8510 
Vertebral  0.3324  0.3662  0.4306  0.3478  0.3433  0.3467  0.3662  0.3614  0.3442  0.3678  0.3385 
Vowels  0.9276  0.9185  0.9238  0.9199  0.9265  0.9275  0.9313  0.9271  0.9261  0.9299  0.9148 
WBC  0.9379  0.9344  0.9242  0.9451  0.9421  0.9409  0.9321  0.9367  0.9420  0.9314  0.9407 
Wilt  0.5275  0.5517  0.6550  0.4286  0.5101  0.5358  0.6384  0.6056  0.5037  0.5586  0.5868 
4.3 Algorithm Performances.
Tables 2 and 3 summarize the ROCAUC and mAP scores on the 20 datasets. Our experiments demonstrate that LSCP can bring consistent performance improvement over its GG counterparts, which is especially noticeable in the mAP scores. The Friedman test shows there is a statistically significant difference between the 11 algorithms in both ROCAUC and mAP ; however, the Nemenyi test fails to identify which pairs of algorithms are significantly different. The latter result is expected in an unsupervised setting due to the difficulty of this task relative to the limited number of datasets [11]. In general, LSCP algorithms show great potential: they achieve the highest ROCAUC scores on 15 datasets and the highest mAP scores on 18 datasets. GG_M yields higher ROCAUC on four datasets while GG_FB and GG_AOM perform better on Annthyroid in terms of ROCAUC and mAP respectively. In all other cases, GG algorithms are outperformed by a variant of LSCP. Specifically, LSCP_AOM is the best performing method and ranks highest on 13 datasets in terms of ROCAUC and 14 datasets in terms of mAP. It should be noted that GG methods with a secondphase combination (GG_MOA and GG_AOM) demonstrate better performance than GG_A and better stability than GG_M. For instance, GG_M has lower mAP than GG_AOM on 15 datasets. These observations agree with the conclusions in Aggarwal’s work [2, 3].
Dataset  LSCP_ A  LSCP_ MOA  LSCP_ M  LSCP_ AOM  GG_ A  GG_ MOA  GG_ M  GG_ AOM  GG_ WA  GG_ TH  GG_ FB 
Annthyroid  0.2283  0.2375  0.2349  0.2453  0.2301  0.2395  0.2413  0.2516  0.2306  0.2277  0.1864 
Arrhythmia  0.3780  0.3744  0.3790  0.3796  0.3766  0.3769  0.3690  0.3722  0.3766  0.3468  0.3707 
Breastw  0.4334  0.4766  0.4728  0.5655  0.4995  0.4849  0.4249  0.4577  0.5085  0.4366  0.2854 
Cardio  0.3375  0.3960  0.3197  0.4117  0.3516  0.3708  0.3666  0.3864  0.3535  0.3629  0.3643 
Letter  0.2302  0.2396  0.3346  0.2407  0.2388  0.2473  0.3160  0.2867  0.2372  0.2416  0.2193 
MNIST  0.3933  0.3974  0.3353  0.3979  0.3911  0.3941  0.3701  0.3896  0.3918  0.3836  0.3928 
Musk  0.8478  0.8773  0.8433  0.9240  0.8245  0.8718  0.8479  0.8806  0.8608  0.8629  0.5806 
PageBlocks  0.5805  0.5707  0.4684  0.6360  0.6043  0.6016  0.5297  0.5733  0.6077  0.6064  0.6094 
Pendigits  0.0709  0.0893  0.0625  0.0944  0.0777  0.0823  0.0834  0.0895  0.0780  0.0832  0.0834 
Pima  0.5092  0.5045  0.4716  0.5142  0.5089  0.5054  0.4813  0.4920  0.5095  0.4599  0.5094 
Satellite  0.4077  0.4268  0.4223  0.4196  0.4047  0.4139  0.4385  0.4352  0.4047  0.4031  0.4049 
Satimage2  0.3477  0.6248  0.3994  0.6249  0.3959  0.5089  0.5344  0.5922  0.4159  0.4114  0.4851 
Shuttle  0.1228  0.1296  0.1167  0.1330  0.1297  0.1316  0.1239  0.1294  0.1293  0.1316  0.0549 
SpamSpace  0.3326  0.3615  0.3592  0.3665  0.3572  0.3521  0.3379  0.3413  0.3612  0.3601  0.3079 
Stamps  0.3596  0.3310  0.3193  0.3779  0.3694  0.3660  0.3144  0.3387  0.3706  0.3638  0.3535 
Thyroid  0.3544  0.3955  0.2638  0.4651  0.4045  0.4123  0.2850  0.3488  0.4130  0.4071  0.1186 
Vertebral  0.0948  0.1020  0.1230  0.0988  0.0971  0.0975  0.1029  0.1000  0.0972  0.1067  0.0965 
Vowels  0.3913  0.3678  0.3482  0.3539  0.3783  0.3790  0.3760  0.3732  0.3784  0.3783  0.3340 
WBC  0.6033  0.5983  0.5472  0.6131  0.6097  0.6069  0.5579  0.5925  0.6105  0.6045  0.5933 
Wilt  0.0518  0.0557  0.0770  0.0423  0.0493  0.0523  0.0715  0.0633  0.0486  0.0537  0.0591 
LSCP_A and LSCP_M do not demonstrate strong performance relative to their GG counterparts. Given that both pseudo ground truth generation methods are heuristic, they may result in poor local competency evaluation. For example, as discussed in §3.6, LSCP_A theoretically benefits from both the variance and bias reduction by averaging and focusing on locality. In practice though, by only selecting the single most competent detector, it’s possible that this approach yields weaker variance reduction compared to GG_A which uses all detector scores. As consequence, the variance reduction may not be able to sufficiently offset the bias inherent to the the pseudo ground truth generation process leading to diminished performance. Comparatively, when the ground truth is generated by taking the maximum among multiple detectors, LSCP_M exhibits unstable behaviour similar to GG_M. LSCP_M only outperforms LSCP_AOM on Letter, Vertebral and Wilt in terms of both ROCAUC and mAP. As discussed in [2, 3], selecting maximum scores across detectors yields high model variance which explains these results; a secondphase combination may mitigate this risk.
A Friedman test confirms there is a significant difference among the four LSCP algorithms in both ROCAUC and mAP . Correspondingly, the LSCP ensemble variation LSCP_AOM shows promise. Building on GG_AOM’s success as one of the most effective combination methods [3], LSCP_AOM averages outlier scores within the selected group of detectors which could be viewed as an additional reduction of model variance over LSCP_M. Moreover, LSCP_AOM’s concentration on the local competency evaluation may have improved model bias in addition to the variance reduction by the secondphase averaging, leading to better accuracy. The results show that LSCP_AOM outperforms all models on 13 datasets in terms of ROCAUC and 14 datasets in terms of mAP. The latter improvement over GG methods is especially considerable on Breastw, Cardio, Satimage2 and Thyroid.
The benefit of taking the secondphase combination is less effective for LSCP_MOA, which does not outperform LSCP_A or GG_MOA. As discussed in [2, 3], it is less effective to do a secondphase combination after averaging because information has already been already lost to blunting. The experiment results confirm that, in an LSCP scheme, the benefit from a secondphase maximization cannot offset the information loss due to the initial averaging. Overall, only LSCP_AOM is recommended for detector combination due to its combined bias and variance reduction capability.
4.4 Visualization Analysis.
Figure 2 visually compares the performance of the best performing GG and LSCP methods on Cardio, Thyroid and Letter using tDistributed Stochastic Neighbor Embedding (tSNE) [20]. The green and blue markers highlight objects that can only be correctly classified by GG and LSCP methods respectively to emphasize the mutual exclusivity of the two approaches. The visualizations of Cardio (left) and Thyroid (middle) illustrate that LSCP methods have an edge over GG methods in detecting local outliers when they cluster together (highlighted by red dotted circles in Fig. 2). Additionally, LSCP methods can contribute to classifying both outlying and normal points when locality is present in the data. However, the outlying data distribution in Letter (right) is more dispersed—outliers do not form local clusters but rather mix with normal points. This causes LSCP to perform worse than GG_M in terms of ROCAUC despite showing an improvement in terms of mAP. Based on these visualizations, one could assume that LSCP is useful when outlying and normal objects are well separated, but less effective when they are interleaved. Additionally, this suggests that the size of LSCP’s local region should be informed by the estimated proportion of outliers in the dataset. For instance, outliers account for only 3.43% and 6.25% of Vowels and Letter respectively, which may not be sufficient to form outlier clusters when the local region size is set to 10% of the training data. A smaller local region size is more appropriate when a small number of outliers is assumed.
4.5 Limitations and Future Directions.
Firstly, the local region definition in LSCP depends on finding nearest neighbors by euclidean distance. However, this approach is not ideal due to: (i) high time complexity and (ii) degraded performance when many irrelevant features are presented in high dimensional space [4]. It may be improved by using prototype selection [9] or by defining the local region using advanced clustering methods [9]. Secondly, only simple pseudo ground truth generation methods are explored (averaging or maximization) in this study; more accurate methods should be considered, such as actively pruning base detectors [23] and greedy model combination [25]. Lastly, DCS has proven to work with heterogeneous base classifiers in classification problems [16, 9], which is pending for verification in LSCP. A more significant performance improvement is expected for LSCP framework when the base detectors are more diversified.
5 Conclusions
In this work, we propose four variants of a novel unsupervised outlier detection framework called Locally Selective Combination in Parallel Outlier Ensembles (LSCP). Unlike traditional combination approaches, LSCP identifies the topperforming base detectors for each test instance relative to its local region. To validate its effectiveness, the proposed framework is assessed on 20 realworld datasets and is found to be superior to baseline algorithms. The ensemble approach LSCP_AOM demonstrates the best performance achieving the highest detection score on 13/20 datasets with respect to ROCAUC and 14/20 datasets with respect to mAP. Theoretical considerations under the biasvariance framework and visualizations are also provided for LSCP to provide a holistic overview of the framework. Since LSCP demonstrates the promise of data locality, future work can extend this exploration by investigating the use of heterogeneous base detectors and more reliable pseudo ground truth generation methods.
References
 [1] C. C. Aggarwal, Outlier ensembles: position paper, ACM SIGKDD Explorations, 14 (2013), pp. 49–58.
 [2] C. C. Aggarwal and S. Sathe, Theoretical Foundations and Algorithms for Outlier Ensembles, ACM SIGKDD Explorations, 17 (2015), pp. 24–47.
 [3] , Outlier ensembles: An introduction, Springer, 1st ed., 2017.

[4]
L. Akoglu, H. Tong, J. Vreeken, and C. Faloutsos,
Fast and Reliable Anomaly Detection in Categorical Data
, in CIKM, 2012.  [5] M. M. Breunig, H.P. Kriegel, R. T. Ng, and J. ö. r. Sander, LOF: Identifying DensityBased Local Outliers, ACM SIGMOD, (2000), pp. 1–12.

[6]
A. S. Britto, R. Sabourin, and L. E. Oliveira, Dynamic selection of
classifiers  A comprehensive review
, Pattern Recognition, 47 (2014), pp. 3665–3680.
 [7] G. O. Campos, A. Zimek, and W. Meira, An Unsupervised Boosting Strategy for Outlier Detection Ensembles, PAKDD, (2018), pp. 564–576.
 [8] V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, CSUR, 41 (2009), p. 15.
 [9] R. M. Cruz, R. Sabourin, and G. D. Cavalcanti, Dynamic classifier selection: Recent advances and perspectives, Information Fusion, 41 (2018), pp. 195–216.
 [10] S. Das, W.K. Wong, T. Dietterich, A. Fern, and A. Emmott, Incorporating Expert Feedback into Active Anomaly Discovery, ICDM, (2016), pp. 853–858.
 [11] J. Dem š ar, Statistical Comparisons of Classifiers over Multiple Data Sets, JMLR, 7 (2006), pp. 1–30.

[12]
T. G. Dietterich,
Ensemble Methods in Machine Learning
, MCS, 1857 (2000), pp. 1–15.  [13] A. Emmott, S. Das, T. Dietterich, A. Fern, and W.k. Wong, A MetaAnalysis of the Anomaly Detection Problem, arXiv preprint, (2015).
 [14] G. Giacinto and F. Roli, A theoretical framework for dynamic classifier selection, ICPR, 2 (2000), pp. 0–3.
 [15] T. K. Ho, J. J. Hull, and S. N. Srihari, Decision Combination in Multiple Classifier Systems, TPAMI, 16 (1994), pp. 66–75.
 [16] A. H. Ko, R. Sabourin, and A. S. Britto, From dynamic classifier selection to dynamic ensemble selection, Pattern Recognition, 41 (2008), pp. 1735–1748.
 [17] H.P. Kriegel, P. Kr ö ger, E. Schubert, and A. Zimek, LoOP: local outlier probabilities, CIKM, (2009), pp. 1649–1652.
 [18] A. Lazarevic and V. Kumar, Feature bagging for outlier detection, ACM SIGKDD, (2005), p. 157.
 [19] F. T. Liu, K. M. Ting, and Z. H. Zhou, Isolation forest, ICDM, (2008), pp. 413–422.
 [20] L. v. d. Maaten and G. Hinton, Visualizing data using tsne, JMLR, 9 (2008), pp. 2579–2605.
 [21] B. Micenkov á, B. McWilliams, and I. Assent, Learning Representations for Outlier Detection on a Budget, arXiv preprint, (2015).
 [22] S. Rayana and L. Akoglu, An Ensemble Approach for Event Detection and Characterization in Dynamic Graphs, in ACM SIGKDD ODD Workshop, 2014.
 [23] , Less is More: Building Selective Anomaly Ensembles, TKDD, 10 (2016), pp. 1–33.
 [24] S. Rayana, W. Zhong, and L. Akoglu, Sequential ensemble learning for outlier detection: A biasvariance perspective, ICDM, (2017), pp. 1167–1172.
 [25] E. Schubert, R. Wojdanowski, A. Zimek, and H.P. Kriegel, On evaluation of outlier rankings and outlier scores, in SDM, SIAM, 2012, pp. 1047–1058.
 [26] E. Schubert, A. Zimek, and H. P. Kriegel, Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection, DMKD, 28 (2014), pp. 190–237.
 [27] B. van Stein, M. van Leeuwen, and T. B ä ck, Local subspacebased outlier detection using global neighbourhoods, IEEE International Conference on Big Data, (2016), pp. 1136–1142.
 [28] K. Woods, W. Kegelmeyer, and K. Bowyer, Combination of multiple classifiers using local accuracy estimates, TPAMI, 19 (1997), pp. 405–410.
 [29] Y. Zhao and M. K. Hryniewicki, DCSO: dynamic combination of detector scores for outlier ensembles, in ACM SIGKDD ODD Workshop, London, UK, 2018.
 [30] Y. Zhao and M. K. Hryniewicki, XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning, IJCNN, (2018).
 [31] Y. Zhao, Z. Nasrullah, and Z. Li, PyOD: A python toolbox for scalable outlier detection, arXiv preprint arXiv:1901.01588, (2019).
 [32] A. Zimek, R. J. G. B. Campello, and J. ö. r. Sander, Ensembles for unsupervised outlier detection: Challenges and research questions, ACM SIGKDD Explorations, 15 (2014), pp. 11–22.
Comments
There are no comments yet.