A Julia package for One-Class Active Learning.
Active learning stands for methods which increase classification quality by means of user feedback. An important subcategory is active learning for one-class classifiers, i.e., for imbalanced class distributions. While various methods in this category exist, selecting one for a given application scenario is difficult. This is because existing methods rely on different assumptions, have different objectives, and often are tailored to a specific use case. All this calls for a comprehensive comparison, the topic of this article. This article starts with a categorization of the various methods. We then propose ways to evaluate active learning results. Next, we run extensive experiments to compare existing methods, for a broad variety of scenarios. One result is that the practicality and the performance of an active learning method strongly depend on its category and on the assumptions behind it. Another observation is that there only is a small subset of our experiments where existing approaches outperform random baselines. Finally, we show that a well-laid-out categorization and a rigorous specification of assumptions can facilitate the selection of a good method for one-class classification.READ FULL TEXT VIEW PDF
This paper introduces a novel, generic active learning method for one-cl...
Support Vector Data Description is a popular method for outlier detectio...
Active Learning (AL) is an active domain of research, but is seldom used...
Existing approaches to active learning maximize the system performance b...
Obtaining labels can be costly and time-consuming. Active learning allow...
Active Learning for discriminative models has largely been studied with ...
Graphs of developer networks are important for software engineering rese...
A Julia package for One-Class Active Learning.
Scripts and notebooks to benchmark one-class active learning strategies.
A Julia package for Support Vector Data Description.
Active learning involves users in machine learning tasks by asking for ancillary information, such as class labels. Naturally, providing such information requires time and intellectual effort of the users. To allocate these resources efficiently, active learning employsquery selection to identify observations for feedback that are likely to benefit classifier training. In this article, we focus on active learning for outlier detection where so-called one-class classifiers learn to discern between objects from a majority class and unusual observations. Examples are network security (Görnitz et al., 2009; Stokes et al., 2008) or fault monitoring (Yin et al., 2018) where unusual observations like breaches or catastrophic failures are rare to non-existent.
The imbalance between majority-class observations and outliers has important implications on active learning. Well-established concepts for query selection, like the margin between two classes, are no longer applicable. This has motivated specific research on one-class active learning (Barnabé-Lortie et al., 2015; Ghasemi et al., 2011b, a; Juszczak and Duin, 2003b; Görnitz et al., 2013)
. However, as we will show, query selection methods proposed for one-class classifiers differ in their objectives and in the assumptions behind them, and not all of them are suited for outlier detection. For instance, outliers do not follow a joint distribution, i.e., different outliers may be from different classes. So active learning methods that rely on density estimation for the minority class are inadequate. This distinguishes outlier detection from other applications of one-class classification, like collaborative filtering(Pan et al., 2008).
In addition, evaluation of active learning may lack reliability and comparability (Kottke et al., 2017), in particular with one-class classification. Evaluations often are use-case specific, and there is no standard way to report results. This makes it difficult to identify a learning method suitable for a certain use case, and to assess novel contributions in this field. – These observations give way to the following questions, which we study in this article:
[leftmargin=8listparindent = -labelsep = 1em, itemindent = 0pt, rightmargin = 2em, topsep = 1ex, itemsep = 0.9ex, labelwidth=6em]
What may be a good categorization of learning objectives and assumptions behind one-class active learning?
How to evaluate one-class active learning, in a standardized way?
Which active learning methods perform well with outlier detection?
Answering these questions is difficult for two reasons. First, we are not aware of any existing categorization of learning objectives and assumptions. To illustrate, a typical learning objective is to improve the accuracy of the classifier. Another, different learning objective is to present a high share of observations from the minority class to the user for feedback (Das et al., 2016). In general, active learning methods may perform differently with different learning objectives. Next, assumptions limit the applicability of active learning methods. For instance, a common assumption is that some labeled observations are already available before active learning starts. Naturally, methods that rely on this assumption are only applicable if such labels indeed exist. So knowing the range of objectives and assumptions is crucial to assess one-class active learning. Related work however tends to omit respective specifications. We deem this one reason why no overview article or categorization is available so far that could serve as a reference point.
Second, there is no standard to report active learning results. The reason is that “quality” can have several meanings with active learning, as we now explain.
Figure 1 is a progress curve. Such curves are often used to compare active learning methods. The y-axis is the values of a metric for classification quality, such as the true-positive rate. The x-axis is the progress of active learning, such as the percentage of observations for which the user has provided a label. Figure 1 plots two active learning methods A and B from an initial state to the final iteration . Both methods apparently have different strengths. A yields better quality at , while B improves faster in the first few iterations. However, quality increases non-monotonically, because feedback can bias the classifier temporarily. At , the quality of B is lower than the one of A.
The question that follows is which active learning method one should prefer.
One might choose the one with higher quality at .
However, the choice of is arbitrary, and one can think of alternative criteria such as the stability of the learning rate.
These missing evaluation standards are in the way of establishing comprehensive benchmarks that go beyond comparing individual progress curves.
This article contains two parts: an overview on one-class active learning for outlier detection, and a comprehensive benchmark of state-of-the-art methods. We make the following specific contributions.
(i) We propose a categorization of one-class active learning methods by introducing learning scenarios. A learning scenario is a combination of a learning objective and an initial setup. One important insight from this categorization is that the learning scenario and the learning objective are decisive for the applicability of active learning methods. In particular, some active learning methods and learning scenarios are incompatible. This suggests that a rigorous specification of the learning scenario is important to assess novel contributions in this field. We then (ii) introduce several complementary ways to summarize progress curves, to facilitate a standard evaluation of active learning in benchmarks. The evaluation by progress-curve summaries has turned out to be very useful, since they ease the comparison of active-learning methods significantly. As such, the categorization and evaluation standards proposed give way to a more reliable and comparable evaluation.
In the second part of our article, we (iii) put together a comprehensive benchmark with around 84,000 combinations of learning scenarios, classifiers, and query strategies for the selection of one-class active learning methods. To facilitate reproducibility, we make our implementations, raw results and notebooks publicly available.111https://www.ipd.kit.edu/ocal A key observation from our benchmark is that none of the state-of-the-art methods stands out in a competitive evaluation. We have found that the performance largely depends on the parametrization of the classifier, the data set, and on how progress curves are summarized. In particular, a good parametrization of the classifier is as important as choosing a good query selection strategy. We conclude by (iv) proposing guidelines on how to select active learning methods for outlier detection with one-class classifiers.
One-class classification is a machine learning method that is popular in different domains. Thus, we fix some terminology before we review the concepts of one-class active learning. We then address Question Categorization with a discussion of the building blocks and assumptions of one-class active learning.
In this article, we focus on one-class classification for outlier detection. This is a subset of the broader class one-class classification, which includes other applications, like collaborative filtering (Pan et al., 2008). The objective of one-class classification for outlier detection is to learn a decision function that discerns between normal and unusual observations. What constitutes a normal and an unusual class may depend on the context, see Section 2.2.2.
One may additionally distinguish between categories of one-class classifiers. One category is unsupervised one-class classifiers, which learn a decision without any class label information. If one-class classifiers make use of such information, they fall into the category of semi-supervised learning. A special case of semi-supervised methods is learning from positive and unlabeled observations(Li and Liu, 2005).
There are different ways to design one-class active learning (AL) systems, and several variants have recently been proposed. Yet we have found that variants follow different objectives and make implicit assumptions. Existing surveys on active learning do not discuss these objectives and assumptions, and they rather focus on general classification tasks (Ramirez-Loaiza et al., 2017; Settles, 2012; Beyer et al., 2015; Olsson, 2009) and on benchmarks for balanced (Bernard et al., 2018) and multi-class classification (Juszczak and Duin, 2003b).
In the remainder of this section, we discuss assumptions for one-class AL, structure the aspects where one-class AL systems differ from each other, and discuss implications of design choices on the AL system. We structure our discussion into three parts corresponding to the building blocks of a one-class AL system. Figure 2 graphs the building blocks. The first block is the AL Setup, which establishes assumptions regarding the training data and the process of gathering user feedback. It specifies the initial configuration of the system before the actual active learning starts. The second building block is the Base Learner, i.e., a one-class classifier that learns a binary decision function based on the data and user feedback available. The third building block is the Query Strategy. It is a method to select observations that a user is asked to provide feedback for.
We call observations that a query strategy selects query objects, the entity that provides the label an oracle, and the process of providing label information feedback. In a real scenario, the oracle is a user. For benchmarks, the oracle is simulated, based on a given gold standard. In what follows, we explain the blocks and discuss dependencies between them.
Researchers make assumptions regarding the interaction between system and user as well as assumptions regarding the application scenario. Literature on one-class AL often omits an explicit description of these assumptions, and one must instead derive them for instance from the experimental evaluation. Moreover, assumptions often do not come with an explicit motivation, and the alternatives are unclear.
We now review the various assumptions found in the literature. We distinguish between two types, general and specific assumptions.
General assumptions specify modalities of the feedback and impose limits on how applicable AL is in real settings. These assumptions have been discussed for standard binary classification (Settles, 2012), and many of them are accepted in the literature. We highlight the ones important for one-class AL.
Feedback Type: Existing one-class AL methods assume that feedback is a class label, i.e., the decision whether an observation belongs to a class or not. However, other types of feedback are conceivable as well, such as feature importance (Raghavan et al., 2006; Druck et al., 2009). But to our knowledge, research on one-class AL has been limited to label feedback. Next, the most common mechanism in literature is sequential feedback, i.e., for one observation at a time. However, asking for feedback in batches might have certain advantages, such as increased efficiency of the labeling process. But a shift from sequential to batch queries is not trivial and requires additional diversity criteria (Juszczak, 2006).
Feedback Budget: A primal motivation for active learning is that the amount of feedback a user can provide is bounded. For instance, the user can have a time or cost budget or a limited attention span to interact with the system. Assigning costs to feedback acquisition is difficult, and a budget restriction is likely to be application-specific. In some cases, feedback on observations from the minority class may be more costly. However, a common simplification here is to assume that labeling costs are uniform, and that there is a limit on the number of feedback iterations.
A user is expected to have sufficient domain knowledge to provide feedback purposefully.
However, this implies that the user can interpret the classification result in the first place, i.e., the user understands the output of the one-class classifier.
This is a strong assumption, and it is difficult to evaluate.
For one thing, “interpretation” already has various meanings for non-interactive supervised learning (Lipton, 2016), and it has only recently been studied for interactive learning (Phillips
et al., 2018; Teso and Kersting, 2018).
Concepts to support users with an explanation of outliers (Micenková
et al., 2013; Kauffmann
et al., 2018) have not been studied in the context of active learning either.
In any case, a thorough evaluation would require a user study.
However, existing one-class AL systems bypass the difficulty of interpretation and assume a perfect oracle, i.e., an oracle which provides feedback with a predefined accuracy.
Specific assumptions confine the learning objective and the data for a particular AL application. One must define specific assumptions carefully, because they restrict which base learners and query strategies are applicable. We partition specific assumptions into the following categories.
Class Distribution: One-class learning is designed for highly imbalanced domains. There are two different definitions of “minority class”. The first one is that the minority class is unusual observations, also called outliers, that are exceptional in a bulk of data. The second definition is that the minority class is the target in a one-vs-all multi-class classification task, i.e., where all classes except for the minority class have been grouped together (Juszczak and Duin, 2003a; Ghasemi et al., 2011a). With this definition, the minority class is not exceptional, and it has a well-defined distribution. Put differently, one-class classification is an alternative to imbalanced binary classification in this case. So both definitions of “minority class” relate to different problem domains. The first one is in line with the intent of our paper, and we stick to it in the following.
Under the first definition, one can differentiate between characterizations of outliers. The prevalent characterization is that outliers do not follow a common underlying distribution. This assumption has far-reaching implications. For instance, if there is no joint distribution, it is not meaningful to estimate a probability density from a sample of the minority class.
Another characterization of outliers is to assume that it is a mixture of several distributions of rare classes. In this case, a probability density for each mixture component exists. So the probability density for the mixture as a whole exists as well. Its estimation however is hard, because the sample for each component is tiny. The characterization of the outlier distribution has implications on the separation of the data into train and test partitions, as we will explain in Section 3.4.
Learning Objective: The learning objective is the benefit expected from an AL system. A common objective is to improve the accuracy of a classifier. But there are alternatives. For instance, users of one-class classification often have a specific interest in the minority class (Das et al., 2016). In this case, it is reasonable to assume that users prefer giving feedback on minority observations if they will examine them anyhow later on. So a good active learning method yields a high proportion of queries from the minority class. This may contradict the objective of accuracy improvement.
There also are cases where the overall number of available observations is small, even for the majority class. The learning objective in this case can be a more robust estimate of the majority-class distribution (Ghasemi et al., 2011a, b). A classifier benefits from extending the number of majority-class labels. This learning objective favors active learning methods that select observations from the majority class.
Initial Pool: The initial setup is the label information available at the beginning of the AL process. There are two cases: (i) Active learning starts from scratch, i.e., there are no labeled examples, and the initial learning step is unsupervised. (ii) There are some labeled instances available (Juszczak, 2006). The number of observations and the share of class labels in the initial sample depends on the sampling mechanism. A special case is if the labeled observations exclusively are from the majority class (Ghasemi et al., 2011a). In our article, we consider different initial pool strategies:
[leftmargin =3.8rightmargin = 2em, topsep = 1ex, itemsep = 0.5ex, align=left, labelwidth=2.3em]
Pool unlabeled: All observations are unlabeled.
Pool percentage: Stratified proportion of labels for percent of the observations.
Pool number: Stratified proportion of labels for a fixed number of observations .
Pool attributes: As many labeled inliers as number of attributes.
The rationale behind Pa is that the correlation matrix of labeled observations is singular if there are fewer labeled observations than attributes.
With a singular correlation matrix, some query strategies are infeasible.
How general and specific assumptions manifest depends on the use case, and different combinations of assumptions are conceivable. We discuss how we set assumptions for our benchmark in Section 4.
Before we introduce the remaining two building blocks, we specify some notation. is a data space with attributes. is a sample from of observations , where each observation
is a vector ofattribute values, i.e., . In this article, each observation either belongs to the minority or to the majority class. For brevity, we call an observation from the minority class outlier and one from the majority class inlier, and we encode them with a categorical class label . Synonyms for inlier are “target”, “positive observation” or “regular observation”, and for outlier “anomalous observation” or “exceptional observation”. can be partitioned into the unlabeled set of observations , i.e., observations where is unknown, and the labeled set of observations . We distinguish between the labeled inliers and the labeled outliers .
A base learner is a one-class classifier that discerns between inliers and outliers. It takes observations and a set of class labels as input and returns a decision function.
A decision function is a function of type . An observation is assigned to the minority class if and to the majority class otherwise.
One-class classifiers fall into two categories: support-vector methods and non-support-vector classifiers (Khan and Madden, 2014). In our article, we focus on support-vector methods, the prevalent choice as base learners for one-class AL. However, the query strategy is independent from a specific instantiation, as long as the base learner returns a decision function.
One can further distinguish between semi-supervised and unsupervised one-class classifiers. Both have been used with one-class AL, but whether they are applicable depends on the learning scenario. A semi-supervised base learner uses both unlabeled data and labeled data with class labels for training. Labels can either come from both classes or only from the minority class (Tax and Duin, 2004). An unsupervised base learner does not have any mechanism to use label information directly to train the decision function. Instead, one can manipulate the unsupervised base learner by exposing it only to specific subsets of the training data. For instance, one can train on labeled inliers only.
In this current article, we restrict our discussion to base learners that have been used in previous work on active learning for outlier detection with one-class classifiers. In particular, we use unsupervised SVDD (Tax and Duin, 2004), semi-supervised SVDDneg (Tax and Duin, 2004) with labels from the minority class, and the semi-supervised SSAD (Görnitz et al., 2013) with labels from both classes.
One of the most popular support-vector methods is Support Vector Data Description (SVDD) (Tax and Duin, 2004). The core idea is to fit a sphere around the data that encompasses all or most observations. This can be expressed as a Minimum Enclosing Ball (MEB) optimization problem of the following form
with the center of the ball , the radius , slack variables , a cost parameter , and a function which maps into a reproducing kernel Hilbert space . Solving the optimization problem gives a fixed and and the decision function
The slack variables relax the MEB, i.e., they introduce a trade-off to allow observations to fall outside the sphere at cost . If is high, observations falling outside of the sphere are expensive. In other words, controls the share of objects that are outside the decision boundary.
The optimization problem from Equation 1 can be solved in the dual space. In this case, the problem contains only inner products of the form . This allows to use the kernel trick, i.e., to replace the inner products with a kernel function
. A common kernel is the Radial Basis Function (RBF) Kernel
with parameter . Larger values correspond to more flexible decision boundaries.
SVDDneg (Tax and Duin, 2004) extends the vanilla SVDD by using different costs for and and costs for . An additional constraint places observations in outside the decision boundary.
SSAD (Görnitz et al., 2013) additionally differentiates between labeled inliers and unlabeled observations in the objective and in the constraints. In its original version, SSAD assigns different costs to , , and . We use a simplified version where the cost for both and are . SSAD further introduces an additional trade-off parameter, which we call . High values of increase the weight of on the solution, i.e., SSAD is more likely to overfit to instances in .
Under mild assumptions, SSAD can be reformulated as a convex problem (Görnitz et al., 2013).
Parameterizing the kernel function and cost parameters is difficult, because a good parameterization typically depends on the data characteristics and the application. Further, one has to rely on heuristic to find a good parametrization in unsupervised settings. There are several heuristics to find a good parametrization that use data characteristics(Silverman, 2018; Scott, 2015; Xiao et al., 2014), artificial outliers (Tax and Duin, 2002; Bánhalmi et al., 2007; Wang et al., 2018), or SVDD-specific properties like the number of support vectors (Wang et al., 2013). However, optimizing the parametrization is not a focus of this article, and we rely on established methods to select the kernel and the cost parameters, see Section 4.2.
A query strategy is a method that selects observations for feedback. In this section, we review the respective principles, as well as existing strategies for one-class classification. The varying notation in the literature would make an overview difficult to follow. So we rely on notation introduced earlier.
To decide on observations for feedback, query strategies rank unlabeled observations according to an informativeness measure.
Let a decision function , unlabeled observations and labeled observations be given. Informativeness is a function that maps an observation to .
For brevity, we only write . quantifies how valuable feedback for observation is for the classification model. This definition is general, and there are different ways to interpret valuable. Feedback can be valuable if the model is uncertain with the prediction of an observation, or if the classification error is expected to decrease. Some query strategies also balance between the representativeness of observations and the exploration of sparse regions. In this case, local density estimates affect the value of an observation.
In general, a query strategy selects one or more observations based on their informativeness. We define it as follows.
A query strategy is a function of type
The feedback on from an oracle results in an updated set of labeled and unlabeled data . In this current article, we only consider single queries (cf. Section 2.2). Given this, we assume query strategies to always return the observation with the highest informativeness
We now review existing query strategies from literature that have been proposed for one-class active learning.
To this end, we partition them into three categories.
The first category is data-based query strategies.222Others have used the term “model-free” instead (O’Neill
et al., 2017) .
However, we deliberately deviate from this nomenclature since the strategies we discuss still rely on some kind of underlying model, e.g., a kernel-density estimator.
. However, we deliberately deviate from this nomenclature since the strategies we discuss still rely on some kind of underlying model, e.g., a kernel-density estimator.These strategies approach query selection from a statistical side. The second category is model-based query strategies. These strategies rely on the decision function returned by the base learner. The third category is hybrid query strategies. These strategies use both the data statistics and the decision function.
The concept behind data-based query strategies is to compare the posterior probabilities of an observationand . This is well known from binary classification and is referred to as measure of uncertainty (Settles, 2012)
. If a classifier does not explicitly return posterior probabilities, one can use the Bayes rule to infer them. But this is difficult, for two reasons. First, applying the Bayes rule requires knowing the prior probabilities for each class, i.e., the proportion of outliers in the data. It may not be known in advance. Second, outliers do not follow a homogeneous distribution. This renders estimatinginfeasible. There are two types of data-based strategies that have been proposed to address these difficulties.
The first type deems observations informative if the classifier is uncertain about their class label, i.e., observations with equal probability of being classified as inlier and outlier.
The following two strategies quantify informativeness in this way.
Minimum Margin (Ghasemi et al., 2011b): This QS relies on the difference between posterior class probabilities
where Equation 7b and Equation 7c follow from the Bayes rule. If and are known priors, one can make direct use of Equation 7c. Otherwise, the inventors of Minimum Margin suggest to take the expected value under the assumption that
, i.e., the share of outliers, is uniformly distributed
We find this an unrealistic assumption, because a share of outliers of 0.1 would be as likely as 0.9.
In our experiments, we evaluate both with the true outlier share as a prior and with .
Maximum-Entropy (Ghasemi et al., 2011b): This QS selects observations where the distribution of the class probability has a high entropy
Applying the Bayes rule and taking the expected value as in Equation 8 gives
To give an intuition of the Minimum-Margin and the Maximum-Entropy strategy, we visualize the informativeness for Minimum Margin and Maximum Entropy on sample data. Figure 3 visualizes , and
for univariate data generated from two Gaussian distributions, with. The authors of suggest to estimate the densities with kernel density estimation (KDE) (Ghasemi et al., 2011b). However, entropy is defined on probabilities and is not applicable to densities, so just inserting into the formula yields ill-defined results. Moreover, is not defined for . We set in this case. For , we use Equation 7c with prior class probabilities. Not surprisingly, all three depicted formulas result in a similar pattern, as they follow the same general motivation. The tails of the inlier distribution yield high informativeness. The informativeness decreases slower on the right tail of the inlier distribution where the outlier distribution has some support.
The second type of data-based query strategies strives for a robust estimation of the inlier density. The idea is to give high informativeness to observations that are likely to reduce the loss between the estimated and the true inlier density. There is one strategy of this type.
Minimum-Loss (Ghasemi et al., 2011a): Under the minimum-loss strategy, observations have high informativeness if they are expected to increase the estimate of the inlier density. The idea to calculate this expected value is as follows. The feedback for an observation is either “outlier” or “inlier”. The minimum-loss strategy calculates an updated density for both cases and then takes the expected value by weighting each case with the prior class probabilities. Similarly to Equation 7c, this requires knowledge of the prior class probabilities.
We now describe Minimum-Loss formally. Let be an estimated probability density over all inlier observations . Let , and let be its corresponding density. Similarly, we define . Then stands for the density estimated over all and for respectively. In other words, for , one first estimates the density without and then evaluates the estimated density at . One can now calculate how well an observation matches the inlier distribution by using leave-out-one cross validation for both cases.
Case 1: x is inlier
Case 2: x is outlier
The expected value over both cases is
We illustrate Equation 11, Equation 12 and in Figure 4.
As expected, yields high informativeness in regions of high inlier density.
gives an almost inverse pattern compared to the Minimum-Margin and the Maximum-Entropy strategies.
This illustrates that existing query strategies are markedly different.
It is unclear how to decide between them solely based on theoretical considerations, and one has to study them empirically instead.
Model-based strategies rely on the decision function of a base learner.
Recall that an observation is an outlier if and an inlier for .
Observations with are on the decision boundary.
High-Confidence (Barnabé-Lortie et al., 2015): This QS selects observations that match the inlier class the least. For SVDD this is
Decision-Boundary: This QS selects observations closest to the decision boundary
Hybrid query strategies combine data-based and model-based strategies.
Neighborhood-Based (Görnitz et al., 2013): This QS explores unknown neighborhoods in the feature space. The first part of the query strategy calculates the average number of labeled instances among the k-nearest neighbors
with k-nearest neighbors .
A high number of neighbors in makes an observation less interesting.
The strategy then combines this number with the distance to the decision boundary, i.e., . Parameter controls the influence of the number of already labeled instances in the neighborhood on the decision.
The authors do not recommend any specific parameter value, and we use in our experiments.
Boundary-Neighbor-Combination (Yin et al., 2018): The core of this query strategy is a linear combination of the normalized distance to the hypersphere and the normalized distance to the first-nearest neighbor
with a distance function , and trade-off parameter .
The actual query strategy is to choose a random observation with probability p and to use strategy with probability .
The authors recommend to set and .
In addition to the strategies introduced so far, we use the following baselines.
Random: This QS draws each unlabeled observation with equal probability
Random-Outlier: This QS is similar to Random, but with informativeness 0 for observations predicted to be inliers
In general, adapting other strategies from standard binary active learning is conceivable as well. For instance, one could learn a committee of several base learners and use disagreement-based query selection (Settles, 2012). In this current article however, we focus on strategies that have been explicitly adapted to and used with one-class active learning.
Evaluation of active learning methods is more involved than the one of static methods. Namely, the result of an AL method is not a single number, but rather a sequence of numbers that result from a quality evaluation in each iteration.
We now address Question Evaluation in several steps. We first discuss characteristics of active learning progress curves. We then review common quality metrics (QM) for one-class classification, i.e., metrics that take the class imbalance into account. We then discuss different ways to summarize active learning curves. Finally, we discuss the peculiarities of common train/test-split strategies for evaluating one-class active learning and limitations of the design choices just mentioned.
The sequence of quality evaluations can be visualized as a progress curve, see Figure 1. We call the interval from to an active learning cycle. Literature tends to use the percentage or the absolute number of labeled observations to quantify progress on the x-axis. However, this percentage may be misleading if the total number of observations varies between data sets. Next, other measures are conceivable as well, such as the time the user spends to answer a query. While this might be even more realistic, it is very difficult to validate. We deem the absolute number of labeled objects during the active learning cycle the most appropriate scaling. It is easy to interpret, and the budget restriction is straightforward. However, the evaluation methods proposed in this section are independent of a specific progress measure.
The y-axis is a metric for classification quality. There are two ways to evaluate it for imbalanced class distributions: by computing a summary statistic on the binary confusion matrix, or by assessing the ranking induced by the decision function.
In this article, we use the Matthews Correlation Coefficient (MCC) and Cohen’s kappa to evaluate the binary output. They can be computed from the confusion matrix. MCC returns values in , where high values indicate good classification on both classes, equals a random prediction, and is the total disagreement between classifier and ground truth. kappa returns for a perfect agreement with the ground truth and for one not better than a random allocation.
One can also use the distance to the decision boundary to rank observations. The advantage is the finer differentiation between strong and less strong outliers. A common metric is the area under the ROC curve (AUC) which has been used in other outlier-detection benchmarks (Campos et al., 2016). An interpretation of the AUC is the probability that an outlier is ranked higher than an inlier. So an AUC of indicates a perfect ranking; means that the ranking is no better than random.
If the data set is large, users tend to only inspect the top of the ranked list of observations. Then it can be useful to use the partial AUC (pAUC). It evaluates classifier quality at thresholds on the ranking where the false-positive rate (FPR) is low. An example for using pAUC to evaluate one-class active learning is (Görnitz et al., 2013).
The visual comparison of active learning via progress plots does not scale with the number of experiments. For instance, our benchmark would require to compare 84,000 different learning curves; this is prohibitive. For large-scale comparisons, one should instead summarize a progress curve. Recently, true performance of the selection strategy (TP) has been proposed as a summary of increase and decrease of classifier performance over the number of iterations (Reyes et al., 2018). However, TP is a single aggregate measure, which is likely to overgeneralize and is difficult to interpret. For a more comprehensive evaluation, we therefore propose to use several summary statistics. Each of them captures some characteristic of the learning progress and has a distinct interpretation.
We use for the quality metric at the active learning progress .
We use and to refer to the labeled examples at and .
Start Quality (SQ): The Start Quality is the baseline classification quality before the active learning starts, i.e., the quality of the base learner at the initial setup
Ramp-Up (RU): The ramp-up is the quality increase after the initial progress steps. A high RU indicates that the query strategy adapts well to the initial setup
Quality Range (QR): The Quality Range is the increase in classification quality over an interval . A special case is , the overall improvement achieved with an active learning strategy
Average End Quality (AEQ): In general, the progress curve is non-monotonic because each query introduces a selection bias in the training data. So a query can lead to a quality decrease. The choice of often is arbitrary and can coincide with a temporary bias. So we propose to use the Average End Quality to summarize the classification quality for the final progress steps
Learning Stability (LS): Learning Stability summarizes the influence of the last progress steps on the quality. A high LS indicates that one can expect further improvement from continuing the active learning cycle. A low LS on the other hand indicates that the classifier tends to be saturated, i.e., additional feedback does not increase the quality. We define LS as the ratio of the average QR in the last steps over the average QR between init and end
Ratio of Outlier Queries (ROQ): The Ratio of Outlier Queries is the proportion of queries that the oracle labels as outlier
In practice, the usefulness of a summary statistic to select a good active learning strategy depends on the learning scenario.
For instance, ROQ is only meaningful if the user has a specific interest in observations from the minority class.
We conclude the discussion of summary statistics with two comments. The first comment is on Area under the Learning Curve (AULC), which also can be used to summarize active learning curves (Cawley, 2011; Reyes et al., 2018). We deliberately choose to not include AULC as a summary statistic for the following reasons. First, active learning is discrete, i.e., the minimum increment during learning is one feedback label. But since the learning steps are discrete, the “area” under the curve is equivalent to the sum of the quality metric over the learning progress . In particular, approximating the AULC by, say, a trapezoidal approximation (Reyes et al., 2018) is not necessary. Second, AULC is difficult to interpret. For instance, two curves can have different shapes and end qualities, but yet result in the same AULC value. In our article we therefore rely on AEQ and SQ, which one can interpret as a partial AULC, with distinct interpretation.
Our second comment is using summary statistics to select different query strategies for different phases of the active learning cycle is conceivable in principle. For instance, one could start the cycle with a good RU and then switch to a strategy with a good AEQ. However, this leads to further questions, e.g., how to identify a good switch point, that go beyond this current article.
A split strategy specifies how data is partitioned between training and testing. With binary classifiers, one typically splits data into disjunct train and a test partition, which ideally are identically distributed. However, since outliers do not come from a joint distribution, measuring classification quality on an independent test set is misleading. In this case, one may measure classification quality as the resubstitution error, i.e., the classification quality on the training data. This error is an optimistic estimate of classification quality. But we deem this shortcoming acceptable if only a small percentage of the data has been labeled.
The learning objective should also influence how the data is split. For instance, if the learning objective is to reliably estimate the majority-class distribution, one may restrict the training set to inliers (cf. (Ghasemi et al., 2011a, b)). Three split strategies are used in the literature.
[leftmargin = 3.8rightmargin = 2em, topsep = 1ex, partopsep=1ex, align=left, labelwidth=2.3em]
Split holdout: Model fitting and query selection on the training split, and testing on a distinct holdout sample.
Split full: Model fitting, query selection and testing on the full data set.
Split inlier: Like Sf, but model fitting on labeled inliers only.
Split strategies increase the complexity of evaluating active learning, since they must be combined with an initial pool strategy. Most combinations of split strategies and initial pool strategies are conceivable. Only no labels (Pu) does not work with a split strategy that fits a model solely on inliers (Si) – the train partition would be empty in this case. Figure 5 is an overview of all combinations of an initial pool strategy and a split strategy.
Initial setups, split strategies, base learners and query strategies (QS) all come with prerequisites. One cannot combine them arbitrarily, because some of the prerequisites are mutually exclusive, as follows.
Pu rules out any data-based QS. This is because data-based QS require labeled observations for the density estimations.
Fully unsupervised base learners, e.g., SVDD, are only useful when the learning objective is a robust estimate of the majority distribution, and when the split strategy is Si. The reason is that feedback can only affect the classifier indirectly, by changing the composition of the training data .
A combination of Pu and Si is not feasible, see Section 3.4.
Table 1 is an overview of the feasibility of query strategies. In what follows, we only consider feasible combinations.
The plethora of ways to design and to evaluate AL systems makes selecting a good configuration for a specific application difficult. Although certain combinations are infeasible, the remaining options are still too numerous to analyze. This section addresses question Comparison and provides some guidance how to navigate the overwhelming design space. We have implemented the base learners, the query strategies and the benchmark setup in Julia (Bezanson et al., 2017). Our implementation, the raw results of all settings and notebooks to reproduce experiments and evaluation are publicly available at https://www.ipd.kit.edu/ocal.
We begin by explaining our experiments conducted on well-established benchmark data sets for outlier detection (Campos et al., 2016). In total, we run experiments on over 84,000 configurations: 72,000 configurations in Section 4.3.1 to Section 4.3.4 are the cross product of 20 data sets, 3 resampled versions, 3 split strategies, 4 initial pool strategies, 5 models with different parametrization, 2 kernel parameter initializations and 10 query strategies; 12,000 additional configurations in Section 4.3.5 are the cross product of 20 data sets with 3 resampled versions each, 2 models, 2 kernel parameter initializations, 5 initial pool resamples and 10 query strategies. Table 2 lists the experimental space, see Section 4.2 for details.
Each specific experiment corresponds to a practical decision which query strategy to choose in a specific setting. We illustrate this with the following example.
Assume the data set is Arrhythmia, and there are no initial labels, i.e., the initial pool strategy is Pu, and data-based QS are not applicable. The classifier is SVDDneg, and we use Sf to evaluate the classification quality. Our decision is to choose and as potential query strategies and to terminate the active-learning cycle after 100 iterations.
Figure 6 graphs the progress curves for both query strategies. A first observation is that it depends on the progress which one is better. For example, results in a better MCC after 10 iterations, while is superior after 90 iterations. After 50 iterations, both and perform equally well. Until iteration 60, the learning stability (LS) decreases to 0, which speaks for stopping. Indeed, although there is some increase after 60 iterations, it is small compared to the overall improvement.
For a more differentiated comparison, we now look at several progress curve summaries. If only the final classification quality is relevant, i.e., the budget is fixed to 100 observations, is preferred because of higher EQ and AEQ values. For a fast adaption, one should prefer with RU(5) = 0.57, compared to a RU(5) = 0.00 for . Regarding the outlier ratio, both query strategies perform similarly with and . Users can now weigh these criteria based on their preferences to decide on the most appropriate query strategy.
|Initial Pools||Pu, Pp (), Pn (), Pa|
|Split Strategy||Sf, Sh (80% train, 20% test), Si|
|Base Learner||SVDD, SVDDneg, SSAD ()|
|Kernel Initialization||Wang, Scott|
|Query strategy||, , , , , , , , ,|
In our benchmark, we strive for general insights and trends regarding such decisions. In the following, we first discuss assumptions we make for our benchmark and the experiment setup. We then report on results. Finally, we propose guidelines for outlier detection with active learning and discuss extensions to our benchmark towards conclusive decision rules.
We now specify the assumptions behind our benchmark.
General Assumptions. In our benchmark, we focus on “sequential class label” as the feedback type. We set the feedback budget to a fixed number of labels a user can provide. The reason for a fixed budget is that the number of queries in an active learning cycle depends on the application, and estimating active learning performance at runtime is difficult (Kottke et al., 2019). There is no general rule how to select the number of queries for evaluation. In our experiments, we perform 50 iterations. Since we benchmark on many publicly available data sets, we do not have any requirements regarding interpretability. Instead, we rely on the ground truth shipped with the data sets to simulate a perfect oracle.
We have referred to specific assumptions throughout this article and explained how they affect the building blocks and the evaluation.
For clarity, we briefly summarize them.
For the class distribution, we assume that outliers do not have a joint distribution.
The primary learning objective is to improve the accuracy of the classifier.
However, we also use the ROQ summary statistic to evaluate whether a method yields a high proportion of queries from the minority class.
For the initial setup, we do not make any further assumptions.
Instead, we compare the methods on all feasible combinations of initial pools and split strategies.
Our experiments cover several instantiations of the building blocks. Table 3 lists the data sets, and Table 2 lists the experimental space. For each data set we use three resampled versions with an outlier percentage of 5% that have been normalized and cleaned from duplicates. We have downsampled large data sets to . This is comparable to the size of the data sets used in previous work for active learning for one-class classification. Additionally, one may use sampling techniques for one-class classifiers to scale to large data sets, e.g., (Krawczyk et al., 2018; Li, 2011). However, further studying the influence of the data set size on the query strategies is out of the scope of this article.
|Dataset||Observations (N)||Attributes (M)|
Parameters: Parameter selection for base learners and query strategies is difficult in an unsupervised scenario. One must rely on heuristics to select the kernel and cost parameters for the base-learners, see Section 2.4. We use Scott’s rule of thumb (Scott, 2015) and state-of-the-art self-adapting data shifting by Wang et al. (Wang et al., 2018) for the kernel parameter . For cost we use the initialization strategy of Tax et al. (Tax and Duin, 2004). For SSAD, the authors suggest to set the trade-off parameter (Görnitz et al., 2013). However, preliminary experiments of ours indicate that SSAD performs better with smaller parameter values in many settings. Thus, we include and as well. For the query strategies, the selection of strategy-specific parameters is described in Section 2.5. The data-based query strategies use the same value for kernel density estimation as the base learner.
We now discuss general insights and trends we have distilled from the experiments. We start with a broad overview and then fix some experimental dimensions step by step to analyze specific regions of the experimental space. We begin by comparing the expressiveness of evaluation metrics and the influence of base learner parametrization. Then we study the influence of the split strategy, the initial pool strategy, and the query strategy on result quality.
Recall that our evaluation metrics are of two different types: ranking metrics (AUC and pAUC) and metrics based on the confusion matrix (kappa, MCC). On all settings, metrics of the same type have a high correlation for AEQ, see Table 4. So we simplify the evaluation by selecting one metric of each type.
Further, there is an important difference between both types. Figure 7 depicts the AEQ for pAUC and MCC. For high MCC values, pAUC is high as well. However, high pAUC values often do not coincide with high MCC values, please see the shaded part of the plot. In the extreme cases, there even are instances where pAUC = 1 and MCC is close to zero. In this case, the decision function induces a good ranking of the observations, but the actual decision boundary does not discern well between inliers and outliers. An intuitive explanation is that outliers tend to be farthest from the center of the hypersphere. Because pAUC only considers the top of the ranking, it merely requires a well-located center to arrive at a high classification quality. But the classifier actually may not have fit a good decision boundary.
Our conclusion is that pAUC and AUC may be misleading when evaluating one-class classification. Hence, we only use MCC from now on.
Recall that the kernel parameter influences the flexibility of the decision boundary; high values correspond to more flexible boundaries. Our hypothesis is that a certain flexibility is necessary for models to adapt to feedback.
Table 5 shows the SQ and AEQ for two heuristics to initialize . In both summary statistics, Wang strategy outperforms the more simple Scott rule of thumb significantly on the median over all data sets and models. A more detailed analysis shows that there are some data sets where Scott outperforms Wang, e.g., KDD-Cup. However, there are many instances where Wang performs well, but Scott results in very poor active learning quality. For instance, the AEQ on Glass for Scott is , and for Wang . We hypothesize that this is because Scott yields very low values for all data sets, and the decision boundary is not flexible enough to adapt to feedback. The average value is for Scott and for Wang.
We draw two conclusions from these observations. First, the choice of influences the success of active learning significantly. When the value is selected poorly, active learning only results in minor improvements on classification quality – regardless of the query strategy. Second, Wang tends to select better values than Scott, and we use it as the default in our experiments. Our observations also motivate further research on how to select the parameters in an active learning setting. However, studying this issue goes beyond the scope of our article.
Our experiments show that split strategies have a significant influence on classification quality. Figure 8 graphs the AEQ for the different split strategies grouped by base learners.
We first compare the three split strategies. For Sh, the AEQ on the holdout sample is rather low for all base learners. For Sf, SVDDneg and SSAD_0.1 achieve high quality. Some of this difference may be explained by the more optimistic resubstitution error in Sf. However, the much lower AEQ in Sh, for instance for SVDDneg, rather confirms that outliers do not follow a homogeneous distribution (cf. Section 2.2). In this case, the quality on the holdout sample is misleading.
For Si, all classifiers yield about the same quality. This is not surprising. The classifiers are trained on labeled inliers only. So the optimization problems for the base learners coincide. The average quality is lower than with Sf, because the training split only contains a small fraction of the inliers. Based on all this, we question whether Si leads to an insightful evaluation, and we exclude Si from now on.
Next, we compare the quality of the base learners. For Sf, SVDD fails because it is fully unsupervised, i.e., cannot benefit from feedback. For SSAD, the quality fluctuates with increasing . Finding an explanation for this is difficult. We hypothesize that this is because SSAD overfits to the feedback for high values. For Sf, empirically is the best choice.
In summary, the split strategy has a significant effect on classification quality.
SVDDneg and SSAD_0.1 for Sf yield the most reasonable results.
We fix these combinations for the remainder of this section.
|Data set||Initial Pool||n||Initially labeled||SQ||AEQ|
The initial pool strategy specifies the number of labeled observations at . Intuitively, increasing it should increase the start quality, as more information on the data is available to the classifier. If the initial pool is representative of the underlying distribution, little benefit can be expected from active learning.
Our results confirm this intuition. Figure 9 shows the SQ for the initial pool strategies grouped by SVDDneg and SSAD_0.1. For Pu, there are no labeled observations, and the corresponding SQ is low. When labeled data is available, Pp tends to yield a better SQ than Pn. However, the figure is misleading, because the actual number of labels depends on the data set. This becomes clear when looking at ALOI and WBC, see Table 6. For WBC, Pp and Pn result in a similar number of initial labels. For ALOI however, the number of labels with Pp is five times larger than with Pn. So the SQ on ALOI is higher for Pp, but AEQ is only slightly higher than SQ. This means that active learning has comparatively little effect. Pa has a technical motivation, i.e., it is the minimal number of labels required by the data-based strategies. This strategy is not feasible for data sets where the number of attributes is larger than the number of observations. Other than this, the interpretation of Pa is similar to Pp with .
In summary, different initial pool strategies lead to substantially different results. We deem Pn more intuitive than Pp when reporting results, since the size of the initial sample, and hence the initial labeling effort is explicit. In any case, one must carefully state how the initial sample is obtained. Otherwise, it is unclear whether high quality according to AEQ is due to the query strategy or to the initial pool.
We have arrived at a subset of the experimental space where comparing different query strategies is reasonable. To do so, we fix the initial pool strategy to Pn with . In this way, we can include the data-based QS which all require initial labels. We obtain the initial pool by uniform stratified sampling. Additionally, we exclude the Hepatitis data set because it only contains 60 observations; this is incompatible with 20 initially labeled observations and 50 iterations. We repeat each setting 5 times and average the results to reduce the bias of the initial sample.
Table 7 shows the median QR(init, end) grouped by data set. By design of the experiment, SQ is equal for all query strategies. This means that AEQ coincides with QR. On some data sets (indicated by “ - ”), data-based query strategies fail. The reason is that the rang of the matrix of observations, on which the kernel density is estimated, is smaller than . For the remaining data sets, we make two observations. First, the QR achieved differs between data sets. Some data sets, e.g., Annthyroid and PageBlocks, seem to be more difficult and only result in a small QR. Second, the quality of a specific QS differs significantly between data sets. For instance, is the best strategy on Lymphography, but does not increase the classification quality on PageBlocks. In several cases, clearly outperforms the remaining strategies. There neither is a QS category nor a single QS that is superior on all data sets. This also holds for other metrics like RU and ROQ.
Next, runtimes for are an order of magnitude larger than for all other strategies. For PageBlocks, the average runtime per query selection for is , compared to for .
To summarize, there is no one-fits-all query strategy for one-class AL. The requirements for data-based query strategies may be difficult to meet in practice. If the requirements are met, all model-based and hybrid strategies we have evaluated except for may be a good choice. In particular, and are a good choice in the majority of cases. They result in significant increases over 50 iterations for most data sets and scale well with the number of observations. Even in the few cases where other query strategies outperform them, they still yield acceptable results.
The results from previous sections are conclusive and give way to general recommendations for outlier detection with active learning. We summarize them as guidelines for the selection of query strategies and for the evaluation of one-class active learning.
Learning scenario: We recommend to specify general and specific assumptions on the feedback process and the application. This narrows down the design space of building-block combinations. Regarding research, it may also help others to assess novel contributions more easily.
Initial Pool: The initial pool strategy should either be Pu, i.e., a cold start without labels, or Pn with an absolute number of labels. It is important to make explicit if and how an initial sample has been obtained.
Base Learner: A good parametrization of the base learner is crucial. To this end, selecting the bandwidth of the Gaussian kernel by self-adaptive data shifting (Wang et al., 2018) works well. When parameters are well-chosen, SVDDneg is a good choice across data sets and query strategies.
Query Strategies: Good choices across data sets are and . One should give serious consideration to random baselines, as they are easy to implement and outperform the more complex strategies in many cases.
Evaluation: Progress curve summaries yield a versatile and differentiated view on the performance of active learning. We recommend to use them to select query strategies for a specific use case. As the quality metric, we suggest to use MCC or kappa. Calculating this metric as a resubstitution error based on a Sf split is reasonable for outlier detection.
From the results presented so far, one may also think about deriving a formal and strict set of rules to select an active learning method that are even more rigorous than the guidelines presented. However, this entails major difficulties, as we now explain. Addressing them requires further research that go beyond the scope of a comparative study.
One can complement the benchmark with additional real-world data sets. But they are only useful to validate whether rules that have already been identified are applicable to other data as well. So, given our current level of understanding, we expect additional real-world data sets to only confirm our conclusion that formal rules currently are beyond reach.
One may strive for narrow rules, e.g., rules that only apply to data with certain characteristics. This would require a different kind of experimental study, for instance with synthetic data. This also is difficult, for at least two reasons. First, it is unclear what interesting data characteristics would be in this case. Even if one can come up with such characteristics, it still is difficult to generate synthetic data with all these interesting characteristics. Second, reliable statements on selection rules would require a full factorial design of these characteristics. This entails a huge number of combinations with experiment runtimes that are likely to be prohibitive. To illustrate, even just 5 characteristics with 3 manifestations each result in a data sets instead of 20 data sets, and a total of 874,800 experiments – an order of magnitude larger than the experiments presented here. Yet our experiments already have a sequential run time of around 482 days.
One could strive for theoretical guarantees on query strategies. But the strategies discussed in Section 4.3.5 are heuristics and do not come with any guarantees. A discussion of the theoretical foundations of active learning may provide further insights. However, this goes beyond the scope of this current article as well.
To conclude, deriving a set of formal rules based on our results is not within reach. So one should still select active learning methods for a use case individually. Our systematic approach from the previous sections does facilitate such a use-case specific selection. It requires to carefully define the learning scenario and to use summary statistics for comparisons.
Active learning for outlier detection with one-class classifiers relies on several building blocks: the learning scenario, a base learner, and a query strategy. While the literature features several approaches for each of the building blocks, finding a suitable combination for a particular use case is challenging. In this article, we have approached this challenge, in two steps. First, we provide a categorization of active learning for one-class classification and propose methods to evaluate active learning beyond progress curves. Second, we have evaluated existing methods, using an extensive benchmark. Our experimental results show that there is no one-fits-all strategy for one-class active learning. Thus, we have distilled guidelines on how to select a suitable active learning method with specific use cases. Our categorization, evaluation standards and guidelines give way to a more reliable and comparable assessment of active learning for outlier detection with one-class classifiers.
Toward Supervised Anomaly Detection.JAIR (2013), 235–262.
The Knowledge Engineering Review(2014), 345–374. https://doi.org/10.1017/S026988891300043X
Selecting training points for one-class support vector machines.Pattern Recognit. Lett. 32, 11 (Aug. 2011), 1517–1522. https://doi.org/10.1016/j.patrec.2011.04.013
A literature survey of active machine learning in the context of natural language processing. Technical Report. Swedish Institute of Computer Science.
Synthesis Lectures on Artificial Intelligence and Machine Learning(2012), 1–114.
A modified support vector data description based novelty detection approach for machinery components.Appl Soft Comput (2013), 1193–1205.