Active learning for online training in imbalanced data streams under cold start

07/16/2021 ∙ by Ricardo Barata, et al. ∙ feed zai 13

Labeled data is essential in modern systems that rely on Machine Learning (ML) for predictive modelling. Such systems may suffer from the cold-start problem: supervised models work well but, initially, there are no labels, which are costly or slow to obtain. This problem is even worse in imbalanced data scenarios. Online financial fraud detection is an example where labeling is: i) expensive, or ii) it suffers from long delays, if relying on victims filing complaints. The latter may not be viable if a model has to be in place immediately, so an option is to ask analysts to label events while minimizing the number of annotations to control costs. We propose an Active Learning (AL) annotation system for datasets with orders of magnitude of class imbalance, in a cold start streaming scenario. We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies where it is used as warm-up. Then, we perform empirical studies in four real world datasets, with various magnitudes of class imbalance. The results show that our method can more quickly reach a high performance model than standard AL policies. Its observed gains over random sampling can reach 80 annotation budget or additional historical data (with 1/10 to 1/50 of the labels).



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Currently, supervised Machine Learning (ML) models are essential and widespread tools in electronic services, where vast amounts of data are generated daily in domains as diverse as finance, entertainment or consumer goods. Those models are often central in decisions that enhance system efficiency, user experience or even safety. Their performance relies heavily on the quality of the data they are trained on and, specifically for the supervised setting, suitably labeled data is crucial. In many domains, labeled data is expensive to collect, often requiring human annotation. In such scenarios, it is common that the system collects large amounts of unlabeled data with a limited budget for annotations, so it becomes essential to select the most informative samples for labeling.

Active Learning (AL) addresses the problem of selecting the smallest possible sample of data to label and train a high performance ML model. In this paper we study AL-based annotation methods for real-time data streams, to train a high performance model with no historical data, i.e., in a cold start scenario, for datasets with a large class imbalance. AL powered annotation in streaming can be particularly useful in the financial fraud detection domain where, often, there is a considerable delay between the event and the collection of the true label (e.g., through client complaints or reports from financial institutions) unless a human analyst is consulted. In our study, we are interested in real world datasets of credit card transactions, for which there is a high class imbalance. In this setting we aim to address several important questions, namely:

  • Which AL policies can more efficiently produce a high performance model with a small budget of events to label?

  • How much better is AL compared with random sampling?

  • Do we have empirical guarantees on the stability of a given policy, i.e., how small is the variance of its learning curve?

  • How many labels are needed for a high performance model?

We test well known AL policies, as well as our proposed sequences of policies that are especially designed for imbalanced datasets, to achieve a high performance, with reduced variance, in few iterations. Our main contributions are:

  • A new computationally efficient approach to the Discriminative Active Learning method (discriminative_AL_DBLP:journals/corr/abs-1907-06347) named Outlier based Discriminative Active Learning (ODAL) – Section 3.2.2.

  • Two variations of uncertainty sampling policies using an epistemic uncertainty measure, as well as a measure based on the fraud rate percentile – Section 3.2.3.

  • A 3-stage sequence of policies using ODAL as warm-up suited for highly imbalanced datasets – Section 4.3.

  • An extensive set of experiments on four credit card transactions datasets, to compare and rank AL policy sequences, to identify the best AL setup for fraud detection – Section 5.

2. Related Work

Various AL methods have been proposed and surveyed in the literature in the last decades,  (settles2009active; review_YANG2018401). We now discuss the methods most relevant for our experiments.

2.1. Data Querying

In this study, all AL policies will be based on pool based sampling (settles2009active), i.e., an unlabeled data pool is available whose instances must be prioritized, on each AL iteration, for annotation. The number of instances in one querying request, i.e., the batch size, is a parameter that may influence how fast AL improves the model.

As for data availability, usually a large amount of unlabeled instances is provided, as well as either a small initial labeled data pool or no initially labeled data. In contrast, in our experiments we focus on the cold start scenario with no historical data. Furthermore, typical AL setups involve scenarios where the data source is static. Instead, we are interested in a streaming data scenario, where the unlabeled data pool grows – when a AL selects a new batch of queries, more data will be available than on the previous iteration.

Some studies have appeared in the literature discussing AL methods in a streaming data scenario (AL_streaming_10.1007/978-3-642-23808-6_39; AL_streaming_5440901; OAL_8910490; Carcillo2018StreamingAL; text_classification_modifiedQBC; OAL_drift; network_data_10.1145/2661829.2661981; JANARDAN2017804; partial_dishcarge_use_case_7245026; sentiment_analysis_KRANJC2015187). Notably, Carcillo et al. (Carcillo2018StreamingAL), investigated several AL methods for a credit card fraud dataset. Instances were selected once a day with AL, according to a fixed budget. For some of their methods, a budget was also reserved for semi-supervised labeling using a model trained on the labeled data. In contrast, we consider scenarios where several small batches of instances are processed during the day to exploit the collected labels more frequently to update the AL policy, which is important to avoid the selection of many similar instances in one large batch. Furthermore, we will present a detailed analysis of AL curves in the fraud domain, to provide a more complete understanding of its effectiveness for fraud detection, as well as investigate new policies that were not considered in reference (Carcillo2018StreamingAL). In reference (Carcillo2018StreamingAL) no analysis of AL curves was presented, nor of their variability, which is essential to observe the boost in ML model performance at early stages of the AL process. Finally, most other studies in our literature review, cited above, are either: i) focused on applying AL to address concept drift, or ii) not focused on highly imbalanced problems, or iii) not focused on dealing with the cold start problem.

2.2. AL Policies

The central ingredient in an AL based annotation system is the policy determining which instances are the most relevant to label. We categorize the types of policies to mirror our three-stage strategy to efficiently train a model from cold start (discussed in Section 3.2):

  1. Cold policies (unsupervised): In a first stage, while no labeled data is available, a method is used to select the first instances for labeling before AL can start – Section 2.2.1.

  2. Warm-up policies: After some labels are collected, there may be a transient period with only labels of a given type available (e.g., only negative class for binary classification) or too few labels to train a supervised policy – Section 2.2.2.

  3. Hot policies (supervised): These are the most common, and they make full use of the collected labels to differentiate classes and select the best instances to query – Section 2.2.3.

2.2.1. Cold policies

AL studies in the literature often focus on scenarios where a labeled pool is available to start the AL process. However, in many real world scenarios one may be faced with a system that has just been deployed and contains no labeled data (pmlr-v32-houlsby14). In that case, the initial sampling can only be guided by the unlabeled instances. The simplest choice is to randomly sample an initial batch of instances – Random Policy

. Another simple option is to use an unsupervised learning method to build a representation of the unlabeled data and select outliers –

Outlier Detection Policy. The latter is useful if one or more of the classes behave as outliers. Another criterion, is to sample denser regions of the feature space (which relates to the Density-Weighted policies – see Section 2.2.2).

2.2.2. Warm-up policies

These exploit the distribution of the features in the unlabeled and labeled pools without using the labels.

Discriminative Active Learning (DAL) (discriminative_AL_DBLP:journals/corr/abs-1907-06347) is based on the principle that a good labeled pool should be difficult to discriminate from the unlabeled pool. In this approach, the labeled pool instances are labeled positive and the unlabeled pool instances are labeled negative. Then a binary classification model is fit to discriminate between pools. Finally, the unlabeled pool is scored and instances with low scores are queried (i.e., those easy to discriminate from the labeled pool). This can be computationally heavy because it always trains on all available data (labeled and unlabeled). Though this can be mitigated by randomly sampling the unlabeled pool, we will propose a lighter method, in Section 3.2.2.

Density-weighted methods: In the next section we will see that hot policies use informativeness criteria that are also prone to detecting outliers,111E.g., uncertainty sampling assumes that the most relevant instances are closer to the decision boundary, however, those instances can, simultaneously, be outliers.

which may not be of interest in AL – such outliers may not provide information that improves the ML model. Density-weighted methods aim to select instances that cover well the most dense areas of the data distribution. This can be achieved, e.g., through density estimation 

(density_based_fujii-etal-1998-selective) or clustering algorithms (density_based_cluster10.1145/1015330.1015349; density_based_cluster10.1007/978-3-540-71496-5_24). These methods tend to be heavier and harder to implement in streaming because the unlabeled pool may grow and its distribution may drift in real-time. Due to these additional complexities in applying density based methods in streaming we leave their analysis for future work.

2.2.3. Hot policies

We now review policies that use the labels in the labeled pool to select queries based either on: i) an uncertainty measure, or ii) expected changes in model error or parameters.

Uncertainty Sampling: This is the most common active learning technique, originally discussed by Lewis and Gale (uncertainty_sampling_original_10.5555/188490.188495)

. It trains a machine learning model on each AL iteration using the labeled pool instances. Then the unlabeled pool is scored and the queries are ranked by a measure of uncertainty related to the distance to the classification boundary. Instances closer to the classification boundary are assumed to be more likely to improve the model. A common criterion is to select instances with the highest expected entropy over the possible class labels given the model scores as the probabilities. For binary classification those instances have scores closest to 0.5. This method assumes scores that are well calibrated probabilities, which may not hold. Nevertheless, studies show that it is an efficient AL uncertainty measure (

(review_YANG2018401) and references therein).

Query by committee: Query by committee, (qbc_10.1145/130385.130417), is a simple but potentially computationally heavier method that combines knowledge from an ensemble of ML models, chosen by the user, where each model in the ensemble is trained on the labeled data pool and used to score the unlabeled data. A measure of disagreement among the models is computed for each instance based on the model scores. Instances rank higher for higher disagreement. Often, it also assumes that the scores are well calibrated probabilities. In Section 3.2, we discuss an alternative criterion based on rank disagreement.

Expected Model Change, Error Reduction and Variance reduction: These methods compute, for each possible query, an estimate for the expected value of either: i) the change in model parameters, (expected_model_change_NIPS2007_3252), ii) the error reduction in the model predictions, (expected_error_reduction_10.5555/645530.655646), or iii) the variance reduction in model predictions, (variance_reduction_COHN19961071). The basic principles are, respectively, to query the instance that is expected to change the model the most, reduce the total prediction error the most, or reduce the variance of the predictions the most. Expected error reduction is often impractical, requiring retraining the model for all label assignments for each possible query.

3. Methods

Figure 1. Experimental framework architecture overview.

An illustrative diagram of the architecture of our experimental framework is presented in Figure 1. Its main components are:

  • Data Components: This contains a Data Stream collecting events in real time and storing them in the Unlabeled pool. The Labeled pool stores labeled data. Both pools start empty.

  • Process Startup

    : This is responsible for training pre-processing pipelines, enriching the raw incoming data stream with features and applying feature selection and/or dimensionality reduction.

  • AL loop: This iteratively collects labels and trains the model. At each step the Data is accessed and manipulated as follows:

    1. Select Instances: A batch of unlabeled events is selected for labeling. An arbitrary sequence of AL policies chained together with switching criteria is possible (left of block 1), though in our experiments we only consider up to 3-stage sequences.

    2. Label Instances: Here we simply move the instances selected for labeling from the Unlabeled to the Labeled pool and reveal their label (our data sources contain the true label). In a live system analysts would provide the labels.

    3. Train Model: The labeled data is used to train and evaluate the ML model. We continuously iterate this loop up to a maximum fixed duration – e.g., until a week of unlabeled data is collected by the stream and a corresponding fixed number of labels is collected according to the number of batches and the batch size. Because we use historical data to simulate the streaming scenario, we can evaluate the sequence of AL models, obtained while iterating, on a separate test set offline – see Section 4.

3.1. Startup and preprocessing

In our experiments we train random forest (RF) classifiers, which require a suitable set of engineered features. In a

cold start scenario we may not know in advance which features are useful to predict the target. Thus, we apply a preprocessing pipeline using, as minimal information, the schema of the raw data fields collected by the system. The two transformations in the pipeline are described next.

Automatic Feature Engineering: We use Feedzai’s AutoML tool, which can generate automatically a feature engineering plan based only on the semantics of the raw fields. This only requires a semantic mapping file to tag the raw fields (specifying, e.g., grouping entities, numerical fields, or the semantics of fields to be used in pre-defined types of feature engineering operations), together with a specification of window durations to compute profile feature aggregations (e.g., count of transactions per card in the last hour). Further details on Feedzai’s AutoML are found in reference (automl_feedzai_patent).

Unsupervised Feature Selection:

The automatic feature engineering plan may produce several hundreds of features. The data science performance of the ML model may degrade if too many noisy or redundant features are provided. Furthermore, from a system perspective, computing and saving more features than necessary is computationally wasteful. Therefore it is useful to apply feature selection or another dimensionality reduction strategy. We have considered three options. The simplest one,

Domain knowledge reduction, consists of asking a domain expert to suggest the most relevant features. The second option, Pairwise Correlations Reduction

, uses a training set to evaluate feature correlations, starts with the most correlated numerical features pair, removes one of the features and continues iteratively. The process stops when a (small enough) threshold value of pairwise correlation is attained or until a pre-specified number of features is left. A third option is to apply Principal Component Analysis (PCA) 

(pca_paper) to reduce the dimensionality of the numerical features while explaining most of the variance in the data. Both Pairwise correlation and PCA Dimensionality reduction require a sample of unlabeled data. In real applications this is often not an issue, because unlabeled data is easy to collect through an initial waiting period (e.g., we use one day in our experiments).

In our study, we performed a limited set of experiments on one dataset to compare the three methods, which indicated that PCA dimensionality reduction is a suitable method that typically performs as well or better than the other methods. Due to space limitations, and since the results were not very different among methods, we will only present results with PCA preprocessing.

3.2. Policies

We now discuss our specific choice of policies for the experiments.

3.2.1. Cold policies:

We test a Random policy, but also and Outlier detection policy. For the latter we use an isolation forest (ISF_paper_4781136) trained on the unlabeled pool and use the isolation score to rank its transactions from most outlier-like (to query) to most inlier-like. Experiments using this method will be identified with the tag OutlierDetect. In all policies that require an isolation forest we use the scikit-learn (scikit-learn) implementation with 100 trees, using all features to grow each tree, and a maximum number of samples per tree which is the minimum between 256 and the total number of samples. Cold policies are also baselines for AL (if used as a single policy, i.e., one-stage sequence, throughout the experiment).

3.2.2. Warm-up policies

Regarding warm-up we propose a new method, Outlier Discriminative Active Learning

(ODAL), where an outlier detection model is trained on the labeled pool, and is then used to score the unlabeled pool to find the greatest outliers relative to the labeled pool, which are selected for querying. In typical AL scenarios the labeled pool is much smaller than the unlabeled pool. Therefore this provides a computationally lighter policy, because it can be trained on the labeled pool only, in contrast with regular Discriminative AL (DAL) where the (large) unlabeled pool is also used to train a discriminative model to differentiate between the labeled and unlabeled pool. Furthermore, another advantage of ODAL over DAL can be observed by expanding the posterior probability distribution,

for an instance with features

to be in the unlabeled pool (denoted by 0) using Bayes theorem:


Here , are, respectively the distributions of the unlabeled and labeled pool and is the fraction of labeled data. In Eq. (1) we can see that, up to the constant , the DAL score prioritizes instances both with high density in the unlabeled pool and low density in the labeled pool, which may not be desirable if the labeled pool is missing examples in lower density regions of the unlabeled pool. On the other hand ODAL only models so it favours, by design, that the instances to be selected are not well represented in the labeled pool regardless of how well they are represented in the unlabeled pool. For problems with a large class imbalance this is especially important. Thus, ODAL is both computationally feasible for our large scale experiments and less biased by the unlabeled data distribution. Finally, as we will see in Section 5, ODAL warm-up adds an earlier boost to the learning curves in imbalanced datasets. In the experiments, we use the same isolation forest outlier detection algorithm mentioned in Section 2.2.1. Thus, the labeled pool instances are ranked by isolation score and the ones that rank higher are selected for querying.

3.2.3. Hot policies

We now describe the supervised policies.

Uncertainty Sampling: As discussed in Section 2, the most common uncertainty criterion consists of selecting instances with the highest expected entropy which assumes that the ML model scores provide well calibrated probabilities. This may not hold for many algorithms and is especially bad for high class imbalance (calibration_pozzolo_7376606).

One approach to this issue is to perform scores calibration, (scores_calibration_10.1145/1102351.1102430; scores_calibration_nn_10.5555/3305381.3305518)

, but it requires a separate calibration set (or cross validation). In the case of AL, this implies further splitting a labeled pool that is already small so, instead, we introduce an alternative for binary classification. We first observe that the score function of most ML algorithms is a monotonic function of the class posterior probability. Thus we still expect that instances with higher scores will have a higher probability of being positive. Given a sample of data, the classification boundary can be equivalently characterized by a score percentile, i.e., a position in the sorted set of scores. We then note that the percentile of the classification boundary, for a perfect classifier that knows the labels would be equal to the negative class rate. This motivates an alternative uncertainty criterion, which is independent of scores calibration, where the uncertainty boundary is at the percentile given by the estimated negative class rate. Then the uncertain instances are considered to be the ones closest to that boundary. In the experiments, we will show results with the classic

entropy criterion, as well as with our fraud percentile criterion.

Query By Committee:

In this policy we introduce an alternative measure of disagreement, among the models in the committee, that is insensitive to whether or not the scores output by each model are calibrated as probabilities. This can be important if the committee contains a mixture of models with and without a probabilistic outcome. For each model in the committee, we rank the unlabeled pool instances by descending model score and compute the average pairwise absolute difference of ranks between any two models. Instances on which the models disagree will have very different rankings across models. In the experiments we use a committee with: a Random Forest with 100 trees and maximum depths of 3, an L2 regularized Logistic Regression, a Gaussian naive Bayes classifier, and a Gradient Boosting Classifier with 100 estimators.

222We use scikit-learn (scikit-learn) implementations for all mentioned ML models unless stated otherwise. For the unspecified hyper-parameters we use the library defaults.

Expected Model Change:

For this method, we use the simplest approach in the literature. First a gradient-based classifier is trained on the labeled data pool. Then, for each unlabeled instance, the expected gradient norm for the given instance is computed assuming that the model parameters are near an optimum of the model’s loss function. Finally, the unlabeled pool instances are ranked so that instances with larger expected gradient are prioritized. In our implementation, we use a logistic regression with L2 regularization.

Expected variance Reduction and Epistemic Uncertainty: The expected variance reduction method estimates the variance of the model predictions. This is tightly related to the notion of epistemic uncertainty discussed in the literature (rf_uncertainty_10.1007/978-3-030-44584-3_35). The latter is the reducible part of the total uncertainty composed of i) the model uncertainty (or bias), which is due to the restricted choice of hypothesis space when fixing a type of model, plus ii) the approximation uncertainty (variance), which is reducible by collecting more data. The remaining uncertainty (also know as aleatoric) is intrinsic to the data generating process and can never be removed.

The uncertainty sampling criterion that uses the entropy of the model scores is precisely the total uncertainty criterion. The epistemic uncertainty, being the difference between the total and aleatoric uncertainty, may give a better measure of uncertainty for AL, because it is only sensitive to the reducible components. Though it still contains the uncertainty from the bias, it can be more tractable than variance estimates, which often rely on analytic expressions assuming differentiability. In our analysis, we train models using a random forest classifier. This is non-differentiable but it offers a convenient way of controlling regularization, using a large number of shallow trees, which is important to train on small labeled pools. The epistemic uncertainty for random forests is estimated from the outputs of each tree in a random forest,  (rf_uncertainty_10.1007/978-3-030-44584-3_35).

4. Experiments

In this section we present results of experiments with several real world credit card fraud datasets.

4.1. Data preparation

We cover several representative use cases in the fraud detection domain, namely card issuing banks (Banking), platforms that process online payments for several merchants (Payment Processors) and single merchant online platforms (Merchants).

Dataset Class Imbalance Sampling fraction
Bank 1 11.0 %
Bank 2 3.0 %
Payment Processor 2.5 %
Merchant 100.0 %
Table 1. Dataset properties: Due to privacy reasons we do not provide further details (see detailed description in text).

In Table 1 we provide some properties of each data set, which contain fraudulent (positive) and legitimate (negative) transactions. The fraud rates span several orders of magnitude, from an extremely large imbalance (Bank 1), to moderate imbalances of a few percent. The datasets contain raw fields collected when transactions arrived to a fraud detection system in real-time, including the monetary amount of the transaction, the timestamp of the event, several identifiers (e.g., card ID), categorical fields and the fraud label.

The volume of transactions varies across datasets from a few millions to several hundreds of million per year. To speed up our experiments, we applied undersampling to reduce the volume to a manageable (and similar) level for all datasets. This allowed us to scale up our experiments to cover many different types of policies and to perform a more extensive Temporal Cross Validation (TCV) over a longer period. We applied the sampling before feature engineering to speed up the preprocessing. Fraudulent and non-fraudulent card id entities were randomly sampled separately with the sampling rate indicated in Table 1 (this preserves the fraud rate) and all transactions were kept for each sampled card id. This keeps complete card histories, allowing to compute sliding window profiles that are important to characterize the event (rnns_feedzai_10.1145/3394486.3403361).

We applied automatic feature engineering, which generated between 600 and 800 features depending on the dataset – see Section 3.1. The categorical fields were encoded both with ordinal and frequency encoding and standardized to zero mean and unit variance similarly to other numerical features. The remaining pre-processing is scenario specific – details provided in Section 4.2.

4.2. Experimental Setup

In this section we describe details of the experimental setup that are common to all data sets.

Figure 2. Time folds for the five simulation periods in the experiments (see detailed description in the text).

In Figure 2 we present a diagram of the various slices of data for any given data set. We define Folds, which consist of 8 week periods (two pairs of 4 weeks). Within each fold, the first 4 weeks (green), are used for model training, whereas the following 4 weeks (blue), are for model evaluations. The Train period is used differently according to the type of experimental run. We define two types of scenarios:

- AL in streaming: This case mimics a scenario where the AL system is deployed for the first time in streaming without access to previous data. Since the goal is to collect labels quickly to obtain a good model, without waiting for labels to arrive by other means, applying AL is typically relevant for a few weeks. Thus, we only reserve the two last weeks of the Train period (darker green: weeks 3 and 4) to sample data with AL for training (weeks 1 and 2 are used for the strong optimistic baseline discussed next). The Test set allows us to measure the model performance after the deployment of the last AL model. In practice, for most data sets we only use one week for AL training (except for Bank 1 which, due to the extreme class imbalance, needs a longer period for the performance to stabilize).

- Optimistic Baseline: Here we train a strong model that has access to all data and labels (weeks 1 to 4: light plus dark green). The goal is to obtain a “best case scenario” upper bound performance.

Each experiment (either AL or Optimistic Baseline) consists of 35 repetitions of the train-test procedure with different pseudo-random number generator seeds. This allows us to assess the stability of the AL policies by observing the variance of our metrics. We choose 35 seeds as a good trade off between run time and a high chance of observing a wider range of values around the center of the distribution. As displayed in Figure 2, we repeat each experiment in 5 different folds (Train+Test pairs) to observe the robustness of the AL procedure against temporal variations.

4.2.1. Streaming AL Training

In all AL experiments we include an initial waiting period of one day to simulate the collection of some unlabeled data to fit the pre-processing pipeline. This mimics a realistic scenario of deployment with no previous data. To reduce the number of numerical features generated by the AutoML pipeline (which may contain redundant information) we apply PCA on the numerical features. In preliminary experiments on Bank 2, we checked that about 90 features can explain of the data variance. Then we decided to fix 90 features after PCA for all data sets to keep the run time similar across experiments. In real life systems, there are often performance constraints that impose such limits.

Observe that our pre-processing pipeline is trained on the first day of unlabeled data, and used to transform all future data arriving at the stream (Train or Test period). This is to mimic a day-1 system deployment. However, after day-1, the pipeline could be updated frequently but, for simplicity, we chose to fix it in our experiments.

For each run, several labeling iterations are processed after the waiting period of one day, according to the diagram of Figure 1 – see Section 3. Therefore the unlabeled pool grows with time, as does the labeled pool during the AL training iterations, whose growth is indirectly controlled by the time assumed for the team of analysts to label each queried batch of events. Thus, if the team is, e.g., a single analyst taking one hour to review a batch, we assume that one hour of new data is inserted in the unlabeled pool after the batch is labeled. For simplicity we use a fixed batch size and a fixed time to review corresponding to an overall review rate of 1000 events per day. The only exception is for Bank 1, where, due to the extreme class imbalance, we assumed twice the daily budget.

Regarding the ML model to train on the AL labeled data, we chose a highly regularized Random Forest (RF) classifier from the scikit-learn

library with a maximum tree depth of 3 and 200 trees (other hyper-parameters set to defaults). We did a small study on Bank 2 on two time folds, where we either, i) varied the number of trees up to 1000, ii) reduced or increased the maximum depth, or iii) used other models with various different levels of regularization (Feed Forward Neural Networks, Support Vector Machines and Naïve Bayes). This confirmed the benefits of regularization. Despite improvements with 1000 trees, we chose 200 to speedup our simulations.

4.2.2. Optimistic Baseline Training

Here we assume access to fully labeled data in the 4 weeks of the Train period. Additionally we apply a more robust training methodology. We train a RF classifier with 300 trees and a maximum depth of 20. For each of the 35 models (one per seed) we train 5 random configurations of hyper-parameters on the first 3 weeks and evaluate on week 4 to select the best configuration. The final configuration is re-fit on the 4 weeks.

For each model trained above, we also apply supervised feature selection. Thus, each training proceeds in three stages: i) first we fit the data with all features, ii) then we select a fraction of the top importance features, and iii) we retrain with only those top features. The fraction of features to use is a hyper-parameter to vary. In addition, we also vary the minimum number of samples in a leaf node, a binary parameter to use class weights or not, and the complexity parameter for minimal cost-complexity pruning.

4.2.3. Evaluation metrics

We now discuss the performance metrics used to measure the quality of a single AL experiment, as well as to aggregate and summarize an experiment to compare runs.

Learning curves: A single AL experiment, consists of several iterations where the labeled pool grows, and a sequence of models that can be evaluated on the Test set are trained. Given a performance metric (e.g., recall at a fixed false positive rate), we obtain a learning curve where the metric usually improves during the simulation. Since we run 35 simulations, we obtain a distribution of learning curves, which we will visualize as percentile band plots in Section 5.

Since we run hundreds of experiments to test different policies, datasets and time periods, it is not feasible to observe all learning curves. Therefore we now define three aggregations to summarize each set of learning curves and be able to interpret the results.

Learning curves rise: To summarise how quickly the learning curves rise throughout the iterations (see, e.g., Figure 3), we compute the Area Under the percentile 50 learning curve (Area P50), defined as the curve tracing the median performance (over the 35 seeds) on each iteration. In addition, we normalize it by the area under the median optimistic baseline, which is the horizontal line corresponding to the median performance of the optimistic model (denoted by Norm Area P50). This allows us to compare folds relative to their optimistic baseline, while correcting for temporal drift unrelated to AL that also shifts that baseline.

Learning curves variability: To measure the variance of the learning curves (a good policy will always rise fast for all seeds), we use the Area between the percentiles 10 and 90 (denoted Var in the results). This is also normalized by the optimistic baseline area.

Quality of the final AL model: This is defined as the median performance of the final AL model normalized by the performance of the optimistic baseline (we denote it as Norm Final P50).

4.3. Policy Sequences and Parameters

We will present experiments with AL policy sequences with 1-stage (cold), 2-stages (cold + hot) and 3-stages (cold + warm-up + hot). We use the Random policy for the cold phase, with the following exceptions: i) a 1-stage sequence where the Outlier Detection policy is used for comparison, and ii) a baseline policy denoted QueryAll. The latter corresponds to a scenario of unbounded labeling resources where all incoming transactions are labeled. In 2-stage policies, we combine the Random policy with: uncertainty sampling using the entropy uncertainty (Unc. (entropy)); ODAL; Query By Committee (QBC); and Expected Model Change (EMC) – see Section 3. In the 3-stage combinations, we use ODAL for warm-up and then, for the third stage, the same supervised policies as in 2-stage sequences. For uncertainty sampling in 3-stage sequences, we also include the two other measures of uncertainty discussed in Section 3 for comparison – denoted Unc. (epistemic) and Unc. (percentile), respectively for the epistemic uncertainty and the fraud percentile criteria.

In the 2-stage policies we switch policies after we have at least one label from each class. Regarding the 3-stage sequence, the same criterion is used to switch between warm-up and hot policy, however, the switch to the warm-up policy from the cold policy is done after the first batch is collected with the cold policy, to start exploiting ODAL immediately.

As for the batch size, we set it to 100 for all policies. We performed some preliminary experiments with larger and smaller batch sizes and did not see substantial improvements with a smaller batch size.

5. Results

We present results of the AL experiments for the various datasets focusing first on the most imbalanced.

Table 2. Bank 1 rankings of AL policies using various folds (see also detailed description in the text).
Figure 3. Learning curves distribution for Bank 1 in the best fold (5): The best sequence of policies (left panel green bands), and the Random policy (right panel green band), normalized by the percentile 50 of the optimistic baseline (gray bands).

In Table 2, we show a summary of metrics for the five folds and all policies for Bank 1. Each row displays the metric values for a specific policy sequence (1-stage with only a cold policy and 2-stage without warm-up – see dashed lines). In the columns we have five groups of columns (one per Fold) with three metrics each (see Section 4.2.3): i) the normalized area under percentile 50 (Norm Area P50, blue density scale), ii) the ranking of the policy for the fold (center) according to Norm Area P50, and iii) the percentile 50 of the final normalized AL model performance (Norm Final P50, green density scale).

The rightmost pair of columns in Table 2 contains two metrics that summarize the five folds, namely the average of the ranks of each fold for each sequence (AVG Rank, orange density scale) and the average of the normalized area between percentiles 10 and 90 (AVG Var, red density scale). The former provides an overall measure of how fast the policy performance rises, whereas the latter of how noisy the policy is, for this dataset. The table rows are sorted by ascending AVG Rank. Therefore policies that perform better on various folds are at the top. We choose to rank by Norm Area P50 rather than Norm Final P50 because it is more sensitive to how quickly the learning curves rise, which is critical in systems that need a good model to start acting as early as possible. Nevertheless, the final model performance is important to tells us how close we get to the optimistic baseline. We include 12 sequences specified on the left. Random and QueryAll are baselines (Section 4.3).

Bank 1 is the most challenging dataset with an extremely large class imbalance. Therefore we doubled the daily review budget and trained in the full two weeks available for AL in the Train period (see weeks 3 and 4 of each Fold in Figure 2). The best policies in Table 2 outperform Random by a large margin (close to doubling the performance in some cases). Furthermore, they are on par with the QueryAll on folds 1, 4 and 5, both for the Area metric and the Final performance. In folds 2 and 3, although QueryAll performs substantially better, the group of top performing AL policies, based on uncertainty sampling, continue to rank highly.

Observe that, except for the rank, all the metrics have been normalized by the optimistic baseline, which is trained on extra data (full 4 weeks of the train period vs 2 weeks in Figure 2) with supervised feature selection and hyper-parameter tuning. This additional data would not be available in a realistic production setting and the improved training is challenging for AL in streaming. This explains why most metrics are smaller than 1. The exception is Fold 5, where Norm Final P50 is larger than 1 for various policies. This can be explained by observing the learning curves for Fold 5 in Figure 3, where we show the distribution of learning curves for the best AL policy (left) and the Random policy (right)– represented by the rising green bands. Three equally spaced percentile bands are included, together with a solid gray line that traces the median. The distribution of values for the optimistic baseline is represented in the horizontal gray bands. All values have been normalized by the percentile 50 of the optimistic baseline. In this fold we can see that the distribution of values for the training of the optimistic baseline is quite wide. Thus, despite being above 1, the final performance of the AL model for the best policy is still within the central part of the distribution. Comparing left and right, we confirm that the 3-stages policy rises quick to high performance with a narrow variance.

It is also important to note that 3-stage sequences, i.e., with ODAL in the warm-up, tend to outperform simpler setups, especially when paired with uncertainty based policies.

The overall conclusions, up to data set specific noise and some temporal drift effects, are confirmed for the other datasets. Note that AL typically only uses 1/10 to 1/50 of the number of samples available to the optimistic baselines. For other datasets we only present the policy rankings in Section 5.1, due to space constraints.

5.1. Aggregation over Datasets

In the previous section we discussed policy rankings and a pattern emerged: 3-stage sequences were the best performing policies, some 2-stage sequences also showed a good performance, and the rankings of the least performing policies were unstable across folds.

A convenient way of aggregating this information, to provide a clearer picture of the overall rankings, is to average out the policy ranks over the studied datasets. This is displayed in Table 3.

Table 3. Overall policy ranking: Average ranks for each dataset (four central columns) and their overall average (right column). Rows are sorted by the AVG column.

As expected, overall, the QueryAll policy ranks first, even though it is not always the top one for some datasets. The 3-stage policies based on entropy or epistemic uncertainty rank very close to it, which indicates that these are high quality AL policies. Regarding sequences with Expected Model Change or the fraud percentile based Uncertainty policy, despite ranking in the middle of the table, for some datasets they rank very low, so they are not very stable/consistent. On the other hand, the 2-stage policy with ODAL ranks between 5 and 7 across datasets, which reinforces its value as a stable warm-up policy. The Random policy ranks low, as expected. QBC also ranks low, but this may be due to our specific/simple choice of committee (a more detailed study is left to future work). Another important observation is that all 3-stage policies rank higher than their 2-stages counterpart.

Figure 4. Boost in the number of positives sampled in 3-stages vs 2-stages for the entropy based uncertainty policy (see detailed description in the text).

In Figure 4 we display a visualization that helps understanding this improvement for the entropy based uncertainty policy. On each row we present the average increase of sampled positives, over all folds, when adding ODAL as a warm-up policy. For each fold, the increase is the 10th percentile difference between the positives obtained with a 3-stage sequence and the corresponding 2-stage sequence, divided by the mean positives of the 2-stage sequence. We can clearly observe that, for datasets with larger imbalances, including ODAL lifts up this low percentile considerably in early iterations (e.g., the mean value for Bank 1). The effect progressively disappears for milder imbalances – Merchant.

6. Conclusions

We studied the problem of creating a small labeled dataset, with a limited budget of annotations by analysts, in a streaming environment, in a cold start scenario (no previously labeled data and little or no unlabeled data) for highly imbalanced datasets. We proposed an AL system adapted to these conditions and performed a detailed study on four real world credit card fraud detection datasets, covering three use cases with several orders of magnitude in class imbalances. We proposed various ingredients that proved essential, namely: i) ODAL, a computationally efficient version of discriminative active learning to quickly represent well the unlabeled pool in the labeled pool, relying only on the labeled pool features distribution, and ii) the combination of ODAL, as a warmup-policy, with other AL polices, in a 3-stage sequence to alleviate the cold start problem in highly imbalanced datasets where it may take a long time until some of the labels are found. We also proposed two alternative uncertainty measures for the Uncertainty Sampling policy – epistemic uncertainty and the fraud percentile measure – as well as an alternative measure of disagreement based on rank differences for Query By Committee.

In Section 4 we conducted detailed experimental studies, including optimistic baselines and 12 different policy sequences to be ranked. Our analysis showed that the best performing AL policies are 3-stage sequences with ODAL warm-up and Uncertainty Sampling as Hot policy (either entropy or epistemic). In particular, we showed that the ODAL warm-up boosts the learning curves in the earlier AL iterations. As a general rule, the final overall ranking shows that including ODAL warm-up before any Hot policy boosts its learning curves, especially for large class imbalance. Furthermore, the best performing sequence is often as good as the QueryAll policy, it has low variance learning curves, it is competitive with the optimistic baseline and substantially better than Random. Our results show that the required amount of labeled examples until the learning curves stabilizes often ranges between to for mild to intermediate class imbalances, and a bit over for extreme imbalances ( to 1/50 of the optimistic baseline data).

To conclude, we comment on some future directions. In this study, we have simulated the analyst queries by using the real labels in the datasets. It would be interesting to perform experiments with real analysts in a live environment to see if the performance gains are confirmed. Another interesting use case, where label scarcity is a severe problem, is the detection of Money Laundering activities. The tool and methods developed for this study have already been applied to this use case in reference (lorenz2020machine) on a Bitcoin dataset with promising results. It would be interesting to study other Money Laundering datasets. Finally, we have not touched upon other possible problems and improvements that could be important in a real system. This includes the issue of evaluating the AL models online – in our study we used an independent test set in the future of the train set for evaluation. Related to this, it would also be interesting to include online hyper-parameter tuning and model selection, as well as online supervised feature selection, instead of using a static set of features selected in an unsupervised way on the first day.

We thank Jacopo Bono for reviewing the manuscript.