The task of automatic fraud detection has been mainly studied under the framework of imbalanced binary classification (bhattacharyya2011data). Given the description of a transaction , the goal is to predict a binary label indicating whether this transaction is fraudulent or not. The main difficulties arising in fraud detection highlighted earlier in (bolton2002statistical) include among others
The strong imbalance between the output labels. Indeed fraudulent behaviors are assumed to be rare and thus harder to find. Previous work have proposed different solutions that help building efficient predictors in the case of imbalanced classification. Such approaches mainly consist in introducing instance reweighting or bootstrap based schemes (chawla2004special) in order to transform the imbalanced learning problem in a related balanced problem on which learning can be done with on the shelves predictors.
The large amount of unlabeled data in regards to labelled data advocates the use of methods that can scale on large datasets and that generalize well.
In the case of fraud detection in financial transactions, these properties have been highlighted in work involving both supervised (owen2007infinitely) and unsupervised (damez2012dynamic) learning approaches where the problem of handling large datasets is specifically studied. Whereas in this setting the user assumes that he has enough labeled samples to confidently follow the decision of a learned predictor, another line of work relies on an active learning procedure that consists in optimizing the accuracy of a predictor by iteratively labelling a set of well chosen samples (Carcillo2018StreamingAL; zhang2018online). The main objective of this approach is to minimize the amount of work necessary to build a correctly performing classifier since obtaining reliable labels is an expensive operation. A common feature of the existing active learning strategies is the selection of examples that keep a balanced pool of labeled samples wherever it is in label or in space (ertekin2007learning). Indeed, training a predictor with imbalanced data is known to affect its performance while incorporating some scalability issues due to the difficulty to handle in memory the labeled examples of the majority class. This online re-balancing process solves thus the two issues raised above at the same time.
In the standard active learning framework, the true labels are sequentially queried to an oracle and a good active learning strategy should be able to provide good classification performance on some new sample while doing as few oracle queries as possible. Whereas previous work in the context of fraud detection have focused on optimizing some metrics computed on a test set based on the resulting classifier predictions, we argue that in many practical applications with financial data, this setting is not adapted since it does not rely on the right metric. Actually, Due to the article 22 of the GPDR european regulation - automated individual decision-making, including profiling -, engaging some legal pursuits and sanctions against a fraudulent user requires a human verification of the corresponding decision (eu-269-2014). Since the fitted classifier will never be used without requesting an oracle, it is not desirable for the fitted classifier to outperform on a held-out dataset. It is preferable, in this configuration, to minimize the number of verifications that corresponds to non fraudulent operations and thus maximize the number of discovered and treated frauds over time. This setting differs from the active learning setting in the fact that the goal is not to build the best classifier over a given horizon but instead to recommend as many fraudulent objects as possible to the oracle. In the next sections we present each framework and stress on their similarities and differences as well as the consequences in terms of adapted strategy.
2 Mathematical Settings
2.1 The Active Learning Setting
In this section we assume that we have access to a sample where are input feature representations and binary labels indicating whether the transaction is a fraud () or not. This sample is partitioned into an active set and a finite testing set . The active set is again partitioned in a labeled set and an unlabeled set that evolve over time since querying the label of an example make it move from to at each iteration . Initially, the labels in the labeled set are only available for a fraction of the data . We suppose additionally that we have access to an active learning strategy i.e. a function based on the current labeled sample which returns the next unlabeled point that will be provided to the oracle. Example of such strategies are detailed in section 2.2. The active learning procedure can then be divided in the following steps:
Based on the current labeled dataset , build a predictor
Choose an unlabelled point based on for which we want to obtain the label.
Query the corresponding label to an oracle
Update , and
Increment and repeat from (1) until .
The performance of an active learning strategy can be measured thanks to the performance of on the testing set. For a non-negative performance measure , the goal is to find the strategy that maximizes for all the time steps .
2.2 Active learning strategies
A large body of work has focused on designing active learning strategies that take into account some properties of the data or some specificities of the underlying class of predictors to optimize. Thus (ertekin2007learning) focuses on learning on the border using SVM properties, and (zhang2016online)
uses the distance notion introduced by the SVM hyperplane to define a way to query points to label. On the other hand, strategies can be defined without relying on some properties of the underlying predictor but only take advantage of their ability to produce class wise probability estimations. Such strategies can be grouped into two categories :
Unitary methods (base methods): Uncertainty Sampling, Random sampling. This type of method rely on a single hypothesis explaining the insufficient performance of the predictor. Based on this hypothesis, a sampling method is proposed. In the case of Uncertainty sampling the hypothesis is that the most important are where the probabilities estimated by the model itself have a high variancelewis1994heterogeneous; cohn1995active. In practice the strategy will tend to select samples in zones that are at the known frontier of two distinct classes. In the case of Random sampling, the strategy ignores the learned predictor and makes no hypothesis on the evolution of its performance with respect to the chosen labeled points.
Adaptive methods: While unitary methods have been designed with the idea of choosing samples that optimize a single criterion, (hsu2015active) proposes a meta-algorithm that chooses the best unitary method to use at each time step in order to maximize a specifically designed reward function (Weighted accuracy computed on the points submitted to the oracle). Note for example that different uncertainty sampling approaches could be built based upon different probability estimations of the output labels and the adaptive approach would choose at each time step which unitary strategy should be chosen. Similarly, (Konyushkova2017LearningAL) fits a model able to predict the expected increase of a test metric. Then, the point picked by the algorithm is the one that has the greatest expected reward in the so-called metrics.
Now we turn to the presentation of our framework that differs from the active learning one.
2.3 The Reward Maximization framework
We now propose a new setting that intends to simulate more appropriately the real-life constraints. The goal is no longer to optimize a metric evaluated on a holdout dataset but instead to iteratively retrieve only the examples corresponding to the class 1 (fraudulent operation) to the oracle. The available data is thus only partitioned into a labeled set and unlabeled set such that . Suppose that we can build a strategy that returns an unlabelled example. Given a non-negative reward function the goal is then to find the strategy that maximises the cumulated reward:
The reward can take into account the amount of money contained in a fraudulent transaction. When this information is not available, we can simply provide a unitary reward when a fraud is identified :
At each time step, the optimal strategy would return an element of in the set of highest expected reward. , where is the true conditional distribution of the data. Since the conditional probability is not directly available, it is instead estimated by a function taken in a hypothesis class and learned on the labeled sample :
is a loss function penalizing wrong predictions ofand a penalty function enforcing the choice of regular candidates.
In the case where we want to compute class probabilities, one can choose the cross entropy loss function:
This type of probability estimators are well known and can be parameterized by a linear (logistic regression) or a non linear model (neural networks). Different choices of loss and parameterization lead to different class of predictors that may be used to construct
(Gaussian Processes, Random Forests, Boosting based algorithm). Up to this point we have provided an approximation ofbased on the sample only. This has two consequences:
Based on the knowledge of , the values proposed to the oracle will be the one with the highest probability of finding the label . For a correctly regularized predictor, these points will be the one located close to already detected frauds. By analogy with the bandit litterature audibert2009exploration, this step can be seen as an exploitation phase where the strategy relies on its estimation of the expected rewards to pick the arm that will give a gain with the highest probability among all the possible candidates.
When there are unlabeled parts of the space containing some objects labeled or when the ones we already found have been exhausted, then a good strategy needs to quickly explore the space to find new instances labeled . During this step, instead of choosing the that maximizes the corresponding reward, we try to find the one that gives the most information to . Once again it is analogous to the exploration phase in the bandit literature.
3 A Computer-Assisted Fraud Detection Algorithm (Cafda)
The two steps of exploration / exploitation presented previously can be mixed in a simple algorithm that in practice works surprisingly well on benchmark datasets for the task of computer-assisted fraud detection. It is inspired by the EXP4.P bandit algorithm beygelzimer2011contextual which maintains a set of probability of picking each of the possible strategy and update them according to the reward received.
Similarly to beygelzimer2011contextual; hsu2015active, we suppose that we have access to a set of
active learning algorithm that provide an advice vectorof the size of the unlabelled set which contains the probability of querying each example. We additionally maintain a vector that indicates the probability of using each strategy and choose two update parameters , which control the variation of depending on the rewards received. We also introduce and two threshold levels on the probabilities stored in that are used to reduce the time necessary to switch quickly from one best current strategy (of index with high value) to another as the number of iterations increases. In order to maximize our custom reward, we propose the following fraud detection algorithm (CAFDA):
The main difference with hsu2015active is the use of the update heuristic. In the original paper, the reward update scheme is chosen to optimize the accuracy of the resulting predictor on a held-out dataset which differs from our reward based only on the label found. Concerning the update, has been designed to achieve optimal regrets in a stationary context which is not the case here. By choosing carefully , , , , CAFDA retrieves competitive results that we detail in section (4).
We simulate the framework described in section 2.3 in the following way. Given an imbalanced fraud dataset containing frauds, we first sample a small fraction of the points that will constitute an initial labeled set and then iteratively select an unlabelled point which is shown to the oracle. If this point is labeled , a reward of is gained and we display the cumulated reward over the time. We compare CAFDA against some baselines and state of the art active learning strategies:
base: Use the predictor trained only once on the initial labeled set and perform the exploitation phase only at each time step:
base_refit: Same as base but the predictor is retrained on at each timestep.
random: The point queried is picked randomly in the unlabeled set
us (uncertainty sampling): The point queried is the one of maximal uncertainty for the predictor i.e.
lal_independent (Learning Active learning with an independent strategy): The point queried is the one of the maximal expected improve in a choosen loss. The expected improve is the prediction of a model fitted on a synthetic dataset. In the independent strategy, a Monte Carlo procedure is simulated to query randomly some points and associate them with an improve in the loss. (Konyushkova2017LearningAL)
lal_iterative(Learning Active learning with an iterative strategy): This algorithm differs from the previous one only by the way the synthetic dataset is constructed. Actually, the points are queried in order to minimize the selection bias.
albl (Active Learning By Learning): A multi-armed bandit chooses among multiple active learning strategies at each time step in order to maximise an expected cumulated reward which is a weighted accuracy on the already queried point .
As base strategies for CAFDA, we take 5 strategies (base, base_refit, random, lal_independent, lal_iterative) and exclude ALBL as it is also a meta-algorithm. For all the scenarios, , , and .
The different methods are compared in two scenarios:
The active learning is run during the entire experiment. In this experiment, we empirically show that active learning methods do not maximize the cumulated reward we defined.
The active learning algorithm is run for 100 steps, then the resulting classifier is used to select the points labeled with the highest probability. Here we aim at showing that early exploration using an active learning strategy is not even helping in the long run.
For all our experiments, we used a Random Forest classifier as the base probability estimator and selected the hyperparameters by cross-validation on the initially labeled training set.
We display results obtained with 3 standard benchmark anomaly detection datasets since they share the imbalance property of financial fraud detection databases and are freely available.
|Number of samples||85849||295541||284807|
In all experiments, we first sample an initial labeled dataset. This initial set is 1% for all the datasets. In the case of the covtype dataset, instead of using the full dataset, we worked with a sample of examples in order to keep a fairly low number of ’fraud’ initially observed.
4.1 Scenario 1
The figure 4 shows the cumulated reward over time obtained with each strategy.
The best results are obtained using CAFDA, base and base_refit with a slight improvement given to the model that retrain their underlying model on the creditcard dataset. As expected, the active learning strategies do not specifically try to provide labels to the oracle which explains their behavior. Now we test experimentally whether an early active learning based exploration can provide a benefit in a subsequent exploitation phase.
4.2 Scenario 2
We now turn to the case where each active learning strategies is used for the 100 first steps. In the next iterations, the points provided to the oracle are queried by solving based on the resulting learned predictor. The results presented in Figure 11 show that the active learning strategies do not take advantage even lately of their early exploration. Indeed CAFDA, base and base_refit remain competitive while being simpler than active learning procedures. We focused on the 300 first iterations where we observe that the delay of the cumulated reward of active learning procedures is generated at the very beginning and remains present until all the 1 have been found.
5 Conclusion and perspectives
We presented a new fraud detection framework that differs from the active learning setting in which the quality of a strategy is measured by its ability to retrieve the rare labels to the oracle. We have shown that our algorithm CAFDA and also simple baselines provide better results than state of the art active learning algorithms on these problems. Future work will focus on the statistical properties of the computer-assisted fraud detection problem in order to design theory grounded optimal algorithms for the task at hand and will explore how to adapt other adaptative active learning strategies to our setting.