Distributed Deep Forest and its Application to Automatic Detection of Cash-out Fraud

05/11/2018 ∙ by Ya-Lin Zhang, et al. ∙ Ant Financial Nanjing University 0

Internet companies are facing the need of handling large scale machine learning applications in a daily basis, and distributed system which can handle extra-large scale tasks is needed. Deep forest is a recently proposed deep learning framework which uses tree ensembles as its building blocks and it has achieved highly competitive results on various domains of tasks. However, it has not been tested on extremely large scale tasks. In this work, based on our parameter server system and platform of artificial intelligence, we developed the distributed version of deep forest with an easy-to-use GUI. To the best of our knowledge, this is the first implementation of distributed deep forest. To meet the need of real-world tasks, many improvements are introduced to the original deep forest model. We tested the deep forest model on an extra-large scale task, i.e., automatic detection of cash-out fraud, with more than 100 millions of training samples. Experimental results showed that the deep forest model has the best performance according to the evaluation metrics from different perspectives even with very little effort for parameter tuning. This model can block fraud transactions in a large amount of money [detail is business confidential] each day. Even compared with the best deployed model, deep forest model can additionally bring into a significant decrease of economic loss.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Internet companies such as Ant Financial and Alibaba are facing the need of developing algorithms for large scale machine learning applications, among which the detection of cash-out fraud for the financial services is a crucial task. Cash-out fraud is a common hazard for on-line credit financial firms like Ant Financial, the shopper makes a transaction to the seller via Ant Credit Pay (a credit service issued by Ant Financial), and get cash-out from the seller. Without a proper strategy to detect them, a large amount of money may lost each day from cash-out fraud, making it a serious threat. Nowadays, machine learning based methods such as logistic regression (LR) and multiple additive regression trees (MART) are employed, and these methods have bring great success for handling this task. However, a more effective method is always with a strong demand, since this task has a pretty close connection with economy, and a small improvement will bring into an obvious decrease of economic loss.

On the other hand, due to the increasing effectiveness of data driven machine learning models, the data scientists in the company often work closely with the product department to design and deploy efficient statistical models for such tasks. An ideal platform for data scientists and machine learning engineers to achieve such goal should handle large scale learning tasks (often millions or billions of training samples) with high performance. In addition, it should be easy to build and run different tasks for productivity.

At Ant Financial and Alibaba, a distributed parameter server based system called KunPeng has been successfully developed and deployed zhou2017kunpeng ; zhou2017psmart . It is among the world’s largest on-line learning systems which can process data at peta-bytes and models with billions parameters. The system combines the advantages of parameter server for convenience and MPI for flexibility. Concretely, KunPeng has a robust fail-over mechanism which guarantees the high success rate of large-scale jobs with all kinds of jobs including SQL, Spark, MPI, etc., and it has an efficient communication implementation for sparse data and general communication interfaces which are compatible with MPI. Some popular machine learning models, such as Logistic Regression hosmer2013applied and Multiple Additive Regression Trees friedman2001greedy

and Deep Neural Network  

goodfellow2016deep , have be implemented with fairly user-friendly APIs. This system is currently supporting most of the machine learning tasks at Ant Financial and Alibaba including many predictive tasks involving machine learning models during the double 11 on-line shopping festival and daily online financial services as well.

Currently, tree based models such as random forest 

breiman2001random and multiple additive regression trees friedman2001greedy are still one of the dominant methods for a variety of tasks fernandez2014we

. In fact, most of the winning Kaggle competitions or data science projects always use an ensemble of MART or its variants 

chen2016xgboost , due to its superior performance. This is particular true for applications inside Ant Financial. The data is often sparse and high-dimensional, and as a discrete modeling problem or hybrid modeling problem, it is always not suitable for other popular choices such as deep neural network.

Recently, Zhou and Feng proposed a deep forest approach which opens a new way of building deep models with non-differentiable components, especially trees zhou2017deep

. This new kind of deep model is able to achieve the best performance among all non-DNN methods and give competitive results with state-of-arts DNN models in a variety of domain of tasks. In addition, the number of layers can be automatically determined, making the model complexity being adaptive with exact data (other than a pre-defined DNN structure), and the deep forest approach has much less hyper-parameters to tune. In fact, according the paper, a default setting will produce highly competitive results across different tasks, making it a good candidate for an off-the-shelf classifier.

In particular, many real-world tasks always contain discrete features, and these discrete modeling problems or hybrid modeling problems will get to be troublesome when using deep neural network, since explicit or implicit transformation of discrete information to continuous is needed, while such a transformation process usually introduces additional bias or loses information. However, when using deep forest, the tree-based model is naturally suitable for handling these kinds of problems. What’s more, the task of cash-out fraud detection discussed in this paper is exactly belonging to this kind of task, and deep forest model may provide excellent performance for this task.

In this work, we implemented and deployed the distributed version of deep forest model, based on our distributed learning system KunPeng zhou2017kunpeng

. This is the first parameter server based distributed implementation of deep forest model with industrial standard, together with the ability to handle millions of high-dimensional data. In addition, a web based graphical user interface is provided based on the platform of artificial intelligence in Ant Financial, and that allows the data scientist to use the model without the need for coding. By simply a few drag and clicks, a deep forest model is ready to go. This makes the modeling process extremely efficient and convenient, so that the data scientist team can build and evaluate the model much easier.

To meet the need of real world tasks, many improvements have been introduces based on the original version of deep forest model. To name a few, MART is employed as base learner for the consideration of both efficiency and effectiveness, cost based method is applied for handling extra-imbalanced data, feature selection with MART is adopted for high dimension data, different evaluation metrics are provided for automatically determining of the layer amount in the cascade level.

We applied deep forest model on the automatic detection of cash-out fraud to validate the effectiveness of it. This method is able to block fraud transactions in a large amount of money per day, and achieves the best performance compared with the current fraud detection system currently deployed. Due to its good performance with much less hyper-parameters of tuning, we believe that the deep forest model can be one of the default modeling choices for data scientist inside Ant Financial.

Briefly speaking, the main contribution of this work can be concluded as follows:

We implement and deploy the first distributed version of deep forest model based on the existing distributed system KunPeng, and provide it with an easy-to-use interface with the help of our platform of artificial intelligence PAI.

Many improvements are brought based on the original deep forest model, including MART as base learner for efficiency and effectiveness, cost based method for handling prevalent class-imbalanced data, MART based feature selection for high dimension data and different evaluation metrics for automatically determining of the cascade level.

We validate the performance of the deep forest model on a crucial task named automatic detection of cash-out fraud, which is with extremely large scale, the results show that the performance of deep forest is significantly better than all existing methods with regard of different evaluation metrics. What’s more, the robustness of deep forest is also verified through the experiments.

The rest of the paper is organized as follows: First, we give a introduction for the system. Then, experiments on the task of automatic detection of cash-out fraud is presented and results from different perspectives are analyzed. Finally we conclude this paper.

2 System Overview

2.1 KunPeng System

KunPeng zhou2017kunpeng is a parameter server based distributed learning system with parallel optimization algorithms developed to handle the large-scale problems that arise in industrial community. A parameter server system dean2012large ; li2014scaling ; xing2015petuum is composed of two main parts: the first one is the so-called stateless workers, which performs the bulk of computation tasks of model training; the second part is the so-called stateful servers which maintain the parameters of the model. Specifically speaking, the huge model parameters are distributed on the servers and the parameters can be passed to the workers through network communication. In this way, hundreds of billions of model parameters can be handled. Besides, parameter server also provides solution to the node failures in the clusters, i.e., automatically recovery of the parameters can be executed when some node fails, with the help of checkpoints.

Due to these advantages, KunPeng is developed, as a production-level parameter server based distributed learning system. Generally speaking, KunPeng is built with many optimizations: (1) A robust failover mechanism which guarantees the high success rate of large-scale jobs; (2) An efficient communication implementation for sparse data and general communication interfaces; (3) A user-friendly C++ and Python SDKs zhou2017kunpeng .

Many popular algorithms are implemented, such as the Follow- the-Regularized-Leader Proximal (FTRL-Proximal) mcmahan2013ad , Multiple Additive Regression Trees (MART) algorithm friedman2001greedy and its extension LambdaMART burges2010ranknet , the Sparse Logistic Regression algorithm liu2009large , Factorization Machines li2016difacto , Latent Dirichlet Allocation (LDA) algorithm blei2003latent and deep learning goodfellow2016deep framework based on CPU cluster, and so on. Based on this system, many other algorithms can be further developed to handle extremely large-scale tasks.

Figure 1: The simplified architecture of Kunpeng, including ML-Bridge and PS-Core. The users can simply operate upon the ML-Bridge.

To make it more convenient to use, ML-Bridge, which is a practical machine learning (ML) pipeline, is provided upon the core part of KunPeng, so that the users can conveniently use the system by writing some simple scripts. The simplified architecture of the whole system is illustrated in Figure 1, with two main parts showed: the ML-Bridge and the PS-Core. The users only need to operate upon the ML-Bridge layer.

2.2 Distributed MART

In this section, we will give an introduction to Multiple Additive Regression Tree (MART), and the distributed implementation in KunPeng. Since MART is already implemented in KunPeng with great efficiency and effectiveness, we will use it as the basic building block for the distributed deep forest implementation, and any other kinds of building blocks can be further developed for the distributed version of deep forest.

Multiple Additive Regression Tree (MART), which is also known as Gradient Boosting Decision Tree (GBDT) and Gradient Boosting Machine (GBM) 

friedman2001greedy , is a widely used machine learning algorithm in both academic and industry fields, because of their high effectiveness and great interpretability zhou2017kunpeng .

To give a quick glance of MART, we will first give a brief explanation to boosting decision tree drucker1996boosting , which constructs the model by additively fitting a tree model to the current residual. In details, let denotes the -th instance and is the corresponding label. At the -th iteration, we have got the prediction from the first -rounds, then the tree model is learned to minimize the following objective,

(1)

in which

is the loss function and

is the regularization term which controls the complexity of the tree model . Then, the prediction from the -round is

(2)

Similar to boosting decision tree, MART is also additively constructed with the philosophy of fitting the residual. However, in boosting decision tree, finding the best model for an arbitrary loss function may get to be computationally infeasible in many conditions. To handle this, MART friedman2001greedy is proposed, by making an approximation to the real residual with the steepest-descent method, the so-called pseudo-residual is fitted. Furthermore, second-order approximation is widely explored to efficiently optimize the objective friedman2000additive

, and it has been implemented in most of the systems such as XGBoost 

chen2016xgboost . The objective can be shown as below,

(3)

in which and are the first order and second order gradient on the loss function.

Many scalable systems, such as XGBoost chen2016xgboost and lightGBM ke2017lightgbm , are some well-known implementations for this model and its variants, with additional optimization of the speed and memory usage. KunPeng-MART is the parameter server based implementation on KunPeng system. Currently many machine learning tasks in Ant Financial and Alibaba is using KunPeng-MART in a daily basis.

There are two particular features for MART that are worth noticing and important for further using in our system. First, in real world applications, severe class-imbalance data japkowicz2002class

is often at presence. Take two-class classification as an example, the number of some class of data may be seriously less than that of the other class of data. This problem may frequently happen in various tasks including fraud detection, anomaly detection and medical diagnosis, etc. If we simply use the common strategy without special design, the performance may be sub-optimal. To handle this problem, two different solutions are always employed, i.e., the cost based methods 

zhou2006training and the sampling based methods liu2009exploratory .

To handle the class-imbalance problem, the cost based method can be naturally embedded to MART. Concretely, the weight can be assigned to each sample. To be specific, higher weights will be set to the class with less amount (which are always the ones with larger cost if they are wrongly classified) and smaller weights are set to the class with more amount. Thus, the goal in Eq. 3 can be modified as following,

(4)

in which is the importance weight associated with instance .

Figure 2: The new pipeline of deep forest.

Second, feature selection dash1997feature

is an important step in the machine learning pipe line. Fortunately, the MART method has a built-in functionality to do such tasks. That is, the estimates of feature importance can be calculated and feature selection can be performed 

xu2014gradient . Generally speaking, the importance score of each attribute can be calculated, and it indicates that the value of the feature when constructing a tree. The more frequently a feature is used to make a decision, the more important it is. Concretely, for each single tree, the importance of an attribute is calculated by,

(5)

in which is the number of leaf nodes (and is the number of non-terminal nodes), is the corresponding feature associated with node , and is the corresponding empirical improvement in square-error from the splitting and is the indicator function. Then, as shown in Eq. 6, the global importance of an attribute is calculated by averaging the importance value that the feature obtained from each single tree friedman2001greedy .

(6)

To handle industrial task, distributed version of MART is implemented in KunPeng, i.e., KunPeng-MART. There are many challenges encountered when develop the distributed MART, such as the storage problem and computation and communication cost zhou2017kunpeng

. To meet the need of extremely huge storage, a data parallelization mechanism is employed in KunPeng-MART. To be specific, each worker only stores a subset of the whole data for each feature, and the main workflow for splitting a node is as follows: (1) Each worker calculates the local weighted quantile sketch with the data stored on it; (2) Each worker pushes the local weighted quantile sketch to servers, and the servers merge them up to a global weighted quantile sketch, and find the splitting value; (3) Each worker pulls the splitting value from servers and splits samples to two nodes. Another key challenge is that the computation and communication cost of split-finding algorithm may become very high. To handle this, the communication schema of KunPeng is employed to reduce the cost of merging local sketches, and this really speeds up the whole process. Note that since MART is already implemented in KunPeng, we will use them as the building block for the distributed version of deep forest.

Figure 3: Normal strategy of job scheduling, each rectangle (except the red ones) represents a process, and the red rectangles indicate the waiting time.

2.3 The Specified Deep Forest Structure

Deep forest zhou2017deep is a recently proposed deep learning framework which uses tree ensembles as building blocks. The original version consists of two modules, i.e., the fine-grained module and the cascading module. In our task, the fine-grained module is unnecessary and will be removed. For the cascading module, a multi-layer structure is built. Each layer consists of several base learners, and each base learner is either a random forest breiman2001random or a completely random tree forest liu2008isolation (as shown in Figure 2

). For each base learner, the input is the concatenation of the class vector produced by the previous layer and the original input data, the output is the combination of the output from each base learner. K-fold validation is conducted for each layer and the cascading process is automatically terminated when the accuracy on validation set stops increasing.

To meet the need for the applications in Ant Financial, we propose our version of the deep forest pipeline, which is showed in Figure 2. There are some challenges that should be taken into consideration when building a statistical model for the real-world tasks. Firstly, the raw training data is often in high-dimensional space, usually thousands or even more raw features are used to represent a single entity or a transaction, in which may exist many irrelevant attributes. In addition, when deploying a model for real time prediction, it is not economic efficient to calculate every attribute for each prediction. To handle this problem, the raw training data is first trained on a MART for feature importance evaluation. Then, based on the feature importance score, feature selection is conducted for the training set. Secondly, the data we are facing is often extremely class imbalanced, the positive samples may be much rare, compared with negative samples. Therefore, a mechanism to handle such situation is needed in order to get a reasonable result. To handle this problem, cost based strategy is employed in each base learner so that the class-imbalance problem is addressed. Thirdly, for the extremely large scale tasks, efficiency and effectiveness are both important. To meet this, all based learners are replaced with the MART model implemented in Kunpeng, which is able to provide excellent efficiency and effectiveness. Finally, for many tasks, the evaluation metric may be specific, accuracy can not meet all of the need. For this, we provide more metrics (such as AUC and F1-Score) for the automatic growing of the cascading structure. What’s more, note that in the original paper, random forest and completely random tree forest are used to provide better diversity (which is crucial for ensemble methods), when replacing all base learner with MART, the diversity is damaged to some extent, so feature sample strategy is applied in MART to alleviate this problem.

It should be noticed from the above description, each base learner can be trained in a distributed fashion, and all the base learners within a layer can also been trained in parallel, making the whole process easy to be implemented in a distributed fashion. The next section will briefly explain how to build such model via a parameter server based distributed learning system.

Figure 4: Job scheduling with DAG, each rectangle represents a process, only the corresponding processes are connected with each other.

2.4 Distributed Implementation and Job Scheduling for Deep Forest

In industrial scenarios, we are always in confront of the data with tremendous size and high dimension, which means that the distributed version of the algorithm is needed. In this section, we introduce the distributed framework of deep forest.

The distributed deep forest framework is built upon the parameter server KunPeng. Based on the KunPeng architecture, the distributed version of deep forest contains three different nodes: (1) the worker nodes, which execute the heavy computing tasks; (2) the server nodes, which maintain the globally shared parameters; (3) the coordinator nodes, which coordinate the worker and server nodes, and perform job scheduling.

To efficiently deploy the deep forest algorithm, a key problem is how to do the job scheduling.

In each layer, the process of deep forest consists of the following sub-jobs: (1) data preparation process, which splits the data into different training and valid fold, since k-fold cross validation is needed to reduce the risk of over-fitting; (2) model training process, which trains different base learners based on the splitted training data; (3) prediction process, which makes prediction on the splitted valid data; (4) combination and concatenation process, which combines the result of different base learners and concatenates the prediction with the original features.

To preform the job scheduling, a normal strategy is shown in Figure 3, in which the coordinator sequentially calls the workers and servers to execute each of these processes. However, the job scheduling in this way may be very inefficient. Since the model training module needs to keep waiting until all of the data preparation work is finished, and the prediction module will not execute until all of the models are trained successfully.

To perform more efficient job scheduling, we employ directed acyclic graph (DAG) thulasiraman2011graphs to handle this problem. A directed acyclic graph, is a finite directed graph with no directed cycles. As shown in Figure 4, we regard each process as a node in the graph, and only the corresponding process are connected. The pre-conditions of one node are the input of that node. Only when all the pre-conditions of one node are satisfied, that node could be executed. Each node is executed separately, which means that the failure of one node will not influence other nodes. Each node is dependent on its inputs, which means that if and only if all the pre-conditions of the node are finished, that node is allowed to execute. In this way, the waiting time will be significantly shorten, since each node only need to wait the corresponding per-nodes finishing executing. For example, once the splitting of the -fold training data is finished, the corresponding model can be called to do training, rather than waiting until all of the data are prepared. What’s more, this design provide better solution for failover. For example, if some node is crashed with some reason, since the its pre-conditions are finished successfully, we only need to rerun from this node instead running the whole algorithm from the beginning.

2.5 Graphical User Interface

Data scientists in Ant Financial and Alibaba are facing hundreds of different machine learning tasks each day. Numerous new tasks are created and each task is different by its own nature. Therefore, how to efficiently build and evaluate a model is critical for productivity. In order to solve such problem, in Ant Financial, the Platform of Artificial Intelligence (PAI for short) 222pai.alipay.com is developed, which decouples the algorithms from different algorithm engines, for example KunPeng, MaxCompute, MPI, etc., and provides a uniform graphical user interface (GUI) for data scientists to process data, invoke multiple machine learning algorithms, create task pipeline at cloud, and so on.

Figure 5: The overall GUI of deep forest on PAI, each node represents an atomic operation.

As noted earlier, the deep forest algorithm is robust enough to handle different domains of tasks, making it one of the best choices when facing a new task. The parameter server based distributed implementation of base leaner enables the model to handle even extremely large scale real world problem, and we have implemented a deep forest module in PAI platform, the data scientists is able to create a deep forest model within a browser. That is, with only a few mouse clicked, the deep forest model is ready to train on massive training data and ready for deployment.

The overall GUI of the process with deep forest model is illustrated in Figure 5. Each node represents an atomic operation, which includes loading the data, building the model,and making predictions, etc. For instance, all the details of the deep forest model are encapsulated into a single node, the only thing needed to specify is which base learner to use, how many per layer and the detailed configuration of each base learner. The default base learner is MART, as introduced before.

The arrowed line indicates the sequential dependency and data flow from one to the other. By only a few drag-and-clicks, the user is able to create the deep forest model within minutes, and the evaluation results will be analyzed once the training is finished.

3 Application

In this section, we validate the effectiveness of deep forest model, on one important application for Ant Financial, namely the automatic detection of cash-out fraud. We will first give a brief introduction to the task of cash-out fraud, then the empirical results are showed and analyzed from different perspectives.

3.1 Task Description and Data Preparation

Cash-out, which means pursuing cash gains using illegal or insincere means, is a troublesome problem in the scenario of credit card and public accumulation funds. Similar to credit card, Ant Credit Pay is a credit production provided by Alipay, and the automatic detection of the cash-out fraud for Ant Credit Pay is pretty crucial for risk control. The cash-out fraud by Ant Credit Pay always follows the following process: the shopper makes a transaction to the seller with Alipay by scanning a QR code, the Ant Credit Pay is selected to by the choice of payment, and then the shopper can get cash-out from the seller. These activities of cash-out fraud may give potential threat to the credit system. Without a proper strategy to detect them, millions of CNY may lost each day. What we need to do is to detect the potential hazard of cash-out when a QR code is scanned for the transaction, and disable Ant Credit Pay for payment if this transaction is detected to be a highly possible cash-out fraud, so that the economic loss can be avoided.

We formulate this recognition task as a binary classification problem and collect the original features from four different aspects, i.e., the seller features, which describe the identity information of the seller; the buyer features, which describe the identity information of the buyer; the transaction features, which describe the information of the exact transaction,and the history features,which describe the information of both seller and buyer. All together, more than 5000 dimension features are collected to when each transaction is happening, with both numerical and categorical features included. The detail can not be explained since it is business confidential.

To construct the training and test data, we sampled the training data from data collected from O2O(Online to Offline) transactions using Ant Credit Pay during some successive months, and the sampled transactions of the same scenario during the next few months are used as test data. The statistical information of the data is showed in Table 1. As we can see, this task is with extremely large scale and class-imbalanced.

# Pos. Ins. # Neg. Ins. # All Ins.
Train 171,784 131,235,963 131,407,704
Test 66,221 52,423,308 52,489,529
Table 1: The number of the training and test samples.

As we discussed above, the original collected features are with the size up to 5000, among which many irrelevant attributes may be included. If simply use all these raw features, the whole procedure may get to be too time-consuming, and it will really reduce the efficiency when deployed as an real time service, making it a bad choice. To bypass this obstructor, feature importance is calculated and feature selection is performed with the help of MART.

Concretely, We first run a MART model with all collected features, and then feature importance scores are calculated with the obtained model. Based on the obtained feature importance scores, feature selection is then performed. Empirical results show that with the first 300 selected features (with the higher feature importance score), the retrained model can already achieve the competitive performance with the model trained using the whole features, which also validates the redundancy of the features. Thus, the subsequently-used features are filtered to be the 300 dimension with higher importance scores.

3.2 Empirical Results

In this section, empirical results are showed from different perspectives and detailed discussion is provided.

Since the data is extremely large, the experiments are conducted with the distributed learning system Kunpeng. To validate the effectiveness of the deep forest algorithm on this extremely large dataset, many KunPeng based algorithms are compared, including logistic regression (LR) liu2009large , deep neural network (DNN) goodfellow2016deep and multiple additive regression trees (MART) friedman2001greedy , in which LR and MART are the previous and currently deployed machine learning based methods, which have achieve great success. Note that since the data is severely imbalanced, cost based strategy is employed. We need to address that, for MART, 600 trees are used to get a better performance (and a larger number of trees will not improve the performance), while in deep forest, each MART is with only 200 trees, but not 600.

AUC F1 KS
LR 0.9887 0.4334 0.8956
DNN 0.9722 0.3861 0.8551
MART 0.9957 0.5201 0.9424
gcForest 0.9970 0.5440 0.9480
Table 2: The results using the common metrics.

Common Metrics: We first evaluate the performance with the widely used metrics for binary classification tasks in Ant Financial, including the AUC (Area Under the ROC Curve) score, the F1 score and the KS (Kolmogorov-Smirnov) score . These metrics will not directly reflect the economic influence by each model, but even a slight improvement at the third decimal place for these metrics is significant, since this improvement can bring into millions of decrease of economic loss, according to the business experience, so we first give the result and analysis from these metrics.

The results are shown in Table 2, as we can see , the deep forest method (gcForest for short) performs much better than all other existing methods. MART performs as the second better, validated its effectiveness. We need to address that the MART model is fine-tuned, and 600 trees are needed to reach this performance (and a larger number of trees will not improve the performance), which may be the upper limit that can be reach by this model. However, for deep forest, only with 200 trees for each MART, and with slight tuning, the performance is already much better than the best baseline, and as we will showed later, even with only 50 trees (the default setting) for each MART in deep forest, the performance is still better. Here, DNN performs pretty unsatisfactorily, which verified the weakness of DNN for handling the hybrid modeling problem (while DNN is more suitable for the continuous modeling tasks like image recognition). LR performs unsatisfactorily too.

1/10000 1/1000 1/100
LR 0.3708 0.5603 0.8762
DNN 0.3165 0.4991 0.8471
MART 0.4661 0.6716 0.9358
gcForest 0.4880 0.6950 0.9470
Table 3: The results using the specified metrics. Here, 1/100 means that 1/100 of all transactions are interrupted.

Specified Metrics : Besides the common metrics, as a real world task, many specified metrics are always needed for analysis, with the consideration of practical using. For our task, one important metric is the recall of the positive samples (the potential cash-out cases) under a given interrupt (by disabling Ant Credit Pay for such transactions) rate. For example, if we can stand that 1/10000 of all transactions to be interrupted, we will definitely hope that the selected cases contains as much positive samples as possible. When deploying these models, a threshold is always determined so that some ratio of the transaction will be interruptted. Since the volume of transaction is tremendous, a small improvement under this metric will be associated with a large amount of transactions and a large amount of money can be saved for the company.

In Table 3, the recalls under different interrupt rates are provided. As we can see from the results, LR and DNN perform pretty unsatisfactorily, the MART brings great improvement from these methods. At any level of interrupt rate, the deep forest model can always capture most potential cash-out transactions, which means that it will provide the best performance when deployed. The improvements can be bigger than 2% even compared with the fine-tuned MART, which is current deployed for this task. The improvement for these metrics are much significant according to the business specialist, and these metrics are much more relevant to the real economic influence when the model is deployed.

Figure 6: The PR curve of LR, DNN, MART and gcForest.

PR curve: Further more, according to the business experience, PR (Precision-Recall) curve is always a good choice to provide a visual comparison, and the decision are always made based on this metric, so the improvement in this metric is more approved.

We draw the PR curve in Figure  6, the result is much clearer, the PR curve of deep forest covers those of all other methods, meaning that the performance of deep forest is much better than other methods, validating the effectiveness of deep forest model. One interesting result is that the MART turns out to behave unsatisfactorily in the high score part, making it somewhat unsatisfactory for deployment.

Economic benefit: Till now, we have provide analysis from different perspectives, and all of this results validate the effectiveness of deep forest model on this extremely large scale task. Further more, when deployed, the deep forest model can block cash-out fraud transactions in a large amount of money per day (detail is business confidential). Even compared with the best deployed model (MART with 600 trees), the deep forest model (200 trees for each MART) can additionally bring into a significant decrease of economic loss each month, making it a better choice for this task.

3.3 Robustness Analysis

In this section, we give a brief discussion on the robustness of the deep forest model.

According the original paper of deep forest zhou2017deep , a default setting of deep forest will produce highly competitive results, which means that less effort on parameter tuning is needed when using the model. To validate that, we compare performances among the deep forest model with default setting (using 4 MARTs as base learner, each with only 50 trees), the slightly-tuned deep forest model (by only change the number of trees to 200 in each MART) and the best tuned model of the MART (with 600 trees). We need to mention that we have just finished the development of the distributed version of deep forest model, so we have no time for a fine-tuning of deep forest model, and only a slightly-tuning is performed for analysis.

AUC F1 KS
MART 0.9957 0.5201 0.9424
gcForest-d 0.9962 0.5247 0.9444
gcForest-t 0.9970 0.5440 0.9480
Table 4: The results using the common metrics.
1/10000 1/1000 1/100
MART 0.4661 0.6716 0.9358
gcForest-d 0.4703 0.6775 0.9397
gcForest-t 0.4880 0.6950 0.9470
Table 5: The results using the specified metrics. Here, 1/100 means that 1/100 of all transactions are interrupted.

The corresponding results are showed in Table 4, Table 5 and Figure 7 (gcForest-d for the default setting of deep forest and gcForest-t for the slightly-tuned deep forest). As we can see, even without any tuning, the performance of the default setting of the deep forest model (with only 50 trees for each MART) is still showed to be much better than the find-tuned MART model (with up to 600 trees), which is consist to the claim from original paper that the default setting of deep forest can produce highly competitive performance. Note that even with 4 forests in each layer, the number of trees of deep forest is still much smaller that that used in the fine-tuned MART, leading to a great resource saving. On the other hand, a slightly-tuning of deep forest model (by only change the number of trees to 200) has already lead to a much better performance. We believe that a fine-tuning of the deep forest model will lead to much more excellent result.

Figure 7: The PR curve of deep forest with default parameters, slightly-tuned deep forest and fine-tuned MART.

4 Conclusion

Nowadays, more and more data are collected in Internet companies and large scale machine learning applications are widely needed. Distributed machine learning systems which can handle extra-large scale tasks are necessary for more and more tasks, and a platform with easy-to-use interface will provide much help.

In this paper, we introduce the distributed version of deep forest model. It is developed and deployed based on the parameter server system KunPeng and platform of artificial intelligence PAI. To meet the need of real world applications, many improvement are further introduced to the distributed version of deep forest model. To name a few, MART is used as base learner for efficiency and effectiveness, cost based method is employed for handling prevalent class-imbalanced data, MART based feature selection is performed for high dimension data and different evaluation metrics are included for automatically determining of the cascade level. With the deep forest model, experiments on the task of automatic detection of cash-out fraud are performed and results are analyzed from different perspectives. All of these results show that the deep forest model can provide highly competitive performance, and a significant decrease of economic loss can be reached with this model, suggesting that this framework can be regarded as a great choice for data scientists facing with large scale data.

References

  • (BNJ03) David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
  • (Bre01) Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
  • (Bur10) Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11(23-581):81, 2010.
  • (CG16) Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm Sigkdd International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016.
  • (DC96) Harris Drucker and Corinna Cortes. Boosting decision trees. In Advances in Neural Information Processing Systems, pages 479–485, 1996.
  • (DCM12) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
  • (DL97) Manoranjan Dash and Huan Liu. Feature selection for classification. Intelligent Data Analysis, 1(3):131–156, 1997.
  • (FDCBA14) Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems. The Journal of Machine Learning Research, 15(1):3133–3181, 2014.
  • (FHT00) Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statistical view of boosting. Annals of statistics, 28(2):337–407, 2000.
  • (Fri01) Jerome Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232, 2001.
  • (GBCB16) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep Learning, volume 1. MIT press Cambridge, 2016.
  • (HJLS13) David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied Logistic Regression, volume 398. John Wiley & Sons, 2013.
  • (JS02) Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429–449, 2002.
  • (KMF17) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pages 3149–3157, 2017.
  • (LAP14) Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Proceedings of 11th USENIX Symposium on Operating Systems Design and Implementation, volume 14, pages 583–598, 2014.
  • (LCY09) Jun Liu, Jianhui Chen, and Jieping Ye. Large-scale sparse logistic regression. In Proceedings of the 15th ACM SIGKDD Iternational Conference on Knowledge Discovery and Data Mining, pages 547–556. ACM, 2009.
  • (LLSW16) Mu Li, Ziqi Liu, Alexander J Smola, and Yu-Xiang Wang. Difacto: Distributed factorization machines. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 377–386. ACM, 2016.
  • (LTZ08) Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In Proceeding of 8th IEEE International Conference on Data Mining, pages 413–422. IEEE, 2008.
  • (LWZ09) Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, 2009.
  • (MHS13) H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1222–1230. ACM, 2013.
  • (TS11) Krishnaiyan Thulasiraman and Madisetti NS Swamy. Graphs: Theory and Algorithms. John Wiley & Sons, 2011.
  • (XHD15) Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data, 1(2):49–67, 2015.
  • (XHWZ14) Zhixiang Xu, Gao Huang, Kilian Q Weinberger, and Alice X Zheng. Gradient boosted feature selection. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 522–531. ACM, 2014.
  • (ZCL17) Jun Zhou, Qing Cui, Xiaolong Li, Peilin Zhao, Shenquan Qu, and Jun Huang. Psmart: Parameter server based multiple additive regression trees system. In Proceedings of the 26th International Conference on World Wide Web Companion, pages 879–880. International World Wide Web Conferences Steering Committee, 2017.
  • (ZF17) Zhi-Hua Zhou and Ji Feng. Deep forest: Towards an alternative to deep neural networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3553–3559, 2017.
  • (ZL06) Zhi-Hua Zhou and Xu-Ying Liu. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63–77, 2006.
  • (ZLZ17) Jun Zhou, Xiaolong Li, Peilin Zhao, Chaochao Chen, Longfei Li, Xinxing Yang, Qing Cui, Jin Yu, Xu Chen, Yi Ding, et al. Kunpeng: Parameter server based distributed learning systems and its applications in alibaba and ant financial. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1693–1702. ACM, 2017.