A Time Attention based Fraud Transaction Detection Framework

12/26/2019 ∙ by Longfei Li, et al. ∙ 0

With online payment platforms being ubiquitous and important, fraud transaction detection has become the key for such platforms, to ensure user account safety and platform security. In this work, we present a novel method for detecting fraud transactions by leveraging patterns from both users' static profiles and users' dynamic behaviors in a unified framework. To address and explore the information of users' behaviors in continuous time spaces, we propose to use time attention based recurrent layers to embed the detailed information of the time interval, such as the durations of specific actions, time differences between different actions and sequential behavior patterns,etc., in the same latent space. We further combine the learned embeddings and users' static profiles altogether in a unified framework. Extensive experiments validate the effectiveness of our proposed methods over state-of-the-art methods on various evaluation metrics, especially on recall at top percent which is an important metric for measuring the balance between service experiences and risk of potential losses.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Online payment platforms have been playing an increasingly important role in our daily life, as we are heading towards a cashless society111https://en.wikipedia.org/wiki/Cashless_society. The major online payment platforms, such as Alipay222https://intl.alipay.com/, PayPal333https://www.paypal.com and Paytm444https://paytm.com/, are currently serving hundreds of millions of users around the world and processing millions of cashless transactions each day. To provide a credible service, a crucial and challenging issue is to ensure the safety of all the transactions, among which the detection and prevention for the fraud transactions is a critical task.

To handle this task, a key issue is how to construct the detection system. In recent years, machine learning based methods have been applied, in which the detection of fraud transaction is formulated as a classification problem and a model is trained with the collected labeled data  

(panigrahi2009credit; mahmoudi2015detecting; srivastava2008credit)

. When deployed, a score can be obtained for each transaction to measure the fraud risk with the trained model. Then a threshold is set so that those transactions whose scores are higher than the threshold will be suspended for further verifications, which include different authentication methods, such as face recognition, Short Messaging Services (SMS) and verification emails. However, these are some awkward problems.

When building a model, another important issue is that the features in this task are much complicated, and specific consideration and a more effective model is needed. Roughly speaking, there are two different kinds of features in this task. On the one hand, the users’ static profiles, such as users’ demographics and average spendings, are basic features to describe one user, and to indicate the risk of the account. Thus, we claim that the model should pay enough attention to the dynamic features, and effective method should be explored to reduce the expense of computing and storage.

To handle sequence data, deep learning based methods, such as long short-term memory (LSTM) 


, convolutional neural network(CNN) 

(lecun1998gradient) , and their variant algorithms (DBLP:journals/corr/ChungGCB14) have been developed in recent years, and significant improvement has been achieved in various applications, such as speech recognition (DBLP:conf/icassp/GravesMH13)

, natural language processing 

(DBLP:conf/interspeech/MikolovKBCK10), video processing (DBLP:conf/cvpr/DonahueHGRVDS15; DBLP:conf/icml/SrivastavaMS15), etc. However, most of these methods only address the sequence information of the data, while the detailed information of the time intervals are not considered. In fact, the time regularity is informative and the time interval information is meaningful in real-world applications. One example is that a person who trades in the same frequency is quite different from the one who trades irregularly. Another example is that the person who takes a short time between two operations is quite different from the one that takes a long time. Thus, the detailed information of the time interval is a key point which should be valued.

To address the problem above, we proposed to introduce the attention mechanism to handle it. Attention mechanisms have been proven to be a very powerful mechanism (DBLP:conf/nips/VaswaniSPUJGKP17), and have brought improvement in many areas, such as natural language translation (DBLP:journals/corr/BahdanauCB14), speech recognition (DBLP:conf/nips/ChorowskiBSCB15). The attention mechanisms are also applied upon CNN (DBLP:journals/tacl/YinSXZ16) or LSTM to integrate the extra sources of information, and guide the extraction of embeddings which are highly correlated to the specific tasks. As we discussed, detailed time information (not only the sequence information) is of great value, which will play crucial role in our task. In this paper, inspired by (DBLP:journals/corr/LinFSYXZB17; DBLP:conf/nips/NeilPL16), we propose to use time attention based recurrent layers to embed the detailed information of the time interval, such as the durations of specific actions, time differences between different actions and sequential behavior patterns in the same latent space, etc., and we further combine the learned embeddings and users’ static profiles altogether in a unified framework for the final training of the fraud transactions detection model. The main contributions of this paper are summarized as follows.

  • We propose a novel time attention based recurrent layer which can operate sequential data in continuous time spaces with the detailed information of the time interval addressed.

  • We perform experiments on several datasets, from which we conclude that our method is competitive and alternative to existing time-LSTM works.

  • We further deploy the proposed framework as a real system at Alipay, and the results on real-world tasks also validate the effectiveness of the proposed method in terms of various metrics.

The rest of this paper is organized as follows. We summarize the related literature in Section 2, and describe the detailed architecture of the proposed approach in Section 3. We report and discuss experimental results in Section 4, and conclude in Section 5.

2. Background

In this section, we will introduce the related literature that formed the basis of our work.

2.1. Phased LSTM

Phased LSTM  (DBLP:conf/nips/NeilPL16) is a RNN architecture for modeling event-based sequential data. It extends LSTM by adding the time gate . is controlled by three parameters: , and , where is the total period of the model, is the phase shift and is the ratio which controls the ratio of the duration of the open phase to the full period. , and are learned during the training process. is formally defined as:


where is the time stamp and is an auxiliary variable. And the modified model can be described as follows:


where denotes input features at timestamps , denotes the -dimensional hidden units, and denotes the cell memory. However, this method is designed for high-frequency sampling scenes, which is quite different from our task.

2.2. Time LSTM

Time LSTM  (DBLP:conf/ijcai/ZhuLLWGLC17) adds specific inner gated units in LSTM to maintain the long term and short term effects on current actions in the sequence, such gates are controlled by the time interval between two actions. The model can be described as follows:


where is the time interval between two states. Such a method has been successfully applied in predicting users’ next actions in recommendation systems (RS), which is quite similar to our task. As mentioned in  (DBLP:journals/corr/BahdanauCB14)

, by using the last hidden state of such models as the representation of the sequence, it’s difficult to use a fixed length vector to represent a long sequence.

3. Time attention based Heterogeneous Network

The proposed heterogeneous network’s architecture is shown in Fig 1. In this section, we will introduce the components of our architecture.

3.1. Representation of Static Features

According to one’s transaction and shopping records in the platform, we can collect one’s profiles, such as working places, living places, credit scores (similar with FICO score555http://www.fico.com/), trading amounts, etc., which demonstrate a person’s consuming ability and habits. The rationale for using such features is that an unusual transaction amount or location may be suspicious.

As many continuous features are static ones, before feeding such features into a neural network, data preprocessing, such as normalization and discretization, are needed. For example, different normalization, discretization methods are needed. But for tree-based models, the raw features can be directly used, as the model is able to split the numerical values accordingly. This property and the strong representation power of tree-based models make them widely adopted in the industries. Despite this, RandomForest(RF) or Gradient Boosting Decision Tree(GBDT) is a linear combination of separate trees, which can be observed from Eq. (



where is the weight of the -th tree, and is the output of -th tree.

The boosted decision trees have shown to be a powerful model to transform the original features of an instance (DBLP:conf/kdd/HePJXLXSAHBC14), which can then be utilized by other models to further get even higher accuracy. Specifically, we use each learned individual tree as a categorical feature, where the value is set as the index of the leaf node the instance falls in. As a result, if there are trees in the GBDT model, the transformed feature of an instance is given in terms of a structured vector , where is the -th unit vector with the dimension of , where is the number of leaf nodes at -th tree, and is the index of the leaf node where the current instance falls into at -th tree.

3.2. Dynamic Behaviors

3.2.1. Click Behavior

When users use services provided by Alipay, there will be a record describing the service the user had used, which is quite similar to the click history used in the recommendation system. We can formulate the user behavior sequence as a tuple , where the is the user set, is the action set, is the time stamp of user have done the action . For a user , his/her click behaviors can be represent as . In order to involve the time effect, we separate the click behaviors into two parts, the first part is the click history , and the second part is the time behavior . For the time behavior, we pay more attention to the interval between two actions, so we transform the time behavior as , where . However, since the ’s values fall into a large range, some values appear rarely, which makes the network hard to convergence, so a discretization process is needed.

3.2.2. Transaction Behavior

When users make a transaction, a lot of information will be saved, which contain abundant aspects of this transaction, for example, an event will contain the scene, the location, and the time user does such transaction, at the same time the formal transaction place and the registered place are included in the event, which can demonstrate if the user is trading in an abnormal place. For time data, we use the same notation as mentioned in Section  3.2.1.

3.3. Time Attention based Recurrent Layers

Since our attention mechanism is added upon RNN layers, so we will introduce the basic LSTM and GRU first, and followed by our proposed time attention mechanism.

3.3.1. Lstm

Using LSTM to model sequential data has many successful applications. Compared with Recurrent Neural Network(RNN), LSTM is comprised of forget gate, input gate, output gate, and a memory cell. Standard LSTM equations can be described as follow:


where the , and are parameters of the LSTM. represents the input vector of the LSTM at timestamps ,

is the sigmoid function,

is the hyperbolic tangent function.

3.3.2. Time Attention

Assuming we have a sequence consists of actions, represented in a sequence of embeddings:


where is a vector in dimension , is a 2-D matrix, whose is -by-. The hidden states of RNN at time can be given by:


where is a dimension vector, is the hidden unit number of the RNN. All the n s are denoted as , whose shape is -by- when RNN is the single direction architecture, or -by- if of a bi-direction architecture.

For time data, there are multiple meanings, for example, how long a user stays in a session, which means the degree of interest or familiar of this user, or how long after the user uses another service, which can denote a user’s behavior. Here we use to denote the time data. Since , where is one dimension real value space, we first discrete the time data by , and just use it as a category feature, and then we can encode the time data as:


where is the embedding representation of the discrete time data, whose dimension is . Then we can stack all time embeddings together, and denote such matrix as , which means the embedding representation of time data. Its dimension is -by-. Following the self-attention mechanism, we use the following equations to calculate the weight of we get from RNN:


where and are the matrices to be learned, , , is a vector, . is the attention weight which quantifies the relevance of features in . The ensures all the computed weights sum up to 1. After getting the , we can use the standard attention mechanism to gather the embeddings extracted from different time states together by the following equation:


where the demonstrates the new representation of the sequence.

Figure 1. Architecture of our framework.

3.4. Heterogeneous network

Since we have two different kinds of behavior data, and static features in our system, we want to blend the heterogeneous data into a unified architecture, which will make the whole system more compact, at the mean while reduce the work of feature engineering. According to the method we mentioned in Section  3.1

, we extract tree embeddings based on the user profile. At the same time, we extract two kinds of behavior embeddings from click and transaction behavior by our time attention RNN architecture, respectively. Since the values from different parts are in different scales, directly concatenating them together will make the whole network hard to converge. So we add a batch normalization layer 

(DBLP:conf/icml/IoffeS15) at the top of time attention layer and tree embedding layer, then we concatenate the output of each BN layer and feed them to a multi-layer neural network. The whole architecture is shown in the Figure 1.

4. experiments

In this section, we will describe the comprehensive experiments that we conducted to show the effectiveness of our proposed model. We first describe the dataset we use, the comparison methods, hyper-parameter settings, and evaluation metrics. We then report the comparison result and finally study model parameter effects.

4.1. Dataset

We use the real transaction data from Alipay as our experimental dataset, where both real and fraud transactions are available. Fraud detection task is quite different from the traditional classification tasks, because the execution methods of fraud transaction vary in different time periods. Thus, in order to test our model’s performance as practical as possible, we separate the original five-month transaction data into three parts according to the transaction occur time: the transaction data of the first three months are used as training set, the data of the fourth month is used as the validation set, and the data of the last month is used as test set. Meanwhile, since the whole transaction amount is extremely large and the fraud transactions are rare, we down-sample the non fraud samples of train set and the validation set to accelerate our experiments, at the same time, in order to simulate the real online situation, we sample from the original data set to build our test set which makes the fraud and non-fraud samples’ number is quite different from train and validation set. We report the details of the dataset after preprocessing in Table  1.

Dataset #User #Sequences #Non Fraud Transaction #Fraud Transaction
train set 1,221,706 3,837,624 3,832,560 5,064
validation set 656,521 1,248,912 1,247,315 1,597
test set 674,057 1,302,226 1,302,091 135
Table 1. Fraud detection dataset description

As described in section 3, the features in fraud transaction scenario are divided into three groups, i.e., user static features, user click behavior features, and transaction behavior features. User static features demonstrate a person’s consuming ability and habits. For the click behavior, we choose user’s interactions with the Alipay APP during the recent two days as click behavior features. We set the max number of interaction to 200 based on experience. If the number of interaction is bigger than 200, we only keep the lasted 200 interactions. For transaction behavior, user’s trading history in Alipay during the recent ten days are selected as features. We also set the max number of trading history to 32 based on experience. If the number of history is bigger than 32, we will only keep the latest 32 transactions. Moreover, for each transaction, we select the 28 most important attributes, e.g., the trading location, IP location, trading amount and so on. Finally, we summarize the dimensions of each feature in Table 2.

Static feature dimension 89
User Behavior feature dimension 2,300
Transaction feature dimension 28
Table 2. Feature dimension description

4.2. Comparison Methods

In order to study whether our proposed time attention mechanism works, we compare our time attention mechanism with the following methods by varying the building block that generates the behavior embedding.

  • Bi-LSTM: We use bidirectional-LSTM  (graves2009offline) method to model the user’s behavior. We extract the last state data as user embedding, and concatenate it with tree embedding extracted from trees.

  • Phased LSTM: This method is introduced in Section  2

    . We use the implementation which is provided by TensorFlow  


  • Self-attention LSTM: We add a self attention layer on the top of Bi-LSTM which is introduced in  (DBLP:journals/corr/LinFSYXZB17).

  • CNN+Max pooling: We use traditional CNN with Max pooling to extract click and transaction behavior’s embedding. The window size is set from 4 to 10. For click behavior, the kernel size is set to be 32. For transaction behavior, the kernel size is set to be 16, which equals to the embedding dimension of different kinds of behaviors.

  • Time LSTM: This method is introduced in Section  2, and we use the implementation available at GitHub666https://github.com/DarryO/time lstm.

4.3. Hyperparameter Setting

We fix the tree model’s parameters, so that different models are using the same tree embeddings. For all the LSTM derived algorithm, we set the stack depth to 1, and use the same shape to make a fair comparison. The detailed settings are described as below.

  • Tree Embedding: We choose the large-scale GBDT model implemented on KunPeng  (DBLP:conf/kdd/ZhouLZCLYCYCDQ17) as the tree model, and we set the tree number to 100 and the max deepth to 5.

  • Network shape: For LSTM, GRU, and the derived algorithm, we set the hidden units to 256. For MLP, the hidden layer size is set to 1, and hidden unit number is 128.

  • Learning rate: We use SGD as the optimizer, and select the best learning rate in {0.1, 0.01, 0.001} .

  • Embedding Dimension: For every time stamp, the transaction event contains 28 different features, and each feature contains a different number of components, each component uses a 16 dimension embedding matrix. For click behavior, the dimension of embedding matrix is set to 32. For the time dimension, we select the best value in {8, 16, 32, 64}.

  • Batch Size: We set batch size to 512 for all the models.

  • Regularization: We use L2 as regularization, and its value is set to be 1e-5.

Note that for each model, we use the validation set to select the best model parameters, and evaluate them on the test set.

4.4. Evaluation Metrics

We use three different kinds of evaluation metrics to measure our proposed method’s performance. We adopt two standard ranking metrics: Area Under ROC Curve (AUC) and F1-Score. At the same time, in the real fraud detection system, we can not disturb too many people to improve the recall rate, we use another more practical indicator to evaluate our method, i.e., Recall At Top Percent (RATP). RATP@ is the recall of the subset which consists of the instances of the top percentage of prediction scores, for example, RATP@0.05 means only 5 transactions will be disturbed in 10000 transactions.

Method F1-score AUC RAPT@0.05 RATP@0.01
GBDT 0.701 0.981 0.807 0.637
CNN+Max pooling 0.702 0.982 0.815 0.652
GRU 0.708 0.981 0.822 0.652
Bi-LSTM 0.712 0.983 0.815 0.659
Bi-LSTM+self attention 0.714 0.984 0.830 0.674
PLSTM 0.714 0.986 0.835 0.689
TLSTM 0.716 0.986 0.844 0.692
Bi-LSTM+time attention 0.721 0.99 0.864 0.706
GRU + time attention 0.718 0.988 0.859 0.703
Table 3. Experiment Results
Figure 2. Recall@Top 0.05% result at test set.The vertical axes denotes RATP@0.05 on the test set and the horizontal axes denotes training iterations. For each iteration, the model have processed 512*1000 sequence.
Figure 3. Recall@Top 0.01% results at test set, whose meaning is similar as Fig  2

4.5. Comparison Results

We report the comparison results in Table 3

. From it, we can see that: (1) compared with the original GBDT which uses behavior features extracted by human, after using LSTM or GRU to modeling the user behavior, our proposed model has a significant improvement in terms of RATP@

. Take RATP@0.05 for example, our proposed method has a 7% improvement compared with the GBDT, which is because by using sequence modeling method, more complex patterns can be extracted. (2) The improvements of our proposed model against other models are not significant in terms of AUC, which is because the number of Non-Fraud transaction is too many, while the number of fraud transaction is too little. Thus, the improvement at the high score part will not improve the AUC too much. (3) All the methods that consider the time influence between different action outperform the Bi-LSTM and GRU and Bi-LSTM with self-attention, which means that time is an important information in fraud detection task. (4) At the same time, LSTM, GRU with our proposed time attention mechanism outperform PLSTM and TLSTM in our task. This is because, compared with adding inner gates in RNN, time attention mechanism that uses time information to guide the generation of sequence embedding may provide a better representation of the time sequence.

We also show the convergence speed of different models in Fig  2 and Fig  3. From them, we can find our proposed method’s convergence speed is the fastest, and LSTM with time attention is slightly better than that of GRU with time attention.

4.6. Parameter Analysis

We will study the effects of the hidden units and the time embedding dimension on our model performance.

4.6.1. Effect of the hidden units

We first vary the LSTM/GRU hidden units number to study their effect on our model performance, while fixing other hyperparameters. The result is shown in Fig  

4. As we can see, with increases, the performance of RATP@0.01 becomes better. However, do not perform too much better than . This is because as the number of hidden unit increasing, the parameter is also increasing, which makes more data is needed to fit the model.

Figure 4. Effect of different hidden units. The vertical axes indicates test set RATP@0.05 and the horizontal axes indicates the number of RNN hidden units.

4.6.2. Effect of the time embedding dimension

Figure 5. Effect of different time embedding dimension. The vertical axes shows RATP@0.05 on the test set and the horizontal axes is the dimension of time embedding.

We then vary the time data’s embedding dimension to study its effect on model performance, at the same time we fixing the other hyper-parameters. As shown in the Fig  5, with the time data embedding dimension increases, the performance does not always become better. When time embedding dimension is 32, we get the best result. That because as the time embedding dimension increasing, the feature space become sparse, which makes the model harder to converge.

5. Conclusions and future work

In this paper, we proposed a framework which manipulates heterogeneous data, at the same time, we introduce a new attention mechanism which models the time aspect into the whole framework. We implemented and evaluated our proposed method against several baseline approaches, and showed that our method achieve the best results.

In the future, we will try to evaluate our model in more datasets, and we will improve the computational efficiency of our model. Moreover, we will try to deal with the users who do not have too much history information in our platform.