Ant Credit Pay is a consumer credit service in Ant Financial Service Group. Similar to credit card, users are offered with certain credit lines and able to pay for their online/offline shopping with it. Each month, users have to pay their debts before the due day (usually 10th), otherwise default will happen and this could be adverse to users’ future loan application. Similar to other credit products, loan default is a major risk of Ant Credit Pay, which means default prediction is a key point in risk management. The predicted default probability is one of the most importance factors for admittance management and credit limit grant. Hence, algorithm which makes effective prediction is the key to losses reduction and profits increment for the company.
Ant Credit Pay is an online credit service. In contras with other offline credit card services in conventional bank, it has different business characteristics, which also means that the challenges facing in Ant Credit Pay are different from those in conventional bank.
The first challenge we are facing is scalability
. In Ant Credit Pay, we serve hundreds of millions of users, which may be tens, or even hundreds, of times larger than the amount of credit card users in a single bank. The industrial-scale amount of users and their behaviors requires an industrial-scale data processing and machine learning platform for feature engineering and model training. It also requires well-designed distributed algorithms which are able to learn from big data efficiently.
On the other hand, conventional solutions for default prediction problem tend to learn explainable model (e.g., linear or tree-based model) based on subtle feature engineering. Their performance mainly depends on the effectiveness of input features. Since most features come from user-provided information and users’ behaviors in relative scenarios, the quality of these data decides the effectiveness of feature engineering and the performance of default prediction model. Recently, researchers also try to apply new methods, such as deep learning, in default prediction problem. Deep models are able to capture subtle interactions between input features, thus yield better performance. It should be noted that since deep model is still learned from the same feature space as conventional models, its performance also depends on the quality of raw data.
Conventional banks usually manage the credit card applications offline. Their employees manually review applicants’ information, which has great contribute to the quality of data. However, in our online service scenario, a huge amount of users stop us from manually reviewing for each application. Our decisions are made according to users’ behaviors in relative business in Alipay App (e.g., payment, credit history, etc.). Moreover, for those who are inactive in Alipay App, it’s hard to acquire high quality data for conventional feature engineering and may result in bad prediction.
Hence here comes the second challenge, the cold-start problem. Similar to recommender system, in our scenario, the cold-start problem means the challenge in predicting default probability for inactive users or new users, due to the lack of enough data. Hence, new data source as well as new algorithm need to be applied to alleviate the cold-start problem.
In Ant Financial, a user will interact with other users in various kinds of businesses (e.g., social relations, fund interactions, common interests, etc.). It is natural to build a social network, in which user acts as node and their interaction acts as edge. In real world, it is easy to find that people with different level of credit risk tend to interact with different crowd. We demonstrate this observation in Figure 1(a). We first calculate the default rates of user groups with certain number of default neighbors respectively, and then show the lifting percentages of default rates comparing with users who have none default neighbor. As shown in Figure 1(a), the more default neighbors a user have, the higher default rate he will be. Specially, the credit risk of users with no less than 5 default neighbors is almost higher than who never interact with default user. The aggregation of users with similar risk in social network implies that the structural information of social relations can be beneficial to the default prediction problem.
On the other hand, users who rarely participate in credit-relative businesses may still have rich social interactions. Figure 1(b) demonstrates the average amount of neighbors of user groups with different active level. New users mean those who have signed up in Alipay App in a month, while other users are divided into two groups (i.e., active users and inactive users) based on their frequency of using the App. The active users has the most neighbors due to the higher frequency of use. But for both inactive users and new users, more than 40 neighbors are found in the social network, which are enough to learn high-quality network representations for them.
However, conventional methods for default prediction rarely utilize network structural information effectively. It’s hard to find literatures about applying network structural information to improve the default prediction problem, especially in such an industrial scenario with billions of users and tens of billions connections between users.
Considering the above challenges and the special scenario in Ant Financial, we present NetDP (DP is short for Default Prediction), an industrial-scale distributed network representation framework for default prediction in Ant Credit Pay. NetDP is a flexible framework which supports both unsupervised and supervised network representation simultaneously. The unsupervised module tends to depict the global structural information in the whole network, while supervised module is responsible for modeling local structural information of labeled data. Then, the ensemble module applies Multiple Additive Regression Tree (MART) to blend the output of unsupervised and supervised model, and assigns a final predicted default probability for certain user.
Thanks to the succinct modeling and efficient distributed implement, NetDP can modeling the structural information of a social network with billions of nodes and tens of billions of edges in several hours. To our best knowledge, there is not a published method which can learn representation in a network of such magnitude efficiently. Moreover, experimental results show that the proposal can actually improve the performance of default prediction, especially for new users.
The rest of this paper is organized as follow: section 2 give a preliminary of the proposal. Some notations used in the following sections are defined here. Section 3 presents NetDP, the proposed distributed network representation framework for default prediction. Unsupervised module, supervised module, as well as the ensemble module are introduced respectively. We also present details about the distributed implement of NetDP. Section 4 shows the experimental settings and results to demonstrate the performance of our proposal in default prediction problem, especially in alleviating the cold-start problem. We also show the efficiency of our distributed implement. Section 5 is the related works about the default prediction problem and our proposal. And we make a conclusion of this work in the last section.
Given a directed network , in which denotes the node set of size and denotes the directed edge set (a pair represents the edge from node to node ). Let denotes the set of neighbor nodes of . In our scenario, a node represents a user of Alipay and an edge represents that user interacts with . The interactions between users include social friend relationship, fund transfer between users, transaction between buyer and seller, etc. To simplify the problem, different types of interactions are treated in a unified manner. Irrelevant or not strong enough interactions between users are dropped in case the useless or even pernicious noises are brought.
The unsupervised representation learning module will assigned a learnt
-dimensional vector(the superscript indicates this representation is generated by unsupervised learning) to each node , which represents the global structural information of . A little portion of users who have paid with Ant Credit Pay are labeled according to whether they defaulted or not. The label means have defaulted on a loan, otherwise . After model training, the supervised module will assign a learnt -dimensional vector (the superscript indicates this representation is generated by supervised learning) to represent the local structural information of , as well as a score to as the predicted default probability.
Iii The proposed method: NetDP
In this section, we first introduce the overall framework of the proposed NetDP briefly, then give detailed formalizations of unsupervised representation learning module and supervised representation learning module. At last, we will present the efficient distributed implementation of each component in NetDP.
Iii-a Overall Framework
Figure 2 demonstrates the overall framework of NetDP. The input includes two parts: a directed network consisting of Alipay users and the social interactions between them, and the default labels tagged on a little portion of users (gray blocks tagged on nodes with 0 or 1 inside in Figure 2). The whole network without labels acts as the input of unsupervised representation learning module and the learned representations (i.e., ) for each node are outputted to ensemble module. The Supervised representation learning module takes the labeled nodes and their neighbors as input. It produces learned representations (i.e., ) for each node and the predicted default probability (i.e., ). The unsupervised representation vector and the supervised predicted score are concatenated as -dimensional vector, which becomes the input of the ensemble module. A distributed Multiple Additive Regression Tree (MART) is applied to ensemble the output of unsupervised and supervised module. A final predicted default probability is given to represent the credit risk of certain user.
Iii-B Formalization of Network Representation Learning
Our goal is to predict whether a user will default or not based on the structural information learned from the given social network
. To achieve this, NetDP performs unsupervised and supervised network representation learning simultaneously. On the one hand, the unsupervised method, which takes the whole network as input, can learn effective representation to encode the global structural information of each node. Without biased adjustment by supervised information, the unsupervised representations can objectively reflect the structural characteristics of nodes. On the other hand, the default prediction problem is still a supervised learning task. Hence, we design a supervised network representation method to capture the local structural information of labeled nodes by focusing on modeling the relationship between labeled nodes and their neighbors. We will introduce the formalizations of these two methods in the following subsections.
Iii-B1 The unsupervised method
Many state-of-the-act unsupervised network representation methods represent a node as a vector of the low-rank hidden space, which encodes the structural information of the corresponding node among the whole network. A common assumption of these works is that closely connected nodes should be close to each other in the low-rank hidden space. In our case, closely connected users tend to have similar credit risk (Figure 1(a)). Hence, the ability of encoding closely connected nodes into close low-rank vectorized representations is the key point to build default prediction model with unsupervised network representation.
Some recent works ([5, 2, 6]) provide a new direction to capture structural information from social network. These methods learn low-rank vectors to represent each node, which are able to preserve relations between nodes in the network. DeepWalk  and Node2vec  have two main steps: apply random walk to generate the node sequences, and then perform skip-gram model to generate representations for each node.
Random walk is utilized to extract the high-order topological information from the network. At each walk, a neighbor of the last visited node is samples uniformly until the maximum walking step is reached. Random walk is efficient only if it’s able to fit the whole network into the memory of a single machine. In the single machine case, the lookup of neighbors for a certain node can perform much faster because no machine communication would happen during the lookup procedure. However, the situation becomes challenging when the network is too large that it has to be partitioned and stored on several machines. If a node and its neighbors are not stored in the same machine, the time of neighbors lookup procedure will increase from to , where denotes the time of lookup in the same machine, and denotes the communication time between two machines. means one local lookup and one remote lookup are required, while means one communication to issue the lookup command from the local machine to the remote one and one communication to aggregate the lookup results. In general, is quite larger than . Therefore, random walk become inefficient if the whole network is too large to store in the memory of a single machine.
In our scenario, the social network contains billions of users and tens of billions of interactions between them. It has to be stored in several machines. It’s impossible to perform the random-walk based or higher order proximity based network representation methods in such a huge network efficiently. On the other hand, the social network in Alipay App has rich enough first order proximity (i.e., more than 40 neighbors of each node) for modeling network representation. Therefore, in our proposed unsupervised method, only the direct neighbors are sampled to optimize the representation of the target node.
In our algorithm, a node is represented as a -dimensional vector . At each time, for a target node , several neighbors of are sampled from its neighbor set (i.e., ). We denote the log-likelihood of co-occurrence of and as:
To preserve the structural information of the social network, the representations of nodes are optimized to minimize the negative log-likelihood of co-occurrences as follow:
In order to optimize the loss efficiently, negative sampling technique is applied and the loss can be revised as:
is the sigmoid function.
By minimizing the above loss, the learned node representations can preserve the structure information of the whole network, which is beneficial to predict users’ credit risk.
Iii-B2 The Supervised Method
Although the network representations generated by unsupervised method can be treated as input feature of a classifier to recognize whether a user will default or not, an end-to-end supervised network representation method is still needed in this problem. It’s easy to observe that the credit risk of a certain user can be reflected by the credit risk of his/her neighbors (Figure1(a)), which means the local labeled information is beneficial to default prediction.
Based on this observation, the proposed supervised method represents a target user with the aggregation of his/her neighbors’ representations. That is, at
-th step, the representation of a target node is a non-linear transformation of the average of his/her neighbors’ representation instep:
where is a trainable matrix. The prediction of default probability is given based on the node representation.
where is another trainable vector. At last, , and are optimized to minimize the cross entropy loss between the prediction and the ground truth of each labeled data.
where is the -norm regularization of all trainable parameters. Different from the representations generated by unsupervised method, the supervised representations pay more attention to the local structural information of the labeled data. The predicted score is utilized in the ensemble module as one of the input features for MART training.
Iii-C Distributed Implementations
In order to learn node representations in such an industry-scale social network efficiently, we implement and deploy the proposed NetDP on KunPeng platform . KunPeng is a high-performance distributed machine learning platform, which provide parameter-server-based API for implementing parallel machine learning algorithms to learn from industrial-scale data.
In the proposed NetDP, to optimize the representation of a target node, we only need to lookup the representations of its direct neighbors. In this situation, the social network has to be formed as adjacency lists (i.e., each record stores one target node and all of its neighbors together) first. Then the whole set of adjacency lists is divided into serval parts and stored in the memory of serval machines. Hence, the neighbor lookup procedure for a certain node will only happen in one machine, which contributes to shorten the communicational time between different machines and make the training procedure efficient.
Moreover, we implement the parallel mini-batch Stochastic Gradient D
escent (SGD) to accelerate the training procedure. Mini-batch SGD solves optimization problem iteratively. At each step, each worker randomly selects a mini-batch of nodes, and retrieves the neighbors of them. It then computes the gradients of the objective function with respect to different trainable parameters, and do parameter update. The training procedure goes on until convergence happens or the maximum epoch is reached. Algorithm1 demonstrates the distributed implementation of unsupervised network representation in NetDP. The implementation of supervised method is similar to Algorithm 1.
|# of Users||# of relations||Avg. of in degree||Avg. of out degree|
|# of Users||# of Default Users||The First Month||The Last Month|
In this section, we demonstrate the effectiveness of NetDP in solving default prediction problem, especially the cold-start problem, as well as its efficiency in handling industrial dataset.
Iv-a Experimental Settings
The dataset consists of two parts: the social network and the labeled data. Table I shows the statistic information of the social network. Around 1.04 billions of users and 58.5 billions of relations between them are involved in the social network. To our best knowledge, this could be one of the biggest social networks mentioned in recent literatures, and none of the state-of-the-art network representation methods have reported results on a social network with similar scale.
Table II shows the statistic information of the labeled data. We drop lots of the users who never default before to keep the default rate at around . The labeled data are collected from March, 2017 to November, 2017. We divide them into two parts: the training set from March, 2017 to July, 2017, and the test set from August, 2017 to November, 2017. We train NetDP with the training set and report the performance of test set as results.
Three results are reported as follow:
NetDP: the output of the ensemble module (MART) of NetDP.
BenchDP: the output of a conventional default prediction model based on credit-relative features.
NetDP+BenchDP: a weighted average of the output of NetDP and BenchDP. The weights of the two models are adjusted based on training set.
We utilize the Kolmogorov-Smirnov statistic (KS) as metric to demonstrate the performance of default prediction. The higher value of KS statistic means the better performance.
Iv-B Experimental Results
As shows in Figure 3(a), BenchDP outperforms NetDP. That’s because the conventional model is based on well-designed credit features, while NetDP only models the network structural information without any features. However, a simple weighted average result (i.e., NetDP+BenchDP) can achieve significant improvement in comparing with BenchDP, which means the structural information is beneficial to the conventional models. It also illustrates that the proposed NetDP has the ability to capture effective structural information in an industrial-scale social network.
Moreover, to demonstrate the ability of NetDP in alleviating cold-start problem, we evaluate KS performance of different methods in different types of user groups (Figure 6). In Figure 3(b) and 3(c), BenchDP still outperforms NetDP thanks to the rich enough credit features, but the distance diminishes in inactive users comparing to active users. However, the situation become different in Figure 3(d). For new users, NetDP outperforms BenchDP and the weighted average of NetDP and BenchDP achieve great improvement than it on active or inactive users. These experimental results prove the effectiveness of utilizing network representation methods to predict for those who have less credit information.
We have implemented and deployed NetDP in the data center of AlibabaCloud. The unsupervised network representation module has to apply around 1000 cpu cores for modeling network mentioned above, and finishes training in around 5 hours 40 minutes. The supervised module acquires 20 cpu cores and perform training in 1 hour. In summary, the proposed NetDP is quite efficient in modeling industrial-scale social networks.
V Related work
achine (SVM) and neural networks are introduced as promising data mining tools, which provide an alternative to statistical techniques in building default prediction models[3, 1, 4].
Recently, the network representation model plays an increasingly important role to encode an existing network into a low-rank representation space to facilitate network structure analysis Our work is mainly related to structure based methods.  first deployed truncated random walks on networks to generate node sequences, and then leverage skip-gram model to learn node representations.  designed a biased random walk to balance breadth-first sampling and depth-first sampling.  designed objective functions to preserve the first-order proximity and second-order proximity. Although these methods are able to scale up to large dataset in a single machine, it is still necessary to proposed an efficient distributed network representation method which is able to handle industrial-scale networks.
This paper aims to improve the prediction of loan default in Ant Credit Pay. In order to overcome the scalability and cold-start challenges, we propose to incorporate default prediction with social network information, and present NetDP, an industrial-scale distributed network representation framework, to learn the structural information of users in a very-large social network. The proposal models global structural information with unsupervised network representation methods, as well as local structural information with supervised network representation method, and blends the outputs with MART for final prediction. We also present a distributed implementation of NetDP based on KunPeng platform, which it’s able to learn representation from a social network with billions of users and tens of billions of relations. The experimental results shows that with the help of NetDP, conventional default prediction model can achieve better performance, especially in cold-start users.
-  (2016) Predicting creditworthiness in retail banking with limited scoring data. Knowledge-Based Systems 103, pp. 89–103. Cited by: §V.
-  (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §III-B1, §V.
-  (2015) Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research. European Journal of Operational Research 247 (1), pp. 124–136. Cited by: §V.
-  (2016) Classification methods applied to credit scoring: systematic review and overall comparison. Surveys in Operations Research and Management Science 21 (2), pp. 117–134. Cited by: §V.
-  (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §III-B1, §V.
-  (2015) Line: large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. Cited by: §III-B1, §V.
-  (2017) Credit scoring and its applications. Vol. 2, Siam. Cited by: §V.
-  (2017) Kunpeng: parameter server based distributed learning systems and its applications in alibaba and ant financial. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1693–1702. Cited by: §III-C.