1 Introduction
Networkguaranteed loans (also known as guarantee circles) are a widespread economic phenomenon in Asia countries, and they are attracting increasing attention from the banks, financial regulatory authorities, and the governments. In order to obtain loans from banks, groups of small and medium enterprises (SMEs) back to each other to enhance their financial security. When more and more enterprises are involved, they form complex directednetwork structures [meng2017netrating]. Figure 1 illustrates a guaranteedloan network consisting of around enterprises and guarantee relations, where a node represents a small or medium enterprise and a directed edge indicates that the enterprise guarantees another enterprise .
The existing mechanism in the financial industry for loan decisionmaking falls behind the demand for loans from businesses. Most of the criteria are designed for independent major players, while, in practice, the small and medium enterprises may provide inaccurate or manipulated data or induce intertwined risk factors [jian2012determinants]. Thousands of guaranteedloan networks of different complexities have coexisted for a long period and have evolved over time. This requires an adaptive strategy in order to prevent, identify, and dismantle systematic crises.
Challenges. Highlighted by the complex background of the growth period, the structural adjustment of the pain period and the early stage of the stimulus period, structural and deeplevel contradictions have emerged in the economic development system. Many kinds of risk factors have emerged throughout the guaranteedloan network that might accelerate the transmission and amplification of risk, and the guarantee network may be alienated from the “mutual aid group” as a “breach of contract”. An appropriate guarantee union may reduce the default risk, but significant contagious damage throughout the networked enterprises may still occur in practice [mcmahon2014loan]. The guaranteed loan is a debt obligation promise; if one corporation gets trapped in risks, it may spread the contagion to other corporations in the network. When defaults diffuse across the network, a systemic financial crisis may occur. Thus, it is critical to consider the contagion damage in the guaranteedloan networks. Moreover, it is desirable to quickly identify the most vulnerable nodes (i.e., enterprises with high default risks) such that the banks or the financial regulatory authorities can pay extra attention to the purpose of financial risk management, which is more urgent than ever before with the slowdown of the economics worldwide nowadays.
In the literature, some advanced approaches (e.g., [DBLP:conf/ijcai/ChengTMN019]) have been proposed to predict the default risk of an enterprise in guaranteedloan network, in which its profiles, as well as the context of the node (i.e., enterprise), are carefully considered. For instance, a highorder and graph attention based network representation methods have been designed in [DBLP:conf/ijcai/ChengTMN019] to infer the possibility of loan default events. These approaches indeed consider the structure of the guaranteedloan network, however, they cannot properly capture the uncertain nature of the contagion behavior of the networks.
Our Approach. In this paper, we introduce VulnDS: a top vulnerable SME detection system for largescale financial networks, which is deployed in our collaborated bank. In particular, we model the risks of the guaranteedloan network with probabilistic graph model, and infer the default probability of a node following the possible world semantics [DBLP:conf/sigmod/AbiteboulKG87], which has been widely used to capture the contagion of the network in practice. In particular, as shown in Figure 2, we use a probabilistic guaranteedloan network with two types of probabilities to model the occurrence and prorogation of the default risks in the guaranteedloan network. Specifically, for each enterprise node , we use to denote the learned default probability of without considering the contagion damage, namely selfrisk probability. For each guarantee relation , we use to denote the likelihood that defaults in case of ’s default, namely diffusion probability. Note that we can obtain the selfrisk probabilities and diffusion probabilities based on the existing works (e.g., [cheng2018prediction, DBLP:conf/ijcai/ChengTMN019]).
Figures 2(a) and (e) illustrate the structure of a toy probabilistic guarantee loan network with enterprise and guarantee relations, as well as the associated selfrisk probabilities and diffusion probabilities. Given the probabilistic graph , we may derive the default probability of a node following the possible world semantics, where each possible world (i.e., instance graph in this paper) corresponds to a subgraph (i.e., possible occurrence) of . Figures 2(b)(d) denote three example possible worlds (instance graphs). In each possible world, a node (i.e., enterprise) in exits if it defaults, and an edge appears if the default of indeed leads to the default of . Taking the node as an example, it may default because of itself, which is represented by a shaded node, as shown in Figure 2(b), or because of the contagion damage initiated by other nodes as shown in Figures 2(c)(d). In Section 2, we introduce how to derive default probabilities of the nodes (i.e., SMEs) given the probabilistic guaranteedloan network.
In this paper, we show that the problem of calculating the default probability of a node alone is already #Phard, not mentioning the top vulnerable nodes computation. A straightforward solution for the top vulnerable nodes computation is to enumerate all possible worlds and then aggregate the computation results in each possible world. However, this is computational prohibitive as the number of possible worlds of a probabilistic guaranteedloan network maybe up to where and is the number of nodes and edges in the graph, respectively. In this paper, we first show that we can identify the top nodes with a limited number of sampled instance graphs, with tight theoretical guarantees. To reduce the sample size required and speedup the computation, lower/upper bounds based pruning strategy and reverse sampling method are developed. In addition, to further accelerate the computation, a bottom sketchbased method is proposed.
Contribution. The principle contributions of this paper are summarized as follows.

We advocate the problem of top vulnerable nodes detection in guaranteedloan network, which is essential in the financial risk management. We propose a probabilistic guaranteedloan network model to capture the contagion damage among the networkedguaranteed loans properly.

We develop effective lower and upper bound techniques to prune the search space effectively. The advanced sampling method is designed to speed up the computation with the rigorous theoretical analysis.

To further accelerate the search, a bottom sketchbased approach is proposed, which can greatly speedup the computation and return a comparable result.

We conduct extensive experiments to evaluate the efficiency and scalability of our proposed algorithms on financial datasets and benchmark networks. The evaluation results show that the proposed method can achieve over x speedup ratio compared with the baseline method. In addition, through the experiments on a reallife financial dataset, it verifies that our proposed model can significantly improve the prediction accuracy, thanks to the probabilistic guaranteedloan network model.

The proposed techniques are integrated into our current loan risk control system, which further demonstrates the effectiveness and efficiency of our proposed methods.
Roadmap. The rest of the paper is organized as: Section 2 describes works involving different aspects related to our problem. Section 3 shows the basic samplingbased and our optimized algorithm of this paper, respectively. The system deployment details are presented in Section 4. We report the experiment results in Section 5, and case studies in Section 6. Section 7 surveys the related work. Conclusion and future work are described in Section 8.
2 Preliminaries
In this section, we first introduce the business procedure for the construction of guranteedloan network. Then we present some key concepts in the context of our methods as well as a formal description of our top vulnerable nodes detection problem for networkedguarantee loans, followed by the introduction of related techniques.
2.1 Business Procedure
In order to obtain a loan, a borrower needs to open an account and provide detailed information to the bank. Banks assess the loan application by rule checking, grant evaluation. Normally, the bank may be reluctant to issue the loan to SMEs, as it is difficult for small businesses to meet the bank’s lending criteria, which are designed for scale companies. There is something of a blank area for setting the criteria for SMEs due to their lack of security. However, they are permitted to offer other corporations as an endorsement. Usually, banks need to collect as much finegrained information as possible to make the decision, including transaction information, customer information, asset information such as mortgage status, and history of loan approval.
As Figure 3 shows, if one or multiple guarantors back the loan, the bank then processes the application in a preloan risk assessment system and issue the loan to the borrower if passed. Afterwards, the borrower is expected to repay the interests and principal (or partial) regularly according to the loan contract. If the borrower fails to repay, its guarantors are obligated to pay the rest part of the loan, which is illustrated in the dashed line. The bank regularly conducts a postloan risk assessment for all issued loans. In this procedure, our proposed VulnDS plays a vital role in the risk control of loans, which is proceeded in both the preloan assessment and postloan control. The detected risk loans will be alarmed and escalated to responsible account managers and risk managers for the appropriate measures to be taken.
2.2 Bottom Sketch
In this section, we briefly introduce the bottom sketch [cohen2007summarizing, beyer2007synopses], which is used in our BSRBKframework to obtain the statistics information for early stopping. Bottom
sketch is designed for estimating the number of distinct values in a multiset. Given a multiset
and a truly random hash function , each distinct value in the set is hashed to and for . The bottom sketch consists of the smallest hash values, i.e., , where is the th smallest hash value. So the number of distinct value can be estimated with . The estimation can converge fast with the increase of , where the expected relative error is and the coefficient variation is no more than . To distinguish from the in top, hereafter in this paper, we use to denote the parameter in the bottom sketch.2.3 Key Concept
Definition 1.
Guaranteedloan Network A guaranteedloan network (GN) is defined as a directed graph , where each node is a small or medium enterprise (SME). For each , the direction is from the warrantor to the borrower.
Definition 2.
Selfrisk Probability Given a guaranteedlan network , we define the default probability of a node , which is caused by self factor, as selfrisk probability .
Definition 3.
Diffusion Probability Given a guaranteedloan network , if a node provides warrant to another node , has obligation to repay the loans in case of defaults. Thus, the diffusion probability of caused by is defined as :
In this paper, we assume the selfrisk probabilities and diffusion probabilities are readily available. Please refer to our previous studies [DBLP:conf/ijcai/ChengTMN019, cheng2018prediction] if readers are interested in how to derive these probabilities. Then we have the definition of probabilistic guaranteedloan network as follows.
Definition 4.
Probabilistic Guaranteedloan Network A probabilistic Guaranteedloan network is a guaranteedloan network equipped with selfrisk probability for each node and diffusion probability for each edge, denoted by .
For simplicity, when there is no ambiguity, we use network, guaranteedloan network and probabilistic guaranteedloan network interchangeably. In this paper, we derive the default probability of a node by considering both selfrisks probability and diffusion probability, which is formally defined as follows.
Definition 5.
Default Probability Given a network , for each node , its default probability, denoted by , is obtained by considering both selfrisks probability and diffusion probability in . can be computed as follows.
(1) 
where is collection of nodes who are guaranteed by , i.e., inneighbors of in the network.
It is easy to verify that the equation above is equal to aggregate the probability over all the possible worlds, i.e.,
where is the set of all possible world, is the probability of a possible world and is an indicator function, which denotes if defaults in or not.
2.4 Problem Definition
In this work, we aim to identify the top vulnerable nodes, which are in a high level risk of loan default. While there are many classification based methods that can be used to predict default probability (e.g., [DBLP:conf/ijcai/ChengTMN019], etc.), they often need to retrain model to fit distribution shift or suffer from efficiency issues. The problem is essential in real scenarios, because we observe that the risk of guarantee loan changes frequently and over 60% of them remains unsupervised once the loan issues. Thus, it is desirable that the top vulnerable SMEs can be accurately and efficiently located.
Input. The input is a probabilistic guaranteedloan network .
Output. The output of our method is the identified top vulnerable nodes with , which may have delinquent loans in the next time window. In our scenario, we have the firsthand real defaulted nodes in loan record system.
Objective. We have two objectives: 1) the method should be efficient, 2) while maintain competitive accuracy.
Problem Hardness. According to Theorem 1, it is #Phard to compute the default probability .
Theorem 1.
It is #Phard to compute the default probability.
Proof.
We show the hardness of the problem by considering a simple case, where the selfrisk probability equals 1 for node , and equals 0 for . Therefore, for the node , the default probability is only caused by the default of node . Then the default probability of equals the reliability from to , which is #Phard to compute [DBLP:journals/pvldb/KeKQ19]. Thus, it is #Phard to compute the default probability. The theorem is correct. ∎
3 Ours Approaches
3.1 Basic Sampling Approach
Due to the hardness of computing the default probability, in this section, we propose a sampling based method. Rigorous theoretical analysis about the sample size required is conducted in order to bound the worst case performance.
3.1.1 Sampling Framework
To compute the default probability, we can enumerate all the possible worlds and aggregate the results. However, the possible world space is usually large in size. Sampling based methods are widely adopted for this case. That is, we randomly select a set of possible and take the average value as the estimated default probability. By carefully choosing the sample size, we can return a result with performance guarantee.
Algorithm 1 shows the details of the basic sampling based method. The input is a given network, where each node/edge is associated with a selfrisk/diffusion probability. In each iteration, we generate a random number for each node to determine if it defaults by itself or not (Lines 47). Then we conduct a breath first search from these nodes, i.e., , to locate the nodes that will be influenced by them in the current simulation. For each encountered edge, we generate a random number to decide if the propagation will continue or not. For each node, the number of default times is cumulated in Lines 21. The final default probability is calculated by taking an average over the cumulated value . Finally, the algorithm returns results with the largest estimated value.
3.1.2 Sample Size Analysis
For sampling based method, a critical problem is to determine the sample size needed. In order to bound the quality of returned result, in this section, we conduct rigorous theoretical analysis about the sample size required. Specifically, we say an approximation algorithm is a approximation if the following conditions hold.
Definition 6 (approximation).
Given an approximation algorithm for the top problem studied, let be the set of nodes returned by , and be default probability of the rank th node. Given , we say is a approximation if fulfills the following conditions with at least probability.

For , ;

For , .
If an algorithm is approximation, it means that the default probabilities of returned nodes are at least . For the nodes not in , their default probabilities are at most with high probability. To derive the sample size required, we use the following inequality.
Theorem 2 (Hoeffding Inequality).
Based on the Hoeffding inequality, we have following theorem hold.
Theorem 3.
Given the sample size , and two nodes , if , then
Proof.
We have
The last 2 steps take as the estimator of and . Then we can feed into the Hoeffding inequality and obtain the result. ∎
Theorem 4.
Algorithm 1 is approximation if the sample size is no less than
(3) 
where is the number of nodes, i.e., .
Proof.
Suppose we sort the nodes based on their real default probabilities, i.e., . Then we show the two conditions in approximation hold if we have with for . 1) We must have for held, if , which implies that the selected nodes must fulfill the first condition. 2) For , if does not belong to the top result, the second condition holds naturally. Otherwise, the must be a node that does not belong to the top result being selected into . If , it means . Therefore, should also be selected into the top, which is contradict to the assumption. Thus, the second condition holds.
As we can see, we need to bound the order of pairs of nodes in order to fulfill the approximation requirement, i.e.,
Therefore, the theorem is correct. ∎
3.2 Optimized Sampling Approach
Based on Theorem 4, Algorithm 1 can return a result with tight theoretical guarantee. However, it still suffer from some drawbacks which make it hard to scale for large networks. Firstly, to bound the quality of returned results, we need to bound the order of node pairs. The node size is usually large in real networks. Therefore, if we can reduce the size of (i.e., reduce candidate space) and (i.e., verify some nodes without estimation), then the sample size can be reduced a lot. Secondly, in each sampled possible world, we only need to determine if the candidate node can be influenced, i.e., compute . If the candidate space is greatly reduced, the previous sampling method may explore a lot of unnecessary space.
According to the intuition above, in this section, we firstly derive the lower and upper bounds of the default probability to reduce the candidate space. In addition, a reverse sampling framework is proposed in order to reduce the searching cost.
3.2.1 Candidate Reduction
To compute the lower and upper bounds of the default probability, we utilize the equation in default probability definition, i.e., Equation 1. The idea is that the default probability for each node is in if no further information is given. By treating each node’s default probability as and , we can aggregate the probability over its neighbors to shrink the interval based on Equation 1. Then, with the newly derived lower and upper bounds for neighbor nodes, we can further aggregate the information and update the bounds. The details of deriving lower and upper bounds are shown in Algorithm 2 and 3. The algorithm iteratively uses the lower and upper bound derived in the previous iteration as the current default probability. It is easy to verify that larger order will result tighter bounds. Users can make a tradeoff between the efficiency and bound tightness.
Given the lower bound and upper bound derived, we can filter some unpromising candidates and verify some candidates with large probability. Lemma 1 shows the pruning rules to verify and filter the candidate space.
Lemma 1.
Given the upper and lower bounds derived for each node, let and be the th largest value in and , respectively. Then, we have

For , must be in the top if ;

For , must not be in the top if ;
Proof.
For the first case, suppose a node with but not being selected in the top results. It means a node should have default probability of at least to be selected into the top result. Because be the th largest value in , it means there are no more than nodes that satisfy the condition. Therefore, the first case holds. For the second case, since is the th largest value of , which means must be at least . Therefore, the second case holds. ∎
Based on Lemma 1, Algorithm 4 shows the details of reducing candidate space. The algorithm takes the derived lower and upper bounds as input and outputs the candidate nodes and the number of verified nodes. Note that, if we can verify nodes based on the first pruning rule, then we only need to find top nodes from the candidate . In this case, we reduce both the value and of Equation 3 to and , respectively.
3.2.2 Reverse Sampling Approach
Based on Algorithm 4, we can greatly reduce the candidate space, which performance is verified in our experiments on realworld datasets. In the basic sampling method, it aims to estimate the default probability for each node. Here, we only need to compute the probability for the candidate nodes. Especially, when the candidate size is small, the previous sampling method will explore a lot of unnecessary space. Intuitively, given a sampled possible world, for each candidate node, we only need to verify if it can be reached by a node with . Therefore, we can conduct a reverse traverse from the candidate nodes to see if it can meet the criteria. The details are shown in Algorithm 5, where is the graph by reversing the direction of each edge in
The input is the graph and candidate . After traverse a sample, it returns for each node in . At first, we set for all the nodes. Then we conduct breath first search from each node in the candidate set. For each encountered node and edge, we mark it as checked and store the corresponding information (e.g., survived and ) in order to avoid generating random numbers for the same node multiple times. The BFS terminate if it encounter a node with or there is no more node to be explored (Lines 58). If it encounters a node with , it means the candidate node is influenced, and vice versa. Through this way, we can greatly reduce the computation cost by filtering unnecessary searching space.
By integrating the bounds based pruning method and the reverse sampling techniques, we have the reverse sampling based method. It is easy to verify the approach is approximation if the sample size fulfills Theorem 5.
Theorem 5.
Algorithm 5 is approximation if the sample size is larger or equal than
(4) 
3.3 BottomK Based Approach
Based on the lower and upper bounds derived, we can reduce the candidate space. In addition, by using the reverse sampling technique, we can reduce the cost of exploring samples. The reverse sampling based algorithm can return a result with tight theoretical guarantee, which reduces the sample size from Equation 3 to Equation 4. However, in many real cases, the sample size and computation cost is still large. Intuitively, we only need sufficient samples to obtain a competitive result. In this section, we derive a method based on bottom technique, which can greatly accelerate the procedure with competitive top results.
3.3.1 Find the Top1 Result
In the reverse sampling approach, when we process the samples one by one. We can terminate the processing, if there is a node that has sufficient statistic. In this paper, we use bottom sketch to serve this role. The idea is that, we first apply the lower and upper bound technique to obtain and . Let be sample size computed by using Equation 4. We assign each sample an id and generate a random hash value in for each of them. Since we does not materialize the samples, the time complex of generating hash value is only . We sort the sample increasingly based on the hash value, and materialize the samples accordingly by using the reverse sampling framework. For each node in the candidate set, we record its cumulated value . Based on Theorem 6, the node whose reaches first, which is the threshold preselected, is the top1 result.
Theorem 6.
The node selected by using the above procedure is the top1 node.
Proof.
Suppose node is the first node that reaches the criteria and the hash value of its th encountered sample is . Then we can estimate as . If is the second node that reaches the criteria. We must have . Therefore, the corresponding estimated value is smaller that of . The theorem is correct. ∎
Here, we use to measure if the statistic is sufficient or not. Even though the bottom based method does not often tight theoretical guarantee as the previous approaches. Through experimental evaluation, the bottom based method shows great advantage compared with the others.
3.3.2 Find the Topk Result
By extending Theorem 6, we can stop exploring the samples when there are nodes with sufficient statistic, i.e., their counters reach . Note that, there may be case when the stop condition cannot be met after all the samples are processed. Then the algorithm turns to the reverse based sampling method, and we just return the nodes with the largest estimated value. While, according to the experiments over realworld datasets, the algorithm can coverage quickly with .
4 System Implementation and Deployment
In this section, we first present the overall architecture for VulnDS. We then describe the details of system implementation. Finally, we report the interface and observation after system deployment.
4.1 Architecture Overview
Figure 4 shows the architecture overview of the VulnDS in a loan management system. We collect origin data from three upstream: loan data warehouse, data market, and external loan data. In the preprocessing layer, raw records are extracted, merged, and aggregated for risk control. We employ inmemory database to store the frequent queried data, and graph database to preserve networked relationships, as well as rational DB for conventional tables. We utilize a monitoring platform for scheduling submitted tasks from the preprocessing and risk control module. The risk assessment results are consumed by the tools and application platform, which is the main scenario to control loan risks. Different roles of business users access the ability from a unified application interface.
The risk control center consists of three main parts: the rule engine, vulnerable detection system, and evaluation module. Rule engine mainly includes loan blacklist, white list, and compliance rules. IF a loan passes the rule check, it will be then processed by our proposed vulnerable detection system. VulnDS assess the selfrisk of SME, the risk of guarantee relationships, and detect the top k vulnerable nodes by BSRBK. Evaluation module leverage the output of VulnDS to quantify the loan grant amount, time limit and interest ratio, etc. One the bank issue a loan, postloan process are activated immediately. All three steps in the risk control center will be employed to evaluate all issued loans regularly. In our implementation, we detect all loans monthly by the proposed VulnDS in a risk control center.
4.2 Implementation Details
Figure 6 shows an overview of the data association, which is extracted by the preprocessing layer. We employ the internal black and white lists from our collaborated bank. The rules are mainly under the compliance of the new Basel protocol[engelmann2011basel]. In vulnerable detection system, we employ HGAR [DBLP:conf/ijcai/ChengTMN019] for selfrisk assessments, pwkNN [cheng2018prediction] to infer the probability of risk guarantee relationships. The proposed BSRBKare utilized for the final vulnerable SME detection. During implementation, we use the Drools [thu2017transforming] on Apache Flink as the rule engine, in which the hot data are stored in Redis [patel2015sales]
. We employ neo4j as the graph database, visualize the graph by opensource software package D3.js and layout ForceAtlas2
[jacomy2014forceatlas2]. The training model and system implementation are written in Python, Java, and Scala [andersen2016evaluating].4.3 System Deployment
Our proposed VulnDS is deployed in a loan management system of our collaborated bank. Figure 5 presents the system interface and main components, where Figure 5(a) presents the control and metric panel, including the risk statistics of each of the loan communities and control menus. Figure 5(b) displays the loan status monitoring screen. The node size indicates the predicted delinquent probability by using BSRBK, which is dynamic and changes periodically according to the time window. Thus, risk managers could focus on risky and dominant companies. Figure 5(c) lists all risky diffusion patterns discovered. During the observation time window, it successfully warns of the true positive ratio of all risky loans with an acceptable false positive rate, involving 2538 SMEs and 7006 guarantors. It is worth noting that these risky loans failed to be discovered by the conventional scorecard based risk control system.
5 Experiment
In this section, we conduct extensive experiments to evaluate the effectiveness and efficiency of our proposed methods.
Datasets  # Nodes  # Edges  Avg Deg.  Max Deg. 
Bitcoin  3,783  24,186  6.39  888 
4,039  88,234  21.85  1,045  
Wiki  7,115  103,689  14.57  1,167 
P2P  62,586  147,892  2.36  95 
Citation  2,617  2,985  1.14  44 
Interbank  125  249  1.99  47 
Guarantee  31,309  35,987  1.15  14,362 
Fraud  14,242  236,706  16.62  85,074 
5.1 Experimental Settings
Datasets. We conduct the experiments on 3 realworld financial datasets, i.e., Interbank^{1}^{1}1https://github.com/carloscinelli/NetworkRiskMeasures, Fraud and Guarantee, and 5 public benchmark datasets, with drastically varying sizes and characteristics. The statistic details are shown in Table 2.
Interbank networks is generated by the maximumentropy (ME) approach [anand2015filling], in which each node represents a bank and edge corresponds to an interbank loan from the lender bank to the borrow bank. The dataset is public avaliable. Fraud and Guarantee are our contributed datasets, which details are described as follows.

Fraud. Credit card fraud networks with nodes and edges is constructed based on credit card fraud transactions from a major commercial bank. Each edge represents a trade between the consumer and merchant.

Guarantee. The guaranteed loans network is from a major commercial bank spanning 4 years. The names of the customers in the records are encrypted and replaced by IDs. We can access the guarantee relationships, which denotes an edge between the guarantor to borrower. Besides, in case studies, we also get the basic profile information such as the enterprise scale, and loan information such as the guarantee ID and the loan credit.
Besides the realworld financial datasets, we also employ 5 benchmark datasets, which are pubic available. We download Citation from networkrepository^{2}^{2}2http://networkrepository.com/. The others are downloaded from SNAP^{3}^{3}3https://snap.stanford.edu/data/.
Algorithms. We evaluate the following algorithms to demonstrate the performance of proposed techniques.

N (Naive) Algorithm 1 with fixed sample size 3000.

SR (Sample+Reverse) Algorithm that uses reverse sampling method with sample size computed by Equation 3.

BSR (Bound+Sample+Reverse) Optimized sampling method by integrating reverse sampling and bounds filtering techniques with sample size calculated by Equation 4.

BSRBK (Bound+Sample+Reverse+Bottomk) Bottom based method by integrating reverse sampling and bounds filtering techniques.
Parameters and Workload. To evaluate the effectiveness of proposed techniques, the precision is reported. We use 10000 samples to obtain the ground truth for top results. In the case study, we directly observe labels from realworld behavior and validate the prediction result with the tagged labels. For efficiency evaluation, the response time is reported.
For Fraud and Guarantee datasets, the selfrisk and diffusion probability are obtained in our previous research [fu2016credit, cheng2018prediction]. For the other datasets, the probability is randomly selected from . For parameter , we vary it from to , where is the corresponding graph node size. We set and for computing the sample size.
5.2 Parameter Tuning
In this section, we tune the parameters and the order of bounds on 4 datasets, i.e., Citation, Interbank, Fraud and Guarantee.
Tuning . As analyzed in the paper, the precision of BSRBKshould converge rapidly with the increase of . We vary from 4 to 64. The results are shown in Figure 8. Note that means is set to . With the increase of , the algorithm converges quickly for all the datasets. When the reaches 8, the drop of performance becomes less significant. Thus, in the following experiments, we set to 16.
Tuning Order of Bounds. Since the tightness of lower and upper bounds may greatly affect the sample size and computation cost, we conduct the experiments to tune the order of bounds. We vary the order of bounds from 1 to 5 and set as of the number of nodes. The candidate size is reported.
Figure 8 visualizes the result with heatmaps. The lighter the color is, the less number of candidates will be. As we can see, the candidate size decreases rapidly at the beginning, and reach steady when the order reaches 2 for most cases. Therefore, we set the order of upper and lower bounds to 2 for the following experiments.
5.3 Efficiency Evaluation
To demonstrate the efficiency of proposed techniques, we conduct experiments on all the datasets and report the response time. The results are shown in Figure 9. In all methods, the computation time gradually increases along with except for the naive approach N, because N uses a large fixed sample size. For the other methods, the sample size may change when increases. As we can observe, algorithm N is the most timeconsuming method, and the algorithm runs faster when more accelerating techniques involved. SR is better than SN because the reverse sampling technique can greatly reduce the sampling cost. BSR is better than SR due to the small candidate size, since we can reduce the candidate space and sample size based on the lower and upper bounds derived. BSRBKis better than BSR because of the novel stop condition used. BSRBKalways outperforms the others and achieves up to 100x acceleration. These observations strongly proves the effectiveness of proposed techniques.
5.4 Effectiveness Evaluation
To evaluate the effectiveness of proposed methods, the precision is reported by varying from to . The results are shown in Figure 10. Generally, the precision of the 5 methods is very close to each other, and the largest gap between the naive method N and BSRBKis only 3%. Compared with the speedup in efficiency, the precision difference is much less noticeable. The naive method N is slightly better than the other methods, because it has used more samples. SN, SR and BSR report almost the same result, because they obtain the same theoretical guarantee. It should be noted that for the Interbank dataset, and all methods successfully detect that node. Therefore, the precision is 1 as shown in Figure 10(c). As observed, the experiment results prove that BSRBKcould achieve significant performance acceleration while keeping a tolerable precision reduction.
6 Case Studies
In this section, we conduct the case studies by deploy our methods on a realworld loan management system. Firstly, we compare the proposed methods with some previous used prediction methods over a real financial dataset. Then we show how to discover risky paths and patterns by leveraging the vulnerable nodes identified as well as corresponding case studies. Finally, we present the deployed system, which integrating the proposed techniques, and demonstrate the performance over real scenarios.
6.1 Loan Default Prediction
To further demonstrate the performance of proposed methods, we compare the proposed methods with some baseline approaches, which are designed for the default prediction task for realworld system. The baseline methods include Wide [mcmahan2011follow], Wide and Deep [cheng2016wide], CNNmax [zheng2017joint], GBDT [ke2017lightgbm], crDNN [tan2018deep], INDDP [cheng2018prediction], HGAR [DBLP:conf/ijcai/ChengTMN019]. We conduct the experiments over realworld dataset, i.e., Guarantee dataset, which spans 4 yeas, from 2012 to 2016. As observed, most of the loans are repaid monthly. Hence, we aggregate the behavior features within onemonth time window and mark the delinquency loans as the target label for the month. The records of 2012 are used as the training data and then we predict the defaults over the next three years. For the baseline methods, the training data is used to train the prediction models. For our methods, the training data is used to train the probabilities involved in the networks, which details are shown in our previous research [cheng2018prediction].
The results are shown in Table 3, where AUC (Area Under the Curve) for each year is reported. As we can see, GBDT and Wide & Deep outperform the Wide model, because of the increase of model capacity. INDDP and HGAR are shown to be competitive across all the baselines. BSR and BSRBKsurpasses all the other approaches, which means the graph structure and default diffusion properties are effective for default prediction tasks. BSR is slightly better than BSRBK, because it can offer tight theoretical guarantee.
AUC(2014)  AUC(2015)  AUC(2016)  
Wide  0.75509  0.77751  0.78195 
Wide & Deep  0.76464  0.79825  0.81053 
GBDT  0.77263  0.80627  0.81182 
CNNmax  0.77645  0.80049  0.81492 
crDNN  0.77429  0.79565  0.81054 
INDDP  0.79015  0.80927  0.81588 
HGAR  0.81310  0.80988  0.81875 
BSRBK  0.82367  0.82835  0.83709 
BSR  0.82539  0.83004  0.83917 
6.2 Risk Path and Pattern Discovery
For a given network, identify risky patterns can provide valuable information for the company. However, directly analyzing the original network (e.g, mining frequent patterns) may not return meaningful results. To find useful patterns, we first try to locate risky paths by the leveraging the vulnerable nodes found. In our top vulnerable nodes detection problem, we can locate the nodes with large default probability. By conducting backtracking from these nodes through highly risky paths, we can find the source that cause the default. To define the highly risky paths, we cumulate the diffusion and selfrisky probability for edges and nodes on the path. A path is defined as risky if the cumulated probability is larger that a preselected threshold. Figure 12(a) shows a toy network where node is the top result. By backtracking from , we find two risky paths and . Note that, the risky paths may have intersected nodes. Then we use the subgraph constructed by the risky paths as input and mine the frequent pattens. Figure 5(c) shows the risky pattern screen of our deployed system, where the discovered patterns are presented. The system details will be present later.
Besides using the risky paths to identify the patterns, we can use these paths to analyze the cause of default. In Figure 11(a), we visualize the network, where the vulnerable nodes are with dark color. By selecting a node, the deployed system will visualize the corresponding risky paths (i.e., Figure 11(b)). In addition, we also provide the detailed guarantee flow of up/down stream companies by using Sankey diagram. For example, node 18 provides much more guarantee than it receives.
7 Related Work
In this section, we introduce related work on credit evaluation, diffusion in financial problems and probabilistic graphs.
7.1 Credit evaluation
Consumer credit risk evaluation is often technically addressed in a datadriven fashion and has been extensively investigated [baesens2003using, hand1997statistical]. Since the seminal “Partial Credit” model [masters1982rasch]
, numerous statistical approaches have been introduced for credit scoring: logistic regression, kNN, neural network, and support vector machine.
More recently, [baesens2003using]
presents an indepth analysis on interpreting the learned knowledge embedded in neural networks by using explanatory rules, and discusses how to visualize these rules. Researchers combine debttoincome ratio with consumer banking transactions and use a linear regression model with timewindowed dataset to predict the default rates in a short future. They claim an 85% default prediction accuracy and can save costs of between 6% and 25%
[khandani2010consumer].7.2 Diffusion in finance
The relationship between network structure and financial system risk has been carefully studied and several insights have been drawn. Network structure has little impact on system welfare, but it is important in determining systemic risk and welfare in shortterm debt [allen2010financial].
Network theory attracted more attention after the 2008 global financial crisis. The crisis brought about by Lehman Brothers infected connected corporations, which is similar to the 2002 Severe Acute Respiratory Syndrome (SARS) epidemic. Both of them started from small damages, but hit a networked society and caused serious events [bougheas2015complex].
The dynamic network produced by bank overnight funds loans may be an alert of the crisis [van2013using]. Moreover, research that aims to understand individual behavior and interactions in the social network has also attracted extensive attention [onnela2006complex, qiu2016lifecycle]. Although preliminary efforts have been made using network theory to understand fundamental problems in financial systems [van2013using, chow2008social], there is little work on system risk analysis in networkedguarantee loans [meng2015credit].
7.3 Probabilistic graph
The probabilistic (uncertain) graph, where each node or edge may appear with a certain probability, has been widely used to model graphs with uncertainty in a wide spectrum of graph applications. A large number of classical graph problems have been studied in the context of probabilistic graphs. For instance, Jin et al. [jin2011distance] investigate the distanceconstraint reachability problem in probabilistic graph. Potamias et al. [potamias2010k] introduces a framework which can address
nearest neighbors (kNN) queries on probabilistic graphs. The problem of vulnerable nodes detection has been investigated in the context of network reliability (e.g.,
[articleli13, DBLP:conf/infocom/SenMBDC14, Cetinay2018]). Nevertheless, their models are inherently different with ours, and hence the existing techniques cannot be trivially applied.The problem investigated in this paper is similar to the study of node influence under the IC model [DBLP:conf/kdd/KempeKT03] in the sense that the influence of a node can be modeled by possible world semantics. Although a large body of works (e.g., [DBLP:conf/kdd/KempeKT03, DBLP:journals/tkde/WangZZLC17]) have been developed for the problem of influence maximization under the IC model, their proposed techniques cannot be applied to our problem due to the inherent difference between the two problems. Firstly, the nodes in IC model do not carry any probability. Secondly, their focus is to select nodes such that the spread of influence is maximized. While we aim to find nodes with largest default probabilities.
8 Conclusion
In this paper, we proposed a vulnerable SME detection system (VulnDS), which advocates the problem of top vulnerable nodes detection to identify the most risky enterprises evolved in the networkedguaranteed loans. In particular, we propose a probabilistic guaranteedloan network model to capture the contagion damage caused by both selfrisks of the nodes and the prorogation of defaults. Following the possible world semantics, we derive the default probability of the nodes. As shown in the case study, this model significantly improves the accuracy of default prediction on reallife guaranteedloan networks. To tackle the efficiency and scalability issues in top vulnerable nodes detection, effective pruning techniques and advanced sampling methods are proposed with rigorous theoretical guarantee. To further accelerate the search, a bottom based approach is developed. Moreover, the proposed techniques were integrated into a loan management system, which is deployed in our collaborated banks and further demonstrates the effectiveness and efficiency of our proposed methods. The experimental result demonstrates that VulnDS is useful in financial management for banks, regulatory authorities, and governments.
Comments
There are no comments yet.