Log In Sign Up

Modelling graph dynamics in fraud detection with "Attention"

by   Susie Xi Rao, et al.
ETH Zurich

At online retail platforms, detecting fraudulent accounts and transactions is crucial to improve customer experience, minimize loss, and avoid unauthorized transactions. Despite the variety of different models for deep learning on graphs, few approaches have been proposed for dealing with graphs that are both heterogeneous and dynamic. In this paper, we propose DyHGN (Dynamic Heterogeneous Graph Neural Network) and its variants to capture both temporal and heterogeneous information. We first construct dynamic heterogeneous graphs from registration and transaction data from eBay. Then, we build models with diachronic entity embedding and heterogeneous graph transformer. We also use model explainability techniques to understand the behaviors of DyHGN-* models. Our findings reveal that modelling graph dynamics with heterogeneous inputs need to be conducted with "attention" depending on the data structure, distribution, and computation cost.


page 1

page 2

page 3

page 4


xFraud: Explainable Fraud Transaction Detection on Heterogeneous Graphs

At online retail platforms, it is crucial to actively detect risks of fr...

Suspicious Massive Registration Detection via Dynamic Heterogeneous Graph Neural Networks

Massive account registration has raised concerns on risk management in e...

Ethereum Fraud Detection with Heterogeneous Graph Neural Networks

While transactions with cryptocurrencies such as Ethereum are becoming m...

R-GSN: The Relation-based Graph Similar Network for Heterogeneous Graph

Heterogeneous graph is a kind of data structure widely existing in real ...

An Attention-based Graph Neural Network for Heterogeneous Structural Learning

In this paper, we focus on graph representation learning of heterogeneou...

Dynamic Nonparametric Edge-Clustering Model for Time-Evolving Sparse Networks

Interaction graphs, such as those recording emails between individuals o...

Learning to predict synchronization of coupled oscillators on heterogeneous graphs

Suppose we are given a system of coupled oscillators on an arbitrary gra...

1. Introduction

Fraud detection is an important task for e-commerce platforms. They are dealing with a large amount of user activities daily, so it is crucial to minimize all sorts of risks involved in online activities, be they account registrations, user interactions, or user transactions. There are two aspects one can model in the aforementioned applications: structural and temporal. Structural refers to the relations between the nodes involved in the activities, temporal the evolution of nodes and edges over time.

We take the scenario of suspicious massive registrations as an illustration. First, abusive accounts are usually interlinked. For example, they may share the same phone number or be registered with the same IP address. They naturally form a graph with heterogeneous types of nodes (e.g., accounts, transactions, IP address, email, phone number). Second, we want to capture suspicious accounts and transactions within a certain time frame. Indeed, studying the patterns of suspicious registrations shows that, because fraudsters tend to abuse accounts when they are recently registered, the temporal dynamic is a critical factor for detection.

Take another application scenario we have in e-commerce. We want to understand how the accounts evolve over time in terms of risk scores. Some accounts might be hacked and stolen, some might engage in suspicious transactions after a while, others might be part of a ring attack to the platform. Therefore, it is meaningful to add a temporal aspect in monitoring account behaviors.

Graph Neural Networks (GNNs) aim to learn a representation vector for each node based on the graph structure. It has been shown in recent studies of fraud detection that GNN-based methods are powerful in flagging frauds (c.f. 

Li et al. (2019); Wen et al. (2020); Zhang et al. (2019b); Liu et al. (2018); Wang et al. (2019a); Weber et al. (2019); Ma et al. (2018); Liu et al. (2019); Liang et al. (2019); Rao et al. (2022); Lu et al. (2021); Wang et al. (2021)).

In this paper, we propose DyHGN model and their variants built with diachronic embeddings (Goel et al., 2020) and heterogeneous graph transformer (HGT) (Hu et al., 2020b). DyHGN stands for Dynamic Heterogeneous Graph Neural Network. We then apply the DyHGN-* models to three use cases highly relevant in eBay: (a1) detecting suspicious account registration, (a2) flagging risky transactions, and (a3) identifying risky accounts. This allows us to gain insights on modelling dynamic heterogeneous graphs in real-world applications. Hence, we are willing to share those insights with the audiences in both academia and industry.

First, we discuss DyHGN (Figure 1), which integrates both structural and temporal information by putting them into one graph. We then extend the vanilla DyHGN model with a diachronic entity embedding function with LSTM (DyHGN-DE, Figure 2 (left)), which provides the characteristics of entities at any point in time. We also investigate the benefits of capturing the dynamic heterogeneous relation patterns with self-attention based HGT (Hu et al., 2020b) (DyHGN-DE-HGT, Figure 2 (right)). We resonate to the finding in Lv et al. (2021) on the comparison between heterogeneous GNNs (HGNNs) and GAT after having explored the datasets in fraud detection with dynamics. Furthermore, we have discussed the uneven distribution of node labels (uneven proportion of risky labels across time) in the datasets and their influences on the classification performances. Interestingly, we show DyHGN-* models are powerful in learning long-term dependencies from the past, esp. when the label distribution over time fluctuates largely. Last but not least, we use model explainability (Shapley values) in ML models with graph-derived features to gain insights about DyHGN-* models.

Our key contributions in this work are:

  • [leftmargin=]

  • We have developed a prototype (DyHGN) that models the account dynamics in a heterogeneous transaction graph. On top of DyHGN, we built other variants with diachronic entity embeddings (Goel et al., 2020) and HGT (Hu et al., 2020b) modules and evaluate their performances on real-world industrial datasets.

  • We benchmark multiple GNNs against three datasets from an e-commerce retail platform.

  • We use ML models trained on graph-derived features and use the insights to better explain DyHGN-* models.

  • We open-source the codebase and data

    111Note that eBay-small dataset in (Rao et al., 2022) (desensitized transaction records, denoted as xFraud here) is used in this work. Please contact the authors for DATA USE AND RESEARCH AGREEMENT (eBay) and obtain the usage rights of eBay-small dataset. In the long run, it would be possible to share the MassReg dataset after the legal review at eBay. to facilitate further research in these directions. The codebase/data are made available on this GitHub repository:

2. Research Question and Methodology

In this section, we first present our research question, the problem definition (Sec. 2.2) of fraudulent accounts and transactions, from the perspective of modeling a dynamic heterogeneous graph. Then, we explain the graph construction (Sec. 2.3), our DyHGN architecture (Sec. 2.4), the diachronic embedding component (Sec. 2.5.1), and the DyHGN-* variants (Sec. 2.5).

2.1. Research Question

In relation to daily use cases at eBay, we aim to address the following question: how can we develop an end-to-end framework to leverage the entities and time information available in the following scenarios (a1) detect suspicious account registration, (a2) flag risky transactions, and (a3) identify risky accounts?

2.2. Problem Definition

As listed above, we study three application scenarios in the paper which would benefit from leveraging both dynamic and heterogeneous information from transaction graphs: (a1) detect suspicious account registration, (a2) flag risky transactions, and (a3) identify risky accounts. Here, we take scenario (a1) as an example to illustrate how we build a dynamic heterogeneous graph and a model around the use case. The other two use cases (a2 and a3) follow a similar design.

Massive account registration refers to a process where a user or an organization can add a group of accounts simultaneously222See this illustration for an experience of massive user registration, (last accessed: Nov. 9, 2020).. This functionality is enabled on many e-commerce websites to speed up batch registration for a group of accounts. By providing the entities required in a registration template, a user can create within a short period several accounts. Apart from template-based operations, a bot can also be built to submit forms multiple times with generic information. In practice, a fraudster can make use of this functionality on an e-commerce website to register a set of accounts for multiple purposes: using stolen financial means, providing fake reviews, redeeming coupons and purchasing using an abusive gift card, etc. Oftentimes, these registrations share some common patterns of disguise.

Think of the critical entities involved in massive account registration. A credit card might be stolen and used by one of the newly registered accounts. A suspicious account can be registered with a common shipping address such as a warehouse, which is a disguise of suspicious activity. These accounts are oftentimes registered using email addresses with spam patterns, also using telephone numbers that are listed as spam calls by third-party collaborators for risk detection. We can also detect a chain of illegal activities within an e-commerce platform: an account has been hacked and a fraudster registers a batch of new accounts to use the gift cards and financial instruments that are bundled with the hacked account.

Furthermore, the time dimension is crucial in massive registration. Based on the manual analysis in the business unit on suspicious massive registrations, usually within three months after registration, fraudster activities occur. An effective system should be able to link suspicious entities used in the past or uncover suspicious patterns discovered in the past graph snapshots. Later, when such a system is deployed in production, it is expected to provide both feature level detection and graph level detection. Feature level detection is achievable via a rule-based system, while graph level detection is better modeled by an end-to-end model that learns the time-variant representations quickly and effectively.

Formally, a heterogeneous network is defined as a graph , where node type mapping function is and link type mapping function is . Each node has only one type , and each edge has only one type . Each node or edge can be associated with attributes, denoted by . In addition, each node or edge can be labeled, denoted by . For each time , an entity in , or a link in , can be removed from or added into the graph snapshot .

Figure 1. DyHGN: Dynamic Heterogeneous Graph Neural Network for Suspicious Massive Registration.

2.3. Graph Construction

In our work, we populate the graph with three different applications that we introduce with further details in Sec. 3.1. Again, we take the application of detecting suspicious registrations (a1) as an illustration. We formulate it as a binary classification problem in a transductive setting in a heterogeneous graph. Thus, we specify the problem definition from the heterogeneous and temporal aspects as follows.

Heterogeneous. In a heterogeneous transaction graph , has a type , where {account, address, IP, phone, email}, referring to account, registration address, IP address, phone, email, respectively333For this study, we choose these attributes because they reflect typical patterns of massive registration activities as reported by our business unit. Other types of attributes such as device type are incorporated in the in-house risk detection system.. If an account uses a linking entity in address, IP, phone, email, we put an edge between the account node and this linking entity in the heterogeneous graph. Each node carries node attributes provided by a risk identification system. Each account ID is flagged as benign or suspicious. We use these flags as labels in our binary classification.

Temporal. In general, there are two ways (static and dynamic) to incorporate the time dimension into a heterogeneous graph using accounts and their linking entities.

Static. One can either construct different static snapshots of a graph with all involving entities for each time . This means a detection system has to process all snapshots , where is the number of time steps. In each snapshot, a graph is constructed. This is the graph construction DySAT (Sankar et al., 2020) employs. However, as analyzed in Section 4, DySAT is not suitable for our application because not all entities are appearing in every snapshot. If we break down the graph into snapshots, we will lose many time-dependent linkages and because many accounts have not existed and hence cannot be linked to any entity. Another reason is that newly registered accounts that are suspicious will only be abused within the weeks ( is usually small, say 1-12). Consequently, we only need to encode the linking entities such as IP addresses and registration addresses for those periods. Hereafter, we propose an alternative of constructing the graph for our applications and present it below.

Dynamic. Unlike treating a heterogeneous graph as a set of static snapshots, we unroll the time snapshots into one graph to incorporate nodes and edges that (dis)appear overtime. TIMESAGE (Shekhar et al., 2020) discusses is a similar design: a time-dependent network representation. We build a graph that tracks entities which could be present or absent in all snapshots. Entities such as phone numbers and addresses are present in all snapshots, but accounts can only exist after certain snapshots, i.e., after the accounts have been created.

We have two main components in our dynamic heterogeneous transaction network, a structural subgraph (Figure 1 (a)), and a temporal subgraph (Figure 1 (b)). Here we discuss the design of these two network components and the intuitions behind the design. The structural subgraph is designed to reflect the linkages among various types of entities. The nodes are accounts and attributes used in account registration. For instance, if two accounts are registered using the same IP, an edge is added from account 1 to this IP, and another edge is added from account 2 to this IP as well. Consequently, account 1 and account 2 are linked via this IP in this way (see Figure 1 (a)). Hence, the structural subgraph captures the relationships among the entities and allows us to uncover the patterns in account registration. Now let us talk about the temporal subgraph that builds on top of the structural components in each time . For each time , we observe a structural subgraph constructed as we describe above. Then, we add temporal edges from the structural nodes to a node called . These structural nodes in different timestamps represent the identical entities in each time . We also index with the timestamp(s) when it appears. In the example shown in Figure 1, the nodes , , are connected to a node via the temporal edges , , and . From the nodes to the node we have a star graph that represents if an entity has appeared in time or not. We denote the unrolled dynamic heterogeneous graph as , where all the edges and nodes appearing from are present. is composed of and .

Notation Description
a graph snapshot at time
the number of time steps
a dynamic heterogeneous graph from timestamps
structural subgraph in Figure 1 (a)
temporal subgraph in Figure 1 (b)
account level features
the output of structural message passing
the output of FC transformation of
the output of structural message passing
the diachronic entity embeddings
an edge in each time for type , e.g.,
a node in each time for type , e.g.,
a hub node in from to for type , e.g.,
Table 1. Notations.

2.4. DyHGN Architecture

Now we present the architecture of DyHGN, which is applicable in all three scenarios discussed in this paper. DyHGN is composed of two subgraphs, a structural subgraph to capture the relations between different types of entities and a temporal subgraph to capture the dynamic aspect of the entities and to determine whether an entity appears in time or not. Temporal edges are added from the structural nodes to their timestamp-indexed counterparts in the temporal subgraph. For each subgraph, a graph convolutional network (GCN) layer is used to learn the message passing between the nodes.

Structural message passing. The first GCN layer takes as input the structural subgraph and the node level features (e.g.,

in MassReg). Only the nodes to classify have features, while the other node types do not have initial feature values and are initialized with zero vectors. Then, we use a feedforward layer to connect the message passing between the structural and temporal parts. The output is layer-normalized and fed into a nonlinear transformation using ReLU as the activation function. Finally, a Dropout layer is applied to regularize and avoid overfitting. The output of this step is


Temporal message passing. The second GCN layer takes as input the temporal subgraph and . More specifically, in order to obtain , we create temporal nodes where each linking entity of type appearing in time snapshot gives us a temporal node . Similarly, we add temporal edges that connect the original node from the structural subgraph to its temporal counterparts in each time stamp. After the GCN layer, we apply dropout, layer normalization, and ReLU.

Prediction. There can be multiple convolution layers of structural and temporal message passing (see Table 8

about n_layers). The output from the final convolution is fed into a feedforward connected (FC) network. We then apply layer normalization, ReLU, and dropout before calculating a risk score (aka probability) for the input node, using softmax. Based on the risk score, a label is generated (this part is summarized under MLP in Figure 


). For the xFraud datasets, the loss function is a cross entropy of the true label, whereas for the MassReg dataset, the loss function is an average of the binary cross entropy and the multi-class cross entropy, because accounts are also labeled with different types of risk levels (see MassReg dataset in Sec. 


Figure 2. Architecture of DyHGN-DE and DyHGN-DE-HGT. “DE”: diachronic embedding, “HGT”: heterogeneous graph transformer.

2.5. DyHGN-* Variants

It is worth noting that while DyHGN is applied to heterogeneous graphs, its convolution layer uses GCN, a convolution designed for homogeneous graphs. We further explore methods to modify the DyHGN model with better representations of temporal and structure modules. We introduce two model variants, DyHGN-DE and DyHGN-DE-HGT, which modify the temporal and structural aspects, respectively. In the subsequent sections, we first introduce the diachronic embedding module, which enables us to model the temporal aspects of an entity.

2.5.1. Diachronic Embedding

To take the temporal dynamic further into account, we equip DyHGN model with diachronic entity embedding following Goel et al. (2020)

, which gives us DyHGN-DE. This approach builds upon static entity embeddings and proposes an alternative which takes time as input as well, to provide the characteristics of entities at any point in time. The use of diachronic embedding has proved beneficial in temporal knowledge graph (KG) completion. In our work, we construct a diachronic embedding based on the DistMult 

(Yang et al., 2014) score function, which we denote as DE-DistMult and provide its formal definitions below. Let be a finite set of entities, be a set of relation types, and be a finite set of timestamps.

Definition 2.1 (Diachronic entity embedding).

A diachronic entity embedding, denoted as DEEMB, is a function which maps every pair , where and

, to a hidden representation.

Definition 2.2 (DE-DistMult).

In the DE-DistMult scoring function ,

  • [leftmargin=]

  • we define for the nodes their entity embedding for every where ,

  • we define for the edges their relation embedding for every where , and .

Finally, we define the diachronic embedding


where represents the th element of vector , and are (entity-specific) vectors with learnable parameters and is an activation function.

is a hyperparameter controlling the percentage of temporal features. The first

elements of the vector in Equation (1) capture temporal features and the other elements capture static features. We use sine as the activation function.

Diachronic embedding vs. temporal subgraph. While DyHGN constructs a temporal subgraph for each time , the diachronic embedding takes an entity and its timestamp as input and provides a hidden representation for the entity at that time, where the parameters of hidden representations are learned from the data. We obtain a diachronic representation of each node that has temporal and structural information, by computing the DE-DistMult scores for all relations in which an entity is involved and average them.

Now, we explain in detail how we incorporate the diachronic embedding in a heterogeneous setting into our DyHGN model. We denote the node level features as . For every edge in the structural graph with , and , we compute the DE-DistMult score

of their diachronic embedding. Then, we use an LSTM layer (Long Short Term-Memory) to aggregate the scores of the edges that are linked to a same node. Finally, we concatenate the obtained diachronic embedding of each node to the

table. The output after this operation is denoted as . Using an LSTM deals nicely with the variable number of edges that a node can be involved in.

2.5.2. DyHGN-DE and DyHGN-DE-HGT

Now, we explain two DyHGN-* variants. (1) (DyHGN-DE) We directly apply the DyHGN model to our new input , , and . (2) (DyHGN-DE-HGT) We first apply a Heterogeneous Graph Transformer (HGT) layer (Hu et al., 2020b) to . The input to the HGT layer is and , the output . HGT has node- and edge-type dependent parameters, which makes our graph convolution layer also heterogeneous. Finally, we apply the DyHGN model with inputs , , and .

Node Type Count Edge Type Count
Account ID 111,691 Temporal Edge 135,614
Address 7,221 Account ID - Address 29,217
IP Address 6,762 Account ID - IP Address 104,719
Phone 4,958 Account ID - Phone 18,542
Email 134 Account ID - Email 608
TOTAL 130,766 TOTAL 288,700
Table 2. Node & Edge Count/Types for the MassReg Graph.
Node Type Count Edge Type Count
Transaction 207,749 Temporal Edge 350,846
Account ID 28,815 Transaction - Account ID 124,818
Payment token 22,273 Transaction - Payment 102,569
Address 7,138 Transaction - Address 199,957
Email 25,878 Transaction - Email 185,560
TOTAL 291,853 TOTAL 963,750
Account ID 28,815 Temporal Edge 488,654
Transaction 207,749 Account ID - Transaction 124,818
Payment token 22,273 Account ID - Payment token 101,728
Address 7,138 Account ID - Address 117,158
Email 25,878 Account ID - Email 124,796
TOTAL 291,853 TOTAL 957,154
Table 3. Node & Edge Count/Types for the xFraud Graphs.

3. Experiments, Evaluation, and Discussions

In this section, we introduce our datasets and their preprocessing (Sec. 3.1), show the experimental setups (Sec. 3.2 and 3.3), evaluate the model performances (Sec. 3.4) with ablation studies (Sec. 3.5), and discuss our findings (Sec. 3.6).

3.1. Dataset, Preprocessing, and Evaluation Metric

We use two eBay data sources (MassReg and xFraud444The ebay-small dataset in (Rao et al., 2022).), from which we derive three application scenarios: (a1) detecting suspicious account registration (MassReg), (a2) flagging risky transactions (xFraudTxn), and (a3) identifying risky accounts (xFraudAccount). In Tables 2 and 3, we report node and edge types and their counts.

MassReg dataset. This dataset was sampled from the real-time account registration logs from September to December in 2019. We create a heterogeneous graph where each node has a type where , referring to account, registration address, IP address, phone, email, respectively. Account features are 264-dimensional vectors, encoding risk features generated by an in-house risk detection system. Each account is flagged as benign or suspicious. The labels are either automatically generated by rule-based filtering on transaction behaviors (e.g., riskiness, payment rejection, chargeback, abusive buyers, etc.) or by manual annotations deducing from the registration rules. The dataset is balanced with about 50% of suspicious registrations, yet these registrations are not even across time, with a peak of suspicious registrations during the first two weeks of our time window (c.f. Figure 4 in Appendix C). For a more detailed description of MassReg, refer to (Rao et al., 2020).

xFraud dataset. This dataset contains financial transactions carried on the platform across 6 weeks. Transaction records have a rich set of relations, which enables us to derive two application scenarios: (a2) detecting risky transactions and (a3) flagging suspicious accounts. For this, we create two heterogeneous graphs where, for both graphs, each node has a type where , referring to transaction, payment token, email, shipping address, buyer, respectively. For a more detailed description of xFraud, refer to Rao et al. (2022) (the ebay-small dataset in (Rao et al., 2022)). We present two graphs as follows.

  • [leftmargin=]

  • xFraudTxn. If a transaction has a relation with another type of node in , we put an edge between two nodes. Each carries node attributes as a 114-dimensional vector, encoding different features, e.g., item type, device type, and IP, from which the transaction was made. Each transaction is flagged legit or fraud, which we want to predict.

  • xFraudAccount. If an account has a relation with another type of node in , we put an edge between two nodes. Each has features which are initialized as an average of the transactions’ attributes to which they are connected.

Different from the unevenly distributed labels across time in the MassReg graph, we see an even distribution of labels in the xFraud dynamic graphs. We provide the detailed graphs of labels distribution across time in Figure 4 (Appendix C). To prevent data leakage, accounts’ labels are not used in the transaction classification task, and reciprocally for transactions’ labels in the account classification. Since we only have transactions timestamps in this dataset, we define the date of account creation as the day of their first (possible) transaction. These datasets are highly imbalanced, with 1.5% of fraudulent transactions and 3.5% of fraudulent accounts (note that we report the average percentage across time stamps).

For all three application scenarios, we perform a binary classification task on nodes in a transductive setting (similar to Rossi et al. (2020)), where all edges are available during training and node labels are split chronologically in the ratio of 70%-10%-20%.555While the test set was split chronologically, the train/validation split was performed randomly, which can be questionable since we are mainly interested in the temporal aspect of the graph. For this reason, we also evaluate the DyHGN model with a chronological Train/Val split. We observed a slight decrease in performance for all models, but it did not change the score significantly.

Since we have binary labels, we use a popular evaluation metric in risk management: the average precision (AP), which corresponds to the area under the precision-recall curve. We prefer to choose this metric over the Area Under the ROC Curve (ROC-AUC) metric since we care more about correctly classifying the positive class, and we also deal with imbalanced data. However, we still report the AUC scores in Appendix 

A for the interested reader.

3.2. Baselines

We select our baseline models to be simple GNN models such as graph convolutional network (GCN) and graph attention network (GAT). They are designed to model homogeneous graphs, but their performances on heterogeneous graphs has been redeemed by Lv et al. (2021), who show that vanilla GAT can even outperform existing HGNNs in most cases. We also compare our models to Simple-HGN (Lv et al., 2021). We report below the detailed settings of each model.

  • [leftmargin=]

  • GCN is one of the first GNN models and uses average aggregation from neighbors with the objective to learn a shared ”convolution kernel” which could be applied to every part of the graph in order to absorb the information from neighbors to the node.

  • GAT uses an attention mechanism to perform a weighted aggregation from one-hop neighbors. Along with GCN, these homogeneous GNNs can handle heterogeneous graphs by simply ignoring node and edge types.

  • Simple-HGN. Starting from the GAT model, it includes edge type information into attention calculation through a learnable edge-type embedding, making it possible to model heterogeneous graphs. The model is also enhanced with residual connection and

    normalization on the output embedding.

3.3. Implementation Details

All models are implemented in Pytorch and were run on NVIDIA TITAN X with 12G of memory. Concerning the diachronic embedding, since the timestamps in our MassReg dataset are dates rather than single numbers, we apply the temporal part of Equation (1) to week and day separately (with different parameters) and thus obtain two temporal vectors. Then we take an element-wise sum of the resulting vectors, to obtain a single temporal vector. Intuitively, this can be viewed as converting a date into a timestamp in the embedded space. We report in Table 8 (Appendix B) the hyperparameters chosen for our models after hyperparameter tuning on the validation set. We train our models on 5 different seeds to get the average scores.

3.4. Performance Evaluation

Now we present the results of DyHGN-* models on three datasets.

3.4.1. Overall Performances: AP scores

We report the AP scores of our models and their baselines in Table 4. Comparing the results across models, we make observations as follows:

  1. [leftmargin=]

  2. The power of GAT. GAT is a very strong model (which is in line with the findings in Lv et al. (2021)), esp. in the evenly distributed datasets across time like xFraudTxn and xFraudAccount.

  3. The power of temporal modeling. We use “DE” module to model time-related entity embeddings across different timestamps. In MassReg (a dataset with an uneven distribution), our DyHGN and DyHGN-DE models have stronger performances over GAT. This is because the message passing from the previous time stamps is better learned in DyHGN-(DE) models with a temporal focus (node/edge appearance and disappearance). Also, the LSTM layer helps learning long-term dependencies from the past. In MassReg where the temporal patterns fluctuate largely across time, DyHGN-DE is very suitable to model the dynamics. On xFraud datasets where the temporal patterns do not vary largely overtime, DyHGN-DE does not have strong performance. Another reason that DyHGN-DE is doing better on MassReg might come from the fact that we were able to have a diachronic embedding of larger size than for the other datasets (due to GPU memory limit), shown in Table 8 (Appendix B). Therefore, using larger diachronic embedding size could be beneficial.

  4. The power of heterogeneous graph transformer. In DyHGN-DE-HGT, we adopt the “HGT” module for the heterogeneous graph transformer between different types of nodes and edges. We notice that a complex convolution like HGT does not always outperform Simple-HGN (Lv et al., 2021).

  5. Varying performances among DyHGN-* model. It is worth noting that DyHGN-DE-HGT does not outperform the other two DyHGN-* variants and GAT on all three datasets, this means that modeling the temporal aspects have contributed largely to the prediction. Using a more complex convolution (HGT vs. GCN/GAT) might deteriorate the performances.

MassReg xFraudTxn xFraudAccount
GCN 0.7995 0.0058 0.1380 0.0126 0.0606 0.0039
GAT 0.8193 0.0049 0.1504 0.0087 0.1638 0.0041
Simple-HGN 0.7566 0.0253 0.0615 0.0069 0.2588 0.0267
DyHGN 0.8239 0.0086 0.0950 0.0018 0.1195 0.0525
DyHGN-DE 0.8298 0.0032 0.0323 0.0010 0.0625 0.0038
DyHGN-DE-HGT 0.8047 0.0159 0.0321 0.0037 0.0669 0.0095
Table 4. Experiment Results with the Average Precision (AP) score. Our models are denoted as DyHGN-*. “DE”: diachronic embedding, “HGT”: heterogeneous graph transformer.

3.4.2. Ablation Studies

We describe ablation studies on the MassReg dataset (where DyHGN-* outperforms baselines) with several variants of proposed models to provide better understanding of their performances.

GCN vs. DyHGN-*. We compare the two model variants, DE + HGT and DyHGN-DE-HGT. The main difference is that the former has only the structural subgraph using HGT (no temporal subgraph) and the latter one has both subgraphs. The former performs worse than the latter. So the comparison indicates that diachronic embeddings, temporal, and structural subgraphs are all important in modelling graph dynamics. And diachronic embeddings and temporal subgraph capture different aspects of dynamics.

Aggregation of diachronic embedding. So far, we used an LSTM layer in order to aggregate the DE-DistMult scores of the edges linked to a node. We also tried taking the simple average of all scores, but observed a worsened performance from that variant. This suggests that the model really benefits from the LSTM layer being able to capture long-term dependencies.

Adding diachronic embedding for relations. As already highlighted in Goel et al. (2020), we hypothesize that relation evolution is negligible (compared with node evolution), therefore, modeling relations with a static, rather than a diachronic embedding suffices. We tested this hypothesis by running our models where relation embeddings are also a function of time. We observed no significant improvement, meaning that the evolution of relations is not helpful in our setting.

day week relations_address relations_ip relations_phone relations_email snapshots_address snapshots_ip snapshots_phone snapshots_email
0 27 3 0 41 0 0 0 13 0 0
1 63 9 4 2 9 0 2 2 2 0
2 23 3 2 28 0 0 2 1 0 0
3 58 8 0 8 0 0 0 3 0 0
4 20 2 0 121 0 0 0 10 0 0
Table 5. Example of Graph-derived Features.
Global Incremental
Random train/test split 0.8246 0.7953
Chronological train/test split 0.5598 0.5301
Table 6.

AP Scores of XGBoost. “Global”: using features computed from all time stamps, “Incremental”: using only features until the time


3.5. Deep Dive into DyHGN-* using ML Models with Graph-derived Features

To assess the performance of our models, we decide to establish yet another series of ML baselines (Logistic regression, Random forest, XGBoost) using purely graph-derived features. Indeed, after the study of multiple graph features (e.g., Table 

5) for the MassReg dataset, we observe for instance a correlation between the total number of relations of the linking entities (shipping address, phone number, …) and the account riskiness (See Figure 3). A natural explanation is that suspicious accounts are oftentimes registered using a common shipping address, such as a warehouse, or telephone numbers that are listed as spam calls. As a consequence, these linking entities will be linked to many accounts and are indicative of suspicious activities. Since in MassReg, DyHGN-* models perform the best, we only conduct the ML experiments on this dataset.

Feature description. For each account in MassReg, we derive 8 features from the graph structural and temporal information. We provide an example to illustrate them in Table 5. The row 1 in bold can be read as follows: the account n°1 was created on the day (in the week); it gave a shipping address and a phone number that were used respectively 4 and 9 different times, both appearing in two different time snapshots; this account also used an IP address that has been used twice and appeared in two different time snapshots; it did not provide any email address.

Incremental setting. In Table 6 we report the best ML performer XGBoost under various settings. The global model in the Table 6 relies on the features of the whole graph in all time stamps, which means that we take the features of the linking entities regardless of which time snapshot the account was created in. In contrast, we can also construct our same features incrementally, meaning that we only take the features from the graph that are available at the time of the account’s creation. As expected, the incremental setting performs worse than the global one. However, this is very common in graph-related tasks to investigate both cases (c.f. (Hu et al., 2020a)).

In the DyHGN-* architecture, we allow the edges and nodes in the test set to appear already in the training set, since the whole structural and temporal subgraphs are given as input to the GCN layers. This is a global setting. In the case of flagging frauds in a dynamic setting, both global and incremental settings are useful: the former allows us to pre-train a large model on historical+current data and capture fraudulent patterns with long-term dependencies, while the latter allows us to capture newly established fraudulent patterns better.

Train/test split on the performance. In Table 6, we report the results of our best ML performer XGBoost using two different types of train-test split. We observe a large discrepancy between them. For comparison, we also evaluated DyHGN on a random train/test split and found an AP score of about 0.89, which is about 7% higher than the score with chronological split. While this confirms that a random split is overly optimistic compared to a chronological split, the discrepancy in the score observed with global and incremental features also indicates that it is actually hard to predict the future based on the past account creation logs. This aligns with our finding from Figure 4 in Appendix C that in MassReg, the uneven distribution of the labels across time makes it difficult to predict the future labels based on the past.

Feature importance. One advantage of using handcrafted features is that this model can be better explained (vs. GNN explainability in (Rao et al., 2022)) and we can look at the Shapley (SHAP) values to closely identify the key features of the graph. The SHAP values shows the impact of each feature (y-axis ordered in descending order of feature importance), whether a feature contributes positively or negatively, to the prediction (x-axis). In Figure 3, we plot the SHAP values of the best performer (XGBoost with global features). Note that its performance is on par with the best DyHGN model (DyHGN-DE). The use frequency of one entity (e.g., IP, address, phone) is a crucial indicator of frauds. Also, the time dimension (e.g., day, week) is influential when flagging frauds. We see that if an account provides an IP address that has been used multiple times (high feature value of relations_ip), it is more likely to be a risky account (impact on the model output leans toward the positive class). However, if the IP address appears in multiple time snapshots (snapshots_ip), the account is less likely to be risky. Indeed, since we are concerned with massive registration of suspicious accounts here, we are mostly looking for IP addresses that would be used frequently in a restricted time period. The impact of snapshots_email and relations_email follows a similar pattern as IP. For shipping addresses and phone numbers, the interpretation is not as easy. We see that less used addresses and phone numbers are mostly linked with non-risky accounts, but they can be indicators of legit/fraud labels when they are highly used. On the temporal aspects, accounts created at the beginning (low day and week number) are more likely to be suspicious, which correlates with the label distribution of the MassReg dataset shown in Figure 4. This resonates in the AP performances of DyHGN-DE, where long-term dependencies from the past can be learned.

To summarize, looking at ML models trained by graph-derived features helps us gain more insights of learning graph dynamics. It can guide us to model interpretability when using complex models such as HGNNs and DyHGN.

Figure 3. Shapley Values in XGBoost with Global Features.

3.6. Discussions

We share thoughts about modelling graph dynamics with GCN, GAT, HGT and diachronic embeddings and discuss the architecture efficiency in prototyping and production.

Modeling dynamic graphs with “Attention”. Lv et al. (2021) discusses the efficiency and performances of 10 models (incl. GAT) on 7 datasets, but none of the models and datasets include graph dynamics. We are able to resonate to their findings by looking at graph dynamics. In line with the findings in  (Lv et al., 2021), in datasets (xFraudTxn, xFraudAccount) with evenly distributed labels across time, GAT significantly outperforms models with HGT. Since the latter takes longer to train, it is beneficial to use GAT in such datasets with dynamics. For the dataset with unevenly distributed labels across time (MassReg), we have a better performance in our models (DyHGN-DE and DyHGN) over GAT.


Note that the largest DyHGN-* model can take between 4 and 15 min per epoch (on average 5.5 for MassReg, 14 for xFraudTxn, 9 for xFraudAccount), because we need to group by source nodes and the current implementation does not include batch training. But in the future, we plan to extend the implementation with (mini-)batch training once the prototype needs to run on an industrial-scale dataset as in xFraud

(Rao et al., 2022), where a deployment in a distributed setting is also explored and implemented.

4. Related Work

We review several key areas relevant to our work.

Heterogeneous Graph Neural Networks. Learning in heterogeneous graphs has gained interest in recent years, and recent works aim to generalize the traditional GCN and GAT to heterogeneous graphs (Hu et al., 2020b; Wang et al., 2019b; Hong et al., 2020; Zhang et al., 2019a). These models specify node types when constructing graphs and perform sampling over different types of nodes during message passing, which brings improvement in modelling heterogeneous graphs.

Dynamic Graph Neural Networks. The temporal dynamics are oftentimes investigated within a homogeneous graph setting. One representative work is DySAT (Sankar et al., 2020) that discusses an approach to learn deep neural representations on dynamic homogeneous graphs via self-attention networks. DySAT is applicable in cases where all entities in a graph have a dynamic perspective. However, in our application case, we need to differentiate between two types of entities in the time dimension: (1) accounts and transactions being dynamically added/removed across time, (2) hard linking entities like registration addresses and telephone numbers that stay static across time. Dynamic graphs could also be represented as a sequence of time events, which is the approach used in Temporal Graph Networks (TGN) (Rossi et al., 2020). Applying memory modules and graph-based operators, the model is able to outperform other approaches for link prediction in both transductive and inductive settings while being more computationally efficient.

Fraud Detection. In the context of fraud detection, we already described DyHGN (Rao et al., 2020) that tackles suspicious massive registration detection with a homogeneous graph neural network that is composed of two subgraphs, a structural and a temporal and that can be applied to heterogeneous graphs. xFraud (Rao et al., 2022) uses self-attentive heterogeneous graph neural network as predictor in a static setting and provides an explainer that generates meaningful insights to facilitate further process in business units. Lambda Neural Network (LNN) (Lu et al., 2021) uses directed dynamic snapshot linkage design for graph construction, to ensure that information flow passed through neighbours only comes from the past. While LNN performs node classification in an inductive setting, Asynchronous Propagation Attention Network (APAN) (Wang et al., 2021) adopts temporal encoding similar to TGN (Rossi et al., 2020) and decouples graph computation and inference, to perform edge classification regarding fraudulent interactions.

5. Conclusion

In this paper, we investigate the benefits of building a dynamic heterogeneous graph neural network (DyHGN) with a diachronic entity embedding to provide characteristics of entities at any point in time. We also make the model heterogeneous and compare it with its homogeneous counterpart. We conduct experiments on three real-world graphs from application scenarios at eBay and benchmark several models on it. We find that our models are best-performers only in specific settings with uneven label distributions across time, and we also discuss a trade-off between performance and computation cost. In addition, we share insights on feature importance using Shapley values when investigating temporal graphs trained by XGBoost using graph-derived features. This interesting angle provides inspirations for future work in this direction.


  • R. Goel, S. M. Kazemi, M. Brubaker, and P. Poupart (2020) Diachronic embedding for temporal knowledge graph completion. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 34, pp. 3988–3995. Cited by: 1st item, §1, §2.5.1, §3.4.2.
  • H. Hong, H. Guo, Y. Lin, X. Yang, Z. Li, and J. Ye (2020) An attention-based graph neural network for heterogeneous structural learning.. In AAAI, pp. 4132–4139. Cited by: §4.
  • W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020a)

    Open graph benchmark: datasets for machine learning on graphs

    Advances in neural information processing systems 33, pp. 22118–22133. Cited by: §3.5.
  • Z. Hu, Y. Dong, K. Wang, and Y. Sun (2020b) Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, pp. 2704–2710. Cited by: 1st item, §1, §1, §2.5.2, §4.
  • A. Li, Z. Qin, R. Liu, Y. Yang, and D. Li (2019) Spam review detection with graph convolutional networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2703–2711. Cited by: §1.
  • C. Liang, Z. Liu, B. Liu, J. Zhou, X. Li, S. Yang, and Y. Qi (2019) Uncovering insurance fraud conspiracy with network learning. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1181–1184. Cited by: §1.
  • Z. Liu, C. Chen, L. Li, J. Zhou, X. Li, L. Song, and Y. Qi (2019) Geniepath: graph neural networks with adaptive receptive paths. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4424–4431. Cited by: §1.
  • Z. Liu, C. Chen, X. Yang, J. Zhou, X. Li, and L. Song (2018) Heterogeneous graph neural networks for malicious account detection. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2077–2085. Cited by: §1.
  • M. Lu, Z. Han, Z. Zhang, Y. Zhao, and Y. Shan (2021) Graph neural networks in real-time fraud detection with lambda architecture. arXiv preprint arXiv:2110.04559. Cited by: §1, §4.
  • Q. Lv, M. Ding, Q. Liu, Y. Chen, W. Feng, S. He, C. Zhou, J. Jiang, Y. Dong, and J. Tang (2021) Are we really making much progress? revisiting, benchmarking, and refining heterogeneous graph neural networks. Cited by: §1, item 1, item 3, §3.2, §3.6.
  • J. Ma, D. Zhang, Y. Wang, Y. Zhang, and A. Pozdnoukhov (2018) GraphRAD: a graph-based risky account detection system. Cited by: §1.
  • S. X. Rao, S. Zhang, Z. Han, Z. Zhang, W. Min, Z. Chen, Y. Shan, Y. Zhao, and C. Zhang (2022) XFraud: explainable fraud transaction detection. VLDB. External Links: Document Cited by: item 2, §1, §3.1, §3.5, §3.6, §4, footnote 1, footnote 4.
  • S. X. Rao, S. Zhang, Z. Han, Z. Zhang, W. Min, M. Cheng, Y. Shan, Y. Zhao, and C. Zhang (2020) Suspicious massive registration detection via dynamic heterogeneous graph neural networks. arXiv preprint arXiv:2012.10831. Cited by: §3.1, §4.
  • E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, and M. Bronstein (2020) Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637. Cited by: §3.1, §4, §4.
  • A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang (2020) Dysat: deep neural representation learning on dynamic graphs via self-attention networks. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 519–527. Cited by: §2.3, §4.
  • S. Shekhar, D. Pai, and S. Ravindran (2020) Entity resolution in dynamic heterogeneous networks. In Companion Proceedings of the Web Conference 2020, pp. 662–668. Cited by: §2.3.
  • J. Wang, R. Wen, C. Wu, Y. Huang, and J. Xion (2019a) Fdgars: fraudster detection via graph convolutional networks in online app review system. In Companion Proceedings of The 2019 World Wide Web Conference, pp. 310–316. Cited by: §1.
  • X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu (2019b) Heterogeneous graph attention network. In The World Wide Web Conference, pp. 2022–2032. Cited by: §4.
  • X. Wang, D. Lyu, M. Li, Y. Xia, Q. Yang, X. Wang, X. Wang, P. Cui, Y. Yang, B. Sun, et al. (2021) APAN: asynchronous propagation attention network for real-time temporal graph embedding. In Proceedings of the 2021 International Conference on Management of Data, pp. 2628–2638. Cited by: §1, §4.
  • M. Weber, G. Domeniconi, J. Chen, D. K. I. Weidele, C. Bellei, T. Robinson, and C. E. Leiserson (2019) Anti-money laundering in bitcoin: experimenting with graph convolutional networks for financial forensics. arXiv preprint arXiv:1908.02591. Cited by: §1.
  • R. Wen, J. Wang, C. Wu, and J. Xiong (2020) ASA: adversary situation awareness via heterogeneous graph convolutional networks. In Companion Proceedings of the Web Conference 2020, pp. 674–678. Cited by: §1.
  • B. Yang, W. Yih, X. He, J. Gao, and L. Deng (2014) Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Cited by: §2.5.1.
  • C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla (2019a) Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 793–803. Cited by: §4.
  • Y. Zhang, Y. Fan, Y. Ye, L. Zhao, and C. Shi (2019b) Key player identification in underground forums over attributed heterogeneous information network embedding framework. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 549–558. Cited by: §1.

Appendix A Experiment Results using AUC score

We report the AUC scores of our models and their baselines in Table 7. The AUC score corresponds to the Area Under the Curve of the Receiver Operating Characteristic, which is obtained by plotting the true positive rate against the false positive rate at various thresholds. We make the following observations:

  1. [leftmargin=]

  2. GAT outperforms DyHGN-* models according to the AUC metric for the MassReg dataset, while DyHGN-* models show better performances with the AP score in Table 4. We observe that the differences in AUC scores between models on all datasets are tighter than those in the AP metric. However, we prefer the AP score, since it is more sensitive to the improvements for the positive class, which is our main focus. For xFraudTxn and xFraudAccount, GAT and Simple-HGN remain the best performer, as the AP scores indicate.

  3. Compared with the results in Rao et al. (2022), our AUC is higher because we run more epochs on the models. However, this does come with a cost of having more complicated design of graph construction and model architecture. Therefore, when one wants to model the graph dynamics, one should do it with “attention” (!) and benchmark different solutions against the computation cost.

MassReg xFraudTxn xFraudAccount
GCN 0.8627 0.0044 0.7647 0.0078 0.5793 0.0284
GAT 0.8806 0.0026 0.8880 0.0066 0.7572 0.0178
Simple-HGN 0.8260 0.0204 0.6124 0.0316 0.8456 0.0270
DyHGN 0.8749 0.0059 0.7878 0.0060 0.7527 0.0733
DyHGN-DE 0.8793 0.0031 0.6288 0.0065 0.6779 0.0081
DyHGN-DE-HGT 0.8600 0.0078 0.6263 0.0175 0.6851 0.0051
Table 7. AUC Scores. “DE”: diachronic embedding, “HGT”: heterogeneous graph transformer.

Appendix B Hyperparameters

Model n_layer n_hid dropout optimizer lr max_epochs patience n_heads diachronic_embedding
MassReg xFraudTxn Account MassReg xFraudTxn Account MassReg xFraudTxn xFraudAccount
GCN 4 4 4 256 256 256 0.1 adamw 0.001 2048 64 - - - -
GAT 8 2 2 256 256 128 0.1 adamw 0.001 2048 64 4 - - -
Simple-HGN 2 2 2 256 64 256 0.1 admaw 0.001 2048 64 4 - - -
DyHGN 4 2 4 256 256 128 0.1 adamw 0.001 2048 64 - - - -
DyHGN-DE 4 2 2 256 128 128 0.1 adamw 0.001 128 64 - 60 10 10
DyHGN-DE-HGT 4 2 2 256 128 128 0.1 adamw 0.001 128 64 4 30 10 10
Table 8. Hyperparameters on our dataset. “DE”: diachronic embedding, “HGT”: heterogeneous graph transformer.

We list the hyperparameters in the GNN models in Table 8 and the XGBoost best parameters are colsample_bytree: 0.9, eval_metric: error, gamma: 0, learning_rate: 0.1, max_depth: 7, min_child_weight: 5, n_estimators: 200, objective: binary:logistic, reg_alpha: 0, reg_lam- bda: 1, scale_pos_weight: 1, subsample: 0.7.

Appendix C Distribution of risky nodes across time

Figure 4 shows the distribution of labels across time for the three datasets. We observe that the MassReg dataset is particularly unevenly distributed, the proportion of risky accounts varies from 40% to 65% and has two peaks at weeks 2 and 8 (the peak at week 13 can be discarded since this week was truncated and thus includes fewer accounts). On the contrary, the xFraud datasets are more evenly distributed. We believe this major difference to be the reason why certain models work better on the MassReg dataset but not on the xFraud datasets. Indeed, in the case of unevenly distributed datasets, the diachronic embeddings from the previous time stamps might be helpful in learning long-distance dependencies.

Figure 4. Distribution of risky accounts across time for the MassReg dataset and the two xFraud datasets.
Epoch 0 - AP 0.430 Epoch 170 - AP 0.815
Figure 5. t-SNE visualization of the diachronic embedding at the beginning and at the end of the training.

Appendix D Other Experiments concerning the Diachronic Embedding

In order to fully investigate the potential effect of the diachronic embedding, we carried supplementary ablation studies to the ones already described in Section 3.4.2. It is important to notice that these experiments were only carried on the MassReg dataset, since DyHGN-* models have outperformed the GNN baselines in AP scores.

Types of scores. In MassReg, when we consider a relation between an account (source node) and a linking entity (address, email, phone number, IP address) the information carried by the edge is redundant with the information carried by the linking entity (target node) since the edge already indicates the type of the target node. Therefore, we tried to compute the DE-DistMult score with the source node and the relation only. This gave us a boost in terms of computation time per epoch (about 3 minutes per epoch); however, we observed a decrease in performance. This could stem from the fact that target nodes are used multiple times and so the diachronic embedding could also use this information which is beneficial to the model.

Embedding size. We experimented with different embedding sizes, and we found that a larger size of embedding seems to entail a higher score. However, increasing the embedding size is computationally heavy both in terms of space and of time, therefore, one needs to find a trade-off between complexity and performance.

Visualization. We show some visualization of the evolution of the diachronic embedding throughout training in Figure 5. The visualization is obtained by using t-SNE technique on the diachronic embedding at different epochs of the training. Each point represents an account of the MassReg dataset. The color of the point indicates the label of the account (0/blue for non-risky and 1/orange for risky).

We observe that at epoch 0, there is no distinction between the two labels. The clusters that we observe are actually indicative of the number of linking entities (email, phone number, address, etc.) that the account has. These clusters are natural, since the diachronic embedding of a given account is obtained by averaging the scores of all the relations of this account. Then, at epoch 170, we see that the two classes are partly separated, which indicates that the diachronic embedding was able to learn to classify the accounts to some extent. However, we also observe that many accounts are still mixed with accounts of the opposite label, which might explain why the diachronic embedding does not improve the performance of the model very significantly. We also see different degrees of separation across clusters (linking entities), this corresponds to the feature importance (SHAP values and their impacts on prediction) we discussed in Figure 3 in Sec. 3.5.