Navigating the Dynamics of Financial Embeddings over Time

by   Antonia Gogoglou, et al.

Financial transactions constitute connections between entities and through these connections a large scale heterogeneous weighted graph is formulated. In this labyrinth of interactions that are continuously updated, there exists a variety of similarity-based patterns that can provide insights into the dynamics of the financial system. With the current work, we propose the application of Graph Representation Learning in a scalable dynamic setting as a means of capturing these patterns in a meaningful and robust way. We proceed to perform a rigorous qualitative analysis of the latent trajectories to extract real world insights from the proposed representations and their evolution over time that is to our knowledge the first of its kind in the financial sector. Shifts in the latent space are associated with known economic events and in particular the impact of the recent Covid-19 pandemic to consumer patterns. Capturing such patterns indicates the value added to financial modeling through the incorporation of latent graph representations.



There are no comments yet.


page 7

page 9

page 11


DeepTrax: Embedding Graphs of Financial Transactions

Financial transactions can be considered edges in a heterogeneous graph ...

Scalable, Trie-based Approximate Entity Extraction for Real-Time Financial Transaction Screening

Financial institutions have to screen their transactions to ensure that ...

Latent Dirichlet Allocation (LDA) for Topic Modeling of the CFPB Consumer Complaints

A text mining approach is proposed based on latent Dirichlet allocation ...

dyngraph2vec: Capturing Network Dynamics using Dynamic Graph Representation Learning

Learning graph representations is a fundamental task aimed at capturing ...

TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning

Dynamic graph modeling has recently attracted much attention due to its ...

Visual Analytics approach for finding spatiotemporal patterns from COVID19

Bounce Back Loan is amongst a number of UK business financial support sc...

Allocating Stimulus Checks in Times of Crisis

We study the problem of allocating bailouts (stimulus, subsidy allocatio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Financial transactions, from credit card payments to stock purchases, can be viewed as edges on a graph where the nodes represent the parties involved in the transaction. This graph can contain features at various levels. For instance, the edges can contain information such as transaction amount and frequency while nodes themselves can contain rich features such as FICO score, income, and account balance. This structure lends itself to the application of Graph Neural Networks as shown previously in

[4] and [21].

At the same time, graphs of financial transactions have some unique properties not typical to the graphs used to develop GNNs. They can exhibit extreme power-law distributions. They are often heterogeneous, with a high level of variance in degree across nodes of different types. Additionally, they tend to be highly non-stationary in a multi-dimensional way. Firstly, new edges can form between nodes. Secondly, the weight of the edges can change over time. Finally, the set of nodes active on the network can grow and/or shrink.

For the purpose of our work, we focus on credit card transactions. Credit card transactions form a bipartite graph between account holders and the merchants they shop at. Transactions are processed in a continuous stream, therefore to define this bipartite graph, one must look at a specific window in time defining for t amongst T discrete windows. This graph is defined over a set of accounts & merchants active in the time window t and the weight placed on their edge can be defined as the frequency of transactions in that window. Between any two times and the following might be true:

  • : New accounts and merchants will appear

  • : Previously observed accounts and merchants will disappear

  • : New edges will be observed

  • : Old edges will have their weight updated

Representation learning comprises a well-established set of techniques for embedding high cardinality sets (e.g. nodes in a graph or words in a vocabulary) in lower dimensional vector spaces in such a way that preserves their locality and co-occurence structure. Graph representation learning is often used as a generalized approach to feature generation from a graph structure that can be used in a number of down stream applications. In financial services these include fraud detection and credit decisions

[21], [4]. Many of the traditional representation learning techniques assume stationarity in the underlying structures that they embed. For instance, word embedding models are trained once on very large corpora and can be used for many downstream NLP tasks for years. Out-of-vocabulary words are often dropped or replaced with an unknown word token. Recent work has started to look at dynamic language and graph representation learning. However, the assumption of stationarity is only slightly relaxed as the datasets change slowly over time.

To solve the problem of graph representation learning on non-stationary financial graphs, this paper provides the following:

  • A description of our framework for training shallow embeddings from highly dynamic graphs over multiple timeframes that can empower down stream applications.

  • An extensive in-depth qualitative analysis of embedding shift over time and identification of meaningful shifts.

  • Association of temporal patterns in the representation space with real world transaction dynamics (e.g. shopping patterns, market changes) and merchant categories with a particular focus on the effects of Covid-19 pandemic to financial transactions.

  • Time series analysis to filter rotational noise from the dynamic retraining process and demonstration of how short-term representation shift can be effectively inferred from prior shift of the embedding space.

  • In light of recent events related to the global pandemic, we investigate the ability of dynamic representations to quantify the extend of spending shifts that can prove useful in supporting decision makers in times of crisis.

Our analysis unveils the expressive power of embeddings for financial entities and their ability to encode the dynamics of consumer patterns in a dense representation fit for multiple downstream applications.

2 Related Works

Graph representation learning seeks to embed the nodes of a network into a low dimensional vector space in such a way that the topological properties of the network are preserved [19], [10], [14], [3]. Recent work investigating the application of these techniques to financial graphs shows promise as a general approach to modeling rich transaction networks [4] [21]. Much of the research in graph representation learning, including those applications in finance, have focused on static graphs. However, recent work into learning dynamic representations can be found in both the literature on word embeddings and as well as graph approaches.

2.1 NLP Approaches

Language is not static. New words are introduced, old words go out of fashion, and words can change their meaning. Research in NLP seeks to model the semantic changes in word usage over time. There are three main approaches explored within NLP for allowing words to evolve their meaning over time: chained initialization approaches, latent variable approaches and temporal alignment approaches. An important thing to note about dynamic word embeddings methods is that the time scales of change in language at a global level might be much longer and robust to exogenous influences compared to financial transaction graphs.

Chained initialization approaches tackle the problem of embeddings over time pragmatically by modifying static word embedding models to allow them to update dynamically over time. The simplest approach is to use the embeddings of the previous time period as the intialization for the current time period and retrain on the current time period [13] and [22]. Other variants use this same methodology but make use of word2vec’s two embedding spaces, one for the observed words and one for the context, by fixing the context space over time and allowing the observed words to train each time period [5]. The baseline methodology presented here for financial transactions most closely resembles these approaches.

Latent variable approaches take a Bayesian formulation of the problem assuming that the words observed in a corpus are draws from a latent distribution parameterized by the word embeddings. The Bayesian formulation allows them to explicitly model the dynamics of word evolution in a Markovian manner. In [20]

vectors of words and their contexts are used to paramaterize a Bernoulli distribution which defines the probability of the word in that context. Similar to

[5] the context vectors are fixed on some reference point, such as , but the word vectors are allowed to update over time. Another approach proposed by [1]

is to treat the evolution of words as an Ornstein-Uhlenbeck process to ensure the evolutionary process doesn’t stray too far from the origin. This work proposes two Kalman inference methods one for a streaming settings (filtering) and one where the entirety of the corpus is available (smoothing). We present our work utilizing Kalman filtering over the dynamic embeddings as a meta-model following chained initialization training.

Both of these approaches seek to chain the training of individual word embeddings over time. A third approach taken by [11], seeks to achieve alignment at a more global scale. In this approach embeddings at different time periods are trained independently of one another to allow them to capture the specific semantic structures within that time period. However, most embedding algorithms are not rotation invariant and therefore cannot be compared to one another when trained independently in this way. So this approach learns an alignment matrix between each time period to ensure the coordinate systems are aligned.

2.2 Graph Approaches

Graph approaches to dynamic graph problems have focused on solving the problem of transmitting changing graph information through time while also allowing full expressiveness at each time step. Tracking this temporal information by learning dynamic representation through time using recurrent architectures [18, 8] is one approach. In [9] a combination of chained initialization and dynamically widening and/or deepening techniques on deep auto-encoders are employed to preserve stability across time steps. Additionally, the importance of community changes over time is underlined as a metric that captures the meaningful shift between snapshots of the graph. Other approaches seek to make explicit the temporal dependencies between graphs by injecting this information into the embeddings themselves. In [15] the concept of node trajectory is introduced and the embeddings of each snapshot are treated as instances of a time-series of the overall trajectory of the node. Temporal dependencies are seen as a physical process and a context vector can be created to track this evolution and provide usable insight into shifting embeddings. [23] further explores the temporal dependencies between graphs by investigating long-term topological changes in graph structure and short term interactions between nodes, treating these short term interactions as a temporal point process model parameterized by an attention mechanism.

All of these approaches extend static graph approaches to dynamic ones by examining the relationships between graphs stacked on top of each other; by introducing novel architecture or changing existing architectures to maintain a temporal element they capture temporal dependencies effectively. However, the increase in the volume of information quickly makes large temporal graph problems intractable as more and more time steps are considered. Simplifications, such as restricting the change in size to the number of nodes, are necessary to make up for the increase in graph size as well as increase in parameters for more complicated models.

3 Methodology

3.1 Generating Dynamic Embeddings

As discussed in Section 1, there are particular challenges with the representation of highly dynamic graphs such as transaction graphs. To address these challenges we devise a training framework that ensures both coherence and flexibility throughout multiple time frames. The static baseline method for each snapshot of the transaction graph is described in [4]. The incoming data are represented in tabular format where each data entry corresponds to a particular transaction between an account and a merchant associated with a given timestamp. A bipartite graph of accounts and merchants connected through transactions can subsequently be inferred from this dataset. As described in [7], random walks can be generated across different types of nodes for heterogeneous graphs. Building off this concept and to enhance flexibility and training capacity we consider each node of a particular type as a bridge that facilitates a connection between two nodes of another type by interacting with both of them. For the case of merchant type nodes this results in {merchant,account,merchant} triplets that represent a connection between two merchants if the same account has interacted with both of them. Similarly, accounts are linked by performing transactions at the same merchant within the time window. This results into two homogeneous projections of the original heterogeneous graph. In the present work, we focus on connections between Brand Level Merchants, meaning that merchants are represented by their brand name without distinguishing between different locations of the same merchant. This approach results in a highly interconnected graph representing spending patterns across the nation. By tuning the time window appropriately this connectivity can be efficiently tuned.

This link construction approach can be viewed as formulating random walks of length two with a context window of one. The resulting pairs are subsequently fed into a skip-gram model [17] to generate dense low dimensional representations for each node. We posit that given the noisiness and high interconnectivity in transactional data contemplating longer range interactions through larger walk lengths would constitute it challenging to guarantee that a negatively sampled merchant will not appear as part of the positive context as well. Considering a larger context window would pose the risk of interrelating merchants that are not actually meaningfully similar. The frequency of appearance for each training pair represents the strength of each link, i.e. the edge weight.

In order to capture the state of the transaction graph at different moments in time, we form monthly snapshots at the end of each month following the aforementioned methodology. One issue that has been contemplated in literature

[16, 12] is the random rotation of the embedding space in the skip-gram model. This rotation would constitute each snapshot’s embeddings radically different in values compared to the previous and therefore their use in downstream models would be compromised. To address this matter, we opt for a warm-start training where each snapshot’s model is initialized with the previous month’s final state. This leads to nodes being cumulatively added to the embedding space. Once a merchant has entered the embedding space its position is maintained and receives updates only when new training pairs appear that include this merchant. Newly introduced nodes are added in the embedding space with a random initialization.

Notation Considering a set of snapshots which correspond to time stamps . For each time step , positive context pairs are generated based on co-occurrence of transactions within a given time window, while pairs are sampled for negative context. In terms of probabilities the objective is to maximize the log likelihood of the positive context pairs being observed as opposed to the negative pairs:


Embedding lookup table , where N is the cardinality of set , is updated only for the nodes that are present in the context pairs of this month. However the number of data points in includes all the unique nodes that reside in the intersection of the sets of context pairs up to time stamp . Details of training process are depicted in Figure 1.

Figure 1: Framework overview for dynamic training of skip-gram model for merchant embeddings with chained initialization over T sequential time stamps.

Dataset Our dataset consists of a set of Brand Level Merchants and their transactions over monthly snapshots between 2017-11 and 2020-03. For each merchant, we have an externally provided category code that identifies the type of business 111 Based on the dynamic training described above, averages at 50,000 per time stamp, while the total size of the embedding space extends to 260,000 merchants.

3.2 Quantification of Temporal Shifts

The main focus of this work is to investigate the effectiveness of our dynamic training framework and explore the behavior of the resulting representations. In this section we will describe the methodology we designed to extract that knowledge from multiple snapshots of the embedding space including a series of downstream models that we applied on the sequence of representations.

Measuring shift Since nodes in the latent space are represented as dense vectors, we select a set of first-order difference statistics to calculate their shift over time. The goal is to associate representation shift as measured in the latent space with real-world shift, i.e. semantic shift. Our selected metrics are Euclidean distance and cosine distance to measure both magnitude shift by the absolute values of the embedding dimensions but also similarity shifts as measured by the angle between embedding vectors. For a time stamp with , and are defined as:


We proceed to investigate how these statistics change over time and what seasonal patterns arise from observing deviations from the average shift. Of particular interest is the timing of maximum shift for each merchant or a subset of them. For each merchant across a set of time stamps , the maximum shift is defined as:


The normalization by the cumulative shift of all the nodes within time stamp ensures that the shift ranking is sensitive to the nodes that drifted more compared to the drift of the rest of the nodes.

Another important measure of shift is the formulation of a node’s neighborhood in the representation space. Over different time stamps and given the fact that nodes from a previous snapshot maintain their position in the space if they do not receive any updates, we expect the neighborhood to evolve but not deviate largely between consecutive snapshots. By defining neighborhood as the

highest ranking in cosine similarity nodes around a target node, neighborhood shift can be quantified as:


where is the number of time stamps elapsed between two different calculations of the top-k neighborhoods for every node and varying values of . Per node shifts are subsequently aggregated across all nodes .

Shift trajectories In order to formalize the quantification of shift over time we aggregate the first order statistics across different timeframes that differ by and create a time series of shifts:


The resulting time series may be viewed as a state space model where individual components such as a financial trend component or a seasonal one are combined to produce the observed values of shift. By decomposing the time series into its constituent components we aim to identify the effects of the trend component as well as the expected random component that arises from the rotation of the embedding space after multiple rounds of Skip-gram training.

We employ filtering of the time series which entails the estimation of current values of the state from past and current observations. A particular method of time series filtering that has been utilized in conjunction with embedding approaches is

Kalman filtering [15, 1]. The Kalman filter operates on a series of measurements observed over time, e.g. the embedding shift, and assumes that they contain some level of Gaussian noise or inaccuracy, using the time series to estimate as given by the equations below.


Building on that assumption, it produces estimates that tend to be more accurate than those based on individual measurements alone. Therefore it can provide an estimation of the true shift trajectory across time stamps. To explore trajectories themselves, we compute neighborhood based on Equation 5. We applied Kalman smoothing to embeddings (normalized using ) elementwise, by assuming an independent difference vector, namely velocity, for each embedding component and calculating the next time step components as a linear combination of predicted and actual values calculated by an EM algorithm. We therefore compute the normalized difference vector or velocity, where is the dimension of the embedding vectors.


Predicting shift

Finally, a predictive sequence model is applied to attempt inference of future shifts given past observations. If there are meaningful short-term and/or long-term dependencies then a sequence model should be able to identify them and outperform a naive average based baseline. Additionally, by altering the size of the considered sequence we discover the shift rate of transaction patterns and the duration of trends. A Long Short Term Memory (LSTM) model for regression

[6] is utilized to predict shift in time stamps in the future by looking back at time stamps (sequence length) and over a training period of slices which represents the training length on which the model is trained:


4 Results

Utilizing the metrics and methods described in Section 3 we perform a rigorous qualitative analysis of the temporal evolution of merchant behavior as represented in the embedding space. The objective of this analysis is threefold. First, if seasonal patterns and well known financial events are encoded in the embedding space, that increases the applicability and effectiveness of the proposed representation. Second, we aim to quantify meaningful shift that represents changes in transaction patterns as opposed to the arbitrary embedding space rotation and ensure the consistency of the representation over time. Finally, by treating the sequence of the aforementioned statistics over different snapshots as a time series we seek to address the predictability of this shift.

4.1 Shift distribution

Representation shift Figures 2 and 3 visualize the distribution of shifts across different time frames. Based on Figure 2 that depicts on yearly and monthly cadence, we can see that embedding vectors gradually move away from their original position as time progresses indicating the smoothness of the embedding trajectories and a directed motion. We notice that also reduces over time (Figure 3) however the distribution is still far from stationary. As more nodes are added and maintain their position in the embedding space, the amount of shift in more recent time frames is mostly attributed to change of transaction patterns while in the original time frames shift is observed due to the first time appearances of merchants. Additionally, we observe that even though the general trends are shared across categories of merchants, meaning spikes and valleys usually happen across all categories at the same time frames, there are some categories that steadily demonstrate higher deltas. For instance Government, Services and Travel appear to maintain higher across all time frames, probably due to seasonal volatility experienced in Travel and Government while Services represent a high percentage of the merchant space thus incorporate the changes for a wide range of nodes.

Figure 2: Distribution of over different time frames on a yearly and monthly cadence indicating smooth temporal shifts
Figure 3: Average over all time stamps for different merchant categories indicating how the size of shifts differs significantly across types of merchants

Since the latent representations are trained to bring closer together the most similar nodes, it is interesting to see how the ranking of the closest neighbors evolves. Therefore we measure from Equation 5 for k=10,50 and 100 and =2,3 and 4. Results are exhibited in Figure 4. As expected the longer the time that elapsed, the smaller the intersection between the different versions of a merchant’s neighborhood. However, the shift seems to be mostly in the closest of neighbors (k=10) whereas the broader neighborhood (k=100) remains roughly stable. This can be attributed to general shopping patterns being relatively similar over the years with small seasonal trends that alter the ranking within the neighborhood. In Table 1 we observe specific examples of the nodes that belong in the intersection of neighborhoods from four different time stamps for two well known merchants, namely the Ritz-Carlton hotel and the Banana Republic retailer. As can be observed the intersection between 2019-10 and 2019-12 is generally higher for both compared to a longer time window of 5 months (2019-10 to 2020-03) indicating that similarity shifts happen gradually over time.

Semantic Shift The question arises whether all merchant embeddings shift at the same pace or whether there exist categories that showcase larger shifts in particular time periods. We measure the percentage of each merchant category and repeat this calculation amongst the top 10,000 shifting merchants (according to ) between different time stamps. In Figure 5 we report results for time periods of interest. For this experiment, the set of merchants for each time stamp has been trimmed based on frequency of transactions. Therefore only nodes that received significant updates in this training round are included. The first chart showcases the effect of seasonal spending patterns such as holiday seasons that cause the categories of Retail and Social to exhibit the largest shift between December and February. Services as well as Travel follow a similar trend. An interesting pattern appears in the next two charts, by comparing the shift in 2019-03 and 2020-03. It appears that Social, Retail and Services achieve higher percentages amongst the maximum shifting merchants compared to the same time last year, which is in line with the effects of the Covid-19 pandemic in consumer behavior.

Similar patterns appear when we explore the maximum shift in magnitude from Equation 4. From the time series of for each merchant, the month in which the merchant exhibited the highest shift compared to the total magnitude shift of all merchants in that same month is selected as max shift month. By displaying the counts of merchants that exhibit their max shift in each given month, we notice trends appearing that coincide with known financial events. In the beginning of 2019 and then again in the summer of 2019 market volatility may have influenced consumer behavior. Interestingly, counts appear to spike in the first few months of 2020 with a peak in March 2020 when the Covid-19 pandemic caused major changes in spending patterns.

Figure 4: Normalized intersection size between topk neighborhoods over each time stamp and past ones with varying time drift.
Timeframes k size The Ritz-Carlton Banana Republic
2019-10 to 2019-12 5
Link Restaurant Group
La Meridien
Apple Store
Four Seasons Hotel
Nopsi Hotel
Double Tree
Ruby Slipper Cafe
Link Restaurant Group
Miami Beach Resort
Superior Seafood
Saks Fifth Avenue
Kiehl’s Since 1851
Apple Store
Abercrombie and Fitch
David’s tea
2019-10 to 2020-03 5 Link Restaurant Group Gap
Four Seasons Hotel
Nopsi Hotel
The Daily Beet
Link Restaurant Group
Neiman Marcus
Kiehl’s Since 1851
Nike Outlet
Bath & Body Works
Table 1: Examples of topk neighbors for time drift equal to 2 (top row) and 5 months (bottom row)
Figure 5: Percentages of different merchant categories in the set of most frequent merchants and in the set of top 10,000 shifting merchants showing the periods when a category changes more than the rest
Figure 6: Normalized count of merchants that exhibit their maximum shift in each time stamp.

Moving towards a microscopic view of the relationship between representation shift and neighborhood evolution we explore individual merchants and how their topk neighbors across different time stamps relate to their . Results are depicted in Figure 7. We consider the neighbors in the first time stamp (2017-11) and the last one (2020-03). If a node belongs to the original neighbor set then it is depicted in red even if they belong to the final set of neighbors as well. The neighboring nodes represented with blue lines however are selected so that they are not part of the original set of neighbors. We search for topk neighbors with k=10, but we remove the ones that exhibited unchanged for more than 70% of the considered time stamps. Two patterns arise from this analysis: Firstly, seed nodes tend to have generally high similarity in representation shift with their neighbors (e.g. J Crew with Banana Republic or Walmart with Lowe’s). Secondly, merchants that emerge later as neighbors of the seed node tend to be more volatile and have started following similar shift trends with the seed during intermediate time stamps. Equinox and Starbucks, for instance, with their neighbors SoulCycle and Wendy’s appear to observe common trends during the last 5 time stamps. These findings suggest that correlates with meaningful changes in transaction patterns that alter neighborhood formulation.

Figure 7: Cosine shift of seed node (depicted in black) and the respective shifts of their neighbors from the first time stamp (red) and the last one (blue) comparing the trajectories of seed node and neighbors

4.2 Shift Trajectories as Time Series

Visualizing and filtering trajectories Moving away from selecting the biggest shifts from the time series, we also explore the sequence as a whole and attempt to identify its constituent components. To explicitly visualize patterns in both magnitude and direction of embedding shift, we calculate 120 different 2 hop neighborhoods with neighboring nodes. After normalizing these embeddings using , we smooth them using Kalman smoothing on a per-merchant basis and take their element-wise difference per time stamp to generate velocity vectors.

We first note that according to Figure 8, smoothing did lower both the mean and variance of cosine distance across time stamps. We believe this indicates that arbitrary embedding shift was lowered operating on the assumption that embeddings follow a linear trajectory in embedding space, as assumed in the Kalman filter. This follows the results from [15, 1] where a similar concept was employed during embedding calculation rather than afterwards. We benefit from filtering after embedding generation rather than beforehand because the filter can be agnostic to the source of the data. It also allows us to use both embeddings calculated using well-established embedding algorithms on massive data set with more downstream flexibility pertinent on the use case.

Subsequently, the velocity vectors , from equation 9, of all merchants across all time stamps are aggregated and passed through t-SNE dimensionality reduction. Dynamic t-SNE is a challenging problem and its limitations have been discussed in literature [2]. In this work we opt to calculate t-SNE coordinates for all time stamps at once so that comparisons are enabled across different snapshots of the embedding space. As we observe in Figure 9, for the non smoothed velocity embeddings two clusters emerge that correspond to regions of high and low frequency. After smoothing (second chart), the frequency pattern was maintained and by shading according to quartiles we notice higher degree of separation among the clusters high frequency region. In these clusters, the denser ones appear to be correlated with high . Finally, in the last chart of Figure 9 we observe that neighborhood memberships expand across regions of zero and non-zero movement (low and high frequency) which indicates that the inclusion of non-updating nodes in our sequential training workflow allows nodes to maintain connections from previous time frames and expand their neighborhood in both highly changing and more stable regions. Both small and large merchant neighborhoods have this tenancy; when reduced to a single merchant neighborhood, a large retailer or a single parking meter both show momentum from previous snapshots even if their neighbors are speeding away from their original positions in the high cosine distance area.

Figure 8: Box plot of average cosine differences between merchant embeddings in each time step without smoothing on the left and with Kalman smoothing on the right.
Figure 9: t-SNE of 2020-03 velocity vectors shaded by log frequency, quartile cosine distance, and neighborhood membership out of 120 neighborhoods. Center and right are Kalman smoothed, while left chart is raw embeddings.

Predicting trajectories Finally, we explore the predictability of representation shift as measured by and attempt to identify the effect of long- and short-term dependencies in the latent space. As discussed in Section 3, we employ an LSTM model trained with multiple past time stamps and a sequence of previous values available at each time stamp. Since the transaction space is highly dynamic, extending the sequence length beyond a certain point could be adding little value to the estimation of the next shift value. Therefore sequence length for our experiments varies between 1 and 7 months, meaning of each month with the previous one are available for 1 to 7 months. Training length lies in the same range to investigate if a large number of training points is needed to make effective predictions. A naive baseline is added as a frame of reference, where we estimate the next month’s as a moving average of the past sequence values. For performance evaluation Mean Square Error is reported. Results are shown in Tables 4 and LABEL:tab-202002 for the test month of 2 different time periods. Steady increase in performance is observed over the baseline for all sequence and training lengths which indicates that non trivial temporal and perhaps repetitive patterns arise in the embedding space. It appears that extending the sequence length up to the 5 previous time steps achieves the highest performance; however increasing the sequence length offers limited to no benefit. This could be attributed to seasonal trends that move embeddings further than their representation 7 time steps ago as well as the overall rotation of the embedding space. Moreover, error values are on average higher for March 2020 when an unprecedented change in consumer behavior occurred due to the Covid-19 pandemic.

Table 3: Training up to 2019-12 and test 2020-01
training length
Baseline LSTM
1 3 5 7
1 1.75 1.08 0.95 0.94 0.92
3 1.52 0.68 0.64 0.62 0.61
5 1.37 0.62 0.56 0.56 0.47
7 1.38 0.63 0.59 0.58 0.50
Table 4: Training up to 2020-02 and test 2020-03
training length
Baseline LSTM
1 3 5 7
1 1.38 0.95 0.91 0.91 0.89
3 1.30 0.70 0.69 0.70 0.69
5 1.21 0.69 0.68 0.68 0.62
7 1.26 0.70 0.69 0.68 0.63
Table 2: Mean Square Error () for naive baseline model and LSTM with different timeframes

5 Conclusion

This work introduces a dynamic scalable graph representation workflow deployed on financial graphs of transactions with accounts and merchants as entities. The resulting representations across a range of timestamps are evaluated for consistency and the concept of semantic shift is introduced as a means to measure the effectiveness of latent representations in real world application scenarios. We perform a multi-step qualitative analysis that is to our knowledge the first of its kind in the financial sector to extract trends and patterns from the evolution of the merchant representations over time.


  • [1] R. Bamler and S. Mandt (2017) Dynamic word embeddings. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 380–389. Cited by: §2.1, §3.2, §4.2.
  • [2] A. Boytsov, F. Fouquet, T. Hartmann, and Y. L. Traon (2017) Visualizing and exploring dynamic high-dimensional datasets with lion-tsne. CoRR abs/1708.04983. External Links: Link, 1708.04983 Cited by: §4.2.
  • [3] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017)

    Geometric deep learning: going beyond euclidean data

    IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §2.
  • [4] C. B. Bruss, A. Khazane, J. Rider, R. T. Serpe, A. Gogoglou, and K. Hines (2019) DeepTrax: embedding graphs of financial transactions. 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 126–133. Cited by: §1, §1, §2, §3.1.
  • [5] V. D. Carlo, F. Bianchi, and M. Palmonari (2019) Training temporal word embeddings with a compass. ArXiv abs/1906.02376. Cited by: §2.1, §2.1.
  • [6] ¸S. C. Ciucu (2013) Time series forecasting using neural networks. In Challenges of the Knowledge Society, pp. 1402–1408. Cited by: §3.2.
  • [7] Y. Dong, N. V. Chawla, and A. Swami (2017) Metapath2vec: scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, New York, NY, USA, pp. 135–144. External Links: ISBN 9781450348874 Cited by: §3.1.
  • [8] P. Goyal, S. R. Chhetri, and A. Canedo (2018) Dyngraph2vec: capturing network dynamics using dynamic graph representation learning. CoRR abs/1809.02657. External Links: Link, 1809.02657 Cited by: §2.2.
  • [9] P. Goyal, N. Kamra, X. He, and Y. Liu (2018) DynGEM: deep embedding method for dynamic graphs. CoRR abs/1805.11273. External Links: Link, 1805.11273 Cited by: §2.2.
  • [10] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.
  • [11] W. L. Hamilton, J. Leskovec, and D. Jurafsky (2016) Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1489–1501. Cited by: §2.1.
  • [12] W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, pp. 1025–1035. External Links: ISBN 9781510860964 Cited by: §3.1.
  • [13] Y. Kim, Y. Chiu, K. Hanaki, D. Hegde, and S. Petrov (2014) Temporal analysis of language through neural language models. ACL 2014, pp. 61. Cited by: §2.1.
  • [14] T. N. Kipf and M. Welling (2017) Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR ’17. External Links: Link Cited by: §2.
  • [15] S. Kumar, X. Zhang, and J. Leskovec (2019) Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD international conference on Knowledge discovery and data mining, Cited by: §2.2, §3.2, §4.2.
  • [16] O. Levy and Y. Goldberg (2014) Neural word embedding as implicit matrix factorization. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2177–2185. Cited by: §3.1.
  • [17] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, Red Hook, NY, USA, pp. 3111–3119. Cited by: §3.1.
  • [18] A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler, and C. E. Leiserson (2019) EvolveGCN: evolving graph convolutional networks for dynamic graphs. CoRR abs/1902.10191. External Links: Link, 1902.10191 Cited by: §2.2.
  • [19] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §2.
  • [20] M. Rudolph and D. Blei (2018) Dynamic embeddings for language evolution. In Proceedings of the 2018 World Wide Web Conference, pp. 1003–1011. Cited by: §2.1.
  • [21] V. Shumovskaia, K. Fedyanin, I. Sukharev, D. Berestnev, and M. Panov (2020) Linking bank clients using graph neural networks powered by rich transactional data. External Links: 2001.08427 Cited by: §1, §1, §2.
  • [22] I. Stewart, D. Arendt, E. Bell, and S. Volkova (2017) Measuring, predicting and visualizing short-term change in word representation and usage in vkontakte social network. In Eleventh International AAAI Conference on Web and Social Media, Cited by: §2.1.
  • [23] R. Trivedi, M. Farajtabar, P. Biswal, and H. Zha (2019) DyRep: learning representations over dynamic graphs. In International Conference on Learning Representations, External Links: Link Cited by: §2.2.