Personal financial management (PFM) services and financial aggregators are software applications that collect and bring together information from multiple sources to provide users with a single stop shop for tracking and managing their personal finances (Gupta et al., 2014). For individuals with multiple bank accounts, credit cards, and utility bills, seeing the big picture and gaining insights into their financial health can be incredibly valuable. Indeed, services of this sort are used by millions of people in the US alone (Green and Craven, 2017).
One of the most important types of information collected and analyzed by PFM services are transactions. Bank and credit card transactions are retrieved from financial institutes after users provide the appropriate credentials. These pieces of information essentially sum up to the full financial story pertaining to an individual. However, in order to distill the most relevant insights and suggestions for users, PFMs must fully understand the nature of the observed transactions, their source and meaning. One case of fundamental importance is bank transfers. Across the plethora of Financial Institutes (FIs) in the US, the information consistently retrieved by the service is the date, dollar amount, and a varying length string describing the transaction. These strings are semi human-readable, and in the general case do not include an explicit identification of the issuing FI of the bank transfer 111This information is often available to the customer in the online bank account display, but is not obtained by the PFM due to technical issues..
Transaction data includes many types of implicit information that can be extracted using machine learning and data mining methods. Previous research addressed the location of transactions(Resheff and Shahar, 2018b), and demographic attributes of users (Resheff and Shahar, 2018a). In this paper we utilize the description string associated with bank transfer transactions (i.e. the transaction signature
), and treating this as short textual data we apply Natural Language Processing (NLP) and supervised learning techniques to learn the mapping between description strings and the identity of the FI issuing the transfer.
Description strings are formed by the FIs, presumably, by formatting a template with information regarding a specific transaction. At first glance it might seem that Recurrent Neural Networks (RNNs) and other machine learning approaches are not the right tool for inferring these deterministic mappings (although commonly used successfully for tasks over short strings(Limsopatham and Collier, 2016; Korpusik et al., 2016; Yen et al., 2018)). The simplicity and accessibility of these methods, and excellent results obtained in our task (see Section 4) lead us to favor them over more traditional data mining tools designed specifically for finding patterns in strings.
The rest of the paper is structured as follows: Section 2 contains a precise problem definition. Next, in Section 3 the methods are presented, followed by results on a large real-world dataset in Section 4.
|date||amount||from||to||description on receiving end|
|11.01.16||1,000$||Bank of America||Bank of America||Online Banking transfer to SAV XXXX Confirmation# XXXXX|
|12.01.16||1,000$||Bank of America||Chase||Online Transfer XXXXX fromBofA main account ########XXXX t|
|01.01.17||1,000$||Chase||Wells Fargo||CHASE EPAY XXXXX XXXXX ¡Sender Name¿|
|02.01.17||1,000$||Wells Fargo||Bank of America||Payment|
|03.01.17||1,000$||ING direct||Chase||CAPITAL ONE N.A. CAPITALONE XXXXX WEB ID: XXXXX|
2. Problem Definition
We formalize the problem as the recovery of the identity of the formatter program running by the transaction issuer. Consider a transaction with a set of attributes . The issuing formatter running by the sending FI is a mapping from to a string. The receiving formatter running by the receiving FI is a mapping from the received string and to the final description string we observe. Thus, we observe the string:
From an in-depth exploration of the data we conclude that the formatters leave much of the structure produced by intact. Furthermore, we observe that transactions originating from difference FIs have uniquely identifying patters, albeit this is a many-to-one relation (see Table 1).
Given many transaction strings our goal is to recover the pairs of the sending and receiving formatters that produced them. Note that since one side of the transaction is known (this is the financial institute in which we saw this transaction), we only need to infer the other side of the transaction. In this paper we concentrate on incoming transactions, where in known, and infer from the transaction strings.
3.1. Generating the Labeled Dataset
Data used for this work was collected by a large financial data aggregation service. During registration, users provide credentials that allow us to continuously obtain transaction data from over financial institutions including banks and credit card companies. A record describing a transaction typically contains the date of the purchase, a dollar amount, and a description string explaining the nature of the transaction. Overall, available data contains over 15 billion transactions per year, arriving from over 10 million users. This represents several percent of all private transactions in the US. In our experiments, we use slices of this data pertaining to money transfer between known financial institutes. All experiments were conducted with data from the year starting November 2016.
In order to generate a labeled dataset of transactions between known financial institutes we use transactions for which both sides are visible to the data aggregation service. More specifically, we concentrate on transactions where both the source and the destination are within the same user account. In such cases we are able to obtain the identity of both financial institutes, as well as the descriptions produced by both of them.
The labeled dataset obtained this way contains million records, from users. Each record consists of the name of the sending and receiving financial institute, a dollar amount and date, and the description of the transaction as recorded both by the sender and the receiver (see illustrative examples in Table 1). Experiments reported here were conducted using a random sample of records from this dataset.
3.2. Tokenization, Feature Crafting, and Models
Description strings were tokenized using a standard (NLTK (Bird and Loper, 2004)) tokenizer, limited to a dictionary of size . No text pre-processing was preformed, other than replacing digits with Xs (this was done so that tokens representing number lengths would be formed to replace individual numbers).
In addition to the tokenized representation of the description strings, additional hand-crafted features describing textual patterns that are not expressed as token were computed. These features include indicators (ex. is all the string upper case?), and more complex regular expression patterns found to be useful for this task.
We compare the following classification methods and baselines:
max-label baseline: as a baseline for all other methods we use the proportion of data from the largest FI in the set under consideration.
logistic-raw: logistic regression on the distribution of tokens only.
logistic-features: logistic regression with the additional computed features and the raw token distributions combined.
LSTM/GRU: The model structure for all RNN based methods used here consists of a token embedding layer (in all cases the embedding size is ), followed by a single LSTM or GRU layer. The final output of the RNN is then fed into a cascade of dense layers, and a softmax readout of the identity of the financial institute. In the RNN setting only the tokenized sequence is used (with no hand-crafted features). Description string length was limited to tokens (longer ones were truncated).
4. Results and Discussion
In an exploratory phase, we examined the manner in which the distribution of tokens in descriptions reflects the relations between financial institutes in the US. After tokenization we observed the association between financial institutes as pairwise distances with respect to token distribution. Plotting a clustered heatmap of these distances (see figure 1) reveals that the textual data is useful in revealing associations between different banks.
For example, the token distribution seems to easily capture the relatedness of different branches or devisions of the same bank, as in the case for Citibank and Chase bank (including Amazon award visa which Chase operates). This view of the data also surfaced mergers and acquisitions in the FI market, such as Capital One’s acquisition of ING Direct division. Finally, we learned that the descriptions may also generate geographical attributes, as demonstrated by the moderate similarity between CIBC and National Bank of Canada, which are two distinct institutes. The later observation raises the potential of learning more characteristics of financial institutes through their transaction descriptions. This might also imply a limitation on the learnability of the mapping from description strings to FIs. More precisely, it indicates that we are likely to have to rely on structure and deeper features of these strings, and not just the distributions of tokens. This notion is reinforced by the results presented below.
We test multiple methods for determining the identity of the financial institute from which a transaction originated based on the description of the transaction (See section 3 for data and model details). Experiments show overall satisfactory results, with classification accuracy ranging from over when only the top FIs are considered to approximately for the top (Table 2
). In all cases the LSTM based classifier outperformed all other methods, followed closely by the GRU (It is noteworthy that the logistic regression operated on single token distributions and manually crafted features. Multi-grams were not tested for computational reasons). The vast superiority of both RNN based methods (which operate on the raw token sequences) over the logistic regressions which are not able to take the order of tokens into consideration indicates again that the structure of the description string has an important role in determining the identity of the source FI, and not just the actual tokens used.
Since the experiments presented in this paper are conducted on a subset of the available data, we test to determine the sensitivity of the classification results to the amount of training data used. Results for the LSTM based method in the FI setting (Figure 2) show that performance reaches a plateau at of the data used in practice, indicating that the use of additional data would be unlikely to achieve better results. We do not however rule out the possibility of utilizing the full amount of data available with more complex models, or when classifying a larger number of FIs, and leave this to future work.
Next we test the trade-off between the number of FIs we classify and classification performance. The US banking system is comprised of tens of thousands of institutions with a long tail distribution of number of customers. In the data used for these experiments the top institutions are responsible for approximately of all transactions, and the top for approximately . The decline in performance in the LSTM based method as additional FIs are added follows this structure closely (Figure 3), with a reduction from with FIs to for . The decline then slows down, and reaches with FIs.
Understanding the source and meaning of transactions is a key component in the ability of financial data aggregators and personal financial management systems to deliver value through deep insights and suggestions. Money transfers are an especially important type of transaction, but the identity of the sending financial institute is not readily available in PFM aggregators systems.
In this paper we investigate the problem of supervised learning of the identity of a sending financial institute from the description string provided by the receiver. Using word embeddings, RNNs and other methods borrowed from NLP we are able to achieve excellent accuracy on this task, possibly limited only by the multiplicity of banking brands within the same family of banks. Interestingly, RNN methods with the ability to process the order of tokens in the transaction strings vastly outperform linear methods (even when additional hand-crafted features were added to the latter). This finding further supports our original hypothesis that the structure of these strings is tied to issuing FIs, and not merely the distribution of tokens.
Future work will attempt to enrich the information regarding incoming money transfers beyond the identity of the sending FI by utilizing and extending the methods presented in the current work to recover the structure of description strings and extract the attributes of the transaction embedded within them.
- Bird and Loper (2004) Steven Bird and Edward Loper. 2004. NLTK: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. Association for Computational Linguistics, 31.
- Green and Craven (2017) James R Green and Annette E Craven. 2017. Account Aggregation Tools: History and Use for the Future. Academy of Business Research Journal 1 (2017), 74.
- Gupta et al. (2014) Vipul Gupta, Sameer Khanna, and Iljoo Kim. 2014. Personal Financial Aggregation and Social Media Mining: A New Framework for Actionable Financial Business Intelligence (AFBI). International Journal of Business Intelligence Research (IJBIR) 5, 4 (2014), 14–25.
- Korpusik et al. (2016) Mandy Korpusik, Shigeyuki Sakaki, Francine Chen, and Yan-Ying Chen. 2016. Recurrent Neural Networks for Customer Purchase Prediction on Twitter.. In CBRecSys@ RecSys. 47–50.
Nut Limsopatham and
Nigel Henry Collier. 2016.
Bidirectional LSTM for named entity recognition in Twitter messages.(2016).
Yehezkel S. Resheff and
Moni Shahar. 2018a.
Fusing Multifaceted Transaction Data for User Modeling and Demographic Prediction. InWorkshop on Multi-dimensional Information Fusion for User Modeling and Personalization. ACM.
- Resheff and Shahar (2018b) Yehezkel S. Resheff and Moni Shahar. 2018b. A Statistical Approach to Inferring Business Locations Based on Purchase Behavior. (2018). Manuscript submitted for publication.
- Yen et al. (2018) An-Zi Yen, Hen-Hsen Huang, and Hsin-Hsi Chen. 2018. Detecting Personal Life Events from Twitter by Multi-Task LSTM. In Companion of the The Web Conference 2018 on The Web Conference 2018. International World Wide Web Conferences Steering Committee, 21–22.