Data science processes such as data exploration, manual feature engineering,
validating hypotheses and constructing baselines can become convoluted, and often call for empirical and domain knowledge. Being a data science company, one of our concerns at BBVA D&A is to provide our data scientists with tools to improve these workflows, in the spirit of works such as [Duvenaud13, grosse2012, snoek2012, kanter2015, boselli2017ai, Lloyd2014].
Most data analytics and commercial campaigns in retail banking revolve around the concept of behavioral similarity, for instance: studies and campaigns on client retention; product recommendations; web applications where our clients can compare their expenses with those of similar people in order to better manage their own finances; data-integrity tools.
The analytic work behind each of these products normally requires the construction of a set of customer attributes and a model, both typically tailored to the problem of interest.
We aim to systematize this process in order to encourage model and code reuse, reduce project feasibility assessment times and promote homogeneous practices.
Our contribution to this end is client2vec: a library to speed up the construction of informative baselines for behavior-centric banking applications. In particular, client2vec focuses on behaviors which can be extracted from account transactions data by encoding that information into vector form (client embedding). These embeddings make it possible to quantify how similar two customers are and, when input into clustering or regression algorithms, outperform the sociodemographic customer attributes traditionally used for customer segmentation or marketing campaigns. We pursued a solution with minimal computational and preprocessing requirements that could run even on simple infrastructures. This is not damaging: baselines need to provide an informative starting point rather than an off-the-shelf solution, and client2vec helps generating them in few minutes, with few lines of code. Additionally, client2vec offers our data scientists the possibility to optimize the embeddings against the business problem at hand. For instance, the embedding may be tuned to optimize the average precision for the task of retrieving suitable targets for a campaign.
This paper describes our experience and what we learned while building client2vec; it is organized as follows:
in Section 2 we go through the principles we built client2vec on; in Section 3 we mention relevant related work, from both the application and the algorithmic perspectives; in Section 4 we describe the structure of the account transactions data we used; in Section LABEL:sec:methodology we dig deeper into the algorithms we studied when developing our library, why we chose them and how we trained them; finally, in Section LABEL:sec:experiments we present experiments in the tasks of client segmentation, reconstruction of missing expenses and retrieval of targets for a commercial campaign.
2 Our approach
We built client2vec following an analogy with unsupervised word embeddings [mikolov:2013, dhillon2012two, Chen:12, socher2011parsing], whereby account transactions can be seen as words, clients as documents (bags or sequence of words) and the behavior of a client as the summary of a document. Just like word or document embeddings, client embeddings should exhibit the fundamental property that neighboring points in the space of embeddings correspond to clients with similar behaviors.
We thus identified two possible paths: one is to extract vector representations of transactions and compose them into client embeddings, as done with word embeddings to extract phrase or document embeddings via averaging or more sophisticated techniques [socher2011parsing]; the other is to embed clients straight away. We explored the former option by applying the famed word2vec algorithm [mikolov:2013] to our data and then pooling the embeddings of individual transactions into client representations with a variety of methods. For the latter approach, which is the one currently employed by client2vec, we built client embeddings via a marginalized stacked denoising autoencoder (mSDA) [Chen:12]. For comparison and benchmarking purposes, we also tested the embedding comprising the raw transactional data of a client and the one produced by sociodemographic variables.
Embeddings are then turned into actionable baselines by casting business problems as nearest neighbor regressions. This builds on successful works in computer vision[Torralba2008, hays2008, rodriguez2016] which adopt the principle of the unreasonable effectiveness of data [halevy2009unreasonable]. As we will show, such an approach produces effective baselines in a variety of scenarios.
3 Related work
The need for automation has been noticed by the wider data science community. The AutoML Challenge proposed contestants to solve classification and regression tasks without human intervention [guyon2015]. Projects like the Automatic Statistician111https://www.automaticstatistician.com/
, which aims at developing an “artificial intelligence for data science”, for instance apply automatic, efficient exploration of the space of models for non-parametric regression by composing kernel structures[Duvenaud13] or matrix decomposition by organizing and exploring models in a context-free grammar [grosse2012]
. Other strategies involve using Gaussian Processes for hyperparameter search[snoek2012] and autonomous feature engineering from raw data [kanter2015]. Along the same lines, automation may concern data cleaning [boselli2017ai] or even summarizing data and models with natural language [Lloyd2014]
. A de facto standard for the rapid construction of baselines is the class of gradient boosting methods222https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions, accessed 2018-02-09. [friedman2001gbm]. However, we seek a method that can produce an unsupervised embedding of a client that be semantically interpretable, with neighboring points representing behaviorally similar clients, and that we can reliably employ in tasks as diverse as retrieval or clustering, as we do in Section LABEL:sec:experiments.
Consequently, we rather draw inspiration from the straightforward, successful application of textual embeddings produced by models like word2vec [mikolov:2013] or Glove [pennington2014glove] to data science problems such as predicting the outcome of soccer matches from Tumblr hashtags [radosavljevic2014large] or improving the accessibility of the information in medical corpora [minarro2014exploring]. More interestingly, mechanisms for embedding words are applicable to any indexable entity and such models have been extended to embed data other than text. This includes embedding nodes in a graph [figueiredo2017struc2vec], items in recommender systems [barkan:2016item2vec], Pinterest’s pins [Liu2017pin2vec] and even twitter users [benton2016user2vectwitter], where different embeddings of the same entity are mixed using a form of canonical correlation analysis.
One drawback of these embedding methods is that they lack a formalism to embed sets or sequences of items, rather than the items themselves. On sequential data, methods such as RNNs offer in most cases state-of-the-art performance [lipton2015critical]. However, these models are unsuitable to our use case: they are usually data-hungry and computationally burdensome, hence inadequate to rapidly and systematically build baselines. Autoencoders, on the other hand, have been shown to be effective at creating unsupervised embeddings of formal representations of sets (e.g. bag-of-items) [bengio2013representation, Chen:12, Sec. 7] and have thus found application, for instance, in embedding patient data [miotto2016deeppatient]. The latter work, in particular, leverages stacked denoising autoencoders (SDAs) and offers a very similar scenario to the ones we consider in our applications. We chose however to focus on mSDAs to reduce the computational cost associated with SDAs, Section LABEL:sec:msdaembed.
4 Account transactions Data
Current account transactions include movements like direct debits, commission fees, coupons of financial products or debit card expenses. Said movements normally have an associated description, either in formatted or free text, and a movement code whose descriptivity is out of our control and ranges from the extremely general (a money transfer of any kind) to the very specific (the purchase of tickets of a given airline). The taxonomy of transactions is further enriched by assigning movement codes to one of 70 categories, e.g. Utilities or Car insurance. This operation is carried out by an internal tool employing search and NLP algorithms. Together with new structure, by miscategorizing some movements this procedure can also introduce noise, to which the algorithms we tested showed good resilience.
We chose to focus client2vec on an aggregation of current account data by client, year and transaction category. This aggregation can easily be applied to most commercial cases of interest, is readily available within BBVA and preserves enough information within a fairly succinct dataset. Past applications within BBVA include segmentation, campaigns, recommendations and automated finance management. Specifically, for a year worth of data, we aggregate each client into a vector of 70 numbers, one per category, as depicted in Figure LABEL:fig:raw_data. Unless otherwise specified, this is the data format that we will refer to throughout the rest of the paper.