Personal Financial Management (PFM) services and financial aggregators are software applications that collect and bring together information from multiple sources to provide users with a single stop shop for tracking and managing their personal finances. For individuals with multiple bank accounts, credit cards, and utility bills, seeing the big picture and gaining insights into their financial health can be incredibly valuable. Indeed, services of this sort are used by millions of people in the US alone.
One of the most important types of information collected and analyzed by PFM services are financial transactions. Bank and credit card transactions are retrieved from financial institutes after users provide the appropriate credentials. These pieces of information essentially sum up to the full financial story pertaining to an individual (even in the case of small transactions where cash still dominates , ATM withdrawals are still recorded, and they tell part of the story).
Across the plethora of financial institutes in the US, the information consistently retrieved by the service is the date, a dollar amount, and a varying length string describing the transaction. These strings are semi human-readable, and include information such as the merchant, time-stamps, and other pertinent information including a (typically numeric) store identifier when a purchase is made in a chain store.
A piece of information that is not directly present in the transaction data, and that is of extreme importance to PFM services is precise location information. When a purchase is made at a chain store, the identifier accessible in the transaction description string is not directly mapped to a physical location via publicly accessible directories and data sources. A database of all businesses in the US is commercially available via several providers. However, the list of stores in a chain or franchise will not contain these arbitrary internal identifiers. The task we tackle in this paper is to infer store locations based on purchasing patterns across users.
The need for location information arises in many aspects of the activity of a PFM. First and foremost as an enabler of personalization, and recommendation of more relevant content. This information is also useful in various tasks such as fraud detection, advertising, and user profiling to name a few. Moreover, the location of the businesses and individuals, together with the purchase data can be used for higher level economical analysis both of stores and of populations.
Ii Data collection and processing
Data used for this work was collected by a large financial data aggregation service. During registration, users provide credentials that allow the aggregator to continuously obtain transaction data from over financial institutions including banks and credit card companies. A record describing a transaction typically contains the date of the purchase, a dollar amount, and a description string explaining the nature of the transaction. The first step in the process is to extract structured information out of these strings. Namely, we would like obtain the identity of the merchant from which a purchase was made. For chain stores this includes the name and branch identifier (See Table I for some examples).
The main difficulty in extracting structured information from the transaction description strings arises from the variability in processing the information undergoes along the way. The identity of the merchant, as well as the financial institutions processing the information en route, all affect the structure and the information available in the final string obtained by the service. As a result, a number of heuristic and machine learning methods are used to obtain structured information. The extracted information includes fields describing where and when the transaction took place such as exact time-stamps, location information, merchant name, and branch identifier.
The information we use in this work are the set of purchases for individual users, each characterized only by the the merchant and branch-ID. These branch identifiers uniquely represent physical real-world locations, but the mapping to the real world is unknown. The main task we tackle in this paper is to utilize the full set of data to infer the real world locations of these arbitrary identifiers, based on the key insight that individual stores that share a large percent of their customer base are likely to be near each other.
Overall, available data contains over 15,000,000,000 transactions per year, arriving from over 10,000,000 users. This represents several percent of all private transactions in the US. In our experiments (Section IV), we use slices of this data pertaining to specific chains in well-defined geographical areas. All experiments were conducted with data from year 2017, and the first quarter of 2018.
A ground truth location dataset is constructed by scraping websites of the relevant store chains and obtaining both store identifiers and address information. This approach is limited for two reasons. First, the process of building the mechanism to scrape each website is time consuming since the tool has to be adapted to each new store chain. More importantly though, only a small fraction of chains presents the stores with both their addresses and the internal identifiers we see in the transaction data. The ground truth dataset is used in this work to test the methods (Section IV). It is however also an important component in the general inference system for the millions of store locations in the US, since it serves as as seed with which we can start inferring new locations. The methods we use to do this are presented in the next section.
|description string||merchant||store ID||city||exact location|
|Starbucks Store 06607||Starbucks||06607||?||?|
|PIZZA HUT 030579||Pizza Hut||030579||?||?|
|SHELL OIL 5908 SAN DIEGO CA||Shell||5908||San Diego||?|
|The Who Knows where shoe store||?||?||?||?|
In this section we develop a statistical method used to infer locations based on user transactions. The method assumes a subset of known location, and a store-customer relationship matrix, indicating which users are customers of which stores. The subset of known locations arises in our data via chains with public facing store-IDs that are maintained also in the transaction data, and thus allow us to map the signature to physical locations. The store-customer relationship matrix is a Boolean indicator matrix, where the position determines whether user frequents store .
Two key insights allow us to develop the proposed model. The first is the assumption of locality of consumer behavior, and the second is the real-world metric of locality. By locality of consumer behavior we mean that, in the probabilistic sense, a customer of a given shop is more likely to be a customer of a shop in the same neighborhood, than of a shop which is far away. This intuitive insight is backed by the data, as seen in Figure 2.
The second insight is that Euclidean distance is not a good determinant of locality. In a dense urban environment the distance between shops is small, and even at modest distances we don’t expect to see much overlap in customer base. In a more rural area on the other hand, the distance between shops a single user frequents can easily span distances on the order of the radius of a a city.
As a result, we must either consider separately areas with different density (either by population or business place density), or otherwise enter this source of variability into our model. In the current work we maintain a rather homogeneous density by inferring locations for stores one city at at time, but our framework can easily accomodate extra paramters and given enough data can even learn their effect.
Iii-a Maximum Likelihood Inference of Unknown Store Locations
We consider a dataset ordered as a tuple:
where is a set of known stores and locations, index the unknown locations, and gives the customer sharing matrix between stores in the sets of known and unknown locations. This matrix is a representation of the bipartite graph where stores are nodes and edges exist between each store with a known location and store with an unknown location – the element in the -th position of the matrix
– giving the customer sharing index between the two stores. As a measure of customer sharing it would be natural to use the Jaccard index:
where denotes the set of customers of a store. The Jaccard index notion of customer sharing of stores does however have some pitfalls. For instance, when comparing a small shop to a large one, even if all the customers of the small shop also frequent the large shop, the computed similarity will still be small. For this reason we propose an alternative minimum-normalized version:
which better captures the desired notion of customer sharing when stores are very different in size of customer base. An important component of the model will be the conditional distribution of customer sharing (denoted ), given the locations of a known shop and an unknown shop . We will assume that this distribution depends only on the distance between the two shops, namely:
here and elsewhere, we use to denote the distance between the known location , and the unknown location . in practice, we estimate empirically based on the set of known locations and customer sharing between them. The joint likelihood of the dataset is then:
where we assume a flat and i.i.d. distribution of locations of both known and unknown entities. Recall that our aim is to resolve the unknown locations :
At this point the problem statement indicates all stores are important for the determination of each of the unknown locations. However, the problem is further simplified by revisiting the locality property of purchase behavior. Overall, a shop of unknown location will have significant customer overlap with very few of the known locations , meaning is essentially a sparse matrix. We will thus only consider those locations where for some threshold . Customer sharing values below this threshold are treated as zeros. The correct value for this parameter is determined experimentally. The need for such a threshold also arises statistically, since rare co-occurrence where a customer happened to make a purchase at an unusual location cause small sample size effects that make inference very noisy, thus the introduction of the threshold to the customer sharing values adds robustness to our model.
Finally, taking the threshold on customer sharing into account the problem becomes:
We now briefly discuss some properties of formulation (5):
Separability: since , and is a store with known location, the problem stated in (5) is completely separable in the unknown stores
. This means we may resolve each unknown point independently of all others. This property has consequences to the scalability of the approach to big datasets by enabling a trivial parallelization of the store location inference. In short, the process begins with a global step of estimating the underlying conditional customer sharing probabilities, then the unknown locations are segmented into batches and their locations are inferred.
The drawback of this parallel approach is that information is not shared between unknown locations. Ideally, we would want each new unknown location we infer to benefit from the locations of the already-inferred places. Note that in this case the graph is no longer bipartite, however this doesn’t change the formulation. A trade-off between the need for scalability and local information sharing of this sort is achieved by segmenting the data by areas, and processing each area (such as city or county) separately. This is the approach we take (see Section IV). It is however possible to treat the already inferred locations as a less trusted information (relative to locations which are known as a fact) for the inference of the current location, by giving a lower weight to such terms in the sum. For the sake of simplicity we didn’t use this information in the procedure we adopted, and the experimental results presented in this paper.
In the single neighbor limit where for an unknown point there is a single point for which , the problem formula for resolving that point becomes:
which is the circle around the point of radius determined by an argmin operation over , with . Intuitively, since each known locations active in the problem gives a single radius constraint of this sort, in the general case we should need at least locations (in general position), in order to be able to uniquely determine the unknown location. In the single known point case however, for a large enough , is maximal for , and thus the solution collapses to the single known point (this property of is further discussed bellow, and can be seen empirically in Figure 3).
In many cases it is reasonable to force the inference problem for an unknown point to be based only on a few known points in its vicinity. Technically, this may be achieved by selecting a large value of , i.e. using only shops with a large shared customer group with the unknown location under consideration (furthermore, the value of the threshold could in theory be chosen dynamically per location in order for the problem to contain a number of known locations within a pre-determined boundary). In this region, the solution to 5 is in the convex hull of the set of known locations used. To see this, we note that in the high threshold region of the problem, values of customer sharing are monotonically more likely as the distance between the shops decreases (this can be seen empirically in Figure 3). We formalize this notion:
Let be a set of known locations, and an arbitrary unknown location. Assume that is monotonically decreasing in s.t. , then the solution to the problem where is in the convex hull of .
Suppose, on the contrary, that the solution is not in the convex hull of . By the following lemma (Lemma 1.) the projection of onto the convex hull is closer than is to each of the points , and thus by the monotonicity of in we get that each element in the sum is decreased, and therefore is decreased, in contradiction to being the optimum.
Let be a compact convex set. , where is the projection of onto .
Let be the projection of onto , i.e., the closest point in to . Note that such a point exists due to the compactness of . From compactness it is easy to see that is on the boundary of , for otherwise there exists a ball of radius such the on which there is a point closer to . From the Hahn-Banach theorem there exist a supporting hyper plane such that . Consider the triangle formed by . The angle
is the largest angle (since a supporting hyperplane ofat separates the set), and therefore the edge which is opposite to it is larger than the proximal edge . ∎
This concludes the proof of Theorem 1. As a consequence of this theorem, we must be mindful when selecting values of the threshold that imply a solution in the convex hull when inferring for stores in areas where the extent of coverage of the known locations is insufficient for us to assume that the unknown location is in their convex hull. On the other hand, we will use larger values of and the convex hull property to help in resolving locations in other cases, this is further discussed in the following items.
A key quantity we use throughout this discussion is the conditional distribution of customer sharing given store distance . Up until now we have assumed it to be known, however in practice this quantity must be estimated from the data. More specifically, we look at the customer sharing and distances between pairs of known locations. Both distance () and customer sharing () are binned, and the appropriately normalized frequency table is computed once and stored for the computation of each inferred location.
Finally, when it comes to resolving an unknown location, the natural solution for maximum likelihood problems of this sort is to use gradient based methods; in this case:
Leading to the gradient descent algorithm (Algorithm 1). It is noteworthy that the computation of the partial derivative is only possible numerically since we do not assume any functional form of this conditional distribution. One could instead select a parametric family for this distribution and proceed with an analytic partial derivative. In the analytic case, the binned estimation of the conditional distribution described above would be replaced by estimating the parameters. As a consequence, for numeric stability we must estimate in relatively narrow bins of . This leads to a small amount of data in each bin and hence again to stability issues.
These problems make the gradient descent solution less appealing for our specific problem. Luckily, by Theorem 1 we have that for large enough values of the solution is in the convex hull of the customer-sharing neighbors of the unknown shop. In this region we are looking for a solution in a relatively contained area in , and can apply an exhaustive (or in practice grid) search. We empirically find that a threshold of is sufficient for the convex hull property to hold (see also Figure 3).
Iv Experimental Evaluation
In this section we present a thorough evaluation of the proposed method. Ground truth locations were obtained for a subset of approximately shops (a tiny portion of the many hundreds of thousand merchants in the US), including a large coffee shop chain, a fast food franchise, a pharmacy, a department store, and a discount super-market. The store to store customer sharing graph was constructed for the entire list of stores in this dataset, and thresholded at , this value was decided based on Figure 2 as the minimal value where customer sharing is monotonically more likely as distance between shops decreases, a property important in our algorithm for resolving locations (see Section III-A above).
We test the proposed method on stores from two lists of cities. The first experiment is conducted on a list that is representative of large cities in the US. The second consists of large to medium cities in the state of California. Together, these experiments test the method on cities representing the core user base of the financial data aggregation service the data is obtained from. Table II lists the cities used in this section. The number of stores per city in the dataset varies substantially between (Chicago) and (Bakersfield, CA).
Another variable that is expected to have a strong effect on the ability to recover locations is the store density in each area. This value is estimate as average of either the mean or the median distance to the nearest store in the data, and is calculated per city. The median distance ranges between meters (Fresno) to meters (Chicago). Note That value of distance is obtained when two shops are in the same shopping center or mall, and thus share the same address, but the real world distance can be up to several hundred meters (for our purpose, placing everything in a mall at the same location makes sense, because the effect of distance on customer sharing within this space is expected to be minimal).
The average value of customer sharing (Final column in Table II) with the nearest shop seems not to directly relate to the store density value. We interpret this as an indication that distances have a different meaning to customers depending on the environment. The tendency to walk or travel longer distances in a specific city should ideally be taken into account. We further discuss this issue with respect to future work in the Conclusions section bellow.
In each of the following experiments, the ability of the proposed method to infer store location was tested and compared to two baseline options. The NN-1 baseline consists of the nearest neighbor in customer sharing space. Namely, for each test store the location of the known store with the highest value in the customer sharing index is adopted as the inferred location. Likewise, the NN-3 method uses the center of mass of the three stores with the largest extent of customer sharing with the test store, and adopts that as the inferred location for the unknown store.
The tests are preformed in a leave-one-out scheme. One at a time, each of the stores in the dataset is treated as an unknown location, and the location is to be inferred by each of the methods based on all other stores. The misplacement error is then computed as the distance between the inferred location and the actual location of the store. This testing scheme is meant to approximate the error expected when utilizing these methods to find the locations of stores with currently unknown locations, based on the full dataset of known locations.
The California cities evaluation included Los Angeles, San Diego, San Francisco, San Jose, Sacramento, Fresno, and Bakersfield. Firstly, for all cities in the list both the baseline NN-1 and the proposed method (MaxLike column in table III) achieve the desired neighborhood precision. This serves as an indication that the general method of inferring location based on shred customers is feasible. Secondly, in all cases the proposed maximum likelihood based method outperforms both baseline methods, often by a substantial margin.
The purpose of the NN-3 baseline is to determine whether the proposed maximum likelihood method succeeds mostly due to the integration of information from several nearby locations. However, here and in all other experiments we preformed, the NN-3 method is strongly dominated by NN-1, meaning that if we are to simply use weighted average location, then adding extra locations is of no use. This result persisted also when going beyond neighbors, further indicating the usefulness of our statistical model.
|City||n||distance (median)||distance (mean)||customer sharing ()|
The median displacement error in inferring city location in California (Table III) are highly variably as we would expect from the varying sample size and density (Table II). In San Francisco – a dense city with a large sample size – we obtain a median displacement of meters. This goes up to nearly Kilometers in Fresno, where the sample size is only , and the stores are sparsely distributed with a mean distance of to the nearest store in this dataset.
These two parameters (number and density of known locations in a region) emerge as important determinant of success both for the proposed method and the baseline alternative. Essentially, since our approach is to infer locations based on neighbors, we will inherently be limited by the quality of our known neighbor data. This has practical consequences in directing the type of additional information we should collect in order to improve results where they are most important to us.
For the most part, for various reasons we are interested in locating shops to within the typical distance between shops of the same chain. The per-chain results are shown for the maximum likelihood method in Table V. Results for the pharmacy chain are excellent, mostly under meters. Sine pharmacy shops of the same chain do not tend to appear in close proximity to each other, these numbers are satisfactory.
For the coffee shop franchise on the other hand we see mixed results. In some cities we are able to infer their locations with median accuracy in the meter range. In other cities we are substantially above meters. For the urban densely of popular coffee shops this may well not be sufficient. This directs us to the type of data we must collect, namely additional shops in these specific areas where coverage is currently insufficient for our needs.
In the large city experiment (Table IV) results are very similar in essence. Overall the proposed method outperforms the baseline in almost all cases. In a single case (Philadelphia) the baseline ties with the maximum likelihood approach. Here again results are tied to a large extent to the number and density of the known locations in the city in our dataset. The great accuracy of meters in NYC for instance is tied to the large sample size and high density in the dataset (See Table II), whereas for Houston we have a large sample size, but low store density, and a median displacement error of meters.
|city/chain||coffee shop||fast food||pharmacy||department shop|
V Related Work
The Weighted Graph Matching Problem (WGMP)
A weighted graph is an ordered pairwhere is a set of nodes and is a function . If the function is symmetric the graph is undirected, whereas for an arbitrary the graph is directed. The Weighted Graph Matching Problem (WGMP)  is the following: given two weighted graphs and where , find a one-to-one correspondence that minimizes:
A closely related problem is the Graph Isomorphism Problem (GIP) where the goal is to decide if for a two given graphs , there exists a one-to-one correspondence between their verticals such that . GIP can easily be formulated as a version of WGMP for which where .
Both GIP and WGIP (as well as other versions of them) were extensively studied. These problems appear naturally in applications of pattern recognition, computer vision, neuroscience and others (see for example, , ). From theoretical aspect the question whether GIP is in or is one of the oldest open questions in complexity theory . WGMP is Hard, and despite many attempts (see for example ,), no constant ratio approximation algorithm is known for it.
Inferred Location Problems One of the motivating use-cases for the inference of store location in the current work, is to extend this layer of spatial information to users, by associating them with the locations at which their transactions take place. The resulting representation is a heatmap describing a small number of areas which together represent the regular habitat of a person. Previous work has already used purchase history information to predict demographic characteristics of consumers [10, 12], and this work may be seen as a further extension of this to geo-spatial information.
Another field where locations are being inferred from relation graphs combined with a subset of known locations is in digital archeology. An especially inspiring case is that of the Old Assyrian long-distance trade routes. These routes spanned vast distances, between numerous cities and outposts, many of which have since been covered by dust . Four thousand years later, a combination of Archaeological and Economical graph-based analytics techniques were able to uncover lost locations, based on travel logs detailing the time spent along the route .
One should note that the inference problem we tackle in this paper is different from WGMP also in the structure of the data it operates on. Applying WGMP for our problem would require that the input would be two graphs, a graph of locations whose nodes are indexed in an arbitrary way, and which is weighted by the actual distances between the shops, and a graph (derived from the transactions) whose nodes are indexed according to the internal index of the merchant, and weighted according to customer purchases (we formulate this weighting in Section III). The goal of WGMP is then to find the permutation that maps the arbitrary indices to the merchant indices, and by this get the locations.
Even if the list of locations of unknown stores were to be given, the hardness of the graph matching alternative would still require an alternative approach to the global matching of these two graphs. In that case, our approach might be seen as a local matching relaxation, where each node is matched to a location in the graph of real-world locations, based on the already-matched locations.
For a large financial aggregation service, knowing where transactions took place is a key enabler for a vast number of location based features. This information is not present in the raw data arriving from financial institutes, and thus has to be inferred. We present a statistical inference formulation of the problem, based on a seed of known locations, and the graph of store to store relationships defined by the shared customer base.
Experiments on many cities show the proposed method is able to locate stores within neighborhood precision or better. Since the inference process is limited by the density of known locations, as the number of those increases we expect to see improvement in the accuracy of place assignment for unknown shops.
A drawback of the simple approach described in this paper that in reality the meaning of real-world distances varies according to the type of store one is visiting. For instance, one might travel a relatively longer distance to a preferred super-market than to a coffee-shop (evidence of this is indeed seen in Figure 2). As a result, a relatively distant shop of a one type might share a substantial customer base with a local shop, whereas two distant shops of another kind are unlikely to do so.
Although the algorithmic framework is general enough to support this property of customer behavior in the model, that would require estimating separate functions for each two classes of stores . However, when scaling to the large dataset containing all US stores, and a very large set of store classes, the functions become prohibitively many.
Furthermore, we would ideally want to share information between some of the classes of stores. For instance, super-markets and hardware shops may very well behave in the same way regarding their profile of customer sharing with local coffee-shops. We do not want to assume any clustering of types of shops, but rather learn these relationships from the data. Future work will focus on the extension of the current method to incorporate these and other attributes of shops and not just the distance between them.
Another avenue for further improvement is the measure of distance between shops. In this paper we use the Euclidean aerial distance between locations as the crow flies. A better measure of effective distance should consider road topology, or maybe measure the time to travel between locations instead. In order to incorporating this richer notion of distance we will need to overcome substantial hurdles in the optimization of the resulting optimization problems.
Future work will also focus on incorporating a notion of trajectory analysis (for instance by building on the trajectory locality notion introduced in ).By understanding the path of users in the space of merchants they make purchases from, and enforcing speed constraints, we will be able to further improve the store localization method developed in this paper. Our hope is that by adding trajectory and other additional sources of information we will be able to pin-point locations, without giving up the favorable scalability properties of the approach that enable the application to real-world financial big-data.
-  Barjamovic, G., Chaney, T., Coşar, K.A., Hortaçsu, A.: Trade, merchants, and the lost cities of the bronze age. Tech. rep., National Bureau of Economic Research (2017)
-  Berg, A.C., Berg, T.L., Malik, J.: Shape matching and object recognition using low distortion correspondences. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. vol. 1, pp. 26–33. IEEE (2005)
-  Bunke, H.: Graph matching: Theoretical foundations, algorithms, and applications. In: Proc. Vision Interface. vol. 2000, pp. 82–88 (2000)
Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty years of graph matching in pattern recognition. International journal of pattern recognition and artificial intelligence18(03), 265–298 (2004)
-  Fiori, M., Sapiro, G.: On spectral properties for graph matching and graph isomorphism problems. Information and Inference: A Journal of the IMA 4(1), 63–76 (2015)
-  Garey, M.R., Johnson, D.S.: Computers and intractability, vol. 29. wh freeman New York (2002)
-  Kool, J., et al.: The Old Assyrian Trade Network from an Archaeological Perspective. B.S. thesis (2012)
-  Lyzinski, V., Fishkind, D.E., Fiori, M., Vogelstein, J.T., Priebe, C.E., Sapiro, G.: Graph matching: Relax at your own risk. IEEE transactions on pattern analysis and machine intelligence 38(1), 60–73 (2016)
-  Resheff, Y.S.: Online trajectory segmentation and summary with applications to visualization and retrieval. In: Big Data (Big Data), 2016 IEEE International Conference on. pp. 1832–1840. IEEE (2016)
-  Resheff, Y.S., Shahar, M.: Fusing multifaceted transaction data for user modeling and demographic prediction. arXiv preprint arXiv:1712.07230 (2017)
-  Wakamori, N., Welte, A.: Why do shoppers use cash? evidence from shopping diary data. Journal of Money, Credit and Banking 49(1), 115–169 (2017)
-  Wang, P., Guo, J., Lan, Y., Xu, J., Cheng, X.: Your cart tells you: Inferring demographic attributes from purchase data. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. pp. 173–182. ACM (2016)
-  Zhou, F., De la Torre, F.: Factorized graph matching. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. pp. 127–134. IEEE (2012)