Linking Sequences of Events with Sparse or No Common Occurrence across Data Sets

11/12/2017 ∙ by Yunsung Kim, et al. ∙ Columbia University 0

Data of practical interest - such as personal records, transaction logs, and medical histories - are sequential collections of events relevant to a particular source entity. Recent studies have attempted to link sequences that represent a common entity across data sets to allow more comprehensive statistical analyses and to identify potential privacy failures. Yet, current approaches remain tailored to their specific domains of application, and they fail when co-referent sequences in different data sets contain sparse or no common events, which occurs frequently in many cases. To address this, we formalize the general problem of "sequence linkage" and describe "LDA-Link," a generic solution that is applicable even when co-referent event sequences contain no common items at all. LDA-Link is built upon "Split-Document" model, a new mixed-membership probabilistic model for the generation of event sequence collections. It detects the latent similarity of sequences and thus achieves robustness particularly when co-referent sequences share sparse or no event overlap. We apply LDA-Link in the context of social media profile reconciliation where users make no common posts across platforms, comparing to the state-of-the-art generic solution to sequence linkage.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given multiple independent data sources, it is desirable to link representations of identical entries to allow valuable statistical analyses or to save clerical efforts of identification. When the identifying attributes of the objects are absent, sanitized, or prone to error, resolving identity uncertainty becomes a highly non-trivial task. State-of-the-art solutions to identity uncertainty can effectively associate co-referent records across relational databases (record linkage) or dissimilar representations of the same object (author identification, noun coreference, image association), and they have widespread applications to reference matching lawrence1999autonomous ; mccallum2000efficient ; mccallum2004conditional ; pasula2003identity , public policy making sadinle2014detecting ; sadinle2013generalized ; jaro1989advances , behavioral analysis ji2014structure ; kazemi2015growing ; narayanan2009anonymizing , biomedical science christen2002febrl , and database cleansing bhattacharya2007collective .

Most data of practical interest today, however, is a collection in which each entry is a sequence of events generated by an entity. This happens ubiquitously, especially when these entries reflect individual traits: shopping services keep track of different consumption histories, location-based applications maintain geo-tagged records of an individual’s whereabouts, and medical records and prescription histories contain a sequence of health-related incidents. Sequences collected in relevant domains reflect the common features of the same entity (e.g., consumption preferences, locational trajectories, or health conditions), which can provide evidence for matching ones that refer to a common source.

Although these sequences share an abstract generative pattern, their domain disparity prevents them from sharing rich common events. For example in many mobility-related applications, especially with granular records of time and location, one type of a spatiotemporal event (such as a phone call) is never guaranteed to occur in tandem with another (a credit card transaction). This poses a critical challenge for sequence linkage and also distinguishes the problem of sequence linkage from the previously studied problems of identity uncertainty. While the entries being matched in the latter consist of distortions or alternate forms of “ground-truth” features such as canonical names, objective measurements, or semantic definitions, no such ground-truth exists in sequence linkage so the solutions to these problems are crucially inapplicable. In addition, frequent abscence of common events makes sequence linkage an even more intricate task.

In this work, we present the simplest form of a generically applicable Bayesian framework that addresses the issues of rare common occurrence in cross-domain sequence linkage. This framework consists of a mixed-membership model for the generation of event sequences whose source entities are shared across data sets (Split-Document model), and a 3-phase unsupervised algorithm for inferring their identities across data sets (LDA-Link). Split-Document frames each event incident in terms of semantic “motifs,” and LDA-Link uses this characterization to determine the semantic similarities of a pair of views, ensuring robust linkage even when the co-referent sequence pairs share no common events at all.

To validate the empirical robustness of LDA-Link against common occurrence sparsity, we also provide a case study with real-world geo-tagged data sets. Mobility-related data are the richest and the most omnipresent type of data available through numerous location-based smart phone applications and other external services such as social media, cellular logs, and credit card transaction histories. As diverse as these applications are, collections from one application are often independent from collections from another and their locations seldom coincide, which makes them a challenging target for sparsity-robust sequence linkage. LDA-Link achieves up to 37% identifiability when linking profiles with no common posts across Instagram and Twitter, significantly outperforming the current state-of-the-art generic solution to sequence linkage.

The paper is organized as follows. Section 2 presents an overview of relevant works and formalizes the problem of sequence linkage. Section 3 reviews Latent Dirichlet Allocation and its inference methods that are essential to the development of our work. Section 4 presents Split-Document model and LDA-Link. In Section 5 we perform a case study of this method in the context of social media identity reconciliation, and validate its robustness against sparsity. We conclude our paper in Section 6. The Appendix analyzes each step of the algorithm in theory and studies its convergence properties as well as its effectiveness.

2 Related Works and Problem Formulation

2.1 Previous Works

Sequence linkage belongs to the class of problems on identity uncertainty. Existing solutions to identity uncertainty are customized to linking representations of an entity with “ground-truth” field values such as canonical names, unambiguous semantics, numerical features, or objective measurements. Deterministic association can be done to declare exact matches on these features when error-free identifiers are available, but entries in large data sets tend to be prone to noise and distortions such as human inscription or measurement errors, use of alternative forms, subjective observations, or deliberate data sanitization.

In the absence of trustworthy identifiers, probabilistic methods can be used to address uncrontrollable noise. The seminal work of fellegi1969theory provided the first probabilistic framework for linking records in relational databases that refer to the same entity (record linkage) based on agree/disagree match statuses of each field. Methods using generalized Expectation-Minimization winkler2000frequency ; winkler1993improved , scoring thibaudeau1993discrimination , or Gibbs sampling meng1993maximum ; larsen2001iterative

for parameter estimation have been developed to overcome the assumption of conditional independence in

fellegi1969theory . winkler2000frequency ; winkler1988using suggests a similar method that uses relative frequencies of the field values in place of dichotomous match statuses of individual fields to determine the weight parameters for linkage. The downside of the aforementioned family of models is that they disregard the generative patterns of each observation and thus suffer the loss of evidence contained in the actual values of the noisy observations.

In this regard, Bayesian methods have the advantage of naturally handling noises present in observations. By incorporating uncertainty into a generative process, Bayesian approaches allow the computation of match probabilities conditioned on the actual observations without neglecting the presence of simultaneous matches

fortini2001bayesian

. Recent solutions to identity uncertainty have enjoyed the improvement of Bayesian inference techniques, and these solutions can largely be classified into “parametric,” “cluster-based,” or “correlational” depending on their generative assumptions and the criteria for inferring the co-reference structure.

The parametric family of methods is the most prominent, which encodes the latent linkage structure into a parameter of interest - such as matching matrices fortini2001bayesian ; steorts2015entity ; liseo2013some ; tancredi2011hierarchical ; steorts2014smered ; steorts2015bayesian or co-reference partitions sadinle2013generalized ; sadinle2014detecting - of a probabilistic generative model. Models in tancredi2011hierarchical and liseo2011bayesian represent the linkage structure as a one-hot matching matrix, and co-referent records separately undergo “hit-miss” distortions of latent “true” categorical or continous normal attributes. Similarly using matching matrices, steorts2014smered ; steorts2015bayesian provide a unified framework for record linkage and de-duplication that can be extended to simultaneously linking records across multiple files. sadinle2013generalized developed a block-partitioning method to find co-referent partitions across multiple files under a normal mixture model, and jaro1989advances dvelops on this model to schieve de-duplication. Apart from the generative models mentioned above, discriminative models such as

Cluster-based methods bind similar objects into clusters, and these methods are more widely used in other branches of identity uncertainty such as author identification, text classification and noun coreference for which co-occurrence is well explained by close-knit groups of relevance. pasula2003identity applies the concept of identity clustering to find the sample posterior over all relationships between objects, classes, and attributes modeled with the Relational Probability Model (RPM). bhattacharya2006latent proposes an LDA-based model for the generation of author and citation entries in which authors and publications in the same membership group are more frequently observed together. Non-parametric Bayesian Dirichlet processes (DP) allow the number of such groups to be flexible, and dai2011grouped applies DP to modeling groups of authors associated with topics.

Correlational methods, on the other hand, compute the statistical interrelations of each pair of records rather than attempting to infer the linkage structure a posteriori. klami2013bayesian uses covariance matrix of a correlational, multivariate normal generative model as a measure of statistical dependency used for finding the objects with the same identity.

The aforementioned methods suffer from two critical downfalls when applied to cross-domain sequence linkage. First and foremost, these methods are tailored to the treatment of objects with unambiguous ground-truth features such as relational records with categorical, string-valued, or continuous attributes. As mentioned previously, this makes these methods fundamentally inapplicable to modeling event sequences. Also, the failure to isolate the unknown linkage structure when estimating hidden model parameters adds an extra layer of uncertainty which can compromise both the accuracy and convergence rate of the whole process. In addition, the mixture assumptions in many of the above methods are more ill-suited for sequence linkage than mixed-membership models since each entity exhibits a unique pattern of event generation in reailty while a mixture assumption binds them to restrictive patterns.

One notable method that particularly aims at sequence linkage is unnikrishnan2015asymptotically , which computes the distance of two sequences in the simplex space of empirical distributions. Yet, as our case study in Section 5 reveals, this method fails when the empirical distributions have sparse intersections. In contrast, we propose a method that analyzes the semantics of each event incident, and determines the similarities of a pair of views within the latent semantic space, allowing for a more macroscopic pattern analysis. This method resolves all of the above-mentioned downfalls of existing methods through an information-theoretic interpretation of Latent Dirichlet Allocation as a mechanism for dimension reduction.

2.2 Problem Formulation

Assume a world of real-world entities, , where . and each denotes the size of two data sets and . (We assume that .) Entity generates exactly one sequence of events in the data set, and one or more sequences in the data set. Here, the function is the “identity” association, which indicates that sequence and sequences are generated by the same real-world entity, namely , and such sequences in different data sets are said to be ‘linked’ or ‘co-referent.’ ( is the set of all subsets of , and is a function whose image forms a partition of the set .)

The sequence generation proceeds as follows. Each entity performs a sequence of one of possible actions. (This categorical assumption can be easily extended to the continuous case.) When an action is performed by etity , this appends an entry in exactly one of the sequences among and . Every sequence is therefore an ordered collection of events, which is equivalent to the concept of a “word” in information theory and “document” in topic modeling. We will call this sequence, “view.” See Figure 1 for a visual understanding.

(a) Generation of Views: Hidden entities generate sequences across data sets. (b) Sequence Linkage: is the true linkage structure

Figure 1: Description of View Generation and the Linkage Problem. (Colors represent events.)

A typical problem of identity uncertainty is to find the exact identity mapping . When the data sets and consist of an identical number of entries, this becomes a well-known problem of finding the optimal bipartite graph matching between the and views. When the data sets and differ in size, however, graph matching methods are no longer straighforwardly applicable. In this paper, we study instead the problem of finding for each entity a set of up to candidates for a modest choice of , which is a more realistic approach in cases where the number of co-referent target views are unknown or different for each entity. Although the candidate sets will no longer be mutually exclusive, this method has the additional advantage of preventing close misses at the slight cost of preciseness. We will call this type of matching “Rank-” matching since it associates a view with supposedly most relevant candidate views.

Notations

In Sections 4 and the Appendix, we use and to index views in the and domain data sets without a specific knowledge or reference to its real-world entity. To refer to a view in relation to a specific entity, we use the index so that and identify the and views generated by entity . Sometimes we will use

to indicate an arbitrary view regardless of its domain. Although a view is a sequence of events, Split-Document model is a bag-of-words model in which events are exchangeable as we shall see in the later sections. We can thus represent a view as a vector of event frequencies and use the subsecripts

(or equivalently, , ) to refer to the frequency of the event in view (or , ). Sometimes we may want to normalize the frequency vector to create a vector of relative frequencies (or an empirical distribution.) We denote this normalized relative frequency vector by , and use superscripts such as and to specify its domain. Lastly, we index topics with the index

, and each topic is a probability distribution over the events

.

3 Latent Dirichlet Allocation and Variational Bayes

The proposed sequence linkage framework is closely related to Latent Dirichlet Allocation (LDA)blei2003latent . Following subsections will review the LDA and the algorithms used for its statistical inference.

LDA is the simplest topic model, a Bayesian probabilistic model for generating documents. Each topic is a probability distribution over the vocabulary space. A total of topics are assumed, and each topic is drawn from a Dirichlet prior, . Given these topics, each of the documents is generated in the following way. First a “topic proportion” is drawn. Then for each word , a topic index is drawn according to the topic proportion, , and . Therefore the complete joint probability distribution becomes

(1)

whose graphical model is shown in Figure 2

Figure 2: Latent Dirichlet Allocation

Exact MAP inference on LDA is infeasibleblei2003latent , and usually one resorts to approximate inference techniques such as the MCMC and variational inference. In our approach, a particular variant of the variational inference is used, namely the stochastic variational inferencehoffman2013stochastic .

3.1 Variational Bayesian Inference for Lda

Variational Bayesian (VB) methods approximate an intractable posterior distribution to a member of a family of simpler distributions. In particular, it attempts to find the distribution that is closest in KL divergence to the true posterior distribution ,

where the conditioning hyperparameters in

are omitted for simplicity. The simplest and most frequently utilized family is the “mean-field” family, which is a family of product distributions that is factorized into each latent variable term:

(2)

where , , and for conjugacy.

Instead of attempting to minimize the KL divergence directly, which involves the intractable distribution , one may use the follwing relation

where , and maximize instead. is called the Evidence Lower BOund (ELBO), which is conceptually the lower bound on given by Jensen’s Inequality. Early implementations of variational inference used the method of coordinate ascent that iteratevely maximizes the ELBO for each variational parameter while keeping others fixed wainwright2008graphical

. Although this method guarantees local convergence, it requires batch updates that become costly for large corpora, which led to the development of the online variational bayes that uses stochastic gradient descent for faster convergence

hoffman2010online .

3.2 Stochastic Gradient Descent and Online Vb for Lda

We first briefly discuss the nature of stochastic gradient descent before discussing online VB. Stochastic gradient descent (SGD) optimizes an object function when only the noisy estimates of the true gradient is available bottou2004stochastic . Given an object function of the form , SGD applies the following update formula

to find the optimal value of , where is an event from distribution and is the learning rate. The update term satisfies the condition

and thus can be understood as a noisy yet consistent estimate of the true gradient of at . It is shown that converges almost surely to the local optimum bottou1998online .

Stochastic Maximization of the ELBO

SGD can be used to optimize the ELBO for LDA hoffman2013stochastic 111SGD can be applied in VB for any generative model that involves local and global latent variables hoffman2013stochastic ., . Considering the factorization of and from Equations 1 and 2,

Letting

be a random variable that chooses an index

over the document indices , we can rewrite as where

so that the “natural gradient” of with respect to each global variational parameter

is a noisy yet unbiased estimate of the natural gradient of the variational objective,

. Computing the natural gradient instead of the Euclidean gradient corrects for the geometry of the space of the variational parameters by using the symmetrized KL divergence as the measure of spatial distance amari1998natural and thus leads to a more effective convergence to the local optimum hoffman2013stochastic . Refer to hoffman2010online for the resulting online variational Bayesian algorithm for LDA.

4 SPLIT-DOCUMENT MODEL AND LDA-Link ALGORITHM

We now introduce Split-Document model, a simple probabilistic generative model for co-referent views that is based on the LDA. Based on this model we suggest LDA-Link as a solution for identifying co-referent views across data sets of different domains.

4.1 Split-Document Model

Figure 3: Split-Document Model

The Split-Document model displayed in Figure 3 extends LDA to model the generation of views across domains. In this model, a real-world entity generates a sequence of i.i.d. events according to LDA, and each event falls into exactly one among multiple co-referent views. (For now we assume that each entity generates a single view in each data set, and that views in the smae data set have a fixed number of events.) A set of co-referent views are thus generated through a mixture model whose mixture proportions depend on the sequence-generating entity.

Although the Split-Document model is kept simple in this paper for the sake of lucid presentation, its assumptions can be generalized to allow more than two data sets with each entity generating more than one views in each data set and events ocurring in different data sets with different probabilities. As in steorts2014smered ; steorts2015bayesian , this approach would combine the problem of record linkage with deduplication.

4.2 LDA-Link ALGORITHM

We now introduce LDA-Link, a co-reference linkage algorithm based on the Split-Document model. The key idea is to consider topic proportions as a reduction of dimensionality from the size of the entire event space to the number of topics , and to compare these dimension-reduced representations. This distinguishes LDA-Link from other methods that leverage only the common events that appear in both views, which causes them to fail when a sequence pair displays sparse common occurrence.

The algorithm works in three separate phases. First in the “Learning” phase, topics are estimated from the views in the two domains. In the second phase, the topic estimates are used to find the topic proportions ’s for each view that maximizes the posterior distribution given the estimated topics. This phase is the “Dimensionality Reduction” phase that effectively reduces the dimension of each view from to . A score is then computed for every pair of views as the Jensen-Shannon distance between their topic proportions. In the last “Rank- Linkage” phase, up to candidate views are declared as potential matches for each view based on these scores.

The rest of this section discusses each phase in detail. Appendices A through D will explore the guarantees of this algorithm in theory.

Learning Phase: Topic Estimation

In the Split-Document model, co-referent views share the same topic proportion, and these views are essentially a bipartition of a document generated through the normal, un-split LDA process described in Figure 2. If the true co-reference linkage structure is known, estimating the topics for Split-Document model would amount to finding the MAP LDA topics where all co-referent views are combined into one document. Yet, since the co-reference structure is the unknown that we aim to find, it is difficult to directly compute the MAP estimates of the generative model in Figure 3.

A workaround is to consider a slightly different surrogate model in which each view is considered as a separate document that has a topic proportion of its own right, and instead learn the topics optimal to this model as an approximation. This surrogate model is called the Independent-View model shown in Figure 4. Intuitively, in large-document limits where the number of words reaches infinity, learning the Independent-View model is equivalent to learning LDA with duplicates of each document. (We study the effectiveness of this surrogate learning in Appendix A)

Figure 4: Independent-View Model (Surrogate for Learning Topics in Split-Document Model)

We will call the ELBO of the Independent-View model , which is given by

(3)

where is given as

(4)

and . Note its difference from the ELBO of the Split-Document model, which is

Algorithm 1 summarizes the Learning phase.

Define . Initialize
for  do
     Initialize
     repeat
         
         
     until  has converged
     Compute
     
end for
return
Algorithm 1 Learning Phase: Topic Estimation

Dimension Reudction and Rank- Linkage Phases

With the topics obtained in the learning phase, the topic proportions are learned for each view using the conventional coordinate descent method, mapping each view to a vector on the -dimensional latent semantic simplex space. Once these topic proportions are learned, a dissimilarity score based on Jensen-Shannon distance is computed for every pair of views as

(5)

Algorithm 2 summarizes the Dimension Reduction phase.

for
Initialize ,
for  do
     repeat
         
         
     until  has converged
end for
return and
Algorithm 2 Dimensino Reduction Phase

The final Rank- Linkage phase selects for each view, candidate views of the smallest dissimilarity score, as shown in Algorithm 3.

, where
for  do
     , where
     
end for
return
Algorithm 3 Rank- Linkage Phase

5 Case Study: Breaking Anonymity in Location-Based Social Media

We now apply LDA-Link to location-based social media profile linkage, where profiles of the same individual in different social media are matched based on their online activities. As explained in the introduction, location-based data sets present critical challenges of sparse common occurrence. The ability to reconcile identities and bridge data across online social platforms of different thematic nature has significant commercial as well as privacy implications riederer2016linking . First the data sets and the baseline algorithms are explained, and the performance of LDA-Link is evaluated.

5.1 Data Sets

A total of two pairs of datasets were used, both of which contain only the spatio-temporal information of public activities of profiles in two different social media collected during a common time span. Both of these datasets were studied and explained previously in riederer2016linking .

  • Foursquare-Twitter (FT): This dataset contains checkins on Foursquare and posts on Twitter, both of which are geo-temporally tagged. Each account activity is therefore a (user id, time, GPS coordinate) tuple. By selecting only the users who has records in both social media accounts, a total of 862 users, 13,177 Foursquare checkins, and 174,618 tweets were obtained. The imbalance of activities in the two social media is a factor of challenge.

    While Foursquare checkins are typically associated with a user exposing their current activities, tweets are associated with more general behaviors. In this dataset, only 260 pairs of checkins (less than 0.3%) had exactly matching GPS coordinates, and none of them were made within 10 seconds of each other, suggesting that it is highly unlikely that there is a pair of account activities forwarded by software across both services riederer2016linking .

  • Instagram-Twitter (IG): The second dataset is also a collection of (user id, time, GPS coordinate) from public posts on the photo-sharing site Instagram, and the microblogging service Twitter. This data set was obtained by linking Instagram and Twitter accounts that were associated with the same URL in their user profiles, and downloading the spatio-temporal tags of the tweets made by these Twitter accounts riederer2016linking . The collection includes 1717 unique users, 337,934 Instagram posts, and 447,336 Tweets.

Both pairs of data sets contain the timestamped visits of each user, and while users are communicating an action or a message to the general public, the events(posts) in each data set are collected within different thematic contexts. Split-Document model dictates that each recorded visit occurs with a certain purpose, e.g., shopping, sports, travel, hobbies etc.. LDA-Link attempts to discover such motifs (or “topics”) and associate with every view a proportional mixture of these topics to represent its characteristic feature.

Modulating the Sparsity of Common Events with Spatiotemporal Granularity

The full GPS coordinates and timestamps of the posts in each of these data sets never coincide precisely, which raises the problem of determining the meaning of “common occurrence.” Our solution is to bin time and locations based on spatiotemporal proximity, and declare events belonging to the same bin to have occurred in common. The size of the bin can be controlled to represent different levels of measurement precision, and changing the event space granularity in this manner modulates the discreteness and size of the event space. This allows the study of the robustness of the algorithm under different degrees of sparsity available to the linking agent. In our experiment we bin event locations by truncating the coordinate values after a certain number of digits below decimal and bin event timestamps into a fixed number of bins. We call each of these numbers “spatial” and “temporal” granularity.

5.2 Prior Algorithms

Here we summarize three baseline algorithms for identity reconciliation that were inspired by state-of-the-art algorithms in the social computing literature.

Sparsity-Based Scoring: The “Netflix Attack” (NFLX)

Based on the algorithm used to de-anonymize the Netflix prize dataset in narayanan2008robust , riederer2016linking describes a variation for cross-domain reconciliation, where a score between views and is defined as

where

and are model parameters.

This algorithm relies on the exact timestamps. The algorithm matches an view with the view with the smallest score, and leaves it unmatched if the best candidate and the second best differ in scores by no more than a standard deviations. In resemblance to narayanan2008robust this score favors locations that are rarely visited, frequent visits to the same location, and visits that occur close in time, thus exploiting sparsity. riederer2016linking

Density-Based Scoring: JS-Distance Matching (JS-Dist)

In unnikrishnan2015asymptotically , authors proved the asymptotic optimality of the JS-Distance between relative frequencies (empirical distributions) of two observation sequences as a measure of disparity.

Their algorithm estimates the true matching as the matching that mininimizes the sum of the JS measures. It relies on the density of the data based on the asymptotic convergence of empirical distributions implied by Sanov’s Theorem.

Leveraging Both Sparsity and Density: Poisson Process (POIS)

riederer2016linking

suggests a simple generative model for mobility records in which the number of visits to each location within a certain time period follows a Poisson distribution whose rate parameters are specific to the location and period of the visit. Based on this model, the following similarity score between two views can be defined for an MAP estimate of the identity linkage structure:

where and are location and time indices and

The identity mapping is the mapping that maximizes the expected sum of scores.

5.3 Performance Analysis

We now present the empirical performances of LDA-Link. In light of the “Rank-” matching we described in Section 2.2, we measure our performance in terms of the Rank- recall, which is the proportion of views in the source data set whose co-referent views in the source data set are fully contained in the set of best candidates, not allowing ties. We draw our attention to LDA-Link’s relative performance as compared to the baseline algorithms, and evaluate its robustness against sparse common occurrence by (1) modulating the granularity of the event space, and (2) testing on a sample of sequences with sparse common events.

LDA-Link and Domain-Specific Alternatives

Figure 5 plots the best Rank- recalls of each algorithm for different values of . LDA-Link outperforms the domain-specific reconciliation algorithms as the size of the candidate set is increased. Although NFLX and POIS perform better for small values of , LDA-Link’s recall increases more rapidly, exceeding POIS and NFLX respectively at (1.16% of the total number of views) and (2.20%) for FSQ-TWT, and at (1.05%) and (1.34%) for IG-TWT. The early plateu for POIS occurs due to the lack of rules for evaluating the similarity of a pair of views when none of their events belong to the same location or time bins. The plateau is reached more slowly at a higher recall for NFLX because NFLX depends on precise time-difference instead of binning by time and is thus slightly less vulnerable to time granuarity. LDA-Link, on the other hand, computes the similarity of a pair of views on a dimension-reduced space of topic proportions, which removes the dependence on the granularity of the event space. When reaches up to 10% of the total number of views, LDA-Link outperforms POIS and NFLX by over 50% and 20% on FSQ-TWT.

(a) Recall on FSQ-TWT (b) Recall on IG-TWT

Figure 5: Best Rank- Recall Plots on the Two Datasets

(a) Spatial Granularity = 0 (b) Spatial Granularity = 1 (c) Spatial Granularity = 2 (d) Spatial Granularity = 3

Figure 6: Rank- Recall of LDA-Link and JS-Dist for Different Spatiotemporal Granularities

(a) FSQ-TWT (=10) (b) IG-TWT (=20)

Figure 7: Rank- Linkage Recall of Sequence Pairs with Top 10% L1 Distance

Robustness against the Event Space Granularity

In Figure 6, we tested LDA-Link against different levels of spatiotemporal granularity of the event space. The number of learned topics were fixed to and for FSQ-TWT and IG-TWT respectively. The plot displays the Rank- matching of LDA-Link and JS-Distance matching (dashed lines) on the two data sets, where was set to 10 and 20 for FSQ-TWT and IG-TWT respectively for a consistent comparison. Although the rank- recall for JS-Dist is greater than LDA-Link when temporal granularity is small, the capability of JS-Dist is rapidly compromised for higher spatiotemporal granularity, making a much steeper drop to zero for higher granularities. Meanwhile, LDA-Link maintains a more-or-less stable performance, which proves its robustness against the sparsity of the event space.

Linking Views with Sparse or No Common Events

Lastly, we assess LDA-Link’s robustness to the second type of data sparsity, which is the actual degree of event overlap for a co-referent pair of sequences. In Figure 7, we took 10% of the population whose views have the highest L1 distance, and plotted the best Rank- recall of LDA-Link and JS-Dist on this sample for different spatial granularities (bar graph). The plot also displays the best Rank- recall of the two algorithms on the sample of users whose profiles have no common posts at all (No Overlap). Although the overall best performance of JS-Dist is greater than LDA-Link (Figures 56), its performance is far eclipsed by LDA-Link on the spase sample and the difference is even more striking for greater input granularity. Most critically, LDA-Link is able to achieve up to 37% Rank- recall on IG-TWT and 17% recall on FSQ-TWT for users with no common posts at all, while JS-Dist fails to reconcile any. Again, this is the effect of LDA-Link’s dimension reduction and semantic comparison.

6 Conclusion

We defined the problem of sequence linkage, a newly studied problem of identity uncertainty. As a solution to sparsity-robust sequence linkage, we described Split-Document and LDA-Link. Split-Document is a mixed-membership model for the generation of event sequences across data sets of different domains which uses the concept of motifs that account for the generation of individual events and their collective patterns. Based on this model, LDA-Link can infer the correct identity linkage structure across data sets through a semantic comparison of each sequence pair. By conducting an empirical validation in linking profiles across different location-based social media, we tested LDA-Link’s robustness against factors of common event sparsity by modulating the granuarlity of the event space and testing against a selective sample of co-referent views with rare common occurrence. We proved that LDA-Link is able to significantly outperform the current state-of-the-art solution to sequence linkage when linking social media profiles that have no commonly occurring posts at all.

Split-Document can be extended to accommodate more than two data sets, each potentially having different views with duplicate identities. Extra layers of stochasticity can also be embedded into the original Split-Document model to construct more complex models. For example, one can inject an “observation” layer into the original model to take into consideration different rules of observation emission, which may include the probability of observation or different distortion processes (e.g. “hit-miss” distortion). Continuous or non-categorical variants for Gaussian or Poisson events is also a potential future direction of study. The incorporation of a Poisson model should especially be suitable for discretizing continuous time events. Another area of development is the incorporation of Bayesian non-parametric clustering models such as Dirichlet Process and Chinese Restaurant Processes as a “clustering” layer to model multiple duplicate identities of different views.

Acknowledgement

The author gratefully acknowledges Professor Augustin Chaintreau and Professor David Blei for their valuable comments and feedback.

References

  • [1] Steve Lawrence, C Lee Giles, and Kurt D Bollacker. Autonomous citation matching. In Proceedings of the third annual conference on Autonomous Agents, pages 392–393. ACM, 1999.
  • [2] Andrew McCallum, Kamal Nigam, and Lyle H Ungar.

    Efficient clustering of high-dimensional data sets with application to reference matching.

    In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169–178. ACM, 2000.
  • [3] Andrew McCallum and Ben Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, pages 905–912, 2004.
  • [4] Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, and Ilya Shpitser. Identity uncertainty and citation matching. Advances in neural information processing systems, pages 1425–1432, 2003.
  • [5] Mauricio Sadinle et al. Detecting duplicates in a homicide registry using a bayesian partitioning approach. The Annals of Applied Statistics, 8(4):2404–2434, 2014.
  • [6] Mauricio Sadinle and Stephen E Fienberg. A generalized fellegi–sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 108(502):385–397, 2013.
  • [7] Matthew A Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, 84(406):414–420, 1989.
  • [8] Shouling Ji, Weiqing Li, Mudhakar Srivatsa, Jing Selena He, and Raheem Beyah. Structure based data de-anonymization of social networks and mobility traces. In International Conference on Information Security, pages 237–254. Springer, 2014.
  • [9] Ehsan Kazemi, S Hamed Hassani, and Matthias Grossglauser. Growing a graph matching from a handful of seeds. Proceedings of the VLDB Endowment, 8(10):1010–1021, 2015.
  • [10] Arvind Narayanan and Vitaly Shmatikov. De-anonymizing social networks. In Security and Privacy, 2009 30th IEEE Symposium on, pages 173–187. IEEE, 2009.
  • [11] Peter Christen, Tim Churches, et al. Febrl-freely extensible biomedical record linkage. Australian national University, Department of Computer Science, 2002.
  • [12] Indrajit Bhattacharya and Lise Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007.
  • [13] Ivan P Fellegi and Alan B Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969.
  • [14] William E Winkler. Frequency-based matching in fellegi-sunter model of record linkage. Bureau of the Census Statistical Research Division, 14, 2000.
  • [15] William E Winkler. Improved decision rules in the fellegi-sunter model of record linkage. In in American Statistical Association Proceedings of Survey Research Methods Section. Citeseer, 1993.
  • [16] Yves Thibaudeau. The discrimination power of dependency structures in record linkage. In Survey Methodology. Citeseer, 1993.
  • [17] Xiao-Li Meng and Donald B Rubin. Maximum likelihood estimation via the ecm algorithm: A general framework. Biometrika, 80(2):267–278, 1993.
  • [18] Michael D Larsen and Donald B Rubin. Iterative automated record linkage using mixture models. Journal of the American Statistical Association, 96(453):32–41, 2001.
  • [19] William E Winkler. Using the em algorithm for weight computation in the fellegi-sunter model of record linkage. In Proceedings of the Section on Survey Research Methods, American Statistical Association, volume 667, page 671, 1988.
  • [20] Marco Fortini, Brunero Liseo, Alessandra Nuccitelli, and Mauro Scanu. On bayesian record linkage. Research in Official Statistics, 4(1):185–198, 2001.
  • [21] Rebecca C Steorts et al. Entity resolution with empirically motivated priors. Bayesian Analysis, 10(4):849–875, 2015.
  • [22] Brunero Liseo and Andrea Tancredi. Some advances on bayesian record linkage and inference for linked data. http://www.ine.es/e/essnetdi_ws2011/ppts/Liseo_Tancredi.pdf, 2013.
  • [23] Andrea Tancredi, Brunero Liseo, et al. A hierarchical bayesian approach to record linkage and population size problems. The Annals of Applied Statistics, 5(2B):1553–1585, 2011.
  • [24] Rebecca C Steorts, Rob Hall, and Stephen E Fienberg. Smered: A bayesian approach to graphical record linkage and de-duplication. In AISTATS, pages 922–930, 2014.
  • [25] Rebecca C Steorts, Rob Hall, and Stephen E Fienberg. A bayesian approach to graphical record linkage and deduplication. Journal of the American Statistical Association, 111(516):1660–1672, 2016.
  • [26] Brunero Liseo and Andrea Tancredi. Bayesian estimation of population size via linkage of multivariate normal data sets. Journal of Official Statistics, 27(3):491–505, 2011.
  • [27] Indrajit Bhattacharya and Lise Getoor. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM International Conference on Data Mining, pages 47–58. SIAM, 2006.
  • [28] Andrew M Dai and Amos J Storkey. The grouped author-topic model for unsupervised entity resolution. In

    International Conference on Artificial Neural Networks

    , pages 241–249. Springer, 2011.
  • [29] Arto Klami. Bayesian object matching. Machine learning, 92(2-3):225–250, 2013.
  • [30] Jayakrishnan Unnikrishnan. Asymptotically optimal matching of multiple sequences to source distributions and training sequences. IEEE Transactions on Information Theory, 61(1):452–468, 2015.
  • [31] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
  • [32] Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • [33] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1-2):1–305, 2008.
  • [34] Matthew Hoffman, Francis R Bach, and David M Blei. Online learning for latent dirichlet allocation. In advances in neural information processing systems, pages 856–864, 2010.
  • [35] Léon Bottou. Stochastic learning. In Advanced lectures on machine learning, pages 146–168. Springer, 2004.
  • [36] Léon Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9):142, 1998.
  • [37] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
  • [38] Christopher Riederer, Yunsung Kim, Augustin Chaintreau, Nitish Korula, and Silvio Lattanzi. Linking users across domains with location data: Theory and validation. In Proceedings of the 25th International Conference on World Wide Web, pages 707–719. International World Wide Web Conferences Steering Committee, 2016.
  • [39] Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pages 111–125. IEEE, 2008.
  • [40] Pranjal Awasthi and Andrej Risteski. On some provably correct cases of variational inference for topic models. In Advances in Neural Information Processing Systems, pages 2098–2106, 2015.
  • [41] Feng Qi and Bai-Ni Guo. An inequality involving the gamma and digamma functions. Journal of Applied Analysis, 22(1):49–54, 2016.
  • [42] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • [43] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991.

Appendix A Learning Phase:
Learning Topics With and Without Omniscient Knowledge

The learning step in LDA-Link is equivalent to performing online variational for LDA with each view in separate domains as a single input document. In this section we lay out the steps required for proving the high-probability asymptotic proximity of the topics learned by LDA-Link to the topics learned through online LDA when every co-referent pair of views is reconciled and combined into a single document.

a.1 SKL Divergence between the Learned Topics

The topic-learning step is a stochastic variational inference step that optimizes the per-document ELBO given in Equation 3. We start with the simple and slightly less realistic case that and are fetched together at each iteration. moves in the direction of the natural gradient of , which is given by

(6)

where and are the local optimum of the variational parameters and . See Appendix D.1 for its full derivation. LDA-Link’s topic-learning algorithm applies the update formula

(7)

to obtain the topic estimate .

The gradient of the per-document ELBO of the Split-Document model is given by [34]

The optimal update formula for the Split-Document model would thus be

(8)

We state the following claim that, with high probability, the SKL distance between the convergent values of the topics learned through LDA-Link (using Equation 7) and the topics learned through the optimal update formula for the Split-Document model (using Equation 8) is bounded with high probability.

Claim 1.

where depends on the number of records in the view and

A Potential Validation Approach.

The idea is to (1) find high-probability bound on the co-referent views in the latent semantic space, (2) bound the distance between the two gradients when the co-referent views are nearby in the latent space, and then (3) to prove that the distance between the gradient descent estimates are convergent.

  1. Prove that there is a function that measures the dissimilarity of the two views in the latent semantic space, and that there is a high probability bound on a pair of co-referent views and having latent dissimilarity below a certain threshold that depends on the number of words in each view. That is, define a suitable choice of a dissimlarity measure for which there is a moderate of choice of and that both depend on and satisfies

  2. Prove the following for the topic update procedure for and (Equations 7 and 8): Given a pair of co-referent views that are close to each other in the latent space, when the SKL distance between the -th iteration of and are within a certain threshold, then the gradients also lie close to each other for the ()-th iteration. That is, find a suitable choice of for measuring the difference between two gradients, for which if and for some , then

    for a moderate bound that depends on .

  3. Prove that when the difference between gradients is small, the next step topic estimate is also bounded in a convergent manner. That is, if and , then

    where

    depends on the previous bound , the learning rate , and the difference between gradients in such a way that

This will prove Claim 1. ∎

a.2 Distance between Posterior Distributions

The argument in the earlier paragraph provides a bound for the symmetrized KL divergence between the optimal topic estimates of the omniscient Split-Document model and its practical surrogate that considers the information geometry of the mean-field Dirichlet posterior approximation for the topic parameter - namely the Independent-View model. Symmetrized KL divergence measures the distance between two topic parameters as the Jeffrey’s divergence between the Dirichlet distributions that they parameterize. For the purpose of topic estimation, however, the distance of interest is not the distance between the posterior approximations, but rather the distance between the MAP estimates.

This core of thie section is Theorem 1, which provides a bound on the JS distance between the MAP estimates of the two Dirichlet distributions in terms of their symmetrized KL divergence. This theorem is useful when, given only the symmetrized KL distance between two posterior Dirichlet distributions, it is desirable to find the bound on the JS distance between their MAP estimates (modes).

Theorem 1.

Let and each be the modes of and . Let , , and assume . If for all , then

where vanishes when and .

Proof.

See Appendix D.2. ∎

Using this theorem we can easily derive the following corollary, which proves that the modes of the two surrogate posterior Dirichlet distributions that are close in terms of the SKL distance are also close in JS distance.

Corollary 1.

Given Dirichlet distributions, each parameterized by and with all parameters greater than 1, we have

where and for and

Appendix B Dimensionality Reduction:
Proximity of the Co-Referent Views in the Semantic Space

Once the topics are learned, LDA-Link computes the optimal topic proportions for each view through a stochastic variational Bayesian approach (Appendix A). In this section, we attempt to prove that the topic proportions of the coreferent views that are learned through the EM step in the LDA-Link algorithm are likely to be close in the simplex space with high probability.

To achieve this we first revisit a reasonable simplification of the VB updates suggested in [40] that will simplify our analysis. Since and , the iterative updates for a particular view in Algorithm 2 can be rewritten as

where we have omitted the entity index for simplicity.

Since [41], we have . Considering this in relation to the update in Algorithm 2, in large document limits where the update equation becomes

(10)

where and . A detailed study of this simplification and its correctness and convergence properties is presented in [40]. We will use this approximation for the rest of this section.

The iterative procedure in Equation 10 converges at a point and for which

(11)

This relation implicitly defines a set of topic proportions ’s at which the iteration converges for some initial parameters when the empirical distribution (relative frequencies) of words is . The set includes, but is not limited to, the global optimum of the ELBO.

We need to compute the change in ’s in cuased by the difference in the relative frequencies . From a slight variation of Sanov’s theorem[30] we get:

(12)

for the relative frequencies and of the views generated from the same distribution, so the two views are close in the simplex space with high probability. Since [42],

(13)

so that if the two relative frquencies are close in the simplex space, they are also close in Euclidean space as well. This allows us to describe differential change in relative frequencies in terms of Euclidean gradients.

Consider a specific choice of for a particular empirical distribution . We can make the following Taylor approximation to when is within a small neighborhood of :

(14)

where the gradient and the Hessian are computed at .

To compute the gradient and Hessian, we must resort to implicit differentiation.

Frist and Second Order Necessary Conditions at Convergence

Combining the two equations in Equation 10 under the limit , we obtain the following necessary conditions for the point of convergence after some rearrangement:

(15)

Taking partial derivatives with respect to and setting , we obtain the following first order condition:

Proposition 1.

When and are probability distributions that satisfy Equation 15,

(16)

where .

Proof.

See Appendix D.3. ∎

Note that from Equations 15 and 16, when , then or .

Taking a second partial derivative with respect to we obtain the following second order condition:

Proposition 2.

When and are probability distributions that satisfy Equation 15,

(17)
Proof.

See Appendix D.4. ∎

Appendix C -Rank Linkage:
Asymptotic Optimality of the Ranking Given the True Topics

In this section, we provide a sketch for proving the theoretical guarantee of the correctness of LDA-Link’s linking algorithm. In order to do so, we will first model the problem of finding the co-referent views as a hypothesis testing problem, in light of the approach in [30].

Given a particular view , the objective is to find among all views the view for which the match is optimal. We formulate this problem as testing a set of hypotheses, each of which states that a view in is the optimal match for for different views, so that for corresponds to the hypothesis that . Therefore, finding the correct pair of views is equivalent to finding the most optimla rule for testing the hypotheses , where, for the ease of analysis, we have introduced the rejection hypothesis as failing to find a match. Our goal is to compute the bound on the probability of error for the decision rule that links a view with the candidate view whose JS distance in the latent semantic space is minimum.

We will more formally restate the decision rule designed in our algorithm. Let , and . The decision rule , where

is the acceptance region for hypothesis , and the rejection region is

The following preliminary theorem, inspired by Theorem IV.3 of [30], may be useful for proving the error probability of .

Theorem 2.

Consider the hypothesis testing problem with the decision rule given as defined above. If