1 Introduction
Given multiple independent data sources, it is desirable to link representations of identical entries to allow valuable statistical analyses or to save clerical efforts of identification. When the identifying attributes of the objects are absent, sanitized, or prone to error, resolving identity uncertainty becomes a highly nontrivial task. Stateoftheart solutions to identity uncertainty can effectively associate coreferent records across relational databases (record linkage) or dissimilar representations of the same object (author identification, noun coreference, image association), and they have widespread applications to reference matching lawrence1999autonomous ; mccallum2000efficient ; mccallum2004conditional ; pasula2003identity , public policy making sadinle2014detecting ; sadinle2013generalized ; jaro1989advances , behavioral analysis ji2014structure ; kazemi2015growing ; narayanan2009anonymizing , biomedical science christen2002febrl , and database cleansing bhattacharya2007collective .
Most data of practical interest today, however, is a collection in which each entry is a sequence of events generated by an entity. This happens ubiquitously, especially when these entries reflect individual traits: shopping services keep track of different consumption histories, locationbased applications maintain geotagged records of an individual’s whereabouts, and medical records and prescription histories contain a sequence of healthrelated incidents. Sequences collected in relevant domains reflect the common features of the same entity (e.g., consumption preferences, locational trajectories, or health conditions), which can provide evidence for matching ones that refer to a common source.
Although these sequences share an abstract generative pattern, their domain disparity prevents them from sharing rich common events. For example in many mobilityrelated applications, especially with granular records of time and location, one type of a spatiotemporal event (such as a phone call) is never guaranteed to occur in tandem with another (a credit card transaction). This poses a critical challenge for sequence linkage and also distinguishes the problem of sequence linkage from the previously studied problems of identity uncertainty. While the entries being matched in the latter consist of distortions or alternate forms of “groundtruth” features such as canonical names, objective measurements, or semantic definitions, no such groundtruth exists in sequence linkage so the solutions to these problems are crucially inapplicable. In addition, frequent abscence of common events makes sequence linkage an even more intricate task.
In this work, we present the simplest form of a generically applicable Bayesian framework that addresses the issues of rare common occurrence in crossdomain sequence linkage. This framework consists of a mixedmembership model for the generation of event sequences whose source entities are shared across data sets (SplitDocument model), and a 3phase unsupervised algorithm for inferring their identities across data sets (LDALink). SplitDocument frames each event incident in terms of semantic “motifs,” and LDALink uses this characterization to determine the semantic similarities of a pair of views, ensuring robust linkage even when the coreferent sequence pairs share no common events at all.
To validate the empirical robustness of LDALink against common occurrence sparsity, we also provide a case study with realworld geotagged data sets. Mobilityrelated data are the richest and the most omnipresent type of data available through numerous locationbased smart phone applications and other external services such as social media, cellular logs, and credit card transaction histories. As diverse as these applications are, collections from one application are often independent from collections from another and their locations seldom coincide, which makes them a challenging target for sparsityrobust sequence linkage. LDALink achieves up to 37% identifiability when linking profiles with no common posts across Instagram and Twitter, significantly outperforming the current stateoftheart generic solution to sequence linkage.
The paper is organized as follows. Section 2 presents an overview of relevant works and formalizes the problem of sequence linkage. Section 3 reviews Latent Dirichlet Allocation and its inference methods that are essential to the development of our work. Section 4 presents SplitDocument model and LDALink. In Section 5 we perform a case study of this method in the context of social media identity reconciliation, and validate its robustness against sparsity. We conclude our paper in Section 6. The Appendix analyzes each step of the algorithm in theory and studies its convergence properties as well as its effectiveness.
2 Related Works and Problem Formulation
2.1 Previous Works
Sequence linkage belongs to the class of problems on identity uncertainty. Existing solutions to identity uncertainty are customized to linking representations of an entity with “groundtruth” field values such as canonical names, unambiguous semantics, numerical features, or objective measurements. Deterministic association can be done to declare exact matches on these features when errorfree identifiers are available, but entries in large data sets tend to be prone to noise and distortions such as human inscription or measurement errors, use of alternative forms, subjective observations, or deliberate data sanitization.
In the absence of trustworthy identifiers, probabilistic methods can be used to address uncrontrollable noise. The seminal work of fellegi1969theory provided the first probabilistic framework for linking records in relational databases that refer to the same entity (record linkage) based on agree/disagree match statuses of each field. Methods using generalized ExpectationMinimization winkler2000frequency ; winkler1993improved , scoring thibaudeau1993discrimination , or Gibbs sampling meng1993maximum ; larsen2001iterative
for parameter estimation have been developed to overcome the assumption of conditional independence in
fellegi1969theory . winkler2000frequency ; winkler1988using suggests a similar method that uses relative frequencies of the field values in place of dichotomous match statuses of individual fields to determine the weight parameters for linkage. The downside of the aforementioned family of models is that they disregard the generative patterns of each observation and thus suffer the loss of evidence contained in the actual values of the noisy observations.In this regard, Bayesian methods have the advantage of naturally handling noises present in observations. By incorporating uncertainty into a generative process, Bayesian approaches allow the computation of match probabilities conditioned on the actual observations without neglecting the presence of simultaneous matches
fortini2001bayesian. Recent solutions to identity uncertainty have enjoyed the improvement of Bayesian inference techniques, and these solutions can largely be classified into “parametric,” “clusterbased,” or “correlational” depending on their generative assumptions and the criteria for inferring the coreference structure.
The parametric family of methods is the most prominent, which encodes the latent linkage structure into a parameter of interest  such as matching matrices fortini2001bayesian ; steorts2015entity ; liseo2013some ; tancredi2011hierarchical ; steorts2014smered ; steorts2015bayesian or coreference partitions sadinle2013generalized ; sadinle2014detecting  of a probabilistic generative model. Models in tancredi2011hierarchical and liseo2011bayesian represent the linkage structure as a onehot matching matrix, and coreferent records separately undergo “hitmiss” distortions of latent “true” categorical or continous normal attributes. Similarly using matching matrices, steorts2014smered ; steorts2015bayesian provide a unified framework for record linkage and deduplication that can be extended to simultaneously linking records across multiple files. sadinle2013generalized developed a blockpartitioning method to find coreferent partitions across multiple files under a normal mixture model, and jaro1989advances dvelops on this model to schieve deduplication. Apart from the generative models mentioned above, discriminative models such as
Clusterbased methods bind similar objects into clusters, and these methods are more widely used in other branches of identity uncertainty such as author identification, text classification and noun coreference for which cooccurrence is well explained by closeknit groups of relevance. pasula2003identity applies the concept of identity clustering to find the sample posterior over all relationships between objects, classes, and attributes modeled with the Relational Probability Model (RPM). bhattacharya2006latent proposes an LDAbased model for the generation of author and citation entries in which authors and publications in the same membership group are more frequently observed together. Nonparametric Bayesian Dirichlet processes (DP) allow the number of such groups to be flexible, and dai2011grouped applies DP to modeling groups of authors associated with topics.
Correlational methods, on the other hand, compute the statistical interrelations of each pair of records rather than attempting to infer the linkage structure a posteriori. klami2013bayesian uses covariance matrix of a correlational, multivariate normal generative model as a measure of statistical dependency used for finding the objects with the same identity.
The aforementioned methods suffer from two critical downfalls when applied to crossdomain sequence linkage. First and foremost, these methods are tailored to the treatment of objects with unambiguous groundtruth features such as relational records with categorical, stringvalued, or continuous attributes. As mentioned previously, this makes these methods fundamentally inapplicable to modeling event sequences. Also, the failure to isolate the unknown linkage structure when estimating hidden model parameters adds an extra layer of uncertainty which can compromise both the accuracy and convergence rate of the whole process. In addition, the mixture assumptions in many of the above methods are more illsuited for sequence linkage than mixedmembership models since each entity exhibits a unique pattern of event generation in reailty while a mixture assumption binds them to restrictive patterns.
One notable method that particularly aims at sequence linkage is unnikrishnan2015asymptotically , which computes the distance of two sequences in the simplex space of empirical distributions. Yet, as our case study in Section 5 reveals, this method fails when the empirical distributions have sparse intersections. In contrast, we propose a method that analyzes the semantics of each event incident, and determines the similarities of a pair of views within the latent semantic space, allowing for a more macroscopic pattern analysis. This method resolves all of the abovementioned downfalls of existing methods through an informationtheoretic interpretation of Latent Dirichlet Allocation as a mechanism for dimension reduction.
2.2 Problem Formulation
Assume a world of realworld entities, , where . and each denotes the size of two data sets and . (We assume that .) Entity generates exactly one sequence of events in the data set, and one or more sequences in the data set. Here, the function is the “identity” association, which indicates that sequence and sequences are generated by the same realworld entity, namely , and such sequences in different data sets are said to be ‘linked’ or ‘coreferent.’ ( is the set of all subsets of , and is a function whose image forms a partition of the set .)
The sequence generation proceeds as follows. Each entity performs a sequence of one of possible actions. (This categorical assumption can be easily extended to the continuous case.) When an action is performed by etity , this appends an entry in exactly one of the sequences among and . Every sequence is therefore an ordered collection of events, which is equivalent to the concept of a “word” in information theory and “document” in topic modeling. We will call this sequence, “view.” See Figure 1 for a visual understanding.
A typical problem of identity uncertainty is to find the exact identity mapping . When the data sets and consist of an identical number of entries, this becomes a wellknown problem of finding the optimal bipartite graph matching between the and views. When the data sets and differ in size, however, graph matching methods are no longer straighforwardly applicable. In this paper, we study instead the problem of finding for each entity a set of up to candidates for a modest choice of , which is a more realistic approach in cases where the number of coreferent target views are unknown or different for each entity. Although the candidate sets will no longer be mutually exclusive, this method has the additional advantage of preventing close misses at the slight cost of preciseness. We will call this type of matching “Rank” matching since it associates a view with supposedly most relevant candidate views.
Notations
In Sections 4 and the Appendix, we use and to index views in the and domain data sets without a specific knowledge or reference to its realworld entity. To refer to a view in relation to a specific entity, we use the index so that and identify the and views generated by entity . Sometimes we will use
to indicate an arbitrary view regardless of its domain. Although a view is a sequence of events, SplitDocument model is a bagofwords model in which events are exchangeable as we shall see in the later sections. We can thus represent a view as a vector of event frequencies and use the subsecripts
(or equivalently, , ) to refer to the frequency of the event in view (or , ). Sometimes we may want to normalize the frequency vector to create a vector of relative frequencies (or an empirical distribution.) We denote this normalized relative frequency vector by , and use superscripts such as and to specify its domain. Lastly, we index topics with the index, and each topic is a probability distribution over the events
.3 Latent Dirichlet Allocation and Variational Bayes
The proposed sequence linkage framework is closely related to Latent Dirichlet Allocation (LDA)blei2003latent . Following subsections will review the LDA and the algorithms used for its statistical inference.
LDA is the simplest topic model, a Bayesian probabilistic model for generating documents. Each topic is a probability distribution over the vocabulary space. A total of topics are assumed, and each topic is drawn from a Dirichlet prior, . Given these topics, each of the documents is generated in the following way. First a “topic proportion” is drawn. Then for each word , a topic index is drawn according to the topic proportion, , and . Therefore the complete joint probability distribution becomes
(1) 
whose graphical model is shown in Figure 2
Exact MAP inference on LDA is infeasibleblei2003latent , and usually one resorts to approximate inference techniques such as the MCMC and variational inference. In our approach, a particular variant of the variational inference is used, namely the stochastic variational inferencehoffman2013stochastic .
3.1 Variational Bayesian Inference for Lda
Variational Bayesian (VB) methods approximate an intractable posterior distribution to a member of a family of simpler distributions. In particular, it attempts to find the distribution that is closest in KL divergence to the true posterior distribution ,
where the conditioning hyperparameters in
are omitted for simplicity. The simplest and most frequently utilized family is the “meanfield” family, which is a family of product distributions that is factorized into each latent variable term:(2) 
where , , and for conjugacy.
Instead of attempting to minimize the KL divergence directly, which involves the intractable distribution , one may use the follwing relation
where , and maximize instead. is called the Evidence Lower BOund (ELBO), which is conceptually the lower bound on given by Jensen’s Inequality. Early implementations of variational inference used the method of coordinate ascent that iteratevely maximizes the ELBO for each variational parameter while keeping others fixed wainwright2008graphical
. Although this method guarantees local convergence, it requires batch updates that become costly for large corpora, which led to the development of the online variational bayes that uses stochastic gradient descent for faster convergence
hoffman2010online .3.2 Stochastic Gradient Descent and Online Vb for Lda
We first briefly discuss the nature of stochastic gradient descent before discussing online VB. Stochastic gradient descent (SGD) optimizes an object function when only the noisy estimates of the true gradient is available bottou2004stochastic . Given an object function of the form , SGD applies the following update formula
to find the optimal value of , where is an event from distribution and is the learning rate. The update term satisfies the condition
and thus can be understood as a noisy yet consistent estimate of the true gradient of at . It is shown that converges almost surely to the local optimum bottou1998online .
Stochastic Maximization of the ELBO
SGD can be used to optimize the ELBO for LDA hoffman2013stochastic ^{1}^{1}1SGD can be applied in VB for any generative model that involves local and global latent variables hoffman2013stochastic ., . Considering the factorization of and from Equations 1 and 2,
Letting
be a random variable that chooses an index
over the document indices , we can rewrite as whereso that the “natural gradient” of with respect to each global variational parameter
is a noisy yet unbiased estimate of the natural gradient of the variational objective,
. Computing the natural gradient instead of the Euclidean gradient corrects for the geometry of the space of the variational parameters by using the symmetrized KL divergence as the measure of spatial distance amari1998natural and thus leads to a more effective convergence to the local optimum hoffman2013stochastic . Refer to hoffman2010online for the resulting online variational Bayesian algorithm for LDA.4 SPLITDOCUMENT MODEL AND LDALink ALGORITHM
We now introduce SplitDocument model, a simple probabilistic generative model for coreferent views that is based on the LDA. Based on this model we suggest LDALink as a solution for identifying coreferent views across data sets of different domains.
4.1 SplitDocument Model
The SplitDocument model displayed in Figure 3 extends LDA to model the generation of views across domains. In this model, a realworld entity generates a sequence of i.i.d. events according to LDA, and each event falls into exactly one among multiple coreferent views. (For now we assume that each entity generates a single view in each data set, and that views in the smae data set have a fixed number of events.) A set of coreferent views are thus generated through a mixture model whose mixture proportions depend on the sequencegenerating entity.
Although the SplitDocument model is kept simple in this paper for the sake of lucid presentation, its assumptions can be generalized to allow more than two data sets with each entity generating more than one views in each data set and events ocurring in different data sets with different probabilities. As in steorts2014smered ; steorts2015bayesian , this approach would combine the problem of record linkage with deduplication.
4.2 LDALink ALGORITHM
We now introduce LDALink, a coreference linkage algorithm based on the SplitDocument model. The key idea is to consider topic proportions as a reduction of dimensionality from the size of the entire event space to the number of topics , and to compare these dimensionreduced representations. This distinguishes LDALink from other methods that leverage only the common events that appear in both views, which causes them to fail when a sequence pair displays sparse common occurrence.
The algorithm works in three separate phases. First in the “Learning” phase, topics are estimated from the views in the two domains. In the second phase, the topic estimates are used to find the topic proportions ’s for each view that maximizes the posterior distribution given the estimated topics. This phase is the “Dimensionality Reduction” phase that effectively reduces the dimension of each view from to . A score is then computed for every pair of views as the JensenShannon distance between their topic proportions. In the last “Rank Linkage” phase, up to candidate views are declared as potential matches for each view based on these scores.
The rest of this section discusses each phase in detail. Appendices A through D will explore the guarantees of this algorithm in theory.
Learning Phase: Topic Estimation
In the SplitDocument model, coreferent views share the same topic proportion, and these views are essentially a bipartition of a document generated through the normal, unsplit LDA process described in Figure 2. If the true coreference linkage structure is known, estimating the topics for SplitDocument model would amount to finding the MAP LDA topics where all coreferent views are combined into one document. Yet, since the coreference structure is the unknown that we aim to find, it is difficult to directly compute the MAP estimates of the generative model in Figure 3.
A workaround is to consider a slightly different surrogate model in which each view is considered as a separate document that has a topic proportion of its own right, and instead learn the topics optimal to this model as an approximation. This surrogate model is called the IndependentView model shown in Figure 4. Intuitively, in largedocument limits where the number of words reaches infinity, learning the IndependentView model is equivalent to learning LDA with duplicates of each document. (We study the effectiveness of this surrogate learning in Appendix A)
We will call the ELBO of the IndependentView model , which is given by
(3) 
where is given as
(4) 
and . Note its difference from the ELBO of the SplitDocument model, which is
Algorithm 1 summarizes the Learning phase.
Dimension Reudction and Rank Linkage Phases
With the topics obtained in the learning phase, the topic proportions are learned for each view using the conventional coordinate descent method, mapping each view to a vector on the dimensional latent semantic simplex space. Once these topic proportions are learned, a dissimilarity score based on JensenShannon distance is computed for every pair of views as
(5) 
Algorithm 2 summarizes the Dimension Reduction phase.
The final Rank Linkage phase selects for each view, candidate views of the smallest dissimilarity score, as shown in Algorithm 3.
5 Case Study: Breaking Anonymity in LocationBased Social Media
We now apply LDALink to locationbased social media profile linkage, where profiles of the same individual in different social media are matched based on their online activities. As explained in the introduction, locationbased data sets present critical challenges of sparse common occurrence. The ability to reconcile identities and bridge data across online social platforms of different thematic nature has significant commercial as well as privacy implications riederer2016linking . First the data sets and the baseline algorithms are explained, and the performance of LDALink is evaluated.
5.1 Data Sets
A total of two pairs of datasets were used, both of which contain only the spatiotemporal information of public activities of profiles in two different social media collected during a common time span. Both of these datasets were studied and explained previously in riederer2016linking .

FoursquareTwitter (FT): This dataset contains checkins on Foursquare and posts on Twitter, both of which are geotemporally tagged. Each account activity is therefore a (user id, time, GPS coordinate) tuple. By selecting only the users who has records in both social media accounts, a total of 862 users, 13,177 Foursquare checkins, and 174,618 tweets were obtained. The imbalance of activities in the two social media is a factor of challenge.
While Foursquare checkins are typically associated with a user exposing their current activities, tweets are associated with more general behaviors. In this dataset, only 260 pairs of checkins (less than 0.3%) had exactly matching GPS coordinates, and none of them were made within 10 seconds of each other, suggesting that it is highly unlikely that there is a pair of account activities forwarded by software across both services riederer2016linking .

InstagramTwitter (IG): The second dataset is also a collection of (user id, time, GPS coordinate) from public posts on the photosharing site Instagram, and the microblogging service Twitter. This data set was obtained by linking Instagram and Twitter accounts that were associated with the same URL in their user profiles, and downloading the spatiotemporal tags of the tweets made by these Twitter accounts riederer2016linking . The collection includes 1717 unique users, 337,934 Instagram posts, and 447,336 Tweets.
Both pairs of data sets contain the timestamped visits of each user, and while users are communicating an action or a message to the general public, the events(posts) in each data set are collected within different thematic contexts. SplitDocument model dictates that each recorded visit occurs with a certain purpose, e.g., shopping, sports, travel, hobbies etc.. LDALink attempts to discover such motifs (or “topics”) and associate with every view a proportional mixture of these topics to represent its characteristic feature.
Modulating the Sparsity of Common Events with Spatiotemporal Granularity
The full GPS coordinates and timestamps of the posts in each of these data sets never coincide precisely, which raises the problem of determining the meaning of “common occurrence.” Our solution is to bin time and locations based on spatiotemporal proximity, and declare events belonging to the same bin to have occurred in common. The size of the bin can be controlled to represent different levels of measurement precision, and changing the event space granularity in this manner modulates the discreteness and size of the event space. This allows the study of the robustness of the algorithm under different degrees of sparsity available to the linking agent. In our experiment we bin event locations by truncating the coordinate values after a certain number of digits below decimal and bin event timestamps into a fixed number of bins. We call each of these numbers “spatial” and “temporal” granularity.
5.2 Prior Algorithms
Here we summarize three baseline algorithms for identity reconciliation that were inspired by stateoftheart algorithms in the social computing literature.
SparsityBased Scoring: The “Netflix Attack” (NFLX)
Based on the algorithm used to deanonymize the Netflix prize dataset in narayanan2008robust , riederer2016linking describes a variation for crossdomain reconciliation, where a score between views and is defined as
where
and are model parameters.
This algorithm relies on the exact timestamps. The algorithm matches an view with the view with the smallest score, and leaves it unmatched if the best candidate and the second best differ in scores by no more than a standard deviations. In resemblance to narayanan2008robust this score favors locations that are rarely visited, frequent visits to the same location, and visits that occur close in time, thus exploiting sparsity. riederer2016linking
DensityBased Scoring: JSDistance Matching (JSDist)
In unnikrishnan2015asymptotically , authors proved the asymptotic optimality of the JSDistance between relative frequencies (empirical distributions) of two observation sequences as a measure of disparity.
Their algorithm estimates the true matching as the matching that mininimizes the sum of the JS measures. It relies on the density of the data based on the asymptotic convergence of empirical distributions implied by Sanov’s Theorem.
Leveraging Both Sparsity and Density: Poisson Process (POIS)
suggests a simple generative model for mobility records in which the number of visits to each location within a certain time period follows a Poisson distribution whose rate parameters are specific to the location and period of the visit. Based on this model, the following similarity score between two views can be defined for an MAP estimate of the identity linkage structure:
where and are location and time indices and
The identity mapping is the mapping that maximizes the expected sum of scores.
5.3 Performance Analysis
We now present the empirical performances of LDALink. In light of the “Rank” matching we described in Section 2.2, we measure our performance in terms of the Rank recall, which is the proportion of views in the source data set whose coreferent views in the source data set are fully contained in the set of best candidates, not allowing ties. We draw our attention to LDALink’s relative performance as compared to the baseline algorithms, and evaluate its robustness against sparse common occurrence by (1) modulating the granularity of the event space, and (2) testing on a sample of sequences with sparse common events.
LDALink and DomainSpecific Alternatives
Figure 5 plots the best Rank recalls of each algorithm for different values of . LDALink outperforms the domainspecific reconciliation algorithms as the size of the candidate set is increased. Although NFLX and POIS perform better for small values of , LDALink’s recall increases more rapidly, exceeding POIS and NFLX respectively at (1.16% of the total number of views) and (2.20%) for FSQTWT, and at (1.05%) and (1.34%) for IGTWT. The early plateu for POIS occurs due to the lack of rules for evaluating the similarity of a pair of views when none of their events belong to the same location or time bins. The plateau is reached more slowly at a higher recall for NFLX because NFLX depends on precise timedifference instead of binning by time and is thus slightly less vulnerable to time granuarity. LDALink, on the other hand, computes the similarity of a pair of views on a dimensionreduced space of topic proportions, which removes the dependence on the granularity of the event space. When reaches up to 10% of the total number of views, LDALink outperforms POIS and NFLX by over 50% and 20% on FSQTWT.
Robustness against the Event Space Granularity
In Figure 6, we tested LDALink against different levels of spatiotemporal granularity of the event space. The number of learned topics were fixed to and for FSQTWT and IGTWT respectively. The plot displays the Rank matching of LDALink and JSDistance matching (dashed lines) on the two data sets, where was set to 10 and 20 for FSQTWT and IGTWT respectively for a consistent comparison. Although the rank recall for JSDist is greater than LDALink when temporal granularity is small, the capability of JSDist is rapidly compromised for higher spatiotemporal granularity, making a much steeper drop to zero for higher granularities. Meanwhile, LDALink maintains a moreorless stable performance, which proves its robustness against the sparsity of the event space.
Linking Views with Sparse or No Common Events
Lastly, we assess LDALink’s robustness to the second type of data sparsity, which is the actual degree of event overlap for a coreferent pair of sequences. In Figure 7, we took 10% of the population whose views have the highest L1 distance, and plotted the best Rank recall of LDALink and JSDist on this sample for different spatial granularities (bar graph). The plot also displays the best Rank recall of the two algorithms on the sample of users whose profiles have no common posts at all (No Overlap). Although the overall best performance of JSDist is greater than LDALink (Figures 5, 6), its performance is far eclipsed by LDALink on the spase sample and the difference is even more striking for greater input granularity. Most critically, LDALink is able to achieve up to 37% Rank recall on IGTWT and 17% recall on FSQTWT for users with no common posts at all, while JSDist fails to reconcile any. Again, this is the effect of LDALink’s dimension reduction and semantic comparison.
6 Conclusion
We defined the problem of sequence linkage, a newly studied problem of identity uncertainty. As a solution to sparsityrobust sequence linkage, we described SplitDocument and LDALink. SplitDocument is a mixedmembership model for the generation of event sequences across data sets of different domains which uses the concept of motifs that account for the generation of individual events and their collective patterns. Based on this model, LDALink can infer the correct identity linkage structure across data sets through a semantic comparison of each sequence pair. By conducting an empirical validation in linking profiles across different locationbased social media, we tested LDALink’s robustness against factors of common event sparsity by modulating the granuarlity of the event space and testing against a selective sample of coreferent views with rare common occurrence. We proved that LDALink is able to significantly outperform the current stateoftheart solution to sequence linkage when linking social media profiles that have no commonly occurring posts at all.
SplitDocument can be extended to accommodate more than two data sets, each potentially having different views with duplicate identities. Extra layers of stochasticity can also be embedded into the original SplitDocument model to construct more complex models. For example, one can inject an “observation” layer into the original model to take into consideration different rules of observation emission, which may include the probability of observation or different distortion processes (e.g. “hitmiss” distortion). Continuous or noncategorical variants for Gaussian or Poisson events is also a potential future direction of study. The incorporation of a Poisson model should especially be suitable for discretizing continuous time events. Another area of development is the incorporation of Bayesian nonparametric clustering models such as Dirichlet Process and Chinese Restaurant Processes as a “clustering” layer to model multiple duplicate identities of different views.
Acknowledgement
The author gratefully acknowledges Professor Augustin Chaintreau and Professor David Blei for their valuable comments and feedback.
References
 [1] Steve Lawrence, C Lee Giles, and Kurt D Bollacker. Autonomous citation matching. In Proceedings of the third annual conference on Autonomous Agents, pages 392–393. ACM, 1999.

[2]
Andrew McCallum, Kamal Nigam, and Lyle H Ungar.
Efficient clustering of highdimensional data sets with application to reference matching.
In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169–178. ACM, 2000.  [3] Andrew McCallum and Ben Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, pages 905–912, 2004.
 [4] Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, and Ilya Shpitser. Identity uncertainty and citation matching. Advances in neural information processing systems, pages 1425–1432, 2003.
 [5] Mauricio Sadinle et al. Detecting duplicates in a homicide registry using a bayesian partitioning approach. The Annals of Applied Statistics, 8(4):2404–2434, 2014.
 [6] Mauricio Sadinle and Stephen E Fienberg. A generalized fellegi–sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 108(502):385–397, 2013.
 [7] Matthew A Jaro. Advances in recordlinkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, 84(406):414–420, 1989.
 [8] Shouling Ji, Weiqing Li, Mudhakar Srivatsa, Jing Selena He, and Raheem Beyah. Structure based data deanonymization of social networks and mobility traces. In International Conference on Information Security, pages 237–254. Springer, 2014.
 [9] Ehsan Kazemi, S Hamed Hassani, and Matthias Grossglauser. Growing a graph matching from a handful of seeds. Proceedings of the VLDB Endowment, 8(10):1010–1021, 2015.
 [10] Arvind Narayanan and Vitaly Shmatikov. Deanonymizing social networks. In Security and Privacy, 2009 30th IEEE Symposium on, pages 173–187. IEEE, 2009.
 [11] Peter Christen, Tim Churches, et al. Febrlfreely extensible biomedical record linkage. Australian national University, Department of Computer Science, 2002.
 [12] Indrajit Bhattacharya and Lise Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007.
 [13] Ivan P Fellegi and Alan B Sunter. A theory for record linkage. Journal of the American Statistical Association, 64(328):1183–1210, 1969.
 [14] William E Winkler. Frequencybased matching in fellegisunter model of record linkage. Bureau of the Census Statistical Research Division, 14, 2000.
 [15] William E Winkler. Improved decision rules in the fellegisunter model of record linkage. In in American Statistical Association Proceedings of Survey Research Methods Section. Citeseer, 1993.
 [16] Yves Thibaudeau. The discrimination power of dependency structures in record linkage. In Survey Methodology. Citeseer, 1993.
 [17] XiaoLi Meng and Donald B Rubin. Maximum likelihood estimation via the ecm algorithm: A general framework. Biometrika, 80(2):267–278, 1993.
 [18] Michael D Larsen and Donald B Rubin. Iterative automated record linkage using mixture models. Journal of the American Statistical Association, 96(453):32–41, 2001.
 [19] William E Winkler. Using the em algorithm for weight computation in the fellegisunter model of record linkage. In Proceedings of the Section on Survey Research Methods, American Statistical Association, volume 667, page 671, 1988.
 [20] Marco Fortini, Brunero Liseo, Alessandra Nuccitelli, and Mauro Scanu. On bayesian record linkage. Research in Official Statistics, 4(1):185–198, 2001.
 [21] Rebecca C Steorts et al. Entity resolution with empirically motivated priors. Bayesian Analysis, 10(4):849–875, 2015.
 [22] Brunero Liseo and Andrea Tancredi. Some advances on bayesian record linkage and inference for linked data. http://www.ine.es/e/essnetdi_ws2011/ppts/Liseo_Tancredi.pdf, 2013.
 [23] Andrea Tancredi, Brunero Liseo, et al. A hierarchical bayesian approach to record linkage and population size problems. The Annals of Applied Statistics, 5(2B):1553–1585, 2011.
 [24] Rebecca C Steorts, Rob Hall, and Stephen E Fienberg. Smered: A bayesian approach to graphical record linkage and deduplication. In AISTATS, pages 922–930, 2014.
 [25] Rebecca C Steorts, Rob Hall, and Stephen E Fienberg. A bayesian approach to graphical record linkage and deduplication. Journal of the American Statistical Association, 111(516):1660–1672, 2016.
 [26] Brunero Liseo and Andrea Tancredi. Bayesian estimation of population size via linkage of multivariate normal data sets. Journal of Official Statistics, 27(3):491–505, 2011.
 [27] Indrajit Bhattacharya and Lise Getoor. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM International Conference on Data Mining, pages 47–58. SIAM, 2006.

[28]
Andrew M Dai and Amos J Storkey.
The grouped authortopic model for unsupervised entity resolution.
In
International Conference on Artificial Neural Networks
, pages 241–249. Springer, 2011.  [29] Arto Klami. Bayesian object matching. Machine learning, 92(23):225–250, 2013.
 [30] Jayakrishnan Unnikrishnan. Asymptotically optimal matching of multiple sequences to source distributions and training sequences. IEEE Transactions on Information Theory, 61(1):452–468, 2015.
 [31] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
 [32] Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 [33] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(12):1–305, 2008.
 [34] Matthew Hoffman, Francis R Bach, and David M Blei. Online learning for latent dirichlet allocation. In advances in neural information processing systems, pages 856–864, 2010.
 [35] Léon Bottou. Stochastic learning. In Advanced lectures on machine learning, pages 146–168. Springer, 2004.
 [36] Léon Bottou. Online learning and stochastic approximations. Online learning in neural networks, 17(9):142, 1998.
 [37] ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 [38] Christopher Riederer, Yunsung Kim, Augustin Chaintreau, Nitish Korula, and Silvio Lattanzi. Linking users across domains with location data: Theory and validation. In Proceedings of the 25th International Conference on World Wide Web, pages 707–719. International World Wide Web Conferences Steering Committee, 2016.
 [39] Arvind Narayanan and Vitaly Shmatikov. Robust deanonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pages 111–125. IEEE, 2008.
 [40] Pranjal Awasthi and Andrej Risteski. On some provably correct cases of variational inference for topic models. In Advances in Neural Information Processing Systems, pages 2098–2106, 2015.
 [41] Feng Qi and BaiNi Guo. An inequality involving the gamma and digamma functions. Journal of Applied Analysis, 22(1):49–54, 2016.
 [42] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 [43] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145–151, 1991.
Appendix A Learning Phase:
Learning Topics With and Without Omniscient Knowledge
The learning step in LDALink is equivalent to performing online variational for LDA with each view in separate domains as a single input document. In this section we lay out the steps required for proving the highprobability asymptotic proximity of the topics learned by LDALink to the topics learned through online LDA when every coreferent pair of views is reconciled and combined into a single document.
a.1 SKL Divergence between the Learned Topics
The topiclearning step is a stochastic variational inference step that optimizes the perdocument ELBO given in Equation 3. We start with the simple and slightly less realistic case that and are fetched together at each iteration. moves in the direction of the natural gradient of , which is given by
(6) 
where and are the local optimum of the variational parameters and . See Appendix D.1 for its full derivation. LDALink’s topiclearning algorithm applies the update formula
(7) 
to obtain the topic estimate .
The gradient of the perdocument ELBO of the SplitDocument model is given by [34]
The optimal update formula for the SplitDocument model would thus be
(8) 
We state the following claim that, with high probability, the SKL distance between the convergent values of the topics learned through LDALink (using Equation 7) and the topics learned through the optimal update formula for the SplitDocument model (using Equation 8) is bounded with high probability.
Claim 1.
where depends on the number of records in the view and
A Potential Validation Approach.
The idea is to (1) find highprobability bound on the coreferent views in the latent semantic space, (2) bound the distance between the two gradients when the coreferent views are nearby in the latent space, and then (3) to prove that the distance between the gradient descent estimates are convergent.

Prove that there is a function that measures the dissimilarity of the two views in the latent semantic space, and that there is a high probability bound on a pair of coreferent views and having latent dissimilarity below a certain threshold that depends on the number of words in each view. That is, define a suitable choice of a dissimlarity measure for which there is a moderate of choice of and that both depend on and satisfies

Prove the following for the topic update procedure for and (Equations 7 and 8): Given a pair of coreferent views that are close to each other in the latent space, when the SKL distance between the th iteration of and are within a certain threshold, then the gradients also lie close to each other for the ()th iteration. That is, find a suitable choice of for measuring the difference between two gradients, for which if and for some , then
for a moderate bound that depends on .

Prove that when the difference between gradients is small, the next step topic estimate is also bounded in a convergent manner. That is, if and , then
where
depends on the previous bound , the learning rate , and the difference between gradients in such a way that
This will prove Claim 1. ∎
a.2 Distance between Posterior Distributions
The argument in the earlier paragraph provides a bound for the symmetrized KL divergence between the optimal topic estimates of the omniscient SplitDocument model and its practical surrogate that considers the information geometry of the meanfield Dirichlet posterior approximation for the topic parameter  namely the IndependentView model. Symmetrized KL divergence measures the distance between two topic parameters as the Jeffrey’s divergence between the Dirichlet distributions that they parameterize. For the purpose of topic estimation, however, the distance of interest is not the distance between the posterior approximations, but rather the distance between the MAP estimates.
This core of thie section is Theorem 1, which provides a bound on the JS distance between the MAP estimates of the two Dirichlet distributions in terms of their symmetrized KL divergence. This theorem is useful when, given only the symmetrized KL distance between two posterior Dirichlet distributions, it is desirable to find the bound on the JS distance between their MAP estimates (modes).
Theorem 1.
Let and each be the modes of and . Let , , and assume . If for all , then
where vanishes when and .
Proof.
See Appendix D.2. ∎
Using this theorem we can easily derive the following corollary, which proves that the modes of the two surrogate posterior Dirichlet distributions that are close in terms of the SKL distance are also close in JS distance.
Corollary 1.
Given Dirichlet distributions, each parameterized by and with all parameters greater than 1, we have
where and for and
Appendix B Dimensionality Reduction:
Proximity of the CoReferent Views in the Semantic Space
Once the topics are learned, LDALink computes the optimal topic proportions for each view through a stochastic variational Bayesian approach (Appendix A). In this section, we attempt to prove that the topic proportions of the coreferent views that are learned through the EM step in the LDALink algorithm are likely to be close in the simplex space with high probability.
To achieve this we first revisit a reasonable simplification of the VB updates suggested in [40] that will simplify our analysis. Since and , the iterative updates for a particular view in Algorithm 2 can be rewritten as
where we have omitted the entity index for simplicity.
Since [41], we have . Considering this in relation to the update in Algorithm 2, in large document limits where the update equation becomes
(10) 
where and . A detailed study of this simplification and its correctness and convergence properties is presented in [40]. We will use this approximation for the rest of this section.
The iterative procedure in Equation 10 converges at a point and for which
(11) 
This relation implicitly defines a set of topic proportions ’s at which the iteration converges for some initial parameters when the empirical distribution (relative frequencies) of words is . The set includes, but is not limited to, the global optimum of the ELBO.
We need to compute the change in ’s in cuased by the difference in the relative frequencies . From a slight variation of Sanov’s theorem[30] we get:
(12) 
for the relative frequencies and of the views generated from the same distribution, so the two views are close in the simplex space with high probability. Since [42],
(13) 
so that if the two relative frquencies are close in the simplex space, they are also close in Euclidean space as well. This allows us to describe differential change in relative frequencies in terms of Euclidean gradients.
Consider a specific choice of for a particular empirical distribution . We can make the following Taylor approximation to when is within a small neighborhood of :
(14) 
where the gradient and the Hessian are computed at .
To compute the gradient and Hessian, we must resort to implicit differentiation.
Frist and Second Order Necessary Conditions at Convergence
Combining the two equations in Equation 10 under the limit , we obtain the following necessary conditions for the point of convergence after some rearrangement:
(15) 
Taking partial derivatives with respect to and setting , we obtain the following first order condition:
Proposition 1.
Proof.
See Appendix D.3. ∎
Taking a second partial derivative with respect to we obtain the following second order condition:
Proposition 2.
When and are probability distributions that satisfy Equation 15,
(17) 
Proof.
See Appendix D.4. ∎
Appendix C Rank Linkage:
Asymptotic Optimality of the Ranking Given the True Topics
In this section, we provide a sketch for proving the theoretical guarantee of the correctness of LDALink’s linking algorithm. In order to do so, we will first model the problem of finding the coreferent views as a hypothesis testing problem, in light of the approach in [30].
Given a particular view , the objective is to find among all views the view for which the match is optimal. We formulate this problem as testing a set of hypotheses, each of which states that a view in is the optimal match for for different views, so that for corresponds to the hypothesis that . Therefore, finding the correct pair of views is equivalent to finding the most optimla rule for testing the hypotheses , where, for the ease of analysis, we have introduced the rejection hypothesis as failing to find a match. Our goal is to compute the bound on the probability of error for the decision rule that links a view with the candidate view whose JS distance in the latent semantic space is minimum.
We will more formally restate the decision rule designed in our algorithm. Let , and . The decision rule , where
is the acceptance region for hypothesis , and the rejection region is
The following preliminary theorem, inspired by Theorem IV.3 of [30], may be useful for proving the error probability of .
Theorem 2.
Consider the hypothesis testing problem with the decision rule given as defined above. If
Comments
There are no comments yet.