Our study deals with a fairly simply formulated problem – given a full name, how to estimate the number of people bearing this particular name in a large population? Originally, the study was motivated by an applied record linkage task in a large database, where occurrences of personal names were accompanied with no or only scarce additional information.
Record linkage – the task of matching records referring to the same real-world entity – is a well-studied field within database technology that is also known under such names as name matching, entity resolution, object identification, deduplication, and others. The task arises when several databases are merged or one is interested in linking duplicate records within a single database. Records referring to people are the most common objects of linkage task. There is a wide variety of application domains, such as social network profiles, medical and census data processing, human resources and customer relationship management, bibliographic and genealogy databases, etc. Discrepancy in data can occur due to different attribute sets, as well as on individual fields’ level due to misspellings and OCR errors, name (cf. nicknames) and transliteration variations. In contrast, in our settings there is no additional information but identical names. Name popularity estimates can serve as an additional signal for matching in case of limited information. In our study we use a large dataset of open government data that gave rise to the applied task initially. The dataset contains about 20 million records that correspond to 13.4 million real-word persons, which constitutes about one tenth of the entire Russian population. Russia is a multi-ethnic country and we may hope that methods described herein are not heavily dependent on language and culture and can be applied to other name collections.
Accurate name popularity estimation based on limited number of observations is a hard task. Even very large collections contain many unique names – names are a good example of large number of rare events (LNRE)
distributions. It is quite natural – names serve primarily to distinguish people and to avoid collisions. Therefore maximum likelihood estimates based even on large name samples are poor predictors, since there are always many unseen names. A simplistic assumption that all names are unique may work in small communities, but in larger populations one can observe a whole spectrum from singletons to very common names. To address the problem we employ several smoothing techniques that redistribute probability mass from already seen names towards yet unseen ones. In addition, we use the fact that first, middle and last names are dependent on each other. We use LNRE models to estimate the number of unique names and use this estimate as smoothing parameter. In case of full name triples (first, middle, and last name) we apply Markov assumption, i.e use only pairwise conditional probabilities. We also propose our simple yet effective technique for name count estimates that takes into account the large number of unique names.
We conducted two experiments: 1) name popularity estimation and 2) record linkage guided solely by the name popularity estimates. We performed evaluation both for name triples (first, middle, and last) and doubles (first and last). Obtained results suggest that theoretically informed approaches outperform simple heuristics. Name popularity estimates can be a good supplemental signal in record linkage tasks, help distinguish unrealistic (artificial), rare and more common names. The main contribution of our study is a thorough comparative evaluation of several statistical techniques applied to the name popularity estimation task on a sizable dataset. The study provides guidance for choosing the most appropriate model depending on available data, task, and performance requirements.
Knowing an estimate of people bearing a particular name is beneficial not only for record linkage in databases, but also for social network analysis (especially in detecting fake and duplicate accounts111Facebook estimated that duplicate and fraudulent accounts represented up to 14% of its worldwide monthly active users in the fourth quarter of 2017, see https://investor.fb.com/financials/sec-filings-details/default.aspx?FilingId=12512043), people search, information security, and information extraction. Quantitative analysis of personal names is also of interest for genealogy, demographic, sociological, and human biology research. Last but not least, name popularity estimates can be helpful in such an important matter as the choice of the name for a newborn baby.
The paper is organized as follows. Section 2 gives an overview of various studies on numerical name analysis; section 3 describes the data used in the study. In section 4 we describe name popularity prediction models investigated within the experiment, as well as experimental design, evaluation approaches and measures. Section 5 reports experimental results. Section 6 concludes and outlines directions of future work.
2 Related Work
Our study is close to personal name matching , a special case of record linkage
– the task of matching records referring to the same real-world person in the presence of errors, spelling variants, omissions, abbreviation, etc. Most name matching method rely on pre-defined or machine-learned similarity measures for field values and tuples, see[8, 15, 23]. The main difference of our study is that we deal with identical names and no additional fields. Our approach is close to record linkage methods that use conditional probabilities for field values (see, for example, an early work by Winkler ). However, we do not adjust our methods to a particular database; we rather aim at modeling name popularity at a global scale. As such, name popularity models can deliver additional evidence for record linkage tasks applied to different databases and in case of scarce additional information. Personal name matching and deduplication attract a great deal of attention. For example, 238 teams participated in the Author Disambiguation Challenge in 2013.222https://www.kaggle.com/c/kdd-cup-2013-author-disambiguation/ The task was to identify which authors in a large bibliographic database correspond to the same person. The winning solution 
used string similarity measures and an ensemble classifier for two concurrent matcher implementations, as well as processed Chinese and non-Chinese names separately. A recent Multilingual Web Person Name Disambiguation shared task333http://nlp.uned.es/IberEval-2017/index.php/Tasks/M-WePNaD consisted of clustering Web search results for a person name query accounting for different real-world persons .
In a related study Popescu et al.  address the problem of estimating the number of people with identical names mentioned in a corpus in the context of information extraction. Authors assess name popularity on the basis of phone books, Web search statistics, and name counts in Wikipedia. In contrast to our study, no evaluation of name popularity estimates was performed, as well as no formal justification of the method.
Name frequencies and their dynamics along with demographic information can provide valuable insights for psychology [28, 33], human biology , sociology and history [29, 20]. The advent and proliferation of online social networks had a powerful impact on quantitative research on names, as name is often the only available information about the user. There is a series of studies that derive ethnicity [4, 22, 27] and gender [2, 24] from names in social network profiles. Perito et al.  and Liu et al. 
introduce the problem of linking user profiles belonging to the same physical person between online social networks based solely on usernames. These studies are relevant to ours since the central notion in both approaches is username uniqueness. The latter study models username unexpectedness with character-level Markov model. The authors of the former study first perform username segmentation; then estimate rareness or commonness of a segmented username using web n-gram statistics. Minkus et al. match population registry entries from a small US city to Facebook accounts based on straightforward name and location matching. Thonas et al.  analyze naming patterns in fraudulent Twitter accounts.
Smoothing techniques we employ in the study have been actively developed within statistical language modeling [14, 5]. Khmaladze  introduced the notion of large number of rare events (LNRE) distributions and studied their statistical properties. Baayen  and Evert  elaborated the models for a better fitting of frequency distributions of words in large corpora, with special attention to the estimation of hapax legomena count (that is, count of words with frequency 1). We use LNRE models for a more accurate choice of smoothing parameters in several evaluated methods. To the best of our knowledge, application of smoothing and LNRE models to the name popularity prediction task is novel.
In our study we experiment with a dataset that originates from the Russian registry of legal entities and individual entrepreneurs.444http://egrul.nalog.ru/ There is a many-to-many relationship between persons and companies: each legal entity is associated with one or more persons – managers and/or founders; each real-word person can be associated with several companies. The registry contains about 32 million name mentions. Minimal piece of information about a person is his or her full name. Full names in Russian official documents are triples comprising of first, middle (patronymic), and last names, for example, Alexander Sergeyevich Pushkin. Patronymics have gender-specific endings (cf. Sergeyevich and Sergeyevna – literally Sergey’s son and daughter, respectively) as many (but not all) Slavic last names do (Pushkin and Pushkina for male and female variants of the same family name, respectively). In our experiment we unify gender-specific variants of last names and patronymics.
A subset of records contains persons’ taxpayer identification numbers (TINs) that can be used as a key. In the rest of the paper we focus on about 20.6 million records containing both TIN and full name that refer to about 13.4 million real persons, which constitutes about one tenth of the entire Russian population.555According to the 2010 census, Russian population is 143,666,931, see http://www.gks.ru/free_doc/new_site/perepis2010/croc/perepis_itogi1612.htm (in Russian). There are about 63.2 million pairs of identical names among 20.6 million occurrences, i.e. potential links between same-person records; 32% of them are correct according to TINs.
Figure 1 illustrates that first, middle, and last names taken separately or as full names are a good example of LNRE regime: the majority of names occur only once, while a small number of combinations are relatively common. Expectedly, last names tend to be more rare than first names and patronymics (the latter are derivatives from male first names). Figure 2 shows proportions of unique name combinations in random samples of different sizes. For example, in a random population of 100,000 a combination of first, middle and last name is an almost perfect identifier (about 96% people bear a unique name), while name pairs (first, last) reliably distinguish less then 75% of people in the same sample.666Names of inhabitants of a particular city/region are presumably less diverse due to a higher ethnic and cultural homogeneity.
4.1 Name Popularity Prediction Methods
In this section, we informally describe name popularity prediction models evaluated within the study. In what follows, is the number of people with a name in training set , where can be either a full name or its constituents; stands for first name, and – for middle and last names, respectively; is the number of names that occur exactly times in and is the total number of persons in .
We start with a naïve approach assuming all people have unique names (model I). So, the number of people with the name is equal to 1 in the population of any size. Then, we proceed with straightforward maximum likelihood estimates (MLE) for full names (II):
Model II assigns zero probabilities to names unseen in the training set. To partially mitigate the problem we can assume independence of name constituents and approximate the probability of a full name by the product of individual first, middle, and last name probabilities (or just first and last name probabilities in case of name doubles), which defines model III:
This model assigns a zero probability to a name if one of its components is new in the test set.
Some combinations of first, middle, and last names occur together more frequently than others. The reasons are diverse: cultural and ethnic traditions, fashion (e.g. celebrities’ names), or euphony of a combination. To capture these dependencies we use conditional probabilities. In case of names triples we apply Markov assumption, in other words – we account only for dependencies between pairs of constituents777This approach corresponds to the bigram language model, however in case of names the order of constituents is irrelevant and we can experiment with different dependencies.:
In case of LNRE distributions it is highly beneficial to have an estimate of unseen events for smoothing. LNRE models implemented in zipfR package for R environment  allow us, starting with name frequency distributions in the training set, to estimate the number of different names in a set of doubled size and consequently the number of names not appearing in the training set. As Table 1 (columns 1 and 2) shows, the Generalized Inverse Gauss-Poisson (GIGP) model implemented in zipfR performs very well; the third column contains country-wide estimates for reference.
|Name||estimates||Actual counts in||Country-wide estimates|
Laplace smoothing (models V and VI) is a simple additive smoothing method: pretend that every name occurs times more than it has been observed in the training set. Thus, the number of people with previously unseen name is . If is the set of unique names in , then
Good-Turing smoothing  is a more gentle approach widely employed in language modeling (VII). The general idea behind the approach is to estimate the probability of all unseen names in the test set roughly equal to the total probability of names that appear only once in the training set, i.e. . The counts of all other names are discounted accordingly:
This results in the following probability estimates:
where is a estimate of hapaxes in based on . Note that it implies we know the size of the test set beforehand.
One of the drawbacks of the Good-Turing smoothing is that it discounts probabilities uniformly in different frequency ranges. It leads often to severely distorted probabilities for high-frequency items. Katz smoothing  uses MLE for high-frequency names ( in our experiment, model VIII) and Good-Turing smoothing for low-frequency ones.888Katz smoothing as described in the original work incorporates two approaches: 1) combination of ML and GT estimates and 2)“backing off” to lower-order n-grams in case of sparse data. In this study, we use only the former one.
Aiming at combining the simplicity of Laplace smoothing and the selectivity of Katz smoothing, we introduce pseudo-Laplace smoothing with a small (model IX):
The idea is quite simple: names present in the training set obtain probability close to the MLE, while unseen names get reasonable non-zero probabilities. In a strict mathematical sense, these are not probabilities, since they do not sum up to unity (and that is why we denote it ). Such probability-like scores are widely used in many practical applications, see for example “stupid back-off” introduced in .
4.2 Experimental design
We conducted two experiments: 1) estimation of name popularity (that is, estimation of the number of people bearing each name) and 2) record linkage based solely on the name popularity estimates. In the first experiment we used a list of 13.4 million real-world persons represented by TINs and corresponding names compiled from the original dataset. In the second experiment, we performed record linkage on the original dataset of 20.6 million records.
Name popularity estimation.
Evaluation of models on samples with a large number of unique events is not an easy task. Evaluation results may diverge significantly on different test samples and depend on the size of test sample, particularly in low frequencies ranges. For example, LNRE models are traditionally evaluated by looking at how well expected values generated by them fit empirical counts extracted from the same dataset used for parameter estimation [10, 1]. In this experiment we follow extrapolation setting for evaluation described in : the parameters of the model are estimated on a subset of the data used subsequently for testing. We randomly sampled a training set of 6.7 million names, which is of the whole dataset .999We also performed experiments accounting for historical dimension: we ranked all persons with available year of birth by age and trained parameters on the ‘older’ half of the population. The results showed general decrease in quality, which supports the hypothesis of name popularity dynamics . We do not cite the results here due to limited space. We employ root-mean-square error (RMSE) between the estimates and actual counts averaged over all names as evaluation measure. RMSE of the model on the test set of people over the set of unique full names is defined as follows:101010Note, that in this case corresponds to the number of persons bearing name in (not in as in equations above).
In order to have a better understanding of models’ behavior and their applicability to different tasks and data volumes, we calculate for the following name frequency buckets: (hapaxes), , , , and (very frequent names).
For the second task we calculate , i.e. the probability that there is a single person with a given name in the population of size using estimates by different models . If the probability surpasses the threshold , we link records with identical names. Note that all identical names are linked at once, whereby records with a given name trigger linkages. The evaluation measure for the task are standard classification measures: precision – the fraction of linked records pairs that are correct, i.e. both refer to the same real-world person, and recall – the fraction of correct links identified. As stated before, there are about 63.2 million pairs of identical names among 20.6 million occurrences, i.e. potential links between same-person records; 32% of them are correct according to TINs. Taking into account these figures, linking all possible pairs results in and .
In contrast to the first experiment that presumably reflects a global distribution of names, the second experiment deals with a concrete database and its particular characteristics, e.g. the number of companies associated with a person.
5.1 Name count prediction
Table 2 summarizes evaluation results for nine name popularity prediction models. The first model (I) is a naïve “always 1” baseline that assumes all names are unique. Obviously, the model performs ideally on hapaxes. MLE model for full name triples (II) demonstrates the best prediction results in higher frequency ranges. The product of individual probabilities for first, middle and last names (III) performs slightly better on hapaxes, but substantially underestimates the probability of more frequent names. We investigated different dependencies between full name constituents, and combination in the model IV performed best. As one can see, conditional probabilities considerably improve over model III that assumes independence of name constituents.
The next five models incorporate smoothing. Add-1 smoothing (V) is too aggressive in case of LNRE distributions, as the evaluation results show. A more delicate Laplace smoothing with (VI) delivers better results that are equal to model IV’s ones. Good-Turing and Katz methods with estimates (VII and VIII, respectively) perform slightly worse, but comparably to other models with smoothing. Our method (IX) performs best in the low-frequency range and equally well as models IV and VI in higher-frequency areas.
Table 3 summarizes performance of the same models for first-last name doubles (model IV applied to name doubles coincides with model III). Although the general trend is the same as in case of name triples, the results contain some peculiarities. Good-Turing model (VII) is the best in predicting hapaxes and most frequent names. The latter fact is somewhat unexpected, and we will study this outcome in depth in our future work. The proposed pseudo-Laplace method performs best in the middle frequency range (2–100).
5.2 Record linkage
Results of the record linkage experiment are presented in Figures 2(a) (name triples) and 2(b) (name doubles). The threshold governs the linkage process: the higher the threshold the less name mentions are linked. One can imagine the process of gradual data linkage going from right to left, from higher to lower values. Stepped curves of the MLE models are due to the fact that at some values a large number of links is established at a time. In the case of full name triples (Figure 2(a)) all ‘advanced’ methods deliver almost identical results. The simplest MLE method for full names works well when we favor precision over recall. Threshold delivers precision of about 90% and recall above 70%. In the case of first and last name doubles, the task of record linkage in such a sizable dataset based solely on name popularity estimates is much less effective (see Figure 2(b)).
6 Conclusion and future work
In our experiments we make use of a large name dataset with unique identifiers that contains names of approximately one tenth of the Russian population. We conducted a series of experiments with different name popularity prediction models built upon the name dataset. We thoroughly evaluated several models, including well-known smoothing approaches and proposed a new simple yet effective method for adjusting probability estimates accounting for unseen events. Results show that the considered methods behave differently depending on the frequency range of names to be estimated, the name structure (full name triples vs. first and last name doubles), and the population size for which the prediction is made. These experimental results can serve as guidelines for choosing the most suitable method for a specific task and available data.
Furthermore, we conducted a record linkage experiment in the large database based solely on name popularity estimates. The outcomes suggest that name popularity estimates are a valuable signal for personal name matching. Results show that all methods using smoothing perform almost identically and the simplest method based on maximum likelihood estimates can be a good choice, when precision is more important than recall. However, these results reflect the peculiarities of a specific database and serve merely as an illustration of feasibility of the approach.
Proposed statistical techniques can incorporate other components along names such as location, gender, age and so on. In case of our dataset, locations associated with a person can be derived either from TIN – it encodes the federal district, where the TIN was issued, – or from the legal address of the associated company. An record linkage experiment accounting for location () in the form achieved precision 95% and recall 83% on the dataset.
The proposed methods are applied to identical name strings and do not account for misspellings, OCR errors, spelling and transliteration variants. An interesting direction for future research could be combination of name popularity estimates and string similarity measure traditionally used in record linkage tasks.
In future work we plan to incorporate other sources of name popularity information such as phone books, open electoral registers, and social network sites and to compare results obtained using different datasets. It is also interesting to juxtapose name popularity distributions in different countries and cultures.
We thank Kontur for preparing the dataset and granting access to it for research. We are very grateful to Leonid Boytsov, Julia Efremova, James Lu, Boris Novikov, Guillaume Obozinski, Julia Stoyanovich, and Yana Volkovich for reading the paper draft and making valuable comments and suggestions.
-  Baayen, H.: Word frequency distributions. Text, speech and language technology, Kluwer Academic Publishers (2001)
-  Bergsma, S., Dredze, M., Van Durme, B., Wilson, T., Yarowsky, D.: Broadly improving user classification via communication-based name and location clustering on Twitter. In: Proceedings of NAACL-HLT. pp. 1010–1019 (2013)
-  Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Proceedings of the Joint EMNLP-CoNLL Conference. pp. 858–867 (2007)
-  Chang, J., Rosenn, I., Backstrom, L., Marlow, C.: epluribus: Ethnicity on social networks. In: Proceedings of ICWSM. pp. 18–25 (2010)
-  Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13(4), 359–393 (1999)
-  Chin, W.S., Zhuang, Y., Juan, Y.C., Wu, F., Tung, H.Y., Yu, T., Wang, J.P., Chang, C.X., Yang, C.P., Chang, W.C., et al.: Effective string processing and matching for author disambiguation. The Journal of Machine Learning Research 15(1), 3037–3064 (2014)
-  Christen, P.: A comparison of personal name matching: Techniques and practical issues. Tech. Rep. TR-CS-06-02, Australian National University (September 2006)
-  Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer (2012)
-  Colantonio, S., Lasker, G.W., Kaplan, B.A., Fuster, V.: Use of surname models in human population biology: A review of recent developments. Human Biology 75(6), 785–807 (2003)
-  Evert, S.: A simple LNRE model for random character sequences. In: Proceedings of JADT. pp. 411–422 (2004)
-  Evert, S., Baroni, M.: Testing the extrapolation quality of word frequency models. In: Proceedings from the Corpus Linguistics Conference Series. vol. 1 (2005)
-  Evert, S., Baroni, M.: zipfR: Word frequency distributions in R. In: Proceedings of ACL. pp. 29–32 (2007)
-  Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40(3 and 4), 237–264 (1953)
-  Goodman, J.T.: A bit of progress in language modeling. Computer Speech & Language 15(4), 403–434 (2001)
-  Ilyas, I.F., Chu, X.: Trends in cleaning relational data: Consistency and deduplication. Found. Trends databases 5(4), 281–393 (2015)
-  Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE transactions on acoustics, speech, and signal processing 35(3), 400–401 (1987)
-  Kessler, D.A., Maruvka, Y.E., Ouren, J., Shnerb, N.M.: You name it–how memory and delay govern first name dynamics. PloS one 7(6), e38790 (2012)
-  Khmaladze, E.V.: The statistical analysis of a large number of rare events. Tech. Rep. MS-R8804, CWI (1988)
-  Liu, J., Zhang, F., Song, X., Song, Y.I., Lin, C.Y., Hon, H.W.: What’s in a name? An unsupervised approach to link users across communities. In: Proceedings of WSDM. pp. 495–504 (2013)
-  Mateos, P., Longley, P.A., O’Sullivan, D.: Ethnicity and population structure in personal naming networks. PloS one 6(9), e22943 (2011)
-  Minkus, T., Ding, Y., Dey, R., Ross, K.W.: The city privacy attack: Combining social media and public records for detailed profiles of adults and children. In: Proceedings of COSN. pp. 71–81 (2015)
-  Mislove, A., Lehmann, S., Ahn, Y.Y., Onnela, J.P., Rosenquist, J.: Understanding the demographics of Twitter users. In: Proceedings of ICWSM (2011)
-  Naumann, F., Herschel, M.: An Introduction to Duplicate Detection. Morgan and Claypool Publishers (2010)
-  Panchenko, A., Teterin, A.: Detecting gender by full name: Experiments with the Russian language. In: Analysis of Images, Social Networks and Texts, pp. 169–182 (2014)
-  Perito, D., Castelluccia, C., Kaafar, M., Manils, P.: How unique and traceable are usernames? In: Proceedings of the 11th International Symposium on Privacy Enhancing Technologies (PETS’2011), pp. 1–17 (2011)
-  Popescu, O., Corcoglioniti, F., Zanoli, R.: Person number estimation in large corpora. Intelligenza Artificiale 6(2), 135–148 (2012)
Rao, D., Yarowsky, D.: Typed graph models for semi-supervised learning of name ethnicity. In: Proceedings of ACL-HLT: Vol.2. pp. 514–518 (2011)
-  Savage, B.M., Wells, F.L.: A note on singularity in given names. The Journal of Social Psychology 27(2), 271–272 (1948)
-  Scapoli, C., Mamolini, E., Carrieri, A., Rodriguez-Larralde, A., Barrai, I.: Surnames in Western Europe: A comparison of the subcontinental populations through isonymy. Theoretical population biology 71(1), 37–48 (2007)
-  Soto Montalvo, R.M., Fresno, V., Delgado, A.D., Zubiaga, A., Berendsen, R.: Overview of the M-WePNaD task: Multilingual web person name disambiguation at IberEval 2017. In: Proceedings of the 2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017). pp. 113–127 (2017)
-  Thomas, K., McCoy, D., Grier, C., Kolcz, A., Paxson, V.: Trafficking fraudulent accounts: The role of the underground market in Twitter spam and abuse. In: USENIX Security Symposium. pp. 195–210 (2013)
-  Winkler, W.E.: Using the EM algorithm for weight computation in the fellegi-sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, American Statistical Association. pp. 667–671 (1988)
-  Zweigenhaft, R.L., Hayes, K.N., Haagen, C.H.: The psychological impact of names. The Journal of Social Psychology 110(2), 203–210 (1980)