Searching for a person’s name is a frequent activity in information systems [hall1980approximate, yang2006web]. Retrieving a paper or news article by its author’s name, examining patient records [pfeifer1996retrieval], or finding usernames via given emails [minkov2006contextual] are all daily activities carried out using individuals’ names. Moreover, the dependency on names for searching on the web is consistently growing. In 2004, 30% of search engine queries included personal names [guha2004disambiguating]. One decade later, one billion names were entered in the Google search engine every day [GoogleYourself].
While the need for online searching of people’s names has increased, the results from web search engines have not kept pace. The well-known web search engines like Google, Yahoo and Bing suffer from a low number of accurate results in response to a query containing a name. This acute problem has created a new market need [organicweb] that has been filled by companies such as Pipl [pipl] and ZoomInfo [zoominfo], which specialize in searching for information about specific people. However, in many cases, users do not know the exact name, or correct form of the name, that they are searching for. Therefore, search engines will not provide the desired results.
The main reason for the poor results provided by the well-known web search engines lies in the queries which contain names. As opposed to a simple word with one correct spelling, there can be many legitimate spelling variations for a given name [christen2006comparison]. Furthermore, first names sometimes change over time due to adopting nicknames, marriage, religious conversion (e.g., from Cassius Clay Jr. to Muhammad Ali), or gender reassignment, and these are heavily influenced by a person’s cultural background [christen2006comparison, smalheiser2009author]. For instance, the English first name of John has several variations in other languages: Jean (French), Giovanni (Italian), Johannes (German and Latin), João (Portuguese), and Juan (Spanish) [familyeducation] (see Figure 1). Also detecting aliases for names is very difficult. For example, Lionel Messi, the famous football player, is called “La Pulga” (the flea) and “Messiah.” Therefore, the matching of first names is a more challenging problem than the matching of general text [borgman1992getty].
The problem of name matching is well-known and has been explored in many research fields, including statistics, databases, and artificial intelligence[cohen2003comparison]. Today, most of the techniques for related name retrieval are based on pattern matching, phonetic encoding, or a combination of these two approaches [christen2006comparison]. However, the retrieval of related names leads to poor results [friedman1992tolerating].
Alongside the existing problem of poorly suggested related names, online genealogical research has gained great popularity with the growth of digital genealogical documents and the spreading of the internet across the world [heinlen2007genealogy]. The increased interest in genealogy has encouraged a series of online companies that specialize in genealogy, such as MyHeritage [myheritage] and WikiTree [wikitree], to fill the gap of historical knowledge for the public. Based on personal data provided by users, these companies construct personal digitized family trees. Over time, the many individually constructed family trees merge into a single enormous forest by utilizing the wisdom of the crowd and entity matching [jacobs2018s, kaplanis2018quantitative].
In this paper, we propose a innovative approach to confront the problem of similar name suggestions. Our novel algorithm utilizes historical data collected from digitized family trees, combined with graph algorithms and genealogy (the study of families, family history, and the tracing of their lineages) [yakel2004seeking]. As opposed to previous approaches that retrieve related names based on the same encoded representation or pattern [holmes2002improving, uzzaman2004bangla], we propose a general approach that suggests first names based on the construction and analysis of digitized family trees assembled by millions of people in a joint effort to trace their past. Namely, we collected data from a genealogy website to construct a large weighted graph of first names, which contains information about how names have evolved over the centuries (see Figure 1 and Section 3). Then, we developed a novel algorithm that can utilize the names existing in a graph to suggest similar names for a given name. We show that our general algorithm provides significantly superior results compared to other existing methods that focus on encoding or detection of specific patterns. For example, the average precision@1 obtained by our approach reached 2 times higher than the well-known Soundex algorithm (0.272, as opposed to 0.102) (see Section 6 and Table 2). This means that the fraction of relevant names among the names retrieved, given a name as input, is significantly higher compared to phonetic encoding algorithms which retrieve names based on similar sounds.
The remainder of this paper is organized as follows: In Section 2, we provide a brief overview of the studies that have focused on similar issues to what we have discussed in this study. Section 3 describes the proposed method for suggesting related names based on the construction and analysis of digitized family trees. In Section 4, we describe in detail the datasets used throughout this study. In Section 5, we review the setup done for conducting the experiments. Section 6 presents the performance of the proposed method, as well as other phonetic and string similarity algorithms for the task of suggesting related names. In Section 7, we discuss the results obtained, and Section 8 presents our conclusions and future directions.
In the following subsections, we provide the necessary background and related work to this study: In Section 2.1, we review the issue of the digitized family trees, as well as their usages. Next, in Section 2.2, we provide a brief overview of a few well-known string similarity metrics, which we later used in our presented algorithm. Then, in Section 2.3, we provide the necessary background for several well-known phonetic algorithms that were used by us to compare our proposed method. Lastly, in Section 2.4, we review previous studies that suggested similar names based on a given name.
2.1 Digitized Family Tree Usages
Three decades ago, the creation, as well as the use of family trees was very limited due to their reliance on domestic data repositories of churches or vital records offices [gulcher2001decode, albright2008utah, gauvin2016french]. The two main reasons for such limited use were the lack of comprehensive and accurate genealogical information in large populations [thorsson2003systematic] and the extensive effort needed to digitize and organize the manual genealogical records [albright2006computerized].
However, in the last two decades, there has been impressive growth in the online digitizing of genealogical documents. Today, many universities, libraries, and public institutions digitize these documents to preserve this valuable information and provide open access to them. This phenomenon of open access to genealogical documents, together with the inner interest and curiosity of the public for knowing their origins [smolenyak2004trace], has contributed to the popularity of online genealogical research [heinlen2007genealogy].
Today, a family tree visually presents a person’s ancestry simply and conveniently. In most cases, its structure depicts a mathematical graph that attempts to capture natural processes, such as mating and parenthood [kaplanis2018quantitative]. This structure, which is based on one’s ancestors, is an important and useful tool to observe family evolution over generations by presenting the relationships between family members [kingman1982genealogy]. Furthermore, the valuable information that is captured by these family trees can be utilized in a wide range of research domains. Currently, the main research domain that utilizes family trees is genetics, which leverages genotype data from relatives [kong2008detection], analyzes parental-origin effects [kong2009parental]
, estimates heritabilities[ober2001genetic], and studies disease prevention [valdez2010family]. Beyond genetics, family trees have played a major role in a wide range of other domains, such as human evolution [lahdenpera2004fitness], anthropology [helgason2008association], economics [modalsli2016multigenerational], and even behavior analysis over generations [mann1985reliability]. Inspired by the convenience and simplicity of presenting the evolution of families over generations, researchers have utilized this concept to analyze the evolution of other elements, such as myosin protein [hodge2000myosin] and cancer [chung2002molecular].
In recent years, alongside these domains, researchers who specialize in the advancing domain of data science have used big genealogical datasets for analysis. In 2015, Fire and Elovici[fire2015data]
used machine learning algorithms on a genealogical dataset with more than a million individuals to uncover features that affect individuals’ lifespans over time. In 2018, Kaplanis et al.[kaplanis2018quantitative] obtained a genealogical dataset from Gini.com, which consists of over 86 million publicity profiles. After an extensive cleaning process, they constructed family trees in which the largest pedigree consisted of 13 million people. They analyzed these family trees and provided insights into population genetic theories. In the same year, Charpentiera and Gallic [charpentier2018internal] used family trees of 2.5 million individuals collected from the Geneanet website to study internal migration in France in the 19th century.
In addition to researchers using family trees for network evolution analysis, the desire of people to learn about their origins created a new market for online companies that specialize in genealogy [heinlen2007genealogy]. Examples of these companies are Ancestry.com [ancestry], FamilySearch [familysearch], MyHeritage, [myheritage], and WikiTree [wikitree]. These companies encourage genealogy enthusiasts to upload their family tree by creating profiles for each family member [kaplanis2018quantitative]. In many cases, the profile includes basic information such as first and last name, nickname, demographic information, birth and death date, and a photo. Currently, the popularity of these companies has grown, and each company now reaches millions of customers worldwide [wikitree, ancestry_about, myheritage_about, familysearch_about]. The relative advantage of the companies is the scanning operations they have performed to detect similar profiles by entity-matching metrics. In detecting similar profiles, the websites encourage customers to merge two given profiles into a single profile [jacobs2018s, kaplanis2018quantitative], connecting separate digitized family trees into a larger digitized family tree. These larger family trees provide additional information about the user’s ancestors beyond his or her knowledge.
2.2 String Similarity Algorithms
In this study, we utilize digitized family trees by connecting family members who share a similar name to their ancestors. The condition for detecting similar names is determined by well-known string similarity functions. These functions usually have been used to match individuals or families between samples and censuses for tasks like measuring the coverage of a decennial census, or for combining two databases, such as tax information and population surveys [cohen2003comparison, casanova2007database]. Such functions attempt to determine the similarity of two strings by measuring the “distance” between the two strings. Two strings that are found similar by the functions are considered related. In this study, we will use the following string similarity functions:
Damerau-Levenshtein Distance (DLD). The Damerau-Levenshtein distance was developed in 1964 by Damerau [damerau1964technique]. This string algorithm measures the minimal number of four different types of editing operations, such as insertion, deletion, permutation, and replacement for transforming a given word to another.
Edit Distance (ED). The edit distance, also known as the Levenshtein distance, was developed two years later in 1966 by Levenshtein [levenshtein1966binary]. It is a similarity string algorithm which measures the minimal number of operations required to transform one word into the other [levenshtein1966binary]. These operations are insertions, deletions, and substitutions of a single character. For example, the edit distance between the names: John and Johan is 1.
2.3 Phonetic Encoding Algorithms
The phonetic encoding algorithms are methods that convert a given word into code according to the way it pronounced. The common usages of these algorithms are for spelling suggestion [uzzaman2004bangla], entity matching [cohen2003comparison, peled2013entity], and searching for names in websites [khan2017application] or databases [patman2001soundex].
Soundex. The Soundex algorithm is one of the first phonetic encoding techniques devised over a century ago by Russel and O’Dell [hall1980approximate]. Given a name, it provides a code that reflects how it sounds when spoken. It keeps the first letter in a given name and reduces all the remaining letters into a code of one letter and three digits. Vowels and the letters h and y are converted to 0. The letters b, f, p, and v are converted to 1. The letters c, g, j, k, q, s, x, and z are converted to 2. The letters d and t convert to 3, whereas m and n are converted to 5. The letter l is converted to 4, whereas r to 6. The final code will include the original first letter and three numbers. Codes that are generated based on longer names are cut off, whereas shorter codes are extended with zeros. For example, the Soundex for the name “Robert” is “R163.”
Metaphone. The Metaphone algorithm was developed in 1990 by Lawrence Philips [philips1990hanging]. It is an improvement over the Soundex because the words are encoded to the representation so that they can be combined into a group despite minor differences [binstock1995practical]. This algorithm assumes English phonetics and works equally well for forenames and surnames [pimpalkhute2014phonetic]. It widely used in spell checkers, search interfaces, genealogy websites, etc [khan2017application]. As an example, the Metaphone for the forename “Robert” is “RBRT.”
Double Metaphone. The Double Metaphone algorithm was developed almost two decades ago by Lawrence Philips [philips2000double]. The Double Metaphone is a variation of the Metaphone algorithm. It retrieves a code that solely consists of letters. As opposed to the previous two algorithms, the Double Metaphone also attempts to encode non-English words (European and Asian names). Moreover, unlike all other phonetic algorithms, it returns two phonetic codes. As an example, the Double Metaphone code for the forename “Jean” is “JN” and “AN.”
NYSIIS. The phonetic encoding algorithm titled the New York State Identification Intelligence System (NYSIIS) also returns a code that solely consists of alphabetic letters [borgman1992getty]. It preserves the vowels’ positions in the given name by converting all the vowels to the letter ‘A’ [de1986guth]. For example, the NYSIIS code for the forename “Robert” is “RABAD.”
Match Rating Approach (MRA). This phonetic encoding algorithm was developed by Gwendolyn Moore in 1977 [moore1977accessing]. The algorithm includes a small set of encoding rules, as well as a more lengthy set of comparison rules. For example, the returned code for the forename “Robert” is “RBRT.”
2.4 Related Name Suggestion Algorithms
In 1996, Pfeifer et al. [pfeifer1996retrieval] examined the differences in the performance between a few known phonetic similarity measures and exact-match metrics for the task of improving the retrieval of names. For evaluation, Pfeifer et al. manually collected surnames from a few sources, such as the TREC collection [harman1992overview], the CACM collection from the SMART system [buckley1985implementation], the phonebook of the University of Dortmund, Germany, and author names from a local bibliographic database. At the end of this process, all these surnames were combined into one large dataset titled COMPLETE with approximately 14,000 names. Afterward, they determined the queries for this dataset as follows: First, they chose 90 names randomly from the COMPLETE dataset. Second, for each of the selected 90 queries, they manually determined the relevant names. They showed that an information system that is based on phonetic similarity measures, such as Soundex, and variations of phonetic algorithms outperform exact-match search metrics for searching related names.
In 2010, Bollegala et al. [bollegala2010automatic] presented a method for extracting aliases for a given personal name based on the web. For example, the alias of the “fresh prince” is Will Smith. They proposed a lexical pattern-based approach for extracting aliases of a given name using snippets returned by a web search engine. Later, they defined numerous ranking scores to evaluate candidate aliases using three approaches: lexical pattern frequency, word co-occurrences in an anchor text graph, and page counts on the web. Their method outperformed numerous baselines, achieving a mean reciprocal rank of 0.67.
Alongside researchers who attempted to suggest related names, several companies emerged for finding people by names, due to the constant need of internet users to find people and the poor results provided by the biggest search engines [organicweb]. Among them are Pipl [pipl], which utilizes names to search for the real person behind online identities [pipl_about_us], and ZoomInfo [zoominfo], which provides people’s information that is company- or organizational-oriented. According to ZoomInfo [zoominfo_about_us], their data includes 67 million emails and 20 million company profiles.
Other services that are free online are PeekYou [peekyou], a people search website that collects and combines content from online social networks, news sources, and blogs for assisting to retrieve the online identity of American users, and TruePeopleSearch [truepeoplesearch], which helps find people by name, phone number, or address. Websites such as TruthFinder [truthfinder] and BeenVerified [beenverified] provide background checking services for people. These services can help reconnect Americans with their friends and relatives, as well as provide a way to look up criminal records online.
In this paper, we propose a novel approach for improving the suggestion of related names associated with a given name. Our pioneering method is based on the construction and analysis of digitized family trees, combined with network science. By constructing digitized family trees, we utilize the historical and valuable information over generations that exists in these family trees for detecting family members who share a similar name. Afterward, by connecting names that many family members have preserved over generations, we construct a name-based graph that reflects the evolution of names over generations (see Figure 1). In the last phase, we search for the given name in the constructed name-based graph and select candidates for similar names according to a general ordering function that takes into account several parameters, such as the network’s structure, and the string and phonetic similarity between the given name and the candidate (see Section 3.1).
3.1 Suggesting Related Names Based on a Name-Based Graph
The proposed method consists of five main phases: data collection, preprocessing, digitized family tree construction, name-based graph construction, and name suggestion (see Figure 2).
Genealogical Data Collection. Our proposed method utilizes the inherent “wisdom” that exists in digitized family trees. Therefore, in the first phase, we use an available genealogical dataset, which includes valuable information regarding people and their ancestors, such as first and last names, nicknames, parents’ names, and more.
Preprocessing. After obtaining such a genealogical dataset, we have to clean the given names, such as the first and last names of people who use short abbreviations. For example, a person named “Aaron T Jones,” was changed to “Aaron Jones” because the “T” character is used as an abbreviation of an unknown middle name. Therefore, we remove all the names with fewer than three characters in order to avoid abbreviations and English honorific titles, such as Mr., Dr., Jr., etc.111While short names are widely used in public, constructing these problematic names into digitized family trees will damage the evolution of names analysis.
Constructing Digitized Family Trees. Using the cleaned genealogical dataset, we construct digitized family trees as a giant graph, by linking child and parent profiles to each other. Namely, we construct a direct graph , where is a set of profiles in the cleaned genealogical dataset and is a set of links between profiles, where each link, , connects two profiles , where is a parent of (see Figure 3). At the end of this step, a large graph with millions of vertices and links is created.
Constructing a Name-Based Graph. By using , we create a new weighted graph in which each vertex is a first name, each link connects first names of a parent and his/her children, and each link’s weight is the number of times links between two exact first names exists in . To reduce the size of the graph, we only establish links between two vertices where the “distance” between their names is small. Namely, we create a first name graph , where is a set of vertices defined as follows: , and is defined to be the first name of a profile . Additionally, we define to be the following set of links , where and , , and . Moreover, we define to be equal to the following: , i.e., the number of times two exact first names exist in . Lastly, we remove links between two vertices if their names are too far apart.
Name Suggestion. By using , we suggest similar names as follows (see Figure 4): Given the first name , we search for . In case the given name does not exist in , we retrieve no similar names. In case , we search for candidates to be similar names using the following algorithms: First, we traverse using the breadth-first search (BFS) starting from . This means that in the first iteration, we pass on all the neighbors that are directly connected to . Next, in the second iteration, we pass the neighbors of ’s neighbors and so forth. After passing all the reachable vertices from (defined as ), we provide a score to each reachable vertex , according to the predefined name similarity scoring function , which measures how similar each reachable vertex is to . Usually, for each , will take into account the distance between to in , as well as the string and phonetic similarity between and . Lastly, we sort all the vertices and suggest as similar names the top- reachable vertices in which received the highest scores.
4 Data Description
In this study, to evaluate our proposed algorithm we used the WikiTree and Behind the Name datasets. In the following subsections, we describe each dataset:
4.1 WikiTree Dataset
According to the proposed method, we have to use a genealogy dataset to utilize the inherent knowledge which exists in these historical records. Therefore, for evaluation, we used the online and open genealogical records obtained from the WikiTree website [Wikitree_dump]. Wikitree is an online genealogical website that was founded in 2008 by Chris Whitten [wikitree]. The main goal of WikiTree is to provide an accurate single family tree using genealogical sources that makes genealogy free and accessible worldwide. As of September 2019, WikiTree had over 641,000 registered users and maintained over 21 million ancestral profiles [wikitree]. Many of these profiles contain specific details about each individual, such as full name, nickname, gender, birth and death dates, children’s profiles, etc. The massive WikiTree dump we worked with includes more than 17 million profiles and more than 250,000 unique first names.
4.2 Behind the Name Dataset
In order to estimate the performance of the proposed method and compare other methods to ours, we had to obtain a ground truth dataset. Therefore, we created the following ground truth dataset by combining the information included in WikiTree dataset with the data existing in the Behind the Name website [behindthename]. This website was founded in 1996 by Mike Campbell to study aspects of given names [behindthename_info]. It holds many given names from all cultures and periods, as well as mythological and fictional names. Currently, it includes 22,263 names.
The creation of the ground truth dataset was performed as follows: First, we extracted all the distinct first names exist in the WikiTree dataset with a length greater than two letters to avoid English honorific titles. Among more than 17 million profiles, we extracted 250,039 unique first names. Using the public service application programming interface (API) revealed by Behind the Name, we collected related names for the distinct first names. For example, for the given name of Ed, we collected Eddie, Edgar, Edward, Ned, Teddy, etc. [ed_behindthename]. For the given name of Elisabeth, we retrieved Eli, Elisa, Ella, Elsa, Lisa, Liz, and so on [elisabeth_behindthename]. In total, 37,916 related names were retrieved for the 7,399 distinct names. The names that provided the maximal number of synonyms were Ina, Nina, and Jan with 127, 119, and 92 synonyms respectively (see Figure 5). Moreover, given a first name, there were 5.12 synonyms provided on average.
5 Experimental Setup
5.1 Setting Experimental Parameters
To evaluate our proposed name suggestion algorithm, we executed the following large-scale experiments: First, as a data source, we used the WikiTree dataset (see Section 4.1). As described above, in the preprocessing phase, we cleaned the first names by removing short abbreviations which were fewer than three characters (see Section 3). Next, we constructed digitized family trees as a large-scale graph, , by linking the WikiTree user profiles with those of their parents. This giant graph consisted of 208,774 vertices, 3,323,554 links, and 126 connected components. Subsequently, using , we generated an additional new weighted first name graph, , where its vertices were first names, and each link connected two first names and , with equal to the number of links in that connected users with the first name of to their parents with the first name of (see Section 3). Then, we generated the graph (consisting of 2,810 vertices, 9,302 links, and 214 connected components), by using and leaving only links between related parent and child first names with edit distance values ranging from 1 to 3.222We limited the edit distance values to be smaller or equal 3 because we observed that names with edit distance values greater than 3 were highly different from the searched name, and in the vast majority of the cases these will not provide relevant similar name suggestions. Afterward, we defined the following four ordering functions:
where and are names, is a function that retrieves the shortest path from the start vertex to the goal vertex in , is a function that returns the minimal number of editing operations required to transform from word into (see Section 2.2), and is a function that returns the phonetic sound code of a given name.
The motivation behind was to take into account the similarity between the names in two dimensions: first, in the sense that the names as strings were similar, and second, that the names’ vertices were also near each other in the given . is similar to ; however, it prioritizes the proximity of the names in the graph. As opposed to and which combines string similarity and network structure, focuses on the performance of phonetic algorithms. For our phonetic algorithm, we chose Double Metaphone because this algorithm improves Soundex and Metaphone and it returns both a primary and a secondary code for a name, a mechanism that can assist in finding similar names. takes into account all the factors that can assist for name suggestion: name and phonetic similarity and network structure.
5.2 Evaluation Process
We evaluated the proposed algorithms for suggesting similar names and also compared them to well-known phonetic algorithms and string similarity algorithms.
We performed the evaluation process in the following manner: First, we created a ground truth dataset of names that appeared both in the Behind the Name and in the WikiTree datasets. Each one of these names consisted of a list of related first names. Overall, our generated ground truth dataset consisted of 7,399 first names that were linked to 37,916 related names according to the Behind the Name dataset.
Second, for each one of the 7,399 first names in the ground truth datasets, we searched for the name in . If the searched name appeared in , we traversed the graph using BFS starting from the given name and collecting its neighbors up to a depth of 3.333We limited the BFS’s search depth to be smaller or equal to 3 in order to improve the search result run time. Moreover, in most cases, candidate names that existed in the name-based graph with a depth higher than 3, were not good candidates to be true related names for a given name. If the name did not appear in , we moved on to the next name on the list.
Third, after obtaining these candidates, we measured the similarity between the given name and each of the candidates by ranking the retrieved names using proposed similarity scoring functions . For example, assume that we searched for the given name “Robert,” in the name-based graph. After detecting this name in the graph, we traversed from this name and collected the following candidate names: “Rob” and “Reuben.” Both were located at a depth of 1 from the given name of “Robert.” In the next phase of this example, we applied on the given name and its candidate. For example, for the name “Robert,” we calculated the following:
Fourth, we sorted the candidates according to the proposed similarity score. Therefore, according to the provided example above, we retrieved “Rob” and only afterward “Reuben.”
Fifth, we evaluated the performance of the top 10 suggestions, as well as the total suggestions provided. The evaluation was carried out by differentiating between the suggestions and the true synonyms existing in the Behind the Name dataset. For this, we used the performance metrics of accuracy, F1, precision, and recall. Concerning the precision measure, for each given name in the Behind the Name dataset, we took the top 10 suggestions provided according to our proposed functions and calculated the metric of where . We chose to evaluate the top 10 suggestions because like any search for results in any search engine, people are still only willing to look at the first few tens of results [brin1998anatomy]. To understand the limitations of all metrics, we also evaluated the performance of the total suggestions provided by each metric.
Lastly, we compared our proposed approach to several other methods for suggesting similar names in order to evaluate our proposed approach. Namely, we utilized five well-known phonetic algorithms, Soundex, Metaphone, Double Metaphone, NYSIIS, and Matching Rating Approach, as well as two string similarity metrics, edit distance and Damerau-Levenshtein distance. The evaluation process for the phonetic algorithms and string similarity metrics was performed as follows: For each given name in the ground truth Behind the Name dataset, we calculated the phonetic code according to the given phonetic algorithm. For example, assume that the given name was “Abraham” and the selected phonetic algorithm was Soundex. Thus, we calculated the Soundex of the name “Abraham,” which equaled to “A165.” Next, we calculated the Soundex phonetic code for all the other names existing in the WikiTree dataset (more than 250,000 first names). Afterward, we chose as candidates the first names that shared the same phonetic code. To retrieve the names according to some order, we sorted the candidates according to their edit distance from the given name (the lower the distance, the higher the similarity) and retrieved them as similar names. Therefore, we labeled this algorithm, as Soundex + edit distance.
As opposed to phonetic algorithms which retrieve a single sound code for a given name, the Double Metaphone can retrieve two phonetic codes (primary and secondary). Therefore, for this algorithm, we collected all the names that shared the same phonetic code (no matter whether it equaled the primary or secondary) and retrieved them according to calculated edit distance from the given name.
For the string similarity algorithms (edit distance and Damerau-Levenshtein distance), we measured the given string similarity algorithm between each name in the ground truth and the candidate name existing in the WikiTree dataset. For example, assuming that the given name was “Abraham” and the string similarity algorithm was edit distance, then we calculated the edit distance between each name in the WikiTree dataset (more than 250,000 first names) and the given name of “Abraham.” As candidates, we chose just the first names having a distance from the given name between 1 to 3. In our search, we limited the edit distance to be less or equal to 3, because we observed that a greater edit distance value gave highly different names from the given name and were not be useful for suggesting similar names. In the final step, we sorted the candidates according to their distance and evaluated the performance of each algorithm using the performance measures described above.
In this section, we present the results obtained from the experiment described in Section 5. First, all the evaluated methods suggested on average about ten similar names per a given name. The metrics giving the highest number of suggestions were the string similarity metrics: edit distance and Damerau-Levenshtein distance, with 9.968 and 9.965 suggested names per given name, respectively. After them were the phonetic algorithms of Soundex, Double Metaphone, and Metaphone which suggested 9.922, 9.617, and 9.567 first names per given name, respectively. Our constructed name-based graph provided 9.217 suggestions per given name (see Table 1).
Among the suggestions provided by each algorithm, we calculated how many of the name suggestions were found relevant. The algorithm that provided the highest average number of relevant suggestions was the algorithm which suggested similar names based on a name-based graph using . Among ten suggestions, it provided almost one relevant similar name (0.88) per given name. After , the algorithms in second and third places were and , with 0.785 and 0.748 relevant similar names per given name, respectively. The algorithms that provided the lowest average number of relevant similar names were the Matching Rating Approach + edit distance, edit distance, and Damerau-Levenshtein distance, with 0.432, 0.444, and 0.46, respectively (see Table 1).
In addition, among the 7,399 first names which had synonyms, we checked the number of given names for which each algorithm was able to suggest similar names. The algorithms that gave suggestions for the highest number of given names were the string similarity algorithms of edit distance and Damerau-Levenshtein distance, with 7,396 and 7,396 given names, respectively. After them, all the phonetic algorithms succeeded in suggesting similar names for about 6,000 given names. The algorithms that suggested the least were our named-based graph algorithms, which gave suggestions for only 1,265 given names.
Also, we measured the success rate by dividing the number of relevant suggested names (where the metric gave at least one suggestion) by the number of given names for which the algorithm suggested similar names. The algorithm which obtained the highest success rate was the name-based graph which suggested names according to with a success rate of 50.12%. Also and obtained success rates of 47.59% and 46%, respectively. Among the phonetic algorithms, Soundex + edit distance and Metaphone + edit distance reached the highest success rates of 43.03% and 44%, respectively (see Table 1).
Regarding performance, we received the following results: With respect to accuracy and F1 measures, all the four proposed algorithms suggested by us () obtained the highest scores among the eleven algorithms. The highest accuracy score was obtained by with an accuracy score of 0.096. After it, and obtained 0.086 and 0.083, respectively. The phonetic algorithms obtained an accuracy of about 0.06, while the lowest accuracy scores were from the string similarity algorithms of edit distance and Damerau-Levenshtein distance, which obtained accuracy scores of 0.045 and 0.046, respectively. The highest F1 score was also obtained by with an F1 score of 0.152. Also, in this case, the lowest F1 scores were obtained by the edit distance and Damerau-Levenshtein distance, with scores of 0.078 and 0.8, respectively (see Table 2).
With respect to precision, it can be noted that the four ordering algorithms suggested by us provided the highest precision scores for all the values, where (see Figure 6). The algorithm that provided the highest score was , which obtained an average precision@1 score of 0.272. Also and obtained the next highest scores of 0.237 and 0.221, respectively. The phonetic algorithms obtained scores around 0.1. The algorithms that obtained the lowest scores were the string similarity metrics of edit distance and Damerau-Levenshtein distance, with average precision@1 score of 0.071 (see Table 2).
With regard to recall measure, the metric that obtained the highest score was Double Metaphone + edit distance with a recall score of 0.221. In second and third places were the phonetic algorithms of Soundex and Metaphone, each with a recall score of about 0.21. Our algorithms reached an average recall score of approximately 0.15. The lowest recall scores were obtained by (see Table 2).
|Method||#Relevant Suggested||#Suggested||#At Least one Relevant||#Given Names’ Suggestions||Percent|
|Soundex + ED||0.592||9.922||2,848||6,618||43.03|
|Metaphone + ED||0.617||9.567||2,876||6,537||44|
|DMphone + ED||0.642||9.617||2,814||6,550||42.96|
|NYSIIS + ED||0.487||8.594||2,360||6,322||37.33|
|MRA + ED||0.432||8.345||2,087||6,267||33.33|
|Soundex + ED||0.06||0.102||0.101||0.096||0.092||0.08||0.06||0.208|
|Metaphone + ED||0.066||0.11||0.107||0.1||0.097||0.086||0.066||0.209|
|DMetaphone + ED||0.068||0.112||0.107||0.102||0.098||0.088||0.068||0.221|
|NYSIIS + ED||0.064||0.11||0.105||0.093||0.087||0.079||0.064||0.163|
|MRA + ED||0.058||0.0919||0.093||0.086||0.082||0.073||0.058||0.144|
In order to understand the limitations of our proposed algorithms, we also evaluated the overall performance of all the algorithms. This means that we analyzed all the suggestions provided by each algorithm, not just the top 10 provided. With respect to the average number of suggestions, we can see that the string similarity algorithms of edit distance and Damerau-Levenshtein distance suggested the highest number of similar names (about 3,000 similar names per given name). The similar names suggested from the name-based graph according to all four ordering functions were about 700 similar names per given name. The phonetic algorithms provided about 100 similar names per given name (see Table 3).
Regarding finding relevant suggestions, the algorithms that provided the highest average number of relevant suggestions were again the edit distance and Damerau-Levenshtein distance, each with about 2.4 relevant similar names per given name. After them was the name-based graph with 2.01 relevant similar names per given name (see Table 3).
With respect to overall performance of accuracy, precision, and F1, we can see that all the metrics performed poorly. Regarding recall we can see that the string similarity algorithms of edit distance and Damerau-Levenshtein distance obtained the highest recall scores of approximately 0.58 (see Table 4).
|Method||#Relevant Suggested||#Suggested||#At Least One Relevant||#Given Names’ Suggestions||Percent|
|Soundex + ED||1.21||172.394||3,734||6,618||56.42|
|Metaphone + ED||1.164||142.385||3,540||6,537||54.15|
|DMetaphone + ED||1.37||153.11||3,561||6,550||54.37|
|NYSIIS + ED||0.73||54.65||2,776||6,322||43.91|
|MRA + ED||0.633||36.68||2,431||6,267||38.79|
|Soundex + ED||0.011||0.02||0.011||0.313|
|Metaphone + ED||0.019||0.033||0.019||0.297|
|DMetaphone + ED||0.019||0.033||0.019||0.333|
|NYSIIS + ED||0.039||0.061||0.039||0.204|
|MRA + ED||0.038||0.059||0.038||0.176|
Upon analyzing the results presented in Section 6, we can conclude the following: First, suggesting similar names based on a name-based graph derived from a genealogical dataset is superior to other well-known algorithms, such as encoding phonetic and string similarity algorithms (see Table 2). Observing the top 10 suggestions for similar names per given name in the ground truth, we can notice that the suggestions provided by the constructed name-based graph and sorted according to the four proposed algorithms (
) are significantly higher than other algorithms (found statistically significant using t-tests with) in terms of performance measures, such as accuracy, F1, and precision, except recall (see Table 2). For example, similar name retrieval using a name-based graph and sorting the candidates according to obtained an average F1 of 0.152 as opposed to Double Metaphone + edit distance that obtained an average F1 score of 0.112.
Second, we can see that all four proposed algorithms () reached the top four places with respect to the performance measures of accuracy, F1, and precision (see Table 2). Only after them, with a significant difference, come other algorithms, such as phonetic and string similarity algorithms (e.g., Double Metaphone + edit distance).
Third, concerning particular information retrieval metrics, we measured the for the provided top 10 suggestions when . We can see that the best performance was obtained in (see Figure 6). The algorithm that reached the highest score was , with an average precision of 0.272 (see Table 2) which took into account the name-based graph structure, the string similarity between the given name and its neighbors, and the string similarity between the sound codes of the given name and its neighbors. In second place was (with an average precision of 0.237), which took into account the graph structure and the string similarity between the given name and its neighbors. In the third place was (with an average precision of 0.221), which was affected more by the place of the suggested name in the graph. Also, , which retrieved the names according to their string similarity between the sound codes solely reached an average precision@1 of 0.114. These results emphasize the effectiveness of our generic approach for suggesting similar names. We can see that all four suggested ordering functions, which can be replaced by any other ordering function for suggesting names for a given name, reached the highest scores. With respect to , we can conclude that ordering the candidates from the similar names suggested from the constructed name-based graph is very significant. Moreover, these results emphasize the importance of a variety of parameters, such as the graph structure, the string similarity, and phonetic similarity when suggesting similar names from the name-based graph.
Fourth, alongside the impressive performance of the proposed approach with respect to the accuracy, F1, and precision metrics, we can see that the name-based graph provided low recall scores (0.15, 0.139, 0.129 and 0.165) as opposed to Double Metaphone + edit distance, which obtained the highest recall score of 0.221. The recall measure estimates how many among the suggested names are found to be relevant. According to the obtained results, we can notice that the name-based graph provided more precise similar names as opposed to any other algorithm. However, in the case where the given name does not exist in the graph our proposed method returns nothing. This behavior explains the low recall provided by all the proposed functions derived from the name-based graph.
Fifth, despite the significant high performance of the proposed generic approach for suggesting similar names, there is also a limitation of the proposed approach. We can notice this limitation when looking at the total average number of suggestions: The proposed algorithm suggested similar names for 1,265 given names as opposed to string similarity and phonetic algorithms which suggested similar names for 6 and 7 times more given names (see Table 1). This is caused because in our approach, as a first step to suggesting similar names for a given name, we are required to detect the given name in the generated name-based graph . However, in the case where the given name does not exist in , we cannot retrieve similar names. Thus, constructing a new name-based graph from a larger genealogical dataset or increasing the criteria for generating the name-based graph when will result in a larger name-based graph with a larger number of names, but higher run times. Moreover, our algorithm is generic; therefore, using a larger family tree dataset with more profiles will likely provide more accurate results with higher recall rates.
Sixth, with respect to name suggestions provided, we also checked to see how many of them were found relevant by each algorithm. In this case, we can see that our proposed four algorithms found the highest number of relevant similar names compared to other algorithms. Among the top 10 suggestions for a given name, , and succeeded to find almost one suggestion that is a true synonym for a given name (0.88 on average). The highest phonetic algorithm obtained 0.642 relevant suggestions on average. The lowest relevant suggestions, on average, were obtained by the string similarity algorithms (approximately 0.45 relevant suggestions). These results emphasize that suggesting similar names according to our proposed approach is the most effective way compared to the other algorithms.
Seventh, regarding the overall results and not just the top 10 results, we can conclude that all the methods provided poor results when every suggestion was analyzed. The best performance was obtained by NYSIIS + Edit Distance with an average accuracy, F1, and precision of 0.039, 0.061, and 0.039, respectively. The best recall was obtained by the edit distance, with an average score of 0.59. According to the results obtained, we can understand that the proposed approach is good only when we need to suggest candidates be a similar name for a given name when 10. The reason for this is related to the task’s goal, which is associated with the information retrieval domain in which people are analyzing just the top results and not all the results.
Eighth, it is important to note that our proposed approach is generic and can be used both as a standalone algorithm and also to improve other algorithms adds more dimensions to improve the suggestions of names compared to other algorithms, such as Soundex and the edit distance, which base their suggestions solely on phonetics or string similarity.
8 Conclusion & Future Work
This paper addresses the acute problem of attaining accurate results when searching for a person’s name online. In order to ease this problem, we present a novel and generic approach for suggesting similar names based on a name-based graph constructed from digitized family trees. In our approach, we constructed digitized family trees based on more than 17 million people who exist in the WikiTree dataset. After creating the digitized family trees, we constructed a name-based graph using on parents and children who share similar names based on string similarity algorithms. Using this graph as well as four proposed algorithms we suggested similar names for each given name in the ground truth. To compare the results obtained, we evaluated the performance of seven other search algorithms of well-known phonetic and string similarity algorithms. We concluded the following: First, we found that our proposed algorithms performed significantly higher ( 2 times higher than any other algorithm) in terms of accuracy, F1, and precision (see Section 6).
Second, the proposed approach was also superior (analyzed using t-tests) when analyzing with . We also analyzed the overall performance provided by each metric to conclude that our proposed algorithms performed well only in small s while their performance dropped when analyzing all the results. However, we need to understand that our approach is data-based; therefore, using larger family trees will result in increasing the overall performance.
Third, according to our approach, after collecting candidates to be suggested as similar names, the determination process regarding the order of the candidates is generic. This means that it can assign any algorithm for improving the performance of the suggested names by utilizing several approaches, as opposed to phonetic and string similarity algorithms which base their suggestions on a single domain.
Our research currently considers suggestions for first names using name-based networks and digitized family trees. A possible future research direction is to examine the proposed method on other elements, such as last names, nicknames, etc. Also, the sorting functions can be configured to assist in detecting aliases, or machine learning techniques could be used to improve the name suggestions.
The authors would like to thank Carol Teegarden for proofreading this article, and the icons8 website (https://icons8.com) for their beautiful icons.