How Does That Sound? Multi-Language SpokenName2Vec Algorithm Using Speech Generation and Deep Learning

05/24/2020 ∙ by Aviad Elyashar, et al. ∙ Ben-Gurion University of the Negev 0

Searching for information about a specific person is an online activity frequently performed by many users. In most cases, users are aided by queries containing a name and sending back to the web search engines for finding their will. Typically, Web search engines provide just a few accurate results associated with a name-containing query. Currently, most solutions for suggesting synonyms in online search are based on pattern matching and phonetic encoding, however very often, the performance of such solutions is less than optimal. In this paper, we propose SpokenName2Vec, a novel and generic approach which addresses the similar name suggestion problem by utilizing automated speech generation, and deep learning to produce spoken name embeddings. This sophisticated and innovative embeddings captures the way people pronounce names in any language and accent. Utilizing the name pronunciation can be helpful for both differentiating and detecting names that sound alike, but are written differently. The proposed approach was demonstrated on a large-scale dataset consisting of 250,000 forenames and evaluated using a machine learning classifier and 7,399 names with their verified synonyms. The performance of the proposed approach was found to be superior to 12 other algorithms evaluated in this study, including well used phonetic and string similarity algorithms, and two recently proposed algorithms. The results obtained suggest that the proposed approach could serve as a useful and valuable tool for solving the similar name suggestion problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In information systems, searching for a username is a frequently performed activity [yang2006web]; for example, retrieving a patient’s electronic medical record from a medical records system [pfeifer1996retrieval], and searching for a research paper by the author’s name or a news article by a journalist’s name are daily tasks performed using individuals’ names. Names are also the focus of online search, and individuals’ reliance on names, as reflected in search engine queries, is steadily increasing. For example, in 2004, 30% of all search engine queries provided by users included personal names [guha2004disambiguating]. A decade later, in 2014, one billion names were used in Google search engine queries each day [GoogleYourself].

While the use of personal names in online search has increased, the results retrieved from Web search engines has not kept pace [jansen2000real]. Leading online search engines retrieve sub-optimal results in response to searches for a person’s name [spink2002sex]. These poor results created a new customer need [organicweb] which has been fulfilled by companies, such as Pipl111https://pipl.com/ and ZoomInfo,222https://www.zoominfo.com/ which have dedicated their efforts towards providing information about specific people. Despite these new services, in many cases, users experience difficulty when selecting the exact name to search for or the correct form when formulating a name-containing query. Therefore, searching for people by name online remains a challenging problem.

There are several reasons for the poor search engine performance for queries containing names. First, unlike words, which, in most cases, have a single correct spelling, there are several legitimate variations for a given name [christen2006comparison]. Second, there are cases in which a name changes over time due to the use of a nickname, marriage, religious conversion (e.g., from Lewis Alcindor Jr. to Kareem Abdul Jabbar), or gender reassignment. Third, many names are heavily influenced by a person’s cultural background [christen2006comparison]. For example, the English forename of Anthony has several variations in other languages: Antoine (French), Antonius (Ancient Roman), Anton (Russian), and Antonio (Spanish) [anthony_behindthename]. The detection of aliases for people also poses a challenge; for instance, the nickname of Kobe Bryant, the famous basketball player, is the “Black Mamba.” Therefore, finding a match for a name is more difficult than it is for general text [borgman1992getty].

Today, techniques used for name matching and the retrieval of similar names are mainly based on pattern matching, and phonetic encoding [christen2006comparison]. For example, in the context of names, phonetic encoding algorithms (e.g., Soundex) encode a given name into plain-text code that reflects the way people pronounced the name. This plain-text code assists in finding similar names in cases in which the code of two different names is identical (e.g., Smith and Smyt). However, the performance of these algorithms has been poor [friedman1992tolerating].

In recent decades, there has been a data science revolution resulting in the development of products and services that utilize machine and deep learning algorithms to help people in various aspects of modern life, for example, searching for information on the Internet, filtering spam email, image recognition, etc. 

[grapov2018rise] These advanced algorithms, which are capable of learning from a large set of examples, were found much more effective and robust than those which designed using explicitly specifying rules [gulshan2016development]. For example, Word2Vec [word2vec_paper]

is a deep learning-based model that utilize large-scale text to transform words into continuous vector space representations (also known as word embeddings). These fixed-dimensional vector representations were found to have semantic meaning, which can be used for many natural learning processing (NLP) tasks, such as text classification, word similarity, and more.

Inspired by Word2Vec, we propose a novel and generic approach that leverages the power of human speech and deep learning to address several issues associated with names, such as similar name suggestion and record linkage. The proposed SpokenName2Vec

approach is an innovative multi-language framework that uses names, languages, accents, and automated speech generation to produce spoken name embeddings. These novel embeddings capture linguistic and acoustic content, which is used to the detect names that sound alike. In contrast to phonetic encoding algorithms, such as Soundex and Double Metaphone, which represent names with plain-text code, the proposed approach utilizes neural networks to create more advanced name representations, viewed as fixed-length space vectors. The continuous vector space representation for names is based on the way humans pronounce names in any language, with any accent (e.g., American, and British English). To the best of our knowledge, we are the first representing names using spoken name embeddings.

In this paper, we demonstrate the proposed approach on the task of suggesting synonyms associated with a given name, a common task required of search engines today. The SpokenName2Vecapproach consists of five phases: (1) the name collection phase, in which we collect names; (2) the speech segment generation phase, in which we generate spoken names based on the given name, targeted language, and accent

; (3) the feature extraction phase, where we extract audio features which serves as a continuous vector space representation for each name; (4) the classification phase, in which a machine learning classifier is used to classify candidates that sound like the given name; and (5) the last phase, in which candidates are filtered according to a predefined threshold (the remaining candidates serve as synonyms for the given name).

In our evaluation, the performance of the proposed algorithm is compared with the performance of other the state-of-the-art machine and deep learning algorithms. The performance was evaluated using the Behind the Name dataset with over 7,300 forenames and over 37,000 synonyms. We show that SpokenName2Vecalgorithm outperforms all other algorithms evaluated, including commonly used phonetic encoding and string similarity algorithms, as well as novel approaches suggested more recently (e.g. graph-based names  [elyashar2019runs], and Name2Vec [foxcroft2019name2vec] algorithms in terms of the average accuracy, F1, and precision@5 and precision@10 measures. For example, SpokenName2Vectrained on spoken names in the Italian accent obtained an average accuracy score of 0.151, in contrast to the graph-based names and Double Metaphone algorithms which obtain scores of 0.096 and 0.068, respectively.

The remainder of this paper is organized as follows: In Section 2, we provide a brief overview of related work focused on issues similar to those addressed in this study. Section 3 presents the SpokenName2Vecframework. We provide detailed description of the datasets used in this study in Section 4. In Section 5, we review the experimental setup, and in Section 6, we present the performance (for the task of suggesting synonyms) of the proposed algorithm and other algorithms evaluated. In Section 7, we discuss the results obtained, and our conclusions and future directions are provided in Section 8.

2 Background

In the subsections that follow, we provide the necessary background on this study and review related work. In Section 2.1, we provide a brief background related to speech, including the existing automated mechanism for generating speech automatically. In Section 2.1, we present previous representations for speech and names. Our proposed spoken name embedding relies on the extraction of audio features from audio segments, and Section 2.2 presents the mechanism used for extracting this embedding. Then, in Section 2.3, we provide a brief overview of a few well-known string similarity algorithms, as well as phonetic algorithms (see Section 2.4) that our proposed algorithm is compared to when evaluating the performance. Lastly, in Section 2.5, we review previous studies that focused on suggesting similar names associated with a given name.

2.1 Speech and Name Representation

In this paper, we propose a novel representation for names, which uses automated speech and deep learning, to deal with problems associated with names, such as similar name suggestion [elyashar2019runs], and record linkage [foxcroft2019name2vec]. Most of the well-known approaches confronting these problems emphasize character or word similarities (e.g., the edit distance string similarity algorithm). In contrast to these approaches, SpokenName2Vecaddresses these problems by utilizing the power of speech to find similar names. In addition to the use of speech for conveying ideas, and expressing feelings [tiwari2012voice], sound has been found helpful for other tasks, such as voice recognition [klevans1997voice], speaker recognition [jayamaha2008voizlock], analyzing human behavior [lepine1998predicting], Internet communication [goode2002voice], name suggestion [hall1980approximate]

, and more. Often there are several variations of names (e.g., Smith, and Smyt), which are written differently but pronounced the same. Focusing on the way names pronounced instead of how they are written can be a salient advantage for the detection of similar names. For this, we use open source and publicly available services for generating automated speech, e.g., the Text2Speech website,

333https://www.text2speech.org/ Google Text-to-Speech,444https://cloud.google.com/text-to-speech and many others.

The data science revolution of last decade has resulted in the development of many products that use machine and deep learning algorithms, including products for filtering spam images, image recognition, and more. One of the pioneers of these algorithms was Mikolov et al. [word2vec_paper]

, who in 2013 introduced the Word2Vec’s architecture for word embedding. Word2Vec is a general term encompassing two representation learning models: continuous bag of words (CBOW) and skip-gram. Both models are simple feed-forward neural network architectures that are used for computing continuous vector representations of words from very large datasets. The vector representations of words learned by Word2Vec were found to be promising for carrying semantic meanings, a trait that is useful for various natural language processing (NLP) tasks, such as text classification 

[lilleberg2015support], information retrieval [ganguly2015word], etc. In 2014, Le and Mikolov [le2014distributed] extended the Word2Vec methodology and suggested Doc2Vec, a fixed-dimensional vector representation for sentences and documents using a paragraph vector. This additional vector remembers the context or the topic of each paragraph, which was shown to be useful for capturing the semantics of paragraphs, sentences, and documents. In recent years, many researchers, inspired by the novel Word2Vec, have suggested utilizing the power of representation learning on various domains that are not necessarily related to NLP. Examples of the models proposed include Node2Vec [grover2016node2vec], App2Vec [ma2016app2vec], Song2Vec [rosssong2vec], and Emoji2Vec [eisner2016emoji2vec], and more.

In 2018, Chung and Glass [chung2018speech2vec] proposed Speech2Vec, a speech version of Word2Vec. For training their model, they used LibriSpeech, a corpus of 500 hours of read English speech, to learn Speech2Vec embeddings. They compared their model with the classic Word2Vec algorithm on word similarity tasks. Later that year, Chung et al. [weng2018towards] tested the Speech2Vec models on the task of speech-to-text translation. In 2019, Haque et al. [haque2019audio] proposed spoken sentence embeddings. Their results demonstrated that the proposed spoken sentence embeddings outperformed phoneme and word-level baselines on speech and emotion recognition tasks. In the same year, Foxcroft et al. [foxcroft2019name2vec] presented Name2Vec, a method for name embeddings that employs the Doc2Vec methodology, where each surname is viewed as a document, and each letter constructing the name is considered a word. They demonstrated the task of record linkage by training a few name embedding models on a dataset containing 250,000 surnames and tested their model on 25,000 verified name pairs from Ancestry.com. They used the Records dataset as positive samples and other 25,000 random name pairs as negative samples. The authors concluded that the name embeddings generated can predict whether a pair of names match.

2.2 Audio Feature Extraction

In the feature extraction phase, in order to analyze the audio data obtained by generating spoken names and produce spoken name embeddings, we extract audio features using open source frameworks that specialize in extracting features from audio files. Such frameworks are mainly used for tasks like audio event recognition and surveillance, speech recognition, and music information retrieval [giannakopoulos2015pyaudioanalysis]; examples of libraries and frameworks for this include Yaafe,555http://yaafe.sourceforge.net/ librosa,666https://github.com/librosa/librosa PyCASP,777https://github.com/egonina/pycasp Bob,888http://idiap.github.io/bob/ pyAudioAnalysis [giannakopoulos2015pyaudioanalysis], and Turi Create’s sound classifier [sound_classifier].

In this study, we extract audio features using two frameworks: the Turi Create sound classifier and pyAudioAnalysis. With Turi Create [sound_classifier], this phase includes the following signal processing steps to transform the audio segments into convenient data for use as neural network input: First, the raw audio frequency signals are transmitted into a series of digital numbers (from 1 to -1) using pulse code modulation (PCM) [shorter1972application]. All of the signals are re-sampled to 16,000 samples per second. The data is then divided into several overlapping windows. For each window, the Hamming window, a mathematical function that is zero-valued outside of some chosen interval, is applied; this function window is widely used in digital signal processing applications [podder2014comparative]

. The power spectrum is calculated using fast Fourier transformation, and finally Mel Frequency filter banks are applied and the natural logarithm of all of the values are used as features.

The pyAudioAnalysis framework was implemented by Giannakopoulos [giannakopoulos2015pyaudioanalysis] in 2015. This framework includes the calculation of 11 types of audio features, including zero crossing rate; Energy; entropy of energy; spectral centroid; Spread, entropy, flux, and rolloff; Mel-frequency cepstral coefficients (MFCCs), chroma vector, and deviation.

2.3 String Similarity Algorithms

To evaluate SpokenName2Vec, we compared its results with the results of string similarity algorithms. These well-known algorithms usually have been used to match individuals or families of samples for tasks, such as measuring the coverage of a decennial census or for combining two databases, such as tax information and population surveys [cohen2003comparison, casanova2007database]. Such algorithms determine the similarity of two given strings by measuring the “distance” between the two strings. Two strings that are found similar by the functions are considered related. In this study, we evaluate the performance of the following string similarity functions:

Damerau-Levenshtein Distance. The Damerau-Levenshtein distance was developed in 1964 by Damerau [damerau1964technique]. To transform a given word to another, this string algorithm measures the minimal number of four different types of editing operations, such as insertion, deletion, permutation, and replacement.

Edit Distance. The edit distance, also known as the Levenshtein distance, was developed two years later by Levenshtein [levenshtein1966binary]. This similarity string algorithm measures the minimal number of operations required to transform one word into an other [levenshtein1966binary]. These operations are insertions, deletions, and substitutions of a single character. For example, the edit distance between the names John and Johan is 1.

2.4 Phonetic Encoding Algorithms

Other algorithm families whose performance we compare to SpokenName2Vec’s performance are the phonetic encoding algorithms. These algorithms are methods that transform a given word into code according to the way the word is pronounced. These algorithms are commonly used for spelling suggestion [uzzaman2004bangla], entity matching [cohen2003comparison, peled2013entity], and searching for names in websites [khan2017application] or databases [patman2001soundex]. In this paper, we evaluate the Soundex, Metaphone, Double Metaphone, the New York State Identification and Intelligence System Phonetic Code (NYSIIS), and the match rating approach (MRA).

Soundex. Devised over a century ago by Russel and O’Dell, the Soundex algorithm is one of the first phonetic encoding techniques [hall1980approximate]. Given a name, it provides a code that reflects how it sounds when spoken. It keeps the first letter in a given name and reduces all of the remaining letters into a code of one letter and three numbers. Vowels and the letters h and y are converted to zero. The letters b, f, p, and v are converted to one. The letters c, g, j, k, q, s, x, and z are converted to two. The letters d and t are converted to three, while m and n are converted to five. The letter l is converted to four, and r is converted to six. The final code includes the original first letter and three numbers. Codes that are generated for longer names are cut off, whereas shorter codes are extended with zeros. For example, the Soundex code for the name, Robert is R163.

Metaphone. The Metaphone algorithm was developed in 1990 by Lawrence Philips [philips1990hanging]. It is an improvement over Soundex, because the words are encoded to a representation so that they can be combined into a group despite minor differences [binstock1995practical]. This algorithm assumes English phonetics and works equally well for forenames and surnames [pimpalkhute2014phonetic]. It widely used in spell checkers, search interfaces, genealogy websites, etc [khan2017application]. The Metaphone code for the forename Robert is RBRT.

Double Metaphone. The Double Metaphone algorithm was developed almost two decades ago by Lawrence Philips [philips2000double]. A variation of the Metaphone algorithm, the Double Metaphone, retrieves a code that consists solely of letters. As opposed to the previous two algorithms, the Double Metaphone also attempts to encode non-English words (European and Asian names). Moreover, unlike all other phonetic algorithms, it returns two phonetic codes. For example, the Double Metaphone code for the forename Jean is JN and AN.

NYSIIS. The New York State Identification Intelligence System (NYSIIS) phonetic encoding algorithm also returns a code that solely consists of alphabetic letters [borgman1992getty], however it preserves the vowels’ positions in a given name by converting all of the vowels to the letter ‘A’ [de1986guth]. For example, the NYSIIS code for the forename Robert is RABAD.

Match Rating Approach (MRA). This phonetic encoding algorithm was developed by Gwendolyn Moore in 1977 [moore1977accessing]. The algorithm includes a small set of encoding rules, as well as a more lengthy set of comparison rules. For example, the returned code for the forename Robert is RBRT.

2.5 Similar Name Suggestion Algorithms

In the latest two decades, several studies have confronted the problem of similar name suggestion. In 1996, Pfeifer et al. [pfeifer1996retrieval] compared the differences in the performance of a few known phonetic similarity measures and exact match metrics for the task of improving the retrieval of names. For the evaluation process, Pfeifer et al. collected surnames manually from a few sources, such as the TREC collection [harman1992overview], the CACM collection from the SMART system [buckley1985implementation], the phonebook of the University of Dortmund, Germany, and author names from a local bibliographic database. They combined all of the surnames into the COMPLETE dataset, which includes approximately 14,000 names. They determined the queries for this dataset as follows: First, they chose 90 names randomly from the COMPLETE dataset. Second, for each of the 90 queries, they manually determined the relevant names. They reported that an information system based on phonetic similarity measures, such as Soundex, and variations of phonetic algorithms outperform exact-match search metrics in the task of searching for synonyms.

In 2010, Bollegala et al. [bollegala2010automatic] suggested a method for extracting aliases for a given personal name based on the Web; for example, the alias of the term “fresh prince” is Will Smith. They proposed a lexical pattern-based approach for extracting aliases of a given name using snippets returned by a Web search engine. Then, they defined numerous ranking scores to evaluate candidate aliases using three approaches: lexical pattern frequency, word co-occurrences in an anchor text graph, and page counts on the Web. Their method outperformed numerous baselines, achieving a mean reciprocal rank of 0.67. There are a few differences between this study and ours. First, our study focuses on the task of suggesting similar names that sound like a given name, while Bollegala et al. focused on suggesting aliases. An alias, as opposed to a similar name can be very different from a given name. For example, the aliases of the famous basketball players, LeBron James and Earvin Johnson are “The King,” and “Magic,” respectively.

In 2019, Elyashar et al. [elyashar2019runs] proposed a novel approach for suggesting synonyms using the construction and analysis of digitized family trees. Using a large-scale online genealogical WikiTree dataset, Elyashar et al. constructed a name-based graph derived from digitized family trees. Utilizing this very large graph, they suggested synonyms by searching for the given name in the graph and traversed from this point to collect the suggested candidates. In the next stage, they applied four ordering functions determining the order of the suggested names. Suggesting similar names based on the graph-based names derived from digitized family trees outperformed phonetic and string similarity algorithms. In contrast to this approach, which utilize historical knowledge to detect similar names based on ancestors, the main advantage of SpokenName2Vecis its ability to detect many similar names that sound like the given name, without the need for historical data which may or may not be available.

In addition to studies aimed developing techniques for suggesting synonyms, several companies emerged for to address the task of using names to find people online in response to the growing need of Internet users to find people online and the poor results provided by the largest search engines [organicweb]. Among them are Pipl, which utilizes names to search for the real person behind online identities [pipl_about_us], and ZoomInfo, which provides company or organizational oriented information for a searched name. According to ZoomInfo [zoominfo_about_us], their database includes 67 million emails and 20 million company profiles.

Other free online services include: PeekYou,999https://www.peekyou.com/ a people search website that collects and combines content from online social networks, news sources, and blogs to help retrieve the online identity of American users, and TruePeopleSearch,101010https://www.truepeoplesearch.com/ which helps find people by name, phone number, or address. Websites, such as TruthFinder111111https://www.truthfinder.com/ and BeenVerified121212https://www.beenverified.com/ provide background checking services for people. These services can help reconnect Americans with their friends and relatives, as well as provide a way to look up criminal records online.

3 Methods

In this paper, we present the SpokenName2Vec, a novel and generic deep learning algorithm utilizing multi-language automated speech for various tasks related to names. In this section, we present the steps for the proposed algorithm, as well as demonstrating its effectiveness for the task of suggesting names that are similar to a given name. Similarly to encoding phonetic algorithms (e.g., Soundex, and Double Metaphone), the proposed SpokenName2Vecalgorithm transforms a given name into a single representation. However, in contrast to those methods, after encoding text into a simple plain-text code, the proposed algorithm generates a fixed-dimensional vector representation derived from an audio segment expressing the way people articulate a given name in a given language and accent. This results in a deep neural network-based model, which takes into account the given name, as well as the language and accent. This model is much more sophisticated, and its ability to detect names that are written differently but sound alike is notable, particularly in comparison to other algorithms.

3.1 Multi-Language SpokenName2Vec

The proposed method consists of the following five steps (see Figure 1):

Figure 1: Overview of the algorithm’s steps.
  1. Name Collection. To produce the proposed innovative SpokenName2Vec algorithm, a dataset of names is required. Names can be obtained from genealogical websites, online social networks, other designated websites, and other services. Of course, a preprocessing step is required to remove noisy and unnecessary data from these names, such as short abbreviations, honorific titles, etc.

  2. Speech Segment Generation. After obtaining a collection of names, audio segments are generated, reflecting how humans say each name according to a given language and accent. The generation of audio segments is performed using available tools that transform text for a given name into speech segments automatically. This step is generic, i.e., we can transfer text to speech by selecting any of the languages provided by the used tool, along with its associated accent to generate the speech segment. The speech segment generation step results in a collection of speech segments, reflecting the names collected in Step 1, and spoken according to a target language and accent.

  3. Speech Segment-Based Feature Extraction. In this step, each speech segment generated is transformed into a fixed-dimensional vector space representation using deep learning implemented by an artificial neural network-based model. This sophisticated representation, which consists of several dimensions, obtains linguistic and acoustic content concerning the spoken name. For this, we use state-of-the-art algorithms to transform an audio segment into a fixed-dimensional vector representation. The resulting vectors, also known as spoken name embeddings serve as features for each of the given names; these features are used in the next steps.

  4. Name Classification. Utilizing the extracted speech segment-based features, we use supervised machine learning classifiers for suggesting synonyms associated with a given name. This step is generic and compatible with many classifiers, such as classification-based nearest neighbor classifiers, classifiers that apply kernel functions, and others. For a given name, this step results in at most candidate names, suggested (based on the classifier’s predictions) as synonyms, as well as a confidence score associated with each candidates reflecting the candidate’s likelihood of being found as a correct synonym for the given name.

  5. Name Suggestion. The name suggestion step consists of two actions: filtering candidates and applying an order function. First, the candidates that sound different than the given name are filtered, by determining a threshold. The filtering action is performed using a confidence score provided by the chosen classifier. Therefore, we order all of the candidates associated with the given name according to the confidence score provided, where the candidates with the highest confidence score are placed first. A threshold is then determined; as a rule of thumb, the threshold should be set such that all of the candidates the classifier is not certain about are removed. Second, we order the remaining candidates using an order function. In this step, a variety of ordering functions, such as Damerau-Levenshtein, edit distance, and more.

4 Data Description

To evaluate the proposed algorithm, we used three datasets: the WikiTree, Spoken Name, and Behind the Name datasets. The WikiTree dataset inncludes names from previous generations. The Spoken Name dataset is a collection of audio segments taken from an automated speech generation process, including a name, as well as its language and accent. The Behind the Name dataset provides the ground truth for evaluating the SpokenName2Vecalgorithms, as well other algorithm’s performance.

4.1 WikiTree Dataset

We used genealogical records available on the WikiTree website [Wikitree_dump]. WikiTree is an online genealogical website founded in 2008 by Chris Whitten [wikitree]. Its main aim is to provide a framework and genealogical sources for creating an accurate single family tree, making genealogy free and accessible worldwide. As of February 2020, WikiTree had over 680,000 registered users and maintained over 22 million profiles [wikitree]. Many of these profiles contain specific details about each individual, such as full name, nickname, gender, birth and death dates, children’s profiles, etc. The massive WikiTree dump we worked with includes more than 17 million profiles and over 250,000 unique first names.

4.2 Spoken Name Dataset

This dataset is a collection of audio segments (WAV files) of names pronounced by an automated text-to-speech framework. We used the Google Text-to-Speech Python library (gTTs) [gtts] which supports multiple languages and accents. For each name in the WikiTree dataset, we generated a speech segment reflecting how people utter it in a target language and accent. The Spoken Name dataset consists of six different languages: American English, French, Spanish, Chinese, Russian, and Italian. Each language includes 250,038 WAV files associated with the names in the dataset.

4.3 Behind the Name Dataset

To evaluate the performance of the proposed algorithm and compare it to other methods, we needed a ground truth dataset. Therefore, we generated the following ground truth dataset by combining the information included in the WikiTree dataset with the data on the Behind the Name website [behindthename]. This website was founded in 1996 by Mike Campbell in order to study various aspects of names [behindthename_info]. It contains names from all cultures and time periods, as well as mythological and fictional names. Currently, the website contains 22,263 names.

The creation of the ground truth dataset was performed, as follows: First, we extracted of all the distinct forenames existing in the WikiTree dataset with a length greater than two letters (to avoid honorific titles). From the over 17 million profiles available at the time of this research, we extracted 250,038 unique forenames. Using the public service application programming interface (API) provided by Behind the Name, we collected synonyms for the unique forenames in the WikiTree dataset. For example, for the given name of Ed, we collected Eddie, Edgar, Edward, Ned, Teddy, etc. [ed_behindthename]. For the given name of Elisabeth, we retrieved Eli, Elisa, Ella, Elsa, Lisa, and Liz [elisabeth_behindthename]. In total, 37,916 synonyms were retrieved for the 7,399 distinct names. The names that provided the greatest number of synonyms were Ina, Nina, and Jan with 127, 119, and 92 synonyms, respectively. On average, the Behind the Name dataset contains 5.12 synonyms for a given first name.

5 Experimental Setup

5.1 Setting Experimental Parameters

In this study, we conducted experiments aimed at answering two research questions: First: is the proposed SpokenName2Vecvector representation valid and useful? Second: can the proposed algorithm’s performance be improved by utilizing specific languages and accents in the speech segment generation step?

5.1.1 Vector Representation Validation

To answer the first research question regarding performance, we evaluated the performance of the proposed SpokenName2Vecalgorithm for the task of suggesting similar names for a given name by conducting a large-scale experiment as follows: First, we obtained a collection of forenames. For this, we used the WikiTree dataset (see Section 4.1). As mentioned earlier, preprocessing was required, therefore we cleaned the forenames by removing short names that contained less than three characters (see Section 3.1, Step 1).

Second, we used the gTTS library [gtts] to transform the forenames collected into speech segments reflecting the names as expressed by humans in their native tongue according to four different languages and accents: American English, French, Spanish, and Italian. In total, for each language, 250,038 WAV files were generated.

Then, for each speech segment representing a name, we extracted audio features using two open-source frameworks: Turi Create’s sound classifier [sound_classifier], and pyAudioAnalysis [giannakopoulos2015pyaudioanalysis]. In total, for each name, we generated 12,288 audio features using Turi Create and 136 audio features using pyAudioAnalysis, which served as a fixed-dimensional vector representation for each name.

Next, using the audio features obtained, together with the K-nearest-neighbors (KNN) classifier, we selected the

nearest neighbors as candidates to be suggested as synonyms for each given name, where .

For each given name, we first sorted the candidate names according to their Euclidean distance from the vector representation of the given name and removed forenames for which the vector representation’s distance was greater than one (for the audio features extracted using Turi Create’s sound classifier) and greater than zero (for the audio features extracted using pyAudioAnalysis).131313The reason for the differences in the thresholds is related to the audio feature extraction step. The Euclidean distance between the given name and its candidates for the 12,288 audio features extracted using Turi Create was diverse (from zero to 12), whereas when utilizing the 136 audio features extracted using pyAudioAnalysis, all of the Euclidean distance scores ranged from zero and one.

Then, we used the edit distance as an ordering function (i.e., the edit distance score is calculated between each given name and the remaining candidates). Finally, we sorted the candidates in ascending order according to the edit distance score and these served as the suggested synonyms for each given name. To measure the validity of the proposed representation, we evaluated it for the task of suggesting synonyms using objective performance metrics, such as accuracy, F1, precision, and recall.

5.1.2 Language and Accent Comparison

To answer the second research question, we conducted an empirical experiment in which we evaluated the performance of proposed method on a specific language and associated accent for suggesting names that are commonly used in countries and regions that mainly speak this specific language. In other words, in order to improve the performance, we analyzed whether the selected language and accent should be taken into account. For this, we conducted the following experiment: First, we used the Behind the Name dataset. For each name, we utilized the Behind the Name website to identify where (countries and regions) each given name is commonly used; for example, according to the website, the forename of Alfredo is commonly used in Italy, Spain, and Portugal141414https://www.behindthename.com/name/alfredo).

Second, we applied five versions of the SpokenName2Vecalgorithm using Turi Create’s sound classifier and five different languages: English, French, Spanish, Italian, and Russian.

Then, for each version reflecting a language and accent, we selected the names that exist in the Behind the Name (ground truth) dataset which are also commonly used in specific countries and regions; for example, we defined English names as names whose usage, according to Behind the Name website was English (General, Modern, Rare, Archaic). We also included Australian, British, New Zealand, and American names as English, as well as Hispanic, African American, and Anglo-Saxon names. We defined French names whose usage was French (General, Modern, Rare, Archaic). We handled Italian, Russian, and Spanish the same way. For Spanish names, we also included names that are commonly used in Latin America.

Next, for each given name in the dataset, we searched the top 10 most similar names according to each language and suggested them as synonyms. Finally, to evaluate the performance of proposed method, we identified the correct synonyms (according to the Behind the Name dataset) among the suggested names. To evaluate the method, we used the precision as the performance measure.

5.2 Evaluation Process

To analyze and evaluate the performance of the proposed SpokenName2Vecalgorithm on the task of name suggestion, we evaluated its performance (see Section 5.2.1), as well as other algorithms used for suggesting synonyms, such as phonetic encoding algorithms (see Section 5.2.2), string similarity algorithms (see Section 5.2.3), and other recently proposed approaches, such as graph-based names derived from digitized family trees (see Section 5.2.4), and Name2Vec (see Section 5.2.5). The performance of each of the algorithms was evaluated using the performance metrics of accuracy, F1, precision, and recall. For the precision measure, we used the top suggestions provided by each algorithm and calculated the metric of for . Similar to the evaluation of search engine ranking, we chose to evaluate the top suggestions (based on the assumption that our case is similar to the search engine ranking domain, where in most of the cases, people are only interested in the first page of the results ad do not bother to move on to subsequent pages [brin1998anatomy]).

5.2.1 Evaluation of SpokenName2Vec

For each first name in the ground truth, we searched for its 10 nearest neighbors using the KNN algorithm, where . Next, we filtered some candidate names based on the predefined threshold representing the maximal Euclidean distance between the candidate and given names. In case in which the euclidean distance of the candidate name from the given name is above the threshold, we filtered this candidate. The remaining candidates were placed in ascending order according to their edit distance from the given name. Finally, we evaluated the performance of the top suggestions provided.

5.2.2 Comparison to Phonetic Encoding Algorithms

We evaluated the performance of five well-known phonetic algorithms: Soundex, Metaphone, Double Metaphone, NYSIIS, and Matching Rating Approach (MRA) for the task of suggesting similar names.

The following evaluation process was performed as follows: For each given name in the ground truth Behind the Name dataset, we calculated the phonetic code according to the given phonetic algorithm. Take, for example, the name of Abraham and the Soundex phonetic algorithm. First, the name, Abraham, was encoded by Soundex as A165. Then, we derived the Soundex phonetic code for all of the other names in the WikiTree dataset. After that we chose the first names that shared the same phonetic code as Abraham as candidates; we sorted the candidates according to their edit distance from the given name (the lower the distance, the higher the similarity) and retrieved the top as synonyms.

Unlike phonetic algorithms which produce a single sound code for a given name, Double Metaphone produces two phonetic codes (primary and secondary). Therefore, for this algorithm, we collected all of the names that shared the same phonetic code (as either the primary or secondary code) and ordered them according to their edit distance from the given name.

5.2.3 Comparison to String Similarity Algorithms

We evaluated the performance of two well-known string similarity algorithms (edit distance and Damerau Levenshtein distance). For this, we measured the given string similarity between each name in the ground truth and the candidate name existing in the WikiTree dataset. Take, for example, the name of Abraham and the edit distance string similarity algorithm: First, we calculated the edit distance between each name in the WikiTree dataset and the name of Abraham. As candidates, we chose just the first names whose a distance from the given name between one and three. We limited the edit distance to be less or equal to three, since we observed that a larger edit distance value resulted highly different names from the given name. In the final step, we sorted the candidates according to their distance.

5.2.4 Comparison to the Graph-Based Names Derived From Family Trees

To evaluate this method, we followed the step presented by Elyashar et al. [elyashar2019runs], including the construction of the digitized family trees, and the graph-based names, using the WikiTree dataset. For the suggestion of similar names, we applied the ordering functions as described in this paper. For convenience, we named the functions : and , respectively.

5.2.5 Comparison to Name2Vec

We performed two experiments to compare SpokenName2Vecand Name2Vec [foxcroft2019name2vec] algorithms. In the first experiment, we evaluated the performance of the Name2Vec approach on our first name datasets (the WikiTree and Behind the Name datasets). In the second experiment, we utilized the Ancestry Surnames dataset provided by Foxcroft et al. [foxcroft2019name2vec].

Evaluation on Forenames.

First, we used the WikiTree dataset as a data source and trained a Doc2Vec model based on these first names. Foxcroft et al. reported that their best model training on the Ancestry dataset consisted of 250,000 surnames. Since the datasets are nearly equal in size (250,000 records), and there are not great differences generally between first and last names, we set the parameters so they were the same as those reported by Foxcroft et al. (640 epochs, 30 dimensions, and a window size of two). Next, using the trained model, we collected the 10 most similar candidate names for each first name existing in the Behind the Name dataset. To improve performance, we applied the edit distance similarity function between each given name and its candidates. Then, we sorted the candidates in ascending order based on their edit distance score and filtered those candidates obtaining an edit distance score greater than one.

151515We tested several predefined thresholds and presented the threshold providing the best results here. Finally, we used the remaining candidates as suggested synonyms for the given names that are part of the Behind the Name ground truth dataset.

Evaluation on Surnames. In this experiment, we performed the same steps described in the previous paragraph, with two changes: This time, we trained a Doc2Vec model with the parameters described above, on the Ancestry Surnames dataset, which includes 250,000 surnames. In this experiment, the predefined threshold for last names was set at those candidates obtaining an edit distance greater than three (instead of one, as was done in the previous experiment). Also in this case, we evaluated a few thresholds to maximize the performance of the algorithm. Finally, the remaining candidates were evaluated using the Ancestry Records ground truth dataset.

6 Results

6.1 Performance Comparison

In this section, we present the results of the experiments described in Section 5. The results of this evaluation are presented in Table 1.

SpokenName2Vec Evaluation. In our evaluation, we assessed the performance of the SpokenName2Vecalgorithm using several languages and accents; four methods of SpokenName2Vecwere developed using four languages (English, French, Spanish, and Italian) using Turi Create (TC), and one English was generated using pyAudioAnalysis (pyAA). As seen in the table, most of the methods performed similarly. For the accuracy measure, all of the methods obtained scores between 0.137 and 0.15. The best method was the SpokenName2Vecalgorithm which based its suggestions on Italian spoken names using Turi Create, which achieved an accuracy score of 0.151. The methods which used Spanish, English, and French spoken names obtained similar high accuracy scores of 0.148, 0.147, and 0.142, respectively.

For the F1 measure, the highest scores were obtained by the both methods used the English language (Turi Create, and pyAudioAnalysis) with F1 scores of 0.181, and 0.182, respectively. The methods which used French, Spanish, and Italian languages obtained an average F1 of 0.175, 0.173, and 0.173, respectively.

For the precision measure, it can be seen that the highest average precision scores were obtained by the two English (pyAA and TC) and French spoken names obtaining an average precision@1 of 0.186, 0.184, and 0.183, respectively. We can see also that as long increases, the average precision@k decreases. The trends in similar performance among the leading methods is also seen for the average precision@5 and average precision@10, although for precision@5, and average precision@10 the highest method obtained by the Italian language with scores of 0.152, and 0.151, respectively for average precision@5 and average precision@10. The methods used Spanish, French and English languages using Turi Create obtained similar high performance.

Regarding recall, in the table it can be seen that the highest recall score was obtained by the method used the English language together with pyAudioAnalysis which had a recall score of 0.169. The next highest recall scores were obtained using French and English languages wihich had recall scores of 0.133 and 0.13, respectively. The others obtained recall scores of around 0.13.

Method Accuracy F1 AP@1 AP@2 AP@3 AP@5 AP@10 Recall
SpokenName2Vec TC (En) 0.147 0.181 0.184 0.172 0.162 0.147 0.147 0.13
SpokenName2Vec TC (Fr) 0.142 0.175 0.183 0.167 0.158 0.149 0.143 0.133
SpokenName2Vec TC (Sp) 0.148 0.173 0.177 0.161 0.157 0.150 0.148 0.116
SpokenName2Vec TC (It) 0.151 0.173 0.165 0.16 0.157 0.152 0.151 0.113
SpokenName2Vec pyAA (En) 0.137 0.182 0.186 0.171 0.159 0.148 0.137 0.169
FTG (Net + SS) 0.086 0.139 0.237 0.185 0.157 0.125 0.086 0.15
FTG ( + SS) 0.083 0.133 0.221 0.172 0.146 0.116 0.083 0.139
FTG (PE + SS) 0.07 0.114 0.164 0.132 0.114 0.094 0.07 0.129
FTG (Net + PE + SS) 0.096 0.152 0.272 0.211 0.178 0.136 0.096 0.165
Name2Vec 0.021 0.037 0.079 0.063 0.052 0.038 0.021 0.075
Soundex 0.06 0.102 0.101 0.096 0.092 0.08 0.06 0.208
Metaphone 0.066 0.11 0.107 0.1 0.097 0.086 0.066 0.209
DMetaphone 0.068 0.112 0.107 0.102 0.098 0.088 0.068 0.221
NYSIIS 0.064 0.11 0.105 0.093 0.087 0.079 0.064 0.163
MRA 0.058 0.0919 0.093 0.086 0.082 0.073 0.058 0.144
Edit Distance 0.045 0.078 0.071 0.067 0.062 0.055 0.045 0.179
Damerau-Levenshtein 0.046 0.08 0.071 0.065 0.062 0.056 0.046 0.182
Table 1: Performance Obtained by the SpokenName2Vec and other algorithms

6.1.1 Phonetic Encoding Algorithm Evaluation

For the accuracy measure, we can see that all of the algorithms provide scores around the value of 0.06; the highest score was obtained by Double Metaphone, with an accuracy score of 0.068, and the lowest was obtained by MRA with a score 0.058.

For the F1 measure, we can see that all of the algorithms had scores around 0.1. Similarly, they all obtained an average precision score of 0.1.

For recall, we can see that the phonetic algorithms outperformed all other algorithms. The highest recall score was obtained by Double Metaphone, which had a score of 0.221. The second highest recall scores were achieved by Metaphone and Soundex, with scores of 0.209, and 0.208, respectively.

6.1.2 String Similarity Algorithm Evaluation

As seen in the table, the algorithms had similar performance on both accuracy measures, with accuracy scores of 0.046 and 0.45 and F1 scores of 0.08 and 0.078, respectively. for the precision measure, both algorithms obtained the highest scores for average precision@1 with a score of 0.071. For the recall measure, the similarity algorithms obtained scores second only to the phonetic encoding algorithms with 0.182, and 0.179, respectively.

6.1.3 Evaluation of the Graph-Based Names Derived From Family Trees

Regarding accuracy measure, the FTGs obtained a high average accuracy score of 0.084. For the F1 measure, the FTGs obtained an average F1 score of 0.1345. For the precision metric, we can see that of the algorithms evaluated three of the FTG algorithms achieved the highest average precision scores with . For example, FTG (Net + PE + SS) obtained an average precision@1 of 0.272; FTG (Net + SS) and FTG ( + SS) followed with average precision@1 of 0.237, and 0.221, respectively. However, for average precision@5 and average precision@10, the trend changes, and the FTG algorithms are outperformed by the SpokenName2Vecalgorithm. The highest precision scores with were obtained by FTG (Net + PE + SS), with an average precision of 0.136 and 0.096, respectively for average precision@5 and average precision@10. Concerning recall, the highest recall score among the FTGs was obtained by FTG (Net + PE + SS), with a recall score of 0.165.

6.1.4 Name2Vec Evaluation

Table 2 provides a comparison of the performance of SpokenName2Vecand Name2Vec. as can be seen, SpokenName2Vecoutperformed Name2Vec on each measure. For the forenames evaluation, Name2Vec obtained accuracy and F1 scores of 0.092 and 0.113, respectively, in contrast to the SpokenName2Vecversion used English language and Turi Create which obtained higher results (an average accuracy and F1 scores of 0.147, and 0.181, respectively). For precision, SpokenName2Vecobtained an average precision@1 score of 0.184, while the Name2Vec obtained an average precision@1 score of 0.092. The same trend can be seen with respect to recall where SpokenName2Vecand Name2Vec obtained scores of 0.13 and 0.072, respectively.

Algorithm Dataset Type Accuracy F1 AP@1 AP@5 AP@10 Recall
SpokenName2Vec Behind the Name First Names 0.147 0.181 0.184 0.147 0.147 0.13
Name2Vec Behind the Name First Names 0.092 0.113 0.091 0.092 0.092 0.072
SpokenName2Vec Ancestry Surnames 0.521 0.563 0.578 0.522 0.521 0.667
Name2Vec Ancestry Surnames 0.211 0.287 0.4 0.223 0.211 0.611
Table 2: Comparison of SpokenName2Vec and Name2Vec

A similar picture can be seen for the surname evaluation, SpokenName2Vecoutperformed Name2Vec on every measure: For example, SpokenName2Vecan average accuracy score of 0.521, while Name2Vec had an accuracy score of 0.211. The same pattern is seen for F1, precision, and recall.

6.2 Language and Accent Comparison

Regarding the second research question which focuses on improving the SpokenName2Vec’s performance by determining the optimal language and associated accent, we found that the SpokenName2Vecversion which used the spoken names in French language was the most successful of the five versions. For 2,801 English names (commonly used in the United Kingdom and United States), the french SpokenName2Vecversion obtained the highest precision score of 0.033 (see Table 3). The version that came in second place was the English version with an average precision score of 0.025. For 404 French names, the French version obtained the highest precision score of 0.03. In the second, third, and fourth places were the English, Spanish, and Russian versions, which obtained average precision scores of 0.02, 0.018, and 0.015, respectively. For 379 Spanish names and 307 Russian names, the French SpokenName2Vecachieved first place with average precision scores of 0.03 and 0.06, respectively. Surprisingly, the SpokenName2Vecthat was the best for suggesting synonyms for 476 Italian names was the Spanish version, which obtained an average precision score of 0.028; the French version came next with an average precision score of 0.026.

Usage English SpokenName2Vec French SpokenName2Vec Spanish SpokenName2Vec Italian SpokenName2Vec Russian SpokenName2Vec
English 0.025 0.033 0.019 0.009 0.01
French 0.02 0.03 0.018 0.009 0.015
Spanish 0.02 0.03 0.025 0.004 0.011
Italian 0.009 0.026 0.028 0.01 0.006
Russian 0.034 0.06 0.034 0.018 0.018
Table 3: Performance of five versions of the SpokenName2Vec algorithm for suggesting similar names

To illustrate the evaluation performed we chose two first names: Beatrice, and Victoria. According to the Behind the Name website, the name of Beatrice is a commonly-used name in France for females and is probably derived from a feminine form of the Late Latin name Viator, which means voyager, or traveler.

161616https://www.behindthename.com/name/be10atrice The name of Victoria, meaning victory in Latin, was very rare in the English speaking world until the 19th century, when Queen Victoria began her long rule of the British Empire.171717https://www.behindthename.com/name/victoria We collected the candidates associated with each of the names, using the SpokenName2Vec

versions and a KNN classifier. Then, for each name, we put the name and its associated candidates, according to each language version on a vector space of two by applying dimensionality reduction using principal component analysis (PCA). Doing so enables us to view the given name and its 10 associated candidates in multiple languages, as seen in Figures 

2 and 3.

Figure 2 presents the distribution of the given name and the names associated with it, provided by the SpokenName2Vecalgorithm using three languages (English, French, and Spanish) for the given name of Beatrice. In the figure, we an see that the French version was successful in suggesting four out of ten correct synonyms: Beatris (Russian), Beatrix (Dutch), Beatriz (Portuguese), and Beatryce (a rare form used by Americans and Brazilians). The name of Beatriz (Portuguese) was detected as a true synonym by both the French and English SpokenName2Vecversions, whereas the name of Beatryce (America and Brazil) was detected by the three versions. It is interesting to note that the French version also was successful indentifying six additional variations of the given name Beatrice that are not include on the website: Beaatrice, Beatricx, Beatrics, Beatryx, and Beatriks. Similarly, the English version identified the following names: Beatries, and Beattris, and its Spanish counterpart found the names of Beattrice and Beatrich.

Figure 2: Similar name distribution for the French name of Beatrice by the English, French, and Spanish SpokenName2Vec’s versions. The red and turquoise colors represent the given name in the Spanish and English versions, respectively. The green, purple and orange colors represent the suggestions provided by the Spanish, French, and English versions, respectively.

Figure 3 presents the distribution of Victoria its associated names provided by the English and French SpokenName2Vec’s versions. As can be seen, the French version successfully suggested two verified correct synonyms: Wiktoria (Polish), and Viktoria (German, Swedish, Norwegian, Danish, and many more). The English version was success full at suggesting the following synonyms: Vittoria (Italian), and Viktoriya (Bulgarian, Russian, and Ukrainian). Also, we can see that the French version identified the following names which do not exist on the website: Wicktoria, Wictoria, Victorya, Viktorya, Vicktoria, and Victtoria.

Figure 3: Similar name distribution for the English name of Victoria by the English and French SpokenName2Vec’s versions. The green color represent the given name in the French version. The red, and turquoise colors represent the suggestions provided by the French,and English versions, respectively.

7 Discussion

Upon analyzing the results presented in Section 6, we can conclude the following:

First, the proposed novel SpokenName2Vecapproach representing names by automated speech obtained promising results and was found useful for the task of suggesting synonyms for a given name.

Second, the suggested algorithm is generic. For example, in the audio feature extraction step, features van be extracted using any available tool, and the algorithm does not depend on a single technique. This was demonstrated by extracting audio features using two different tools: Turi Create and the pyAudioAnalysis (see Section 5); this finding in fact opens a room for improvement by assessing other available tools which capable converting the audio into a fixed-dimensional vector for optimizing the name suggestions. Our demonstration of this approach on forenames and surnames also demonstrated the approach’s generality.

Third, unlike many approaches, such as Soundex and Name2Vec which support only the English language, the SpokenName2Vecalgorithm supports multiple languages. Its ability to extract valuable information based on speech without the necessity of working with text and grammar allows it to support many languages. This ability, which was demonstrated in our evaluation of the performance of versions that were used in English, Latin languages, such as French, Spanish, and Italian, and East Slavic languages (demonstrated using Russian), also shows the generality of the algorithm.

Fourth, with respect to performance on the task of suggesting synonyms, SpokenName2Vec

was found superior to encoding phonetic and string similarity algorithms on all metrics (a difference found statistically significant using t-tests with

) in terms of accuracy, F1, precision, and recall (see Table 1). For example, the English SpokenName2Vecmethod obtained an average precision@1 score of 0.147, whereas Soundex and edit distance obtained scores of 0.06 and 0.045, respectively. Given this, we can conclude that the suggested name representation based on speech embedding is much more effective and accurate than the plain text code produced by phonetic algorithms.

Fifth, based on our comparison of SpokenName2Vecand Name2Vec, we conclude that SpokenName2Vecoutperforms the Name2Vec approach presented by Foxcroft et al. [foxcroft2019name2vec]. We base this conclusion on evaluation on two datasets: the WikiTree and Behind the Name datasets The SpokenName2Vecalgorithm was found to be superior on all metrics. For instance, the average precision@1 of SpokenName2Vecwas 0.184 as opposed to 0.091 obtained by Name2Vec. A similar picture can be seen when evaluating last names using the Ancestry dataset provided by Foxcroft et al. [foxcroft2019name2vec]. SpokenName2Vecobtained an average precision@1 score of 0.578, in contrast to Name2Vec’s score of 0.4. The main disadvantage of the Name2Vec approach is related to its architecture. Name2Vec is a Doc2Vec model that relates to each name as a document and to each character that composes the given name as a word [foxcroft2019name2vec]. This limits this approach to suggesting synonyms composed of only the characters of the given name. Thus, it fails to suggest synonyms which include additional characters that do not exist in the given name. For example, for the given name of Victoria, Name2Vec cannot suggest the associated correct synonym of Viktoria due to the absence of the character “k” in this given name. Unlike Name2Vec, SpokenName2Vecdoes not depend on the characters, but rather depend on a similar sound. Therefore, the absence of the character “k” is not an obstacle, and all of the SpokenName2Vec’s versions suggested the name Viktoria as a correct synonym for the given name of Victoria as demonstrated in Figure 3.

Sixth, it can be seen that SpokenName2Vecis superior in terms of accuracy, F1, and recall in contrast to the method derived from graph-based names (see Table 1). For example, the average F1 score obtained by the English SpokenName2Vecversion was 0.181 as opposed to 0.152 obtained by the FTG (Net + PE + SS). With respect to precision, we can see that the FTGs outperformed spokennametovec with . However, the picture changes when k increases () as in those cases, SpokenName2Vecoutperformed the FTGs. It is important to understand that the algorithms are totally different from one another; SpokenName2Vecis capable of detecting names that sound alike, but are written differently. In constrast, FTG suggests similar names based on historical ancestral relationships that are not necessarily related to sound. We can therefore conclude from these results that speech, as well as ancestral trees, can be utilized to improve the similar name suggestion. We believe that future research in which these two approaches are combined should be very helpful and effective for this purpose.

Seventh, for the recall performance, we can see that the phonetic encoding and string similarity algorithms outperformed all other algorithms: The highest recall scores were obtained by Double Metaphone, Metaphone, and Soundex with 0.221, 0.209, and 0.208, respectively. The string similarity algorithms (edit and Damerau-Levenshtein distances) obtained average recall scores of 0.182, and 0.179, respectively. The recall measure estimates the fraction of the total number of relevant names that were actually suggested. Therefore, we deduce that these well-known algorithms can detect the largest number of correct synonyms in the long run, however, their mechanism misses many correct synonyms in the short-term (the top 10 suggestions). In contrast,

SpokenName2Vecsuggests similar names with the highest likelihood first. This is the reason for its low recall scores.

Finally, with regard to the utilization of specific languages and accents for improving the SpokenName2Vec’s performance, out initial assumption was that the best version of the SpokenName2Vecalgorithm would be used the language of the targeted country or region, i.e., the English version would be the best for suggesting similar English names, the French version would be best for suggesting similar French names, etc. However, as can be seen, the SpokenName2Vecalgorithm, which used French speech was the best for suggesting synonyms for English, French, Spanish, and Russian given names (see Table 3). In addition, the French version was also very good at suggesting Italian names, however the best method for suggesting Italian names was the method, which used the Spanish speech. We can deduce that in the most of cases, it is recommended to use the French version of SpokenName2Vecfor suggesting similar for improving performance.

8 Conclusion & Future Work

This paper introduces the multi-language SpokenName2Vec a novel and generic algorithm which uses automated speech generation in different languages and accents and deep learning to address some of the challenges associated with synonyms.

We provided a comprehensive description of our framework’s steps which start with the compilation of a collection of names using genealogical datasets; these datasets were used to generate audio segments reflecting the way humans pronounce the given names in several languages, such as English, French, Spanish, and Italian. Based on these speech segments, we extracted audio features, which serve as vector representations for each name. A supervised machine learning classifier was used to for finding the top 10 candidates which their likelihood to be correct synonyms for a given name is the highest. Using a threshold and order functions, we filtered candidates that sound different from the given name and used an ordering function to retrieved the remaining names. In this way, SpokenName2Vecwas used to suggest synonyms for each given name in the ground truth. We compared the performance of SpokenName2Vecon the task of suggesting similar name suggestion to the performance of 12 other search algorithms, including well-known phonetic, string similarity algorithms, as well as the graph-based names derived from digitized family trees and Name2Vec, in our evaluation. We make the following observations and conclusions:

The SpokenName2Vecapproach was very useful for confronting the problem of suggesting similar names for a given name, outperforming the other evaluated algorithms with respect to the accuracy, F1, and precision@k, where .

The proposed approach is very generic. This is reflected in the proposed approach’s demonstrated ability to (1) detect similar names that sound alike instead of text, a capability that shows its potential to support a large number of languages, in contrast to other well-known algorithms (e.g., Soundex) that only support English; (2) extract of audio features using two different frameworks (Turi Create’s sound classifier and pyAudioAnalysis), which shows SpokenName2Vec’s ability to support various tools for feature extraction; and (3) to use any supervised machine learning algorithm for name classification. The generality of this algorithm was also demonstrated in the suggestion of first and last names. This shows that the approach was effective for both first and last names.

Furthermore, our evaluation showed that the suggestions provided by all of the proposed SpokenName2Vecversions are significantly higher than the suggestions provided by all of the other algorithms evaluated, including the graph-based names derived from digitized family trees, Name2Vec, encoding phonetic, and string similarity algorithms (a difference was found statistically significant using t-tests with ) for the performance measures of accuracy and the F1. Given this, we conclude that the proposed SpokenName2Vecmethod should be used for suggesting similar names.

In terms of precision, SpokenName2Vecwas found only second to the algorithms proposed based on the name-based graph for . However, in cases in which , the SpokenName2Vecmethods provided the highest precision scores (see Section 6.1). On the basis of this, we conclude that both utilizing automated speech and digitized family trees is essential for similar name suggestion. However, unlike the graph-based names derived from digitized family trees which requires historical information (e.g., father-son connection) to construct the graph of names, the SpokenName2Vecdoes not need this type of information, which in many cases does not exist.

Our final conclusion is based on the results of language comparison (see Section 6.2), we conclude that the preferred version for improving performance is the French version. A possible future research direction is to examine other groups of names, and datasets for understanding the usefulness of the French version. Another avenue to pursue is combining the sound and the family tree approaches to improve the suggestions of similar names.

9 Availability

This study is reproducible research. Therefore, the Spoken Name dataset, as well a code suggesting synonyms for a given name is available.181818https://github.com/aviade5/SpokenName2Vec Other datasets for evaluation are available upon request.

10 Acknowledgments

The authors would like to thank the icons8 website (https://icons8.com) for their beautiful icons.

References