Comparing Name Nationality Classification Services
Nationality identification unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name substrings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support fine-grained classification. We exploit the phenomena of homophily in communication patterns to learn name embeddings, a new representation that encodes gender, ethnicity, and nationality which is readily applicable to building classifiers and other systems. Through our analysis of 57M contact lists from a major Internet company, we are able to design a fine-grained nationality classifier covering 39 groups representing over 90 against other published systems over 13 common classes, our F1 score (0.795) is substantial better than our closest competitor Ethnea (0.580). To the best of our knowledge, this is the most accurate, fine-grained nationality classifier available. As a social media application, we apply our classifiers to the followers of major Twitter celebrities over six different domains. We demonstrate stark differences in the ethnicities of the followers of Trump and Obama, and in the sports and entertainments favored by different groups. Finally, we identify an anomalous political figure whose presumably inflated following appears largely incapable of reading the language he posts in.READ FULL TEXT VIEW PDF
To make the best use of the underlying minute and subtle differences,
Your name tells a lot about you: your gender, ethnicity and so on. It ha...
Image classification has advanced significantly in recent years with the...
Visual artefacts of early diabetic retinopathy in retinal fundus images ...
This work is based on the submission to the competition Hindi Constraint...
We study to what extend Chinese, Japanese and Korean faces can be classi...
Understanding expressions of emotions in support forums has considerable...
Comparing Name Nationality Classification Services
Nationality and ethnicity are important demographic categorizations of people, standing in as proxies to represent a range of cultural and historical experiences. Names are important markers of cultural diversity, and have often served as the basis of automatic nationality classification for biomedical and sociological research. For example, nationality from names has been used as a proxy to reflect genetic differences (Burchard et al., 2003; Banda et al., 2015) and public health disparity (Barr, 2014; Quesada et al., 2011) among groups. Nationality identification is also important in ads targeting, academic studies of political campaigns and social media analysis (Chang et al., 2010; Appiah, 2001). Name analysis is often the only practical way to gather ethnicity/nationality annotations, because of privacy concerns.
Several previous name-based ethnicity/nationality classification approaches have been presented (Treeratpituk and Giles, 2012; Chang et al., 2010; Torvik and Agarwal, 2016), including (Ambekar et al., 2009) at KDD ’09. However, the performance of these methods has been constrained by small and artifical training sets, such as celebrity names from Wikipedia, and restricted to coarse ethnicity/nationality taxonomies. The long tail of names makes these approaches dependent on surface forms (like substring distributions), which are by definition ineffective for logograms. Almost all existing methods are designed only for Latinized names, while other writing systems (e.g. Arabic, Cyrillic) are also widely used.
In this paper, we present NamePrism, a new name nationality and ethnicity classifier which offers a finer-grained taxonomy of ethnic groups. Fig. 1
demonstrates the performance of our system, by presenting the ethnicity/nationality probability distributions of some data mining researchers. We believe our results will generally agree with the reader’s judgement.
Unlike previous methods that rely on substring features, we propose a more robust representation of names, which exploits the phenomenon of homophily in communication. The homophily principle, that people tend to associate with similar people or popularly that “birds of a feather flock together,” is one of the most striking and empirically robust regularities in social life (McPherson et al., 2001; Kossinets and Watts, 2009). Leskovec and Horvitz observed that, in instant messages, people tend to communicate more frequently with others of similar age, language and location (Leskovec and Horvitz, 2008). We analyze over 57 million contact lists from an email company, where the account holders are anonymized. The homophily-induced coherence of these contact lists enables us to derive meaningful features using word embedding methods (Mikolov et al., 2013; Pennington et al., 2014) as the basis for a comprehensive and effective nationality classifier.
We collected 74M labeled names come from 118 different countries, containing over 90% of world’s population. We use these labels to define a natural taxonomy of 39 leaf nationalities. As far as we know, our classifier is the most fine-grained and effective one accessible to the public. The main contributions of our work are:
Introducing Name Embeddings: The contact-list derived name embeddings prove to be a powerful way to capture latent properties of gender, nationality, and age in features readily applicable to classification and regression tasks. Projections of these embeddings are very compelling, creating maps in embedding space that correspond to maps of national boundaries. We believe these embeddings will prove widely applicable to other applications and domains, including those in data privacy and security.
Improved Nationality Classification: Our name-based nationality classifier NamePrism performs considerably better than previous classifiers. In particular, on a 13-class evaluation over email/Twitter data, our F1 score (0.795) proves to be much better than competing systems Ethnea111http://abel.lis.illinois.edu/cgi-bin/ethnea/search.py (0.580) (Torvik and Agarwal, 2016), HMM222http://www.textmap.com/ethnicity/ (0.364) (Ambekar et al., 2009), and (on a reduced 10-class scale) EthnicSeer333http://singularity.ist.psu.edu/ethnicity (0.571) (Treeratpituk and Giles, 2012). NamePrism
uses a Naive Bayes approach within a nationality taxonomy over 39 leaf nodes, employing name embeddings as the primary features.
Improved Ethnicity Classification: A benefit of fine-grained nationality taxonomy is its flexibility to apply to different task settings.The six ethnic groups defined by U.S. Census Bureau over U.S. population largely corresponds to distinct nations of origin. Our ethnicity classifier NamePrism, simply reduces the nationality taxonomy from 39 leaf nodes to 6 and incorporates census-based ground truth parameters into the Naive Bayes model.
Online Classification Resources: We release NamePrism as free web service444NamePrism open API: http://www.name-prism.com/ for research in sociology, linguistics, and biomedical applications. To the best of our knowledge, it is the only nationality classifier that handles various writing systems, and works on a fine-grained 39-class taxonomy.
Social Media Analysis: We use NamePrism to analyze social media, specifically the followers’ nationalities/ethnicities of 600 major celebrities on Twitter. Our results show that: (1) Donald Trump’s U.S. followers are disproportionally White with followers of Obama and Clinton, (2) ethnicities exhibit different preferences in sports and entertainment, and (3) the follower counts of a particular Indonesian politician has been artificially inflated by Russian names.
Name nationality classification is a fundamental problem with a variety of important applications: (i) biomedical research and clinical practice: it is critical to study the genetic and dietary differences among distinct groups(Burchard et al., 2003; Banda et al., 2015). (ii) sociology: health care/ employment/ education disparities among different people. (Barr, 2014; Quesada et al., 2011) (iii) online targeting: recommend more accurate ads/news/social media posts to users (Chang et al., 2010; Appiah, 2001). Other applications includes population demographic studies (Aries and Moorehead, 1989; Lauderdale and Kestenbaum, 2000; Mateos, 2007; Mateos et al., 2007). Despite wide-spread demand for nationality labels, it is hard to collect such information via self-reporting because of privacy concerns. Meanwhile, manual annotation of nationality by names is, in fact, a very difficult task, especially for fine-grained taxonomy.
Most recent works use name substrings as features for ethnicity/nationality classification (Ambekar et al., 2009; Chang et al., 2010; Treeratpituk and Giles, 2012; Torvik and Agarwal, 2016). Ambekar et. al. (Ambekar et al., 2009)
propose to combine decision tree and HMM to conduct classification on a taxonomy with 13 leaf classes. Treeratpituk et. al.(Treeratpituk and Giles, 2012) utilize both alphabet and phonetics sequences in names to improve performance and applied it to analyze how ethnicities evolves in computer science research community (Wu et al., 2014). Chang et. al. (Chang et al., 2010) use Bayesian methods to infer ethnicity of Facebook users with U.S. census data and study the interactions between ethnic groups. Torvik and Agarwal (Torvik and Agarwal, 2016) propose instance-based classifiers by using scientists’ names from PubMed. In comparison, we propose name embedding in the light of homophily principle in social life (Leskovec and Horvitz, 2008; Kossinets and Watts, 2009; McPherson et al., 2001). It is a better representation because substrings are limited to phonogram. Other relevant efforts are binary ethnicity classifiers, including Hispanic (Buechley, 1976), Chinese (Coldman et al., 1988), South Asian (Harding et al., 1999).
, which has many applications in natural language processing(Al-Rfou et al., 2013; Le and Mikolov, 2014; Bengio and Corrado, 2015). Other types of data can also benefit from the same assumptions that underlie word embeddings, namely that a data point is governed by the other data in its context (Perozzi et al., 2014; Tang et al., 2015; Rudolph et al., 2016). DeepWalk (Perozzi et al., 2014) learns node embeddings for graph data. It generates contexts by simulating random walks on graphs. Rudolph et. al. (Rudolph et al., 2016) propose a more general formulation of learning embeddings in different application settings. Similarly, name embeddings treats email contacts with most recency and frequency as context.
aim to learn similar embeddings (vectors) if two words co-occur frequently in their contexts. In articles, the context of a word are naturally the words around it. To generate context in contact lists, we need to assign orders to contacts. In the light of homophily principle, we weigh contacts by recency and frequency of communications. As a result, names with large weights tend to have same nationalities. In this way, we construct a “sentence” by keeping top contacts of a sorted list. Note that the ordering of sentences in an article is informative for word embeddings. In contrast, the ordering of the contact lists is not useful because email account holders are mutually independent.
We use t-SNE (Van Der Maaten, 2014) to project the 100D name embeddings into 2D, and create the map visualization with gvmap (Hu et al., 2010). U.S. census data are used as ground truths to visualize and evaluate name embeddings. More specifically, we use U.S. 1990 Census data to label popular first names (4.7K female and 1.2K male) and U.S. 2000 Census data to label popular last names (115K White, 5K Black, 6K Asian/Pacific Islander (API), 0.2K American Indian/Alaskan Native (AIAN), 0.1K Two or more races (2PRACE) and 7K Hispanics). As shown from Fig. 2 to Fig. 4, genders and ethnicities are labeled with different colors. We are surprised to see how names with same gender, ethnicity and nationality cluster together (Fig. 2 to Fig. 4).
Fig. 2 (left) illustrates the landscape of first names. Using 1990 Census data, we color male names orange, female names pink, and names with unknown gender gray. In general, names of the same gender form mostly contiguous regions. Fig. 2 (right) is an inset showing a region along the male/female border. We can see that “Ollie” is labeled as a female name based on Census data (2:1 ratio of female/male instances), while in fact it is often used as a nickname for “Oliver” or “Olivia” for daily use. Therefore name embedding is correct in placing it near the border. The embedding also correctly placed “Imani” and “Darian”, two names not labelled by the Census data, near the border, but in the female/male regions, respectively.
Fig. 3 (left) shows a map of last names. We color a name according to the dominant ethnicity classification from 2000 Census data. Four major ethnicities are White (pink), Black (orange), Hispanic (yellow), and API (green). Names beyond census data are colored gray. The three insets in Fig. 3 highlight the homogeneity of regions by ethnicities. White, Hispanic and API stand in large contiguous regions while Black are more dispersed. It makes sense because many Black people adopt White names during American slavery time. More interestingly, there are two distinct Asian regions in the map. Fig. 4 presents insets for these two regions, revealing that one cluster consists of Chinese and Vietnamese names (left) while the other (right) contains Indian names. Even on the left subfigure, Vietnamese names are more gathering around the bottom part while Chinese names on the top. These observations strongly indicate name embeddings capture gender, ethnicity and nationality signals.
We run experiments to validate our observations quantitatively and explore the sensitivity of name embeddings under different parameters. The parameters that we test include: (i) different embedding learning method: CBOW (Continuous Bag Of Word) or SG (Skip-Gram); (ii) use joint embedding space of first/last names or separate; (iii) number of nearest neighbor.
We can see from Tab. 1 that the joint variants generally perform best. However the differences between the variants are relatively small. In addition, the CBOW model generally outperforms the SG model. It seems is relatively low (0.35-0.59). However, it is essentially a harder task to find a black name because a random name from the contact lists has a probability of 0.03 being Black, while 0.74 being White.
uses Naive Bayes model because of its effectiveness and interpretability. We argue that name nationalities depend on both first name and last name. This is especially effective for names used across different nationalities but with different popularities. It also helps to reduce errors when names are mixtures because of immigration or cross-nationality marriages. We put much effort on estimating parameters, i.e. name parts likelihood, using features from training data, name embedding, substrings and string characters. Therefore, each parameter has at most 4 estimations. NamePrism uses the ones with largest confidence for predictions.
In many case, our last names reveal our nationality origins. For example, “Zhang” is a common Chinese last name. It is easy to predict one’s nationality if his last name is unique to that nation. However, there are many last names that are popular across nationalities. For example, “Lee” is popular in both China (especially in Hong Kong) and the UK. For “Qiang Lee” and “John Lee”, we would make mistakes if we only take signals from the last name. Combining with first names, we can perform better because it is easy to see whether the first name is more in China or UK. Similarly, using both name parts also helps when names are mixtures due to immigration or cross-nationality marriage.
Our method, NamePrism, can be formalized in Eq. 1:
where denotes nationality, means last name and is first name. We will describe our methods to estimate the likelihood (i.e. , ) for frequent and rare names in next subsection. We can get Equ. 1 by using Bayesian rule under the assumption that and are conditionally independent given .
We estimate name part likelihood from 4 sources: (i) training data, i.e. the names appear in training data (denoted as ); (ii) name embedding, the names from contact lists that have embeddings (); (iii) prefix/suffix strings, names that share the same prefix/suffix with names in training data (); (iv) name characters, names that use the same language characters (e.g. Arabic) seen in training data (). Intuitively, the increasing order of vocabulary size is , , , , which is also the decreasing order of estimation confidence.
Eq. 2 shows the most effective and simple way to estimate and directly from training data.
where is either a first name or last name from . is the count of with nationality and is equivalent to . Note that each name part in have more than 5 occurrences in training data so that we have high confidence in the estimation.
The likelihood of names () in can be estimated using k-NN, i.e. take the average of k nearest neighbors (e.g. kNNs) in . However, we did not directly estimate the likelihood using its kNNs’ likelihood. Instead, we realize that it performs better if we first estimate ’s posterior using its neighbors’ posteriors and then apply Bayes rule to estimate the likelihood (Eq. 3). It makes sense because names with similar embeddings do not necessarily have similar popularity (see Fig. 3). The estimation of is formulated by Eq. 3 and 4.
where is the set of name parts that are ’ kNNs.
As mentioned in (Ambekar et al., 2009), prefix and suffix of name parts are indicative features. For name part , we can estimate its likelihood by averaging the ones’ which share the same prefix/suffix.
where is the set of prefix and suffix strings of . Here we use substrings with length between 3 to 5. is the average likelihood of name parts in that have prefix/suffix .
If a name is so rare that it is not in nor . Moreover, it doesn’t contain valid prefix or suffix strings. For example, a name written in “Hangul”, “근혜” . It is very likely to be a Korean name because most names in “Hangul” are Korean names. Therefore, for a name , we use the average of names in same characters to estimate its likelihood.
where is the set of names in training data that are written in the same language as .
As we have shown in previous subsections, the name parts likelihood are estimated from email/Twitter users. However, Internet services (Email and Twitter) has varying popularity in different countries. Therefore, we need to assign different priors if a name is not sampled from Internet users. For example, UK and South Africa have similar population (around 50M to 60M). In our datasets, we have an order of magnitudes more names from the UK than from South Africa. Therefore we need to adjust to the real population of countries when we are predicting a random name from the world population.
Formally, let be probabilities over Internet population, and be the probability over world population. We have by assuming that names of Internet population are random samples from corresponding countries. Let be the number of names with nationality on Internet population and be the one of on world population. Thus, and . We can get the relation between and with Eq. 7.
Names are classified on a predefined taxonomy in top-down fashion (see Fig. 5). The detailed algorithm are shown in Alg. 1. We start from root class of the taxonomy (line 1). In each iteration, it picks the class that maximizes (from line 2 to 19) until it meets a leaf class. Since we have higher confidence in than , so we prefer parameters from the former (line 3 to 10). If neither of the name parts are in or , we use the parameters from or (line 11 to 17). Note that if only one of the name part in or , we will only use the partial signal and smooth the other name part.
The nationality taxonomy is a key component in our method. Mateos et al. proposed a nationality taxonomy based on Cultural, Ethnic and Linguist (CEL) similarities (Mateos et al., 2007). Our name-based nationality taxonomy is constructed on top of CEL-based taxonomy, especially for the top level construction. While there is no “gold standard” name-based nationality taxonomy because of the complexity in naming customs around the world, we consult opinions from linguists and people from different cultures to reach a common ground as a useful approximation . Moreover, as shown in Sec. 4.3.2, we could compute similarities between countries using name parts distributions. These similarities are helpful to construct the bottom levels of the taxonomy. For example, Hispanic countries are divided into three subgroups: Spanish, Portuguese and Philippines. The reason is, countries within Spanish and Portuguese are very similar to each other according to name similarities, indicting finer-granularity groupings are not feasible and necessary.
In order to estimate parameters mentioned above, we need name labels, i.e. full name and nationality pairs. We collected 68M such pairs from the Email source and 6M pairs from Twitter, totaling 74M labeled names from 118 major countries (Fig. 5). These countries take up over of world population. To remove noise, we filter out names where both parts appear only once. 91% names remain. Note that we are interested in nationalities, thus immigration countries, including U.S., Canada and Australia, are not included in our dataset. To preserve privacy for the email data, the IDs of these users (e.g. email address) are removed. Furthermore, we only retain the counts of first/last names and countries. We used the full name and country labels solely for the purpose of performance measurement. They are not retained for classification.
90% of name labels come from email. Each name part appears at least twice so that typos and random strings are filtered. Note that the email contact lists and labeled names are different set of users. We set 5 as thresholds for both and . It turns out is 1.02M and is 4.09M. It makes sense because contact lists are names from many email companies and thus a larger population.
|Wikipedia Data||Email/Twitter Data|
Although the email data offers the majority of name labels, its imbalanced popularity across the world make some regions inadequate name labels. We noticed that Twitter555Twitter API: https://dev.twitter.com/rest/public, as an emerging Web service, has a wider coverage and thus can act as a supplementary source of name labels.
In order to get name labels from interested regions, we (i) get list of most popular regional celebrities666https://www.socialbakers.com/statistics/twitter/profiles/kenya/; (ii) get all followers’ Twitter profiles of the celebrities’. Each profile record contains “name” and “location” fields, though many users leave the latter blank. In summary, we gathered 43M unique Twitter user profiles, within which 9M have non-empty “location” field and well-formed names (e.g. two name parts and string length 1). However, these location tags are not well defined. Among 9M profiles, there are 1.5M unique locations. Some of them are simply noise, while some offer too much details (e.g. university name without country info.). Therefore, we use Google Map API777https://developers.google.com/maps/ to query for country names using the 10% most popular “locations”. As a result, we have 6M labeled names for use, supplementary to the labels from email source.
Since our labeled names are collected from Internet, it is important to check its quality. In this subsection, we provide an interesting perspective to validate the high quality of the datasets.
We compute the similarities between countries using the aggregations of names, and check whether they agree with common sense. In fact, we observe that the cultural/spatial closeness between countries are well captured by country name similarities. Take African continent as an example (shown in Fig. 6). On the right-bottom part of the figure, the continent map is divided into 4 major parts based on how close they are culturally and geographically. On the remaining part of the figure, countries with names labels are colored in accordance with the map. It is apparent that countries with same colors are clustered, indicating that nearby countries have similar names. One interesting case is that Angola is connected with Mozambique, even though one is on the west coast of the continent while the other is on the east coast. The reason is that both countries were once colonized by Portuguese, thus many Internet users have Portuguese names.
We compute the similarities between countries with following steps: (i) aggregate name parts of each country so that countries are represented by name part vectors, where each dimension indicates how many name parts occur in the countries, (ii) compute cosine similarity between vectors, i.e. name similarities between countries. Note that in Fig.6, the thickness of edges indicate the magnitude of similarities. One link is made if either the similarity is larger than 0.5 or it makes sure that each country is linked to at least one most similar countries. Therefore, Ethiopia is linked to Sudan with a very small weight, though it is distinct from other countries.
In this Subsection, we will first compare our method with existing systems on smaller nationality taxonomies (one 13-leaf taxonomy and one 10-leaf flat taxonomy (Ambekar et al., 2009; Torvik and Agarwal, 2016; Treeratpituk and Giles, 2012)). Note we use their Web APIs to collect the classification results. Two independent datasets are tested on. The smaller one is from Wikipedia (used in (Ambekar et al., 2009; Treeratpituk and Giles, 2012)), the other is from our test set of labeled names. In the end, we will introduce more details about NamePrism’s performance on a finer-grained nationality taxonomy.
Ambekar et al. proposed an HMM-based method, which used signals from substrings of names (Ambekar et al., 2009) to classify name nationalities. Their taxonomy contains 13 leaf nodes and 18 nodes in total (see (Ambekar et al., 2009) for the definition of this taxonomy). In order to compare, all methods need to be on the same taxonomy. HMM is designed on this taxonomy. NamePrism and Ethnea are adapted to this because both methods are defined on a finer-grained taxonomy. EthnicSeer is compared separately on a flat 10-nationality taxonomy.
Two datasets are available for comparison: (i) the labeled names from Wikipedia (150K in total, the same dataset used to train HMM and EthnicSeer); (ii) we divide Email/Twitter data into training and testing datasets (60% vs. 40%). Then we sample 2% from the test data for use because it is not efficient to get classification results of baselines from their Web APIs (380K). Some small nationalities are given larger sampling ratio to get large enough test samples.
As shown in Tab. 2, we compare results of five methods: HMM (Ambekar et al., 2009), Ethnea (Torvik and Agarwal, 2016), Embd, NamePrism and NamePrism. Embd only use parameters estimated from name embeddings. NamePrism uses the world population as priors. NamePrism and NamePrism performs best on most classes for both datasets. On Wikipedia data, our methods achieves best performances on 15 (out of 18) classes. Some classes get +10% F1 boost, including Indian, Nordic and EastAsian. On Email/Twitter data, the improvement is more significant. NamePrism outperforms the rest on all classes. Some classes get performance increase by +30%, including Muslim, Africans, etc. Note that Embd also achieves considerable high performance, indicating that name embedding is capturing nationality signals well.
EthnicSeer is defined on a 10-leaf flat taxonomy. For comparison purpose, we removed the labeled names from African, Jewish and Nordic from both datasets. We also shrink NamePrism’s 39-leaf taxonomy to fit this small one. The weighted average F1 score shows EthnicSeer performs slightly better on Wikipedia but it is the same dataset that EthnicSeer is trained on. In contrast, NamePrism performs significantly better on Email/Twitter testing set.
Tab. 4 shows NamePrism F1 scores on the large nationality taxonomy. Note we randomly split the Email/Twitter data into training and testing sets (60% vs. 40%) for 3 times. All reported performances of our methods (i.e. Embd, NamePrism and NamePrism
) are average F1 of 3 runs. The standard deviations are all below 0.005. As we can see from Tab.4, NamePrism performs well on most nationalities. For some less developed countries with few Internet users, including Central Asian countries and Maghreb countries, we have limited number of name labels and contact lists. Thus the performances on these nationalities are limited. To the best of our knowledge, our work is the first effort trying to classify names belonging to these regions.
As we have mentioned in Sec. 3, U.S. Census Bureau defined 6 race/ethnicity: White, Black, API, Hispanic, AIAN and 2PRACE. In order to build classifier for these ethnicities, we need labeled names for these ethnicities to estimate parameters. Fortunately, U.S. Census Bureau published ethnicity distribution for popular last names. We can estimate first names’ ethnicity distribution by connecting census labels with email names from the U.S.
More formally, let be the set of popular last names from Census Bureau, so we have ground truth, , where denote ethnicity. We can estimate the posteriors of first names with Eq. 8.
where is the list of last names and is a full name from U.S. email data. Note that some of the last names paired with may not have a ground truth label (i.e. ). To make reliable estimation, we only keep first names that at least half of the paired last names with a ground truth label. Therefore, we form a set of first names () with estimated ethnicity distributions. We denote . We can get and by applying Bayes Rules.
For now, can handle names with popular first/last names. For rare names, we can make use of the Email/Twitter name labels. 118 countries are assigned to the six ethnicities based on their definitions. For example, we make names from European countries as White while names from Asian as API. Therefore, we can follow similar steps as Algorithm 1. The difference is we will first check whether a name part is from . If yes, we will use to compute because they are estimated from ground truth with high confidence. Otherwise, we will then check whether they are in or as in Algorithm 1 and follow the remaining steps.
Nationality and ethnicity classification have broad application in sociological research and media analysis. Here we present some interesting observations, when we apply our classifiers to the followers of Twitter celebrities.
To collect data, we identified the 100 most followed celebrities in each of six categories: actors, singers, news, atheletes, governments and politicians; all of whom have from 1M to 100M followers. For each celebrity, we selected 50,000 random followers, and filtered out accounts with irregular names using the same method as discussed in 4.3.1). We then apply NamePrism and NamePrism to the remaining followers.
Our primary observations here include:
Ethnicity and the 2016 U.S. Presidential Election – There has been considerable concern that the recent election exacerbated tensions between ethnic groups in the United States. Indeed, our analysis of U.S.-based followers of the primary figures in the race (Obama, Clinton, and Trump) show stark differences in composition. Fig. 8 shows that whites are substantially overrepresented among Trump’s followers, while Clinton and Obama have disproportionately more followers among minorities.
Interests and Ethnicity – Fig. 7 similarly breaks down the followers of major celebrities in sports, entertainment, and news categories. The followers of cricket and Bollywood stars are overwhelmingly Indian, while Hispanics disproportionally favor soccer and boxing.
Anomaly Detection through Nationality Analysis – We were surprised to learn that an Indonesian politician named Jeffrie Geovanie was one of the most heavily followed figures on Twitter, because he has only 45K Google search results about him, mostly in Indonesian. Yet our name analysis of his followers shows that only 13% are Indonesian, with over 50% of the followers of British, Russian, or Indian nationality (Fig. 9). This is quite peculiar given that Indonesian is the primary language of his Twitter stream.
We demonstrate that homophily patterns in communications can be exploited to learn name embeddings, that capture interesting properties of gender, nationality and ethnicity. Further we use these embeddings to build state-of-the-art name nationality and ethnicity classifiers. Through extensive experiments, we show that NamePrism substantially outperforms exiting methods on two independent datasets. Finally, we apply our classification to the Twitter celebrities’ followers, with interesting results.
We believe that NamePrism will become an important tool for biomedical and sociological research. Future work revolves around applying name embeddings to other classification tasks, such as those arise in demographics, security and social media analysis.
Bilbowa: Fast bilingual distributed representations without word alignments.ICML (2015).
Journal of Machine Learning Research3, Feb (2003), 1137–1155.