Rethnicity: Predicting Ethnicity from Names

09/19/2021 ∙ by Fangzhou Xie, et al. ∙ Rutgers University 0

I provide an R package, , for predicting ethnicity from names. I use the Bidirectional LSTM as the model and Florida Voter Registration as training data. Special care is given for the accuracy of minority groups, by adjusting the imbalance in the dataset. I also compare the availability, accuracy, and performance with other solutions for predicting ethnicity from names. Sample code snippet and analysis of the DIME dataset are also shown as applications of the package.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The study on the differential effects of ethnicity usually require researchers to have ethnic information available in the dataset. However, such information is usually not readily available222 Health care is one of the literatures that often need to study the ethnic disparities in insurance plans, and researchers in this field need to deal with the missing ethnic information (fiscella2006). . When there are only names available in the dataset, one would naturally want to predict people’s ethnicity from the names, as names are usually highly correlated with their races.

In fact, the surname analysis on ethnicity has been used for many years333 See fiscella2006 for a survey.

, but the application of deep learning could make it even simpler, as is illustrated by 

sood2018.

In this note, I offer an novel approach to predict the ethnicity from names and provide an R package, rethnicity444 https://github.com/fangzhou-xie/rethnicity . I demonstrate that it achieves good performance by also being fast and free.

The rest of the article is organized as follows. Section 2 describes the methodology of the package. Section 3 offers some details in the implementation of the model. Section 4 highlights the notable features unique to the package. Section 5 compares its availability, accuracy, and performance against other solutions. Section 6 shows an example as code snippet and applies to the analysis on the racial difference in the political donation. Lastly, Section 7 concludes the paper.

2 Methodology

In this section, I briefly introduce the methodology of the prediction method offered by this package and the procedures being taken to build this package.

2.1 Undersampling for the Imbalanced Racial Distribution

Most classification algorithms assume a relatively balanced dataset and equal misclassification cost (sun2009). When applying them on imbalanced data, where the instances of some classes are significantly larger or smaller than other classes, the algorithm will mainly focus on the majority class and hence ignore the minority classes. One example is fraud detection, where most of the transactions are normal but a small number of them are frauds (fawcett1997).

In our application, this problem is also of concern. We are trying to predict the ethnic groups from people’s names, and this is naturally an example of classification with imbalanced data555 We are taking the dataset from Florida Voter Registration (sood2017)

and the predicted ethnic groups are defined in the U.S. context. One can, of course, apply the same methodology and build a classifier for other country or region, if she can get access to a proper dataset with names and races both being available.

.

To overcome this problem, one important method is to over-sample on the minority class (chawla2002; fernandez2018). However, in our case, since we have abundant data points available, I decided to under-sample on the majority classes to achieve a balanced dataset. This will also help us reduce training and testing time for the model as we have a very large dataset666 Details in Section 3.1. .

What is more, first names are not only associated with gender, but also correlated with race (fryer2004). Hence, I decided to group the dataset by both ethnicity and gender, and undersample on all the other group except for the smallest one. Later on, I will train two different models for the classification of ethnicity, one only using last names, and another leveraging first names as well. It is thus crucial for us to adjust the dataset on both gender and race to avoid disproportionate classification errors on minority classes.

2.2 Character-level Dictionary

Classic Natural Language Processing (NLP) models consider “token” to be the building block of the languages. This seems natural to human, since we would consider words and phrases to be the smallest elements in our daily usage of the language. However, for the sentences to be processed by algorithms, this would require us tokenize the sentences, build a vocabulary, and then build the model based on the vocabulary. It would soon be cumbersome when the size of our data grows and we need to retain a extremely large vocabulary, where some of the tokens are very common, while a large number of them are extremely infrequent 

(zipf1936).

Moreover, the vocabulary is usually built at the time of training the model and there might be out-of-vocabulary (OOV) tokens during inference once the model is deployed. The usual practice is to map OOV tokens to some “unknown” token and treat any token that is not seen in the training process to be the same one. This will inevitably lose information as some tokens are extremely informative but also infrequent (e.g. special words or abbreviations with domain knowledge).

To overcome this, there have been efforts in the machine learning community to build models directly on characters, instead of tokens 

(zhang2015; sutskever2011)777Those character-level models only consider 26 English letters (and some symbols). Recently, xue2021 propose a byte-based model, fully compliant with UTF-8 standard, and is capable of dealing with non-English texts as well. . It is easier to enumerate all possible characters888 In case of English, we only need 26 letters plus some symbols if needed. One could potentially use a larger dictionary by including upper-case letters as well. Note that for the classification of names, it is better to only use lower-case letters, as there will fewer occurrences of upper-case letters and the model may not have enough opportunity to learn from the upper-case letters. and maintain a dictionary of those characters, than a dictionary consisting of distinct tokens. Hence, we won’t have OOV problem since all of the characters have been kept in the dictionary.

The other benefit of using character-level dictionary is that we could greatly reduce the size of dictionary, and hence significantly reduce the parameters needed in the model as many, if not most, parameters in NLP are needed solely to capture the meaning of the tokens (xue2021). This way, we could make our model light-weight without sacrificing accuracy and gain some speed in the inference stage999 However, training models with character-level dictionary might be more difficult than token-level dictionary models. This might be the only drawback of using character-level dictionary. .

2.3 Bidirectional LSTM

Long Short-Term Memory (hochreiter1997, LSTM) has been widely used in sequence modeling since its proposal101010 Some of the most recent and exciting development on LSTM include BERT (devlin2019) and its variants. . Moreover, graves2005 proposed Bidirectional LSTM (BiLSTM), which would capture the context even better than the unidirectional LSTM model111111 BiLSTM would use both forward-passing and backward-passing LSTM to capture the context of sequential data, which is part of the reason that it works well. BiLSTM cannot be used on real-time prediction, but it is less of a concern for our name classification task. .

In this package, I use the BiLSTM as the architecture of the model for predicting race from names. The model is built with 256 units of an Embedding layer, and 4 Bidirectional LSTM layers with 512 units each. The final output layer is Dense layer of 4 units (equal to the number of races for the classification problem) with softmax activation function

121212 The test accuracy is given in 5.2. ,131313 The last name model and full name model have the same architecture, but differ in the dataset for training. .

performance add here

2.4 Distillation of Knowledge

To overcome the difficulty of training models with character-level dictionary, we train the BiLSTM model with many parameters for better accuracy performance. But this trained model is very large in size, and would be difficult to deploy for production.

To compress model, hinton2015 propose the “distillation” technique for extracting information from large models and teaching a smaller model to achieve similar prediction141414 Or taking the prediction of an ensemble of models and distill the information into a single small model. . To be more precise, the “student” model is trained to match the “teacher” model and the knowledge is considered to be transferred from the teacher to the student. This way, the student will “learn” the interclass relationship better than directly learning from the data.

I apply the distillation trick on the trained large model to obtain a smaller model with same architecture but fewer parameters and layers151515 The student model also has Dense, Bidirectional LSTM, and Dense layers as it architecture, but there are only 32 units for Dense, 64 units each for 2 layers of BiLSTM. . The smaller model is compressed from the larger model and will become the model I use for inference for production161616 The accuracy comparison is shown in Section 5.2. .

2.5 Export to C++

After training the student model, it is converted to json format, via frugally-deep171717 https://github.com/Dobiasd/frugally-deep project, to hold all the parameters of the model and it is no longer dependent on the installation of Keras (or tensorflow) anymore. Later we can load the model directly in C++ with very few dependencies181818 The frugally-deep is a lightweight header-only C++ project, which depends only on FunctionalPlus, Eigen, and json projects, which are again all being header-only projects. .

To make the model callable from R, we need to create an interface by Rcpp (eddelbuettel2011). This will give us a wrapper around the underlying C++ code for loading the model and run the prediction for the names. Besides, the prediction can also be parallel processed by multi-threading. These features will enable extremely fast processing on names for the prediction of ethnicity.

3 Data and Preprocssing

3.1 Name and Ethnicity Data

To train the ethnicity classification model from names, we need a dataset of names along with their individual level racial information. Fortunately, the Florida Voter Registration Dataset 

(sood2017) is a very good candidate for this purpose.

However, the dataset contains names and races for almost 13 million people in Florida, and the racial distribution is naturally imbalanced. First, I dropped the names for Native American191919 Defined in the Florida Voter Registration dataset as “American Indian or Alaskan Native”. and multi-racial names since we only have very little data on these groups. Further, I define Asian or Pacific Islanders, Hispanic, Non-Hispanic Black, and Non-Hispanic White as our 4 categories for the classification problem202020 We will later call these groups: Asian, Hispanic, Black and White, respectively. .

In Table 1, we list the frequency table of the names, grouped by ethnic group and gender. The undersampling procedure discussed in Section 2.1 is to take the smallest group, namely Asian Male, and randomly select sample for all other groups to eventually have the same group size. After the undersampling, each race-gender group contains exactly the same number (i.e. 104,632) of names.

Race Gender Count Before Count After
Asian Female 131602 104632
Asian Male 104632 104632
Black Female 989142 104632
Black Male 717118 104632
Hispanic Female 1137594 104632
Hispanic Male 925623 104632
White Female 4419030 104632
White Male 3963833 104632
Table 1: Count of Names, Grouped by Ethnicity and Gender.

3.2 Character Encoding

For the characters to be processed by the algorithm, we need to encode the characters by numeric values. Since we are building a character-level dictionary212121 I take the 26 lower-case English letters, a space character (“ ”), an Empty character (“E”), and a Unknown character (“U”). Hence the size of the dictionary is 29. , the dictionary is small and pre-defined. We should be able to map characters in all the names to one of the values in the dictionary.

In practice, I first map the upper-case letters into their lower-case counterparts, and then map all the punctuation symbols to space, and all other characters to “U” (Unknown). Further, to make all the input data have equal length, I insert “E” (Empty) for the names whose length is below the threshold and trim the extra characters for the names having length above the threshold222222 The threshold is 10 for each name component. In other words, for the last name model, the input length is 10. The full name model takes both first name and last name, each being 10, and the final length in thus 20.

. After this, the input name is transformed into a vector of integers, according to the mapping described in Section 

2.2.

3.3 Sequence Padding and Alignment

People’s name differ in the number of characters and so will the their encoded numeric representation. However, the model will only accept input vector of equal length. To achieve this, it is necessary to determine a fixed number as expected length of input vector and process the input data such that each string would have same length. Longer strings will be truncated and shorter ones will be padded

232323 This process is called “padding” and is the usual practice in preprocessing for RNN models. .

Now that I have chosen to pad the sequences at 10, all surnames must have the same length after the padding process. For example, if we take the last name “Smith”, we will first need to lower the cases to get “smith” and then add 5 Empty (“E”) characters to get a vector of length 10. On the other hand, we need to again lower the case first and then trim the last “n” for “Christensen”.

For the model that leverages both first names and last names, special care has to be taken. Since first names also vary in length, if we take the single string of both first name and last name, the starting point of last name will also vary242424 This fact makes the full name model insensitive to the names whose first name and last name don’t come from the same ethnic group, i.e. descendants of immigrants. Early models trained without adjustment on the position of last names would tend to focus more on the first names instead of the last names and tend to predict “Andrew Yang” as White instead of Asian. Hence there is a need to introduce the alignment procedure. . The solution would be to pad the first names and last names separately and then concatenate them into a single vector. This will guarantee the same starting position of all the last names across the sample. In particular, both first names and last names are padded by 10 characters, so that the concatenated input for the full name model would be 20 in total, but the last name always start at position 11.

4 Features of the Package

In this section, I layout several notable features of the package.

4.1 Native in R

The R community has incorporated mature deep learning libraries from Python (falbel2021; kalinowski2021) via reticulate package (ushey2021). In essence, users need to install a separate Python environment with the required libraries (e.g. tensorflow) and then reticulate

would provide the interface from R to Python so that we can build neural networks in R.

In practice, this could be problematic for researchers as this approach heavily relies on another language (namely, Python) and may cause issues for replicating the studies. Moreover, most applied researchers only need to make inferences on their dataset at hand. Installing and maintaining those libraries252525 The installation processes of those libraries and their dependencies are not trivial. Even after correct installation, they would also take up multiple gigabytes storage. are usually problematic even for veterans.

To ease the burden on users, I separate the processes of modeling and deployment, so that end-users would only need to install minimal dependencies on their machines. After training the teacher and distilling the student model, I export the student model to C++ via frugally-deep. At this stage, the model is no longer dependent on any of the deep learning frameworks. Further, to run inference of this model in R, I use Rcpp to provide an interface to the C++ model and make it callable from R. The package is henceforth considered lightweight and without the need to refer to external languages (e.g. Python) for anyone wants to predict ethnicity from names.

4.2 Mature Dependencies

The torch package (falbel2021a) aims to build the PyTorch package native to R, which is in sharp contrast to the approach of tensorflow and keras for R. However, this project is experimental and not ready for production. Besides, this approach still requires the installation of the entire gigantic torch package. Again, such an installation might not be desirable for many people.

For the rethnicity package, there are only three dependencies: Rcpp (eddelbuettel2011), RcppThread (nagler2021), and RcppEigen (bates2013), all being well-tested, mature packages published on CRAN and widely used by many other packages in R community. In particular, Rcpp provides the interface to C++, RcppThread provides the multi-threading support for fast inference, and RcppEigen provides fast and efficient matrix computation for the matrix computation of the neural network262626 Apart from these three packages, it also depends on frugally-deep and its dependencies. But those header-only dependencies are included in the package during installation and there is no need to link against external packages. .

4.3 High Performance

With the rise of Empirical Economics, economists often find themselves dealing with larger and larger datasets (einav2014). It is therefore critical to have packages that are built with performance in mind for other people to process their dataset fast enough.

In the domain of predicting ethnicity/nationality from names, there are some good API services272727For example, nationalize.io (https://nationalize.io/) and NamePrism (https://www.name-prism.com/).. However, as is the case for most API services, they are rate-limited to make their service reliable and sustainable. This may create bottlenecks for researchers who have a large collection of names waiting to be predicted.

The rethnicity package is built with this concern in mind. The trained model is exported to C++ via frugally-deep project282828 frugally-deep could achieve very fast performance, on par of single-core tensorflow (https://github.com/Dobiasd/frugally-deep#performance). and then made available in R via Rcpp (eddelbuettel2011). Moreover, I also leverage multi-threading via RcppThread and this will further boost the inference performance292929 This is possible because models exported by frugally-deep are thread-safe. ,303030 See Section 5.3 for detailed comparison against ethnicolr package in Python. .

5 Comparison with Existing Packages

5.1 Availability

Table 2 shows the differences between rethnicity and other solutions for predicting races from names. The comparison is made on 4 aspects: cost, rate limit or not, requirement of heavy dependecies or not, and which language it is being implemented.

rethnicity ethnicolr NamePrism nationalize.io
Cost free free free free & paid
Rate Limit No No Yes Yes
Dependency Low High N/A N/A
Language R Python API API
Table 2: A comparison across some public available services/packages for predicting ethnicity from names. rethnicity aims to provide a free and light-weight package for the R community without rate-limiting.

NamePrism is free but is rate-limited to 60 requests per minute. nationalize.io offers 1000 free requests each day and need to subscribe their services for more names to be processed in a day. ethnicolr might be most similar in scope as rethnicity, but it is written in Python and needs the installation of tensorflow to run inference. The installation of tensorflow might be daunting for many who only want to run the inference on a given name dataset they have at hand.

In general, what I want to achieve is to have a package simple, easy, and free to use for the R community, with guaranteed accuracy and performance. This rethnicity package does exactly that.

5.2 Accuracy

In Table 3 and Table 4, I show the prediction accuracy on the test data unseen in training process. Table 3 shows the accuracy of the teacher model we first trained and Table 4 shows the accuracy of the student model. We can see that full name model preforms better than the one only leverages last name information, for both teacher and student models. Also the student model loses some precision compared to the teacher model, but still being sufficiently high and close to their teacher.

Moreover, if we compare the results within each ethnic group, we can see that the accuracy for each group is roughly balanced and performing slightly better for the minority groups. This is the case for both teacher and student models. This would suggest that our undersampling approach for adjusting imbalance in the dataset work well. Indeed, if we also compare the results to ethnicolr (sood2018), rethnicity shows significantly better results on the prediction of asian, hispanic and black people, albeit losing some precision for white people313131 The accuracies in sood2018 are disproportionately high for white people, which might suggest that the classifier tend to always predict white to minimize loss. This is why I perform adjustment for the imbalanced dataset, as discussed in Section 2.1. .

Fullname Lastname
precision recall f1-score precision recall f1-score support
asian 0.87 0.76 0.81 0.87 0.69 0.77 41861
black 0.74 0.77 0.76 0.65 0.80 0.72 41904
hispanic 0.86 0.87 0.86 0.84 0.85 0.85 41940
white 0.67 0.73 0.70 0.62 0.58 0.60 41707
total 0.79 0.78 0.78 0.74 0.73 0.73 167412
Table 3: Accuracy on the test data for the teacher model before distillation.
Fullname Lastname
precision recall f1-score precision recall f1-score support
asian 0.86 0.73 0.79 0.84 0.64 0.73 41861
black 0.70 0.76 0.73 0.61 0.75 0.67 41904
hispanic 0.83 0.87 0.85 0.80 0.84 0.82 41940
white 0.67 0.68 0.68 0.57 0.53 0.55 41707
total 0.77 0.76 0.76 0.70 0.69 0.69 167412
Table 4: Accuracy on the test data for the student model after distillation.

5.3 Performance

Figure 1: Comparison of elapsed time between rethnicity and ethnicolr. At any given sample size, I run the inference 5 times, and then take the average running time as the result for the elapsed time measurement. Moreover, the comparison is also made for different number of threads available to rethnicity. The default single-threaded inference speed is shown as “rethnicity_0” in the plot. I also show the performance for inference under the 2-thread, 4-thread, and 8-thread pool.

The performance of the package is guaranteed by leveraging distillation for model compression, C++ and multi-threading for low overhead, as discussed in Section 2.4 and Section 4.3. But to put the speed into perspective, we still need to rigorously test the performance, and compare with the ethnicolr package as baseline performance.

In Figure 1, we can see that the single-threaded performance is on par of that from ethnicolr and the multi-threaded modes achieve even more speed-up. First, the distillation method successfully compresses the model and gains performance by having a smaller model. By comparing single-threaded with distilled model in rethnicity with the multi-threaded with larger model in ethnicolr, we can see that inference speed is roughly comparable. This suggest that the distillation per se close the gap between the speedup led by multi-threading tensorflow. Second, there is very little overhead for the multi-threading and the speedup is almost linear in the number of threads available and the threading is very efficient. In practice, if one wishes to process a large dataset, she may want to leverages more threads, depending on the size of the dataset and the total available threads in the machine.

6 Using the Package

6.1 Code Snippet

The usage of the package is very straightforward as there is only one function provided323232 More examples can also be found at the Github repository: https://github.com/fangzhou-xie/rethnicity. .

> predict_ethnicity(firstnames = "Morgan", lastnames = "Freeman")
  firstname lastname prob_asian prob_black prob_hispanic prob_white  race
1    Morgan  Freeman 0.06973174  0.5086463    0.02247369  0.3991482 black
Listing 1: Example of the predict_ethnicity function.

There are only 5 arguments for the function predict_ethnicity: firstnames, lastnames, method, threads, and na.rm.

The firstnames argument accepts a vector of strings333333 Character Vector in R. and is only needed when method = ‘fullname’.

lastnames also accepts a Character Vector and is needed for both method = ‘fullname’ and method = ‘lastname’.

method can only be either ‘fullname’ or ‘lastname’ to indicate whether working only with last names or both first names and last names.

threads can be chosen to have an integer greater than 1 to leverage multi-threading support for even faster processing of data343434 Theoretically, one could choose the number to equal the number of threads in the machine. The more threads to be used, the more overheads to be introduced in the parallel processing, and the less performance boost to be gained. .

Lastly, there is na.rm argument. This will allow one to remove the missing values from the input names353535 For the last name model, only non-missing names will be kept for processing and returned. For the full name model, since it needs both first names and last names, only the names with both being present will be processed. . Otherwise, an error will be thrown if any of the input data has missing values. This is to guarantee the model has the correct input data and returns meaningful prediction.

6.2 DIME data

The DIME dataset offers rich information on the finance and ideology of the campaign (bonica2014; bonica2019). Following the practice of sood2018, we also consider this dataset as a way to illustrate one potential usage of the rethnicity package.

I take all the donors in the dataset and predict their races using the fullname model, and then aggregate the total amount of donation separated by predicted race, and then calculate the ratio of donation across ethnicity. The results for 2000 and 2010 are shown in Table 5.

rethnicity ethnicolr
2000 2010 2000 2010
asian 6.29% 5.90% 2.00% 2.28%
black 20.83% 18.00% 8.93% 7.92%
hispanic 4.01% 4.44% 3.23% 3.31%
white 68.87% 71.66% 85.84% 86.49%
Table 5: Comparison of total amount donation grouped by predicted race of donors’ names. The right half of the table is taken from sood2018.

7 Conclusion

In this article, I demonstrate methodology and potential usage of the rethnicity package in R.

It leverages different techniques to perform ethnicity prediction. First, I use undersampling to adjust imbalance of racial distribution in the dataset. Second, I use character-dictionary to reduce the dictionary size and make it independent from training data. Third, I choose BiLSTM as the architecture for its better performance in capturing context. Fourth, after training the gigantic teacher model, I distill the information by letting it instructing a much smaller student model. Last, the student model is exported to C++ and then being loaded via Rcpp.

To train the data, I get the Florida Voter Registration dataset and take the voters’ names, along with their identified ethnicity for the training of model. After training the large model, a smaller student model is also trained and tested.

My objective of building this package is to ease the burden of installation and usage for any user interested in predicting ethnicity from names for their research. The package is entirely native in R with only dependencies being several mature packages published on CRAN. Plus, it also achieves very high performance by delegating the heavy computation to C++ with multi-threading. All those efforts come together into this rethnicity package and it is free, fast, and available to the R community. It also achieves very good performance, especially for the ethnic minorities.

Code snippet is given as example of usage of the package. Application on the data of finance and ideology of political candidates is also illustrated.

References