Entity-Switched Datasets: An Approach to Auditing the In-Domain Robustness of Named Entity Recognition Models

04/08/2020 ∙ by Oshin Agarwal, et al. ∙ Google Northeastern University University of Pennsylvania 0

Named entity recognition systems perform well on standard datasets comprising English news. But given the paucity of data, it is difficult to draw conclusions about the robustness of systems with respect to recognizing a diverse set of entities. We propose a method for auditing the in-domain robustness of systems, focusing specifically on differences in performance due to the national origin of entities. We create entity-switched datasets, in which named entities in the original texts are replaced by plausible named entities of the same type but of different national origin. We find that state-of-the-art systems' performance vary widely even in-domain: In the same context, entities from certain origins are more reliably recognized than entities from elsewhere. Systems perform best on American and Indian entities, and worst on Vietnamese and Indonesian entities. This auditing approach can facilitate the development of more robust named entity recognition systems, and will allow research in this area to consider fairness criteria that have received heightened attention in other predictive technology work.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Named Entity Recognition (NER) systems work well on domains such as English news, achieving high performance on standard datasets such as MUC-6 Grishman and Sundheim (1996), CoNLL 2003 Tjong Kim Sang and De Meulder (2003) and OntoNotes Pradhan and Xue (2009). Research in other areas of predictive technology has revealed, however, that ostensibly strong predictive performance may obscure wide variation in performance for certain types of data. For example, gender recognition systems attain high accuracy on what used to be considered standard datasets for this task, but have large error rates on people with dark skin tone, particularly on women with dark skin Buolamwini and Gebru (2018). Language identification is also highly accurate on standard datasets Zhang et al. (2018) but may fail to recognize dialects, e.g., failing to identifying African American English as English Blodgett et al. (2016).

Original Sentence New Sentence
Defender Hassan Abbas rose to intercept a long ball … Defender Ritwika Tomar rose to intercept a long ball …
The Democratic Convention signed an agreement on government and parliamentary support with its coalition partners the Social Democratic Union and the Hungarian Democratic Union. The Democratic Convention signed an agreement on government and parliamentary support with its coalition partners the Jharkhand Mukti Morcha and the Mizo National Front.
Table 1: Example of switching entities.

Here, we set out to develop methods for testing two intertwined properties of NER models: (i) Their robustness on a variety of entities, and, (ii) their relative performance across groups, which here correspond to national origin. It is known that the robustness of NER methods depends on entities being represented in the training data Augenstein et al. (2017). Probing their performance through the lens of national origin is then one way to question the choice of training data, and what a system can learn from it. This has important fairness implications: Our approach enables auditing of the systems for entity groups, e.g., ethnic groups within a country of origin, in line with guidelines for model card reporting of system strengths and weaknesses Mitchell et al. (2019).

We propose a diagnostic to evaluate the in-domain robustness of NER models by creating datasets that contain a variety of entity names. We evaluate state-of-the-art systems on these entity-switched datasets and find that they have highest performance (F1) on American and Indian entities, and lowest performance on Vietnamese and Indonesian entities. We make the code to generate these datasets from the original ones publicly available.

2 Entity-Switched Datasets

To create entity-switched datasets, we replace entities in existing datasets with names from various countries while retaining the rest of the text and maintaining its coherence. An example is shown in Table 1 were the original entities are replaced by Indian ones. While some types of entities (such as persons) can be readily replaced, other entities (such as organizations) require more care to maintain the coherence of the text. For this reason, we use two versions of the datasets. In one, we replace all entities; in the other, we replace only PER. Perturbation techniques have also been used in Prabhakaran et al. (2019) to study the impact of person names on toxicity in online comments by substituting in names of controversial personalities and in Emami et al. (2019) to make coreference resolution robust to gender and number cues by making both antecedent and candidate of the same type.

Country Huang et al. (2015) Lample et al. (2016) Devlin et al. (2019)
GloVe words GloVe words+chars BERT subwords
P R F1 P R F1 P R F1
Original testset 96.9 96.5 96.7 97.1 98.1 97.6 98.3 98.1 98.2
Super recall
US 96.9 99.6 98.2 96.9 99.6 98.3 98.4 99.7 99.1
Russia 96.8 99.5 98.1 97.1 99.8 98.4 98.4 99.3 98.9
India 96.5 99.5 98.0 97.1 99.3 98.2 98.4 98.8 98.6
Mexico 96.7 98.9 97.8 97.1 98.9 98.0 98.4 99.2 98.8
Poor recall
China-Taiwan 95.4 93.2 93.9 97.0 94.9 95.6 98.3 92.0 94.8
US (Difficult) 95.9 87.4 90.2 96.6 87.9 90.7 98.1 88.5 92.3
Indonesia 95.3 84.6 88.7 96.5 91.0 93.3 97.8 85.8 92.0
Vietnam 94.6 78.2 84.2 96.0 78.5 84.5 98.0 84.2 89.8
BERT not best
Ethiopia 96.5 96.8 96.6 96.6 98.6 97.9 98.3 90.6 94.1
Nigeria 96.3 92.2 94.1 97.1 96.6 96.8 98.2 90.2 93.8
Philippines 97.3 97.9 97.5 97.5 98.9 98.2 98.6 94.7 96.4
Bangladesh 96.7 97.5 97.1 97.1 97.6 97.3 98.4 97.8 98.0
Brazil 96.6 96.8 96.6 97.1 96.2 96.5 98.4 96.7 97.5
China-Mainland 95.7 97.9 96.7 97.0 97.4 97.2 98.4 96.7 97.5
Egypt 96.6 99.2 97.8 97.0 98.2 97.6 98.4 97.4 97.9
Japan 96.7 97.2 96.8 97.0 98.7 97.8 98.5 99.0 98.7
Pakistan 96.2 92.6 94.1 97.0 96.5 96.6 98.3 95.3 96.7
Table 2: Performance of systems on PER entity of CoNLL 03 test data. Original refers to the unchanged data. The rest of the rows are averages over 20 names typical for each country.
Country Huang et al. (2015) Lample et al. (2016) Devlin et al. (2019)
GloVe words GloVe words+chars BERT subwords
P R F1 P R F1 P R F1
Original 94.7 95.6 95.2 97.5 95.0 96.2 97.0 96.8 96.9
India 94.2 95.5 94.8 97.0 95.7 96.2 96.3 96.9 96.6
Vietnam 93.1 82.3 85.8 96.3 82.3 86.9 96.5 85.2 90.5
Table 3: Performance of systems on PER entity of OntoNotes newswire test data. Original refers to the unchanged data. The rest of the rows report averages over 20 names typical for each country.
Highest performance Lowest performance
Name Country F1 Name Country F1
Jose Mari Andrada Philippines 98.8 Trinity Washington U.S.(Difficult) 37.8
Chris Collins U.S. 98.4 My On Vietnam 37.9
Alex Mikhailov Russia 98.4 Thien Thai Vietnam 62.5
Kalpana Chawla India 98.4 Thu Giang Vietnam 64.9
Alejandro Garcia Mexico 98.4 Elaf Zahaar Pakistan 69.9
Table 4: Names on which Huang et al. (2015) achieved highest and lowest F1 scores.
Country Huang et al. (2015) Lample et al. (2016) Devlin et al. (2019)
GloVe words GloVe words+chars BERT subwords
P R F1 P R F1 P R F1
Original 90.9 91.4 91.2 90.9 92.6 91.7 95.5 93.3 94.4
India 84.3 77.3 80.7 83.8 82.9 83.3 95.6 87.8 91.5
Vietnam 74.3 73.0 73.6 77.9 76.4 77.2 96.2 81.5 88.2
Table 5: Performance of systems on all entity types in the CoNLL 03 test data.

2.1 Replacing PER entities

In the first group of datasets, we change only names of people. We replace all PER entities in the test set with the same string. For example, all sequences of PER might be replaced with the name ‘John Smith’, or the name ‘Marijuana Pepsi’.222https://tinyurl.com/yxdhupr9 Names for replacement are drawn from lists of popular names in the countries with the largest populations. This allows us to examine system performance with respect to country of origin, and also in terms of the number of people whose names would be potentially affected by recognition failures. Specifically, we selected the 15 most populous countries and found 20 common first names and 10 common family names for each.333Sources include: Wikipedia, websites with baby names, and websites listing popular names We used two sets of Chinese names, from mainland China and Taiwan, respectively. We also use two sets of U.S. names: The first comprising common names (e.g., John Smith) and the second composed of Native American, African American, and names that could be locations (e.g., Madison) or regular words (e.g., Brown).

The first names used in the experiment are a mix of male, female, and unisex names. We create full names by matching each first name to a random family name. For some countries, names have additional constraints that we account for. In Indonesia,444https://en.wikipedia.org/wiki/Indonesian_names names might include only a single name or multiple first names (without a family name). In Pakistan,555https://en.wikipedia.org/wiki/Pakistani_name some female names have a first name followed by the father/husband’s most called name.

We replace all sequences of PER in the test set with a single name.666Training data is unchanged, as our goal is to quantify the robustness of identifying various names. We attempt to be consistent in replacing the names, i.e., full names are replaced by full names, first names by first names, and last names by last names. We treat multi-word names separated by spaces as full names. For each full name, we take a Western-centric view and consider the first word to be a first name and the remaining to be the last name, and determine if other occurrences in the text are first or last names by string matching. If a multi-word name is part of a larger name, we do not break it down and replace it based on the larger name it matches.

We use the English CoNLL’03 and calculate the F1 for each country, replacing every PER entity in the test set with each of the 20 names in turn and taking the average over the 20 versions of the dataset. We evaluate the word-based biLSTM-CRF Huang et al. (2015), word and chararacter-based biLSTM-CRF Lample et al. (2016) and BERT Devlin et al. (2019). For the first two, we used 300-d cased GloVe Pennington et al. (2014)vectors trained on Common Crawl.777http://nlp.stanford.edu/data/glove.840B.zip For BERT, we use the public large888uncased_L-24_H-1024_A-16 uncased999Uncased performed better than cased. model and apply the default fine-tuning strategy. We use IO labeling and evaluate systems using micro-F1 at the token level.

We present the results in Table 2. All the models achieve higher performance on typical American names and names from Russia, India, and Mexico than on the original dataset (Super recall). Precision remains the same but recall improves to almost perfect. This finding has fairness implications, as it shows systems would work almost perfectly for names from some countries, but comparatively poorly on names from many other countries.

For the two GloVe models, performance drops by up to 10 points F1 for certain countries (Poor recall). Names from Indonesia and Vietnam fare the worst, along with the difficult US names and names from Taiwan, with small degradation of precision and a precipitous drop of recall. BERT exhibits a similar pattern, with stable precision and varying recall which remains above 84% for all name origins. The names with the highest and lowest F1 with Huang et al. (2015) are shown in Table 4.

Notably, BERT performance is lowest on names from Ethiopia, Nigeria, and the Philippines (see BERT not best

rows). In light of these findings, one might wonder if accepting current architectures trained on standard corpora as state-of-the-art is the NER equivalent of developments in photography, which was optimized for perfect exposure of white skin, and which is the assumed technical reason for many failures of computer vision applications when applied to dark skinned people

Benjamin (2019). At the very least practitioners should be cognizant of these systematic performance differences.

BERT and character LSTM-CRF results are higher and more stable, but it is nevertheless clear that one need not perform a completely out-of-domain test to observe deteriorating performance; changing the name strings is sufficient. We perform similar tests on the newswire section of OntoNotes. We use the original train and test splits, where we have switched PER in the latter. We use names from India and Vietnam because these were in the top and bottom performing entity-switched sets, respectively, for CoNLL data. We observe a similar drop in performance (Table 3). We hypothesize that this change in performance is due to the names that systems have seen during pre-training and training. They are able to remember their identities and are better able to predict their types.

2.2 Replacing other entity types

In addition to the above, we construct two more datasets derived from the original English CoNLL’03 test set. In these we replace all person (PER), location (LOC), and organization (ORG) instances with entities of the same type and from a particular country. We do not replace MISC entities, because these are not usually country-specific. In one dataset, we replace all entities with corresponding entities of Indian origin; in the other, we replace these with entities of Vietnamese origin. Unlike in the above dataset, where we replaced every entity of type PER with the same name throughout, we perform a stochastic (though type-constrained) mapping from the original set of entities to the new entity set. That is, when replacing a target entity of type , we sample an entity with which to replace it at i.i.d. random from entities of type in the set of country-specific entities of interest.

We select PER names as in the previous section. For each document, we then generate a list of possible names an entity was referred to by string matching and replace each entity with a new name as consistently as possible, as in the previous section. For LOC, we select names of villages, cities, and provinces and again select a location of the same type from the country-specific list in a document.

Consistently replacing ORG entities is more complicated than replacing PER and LOC entities, because not every organization would be suitable for every context. We cannot, for instance, reasonably replace ‘Bank of America’ in ‘We withdrew money from the Bank of America’ with ‘New York Times’ or ‘Mayo Clinic’. For this reason, we divided organizations into sub-categories, selected candidates for each category from similar websites as above, labelled test ORG with the sub-category manually, and then replaced them a country-specific ORG of the same sub-category, again being consistent within a document based on string matching. The sub-categories we used are: Airline, Bank, Corporation, Newspaper, Political Party, Restaurant, Sports Team, Sports Union, University, Others. Others included international or intergovernmental organizations such as United Nations and we did not replace these.

We observe a drop in performance on both the datasets (Table 5). Unlike only PER

, there is drop in both precision and recall for the two GloVe systems. However, for the BERT system, the precision remains the same and only the recall drops.

3 Conclusion

Standard NER datasets such as English news contain a limited number of unique entities, with many of them occurring in both the train and the test sets. State-of-the-art models may thus memorize observed names, rather than learning to identify entities on the basis of context. As a result, models perform less well on ‘foreign’ entity instances.

To measure this phenomenon we introduced a practical approach of entity-switching to create datasets to test the in-domain robustness of systems. We selected entities from different countries and showed that modern NER models perform extremely well on entities from certain countries, but not as well on others. This finding has fairness implications when NER is used in practical applications. We hope that these datasets — which we publicly release — will facilitate research in developing more robust NER systems.


  • I. Augenstein, L. Derczynski, and K. Bontcheva (2017) Generalisation in named entity recognition: a quantitative analysis. Computer Speech & Language 44, pp. 61–83. Cited by: §1.
  • R. Benjamin (2019) Race after technology: abolitionist tools for the new jim code. Polity. External Links: ISBN 978-1509526406 Cited by: §2.1.
  • S. L. Blodgett, L. Green, and B. T. O’Connor (2016) Demographic dialectal variation in social media: A case study of african-american english. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016

    pp. 1119–1130. External Links: Link Cited by: §1.
  • J. Buolamwini and T. Gebru (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA, pp. 77–91. External Links: Link Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link Cited by: §2.1, Table 2, Table 3, Table 5.
  • A. Emami, P. Trichelair, A. Trischler, K. Suleman, H. Schulz, and J. C. K. Cheung (2019) The KnowRef coreference corpus: removing gender and number cues for difficult pronominal anaphora resolution. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3952–3961. External Links: Link, Document Cited by: §2.
  • R. Grishman and B. Sundheim (1996) Message understanding conference-6: a brief history. In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, Cited by: §1.
  • Z. Huang, W. Xu, and K. Yu (2015) Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991. Cited by: §2.1, §2.1, Table 2, Table 3, Table 4, Table 5.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270. External Links: Link, Document Cited by: §2.1, Table 2, Table 3, Table 5.
  • M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru (2019) Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 2019, pp. 220–229. External Links: Link Cited by: §1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Link Cited by: §2.1.
  • V. Prabhakaran, B. Hutchinson, and M. Mitchell (2019) Perturbation sensitivity analysis to detect unintended model biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5740–5745. External Links: Link, Document Cited by: §2.
  • S. S. Pradhan and N. Xue (2009) OntoNotes: the 90% solution. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts, Boulder, Colorado, pp. 11–12. External Links: Link Cited by: §1.
  • E. F. Tjong Kim Sang and F. De Meulder (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147. External Links: Link Cited by: §1.
  • Y. Zhang, J. Riesa, D. Gillick, A. Bakalov, J. Baldridge, and D. Weiss (2018) A fast, compact, accurate model for language identification of codemixed text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 328–337. External Links: Link Cited by: §1.