1. Introduction: Disparity in NLP
As machine learned algorithms govern more and more real-world outcomes, how to make them fair—and what that should mean—is of increasing concern. One strand of research, heavily represented at the FAT-ML series of workshops,111http://www.fatml.org/ considers scenarios where a learning algorithm must make decisions about people, such as approving prospective applicants for employment, or deciding who should be the targets of police actions (Goel et al., 2017), and seeks to develop learners or algorithms whose decisions have only small differences in behavior between persons from different groups (Feldman et al., 2015) or that satisfy other notions of fairness (e.g. (Joseph et al., 2016b; Joseph et al., 2016a)).
Another recent strand of research has examined a complementary aspect of bias and fairness: disparate accuracy in language analysis
. Linguistic production is a critically important form of human behavior, and a major class of artificial intelligence algorithms—natural language processing, or language technologies—may or may not fairly analyze language produced by different types of authors(Hovy and Spruit, 2016). For example, Tatman (Tatman, 2017) finds that YouTube autocaptioning has a higher word error rate for female speakers than for male speakers in videos. This has implications for downstream uses of language technology:
Viewing: users who rely on autocaptioning have a harder time understanding what women are saying in videos, relative to what men are saying.
Access: search systems are necessary for people to access information online, and for videos they may depend on indexing text recognized from the audio. Tatman’s results (Tatman, 2017) imply that such a search system will fail to find information produced by female speakers more often than for male speakers.
This bias affects interests of the speakers—it is more difficult for their voices to be communicated to the world—as well as other users, who are deprived of information or opinions from females, or more generally, any social group whose language experiences lower accuracy of analysis by language technologies.
Gender and dialect are well-known confounds in speech recognition, since they can implicate pitch, timbre, and the pronunciation of words (the phonetic level of language); domain adaptation is always a challenge and research continues on how to apply domain transfer to speech recognizers across dialects (Lehr et al., 2014). And more broadly, decades of research in the field of sociolinguistics
has documented an extensive array of both social factors that affect how people produce language (e.g. community, geography, ethnicity), and how specifically language is affected (e.g. the lexicon, syntax, semantics). We might expect a minority teenager in school as well as a white middle-aged software engineer to both speak English, but they may exhibit variation in their pronunciation, word choice, slang, or even syntactic structures. Dialect communities often align with geographic and sociological factors, as language variation emerges within distinct social networks, or is affirmed as a marker of social identity.
Dialects pose a challenge to fairness in NLP, because they entail language variation that is correlated to social factors, and we believe there needs to be greater awareness of dialects among technologists using and building language technologies. In the rest of this paper, we focus on the dialect of African-American English as used on Twitter, which previous work (Blodgett et al., 2016; Jones, 2015; Jørgensen et al., 2015) has established is very prevalent and sometimes quite different than mainstream American English. We analyze an African-American English Twitter corpus (from Blodgett et al. (Blodgett et al., 2016), described in §3
), and analyze racial disparity in language identification, a crucial first step in any NLP application. Our previous work found that off-the-shelf tools display racial disparity—they tend to erroneously classify messages from African-Americans as non-English more often than those from whites. We extend this analysis from 200 to 20,000 tweets, finding that the disparity persists when controlling for message length (§4), and evaluate the racial disparity for several black-box commercial services. We conclude with a brief discussion (§5).
2. African-American English and social media
We focus on language in social media, which is often informal and conversational. Social media NLP tools may be used for, say, sentiment analysis applications, which seek to measure opinions from online communities. But current NLP tools are typically trained on traditional written sources, which are quite different from social media language, and even more so from dialectal social media language. Not only does this imply social media NLP may be of lower accuracy, but since language can vary across social groups, any such measurements may be biased—incorrectly representing ideas and opinions from people who use non-standard language.
Specifically, we investigate dialectal language in publicly available Twitter data, focusing on African-American English (AAE), a dialect of American English spoken by millions of people across the United States (Labov, 1972; Rickford, 1999; Green, 2002). AAE is a linguistic variety with defined syntactic-semantic, phonological, and lexical features, which have been the subject of a rich body of sociolinguistic literature. In addition to the linguistic characterization, reference to its speakers and their geographical location or speech communities is important, especially in light of the historical development of the dialect. Not all African-Americans speak AAE, and not all speakers of AAE are African-American; nevertheless, speakers of this variety have close ties with specific communities of African-Americans (Green, 2002).
The phenomenon of “BlackTwitter” has been noted anecdotally; indeed, African-American and Hispanic minorities were markedly over-represented in the early years of the Twitter service (as well as younger people) relative to their representation in the American general population.222http://www.pewinternet.org/fact-sheet/social-media/ It is easy to find examples of non-Standard American English (SAE) language use, such as:
he woke af smart af educated af daddy af coconut oil af GOALS AF & shares food af
Bored af den my phone finna die!!!
The first example has low punctuation usage (there is an utterance boundary after every “af”), but more importantly, it displays a key syntactic feature of the AAE dialect, a null copula: “he woke” would be written, in Standard American English, as “he is woke” (meaning, politically aware). “af” is an online-specific term meaning “as f—.” The second example displays two more traditional AAE features: “den” is a spelling of “then” which follows a common phonological transform in AAE (initial “th” changing to a “d” sound: “dat,” “dis,” etc. are also common), and the word “finna” is an auxiliary verb, short for “fixing to,” which indicates an immediate future tense (“my phone is going to die very soon”); it is part of AAE’s rich verbal auxiliary system capable of encoding different temporal semantics than mainstream English (Green, 2002).
3. Demographic Mixed Membership Model for Social Media
In order to test racial disparity in social media NLP, (Blodgett et al., 2016) collects a large-scale AAE corpus from Twitter, inferring soft demographic labels with a mixed-membership probabilistic model; we use this same corpus and method, briefly repeating the earlier description of the method. This approach to identifying AAE-like text makes use of the connection between speakers of AAE and African-American neighborhoods; we harvest a set of messages from Twitter, cross-referenced against U.S. Census demographics, and then analyze words against demographics with a mixed-membership probabilistic model. The data is a sample of millions of publicly posted geo-located Twitter messages (from the Decahose/Gardenhose stream (Morstatter et al., 2013)), most of which are sent on mobile phones, by authors in the U.S. in 2013.
For each message, we look up the U.S. Census blockgroup geographic area that the message was sent in, and use race and ethnicity information for each blockgroup from the Census’ 2013 American Community Survey, defining four covariates: percentages of the population that are non-Hispanic whites, non-Hispanic blacks, Hispanics (of any race), and (non-Hispanic) Asians. Finally, for each user
, we average the demographic values of all their messages in our dataset into a length-four vector.
Given this set of messages and author-associated demographics, we infer statistical associations between language and demographics with a mixed membership probabilistic model. It directly associates each of the demographic variables with a topic; i.e. a unigram language model over the vocabulary. The model assumes an author’s mixture over the topics tends to be similar to their Census-associated demographic weights, and that every message has its own topic distribution. This allows for a single author to use different types of language in different messages, accommodating multidialectal authors. The message-level topic probabilitiesare drawn from an asymmetric Dirichlet centered on , whose scalar concentration parameter controls whether authors’ language is very similar to the demographic prior, or can have some deviation. A token ’s latent topic is drawn from , and the word itself is drawn from , the language model for the topic. Thus, the model learns demographically-aligned language models for each demographic category. Our previous work (Blodgett et al., 2016) verifies that its African-American language model learns linguistic attributes known in the sociolinguistics literature to be characteristic of AAE, in line with other work that has also verified the correspondence of geographical AA prevalence to AAE linguistic features on Twitter (Jørgensen et al., 2016; Stewart, 2014).
This publicly available corpus contains 59.2 million tweets. We filter its messages to ones strongly associated with demographic groups; for example, for each message we infer the posterior proportion of its tokens that came from the African-American language model, which can be high either due to demographic prior, or from a message that uses many words exclusive to the AA language model (topic); these proportions are available in the released corpus. When we filter to messages with AA proportion greater than 0.8, this results in AAE-like text. We call these AA-aligned messages and we also select a set of white-aligned messages in the same way.333While Blodgett et al. verified that the AA-aligned tweets contain well-known features of AAE, we hesitate to call these “AAE” and “SAE” corpora, since technically speaking they are simply demographically correlated language models. The Census refers to the categories as “Black or African-American” and “White” (codes B03002E4 and B03002E3 in ACS 2013). And, while Hispanic- and Asian-associated language models of Blodgett et al.’s model are also of interest, we focus our analysis here on the African-American and White language models.
4. Bias in NLP Tools
4.1. Language identification
Language identification, the task of classifying the major world language in which a message is written, is a crucial first step in almost any web or social media text processing pipeline. For example, in order to analyze the opinions of U.S. Twitter users, one might throw away all non-English messages before running an English sentiment analyzer. (Some of the coauthors of this paper have done this as a simple expedient step in the past.)
A variety of methods for language identification exist (Hughes et al., 2006); social media language identification is particularly challenging since messages are short and also use non-standard language (Baldwin et al., 2013). In fact, a popular language identification system, langid.py (Lui and Baldwin, 2012), classifies both example messages in §2 as Danish with more than 99.9% confidence.
We take the perspective that since AAE is a dialect of American English, it ought to be classified as English for the task of major world language identification. We hypothesize that if a language identification tool is trained on standard English data, it may exhibit disparate performance on AA- versus white-aligned tweets. In particular, we wish to assess the racial disparity accuracy difference:
From manual inspection of a sample of hundreds of messages, it appears that nearly all white-aligned and AA-aligned tweets are actually English, so accuracy is the same as proportion of English predictions by the classifier. A disparity of 0 indicates a language identifier that is fair across these classes. (An alternative measure is the ratio of accuracies, corresponding to Feldman et al.’s disparate impact measure (Feldman et al., 2015).)
We conduct an evaluation of four different off-the-shelf language identifiers, which are popular and straightforward for engineers to use when building applications:
langid.py (software): One of the most popular open source language identification tools, langid.py was originally trained on over 97 languages and evaluated on both traditional corpora and Twitter messages (Lui and Baldwin, 2012).
IBM Watson (API): The Watson Developer Cloud’s Language Translator service supports language identification of 62 languages.444https://www.ibm.com/watson/developercloud/doc/language-translator/index.html
Microsoft Azure (API): Microsoft Azure’s Cognitive Services supports language identification of 120 languages.555https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/overview#language-detection
Twitter (metadata): The output of Twitter’s in-house identifier, whose predictions are included in a tweet’s metadata (from 2013, the time of data collection), which Twitter intends to “help developers more easily work with targeted subsets of Tweet collections.”666https://blog.twitter.com/developer/en_us/a/2013/introducing-new-metadata-for-tweets.html
Google (API, excluded): We attempted to test Google’s language detection service,777https://cloud.google.com/translate/docs/detecting-language but it returned a server error for every message we gave it to classify.
We queried the remote API systems in May 2017.
From manual inspection, we observed that longer tweets are significantly more likely to be correctly classified, which is a potential confound for a race disparity analysis, since the length distribution is different for each demographic group. To minimize this effect in our comparisons, we group messages into four bins (shown in Table 1) according to the number of words in the message. For each bin, we sampled 2,500 AA-aligned tweets and 2,500 white-aligned tweets, yielding a total of 20,000 messages across the two categories and four bins.888Due to a data processing error, there are 5 duplicates (19,995 unique tweets); we report on all 20,000 messages for simplicity. We limited pre-processing of the messages to fixing of HTML escape characters and removal of URLs, keeping “noisy” features of social media text such as @-mentions, emojis, and hashtags. We then calculated, for each bin in each category, the number of messages predicted to be in English by each classifier. Accuracy results are shown in Table 1.999We have made the 20,000 messages publicly available at: http://slanglab.cs.umass.edu/TwitterAAE/
|AA Acc.||WH Acc.||Diff.|
As predicted, classifier accuracy does increase as message lengths increase; classifier accuracy is generally excellent for all messages containing at least 10 tokens. This result agrees with previous work finding short texts to be challenging to classify (e.g. (Baldwin and Lui, 2010)
), since there are fewer features (e.g. character n-grams) to give evidence for the language used.101010A reviewer asked if length is used as a feature; we know that the open-source langid.py system does not (explicitly) use it.
However, the classifier results display a disparity in performance among messages of similar length;
for all but one length bin under one classifier, accuracy on the white-aligned sample is higher than on the AA-aligned sample. The disparity in performance between AA- and white-aligned messages is greatest when messages are short; the gaps in performance for extremely short messages ranges across classifiers from 6.6% to 19.7%. This gap in performance is particularly critical as 41.7% of all AA-aligned messages in the corpus as a whole have 5 or fewer tokens.111111 For most (system,length) combinations,
the accuracy difference is significant under a two-sided t-test ( ).
Accuracy rate standard errors range from
For most (system,length) combinations, the accuracy difference is significant under a two-sided t-test () except for two rows (, langid.py, ) and (, Twitter,
). Accuracy rate standard errors range fromto ().
Are these disparities substantively significant? It is easy to see how statistical bias could arise in downstream applications. For example, consider an analyst trying to look at major opinions about a product or political figure, with a sentiment analysis system that only gathers opinions from messages classified as English by Twitter. For messages length 5 or less, opinions from African-American speakers will be shown to be less frequent than they really are, relative to white opinions. Fortunately, the accuracy disparities are often only a few percentage points; nevertheless, it is important for practitioners to keep potential biases like these in mind.
One way forward to create less disparate NLP systems will be to use domain adaptation and other methods to extend algorithms to work on different distributions of data; for example, our demographic model’s predictions can be used to improve a language identifier, since the demographic language model’s posteriors accurately identify some cases of dialectal English (Blodgett et al., 2016). In the context of speech recognition, Lehr et al. (Lehr et al., 2014) pursue a joint modeling approach, learning pronunciation model parameters for AAE and SAE simultaneously.
One important issue may be the limitation of perspective of technologists versus users. In striking contrast to Twitter’s (historically) minority-heavy demographics, major U.S. tech companies are notorious for their low representation of African-Americans and Hispanics; for example, Facebook and Google report only 1% of their tech employees are African-American,121212https://newsroom.fb.com/news/2016/07/facebook-diversity-update-positive-hiring-trends-show-progress/ https://www.google.com/diversity/ as opposed to 13.3% in the overall U.S. population,131313https://www.census.gov/quickfacts/table/RHI225215/00 and the population of computer science researchers in the U.S. has similarly low minority representation. It is of course one example of the ever-present challenge of software designers understanding how users use their software; in the context of language processing algorithms, such understanding must be grounded in an understanding of dialects and sociolinguistics.
- Baldwin et al. (2013) Timothy Baldwin, Paul Cook, Marco Lui, Andrew MacKinlay, and Li Wang. 2013. How Noisy Social Media Text, How Diffrnt Social Media Sources?. In International Joint Conference on Natural Language Processing. 356–364.
- Baldwin and Lui (2010) Timothy Baldwin and Marco Lui. 2010. Language identification: The long and the short of the matter. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 229–237.
- Blodgett et al. (2016) Su Lin Blodgett, Lisa Green, and Brendan O’Connor. 2016. Demographic Dialectal Variation in Social Media: A Case Study of African-American English. Proceedings of EMNLP (2016).
- Feldman et al. (2015) Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 259–268.
- Goel et al. (2017) Sharad Goel, Maya Perelman, Ravi Shroff, and David Alan Sklansky. 2017. Combatting police discrimination in the age of big data. New Criminal Law Review: In International and Interdisciplinary Journal 20, 2 (2017), 181–232.
- Green (2002) Lisa J. Green. 2002. African American English: A Linguistic Introduction. Cambridge University Press.
- Hovy and Spruit (2016) Dirk Hovy and L. Shannon Spruit. 2016. The Social Impact of Natural Language Processing. In Proceedings of ACL.
- Hughes et al. (2006) Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and Andrew MacKinlay. 2006. Reconsidering Language Identification for Written Language Resources. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). European Language Resources Association (ELRA). http://aclweb.org/anthology/L06-1274
- Jones (2015) Taylor Jones. 2015. Toward a Description of African American Vernacular English Dialect Regions Using “Black Twitter”. American Speech 90, 4 (2015), 403–440.
- Jørgensen et al. (2016) Anna Jørgensen, Dirk Hovy, and Anders Søgaard. 2016. Learning a POS tagger for AAVE-like language. In Proceedings of NAACL. Association for Computational Linguistics.
Jørgensen et al. (2015)
Anna Katrine Jørgensen,
Dirk Hovy, and Anders Søgaard.
Challenges of studying and processing dialects in
social media. In
Proceedings of the Workshop on Noisy User-generated Text. 9–18.
- Joseph et al. (2016b) Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth. 2016b. Rawlsian fairness for machine learning. arXiv preprint arXiv:1610.09559 (2016).
- Joseph et al. (2016a) Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. 2016a. Fairness in Learning: Classic and Contextual Bandits. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 325–333. http://papers.nips.cc/paper/6355-fairness-in-learning-classic-and-contextual-bandits.pdf
- Labov (1972) William Labov. 1972. Language in the inner city: Studies in the Black English vernacular. Vol. 3. University of Pennsylvania Press.
- Lehr et al. (2014) Maider Lehr, Kyle Gorman, and Izhak Shafran. 2014. Discriminative pronunciation modeling for dialectal speech recognition. In Proc. Interspeech.
- Lui and Baldwin (2012) M. Lui and T. Baldwin. 2012. langid. py: An Off-the-shelf Language Identification Tool. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Demo Session, Jeju, Republic of Korea. http://www.aclweb.org/anthology-new/P/P12/P12-3005.pdf
- Morstatter et al. (2013) Fred Morstatter, J rgen Pfeffer, Huan Liu, and Kathleen Carley. 2013. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. In International AAAI Conference on Weblogs and Social Media. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/paper/view/6071
- Rickford (1999) John Russell Rickford. 1999. African American vernacular English: Features, evolution, educational implications. Wiley-Blackwell.
- Stewart (2014) Ian Stewart. 2014. Now We Stronger than Ever: African-American English Syntax in Twitter. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Gothenburg, Sweden, 31–37. http://www.aclweb.org/anthology/E14-3004
- Tatman (2017) Rachael Tatman. 2017. Gender and Dialect Bias in YouTube’s Automatic Captions. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. Association for Computational Linguistics, Valencia, Spain, 53–59. http://www.aclweb.org/anthology/W/W17/W17-1606