A key idea behind the current success of neural network models for language is pretrained representations such as word embeddingsMikolov2013; pennington2014glove and pretrained language models peters_deep_2018; howard_universal_2018; devlin_bert_2019; radford_language_2019; Liu2019. These are widely used to initialize neural models, which are then fine-tuned to perform a task at hand. Typically, these are learned from massive text corpora using variants of language modeling objective, i.e., correctly predicting a word given its surrounding context. In the recent years, these representations empowered neural models to attain unprecedented levels of performance gains on multiple language tasks. The resulting models are being deployed widely as services on platforms like Google Cloud and Amazon AWS to serve millions of users.
While this growth is commendable, there are concerns about the fairness of these models. Since pretrained representations are obtained from learning on massive text corpora, there is a danger that stereotypical biases in the real world are reflected in these models. For example, GPT2 radford_language_2019, a pretrained language model, has shown to generate unpleasant stereotypical text when prompted with context containing certain races such as African-Americans Sheng2019. In this work, we assess the stereotypical biases of popular pretrained language models.
The seminal works of bolukbasi_man_2016 and caliskan_semantics_2017 show that word embeddings such as word2vec Mikolov2013 and GloVe pennington2014glove contain stereotypical biases using diagnostic methods like word analogies and association tests. For example, caliskan_semantics_2017 show that male names are more likely to be associated with career terms than female names where the association between two terms is measured using embedding similarity, and similarly African-American names are likely to be associated with unpleasant terms than European-American names.
Recently, such studies have been attempted to evaluate bias in contextual word embeddings obtained from pretrained language models where a word is provided with artificial context (may_measuring_2019; Kurita2019), e.g., the contextual embedding of man is obtained from the embedding of man in the sentence This is a man. However, these have a few drawbacks. First, the context is artificial, which does not reflect the natural usage of a word. Second, they require stereotypical attribute terms to be predefined (e.g., pleasant and unpleasant terms). Third, they focus on single word target terms (and attributes) and ignore multiword terms like construction worker.
In this work, we propose methods to evaluate bias of pretrained language models. These methods do not have the aforementioned limitations. Specifically, we design two different association tests, one for measuring bias at sentence level (intrasentence), and the other at discourse level (intersentence). In these tests, each target term (e.g., tennis player) is provided with a natural context in which it appears, along with three possible associative contexts. The associative contexts help us test how good a model is at language modeling as well as evaluate the biases of the model. We crowdsource StereoSet, a dataset for associative contexts in English containing 4 target domains, 321 target terms and 16,995 test instances (triplets). A few instances are shown in Figure 1.
2 Task Formulation
What are the desiderata of an idealistic language model? First, it should be able to perform the task of language modeling, i.e., it should rank meaningful contexts higher than meaningless contexts. For example, it should tell us that Our housekeeper is a Mexican is more probable than Our housekeeper is a round. Second, it should not exhibit stereotypical bias, i.e., it should avoid ranking stereotypical contexts higher than anti-stereotypical contexts, e.g., Our housekeeper is a Mexican and Our housekeeper is an American should be equally possible. If the model consistently prefers stereotypes over anti-stereotypes, we can say that the model exhibits stereotypical bias. Based on these observations, we develop the Context Association Test (CAT), a test that measures the language modeling ability as well as the stereotypical bias of pretrained language models.
In CAT, given a context containing a target group (e.g., housekeeper), we provide three different ways to instantiate this context. Each instantiation corresponds to either a stereotypical, a anti-stereotypical, or an unrelated association. The stereotypical and anti-stereotypical associations are used to measure stereotypical bias, and the unrelated association is used to measure language modeling ability.
Specifically, we design two types of association tests, intrasentence and intersentence CATs, to assess language modeling and stereotypical bias at sentence level and discourse level. Figure 1 shows an example for each.
Our intrasentence task measures the bias and the language modeling ability for sentence-level reasoning. We create a fill-in-the-blank style context sentence describing the target group, and a set of three attributes, which correspond to a stereotype, an anti-stereotype, and an unrelated option (Figure 0(a)). In order to measure language modeling and stereotypical bias, we determine which attribute has the greatest likelihood of filling the blank, in other words, which of the instantiated contexts is more likely.
Our intersentence task measures the bias and the language modeling ability for discourse-level reasoning. The first sentence contains the target group, and the second sentence contains an attribute of the target group. Figure 0(b) shows the intersentence task. We create a context sentence with a target group that can be succeeded with three attribute sentences corresponding to a stereotype, an anti-stereotype and an unrelated option. We measure the bias and language modeling ability based on which attribute sentence is likely to follow the context sentence.
3 Related Work
Our work is inspired from several related attempts that aim to measure bias is pretrained representations such as word embeddings and language models.
3.1 Bias in word embeddings
The two popular methods of testing bias in word embeddings are word analogy tests and word association tests. In word analogy tests, given two words in a certain syntactic or semantic relation (man king), the goal is generate a word that is in similar relation to a given word (woman queen). Mikolov2013 showed that word embeddings capture syntactic and semantic word analogies, e.g., gender, morphology etc. bolukbasi_man_2016 build on this observation to study gender bias. They show that word embeddings capture several undesired gender biases (semantic relations) e.g. doctor : man :: woman : nurse. manzini_black_2019 extend this to show that word embeddings capture several stereotypical biases such as racial and religious biases.
In the word embedding association test (WEAT, caliskan_semantics_2017), the association of two complementary classes of words, e.g., European names and African names, with two other complementary classes of attributes that indicate bias, e.g., pleasant and unpleasant attributes, are studied to quantify the bias. The bias is defined as the difference in the degree with which European names are associated with pleasant and unpleasant attributes in comparison with African names being associated with pleasant and unpleasant attributes. Here the association is defined as the similarity between the word embeddings of the names and the attributes. This is the first large scale study that showed word embeddings exhibit several stereotypical biases and not just gender bias. Our inspiration for CAT comes from WEAT.
3.2 Bias in pretrained language models
may_measuring_2019 extend WEAT to sentence encoders, calling it the Sentence Encoder Association Test (SEAT). For a target term and its attribute, they create artificial sentences using generic context of the form "This is [target]." and "They are [attribute]." and obtain contextual word embeddings of the target and the attribute terms. They repeat caliskan_semantics_2017
’s study using these embeddings and cosine similarity as the association metric but their study was inconclusive. Later,Kurita2019 show that cosine similarity is not the best association metric and define a new association metric based on the probability of predicting an attribute given the target in generic sentential context, e.g., [target] is [mask], where [mask] is the attribute. They show that similar observations of caliskan_semantics_2017 are observed on contextual word embeddings too. Our intrasentence CAT is similar to their setting but with natural context. We also go beyond intrasentence to propose intersentence CATs, since language modeling is not limited at sentence level.
3.3 Measuring bias through extrinsic tasks
Another popular method to evaluate bias of pretrained representations is to measure bias on extrinsic applications like coreference resolution rudinger2018gender; zhao_gender_2018kiritchenko_examining_2018
. In this method, neural models for downstream tasks are initialized with pretrained representations, and then fine-tuned on the target task. The bias in pretrained representations is estimated based on the performance on the target task. However, it is hard to segregate the bias of task-specific training data from the pretrained representations. Our CATs are an intrinsic way to evaluate bias in pretrained models.
4 Dataset Creation
We select four domains as the target domains of interest for measuring bias: gender, profession, race and religion. For each domain, we select terms (e.g., Asian) that represent a social group. For collecting target term contexts and their associative contexts, we employ crowdworkers via Amazon Mechanical Turk.111Screenshots of our Mechanical Turk interface and details about task setup are available in the Section A.2. We restrict ourselves to crowdworkers in USA since stereotypes could change based on the country they live in.
4.1 Target terms
We curate diverse set of target terms for the target domains using Wikidata relation triples Vrandecic:2014:WFC:2661061.2629489. A Wikidata triple is of the form subject, relation, object (e.g., Brad Pitt, P106, Actor). We collect all objects occurring with the relations P106 (profession), P172 (race), and P140 (religion) as the target terms. We manually filter terms that are either infrequent or too fine-grained (assistant producer is merged with producer). We collect gender terms from Nosek. A list of target terms is available in Section A.3. A target term can contain multiple words (e.g., software developer).
4.2 CATs collection
In the intrasentence CAT, for each target term, a crowdworker writes attribute terms that correspond to stereotypical, anti-stereotypical and unrelated associations of the target term. Then they provide a context sentence containing the target term. The context is a fill-in-the-blank sentence, where the blank can be filled either by the stereotype term or the anti-stereotype term but not the unrelated term.
In the intersentence CAT, first they provide a sentence containing the target term. Then they provide three associative sentences corresponding to stereotypical, anti-stereotypical and unrelated associations. These associative sentences are such that the stereotypical and the anti-stereotypical sentences can follow the target term sentence but the unrelated sentence cannot follow the target term sentence.
Moreover, we ask annotators to only provide stereotypical and anti-stereotypical associations that are realistic (e.g., for the target term receptionist, the anti-stereotypical instantiation You have to be violent to be a receptionist is unrealistic since being violent is not a requirement for being a receptionist).
4.3 CATs validation
In order to ensure, stereotypes were not simply the opinion of one particular crowdworker, we validate the data collected in the above step with additional workers. For each context and its associations, we ask five validators to classify each association into a stereotype, an anti-stereotype or an unrelated association. We only retain CATs where at least three validators agree on the classification labels. This filtering results in selecting 83% of the CATs, indicating that there is regularity in stereotypical views among the workers.
|Domain||# Target||# CATs||Avg Len|
5 Dataset Analysis
Are people prone to associate stereotypes with negative associations? To answer this question, we classify stereotypes into positive and negative sentiment classes using a two-class sentiment classifier (details in Section A.5). The classifier also classifies neutral sentiment such as My housekeeper is a Mexican as positive. Table 2 shows the results. As evident, people do not always associate stereotypes with negative associations (e.g., Asians are good at math is a stereotype with positive sentiment). However, people associate stereotypes with relatively more negative associations than anti-stereotypes (41% vs. 33%).
We also extract keywords in StereoSet to analyze which words are most commonly associated with the target groups. We define a keyword as a word that is relatively frequent in StereoSet compared to the natural distribution of words in large general purpose corpora kilgarriff2009simple. Table 3 shows the top keywords of each domain when compared against TenTen, a 10 billion word web corpus TenTen. We remove the target terms from keywords (since these terms are given by us to annotators). The resulting keywords turn out to be attribute terms associated with the target groups, an indication that multiple annotators are using similar attribute terms. While the target terms in gender and race are associated with physical attributes such as beautiful, feminine, masculine, etc., professional terms are associated with behavioural attributes such as pushy, greedy, hardwork, etc., and religious terms are associated with belief attributes such as diety, forgiving, reborn, etc.
6 Experimental Setup
In this section, we describe the data splits, evaluation metrics and the baselines.
6.1 Development and test sets
We split StereoSet into two sets based on the target terms: 25% of the target terms and their instances for the development set and 75% for the hidden test set. We ensure terms in the development set and test set are disjoint. We do not have a training set since this defeats the purpose of StereoSet, which is to measure the biases of pretrained language models (and not the models fine-tuned on StereoSet).
6.2 Evaluation Metrics
Our desiderata of an idealistic language model is that it excels at language modeling while not exhibiting stereotypical biases. In order to determine success at both these goals, we evaluate both language modeling and stereotypical bias of a given model. We pose both problems as ranking problems.
Language Modeling Score ()
In the language modeling case, given a target term context and two possible associations of the context, one meaningful and the other meaningless, the model has to rank the meaningful association higher than meaningless association. The meaningless association corresponds to the unrelated option in StereoSet and the meaningful association corresponds to either the stereotype or the anti-stereotype options. We define the language modeling score () of a target term as the percentage of instances in which a language model prefers the meaningful over meaningless association. We define the overall of a dataset as the average of the target terms in the split. The of an ideal language model will be 100, i.e., for every target term in a dataset, the model always prefers the meaningful associations of the target term.
Stereotype Score ()
Similarly, we define the stereotype score () of a target term as the percentage of examples in which a model prefers a stereotypical association over an anti-stereotypical association. We define the overall of a dataset as the average of the target terms in the dataset. The of an ideal language model will be 50, i.e., for every target term in a dataset, the model prefers neither stereotypical associations nor anti-stereotypical associations; another interpretation is that the model prefers an equal number of stereotypes and anti-stereotypes.
Idealized CAT Score ()
We combine both and into a single metric called the idealized CAT score based on the following axioms:
An ideal model must have an score of 100, i.e., when its is 100 and is 50, its score is 100.
A fully biased model must have an score of 0, i.e., when its is either 100 (always prefer a stereotype over an anti-stereotype) or 0 (always prefer an anti-stereotype over a stereotype), its score is 0.
A random model must have an score of 50, i.e., when its is 50 and is 50, its score must be 50.
Therefore, we define the score as
This equation satisfies all the axioms. Here is maximized when the model neither prefers stereotypes nor anti-stereotypes for each target term and is minimized when the model favours one over the other. We scale this value using the language modeling score. An interpretation of is that it represents the language modeling ability of a model to behave in an unbiased manner while excelling at language modeling.
We define this model as the one that always picks correct associations for a given target term context. It also picks equal number of stereotypical and anti-stereotypical associations over all the target terms. So the resulting , and scores are 100, 50 and 100 respectively.
We define this model as the one that always picks a stereotypical association over an anti-stereotypical association. So its is 100. As a result, its score is 0 for any value of .
We define this model as the one that picks associations randomly, and therefore its , and scores are 50, 50, 50 respectively.
In Section 5, we saw that stereotypical instantiations are more frequently associated with negative sentiment than anti-stereotypes. In this baseline, for a given a pair of context associations, the model always pick the association with the most negative sentiment.
7 Main Experiments
In this section, we evaluate popular pretrained language models such as BERT devlin_bert_2019, RoBERTa Liu2019, XLNet Yang2019 and GPT2 radford_language_2019 on StereoSet.
In the intrasentence CAT (Figure 0(a)), the goal is to fill the blank of a target term’s context sentence with an attribute term. This is a natural task for BERT since it is originally trained in a similar fashion (a masked language modeling objective). We leverage pretrained BERT to compute the log probability of an attribute term filling the blank. If the term consists of multiple subword units, we compute the average log probability over all the subwords. We rank a given pair of attribute terms based on these probabilities (the one with higher probability is preferred).
For intersentence CAT (Figure 0(b)), the goal is to select a follow-up attribute sentence given target term sentence. This is similar to the next sentence prediction (NSP) task of BERT. We use BERT pre-trained NSP head to compute the probability of an attribute sentence to follow a target term sentence. Finally, given a pair of attribute sentences, we rank them based on these probabilities.
Given that RoBERTa is based off of BERT, the corresponding scoring mechanism remains remarkably similar. However, RoBERTa does not contain a pretrained NSP classification head. So we train one ourselves on 9.5 million sentence pairs from Wikipedia (details in Section A.4). Our NSP classification head achieves a 94.6% accuracy with RoBERTa-base, and a 97.1% accuracy with RoBERTa-large on a held-out set containing 3.5M Wikipedia sentence pairs.222For reference, BERT-base obtains an accuracy of 97.8%, and BERT-large obtains an accuracy of 98.5% We follow the same ranking procedure as BERT for both intrasentence and intersentence CATs.
XLNet can be used in either in an auto-regressive setting or bidirectional setting. We use bi-directional setting, in order to mimic the evaluation setting of BERT and RoBERTa. For the intrasentence CAT, we use the pretrained XLNet model. For the intersentence CAT, we train an NSP head (Section A.4) which obtains a 93.4% accuracy with XLNet-base and 94.1% accuracy with XLNet-large.
Unlike the above models, GPT2 is a generative model in an auto-regressive setting, i.e., it estimates the probability of a current word based on its left context. For the intrasentence CAT, we instantiate the blank with an attribute term and compute the probability of the full sentence. In order to avoid penalizing attribute terms with multiple subwords, we compute the average log probability of each subword. Formally, if a sentence is composed of subword units , then we compute . Given a pair of associations, we rank each association using this score. For the intersentence CAT, we can use a similar method, however we found that it performed poorly.333In this setting, the language modeling score of GPT2 on the intersentence CAT is 61.5. Instead, we trained a NSP classification head on the mean-pooled representation of the subword units (Section A.4). Our NSP classifier obtains a 92.5% accuracy on GPT2-small, 94.2% on GPT2-medium, and 96.1% on GPT2-large.
|Model||Language Model Score ()||Stereotype Score ()||Idealized CAT Score ()|
8 Results and discussion
Table 4 shows the overall results of baselines and models on StereoSet.
Baselines vs. Models
As seen in Table 4, all pretrained models have higher values than RandomLM indicating that pretrained models are better language models. Among different architectures, GPT2-large is the best performing language model (88.9 on development) followed by GPT2-medium (87.1). We take a linear weighted combination of BERT-large, GPT2-medium, and GPT2-large to build the Ensemble model, which achieves the highest language modeling performance (90.7). We use to measure how close the models are to an idealistic language model. All pretrained models perform better on than the baselines. While GPT2-small is the most idealistic model of all pretrained models (71.9 on development), XLNet-base is the weakest model (61.6). The scores of SentimentLM are close to RandomLM indicating that sentiment is not a strong indicator for building an idealistic language model. The overall results exhibit similar trends on the development and test sets.
|Domain||Language Model Score ()||Stereotype Score ()||Idealized CAT Score ()|
Relation between and
All models exhibit a strong correlation between and scores. As the language model becomes stronger, so its stereotypical bias () too. This is unfortunate and perhaps unavoidable as long as we rely on real world distribution of corpora to train language models since these corpora are likely to reflect stereotypes (unless carefully selected). Among the models, GPT2 variants have a good balance between and in order to achieve high scores.
Impact of model size
For a given architecture, all of its pretrained models are trained on the same corpora but with different number of parameters. For example, both BERT-base and BERT-large are trained on Wikipedia and BookCorpus zhu2015aligning with 110M and 340M parameters respectively. As the model size increases, we see that its language modeling ability () increases, and correspondingly its stereotypical score. However, this is not always the case with . Until the language model reaches a certain performance, the model does not seem to exhibit a strong stereotypical behavior. For example, the scores of RoBERTa and XLNet increase with model size, but not BERT and GPT2, which are strong language models to start with.
|Model||Language Model Score ()||Stereotype Score ()||Idealized CAT Score ()|
Impact of pretraining corpora
BERT, RoBERTa, XLNet and GPT2 are trained on 16GB, 160GB, 158GB and 40GB of text corpora. Surprisingly, the size of the corpus does not correlate with either or . This could be due to the difference in architectures and the type of corpora these models are trained on. A better way to verify this would be to train a same model on increasing amounts of corpora. Due to lack of computing resources, we leave this work for community. We conjecture that high performance of GPT2 (on and ) is due to the nature of its training data. GPT2 is trained on documents linked from Reddit. Since Reddit has several subreddits related to target terms in StereoSet (e.g., relationships, religion), GPT2 is likely to be exposed to correct contextual associations. Also, since Reddit is moderated in these niche subreddits (ie. /r/feminism), it could be the case that both stereotypical and anti-stereotypical associations are learned.
Table 5 shows domain-wise results of the Ensemble model on the test set. The model is relatively less biased on race than on others ( score of 69.7). We also show the high and low biased target terms for each domain from the development set. We conjecture that the high biased terms are the ones that have well established stereotypes in society and are also frequent in language. This is the case with mother (attributes: caring, cooking), software developer (attributes: geek, nerd), and Africa (attributes: poor, dark). The low biased are the ones that do not have well established stereotypes, for example, producer and Crimean
. The outlier to this observation isMuslim. Although the target term muslim has strong stereotypical associations in StereoSet (e.g., the attribute term terrorist appears in 20% of stereotypes and other hatred terms in 43% examples), the model exhibits idealistic behavior (with a slight preference towards anti-stereotypes). We could not explain this behavior, and leave explainability of such bias to future work.
Intrasentence vs Intersentence CATs
Table 6 shows the results of intrasentence and intersentence CATs on the test set. Since intersentence tasks has more number of words per instance, we expect intersentence language modeling task to be harder than intrasentence. This is the case with most models (except BERT).
StereoSet is crowdsourced on Mechanical Turk using worker pool from the US. Since this pool is predominantly (80%) below the age of 50 10.1145/3159652.3159661, StereoSet may not reflect the stereotypes of the wider US population.
Although stereotypes are subjective opinions, we often find them to collide with objective facts. For instance, "I’m going to visit Iraq. Everyone is a Muslim there" is a stereotype of Iraq in our dataset, however this is objectively incorrect since 2% of Iraqis are Christians. Similar is the case with anti-stereotypes (e.g., "I travelled to France recently. Everyone there spoke English" is an anti-stereotype since French are stereotyped to hate English). Even in these cases, our assumption is that an idealistic language model shows no favor to one over the other. Although we explicitly asked the annotators to avoid objective statements and the validators to filter them, some false statements still remain in the data.
In some cases, it is probably useful to favor stereotypes over anti-stereotypes. Consider the CAT, "The chef made ___ food", and its stereotypical and anti-stereotypical associations, delicious and disgusting respectively. Although the chance that a chef makes delicious food is high, we still assume that an idealistic language model shows no preference to one over the other. This could be problematic. We leave this for future work.
In this work, we develop the Context Association Test (CAT) to measure the stereotypical biases of pretrained language models with respect to their language modeling ability.
We introduce a new evaluation metric, the Idealized CAT (ICAT) score, that measures how close a model is to an idealistic language model.
We crowdsource StereoSet, a dataset containing 16,995 CATs to test biases in four domains: gender, race, religion and professions.
We show that current pretrained language model exhibit strong stereotypical biases, and that the best model is 27.0 ICAT points behind the idealistic language model.
We find that the GPT2 family of models exhibit relatively more idealistic behavior than other pretrained models like BERT, RoBERTa and XLNet.
Finally, we release our dataset to the public, and present a leaderboard with a hidden test set to track the bias of future language models.
We hope that StereoSet will spur further research in evaluating and mitigating bias in language models.
We would like to thank Jim Glass, Yonatan Belinkov, Vivek Kulkarni, Spandana Gella and Abubakar Abid for their helpful comments in reviewing this paper. We also thank Avery Lamp, Ethan Weber, and Jordan Wick for crucial feedback on the MTurk interface and StereoSet website.
Appendix A Appendix
a.1 Detailed Results
a.2 Mechanical Turk Task
Our crowdworkers were required to have a 95% HIT acceptance rate, and be located in the United States. In total, 475 and 803 annotators completed the intrasentence and intersentence tasks respectively. Restricting crowdworkers to the United States helps account for differing definitions of stereotypes based on regional social expectations, though limitations in the dataset remain as discussed in Section 9. Screenshots of our Mechanical Turk interface are available in Figure 2 and 3.
a.3 Target Words
Table 9 list our target terms used in the dataset collection task.
|barber||coach||businessperson||football player||construction worker||manager|
|boxer||chess player||priest||swimmer||tennis player||supervisor|
|Sierra Leon||Qatari||Syrian||Arab||Saudi Arabia||Lebanon|
|Austrian||White people||Finland||Australia||South African||Eriteria|
a.4 General Methods for Training a Next Sentence Prediction Head
Given some context , and some sentence , our intersentence task requires calculating the likelihood , for some sentence and context sentence .
While BERT has been trained with a Next Sentence Prediction classification head to provide , the other models have not. In this section, we detail our creation of a Next Sentence Prediction classification head as a downstream task.
For some sentences and , our task is simply determining if Sentence follows Sentence , or if Sentence follows Sentence . We trivially generate this corpus from Wikipedia by sampling some sentence, sentence, and a randomly chosen negative sentence from any other article. We maintain a maximum sequence length of 256 tokens, and our training set consists of 9.5 million examples.
We train with a batch size of 80 sequences until convergence (80 sequences / batch * 256 tokens / sequence = 20,480 tokens/batch) for 10 epochs over the corpus. For BERT, We use BertAdam as the optimizer, with a learning rate of 1e-5, a linear warmup schedule from 50 steps to 500 steps, and minimize cross entropy for our loss function. Our results are comparable todevlin_bert_2019, with each model obtaining 93-98% accuracy against the test set of 3.5 million examples.
Additional models maintain the same experimental details. Our NSP classifier achieves an 94.6% accuracy with roberta-base, a 97.1% accuracy with roberta-large, a93.4% accuracy with xlnet-base and 94.1% accuracy with xlnet-large.
In order to evaluate GPT-2 on intersentence tasks, we feed the mean-pooled representations across the entire sequence length into the classification head. Our NSP classifier obtains a 92.5% accuracy ongpt2-small, 94.2% on gpt2-medium, and 96.1% on gpt2-large. In order to fine-tune gpt2-large on our machines, we utilized gradient accumulation with a step size of 10, and mixed precision training from Apex.
a.5 Fine-Tuning BERT for Sentiment Analysis
In order to evaluate sentiment, we fine-tune BERT devlin_bert_2019 on movie reviews maas-EtAl:2011:ACL-HLT2011 for seven epochs. We used a maximum sequence length of 256 WordPieces, batch size 32, and used Adam with a learning rate of . Our fine-tuned model achieves an 92% test accuracy on the Large Movie Review dataset.