Automatically Neutralizing Subjective Bias in Text

11/21/2019 ∙ by Reid Pryzant, et al. ∙ Kyoto University Georgia Institute of Technology Stanford University 26

Texts like news, encyclopedias, and some social media strive for objectivity. Yet bias in the form of inappropriate subjectivity - introducing attitudes via framing, presupposing truth, and casting doubt - remains ubiquitous. This kind of bias erodes our collective trust and fuels social conflict. To address this issue, we introduce a novel testbed for natural language generation: automatically bringing inappropriately subjective text into a neutral point of view ("neutralizing" biased text). We also offer the first parallel corpus of biased language. The corpus contains 180,000 sentence pairs and originates from Wikipedia edits that removed various framings, presuppositions, and attitudes from biased sentences. Last, we propose two strong encoder-decoder baselines for the task. A straightforward yet opaque CONCURRENT system uses a BERT encoder to identify subjective words as part of the generation process. An interpretable and controllable MODULAR algorithm separates these steps, using (1) a BERT-based classifier to identify problematic words and (2) a novel join embedding through which the classifier can edit the hidden states of the encoder. Large-scale human evaluation across four domains (encyclopedias, news headlines, books, and political speeches) suggests that these algorithms are a first step towards the automatic identification and reduction of bias.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Writers and editors of texts like encyclopedias, news, and textbooks strive to avoid biased language. Yet bias remains ubiquitous. 62% of Americans believe their news is biased [14] and bias is the single largest source of distrust in the media [13].

Figure 1: Example output from our modular algorithm. “Exposed” is a factive verb that presupposes the truth of its complement (that McCain is unprincipled). Replacing “exposed” with “described” neutralizes the headline because it conveys a similar main clause proposition (someone is asserting McCain is unprincipled), but no longer introduces the authors subjective bias via presupposition.

This work presents data and algorithms for automatically reducing bias in text. We focus on a particular kind of bias: inappropriate subjectivity

(“subjective bias”). Subjective bias occurs when language that should be neutral and fair is skewed by feeling, opinion, or taste (whether consciously or unconsciously). In practice, we identify subjective bias via the method of recasens2013linguistic recasens2013linguistic: using Wikipedia’s

neutral point of view (NPOV) policy.111˙point˙of˙view This policy is a set of principles which includes “avoiding stating opinions as facts” and “preferring nonjudgemental language”.

For example a news headline like “John McCain exposed as an unprincipled politician” (Figure 1) is biased because the verb expose is a factive verb that presupposes the truth of its complement; a non-biased sentence would use a verb like describe so as not to presuppose something that is the subjective opinion of the writer. “Pilfered” in “the gameplay is pilfered from DDR” (Table 1) subjectively frames the shared gameplay as a kind of theft. “His” in “a lead programmer usually spends his career” again introduces a biased and subjective viewpoint (that all programmers are men) through presupposition.

We aim to debias text by suggesting edits that would make it more neutral. This contrasts with prior research which has debiased representations of text by removing dimensions of prejudice from word embeddings [3, 15] and the hidden states of predictive models [55, 8]. To avoid overloading the definition of “debias,” we refer to our kind of text debiasing as neutralizing that text. Figure 1 gives an example.

Source Target Subcategory
A new downtown is being developed which A new downtown is being developed which Epistemological
will bring back… which its promoters hope will bring back..
The authors’ exposé on nutrition studies The authors’ statements on nutrition studies Epistemological
He started writing books revealing a vast world conspiracy He started writing books alleging a vast world conspiracy Epistemological
Go is the deepest game in the world. Go is one of the deepest games in the world. Framing
Most of the gameplay is pilfered from DDR. Most of the gameplay is based on DDR. Framing
Jewish forces overcome Arab militants. Jewish forces overcome Arab forces. Framing
A lead programmer usually spends Lead programmers often spend Demographic
his career mired in obscurity. their careers mired in obscurity.
The lyrics are about mankind’s perceived idea of hell. The lyrics are about humanity’s perceived idea of hell. Demographic
Marriage is a holy union of individuals. Marriage is a personal union of individuals. Demographic
Table 1: Samples from our new corpus. 500 sentence pairs are annotated with “subcategory” information (Column 3).

We introduce the Wiki Neutrality Corpus (WNC). This is a new parallel corpus of 180,000 biased and neutralized sentence pairs along with contextual sentences and metadata. The corpus was harvested from Wikipedia edits that were designed to ensure texts had a neutral point of view. WNC is the first parallel corpus targeting biased and neutralized language.

We also define the task of neutralizing subjectively biased text. This task shares many properties with tasks like detecting framing or epistemological bias [41], or veridicality assessment/factuality prediction [43, 32, 42, 51]. Our new task extends these detection/classification problems into a generation task: generating more neutral text with otherwise similar meaning.

Finally, we propose a pair of novel sequence-to-sequence algorithms for this neutralization task. Both methods leverage denoising autoencoders and a token-weighted loss function. An interpretable and controllable

modular algorithm breaks the problem into (1) detection and (2) editing, using (1) a BERT-based detector to explicitly identify problematic words, and (2) a novel join embedding through which the detector can modify an editors’ hidden states. This paradigm advances an important human-in-the-loop approach to bias understanding and generative language modeling. Second, an easy to train and use but more opaque concurrent system uses a BERT encoder to identify subjectivity as part of the generation process.

Large-scale human evaluation suggests that while not without flaws, our algorithms can identify and reduce bias in encyclopedias, news, books, and political speeches, and do so better than state-of-the-art style transfer and machine translation systems. This work represents an important first step towards automatically managing bias in the real world. We release data and code to the public.222

2 Wiki Neutrality Corpus (WNC)

The Wiki Neutrality Corpus consists of aligned sentences pre and post-neutralization by English Wikipedia editors (Table 1). We used regular expressions to crawl 423,823 Wikipedia revisions between 2004 and 2019 where editors provided NPOV-related justification [53, 41, 52]. To maximize the precision of bias-related changes, we ignored revisions where

  • [noitemsep]

  • More than a single sentence was changed.

  • Minimal edits (character Levenshtein distance 4).

  • Maximal edits (more than half of the words changed).

  • Edits where more than half of the words were proper nouns.

  • Edits that fixed spelling or grammatical errors.

  • Edits that added references or hyperlinks.

  • Edits that changed non-literary elements like tables or punctuation.

We align sentences in the pre and post text by computing a sliding window (of size ) of pairwise BLEU [36] between sentences and matching sentences with the biggest score [12, 47]. Last, we discarded pairs whose length ratios were beyond the 95th percentile [39].

Data Sentence Total Seq length # revised
pairs words (mean) words
Biased-full 181,496 10.2M 28.21 4.05
Biased-word 55,503 2.8M 26.22 1.00
Neutral 385,639 17.4M 22.58 0.00
Table 2: Corpus statistics.

Corpus statistics are given in Table 2. The final data are (1) a parallel corpus of 180k biased sentences and their neutral counterparts, and (2) 385k neutral sentences that were adjacent to a revised sentence at the time of editing but were not changed by the editor. Note that following recasens2013linguistic recasens2013linguistic, the neutralizing experiments in Section 4 focus on the subset of WNC where the editor modified or deleted a single word in the source text (“Biased-word” in Table 2).

Table 2 also gives a categorization of these sample pairs using a slight extension of the typology of recasens2013linguistic recasens2013linguistic. They defined framing bias as using subjective words or phrases linked with a particular point of view (like using words like best or deepest or using pilfered from instead of based on, and epistemological bias as linguistic features that subtly (often via presupposition) focus on the believability of a proposition. We add to their two a third kind of subjectivity bias that also occurs in our data, which we call demographic bias, text with presuppositions about particular genders or races or other demographic categories (like presupposing that all programmers are male).

Subcategory Percent
Epistemological 25.0
Framing 57.7
Demographic 11.7
Noise 5.6
Table 3: Proportion of bias subcategories in Biased (full).

The dataset does not include labels for these categories, but we hand-labeled a random sample of 500 examples to estimate the distribution of the 3 types. Table

3 shows that while framing bias is most common, all types of bias are represented in the data, including instances of demographic bias.

2.1 Dataset Properties

We take a closer look at WNC to identify characteristics of subjective bias on Wikipedia.

Topic. We use the Wikimedia Foundation’s categorization models [1] to bucket articles from WNC and the aforementioned random sample into a 44-category ontology,333˙Council/Directory then compare the proportions of NPOV-driven edits across categories. Subjectively biased edits are most prevalent in history, politics, philosophy, sports, and language categories. They are least prevalent in the meteorology, science, landforms, broadcasting, and arts categories. This suggests that there is a relationship between a text’s topic and the realization of bias. We use this observation to guide our model design in Section 3.1.

Tenure. We group editors into “newcomers” (less than a month of experience) and “experienced” (more than a month). We find that newcomers are less likely to perform neutralizing edits (15% in WNC) compared to other edits (34% in a random sample of 685k edits). This difference is significant ( p 0.001), suggesting the complexity of neutralizing text is typically reserved for more senior editors, which helps explain the performance of human evaluators in Section 6.1.

3 Methods for Neutralizing Text

We propose the task of neutralizing text, in which the algorithm is given an input sentence and must produce an output sentence whose meaning is as similar as possible to the input but with the subjective bias removed.

We propose two algorithms for this task, each with its own benefits. A modular algorithm enables human control and interpretability. A concurrent algorithm is simple to train and operate.

We adopt the following notation:

  • is a source sequence of subjectively biased text.

  • is a target sequence and the neutralized version of .

3.1 Modular

The first algorithm we are proposing has two stages: BERT-based detection and LSTM-based editing. We pretrain a model for each stage and then combine them into a joint system for end-to-end fine tuning on the overall neutralizing task. We proceed to describe each module.

Detection Module

The detection module is a neural sequence tagger that estimates

, the probability that each input word

is subjectively biased (Figure 2).

Module description. Each is calculated according to

  • represents

    ’s semantic meaning. It is a contextualized word vector produced by BERT, a transformer encoder that has been pre-trained as a masked language model

    [9]. To leverage the bias-topic relationship uncovered in Section 2.1, we prepend a token indicating an article’s topic category (<arts>, <sports>, etc) to . The word vectors for these tokens are learned from scratch.

  • represents expert features of bias proposed by [41]:


    is a matrix of learned parameters, and is a vector of discrete features444

    Such as lexicons of hedges, factives, assertives, implicatives, and subjective words; see code release.


  • , , and are learnable parameters.

Figure 2: The detection module uses discrete features and BERT embedding

to calculate logit


Module pre-training. We train this module using diffs555 between the source and target text. A label is 1 if was deleted or modified as part of the neutralizing process. A label is 0 if it occurs in both the source and target text. The loss is calculated as the average negative log likelihood of the labels:

Editing Module

The editing module takes a subjective source sentence and is trained to edit it into a more neutral compliment .

Module description.

This module is based on a sequence-to-sequence neural machine translation model

[30]. A bi-LSTM [20] encoder turns into a sequence of hidden states . Next, an LSTM decoder generates text one token at a time by repeatedly attending to

and producing probability distributions over the vocabulary. We also add two mechanisms from the summarization literature

[44]. The first is a copy mechanism, where the model’s final output for timestep becomes a weighted combination of the predicted vocabulary distribution and attentional distribution from that timestep. The second is a coverage mechanism which incorporates the sum of previous attention distributions into the final loss function to discourage the model from re-attending to a word and repeating itself.

Module pre-training. We pre-train the decoder as a language model of neutral text using the neutral portion of WNC (Section 2). Doing so expresses a data-driven prior about how target sentences should read. We accomplish this with a denoising autoencoder objective [19] and maximizing the conditional log probability of reconstructing a sequence from a corrupted version of itself using noise model .

Our is similar to [27]. We slightly shuffle such that ’s index in is randomly selected from . We then drop words with probability . For our experiments, we set and .

Figure 3: The modular system uses join embedding to reconcile the detector’s predictions with an encoder-decoder architecture.

Final System

Once the detection and editing modules have been pre-trained, we join them and fine-tune together as an end to end system for translating into .

This is done with a novel join embedding mechanism that lets the detector control the editor (Figure 3). The join embedding is a vector that we add to each encoder hidden state in the editing module. This operation is gated by the detector’s output probabilities . Note that the same is applied across all timesteps.


We proceed to condition the decoder on the new hidden states . Intuitively, is enriching the hidden states of words that the detector identified as subjective. This tells the decoder what language should be changed and what is safe to be be copied during the neutralization process. Error signals are allowed to flow backwards into both the encoder and detector, creating an end-to-end system from the two modules.

To fine-tune the parameters of the joint system, we use a token-weighted loss function that scales the loss on neutralized words (i.e. words unique to ) by a factor of :

Note that is a term from the coverage mechanism (Section 3.1). We use in our experiments. Intuitively, this loss function incorporates an inductive bias of the neutralizing process: the source and target have a high degree of lexical similarity but the goal is to learn the structure of their differences, not simply copying words into the output (something a pre-trained autoencoder should already have knowledge of). This loss function is related to previous work on grammar correction [24], and cost-sensitive learning [56].

3.2 Concurrent

Our second algorithm takes the problematic source s and directly generates a neutralized . While this renders the system easier to train and operate, it limits interpretability and controllability.

Model description. The concurrent

system is an encoder-decoder neural network. The encoder is BERT. The decoder is the same as that of Section

3.1: an attentional LSTM with copy and coverage mechanisms. The decoder’s inputs are set to:

  • Hidden states , where is the BERT-embedded source and is a matrix of learned parameters.

  • Initial states and . and are learned matrices.

Model training. The concurrent model is pre-trained with the same autoencoding procedure described in Section 3.1. It is then fine-tuned as a subjective-to-neutral translation system with the same loss function described in Section 3.1.

Method BLEU Accuracy Fluency Bias Meaning
Source Copy 91.33 0.00 - - -
Detector (always delete biased word) 92.43* 38.19* -0.253* -0.324* 1.108*
Detector (predict substitution from biased word) 92.51 36.57* -0.233* -0.327* 1.139*
Delete Retrieve (ST) [29] 88.46* 14.50* -0.209* -0.456* 1.294*
Back Translation (ST) [38] 84.95* 9.92* -0.359* -0.390* 1.126*
Transformer (MT) [49] 86.40* 24.34* -0.259* -0.458* 0.905*
Seq2Seq (MT) [30] 89.03* 23.93 -0.423* -0.436* 1.294*
Base 89.13 24.01 - - -
+ loss 90.32* 24.10 - - -
+ loss + pretrain 92.89* 34.76* - - -
+ loss + pretrain + detector (modular) 93.52* 45.80* -0.078 -0.467* 0.996*
+ loss + pretrain + BERT (concurrent) 93.94 44.87 0.132 -0.423* 0.758*
Target copy 100.0 100.0 -0.077 -0.551* 1.128*
Table 4: Performance of various systems. ST indicates a style transfer system. MT indicates a machine translation system. For quantitative metrics, rows with asterisks are significantly different than the preceding row. For qualitative metrics, rows with asterisks are significantly different from zero. Higher is preferable for fluency, while lower is preferable for bias and meaning.

4 Experiments

4.1 Experimental Protocol


We implemented nonlinear models with Pytorch

[37] and optimized using Adam [25] as configured in [9] with a learning rate of 5e-5. We used a batch size of 16. All vectors were of length

unless otherwise specified. We use gradient clipping with a maximum gradient norm of 3 and a dropout probability of 0.2 on the inputs of each LSTM cell

[46]. We initialize the BERT component of the tagging module with the publicly-released bert-base-uncased parameters. All other parameters were uniformly initialized in the range .

Procedure. Following recasens2013linguistic recasens2013linguistic, we train and evaluate our system on the subset of WNC where the editor changed or deleted a single word in the source text. This yielded 53,803 training pairs (about a quarter of the WNC), from which we sampled 700 development and 1,000 test pairs. For fair comparison, we gave our baselines additional access to the 385,639 neutral

examples when possible. We pretrained the tagging module for 4 epochs. We pretrained the editing module on the

neutral portion of our WNC for 4 epochs. The joint system was trained on the same data as the tagger for 25,000 steps (about 7 epochs). We perform interference using beam search and a beam width of 4. All computations were performed on a single NVIDIA TITAN X GPU; training the full system took approximately 10 hours. We report statistical significance with bootstrap resampling and a 95% confidence level [26, 11].

Evaluation. We evaluate our models according to five metrics. BLEU [36] and accuracy (the proportion of decodings that exactly matched the editors changes) are quantitative. We also hired fluent English-speaking crowdworkers on Amazon Mechanical Turk. Workers were shown the recasens2013linguistic recasens2013linguistic and Wikipedia definition of a “biased statement” and six example sentences, then subjected to a five-question qualification test where they had to identify subjectivity bias. Approximately half of the 30,000 workers who took the qualification test passed. Those who passed were asked to compare pairs of original and edited sentences (not knowing which was the original) along three criteria: fluency, meaning preservation, and bias. Fluency and bias were evaluated on a Semantic Differential scale from -2 to 2. Here, a semantic differential scale can better evaluate attitude oriented questions with two polarized options (e.g., “is text A or B more fluent?”). Meaning was evaluated on a Likert scale from 0 to 4, ranging from “totally different” to “identical”. Inter-rater agreement was fair to substantial (Krippendorff’s alpha of 0.65 for fluency, 0.33 for meaning, and 0.51 for bias)666 Rule of thumb: k 0 “poor” agreement, 0 to .2 “slight”, .21 to .40 “fair”, .41 to .60 “moderate”, .61 - .80 “substantial”, and .81 to 1 “near perfect” [17].

. We report statistical significance with a t-test and 95% confidence interval.

4.2 Wikipedia (WNC)

Results on WNC are presented in Table 4. In addition to methods from the literature we include (1) a BERT-based system which simply predicts and deletes subjective words, and (2) a system which predicts replacements (including deletion) for subjective words directly from their BERT embeddings. All methods appear to successfully reduce bias according to the human evaluators. However, many methods appear to lack fluency. Adding a token-weighted loss function and pretraining the decoder help the model’s coherence according to BLEU and accuracy. Adding the detector (modular) or a BERT encoder (concurrent) provide additional benefits. The proposed models retain the strong effects of systems from the literature while also producing target-level fluency on average. Our results suggest there is no clear winner between our two proposed systems. modular is better at reducing bias and has higher accuracy, while concurrent produces more fluent responses, preserves meaning better, and has higher BLEU.

Metric Fluency Bias Meaning
BLEU 0.65 0.34 0.16
Accuracy 0.56 0.52 0.20
Table 5: Spearman correlation () between quantitative and qualitative metrics.

Table 5 indicates that BLEU is more correlated with fluency but accuracy is more correlated with subjective bias reduction. The weak association between BLEU and human evaluation scores is corroborated by other research [7, 34]. We conclude that neither automatic metric is a true substitute for human judgment.

4.3 Real-world Media

To demonstrate the efficacy of the proposed methods on subjective bias in the wild, we perform inference on three out-of-domain datasets (Table 6). We prepared each dataset according to the same procedure as WNC (Section 2). After inference, we enlisted 1800 raters to assess the quality of 200 randomly sampled datapoints. Note that for partisan datasets we sample an equal number of examples from “conservative” and “liberal” sources. These data are:

  • The Ideological Books Corpus (IBC) consisting of partisan books and magazine articles [45, 23].

  • Headlines of partisan news articles identified as biased according to

  • Sentences from the campaign speeches of a prominent politician (United States President Donald Trump).777Transcripts from We filtered out dialog-specific artifacts (interjections, phatics, etc) by removing all sentences with less than 4 tokens before sampling a test set.

Overall, while modular does a better job at reducing bias, concurrent appears to better preserve the meaning and fluency of the original text. We conclude that the proposed methods, while imperfect, are capable of providing useful suggestions for how subjective bias in real-world news or political text can be reduced.

IBC Corpus
Method Fluency Bias Meaning
modular -0.041 -0.509* 0.882*
concurrent -0.001 -0.184 0.501*
Original Activists have filed a lawsuit…
modular Critics of it have filed a lawsuit…
concurrent Critics have filed a lawsuit…
News Headlines
Method Fluency Bias Meaning
modular -0.46* -0.511* 1.169*
concurrent -0.141* -0.393* 0.752*
Original Zuckerberg claims Facebook can…
modular Zuckerberg stated Facebook can…
concurrent Zuckerberg says Facebook can…
Trump Speeches
Method Fluency Bias Meaning
modular -0.353* -0.563* 1.052*
concurrent -0.117 -0.127 0.757*
Original This includes amazing Americans like…
modular This includes Americans like…
concurrent This includes some Americans like…
Table 6: Performance on out-of-domain datasets. Higher is preferable for fluency, while lower is preferable for bias and meaning. Rows with asterisks are significantly different from zero

5 Error Analysis

To better understand the limits of our models and the proposed task of bias neutralization, we randomly sample 50 errors produced by our models on the Wikipedia test set and bin them into the following categories:

  • No change. The model failed to remove or change the source sentence.

  • Bad change. The model modified the source but introduced an edit which failed to match the ground-truth target (i.e. the Wikipedia editor’s change).

  • Disfluency. Errors in language modeling and text generation.

  • Noise. The datapoint is noisy and the target text is not a neutralized version of the source.

Error Type Proportion (%) Valid (%)
No change 38 0
Bad change 42 80
Disfluency 12 0
Noise 8 87
Table 7: Distribution of model errors on the Wikipedia test set. We also give the percent of errors that were valid neutralizations of the source despite failing to match the target sentence.

The distribution of errors is given in Table 7. Most errors are due to the subtlety and complexity of language understanding required for bias neutralization, rather than the generation of fluent text. These challenges are particularly pronounced for neutralizing edits that involve the replacement of factive and assertive verbs. As column 2 shows, a large proportion of the errors, though disagreeing with the edit written by the Wikipedia editors, nonetheless successfully neutralize bias in the source.

Examples of each error type are given in Table 9 (two pages away). As the examples show, our models have have a tendency to simply remove words instead of finding a good replacement.

Method Accuracy
Linguistic features 0.395*
Bag-of-words 0.584*
+Linguistic features 0.617
     (Recasens, 2013)
BERT 0.744*
+Linguistic features 0.752
+Linguistic features + Category 0.759
       (modular detector)
concurrent encoder 0.745
Human 0.543*
Table 8: Performance of various bias detectors. Rows with asterisks are statistically different than the preceding row.
Error Type Source, Output, then Target
No change Existing hot-mail accounts were upgraded to on April 3, 2013.
Existing hot-mail accounts were upgraded to on April 3, 2013.
Existing hot-mail accounts were changed to on April 3, 2013.
Bad change His exploitation of leased labor began in 1874 and continued until his death in 1894…
His actions of leased labor began in 1874 and continued until his death in 1894…
His use of leased labor began in 1874 and continued until his death in 1894…
Disfluency Right before stabbing a cop, flint attacker shouted one thing that proves terrorism is still here.
Right before stabbing a cop, flint attacker shouted one thing that may may terrorism is still here.
Right before stabbing a cop, flint attacker shouted one thing that may prove terrorism is still here.
Noise …then whent to war with him in the Battle of Bassorah, and ultimately left that battle.
…then whent to war with him in the Battle of Bassorah, and ultimately left that battle.
…then whent to war with him in the Battle of the Camel, and ultimately left that battle.
Revised Word Source, Output, then Target
Magnificent After a dominant performance, Joshua…with a magnificent seventh-round knockout win.
After a dominant performance, Joshua…with a seventh-round knockout win.
After a dominant performance, Joshua…with a seventh-round knockout win.
Dominant Jewish history is…interacted with other dominant peoples, religions and cultures.
Jewish history is…other peoples, religions and cultures.
Jewish history is…other peoples, religions and cultures.
Selected Word Output
(input) In recent years, the term has often been misapplied to those who are merely clean-cut.
merely In recent years, the term has often been misapplied to those who are clean-cut.
misapplied In recent years, the term has often been shown to those who are merely clean-cut.
(input) He was responsible for the assassination of Carlos Marighella, and for the Lapa massacre.
assassination He was responsible for the killing of Carlos Marighella, and for the Lapa massacre.
massacre He was responsible for the assassination of Carlos Marighella, and for the Lapa incident.
(input) Paul Ryan desperately searches for a new focus amid Russia scandal.
desperately Paul Ryan searches for a new focus amid Russia scandal.
scandal Paul Ryan desperately searches for a new focus amid Russia.
Table 9: Top: examples of model errors from each error category. Middle: the model treats words differently based on their context; in this case, “dominant” is ignored when it accurately describes an individual’s winning performance, but deleted when it describes a group of people in arbitrary comparison. Bottom: the modular model can sometimes be controlled, for example by selecting words to change, to correct errors or otherwise change the model’s behavior.

6 Algorithmic Analysis

We proceed to analyze our algorithm’s ability to detect and categorize bias as well as the efficacy of the proposed join embedding.

6.1 Detecting Subjectivity

Identifying subjectivity in a sentence (explicitly or implicitly) is prerequisite to neutralizing it. We accordingly evaluate our model’s (and 3,000 crowdworker’s) ability to detect subjectivity using the procedure of recasens2013linguistic recasens2013linguistic and the same 50k training examples as Section 4 (Table 8). For each sentence, we select the word with the highest predicted probability and test whether that word was in fact changed by the editor. The proportion of correctly selected words is the system’s “accuracy”. Results are given in Table 8.

Note that concurrent

lacks an interpretive window into its detection behavior, so we estimate an upper bound on the model’s detection abilities by (1) feeding the encoder’s hidden states into a fully connected + softmax layer that predicts the probability of a token being subjectively biased, and (2) training this layer as a sequence tagger according to the procedure of Section


The low human performance can be attributed to the difficulty of identifying bias. Issues of bias are typically reserved for senior Wikipedia editors (Section 2.1) and untrained workers performed worse (37.39%) on the same task in [41] (and can struggle on other tasks requiring linguistic knowledge [6]). concurrent’s encoder, which is architecturally identical to BERT, had similar performance to a stand-alone BERT system. The linguistic and category-related features in the modular detector gave it slight leverage over the plain BERT-based models.

6.2 Join Embedding

We continue by analyzing the abilities of the proposed join embedding mechanism.

Join Embedding Ablation

The join embedding combines two separately pretrained models through a gated embedding instead of the more traditional practice of stripping off any final classification layers and concatenating the exposed hidden states [2]. We accordingly ablated the join embedding mechanism by training a new model where the pre-trained detector is frozen and its pre-output hidden states

are concatenated to the encoder’s hidden states before decoding. Doing so reduced performance to 90.78 BLEU and 37.57 Accuracy (from the 93.52/46.8 with the join embedding). This suggests learned embeddings can be a high-performance and end-to-end conduit between sub-modules of machine learning systems.

Join Embedding Control

We proceed to demonstrate how the join embedding creates controllability in the neutralization process. Recall that modular relies on a probability distribution to determine which words require editing (Equation 3). Typically, this distribution comes from the detection module (Section 3.1), but we can also feed in user-specified distributions that force the model to target particular words. This can let human advisors correct errors or push the model’s behavior towards some desired outcome. We find that the model is indeed capable of being controlled, letting users target specific words for rewording in case they disagree with the model’s output or seek recommendations on specific language. However, doing so can also introduce errors into downstream language generation (Table 9).

7 Related Work

Subjectivity Bias. The study of subjectivity in NLP was pioneered by the late Janyce Wiebe and colleagues [5, 18]. Several studies develop methods for highlighting subjective or persuasive frames in a text [40, 48], or detecting biased sentences [21, 35, 52, 22]

of which the most similar to ours is recasens2013linguistic recasens2013linguistic, whose early, smaller version of WNC and logistic regression-based bias detector inspired our study.

Debiasing. Many scholars have worked on removing demographic prejudice from meaning representations [31, 54, 55, 4, 50, inter alia]. Such studies begin with identifying a direction or subspace that capture the bias and then removing such bias component to make these representations fair across attributes like gender and age [3, 31]. For instance, [4] introduced a regularization term for the language model to penalize the projection of the word embeddings onto that gender subspace, while [50] used adversarial training to remove directions of bias from hidden states.

Neural Language Generation. Several studies propose stepwise procedures for text generation, including sampling from a corpus [16] and identifying language ripe for modification [28]. Most similar to us is [29] who localize a text’s style to a fraction of its words. Our modular detection module performs a similar localization in a soft manner, and our steps are joined by a smooth conduit (the join embedding) instead of discrete logic. There is also work related to our concurrent model. The closest is [10], where a decoder was attached to BERT for question answering, or [27], where machine translation systems are initialized to LSTM and Transformer-based language models of the source text.

8 Conclusion and Future Work

The growing presence of bias has marred the credibility of our news, educational systems, and social media platforms. Automatically reducing bias is thus an important new challenge for the Natural Language Processing and Artificial Intelligence community. By learning models to automatically detect and correct subjective bias in text, this work is a first step in this important direction. Nonetheless our scope was limited to single-word edits, which only constitute a quarter of the edits in our data, and are probably among the simplest instances of bias. We therefore encourage future work to tackle broader instances of multi-word, multi-lingual, and cross-sentence bias. Another important direction is integrating aspects of fact-checking

[33], since a more sophisticated system would be able to know when a presupposition is in fact true and hence not subjective. Finally, our new join embedding mechanism can be applied to other modular neural network architectures.

9 Acknowledgements

We thank the Japan-United States Educational Commission (Fulbright Japan) for their generous support. We thank Chris Potts, Hirokazu Kiyomaru, Abigail See, Kevin Clark, the Stanford NLP Group, and our anonymous reviewers for their thoughtful comments and suggestions. We gratefully acknowledge support of the DARPA Communicating with Computers (CwC) program under ARO prime contract no. W911NF15-1-0462 and the NSF via grant IIS-1514268. Diyi Yang is supported by a grant from Google.


  • [1] S. Asthana and A. Halfaker (2018) With few eyes, all hoaxes are deep. Proceedings of the ACM on Human-Computer Interaction 2 (CSCW), pp. 21. Cited by: §2.1.
  • [2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle (2007) Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pp. 153–160. Cited by: §6.2.
  • [3] T. Bolukbasi, K. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems, pp. 4349–4357. Cited by: §1, §7.
  • [4] S. Bordia and S. R. Bowman (2019) Identifying and reducing gender bias in word-level language models. In NAACL 2019 Student Research Workshop, Cited by: §7.
  • [5] R. F. Bruce and J. M. Wiebe (1999) Recognizing subjectivity: a case study in manual tagging. Natural Language Engineering 5 (2), pp. 187–205. Cited by: §7.
  • [6] C. Callison-Burch (2009) Fast, cheap, and creative: evaluating translation quality using amazon’s mechanical turk. In Proceedings of EMNLP, pp. 286–295. Cited by: §6.1.
  • [7] A. T. Chaganty, S. Mussman, and P. Liang (2018) The price of debiasing automatic metrics in natural language evaluation. In Proceedings of ACL, Cited by: §4.2.
  • [8] A. Das, A. Dantcheva, and F. Bremond (2018)

    Mitigating bias in gender, age and ethnicity classification: a multi-task convolution neural network approach


    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 0–0. Cited by: §1.
  • [9] Cited by: 1st item, §4.1.
  • [10] P. Dun, L. Zhu, and D. Zhao (2019) Extending answer prediction for deep bi-directional transformers. 32nd Conference on Neural Information Processing Systems (NIPS). Cited by: §7.
  • [11] B. Efron and R. J. Tibshirani (1994) An introduction to the bootstrap. CRC press. Cited by: §4.1.
  • [12] M. Faruqui, E. Pavlick, I. Tenney, and D. Das (2018) WikiAtomicEdits: a multilingual corpus of wikipedia edits for modeling language and discourse. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Cited by: §2.
  • [13] O. S. Foundation (2018) Indicators of news media trust. Note: Cited by: §1.
  • [14] Gallup (2018) Americans: much misinformation, bias, inaccuracy in news. Note:\\americans-misinformation-bias-\\inaccuracy-news.aspx Cited by: §1.
  • [15] H. Gonen and Y. Goldberg (2019) Lipstick on a pig: debiasing methods cover up systematic gender biases in word embeddings but do not remove them. North American Chapter of the Association for Computational Linguistics (NAACL). Cited by: §1.
  • [16] K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang (2018) Generating sentences by editing prototypes. Transactions of the Association of Computational Linguistics 6, pp. 437–450. Cited by: §7.
  • [17] K. L. Gwet (2011) On the krippendorff’s alpha coefficient. Manuscript submitted for publication. Retrieved October 2 (2011), pp. 2011. Cited by: footnote 6.
  • [18] V. Hatzivassiloglou and J. M. Wiebe (2000) Effects of adjective orientation and gradability on sentence subjectivity. In Proceedings of COLING 2000, pp. 299–305. Cited by: §7.
  • [19] F. Hill, K. Cho, and A. Korhonen (2016)

    Learning distributed representations of sentences from unlabelled data

    arXiv preprint arXiv:1602.03483. Cited by: §3.1.
  • [20] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.
  • [21] C. Hube and B. Fetahu (2018) Detecting biased statements in wikipedia. In The Web Conference, pp. 1779–1786. Cited by: §7.
  • [22] C. Hube and B. Fetahu (2019) Neural based statement classification for biased language. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 195–203. Cited by: §7.
  • [23] M. Iyyer, P. Enns, J. Boyd-Graber, and P. Resnik (2014) Political ideology detection using recursive neural networks. In Proceedings of ACL, pp. 1113–1122. Cited by: 1st item.
  • [24] M. Junczys-Dowmunt, R. Grundkiewicz, S. Guha, and K. Heafield (2018) Approaching neural grammatical error correction as a low-resource machine translation task. In Proceedings of NAACL, Cited by: §3.1.
  • [25] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. International Conference for Learning Representations (ICLR). Cited by: §4.1.
  • [26] P. Koehn (2004) Statistical significance tests for machine translation evaluation. In Conference on Empirical Methods in Natural Language Processing, Cited by: §4.1.
  • [27] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato (2018) Phrase-based & neural unsupervised machine translation. Cited by: §3.1, §7.
  • [28] W. Leeftink and G. Spanakis (2019) Towards controlled transformation of sentiment in sentences. International Conference on Agents and Artificial Intelligence (ICAART). Cited by: §7.
  • [29] J. Li, R. Jia, H. He, and P. Liang (2018) Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of NAACL, Cited by: Table 4, §7.
  • [30] M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In Proceedings of EMNLP, Cited by: §3.1, Table 4.
  • [31] T. Manzini, Y. C. Lim, Y. Tsvetkov, and A. W. Black (2019) Black is to criminal as caucasian is to police: detecting and removing multiclass bias in word embeddings. In NAACL 2019, Cited by: §7.
  • [32] M. d. Marneffe, C. D. Manning, and C. Potts (2012) Did it happen? the pragmatic complexity of veridicality assessment. Computational Linguistics 38 (2), pp. 301–333. External Links: Link Cited by: §1.
  • [33] T. Mihaylova, P. Nakov, L. Marquez, A. Barron-Cedeno, M. Mohtarami, G. Karadzhov, and J. Glass (2018) Fact checking in community forums. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §8.
  • [34] R. Mir, B. Felbo, N. Obradovich, and I. Rahwan (2019) Evaluating style transfer for text. In Proceedings of NAACL, Cited by: §4.2.
  • [35] F. Morstatter, L. Wu, U. Yavanoglu, S. R. Corman, and H. Liu (2018) Identifying framing bias in online news. ACM Transactions on Social Computing 1 (2), pp. 5. Cited by: §7.
  • [36] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL, pp. 311–318. Cited by: §2, §4.1.
  • [37] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.
  • [38] S. Prabhumoye, Y. Tsvetkov, R. Salakhutdinov, and A. W. Black (2018) Style transfer through back-translation. Association for Computational Linguistics (ACL). Cited by: Table 4.
  • [39] R. Pryzant, Y. Chung, D. Jurafsky, and D. Britz (2017) JESC: japanese-english subtitle corpus. 11th edition of the Language Resources and Evaluation Conference (LREC). Cited by: §2.
  • [40] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, and Y. Choi (2017) Truth of varying shades: analyzing language in fake news and political fact-checking. In Proceedings of EMNLP, pp. 2931–2937. Cited by: §7.
  • [41] M. Recasens, C. Danescu-Niculescu-Mizil, and D. Jurafsky (2013) Linguistic models for analyzing and detecting biased language. In Proceedings of ACL, pp. 1650–1659. Cited by: §1, §2, 2nd item, §6.1.
  • [42] R. Rudinger, A. S. White, and B. Van Durme (2018) Neural models of factuality. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 731–744. External Links: Link Cited by: §1.
  • [43] R. Saurí and J. Pustejovsky (2009) FactBank: a corpus annotated with event factuality. Language resources and evaluation 43 (3), pp. 227. Cited by: §1.
  • [44] A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. Proceedings of ACL. Cited by: §3.1.
  • [45] Y. Sim, B. D. Acree, J. H. Gross, and N. A. Smith (2013) Measuring ideological proportions in political speeches. In Proceedings of EMNLP, pp. 91–101. Cited by: 1st item.
  • [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §4.1.
  • [47] J. Tiedemann (2008) Synchronizing translated movie subtitles.. In Language Resources and Evaluation Conference (LREC), Cited by: §2.
  • [48] O. Tsur, D. Calacci, and D. Lazer (2015) A frame of mind: using statistical models for detection of framing and agenda setting campaigns. In Proceedings of ACL, pp. 1629–1638. Cited by: §7.
  • [49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: Table 4.
  • [50] T. Wang, J. Zhao, K. Chang, M. Yatskar, and V. Ordonez (2018) Adversarial removal of gender from deep image representations. arXiv preprint arXiv:1811.08489. Cited by: §7.
  • [51] A. S. White, R. Rudinger, K. Rawlins, and B. Van Durme (2018) Lexicosyntactic inference in neural models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4717–4724. External Links: Link Cited by: §1.
  • [52] D. Yang, A. Halfaker, R. Kraut, and E. Hovy (2017) Identifying semantic edit intentions from revisions in wikipedia. In Conference on Empirical Methods in Natural Language Processing, pp. 2000–2010. Cited by: §2, §7.
  • [53] F. M. Zanzotto and M. Pennacchiotti (2010) Expanding textual entailment corpora from wikipedia using co-training. In The People’s Web Meets NLP Workshop (COLING), pp. 28–36. Cited by: §2.
  • [54] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang (2017) Men also like shopping: reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457. Cited by: §7.
  • [55] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang (2018) Gender bias in coreference resolution: evaluation and debiasing methods. arXiv preprint arXiv:1804.06876. Cited by: §1, §7.
  • [56] Z. Zhou and X. Liu (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. Transactions of IEEE. Cited by: §3.1.