NewB: 200,000+ Sentences for Political Bias Detection

by   Jerry Wei, et al.

We present the Newspaper Bias Dataset (NewB), a text corpus of more than 200,000 sentences from eleven news sources regarding Donald Trump. While previous datasets have labeled sentences as either liberal or conservative, NewB covers the political views of eleven popular media sources, capturing more nuanced political viewpoints than a traditional binary classification system does. We train two state-of-the-art deep learning models to predict the news source of a given sentence from eleven newspapers and find that a recurrent neural network achieved top-1, top-3, and top-5 accuracies of 33.3 77.6 model's accuracies of 18.3 sentences, we analyze the top n-grams with our model to gain meaningful insight into the portrayal of Trump by media sources.We hope that the public release of our dataset will encourage further research in using natural language processing to analyze more complex political biases. Our dataset is posted at .



There are no comments yet.


page 1

page 2

page 3

page 4


Predicting the Politics of an Image Using Webly Supervised Data

The news media shape public opinion, and often, the visual bias they con...

The POLUSA Dataset: 0.9M Political News Articles Balanced by Time and Outlet Popularity

News articles covering policy issues are an essential source of informat...

"You are no Jack Kennedy": On Media Selection of Highlights from Presidential Debates

Political speeches and debates play an important role in shaping the ima...

Sampling the News Producers: A Large News and Feature Data Set for the Study of the Complex Media Landscape

The complexity and diversity of today's media landscape provides many ch...

MediaRank: Computational Ranking of Online News Sources

In the recent political climate, the topic of news quality has drawn att...

Us vs. Them: A Dataset of Populist Attitudes, News Bias and Emotions

Computational modelling of political discourse tasks has become an incre...

Generating Sentences Using a Dynamic Canvas

We introduce the Attentive Unsupervised Text (W)riter (AUTR), which is a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Newspaper articles are often biased Pinholster (2016); Bernhardt et al. (2008) and reflect the political leaning of their news source Morris (2007). In recent American politics, the actions of current U.S. President Donald J. Trump have polarized the American people Jacobson (2016) and are thus an exemplary topic of political bias in media. Conservative news sources are more likely to report on Trump’s actions favorably, whereas liberal media outlets tend to portray Trump’s actions in a more negative light Anuta et al. (2017).

Source (Political Bias) V
Newsday (L) 24,000 23 66,366
New York Times (L) 24,000 18 88,982
Cable News Network (L) 24,000 25 53,096
Los Angeles Times (L) 24,000 25 70,073
Washington Post (L) 24,000 25 67,648
Politico (N) 24,000 28 203,725
Wall Street Journal (C) 24,000 18 60,677
New York Post (C) 24,000 23 50,182
Daily Press (C) 24,000 24 60,607
Daily Herald (C) 24,000 25 53,515
Chicago Tribune (C) 24,000 24 67,953
Combined 264,000 23 362,649
Table 1: Summary statistics of our presented Newspaper Bias Dataset (NewB). : number of sentences. : average number of words per sentence. V: vocabulary size. L: liberal source. N: neutral source. C: conservative source. Political leanings of newspapers were retrieved from Media Bias/Fact Check.

The publication of newspaper articles online has generated a large amount of text data conducive for empirical text analysis techniques. We collect tens of thousands of newspaper articles regarding Donald Trump and compile them into a sentence-level Newspaper Bias dataset called NewB. Whereas previous political bias datasets label sentences into the broad categories of liberal and conservative, NewB is labeled by newspaper source and captures more intricate political biases and viewpoints than a binary labeling system can. Our work has the following contributions:

  • [leftmargin=*]

  • We present NewB, a dataset of 200,000 sentences from articles regarding Donald Trump. Previous work has generically categorized sentences as either liberal or conservative, but NewB is labeled by and covers the political ideologies of eleven newspapers, allowing for more nuanced analysis of political bias.

  • We train deep learning classifiers to predict the source of a given sentence. While predicting the source from just a single sentence is challenging, we find that the patterns discovered by training on such a large text corpus shed significant insight on the biases of newspapers.

2 Related Work

Many previous works have used natural language processing for political language and implicit bias analysis. For news content, Gentzkow and Shapiro Gentzkow and Shapiro (2010) measured the political “slant” of news articles by identifying similarities between the articles’ language and that of Congressional representatives, and Iyyer et al. Iyyer et al. (2014) used recursive neural networks for political ideology detection on two sentence-level datasets. In terms of social media, Rao et al. Rao et al. (2010) classified Twitter users as liberal or conservative through their tweets, and Tumasjan et al. Tumasjan et al. (2010) predicted election results using Twitter messages. Moreover, Rao and Spasojevic Rao and Spasojevic (2016) used word embeddings and recurrent neural networks to classify social media messages from Twitter, Facebook, and Google+ as leaning Democratic or Republican.

Related work also includes several datasets on political bias and subjectivity that have been released publicly. Thomas et al. Thomas et al. (2006) released a dataset called Convote for detection of support for or opposition against proposed legislation from Congressional floor-debate transcripts, and Gross et al. Gross et al. (2013) generated a comprehensive dataset of sentences from books and magazine articles labeled as liberal, conservative, or neutral based on the political leaning of the author.

While these works are a solid foundation in political text analysis, our dataset takes a novel approach. Previous political bias and subjectivity datasets have straightforward binary labels but are time-intensive to annotate and often subjective to annotators’ comprehension of the text. Our dataset, on the other hand, is labeled by indisputable ground truth classes- the origins of article- that are used for training and inference. Furthermore, while binary labels for political bias are usually sufficient for simple classification tasks, texts often have more intricate political bias than can be captured by binary labels. By having labels that span eleven news outlets, our dataset differentiates between and captures a wider range of political ideologies, allowing for more comprehensive bias analysis. Our paper is the first to our knowledge to propose a newspaper origin classification task and our dataset at least one order of magnitude larger than related datasets (Table 2).

Dataset Task
IBC 111Gross et al. (2013) Political Bias 4,062 3
Convote 222Thomas et al. (2006) Legislation Support 3,857 2
SUBJ 333Pang and Lee (2004) Sentence Subjectivity 10,000 2
NewB News Origin 264,000 11
Table 2: Comparison of the NewB dataset with other benchmark political bias and subjectivity datasets. : dataset size. : number of classes.
Figure 1: Top-n accuracies by news source for a recurrent neural network trained on NewB.

3 The NewB Dataset

For data collection, we focus on articles that contain the search term “Donald Trump” as a topic of contention between liberal sources and conservative sources. We download texts from the ProQuest US News Stream Database that were written by journalists from the following media outlets: Newsday, New York Times, Cable News Network, Los Angeles Times, Washington Post, Politico, Wall Street Journal, New York Post, Daily Press, Daily Herald, and Chicago Tribune. We select these news sources based on a cross-reference between available sources on ProQuest US News Stream and sources whose political leaning is shown on Media Bias/Fact Check (the largest website that classifies news sources on the political spectrum), and further balance them to include exactly five liberal sources, five conservative sources, and one neutral source.

In terms of data pre-processing, we split all articles at the sentence level and balance the class distribution to 24,000 sentences per class, for a total of 264,000 distinct sentences across 11 classes. Then, we assign all sentences their respective origins as ground truth labels. For each class, we randomly partition 1,000 sentences into an independent test set; the total size of our test set is 11,000 sentences. We use the remainder of the sentences for model training. Our dataset is separated such that there is a standard testing set, which can be useful for future comparisons between classifiers trained on NewB. News sources and their political leanings, as well as summary statistics of NewB can be found in Table 1. Of note, the origin of a text does not necessarily directly illicit bias for a given article (e.g. a New York Times article may report positively about President Trump even though New York Times articles generally report negatively about him).

4 Experimental Setup

4.1 Text Classification Models

To establish baselines for performance on our news source classification dataset, we implement the following text classifiers:

  • [leftmargin=*]

  • Logistic Regression (LR): Logistic Regression is the simplest neural network architecture shown to satisfy the universal approximation theorem Cybenko (1989).

  • Convolutional Neural Network (CNN):Convolutional neural networks, which are commonly used in computer vision, have been shown to achieve high performance on text classification tasks Kim (2014)

    . We implement a CNN with a single 1D convolutional layer, followed by max pooling and a dense layer.

  • Recurrent Neural Network (RNN): Recurrent neural networks are commonly used in language processing because they are suitable for processing sequential data. We use an RNN with two bidirectional layers of LSTM cells Liu et al. (2016).

4.2 Model Training and Evaluation

Our models take variable-length sentences as inputs and return softmax output vectors of length eleven representing the predicted confidence for each news sources, which are compared to ground truth labels represented as one-hot encoded vectors. When inputting sentences into the model, we tokenize each sentence at the word-level and convert each word into a vector using 300-dimensional distributed embeddings trained on the Common Crawl database with the GloVe method

Pennington et al. (2014). Each model was trained using a five percent cross-validation split until convergence was determined using early stopping.

Model Top-n Accuracy ()
LR 18.3 32.1 42.6 52.2 60.8
CNN 34.0 50.3 61.5 70.0 77.4
RNN 33.3 50.6 61.4 70.5 77.6
Table 3: Top-n accuracies for each model. LR: logistic regression. CNN: convolutional neural network. RNN: recurrent neural network.

5 Results

We measure the performances of various baseline models on our dataset and analyze top n-grams with one of our models to gain insight on how Trump is portrayed by various media sources.

5.1 Evaluation Metrics

For each model, we calculate top-1,2,3,4,5 accuracies per class, which are shown in Table 3. Both the CNN and RNN models significantly outperform the logistic regression baseline, likely as a result of their abilities to account for sequential data. With respect to the difficulty of predicting the origin of a news source given just a single sentence, the CNN and RNN achieved respectable accuracies of 34.0% and 33.3% respectively, well above a random guessing accuracy of 9.09%.

In Figure 1

, we show the top-n accuracy per class for the RNN. New York Times and Politico were the easiest to classify. This is may be because Politico uses a larger variety of words than the other classes as shown in its larger vocabulary size, and New York Times has the shortest average sentence length (Table 2). Furthermore, we display the confusion matrix of the RNN’s predicted labels in the form of a heatmap in Figure

2. Of note, there was a high false positive rate for predicting Wall Street Journal on New York Times sentences, likely because both newspapers tended to have short sentences, with an average length of only 18 words per sentence.

Figure 2: Confusion matrix of predicted and ground truth news sources for a recurrent neural network (RNN) trained on NewB.

5.2 Five-gram Analysis

For analysis of our dataset, we find the most significant five-grams for liberal and conservative newspapers by retrieving the most used five-grams that appear only in sources of that bias. We input these five-grams into the RNN and display the outputs in a heatmap, as shown in Figure 3.

Liberal n-grams tend to use loaded words that convey a negative image of Trump. An in-depth investigation of top five-grams revealed that trump has a history of was used negatively ten out of twelve times, and trump the republican presidential nominee was used negatively fifteen out of eighteen times, suggesting that liberal sources reported negatively on and distanced themselves from Trump. Furthermore, the fact that these phrases are from liberal backgrounds is obvious: the trump campaign declined to comment is consistent with the fact that the Trump campaign is less likely to respond to liberal news sources, and i voted against donald trump is more likely to be found in an article from a liberal source.

On the other hand, conservative newspapers tend to portray Trump in a more positive light. They often refer to Donald Trump with titles of respect such as President and Commander-in-Chief. Moreover, they frequently supported his claims regarding his alleged collusion with Russia, emphasizing that trump has denied collusion. Overall, our model generally predicted liberal news sources for liberal n-grams, with an emphasis on New York Times, while conservative n-grams were predicted to originate from conservative news sources, with more varied predictions.

Figure 3: Recurrent neural network’s predicted labels for most significant liberal and conservative five-grams. Liberal five-grams (top half) are in blue, and conservative five-grams (bottom half) are in red. Significant n-grams are bolded.

6 Conclusion

In this paper, we have presented NewB, a newspaper bias dataset of sentences regarding Donald Trump. While previous works have typically classified political bias broadly (e.g. liberal and conservative), there are often more nuanced biases involved in political texts. Our main contribution is a dataset that captures these complex biases by labeling each sentence with its news source origin. We have shown that substantial insights on the political viewpoints of media sources can be identified by models trained on this data, without the use of manually annotated labels required by previous datasets. In terms of applications, a model trained on NewB could potentially be used as an internet browser extension to better inform readers of the biases present in online newspaper articles.

While NewB sets up the foundation for more complex political bias analysis, our current work has not used comprehensive pre-training strategies such as BERT Devlin et al. (2018) and only identifies some of the political tendencies encompassed by this large text corpus. Important future work remains in developing new methods to translate the learned features of classifiers into well-defined political ideologies represented by news organizations. We hope that the release of NewB will encourage further research on using natural language processing for political bias analysis.


  • D. Anuta, J. Churchin, and J. Luo (2017) Election bias: comparing polls and twitter in the 2016 U.S. election. CoRR abs/1701.06232. External Links: Link, 1701.06232 Cited by: §1.
  • D. Bernhardt, S. Krasa, and M. Polborn (2008) Political polarization and the electoral effects of media bias. Journal of Public Economics 92 (5), pp. 1092 – 1104. External Links: ISSN 0047-2727, Document, Link Cited by: §1.
  • G. Cybenko (1989)

    Approximation by superpositions of a sigmoidal function. math cont sig syst (mcss) 2:303-314

    Mathematics of Control, Signals, and Systems 2, pp. 303–314. External Links: Document Cited by: 1st item.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §6.
  • M. Gentzkow and J. M. Shapiro (2010) What drives media slant? evidence from us daily newspapers. Econometrica 78, pp. 35–71. Cited by: §2.
  • J. Gross, B. Acree, Y. Sim, and N. A. Smith (2013) Testing the etch-a-sketch hypothesis: a compu- tational analysis of mitt romney’s ideological makeover during the 2012 primary vs. general elections.. In APSA 2013 Annual Meeting Paper, Cited by: §2, footnote 1.
  • M. Iyyer, P. Enns, J. Boyd-Graber, and P. Resnik (2014) Political ideology detection using recursive neural networks. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 1. Cited by: §2.
  • G. C. Jacobson (2016) Polarization, gridlock, and presidential campaign politics in 2016?. The ANNALS of the American Academy of Political and Social Science 667, pp. 226–246. Cited by: §1.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. CoRR abs/1408.5882. External Links: Link, 1408.5882 Cited by: 2nd item.
  • P. Liu, X. Qiu, and X. Huang (2016) Recurrent neural network for text classification with multi-task learning. In

    Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence

    IJCAI’16, pp. 2873–2879. External Links: ISBN 978-1-57735-770-4, Link Cited by: 3rd item.
  • J. S. Morris (2007) Slanted objectivity? perceived media bias, cable news exposure, and political attitudes. Social Science Quarterly 88 (3), pp. 707–728. External Links: Document, Link, Cited by: §1.
  • B. Pang and L. Lee (2004)

    A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

    In Proceedings of the 42nd ACL, pp. 271–278. Cited by: footnote 3.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing, Cited by: §4.2.
  • G. Pinholster (2016) Journals and funders confront implicit bias in peer review. Science 352 (6289), pp. 1067–1068. External Links: Document, ISSN 0036-8075, Link, Cited by: §1.
  • A. Rao and N. Spasojevic (2016) Actionable and political text classification using word embeddings and lstm. arXiv preprint arXiv:1607.02501. Cited by: §2.
  • D. Rao, D. Yarowsky, A. Shreevats, and M. Gupta (2010) Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, pp. 37–44. Cited by: §2.
  • M. Thomas, B. Pang, and L. Lee (2006) Get out the vote: determining support or opposition from congressional floor-debate transcripts. In Proceedings of EMNLP, pp. 327–335. Cited by: §2, footnote 2.
  • A. Tumasjan, T. Sprenger, P. Sandner, and I. M. Welpe (2010) Predicting elections with twitter: what 140 characters reveal about political sentiment. ICWSM. Cited by: §2.