Nowadays, more and more readers consume news online. The reduced costs and, generally speaking, less strict regulations with respect to standard press, have led to a proliferation of the number of online sources. However, that does not necessarily entail that readers are exposed to a plurality of viewpoints. News consumed via social networks are known to reinforce the bias of the user Flaxman et al. (2016). On the other hand, visiting multiple websites to gather a more comprehensive analysis of an event might be too time consuming for an average reader.
News aggregators —such as Flipboard111https://flipboard.com, News Lens222https://newslens.berkeley.edu and Google News333https://news.google.com.—, gather news from different sources and, in the case of the latter two, cluster them into events. In addition, News Lens displays all articles about an event in a timeline and provides additional information, such as summary of the event and a description for each entity mentioned in an article.
While these news aggregators help readers to get a more comprehensive coverage of an event, some of the sources might be unknown to the user, therefore he/she could naturally question the validity and trustworthiness of the information provided. Deep analysis of the content published by news outlets has been performed by expert journalists. For example, Media Bias/Fact Check444http://mediabiasfactcheck.com provides reports on the bias and factuality of reporting of entire news outlets, whereas Snopes555http://www.snopes.com/ and FactCheck666http://www.factcheck.org/ are popular fact checking websites. All these manual efforts cannot cope with the rate at which news are produced.
We propose Tanbih, a news platform that, in addition to displaying news grouped into events, provides additional information about the articles and their media source in order to develop the media literacy of users. Our system automatically generates media profiles with reports on the factuality, leading political ideology, hyper-partisanship, use of propaganda and bias of a news outlet. Furthermore, Tanbih automatically categorizes articles in English and Arabic, flags potentially propagandistic ones, and examines framing bias.
2 System Architecture
The architecture of Tanbih is sketched in Figure 1. The system consists of three main components: an online streaming processing pipeline for data collection and article level analysis, offline processing for event and media source level analysis and a website for delivering news to consumers. The online streaming processing pipeline continuously retrieves articles in English and Arabic. Translation, categorization, general frame of reporting classification and propaganda detection are performed for each article.
Clustering is performed on the articles that are collected every 30 minutes. Offline processing includes factuality prediction, leading political ideology prediction, audience reach and twitter user based bias prediction on source level and stance detection, aggregation of statistics at article level, e.g. propaganda index (see Section 2.3), for each medium. Offline processing does not have strict time requirements, therefore the choice of the models we develop will favour accuracy of the results over speed.
In order to run everything in a streaming and scalable fashion, we use KAFKA777https://kafka.apache.org as messaging queue and Kubernetes888https://kubernetes.io to manage scalability and fault-tolerant deployment. In the following we describe each component of the system. We have open sourced the code for some of those, we will release the remaining ones upon acceptance of the corresponding research papers.
2.1 Crawlers and Translation
Our crawlers collect articles from an on-growing list of sources999https://www.tanbih.org/about, which currently includes 155 RSS feeds, 82 twitter accounts and 2 websites. Once a link to an article is obtained from any of these sources, we rely on the Newspaper3k Python library to retrieve its content.101010https://newspaper.readthedocs.io After de-duplication, crawlers currently download 7k-10k articles every day. Currently we have more than 700k articles stored in our database. In order to display news both in English and in Arabic, we use QCRI Machine Translation Dalvi et al. (2017) to translate English content into Arabic and vice versa. Since translation is performed offline, we select the most accurate system in Dalvi et al. (2017), i.e. the Neural-based one.
2.2 Section Categorization
We build a model to classify an article into one of six news sections: Entertainment, Sports, Business, Technology, Politics, and Health. We build a corpus using the New York Times articles from the FakeNews dataset111111https://github.com/several27/FakeNewsCorpus
published between Jan. 1st, 2000 and Dec. 31st, 2017. We extract the news section information embedded in the article URL and in total we use 538k articles for training our models on TF-IDF representations of the contents. On a test set of 107k articles, the best-performing model is built based on Logistic Regression with F= and for Sports, Business, Technology, and Politics, the sections used in our system, respectively. The baseline F is 0.497.
2.3 Propaganda Detection
We developed a propaganda detection component to flag articles that potentially could be propagandistic, i.e. purposefully biased to influence its readers and ultimately pursue a specific agenda. Given a corpus of news, binary labelled as propagandistic/non propagandistic Barrón-Cedeño et al. (2019), we train a maximum entropy classifier trained on k articles, represented with various style-related features, such as character -grams and a number of vocabulary richness and readability measures, and obtain state-of-the-art F= on a separate test set of 10k articles. We refer to the score of the classifier as propaganda index and we define the following propaganda labels which we’ll use to flag articles (see Figure 2): very unlikely (), unlikely (), somehow (), likely (), and very likely ().
2.4 Framing Bias Detection
Framing is a central concept in political communication, which intentionally emphasizes (or ignores) certain dimensions of an issue Entman (1993). In Tanbih, we infer frames of news articles to make it transparent. We use the Media Frames Corpus (MFC) Card et al. (2015) for training our model to detect topic-agnostic media frames.
We fine-tuned the BERT-based model with our training data using a small learning rate, 0.0002, a maximum sequence number of 128, and a batch size of 32. The performance of the model, trained on 11k articles in MFC, is an accuracy of 66.7% on a test set of 1,138 articles, which is better than the reported state-of-the-art (58.4%) from the subset of MFC Ji and Smith (2017).
2.5 Factuality of Reporting and Leading Political Ideology of a Source
Verifying the reliability of the source is one of the basic tools used by investigative journalists to verify information reliability. To tacke this issue, we incorporated findings from our recent research on classifying the political bias and factuality of reporting of a news media Baly et al. (2018) into Tanbih. In order to predict the factuality and the bias for a given news medium, we considered: a representation for a typical
article of a medium by averaging linguistic and semantic features of all articles of the medium; features extracted from the Wikipedia page of the source and from the metadata of the Twitter account, the structure of the medium’s URL to identify malicious patternsMa et al. (2009) and web traffic through the Alexa Rank121212http://www.alexa.com/.
In order to collect gold labels for training our supervised models, we used the data from the Media Bias/Fact Check (MBFC) website,131313https://mediabiasfactcheck.com
which contains reliable annotations of factuality, bias and other aspects for over 2,000 news media. We train a Support Vector Machine (SVM) classifier to predict factuality or bias using the representations above. Factuality of reporting was modeled at a 3-point scale (low, mixed and high), and the model achieved a 65% accuracy. On the other hand, political ideology was modeled on a left-to-right scale, and the model achieved a 69% accuracy.
2.6 Stance Detection
Stance detection aims to identify the relative perspective of a piece of text with respect to a claim, typically modeled using labels such as agree, disagree, discuss, and unrelated. An interesting application of stance detection is medium profiling with respect to controversial topics. In this setting, given a particular medium, the stance for each article is computed with respect to a set of predefined claims. The stance of a medium is then obtained by aggregating the stance at article level. In Tanbih the stance is used to profile media sources.
We implemented our stance detection by fine-tuning the BERT classifier on the FNC-1 dataset from the Fake News Challenge141414http://www.fakenewschallenge.org/. Our model outperforms the best submitted system Hanselowski et al. (2018). In particular, our system obtained F and F and for agree, disagree, discuss, and unrelated classes, respectively.
2.7 Audience Reach
User interactions on Facebook enables the platform to generate comprehensive user profiles such as gender, age, income bracket, and political preferences. After marketers have determined a set of criteria for their target audience, Facebook can then provide them with an estimate of the size of this audience on its platform To illustrate, there are an estimated 160K Facebook users that are 20-year-old, very liberal females with an interest in The New York Times. In our system, we exploit the demographic composition, the political leaning in particular, of Facebook users who follow news media as a means to improve media bias prediction.
To get the audience of each news medium, we use Facebook’s Marketing API to identify the medium’s “Interest ID.” Using this ID, we then extract the demographic data of the medium’s audience with a focus on audience members who reside in the US and their political leanings (ideologies), which we categorize according to five classifications: (Very Conservative, Conservative, Moderate, Liberal, and Very Liberal)151515Political leaning information is only available for US-based Facebook users.
2.8 Twitter User-Based Bias Classification
Controversial social and political issues may spur social media users to express their opinion through sharing supporting newspaper articles. Our intuition is that the bias (or ideological leaning) of news sources can be inferred based on the bias of users. For example, if articles from a news source are strictly shared by left or right leaning users, then the source is likely far-left or far-right leaning respectively. Similarly, if it is being cited by both groups, then it is likely closer to the center. We used an unsupervised user-based stance detection method on different controversial topics to find core groups of right and left-leaning users Darwish et al. (2019). Given that the stance detection produces clusters with nearly perfect purity (
97% purity), we used the identified core users to train a deep learning-based classifier, fastTextJoulin et al. (2016), using the accounts that they retweeted as features to further tag more users. Next, we computed the so-called valence score for each news source for each topic. The valence scores ranges between -1 and 1, with higher absolute values indicating being cited with greater proportion by one group as opposed to the other. The score is calculated as follows Conover et al. (2011): , where is the number of times (term frequency) item is cited by group , and is the sum of the term frequencies of all items cited by . and are defined in a similar fashion. We subdivided the range between -1 and 1 into 5 equal size ranges and assigned the labels far-left, left, center, right, and far-right to the ranges.
2.9 Event Identification / Clustering
The clustering module aggregates news articles into stories. The pipeline is divided in two stages: (i) local topics identification and (ii) long-term topics matching for story generation.
For step (i), We represent each article as a
vector, built from the title and the body concatenated. The pre-processing consists of casefolding, lemmatization, punctuation removal, and stopwording. In order to obtain the preliminary clusters, in stage (i) we compute the cosine similarity between all article pairs in a predefined time window. We setas the number of days withing a window with an overlap of 3 days.
The resulting matrix of similarities for each window is then used to build a graph where is the set of vertices —the news articles— and is the set of edges. An edge between two articles is drawn only if , with . We select all parameters empirically on the training part of the corpus from Miranda et al. (2018).
The sequence of overlapping local graphs is merged in order of their creation, thus generating stories from topics. After merging, a community detection algorithm is used in order to find the correct assignment of the nodes into clusters. We use one of the fastest modularity-based algorithms: the Louvain method Blondel et al. (2008).
For step (ii), the topics created from the preceding stage are merged if the cosine similarity , where () is the mean of all vectors belonging to topic (), with .
The model has state-of-the-art performance, on the test partition of the corpus from Miranda et al. (2018): and F (F is an evaluation measure specifically designed to evaluate clustering algorithms Amigó et al. (2009)). As a comparison, the best model in Miranda et al. (2018) has (see Staykovski et al. (2019) for further details).
The home page of Tanbih161616http://www.tanbih.org displays news articles grouped into stories, i.e., clusters of articles (see the screenshot in Figure 2). Each story is displayed as a card. Users can go back and forth between the articles of an event by clicking on the left/right arrows below the title of the article. The propaganda label is displayed if the article is propagandistic.
In Figure 2 the article from the Sputnik is flagged as likely to be propagandistic by our system. The source of each article is displayed with the logo or the avatar of the respective news organization, and it links to a profile page of this organization (see Figure 3). On the top-left of the home page, Tanbih provides language selection buttons, currently English and Arabic only, to switch the language the news are display in. A search box in the top-right corner is also provided allowing the user to find the profile page of a particular news medium of interest.
On the media profile page (Figure 2(a)), a short extract from the Wikipedia page of the medium is displayed on top, with recently-published articles on the right-hand side. The profile page includes a number of statistics automatically derived from the models in Section 2. We use as an example Figure 3 which shows screenshots of the profile of CNN171717CNN full profile page is available at https://www.tanbih.org/media/1. The first two charts in Figure 2(a) show centrality and hyper-partisanship (in the example, CNN is reported as fairly central and low hyper-partisan) and the distribution of propagandistic articles (CNN publishes mostly non-propagandistic articles). Figure 2(b) shows the overall framing bias distribution for the medium (CNN focuses mostly on cultural identity and politics), factuality of reporting (CNN is mostly factual). The profile also shows the leading political ideology distribution of the medium. Figure 2(c) shows audience reach of the medium and the bias classification according to users’ retweets (see Section 2.8): CNN is popular among readers with any political view, although it tends to have a left-leaning ideology on the topics listed. The profile also features reports on the stance of CNN on a number of topics.
Using the topic search box on the Tanbih home page, user can find the dedicated page of a topic, for example Brexit or the Khashoggi’s murder. The top of the Khashoggi’s murder event page is shown in Figure 4.
Recent stories in this topic will be listed on the top of the page, followed by statistics such as number of countries, number of articles and number of media. A map showing how much reporting on this event by each country allows users to have an overview of how important this topic is for these countries. The page also has two sets of charts showing i) the top countries in terms of coverage of the event, both by absolute numbers and by ratio with respect to the total number of articles published; ii) the media sources that have most propagandistic content on the topic, again both in absolute terms and by ratio with respect to the total number of articles published by the medium on the topic. The profile page also features plots equivalent to the ones in Figure 2(b), showing the distribution of propagandistic articles and the framing bias on a topic.
4 Conclusions and Future Work
We have introduced Tanbih, our news aggregator which automatically computes media level and article level analyses to help the user in better understanding what they are reading. Tanbih features factuality prediction, propaganda detection, stance detection, translation, leading political ideology analysis, media framing bias detection, and event clustering. The architecture of Tanbih is flexible, fault-tolerant and it is able to scale to handle thousands of sources.
As future work, we plan to expand the system to include many more sources, especially from non-English speaking regions and to add interactive components, for example letting users ask questions about a topic.
A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. 12 (4), pp. 461–486. External Links: Cited by: §2.9.
- Predicting factuality of reporting and bias of news media sources. In Proc. of EMNLP’18, pp. 3528–3539. Cited by: §2.5.
- Proppy: organizing news coverage on the basis of their propagandistic content. Information Processing and Management 56 (5), pp. 1849–1864. Cited by: §2.3.
- Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008 (10), pp. P10008. Cited by: §2.9.
- The media frames corpus: annotations of frames across issues. In Proc. of ACL ’15, pp. 438–444. Cited by: §2.4.
- Political polarization on Twitter.. In Proc. of ICWSM’11, pp. 89–96. Cited by: §2.8.
- QCRI live speech translation system. In Proc. of EACL’17, pp. 61–64. Cited by: §2.1.
- Unsupervised user stance detection on Twitter. In Proc. of ICWSM’20, Cited by: §2.8.
- Framing: toward clarification of a fractured paradigm. Journal of communication 43 (4), pp. 51–58. Cited by: §2.4.
- Filter bubbles, echo chambers, and online news consumption. Public opinion quarterly 80 (S1), pp. 298–320. Cited by: §1.
- A retrospective analysis of the fake news challenge stance-detection task. In Proc. of COLING’18, pp. 1859–1874. External Links: Cited by: §2.6.
- Neural discourse structure for text categorization. In Proc. ACL’17, pp. 996–1005. Cited by: §2.4.
- Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §2.8.
- Identifying suspicious URLs: an application of large-scale online learning. In Proc. of the 26th ICML, pp. 681–688. External Links: Cited by: §2.5.
- Multilingual clustering of streaming news. In Proc. of EMNLP’18, pp. 4535–4544. Cited by: §2.9, §2.9.
- Dense vs. sparse representations for news stream clustering. In Proc. of Text2story’19, pp. 47–52. Cited by: §2.9.