365 Dots in 2019: Quantifying Attention of News Sources

03/22/2020 ∙ by Alexander C. Nwala, et al. ∙ Old Dominion University 0

We investigate the overlap of topics of online news articles from a variety of sources. To do this, we provide a platform for studying the news by measuring this overlap and scoring news stories according to the degree of attention in near-real time. This can enable multiple studies, including identifying topics that receive the most attention from news organizations and identifying slow news days versus major news days. Our application, StoryGraph, periodically (10-minute intervals) extracts the first five news articles from the RSS feeds of 17 US news media organizations across the partisanship spectrum (left, center, and right). From these articles, StoryGraph extracts named entities (PEOPLE, LOCATIONS, ORGANIZATIONS, etc.) and then represents each news article with its set of extracted named entities. Finally, StoryGraph generates a news similarity graph where the nodes represent news articles, and an edge between a pair of nodes represents a high degree of similarity between the nodes (similar news stories). Each news story within the news similarity graph is assigned an attention score which quantifies the amount of attention the topics in the news story receive collectively from the news media organizations. The StoryGraph service has been running since August 2017, and using this method, we determined that the top news story of 2018 was the "Kavanaugh hearings" with attention score of 25.85 on September 27, 2018. Similarly, the top news story for 2019 so far (2019-12-12) is "AG William Barr's release of his principal conclusions of the Mueller Report," with an attention score of 22.93 on March 24, 2019.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction and background

It is natural to ask “what were the top news stories of 2019?” A partisanship study might ask, “how often do news stories from different partisan media organizations overlap?” A retrospective study might ask, “when did Hurricane Harvey begin to receive serious coverage?”, or “how did the attention given to Hurricane Harvey by the media differ from hurricanes that occurred in similar timeframes (but different locations) such as Irma or Maria?” Addressing these questions requires the fundamental operation of measuring overlap, or similarity, of news topics across different news sources.

We developed a method of measuring the similarity among news articles in near-real time and quantifying the level of attention the topics in the news stories receive. Specifically, we created a service called StoryGraph111http://storygraph.cs.odu.edu/ & https://twitter.com/storygraphbot that creates a news similarity graph from 17 left, center, and right news media organizations. StoryGraph quantifies the level of attention the topics in the news stories receive by assigning each an attention score. Major breaking news stories are often reported by multiple different news organization within the same time period. Similarly, a major news story is characterized by a high degree of similarity between different pairs of news stories from different news organizations. For example, below is a list of headlines showing a high degree of similarity among news reports collected on October 24, 2018, at 5:34 PM EST from four news organizations, following the incident in which mail bombs were sent to multiple Democratic public figures.

  • [leftmargin=0.5cm]

  • Vox: Explosive devices sent to Clintons, Obamas, CNN: what we know (Coaston, Jane and Emily, Stewart and Kirby, Jen, 2018)

  • FoxNews: FBI IDs 7 ‘suspicious packages’ sent to Dem figures containing ‘potentially destructive devices’ (Chamberlain, Samuel and Gibson, Jake, 2018)

  • CNN: Bombs and packages will be sent to FBI lab for analysis (CNN, 2018)

  • Breitbart: Live Updates: Democratic Leaders Receive Mail Bombs (BreitbartNews, 2018)

The prerequisite for deriving the attention score is calculating the similarity between documents (e.g., news articles). This problem has been studied extensively. Methods that represent documents as vectors

(Atkins et al., 2018; Beel et al., 2016; Gabrilovich et al., 2007)

often use the Cosine Similarity vector-based metric to quantify similarity between pairs of documents. Methods that represent documents as sets

(Tran et al., 2014; Strehl et al., 2000) often use set-based metrics such as the Jaccard similarity or the Overlap coefficient metric to quantify the similarity between a pair of documents. In this work, we represent each news article as a set of named entities, and utilize a set similarity measure (Section 2, Step 4) to quantify the degree of similarity between a pair of news documents.

Our investigation into measuring near-real time news similarity and quantifying the attention of news sources has resulted in the following contributions. First, we proposed the attention score, a transparent method for quantifying attention given to a news story by different news sources. The attention score facilitates finding the top news stories for a given day, month, or year. This enabled us to show the top stories of 2018 and 2019 (Table 3

). Second, we introduced the StoryGraph service, which has been running for over two years (since August 8, 2017), generating news similarity graphs every 10 minutes from 17 news organizations across the left, center, and right partisanship spectrum. Third, we showed that the StoryGraph service and dataset provides a platform for multiple longitudinal studies (Section

3). The code for StoryGraph is publicly available (Alexander Nwala, 2018b, c), and the entire StoryGraph dataset are available upon request.

Table 1. List of 17 left (blue), center (purple), and right (red) news media RSS feeds from which StoryGraph extracts news stories. The list of media sources was derived from Faris et al.’s (Faris et al., 2017) list of popular media sources.

2. Methodology

Figure 1. Overview of StoryGraph illustrating the process of generating a news similarity graph is four primary steps.
News Article 1
News Article 2


Christine Blasey Ford’s Attorneys Say They Paid
for Polygraph Test - Breitbart
Live updates: Brett Kavanaugh and Christine Blasey
Ford hearing on sex assault allegations - CNNPolitics
Extracted entities (11 entities common, 40 entities)
blasey, brett, christine, debra, dianne, donald,
feinstein, ford, hanafin, jerry, katz, kavanaugh,
mitchell, rachel, trump
ashley, banks, ben, blasey, botwinick, brett,
bromwich, christine, cornyn, crapo, cruz, debra,
diamond, don, donald, flake, ford, jeff, jeremy,
john, katz, kavanaugh, kristina, lee, lisa,
mcgahn, michael, mike, mitchell, rachel, ryan,
sasse, sgueglia, ted, trump, zeleny
News Article 1, News Article 2, News Article 1, News Article 2, ,
News Article 1, News Article 2 (Similar news)
Table 2. StoryGraph: Worked news similarity (similar news) example. Only PERSON entities are shown here for brevity.

The StoryGraph process has four steps (Fig. 1) outlined below. First, StoryGraph collects the first five news articles from the RSS feeds of 17 news media organizations (Table 1). Second, StoryGraph dereferences the URLs of the news articles and extracts plaintext after removing the HTML boilerplate (Alexander Nwala, 2017). Third, StoryGraph utilizes the Stanford CoreNLP Named Entity Recognizer (Finkel et al., 2005; Alexander Nwala, 2018d) to extract seven entity classes – PERSONS, LOCATIONS, ORGANIZATIONS, DATES, TIME, MONEY, and PERCENT from the news documents. In addition to these entity classes, we created and extracted text that belong to two additional classes: TITLE and TOP-K-TERM. The TITLE class represents title terms from the news articles, while the TOP-K-TERM class represents the top k (we set ) most frequent terms. All text that does not belong to one of the entity classes is discarded. Subsequently, each news article is represented as a set of entities extracted from the article. Fourth, StoryGraph creates a graph where the nodes (set of entities) represent news articles, and an edge between a pair of nodes represents a similarity score beyond some threshold between the nodes (similar news stories). Finally, the attention scores of the connected components of the recently generated graph is calculated. Formally, consider a news similarity graph G in which the nodes represent news articles, and an edge between a pair of nodes represents a high degree of similarity (Section 2, Step 4) between the nodes (similar news stories). Consider the set of ’s connected components , such that , the nodes (news articles) in originate from multiple news sources. The attention score (Eqn. 1) of a news story represented by a connected component with edges is simply the average degree of the connected component.

Figure 2. Three News Similarity (NS) graphs illustrating the dynamics of the news cycle. In these graphs, a single node represents a news article, a connected component (multiple connected nodes) represents a single news story reported by the connected nodes. The first (NS Graph 1 (StoryGraph, 2019c)) shows what is often referred to as a slow news day; low overlap across different news media organizations resulting in a low attention score (1.0) for news stories (connected components A and B). The second graph (NS Graph 2 (StoryGraph, 2019b)) shows a scenario where the attention of the media is split across four different news stories (connected components AD). The third graph (NS Graph 3 (StoryGraph, 2019a)) for the AG William Barr’s release of his principal conclusions of the Mueller Report story shows a major news event; high degree of overlap/connectivity across different news media organizations, resulting in a high attention score of 22.93
Figure 3. 365 dots in 2019 (Alexander Nwala, 2019): Top news stories for 365 days in 2019. Each dot represents the highest attention score across 144 story graphs for a given day.
(a) Kavanaugh and Christine Blasey Ford testify before congress (StoryGraph, 2020b) (attention score = 25.85)
(b) Nunes memo released (StoryGraph, 2020c) (attention score = 18.81)
(c) Trump and Kim Jong Un meet in Singapore (StoryGraph, 2020d) (attention score = 18.15)
Figure 4. StoryGraph: Top three news stories of 2018
(a) AG William Barr releases Mueller Report’s principal conclusions (StoryGraph, 2020a) (22.93)
(b) House Speaker Pelosi announces formal impeachment inquiry (StoryGraph, 2020a) (18.60)
(c) Impeachment inquiry public testimony (StoryGraph, 2020a) (18.18)
Figure 5. StoryGraph: Top three news stories of 2019
Ra- nk
News story


Section 1: Top News Stories of 2018
1 25.85 09-27
Kavanaugh and Christine Blasey
Ford testify before congress
(Fig. 4a)
2 18.81 02-02
Nunes memo released
(Fig. 4b)
3 18.15 06-12
Trump and Kim Jong Un
meet in Singapore
(Fig. 4c)
4 17.03 10-24
Bombs mailed to Clinton, Obama, etc.
5 16.32 03-17
Ex-FBI Deputy Director
Andrew McCabe fired
Section 2: Top News Stories of 2019
1 22.93 03-24
AG William Barr releases
Mueller Report’s principal conc.
(Fig. 5a)
2 18.60 09-24
House Speaker Pelosi announces
formal impeachment inquiry
(Fig. 5b)
3 18.18
Impeachment inquiry
public testimony
(Fig. 5c)
4 17.19 01-19
Mueller: BuzzFeed
Report ‘Not Accurate’
5 15.39 07-31
2019 Democratic debates
Table 3. StoryGraph: Top news stories of 2018 (Alexander Nwala, 2018a) & 2019 (Alexander Nwala, 2019)

2.1. Step 1: News article extraction

StoryGraph extracts the URLs of the first five news articles from each of the 17 RSS feeds (Table 1). Next it dereferences each URL yielding 85 HTML documents.

2.2. Step 2: Plaintext extraction

The HTML boilerplates from the 85 documents from Step 1 are removed (Alexander Nwala, 2017), yielding 85 plaintext documents.

2.3. Step 3: Named entities extraction

The 85 plaintext documents from Step 2 are passed into the Stanford CoreNLP Named Entity Recognizer (Finkel et al., 2005; Alexander Nwala, 2018d), yielding 85 different sets of named entities.

2.4. Step 4: News similarity graph generation

Given a pair of news articles represented by their respective set of named entities and , the weighted Jaccard-Overlap similarity is given by Eqn. 2, where is the coefficient of similarity, defining the threshold two documents must reach to be considered similar (). This threshold was empirically derived from a gold-standard dataset and set to and . An edge is formed between nodes for which . Table 2 illustrates a simple worked out example.


is the Jaccard index of both documents,

, and is the Overlap coefficient of both documents,

StoryGraph has been running since August 8, 2017, generating a news similarity graph once every 10 minutes. Since then, the application has generated 120,663+ graphs. For a given day, the connected component with the highest average degree (attention score) from 144 candidate graphs maps to the top news story of the day. Similarly, for a given month, the connected component with the highest attention score maps to the top news story for the month. And for a given year, the top k news stories is derived by finding k connected components with the highest attention scores. Specifically, the top (e.g., ) news stories is the first connected components from the sorted (in descending order by attention score) list of all news similarity graphs.

3. Results and Discussion

We can now quantify “slow news days” vs. major news, as well as show that the Mueller report was 2019’s top story.

3.1. Slow news day vs. Major news

Fig. 2 illustrates how the attention score (average degree) of the connected components in a news similarity graph helps characterize different news cycle scenarios. All news graphs in this section refer to Fig. 2. Most of the nodes (news articles) in NS Graph 1 are isolated with few connected components; the news sources mostly report on divergent topics. Consequently, no single news story (e.g., connected component A or B) receives attention from more than two different news sources.

The second news similarity graph (NS Graph 2), unlike the first, shows attention split among four primary news stories. The attention score of each news story represented by the connected component indicates the magnitude of attention given to the news story. For example, connected component A (attention score = 6.13) represents the story Poll: Pete Buttigieg becomes the presidential frontrunner in Iowa - Vox and B (1.0) - Colin Kaepernick Skips NFL Organized Workout, Wears Shirt Likening Himself to a Slave - Breitbart.

The third news similarity graph (NS Graph 3) indicates a major news event, characterized by a giant connected component (news story) with a high attention score (22.93) indicating a high degree of overlap among news sources. This indicates a scenario when most news sources report on the same story (AG William Barr’s release of his principal conclusions of the Mueller Report).

3.2. The top news stories of 2019

Stories surrounding the release of the Mueller Report (red dots in Fig. 3, Table 3 Section 2, No. 1) received the most attention in 2019. On March 22, 2019, Robert Mueller submitted his report to AG William Barr (attention score = 18.72). Two days later, AG William Barr released his summary (principal conclusions) of the report. This story received the most attention (attention score = 22.93) in 2019. AG William Barr’s principal conclusions of the Mueller report was received with skepticism by the Democrats who claimed the conclusions were highly favorable to President Trump. In contrast, the Republicans claimed the summary exonerated the President from any wrongdoing. The next top story in 2019 (blue dots in Fig. 3, Table 3, Section 2, No. 2) with attention score of 18.60 was Speaker Nancy Pelosi’s announcement of an official impeachment inquiry (September 24, 2019) four days after the whistleblower’s report. Similarly, at rank three (green dots in Fig. 3) were stories chronicling the public testimonies of the impeachment inquiry.

4. Future Work and Conclusion

StoryGraph has been generating one news similarity graph every 10 minutes since August 2017. A single graph file includes the URL of the news articles, plaintext, entities, publication dates, etc. In this paper, we only reported the result of two studies. The first studies the dynamics of the news cycle (slow news cycle vs. major news event). The second utilized attention scores to facilitate finding top stories.

StoryGraph provides the opportunity for further study beyond the two presented here. For example, a study focused on the coverage of mass shootings can utilize StoryGraph to approximate how much attention the 2018 Parkland, Florida shooting received compared to the 2019 Dayton, Ohio and El Paso, Texas mass shootings. A different study could narrowly apply news similarity to focus on a single news organization, e.g., FoxNews, in order to identify the news stories where they focus the most attention, or compare the attention span of different events. Therefore, we believe the StoryGraph process of quantifying news similarity and the attention of news sources provides a valuable means for studying news.

This work was supported in part by IMLS LG-71-15-0077-15. This is an extended version of the paper accepted at Computation + Journalism Symposium 2020, which has been postponed because of COVID-19. We also appreciate the help of Sawood Alam in the deployment of StoryGraph.