Prioritizing Original News on Facebook

02/16/2021 ∙ by Xiuyan Ni, et al. ∙ Facebook 0

This work outlines how we prioritize original news, a critical indicator of news quality. By examining the landscape and life-cycle of news posts on our social media platform, we identify challenges of building and deploying an originality score. We pursue an approach based on normalized PageRank values and three-step clustering, and refresh the score on an hourly basis to capture the dynamics of online news. We describe a near real-time system architecture, evaluate our methodology, and deploy it to production. Our empirical results validate individual components and show that prioritizing original news increases user engagement with news and improves proprietary cumulative metrics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Large amounts of news are published online every day, and many people now primarily consume news online (Reis et al., 2015). News quality affects how people consume news and which platforms they prefer (Mosseri, 2018b; Facebook, 2021b). Expressing news quality numerically can facilitate significant improvements for users and platforms (Facebook, 2019). Among various aspects of news quality, we focus on originality, which can be contrasted with duplicates, slightly edited text, and coverage that references original news. Producing original news is laborious and requires expertise, but such efforts initiate the typical news cycle and drive the entire news industry. Original news inform people around the world, from breaking news, eye-witness reports and critical updates at the time of crisis, to in-depth investigative reports that uncover new facts. Prioritizing original news online is in everyone’s long-term interest (Facebook, 2019).

In this work, we first explore the landscape of online news, using the Facebook platform as an example. To enable a quantitative approach, we tabulate the spectrum of news originality from completely unoriginal to highly original news. Our static analysis suggests that highly original news are rare, despite a large inventory which needs to be indexed and processed to accurately identify the original ones. We also explore the dynamics of the news life-cycle on Facebook and find that news posts typically attain the greatest exposure in the first couple of hours, followed by a long tail. This result suggests that an originality score used to improve News Feed ranking must be computed promptly.

Given two challenges — search quality at scale and fast response — we build a near real-time system and construct a synthesized signal for news originality. News articles that cover the same news event are clustered together based on specialized BERT embeddings (Devlin et al., 2019), which are finetuned on pairwise-labeled data (same subject or different subjects). After evaluating several clustering algorithms against human-labeled pairwise data, we settle on a two-stage clustering algorithm that is both effective and highly scalable to large datasets. To adequately capture news dynamics, our system performs incremental updates on an hourly basis.

We concluded that content alone is insufficient to judge news originality, but behavioral signals such as citations of prior posts can also be used. Integrity considerations are particularly important, given the high incentives to game online news distribution. To de-bias our algorithms, we filter out news articles produced within patterns of nefarious activity. We first evaluate the performance of our originality signal offline against ratings by professional journalists. Online evaluation is based on an A/B test where we additionally monitor the impact on news article ranking (Xu et al., 2015). The signal is incorporated in the News Feed ranking system.

Our contributions include:

  • We examine the news originality landscape and the dynamics of the news life-cycle, then propose a quantitative approach to reason about the news ecosystem. We categorize the level of news originality by the effort spent to generate news content.

  • We propose a methodology and architect a near real-time system that processes individual news articles at a large scale. Using the PageRank algorithm and three-step clustering, it calculates a synthetic score to estimate news originality. PageRank normalization within clusters is particularly novel. The method can be applied to other news serving systems.

  • To facilitate live-data analysis of perceived news quality and of news quality scores, we develop quantitative and qualitative methods. These methods can zoom in on individual news articles and their distribution, and also measure entire news ecosystems. Such analyses help both news publishers and consumers, which now depend on online news (Facebook, 2019).

Figure 1. Citations in news articles. The top snippet cites an article by another publisher. The cited article cites another article from the same publisher.

2. Background

In this section, we first review the ideas behind PageRank and introduce the news citation graph. Then we outline ranking at Facebook, where we deploy our originality signal. However, other social media use conceptually similar ranking systems and our contributions are not specific to Facebook.

2.1. The News Citation Graph

The PageRank algorithm was originally developed at Google to rank Web pages and sites to improve search results (Brin and Page, 1998; Cresci et al., 2015, 2017; Page et al., 1999; Ye and Skiena, 2019). Mathematically, it is a random-walk based algorithm to rank vertices in a graph. A Web page with many incoming links from large-weight web pages, has a greater weight. Page weights are propagated from each Web page to pages it links to. In the news domain, the work by Del Corso et al. (2005) introduced a related graph-based ranking algorithm where each vertex represents a news source, focusing on authoritative news sources and interesting news events.

Ye and Skiena (2019) built an automated ranking system called MediaRank to rank news sources. They applied the PageRank algorithms on news reporting citation to rank news sources and proved that PageRank values are positively related to reporting quality measured by peer reputation and so on. Zhang and others (2018) introduced a set of signals for indicating the credibility of news collected from expert annotators. They grouped their indicators into two categories. The first group contains content indicators determined by the articles themselves — mentions of organizations, studies, etc. Context indicators in the other group require analysis of external sources, such as author reputation and/or recognition by peers in terms of the PageRank algorithm, as in Cresci et al. (2015, 2017).

Similar to citations in academic papers, it is common to cite credible peers in the news industry, and such citations are important indicators of news source quality. Therefore, we introduce the news citation graph at news article level, instead of the domain level, to estimate the credibility of individual news articles. The idea is that when a news articleƒ is disproportionately cited by its peers, this indicates higher journalistic credibility. Whereas academic papers itemize their references and use reference numbers in citations, news articles follow a different style. In this work, we only consider citations in the form of links in a news article to other news articles.

Figure 1 illustrates news article citations. The example at the top is from a news article by Publisher 1. This article cites multiple sources, one of them is shown: a news article from a another publisher, which cites another article by the same publisher. If a publisher breaks the story about an important news event, many other articles and publishers will cite it.

We take snapshots of the news ecosystem and index all our notation by time (Section 5.1). In particular, is the set of all news articles at time , and denotes an individual article. We cluster such articles by news event or news story (Section 5.2), denoting individual clusters . When a news article cites another article , we represent this by a directed edge , where is the set of edges in the citation graph. We also say that is ’s outbound edge and ’s inbound edge. Using these directed edges, we can compute the PageRank values of individual vertices (Section 5.1) by iteratively applying the following formula on every vertex in the graph in a topological order:

(1)

where is the PageRank of article (initialized to 1) at time , denotes the set of adjacent vertices (neighbors) of vertex , is the number of neighbors of , and is a (constant) damping factor, usually set to 0.85. The latter parameter dampens the propagation of weights through multiple edges.

2.2. The News Article Representation

When estimating article originality, it is important to check how similar two articles are. Such checks are commonly implemented with cosine similarity on vector embeddings. To produce necessary embeddings, prior work uses the BERT (Bidirectional Encoder Representations from Transformers) network architecture

(Devlin et al., 2019)

, which achieved state-of-art results in many natural language processing tasks across different applications

(Reddy et al., 2019; Lee and others, 2020). BERT handles previously unseen words by breaking them down into known subword fragments. It can also be updated on a regular basis to handle emerging keywords such as ”COVID”. Original BERT models were DNNs pre-trained on the BooksCorpus (Zhu et al., 2015) and the English Wikipedia. However, BERT networks can be specialized to a given use case by adding one dense layer and training it on adequate labeled data. Along these lines, Reimers and Gurevych (2019) proposed a Sentence-BERT architecture that uses the Siamese network structure in the context of semantic similarity estimation.

2.3. News Feed Ranking

Figure 2. News Feed ranking at Facebook

The ranking of news has been extensively studied both in academia and industry (Del Corso et al., 2005; Hu et al., 2006; Ye and Skiena, 2019). A number of publications in the information retrieval community address this subject (Gwadera and Crestani, 2009; Kanhabua et al., 2011; De Francisci Morales et al., 2012; Tatar et al., 2014; Reis et al., 2015; Zheng and others, 2018; Zhang and others, 2018). In 2018, Nuzzle announced a ranking system for news sources called NuzzleRank111https://nuzzel.com/rank that integrates various signals, including publisher authority information, into a single score to rank news sources.

Facebook’s News Feed ranks not only news content, but also events from users’ social graph (Mosseri, 2018b; Lada et al., 2021). Ranking objectives optimize long-term user satisfaction, account for communities (friends and family, etc) (Mosseri, 2018a) and News Feed integrity (Halevy and others, 2020)(e.g., to discourage clickbait and prevent unlawful activities). When a user logs in to Facebook, they see their News Feed — which includes fresh updates from their friends, groups they joined, and pages they followed. News Feed ranking can be roughly divided into four stages: inventory, signals, prediction, and relevance scores (Mosseri, 2018b; Ni and other, 2019; Lada et al., 2021)

. Once a piece of content is posted, numerous signals are extracted — publication time, engagement counts, etc. Those signals are used to estimate the probabilities of possible individual user actions for each piece of content in the inventory, should they see it

(Mosseri, 2018b; Ni and other, 2019; Lada et al., 2021). As a matter of notation, P(comment) represents the probability that a user comments on the update, while P(like) represents the probability that user likes the content. At the last stage, we combine these predictions and compute a ranking relevance score for each piece of content. Our news originality signal is deployed within this system summarized in Figure 2. News Feed ranking at Facebook incorporates many signals, and our originality signal enacts only subtle changes to the user experience as we explain later.

3. Problem Analysis

Here we examine the news originality landscape and motivate our work. Then we investigate the life-cycle of news stories on social media platforms. Understanding the news life-cycle is critical to deploying the originality signal within News Feed ranking.

3.1. The Landscape of News Originality

Figure 3. News originality by bucket: (a) completely unoriginal; (b) highly unoriginal; (c) somewhat unoriginal; (d) potentially original but lacking peer recognition; (e) recognized as original by peers. For each bucket, we show estimated total views received by all news articles.

Our quantitative approach to news originality uses content buckets:

  1. [label=)]

  2. completely unoriginal, scraped or spun content with no editorial effort

  3. highly unoriginal, with very low editorial effort

  4. somewhat unoriginal, may be editorially produced but heavily cite other content without original reporting or analysis

  5. potentially original but lacking peer recognition

  6. recognized as original by peers: breaking news, eyewitness reports, exclusive scoop, investigative reporting, etc

Publisher 1 - Original Publisher 2 - Spun
Israel grants Rashida Tlaib West Bank visit on humanitarian grounds Israel grants Rashida Tlaib West Financial Institution go to humanitarian grounds
Israel’s interior minister on Friday said Israel’s inside minister on Friday said
Pod Foods gets VC backing to reinvent grocery distribution Pod Meals will get VC backing to reinvent grocery distribution
Table 1. Examples of spun content. Publisher 1 posted original articles, while Publisher 2 replaced isolated words, phrases, and sentences in articles from Publisher 1.

Scraped content is copied from other sources without editorial efforts. Spun content is taken from a post or a Web page, and posted with only minor modifications by humans or machines (see examples in Table 1). Common methods include paraphrasing, replacing words, and reordering paragraphs. By automating the spinning of existing content one can quickly produce a large amount of content without scraping. Scraped and spun content can eclipse original content and undermine its value, which warrants removal or limited distribution compared to original content.

Highly unoriginal articles are produced by low-effort text changes. We find most of the news articles actually fall into the third bucket - somewhat unoriginal. These articles may provide useful information, but do not require much effort to produce.

Potentially original but lacking peer recognition — this bucket includes content that does not fit in earlier buckets and so may be original, but for various reasons does not receive peer recognition throughout the news cycle. Opinion pieces that receive little support often fall into this category. Thus, citation signals alone cannot distinguish between this bucket and unoriginal articles.

The highly original news are produced with significant effort to fact-check information and produce clear narratives, high-quality writing and visuals. Thoughtful and original news content is usually cited heavily by industry peers and contributes to the reputation of individual content creators. Due to the effort and expertise required, the original news content are scarce. Prioritizing the distribution of original content can help it reach greater audiences and benefits both the readers and the news industry in the long run (Facebook, 2019).

In general, it is difficult to judge each article for originality in isolation because this would require careful analysis of contents with the understanding of current events. Particularly challenging would be to distinguish rumors and fake news from reasonable reporting. Therefore, we draw additional insights from the news citation graph and the dynamics of online news. The special cases of scraped and spun content are handled by dedicated systems that are based on text hashing and fingerprinting, as well as text similarity metrics. In practice, such content does not appear in users’ News Feed inventory and is therefore not treated in our work.

3.2. The Dynamics of Online News

News content published on the Internet can be easily indexed and archived, but it social media platforms tend to favor fresh news. That’s why news reporters strive to break a new story. To re-examine this conventional wisdom and determine how to reflect it in our work, we explore a large volume of news articles shared on Facebook and track the dynamics of user engagement metrics. We also visualize the life-cycle of typical online news stories and check the impact of adding valuable information days after the original publication. As it turns out, the same pattern persists across different news categories — world and local news, politics and entertainment news.

Figure 4 illustrates how quickly users lose interest in a particular story. On September 27, 2019 Disney and Sony reached a deal for Spiderman movies, announcing that Spiderman would stay in the Marvel Universe. One publisher reported the story first. Almost 800 websites covered the news on the exact same day. On the second day, the engagement metrics of this story dropped significantly and eventually vanished on September 29, in just 3 days.

Figure 4. The life-cycle of a Spiderman story

Figure 5 shows that adding information at a later time does not help gain traffic. On November 11, 2019 the Ebola vaccine by Johnson & Johnson was approved. Our inventory showed that 17 websites published 34 related articles on that day, and user engagement metrics hit a peak. The news was first reported by a publisher who focuses on life science and medicine, which gained most traffic. Two days later, on November 13, the World Health Organization officially approved the vaccine. Many mainstream publishers covered this news, and we observed an inventory increase. However, this did not stimulate another engagement peak: traffic was mostly flat and almost vanished after seven days.

(a)
(b)
Figure 5. The life-cycle of the J&J Ebola vaccine story

Our data analysis suggests that ranking interventions can only be effective early in the life-cycle of a news story. This directly impacts the architecture and implementation of News Feed ranking, posing challenges to both signal computation and ranking deployment. Therefore, we only focus on news articles published within the last seven days. Our originality ranking intervention does not dramatically change users’ News Feed experience because we do not alter the existing inventory of posts. However, the aggregated effect should reward publishers and reporters that produce thoughtful and original content.

4. Estimating News Originality

Intuitively, news originality refers to the process by which news content is created as well as the quality of news content. However, capturing these notions computationally appears challenging, especially when the content creation process remains opaque. Professional journalists and rates often find isolated text insufficient to rate originality and need additional context. Useful context includes ongoing news events and how much coverage they enjoyed, and also how a given news article is perceived by peers in the news ecosystem. A major precept in our work is that direct content analysis is neither sufficient nor necessary, whereas adequate context may provide sufficient signals to estimate originality.

To capture the context of individual news articles, we construct a news citation graph (Section 2.1) for the entire news inventory at a fixed time. Peer recognition of each article is evaluated using the PageRank algorithm on this graph. An original piece of news could be cited by different publishers; it could also be a local news story cited by a major publisher with many subsequent citations — both cases are captured adequately by PageRank. Here we emphasize the use of global PageRank values not restricted to particular news events. That is because quality articles often cite out-of-topic background material and may be cited under later news events.

We try to capture news ecosystem dynamics and emulate how professional raters or journalists estimate news originality level. To this end, PageRank values cannot be compared across topics and news events with very different amounts of news coverage. For a given news event or news story, we consider the entire news coverage as a cluster. Our insight is that articles with the highest global PageRank values within each news-event cluster are most likely to be original. Hence, we estimate news originality by normalizing global PageRank scores within each cluster :

(2)

where is the cluster of article , and the constant defaults to . Increasing would favor articles with higher values.

Our process of estimating news originality is shown in Figure 6. Notably, we cannot evaluate a newly published article for originality before peers cite it. This introduces a delay and requires a near real-time system to deliver originality scores early in the news cycle.

When using originality scores in News Feed ranking, we first convert them into P(original) as follows

(3)

Here is the promotion threshold, i.e., only contents with can be promoted. Then, we add P(original) to the relevance score as a second-order term:

(4)

Here P(comment), P(share), P(like) and P(click) are probabilities of respective events for the news article in question, and the weights maximize long-term user satisfaction. Clearly, our originality signal is just one component of News Feed ranking that elevates peer-recognized content. Other signals elevate other content types.

Figure 6. The workflow of our methodology.

5. Implementation and Scaling

Our preliminary investigation found that news articles highly cited by other articles tend to exhibit a higher level of originality. Therefore, we first build a citation graph of all news articles published in a seven-day window. Then, we calculate global PageRank values for individual articles, cluster news articles by news event/story in a scalable way, and normalize PageRank values within each cluster.

5.1. PageRank with Integrity Considerations

We index all the news articles shared on the platform by leveraging the Facebook Crawler tool222https://developers.facebook.com/docs/sharing/webmasters/crawler

. The Facebook Crawler tool crawls the HTML of an app or website that was shared on Facebook via copying and pasting the link or by a Facebook social plugin. There are other open-source crawlers that serve the same purpose. Common Crawl

333https://commoncrawl.org/ is a well-maintained open repository of Web crawl data that can be accessed and analyzed by anyone.

We limit news articles in the graph to those posted within a seven-day moving window. After parsing the HTML, we traverse the output to get all <a> tags, which define hyperlinks to other Web pages. Hyperlinks specified in the <a> tag with different URLs may point to the same Web page. Therefore, we resolve alls URLs to canonical URLs444https://developers.facebook.com/docs/sharing/webmasters/getting-started/versioned-link and assign each news citation graph vertex a unique ID based on a canonical URL. If the cited Web page is also a recent news article, we establish an edge between the two news article vertices. With this news citation graph, we compute PageRank values for each news article.

The raw citation graph is vulnerable to link farming, as per Du et al. (2007). That is, the graph may be manipulated by changing interconnected link structure of pages to add many inbound edges to a target page. To counter such manipulations, we disregard several types of citations before applying the PageRank algorithm. As shown in Figure 1, one typical example is self-linking edges in that cite an article published by the same publisher. Some Web sites link their articles to other Web sites without real content, but with automatic redirect to phishing sites or simply return to the citing article. These integrity filters mitigate the risk of manipulation. A filtered citation graph snapshot at each hour typically contains 300K–500K edges. The news articles without incoming and outgoing citations are excluded from the PageRank computation. Despite their long history, attempts to manipulate PageRank in Web search have been successfully addressed (Google, 2016).

The original PageRank calculations work well with graphs that exhibit cycles, created when popular Web pages are revised to link to pages published later. Unlike the Web link graph, our news citation graph mostly contains links to past content since news posts on social networks are typically not revised. PageRank calculations simplify significantly on acyclic graphs and require a single linear-time graph traversal. However, in practice our citation graph contains enough cycles to question such simplifications.

5.2. News Event Clustering

We now outline our clustering technique. As explained in Section 4, we normalize PageRank scores for individual news articles using PageRank scores of other articles in the same cluster. Intuitively, an important national news event and a local breaking news might carry similar amount of originality, but original articles in a larger cluster get more citations and higher PageRank scores. In addition to cluster normalization, computational scalability is also important — on an uneventful day, our inventory snapshot contains 2M-3M articles, and we strive to process them in minutes.

We estimate the topical similarity of articles based on their titles, noting that articles with identical titles may have different PageRank scores. We first lowercase article titles, remove punctuation and hash the titles to assemble duplicates into mini-clusters. For each unique title, we calculate a vector embedding based on the powerful and adaptable BERT DNN (Section 2.2

). In addition to handling synonyms and equivalent phrases well, BERT also supports transfer learning. To this end, we use a Siamese-twins network architecture shown in Figure 

7, previously proposed for semantic similarity estimation (Reimers and Gurevych, 2019)

. The two article titles are processed by the two constituent BERT models, which we implement in PyTorch using HuggingFace transformers

(Wolf and others, 2020). An additional layer on top of BERT is a 128-dimensional fully connected (FC) layer with activation. In Figure 7, represent the

token in input sentences With the BERT network weights fixed, the top level is trained on labeled article pairs using the cosine embedding loss function

(5)

where and represent the two input sentences respectively. means the two sentences are same news event, while means the two sentences are about completely different news event.

Figure 7. Estimating sentence similarity using pre-trained BERT networks (Reimers and Gurevych, 2019). The shared dense layer is trained.

BERT-based vector embeddings optimized to capture title similarity by cosine similarity support vector-based clustering algorithms. The choice of algorithms is driven by quality considerations and the ability to process millions of titles in several minutes, which we need to ensure frequent refresh of the news originality signal (in the context of Section 3.2

). Clustering algorithms based on K-Nearest-Neighbors (KNN) are a natural starting point, but specifying

is not straightforward and for any given such algorithms risk producing inconsistent results in our application. Therefore, our three-step clustering in Figure 8 combines text hashing and KNN with greedy local search. Topical clusters often contain just a few different titles, while national news receive up to thousands citations per article.

Figure 8. Three-step clustering

The set of unique article vectors is converted into an undirected KNN graph . For each vector, we find its nearest neighbors based on cosine similarity (1 - cosine distance) and use cosine similarity for edge weights between adjacent vertices and . Lightweight edges are ignored, and subgraphs are defined by connected components of the resulting graph. Reasonable weight thresholds are found with a form of binary search guided by a subgraph size target. See details in Algorithm 1.

Input: Weighted graph , subgraph target size , optimization threshold , = 0.0,
Output: A set of subgraphs of approximately target size
Function findSubgraphs(, , , ):
       while  do
             without edges of weight ¡ connectedComponents() foreach  do
                   if  then
                         Remove vertices in and their incident edges from findSubgraphs(, , , )
                   end if
                  
             end foreach
            
       end while
       without edges of weight ¡ connectedComponents() return
end
Algorithm 1 Split a graph into subgraphs with target size

An investigation of typical outputs of Algorithm 1 suggested that clusters were generally reasonable, but local news and events with low coverage were not handled well. To remedy this deficiency, we form local clusters using greedy optimization to maximize the total edge weight inside clusters. We impart a default negative weight to pairs of vertices within a top-down cluster that are not connected by edges (not nearest neighbors). The smaller the , the harder it is to create subclusters. For details, see Algorithm  2.

Example 5.1 ().

Figure 8 illustrates local clusters in a subgraph: and . Suppose . Then the total edge weight in cluster 1 is (no edge between and ), and in cluster 2 . Although and are connected, the edge weight is so low that adding would not increase the total weight of cluster 1. The same reasoning applies to . Therefore local clustering produces two clusters.

Input: Weighted graph , negative weight for missing edges, number of independent randomized passes
Output: An integer for each vertex (cluster assignment)
repeat  times
       Randomize the order of vertices in Initialize each vertex in its own cluster foreach  do
             foreach  do
                   Try moving from cluster to cluster Add up internal weights for and Record with the highest sum of weights seen
             end foreach
            Move to maximize the sum of weights of and
       end foreach
      if  increased then
             repeat foreach
      else
            Record the solution with the highest seen
       end if
      
until
Algorithm 2 Greedy local clustering

5.3. Scalability

Building and processing the KNN graph with nearest neighbors per vertex is a major performance bottleneck. On a typical day, all news articles from the last week fit in the RAM of a single server and can be processed reasonably quickly. However, this architecture is insufficiently scalable for the following reasons.

  • Potential surges of the news inventory during the election season, the New Year’s Eve, etc.

  • Near real-time processing benefits from additional compute resources (lower processing latency via using multiple servers).

  • Need for scaling to larger content inventory. The challenge we are solving and our methods are fairly general, so can be applied to other social-network platforms that value originality. Now or in the future, such platforms may enjoy a much larger scale of content inventory.

The overall design described in Section  5.2 naturally supports distributed processing to ensure greater overall scalability and robustness to surges. In fact, this is why Algorithm 1 performs balanced partitioning. Our implementation supports distributed clustering as well. We found that the upper bound on single-server capacity is an important parameter — individual servers must receive a sufficient amount of work to justify distributed processing, but the data must fit into available RAM. Between the implied lower and upper bounds, there is a transition point where one can reduce the amount of computation at the cost of greater processing latency.

6. Evaluation and Deployment

Before deploying our news originality signal to production at Facebook, we evaluate its functional components individually, evaluate the entire signal with the help of professional raters, then embed the signal into News Feed ranking and explore examples to check that everything works as expected. The production deployment is evaluated with an industry-standard technique —an A/B test on live data for a limited subset of users before it is enabled for the main group of users (Xu et al., 2015).

6.1. Evaluation of Embeddings and Clustering

In our rating flow, we ask professional raters to review pairs of news articles. The raters assign a similarity level to each pair of articles: different subjects, different subject but some common contents, same subjects with different aspects, and same subjects (the four levels are explained in Table 2). For training, we collect 100K pairs of randomly sampled English news titles, using 40% for finetuning, 10% for validation, and 50% for test. Separately, we collect another 10K pairs of news articles to evaluate clustering performance. To sample likely-positive examples, we take some number of closest neighbors in terms of document embeddings and/or text similarity. Likely-negative samples are drawn from further-away neighbors that are sufficiently close to make the labeling task nontrivial.

Score Rating Criteria
0.0 different subjects the two articles cover completely different subjects
1.0 different subjects / some commonality the two articles cover different subject but with share some content
2.0 same subject / different aspects the two articles cover the same subject but report different aspects of the same story
3.0 same subject the two articles cover the same subjects
Table 2. Guidelines for rating the similarity of article pairs

To compare our vector embeddings with FastText (Joulin et al., 2016) and Pytorch-BigGraph (Lerer et al., 2019) embeddings, we represent similarity levels numerically by 0.0, 1.0, 2.0, 3.0 during training following Table 2

. During evaluation, we binarize model scores at thresholds 0.5, 1.5 and 2.5, then use ROC AUC as the evaluation metric. For example, AUC@0.5 considers article pairs with cosine similarity

. Table 3 describes the performance of our BERTPairwise model, which consistently outperforms pre-trained state-of-art embeddings.

Model AUC 0.5 (%) AUC 1.5 AUC 2.5
FastText 80.20 83.89 89.66
BigGraph 82.95 84.87 89.61
BERTPairwise 83.66 88.67 96.13
Table 3. The pairwise embedding vs. FastText (Joulin et al., 2016) and Pytorch-BigGraph (Lerer et al., 2019) embeddings

To evaluate our news-event clustering vs. human labels, we randomly sample 10K pairs of news articles in English from the candidate pool and send the pairs to professional annotators, along with guidelines in Table2

. Then, we apply the clustering algorithms to the entire candidate pool. For each sampled pair, if the two articles appear the same cluster, the predicted label is positive, otherwise — negative. The clustering algorithm is evaluated by precision and recall, then compared with two well-known algorithms in Table 

4. DBSCAN (density-based spatial clustering of applications with noise)(Ester et al., 1996; Schubert et al., 2017) is a highly scalable density-based algorithm. The Louvain algorithm (Blondel et al., 2008) is one of the fastest and best-known community detection algorithms for large networks.

Algorithm Precision Recall
DBSCAN 43.07 73.04
Louvain 81.01 47.57
Stage 1 + Louvain 81.85 32.63
three-stage clustering 83.73 45.33
Table 4. The performance of three-stage clustering with DBSCAN (Ester et al., 1996) and the Louvain algorithm (Blondel et al., 2008)

6.2. Evaluation by Professional Raters

To assess the accuracy of our citations score signal, we sample the most viewed news articles identified as original, and the most viewed article not identified as original from the most viewed news domains over a seven-day period. Our professional raters have many years of news-industry experience and follow a deliberate process to ensure fair judgement for each article they rate on a three-point scale of news originality (Table 5). For the rating 3.0, our predicted labels match these results 90% of the time. In other words, our signal attains 90% accuracy in identifying original news.

Score Rating Criteria
1.0 unoriginal borrows most of the content and language from other sources or is extremely thin / low information overall, and anything that is not properly syndicated.
2.0 possibly/somewhat unoriginal rewords borrowed content with its own language, but ¿70% is borrowed OR properly syndicated
3.0 fully original is not a syndicated republishing, little to no content is borrowed
Table 5. Originality rating guidelines for human raters

6.3. An Illustrative Example

Besides the quantitative evaluation, we also performed qualitative case studies. Here we describe one example that illustrates how our system works. On January 26, 2020, an article about the death of Kobe Bryant in a Calabasas helicopter crash was first reported by the publisher TMZ555TMZ: https://www.tmz.com. In just 10 minutes, many publishers covered this story and cited TMZ. Over 200 articles fell into this news-event cluster, and the original story by TMZ ranked the highest. For such events, users would see news articles posted by the newspages they follow and shared by their friends. If the original news article is in a users’ feed inventory, it gets prioritized. Note that our originality signal is only one component in the ranking formula. Users with preferences for certain publishers or strong affinity with their friends continue seeing articles shared by those actors.

6.4. Production Deployment and Evaluation

The originality signal is intended for the relevance score calculation (see Figure 2 and Equation 4) to increase the distribution of original news articles. To ensure its availability early in the news cycle, it is recalculated from scratch on an hourly basis. Building the news citation graph and news clusters takes only a few minutes, but system bottlenecks are observed in our current crawling infrastructure and in generating vector embeddings. In practice, it takes time for the original articles to get cited, but running the workflow more often could find and promote original articles earlier. Such improvements are likely with further infrastructure optimization.

Before making proposed changes to News Feed ranking at Facebook, we consulted with the academic and publishing communities and performed careful empirical evaluation. In particular, we ran an A/B test on live data for several weeks, where the control group used prior production ranking rules and a small test group used revised ranking rules (Xu et al., 2015). To estimate impact, we computed the increase in view counts at different thresholds (Table 6) and found our technique works well at different thresholds. We have not observed statistically significant deteriorations in our proprietary metrics (Mosseri, 2018a; Facebook, 2019; Lada et al., 2021) during the A/B test or after the subsequent full product launch. After additional checks and consultations, our signal was enabled for English-language content within Facebook’s News Feed ranking system for most users in June 2020  (Brown and Levin, 2020).

Publishers may try to manipulate our news originality signal. To this end, PageRank can be protected from abuse (Google, 2016), wheras Facebook’s integrity monitoring and enforcement (Halevy and others, 2020) has a particular focus on coordinated inauthentic behaviors (Facebook, 2021a).

Originality threshold Increase in num. views (%)
0.4 15.36
0.5 14.72
0.6 14.30
0.7 13.83
0.8 13.38
Table 6. User engagement lift in promoting original news

7. Conclusions and Perspectives

In this paper, we introduce a strategy to prioritize original news in social networks. This strategy computes PageRank scores of news articles and estimates originality by normalizing PageRank scores for each news event. Equation 2 is a particularly novel contribution.

We deployed the originality signal to personalized Facebook News Feed, which compiles articles from sources followed by the user and user’s friends (Mosseri, 2018b; Facebook, 2021b; Lada et al., 2021; Facebook, 2019; Mosseri, 2018a; Ni and other, 2019). When multiple articles are available in a user’s inventory, we promote the more original ones. While subtle, such changes influence what the community sees. As part of our work, we performed conceptual, qualitative and quantitative evaluation to confirm that our techniques positively impact the news ecosystem. In particular, the exposure of original content has grown, and users received more content they liked. Over a longer timeframe, these developments should encourage publishers to invest more in original content.

Acknowledgements.
We would like to thank Jon Levin, Gabriella Schwarz, Lucas Adams, David Vickrey, Xiaohong Zeng, Joe Isaacson, Gedaliah Friedenberg, Pengfei Wang, Feng Yan, Jerry Fu, Songbin Liu, Yan Qi, Ranjan Subramanian, Adrian Le Pera, Vasu Vadlamudi, Julia Smekalina and others who supported and collaborated with us throughout.

References

  • V. D. Blondel, J. Guillaume, R. Lambiotte, and E. Lefebvre (2008) Fast unfolding of communities in large networks. J. Statistical Mechanics: Theory and Experiment 2008 (10), pp. 10008. Cited by: §6.1, Table 4.
  • S. Brin and L. Page (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks 30, pp. 107–117. External Links: Link Cited by: §2.1.
  • C. Brown and J. Levin (2020) Facebook Newsroom. External Links: Link Cited by: §6.4.
  • S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi (2015) Fame for sale: efficient detection of fake twitter followers. Decision Support Systems 80, pp. 56–71. Cited by: §2.1, §2.1.
  • S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi (2017) The paradigm-shift of social spambots: evidence, theories, and tools for the arms race. In Proc. WWW, Perth, Australia, pp. 963–972. Cited by: §2.1, §2.1.
  • G. De Francisci Morales, A. Gionis, and C. Lucchese (2012) From chatter to headlines: harnessing the real-time web for personalized news recommendation. In Proc. WSDM, Washington, USA, pp. 153–162. Cited by: §2.3.
  • G. M. Del Corso, A. Gulli, and F. Romani (2005) Ranking a stream of news. In Proc. WWW, Chiba, Japan, pp. 97–106. Cited by: §2.1, §2.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In In Proc. 17th NAACL, Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.2.
  • Y. Du, Y. Shi, and X. Zhao (2007) Using spam farm to boost pagerank. In Proc. Intl. Workshop on Adversarial Information Retrieval on the Web, Banff Alberta, Canada, pp. 29–36. Cited by: §5.1.
  • M. Ester, H. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Proc. KDD, Vol. 96, Portland, Oregon, USA, pp. 226–231. Cited by: §6.1, Table 4.
  • Facebook (2019) Facebook Newsroom. External Links: Link Cited by: 3rd item, §1, §3.1, §6.4, §7.
  • Facebook (2021a) Facebook Newsroom. External Links: Link Cited by: §6.4.
  • Facebook (2021b) Facebook Business Help Center. External Links: Link Cited by: §1, §7.
  • Google (2016) Google Search Central Blog. External Links: Link Cited by: §5.1, §6.4.
  • R. Gwadera and F. Crestani (2009) Mining and ranking streams of news stories using cross-stream sequential patterns. In Proc. ACM Conference on Information and Knowledge Management, Hong Kong, China, pp. 1709–1712. Cited by: §2.3.
  • A. Halevy et al. (2020) Preserving integrity in online social networks. In Proc. KDD, USA, pp. arXiv:2009.10311. Cited by: §2.3, §6.4.
  • Y. Hu, M. Li, Z. Li, and W. Ma (2006) Discovering authoritative news sources and top news stories. In Asia Information Retrieval Symposium, Beijing, China, pp. 230–243. Cited by: §2.3.
  • A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov (2016) FastText.zip: compressing text classification models. Cited by: §6.1, Table 3.
  • N. Kanhabua, R. Blanco, and M. Matthews (2011) Ranking related news predictions. In Proc. Intl. ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, pp. 755–764. Cited by: §2.3.
  • A. Lada, M. Wang, and T. Yan (2021) Facebook. External Links: Link Cited by: §2.3, §6.4, §7.
  • J. Lee et al. (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §2.2.
  • A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose, and A. Peysakhovich (2019) Pytorch-biggraph: a large-scale graph embedding system. Proc. ML Sys Conference 1. Cited by: §6.1, Table 3.
  • A. Mosseri (2018a) Facebook Newsroom. External Links: Link Cited by: §2.3, §6.4, §7.
  • A. Mosseri (2018b) Facebook Inc.. External Links: Link Cited by: §1, §2.3, §7.
  • X. Ni and other (2019) Feature selection for facebook feed ranking system via a group-sparsity-regularized training algorithm. In Proc. 28th ACM Intl. CIKM, Beijing, China, pp. 2085–2088. Cited by: §2.3, §7.
  • L. Page, S. Brin, R. Motwani, and T. Winograd (1999) The pagerank citation ranking: bringing order to the web.. Technical report Stanford InfoLab. Cited by: §2.1.
  • S. Reddy, D. Chen, and C. D. Manning (2019) Coqa: a conversational question answering challenge. Trans. ACL 7, pp. 249–266. Cited by: §2.2.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In Proc. of EMNLP, Hong Kong, China, pp. 3982–3992. External Links: Link Cited by: §2.2, Figure 7, §5.2.
  • J. Reis, F. Benevenuto, P. O. de Melo, R. Prates, H. Kwak, and J. An (2015) Breaking the news: first impressions matter on online news. Cited by: §1, §2.3.
  • E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu (2017) DBSCAN revisited, revisited: why and how you should (still) use dbscan. ACM Trans. on Database Systems (TODS) 42 (3), pp. 1–21. Cited by: §6.1.
  • A. Tatar, P. Antoniadis, M. D. De Amorim, and S. Fdida (2014) From popularity prediction to ranking online news. Social Network Analysis and Mining 4 (1), pp. 174. Cited by: §2.3.
  • T. Wolf et al. (2020) Transformers: state-of-the-art natural language processing. In Proc. of EMNLP, Online, pp. 38–45. External Links: Link Cited by: §5.2.
  • Y. Xu, N. Chen, A. Fernandez, O. Sinno, and A. Bhasin (2015) From infrastructure to culture: a/b testing challenges in large scale social networks. In Proc. KDD, https://doi.org/10.1145/2783258.2788602, pp. 2227–2236. Cited by: §1, §6.4, §6.
  • J. Ye and S. Skiena (2019) MediaRank: computational ranking of online news sources. In Proc. KDD, Anchorage, AK, USA, pp. 2469–2477. Cited by: §2.1, §2.1, §2.3.
  • A. X. Zhang et al. (2018) A structured response to misinformation: defining and annotating credibility indicators in news articles. In Proc. WWW, Lyon, France, pp. 603–612. Cited by: §2.1, §2.3.
  • G. Zheng et al. (2018)

    DRN: a deep reinforcement learning framework for news recommendation

    .
    In Proc. WWW, Lyon, France, pp. 167–176. Cited by: §2.3.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In

    Proc. IEEE Intl Conf. Computer Vision

    ,
    Santiago, Chile, pp. 19–27. Cited by: §2.2.