Garbage, Glitter, or Gold: Assigning Multi-dimensional Quality Scores to Social Media Seeds for Web Archive Collections

07/06/2021 ∙ by Alexander C. Nwala, et al. ∙ Old Dominion University Indiana University 0

From popular uprisings to pandemics, the Web is an essential source consulted by scientists and historians for reconstructing and studying past events. Unfortunately, the Web is plagued by reference rot which causes important Web resources to disappear. Web archive collections help reduce the costly effects of reference rot by saving Web resources that chronicle important stories/events before they disappear. These collections often begin with URLs called seeds, hand-selected by experts or scraped from social media. The quality of social media content varies widely, therefore, we propose a framework for assigning multi-dimensional quality scores to social media seeds for Web archive collections about stories and events. We leveraged contributions from social media research for attributing quality to social media content and users based on credibility, reputation, and influence. We combined these with additional contributions from the Web archive research that emphasizes the importance of considering geographical and temporal constraints when selecting seeds. Next, we developed the Quality Proxies (QP) framework which assigns seeds extracted from social media a quality score across 10 major dimensions: popularity, geographical, temporal, subject expert, retrievability, relevance, reputation, and scarcity. We instantiated the framework and showed that seeds can be scored across multiple QP classes that map to different policies for ranking seeds such as prioritizing seeds from local news, reputable and/or popular sources, etc. The QP framework is extensible and robust. Our results showed that Quality Proxies resulted in the selection of quality seeds with increased precision (by  0.13) when novelty is and is not prioritized. These contributions provide an explainable score applicable to rank and select quality seeds for Web archive collections and other domains.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction and Background

Figure 1. Web archive collections such as the Coronavirus collection (right image) begin with URLs called seeds or seed URLs (red annotations) often scraped from social media posts (left image)
# Quality Proxy Class Acronym (s)

 

1 Posts Popularity , , and
2 Author Popularity
3 Domain Popularity

 

4 Geographical-author Proximity
5 Geographical-domain Proximity
6 Temporal Proximity
7 Subject expert Proximity
8 Retrievability Proximity
9 Relevance Proximity

 

10 Reputation-broad Uncategorized
11 Reputation-narrow Uncategorized
12 Scarcity Uncategorized
Table 1. Summary of the Quality Proxies (QPs) framework for assigning multi-dimensional QP scores to social media seeds for Web archive collections. A single QP belonging to either , , or class, assigns some quality trait to a seed, while different combinations of QPs combine to express different notions of quality (post popularity - , reputation - , local authority - , etc.). For each QP we provided an instantiation (Section 3) that approximates the QP class. The transparency and flexibility of the QP framework allows the user to extend and/or replace the instantiations of the QPs.

On March 11, 2020, the World Health Organization declared the 2020 Coronavirus outbreak a pandemic. One week later, Archive-It (https://archive-it.org/), an organization founded by the Internet Archive — the largest public Web archive (https://archive.org/) — sent out a tweet222https://twitter.com/archiveitorg/status/1240361850736381952 requesting social media users to contribute URLs about Coronavirus for preservation. It is important to preserve webpages chronicling important events such as the 2020 Coronavirus pandemic because according to SalahEldeen and Nelson, 11% of Web resources shared on social media are lost after the first year of publication (SalahEldeen and Nelson, 2012), so we run the risk of losing a portion of our collective digital heritage if they are not preserved. The Internet Archive (IA) was founded in 1996, and since then, it has been archiving the Web by collecting and saving public webpages. This is based on a simple idea: an archived copy of a webpage may be viewed in place of a lost original copy, but this is only possible if the original webpage was saved. Archive-It is a service created by the Internet Archive where individuals and institutions create Web archive collections (e.g., Fig. 1, right) that preserve webpages and their URLs, about a particular topic (e.g., 2020 Coronavirus). These collections begin with the selection of an initial list of URLs called seed URLs or seeds which are crawled as part of the preservation process called Web archiving.

Following the occurrence of major news events, various organizations collect and save webpages about the events before they are lost due to reference rot (Klein et al., 2018; Zittrain et al., 2014; Bar-Yossef et al., 2004). For example, archivists at the National Library of Medicine (NLM) saved seed URLs (National Library of Medicine, 2014) during the 2014 Western African Ebola Virus Outbreak. The NLM Ebola virus Web archive collection includes websites of organizations, journalists, healthcare workers, and scientists, related to the 2014 Ebola virus discourse. Similarly, archivists at Michigan State University saved webpages (Michigan State University, 2016) chronicling the Flint Water Crisis story. Unfortunately, we do not have enough curators to collect seeds amidst an abundance of local and global events, primarily because it is time-consuming to collect seeds manually, and collecting quality seeds requires domain expertise about the topic, which imposes an additional burden for curators. Consequently, organizations and researchers scrape social media posts for seeds (e.g., Fig. 1, left) to cope with the shortage of curators with domain expertise (Yang et al., 2012; Priyatam et al., 2014). While social media offers a cheap method of crowd-sourcing domain expertise, the quality of social media content varies widely. Selecting quality seed URLs from social media is challenging and has not been extensively studied in the Web archiving community, who acknowledges the importance of selecting good seeds but often pay more attention to the mechanisms of building collections. The challenge of selecting quality seeds is embodied in the idea that it is difficult to define “quality,” which could be subjective and is approximated with various metrics that are sensitive to relevance, popularity, or reputation.

We developed the Quality Proxies (QPs) framework which generates a multi-dimensional quality score for seeds. A single QP assigns some quality trait to a seed, while different combinations of QPs combine to express different notions of quality (post popularity - , reputation - , local authority - , etc.) that are used to score seeds and select those that exceed a user-defined threshold. The Quality Proxies framework was inspired by social media research for attributing quality to social media content and users based on credibility, reputation, and influence (Castillo et al., 2011; Pal and Counts, 2011; Canini et al., 2011). It was additionally informed by Web archive research, which emphasizes the importance of geographical and temporal constraints when selecting seeds (Risse et al., 2014; Nwala et al., 2017), and consists of three main classes (, , and ) that sub-divide into additional classes enumerated in Table 1.

Given that the Quality Proxies is a framework, we additionally instantiated the framework with metrics (Table 1) that approximate each class. The transparency and flexibility of the framework means the user can extend it and/or replace a QP instantiation. The quality score of a seed can be assigned by extracting Quality Proxy metrics across all classes, or a subset of classes as input to a quality score function (Eqn. 2), and selecting seeds that exceed a threshold.

Our contributions are as follows. First, Quality Proxies provide a flexible means of encoding multiple definitions of quality instead of a single definition of quality based on relevance or popularity. By providing a multi-dimensional framework, a curator can use the Quality Proxies to score and select not just popular (post popularity - ) or relevant () seeds but seeds from reputable sources (), local news organizations (), popular residents of a local community (), hard-to-find seeds (), etc. This is because Quality Proxies independently behave like alphabets that can be combined in different ways to provide various policies for scoring seeds. Second, the QP framework is robust to enable the assignment of quality scores even when a subset of the metrics is absent. Third, we instantiated each QP with metrics that approximate them. Fourth, the QP framework and quality score do not function as black boxes and thus produce explainable scores. Consequently, the QP instantiations can be critiqued, extended, or replaced if the requirements of the user demands such. Fifth, we compared seeds from Twitter Micro-collections that were scored/selected with QPs against seeds collected by human experts and scraped from Google (also scored/selected with QPs). Micro-collections (e.g., threaded conversations from single/multiple users) are social media posts that contain URLs that are gathered by humans (as opposed to search engines) as a demonstration of domain expertise and editorial activity (Nwala et al., 2019). Our evaluation results showed that QPs resulted in the selection of quality seeds with increased precision (by 0.13) when novelty is and is not prioritized. Our code and evaluation dataset (generated between 2014 and 2020), are publicly available (Nwala, 2021). The dataset is comprised of 1,552 seeds from reference collections (Google and manual selection by experts) and 2,027 seeds from 4,209 tweets.

2. Related work

The goal of determining the quality of URLs found in social media posts is one shared by Web archivists (Section 2.1) and social media researchers (Section 2.2 and 2.3).

2.1. Seed selection and quality assessment

Risse et al. (Risse et al., 2014) addressed the problem of determining attributes indicative of quality for Web archive seeds in the digital humanities domain by surveying scientists in social sciences, historical sciences, and law. Among others, they proposed that seeds should cover the evolution (topical dimension) of an event over time (temporal dimension) as opposed to a time or topic slice which gives an incomplete picture. Nwala et al. (Nwala et al., 2017) proposed extracting seeds from local news sources for local events by showing that collections from local news articles produced older and lesser exposed stories than their non-local counterparts. These contributions from the Web archiving community inform the Qualities Proxies’ and proximity classes.

2.2. Content credibility and fake news detection

There are many studies that propose methods for assessing the credibility of information on social media platforms such as Twitter. These mostly focus on the content (e.g., text) of the social media posts and not the URLs (seeds) found in the posts, which is the focus of the QP framework. However, we posit that the quality of a seed URL can be approximated by the quality of the social media post that embeds it, and thus the following research is relevant to the QPs. Castillo et al. (Castillo et al., 2011) adopted Merriam Webster’s definition of credibility (“offering reasonable grounds for being believed

”) and automatically assessed the credibility of a given set of tweets and classified them as

or not credible

based on features extracted from the

tweet content, , and . Similarly, to help extract credible tweets from a flood of tweets triggered by a major news event such as a natural disaster, Gupta and Kumaraguru (Gupta and Kumaraguru, 2012) identified content (frequency of unique characters, swear words etc.) and user-based features (e.g., number of followers

) to train a supervised machine learning and relevance feedback system. Their analysis of tweets posted about 14 high impact events in 2011 showed that on average 30% tweets contained situational information about the event, 14% were spam, and 17% contained situational awareness information that was credible.

Bozarth and Budak (Bozarth and Budak, 2020)

demonstrated the importance of an evaluation framework of fake news detection to complement traditional evaluation metrics like F1 and precision. They used error analysis to show that classifiers’ performance varied depending on multiple factors including the choice of dataset and how the training data is split (e.g., 5-fold, 80/20). Shu et al.

(Shu et al., 2020) approached fake news detection by studying the pattern of how news spreads on social media from news publishers to social media posts that (re)tweet content or conversation threads between accounts. Their experiments showed that the multi-level propagation network approach for fake news detection out-performed state-of-the-art fake news detection methods by at least 1.7% with an average F1¿0.84. Similarly, Bal et al. (Bal et al., 2020)

proposed an attention-based deep learning model that first identified tweets about the cause or cures of cancer and subsequently labeled those that spread misinformation.

The Quality Proxies framework includes the class that approximates the credibility of the domain of seed URLs in social media posts. In this research, we did not develop a method for directly assigning credibility scores to seed domains but instead approximate them (Section 4.5) by counting how often the domains were cited by Wikipedia editors. However, the user of the framework could replace our instantiation for approximating with a different method such as those discussed in this section.

2.3. Ranking social media users and/or content

Popularity is widely used as a proxy for quality and credibility (Ciampaglia et al., 2018). However, algorithms that use popularity for ranking can be exploited (Ferrara et al., 2016; Ratkiewicz et al., 2011). Consequently, Abbasi and Liu (Abbasi and Liu, 2013) introduced the CredRank algorithm for ranking social media users by a credibility score determined by their online behavior. CredRank first attempts to detect coordinated accounts that artificially inflate the popularity of some content. Next, it suppresses the votes of the culprits in order to give preference to independently popular (credible) content. Similarly, Pal and Counts (Pal and Counts, 2011) addressed the task of identifying social media users who are authorities for a given topic by proposing a set of features for characterizing social media authors, such as original tweets, conversational tweets, repeated tweets, mentions, etc. Next, using probabilistic clustering over these features, they ranked users in order to identify authorities.

Agichtein et al. (Agichtein et al., 2008) explored using community feedback to identify high-quality content on question/answering social media platforms. They proposed a graph-based model of contributor relationships and combined it with content and usage statistics to identify quality questions and answers applied on Yahoo! Answers. Becker et al. (Becker et al., 2011) explored centrality-based approaches (Centroid, LexRank, and Degree) for identifying high quality text context from tweets for events. They defined high quality tweet text as that which contains relevant information (event time, location, participants, opinions) that is most representative of an event. Their results showed that the Centroid approach selected tweets most related to an event.

Bian et al. (Bian et al., 2009)

addressed quantifying quality of content by combining content quality estimation with user reputation estimation in order to identify quality content and improved the accuracy (above state-of-the-art methods) of search over community question answering archives. Similarly, Canini et al.’s

(Canini et al., 2011) investigation into the factors that affect the credibility of users on social media led to a conclusion that both the topical content of information sources and social network structure affect credibility. This conclusion led them to design a method that automatically identifies and ranks Twitter users according to their relevance and expertise for a given topic.

Similar to the class, the QP framework includes the subject-expert class that approximates the subject-expertise of the domain (e.g., cdc.gov) of a seed URL found in a social media posts. We did not develop a method for directly assigning subject-expertise scores to seed domains, but instead approximated them (Section 4.3) with search engines such as Google. However, the user of the framework could replace our instantiation for approximating subject-expert with a different method such as those discussed in this section.

3. The Quality Proxies Framework

The seed URL quality problem is not unique to social media. A Search Engine (SE) must return a small list of URLs (from possibly millions of candidates) to fulfill an informational request encoded in a search query. It starts by identifying relevant pages — is a proxy for quality — but goes beyond relevance to rank webpages with a preference for popular webpages. In summary, SEs use as one method to approximate quality. This is reasonable since one can argue that popularity is the reward for quality. Ciampaglia et al. (Ciampaglia et al., 2018) argue that measures such as the citation rates of scientific papers, number of downloads of a song, or the number of social media followers are often used in the absence of measurable notions of quality. Additionally, the goal of algorithms that favor popular items “is to identify high-quality items such as reliable news, credible information sources, and important discoveries - in short, high-quality content should rank at the top.”

However, popularity does not always mean quality since popularity could be exploited by fake reviews, social bots, and astroturf campaigns (Ferrara et al., 2016; Ratkiewicz et al., 2011). In SEs, the use of popularity in ranking algorithms was alleged to reduce the novelty, a problem that could however be mitigated by diverse user queries (Fortunato et al., 2006). Consequently, we argue that popularity is not sufficient as a QP and explore additional non-popularity based QPs (e.g., proximity and reputation, Section 4). Nonetheless, popularity remains one effective proxy for quality, and as such is included in our QP framework.

In this section, we introduce the classes (e.g., ) that make up the QP framework and the metrics (e.g.,

) used to instantiate them. For each seed URL, the values (all normalized between 0 and 1) for each metric are used to populate the seed QP vector which holds the multi-dimensional quality score of the seed.

3.1. Popularity-based Quality Proxy classes

There are generally two approaches toward quantifying the popularity of URLs. The first computationally-expensive, link-based (e.g., PageRank) approach (Cho et al., 1998) utilizes the link structure of the Web to assign weights to webpages. We adopted the second lesser computationally-expensive approach that leverages social media post statistics to assign popularity scores to URLs found in social media posts. Social media posts often keep statistics that track the number of times a post is shared (a “retweet” on Twitter), liked, or replied to. Transitively, the popularity of URLs from social media posts could be derived from the social media post statistics (Gupta and Kumaraguru, 2012; Duan et al., 2010; Nagmoti et al., 2010) and also used to rank posts.

3.2. Post popularity Quality Proxy classes

The post popularity classes assign popularity to a seed URL by quantifying the popularity of the post(s) containing the URL. We instantiated them with metrics that count how many people replied to ( ), shared ( ), and liked ( ) a social media post. All of these are normalized () in the QP vectors for seeds.

3.3. Author popularity Quality Proxy

The author popularity QP expresses the popularity of the author(s) who created the social media post(s) containing the seed URL. For example, Twitter and Instagram count (in-degree), and or (out-degree). Unlike Twitter, which separately counts in-degree and out-degree, Facebook only counts (in-degree and out-degree).

For social media platforms like Facebook with bi-directional links, we instantiated with the normalized count of . For social media platforms like Twitter, in-degree out-degree (e.g., for Twitter) normalized. If the in-degree out-degree, then . To fix this, the offset (the absolute value of smallest difference between in-degree and out-degree) is added to each difference before normalization. Given a set of social media posts , let and represent the in-degree and out-degree of social media post , respectively, Eqn. 1 instantiates .

(1)

3.4. Domain popularity Quality Proxy

The domain popularity QP quantifies the popularity of a seed’s domain. We instantiated it with Eqn. 1 by approximating the popularity of the social media account (e.g., @CDCgov) associated with the seed domain (e.g., cdc.gov). To calculate for a seed (e.g., https://www.cdc.gov/coronavirus/2019-nCoV/), utilizing Twitter as example, first, we must find the social media account (https://twitter.com/CDCgov) associated with the domain (e.g., cdc.gov). This is done by finding a bi-directional link between the social media account and the seed’s website. For example, cdc.gov domain links to @CDCgov Twitter account and vice versa. Second, we extract the and degree details from the account. Third, we apply Eqn. 1.

4. Non-popularity QP classes

We have already discussed some limitations of popularity as a proxy for quality such as the artificial manipulation of popularity by fake reviews, social bots, and astroturf campaigns. In addition to these, it is important to note that not all authoritative or credible sources are popular. For example, MLive, a local media organization located in Michigan, the epicenter of the Flint Water Crisis, is less popular than CNN, a national/international news organization, so one can argue that MLive is a local authority on topics about the Flint Water Crisis, more so than CNN. In fact, according to Denise Robbins, it took the national media one year after the E. coli outbreak to report the Flint story (Robbins, 2016). Consequently, it is pertinent to quantify quality (e.g. authority) across other classes in addition to popularity. This is the rationale for the following non-popularity based Quality Proxy classes (Table 1, No. 4 – 12).

4.1. Geographical ( = , ) Quality Proxy

Stories and events are often associated with some geographical location. For example, Hurricane Harvey made landfall in Texas in August 2017. The geographical QP gives credit to a local source (local authority) when geographical location information is present. The local source could be an individual ( - author geographical QP) or an organization ( - domain geographical QP). For example, if our reference epicenter is Texas, USA, given two seeds about Hurricane Harvey from CNN and TexasObserver (Texas local media), the QP would assign a higher value to TexasObserver. Similarly, given two individuals, a resident of Rockport, Texas, and a resident of San Francisco, California, the would give more credit to the Rockport resident.

We instantiated the (or ) QP with the normalized ([0, 1]) distance (measured with the Haversine formula) between a reference epicenter and the geo-location associated with the post author (for ) or social media account associated via a bi-directional link (similar to ) with the seed domain (for ). We utilized the Google Maps Services Places API (Google, 2020) to normalize names (e.g., “NYC” and “New York”) into a single name and geo-coordinates.

4.2. Temporal Quality Proxy

The stories and events often happen at a place (or places), but always happen at some time. After the occurrence of the event or before its occurrence, news organizations report the story or event. For example, some of the earliest reports of the Flint Water Crisis story are from Mlive. The temporal Quality Proxy rewards seeds published “early,” when a priori information about what constitutes early is present. We instantiate it with the normalized time difference between the publication date of the seed and the reference point considered early.

4.3. Subject-expert Quality Proxy

The subject expert QP approximates the subject expertise of a seed’s domain. For example, given two seeds about the Coronavirus, one from the CDC and another from the blog of a high school senior, would assign the CDC a higher subject expert score since the CDC is an authority on health topics. However, how does one measure the subject expertise of cdc.gov?

We instantiated based on this simple assumption: A subject expert often has more to say about their subject of expertise. This means, if indeed the CDC is an expert on Coronavirus, we would expect to see many more reports from the CDC about Coronavirus than say ESPN. We acknowledge that this is a simplifying assumption that could be exploited. We used Document Frequency (DF) to instantiate the subject expertise of the domain of a seed. We extract DF scores by counting the number of result pages returned by Google for a given query normalized by the total number of pages indexed by the search engine for the site. This normalization is needed to avoid giving more advantage to larger websites.

4.4. Retrievability Quality Proxy

Seeds extracted from social media posts could also be scraped from Search Engine Result Pages (SERPs). The QP approximates how easy a seed is to find (Azzopardi and Vinay, 2008). For example, Wikipedia pages for various entities (e.g., political figures) are often placed on the front page of SERPs, meaning they have high retrievability. For this reason, quantifies the level of difficulty of finding a seed. It is often a desirable quality to identify relevant seeds that are not easy to find to increase the novelty of a collection. We instantiated of a seed (e.g., https://www.cdc.gov/vhf/ebola/index.html) with its reciprocal rank (e.g., 1/2) when searching the first (e.g., ) Google SERPs for the seed with the query (e.g., “ebola virus”) used to extract seeds.

4.5. Reputation ( = , ) Quality Proxy

Social media seed URLs originate from sources with varying reputations. Given two URLs about Coronavirus, one from InfoWars (promotes conspiracy theories (Ramadan and Shantz, 2016)) and another from the CDC, it would be problematic to consider the quality of information derived from both sources equal. Similar to the subject expert QP, the reputation approximates the reputation of the domain of seeds.

We defined two kinds of reputation QPs. First, reputation-broad attributes reputation to the domain of a seed for having a record of publishing content about a topic (e.g., health topic), while reputation-narrow attributes reputation to the domain of a seed for having a record of publishing content focused specifically on a story (e.g., Coronavirus). But the question remains, how does one approximate reputation? We instantiated by leveraging the expertise of Wikipedia editors. We posit that Wikipedia editors presumably sample reputable sources (Chesney, 2006). Specifically, the reputation of the domain of a seed corresponds to the fraction of times it was cited as a reference from a gold-standard set of Wikipedia articles.

For , the gold-standard is represented by a collection of Wikipedia articles that focus on the topic (e.g., Disease outbreaks) of the seed. For , the gold-standard is represented by the canonical Wikipedia page for the story. The canonical page can be found by searching for the top ranked Wikipedia page for the query (e.g., “ebola virus outbreak”) representing the topic. To assign or to the domain of a seed, we extracted the URIs from the references of the reputation gold-standard Wikipedia articles and calculated the fraction of times each domain was referenced. For example, in our reputation gold-standard for the Disease outbreaks (https://en.wikipedia.org/wiki/List_of_epidemics) topic, cdc.gov appeared 42 out of 57 gold-standard articles. Therefore, the cdc.gov domain has a score of 0.74. The cdc.gov domain appears 14 times out of 720 references in the canonical 2014 Western African Ebola Virus Outbreak Wikipedia (https://en.wikipedia.org/wiki/Western_African_Ebola_virus_epidemic) page, and thus, . In contrast, for sputniknews.com, (1/57) and (0/720).

4.6. Relevance Quality Proxy

The

QP measures the degree to which a seed is on-topic. A seed that receives high marks across all the other QP vector dimensions remains non-relevant if it is off-topic. We approximate relevance by simply measuring the cosine similarity between a seed’s document vector and a gold-standard document vector that captures our definition of relevance. The gold-standard is created by concatenating the text of hand-selected documents (Section

8.2, Step 1) that are relevant to a topic, and creating a feature (vocabulary) vector consisting of the TF or TFIDF weights of the terms in the concatenated document.

4.7. Scarcity Quality Proxy

The QP rewards seeds from domains that are rare in a collection of seeds. It is not surprising to find multiple seeds from news organizations (e.g., cnn.com, foxnews.com, bbc.co.uk) for news topics. Sometimes far-reaching news events are covered by organizations for which news is not their primary domain (e.g., eonline.com and espn.com) and which may offer a novel reporting perspective. The QP was created to surface such seeds and is approximated by , where is the frequency of a seed’s domain out of total domains.

5. Additional : Flipping

Thus far, the Quality Proxies have been presented with the assumption that the higher the QP value, the better the trait the QP captures. For example, a high author popularity score is a desirable trait, and a low author popularity score is not a desirable trait. However, desirability can be subjective. This means a curator might desire to surface seeds from authors that are not popular in an effort to amplify the voices of obscure users. Consequently, this requires flipping the direction of the reward system of the QP under consideration. For example, before flipping, the most popular author would have , but if we flipped (represented with bar over the QP) the Quality Proxy, is assigned to the most popular author. Since all the quality proxies were designed to fall within [0, 1], a QP is simply flipped by ; .

The ability to flip QPs provides us with additional QPs (, , , etc). But it must be noted that the unflipped () state and the flipped () state of QPs are mutually exclusive.

6. The QP vector and scoring seeds

The seed Quality Proxy vector is a 14-dimensional vector () of all the values of metrics (e.g., , , , ) that instantiate the classes (, , and ) of the Quality Proxies framework. The QP vector of a seed assigns quality scores to a seed across multiple dimensions. Each metric’s value expresses some quality trait of a seed and is normalized () such that 0 represents lowest quality and 1 represents highest quality. The dimensions of the QP vector representing multiple quality traits can be combined into a single score that can be used to score and/or rank seeds. We instantiated the QP score function (Eqn. 2) of a seed simply with the 2-norm of the n-dimensional QP vector of the seed.

(2)

A user can control the relative importance of the metrics of depending on prior information or specific needs. Therefore, one can multiply a weight vector () with () to reflect the importance of each metric to obtain a new Quality Proxy scores . The weight vector can also be used to switch off specific metrics. For example to switch off , we set , such that .

#
domain: title
(user’s twitter handle)
QP score
& QP values
 
QP values
(thousands)
rp sh lk   rp sh lk
1 reuters.com: Most Americans, unlike Trump, want mail-in… (@HillaryClinton) 1.0 1.0 1.0 1.0   13.4 31.7 101
2 cnbc.com: Chamath Palihapitiya: US shouldn’t bail out hedge funds, billionaires (@CNBC) .54 .26 .67 .59   3.48 21.3 59.9
3 gov.uk: New immigration system: what you need to know (@nicktolhurst) .39 .56 .23 .30   7.54 7.20 30.1
4 washingtonpost.com: When coronavirus hits, but the water is shut off (@SenSanders) .32 .08 .32 .46   1.09 9.99 46.4
5 wsj.com: Trump’s Wasted Briefings (@TheRickWilson) .25 .15 .28 .29   2.07 8.97 29.8
Table 2. 2020 Coronavirus Pandemic: top five seeds extracted by combining three popularity-based QPs , , to produce a single quality score ( - Eqn. 2) and ranking the seeds by their scores. Popularity QPs unsurprisingly give more credit to seeds from popular (well-known) domains/users.
# domain (domain org. location): title (user’s twitter handle, user’s location)
Epicenter: New York City
1 thejakartapost.com (Jakarta): Finland discovers masks bought from China not hospital-safe (@ick_forPH, Philippines) 0.92 0.83 1.00
2 rappler.com (Philippines): FACT CHECK: Duque claims PH has ‘low’ coronavirus infection (@rapplerdotcom, Philippines) 0.84 0.83 0.86
3 bylinetimes.com (London): COVID-19 SPECIAL INVESTIGATION: Leaked Home Office …(@GHNeale, NA) 0.75 1.00 0.34
4 kru.co.ke (Nairobi): Kenya Rugby Union announces cancellation of 2019/20 season as Corona virus…(@OfficialKRU, Nairobi) 0.72 0.71 0.73
5 jesusislordradio.info (Nakuru, Kenya): Welcome To Jesus Is Lord Radio (@_lameckongeri, Kisii, Kenya) 0.71 0.69 0.72
Table 3. 2020 Coronavirus Pandemic: top five seeds extracted by combining geographical QPs and (seeds from users and domains distant from New York) and ranking the seeds by their QP scores ( - Eqn. 2). The table illustrates what seeds users from distant geographical regions share.
# domain: title (user’s twitter handle) Hits
1 who.int: Tobacco (@SergioBowers1) 0.82 47
2 nih.gov: Ventilator-Associated Pneumonia: Diagnosis, Treatment, and Prevention (@HITNTNotTalkin) 0.81 46
3 cdc.gov:: 2009 H1N1 Pandemic (H1N1pdm09 virus) Pandemic Influenza (Flu) (@2020DoOver) 0.74 42
4 cdc.gov: Legal Authorities for Isolation and Quarantine (@peabodypress) 0.74 42
5 cdc.gov: 2019-2020 U.S. Flu Season: Preliminary Burden Estimates (@Rick51224214) 0.74 42
Table 4. 2020 Coronavirus Pandemic: top five seeds with the highest broad reputation values. For a single seed (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1592694/ from #2), the score (e.g., 0.81) was approximated by counting the number of times (Hits) the seed domain (e.g., nih.gov) was cited (e.g., 46 times) in a reputation gold standard of 57 representative Wikipedia documents (one vote per document) about Disease outbreaks.
# domain (domain org. location): title (user’s twitter handle, user’s location) rl
1 mlive.com (Michigan): As Flint was slowly poisoned, Snyder’s inner circle failed to act (@PhilRevard, Michigan) 0.91 0.85 0.97
2 eclectablog.com (Ann Arbor, Michigan): The deceptive corporatist rewriting of the history of the #FlintWaterCrisis is in full swing (@LOLGOP, Ann Arbor, Michigan) 0.86 0.72 0.99
3 detroitnews.com (Detroit, Michigan): AG’s office got Flint complaints a year before probe (@PhilRevard, Michigan) 0.85 0.68 0.99
4 michiganadvance.com (Michigan): Judge allows Flint water class-action lawsuit to proceed, adds Snyder…(@jmlarkin, Cambridge, MA) 0.84 0.70 0.97
5 michigan.gov (Michigan): EGLE - Flint’s water remains stable, continues to meet federal and new stricter state standards (@nreza21, NA) 0.84 0.69 0.97
Table 5. Flint Water Crisis: top five seeds extracted by combining relevance and geographical QPs to surface local media (e.g., detroitnews.com) in Flint, Michigan.
# domain: title (user’s twitter handle) Hits
1 texasmonthly.com: Voices from the Storm (@TexasMonthly) 0.71 0.13 0.99 1
2 texasobserver.org: Even Hurricane Harvey Can’t Temper GOP Hostility Toward Texas’ Big Cities (@texasdemocrats) 0.70 0.11 0.99 1
3 eonline.com: Taylor Swift Makes “Very Sizable Donation” to Houston Food Bank After Hurricane Harvey (@enews) 0.70 0.10 0.99 1
4 espn.com: J.J. Watt’s Hurricane Harvey charity fundraising closes with $37M-plus in donations (@SportsCenter) 0.70 0.08 0.99 1
5 rollingstone.com: Houston Astros After Hurricane Harvey (@RollingStone) 0.70 0.08 0.99 1
Table 6. Hurricane Harvey: top five seeds extracted by combining relevance and the scarcity QP, used to increase the diversity of news sources (e.g., texasmonthly.com, eonline.com, and espn.com) by extracting seeds from domains with the smallest representation (Hits) in the collection.

7. Selecting seeds with QP scores

In this section, we explore how different combinations of QPs map to different notions of quality and policies for selecting seeds for the 2020 Coronavirus Pandemic (Tables 2, 3, & 4), the Flint Water Crisis (Table 5), and Hurricane Harvey (Table 6). The seed URL titles can be clicked.

Table 2 illustrates that a combination of popularity-based Quality Proxies , , and unsurprisingly gives more credit to seeds from popular (well-known) domains (e.g., reuters.com, cnbc.com, washingtonpost.com) posted by popular authors (e.g., @HillaryClinton, @CNBC, and @SenSanders). Seeds from well-known domains are more likely to be replied to (), shared (), or liked () as a result of the large audience they enjoy. Sampling seeds from popular sources could help reduce spam or reduce the number of non-credible sources.

Unlike Table 2, Table 3 shifts the reward system by prioritizing authors () and domains () geographical distant from New York. This resulted in the surfacing of authors and domains outside the United States with an international perspective. The top five authors are residents of two different countries (e.g., @ick_forPH - Philippines and @OfficialKRU - Kenya) while the organization of the domains are from four different countries (thejakartapost.com - Indonesia, rappler.com - Philippines, bylinetimes.com - England, and kru.co.ke, jesusislordradio.info - Kenya).

Given the concerns of the spread of (mis/dis)information surrounding the coronavirus pandemic, curators could potentially impose stringent rules that restrict the sources of seeds to reputable sources. This reputable sources only selection criteria aligns with the goal of the reputation-broad QP (). Table 4 outlines the top five seeds when seeds are scored by their respective reputation scores. For a single seed (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1592694/) in Table 4, the score (e.g., 0.81) was approximated by counting the number of times the seed domain (e.g., nih.gov) was cited (e.g., 46 times) in a reputation gold standard of 57 representative Wikipedia documents (one vote per document) about Disease outbreaks. Accordingly, the most dominant seeds were from world-renowned health institutions such as the World Health Organization (who.int) which was referenced 47 times, National Institute of Health (nih.gov) referenced 46 times, and Centers for Disease Control and Prevention (cdc.gov) referenced 42 times out of 57 representative Wikipedia documents about public disease outbreaks.

Table 5 illustrates how the QP helps surface local news organizations, such as mlive.com, which was critical to the coverage of the Flint Water Crisis, by giving credit to seed domains from organizations near a geographical reference (e.g., Flint, Michigan).

Table 6 illustrates how the QP can help increase the diversity of sources by surfacing seeds from non-conventional news media outlets such as Taylor Swift Makes “Very Sizable Donation” to Houston Food Bank After Hurricane Harvey - eonline.com and J.J. Watt’s Hurricane Harvey charity fundraising closes with $37M-plus in donations - espn.com.

8. Framework Evaluation

The goal of this evaluation was two-fold. First, to assess the precision of the seeds selected by their Quality Proxy-assigned scores when novelty is not prioritized (Section 8.2). For brevity, we define QP seeds as the top ranked seeds selected when seed URLs extracted from social media posts (e.g., tweets) are ranked by their QP scores. It would be unreasonable to collect QP seeds if they are of poor quality compared to expert-generated seeds. We modeled good quality with prototypical seeds referred to as seeds scraped from Google and/or hand-selected by human-experts on Archive-It.

Second, to assess the precision of seeds when novelty is prioritized (Section 8.3). It is a positive trait for QP seeds to be highly similar (low novelty) with respect to Google and/or expert-generated seeds, since this could be indicative of their high-quality. The goal of the first evaluation was to quantify the degree of similarity between QP seeds and reference seeds. However, we often need our seeds to be novel or, in other words, different from seeds produced by Google and/or experts but not at the expense of quality. Therefore, we assessed the precision of QP seeds when novelty is prioritized. Novelty of seeds was measured (Section 8.2, Step 4) by comparing them with reference (Google or Expert) seeds.

8.1. Evaluation Dataset

To evaluate social media seed URLs selected with their QP scores (QP seeds), we generated a dataset (Table 7, (Nwala, 2021)) consisting of seeds extracted from reference collections (Section 8.1.1) and Twitter Micro-collections (Section 8.1.2) for multiple topics.

8.1.1. Generating reference (Google/Expert) seeds

.
The reference collections served as baselines for defining quality. Seeds from Google were scraped, while seeds from expert-generated collections were extracted from the Archive-It API (Archive-It, 2020).

8.1.2. Extracting seeds from Micro-collections

.
In addition to reference Google/Expert seeds, we extracted seeds from Twitter Micro-collections to be compared to the reference seeds. Micro-collections are social media posts that contain URLs that are gathered by humans as a demonstration of domain expertise and editorial activity (Nwala et al., 2019). On Twitter, they manifest as the threaded conversations created by single or multiple users. Seeds extracted from Twitter Micro-collections are different from those scraped exclusively from SERPs (Nwala et al., 2018).

In total, the evaluation dataset (extracted from 2014 – 2020) consisted of 1,552 seeds from reference collections, and 2,027 seeds from 4,209 tweets from Twitter Micro-collections. Even though we utilized Twitter for evaluation, our framework is applicable to other social media platforms such as Reddit and Facebook.

Topic Extraction-Range Seeds Count
Reference Google Collections (808 Seeds)
hurricane harvey 2020-04-11 199 (Page 1 - 20)
flint water crisis 2020-04-10 173 (Page 1 - 20)
coronavirus 2020-04-09 176 (Page 1 - 20)
2018 world cup 2019-01-09 112 (Page 1 - 10)
ebola virus 2017-11-29 97 (Page 1 - 10)
hurricane harvey 2017-09-(02 to 29) 51 (Page 1)
Reference Expert Collection from Archive-It (744 Seeds)
coronavirus (National Library of Medicine, 2020) 2020-03-15 574
hurricane harvey (Internet Archive Global Events, 2017) 2017-(08-25 to 09-29) 37
ebola virus (National Library of Medicine (NLM), 2014) 2014-10-01 133
Twitter-Top (1,310 Seeds, 2,221 tweets)
hurricane harvey 2020-04-11 201 (500 tweets)
flint water crisis 2020-04-09 312 (500 tweets)
coronavirus 2020-04-09 533 (500 tweets)
2018 world cup 2019-01-09 121 (500 tweets)
ebola virus 2017-(11-30 to 12-31) 48 (68 tweets)
hurricane harvey 2017-09-(02 to 31) 95 (153 tweets)
Twitter-Latest (717 Seeds, 1,988 tweets)
flint water crisis 2020-04-09 92 (500 tweets)
coronavirus 2020-04-09 541 (500 tweets)
2018 world cup 2019-01-09 84 (488 tweets)
Table 7. Framework evaluation dataset (Nwala, 2021) consisting of 1,552 seeds from Reference (Google & Expert) collections, and 2,027 seeds from 4,209 tweets from Twitter (Top/Latest) extracted at different date ranges.

8.2. Precision when Novelty is not Prioritized

The following five steps describe how we assessed the precision of QP seeds when novelty is not prioritized.

Step 1: Extracting Quality Proxies for Seeds

.
We instantiated the QP vectors for all seeds in the evaluation dataset by extracting all values for QP metrics (Table 1) except subject-expert and temporal () resulting in the use of 12 QPs. The instantiation with the document frequency from Google was not determined to be a dependable approximation of

since it fluctuated (for the same seed) with a high variance, hence we excluded it from our evaluation. Additionally, we did not impose a temporal bias to favor old or new documents, hence we excluded

.

We approximated the QP with the cosine similarity between document vectors for a seed and a gold standard document created from the text of the references of Wikipedia articles corresponding to each dataset topic. The author-popularity QP corresponds to the popularity of the social media author of the post. Since seeds from Google and Archive-It are not posted by social media authors, we approximated the QP with the reciprocal rank () of their seeds to ensure they are comparable to QP seeds.

# 1 - Combination 2 - Combinations 3 - Combinations
1
2
3
4
5
6
7
8
9
10
11
12
Table 8. A sample of 12 QP combinatorial states for 1-combination, 2-combination, and 3-combinations. A single 1-combination or 2-combination or -combination of QPs can be used to score (Eqn. 2) a seed.

Step 2: Generating QPs Combinatorial States

.
We utilized the 12 QPs from from the previous step to score (Eqn. 2) seeds, selected the top K seeds, and compared them with top reference seeds scored with the same QPs. We did not assign weights to the Quality Proxies. Additionally, we expanded the options for scoring seeds beyond 12 QPs as follows. First, we permitted flipping the QPs, resulting in 12 additional QPs (24 QPs total). Second, we permitted using a subset of the 24 QPs, leading to a combinatorial explosion of possible QP states for scoring seeds. However, we restricted our scoring to 1-, 2-, and 3-combinations which produced a total of 2,049 possible QP combinations (e.g., , , ) to score seeds. Table 8 shows 72 of these combinations.

Step 3: Scoring Seeds with a Combination of QPs

.
To score seeds from Twitter or reference Google or Expert collections, we first selected a single combination of Quality Proxies, for example, . Next, using only the QPs selected, we assigned a score to the seed with Eqn. 2.

Step 4: Twitter vs Google/Expert: comparing top K QP seeds

Recall the QP seeds definition: the top ranked seeds selected when seed URLs extracted from social media posts (e.g., tweets) are ranked by their QP scores. The top K QP seeds with scores assigned by a given combination of QPs were compared to the top K reference (Google/Expert) seeds scored with the same QP combination. Comparison was done by measuring the domain (e.g., cdc.gov) overlap () between the Twitter QP seeds and reference (Google and/or expert) seeds. We also measured the precision of the selected QP seeds and reference seeds. For precision evaluation, if the cosine similarity between a seed and the gold standard document vector is at least a predefined relevance threshold (set at 0.20 for all except Hurricane Harvey: 0.10), the seed is considered relevant. The threshold was estimated by finding the median similarity between each gold standard document and the rest of the gold standard documents. Median scores exceeding 0.20 — which was empirically determined to produce satisfactory baseline relevance — were set to 0.20.

Step 5: Seed Precision when Novelty is not Prioritized

.
The final process of assessing the precision of seeds when novelty is not prioritized involved reporting the average overlap and average precision for QP combinations used to score and select top K () seeds. This was achieved by reporting the top 10 (out of 2,049 QP combinations) overlap scores between Twitter and reference seeds and reporting Precision at K (P@K) for the associated QP combination used to score the seeds. Selecting the top 10 overlap enables us learn the precision of seeds when overlap is at its best, albeit at the expense of novelty since the higher the overlap between Twitter and reference seeds, the lower the novelty. Section 9.1 presents and discusses the results.

8.3. Precision when Novelty is Prioritized

Since we consider reference seeds to be quality seeds, a high overlap between reference and Twitter QP seeds could result in a high precision of the Twitter QP seeds. However, since novelty (low overlap) is also a desirable quality of seeds, it is crucial to additionally assess the precision of Twitter QP seeds when novelty is prioritized.

The steps for assessing the precision of seeds when novelty is prioritized are the same as the previous section (when novelty is not prioritized) except for Step 5. Instead of reporting the P@K for the associated QP combinations with the top 10 overlap scores, to prioritize novelty, we measured and reported the precision of QP combinations that produced no overlap (highest novelty) between Twitter and reference QP seeds. Section 9.2 discusses the results.

9. Evaluation results and Discussion

Figure 2. Overlap vs P@20 for Google (orange dots) and Twitter (blue dots) 2020 Coronavirus Pandemic Twitter-Latest seeds scored with different QPs. A single dot represents the overlap (X-axis) and P@20 (Y-axis) for seeds scored by a single Quality Proxy. The scatterplot shows how different combinations of QPs result in high (e.g., and ) or low ( ) overlap/P@20. Unsurprisingly, resulted in a low P@20 because the QP was flipped, meaning relevance was penalized.

Our overlap and precision results were proven to be statistically significant by a one-tailed Student’s t-test with

and K = 30 across all dataset topics.

Figure 3. Average P@K (left) and Average overlap (with Expert seeds - right) showing that Twitter seed URLs scored and selected with Quality Proxies, improved (higher solid lines) the P@K and overlap above the baseline (lower red dotted lines) precision and overlap which did not use QPs. However, the improvement diminished as K (number of seeds) increased.
Figure 4. Median of average P@K for minimum overlap ([0.00]) and maximum overlap (e.g., (.80, .90]) between top K QP seeds and reference Google/Expert seeds. The black line (0.20) marks the relevance threshold. In all cases except (red annotation) Hurricane Harvey (collected 2020) the median of the average P@K of QP seeds for the 0 overlap (maximum novelty) interval was always above the relevance threshold.
Dataset QP Novel   QP Non-Novel
Topic Google+Tw Expert+Tw   Google+Tw Expert+Tw

 

Coronavirus  
W. Cup (’18)  
H. Harvey (’20)  
H. Harvey (’17)  
Flint Water  
Ebola V. (’14)  
Table 9. Combination of QPs that produced highest (QP Novel) and lowest (QP Non-Novel) novelty. Non-novelty mostly favors seeds from broadly-reputable (e.g., , ) sources that are easy-to-find (e.g., , ) while novelty mostly favors seeds that are non-popular (e.g., non-popular author: , non-popular posts: , ), and hard-to-find (e.g., )

9.1. Results when Novelty is not prioritized

Consider the results (P@K/overlap) when novelty is not prioritized.

9.1.1. P@K of Twitter (and overlap with Google) QP seeds

Across all topics, for Google and Twitter seeds, the Minimum, Median, and Maximum (MMM) average overlap were 0.04, 0.32, and 1.0, respectively, when Quality Proxies were used to score seeds. Without the utilization of QP scores, the MMM average overlap were smaller, 0.04, 0.14, and 0.27, respectively. These results (e.g., Fig. 3, right) suggest that the utilization of QP scores to rank and select seeds, helped surface seeds from a common set of domains between Twitter and Google unlike when QPs were not used. Additionally, they illustrate that different combinations of QP can result in high (e.g., and ) or low () overlap/precision as expressed by Fig. 2. In Fig. 2, unsurprisingly, the QP combination resulted in a low P@20 because the QP was flipped, meaning relevance was penalized. Our results (Table 9) also suggest that non-novelty mostly favors seeds from broadly-reputable (e.g., , ) sources that are easy-to-find (e.g., , ).

Across all topics, for Twitter seeds, with Google seeds as the reference, the MMM average Precision at K (P@K) were 0.0, 0.53, and 0.99, respectively, when QP scores were used. Without the utilization of QP scores, the MMM were smaller; 0.06, 0.45, and 0.65, respectively. These results (e.g., Fig. 3, left) showed that the utilization of Quality Proxies to score, rank, and select seeds, improved the precision of seeds by 0.08 (0.53 vs 0.45).

9.1.2. P@K of Twitter (and overlap with Expert) QP seeds

.
Across all topics, for Expert and Twitter seeds, the MMM average overlap were 0.09, 0.67, and 1.0, respectively, when Quality Proxies were used to score and select seeds. Without the utilization of QPs, they were smaller, 0.03, 0.13, and 0.19, respectively. Similar to the overlap between Google and Twitter seeds, these results suggest that the utilization of QP scores to rank and select seeds facilitated the selection of seeds from a common set of domains for Twitter and Expert seeds.

Across all topics, for Twitter seeds, with Expert seeds as the reference, the MMM average precision were 0.0, 0.72, and 0.95, respectively. Further investigation of the seeds that generated 0.0 precision showed that 5/10 were actually relevant based on human judgment. This means our relevance threshold of 0.20 was set too high, and thus resulted in the production of false positive labels. The MMM of the average precision of seeds not scored with QPs were smaller (0.06/0.55/0.71) by 0.17 (0.72 vs 0.55) suggesting again (as previously seen when Google was reference) that the utilization of QP scores improved the precision of seeds.

9.2. Novelty is prioritized: P@K of QP seeds

In Fig. 4, the heights represent the median of the average P@K for different overlap intervals and horizontal lines mark the relevance threshold for each dataset topic. In all cases except Hurricane Harvey (collected 2020), the median of the average P@K of Twitter QP seeds for the 0 overlap (maximum novelty) interval was always above the relevance threshold. This suggests that maximum novelty (0 overlap) did not adversely affect the P@K for Twitter QP seeds even though higher overlap resulted in a higher P@K. Our results (Table 9) also suggest that novelty mostly favors seeds that are non-popular (e.g., non-popular author: , non-popular posts: , ), and hard-to-find (e.g., ).

9.3. Correlation of Quality Proxies

Our correlation analysis (Table 10) showed a strong () positive correlation between between popularity-based (e.g., ) and reputation () QP metrics. All positive correlations were statistically significant (p ¡.05) unlike negative correlations. These results are not surprising. For example, a post with many likes () is highly likely to be shared () and/or replied to (). Similarly, many domains (e.g., cdc.gov) with high broad (topic) reputation () also have high narrow (story) reputation ().

Rank Least Correlated   Most Correlated
QPs r   QPs r

 

1. -0.13   0.94
2. -0.05   0.82
3. -0.04   0.82
4. -0.03   0.82
5. -0.03   0.66
6. -0.03   0.60
7. -0.02   0.45
Table 10. Pairs of QPs least and most correlated (Pearson’s r) showing a strong positive correlation between popularity-based (e.g., ) and reputation () QPs.

10. Future work and Conclusions

The QP framework and metrics that instantiate the classes have some limitations for which a future work would address. First, the evaluation topics such as the 2020 Coronavirus Pandemic are well documented. We expect our framework to under-perform for esoteric or obscure stories due to sparse data. Second, the high correlation (Table 10) between QPs (e.g., popularity-based QPs ) suggests popularity could be given more weight when combined with other QPs. Third, measuring relevance is limited by small text which could result in false negative errors.

The Web is one of the greatest outcomes of human endeavor, but it has some major flaws, one of which is, the Web forgets, causing the disappearance of Web resources chronicling important stories and events. Web archive collections reduce this problem by preserving Web resources, and they begin with seed URLs hand-selected by experts or scraped from social media posts. While social media is a valuable source of seed URLs, the quality of social media content varies widely. In this paper, we presented the Quality Proxies framework (and instantiations) for assigning quality scores to seed URLs extracted from social media posts. A QP assigns a quality trait to a seed within a single dimension. Seeds can be assigned a quality score by selecting different combinations of Quality Proxies which map to different notions of quality across multiple dimensions such as , , geographical proximity, etc. The QP framework is flexible (enables multiple definitions of quality), robust (operates with subsets), explainable, and extensible. Our results showed that Quality Proxies resulted in the selection of quality seeds with increased precision (by 0.13) when novelty is and is not prioritized. To encourage reproducibility we have provided our research data and code (Nwala, 2021).

References

  • (1)
  • Abbasi and Liu (2013) Mohammad-Ali Abbasi and Huan Liu. 2013. Measuring user credibility in social media. In Proceedings of SBP-BRiMS 2020. 441–448.
  • Agichtein et al. (2008) Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne. 2008. Finding high-quality content in social media. In Proceedings of ACM WSDM 2008. 183–194.
  • Archive-It (2020) Archive-It. 2020. Archive-It. https://partner.archive-it.org/api/.
  • Azzopardi and Vinay (2008) Leif Azzopardi and Vishwa Vinay. 2008. Retrievability: an evaluation measure for higher order information access tasks. In Proceedings of ACM CIKM 2008. 561–570.
  • Bal et al. (2020) Rakesh Bal, Sayan Sinha, Swastika Dutta, Risabh Joshi, Sayan Ghosh, and Ritam Dutt. 2020. Analysing the Extent of Misinformation in Cancer Related Tweets. In Proceedings of AAAI ICWSM 2020, Vol. 14. 924–928.
  • Bar-Yossef et al. (2004) Ziv Bar-Yossef, Andrei Z Broder, Ravi Kumar, and Andrew Tomkins. 2004. Sic transit gloria telae: towards an understanding of the web’s decay. In Proceedings of WWW 2014. ACM, 328–337.
  • Becker et al. (2011) Hila Becker, Mor Naaman, Luis Gravano, et al. 2011. Selecting Quality Twitter Content for Events. In Proceedings of AAAI ICWSM 2011, Vol. 11. 495–495.
  • Bian et al. (2009) Jiang Bian, Yandong Liu, Ding Zhou, Eugene Agichtein, and Hongyuan Zha. 2009. Learning to recognize reliable users and content in social media with coupled mutual reinforcement. In Proceedings of WWW 2009. 51–60.
  • Bozarth and Budak (2020) Lia Bozarth and Ceren Budak. 2020. Toward a Better Performance Evaluation Framework for Fake News Classification. In Proceedings of AAAI ICWSM 2020), Vol. 14. 60–71.
  • Canini et al. (2011) Kevin R Canini, Bongwon Suh, and Peter L Pirolli. 2011. Finding credible information sources in social networks based on content and social structure. In Proceedings of the IEEE PASSAT 2011 and SocialCom 2011. 1–8.
  • Castillo et al. (2011) Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. 2011. Information credibility on twitter. In Proceedings of WWW 2011. 675–684.
  • Chesney (2006) Thomas Chesney. 2006. An empirical examination of Wikipedia’s credibility. First Monday (2006).
  • Cho et al. (1998) Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. 1998. Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30, 1-7 (1998), 161–172.
  • Ciampaglia et al. (2018) Giovanni Luca Ciampaglia, Azadeh Nematzadeh, Filippo Menczer, and Alessandro Flammini. 2018. How algorithmic popularity bias hinders or promotes quality. Scientific reports 8, 1 (2018), 1–7.
  • Duan et al. (2010) Yajuan Duan, Long Jiang, Tao Qin, Ming Zhou, and Heung-Yeung Shum. 2010. An empirical study on learning to rank of tweets. In Proceedings of COLING 2010. 295–303.
  • Ferrara et al. (2016) Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. 2016. The rise of social bots. Commun. ACM 59, 7 (2016), 96–104.
  • Fortunato et al. (2006) Santo Fortunato, Alessandro Flammini, Filippo Menczer, and Alessandro Vespignani. 2006. Topical interests and the mitigation of search engine bias. Proceedings of PNAS 2006 103, 34 (2006), 12684–12689.
  • Google (2020) Google. 2020. Place Search. https://developers.google.com/places/web-service/.
  • Gupta and Kumaraguru (2012) Aditi Gupta and Ponnurangam Kumaraguru. 2012. Credibility ranking of tweets during high impact events. In Proceedings of PSOSM 2012. ACM, 2–8.
  • Internet Archive Global Events (2017) Internet Archive Global Events. 2017. Archive-It Hurricane Harvey 2017. https://archive-it.org/collections/9323.
  • Klein et al. (2018) Martin Klein, Harihar Shankar, and Herbert Van de Sompel. 2018. Robust Links in Scholarly Communication. In Proceedings of JCDL 2018. 357–358.
  • Michigan State University (2016) Michigan State University. 2016. Flint Water Crisis Websites Archive. https://archive-it.org/collections/6811.
  • Nagmoti et al. (2010) Rinkesh Nagmoti, Ankur Teredesai, and Martine De Cock. 2010. Ranking approaches for microblog search. In Proceedings of the IEEE/WIC/ACM WI-IAT 2010. 153–157.
  • National Library of Medicine (2014) National Library of Medicine. 2014. Global Health Events. https://archive-it.org/collections/4887.
  • National Library of Medicine (2020) National Library of Medicine. 2020. Global Health Events Web Archive - Coronavirus. https://archive-it.org/collections/4887.
  • National Library of Medicine (NLM) (2014) National Library of Medicine (NLM). 2014. Global Health Events Web Archive - Ebola Virus. https://archive-it.org/collections/4887.
  • Nwala (2021) Alexander C Nwala. 2021. QP Framework dataset/code - Git Repo. https://github.com/oduwsdl/quality-proxies-framework.
  • Nwala et al. (2018) Alexander C Nwala, Michele C Weigle, and Michael L Nelson. 2018. Scraping SERPs for archival seeds: it matters when you start. In Proceedings of JCDL 2018. 263–272.
  • Nwala et al. (2019) Alexander C Nwala, Michele C Weigle, and Michael L Nelson. 2019. Using Micro-collections in Social Media to Generate Seeds for Web Archive Collections. In Proceedings of ACM/IEEE JCDL 2019. 251–260.
  • Nwala et al. (2017) Alexander C Nwala, Michele C Weigle, Adam B Ziegler, Anastasia Aizman, and Michael L Nelson. 2017. Local Memory Project: Providing Tools to Build Collections of Stories for Local Events from Local Sources. In Proceedings of ACM/IEEE JCDL 2017. 219–228.
  • Pal and Counts (2011) Aditya Pal and Scott Counts. 2011. Identifying topical authorities in microblogs. In Proceedings of ACM WSDM 2011. 45–54.
  • Priyatam et al. (2014) Pattisapu Nikhil Priyatam, Ajay Dubey, Krish Perumal, Sai Praneeth, Dharmesh Kakadia, and Vasudeva Varma. 2014. Seed selection for domain-specific search. In Proceedings of WWW 2014. 923–928.
  • Ramadan and Shantz (2016) Hisham Ramadan and Jeff Shantz. 2016. Manufacturing Phobias: The political production of fear in theory and practice. University of Toronto Press.
  • Ratkiewicz et al. (2011) Jacob Ratkiewicz, Michael D Conover, Mark Meiss, Bruno Gonçalves, Alessandro Flammini, and Filippo Menczer Menczer. 2011. Detecting and tracking political abuse in social media. In Proceedings of AAAI ICWSM 2011.
  • Risse et al. (2014) Thomas Risse, Elena Demidova, and Gerhard Gossen. 2014. What Do You Want to Collect from the Web?. In Proceedings of BWOW 2014.
  • Robbins (2016) Denise Robbins. 2016. ANALYSIS: How Michigan And National Reporters Covered The Flint Water Crisis. https://mediamatters.org/research/2016/02/02/analysis-how-michigan-and-national-reporters-co/208290.
  • SalahEldeen and Nelson (2012) Hany M SalahEldeen and Michael L Nelson. 2012. Losing my revolution: How many resources shared on social media have been lost?. In Proceedings of TPDL 2012. 125–137.
  • Shu et al. (2020) Kai Shu, Deepak Mahudeswaran, Suhang Wang, and Huan Liu. 2020. Hierarchical propagation networks for fake news detection: Investigation and exploitation. In Proceedings of AAAI ICWSM 2020, Vol. 14. 626–637.
  • Yang et al. (2012) Seungwon Yang, Kiran Chitturi, Gregory Wilson, Mohamed Magdy, and Edward A Fox. 2012. A study of automation from seed URL generation to focused web archive development: the CTRnet context. In Proceedings of JCDL 2012. 341–342.
  • Zittrain et al. (2014) Jonathan Zittrain, Kendra Albert, and Lawrence Lessig. 2014. Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Legal Information Management 14 (2014), 88–99.