1. Introduction and Background
In an attempt to save the digital history of unfolding world events before they are lost due to link rot (Klein et al., 2018b; Zittrain et al., 2014; Bar-Yossef et al., 2004), we often see the creation of Web archive collections following the occurrence of a major news event. For example, an Archive-It Ebola virus collection (National Library of Medicine, 2014) was created months after the 2014 Ebola outbreak. It consists of groups of webpages of government organizations and public health care workers associated with the Ebola outbreak event. However, some important events occur without the creation of Web archive collections. For example, on February 14, 2018, there was a shooting that claimed the lives of 17 people at the Marjory Stoneman Douglas (MSD) High School in Florida. In the aftermath of the tragic event, the teenage students boldly stepped into the highly politically divisive gun control debate demanding stricter gun control measures (Sean Rossman, 2018; Stephanie Ebbs, 2018). Less than two weeks after the shooting, major retailers Walmart and Dick’s Sporting Goods increased the minimum age required to purchase firearms and ammunition from 18 to 21 (Colin Dwyer, Camila Domonoske, and Emily Sullivan, 2018). Dick’s additionally discontinued the sale of assault-style rifles, high capacity magazines, and bump stocks. On March 9, 2018, Governor Rick Scott of Florida signed the Marjory Stoneman Douglas High School Public Safety Act bill into law. Among other gun control measures, it raised the minimum age for buying rifles to 21, banned bump stocks, and instituted background checks. The ripple effects of the activism of the MSD students is still being felt, and most would agree that this incident deserves highlight as part of the broader gun control discourse in the US, thus worthy of a Web archive collection. However, one year after the shooting there is still no corresponding Archive-It collection. Any subsequent collection would likely not be able to collect all the contemporary resources shared in social media or even reported in the news (SalahEldeen and Nelson, 2012; Nwala et al., 2018b).
The MSD shooting example illustrates gaps in Web archive collections for important events. One major reason for the lack of Web archive collections for important events is tied to how Web archive collections are created. Web archive collections begin with high quality seeds URIs (Uniform Resource Identifiers) selected by curators, a time consuming process often done manually. Amidst an abundance of important local and global events, various organizations such as the Internet Archive cope with the shortage of curators by routinely requesting (Fig. 7) for users to contribute links (seeds) to Archive-It collections, e.g., the 2012 Hurricane Sandy (Internet Archive, 2012), and the 2013 Boston Marathon Bombing (Internet Archive, 2013) collections. But this crowd-sourced approach to collection building, while useful, is not enough. In some other cases, archived collections are initiated months or years after the precipitating event. This could have serious consequences since Web archive collections that start late could omit webpages that address the early stages of events (Nwala et al., 2018b; Klein et al., 2018a). Consequently, it is important to start collecting seeds for Web archive collections early. This calls for a method for generating seeds automatically and on demand. Two prominent sources for automatically generating seeds have been adopted for extracting (scraping) links over the years: Web (e.g., Google, Fig. 8) and social media (e.g., Twitter, Fig. 9) Search Engine Result Pages (SERPs).
In this work, we explore a new source (we call micro-collections) for generating seeds beyond URIs returned by SERPs. It is important to note that our proposed method is not just concerned about finding what a search engine such as Google may find. Even though search engines often produce quality seeds, our micro-collection method of generating seeds is more concerned about finding quality, “hard-to-find,” and heterogeneous seeds that may not be popular enough to be easily retrievable through a simple Web search. We define a micro-collection as a post or group of social media posts that exhibit some properties associated with collection building. Web archive curators spend time selecting and filtering seed URI candidates. Similarly, social media users often perform similar tasks when faced with the decision of choosing what URIs to include in a “non-standard” social media post. For example, the Twitter account Doing Things Differently (@dtdchange) (Doing Things Differently, 2012) created a chain of tweets (Fig. 3 (Doing Things Differently, 2015)) by replying to each subsequent tweet in order to chronicle the Flint water crisis story. This reply thread spans almost 3 years and consists of 75 tweets (as of April 8, 2019) each containing a URI. These tweets exhibit curatorial discretion (selection and filtering), and thus we consider the thread a micro-collection for the Flint water crisis story. Another example of micro-collections are Reddit posts (Fig. 6 (Ilsensine, 2014a, b)) created by the user Ilsensine (Ilsensine, 2013) for the 2014 Ebola virus outbreak story. In total, the posts contain over 102 external references and were published less than two weeks after the World Health Organization (WHO) declared the 2014 Ebola outbreak a Public Health Emergency of International Concern (WHO, 2014). We distinguish micro-collections from standard social media posts by showing that micro-collections can be identified by considering the properties of the posts. The rationale for considering micro-collections as a good source for seeds is that the effort taken to create micro-collections is an indication of editorial effort and a demonstration of domain expertise.
Conventional techniques use SERPs from search engines and social media to extract seeds. Since seeds highly influence the nature of collections generated after the seeds are crawled, we consider it pertinent to understand the nature of the seeds returned from the services often used to generate seeds. Accordingly, we conducted a study to investigate the nature of the seeds generated from different sources on popular social media sites (Reddit and Twitter) and a less popular social media site (Scoop.it). First, we created a classification called post class from four pairs (PA, PA, PA, PA - Table 1) of acronyms for identifying social media posts regardless of platform. A post class is formed by combining two acronyms, P and A, with subscripts (1 - single or - multiple), both combined to represent the count of Posts and A
uthors, respectively. Second, we generated 23,112 collections of seeds extracted from the various post classes by issuing five queries against the following social media sources: Reddit, Twitter (and Twitter Moments111A service launched by Twitter (in October 6, 2015) that enables users to collect and share tweets of noteworthy events as they unfold. A collection of tweets is called a moment.
), and Scoop.it. In total we collected 120,444 URIs from 449,347 social media posts. Third, for a combination of social media and post classes, we studied the resultant collections across the following dimensions: the distribution of links, the probability distribution of URI counts for various post classes, the precision of the seeds, ages of webpages, the diversity of seed hostnames, and overlap with the Google SERPs.
Our study resulted in the following contributions that collectively provide some insight on the nature of seeds generated from various social media post classes. First, the provision of a simple cross-platform vocabulary (post class) for describing social media posts facilitates comparing different social media posts across different platforms, ranging from tweets (PA) on Twitter to Reddit (PA) posts.
|Acronym||Post Count||Author Count||Definition/Example|
|PA||Single (1)||Single (1)||
|PA||Single (1)||Multiple (n)||
|PA||Multiple (n)||Single (1)||
|PA||Multiple (n)||Multiple (n)||
|Topic [Wikipedia Page]||Expectation (Expected/Unexpected)||Recurrence (Recurring/Non-Recurring)||Start definition (Defined/Undefined)||End definition (Defined/Undefined)|
|Ebola Virus Outbreak (Wikipedia, 2014)||Unexpected||Recurring (Irregular)||December 2013 (Centers for Disease Control and Prevention (CDC), 2019)||June 2016 (Centers for Disease Control and Prevention (CDC), 2019)|
|Flint Water Crisis (Wikipedia, 2016)||Unexpected||Non-Recurring||March 2014 (Robbins, 2016)||Undefined|
|MSD Shooting (Wikipedia, 2017)||Unexpected||Non-Recurring||February 14, 2018||February 14, 2018|
|2018 World Cup (Wikipedia, 2018a)||Expected||Recurring||June 14, 2018||July 15, 2018|
|2018 Midterm Elections (Wikipedia, 2018b)||Expected||Recurring||November 6, 2018||November 6, 2018|
Second, we introduced micro-collections (MCs) as social media posts that exhibit properties associated with collection building, and proposed generating seeds from them. MCs are formed by combining seeds from PA, PA, and PA. Seeds generated from scraping SERPs belong to the PA post class. We showed that seeds generated from social media sources are not easily discoverable from Google. Third, we provided a means of characterizing and comparing seeds generated from different post classes. Fourth, we showed that MCs produced more seeds than PA, but PA had a higher median probability (0.63) of producing relevant URIs than MCs (0.5) for all social media and SERP combination excluding seeds generated with hashtags. Finally, we showed that the ages of webpages extracted depends on a combination of features such as the topic and SERP vertical (e.g., Top vs. New). Similarly, we showed that the diversity of seed hostnames varied with post classes. These findings may provide useful information to curators using social media to generate seeds. For example, if a seed generation process prioritizes quantity of URIs (HTML and non-HTML), the curator may consult MCs first. However, if precision is the priority, then PA. Our research dataset comprising of 120,444 links extracted from 449,347 social media posts, as well as the source code for the application utilized to generate the seeds, are publicly available (Alexander Nwala, 2019).
2. Related work
The collection building process starts with seeds. The seeds are fed into a focused crawler’s crawl frontier to start the process of discovering more Web resources related to the collection topic. Chakrabarti et al. (Chakrabarti et al., 1999) introduced the first focused crawler in the 1999s as a means to build collections for specific topics, as opposed to a general-purpose crawler which does not take the topics of the documents under consideration during the crawling process. Since the first focused crawlers, there have been many variants of focused crawlers. Bergmark (Bergmark, 2002)
used a focused crawler to crawl and classify webpages into various topics in science, mathematics, engineering and technology, discarding off-topic pages. Farag et al.(Farag et al., 2017) introduced the Event Focused Crawler, a focused crawler for events that uses an event model to represent documents and a similarity measure to quantify the degree of relevance between a candidate URI and a collection. An event is represented as a triple - Topic, Location, and Date. Similar to Farag et al., Risse et al. (Risse et al., 2014) introduced a new crawler architecture based on the ARCOMEM project. Instead of the conventional crawling of all webpages, ARCOMEM performs a semantic crawl of only webpages related to events and entities such as persons, locations, and organizations. Most focused crawling is performed on the live Web. Unfortunately, the live web is plagued by link rot and content drift, consequently, Klein et al. (Klein et al., 2018a) demonstrated that focused crawling on the archived Web results in more relevant collections than focused crawling on the live Web, for events that occurred in the distant past. Additionally, Klein et al. proposed extracting seeds from external references contained in the Wikipedia page of an event. We consider Wikipedia references examples of PA micro-collections. Our focus in this work is the seed generation process, therefore we did not utilize a focused crawler. Instead we explored the various sources for extracting seeds from social media posts.
Not all collection building uses focused crawling. Gossen et al. (Gossen et al., 2016) proposed a methodology for extracting sub-collections from Web archive collections focused on specific topics and events (called the topic and event focused sub-collection). The topic and event focused sub-collection is defined as a collection of documents in a Web archive collected using a sub-collection specification. Our research differs from Gossen et al. in two major ways. First, Gossen proposes generating collections from within the Web archives, but we propose generating seeds from the live social Web. Second, Gossen proposed running an algorithm over a sub-collection specification on a Web archive to generate a sub-collection. This means the decision of whether a URI belongs in a sub-collection is encoded in the specification of an algorithm. However, in this work, we leverage the judgment of humans on social media.
In a similar work, Gossen et al. (Gossen et al., 2017) adapted some portions of the topic and event focused sub-collection in a method to extract event-centric documents from Web archives based on a specialized focused extraction algorithm. They defined two broad kinds of events based on time: planned and unexpected. The goal of the event-centric extraction process is, given an event input and a Web archive, generate an interlinked collection of documents relevant to the input event that meet the collection specification. The differences of our research with Gossen’s previous work (Gossen et al., 2016) transfer to this work. However, we adapted Gossen’s categorization of events as either planned or unexpected, and we renamed planned to expected (Table 2). Similar to Gossen et al., Nanni et al. (Nanni et al., 2018) presented an approach for extracting event-centric sub-collection from Web Archives. Their method extracts documents not only related to the input event, but also documents describing related events (e.g., premises and consequences). Nanni et al.’s method utilized Wikipedia pages as inputs to generate event-centric collections. In this work, however, we used Wikipedia references to generate our gold standard dataset.
Selecting good seeds is challenging and has not been extensively studied. Collection building researches often acknowledge the importance of selecting good seeds, and admit its link to the performance of their systems, but often they pay more attention on the mechanisms of building the collection, and not seed selection. The challenge of selecting good seeds is embodied in the idea that it is difficult to define “good.” This challenge is captured by Bergmark’s statement (Bergmark, 2002): “It is unclear what makes a good seed URL, but intuitively it should be rich in links, yet not too broad in scope.” Zheng et al. (Zheng et al., 2009) argued that the seed selection problem for Web crawlers is not a trivial, and proposed different seed selection strategies based on PageRank, number of outlinks, and website importance. They also showed that different seeds may result in collections that are considered “good” or “bad.” While there have been efforts made to automatically generate seeds, many of these methods (e.g., Prasath and Öztürk (Prasath and Öztürk, 2011)) target generating seeds for Web crawlers that build indexes for search engines, and not seeds for focused crawlers or Web archive collections.
Du et al. (Du et al., 2014) proposed a customized method of generating seeds for focused crawlers based on user past Web usage information that captures the interests of the user. Since this method depends on historical use information, its performance is tied to the availability of such historical data, which might be lacking due to the absence of domain knowledge or privacy concerns. As part of the Crisis, Tragedy, and Recovery Network project, Yang et al. (Yang et al., 2012) proposed using URIs found in tweet collections (generated with hashtags and keywords) as seeds to bootstrap Web archiving tasks quickly for sudden emergencies and disasters. Similarly, we consider extracting seeds from tweets, but expand the areas for extracting seeds beyond scraping Twitter SERPs. Additionally, we identify post classes of tweets as part of an effort to characterize the nature of seeds generated from different post classes (Table 1). Priyatam et al. (Priyatam et al., 2014) proposed extracting diverse seeds from tweets in a Twitter URI graph for the Web crawlers of digital libraries such as CiteSeerX. Even though their work does not target the generation of seeds for collections of stories and events, which is a focus of our work, the notion of diversity of seeds is adopted in our work (Section 4.4.4).
In previous work (Nwala et al., 2018a), we showed that collections generated from social media sources such as Reddit, Storify, Twitter, and Wikipedia are similar to Archive-It collections across multiple dimensions such as the distribution of sources and topics, content and URI diversity, etc. These findings suggest that curators may consider extracting URIs from these sources in order to begin or augment collections about various news topics. Here, we adopt a subset of the dimensions for comparing collections. Similarly, in another previous work (Nwala et al., 2018b) as part of an effort to understand the behavior of SERPs, a popular source for generating seeds, we investigated “refinding” news stories on the Google SERP by tracking the URIs returned from Google, everyday for over seven months. We discovered that the probability of finding the same URI of a news story diminished drastically after a week (0.01 – 0.11). These findings suggest it becomes more difficult to find the same news story with the same query on the Google SERP. Therefore, collection building efforts that scrape SERPs are highly sensitive to the query issue dates.
3. Research questions
Before generating seeds from micro-collections, we must first identify them. This leads to our first research question:
RQ1: How do we identify, extract, and characterize micro-collections in social media?
Identifying micro-collections makes it easier to accurately describe and extract them. Subsequently, it would be important to quantify the amount of micro-collections relative to conventional social media posts. As part of proposing the extraction of seeds from micro-collections, it is pertinent to verify if they are prevalent on the web.
There are currently two popular sources for automatically or semi-automatically generating seeds. The first involves extracting seeds from SERPs (e.g., Google). The second involves extracting seed URIs from tweets surfaced by hashtags or text queries on Twitter. We propose a third source for extracting seeds - extracting seeds from micro-collections. Therefore, it is important that we compare the new source to the previous popular sources. Such comparison could enable us understand if these sources are similar, and such information would be highly informative to future collection building processes. This leads to our second research question:
RQ2: Do seeds from micro-collections differ from seeds from SERPs?
Here we explain considerations made in the selection of our dataset topics, the dataset generation process, the measures extracted from the dataset and how they informed our research questions.
4.1. Topic selection
A central objective of our research was to outline the characteristics of, and differences between, collections generated by scraping SERPs (PA post class - Table 1) and micro-collections (MC post class). Therefore, the choice of queries was not arbitrary. Instead, we developed a temporal classification system (partly informed by Gossen et al. (Gossen et al., 2017)) of real world stories and events based on three temporal (Table 2) attributes: Expectation, Recurrence, and Occurrence definition - Start and End date definitions. A story can be described by a combination of different states of the temporal attributes.
For the expectation attribute, an event may be expected or unexpected. For example, the Ebola outbreak event was unexpected. Thus we classify this event as an unexpected event. For the recurrence attribute, an event may occur repeatedly at regular or non-regular intervals. For example, the FIFA World Cup tournaments recur at four-year intervals, thus we consider this event a recurring event. Ebola outbreaks in general may also be considered a recurring event, even though they occur at irregular intervals. For the occurrence definition attribute, an event may have a defined or undefined start and end date. For example, the MSD Shooting event started and ended the same day (February 14, 2018), but the Flint water crisis event started in April 2014, and is still ongoing (no end definition).
Following the specification of the temporal classification system, we selected five topics (Table 2) specified by the following queries and hashtags (for Twitter):
“ebola virus outbreak” (#ebolavirus)
“flint water crisis” (#FlintWater)
“stoneman douglas high school shooting” (#MSDStrong)
“2018 world cup” (#WorldCup)
“2018 midterm elections” (#election2018)
In addition to text queries, for Twitter, we selected hashtag queries for each topic to discern if seeds generated with text-based queries differ from those extracted with hashtag queries.
|PA Counts||PA Counts||PA Counts|
|Total||Class: 23,112||Posts: 449,347||URIs: 120,444|
4.2. Dataset generation and segmentation of social media posts into post classes
For Reddit, we issued all five queries to four Reddit SERPs (Relevance, Top, New, and Comments), and extracted posts from the SERPs. For each query we extracted a maximum of 500 posts and recursively extracted a maximum of 500 comment replies from each post extracted from the SERP.
For Twitter, similar to Reddit, we issued all five text and hashtag queries to the two Twitter SERPs (Top and Latest), and extracted tweets from the SERPs with the use of the Local Memory Project (Nwala et al., 2017) local news generator (Alexander Nwala, 2016). For each query, we extracted a maximum of 500 tweets and recursively extracted a maximum of 500 tweet replies for each tweet extracted from the SERP.
For Reddit and Twitter, the posts directly visible from the SERP were assigned to the PA post class. We use the term “post” in order to be general. Different social media platforms have different names for posts, for example, on Twitter, a post is called a tweet. Posts with replies were assigned either to the PA or PA class depending on the number of authors. Posts from the SERP with a reply or a contiguous set of replies exclusively authored by a single user were assigned to the PA post class. Finally, posts with a reply or a series of replies authored by multiple users were assigned to the PA post class. The PA micro-collection post class is rare and not available in Twitter, Reddit, or Scoop.it. However, our gold standard data was extracted from Wikipedia references which belong to PA.
For Twitter Moments, we issued all five queries to Google with (“site:twitter.com/i/moments”) in order to restrict the search results to links from Twitter Moments. Next, we extracted Twitter Moments URIs from the first two pages of the Google default SERP. Next, we dereferenced URIs and extracted the tweets. Tweets from Twitter Moments are authored by multiple users, and thus assigned the PA label.
In addition to the extraction of posts from well-known social media (Reddit and Twitter), we considered a lesser known social media Scoop.it (https://www.scoop.it/). Scoop.it is a content curation social media service that enables users to bookmark a single URI (scoop) or multiple URIs (topics). For Scoop.it, we issued all five queries to the Scoop.it SERPs (Scoops and Topics), and extracted posts (scoops) from the SERPs. The scoops visible from the Scoops SERP were assigned to the PA post class. For a single dataset topic, the scoops found in the Topic SERP were assigned to the PA post class since they are authored by multiple users.
From all social media posts, we extracted the URIs to create collections corresponding to the post class from which the URIs were extracted. Social media posts often link to intra-site posts (e.g., tweet URI in a tweet). We dereferenced and extracted seeds from such intra-site URIs, and substituted them with the extracted seeds.
4.3. Gold standard dataset generation
The following steps were taken in order to generate the gold standard dataset to facilitate measuring precision of URI collections extracted from the various post classes.
First, we selected a corresponding Wikipedia page for the five topics (Table 2
). Second, we extracted the URIs from the references section of each Wikipedia page. Third, we dereferenced the URIs from each reference corresponding to a topic (e.g., Flint water crisis) and removed the HTML boilerplate leaving only the plaintext documents (stopwords removed). The set of plaintext documents were concatenated into one document. Fourth, for each topic, we created a collection vector consisting of the normalized Term Frequency (TF) weights of the concatenated document.
4.4. Primitive measures extraction
We counted the number of URIs (HTML, non-HTML, and both) per topic, per social media source, and per post class (Table 3). Additionally, we extracted the distribution of posts with URIs by counting the number of posts with a specified number of links for a given social media source (e.g., Reddit) to facilitate probability distribution calculation (Table 4). The distribution answers questions such as: “for Reddit posts with links, how many posts had 1 link or 2 links?” Subsequently, the following measures were extracted from the dataset to address our research questions.
4.4.1. Probability distribution of posts with links
For all topics (e.g., World cup), given the set of post classes , given a social media seed source (e.g., Reddit), the probability of the event that a post of post class with a URI, has URIs (e.g., 1 URI) is calculated using Eqn. 1. reads: “What is the probability of the event that a Reddit PA post with a URI has one (i.e., ) URI?”
The general probability of the event that a post with a URI from social media of any post class, has URIs is calculated using Eqn. 2. In Eqn. 1 & 2, if , and , represents the count of PA posts for the first () topic.
4.4.2. Precision of the URIs in post class collections
Given a candidate collection of seed URIs to be evaluated, the URIs may be extracted from a single post (PA) or multiple posts (e.g., PA) from a social media site (e.g., Reddit). We calculated the precision of as follows. First, the URIs in were processed in the same manner as the gold standard (Section 4.3), i.e., dereferenced and boilerplate removed, and plaintext documents concatenated. Second, a document collection matrix was created from and its corresponding gold standard (e.g., Flint water crisis gold standard). The first row of matrix consisted of the gold standard vector, and the second row of the matrix consisted of the vector of
(document to be evaluated). The columns represent the normalize TF weights. Third, cosine similarity was calculated between the pair of rows. If the similarity exceeded relevance threshold of an empirically learned threshold of 0.25,was declared relevant, otherwise, it was declared non-relevant.
For a given topic (e.g., Flint water crisis) and SERP vertical (e.g., Twitter-Top), a URI or multiple URIs may be extracted from a post authored by a single (PA) or multiple (PA, PA) users. Each group of URIs extracted from a post has an associated precision value (Relevant URIs / Total URIs). The average precision metric for a post class (e.g., PA) is an average over all the precision value of all posts in the post class. It provides answers to questions such as: “what is the average precision of the URIs in the PA post class?” For non-HTML URIs we evaluated precision by extracting text from the post that embedded the URI.
4.4.3. Age distribution of relevant webpages per post class
The distribution of ages is an aggregation of the ages of the relevant webpages in a given post class of a given social media. The age of a webpage was calculated by finding the difference between the publication date of a webpage and the date the post containing the webpage URI was retrieved. The publication dates of webpages were extracted with CarbonDate (SalahEldeen and Nelson, 2013)
which estimates the creation date of webpages based on information polled from multiple sources such as the document timestamps, web archives, backlinks, etc. Publication dates of webpages may potentially provide useful information about the kinds of events discussed. For example, the Democratic Republic of Congo in Central Africa has been grappling with another Ebola outbreak (2017-Present). Therefore, webpages published before 2017 are not expected to discuss the 2017 outbreak.
4.4.4. Distribution of hostname diversity per post class
Given a collection of URIs for a given post class of a given social media, the hostname diversity (Nwala et al., 2018a) of is a single value () that reports whether consists of URIs from a single host (, e.g., www.cnn.com) or distinct hosts (, e.g., www.cnn.com and www.foxnews.com). It answers questions such as: “how diverse are the hosts in the Reddit PA post class?”
4.4.5. Overlap between Google collections and post class
We measured the overlap between URIs extracted from Google and URIs extracted from a combination of social media and post class. This was done in order to determine how easy it was to find the URIs scraped from social media micro-collections. Extracting seeds from micro-collections requires more effort than scraping Web search engine SERPs. For example, generating a collection of URIs of the PA or PA post class requires independently dereferencing each social media post and extracting the replies from the post. Therefore, if the URIs discovered from micro-collections are easily discoverable via a search engine such as Google, it does not justify the extra effort of extracting seeds from micro-collections.
5. Results and Discussion
Recall the post class (Table 1) acronyms and their respective meanings and examples: PA (e.g., a tweet) - single Post from a single Author, PA (e.g., Wikipedia reference) - single Post from multiple Authors, PA (e.g., twitter thread) - multiple Posts by a single Author, and PA (e.g., twitter conversation) - multiple Posts from multiple Authors.
To address the first research question, we identified micro-collections (MC = PA PA) as the collection of social media posts that show some properties of collection building222Some PA posts which are visible to SERP scrapers could be added to MC if they contain links above the median number of links, calculated from the same pool of social media posts. However, we did not make such a distinction in our study.. Next, we extracted the PA and PA post classes by identifying social media posts with replies (comments) and extracted the parent post as well as the child posts.
Following the identification and extraction of micro-collections, to address the second research question, we characterized MCs and compared seeds extracted from them to seeds extracted from SERPs (PA). Here we present the results for each of the respective measures introduced in Section 4.4, and Appendices 1 - 10 includes additional figures for these metrics.
5.1. URI and post counts per post class
Micro-collections (MCs) are prevalent on the Web and outnumber (12,917 vs. 10,195) conventional SERP posts (PA). Also, in general, MCs produced more URIs (Appendices 2 - 4) than conventional SERP posts (PA). Additionally, MCs produced more non-HTML URIs than PA across all topics. In fact, the total number of PA non-HTML URIs were between 19% to 44% the size of MCs. These findings are potentially consequential for curators interested in enriching their collections with non-HTML resources.
From Table 3, for all topics in the Reddit SERPs except (Reddit-New), PA mostly produced the largest count of URIs (41,160), next to PA (51% PA), next to PA (3% PA): PA ¿ PA ¿ PA. The relatively low number of Reddit PA posts and URIs shows that it is a rare phenomenon for a Reddit user to reply to his/her initial post especially since Reddit does not impose any size restriction on the length of posts. For the Reddit-New SERP, PA had more URIs (8,056) than PA (80% PA): PA ¿ PA ¿ PA. This is likely due to the fact that in the New SERP, PA do not get sufficient opportunity to increase because they must compete with newer posts, since the SERP is in “newest first” order. Consequently, before PA sufficiently grow, they are pushed down (rank demotion) by newer PA posts, and do not get sufficient exposure, leading to fewer replies which leads to a reduced PA size.
The results show a high degree of inter/extra-user engagement on Twitter, and thus for Post and URI Counts (Table 3), PA ¿ PA ¿ PA. In contrast, Scoop.it showed lesser user engagement, and thus: PA ¿ PA.
5.2. Probability distribution of posts with links
From Table 4, unsurprisingly, the probability of the event that a social media post with a URI of a given post class (PA - PA) had more than one HTML URI () seemed to correlate with whether the social media platform restricts the size of posts. For example, due to the character limit imposed on tweets, the probability of the event that a tweet with a URI has only 1 HTML URI is 0.98 (). On the other hand, single tweets with 3+ HTML URIs are rare. We observed three tweets with 3 or 4 HTML URIs (out of 3,501 tweets).
5.3. Precision of post class URIs
Table 5 shows the conditional probability of the event that the URIs contained in a post of a given post class are relevant, given that the post has a specified count of URIs (). Across almost all per post class, we see that the seeds generated from PA posts had a higher probability (maximum: 1, median: 0.63, minimum: 0.33) of being relevant than MC (0.61, 0.5, 0.0). For example, for Reddit when , PA - 0.80, while MC - 0.59. This shows that PA posts benefit from SERP filters; PA posts are posts directly returned by SERPs and their text often matches a subset of the query. This indicates that a match between a query and a post text lends some relevance to the URI extracted from the post. However, given the fact that MCs do not all benefit from SERP filters since the vast majority of MCs are not extracted directly from the SERP, but from the reply or comment threads, the 0.5 median precision value indicates that comments and replies possess quality URIs. See also Appendices 5 - 7 for additional figures of the average precision of URIs per post class, per social media.
In general, PA post URIs (all URIs, HTML, and non-HTML) had the highest average precision compared to PA and PA for Reddit, Scoop.it, and Twitter posts extracted with text queries. For tweets extracted with hashtags, PA posts had the highest average precision compared to PA and PA.
For Reddit, PA ¿ PA ¿ PA: across all topics, PA posts had the highest average precision (all URIs) 80% of the time than PA and PA. The Maximum, Median, and Minimum (MMM) average precision values were 0.88, 0.59, and 0.15, respectively. Next, PA posts had a higher average precision than PA 70% of the time, MMM - (0.88, 0.50, 0.00), for PA - (0.70, 0.42, 0.07).
For tweets exposed with text queries, PA ¿ PA ¿ PA: PA (0.91, 0.66, 0.45) had the highest average precision 90% of the time than PA and PA. PA (0.74, 0.46, 0.28) had a higher average precision 70% of the time than PA (0.58, 0.39, 0.35).
For tweets exposed with hashtags, PA ¿ PA ¿ PA: PA (0.65, 0.29, 0.27) posts had the highest average precision 60% of the time than PA and PA. PA (0.45, 0.39, 0.21) posts had a higher average precision than PA (0.50, 0.26, 0.11) 70% of the time. For example, from Fig. 12a, the average precision for PA URIs in the Twitter-Top vertical for the Ebola virus outbreak topic was 0.86 (PA - 0.74) for posts extracted with the text query “ebola virus outbreak.” However, PA outperformed (0.65) PA (0.49) when the query used to extract posts was the hashtag “#ebolavirus.”
For Scoop.it, PA ¿ PA: PA (0.87, 0.78, 0.55) posts had a higher average precision than PA (0.80, 0.55, 0.27) 100% of the time. Similar to Twitter PA, Scoop.it PA are derived directly from the SERP, and thus benefit from SERP filtering. PA do not benefit from SERP filtering since they are not extracted directly from the SERP.
5.4. Age distribution of relevant webpages
We compared the ages of PA and MC post class URIs, by focusing on the older topics (Ebola virus outbreak and Flint water crisis) for social media that supports PA, PA, and PA - Reddit and Twitter. MC posts consistently produce older webpages in the Twitter-Latest vertical. A possible explanation for this is: PA tweets (extracted directly from the Twitter-Latest SERP) are highly likely to be new tweets if the topic is ongoing. Even though new tweets can include URIs of old stories, for ongoing news stories such as those we considered, new tweets are likely to include the URIs of the latest developments. We observed that the Twitter-Latest PA tweets were created within days from the query issue dates, and thus more likely to produce new URIs for both topics. In contrast, MCs are extracted from conversations that can mix new and old tweets; a new tweet can reply to an old tweet that contains old URIs. Therefore, Twitter-Latest MCs produced a mix of tweets created within days and years from the query issue dates. See also Appendices 8 and 9 for additional figures of the age distribution of URIs per post class, per social media.
For the Reddit-Top/Relevance/Comments SERPs for Ebola virus outbreak, MCs and PA produced older webpages with similar distributions. For example, for Ebola virus outbreak both post classes had a median webpage age of 4.3 years.
As expected, the Reddit-New, for both topics, MCs and PA produced the newest webpages compared to other Reddit SERPs with median age 1 year.
PA and MC posts from Twitter-Top produced webpages with similar age distributions. For example, for Flint water crisis both post classes had a median webpage age 5 months. In contrast, in the Twitter-Latest vertical, for both topics MCs produced older webpages than PA. For example, MCs for Ebola virus outbreak produced older webpages (median: 4.2 years) than those from PA (19 days) (Fig. 12b).
5.5. Distribution of hostname diversity
For Reddit, the PA posts produced the highest hostname diversity. For Twitter, PA posts produced the highest hostname diversity.
For Reddit, PA ¿ PA ¿ PA: across all topics, PA posts had the highest hostname diversity (HTML URIs) 95% of the time than PA and PA. The Maximum, Median, and Minimum (MMM) hostname diversity values were 1.0, 0.55, and 0.0, respectively. Next, PA posts had more diverse hostnames than PA 61% of the time, MMM - (0.6, 0.33, 0.11), for PA - (0.55, 0.28, 0.1).
For Twitter, PA ¿ PA ¿ PA: PA (0.70, 0.60, 0.43) produced more diverse hostnames 74% of the time than PA and PA. Similarly, PA (0.61, 0.45, 0.39) produced more diverse hostnames 79% of the time than PA (0.74, 0.37, 0.31). Scoop.it did not produce enough URIs for two topics, as a result had fewer PA to derive a fair comparison with PA.
Reddit and Twitter had PA ¿ PA in common. This is not unexpected; hostname diversity rewards unique hosts, and given that the PA collection is smaller than PA, it is more likely for PA to fill in the hostname slots with additional different hosts than PA. However, for Twitter PA had the lowest diversity unlike Reddit for the following reasons. First, PA is the set of all threads authored by the same user. These threads on Twitter, especially those from News (e.g., @nytimes, @vice) and non-News organizations (e.g., @splcenter, @TurnoutPAC) tend to link to webpages within their websites, leading to a lower hostname diversity. This phenomenon was most prominent in the 2018 world cup and midterm elections topics.
5.6. Overlap: Google collections vs. post classes
All post classes showed small amount of overlap with the collections of URIs returned from the first 10 pages of Google for the respective dates the post class URIs were extracted. This highlights the fluidity of the Google SERP. Thus, URIs extracted from MC and PA collections are not easily discoverable.
Reddit PA and MC posts had overlap 0.1 85% of the time. Their MMM overlap were: 0.13, 0.04, and 0.1, respectively. Twitter PA posts had overlap (0.09, 0.02, 0.0) 0.1 100% the time. Similarly, Twitter MC posts had overlap (0.13, 0.04, 0.0) 0.1 80% of the time.
5.7. Recommendations for generating seeds
Considering the results presented, it is clear that collections generated from social media SERPs (PA) are different from collections generated from micro-collections (MCs), and both post classes yield seeds not easily discoverable by scraping Google. Consider the following highlights and how they could affect decisions made in generating seeds from social media.
MCs are more prevalent and produce more seeds than PA. This means seed generation that prioritizes quantity would benefit from extracting seeds from MCs. PA produced higher quality URIs for all social media SERP combinations except with seeds generates with hashtags. The poorer precision performance of hashtag queries compared to text queries shows that hashtags can be used as a vehicle for spreading non-relevant content, especially when the hashtag is popular. However, when users reply to a tweet that contains a link and a hashtag (the composition of PA set), it is likely they are responding to a relevant tweet. Replies may serve as a quality check. Therefore, MCs produced more relevant URIs when hashtags were used to surface tweets. Consequently, seed generation that prioritizes quality would benefit from extracting seeds from PA, but for Twitter, if hashtags are used, MCs should be considered first.
MCs consistently produced older webpages than PA posts for the Twitter-Latest vertical because MCs included older tweets. Consequently, if seed generation from the Twitter-Latest vertical intends to extract older stories, MCs should be prioritized. Finally, we showed that PA produced more diverse hostnames than MCs for Twitter unlike Reddit. Therefore, seed generation that intends to include different hosts should consider PA, instead of Twitter PA, since it showed a low level of hostname diversity due to reuse of the same domains, which is a common practice especially among news organizations.
6. Future work and Conclusions
We populated the MC set with posts from PA and PA exclusively. However, we believe posts from PA with a higher than normal median number of URIs may be added to the MC set. The median number of URIs in social media posts for all social media/SERP combination was 1, 86% of the time. Therefore, we may add to MC, PA posts with URIs above the median. Also, our precision evaluation was biased to favor pages rich in text. Consequently, we observed false negative URIs: relevant URIs marked as non-relevant, a problem we plan to address. Finally, we plan to expand the set of metrics for comparing seeds to include metrics that account for the authority of hosts. For example, such metric would assign a higher authority weight to a CDC (Centers for Disease Control and Prevention) webpage than an obscure webpage for the same Ebola virus topic.
The seed URIs that form the building blocks of Web archive collections are often hand-selected by curators. Manual selection produces high quality URIs but it does not scale and requires domain knowledge. Due to a shortage of curators amidst an abundance of unfolding events, Google and Twitter have been widely adopted for automatically scraping seeds. In this work, we introduced a cross-platform vocabulary (post class: PA, PA, PA, and PA) for describing social media posts. It is common practice to scrape social media such as Twitter for (PA) seeds, we introduced an overlooked source of social media posts - micro-collections (MCs), and showed that seeds generated from MCs are more prevalent and different from their PA counterpart across multiple dimensions. Finally, we provided recommendations for curators generating seeds from these sources. For example, a seed generation that prioritizes quantity may target MCs first, while a precision priority favors PA. Our study outlines how to compare seeds generated from different venues on social media and our findings may inform the decisions made during seed generation. Our research dataset comprising of 120,444 links extracted from 449,347 social media posts, as well as the source code for the application utilized to generate the collections are publicly available (Alexander Nwala, 2019).
This work was supported in part by IMLS LG-71-15-0077-15.
- Alexander Nwala (2016) Alexander Nwala. 2016. Local Memory Project - Local Stories Collection Generator. https://chrome.google.com/webstore/detail/local-memory-project/khineeknpnogfcholchjihimhofilcfp.
- Alexander Nwala (2019) Alexander Nwala. 2019. Using Micro-collections in Social Media to Generate Seeds for Web Archive Collections - Git Repo. https://github.com/anwala/MicroCollectionsJCDL2019.
- Bar-Yossef et al. (2004) Ziv Bar-Yossef, Andrei Z Broder, Ravi Kumar, and Andrew Tomkins. 2004. Sic transit gloria telae: towards an understanding of the web’s decay. In International conference on World Wide Web (WWW 2014). ACM, 328–337.
- Bergmark (2002) Donna Bergmark. 2002. Collection synthesis. In Joint Conference on Digital Libraries (JCDL 2002). 253–262.
- Centers for Disease Control and Prevention (CDC) (2019) Centers for Disease Control and Prevention (CDC). 2019. Years of Ebola Virus Disease Outbreaks. https://www.cdc.gov/vhf/ebola/history/chronology.html.
- Chakrabarti et al. (1999) Soumen Chakrabarti, Martin Van den Berg, and Byron Dom. 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Computer networks 31, 11 (1999), 1623–1640.
- Colin Dwyer, Camila Domonoske, and Emily Sullivan (2018) Colin Dwyer, Camila Domonoske, and Emily Sullivan. 2018. Survivors of Florida high school shooting call for action on gun control. https://www.npr.org/sections/thetwo-way/2018/02/28/589436112/dicks-sporting-goods-ends-sale-of-assault-style-rifles-citing-florida-shooting.
- Doing Things Differently (2012) Doing Things Differently. 2012. Twitter User. https://twitter.com/dtdchange/.
- Doing Things Differently (2015) Doing Things Differently. 2015. Tweet. https://twitter.com/dtdchange/status/676902666153406465.
- Du et al. (2014) YaJun Du, YuFeng Hai, ChunZhi Xie, and XiaoMing Wang. 2014. An approach for selecting seed URLs of focused crawler based on user-interest ontology. Applied Soft Computing 14 (2014), 663–676.
- Farag et al. (2017) Mohamed MG Farag, Sunshin Lee, and Edward A Fox. 2017. Focused crawler for events. International Journal on Digital Libraries (IJDL 2017) (2017), 1–17. https://doi.org/10.1007/s00799-016-0207-1
- Gossen et al. (2016) Gerhard Gossen, Elena Demidova, and Thomas Risse. 2016. Analyzing web archives through topic and event focused sub-collections. In Proceedings of Web Science Conference (WebSci 2016). 291–295.
- Gossen et al. (2017) Gerhard Gossen, Elena Demidova, and Thomas Risse. 2017. Extracting event-centric document collections from large-scale Web archives. In International Conference on Theory and Practice of Digital Libraries (TPDL 2017). 116–127.
- Ilsensine (2013) Ilsensine. 2013. Reddit User. https://www.reddit.com/user/Ilsensine.
- Ilsensine (2014a) Ilsensine. 2014a. [Ebola] 2014 Outbreak Report. https://www.reddit.com/r/OutbreakNews/comments/2cn9yq/ebola_2014_outbreak_report/.
- Ilsensine (2014b) Ilsensine. 2014b. [Ebola] 2014 Outbreak Report. https://www.reddit.com/r/OutbreakNews/comments/2cn9yq/ebola_2014_outbreak_report/.
- Internet Archive (2012) Internet Archive. 2012. Colleagues at Virginia Tech started an Archive-It collection on Hurricane Sandy. Submit websites to capture content. https://twitter.com/archiveitorg/status/263680818784911360.
- Internet Archive (2013) Internet Archive. 2013. Archive-It has created a collection on the ”Boston Marathon Bombing” Add your Url’s here. We’ll start crawling asap. https://twitter.com/archiveitorg/status/325016028700631040.
- Klein et al. (2018a) Martin Klein, Lyudmila Balakireva, and Herbert Van de Sompel. 2018a. Focused Crawl of Web Archives to Build Event Collections. In Web Science Conference (WebSci 2018). 333–342.
- Klein et al. (2018b) Martin Klein, Harihar Shankar, and Herbert Van de Sompel. 2018b. Robust Links in Scholarly Communication. In Joint Conference on Digital Libraries (JCDL 2018). 357–358.
- Nanni et al. (2018) Federico Nanni, Simone Paolo Ponzetto, and Laura Dietz. 2018. Toward comprehensive event collections. International Journal on Digital Libraries (IJDL 2018) (2018), 1–15. https://doi.org/10.1007/s00799-018-0246-x
- National Library of Medicine (2014) National Library of Medicine. 2014. Global Health Events. https://archive-it.org/collections/4887.
- Nwala et al. (2018a) Alexander C Nwala, Michele C Weigle, and Michael L Nelson. 2018a. Bootstrapping Web Archive Collections from Social Media. In Hypertext and Social Media (HT 2018). 64–72.
- Nwala et al. (2018b) Alexander C Nwala, Michele C Weigle, and Michael L Nelson. 2018b. Scraping SERPs for archival seeds: it matters when you start. In Joint Conference on Digital Libraries (JCDL 2018). 263–272.
- Nwala et al. (2017) Alexander C Nwala, Michele C Weigle, Adam B Ziegler, Anastasia Aizman, and Michael L Nelson. 2017. Local Memory Project: Providing Tools to Build Collections of Stories for Local Events from Local Sources. In Joint Conference on Digital Libraries (JCDL 2017). 219–228.
Rajendra Prasath and
Pinar Öztürk. 2011.
Finding potential seeds through rank aggregation of
web searches. In
International Conference on Pattern Recognition and Machine Intelligence (ICPRAI 2011). 227–234.
- Priyatam et al. (2014) Pattisapu Nikhil Priyatam, Ajay Dubey, Krish Perumal, Sai Praneeth, Dharmesh Kakadia, and Vasudeva Varma. 2014. Seed selection for domain-specific search. In International conference on World Wide Web (WWW 2014). 923–928.
- Risse et al. (2014) Thomas Risse, Elena Demidova, Stefan Dietze, Wim Peters, Nikolaos Papailiou, Katerina Doka, Yannis Stavrakas, Vassilis Plachouras, Pierre Senellart, Florent Carpentier, et al. 2014. The ARCOMEM architecture for social-and semantic-driven web archiving. Future Internet 6, 4 (2014), 688–716.
- Robbins (2016) Denise Robbins. 2016. ANALYSIS: How Michigan And National Reporters Covered The Flint Water Crisis. https://mediamatters.org/research/2016/02/02/analysis-how-michigan-and-national-reporters-co/208290.
- SalahEldeen and Nelson (2012) Hany M SalahEldeen and Michael L Nelson. 2012. Losing my revolution: How many resources shared on social media have been lost?. In International Conference on Theory and Practice of Digital Libraries (TPDL 2012). 125–137.
- SalahEldeen and Nelson (2013) Hany M SalahEldeen and Michael L Nelson. 2013. Carbon dating the web: estimating the age of web resources. In Proceedings of WWW 2013. 1075–1082.
- Sean Rossman (2018) Sean Rossman. 2018. ’We’re children. You guys are the adults’: Shooting survivor, 17, calls out lawmakers. https://www.usatoday.com/story/news/nation-now/2018/02/15/were-children-you-guys-adults-shooting-survivor-17-calls-out-lawmakers/341002002/.
- Stephanie Ebbs (2018) Stephanie Ebbs. 2018. Survivors of Florida high school shooting call for action on gun control. https://abcnews.go.com/Politics/survivors-florida-high-school-shooting-call-action-gun/story?id=53111278.
- WHO (2014) WHO. 2014. Statement on the 1st meeting of the IHR Emergency Committee on the 2014 Ebola outbreak in West Africa. https://www.who.int/mediacentre/news/statements/2014/ebola-20140808/en/.
- Wikipedia (2014) Wikipedia. 2014. West African Ebola virus epidemic. https://en.wikipedia.org/wiki/West_African_Ebola_virus_epidemic.
- Wikipedia (2016) Wikipedia. 2016. Flint Water Crisis. https://en.wikipedia.org/wiki/Flint_water_crisis.
- Wikipedia (2017) Wikipedia. 2017. Stoneman Douglas High School shooting. https://en.wikipedia.org/wiki/Stoneman_Douglas_High_School_shooting.
- Wikipedia (2018a) Wikipedia. 2018a. 2018 FIFA World Cup. https://en.wikipedia.org/wiki/2018_FIFA_World_Cup.
- Wikipedia (2018b) Wikipedia. 2018b. 2018 United States elections. https://en.wikipedia.org/wiki/2018_United_States_elections.
- Yang et al. (2012) Seungwon Yang, Kiran Chitturi, Gregory Wilson, Mohamed Magdy, and Edward A Fox. 2012. A study of automation from seed URL generation to focused web archive development: the CTRnet context. In Joint Conference on Digital Libraries (JCDL 2012). 341–342.
- Zheng et al. (2009) Shuyi Zheng, Pavel Dmitriev, and C Lee Giles. 2009. Graph based crawler seed selection. In Proceedings of the 18th international conference on World Wide Web (WWW 2009). 1089–1090.
- Zittrain et al. (2014) Jonathan Zittrain, Kendra Albert, and Lawrence Lessig. 2014. Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Legal Information Management 14 (2014), 88–99.