Garbage, Glitter, or Gold: Assigning Multi-dimensional Quality Scores to Social Media Seeds for Web Archive Collections

07/06/2021
by   Alexander C. Nwala, et al.
0

From popular uprisings to pandemics, the Web is an essential source consulted by scientists and historians for reconstructing and studying past events. Unfortunately, the Web is plagued by reference rot which causes important Web resources to disappear. Web archive collections help reduce the costly effects of reference rot by saving Web resources that chronicle important stories/events before they disappear. These collections often begin with URLs called seeds, hand-selected by experts or scraped from social media. The quality of social media content varies widely, therefore, we propose a framework for assigning multi-dimensional quality scores to social media seeds for Web archive collections about stories and events. We leveraged contributions from social media research for attributing quality to social media content and users based on credibility, reputation, and influence. We combined these with additional contributions from the Web archive research that emphasizes the importance of considering geographical and temporal constraints when selecting seeds. Next, we developed the Quality Proxies (QP) framework which assigns seeds extracted from social media a quality score across 10 major dimensions: popularity, geographical, temporal, subject expert, retrievability, relevance, reputation, and scarcity. We instantiated the framework and showed that seeds can be scored across multiple QP classes that map to different policies for ranking seeds such as prioritizing seeds from local news, reputable and/or popular sources, etc. The QP framework is extensible and robust. Our results showed that Quality Proxies resulted in the selection of quality seeds with increased precision (by  0.13) when novelty is and is not prioritized. These contributions provide an explainable score applicable to rank and select quality seeds for Web archive collections and other domains.

READ FULL TEXT

page 2

page 11

research
12/19/2016

iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

Researchers in the Digital Humanities and journalists need to monitor, c...
research
05/29/2019

Using Micro-collections in Social Media to Generate Seeds for Web Archive Collections

In a Web plagued by disappearing resources, Web archive collections prov...
research
05/27/2019

Social Cards Probably Provide For Better Understanding Of Web Archive Collections

Used by a variety of researchers, web archive collections have become in...
research
05/26/2019

Technical Report of the DAISY System -- Shooter Localization, Models, Interface, and Beyond

Nowadays a huge number of user-generated videos are uploaded to social m...
research
10/09/2018

Ranking News-Quality Multimedia

News editors need to find the photos that best illustrate a news piece a...
research
04/23/2023

Experts prefer text but videos help novices: an analysis of the utility of multi-media content

Multi-media increases engagement and is increasingly prevalent in online...
research
12/19/2016

The iCrawl Wizard -- Supporting Interactive Focused Crawl Specification

Collections of Web documents about specific topics are needed for many a...

Please sign up or login with your details

Forgot password? Click here to reset