Should we trust web-scraped data?

08/04/2023
by   Jens Foerderer, et al.
0

The increasing adoption of econometric and machine-learning approaches by empirical researchers has led to a widespread use of one data collection method: web scraping. Web scraping refers to the use of automated computer programs to access websites and download their content. The key argument of this paper is that naïve web scraping procedures can lead to sampling bias in the collected data. This article describes three sources of sampling bias in web-scraped data. More specifically, sampling bias emerges from web content being volatile (i.e., being subject to change), personalized (i.e., presented in response to request characteristics), and unindexed (i.e., abundance of a population register). In a series of examples, I illustrate the prevalence and magnitude of sampling bias. To support researchers and reviewers, this paper provides recommendations on anticipating, detecting, and overcoming sampling bias in web-scraped data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2023

The Use of Web Archives in Disinformation Research

In recent years, journalists and other researchers have used web archive...
research
02/17/2023

Creating Knowledge Graphs for Geographic Data on the Web

Geographic data plays an essential role in various Web, Semantic Web and...
research
10/11/2020

On Spatial Lag Models estimated using crowdsourcing, web-scraping or other unconventionally collected data

The Big Data revolution is challenging the state-of-the-art statistical ...
research
03/31/2020

The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

The WARC file format is widely used by web archives to preserve collecte...
research
04/27/2023

Machine Learning for Detection and Mitigation of Web Vulnerabilities and Web Attacks

Detection and mitigation of critical web vulnerabilities and attacks lik...
research
10/31/2017

Calibration for Stratified Classification Models

In classification problems, sampling bias between training data and test...
research
12/03/2017

Always Lurking: Understanding and Mitigating Bias in Online Human Trafficking Detection

Web-based human trafficking activity has increased in recent years but i...

Please sign up or login with your details

Forgot password? Click here to reset