Can Common Crawl reliably track persistent identifier (PID) use over time?

01/26/2018
by   Henry S. Thompson, et al.
0

We report here on the results of two studies using two and four monthly web crawls respectively from the Common Crawl (CC) initiative between 2014 and 2017, whose initial goal was to provide empirical evidence for the changing patterns of use of so-called persistent identifiers. This paper focusses on the tooling needed for dealing with CC data, and the problems we found with it. The first study is based on over 10^12 URIs from over 5 * 10^9 pages crawled in April 2014 and April 2017, the second study adds a further 3 * 10^9 pages from the April 2015 and April 2016 crawls. We conclude with suggestions on specific actions needed to enable studies based on CC to give reliable longitudinal information.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/06/2022

Change detection optimization in frequently changing web pages

Web pages at present have become dynamic and frequently changing, compar...
research
06/19/2018

You, the Web and Your Device: Longitudinal Characterization of Browsing Habits

Understanding how people interact with the web is key for a variety of a...
research
03/14/2018

Identifying KDM Model of JSP Pages

In this report, we propose our approach that identifies a KDM model of J...
research
10/23/2017

An Empirical Investigation On Search Engine Ad Disclosure

This representative study of German search engine users (N=1,000) focuse...
research
09/21/2019

Time Series, Persistent Homology and Chirality

We investigate the point process of persistent diagram for Brownian moti...
research
05/31/2022

Improving Ads-Profitability Using Traffic-Fingerprints

This paper introduces the concept of traffic-fingerprints, i.e., normali...
research
04/27/2018

Extracting Parallel Paragraphs from Common Crawl

Most of the current methods for mining parallel texts from the web assum...

Please sign up or login with your details

Forgot password? Click here to reset