On the Persistence of Persistent Identifiers of the Scholarly Web

04/06/2020 ∙ by Martin Klein, et al. ∙ Los Alamos National Laboratory 0

Scholarly resources, just like any other resources on the web, are subject to reference rot as they frequently disappear or significantly change over time. Digital Object Identifiers (DOIs) are commonplace to persistently identify scholarly resources and have become the de facto standard for citing them. We investigate the notion of persistence of DOIs by analyzing their resolution on the web. We derive confidence in the persistence of these identifiers in part from the assumption that dereferencing a DOI will consistently return the same response, regardless of which HTTP request method we use or from which network environment we send the requests. Our experiments show, however, that persistence, according to our interpretation, is not warranted. We find that scholarly content providers respond differently to varying request methods and network environments and even change their response to requests against the same DOI. In this paper we present the results of our quantitative analysis that is aimed at informing the scholarly communication community about this disconcerting lack of consistency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The web is a very dynamic medium where resources frequently are being created, deleted, and moved [bar-yossef:sic-transit, cho:evolution,

cho:estimating

]. Scholars have realized that, due to this dynamic nature, reliably linking and citing scholarly web resources is not a trivial matter [lawrence:persistence, mccown:web_references]. Persistent identifiers such as the Digital Object Identifier (DOI)111https://www.doi.org/ have been introduced to address this issue and have become the de facto standard to persistently identify scholarly resources on the web. The concept behind a DOI is that while the location of a resource on the web may change over time, its identifying DOI remains unchanged and, when dereferenced on the web, continues to resolve to the resource’s current location. This concept is based on the underlying assumption that the resource’s publisher updates the mapping between the DOI and the resource’s location if and when the location has changed. If this mapping is reliably maintained, DOIs indeed provide a more persistent way of linking and citing web resources.

While this system is not perfect [bilder:doi_fail] and we have previously shown that authors of scholarly articles often do not utilize DOIs where they should [sompel:citation], DOIs have become an integral part of the scholarly communication landscape222https://data.crossref.org/reports/statusReport.html. Our work is motivated by questions related to the consistency of resolving DOIs to scholarly content. From past experience crawling the scholarly web, for example in [jones:content_drift, klein:one_in_five], we have noticed that publishers do not necessarily respond consistently to simple HTTP requests against DOIs. We have instead observed scenarios where their response changes depending on what HTTP client and method is used. If we can demonstrate at scale that this behavior is common place in the scholarly communication landscape, it would raise significant concerns about the persistence of such identifiers for the scholarly web. In other words, we are driven by the question that if we can not trust that requests against the same DOI return the same result, how can we trust in the identifier’s persistence?

In our previous study [klein:who_is_asking] we reported the outcome of our initial investigation into the notion of persistence of DOIs from the perspective of their behavior on the web. We found early indicators for scholarly publishers responding differently to different kinds of HTTP requests against the same DOI. In this paper we expand on our previous work by:

  • re-executing the previous experiments with an improved technical setup,

  • adding additional experiments from a different network environment,

  • adding additional experiments with different access levels to scholarly content, and

  • adding a comparison corpus to help interpret our findings and put them into perspective.

Adding these dimensions to our previous work and applying various different yet simple HTTP request methods with different clients to a large and arguably representative corpus of DOIs, we address the following research questions:

  1. What differences in dereferencing DOIs can we detect and highlight?

  2. In what way (if at all) do scholarly content providers’ responses change depending on network environments?

  3. How do observed inconsistencies compare to responses by web servers providing popular (non-scholarly) web content?

  4. What effect do Open Access and non Open Access content providers have on the overall picture?

  5. What is the effect of subscription levels to the observed inconsistencies?

These five research questions (RQs) aim at a quantitative analysis of the consistency of HTTP responses. We do not claim that such consistency is the only factor that contributes to persistence of scholarly resource identifiers. We argue, however, that without a reassuring level of consistency, our trust in the persistence of an identifier and its resolution to a resource’s current location is significantly diminished.

In the remainder of this paper we will briefly highlight previous related work (Section 2), outline the experiments’ setup (Section 3), and address our research questions (Section 4) before drawing our conclusions (Section LABEL:sec:conclusions).

2 Related Work

DOIs are the de facto standard for identifying scholarly resources on the web, supported by traditional scholarly publishers as well as repository platforms such as Figshare and Zenodo, for example. When crawling the scholarly web for the purpose of aggregation, analysis, or archiving, DOIs are therefore often the starting point to access resources of interest. The use of DOIs for references in scholarly articles, however, is not as wide-spread as it should be. In previous work [sompel:citation], we have presented evidence that authors often use the URL of a resource’s landing page rather than its DOI when citing the resource. This situation is undesirable as it requires unnecessary deduplication for efforts such as metrics analysis or crawling. These findings were confirmed in a large scale study by Thompson and Jian [thompson:common_crawl] based on two samples of the web taken from Common Crawl333http://commoncrawl.org/ datasets. The authors were motivated to quantify the use of HTTP DOIs versus URLs of landing pages in these two samples generated from two snapshots in time. They found more than 5 million actionable HTTP DOIs in the first dataset from 2014 and about of them in the second dataset from 2017 but identified as the corresponding landing page URL, not the DOI. It is worth noting that not all resources referenced in scholarly articles have a DOI assigned to them and are therefore subject to typical link rot scenarios on the web. In large-scale studies, we have previously investigated and quantified the “reference rot” phenomenon in scholarly communication [jones:content_drift, klein:one_in_five] focusing on “web at large” resources that do not have an identifying DOI.

Any large-scale analysis of the persistence of scholarly resources requires machine access as human evaluations typically do not scale. Hence, making web servers that serve (scholarly) content more friendly to machines has been the focus of previous efforts by the digital library community with the agreement that providing accurate and machine-readable metadata is a core requirement [brandman:crawler_friendly, nelson:harvesting]. To support these efforts, recently standardized frameworks are designed to help machines synchronize metadata and content between scholarly platforms and repositories [klein:technical_framework].

The study by Alam et al. [alam:methods] is related to ours in the way that the authors investigate the support of various HTTP request methods by web servers serving popular web pages. The authors issue OPTIONS requests and analyze the values of the “Allow” response header to evaluate which HTTP methods are supported by a web server. The authors conclude that a sizable number of web servers inaccurately report supported HTTP request methods.

3 Experimental Setup

3.1 Dataset Generation

To the best of our knowledge, no dataset of DOIs that identify content representative of the diverse scholarly web is available to researchers. Part of the problem is the scale and diversity of the publishing industry landscape but also the fact that the Science, Technology, and Medicine (STM) market is dominated by a few large publishers [johnson:stmreport]. We therefore reuse the dataset generated for our previous work [klein:who_is_asking] that consists of randomly sampled DOIs from a set of more than million DOIs crawled by the Internet Archive. We refer to [klein:who_is_asking] for a detailed description of the data gathering process, an analysis of the composition of the dataset, and a discussion of why we consider this dataset to be representative of the scholarly landscape. In addition, to be able to put our findings from the DOI-based dataset in perspective, we created a dataset of the top most popular URIs on the web as extracted from the freely available “Majestic Million” index444https://blog.majestic.com/development/majestic-million-csv-daily/ on November 14, 2019.

3.2 HTTP Requests, Clients, and Environments

HTTP transactions on the web consists of a client request and a server response. As detailed in RFC 7231 [http:rfc7231] requests contain a request method and request headers and responses contain corresponding response headers. GET and HEAD are two of the most common HTTP request methods (also detailed in RFC 7231). The main difference between the two methods is that upon receiving a client request with the HEAD method, a server only responds with its response headers but does not return a content body to the client. Upon receiving a client request with the GET method, on the other hand, a server responds by sending the representation of the resource in the response body in addition to the response headers.

It is important to note that, according to RFC 7231, we should expect a server to send the same headers in response to requests against the same resource, regardless whether the request is of type HEAD or GET. RFC 7231 states: “The server SHOULD send the same header fields in response to a HEAD request as it would have sent if the request had been a GET…”.

To address our research questions outlined earlier, we utilize the same four methods described in [klein:who_is_asking] to send HTTP requests:

  • HEAD, a HEAD request with cURL555A popular lightweight HTTP client for the command line interface https://curl.haxx.se/.

  • GET, a simple GET request with cURL

  • GET+ a GET request that includes typical browsing parameters such as user agent and accepted cookies with cURL

  • Chrome, a GET request with Chrome666Web browser controlled via the Selenium WebDriver https://selenium.dev/projects/.

We sent these four requests against the HTTPS-actionable format of a DOI, meaning the form of https://doi.org/. This is an important difference to our previous work ([klein:who_is_asking]) where we did not adhere to the format recommended by the DOI Handbook777https://www.doi.org/doi_handbook/3_Resolution.html. For the first set of experiments and to address RQ1, we send these four HTTP requests against each of the DOIs from an Amazon Web Services (AWS) virtual machine located at the U.S. East Coast. The clients sending the requests are therefore not affiliated with our home institution’s network. Going forward, we refer to this external setup as the corpus. In addressing RQ2, we anticipate possible discrepancies in HTTP responses from servers depending on the network from which the request is sent. Hence, for the second set of experiments, we send the same four requests to the same DOIs from a machine hosted within our institution’s network. Given that the machine’s IP address falls into a range that conveys certain institutional subscription and licensing levels to scholarly publishers, this internal setup, which we refer to going forward as , should help surface possible differences. To address RQ3 we compare our findings to responses from servers providing non-scholarly content by sending the same four requests against each of the URIs from our dataset of popular websites. From here on, we refer to this corpus as the dataset.

4 Experimental Results

In this section we report our observations when dereferencing HTTPS-actionable DOIs with our four methods. Each method automatically follows HTTP redirects and records information about each link in the redirect chain. For example, a HEAD request against https://doi.org/10.1007/978-3-030-30760-8_15 results in a redirect chain consisting of the following links: