The web is a very dynamic medium where resources frequently are being created, deleted, and
moved [bar-yossef:sic-transit, cho:evolution, cho:estimating
cho:estimating]. Scholars have realized that, due to this dynamic nature, reliably linking and citing scholarly web resources is not a trivial matter [lawrence:persistence, mccown:web_references]. Persistent identifiers such as the Digital Object Identifier (DOI)111https://www.doi.org/ have been introduced to address this issue and have become the de facto standard to persistently identify scholarly resources on the web. The concept behind a DOI is that while the location of a resource on the web may change over time, its identifying DOI remains unchanged and, when dereferenced on the web, continues to resolve to the resource’s current location. This concept is based on the underlying assumption that the resource’s publisher updates the mapping between the DOI and the resource’s location if and when the location has changed. If this mapping is reliably maintained, DOIs indeed provide a more persistent way of linking and citing web resources.
While this system is not perfect [bilder:doi_fail] and we have previously shown that authors of scholarly articles often do not utilize DOIs where they should [sompel:citation], DOIs have become an integral part of the scholarly communication landscape222https://data.crossref.org/reports/statusReport.html. Our work is motivated by questions related to the consistency of resolving DOIs to scholarly content. From past experience crawling the scholarly web, for example in [jones:content_drift, klein:one_in_five], we have noticed that publishers do not necessarily respond consistently to simple HTTP requests against DOIs. We have instead observed scenarios where their response changes depending on what HTTP client and method is used. If we can demonstrate at scale that this behavior is common place in the scholarly communication landscape, it would raise significant concerns about the persistence of such identifiers for the scholarly web. In other words, we are driven by the question that if we can not trust that requests against the same DOI return the same result, how can we trust in the identifier’s persistence?
In our previous study [klein:who_is_asking] we reported the outcome of our initial investigation into the notion of persistence of DOIs from the perspective of their behavior on the web. We found early indicators for scholarly publishers responding differently to different kinds of HTTP requests against the same DOI. In this paper we expand on our previous work by:
re-executing the previous experiments with an improved technical setup,
adding additional experiments from a different network environment,
adding additional experiments with different access levels to scholarly content, and
adding a comparison corpus to help interpret our findings and put them into perspective.
Adding these dimensions to our previous work and applying various different yet simple HTTP request methods with different clients to a large and arguably representative corpus of DOIs, we address the following research questions:
What differences in dereferencing DOIs can we detect and highlight?
In what way (if at all) do scholarly content providers’ responses change depending on network environments?
How do observed inconsistencies compare to responses by web servers providing popular (non-scholarly) web content?
What effect do Open Access and non Open Access content providers have on the overall picture?
What is the effect of subscription levels to the observed inconsistencies?
These five research questions (RQs) aim at a quantitative analysis of the consistency of HTTP responses. We do not claim that such consistency is the only factor that contributes to persistence of scholarly resource identifiers. We argue, however, that without a reassuring level of consistency, our trust in the persistence of an identifier and its resolution to a resource’s current location is significantly diminished.
2 Related Work
DOIs are the de facto standard for identifying scholarly resources on the web, supported by traditional scholarly publishers as well as repository platforms such as Figshare and Zenodo, for example. When crawling the scholarly web for the purpose of aggregation, analysis, or archiving, DOIs are therefore often the starting point to access resources of interest. The use of DOIs for references in scholarly articles, however, is not as wide-spread as it should be. In previous work [sompel:citation], we have presented evidence that authors often use the URL of a resource’s landing page rather than its DOI when citing the resource. This situation is undesirable as it requires unnecessary deduplication for efforts such as metrics analysis or crawling. These findings were confirmed in a large scale study by Thompson and Jian [thompson:common_crawl] based on two samples of the web taken from Common Crawl333http://commoncrawl.org/ datasets. The authors were motivated to quantify the use of HTTP DOIs versus URLs of landing pages in these two samples generated from two snapshots in time. They found more than 5 million actionable HTTP DOIs in the first dataset from 2014 and about of them in the second dataset from 2017 but identified as the corresponding landing page URL, not the DOI. It is worth noting that not all resources referenced in scholarly articles have a DOI assigned to them and are therefore subject to typical link rot scenarios on the web. In large-scale studies, we have previously investigated and quantified the “reference rot” phenomenon in scholarly communication [jones:content_drift, klein:one_in_five] focusing on “web at large” resources that do not have an identifying DOI.
Any large-scale analysis of the persistence of scholarly resources requires machine access as human evaluations typically do not scale. Hence, making web servers that serve (scholarly) content more friendly to machines has been the focus of previous efforts by the digital library community with the agreement that providing accurate and machine-readable metadata is a core requirement [brandman:crawler_friendly, nelson:harvesting]. To support these efforts, recently standardized frameworks are designed to help machines synchronize metadata and content between scholarly platforms and repositories [klein:technical_framework].
The study by Alam et al. [alam:methods] is related to ours in the way that the authors investigate the support of various HTTP request methods by web servers serving popular web pages. The authors issue OPTIONS requests and analyze the values of the “Allow” response header to evaluate which HTTP methods are supported by a web server. The authors conclude that a sizable number of web servers inaccurately report supported HTTP request methods.
3 Experimental Setup
3.1 Dataset Generation
To the best of our knowledge, no dataset of DOIs that identify content representative of the diverse scholarly web is available to researchers. Part of the problem is the scale and diversity of the publishing industry landscape but also the fact that the Science, Technology, and Medicine (STM) market is dominated by a few large publishers [johnson:stmreport]. We therefore reuse the dataset generated for our previous work [klein:who_is_asking] that consists of randomly sampled DOIs from a set of more than million DOIs crawled by the Internet Archive. We refer to [klein:who_is_asking] for a detailed description of the data gathering process, an analysis of the composition of the dataset, and a discussion of why we consider this dataset to be representative of the scholarly landscape. In addition, to be able to put our findings from the DOI-based dataset in perspective, we created a dataset of the top most popular URIs on the web as extracted from the freely available “Majestic Million” index444https://blog.majestic.com/development/majestic-million-csv-daily/ on November 14, 2019.
3.2 HTTP Requests, Clients, and Environments
HTTP transactions on the web consists of a client request and a server response. As detailed in RFC 7231 [http:rfc7231] requests contain a request method and request headers and responses contain corresponding response headers. GET and HEAD are two of the most common HTTP request methods (also detailed in RFC 7231). The main difference between the two methods is that upon receiving a client request with the HEAD method, a server only responds with its response headers but does not return a content body to the client. Upon receiving a client request with the GET method, on the other hand, a server responds by sending the representation of the resource in the response body in addition to the response headers.
It is important to note that, according to RFC 7231, we should expect a server to send the same headers in response to requests against the same resource, regardless whether the request is of type HEAD or GET. RFC 7231 states: “The server SHOULD send the same header fields in response to a HEAD request as it would have sent if the request had been a GET…”.
To address our research questions outlined earlier, we utilize the same four methods described in [klein:who_is_asking] to send HTTP requests:
HEAD, a HEAD request with cURL555A popular lightweight HTTP client for the command line interface https://curl.haxx.se/.
GET, a simple GET request with cURL
GET+ a GET request that includes typical browsing parameters such as user agent and accepted cookies with cURL
Chrome, a GET request with Chrome666Web browser controlled via the Selenium WebDriver https://selenium.dev/projects/.
We sent these four requests against the HTTPS-actionable format of a DOI, meaning the form of
4 Experimental Results
In this section we report our observations when dereferencing HTTPS-actionable DOIs with our four methods. Each method automatically follows HTTP redirects and records information about each link in the redirect chain. For example, a HEAD request against https://doi.org/10.1007/978-3-030-30760-8_15 results in a redirect chain consisting of the following links: