More than seventeen years have passed since the definition of Open Access (OA) has been agreed (Chan et al., 2002). OA, which refers to scientific literature that is online and available free of cost to the end user, questions the traditional publishing business model relying on paywalls and advocates for a shift towards alternative, more cost-effective publishing models delivering free access to research outputs for all (Sample, 2012; Shieber, 2013; Publishers Communication Group, 2017; Suber, 2003). These arguments have been gradually influencing researchers, research organisations, and funders, resulting in the creation of new OA policies. As of January 2019, according to the Registry of Open Access Repository Mandates and Policies222https://roarmap.eprints.org/, there are 732 institutional and 85 funder OA policies globally. OA policies provide authors with criteria for making their research outputs available as OA (Picarra, 2015). These criteria typically include when and where should the research outputs be deposited or published and what version of the manuscript (e.g. pre-print vs. post-print) should be made openly accessible (Picarra, 2015). Arguably one of the most significant OA policies, the UK Research Excellence Framework (REF) 2021 Open Access Policy333https://www.ref.ac.uk/about/what-is-the-ref/, was introduced in the UK in March 2014 (Higher Education Funding Council for England (HEFCE), 2016). The significance of this policy lies in two aspects: 1) the requirement to make research outputs OA is linked to performance review, creating a strong incentive for compliance (Xia et al., 2012; Swan et al., 2015; Vincent-Lamarre et al., 2016), and 2) it affects over 5% of global research outputs444According to Scimago Journal & Country Rank (https://www.scimagojr.com/countryrank.php?year=2017), in 2017 the UK was the third largest producer of research outputs, representing 5.42% of global research outputs. It ranked third after the US with 17.71% and China with 14.38% percent share.. Under this policy, only compliant research outputs will be evaluated in the national Research Excellence Framework. Over 52 thousand academic staff from 154 UK universities submitted over 190 thousand research outputs in the most recent REF (2014) (Research Excellence Framework, 2014). The UK REF 2021 OA policy is not the only major nation-wide development – the U.S. Public Access Plan (U.S. Agency for International Development, 2013) introduced in 2013 and the European Commission supported “Plan S” (European Commision, 2018), are just two more examples of a global shift towards Open Access. The problem. The growth of OA and the introduction of new policies, such as the REF 2021 Open Access Policy, has brought forth important questions and implications, some universal and some policy-specific. Even when authors deposit their work in OA repositories, does this happen immediately, or is the deposit delayed? What effect does the introduction of policies have on the practice of publishing OA? Is there evidence to support that introducing OA policies reduces the time from acceptance to the open availability of research outputs? More importantly, how can compliance with OA policies be tracked, particularly when specific time-frames for making research outputs OA are in place? While recent studies analysing compliance with OA policies (Lariviere and Sugimoto, 2018; Khoo and Lay, 2018) and the prevalence of OA (Piwowar et al., 2018) have focused on whether articles are eventually made openly available, they have not taken into consideration the time lag between the acceptance/publication of an article and its online availability (deposit into an OA repository). Two existing studies which have taken deposit dates into consideration (Swan et al., 2015; Vincent-Lamarre et al., 2016) are now outdated, are not easily reproducible, and have not used these dates to assess compliance (i.e. to understand whether authors deposit on time in accordance with existing policies) but instead used these dates to study policy effectiveness (i.e. to understand whether certain types of policies shorten the time between publication and deposit). If we can measure the time lag between publication and deposit, can we assist authors and institutions in improving their compliance with OA policies? Research questions. In this paper we analyse the time lag between article publication dates and dates of their deposit into OA repositories. We will further refer to the time lag between these dates simply as deposit time lag555Existing studies sometimes refer to the difference between the publication and deposit dates as “deposit latency” (Swan et al., 2015; Vincent-Lamarre et al., 2016). However, because the term “latency” is in computer science typically associated with a different meaning, we chose to use the term “time lag” instead.. We analyse deposit time lag across country, time, repository, and discipline. Furthermore, we investigate whether introducing a mandatory policy in the UK – the REF 2021 Open Access Policy, which requires depositing research outputs within a specific period – affected this time lag. To study deposit time lag and compliance with the policy, we use data from Crossref666https://www.crossref.org/, the largest Digital Object Identifier (DOI) registration agency, and from CORE777https://core.ac.uk/, the largest full text aggregation service collecting OA research outputs from institutional and subject repositories and from journals around the world (Knoth and Zdrahal, 2012). After matching article metadata from Crossref and from CORE we analyse the time lag between publication dates we receive from Crossref and deposit dates we receive from CORE. Using this data, we answer the following research questions:
How does deposit time lag vary across time, country, institution, and discipline?
What proportion of UK research outputs was not deposited on time to comply with the REF 2021 OA Policy?
Is the REF 2021 OA policy affecting how soon are publications made OA?
How does the change in the deposit time lag in the UK over the past several years compare to other countries?
Findings. We show that the time between publication and deposit has globally significantly decreased. We also show that while there are notable differences in deposit time lag of different subjects, there are even larger differences between different institutions, even when considering only publications from the same discipline. This suggests institutions may be stronger drivers of OA than discipline culture. Furthermore, we show the introduction of the UK REF OA Policy might have accelerated the UK’s move towards immediate OA compared to other countries. Contributions. We present a method for automated tracking of deposit time lag which can be applied to research outputs world-wide. Using this method, we provide the first large scale analysis of deposit time lag. Ours is also the first study to quantitatively analyse deposit time lag in relation to the REF 2021 OA Policy. Our results support the argument for the inclusion of a time-limited deposit requirement in OA policies. Finally, to support further studies on the deposit of research outputs into OA repositories, we release our dataset of 800 thousand publications and the source codes of our analysis888https://github.com/oacore/jcdl_2019. Outline. This paper is organised as follows. First, in Section 2 we review previous work related to our study. Next, in Section 3 we describe our data collection process and the methodology used in our analysis. In Section 4 we explain how we prepare our dataset, and in Section 5 we present the results of our analysis. Finally, Section 6 discusses limitation of the present work and future goals.
2. Related Work
In this section we discuss work related to our research. In particular, we focus on two topics: 1) studies that try to estimate the proportion of all research publications that are openly accessible and 2) studies that analyse compliance with specific OA policies. We close this section by discussing the differences between our study and previous work. Particularly in recent years many studies have been conducted that have tried to estimate the proportion of existing research that is available as OA(Björk et al., 2010; Gargouri et al., 2012; Archambault et al., 2013, 2014; Khabsa and Giles, 2014; Piwowar et al., 2018; Lariviere and Sugimoto, 2018). While an earlier study identified OA articles using manual Google search (Björk et al., 2010), the later studies use automated methods based on web crawling (Gargouri et al., 2012; Khabsa and Giles, 2014), database searching (Piwowar et al., 2018; Lariviere and Sugimoto, 2018), or a combination of both (Archambault et al., 2013, 2014). One of the two most recent studies has estimated the proportion of OA articles to be at least 28% overall (a finding similar to (Gargouri et al., 2012; Khabsa and Giles, 2014)), with 45% of articles published in 2015 being OA (Piwowar et al., 2018). The most recent study we know of (Lariviere and Sugimoto, 2018) has utilised the same method as (Piwowar et al., 2018), but focused on publications subject to OA policies of selected funders, revealing that two thirds of these publications were available as OA. Two of the studies (Gargouri et al., 2012; Lariviere and Sugimoto, 2018) are of particular interest because they investigated the proportion of OA articles in relation to specific policies. Gargouri et al. (Gargouri et al., 2012) have demonstrated that the proportion of OA articles at institutions with OA policies was three times as high as at institutions without them. Interestingly, the study has also shown not all articles were made available online upon publication but were instead deposited retrospectively. Lariviere and Sugimoto (Lariviere and Sugimoto, 2018) investigated twelve funders (the European Research Council and eleven funders from the UK, US and Canada) which implemented OA policies. The study has revealed significant differences in the proportion of OA publications between different funders, even when considering funders from the same discipline. In particular, funders which required depositing into a repository upon publication had significantly higher proportion of OA articles than funders which allowed deposit after publication. While the authors have observed differences between disciplines, finding significant variations between funders within the same discipline has led the authors to conclude the funding agency may be a stronger driver of OA publishing than the culture within a discipline. The above mentioned studies look at how many publications are available as OA compared to how many publications appear behind paywalls. However, as Gargouri et al. (Gargouri et al., 2012) have indirectly shown, the open online availability of a publication does not necessarily ensure compliance with a given policy. A number of policies, including the UK REF 2021 OA Policy and the US National Institutes of Health (NIH) Public Access Policy, require deposit by a certain date – three months after acceptance in the case of the REF 2021 OA Policy and upon publication in the case of the NIH Public Access Policy. The approach utilised by the above mentioned works would typically mean even publications which were deposited retrospectively could be considered compliant with these two policies. Only a handful of studies have investigated specific details of existing policies (Vincent-Lamarre et al., 2016; Swan et al., 2015; Khoo and Lay, 2018). Vincent-Lamarre et al. (Vincent-Lamarre et al., 2016) analysed research articles published by 67 institutions with an OA mandate, i.e. an OA policy which was mandatory rather than recommended. The studied mandates were broken down into eight specific conditions such as deposit timing and embargo length, and the study investigated how these conditions relate to mandate compliance. They found that one value for three of the eight conditions (immediate deposit required, deposit required for performance evaluation, unconditional opt-out allowed for the OA requirement but no opt-out for deposit requirement) was strongly associated with higher deposit rates as well as with lower deposit time lag. Swan et al. (Swan et al., 2015) have conducted a similar study and compared specific policy conditions with deposit rates and time lag for 122 institutions with mandatory OA policies. Similarly as in the case of (Vincent-Lamarre et al., 2016), the authors have identified three criteria which were associated with improved deposit rates (deposit mandatory, deposit cannot be waived, deposit should be linked with research evaluation). Khoo and Lay (Khoo and Lay, 2018) have focused on embargo periods and studied the rate at which neuroscientists in Australia and Canada publish in journals with embargo periods that are not compliant with funder policies, i.e. are longer than 12 months. Interestingly, they observed no reduction in the number of articles published in journals with non-compliant embargo periods after new funder policies were introduced in Australia and Canada, despite these policies being mandatory. In the present work we investigate how much time does it take for authors to deposit their articles in OA repositories in relation to when these articles get published. Our work differs from the aforementioned studies in a number of ways. In contrast to (Vincent-Lamarre et al., 2016) and (Swan et al., 2015) who correlated deposit time lag with specific policy conditions, we instead analyse how deposit time lag differs across a number of dimensions such as country and discipline. We also address what we envision as a future step in assisting the OA movement – automated and reproducible tracking of policy compliance. By utilising the CORE aggregator which harvests content from thousands of repositories globally, we are able to study how many publications get deposited in multiple places and whether recognising these multiple copies can enable faster access to research. Ours is also the first study to quantitatively analyse the UK REF 2021 OA Policy.
In this section, we describe the datasets and the methodology used to answer our research questions. As one of the aims of this work is to study compliance with the UK REF 2021 OA Policy, we start by introducing the policy. Compliance with the REF 2021 OA Policy is met when authors deposit (self-archive) the post-print (also called the “author accepted manuscript,” i.e. author’s final version of the manuscript where all the peer review suggestions have been addressed but without the publisher’s typesetting) into an institutional or a subject repository within three months from the acceptance of the publication (Higher Education Funding Council for England (HEFCE), 2016; Swan, 2014). The policy affects journal articles and conference proceedings with an International Standard Serial Number (ISSN), which constitute the majority (77%) of outputs submitted to the latest REF (Kerridge and Ward, 2014). Although the policy was introduced in 2014, the implementation period started in April 2016 to allow universities to create the necessary infrastructure for tracking compliance. To collect the data needed for the analysis of deposit time lag world-wide, we use the following data sources:
CORE111111https://core.ac.uk/ is the world’s largest OA aggregation service (Notay, 2018), collecting OA research outputs from institutional and subject repositories121212Subject repositories aggregated by CORE include e-print repositories such as ArXiv which is often used to deposit pre-prints as well as post-prints. The latest REF 2021 submission guidelines state e-print repositories will be considered acceptable for compliance purposes (UK Research and Innovation, 2019). We have therefore included these repositories in our analysis. and from journals worldwide (Knoth and Zdrahal, 2012). As such, CORE provides a single interface for accessing data from repositories around the world. At the time of writing, CORE aggregated content from over 3,700 repositories and contained 135 million article records. While there are other services such as OpenAIRE and BASE, which aggregate data from repositories; OpenAIRE has an order of magnitude smaller dataset (25 million records) and neither BASE nor OpenAIRE make the datasets publicly available for download and analysis. Furthermore, judging from the user interfaces of both, deposit dates do not appear to be available.
Figure 1 shows Crossref and CORE along with the data they collect and depicts the process of how published articles get entered into these systems. The process is started when an author submits and a publisher accepts a manuscript. The REF 2021 Open Access Policy stipulates that the author’s final version of the manuscript (i.e. the post-print) must be deposited into a repository within three months of acceptance. The acceptance and deposit steps, which are mentioned in the policy, are shown in red in the figure.
Upon receiving the author’s final version of the manuscript, the publisher registers this manuscript with Crossref. Crossref then stores metadata associated with the publication, including the date of publication. Furthermore, once the author’s final version of the manuscript is deposited in a repository, the metadata of the publication including the date it was deposited into the repository is propagated into CORE through its aggregation service. The REF 2021 OA Policy requires papers to be deposited into a repository within a certain time frame relative to the date of acceptance. However, when the policy was introduced, the date of acceptance was not tracked by Crossref or by most repositories and other databases. Although Crossref metadata now contain an accepted field, this field is only populated for a small fraction of publications (this will be further discussed in Section 4.6). Furthermore, while repositories have since the introduction of the policy created infrastructure for recording the acceptance date, the date is unlikely to be available for publications published prior to the policy taking effect and for non-UK publications. Consequently, the acceptance date does not allow us to study compliance with the policy over time or compare the UK to other countries. Therefore, to measure deposit time lag and non-compliance with the policy, we use dates of publication instead of acceptance dates.
As mentioned above, we use Crossref and CORE to collect data for our analysis. More specifically, we use Crossref to obtain publication dates and ISSN numbers, and CORE to obtain deposit dates, repository names, and for institutional repositories also locations (specifically the country of the repository). Additionally, to ensure correct deposit dates for older documents, we have applied the following procedure. CORE harvests documents from repositories using the Open Archives Initiative Protocol for Metadata Harvesting131313https://www.openarchives.org/pmh/ (OAI-PMH). The OAI-PMH metadata do not contain a deposit date field, but only a last update field. Thus, the last update field will contain a deposit date of an article up until the article’s metadata is updated in the repository. The metadata does not distinguish which version of the article is presented. In September 2018, CORE created infrastructure which allows it to store the first date it receives as the deposit date and any subsequent dates as dates of updates. To ensure correct deposit dates for documents deposited prior to September 2018, we have created web scrapers for the following repositories: repositories using DSpace, EPrints, or Invenio software, and additional individual scrapers for ArXiv and Zenodo. The choice of repositories we created scrapers for was made based on a) availability of deposit dates on the website and b) whether we were able to match a repository page URL to a specific OAI-PMH metadata record. Furthermore, we used Mendeley141414https://www.mendeley.com/ to obtain information about publications’ subjects using the profiles of those who read the publications. Mendeley is a reference manager that can be used to manage a research library and provides an API that can be queried to obtain information about how many people have added a certain publication in their libraries. When users create Mendeley accounts, they are asked about their fields of study. We have used the information about how many users from each field of study have bookmarked a certain publication to categorise publications into subject categories. The details of how we did this are described in Section 4.5.
3.2. Compliance categories
Based on the available data, for the analysis of the REF 2021 OA Policy we can assign each publication to one of the following compliance categories:
3.2.1. Definitely non-compliant:
a publication has been deposited into a repository and its first date of deposit is later than three months after its original date of publication. This category may not include all non-compliant publications as some may fall into the “likely compliant” category below, depending on their actual date of acceptance. However, using this classification, we can be certain that all publications within the non-compliant category are indeed non-compliant, i.e. this category will have 100% precision but not 100% recall.
3.2.2. Likely compliant:
a publication has been deposited into a repository and its deposit date is within a three months period of its original publication date or earlier. This category may include some non-compliant publications, depending on the actual date of acceptance. However, given the way it’s defined, we can be certain that all truly compliant publications will fall into this category, i.e. this category will have 100% recall but not 100% precision.
4. Data Preparation
We started by obtaining a complete data dump from Crossref and CORE. Our Crossref data dump was obtained in May 2018 and our CORE dump in March 2019 (the reason why our CORE dump was obtained later was to allow enough time for publications to be deposited and aggregated by CORE). We then filtered out all documents with a missing title, year of publication, or author names. Additionally, we filtered out any Crossref documents where the metadata contained only the year of publication but not the month of publication. If a day of publication, but not the year or month, was missing, we used the first day of the month as the day when the paper was published, e.g. if we knew a paper was published in 2017-09, we replaced the date with 2017-09-01. Finally, we removed all documents from both datasets which were published prior to 2013. After this filtering we were left with 18,753,649 CORE articles and 15,832,311 Crossref articles. Title, year of publication, and the last name of the first author were then used to merge the two datasets. As not all documents in CORE contain a DOI, we were unable to use DOIs to match documents between Crossref and CORE. On the other hand, title, author, and year information are available for most documents. Matching documents by title, year, and first author name is a strict approach which results in lower recall, because authors may not be listed in the correct order, different spelling or hyphenation of the titles and author names may be used, etc. However, this approach produces cleaner and more reliable data (the accuracy of this matching method is 95.27%, a more detailed analysis of the accuracy is provided in Section 4.1) and for the purposes of the analysis this was our aim. Titles and author names were cleaned by removing all characters other than alphanumeric characters and underscores, and by converting all characters to lowercase. Additionally, we have normalised the text by replacing accented characters and special characters appearing in non-English alphabets with their non-accented/English versions (e.g. by converting “François” to “Francois”). The data was then merged using exact match on the title, year of publication, and last name of the first author. Because one article can be deposited in multiple repositories (for example if the authors of the article are affiliated with different institutions and all deposit the article in their respective repositories), we have additionally grouped all CORE articles that were matched to the same Crossref article into one record using Crossref DOI. This grouping reduces the size of the dataset by about half a million records and the merged and grouped dataset contains 1,589,469 rows. Finally, we have used our repository scrapers (Section 3.1) to obtain correct deposit dates. We were able to obtain deposit dates for 808,984 documents in our dataset. Table 1 shows the final dataset size.
|Unique CORE articles||948,044|
|Unique Crossref articles||808,984|
|Links between Crossref & CORE||985,175|
|Final dataset size (after grouping)||808,984|
4.1. Analysis of our matching method
As the results of our analysis are impacted by the above mentioned matching method, we need to be confident that the accuracy of the matching is high. To measure this accuracy, we compare DOIs between all pairs of matched documents. There are 985,175 document pairs in total (Table 1) out of which 354,897 don’t have a DOI in CORE (36.02%). Of the remaining 630,278 that have a DOI both in Crossref and in CORE, 595,202 have exactly matching DOIs (94.43% of the 630 thousand pairs) and 35,076 have DOIs that do not match (5.57%). We have investigated the non-matches and observed that it is often because of minor differences which seem like errors introduced during the deposit in the repository. More specifically, DOIs obtained from CORE often have additional text appended at the end (Table 2, Example 1) while clearly referring to the same document. This is not the case for the opposite scenario, as CORE DOIs with missing characters can often match multiple Crossref DOIs (Table 2, Example 2). There are 5,264 DOI pairs (15.01% of the non-matching DOI pairs) where Crossref DOI is substring of the CORE DOI, i.e. CORE DOI contains additional characters. If we consider these as correct matches, the accuracy of the matching method is 95.27%.
|Crossref DOI 1||10.1088/0031-8949/2013/t156/014026|
|Crossref DOI 2||10.1088/0031-8949/90/9/095101|
Given the 95.27% matching accuracy, we estimate that 338,110 document pairs, which do not have a DOI in CORE, were matched correctly. If we were to match documents by DOIs instead, we would have missed these. Furthermore, evaluating the accuracy of the method would have been more time consuming (it would require a manual check) and would likely be less precise.
4.2. Repository distribution
We are interested in studying the differences in deposit time lag at different institutions. However, Crossref only contains affiliation information for a small subset of the publications in our dataset – 129,405 (~16%) documents have affiliation information for at least one author. Therefore, as an approximation, we use information about publications’ repositories instead, i.e. we assume authors deposit publications into repositories of institutions they are affiliated with. There are 728 unique repositories in the dataset, each publication was deposited into 1.16 repositories on average and the largest number of repositories per publication is 31. On the other hand, there are on average 1,286 publications per repository, while 315 repositories contain less than 100 publications and 255 less than 50. Appendix A, Table 3 presents the ten largest repositories.
4.3. Country distribution
To assign publications to countries we use information about repository locations. Figure 2 shows the distribution of publications per country for top 20 countries. Publications affiliated with multiple countries are represented as a full publication for each country (instead of counting only the relevant fraction of the publication).
There are several possible reasons why a large number of publications in our dataset are from the UK. Firstly, the UK had a leading role in the adoption and implementation of repositories comparing to other countries. Furthermore, depositing into a repository is now a requirement included in the REF 2021 OA Policy.
4.4. Date of publication
In all experiments we use the date of publication we obtained from Crossref instead of using the date of publication from CORE, as Crossref metadata typically contains more detailed information (e.g. year, month, and day vs. just year). Figure 3 shows the age of publications in our dataset. As part of our study we are interested in analysing deposit time lag in the UK with regard to the UK OA policy. To understand how many publications in our dataset are from the UK, we distinguish them in the figure by colour – blue colour represents UK publications, while green colour represent all other publications.
The drop in publication count in 2018 is due to us not having data for the complete year (we collected data from Crossref in May 2018). The drop in 2017 is likely caused by late deposits – it is possible that some publications from 2017 had not been deposited yet due to looser policy requirements, authors forgetting to deposit, publisher embargoes, etc.
4.5. Subject distribution
Figure 4 shows subject distribution of publications in our dataset. For publications with multiple subjects we only counted the relevant proportion towards each subject. For example, a publication assigned to two subjects is counted as 0.5 towards each subject.
The subjects were obtained from Mendeley in the following way. We used Crossref DOIs to query the Mendeley API151515https://dev.mendeley.com/ to obtain the metadata Mendeley stores for each article. This metadata contains information about how many readers from each of Mendeley’s 28 subjects saved each article in their Mendeley library. Each article was then tagged with the subject in which it accumulated the most readers – e.g. if an article was read by 20 people in “Medicine and Dentistry” and by 5 people in “Immunology”, we would tag the article with the subject “Medicine and Dentistry”. In case multiple subjects had the same number of readers the article was tagged with all of those subjects. According to (Haunschild and Bornmann, 2016)
, reader counts in Mendeley tend to be skewed towards certain disciplines. The obtained subjects are therefore only an approximation. We were able to obtain Mendeley metadata for 664,277 publications (~82%). There are 19 readers per publication on average. Using our subject tagging method described above, 86,731 documents were tagged with multiple subjects (~11%). Out of those, 65,419 were tagged with two subjects (75%) and 15,390 with three subjects (18%), while the rest (5,922, or 7%) was tagged with between four and ten subjects. While these numbers are lower than existing estimates of the proportion of interdisciplinary research(Van Noorden et al., 2015), this could be due to our tagging method.Additionally, we manually assigned each of the Mendeley subject categories to one of the four REF 2021 Main Assessment Panels161616https://www.ref.ac.uk/about/uoa/. These panels are “A: Medicine, health and life sciences”, “B: Physical sciences, engineering and mathematics”, “C: Social sciences”, and “D: Arts and humanities”. The mapping between Mendeley subjects and REF 2021 panels we used is shown in Appendix A, Table 4. Figure 5 shows a distribution of UK publications in our dataset between the four REF 2021 assessment panels.
4.6. Crossref acceptance date
Crossref metadata contains an accepted field which, according to the Crossref API documentation171717https://github.com/Crossref/rest-api-doc/blob/master/api_format.md, contains “date on which a work was accepted, after being submitted, during a submission process”. We have analysed this field for the 800 thousand articles in our dataset. However, we found only 975 articles with the date of acceptance populated. Additionally, for 684 (70%) this date was the same as the date of publication and for 272 (28%) the date of acceptance was a later date than the date of publication, showing that the date of acceptance in Crossref is in 99.9% of cases not available and in 98% of cases where it is available, it is incorrect. Therefore, we won’t use this date in further analysis.
As the REF 2021 Open Access Policy applies only to publications with an ISSN, we have included Crossref ISSN numbers in our dataset. We found that 55,014 publications do not have an ISSN number, 12,463 of those are from a UK institution. In our analysis of compliance with the REF 2021 OA Policy have excluded these 12 thousand publications as the policy does not apply to them.
To calculate deposit time lag for publications in our dataset, we subtracted dates of publication from deposit dates and expressed the difference in days. As a result, negative values mean an article was deposited before being published and positive values mean it was deposited after being published. A histogram of deposit time lag for all publications in our dataset is shown in Appendix B, Figure 14.
5.1. Deposit time lag per country
Figure 6 reveals significant differences in deposit time lag between five countries with the highest number of publications in our dataset. UK publications appear to have the shortest deposit time lag of all five countries, with a large number of articles deposited before or at the time of publication. US publications display a similar pattern, however, deposit time lag in the US peaks a few weeks after publication. On the other hand, Italy, Switzerland, and the Netherlands show a long-tail distribution where deposits peak at the time of publication but decreases slower than in the case of the UK and the US. Furthermore, a large proportion of publications from these countries is deposited with long delays.
Next, we wanted to compare how deposit time lag in these countries has changed over time. One way of doing this is by using all data available to us to calculate average deposit time lag per country and year. This approach has limitations we will illustrate in the following example. Consider deposit dates present in our dataset for articles published in 2013 and in 2017. While articles published in 2013 had just over six years during which they could have been deposited in a repository (our dataset goes until early 2019), publications from 2017 had, in contrast, much shorter time to appear in a repository. It is possible some publications from both years have not been deposited yet, but this is more likely for publications from 2017. This affects yearly deposit time lag in a way which slightly underestimates (decreases) deposit time lag for all publication years, but especially for newer publications. Another option is to use maximum limit on deposit time lag and filter out all publications which were deposited later than within a specified time frame. To give an example, consider limiting deposit time lag to one year. In this case, only publications from 2013 that were deposited within a year of their publication date (but none of the publications deposited later) would be compared to the same set from 2017. This affects yearly deposit time lag in a way which slightly underestimates (decreases) deposit time lag for all years, but especially for older publications, due late deposits becoming less common over time. As we are not aware of a better way to compare deposit time lag across years that would alleviate the limitations of both of the above mentioned approaches at the same time, we use both approaches in conjunction. Figures 7 and 8 show average deposit time lag per year and country. In the case of Figure 7, the deposit time lag was calculated using all available data, while in the case of Figure 8 it was calculated using one year maximum deposit time lag limit. In the case of Figure 8, year 2018 was excluded as we do not have a complete year of data for it. An additional figure created by applying a maximum deposit time lag limit of two years is shown in Appendix B, Figure 15.
The figures reveal several interesting trends. Since 2016, the deposit time lag of UK publications is the lowest of all five countries and is negative in Figure 7 in 2018 (-3.69 days). In fact, this has not always been the case and, when considering all data including late deposits (Figure 7), the UK was fourth of the selected five countries in 2013 and 2014. Interestingly, this change in average deposit time lag in the UK coincides with the introduction of the REF 2021 OA Policy in 2014. When considering only publications deposited within a year (Figure 8), the UK started as the first of the selected five countries, however, its average deposit time lag had increased in 2014. A possible explanation is the introduction of the REF 2021 OA policy, where researchers started shifting their deposit habits to comply with the policy and as a result deposit more often, but it took time for this shift to become a common practice.There has been a decreasing trend in deposit time lag for all countries, particularly since 2016. Italy has seen the largest decrease in average deposit time lag from 706 days in 2013 to 48 days in 2018 in the case of Figure 7, and from 244 in 2013 to 86 in 2017 in the case of Figure 8. In 2013, the Italian government passed legislation requiring all research in which at least 50% of funding was public funding to be made OA (OpenAIRE, 2019). While we are not aware of any specific deposit time frames associated with this requirement, it is possible it affected deposit practice. Finally, we analyse deposit time lag with respect to the UK REF 2021 Open Access Policy. To do this, we assign each UK publication to one of the two compliance categories described in Section 3.2: “definitely non-compliant” – publications with deposit time lag of more than 90 days, and “likely compliant” – publications with deposit time lag with 90 days or less. The proportion of publications belonging to each category per year is shown in Figure 9.
The figure shows that prior to the REF 2021 OA Policy taking effect in 2016, more than 50% of publications each year were deposited later than three months after the date of publication. However, the situation has changed after the policy took effect in April 2016. In 2017, 80% of papers were made available in an OA repository within three months of the date of publication, or even earlier. While we do not yet have complete data for 2018 (our sample contains data until May 2018), we can observe that compliance is still increasing.
5.2. Deposit time lag per repository
Our next question is whether there is a difference between deposit time lag of different repositories and how this has changed over time. Figure 10 shows deposit time lag per year for all repositories with more than 100 publications in a given year. To produce this figure, we have calculated the following two statistics for each repository:
Single repository deposit time lag. Deposit time lag with respect to the publications’ deposit date in a given repository. In this case, we do not take into account that a publication may have been deposited into multiple repositories. For example, if a publication was deposited into the University of Cambridge repository, we only consider the date of deposit into this repository.
Any repository deposit time lag. Deposit time lag calculated with respect to the publications’ deposit date in any repository. For example, if a publication was deposited into the University of Cambridge repository as well as elsewhere, we simply use the first of the two dates to calculate deposit time lag.
To produce the full lines in Figure 10
, we have sorted the repositories according to their “single repository deposit time lag” values from the lowest to the highest. The dashed lines were produced the same way, but using the “any repository deposit time lag” values. The figure reveals significant differences between repositories, which have reduced over time, but remain high. For 2013 publications, the difference between the repository with the lowest and the highest “single repository deposit time lag” was 1,982 days, and the standard deviation across all repositories was 377 days. In 2017, these numbers have dropped to 991 days and a standard deviation of 108. The figure also reveals that by aggregating data from all repositories, the deposit time lag can be lowered. We have produced a similar figure for UK repositories showing the proportion of “likely compliant” publications per repository. Similarly to Figure10, Figure 11 was produced by calculating two statistics for each repository: single repository compliance (full lines), i.e. proportion of likely compliant publications when considering deposits only in a single repository, and any repository compliance (dashed lines), i.e. proportion of likely compliant publications with respect to their deposit date in any repository. In both cases, the repositories were sorted from the most to the least compliant. It can be seen that repository compliance has increased rapidly from 2014 onward, particularly between 2015 and 2016. As the UK REF 2021 OA Policy was introduced in 2014, it may be one of the reasons for this increase. The figure also shows that aggregating research outputs from multiple repositories may help improve repository compliance.
5.3. Deposit time lag per subject
Finally, we investigated whether there were any differences in deposit time lag between different subjects. Figure 12 shows average deposit time lag per subject in 2013 and 2017. To produce this figure we have removed a single subject (Decision Sciences) with less than 100 publications in one year. The figure shows that while there were significant differences between subjects in 2013, these were largely diminished by 2017. The figure also reveals smaller differences between subjects than the differences observed between repositories shown in Figure 10. In 2013, the difference between the highest and the lowest average deposit time lag per subject was 532 days and standard deviation across all subjects was 107 days. In 2017 the range was 295 and standard deviation was 57 days.
On the other hand as we have shown in Section 5.2, range and standard deviation across all repositories were 1,982 and 377 in 2013, and 991 and 108 in 2017. If we consider only publications from a single subject, the differences between repositories remain high. For example, using only publications from “Physics and Astronomy” (our largest subject), range and standard deviation were 1,787 and 370 in 2013, and 940 and 174 in 2017. The situation is similar for other subjects. This suggests institutional policies, particularly when harmonised with funder policies, may be stronger drivers of OA than disciplinary culture. Finally, Figure 13 shows the proportion of likely-compliant and non-compliant publications across the four main REF 2021 assessment panels (Section 4.5) in 2013 and in 2017. The figure shows there has been significant increase in compliance over the five year period, which has been similar across all four panels.
Our findings indicate that deposit time lag has been decreasing globally. However, we have observed major differences in deposit time lag across institutions and significant differences between subjects. Furthermore, we have shown that the deposit time lag has been shortening over the last 5 years both globally and in the UK. Our results suggest that the REF 2021 OA Policy likely helped to reduce deposit time lag. The results outlined in this paper present a preliminary study of deposit time lag and compliance with existing OA policies. There are many areas where this study could be enhanced and broadened. The matching of articles between Crossref and CORE was done by means of the articles’ metadata (titles, years of publication, and first author names). This is a strict approach that may result in lower recall due to minor differences in metadata, such as listing authors in incorrect order, typos, differences in punctuation, etc. While our present study has been precision oriented, i.e. our aim was to produce as clean data as possible, in the future, we would like to improve our recall. This would also allow us to study deposit rates, i.e. the proportion of articles that get deposited into OA repositories compared to articles that do not, in addition to deposit time lag. Improving our recall could be done in a number of ways. For example, in addition to the metadata we already use for the matching, we could utilise all other metadata available to us, such as abstracts, and employ looser matching techniques such as those used in article deduplication (Jiang et al., 2014). For this initial study we make the assumption that if a metadata record is in the repository, the full text is also deposited. This is because validating if the full text is deposited is a complicated process which is outside of the scope of this work. The OAI-PMH protocol does not guarantee a link to the publication full text will be in the metadata even if the full text was deposited into the repository. To check if an article full text was deposited, we would have to crawl all links provided in the OAI-PMH metadata and correctly match the identified documents to the publication metadata. Therefore, as our present study focuses on deposit time lag rather than presence of the full text, we decided not to perform this check. As our analysis relies on deposit dates, publications that have never been deposited into a repository are not included in our study. Consequently, this means that the proportion of publications that are potentially compliant with the REF 2021 OA Policy are compared against non-compliant but deposited publications, rather than all publications. To quantify missing deposits, we would have to be able to correctly match all CORE publications to their Crossref metadata. This is out of the scope of our study, as the focus of our study is on deposit time lag rather than the analysis of the proportion of missing deposits. However, to allow for as many publications to be included in our study we have collected deposit dates almost a year (in March 2019) after collecting publication metadata (May 2018).
The aim of this study was to investigate how much time does it take for authors to deposit their articles in OA repositories in relation to when these articles get published. Furthermore, our goal was to investigate if OA policies might have reduced this time, and if compliance with such policies can be effectively tracked. We collected dates of publication and deposit dates for 800 thousand articles published around the world between 2013 and 2018, and compared the difference between these dates across time, country, subject, and repository. We have shown that the time between publication and deposit has decreased significantly over the 2013-2017 period globally, by 472 days per country on average across all countries in our dataset. We have also shown that after the introduction of the UK REF 2021 OA Policy, this decrease in the UK has accelerated, and in 2018 the mean difference between publication and deposit dates has become negative (-3.69 days), meaning that, as of early 2018, on average, UK publications potentially become OA immediately or even slightly before publication. The key message of our paper is that this observation supports the argument for the inclusion of a strictly time-limited deposit requirement in OA policies. Furthermore, our work demonstrates that countries which now have a time frame on deposits included in their OA policies can develop reliable tracking mechanisms for monitoring the effects of such policies. Based on the presented methodology, we have developed a tool for tracking the time lag between article publication and deposit which relies on data from thousands of repositories. We hope the tool will be useful to authors, funders and institutions who intend to improve the accessibility of research and improve compliance with existing OA policies. To support further studies on the deposit of research outputs in OA repositories, we release our dataset of 800 thousand publications and the source codes of our analysis181818http://github.com/oacore/jcdl_2019.
- Archambault et al. (2014) Eric Archambault, Didier Amyot, Philippe Deschamps, Aurore Nicol, Francoise Provencher, Lise Rebout, and Guillaume Roberge. 2014. Proportion of open access papers published in peer-reviewed journals at the European and world levels–1996-2013. Report, European Commission DG Research & Innovation (2014).
- Archambault et al. (2013) Eric Archambault, Didier Amyot, Philippe Deschamps, Aurore Nicol, Lise Rebout, and Guillaume Roberge. 2013. Proportion of open access peer-reviewed papers at the European and world levels–2004-2011. Report, European Commission DG Research & Innovation (Aug 2013).
- Björk et al. (2010) Bo-Christer Björk, Patrik Welling, Mikael Laakso, Peter Majlender, Turid Hedlund, and Gudni Gudnason. 2010. Open access to the scientific journal literature: situation 2009. PloS one 5, 6 (Jan. 2010), e11273.
- Chan et al. (2002) Leslie Chan, Darius Cuplinskas, Michael Eisen, Fred Friend, Yana Genova, Jean-Claude Guedon, Melissa Hagemann, Stevan Harnad, Rick Johnson, Rima Kupryte, Manfredi La Manna, Istvan Rev, Monika Segbert, Sidnei de Souza, Peter Suber, and Jan Velterop. 2002. Budapest Open Access Initiative. https://www.budapestopenaccessinitiative.org/read. Accessed: 2018-11-19.
- European Commision (2018) European Commision. 2018. ’Plan S’ and ’cOAlition S’ – Accelerating the transition to full and immediate Open Access to scientific publications. https://europa.eu/!hw84rX. Accessed: 2018-11-20.
- Gargouri et al. (2012) Yassine Gargouri, Vincent Larivière, Yves Gingras, Les Carr, and Stevan Harnad. 2012. Green and Gold Open Access Percentages and Growth, by Discipline. arXiv e-prints (Jun 2012). Preprint, https://arxiv.org/abs/1206.3664.
- Haunschild and Bornmann (2016) Robin Haunschild and Lutz Bornmann. 2016. Normalization of Mendeley reader counts for impact assessment. Journal of Informetrics 10, 1 (2016), 62–73.
- Higher Education Funding Council for England (HEFCE) (2016) Higher Education Funding Council for England (HEFCE). 2016. Policy for Open Access in Research Excellence Framework 2021. http://webarchive.nationalarchives.gov.uk/20180319114140/http://www.hefce.ac.uk/pubs/year/2016/201635/. Accessed: 2018-10-10.
- Jiang et al. (2014) Yu Jiang, Can Lin, Weiyi Meng, Clement Yu, Aaron M Cohen, and Neil R Smalheiser. 2014. Rule-based deduplication of article records from bibliographic databases. Database 2014 (2014).
- Kerridge and Ward (2014) Simon Kerridge and Phil Ward. 2014. Open access for REF2020. Insights 27, 1 (2014). https://doi.org/10.1629/2048-7754.115
- Khabsa and Giles (2014) Madian Khabsa and C Lee Giles. 2014. The number of scholarly documents on the public web. PloS one 9, 5 (2014), e93949.
- Khoo and Lay (2018) Shaun Yon-Seng Khoo and Belinda Po Pyn Lay. 2018. A very long embargo: Journal choice reveals active non-compliance with funder open access policies by Australian and Canadian neuroscientists. Liber Quarterly 28, 1 (2018).
- Knoth and Zdrahal (2012) Petr Knoth and Zdenek Zdrahal. 2012. CORE: Three Access Levels to Underpin Open Access. D-Lib Magazine 18, 11/12 (nov 2012). https://doi.org/10.1045/november2012-knoth
- Lariviere and Sugimoto (2018) Vincent Lariviere and Cassidy R. Sugimoto. 2018. Do authors comply when funders enforce open access to research? Nature 562 (2018), 483–486. https://doi.org/10.1038/d41586-018-07101-w
- Notay (2018) Balviar Notay. 2018. CORE becomes the world’s largest aggregator. https://scholarlycommunications.jiscinvolve.org/wp/2018/06/01/core-becomes-the-worlds-largest-aggregator/. Accessed: 2019-03-19.
- OpenAIRE (2019) OpenAIRE. 2019. Italy: Open Science Policy. https://www.openaire.eu/item/italy. Accessed: 2013-03-19.
- Picarra (2015) Mafalda Picarra. 2015. Monitoring Compliance with Open Access Policies.
- Piwowar et al. (2018) Heather Piwowar, Jason Priem, Vincent Lariviere, Juan Pablo Alperin, Lisa Matthias, Bree Norlander, Ashley Farley, Jevin West, and Stefanie Haustein. 2018. The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles. PeerJ 6 (2018), e4375.
- Publishers Communication Group (2017) Publishers Communication Group. 2017. Library Budget Predictions for 2017. http://www.pcgplus.com/wp-content/uploads/2017/05/Library-Budget-Predictions-for-2017-public.pdf. Accessed: 2018-11-19.
- Research Excellence Framework (2014) Research Excellence Framework. 2014. REF2014: Key Facts. https://www.ref.ac.uk/2014/media/ref/content/pub/REF%20Brief%20Guide%202014.pdf. Accessed: 2018-11-20.
- Sample (2012) Ian Sample. 2012. Harvard University says it can’t afford journal publishers’ prices. https://www.theguardian.com/science/2012/apr/24/harvard-university-journal-publishers-prices. The Guardian (24 Apr 2012). Accessed: 2018-11-19.
- Shieber (2013) Stuart Shieber. 2013. Why open access is better for scholarly societies. https://blogs.harvard.edu/pamphlet/2013/01/29/why-open-access-is-better-for-scholarly-societies/. Accessed: 2018-11-19.
- Suber (2003) Peter Suber. 2003. The taxpayer argument for open access. SPARC Open Access Newsletter (4 Sep 2003). http://nrs.harvard.edu/urn-3:HUL.InstRepos:4725013
- Swan (2014) Alma Swan. 2014. HEFCE announces Open Access policy for the next REF in the UK: Why this Open Access policy will be a game-changer. Impact of Social Sciences Blog (2014).
- Swan et al. (2015) Alma Swan, Yassine Gargouri, Megan Hunt, and Stevan Harnad. 2015. Working Together to Promote Open Access Policy Alignment in Europe. Work Package 3 Report: Open Access Policies. Deliverable 3.1: Report on policy recording exercise, including policy typology and effectiveness and list of further policymaker targets.
- UK Research and Innovation (2019) UK Research and Innovation. 2019. Guidance on Submissions. https://www.ref.ac.uk/media/1092/ref-2019_01-guidance-on-submissions.pdf. Accessed: 2019-04-05.
- U.S. Agency for International Development (2013) U.S. Agency for International Development. 2013. Public Access Plan: Increasing Access to the Results of Federally Funded Scientific Research. https://www.usaid.gov/sites/default/files/documents/15396/USAID_PublicAccessPlan.pdf. Accessed: 2018-11-20.
- Van Noorden et al. (2015) Richard Van Noorden et al. 2015. Interdisciplinary research by the numbers. Nature 525, 7569 (2015), 306–307.
- Vincent-Lamarre et al. (2016) Philippe Vincent-Lamarre, Jade Boivin, Yassine Gargouri, Vincent Larivière, and Stevan Harnad. 2016. Estimating open access mandate effectiveness: the MELIBEA score. Journal of the Association for Information Science and Technology 67, 11 (2016), 2815–2828.
- Xia et al. (2012) Jingfeng Xia, Sarah B Gilchrist, Nathaniel XP Smith, Justin A Kingery, Jennifer R Radecki, Marcia L Wilhelm, Keith C Harrison, Michael L Ashby, and Alyson J Mahn. 2012. A review of open access self-archiving mandate policies. portal: Libraries and the Academy 12, 1 (2012), 85–102.
Appendix A Data preparation and statistics
|ArXiv e-Print Archive||97,594|
|White Rose Research Online||24,019|
|Utrecht University Repository||20,304|
|Università di Roma La Sapienza Repository||14,795|
|Online Research @ Cardiff||14,261|
|Università di Padova Repository||14,077|
|Mendeley subject||REF Main Panel|
|Agricultural and Biological Sciences||A|
|Arts and Humanities||D|
|Biochemistry, Genetics and Molecular Biology||A|
|Business, Management and Accounting||C|
|Earth and Planetary Sciences||B|
|Economics, Econometrics and Finance||C|
|Immunology and Microbiology||A|
|Medicine and Dentistry||A|
|Nursing and Health Professions||A|
|Pharmacology, Toxicology and Pharmaceutical Science||A|
|Physics and Astronomy||B|
|Sports and Recreations||C|
|Veterinary Science and Veterinary Medicine||A|
Appendix B Analysis results
Figure 14 shows a histogram of deposit time lag in days for all publications in our dataset. The vertical red line in the figure represents 3 months after the date of publication which is the cut-off between our “likely compliant” and “definitely non-compliant” categories (Section 3.2), meaning that all publications that fall on the right side of the line would not be compliant with the REF 2021 OA Policy. The maximum deposit time lag in our dataset is 2,241 days. This large time lag is possible, as the earliest date of publication in our dataset is January 1, 2013, while the deposit dates were collected between March 7 and March 18, 2019 – the difference between January 1, 2013 and March 18, 2019 is 2,268 days. The graph shows a large portion of articles in our dataset was deposited retrospectively many years after publication.
Figure 15 shows average deposit time lag per country and year. To prepare the figure, data was first filtered by removing all publications which were deposited later than two years after being published.