How is research carried out, for example, how is a journal article produced? Researchers have the answers to the above questions because they actually conduct research and generally summarize their findings in the form of papers. On the other hand, they are individual episodes, and it is not always clear whether they are common or not, or to what extent there are variations.
In recent years, digital transformation (DX) has been progressing, and various activities have come to be conducted mainly on the Internet. As a result, it has become possible to observe and analyze the processes of various social phenomena, which has been difficult in the past, and research fields such as computational social science are gaining momentum.
In the early days of the Internet, it was actively used for exchanging research information including papers, but the flow of DX described above has made it possible to visualize and analyze the research process. For example, as of 2021, in the fields of physics and information science, the research style of publishing preprints, which are drafts of papers before submission to journals, through preprint servers has become widespread. In addition, with the worldwide spread of COVID-19, the research style through preprint servers is also becoming popular in the fields of medicine and biology. This has made it possible to examine the differences between the stage before submission as a journal article and the stage when the article is finally turned into a journal article, as well as the differences between preprints that have been turned into journal articles and those that have not yet been turned into journal articles. It is now possible to examine the differences between preprints that have become journal articles and those that have not yet become journal articles.
In this paper, we report the results of a survey on the differences between the journal articles and their preprints, and between preprints that have become journals and those that have not, using bioRxiv, a preprint server for biological systems, with the ultimate goal of understanding how research is conducted.
2 Related Works
compares journal articles with their preprints, which is very similar to the content of this paper. Klein compared more than 12,000 preprints and their journal versions on arXiv and bioRxiv, and reported that there was little difference between them. In the case of bioRxiv, the analysis covered the period from the site’s launch in November 2013 to November 2016, with a total of 7,000 preprints and a relatively small number of full-text articles (220). For the content, we adopted the Bag-of-Words Cos similarity, where each word is an independent dimension. As described below, this paper is limited to bioRxiv and the period is 2019. The number of preprints is more than 20,000 in total, and about 7,000 full-text articles are used, which means that the situation in bioRxiv is relatively recent and more extensively investigated. As for the content, the similarity is judged based on distributed representation.
Carneiro[Carneiro20] also analyzed the change in quality of journal articles that were published in bioRxiv and later in PubMed, and reported that the quality of peer-reviewed journal articles was slightly higher than that of preprints, but the difference was not significant. The Carneiro’s paper, bioRxiv covers the papers submitted in 2016, and the change in quality is mainly based on questionnaires. Since it is a questionnaire, the number of pairs of preprints and journal articles used for the analysis is 56 cases. The difference between this paper and the Carneiro’s is that, Carneiro’s is the survey period is relatively old (2016), and a small number of data are surveyed on a questionnaire basis.
investigates the difference in references between preprints and journals. In addition to quantitatively examining more than 6000 pairs of preprints and journal articles, he also evaluates about 100 of them by human readers to classify the types of context added and to identify differences by field. The difference is that the paper focuses on references and attempts a precise analysis, while this paper provides a comprehensive overview of various indicators.
3 Data Description
In this section, we will discuss the analysis data.
3.1 Preprint Server; bioRxiv
This time, we set bioRxiv, which is used in the field of biological science, as the preprint server. The reasons are described below.
As a preprint server, arXiv, which is often used in the fields of physics and information science, is well known, has the longest history, and has a large number of submissions and downloads111For example, the number of submissions and downloads can be checked below. https://arxiv.org/help/stats
. From this point of view, arXiv is considered to be appropriate. However, in the field of information science, journal papers are not necessarily important, and the proceedings of top conferences have the same value as journal papers, which complicates the discussion when we consider comparison with journal papers. In addition, although the sources of the manuscripts are available in TeX format, analysis is not always easy because of the high degree of freedom in customization. In addition, it is not clear whether the sources of journal papers and proceedings of top conferences are available or not, or whether they can be obtained in HTML or XML format for easy analysis.
On the other hand, in fields other than information science, it has not been observed that the proceedings of top conferences are as valuable as journal articles. In addition, bioRxiv provides full-text XML in JATS (Journal Article Tag Suite) format. The journal articles that were accepted via bioRxiv are also in many open access articles, and many of them can be obtained in full-text XML in JATS format as described below. We selected bioRxiv because the final results are expected to be linked to journal articles, and the data of preprints and journal articles can be obtained in the same format. The disadvantage of bioRxiv is that the number of analysis targets is less than 10,000 due to the difference in diffusion rate.
3.2 Target of analysis (period), number of data
This time, we chose preprints submitted in 2019 as the target (period) for analysis. The reasons are as follows.
First, the field of biological science is also closely related to COVID-19, and a certain number of COVID-19-related contributions have been observed since 2020. Although these numbers are not large in the context of bioRxiv as a whole, they may act as a disturbance and complicate the discussion. Second, using older data can eliminate these effects, but if the data is too old, the possibility that it is not applicable to the current situation increases due to changes in trends.
Here, COVID-19-related preprints started to increase around the end of December 2019[Koshiba20], so using the 2019 data is relatively recent and does not take into account the impact of COVID-19. In addition, as previously reported, those papers submitted to bioRxiv that are accepted for publication in journals are turned into journal articles within 6 to 8 months on average from the time of preprint registration[Abdill19, Hayashi21]. In this paper, the analysis was conducted in May 2021, which is almost 17 months from the end of December 2019, and we can expect that the journal culture of the preprints registered in 2019 is almost complete and not many of them will be converted into journal papers in the future.
The number of journal articles with full-text XML for this analysis is shown below.
- Number of journal preprints
- Number of journal articles
- Number of full-text XML holdings
Although preprints can be revised as many times as needed, only the first edition is considered in this analysis. This is due to the fact that the first edition of all submitted preprints definitely exists and is easy to control. If the closest edition from the journal article is used, the discussion becomes more complicated, such as whether to look at only the latest edition even if it has not been published in the journal, how to handle the case where there is an update after publication in the journal, and whether to look at the difference between the first edition and the latest edition.
3.3 Data acquisition method
The data were obtained using the following methods.
For bioRxiv, detailed information of each preprint was obtained through bioRxiv API 222https://api.biorxiv.org/ in the past (April 2021). Since the detailed information includes“jats xml path”, we simply retrieved the information appropriately from those URLs. There are several hundred URLs that return an error when accessed, and these are excluded from the analysis.
For journal articles, there are several steps to follow.
First, the linkage between a preprint and a journal article is as follows. The aforementioned detailed information of each preprint contains “doi”, and the DOI details of the preprint are obtained through the CrossRef API 333https://api.crossref.org/. Then, there is an attribute “is-preprint-of”, and the value is set when there is a journal article based on the preprint. In this case, we used the case where the “id-type” of “is-preprint-of” is “doi” and a specific ID (DOI) is set. Therefore, the journal article in this paper is preprint-of. Therefore, all journal articles in this paper have a DOI.
Next, we obtain the full-text XML of journal articles in two steps. The first step is to obtain the full-text XML of journal articles in two steps. For many open access journals, PubMed Central (PMC) provides full-text XML in JATS format centrally in the PMC Open Access Subset 444https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/. In this section, we collect data using PubMed Central OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) 555https://www.ncbi.nlm.nih.gov/pmc/tools/oai/.
PMC OAI-PMH collects data based on PMCIDs, which are IDs uniquely assigned to each article by PMC, and converts DOIs to PMCIDs through PMC’s ID Converter API 666https://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/. Then, the data is retrieved from OAI-PMH. Not all DOIs can be converted to PMCIDs, and the existence of PMCIDs does not necessarily mean the existence of full-text XML, but since it is difficult to confirm the existence of full-text XML in advance, we obtained data from OAI-PMH whenever PMCIDs were obtained.
This paper describes the analysis method.
4.1 Object of comparison
We set up three patterns for comparison.
First, we compare the preprint that became a journal (journal preprint) with the journal, second, we compare the journal preprint with another similar preprint, and finally, we compare a randomly combined pair of preprints and journals as a baseline.
“Similar preprint” requires some explanation and will be explained later.
4.2 Selection of feature values
The following features were selected to observe the differences. Number of authors, number of references, number of paragraphs, number of words, chapter titles, and similarity of contents.
For the number of authors and references, in addition to a simple comparison of the number, the degree to which the names and titles match was also examined. For chapter titles, in addition to the number of matches, we also look at the number of words that are present in only one of the titles.
4.3 Concept of similarity
The similarity of the contents described in the features is as follows.
Basically, we compare two documents (preprints and journal articles) on a paragraph-by-paragraph basis. In this case, the following steps should be taken to detect any change in the position of the paragraphs.
detect the similarity between all paragraphs of the two documents A and B.
pair the paragraphs in B with the highest similarity from each paragraph in A.
replace A and B and perform the process in 2 to create a pair seen from B.
take the pairs created by 2. and 3. as a whole, and make only the ones with a certain level of similarity as the common pairs of the two documents.
calculate the Jaccard coefficient, which is the similarity between two documents A and B.
The threshold value used in 4. is determined by looking at the distribution of the total similarity between the paragraphs calculated in 3.
In this specific work, each paragraph is converted into a vector using the variance representation, and the distance between the vectors is calculated. The direction is different in that the closer the distance, the more similar the paragraphs are, and the larger the similarity, the more similar the paragraphs are.
4.4 distributed representation
In the aforementioned similarity calculation, it is common to calculate cosine similarity based on words, but this time we used variance representation. This was done in order to calculate the similarity by absorbing to some extent the differences in words and phrases even when they are changed in the process of peer review.
FastText[Bojanowski16] 777https://fasttext.cc/ was used as the specific method for obtaining distributed representation, and 300-dimensional embedding was performed in skip-grams using the data of preprints, journal article titles, summaries, and full-text data to be analyzed. Prior to the work, the data were Lemmatized and stopwords were removed based on NLTK and WordNet.
As a result, we obtained 300-dimensional distributed representations of 230,020 words. The distributed representation of paragraphs, etc., was obtained by linearly adding and normalizing the distributed representations containing them.
4.5 Methods for obtaining distributed representations
In the field of biology, bioBERT[Jinhyuk19] 888https://github.com/dmis-lab/biobert is publicly available and can be used to obtain distributed representation with higher accuracy. However, for reasons of processing speed, we decided to forgo the use of bioBERT and use our own FastText-based distributed representation.
Specifically, we extracted five cases from the full-text data and calculated the processing speed. bioBERT took 27.8 seconds for a total of five cases, with an average of 5.6 seconds, while the FastText-based processing took 0.1 seconds for a total of five cases 999In FastText, since the coordinate values do not change depending on the context Since FastText does not change the coordinate values depending on the context, it stores the pre-calculated variance representation of each word in the DB, and reads and calculates it each time..
Since the FastText-based processing involves processing costs such as building a distributed representation dictionary, lemmatizing, and removing stop words, we should not use this calculation result alone to compare the costs. However, if we simply compare the processing speed per 5 sentences mentioned above, the difference is 278 times. Assuming that each document has 20 paragraphs, processing 20,000 documents would require 400,000 calculations, which is 622 hours (about 26 days) for bioBERT. On the other hand, the FastText-based method is expected to take about 2 hours.
Since the reason for using distributed representation is to absorb fluctuations in words and phrases, and not to pursue exact accuracy, we adopted FastText-based processing for its speed. The difference in accuracy between bioBERT and FastText-based processing is a future challenge.
4.6 The concept of “similar preprints”
When comparing journal preprints with other general preprints, it is easy to compare journal preprints with other statistics directly, but it is difficult to compare journal preprints with journals. Therefore, we search for preprints that are similar to journal preprints, and use the results of comparison with these similar preprints.
The “similar preprint” should be the one with the smallest distance between the variance representation values of the summary (the one with the largest similarity). In this case, the similar preprints are limited to those with older timestamps than the journal preprints and those that have not been journaled. This eliminates the possibility of being influenced by journals or journal preprints as well as being unjournaled.
For those with the smallest distance of variance, we used NGT [Iwasaki18] 101010https://github.com/yahoojapan/NGT to approximate the top 100 preprints in order to reduce the computational complexity. There were about 1,000 cases where none of the top 100 similar preprints satisfied the aforementioned conditions (old in time and not journaled), resulting in 6,905 pairs of journal preprints and similar preprints. In addition, there are many cases where the same preprint is associated with multiple journal preprints, and the unique number of “similar preprints” is 3,874.
5.1 Initial Setup
Figure 1 shows the distribution of similarity (distance) for all paragraph combinations of journal preprints - journals, journal preprints - similar preprints, and random matching pairs, which are used to define the threshold for similarity calculation.
Figure 1 shows that in the case of random matching, few distances below 0.3 are observed. On the other hand, in the case of journal preprints and the combination of journal and similar preprints, some of them are above 0.3. Therefore, we adopt a distance of 0.3 as the threshold for calculating the Jaccard coefficient between documents.
5.2 Similarity of content
Figure 2 shows the distribution of Jaccard coefficients between documents calculated based on the above threshold.
Looking at Figure 2, first, the baseline random match has a distribution with few similarities above 0.95 and a peak around 0.93. Journal preprints and journal articles naturally have high similarity, peaking at 1.0. Some preprints that are similar to journal preprints show high similarity, such as 1.0, but the distribution is gentle, which is natural considering that the authors themselves are basically different.
5.3 Other external criteria
External criteria other than the similarity of contents, such as the number of authors and the number of references, are shown below.
Figures 3 and 4 show the difference and the real distribution of the number of authors and the number of references. The distribution of the number of references shifts neatly from that of “journal article” to that of “reference paper”, indicating that the number of references tends to increase when it becomes a journal article.
Figure 5,6 shows the difference and the real distribution of the number of figures. In figures 5,6, tab refers to the table tag in JATS-XML, and tabw refers to the table-wrap tag. Since table tags are rarely used in preprints, and table-wrap tends to be used more frequently, it is necessary to pay attention to the handling of table and table-wrap.
Figure 7 and 8 show the difference in the number of paragraphs and words and the real number distribution. Figure 8 shows that the distribution of the number of paragraphs and the number of words shifted to the right when compared with that of the journal article, indicating that the number of paragraphs and the number of words increased in the journal article. Figure 7 also shows that the number of paragraphs and words rarely decreases.
The distribution of the number of identical and different chapter titles is shown in Figure 9.
5.4 Investigation of differences in titles
In addition to simply looking at the number of titles, we also looked at overlapping titles and different titles.
5.4.1 Differences in chapter titles
The differences in chapter titles were organized into a word cloud of common titles, titles that existed only in journal preprints, and titles that existed only in journal articles or similar preprints.
In this comparison, numbers such as “1.” in front of the title, periods and spaces at the end of the title were removed, and case was ignored.
5.4.2 Title differences
For journal preprints and journals, we also investigated the difference in the title of the article itself, not the chapter title.
Since we wanted to observe the changes in titles after peer review and proofreading, and since there are various variations in titles in general and duplication is not meaningful, we only investigated the degree to which the word sets of titles match (Jaccard) and the differences in specific words.
Figure 16 shows that there is no difference in the word set comparison in most cases (Jaccard coefficient is 1.0), and there is no significant difference in the concrete words between figures 17 and 18 as well as in the chapter titles. In other words, in some cases the definite article was added, in other cases it was removed, and so on, on a case-by-case basis.
5.5 Estimating the importance of external criteria
So far, there was no significant difference between journal preprints and journal articles, although there were some differences in word count and number of references.
However, even if there is no difference in one indicator, there may be a difference in some conditions and combinations of several indicators. Therefore, we investigated whether we could distinguish journal articles, journal preprints, and other preprints using machine learning.
In this paper, we use the journal articles as they have been used in the past, but change the scope of the journal preprints and introduce a new category called “other preprints”. These are measures to increase the sample size.
First of all, journal preprints used to include only those for which the full-text XML of the paired journal article could be obtained, but since the pair is not necessary here, we consider 12.925 preprints that are associated with journal articles and for which the full-text XML of the preprint can be successfully obtained, regardless of whether the full-text XML is available or not, Therefore, 12,925 items that were associated with journal articles with or without full-text XML and for which the full-text XML of the preprint was successfully obtained were considered as journal preprints. In addition, 14,673 cases that were not associated with a journal article and the full-text XML of the preprint was successfully obtained were classified as preprints.
For each of these types, we generated a classifier that classified them based on external criteria, and examined the importance of each criterion (feature). The following criteria were set in reference to the previous analysis.
|auth||Number of authors|
|ref||Number of references|
|word||Total number of words|
|fig||Number of figures|
|tab||Number of tables|
|intro||Percentage of total words in chapters with “intro” in the chapter title|
|metho||Percentage of chapter words that contain “method” in the chapter title|
|resul||Percentage of all chapter words that contain “result” in the chapter title|
|discu||Percentage of the total number of chapter words that contain “discuss” in the chapter title|
|concl||Percentage of all chapter words that contain “conclusion” in the chapter title|
The classification method used was RandomForest in the Python machine learning package scikit-learn, with 70% of the data randomly selected as training data and 30% as test data.
The results are shown below.
|* PPr: PrePrint|
Classification results of test data by RandomForest
The accuracy is 0.47, and the importance of the features is as shown in Figure 19.
In the case where there is no prior knowledge, if the number of classifications is 3, the probability is simply 1/3, then the accuracy of 0.47 is higher than that, but it is not good. In addition, the number of words is the most important factor in discrimination, followed by the introduction and conclusion.
From this analysis, we can see that there seems to be no difference between journal preprints, journal articles, and similar preprints in general. In other words, it would be difficult for a person who does not have expertise in the field to distinguish between a journal preprint, a journal article, and a similar preprint based on the chapter structure and the number of references.
In addition, the differences in content between journal preprints and journal articles are relatively small. However, even in the case of random matching, the peak value of the content similarity distribution of our method is high, about 0.93. In comparison with similar preprints, not a few of them have a similarity of 1.0. Thus, this criterion tends to estimate the similarity relatively high. In addition, the similarity calculation method truncates formulas and numbers, but in a paper, these small differences may be significant.
In addition, there are various points to be considered when reading the results, which will be discussed later, but to simplify the discussion, it is difficult to distinguish preprints and journal articles externally within the scope of this analysis. Therefore, it is necessary to conduct more in-depth text mining to answer the initial question, “How are journal articles produced?” and to examine the differences between preprints and journal articles by adding experts in the field.
6.1 Points to be noted
This time, bioRxiv was selected as the preprint server. Therefore, the field is limited to biology only. In addition, although biological sciences are more active than other fields, the diffusion rate of bioRxiv in biological sciences is not high, considering the number of researchers in the field and the usage rate of arXiv in physical and information sciences. In addition to this, the number of journal articles we were able to survey was very small (about 7,000), and only open access articles were available. Therefore, it should be noted that the discussion is limited to Open Access articles of advanced users who use preprints in the biological sciences.
In addition, although some of the contents were analyzed, they were processed mechanically and the specific contents and contexts were not considered. In reality, even if the set of words is the same, the meaning may be completely different if the order changes, and even if the sentences are almost the same, small differences in numerical values may have critical meanings. These points should also be kept in mind.
The “quality” of a preprint and a journal article also requires careful discussion. It is not clear whether the quality and value of a paper is low because it has not been published in a journal. Some papers may have been accepted for publication simply because they have not been submitted to journals, while others may have been submitted but were not evaluated properly and remain as preprints because the reviewers’ fields of expertise happen to be different. There may be some papers that are not evaluated at present, but will be of great value when combined with other results in a few decades. There are various possible reasons why a paper has not been published in a journal. In this paper, we have investigated the differences between preprints, journal articles, and preprints, but we have only investigated the differences and not how they affect the nature of the research. We did not clarify how critically the added references affected the content, how the changes in chapter titles and organization affected the content, or how much the value of the research changed as a result. It should be noted that the paper merely states that there is such and such a difference.
In this paper, we attempted to obtain knowledge about how research is conducted, especially how journal articles are produced, by comparing preprints with journal articles that are finally published.
First, due to the recent trend of open journals, we were able to secure a certain amount of full-text XML of preprints and journal articles, and verified the technical feasibility of comparing preprints and journal articles.
On the other hand, within the scope of this trial, in which we tried to clarify the difference between them based on external criteria such as the number of references and the number of words, and simple document similarity, we could not find a clear difference between preprints and journal articles, or between preprints that became journal articles and those that did not. Even with the machine learning method, the classification accuracy was not high at about 47%.
The result that there is no significant difference between preprints and journal articles is a finding that has been shown in previous studies and has been replicated in larger and relatively recent situations. In addition to these, the new findings of this paper are that the differences in many external criteria, such as the number of authors, are small, and the differences with preprints that are not journal articles are not large.
Thus, in order to verify how journal articles are produced, what points are brushed up from the preprint stage to become a journal article, and what are the critical differences between those that are accepted as journal articles and those that are not, it is necessary to include experts in the field and conduct more advanced research. In order to examine these issues, it was suggested that it is necessary to include experts in the field and conduct more advanced analysis using text mining and other methods.
As a derivative suggestion, preprints are externally equivalent to journal articles in terms of volume and other factors, and it is difficult to find out what is likely to be accepted for journal articles from preprints based on external criteria. In addition, there is a possibility that the necessity and significance of the peer review system can be reevaluated by pursuing the approach of this study.
- [Abdill19] Abdill, R.J and Blekhman, R.: Meta-Research: Tracking the popularity and outcomes of all bioRxiv preprints. eLife. Apr 2019. https://doi.org/10.7554/eLife.45133
- [Akbaritabar21] Akbaritabar, A., and Stephen, D.: A Disciplinary View of Changes in Publications’ Reference Lists After Peer Review. arXiv preprint. Feb 2021. arXiv:2102.03110
- [Bojanowski16] Bojanowski, P., Grave, E., et al.: Enriching Word Vectors with Subword Information. arXiv preprint. Jul 2016. arXiv:1607.04606
- [Carneiro20] Carneiro, C.F.D., Queiroz, V.G.S., Moulin, T.C. et al.: Comparing quality of reporting between preprints and peer-reviewed articles in the biomedical literature. Research Integrity and Peer Review Vol.5, Article number 16. Dec 2020. https://doi.org/10.1186/s41073-020-00101-3
- [Iwasaki18] Iwasaki, M., Miyazaki, D.: Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity. arXiv preprint. Oct 2018. arXiv:1810.07355
- [Jinhyuk19] Lee, J., Yoon, W., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. Sep 2019. https://doi.org/10.1093/bioinformatics/btz682
- [Klein19] Klein, M., Broadwell, P., Farb, S.E. et al.: Comparing Published Scientific Journal Articles to Their Pre-Print Versions. International Journal on Digital Libraries Vol.20, pp.335 – 350. Dec 2019. https://doi.org/10.1007/s00799-018-0234-1
- [Koshiba20] Koshiba H., et al.: A Trial Analysis of Research Trends Using COVID-19 / SARS-CoV-2 Related Preprints. NISTEP Discuttion Paper, No.186, June 2020. http://doi.org/10.15108/dp186 (in Japanese)
- [Hayashi21] Hayashi K., et al.: Preprint analysis focused on bioRxiv. NISTEP DISCUSSION PAPER Vol.197, Jul 2021. https://doi.org/10.15108/dp197 (in Japanese, in Press)
Appendix A Comparison of the number of citations
The number of citations is not used in this report because there are many issues to be solved in using it as an indicator, such as the evaluation differs depending on the timing of the measurement.
Figure 20 shows a comparison of the number of citations between journal preprints and other preprints as of the end of April 2021.
The number of citations is higher for journal preprints. It is unclear whether the journal preprints receive more attention and citations because they are published in a journal, or whether frequently cited articles are more likely to be published in a journal.
Appendix B Comparison of the number of versions
The number of versions is not used in this report because there are many issues as same as citations, to be solved in using it as an indicator, such as the evaluation differs depending on the timing of the measurement.
Figure 20 shows a comparison of the number of version counts between journal preprints and other preprints as of the end of April 2021.
There was no difference between journal preprints and other preprints, averaging 1.4 times, and the trends were consistent.