1 Introduction
The use of singleblind reviews (which obscure reviewer identities) and doubleblind reviews (which obscure both reviewer and author identities) varies across fields. Where doubleblind review processes are preferred, they are often justified in terms of fairness (to lesserknown authors and institutions, to gender equity etc.) and reduced bias Snodgrass (2006). In so far as reviewers are prevented from inferring author characteristics by obscuring author identity, reviewers cannot discriminate based on those characteristics. Blinding may also be an attempt to promote more objective reviewing, ensuring papers are judged only on their scientific merit.
Unfortunately, doubleblinding measures are always imperfect. Withholding author names obscures their identity, but can not guarantee that reviewers will not find out who wrote the paper some other way. Some sources of deblinding include publication of a preprint prior to review, putting up the paper on the authors’ webpages, publicizing the paper through social media platforms etc.
Previous work has studied the efficacy of blinding measures in review processes by having authors guess the identity of their reviewers and vice versa. Such studies, in a variety of disciplines, report success rates for blinding of 53% to 73% Snodgrass (2006) (i.e. in the worst case, 47% guessed correctly). Even where author names were removed from titles, identifiable details were sometimes left in the paper body or acknowledgements section. In small fields, the choice of project alone could be enough to identify authors.
In this paper, we study one possible source of deblinding in papers submitted to ICLR (International Conference on Learning Representations). arXiv lets anyone immediately publish a citable technical report to the web, without any peer review. When ICLR papers under review are published on arXiv during the review process, it is possible reviewers will see the preprint and discover the authors and their affiliations.
2 Related Works
A recent paper by Tomkins et al. Tomkins et al. (2017) addressed a similar research question with an experimental study of the differences between singleblind (reviewer names withheld) and doubleblind (reviewer and author names withheld) review processes. The authors designed a randomized controlled trial within the review process of the 10th WSDM conference. They divided reviewers into two categories: one that had access to authors’ names and affiliations (singleblind) and another that did not have access to the author list (doubleblind). The same group of papers were reviewed by both groups of reviewers. Analysis of the bidding process and review scores revealed that reviewers in the singleblind pool were significantly more likely to recommend acceptance of papers from famous authors, top companies and top universities.
Other papers relevant to our work have investigated different settings of reviewer bias. For example, Link et al. Link (1998) investigated the existence of reviewer bias when reviewers were asked to review manuscripts from authors outside of their home countries. In particular, they ran a controlled experiment during the review process for the Gastroenterology journal and found that reviewers in the US assigned significantly higher review scores to papers with authors from US institutions compared to papers with authors outside the US.
3 Concrete Operationalization
We operationalize our research question as follows: Does the acceptance rate of ICLR papers correlate more strongly with author hindex (and total citations) when the identity of authors has potentially been revealed to reviewers by an arXiv preprint either during or before the review process?
Choice of data
The International Conference on Learning Representations (ICLR) is an emerging conference focused on deep learning. ICLR uses the
OpenReview platform for open peer review, so all reviews are publicly available for analysis. We choose ICLR data (from 2019 and 2020), because it contains information about acceptances, rejections and reviews of all submitted papers  including author data and affiliations. We scraped data from a total of 5057 submissions, after ignoring papers that were desk rejected or withdrawn prior to decision.Data collection setup To collect ICLR data, we modified an existing tool^{2}^{2}2https://github.com/shaohua0116/ICLR2020OpenReviewData to scrape paper metadata from the OpenReview platform. arXiv provides an API for bulk data access^{3}^{3}3https://arxiv.org/help/bulk_data. Based on the papers and respective author lists scraped from OpenReview, we searched for papers with the same author list on arXiv, and for papers whose preprint existed on arXiv, we noted down the first upload timestamp. We did not query using the paper title because it often happens that papers uploaded on arXiv have different titles compared to the paper submitted for review.
Although Google Scholar does not provide an API for programmatic data access, however there are existing tools for scraping like scholarly^{4}^{4}4https://github.com/OrganicIrradiation/scholarly. We found a brief news article in Nature Else (2018) written by a researcher who spent months scraping data from Google Scholar, for lack of an official public facing API. Luckily, since we only needed hindices and total citations for several thousand authors, we wrote a simple script based on scholarly’s source code that uses BeautifulSoup. In order to avoid getting our IP blocked by Google Scholars, we ran this scraping code on a server with timeouts between successive queries.
Finally, we manually inspected the collected data to ensure that the first upload timestamp we scraped from arxiv does indeed correspond to the paper submitted on OpenReview. In addition, we manually checked the list of authors to ensure that we have scraped the hindex and total citations of the author we intended to from Google Scholars (since multiple people on Google Scholars might have the same name). We defer automated checks for these to future work.
Measuring author reputation To operationalize our research question, we have to choose a reasonable quantitative measure of author reputation. For the purpose of this study, we define two metrics for the reputation of an author: their hindex and their total citations as calculated by Google Scholar. If an author does not have a Google Scholar page, they are excluded from our analysis. Overall, we had 5030 papers for our analysis.
Measuring paper reputation Since we will be analyzing review outcomes for papers, most of which have multiple authors, we further define a measure for the pseudoreputation of a paper. We will consider the following metrics as definitions of a paper’s pseudoreputation:

the max of the hindices/total citations of all authors,

and the average of the hindices/total citations of the top 2 authors.
There are most certainly a number of flaws in using hindex/total citations as a measure of reputation of authors Costas and Franssen (2018), however we intend to clarify that our intention in this work is not to come up with a better method of quantifying the reputation of researchers. Given a publicly available standard metric under which research output is quantified (namely hindex and total citation count), our intention is to perform analyses by grouping authors based on this metric, in order to show the existence of variations in acceptance rates of papers under two conditions for different bins of these metrics.
Choice of an observational study We analyze observational data from ICLR reviews and arXiv. This is not a randomized experiment (natural or controlled), so whatever correlations we discover, we will be unable to make strong conclusions about causation. It is possible that a randomized controlled experiment (such as that conducted in Tomkins et al. (2017)) would better address an explicitly causal version of our research question. However, we believe the choice of an observational “counting” approach strategy is still valuable. Our findings may not generalize well beyond arXiv and ICLR and will be vulnerable to drift
as publication norms and channels evolve. However, we think there are enough people specifically interested in ICLR and similar machinelearning conferences and that the impact of these conferences is high enough, that this study is still worth pursuing.
4 Analyses
In this section we describe the analyses we performed to understand the research questions. We grouped the analyses under the following headings:
4.1 Is there any significant difference between acceptance rates for papers that are arxived during/before the review phase and papers that are not?
We start our analysis by plotting the aggregate acceptance rates of papers in two categories: 1) those whose preprints are released on arxiv either during or before the review phase and 2) the rest whose preprints are either released after the review phase or not released at all till date.
From Fig. 1 we obtain statistically significant differences between the two conditions. There might be a number of reasons for these differences including the potential explanation that the papers released on arxiv during/before the review typically tend to be more polished than their unreleased counterparts. So, we perform additional analyses by binning the pseudoreputation of papers in the subsequent sections to understand the nuances of these differences better.
4.2 Do papers with arxiv preprints tend to have higher acceptance rates in case of papers by wellknown authors?
Method: We plot a histogram with binned paper pseudoreputation along the xaxis and average % of papers accepted in each bin along the yaxis. We consider two different conditions for the plot:

only papers released on arXiv before or during the review process, i.e. before the date reviews were released on OpenReview.

all other papers that are either not present on arXiv or were published on arXiv after the date reviews were released on OpenReview.
Results: Inspecting the different bins in Fig. 2, we note two key trends, 1) the %acceptance increases with high pseudoreputation of papers in the first bin and 2) the % acceptance for papers in the not arxived condition is higher than the arxived condition, while in subsequent bins, in particular the third and fourth bins the trend is reversed.
The first trend aligns with the intuition that papers with high pseudoreputation have an overall higher % acceptance rate, because authors having high author reputation scores perhaps submit genuinely better papers on average. However,this does not explain the second trend of discrepancy we observe between the two conditions.
To identify if the discrepancies we observe are significant, we conduct pairwise
tests for the four bins with the null hypothesis
being there is no difference between the arxiv and the no arxiv conditions. The alternate hypothesis is that there is a difference between the %acceptance in the arxiv condition compared to the %acceptance in the no arxiv condition. Specifically, for the first bin we hypothesize that the %acceptance in the arxiv condition is less than the %acceptance in the no arxiv condition while for the fourth bin we hypothesize that the %acceptance in the arxiv condition is more than the %acceptance in the no arxiv condition.For the first and fourth bins of Fig. 1(a), we obtain and respectively. In Fig. 1(b), we repeat the same analysis, but with hindex of authors used to define the paper pseudoreputation scores. For the first and fourth bins of Fig. 1(b), we obtain and respectively. Hence, we indeed conclude that there is a positive correlation between releasing preprints on arXiv and acceptance rates of papers by wellknown authors, under our concretization of the problem.
Since the data for Fig. 2 consists of ICLR 2020 and ICLR 2019 papers combined, in order to ensure that our results are not confounded by temporal changes in the publication culture over one year, we perform a robustness check by repeating the analyses for ICLR 2020 papers alone in Fig. 3. For the first and fourth bins of Fig. 2(a), we obtain and respectively, while for the first and fourth bins of Fig. 1(b), we obtain and respectively. These results are consistent with those in Fig. 2 and hence our conclusions remain valid.
4.3 Are high pseudoreputation papers more likely to be released on arXiv?
We hypothesize that papers with high pseudoreputation are more likely to have preprints released on arXiv during or before review. Authors understandably want the best outcome for their paper especially when they understand the amount of time, effort, and analysis that has gone into their papers. So, it may be the case that well known authors believe deblinding via arxiv and publicizing their paper before/during peer review will likely work in their favor.
Method: To test this hypothesis we plot a histogram with binned paper pseudoreputation along the xaxis and fraction of papers released on arXiv as yaxis. Note that there is only a single condition per bin in this histogram unlike the previous plots, and we are interested in comparing the yvalues corresponding to each bin.
Results: The result of this analysis is shown in Fig. 4. While it is evident that the fraction of papers arxived in the fourth category is more than all the other three categories, the difference at least through visual inspection is not profound. To quantify if there is a statistically significant difference of papers arxived in the case of high pseudoreputation of papers and low pseudoreputation of papers, we perform a test on the aggregate of the first three bins (the low condition) and the fourth bin (the high condition), with the alternate hypothesis that the papers arxived is higher in the high category compared to the low category. The null hypothesis is that there is no difference in papers arxived for the two conditions high and low. For this, we obtain , which does allow us to reject the null hypothesis in favor of the alternate hypothesis.
4.4 Are review scores by less confident reviewers higher in case of papers with high pseudoreputation and lower in case of papers with low pseudoreputation?
While writing reviews for ICLR papers, reviewers must selfspecify their confidence in the review of the paper in the form of a field called experience assessment. There are four different confidence levels that reviewers can choose from, for example the highest level is defined by “I have read many papers in this area.” This is publicly displayed along with the reviews. We denote the numerical value of the confidence scores as (lowest to highest in this order).
Method: We consider bins of paper pseudoreputations on the xaxis for all papers that have been released on arXiv and plot 3 histograms (corresponding to whether the average reviewer confidence score for the paper lies in [1,2.5], (2.5,3], or (3,4] categories) indicating the average review score assigned by each category of reviewers to papers in each bin.
Results: Fig. 5 shows the results of this analysis. Looking at the third and fourth bins in Fig. 4(a), Fig. 4(c), and Fig. 4(e), it is evident that for papers with a low average reviewer confidence score, the average review score in the arxiv condition is more than the average review score in the no arxiv condition. Looking at the fist bin in Fig. 4(a), Fig. 4(c), and Fig. 4(e), it is evident that for papers with a low average reviewer confidence score, the average review score in the arxiv condition is less than the average review score in the no arxiv condition.
To analyze if these differences are significant, we conduct tests on the four bins. The null hypothesis is that there is no difference between the arxiv and the no arxiv conditions. The alternate hypothesis is that for low confidence reviewers, there is a difference between the arxiv and the no arxiv conditions. Specifically, we hypothesize that in the third and fourth bins, the average review score in the arxiv condition is more than the average review score in the no arxiv condition. We also hypothesize that in the first bin, the average review score in the arxiv condition is less than the average review score in the no arxiv condition.
For the fourth bins in Fig. 4(a), Fig. 4(c), and Fig. 4(e), we respectively obtain the values 0.02, 0.03, and 0.19. Since for Fig. 4(a) and Fig. 4(c), we can reject the null hypothesis in favor of the alternate hypothesis, but we cannot reject the null hypothesis for Fig. 4(e). This offers evidence of negative correlation between confidence of reviewers and their likelihood to assign high review scores to papers with high pseudoreputation.
For the first bin in Fig. 4(a), Fig. 4(c), and Fig. 4(e), we respectively obtain the values 0.02, 0.04, and 0.24. Since for Fig. 4(a), we can reject the null hypothesis in favor of the alternate hypothesis, but we cannot reject the null hypothesis for Fig. 4(c), and Fig. 4(e). This offers evidence of negative correlation between confidence of reviewers and their likelihood to assign low review scores to papers with low pseudoreputation.
4.5 Is the effect of difference between the arxiv and the no arxiv conditions stronger for borderline papers?
In Fig. 7 we analyze papers that have a borderline reviewer rating on average and papers that are highly rated by reviewers on average, under the two conditions arxived and not arxived prior to decision notification. This analysis aims to understand the existence of potential bias at the level of Area Chairs.
Method: To have a principled scheme of determining which papers are borderline, in Fig. 6, we plot fraction acceptance of papers per average reviewer rating bin, where the bins are created based on the twenty percentile values. Based on this, we define borderline papers to be the papers that received an average rating in the range and highly rated papers to be those that received an average rating in the range .
Results: Fig. 6(a) and Fig. 6(b) respectively show fraction acceptance per hindex bin for the borderline papers and the highly rated papers. To identify if the discrepancies we observe are significant, we conduct tests for the four bins with the null hypothesis being there is no difference between the arxiv and the no arxiv conditions. The alternate hypothesis is that there is a difference between the arxiv and the no arxiv conditions. Specifically, for the first bin, we hypothesize that the %acceptance in the arxiv condition is less than the %acceptance in the no arxiv condition, while for the fourth bin, we hypothesize that the %acceptance in the arxiv condition is more than the %acceptance in the no arxiv condition.
For the first and fourth bins of Fig. 6(a), we obtain and respectively, while for the first and fourth bins of Fig. 6(b), we obtain and respectively. So for the borderline papers, we conclude that releasing preprints on arXiv correlates positively with acceptance rates of papers by wellknown authors, and correlates negatively with acceptance rates of papers by less well known authors under our concretization of the problem. The effect is indeed stronger for borderline papers.
5 Discussion and Limitations
In the previous section we performed a number of analyses and obtained three key inferences 1) releasing preprints on arXiv has a positive correlation with acceptance rates of papers by wellknown authors, 2) papers with wellknown authors are more likely to be released on arXiv during or prior to the review phase, and 3) reviewers with a low confidence score are more likely to assign high review scores to deanonymized papers by wellknown authors. In this section we intend to put these inferences in the right perspective and address some of the limitations of our study.
It is important to note that our study is entirely based on observational data and hence it is not possible for us to make rigorous causal claims. Since the same set of reviewers were not exposed to the two conditions arxiv and no arxiv we cannot make any conclusive claims with respect to the intent or bias of the reviewers. On the other hand, we believe that the inthewild nature of our study is helpful in putting into perspective the trends that emerge (albeit correlational and not necessarily causal) in the current publication and preprint culture of machine learning.
Another limitation of our study is that we only analyze data from two recent ICLR conferences (ICLR 2020 and ICLR 2019). ICLR served as the natural platform for this study as the entire list of submissions and reviews are publicly released, in the spirit of open science. It would be very helpful if we could validate our claims on other popular CS/AI/ML conferences to understand the interplay of deanonymization through arXiv and the type of reviews. This is our appeal to the community to consider adopting the OpenReview system and publicly release the entire list of submissions and all the reviews. Apart from facilitating analyses like ours, this also helps readers put into perspective the contributions of the papers and understand the potential shortcomings that were pointed out in the review phase and that hopefully have been addressed in the final version.
Finally, given our findings in this study and the implications this must be having in the publication culture of our community, we discuss some solutions to mitigate potential reviewer bias caused by deanonymization through arXiv preprints. Since the point of a preprint is that the paper is either soon to be submitted for review or is currently under review, arxiv.org could have the option of allowing authors to keep the author list anonymized. Conferences that follow the double blind review system could enforce that only papers that are anonymized on arXiv and that will remain anonymized during the review phase can be submitted to the conference. If the purpose of releasing a preprint is early dissemination of knowledge, having the author list anonymized for some duration would not be detrimental to this cause as the anonymized paper would still be citable (a practice followed by ICLR on OpenReview). In order to strongly discourage people from putting up incomplete works under the cover of anonymity for ‘early flagplanting,’ strict rules can be enforced regarding the conditions under which a paper submitted on arxiv can be later updated. For example, it can be imposed that papers that are submitted in anonymous format when updated will still display the old version as default and the new version will have a separate upload timestamp and be linked to the old version. i.e. the two versions (anonymous and updated) will be listed as separate papers with their respective timestamps in order to discourage people from trying to flagplant incomplete work.
In addition to the above, we believe it is important to modify the typical peerreview process to have a maximum limit on the number of low confidence reviewers that are assigned to a paper. Since we have observed correlational evidence in that low confidence reviewers are more likely to assign favorable ratings to papers with reputable authors, if possible the number of low confidence reviewers overall in the review process should be decreased and if this is not possible then the number of such reviewers per paper must be limited to atmost one.
Acknowledgements
We thank the numerous people who pursue metadiscussions about the current publication culture on social media platforms, in particular on Twitter. Those discussions served as an inspiration for us to pursue a principled study to investigate the correlations between reputation of authors and deblinding of their papers through arxiv submissions prior to doubleblind peer review. We thank David Duvenaud and Florian Shkurti for helpful discussions and feedback.
References
 Reflections around ‘the cautionary use’of the hindex: response to teixeira da silva and dobránszki. Scientometrics 115 (2), pp. 1125–1130. Cited by: §3.
 How i scraped data from google scholar. Nature. External Links: Document, Link Cited by: §3.
 US and nonus submissions: an analysis of reviewer bias. Jama 280 (3), pp. 246–247. Cited by: §2.
 Singleversus doubleblind reviewing: an analysis of the literature. ACM Sigmod Record 35 (3), pp. 8–21. Cited by: §1, §1.
 Reviewer bias in single versus doubleblind peer review. Proceedings of the National Academy of Sciences 114 (48), pp. 12708–12713. External Links: Document, ISSN 00278424, Link, https://www.pnas.org/content/114/48/12708.full.pdf Cited by: §2, §3.
Comments
There are no comments yet.