Content and linguistic biases in the peer review process of artificial intelligence conferences
Philippe Vincent-Lamarre1,2 & Vincent Larivière1
We analysed a recently released dataset of scientific manuscripts that were either rejected or accepted from various conferences in artificial intelligence. We used a combination of semantic, lexical and psycholinguistic analyses of the full text of the manuscripts to compare them based on the outcome of the peer review process. We found that accepted manuscripts were written with words that are less frequent, that are acquired at an older age, and that are more abstract than rejected manuscripts. We also found that accepted manuscripts scored lower on two indicators of readability than rejected manuscripts, and that they also used more artificial intelligence jargon. An analysis of the references included in the manuscripts revealed that the subset of accepted submissions were more likely to cite the same publications. This finding was echoed by pairwise comparisons of the word content of the manuscripts (i.e. an indicator or semantic similarity), which was higher in the accepted manuscripts. Finally, we predicted the peer review outcome of manuscripts with their word content, with words related to machine learning and neural networks positively related with acceptance, whereas words related to logic, symbolic processing and knowledge-based systems negatively related with acceptance.
Peer review is a fundamental component of the scientific enterprise and act as one of the main source of quality control of the scientific literature (ziman_real_2002). The primary form of peer review occurs before publication (wakeling_no_2019) and it is often considered as a stamp of approval from the scientific community (mulligan_is_2005; mayden_peer_2012). Peer-reviewed publications have a considerable weight in the attribution of research and academic resources (tregellas_predicting_2018; mckiernan_use_2019; moher_assessing_2018).
One of the main concern about peer review is its lack of reliability (bailar_reliability_1991; lee_kuhnian_2012; cicchetti_reliability_1991). Most studies on the topic find that agreement between reviewers is barely greater than chance (bornmann_reliability-generalization_2010; price_nips_2014; forscher_how_2019), which highlights the considerable amount of subjectivity involved in the process. This leaves room for a lot of potential source of bias, which have been reported in several studies (lee_bias_2013; de_silva_preserving_2017; murray_gender_2018). A potential silver lining is that it appears that the process has some validity. For instance, articles accepted at a general medicine journal (jackson_validity_2011) and journals in the domain of ecology (paine_effectiveness_2018) were more cited than the rejected articles published elsewhere, and the process appears to improve the quality of manuscripts, although marginally (goodman_manuscript_1994; pierie_readers_1996; calcagno_flows_2012). It is therefore surprising that a process that has little empirical support of its effectiveness, but a lot of evidence of its downsides has so much importance (smith_classical_2010).
The vast majority of studies on peer review have focused on the relationship between the socio-demographical attributes of the actors involved in the process and its outcome (sabaj_meruane_what_2016)
. Comparatively, little research has focused on referee’s report content. This isn’t surprising given that these reports are usually confidential, and whenever they are made available to researchers it is usually through smaller samples designed to answer specific questions. Another factor contributing to this gap in the litterature is that it is more time consuming to analyse textual data (either the referee’s report or the reviewed manuscript) than papers’ metadata. However, new developments in the field of natural language processing (NLP) makes it possible to analyse large amounts of textual data more efficiently. Additionally, the increasing popularity of open access(piwowar_state_2018; sutton_popularity_2017) allows for a greater access to the full text of scientific manuscripts.
The popularity of one open access repository, arXiv, allowed the development of a new method to identify manuscripts that were accepted at conferences after the peer review process (kang_dataset_2018)
based on the scraping of arXiv submissions around the time of major NLP, machine learning (ML) and artificial intelligence (AI) conferences. These pre-prints were then matched with manuscripts that were published at the target venues as a way to determine whether they were accepted or "probably-rejected". In addition, the manuscripts and peer-review outcomes were collected from conferences that agreed to share their data.kang_dataset_2018 were able to achieve decent accuracy at predicting the acceptance of the manuscripts in their dataset. Other groups were able to obtain good performance at predicting paper acceptance with different machine learning models based on the text of the manuscripts (jen_predicting_2018)
, sentiment analysis of referee’s reports(ghosal_sentiment_2019), or the evaluation score given by the reviewers (qiao_modularized_2018).
In this manuscript, we take a different approach to explore the importance of two types of biases that could be involved in the peer review process, namely the language and the content bias. In the language bias, author’s who aren’t native english speakers could receive more negative evaluations from the reviewers due to the linguistic level of their manuscripts (ross_effect_2006; tregenza_gender_2002; herrera_language_1999). However, other studies found that manuscripts that were "linguistically criticized" had similar chances to get accepted than the rest (loonen_who_2005). Overall, there is little data available on possible language bias, which is worrying given the increasingly globalized scientific system that increasingly relies on one language: english (lariviere_introduction:_2019). In addition, recent findings showed that the linguistic style of grant applications could play a role in funding success (kolev_is_2019). Another study found that the readability of scientific communications has been steadily decreasing throughout the last century (plaven-sigray_readability_2017).
Then, in the content bias, innovative and unorthodox methods are less likely to be judged favourably (lee_bias_2013). This type of bias is also quite likely to play a role in fields that are dominated by a few mainstream approaches such as AI (hao_we_2019). Conservatism in the field of artificial intelligence could impede the emergence of breakthrough or novel techniques that don’t fit with the current trends.
In this manuscript, we address both types of biases by comparing the textual data (title, abstract and introduction) of the manuscripts. We first compared the psycholinguistic and lexical attributes of the full text of accepted and rejected manuscripts and found that accepted manuscripts used words that were more abstract, less frequent and acquired at a later age compared to rejected manuscripts. We then used two readability metrics (the Flesch Reading Ease (FRE) and the New Dale-Chall Readability (NDC) Formula), as well as an indicator of AI jargon, and found that manuscripts that were less readable and used more jargon were more likely to get accepted. We then compared manuscripts on their word content and their referencing patterns through bibliographic coupling, and found that the subset of accepted manuscripts were semantically closer than rejected manuscripts. Finally, we used the word content of the manuscripts to predict their acceptance, and found that specific topics were associated with greater odds of acceptance.
2.1 Manuscript data
We used the publicly available PeerRead dataset (kang_dataset_2018) to analyse the semantic and lexical differences between accepted and rejected submissions to some natural language processing, artificial intelligence and machine learning conferences. We therefore used content from six platforms archived in the PeerRead dataset: three arXiv sub-repositories tagged by subject including submissions from 2007 to 2017 (AI: artificial intelligence, CL: computation and language, LG: machine learning), as well as submissions to three other venues: (ACL 2017: Association for Computational Linguistics, CoNLL 2016: Conference on Computational Natural Language Learning, ICLR 2017: International Conference on Learning Representations). This resulted in a dataset with 12,364 submissions. Although the submissions to ACL 2017 and CoNLL 2016 had an acceptance rate in (kang_dataset_2018), the information for each submission was not available in the dataset at the time of the analysis.
We limited our analysis to the title, abstract and introduction (and not the other IMRaD sections) of the manuscripts, because the methods and results contained formulas, mathematical equations and variables, which made it unsuitable for textual analysis.
|Platform||# Papers||# Accepted|
2.2 Semantic distance
First, the text data of each article, including the title, abstract, body and references were cleaned by making all words lowercase, eliminating punctuation, single character words and common stopwords. For all analyses except for the readability, scientific jargon and psycholinguistic matching, the stem of the word was extracted using the porter algorithm (porter_algorithm_1980).
We then used the Term Frequency Inverse Document Frequency (TF-IDF) algorithm to create a vector representation based on the field of interest (title, abstract or body). We used the euclidean distance between the document’s TF-IDF vector as a measure of semantic distance (or dissimilarity).
2.3 Reference matching
In order to obtain manuscript’s bibliographic coupling, we had to develop a reference matching algorithm because their format was not standardized across manuscripts. The references were already parsed in subfields, so we used four conditions to match two references: 1- They were published the same year 2- They had the same number of authors 3- They had a similarity score above 0.7 (empirically determined after manual inspection of matching results) with a fuzzy matching procedure (Token Set Ratio function from the FuzzyWuzzy python library, https://github.com/seatgeek/fuzzywuzzy) on the author’s names and 4- the article’s title.
2.4 Psycholinguistic and readability variables
For the word frequency estimation, we used the SUBTLEXUS corpus (brysbaert_moving_2009) from which we used the logarithm of the estimated word frequency + 1. For the concreteness, we used the (brysbaert_concreteness_2014) dataset providing concreteness rating for 40,000 commonly known English words. For the age of acquisition, we used the (kuperman_age--acquisition_2012) age of acquisition ratings for 30,000 English words.
We used the readability functions as implemented in (plaven-sigray_readability_2017). We used the Flech Reading Ease (FRE; flesch_new_1948; kincaid_derivation_1975) and the New Dale-Chall Readability Formula (NDC; chall_readability_1995). The FRE is calculated based on the number of syllables per words and the number of words per sentence. The NDC is based on the number of words per sentence and the proportion of difficult words that are not part of a list of "common words". We also included two sources of jargon developed by (plaven-sigray_readability_2017). The first one are science-specific common words, which are words used by scientist which are not in the NDC’s list of common words. The other is the general science jargon, which are words frequently used in science, but aren’t specific to science (see plaven-sigray_readability_2017 for methods). Finally, we complied a list of AI jargon from two online glossaries (https://developers.google.com/machine-learning/glossary/ and http://www.wildml.com/deep-learning-glossary/).
2.5 Data analysis
Because of the large size of the datasets included in this study, no significance testing was performed. Our analyses relied on the effect size and the explained variation, as well as the cross-validated effects on the independent subsets of the PeerRead dataset (manuscripts from different venues and online repositories). All error bars represent the standard error.
3.1 Lexical correlates of peer review outcome
We first compared accepted and rejected submissions based on lexical and psycholinguistic attributes. For this analysis, we only focused on the content of the introduction of each submission. We used the number of tokens (total number of words in a document) as well as two measures of lexical diversity: the number of types (unique words in a document) and the Type-Token Ratio (TTR). We also used three psycholinguistic variables: the age of acquisition (AOA), concreteness and frequency (on a logarithmic scale). We computed the average values of those psycholinguistic variables on all types and all tokens. The psycholinguistic variables had values that covered on average 48.4%, 60.8% and 81.1% of the types in each documents for the AOA, concreteness and frequency, respectively.
We found strong and consistent effects for the psycholinguistic variables. Words used in accepted manuscripts were less frequent, acquired later in life and more abstract than in rejected manuscripts on average (Fig. 1). The effects were consistent across all platforms except ICLR (which is much smaller than the other ones). The frequency and AOA had the largest effects, followed by the concreteness (Table 2). The other lexical indicators (#Tokens, #Types and TTR) did not show such differences between accepted and rejected manuscripts.
|Variable||Pearson r||Variance explained|
The readability of scientific articles has been steadily declining in the last century (plaven-sigray_readability_2017). One possible explanation for this is that writing more complex sentences and using more scientific jargon increase the likelihood that a manuscript will get accepted at peer review. To address this question, we used two measures of readability on our data: the Flesch Reading Ease (FRE) and the New Dale-Chall Readability Formula (NDC). FRE counts the number of syllables per word and the number of word per sentence. NDC calculate the number of words in each sentence, and assess the proportion of easy words (based on a predefined list). We also included the proportion of words from a science-specific common words and general science jargon list (constructed by plaven-sigray_readability_2017).
We found that both indicators of readability were correlated to the peer review outcome. FRE (higher score = more readable) is lower for accepted manuscripts, while NDC (higher score = less readable) is higher for accepted manuscripts (Fig. 2). This effect is found for almost every platform and for every section of the documents (title, abstract and introduction). However, results weren’t as consistent for the scientific jargon. There appears to be no effect for the general scientific jargon, where effects are inconsistent across platforms and sections. However, there is a greater proportion of science-specific common words for accepted articles, although this effect is mostly observable for the introduction section of the documents. The lists of science jargon that we used was generated based on articles that were almost exclusively from the field of the life sciences. We therefore generated an AI jargon list (see Methods) to test whether this would improve the robustness of the effect. Using this new list, we found a robust effect across platforms and document section, where a larger proportion of AI jargon predicted greater odds of acceptance for the manuscripts.
3.3 High-level semantic correlates of peer review outcome
3.3.1 Bibliographic coupling and semantic similarity
We then looked at how similar the accepted papers were compared to the rejected papers based on their semantic content. First we looked at the similarity of their title, abstract or introduction based on a tf-idf representation of their word content. Secondly, we looked at their degree of bibliographic coupling (see table 3). sainte-marie_you_2018 reported a small to moderate correlations between the two measures in the field of economics.
|# of common cited references||Quantity|
But first, as the two approaches quantify the content similarity of the documents, we wanted to verify whether those two metrics measured different aspects of the document content. We correlated the semantic distance with the bibliographic coupling of the document submitted to each platform. We used a semantic distance metric based on the euclidean distance between the tf-idf representation of each document, as well as both the citation intersection (# common references) and the Jaccard index (#references in common/ # references in total) as a measure of bibliographic coupling. We found a mild correlation (Pearsonr > 0.35 and < 0.40) between both measures of bibliographic coupling and semantic distance (Fig. 3). This suggests that those two measures aren’t redundant features of semantic content, and that they might capture different aspects of it. This also validates our algorithm for citation disambiguation as comparable correlations between the bibliographic coupling and textual similarity were reported in (sainte-marie_you_2018).
3.3.2 Bibliographic coupling and peer review outcome
We looked at how accepted and rejected manuscripts differed on their bibliographic coupling. We compared all pairs of manuscripts based on the two indicators of bibliographic coupling (intersection and Jaccard). Each pair of manuscripts was categorized as one of the following: "accepted": the two submissions were accepted, "rejected": the two submissions were rejected, and "mixed", one document was rejected and the other was accepted.
We found that accepted manuscripts had more references in common (Fig. 4) than for the two other categories. The effect was slightly weaker for the Jaccard similarity (intersection over union of citations) and less consistent across platforms than the intersection. However, both metrics account for about 0.2% of the variance (All platforms, Jaccard: 0.228% and intersection: 0.21%). This suggests that the number of common references between manuscripts might be a more reliable determinant of their acceptance than the proportion of shared citations.
3.3.3 Semantic similarity and peer review outcome
Having established that semantic similarity and bibliographic coupling capture different aspects of the relationship between documents, we also analysed the semantic similarity of the documents from the four platforms. Thus, for each platform we computed the td-idf distance between all pairs of document based on their word stem.
Overall, we found that accepted manuscripts were more similar to each other than rejected manuscripts (Fig.5). We found an effect gradually stronger when comparing the semantic similarity based on the titles, abstracts and introductions, in that order. Accepted pairs of manuscripts had 0.04%, 0.59% and 1.04% less variance in their td-idf scores, respectively. In other words, accepted pairs of manuscripts were more similar to each other compared to the other two pair types.
This analysis of the semantic similarity of documents (for both citations and text) showed some high levels trends based on whether or not the manuscripts were accepted after peer review. We then examined the text content of the manuscripts with a more detailed approach to gain more insights on the patterns uncovered by the analysis on bibliographic coupling and textual similarity.
Finally, after looking for systematic differences between accepted and rejected manuscripts based on high-level semantic and lexical indicators, we performed a more detailed analysis by looking directly at the word content of the manuscripts. We used a logistic regression to predict the acceptance of a submission with a bag-of-word approach.
Overall, the model was fairly successful at predicting the peer review outcome on a 10-fold cross-validated dataset (Table 6,6 & 6). The model was the most successful when the text of the introduction was used, followed by the text of the abstract and of the title. There is however strong collinearity in the data (certain words tend to co-occur together), so we avoided the direct interpretation of the coefficients to identify the most important words for the classification. We therefore computed the average count of each word for accepted and rejected manuscripts, and obtained measure of "importance" based on the difference between the two averages (Fig. 6). This approach allowed us to identify the most important keywords predicting the acceptance of a manuscript.
Although some differences were noticeable across platforms regarding the predictors of acceptance (Table LABEL:title_words_acc, LABEL:abstract_words_acc & LABEL:intro_words_acc) and rejection (Table LABEL:title_words_rej, LABEL:abstract_words_rej & LABEL:intro_words_rej), some patterns emerged. Words related to the sub fields of neural networks (e.g., learn,neural,gradient,gener) were increasing the odds of the manuscript to be accepted. However words related to the sub fields of logic, symbolic processing and knowledge representation (e.g, use, base, system, logic, fuzzi, knowledg, rule) were decreasing the odds that a manuscript would get accepted.
4 Discussion and Conclusion
4.1 Summary of results
Our results suggest that both linguistic and content bias could occur during the peer review process at AI conferences. When considering the content of the introduction of the accepted manuscripts, we found that they had words that were acquired at a later age, that were more abstract and that were less common than the words from the rejected manuscripts. We found no effect for the lexical indicators. Unsurprisingly, the effect size were small given the highly multivariate determination of the peer review outcome. The effects were replicated across multiples independent datasets from different fields in AI, which strengthen the conclusions of our analysis.
From a linguistic point of view, these results suggest that accepted manuscripts could be written in a more complex, less accessible english. Using two indices of readability, one of which is agnostic to the word content of the manuscript (FRE), we found that the accepted manuscripts obtained lower readability scores. Strikingly, we found the same effect for almost all our independent datasets. The same pattern was also observed for the title, the abstract and the introduction. Using a different type of readability indicator - being the proportion of scientific jargon words - we found weaker differences between accepted and rejected manuscripts that did not generalise to all datasets and manuscript sections. However, when using a list of AI jargon, we found that manuscripts that contained a greater proportion of jargon words were more likely to be accepted. This may explain the recent findings that the readability of manuscripts has steadily declined during the last century (plaven-sigray_readability_2017). In the light of our results, it is possible that part of this effect is driven by a selection process taking place during peer review.
From a content point of view, we compared manuscripts based on their referencing patterns and word content. We compared the coupling both based on the raw number of common references (intersection) and the fraction of overlap between the manuscripts’ references (Jaccard similarity). We found that accepted pairs had a larger intersection than other pairs, and found a similar, but less reliable effect for the similarity. In the same vein, we used a tf-idf vectorial representation of the text from all manuscripts in the database, compared all possible pairs of manuscripts, and we found that pairs of accepted manuscript had considerably more overlap between their word content. This high level analysis of the manuscript’s content revealed that some topics might be associated with different odds of acceptance. We performed a correlation between the bibliographic coupling and the semantic similarity to get an idea of the how independent was the information provided by these two semantic indicators. As reported previously (sainte-marie_you_2018), we found a weak to moderate correlation between the two, which suggest that they provide distinct sources of information about topic similarity in accepted manuscripts.
Finally, we built a logistic regression to predict the peer review outcome, which revealed that using the title, abstract or introduction words lead to robust predictions. Our results suggest the presence of content bias, where trending topics in AI such as machine learning and neural networks were linked with greater acceptance rate, whereas words related to symbolic processing and knowledge-based reasoning lead to lower acceptance rates.
Taken together, our analysis of the linguistic aspects of the manuscripts are suggestive a linguistic bias during peer review. It has been reported that writers using english as their main language (L1) use words that are more abstract and frequent than writers with english as their second language (L2) (crossley_computational_2009). Additionally, this effect is exacerbated by the L2 proficiency (where larger differences are observed for beginners than advanced L2 speakers) (crossley_predicting_2011). The complexity of L2 writing was also shown to correlate with proficiency (kim_predicting_2014; lahuerta_martinez_analysis_2018; radhiah_longitudinal_2018). Our results are therefore compatible with the hypothesis that L2 writers are less likely to get their manuscript accepted at peer review.
Our results are also compatible with a content bias where manuscripts on the topics of machine learning techniques and neural networks have greater odds to be accepted at peer review. Leading figures of the AI community have raised their voice against the overwhelming dominance of neural networks and deep learning in the domain of AI(marcus_deep_2018; jordan_artificial_2018; knight_one_2018). Recent successes of deep learning and neural networks might explain their dominance in the field, but a bias against other techniques might impede developments similar to the ones that lead to the breakthroughs underlying the deep learning revolution (krizhevsky_imagenet_2012). Following this idea, several researchers have indicated that symbolic processing could hold the answer to shortcomings of deep learning (geffner_model-free_2018; marcus_deep_2018; garnelo_reconciling_2019).
Although the main objective of our analysis was to investigate the presence of content or linguistic biases in peer review, all our of analysis were correlational, and there are possible confounds that could explain our results. For instance, while our findings that some linguistic aspects of the manuscripts - the readability and psycholinguistic attributes - were correlated with the peer review outcome, we cannot infer that this relationship is causal. Such variables correlate with other factors such as geographic location and the ranking of the author’s institution, which might also explain our findings.
Similarly, we cannot infer that there is a bias against manuscripts on the topic of machine learning techniques and neural networks. For instance, reviewers favouring high benchmark performance might accept more manuscripts using the state of the art techniques. In the scenario, the reviewer would not reject a manuscript using an alternative technique because of partiality in favour of a technique, but simply because it values some aspects where the alternative technique underperforms.
Another limitation to our findings is the methodology of the peer read dataset (kang_dataset_2018). For most manuscripts included in the dataset, their status is inferred and the true outcome of the peer review process is unknown. Although (kang_dataset_2018) validated their method on a subset of their data, the accuracy is not perfect. However, we believe that the large size of the dataset is enough to counteract this source of noise. Only the minority of manuscripts included in their dataset had a true peer review outcome provided by the publishing venue. This highlight the need for publishers and conferences to open their peer review process in order to further advance our understanding of the strengths and limitations of the peer review process.
In sum, our results are suggestive, but not confirmatory, of the presence of a linguistic and a content bias in the peer review process of major conferences in AI. Both the linguistic aspects of the manuscripts and their content had an impact on their acceptance rate. Although we were able to replicate our results across different dataset, similar studies have to be conducted both in the field of AI and in other disciplines to validate the conclusions of our study.