Automated Mining of Leaderboards for Empirical AI Research

08/31/2021 ∙ by Salomon Kabongo, et al. ∙ Technische Informationsbibliothek L3S Research Center 0

With the rapid growth of research publications, empowering scientists to keep oversight over the scientific progress is of paramount importance. In this regard, the Leaderboards facet of information organization provides an overview on the state-of-the-art by aggregating empirical results from various studies addressing the same research challenge. Crowdsourcing efforts like PapersWithCode among others are devoted to the construction of Leaderboards predominantly for various subdomains in Artificial Intelligence. Leaderboards provide machine-readable scholarly knowledge that has proven to be directly useful for scientists to keep track of research progress. The construction of Leaderboards could be greatly expedited with automated text mining. This study presents a comprehensive approach for generating Leaderboards for knowledge-graph-based scholarly information organization. Specifically, we investigate the problem of automated Leaderboard construction using state-of-the-art transformer models, viz. Bert, SciBert, and XLNet. Our analysis reveals an optimal approach that significantly outperforms existing baselines for the task with evaluation scores above 90 offers new state-of-the-art results for Leaderboard extraction. As a result, a vast share of empirical AI research can be organized in the next-generation digital libraries as knowledge graphs.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our present rapidly amassing wealth of scholarly publications [24] poses a crucial dilemma for the research community. A trend that is only further bolstered in a number of academic disciplines with the sharing of PDF preprints ahead (or even instead) of peer-reviewed publications [8]. The problem is: How to stay on-track with the past and the current rapid-evolving research progress? In this era of the publications deluge [45], such a task is becoming increasingly infeasible even within one’s own narrow discipline. Thus, the need for novel technological infrastructures to support intelligent scholarly knowledge access models is only made more imminent. A viable solution to the dilemma is if research results were made skimmable for the scientific community with the aid of advanced knowledge-based information access methods. This helps curtail the time-intensive and seemingly unnecessary cognitive labor that currently constitute the researcher’s task of searching just for the results information in full-text articles to track scholarly progress [4]. Thus, strategic reading of scholarly knowledge focused on core aspects of research powered by machine learning may soon become essential for all users [42]. Since the current discourse-based form of the results in a static PDF format do not support advanced computational processing, it would need to transition to truly digital formats (at least for some aspects of the research).

In this regard, an area gaining traction is in empirical Artificial Intelligence (AI) research. There, Leaderboards are being crowdsourced from scholarly articles as alternative machine-readable versions of performances of the AI systems. Leaderboards typically show quantitative evaluation results (reported in scholarly articles) for defined machine learning tasks evaluated on standardized datasets using comparable metrics. As such, they are a key element for describing the state-of-the-art in certain fields and tracking its progress. Thus, empirical results can be benchmarked in online digital libraries. Some well-known initiatives that exist to this end are: [40], NLP-Progress [38], AI-metrics [1], SQUaD explorer [43], Reddit SOTA [41], and the Open Research Knowledge Graph.111

Expecting scientists to alter their documentation habits to machine-readable versions rather than human-readable natural language is unrealistic, especially given that the benefits do not start to accrue until a critical mass of content is represented in this way. For this, the retrospective structuring from pre-existing PDF format of results is essential to build a credible knowledge base. Prospectively, machine learning can assist scientists to record their results in the Leaderboards of next-generation digital libraries such as the Open Research Knowledge Graph (ORKG) [22]

. In our age of the “deep learning tsunami,” 


there are many studies that have used neural network models to improve the construction of automated scholarly knowledge mining systems 

[31, 7, 3, 23]. With the recent introduction of language modeling techniques such as transformers [44], the opportunity to obtain boosted machine learning systems is further accentuated.

In this work, we empirically tackle the Leaderboard knowledge mining machine learning (ML) task via a detailed set of evaluations involving a large dataset and several ML models. The Leaderboard concept varies wrt. the domains or the captured data. Inspired by prior work [19, 21, 35], we define a Leaderboard comprising the following three scientific concepts: 1. Task, 2. Dataset, and 3. Metric (TDM). However, this base Leaderboard structure can be extended to include additional concepts such as method name, code links, etc. In this work, we restrict our evaluations to the core TDM triple. Thus, constructing a Leaderboard in our evaluations entails the extraction of all related TDM statements from an article. E.g., (Language Modeling, Penn Treebank, Test perplexity) is a Leaderboard triple of an article about the ‘Language Modeling’ Task on the ‘Penn Treebank’ Dataset in terms of the ‘Test perplexity’ Metric. Consequently, the construction of comparisons and visualizations over such machine-interpretable data can enable summarizing the performance of empirical findings across systems.

While prior work [19, 35] has already initiated the automated learning of Leaderboards, these studies were mainly conducted under a single scenario, i.e. only one learning model was tested, and over a small dataset. For stakeholders in the Digital Library (DL) community interested in leveraging this model practically, natural questions may arise: Has the optimal learning scenario been tested? Would it work in the real-world setting of large amounts of data? Thus, we note that it should be made possible to recommend a technique for knowledge organization services from observations based on our prior comprehensive empirical evaluations [23]. Our ultimate goal with this study is to help the DL stakeholders to select the optimal tool to implement knowledge-based scientific information flows w.r.t. Leaderboards. To this end, we evaluated three state-of-art transformer models, viz. Bert, SciBert, and XLNet, each with their own respective unique strengths. The automatic extraction of Leaderboards presents a challenging task because of the variability of its location within written research, lending credence to the creation of a human-in-the-loop model. Thus, our Leaderboard mining system will be prospectively effective as intelligent helpers in the crowdsourcing scenarios of structuring research contributions [39] within knowledge-graph-based DL infrastructures. Our approach called ORKG-TDM is developed and integrated into the scholarly knowledge organization platform Open Research Knowledge Graph (ORKG) [22].

In summary, the contributions of our work are:

  1. we construct a large empirical corpus containing over 4,500 scholarly articles and Leaderboard TDM triples for training and testing Leaderboard extraction approaches;

  2. we evaluate three different transformer model variants in experiments over the corpus and integrate these into the ORKG-TDM Leaderboard extraction platform;

  3. in a comprehensive empirical evaluation of ORKG-TDM we obtain scores of 93.0% micro and 92.8% macro F1 outperforming existing systems by over 20 points.

To the best of our knowledge, our ORKG-TDM models obtain state-of-the-art results for Leaderboard extraction defined as (Task, Dataset, Metric) triples extraction from empirical AI research articles. Thus ORKG-TDM can be readily leveraged within KG-based DLs and be used to comprehensively construct Leaderboards with more concepts beyond the TDM triples. Our data222 and code is made publicly available.333

2 Related Work

Organizing scholarly knowledge extracted from scientific articles in a Knowledge Graph, has been viewed from various Information Extraction (IE) perspectives.

2.0.1 Digitalization based on Textual Content Mining.

Building a scholarly knowledge graph with text mining involves two main tasks: 1. scientific term extraction and 2. extraction of scientific or semantic relations between the terms.

Addressing the first task, several dataset resources have been created with scientific term annotations and the term concept typing to foster the training of supervized machine learners. For instance, the ACL RD-TEC dataset [17] annotates computational terminology in Computational Linguistics (CL) scholarly articles and categorizes them simply as technology and non-technology terms. The ScienceIE SemEval 2017 shared task [5] annotates the full text in articles from Computer Science, Material Sciences, and Physics domains for Process, Task and Material types of keyphrases. SciERC [32] annotates articles from the machine learning domain with six concepts Task, Method, Metric, Material, Other-ScientificTerm and Generic. The STEM-ECR [12] corpus annotates Process, Method, Material, and Data concepts in article abstracts interdisciplinarily across ten STEM disciplines.

For the identification of relations between scientific terms in the natural language processing (NLP) community, within the context of human annotations on the abstracts of scholarly articles 

[5, 15], seven relation types between scientific terms have been studied. They are Hyponym-Of, Part-Of, Usage, Compare, Conjunction, Feature-Of, and Result. The annotations are in the form of generalized relation triples: experiment Compare another experiment; method Usage data; method Usage research task

. Since human language exhibits the paraphrasing phenomenon, identifying each specific relation between scientific concepts is impractical. In the framework of an automated pipeline for generating knowledge graphs over massive volumes of scholarly records, the task of classifying scientific relations (i.e., identify the appropriate relation type for each related concept pair from a set of predefined relations) is therefore indispensable.

In other text mining initiatives, comprehensive knowledge mining themes are being defined on scholarly investigations. The recent NLPContributionGraph Shared Task [11, 10, 9] released KG annotations of contributions including the facets of research problem, approach, experimental settings, and results, in an evaluation series that showed it a challenging task. Similarly, in the Life Sciences, comprehensive KGs from reports of biological assays, wet lab protocols and inorganic materials synthesis reactions and procedures [3, 2, 26, 27, 28, 36] are released as ontologized machine-interpretable formats for training machine readers.

2.0.2 Digitalization based on Table Mining

In the earlier subsection, we discussed information extraction models defined for retrieving the relevant structured information from the textual body of articles. Recent efforts are geared to mining information from the semi-structured format of information in articles as tables. Unlike the high performances seen in information mining systems applied to textual data, text mining performances over tables are relatively much lower.

Milosevic et al. [34] tested methods for extracting numerical (number of patients, age, gender distribution) and textual (adverse reactions) information from tables identified by the tag in the clinical literature as XML articles. Further, another line of work examined the classification of tables from HTML pages as entity, relational, matrix, list, and nondata leveraging specialized table embeddings called TabVec [16]. Wei et al.[46] defined a question answering task with data in Table cells as the answers over two different datasets, i.e. web data tables and news articles in text format tables. Another model called TaPaS [18] also addressed question answering over tabular data by extending BERT’s architecture to encode tables as input and training it end-to-end over tables crawled from Wikipedia. TableSeer [29] is a comprehensive tables mining search engine that crawls digital libraries, detects tables from documents, extracts their metadata, and indexes and ranks tables in a user-friendly search interface.

2.0.3 Digitalization based on Textual and Tabulated Content Mining

IBM’s science result extractor [19] first defined the extraction task from articles. They trained a BERT classification model leveraging context data from the abstract, tables, and from table headers and captions. Their dataset comprised pdf-to-text converted articles from Following which, AxCell [25] presented an automated machine learning pipeline for extracting results from papers. It used several novel components, including table segmentation, to learn relevant structural knowledge to aid extraction. Unlike the first system, AxCell was trained and tested over LaTeX source code of machine learning papers from Furthermore, the SciRex corpus creation endeavor [21] defines mostly similar information targets as science result extractor. However, SciRex is evaluated on clean LaTeX sources unlike the IBM extractor and our objective of trying to identify robust machine learning pipelines over articles in PDF format. Another recent system, SciNLP-KG [35], reformulated the Leaderboard extraction task as one with relation evaluatedOn between tasks and datasets, evaluatedBy between tasks and metrics, as well as coreferent and related relations between the same type of entities. Like us, they comprehensively investigated several transformer model variants. Nevertheless, owing to their task reformulation, relation-based evaluation, and different dataset our results cannot be compared. Finally, Hou et al., the developers of IBM’s science result extractor, recently released the TDMSci corpus [20] as a sequence labeling task for extracting Task, Dataset, and Metric at the sentence-level. We maintain the original document-level inference task definition.

In this paper, we investigate the science result extractor system [19], however, with a detailed empirical perspective. We comprehensively evaluate the potential of transformer models for the task by testing Bert, SciBert, and XLNet. We also test the models over a much larger empirical dataset emulating its application in practice in the framework of the scholarly digital libraries such as the ORKG.

3 Our Leaderboards Labeled Corpus

To facilitate supervised system development for the extraction of Leaderboards from scholarly articles, we build an empirical corpus that encapsulates the task. Leaderboard extraction is essentially an inference task over the document. To alleviate the otherwise time-consuming and expensive corpus annotation task involving expert annotators, we leverage distant supervision from the available crowdsourced metadata in the PwC KB. In the remainder of this section, we explain our corpus creation and annotation process.

3.0.1 Scholarly Papers and Metadata from the PwC Knowledge Base.

We created a new corpus as a collection of scholarly papers with their TDM triple annotations for evaluating the Leaderboards extraction task inspired by the original IBM science result extractor [19] corpus. The collection of scholarly articles for defining our Leaderboard extraction objective is obtained from the publicly available crowdsourced leaderboards PapersWithCode (PwC)444

. It predominantly represents articles in the Natural Language Processing and Computer Vision domains, among other AI domains such as Robotics, Graphs, Reasoning, etc. Thus, the corpus is representative for empirical AI research. The original downloaded collection (timestamp 2021-05-10 at 12:30:21)

555Our corpus was downloaded from the PwC Github repository and was constructed by combining the information in the files All papers with abstracts and Evaluation tables which included article urls and TDM crowdsourced annotation metadata. was pre-processed to be ready for analysis. While we use the same method here as the science result extractor, our corpus is different in terms of both labels and size, i.e. number of papers, as many more Leaderboards have been crowdsourced and added to PwC since the original work.

3.0.2 PDF pre-processing

While the respective articles’ metadata in machine-readable form was directly obtained from the PwC data release, the document itself being in PDF format needed to undergo pre-processing for pdf-to-text conversion so that its contents could be mined. For this, the GROBID parser [30] was applied to extract the title, abstract, and for each section, the section title and its corresponding content from the respective PDF article files. Each article’s parsed text was then annotated with TDM triples via distant labeling to create the final corpus.

3.0.3 Paper Annotation via Distant Labeling

Each paper was associated with its Leaderboard TDM triple annotations. These were available as the crowdsourced metadata of each article in the PwC knowledge base (KB). The number of triples per article varied between 1 (minimum) and 54 (maximum) at an average of 4.1 labels per paper. The corpus was thus annotated as a distant labeling task since the labels for each paper were directly imported from the PwC KB without additional human curation of the varying forms of label names. Additionally, Leaderboards that appeared in less than five papers were ignored. Consequently to the TDM labels filtering stage, some articles were without TDM triples and these articles were annotated with the label “unknown.”

Our overall corpus statistics are shown in Table 1. We adopted the 70/30 split for the Train/Test folds for the empirical system development (described in detail in section 6). In all, our corpus contained 5,361 articles split into 3,753 in the training data and 1,608 in the test data. There were unique TDM-triples overall. Note that since the test labels were a subset of the training labels, the unique labels overall can be considered as those in the training data. Table 1 also shows the distinct Tasks, Datasets, Metrics in the last three rows. Our corpus contains 288 Tasks defined on 908 Datasets and evaluated by 550 Metrics. This is significantly larger than the original corpus which had 18 Tasks defined on 44 Datasets and evaluated by 31 Metrics.

Ours Original
Train Test Train Test
Papers 3,753 1,608 170 167
“unknown” annotations 922 380 46 45
Total TDM-triples 11,724 5,060 327 294
Avg. number of TDM-triples per paper 4.1 4.1 2.64 2.41
Distinct TDM-triples 1,806 1,548 78 78
Distinct Tasks 288 252 18 18
Distinct Datasets 908 798 44 44
Distinct Metrics 550 469 31 31
Table 1: Ours vs. the original science result extractor [19] corpora statistics. The “unknown” labels were assigned to papers with no TDM-triples after the label filtering stage.
Figure 1: The DocTAET model with context features as a concatenation of the Scholarly Document’s Title, Abstract, Experimental-Setup, and Table content/captions for training NLI transformer models on a set of Leaderboard triples. The figure illustrates specifically six (task, dataset, metric) triples and their context in the original article text extracted as the feature for transformer models.

4 Leaderboard Extraction Task Definition

The task is defined on the dataset described in previous section. The dataset can be formalized as follows. Let be a paper in the collection . Each is annotated with at least one triple where is the task defined, the dataset and the

system evaluation metric. The number of triples per paper vary.

In the supervised inference task, the input data instance corresponds to the pair: a paper represented as the DocTAET context feature and its TDM-triple . The inference data instance, then is where is the inference label. Thus, specifically, our Leaderboard extraction problem is formulated as a natural language inference task between the DocTAET context feature and the triple annotation. is if it is among the paper’s TDM-triples, otherwise . The instances are artificially created by random selection of annotations from another paper. Cumulatively, Leaderboard construction is a multi-label, multi-class inference problem.

4.1 DocTAET Context Feature

In Figure 1, we depict the DocTAET context feature [19]. Essentially, the Leaderboard extraction task is defined on the full document content. However, the respective (task, dataset, metric) label annotations are mentioned only in specific places in the full paper such as in the Title, Abstract, Introduction, Tables. The DocTAET feature was thus defined to capture the targeted context information to facilitate the (task, dataset, metric) triple inference. It focused on capturing the context from four specific places in the text-parsed article, i.e. from the title, abstract, first few lines of the experimental setup section as well as table content and captions.

5 Transformer-based Leaderboard Extraction Models

For Leaderboard extraction [19]

, we employ deep transfer learning modeling architectures that rely on a recently popularized neural architecture -– the transformer 

[44]. Transformers are arguably the most important architecture for natural language processing (NLP) today since they have shown and continue to show impressive results in several NLP tasks [14]. Owing to the self-attention mechanism in the transformer models, they can be fine-tuned on many downstream tasks. These models have thus crucially popularized the transfer learning paradigm in NLP. We investigate three transformer-based model variants for Leaderboard extraction in a Natural Language Inference configuration.

Natural language inference (NLI), generally, is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise” [37]. For Leaderboard extraction, the slightly adapted NLI task is to determine that the (task, dataset, metric) “hypothesis” is true (entailed) or false (not entailed) for a paper given the “premise” as the DocTAET context feature representation of the paper.

Currently, there exist several transformer-based models. In our experiments, we investigated three core models: two variants of Bert, i.e. the vanilla Bert [14] and the scientific Bert (SciBert) [6]. We also tried a different type of transformer model than Bert called XLNet [47] which employs Transformer-XL as the backbone model. Next, we briefly describe the three variants we use.

5.1 Bert Models

Bert (i.e., Bidirectional Encoder Representations from Transformers), is a bidirectional autoencoder (AE) language model. As a pre-trained language representation built on the deep neural technology of transformers, it provides NLP practitioners with high-quality language features from text data simply out-of-the-box and thus improves performance on many NLP tasks. These models return contextualized word embeddings that can be directly employed as features for downstream tasks 


The first Bert model we employ is Bertbase (12 layers, 12 attention heads, and 110 million parameters) which was pre-trained on billions of words from the BooksCorpus (800M words) and the English Wikipedia (2,500M words).

The second Bert model we employ is the pre-trained scientific Bert called SciBert [6]. SciBert was pretrained on a large corpus of scientific text. In particular, the pre-training corpus is a random sample of 1.14M papers from Semantic Scholar666 consisting of full texts of 18% of the papers from the computer science domain and 82% from the broad biomedical domain. For both Bertbase and SciBert, we used their uncased variants.

5.2 XLNet

XLNet is an autoregressive (AR) language model [47]

that enables learning bidirectional contexts using Permutation Language Modeling, unlike Bert which uses Masked Language Modeling. Thus in PLM all tokens are predicted but in random order, whereas in MLM only the masked (15%) tokens are predicted. This is also in contrast to the traditional language models, where all tokens were predicted in sequential order instead of random order. Random order prediction helps the model to learn bidirectional relationships and therefore better handle dependencies and relations between words. In addition, it uses Transformer XL 

[13] as the base architecture, which models long contexts unlike the Bert models with contexts limited to 512 tokens. Since only cased models are available for XLNet, we used the cased XLNetbase (12 layers, 12 attention heads, and 110 million parameters).

6 Automated Leaderboard Mining

6.1 Experimental Setup

6.1.1 Parameter Tuning.

We used the Hugging Transfomer libraries777 with their Bert variants and XLNet implementations. In addition to the standard fine-tuned setup for NLI, the transformer models were trained with a learning rate of

for 14 epochs; and used the

optimizer with a weight decay of 0 for bias, gamma, beta

and 0.01 for the others. Our models’ hyperparameters details are available online.


In addition, we introduce a task-specific parameter that was crucial in obtaining optimal task performance from the models. It was the number of false triples per paper. This parameter controls the discriminatory ability of the model. The original science result extractor system [19] considered false instances for each paper, where was the distinct set of triples overall and was the number of true Leaderboard triples per paper. This approach would not generalize to our larger corpus with over 2,500 distinct triples. In other words, considering that each paper had on average 4 true triples, it would have 2,495 false triples which would strongly bias the classifier learning toward only false inferences. Thus, we tuned this parameter in a range of values in the set {10, 50, 100} which at each experiment run was fixed for all papers.

Finally, we imposed an artificial trimming of the DocTAET feature to accommodate the Bert models maximum token length of 512. For this, the token lengths of the experimental setup and table info were initially truncated to approximately , after which the complete DocTAET feature is trimmed to 512 tokens.

6.1.2 Two-Fold Cross Validation.

To evaluate robust models, we performed two-fold cross validation experiments. In each fold experiment, we train a model on 70% of the overall dataset, and test on the remaining 30% ensuring that the test data splits are not identical between the folds. Thus, all cumulative results reported are averaged results over the two folds. Also, Table 1

corpus statistics are averaged estimates over the two experimental folds.

6.1.3 Evaluation Metrics.

Within the two-fold experimental settings, we report macro- and micro-averaged precision, recall, and F1 scores for our Leaderboard extraction task on the test dataset. The macro scores capture the averaged class-level task evaluations, whereas the micro scores represent fine-grained instance-level task evaluations.

Further, the macro and micro evaluation metrics for the overall task have two evaluation settings: 1) considers papers with Leaderboards and papers with “unknown” in the metric computations; 2) only papers with Leaderboards are considered while the papers with “unknown” are excluded.

Macro P Macro R Macro F1 Micro P Micro R Micro F1
(a) Task + Dataset + Metric Extraction
TDM-IEBert 62.5 75.2 65.3 60.8 76.8 67.8
ORKG-TDMBert 68.1 67.5 65.5 79.6 63.3 70.5
ORKG-TDMSciBert 65.7 77.2 68.3 65.7 76.8 70.8
ORKG-TDMXLNet 71.7 73.9 70.6 77.1 70.9 73.9
(b) Task + Dataset + Metric Extraction (without ”Unknown” annotation)
TDM-IEBert 54.1 65.9 56.6 60.2 73.1 66.0
ORKG-TDMBert 59.0 55.4 54.7 79.5 57.6 66.8
ORKG-TDMSciBert 57.6 68.7 60.1 65.3 73.1 69.0
ORKG-TDMXLNet 63.5 64.1 61.4 76.4 66.4 71.1
Table 2: Leaderboard triple extraction task, comparison of our ORKG-TDM models versus the original science result extractor model (first row in parts (a) and (b)) on the original corpus (see last two columns in Table 1 for the original corpus statistics).

6.2 Experimental Results

The results from our comprehensive evaluations with respect to four main research questions (noted as RQ1, RQ2, RQ3, and RQ4) are shown in Tables 2, 3, and 4, respectively.

6.2.1 RQ1: How well do our transformer models perform for Leaderboard extraction compared to the original science result extractor when trained in the identical original experimental setting?

In the last two columns in Table 1, we showed statistics of the comparatively smaller original corpus that defined and evaluated the Leaderboard extraction task [19]. As a quick recap, the original corpus had 78 distinct TDM-triples including “unknown,” and a distribution of 170 papers in the train dataset and 167 papers in the test dataset as fixed partitions; there were 46 and 45 “unknown” papers in the train and test sets, respectively. We evaluate all the three transformer models on this original corpus to compare our model performances. These results are shown in Table 2. As we can see in the table, all three of our models outperform the original with XLNet reporting the best score. We obtain a 70.6 macro F1 versus 65.3 in the baseline. The Bert model is only a few fractional points better at 65.5. Further, we obtain a 73.9 micro F1 versus 67.8 in the baseline, with the Bert model at 70.5. With these results our models outperform the original system in the original settings reported for this task.

Next, we examine the results of our models on our larger corpus for Leaderboard extraction.

Macro P Macro R Macro F1 Micro P Micro R Micro F1
Average Evaluation Accross 2-fold
ORKG-TDMBert 92.8 93.9 92.4 95.5 89.1 92.1
ORKG-TDMSciBert 90.9 93.4 91.1 94.1 88.5 91.2
ORKG-TDMXLNet 92.8 94.8 92.8 94.9 91.2 93.0
Average Evaluation Accross 2-fold (without ”Unknown” annotation)
ORKG-TDMBert 91.7 92.1 90.8 95.7 88.3 91.8
ORKG-TDMSciBert 89.7 91.4 89.4 94.4 87.6 90.9
ORKG-TDMXLNet 91.6 93.1 91.2 95.0 90.5 92.7
Table 3: Top results for Bert (10neg; 1,302unk), SciBert (10neg; 1,302unk), and XLNet (10neg; 1,302unk)
Entity Macro Micro
TDM 91.6 93.1 91.2 95.0 90.5 92.7
Task 93.7 94.8 93.6 97.4 93.6 95.5
Dataset 92.9 93.6 92.4 96.6 91.5 94.0
Metric 92.5 94.2 92.5 96.0 92.5 94.2
Table 4: Performance of our best model, i.e. ORKG-TDMXLNet, for Task, Dataset, and Metric concept extraction of the Leaderboard
Document Representation Macro P Macro R Macro F1 Micro P Micro R Micro F1
Title + Abstract 88.6 92.9 89.4 92.6 90 91.3
Title + Abstract + ExpSetup 89.2 91.5 89.2 94.2 89 91.5
Title + Abstract + TableInfo 90.5 94.4 91.2 93.5 93.2 93.3
Title + Abstract + ExpSetup + TableInfo 92.3 93.5 91.7 95.1 92 93.5
Table 5: Ablation results of our best model, i.e. ORKG-TDMXLNet, for Leaderboard extraction as triples

6.2.2 RQ2: How do the transformer models perform on a large corpus for Leaderboard construction?

We examine this question in light of the results reported in Table 3. We find that again, consistent with the observations made in Table 2, XLNet outperforms Bert and SciBert. Note that we have leveraged XLNet with the limited context length of 512 tokens (by truncating parts of the context described in Section 6.1) and thus the potential of leveraging the full context in XLNet models remains untapped. Nevertheless, XLNet still distinguishes itself from the Bert models with its PLM language modeling objective versus Bert’s MLM. This in practice has also shown to perform better [47]. Thus XLNet with limited context encoding can still outperform the Bert models. Among all the models, we find SciBert shows a slightly lower performance compared to Bert. This is a slight variation on performance observations obtained in the original smaller dataset (results in Table 2) where Bert was slightly lower than SciBert. These results indicate that in smaller datasets, the SciBert model should be preferred since its underlying science-specific pretrained corpus would compensate for signal absences in the task training corpus. However, with larger training datasets for finetuning the Bert models, the underlying pretraining corpus domains show not to be critical to the overall model performance.

Further, we observe the macro and micro evaluations for all three systems (ORKG-TDMBert, ORKG-TDMSciBert, ORKG-TDMXLNet) show evenly balanced, similar scores. This tells us that our models handles the majority and minority TDM classes evenly. Thus given our large training corpus, the transformers remain unaffected by the underlying dataset class distributions.

6.2.3 RQ3: Which of the three Leaderboard concepts are easy or challenging to extract?

As a fine-grained examination of our best model, i.e. ORKG-TDMXLNet, we examined its performance for extracting each of three concepts separately. These results are shown in Table 4. From the results, we observe that Task is the easiest concept to extract, followed by Metric, and then Dataset. We ascribe the low performance for extracting the Dataset concept due to the variability in its naming seen across papers even when referring to the same real-world entity. For example, the real-world dataset entity CIFAR-10 is labeled as CIFAR-10, 4000 Labels in some papers and CIFAR-10, 250 Labels in others. This phenomenon is less prevalent for Task and the Metric concepts. For example, the Task ‘Question Answering’ is rarely referenced differently across papers addressing the task. Similarly, for Metric, Accuracy, as an example, has very few variables namings.

6.2.4 RQ4: Which aspect of the DocTAET context feature representation had the highest impact for Leaderboard extraction?

Further, in Table 5, we breakdown performance of our best model, i.e. ORKG-TDMXLNet, examining the impact of the features as the shortened context from the articles for TDM inference. We observe, that adding additional contextual information in addition to title and abstract increases the performance significantly, while the actual type of additional information (i.e. experimental setup, table information or both) impacts the performance to a lower extend.

6.3 Integrating ORKG-TDM in Scholarly Digital Libraries

Figure 2: Overview of the Leaderboard Extraction Machine Learning Process Flow for ORKG-TDM in the Open Research Knowledge Graph Digital Library

Ultimately, with Leaderboard construction we aim to give researchers oversight over the state-of-the-art wrt. certain research questions (i.e. tasks). Figure 2 shows how our ORKG-TDM Leaderboard mining is integrated into the Open Research Knowledge Graph scholarly knowledge platform. The fact, that the automated Leaderboard mining results in incomplete and partially imprecise results can be alleviated by the crowdsourcing and curation features implemented in the ORKG. Also, ORKG provides features for dynamic Leaderboard visualization, publication, versioning etc.

Although, the experiments of our study targeted empirical AI research, we are confident, that the approach is transferable to similar scholarly knowledge extraction tasks in other domains. For example in chemistry or material sciences, experimentally observed properties of substances or materials under certain conditions could be obtained from various papers.

7 Conclusion and Future Work

In this paper, we investigated the Leaderboard extraction task w.r.t. three different transformer-based models. Our overarching aim with this work is to build a system for comparable scientific concept extractors from scholarly articles. Therefore as a next step, we will extend the current triples (task, dataset, metric) model with additional concepts that are suitable candidates for a Leaderboard such as score or code urls, etc. In this respect, we will adopt a hybrid system wherein some elements will be extracted by the machine learning system as discussed in this work while other elements will be extracted by a system of rules and regular expressions. Also, we plan to combine the automated techniques presented herein with a crowdsourcing approach for further validating the extracted results and providing additional training data. Our work in this regard is embedded in a larger research and service development agenda, where we build a comprehensive knowledge graph for representing and tracking scholarly advancements [22]. We also envision the task-dataset-metric extraction approach to be transferable to other domains (such as materials science, engineering simulations etc.). Our ultimate target is to create a comprehensive structured knowledge graph tracking scientific progress in various scientific domains, which can be leveraged for novel machine-assistance measures in scholarly communication, such as question answering, faceted exploration and contribution correlation tracing.


This work was co-funded by the Federal Ministry of Education and Research (BMBF) of Germany for the project LeibnizKILabor (grant no. 01DD20003) and by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536).


  • [1] AI metrics. Note: 2021-04-26 Cited by: §1.
  • [2] M. Anteghini, J. D’Souza, V. A. M. dos Santos, and S. Auer (2020) Representing semantified biological assays in the open research knowledge graph. In International Conference on Asian Digital Libraries, pp. 89–98. Cited by: §2.0.1.
  • [3] M. Anteghini, J. D’Souza, V. A. M. Dos Santos, and S. Auer (2020) SciBERT-based semantification of bioassays in the open research knowledge graph. In EKAW-PD 2020, pp. 22–30. Cited by: §1, §2.0.1.
  • [4] S. Auer (2018-01) Towards an open research knowledge graph. Zenodo. External Links: Document, Link Cited by: §1.
  • [5] I. Augenstein, M. Das, S. Riedel, L. Vikraman, and A. McCallum (2017) SemEval 2017 task 10: scienceie - extracting keyphrases and relations from scientific publications. In SemEval@ACL, Cited by: §2.0.1, §2.0.1.
  • [6] I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: a pretrained language model for scientific text. arXiv preprint arXiv:1903.10676. Cited by: §5.1, §5.
  • [7] A. Brack, J. D’Souza, A. Hoppe, S. Auer, and R. Ewerth (2020) Domain-independent extraction of scientific concepts from research articles. In European Conference on Information Retrieval, pp. 251–266. Cited by: §1.
  • [8] A. Chiarelli, R. Johnson, E. Richens, and S. Pinfield (2019) Accelerating scholarly communication: the transformative role of preprints. Cited by: §1.
  • [9] J. D’Souza, S. Auer, and T. Pedersen (2021-08) SemEval-2021 task 11: nlpcontributiongraph - structuring scholarly nlp contributions for a research knowledge graph. In Proceedings of the Fifteenth Workshop on Semantic Evaluation, Bangkok (online). Cited by: §2.0.1.
  • [10] Cited by: §2.0.1.
  • [11] J. D’Souza and S. Auer (2021) Sentence, phrase, and triple annotations to build a knowledge graph of natural language processing contributions—a trial dataset. Journal of Data and Information Science, pp. 20210429. Cited by: §2.0.1.
  • [12] J. D’Souza, A. Hoppe, A. Brack, M. Y. Jaradeh, S. Auer, and R. Ewerth (2020-05) The stem-ecr dataset: grounding scientific entity references in stem scholarly content to authoritative encyclopedic and lexicographic sources. In LREC, Marseille, France, pp. 2192–2203. Cited by: §2.0.1.
  • [13] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988. Cited by: §5.2.
  • [14] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §5, §5.
  • [15] K. Gábor, D. Buscaldi, A. Schumann, B. QasemiZadeh, H. Zargayouna, and T. Charnois (2018) Semeval-2018 task 7: semantic relation extraction and classification in scientific papers. In Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 679–688. Cited by: §2.0.1.
  • [16] M. Ghasemi-Gol and P. Szekely (2018)

    Tabvec: table vectors for classification of web tables

    arXiv preprint arXiv:1802.06290. Cited by: §2.0.2.
  • [17] S. Handschuh and B. QasemiZadeh (2014) The acl rd-tec: a dataset for benchmarking terminology extraction and classification in computational linguistics. In COLING 2014: 4th international workshop on computational terminology, Cited by: §2.0.1.
  • [18] J. Herzig, P. K. Nowak, T. Mueller, F. Piccinno, and J. Eisenschlos (2020) TaPas: weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4320–4333. Cited by: §2.0.2.
  • [19] Y. Hou, C. Jochim, M. Gleize, F. Bonin, and D. Ganguly (2019) Identification of tasks, datasets, evaluation metrics, and numeric scores for scientific leaderboards construction. arXiv preprint arXiv:1906.09317. Cited by: §1, §1, §2.0.3, §2.0.3, §3.0.1, Table 1, §4.1, §5, §6.1.1, §6.2.1.
  • [20] Y. Hou, C. Jochim, M. Gleize, F. Bonin, and D. Ganguly (2021) TDMSci: a specialized corpus for scientific literature entity tagging of tasks datasets and metrics. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 707–714. Cited by: §2.0.3.
  • [21] S. Jain, M. van Zuylen, H. Hajishirzi, and I. Beltagy (2020) SciREX: a challenge dataset for document-level information extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7506–7516. Cited by: §1, §2.0.3.
  • [22] M. Y. Jaradeh, A. Oelen, K. E. Farfar, M. Prinz, J. D’Souza, G. Kismihók, M. Stocker, and S. Auer (2019) Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge. In Proceedings of the 10th International Conference on Knowledge Capture, pp. 243–246. Cited by: §1, §1, §7.
  • [23] M. Jiang, J. D’Souza, S. Auer, and J. S. Downie (2020) Improving scholarly knowledge representation: evaluating bert-based models for scientific relation classification. In International Conference on Asian Digital Libraries, pp. 3–19. Cited by: §1, §1, §5.1.
  • [24] A. E. Jinha (2010) Article 50 million: an estimate of the number of scholarly articles in existence. Learned Publishing 23 (3), pp. 258–263. Cited by: §1.
  • [25] M. Kardas, P. Czapla, P. Stenetorp, S. Ruder, S. Riedel, R. Taylor, and R. Stojnic (2020) AxCell: automatic extraction of results from machine learning papers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8580–8594. Cited by: §2.0.3.
  • [26] O. Kononova, H. Huo, T. He, Z. Rong, T. Botari, W. Sun, V. Tshitoyan, and G. Ceder (2019) Text-mined dataset of inorganic materials synthesis recipes. Scientific data 6 (1), pp. 1–11. Cited by: §2.0.1.
  • [27] C. Kulkarni, W. Xu, A. Ritter, and R. Machiraju (2018-06) An annotated corpus for machine reading of instructions in wet lab protocols. In NAACL: HLT, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 97–106. External Links: Document Cited by: §2.0.1.
  • [28] F. Kuniyoshi, K. Makino, J. Ozawa, and M. Miwa (2020) Annotating and extracting synthesis process of all-solid-state batteries from scientific literature. In LREC, pp. 1941–1950. Cited by: §2.0.1.
  • [29] Y. Liu, K. Bai, P. Mitra, and C. L. Giles (2007) Tableseer: automatic table metadata extraction and searching in digital libraries. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pp. 91–100. Cited by: §2.0.2.
  • [30] P. Lopez (2009) GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In International conference on theory and practice of digital libraries, pp. 473–474. Cited by: §3.0.2.
  • [31] Y. Luan, L. He, M. Ostendorf, and H. Hajishirzi (2018) Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. arXiv preprint arXiv:1808.09602. Cited by: §1.
  • [32] Y. Luan, L. He, M. Ostendorf, and H. Hajishirzi (2018) Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In EMNLP, Cited by: §2.0.1.
  • [33] C. D. Manning (2015) Computational linguistics and deep learning. Computational Linguistics 41 (4), pp. 701–707. Cited by: §1.
  • [34] N. Milosevic, C. Gregson, R. Hernandez, and G. Nenadic (2019-03) A framework for information extraction from tables in biomedical literature. Int. J. Doc. Anal. Recognit. 22 (1), pp. 55–78. External Links: ISSN 1433-2833, Link, Document Cited by: §2.0.2.
  • [35] I. Mondal, Y. Hou, and C. Jochim (2021) End-to-end nlp knowledge graph construction. arXiv preprint arXiv:2106.01167. Cited by: §1, §1, §2.0.3.
  • [36] S. Mysore, Z. Jensen, E. Kim, K. Huang, H. Chang, E. Strubell, J. Flanigan, A. McCallum, and E. Olivetti (2019) The materials science procedural text corpus: annotating materials synthesis procedures with shallow semantic structures. In Proceedings of the 13th Linguistic Annotation Workshop, pp. 56–64. Cited by: §2.0.1.
  • [37] Natural Language Inference. Note: 22 April 2021 Cited by: §5.
  • [38] NLP-progress. Note: 2021-04-26 Cited by: §1.
  • [39] A. Oelen, M. Stocker, and S. Auer (2021) Crowdsourcing scholarly discourse annotations. In 26th International Conference on Intelligent User Interfaces, pp. 464–474. Cited by: §1.
  • [40] Note: 2021-04-26 Cited by: §1.
  • [41] Reddit sota. Note: 2021-04-26 Cited by: §1.
  • [42] A. H. Renear and C. L. Palmer (2009) Strategic reading, ontologies, and the future of scientific publishing. Science 325 (5942), pp. 828–832. Cited by: §1.
  • [43] SQuAD explorer. Note: 2021-04-26 Cited by: §1.
  • [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §5.
  • [45] M. Ware and M. Mabe (2015-03) The stm report: an overview of scientific and scholarly journal publishing. pp. . Cited by: §1.
  • [46] X. Wei, B. Croft, and A. Mccallum (2006-11) Table extraction for answer retrieval. Inf. Retr. 9 (5), pp. 589–611. External Links: ISSN 1386-4564, Link, Document Cited by: §2.0.2.
  • [47] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §5.2, §5, §6.2.2.

Appendix 0.A Examples of Leaderboards in Our Corpus

The top-3 most common Leaderboards in our training set included: 1)

Image Classification, ImageNet, Top 1 Accuracy

; 2)

Object Detection, COCO test-dev, box AP

; 3) Image Classification, CIFAR-10, Percentage correct occurring 93, 57, and 51 times, respectively.

The top-3 least common randomly selected Leaderboards in our training set included: 1) Word Sense Disambiguation, WiC-TSV, Task 1 Accuracy: all; 2) Entity Linking, WiC-TSV, Task 1 Accuracy: all; 3) Causal Inference, IDHP, Average Treatment Effect Error occurring once.