unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network

03/27/2023
by   Tarek Saier, et al.
0

Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time coverage, citation network completeness, and representation of full-text content. To address these points, we propose a new version of the data set unarXive. We base our data processing pipeline and output format on two existing data sets, and improve on each of them. Our resulting data set comprises 1.9 M publications spanning multiple disciplines and 32 years. It furthermore has a more complete citation network than its predecessors and retains a richer representation of document structure as well as non-textual publication content such as mathematical notation. In addition to the data set, we provide ready-to-use training/test data for citation recommendation and IMRaD classification. All data and source code is publicly available at https://github.com/IllDepence/unarXive.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/27/2023

CoCon: A Data Set on Combined Contextualized Research Artifact Use

In the wake of information overload in academia, methodologies and syste...
research
02/06/2020

Citation Data of Czech Apex Courts

In this paper, we introduce the citation data of the Czech apex courts (...
research
11/07/2021

Cross-Lingual Citations in English Papers: A Large-Scale Analysis of Prevalence, Usage, and Impact

Citation information in scholarly data is an important source of insight...
research
03/23/2020

Interdisciplinarity metric based on the co-citation network

Quantifying the interdisciplinarity of a research is a relevant problem ...
research
09/12/2020

Fine-tuning Pre-trained Contextual Embeddings for Citation Content Analysis in Scholarly Publication

Citation function and citation sentiment are two essential aspects of ci...
research
07/13/2017

Incidental or influential? - Challenges in automatically detecting citation importance using publication full texts

This work looks in depth at several studies that have attempted to autom...

Please sign up or login with your details

Forgot password? Click here to reset