On the Use of ArXiv as a Dataset

04/30/2019
by   Colin B. Clement, et al.
0

The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other metadata. These rich, multi-modal features, combined with the natural graph structure---created by citation, affiliation, and co-authorship---makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXiv's publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2023

Modeling interdisciplinary interactions among Physics, Mathematics Computer Science

Interdisciplinarity has over the recent years have gained tremendous imp...
research
04/28/2022

D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research

DBLP is the largest open-access repository of scientific articles on com...
research
05/04/2023

PGB: A PubMed Graph Benchmark for Heterogeneous Network Representation Learning

There has been a rapid growth in biomedical literature, yet capturing th...
research
11/07/2019

GORC: A large contextual citation graph of academic papers

We introduce the Semantic Scholar Graph of References in Context (GORC),...
research
05/06/2017

Citation sentence reuse behavior of scientists: A case study on massive bibliographic text dataset of computer science

Our current knowledge of scholarly plagiarism is largely based on the si...
research
04/11/2021

A Graph Convolutional Neural Network based Framework for Estimating Future Citations Count of Research Articles

Scientific publications play a vital role in the career of a researcher....
research
11/21/2017

Functional Map of the World

We present a new dataset, Functional Map of the World (fMoW), which aims...

Please sign up or login with your details

Forgot password? Click here to reset