SMAuC – The Scientific Multi-Authorship Corpus

11/04/2022
by   Janek Bevendorff, et al.
0

The rapidly growing volume of scientific publications offers an interesting challenge for research on methods for analyzing the authorship of documents with one or more authors. However, most existing datasets lack scientific documents or the necessary metadata for constructing new experiments and test cases. We introduce SMAuC, a comprehensive, metadata-rich corpus tailored to scientific authorship analysis. Comprising over 3 million publications across various disciplines from over 5 million authors, SMAuC is the largest openly accessible corpus for this purpose. It encompasses scientific texts from humanities and natural sciences, accompanied by extensive, curated metadata, including unambiguous author IDs. SMAuC aims to significantly advance the domain of authorship analysis in scientific texts.

READ FULL TEXT
research
08/03/2020

Elsevier OA CC-By Corpus

We introduce the Elsevier OA CC-BY corpus. This is the first open corpus...
research
12/22/2021

STEREO: Scientific Text Reuse in Open Access Publications

We present the Webis-STEREO-21 dataset, a massive collection of Scientif...
research
11/10/2021

Multimodal Approach for Metadata Extraction from German Scientific Publications

Nowadays, metadata information is often given by the authors themselves ...
research
06/04/2021

MexPub: Deep Transfer Learning for Metadata Extraction from German Publications

Extracting metadata from scientific papers can be considered a solved pr...
research
01/28/2021

BIP! DB: A Dataset of Impact Measures for Scientific Publications

The growth rate of the number of scientific publications is constantly i...
research
02/04/2018

A Method for Discovering and Extracting Author Contributions Information from Scientific Biomedical Publications

Creating scientific publications is a complex process, typically compose...

Please sign up or login with your details

Forgot password? Click here to reset