Two Huge Title and Keyword Generation Corpora of Research Articles

02/11/2020
by   Erion Çano, et al.
0

Recent developments in sequence-to-sequence learning with neural networks have considerably improved the quality of automatically generated text summaries and document keywords, stipulating the need for even bigger training corpora. Metadata of research articles are usually easy to find online and can be used to perform research on various tasks. In this paper, we introduce two huge datasets for text summarization (OAGSX) and keyword generation (OAGKX) research, containing 34 million and 23 million records, respectively. The data were retrieved from the Open Academic Graph which is a network of research profiles and publications. We carefully processed each record and also tried several extractive and abstractive methods of both tasks to create performance baselines for other researchers. We further illustrate the performance of those methods previewing their outputs. In the near future, we would like to apply topic modeling on the two sets to derive subsets of research articles from more specific disciplines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2022

Topic Segmentation of Research Article Collections

Collections of research article data harvested from the web have become ...
research
04/30/2018

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

We present NEWSROOM, a summarization dataset of 1.3 million articles and...
research
01/03/2020

TED: A Pretrained Unsupervised Summarization Model with Theme Modeling and Denoising

Text summarization aims to extract essential information from a piece of...
research
06/23/2022

Unsupervised Learning Algorithms for Keyword Extraction in an Undergraduate Thesis

The amount of data managed in many academic institutions has increased i...
research
10/11/2019

Keyphrase Generation: A Multi-Aspect Survey

Extractive keyphrase generation research has been around since the ninet...
research
03/29/2019

Keyphrase Generation: A Text Summarization Struggle

Authors' keyphrases assigned to scientific articles are essential for re...
research
01/19/2021

Knowledge Graph for Microdata of Statistics Netherlands

Statistics Netherlands (CBS) hosted a huge amount of data not only on th...

Please sign up or login with your details

Forgot password? Click here to reset