KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents

by   Ygor Gallina, et al.

Keyphrase generation is the task of predicting a set of lexical units that conveys the main content of a source text. Existing datasets for keyphrase generation are only readily available for the scholarly domain and include non-expert annotations. In this paper we present KPTimes, a large-scale dataset of news texts paired with editor-curated keyphrases. Exploring the dataset, we show how editors tag documents, and how their annotations differ from those found in existing datasets. We also train and evaluate state-of-the-art neural keyphrase generation models on KPTimes to gain insights on how well they perform on the news domain. The dataset is available online at .



There are no comments yet.


page 1

page 2

page 3

page 4


Template-free Data-to-Text Generation of Finnish Sports News

News articles such as sports game reports are often thought to closely f...

CNewSum: A Large-scale Chinese News Summarization Dataset with Human-annotated Adequacy and Deducibility Level

Automatic text summarization aims to produce a brief but crucial summary...

Moving on from OntoNotes: Coreference Resolution Model Transfer

Academic neural models for coreference resolution are typically trained ...

Overview of the NLPCC 2017 Shared Task: Chinese News Headline Categorization

In this paper, we give an overview for the shared task at the CCF Confer...

Open Images V5 Text Annotation and Yet Another Mask Text Spotter

A large scale human-labeled dataset plays an important role in creating ...

Memeify: A Large-Scale Meme Generation System

Interest in the research areas related to meme propagation and generatio...

OpenFraming: We brought the ML; you bring the data. Interact with your data and discover its frames

When journalists cover a news story, they can cover the story from multi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.