How Document Pre-processing affects Keyphrase Extraction Performance

10/25/2016
by   Florian Boudin, et al.
0

The SemEval-2010 benchmark dataset has brought renewed attention to the task of automatic keyphrase extraction. This dataset is made up of scientific articles that were automatically converted from PDF format to plain text and thus require careful preprocessing so that irrevelant spans of text do not negatively affect keyphrase extraction performance. In previous work, a wide range of document preprocessing techniques were described but their impact on the overall performance of keyphrase extraction models is still unexplored. Here, we re-assess the performance of several keyphrase extraction models and measure their robustness against increasingly sophisticated levels of document preprocessing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/03/2021

Multi-Document Keyphrase Extraction: A Literature Review and the First Dataset

Keyphrase extraction has been comprehensively researched within the sing...
research
10/22/2020

A Joint Learning Approach based on Self-Distillation for Keyphrase Extraction from Scientific Documents

Keyphrase extraction is the task of extracting a small set of phrases th...
research
06/10/2022

Smallset Timelines: A Visual Representation of Data Preprocessing Decisions

Data preprocessing is a crucial stage in the data analysis pipeline, wit...
research
06/01/2023

End-to-End Document Classification and Key Information Extraction using Assignment Optimization

We propose end-to-end document classification and key information extrac...
research
07/24/2018

Constant-Delay Enumeration for Nondeterministic Document Spanners

We consider the information extraction approach known as document spanne...
research
10/11/2020

Revising FUNSD dataset for key-value detection in document images

FUNSD is one of the limited publicly available datasets for information ...

Please sign up or login with your details

Forgot password? Click here to reset