Casting a Wide Net: Robust Extraction of Potentially Idiomatic Expressions

11/20/2019
by   Hessel Haagsma, et al.
0

Idiomatic expressions like `out of the woods' and `up the ante' present a range of difficulties for natural language processing applications. We present work on the annotation and extraction of what we term potentially idiomatic expressions (PIEs), a subclass of multiword expressions covering both literal and non-literal uses of idiomatic expressions. Existing corpora of PIEs are small and have limited coverage of different PIE types, which hampers research. To further progress on the extraction and disambiguation of potentially idiomatic expressions, larger corpora of PIEs are required. In addition, larger corpora are a potential source for valuable linguistic insights into idiomatic expressions and their variability. We propose automatic tools to facilitate the building of larger PIE corpora, by investigating the feasibility of using dictionary-based extraction of PIEs as a pre-extraction tool for English. We do this by assessing the reliability and coverage of idiom dictionaries, the annotation of a PIE corpus, and the automatic extraction of PIEs from a large corpus. Results show that combinations of dictionaries are a reliable source of idiomatic expressions, that PIEs can be annotated with a high reliability (0.74-0.91 Fleiss' Kappa), and that parse-based PIE extraction yields highly accurate performance (88 methods increases reliability further, to over 92 extraction method presented here could be extended to other types of multiword expressions and to other languages, given that sufficient NLP tools are available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/03/2015

On TimeML-Compliant Temporal Expression Extraction in Turkish

It is commonly acknowledged that temporal expression extractors are impo...
research
06/18/2020

Extraction and Evaluation of Formulaic Expressions Used in Scholarly Papers

Formulaic expressions, such as 'in this paper we propose', are helpful f...
research
12/01/2016

Multilingual Multiword Expressions

The project aims to provide a semi-supervised approach to identify Multi...
research
03/23/2022

Dynamically Refined Regularization for Improving Cross-corpora Hate Speech Detection

Hate speech classifiers exhibit substantial performance degradation when...
research
06/03/2022

ArgRewrite V.2: an Annotated Argumentative Revisions Corpus

Analyzing how humans revise their writings is an interesting research qu...
research
10/11/2022

Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts

Although several datasets annotated for anaphoric reference/coreference ...
research
04/15/2019

Natural Language Semantics With Pictures: Some Language & Vision Datasets and Potential Uses for Computational Semantics

Propelling, and propelled by, the "deep learning revolution", recent yea...

Please sign up or login with your details

Forgot password? Click here to reset