Why Can't Discourse Parsing Generalize? A Thorough Investigation of the Impact of Data Diversity

02/13/2023
by   Yang Janet Liu, et al.
0

Recent advances in discourse parsing performance create the impression that, as in other NLP tasks, performance for high-resource languages such as English is finally becoming reliable. In this paper we demonstrate that this is not the case, and thoroughly investigate the impact of data diversity on RST parsing stability. We show that state-of-the-art architectures trained on the standard English newswire benchmark do not generalize well, even within the news domain. Using the two largest RST corpora of English with text from multiple genres, we quantify the impact of genre diversity in training data for achieving generalization to text types unseen during training. Our results show that a heterogeneous training regime is critical for stable and generalizable models, across parser architectures. We also provide error analyses of model outputs and out-of-domain performance. To our knowledge, this study is the first to fully evaluate cross-corpus RST parsing generalizability on complete trees, examine between-genre degradation within an RST corpus, and investigate the impact of genre diversity in training data composition.

READ FULL TEXT

page 18

page 19

research
06/25/2021

Persian Rhetorical Structure Theory

Over the past years, interest in discourse analysis and discourse parsin...
research
07/09/2019

Cross-Domain Generalization of Neural Constituency Parsers

Neural parsers obtain state-of-the-art results on benchmark treebanks fo...
research
10/18/2022

Towards Domain-Independent Supervised Discourse Parsing Through Gradient Boosting

Discourse analysis and discourse parsing have shown great impact on many...
research
01/11/2017

Cross-lingual RST Discourse Parsing

Discourse parsing is an integral part of understanding information flow ...
research
02/01/2023

Are UD Treebanks Getting More Consistent? A Report Card for English UD

Recent efforts to consolidate guidelines and treebanks in the Universal ...
research
02/24/2020

Parsing Early Modern English for Linguistic Search

We investigate the question of whether advances in NLP over the last few...
research
04/14/2019

From News to Medical: Cross-domain Discourse Segmentation

The first step in discourse analysis involves dividing a text into segme...

Please sign up or login with your details

Forgot password? Click here to reset