Automatic Correction of Syntactic Dependency Annotation Differences

01/15/2022
by   Andrew Zupon, et al.
0

Annotation inconsistencies between data sets can cause problems for low-resource NLP, where noisy or inconsistent data cannot be as easily replaced compared with resource-rich languages. In this paper, we propose a method for automatically detecting annotation mismatches between dependency parsing corpora, as well as three related methods for automatically converting the mismatches. All three methods rely on comparing an unseen example in a new corpus with similar examples in an existing corpus. These three methods include a simple lexical replacement using the most frequent tag of the example in the existing corpus, a GloVe embedding-based replacement that considers a wider pool of examples, and a BERT embedding-based replacement that uses contextualized embeddings to provide examples fine-tuned to our specific data. We then evaluate these conversions by retraining two dependency parsers – Stanza (Qi et al. 2020) and Parsing as Tagging (PaT) (Vacareanu et al. 2020) – on the converted and unconverted data. We find that applying our conversions yields significantly better performance in many cases. Some differences observed between the two parsers are observed. Stanza has a more complex architecture with a quadratic algorithm, so it takes longer to train, but it can generalize better with less data. The PaT parser has a simpler architecture with a linear algorithm, speeding up training time but requiring more training data to reach comparable or better performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2020

Resources for Turkish Dependency Parsing: Introducing the BOUN Treebank and the BoAT Annotation Tool

In this paper, we describe our contributions and efforts to develop Turk...
research
10/17/2019

Cross-lingual Parsing with Polyglot Training and Multi-treebank Learning: A Faroese Case Study

Cross-lingual dependency parsing involves transferring syntactic knowled...
research
09/06/2023

Offensive Hebrew Corpus and Detection using BERT

Offensive language detection has been well studied in many languages, bu...
research
01/12/2022

Biaffine Discourse Dependency Parsing

We provide a study of using the biaffine model for neural discourse depe...
research
11/18/2021

To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Data-hungry deep neural networks have established themselves as the stan...
research
08/02/2019

Multilingual Speech Recognition with Corpus Relatedness Sampling

Multilingual acoustic models have been successfully applied to low-resou...
research
03/17/2019

Technical notes: Syntax-aware Representation Learning With Pointer Networks

This is a work-in-progress report, which aims to share preliminary resul...

Please sign up or login with your details

Forgot password? Click here to reset