A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging

04/29/2020
by   Shabnam Behzad, et al.
0

Part of speech tagging is a fundamental NLP task often regarded as solved for high-resource languages such as English. Current state-of-the-art models have achieved high accuracy, especially on the news domain. However, when these models are applied to other corpora with different genres, and especially user-generated data from the Web, we see substantial drops in performance. In this work, we study how a state-of-the-art tagging model trained on different genres performs on Web content from unfiltered Reddit forum discussions. More specifically, we use data from multiple sources: OntoNotes, a large benchmark corpus with 'well-edited' text, the English Web Treebank with 5 Web genres, and GUM, with 7 further genres other than Reddit. We report the results when training on different splits of the data, tested on Reddit. Our results show that even small amounts of in-domain data can outperform the contribution of data an order of magnitude larger coming from other Web domains. To make progress on out-of-domain tagging, we also evaluate an ensemble approach using multiple single-genre taggers as input features to a meta-classifier. We present state of the art performance on tagging Reddit data, as well as error analysis of the results of these models, and offer a typology of the most common error types among them, broken down by training corpus.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2021

Cross-Register Projection for Headline Part of Speech Tagging

Part of speech (POS) tagging is a familiar NLP task. State of the art ta...
research
10/31/2014

Rapid Adaptation of POS Tagging for Domain Specific Uses

Part-of-speech (POS) tagging is a fundamental component for performing n...
research
05/15/2020

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

This paper describes the process of building an annotated corpus and tra...
research
04/08/2021

AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With

Large Pre-trained Language Models (PLMs) have become ubiquitous in the d...
research
04/16/2021

Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Improvement in machine learning-based NLP performance are often presente...
research
08/04/2020

Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction

Syntactic annotation of corpora in the form of part-of-speech (POS) tags...
research
07/09/2019

Cross-Domain Generalization of Neural Constituency Parsers

Neural parsers obtain state-of-the-art results on benchmark treebanks fo...

Please sign up or login with your details

Forgot password? Click here to reset