DeepAI AI Chat
Log In Sign Up

A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging

by   Shabnam Behzad, et al.
Georgetown University

Part of speech tagging is a fundamental NLP task often regarded as solved for high-resource languages such as English. Current state-of-the-art models have achieved high accuracy, especially on the news domain. However, when these models are applied to other corpora with different genres, and especially user-generated data from the Web, we see substantial drops in performance. In this work, we study how a state-of-the-art tagging model trained on different genres performs on Web content from unfiltered Reddit forum discussions. More specifically, we use data from multiple sources: OntoNotes, a large benchmark corpus with 'well-edited' text, the English Web Treebank with 5 Web genres, and GUM, with 7 further genres other than Reddit. We report the results when training on different splits of the data, tested on Reddit. Our results show that even small amounts of in-domain data can outperform the contribution of data an order of magnitude larger coming from other Web domains. To make progress on out-of-domain tagging, we also evaluate an ensemble approach using multiple single-genre taggers as input features to a meta-classifier. We present state of the art performance on tagging Reddit data, as well as error analysis of the results of these models, and offer a typology of the most common error types among them, broken down by training corpus.


page 1

page 2

page 3

page 4


Cross-Register Projection for Headline Part of Speech Tagging

Part of speech (POS) tagging is a familiar NLP task. State of the art ta...

Rapid Adaptation of POS Tagging for Domain Specific Uses

Part-of-speech (POS) tagging is a fundamental component for performing n...

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

This paper describes the process of building an annotated corpus and tra...

AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With

Large Pre-trained Language Models (PLMs) have become ubiquitous in the d...

Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Improvement in machine learning-based NLP performance are often presente...

Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction

Syntactic annotation of corpora in the form of part-of-speech (POS) tags...
04/20/2021 A Web Application for Consuming and Annotating Legal Discourse Learning

In this work, we create a web application to highlight the output of NLP...