Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

by   Hang Jiang, et al.

Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. Tasks such as Named Entity Recognition (NER) and syntactic parsing require highly domain-matched training data for good performance. While there are some publicly available annotated datasets of tweets, they are all purpose-built for solving one task at a time. As yet there is no complete training corpus for both syntactic analysis (e.g., part of speech tagging, dependency parsing) and NER of tweets. In this study, we aim to create Tweebank-NER, an NER corpus based on Tweebank V2 (TB2), and we use these datasets to train state-of-the-art NLP models. We first annotate named entities in TB2 using Amazon Mechanical Turk and measure the quality of our annotations. We train a Stanza NER model on the new benchmark, achieving competitive performance against other non-transformer NER systems. Finally, we train other Twitter NLP models (a tokenizer, lemmatizer, part of speech tagger, and dependency parser) on TB2 based on Stanza, and achieve state-of-the-art or competitive performance on these tasks. We release the dataset and make the models available to use in an "off-the-shelf" manner for future Tweet NLP research. Our source code, data, and pre-trained models are available at: <>.


page 1

page 2

page 3

page 4


WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition

Named Entity Recognition (NER) is a fundamental NLP tasks with a wide ra...

Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts

Recent progress in language model pre-training has led to important impr...

Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language

In this paper we address the scarcity of annotated data for NArabizi, a ...

Mitigating Temporal-Drift: A Simple Approach to Keep NER Models Crisp

Performance of neural models for named entity recognition degrades over ...

NorDial: A Preliminary Corpus of Written Norwegian Dialect Use

Norway has a large amount of dialectal variation, as well as a general t...

ner and pos when nothing is capitalized

For those languages which use it, capitalization is an important signal ...

Product Market Demand Analysis Using NLP in Banglish Text with Sentiment Analysis and Named Entity Recognition

Product market demand analysis plays a significant role for originating ...

Please sign up or login with your details

Forgot password? Click here to reset