Log In Sign Up

Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

by   Hang Jiang, et al.

Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. Tasks such as Named Entity Recognition (NER) and syntactic parsing require highly domain-matched training data for good performance. While there are some publicly available annotated datasets of tweets, they are all purpose-built for solving one task at a time. As yet there is no complete training corpus for both syntactic analysis (e.g., part of speech tagging, dependency parsing) and NER of tweets. In this study, we aim to create Tweebank-NER, an NER corpus based on Tweebank V2 (TB2), and we use these datasets to train state-of-the-art NLP models. We first annotate named entities in TB2 using Amazon Mechanical Turk and measure the quality of our annotations. We train a Stanza NER model on the new benchmark, achieving competitive performance against other non-transformer NER systems. Finally, we train other Twitter NLP models (a tokenizer, lemmatizer, part of speech tagger, and dependency parser) on TB2 based on Stanza, and achieve state-of-the-art or competitive performance on these tasks. We release the dataset and make the models available to use in an "off-the-shelf" manner for future Tweet NLP research. Our source code, data, and pre-trained models are available at: <>.


page 1

page 2

page 3

page 4


Bottom-Up Constituency Parsing and Nested Named Entity Recognition with Pointer Networks

Constituency parsing and nested named entity recognition (NER) are typic...

Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts

Recent progress in language model pre-training has led to important impr...

Mitigating Temporal-Drift: A Simple Approach to Keep NER Models Crisp

Performance of neural models for named entity recognition degrades over ...

NorDial: A Preliminary Corpus of Written Norwegian Dialect Use

Norway has a large amount of dialectal variation, as well as a general t...

ner and pos when nothing is capitalized

For those languages which use it, capitalization is an important signal ...

ELIT: Emory Language and Information Toolkit

We introduce ELIT, the Emory Language and Information Toolkit, which is ...

1 Introduction

Researchers use text data from social media platforms such as Twitter and Reddit for a wide range of studies including opinion mining, socio-cultural analysis, and language variation. Messages posted to such platforms are typically written in a less formal style than what are found in conventional data sources for NLP models, namely news articles, papers, websites, and books. Processing the noisy and informal language of social media is challenging for traditional NLP tools because such messages are usually short in length and irregular in spelling and structure. In response, the NLP community has been constructing language resources and building NLP pipelines for social media data, especially for Twitter.

Annotating social media language resources is important to the development of NLP tools. foster2011hardtoparse is the one of the earliest attempts to annotate tweets in the Penn Treebank (PTB) format. Following a similar PTB-style convention suggested by schneider2013framework, kong2014dependency created Tweebank V1. However, the PTB annotation guidelines leave many annotation decisions unspecified and are therefore unsuitable for informal and user-generated text. After Universal Dependencies (UD)

[16] was introduced to enable consistent annotation across different languages and genres, liu2018parsing introduced a new tweet-based Tweebank V2 in UD, including tokenization, part-of-speech (POS) tags, and (labeled) Universal Dependencies. Besides syntactic annotation, NLP researchers have also annotated tweets on named entities. ritter2011named first introduced this task and found that NER systems trained on the news perform poorly on tweets. Since then, the noisy user-generated text (WNUT) workshop has proposed a few benchmark datasets including WNUT15 [30], WNUT16 [30], and WNUT17 [7] for Twitter lexical normalization and named entity recognition (NER). However, these benchmarks are not based upon TB2, which contains high-quality UD annotations.

Many researchers have invested in building better NLP pipelines for tokenization, POS tagging, parsing, and NER. The earliest work focuses on Twitter POS taggers [9, 18] and NER [23]. Later, kong2014dependency published TweeboParser on Tweebank V1 to include tokenization, POS tagging, and dependency parsing. liu2018parsing further improved the whole pipeline based on TB2. The current state-of-the-art (SOTA) pipeline on POS tagging and NER is based on BERT pre-trained on massive tweets nguyen2020bertweet. However, these efforts (1) are often no longer maintained [23, 12], (2) do not contain publicly available NLP modules (e.g., NER, POS tagger) [15], (3) are written in C/C++ or R with complicated dependencies and installation process (e.g., Twpipe [13] and UDPipe [27]), making them difficult to be integrated into Python frameworks and to be used in an “off-the-shelf” fashion. Many modern NLP tools in Python such as spaCy111, Stanza [21], and FLAIR [2] have been developed for standard NLP benchmarks but have never been adapted to Tweet NLP tasks. We choose Stanza over other NLP frameworks because (1) Stanza framework achieves SOTA or competitive performance on many NLP tasks across 66 languages [21], (2) Stanza supports both CPU and GPU training and inference while transformer-based models (e.g., BERTweet) need GPU, (3) Stanza shows superior performance against spaCy in our experiments despite slower speeds, (4) Stanza is competitive in speed compared with FLAIR of similar accuracy [21], but the dependency parser in FLAIR is still under development.

In this paper, we annotate Tweebank V2 on NER and also build SOTA Tweet NLP models for tweets. Our contributions are as follows:

  • We annotate Tweebank V2, the main treebank for English Twitter NLP tasks, on NER. This annotation not only provides a new benchmark (Tweebank-NER) for Twitter NER but also makes Tweebank a complete dataset for both syntactic tasks and NER.

  • We leverage the Stanza framework to present an accurate and fast Tweet NLP pipeline called Twitter-Stanza. Twitter-Stanza includes NER, tokenization, lemmatization, POS tagging, and dependency parsing modules, and it supports both CPU and GPU computation.

  • We compare Twitter-Stanza against existing models for each presented NLP task, confirming that Stanza’s simple neural architecture is effective and suitable for tweets. Among non-transformer models, the Twitter-Stanza tokenizer achieves SOTA performance on TB2, and its POS tagger, dependency parser, and NER model obtain competitive performance.

  • We also train transformer-based models to establish a strong performance on the Tweebank-NER benchmark and achieve the SOTA performance in dependency parsing on TB2.

  • We release our data, models, and code. Our pipeline is highly compatible with Stanza’s Python interface and is simple to use in an “off-the-shelf” fashion. We hope that Twitter-Stanza can serve as a convenient NLP tool and a strong baseline for future research and applications of Tweet analytic tasks.

2 Dataset and Annotation Scheme

In this study, we primarily work on the Tweebank V2 dataset and develop its NER annotations through rigorous annotation guidelines. We also evaluate the quality of our annotations, showing that it has a good F1 inter-annotator agreement score.

2.1 Datasets and Annotation Statistics

Tweebank V2 (TB2) [12, 13]is a collection of 3,550 labeled anonymous English tweets annotated in Universal Dependencies. It is a commonly used corpus for the training and fine-tuning of NLP systems on social media texts. The statistics of TB2 is shown in Table 1.

Dataset Train Dev Test


Tweets 1,639 710 1,201
Tokens 24,753 11,742 19,112
Avg. token per tweet 15.1 16.6 15.9
Annotated spans 979 425 750
Annotated tokens 1,484 675 1183
Avg. token per span 1.5 1.6 1.6
Table 1: Annotated corpus statistics.

2.2 Annotation Guidelines

We follow the CoNLL 2003 guideline222 to annotate named entities. To help annotators understand the guidelines, we provide multiple examples for each rule and ask annotators to read them before the task. Our task focuses on the following four named entities:

  • PER: persons (e.g., Joe Biden, joe biden, Ben, 50 Cent, Jesus)

  • ORG: organizations (e.g., Stanford University, stanford, IBM, Black Lives Matter, WHO, Boston Red Sox, Science Magazine, NYT)

  • LOC: locations (e.g., United States, usa, China, Boston, Bay Area, CA, MT Washington)

  • MISC: named entities which do not belong to the previous three. (e.g., Chinese, chinese, World Cup 2002, Democrat, Just Do It, Top 10, Titanic, The Shining, All You Need Is Love)

To handle challenges in tweets, we also add requirements consistent with [23]: (1) ignore numerical entities (MONEY, NUMBER, ORDINAL, PERCENT), (2) ignore temporal entities (DATE, TIME, DURATION, SET), (3) ”At mentions” are not named entities (e.g., allow “Donald Trump” but not @DonaldTrump), (4) #hashtags are not named entities (e.g., allow “BLM” but not “#BLM”), (5) URLs are not named entities (e.g., disallow

2.3 Annotation Logistics

We use the Qualtrics software to design the sequence labeling task and use Amazon Mechanical Turk to recruit annotators. We first launch a pilot study, annotate each of the 100 tweets, and discuss tweets with divergent annotations. Based on the pilot study, we develop a series of annotation rules and precautions. During the recruiting process, each annotator is given an overview of annotation conventions and our guidelines, after which they are asked to complete the qualification test. The qualification test consists of 7 tweets that are selected from the pilot study. Each annotator must meet the following two requirements through completing the qualification test to enter the final process: (1) makes fewer than 2 errors, (2) does not make any significant error. The first significant error is to annotate any URL, @USER, or hashtag as named entities and the second is to confuse PERSON, LOCATION, and ORG.

After all tweets have been annotated by at least 3 annotators, we merge the annotation results and create the Tweebank-NER dataset in the BIO format [22]. In the merging process, if at least two annotators give the annotation result for a tweet, we use that result as the final annotation. Otherwise, we discuss and re-annotate the tweet to reach a consensus. We identify 177 span annotations whose three annotations are different from each other and decide their gold annotations collectively by two authors. We find that one of the three annotators’ answers is the same as the final annotation for 155 out of the 177 annotations.

2.4 Annotation Quality

We first evaluate the quality of the annotations with the inter-annotator agreement (IAA). For NER, Cohen’s Kappa is not the best measure because it needs the number of negative cases, but NER is a sequence tagging task. Therefore, we follow previous work [11, 10, 4] to use the pairwise F1 score calculated without the O label as a better measure for IAA in NER [6]. In Table 2, we observe that PER, LOC, and ORG have higher F1 agreement than MISC, showing that MISC is more difficult to annotate than the other types. We also provide the additional Kappa measure (0.347) on annotated tokens to provide some insights, although it significantly underestimates IAA for NER. At last, we calculate the scores by comparing the crowdsourced annotators against annotations we did on our own on 100 sampled examples and obtained similar F1 and Kappa scores.

Label Quantity F1


PER 777 84.6
LOC 317 74.4
ORG 541 71.9
MISC 519 50.9
Overall 2,154 70.7
Table 2: Number of span annotations per entity type and Inter-annotator agreement scores in pairwise F1.

3 Methods for NLP Modeling

Stanza is a state-of-the-art and efficient framework for many NLP tasks [21, 32] and it supports both NER and syntactic tasks. We use Stanza to train NER models as well as syntactic models (tokenization, lemmatization, POS tagging, dependency parsing) on TB2. For more detailed information on Stanza, we refer the readers to the Stanza paper [21] and its current website333https:// We use Twitter GloVe embeddings [19] with 100 dimensions in our experiments and the default parameters in Stanza for training.

Alternative NLP frameworks such as spaCy, FLAIR, and spaCy-transformers are compared with Stanza. Both spaCy and FLAIR are open-source NLP frameworks for NER and syntactic tasks. The spaCy-transformers framework provides the spaCy interface to use SOTA transformer architectures via Hugging Face’s transformers. To train spaCy, we adopt the default NER setting

and the default syntactic NLP pipeline555
. For FLAIR, we train its NER and syntactic modules with the default settings as well. At last, we finetune BERTweet-base and XLM-RoBERTa-base language models via spaCy-transformers for NER, POS Tagging, and dependency parsing666
. We denote them as spaCy-BERTweet and spaCy-XLM-RoBERTa in the paper. BERTweet [15] is the first public large-scale language model for English tweets based on RoBERTa and XLM-RoBERTa-base is a multilingual version of RoBERTa-base. Both models show strong performance in Tweet NER and POS tagging [15].

3.1 Named Entity Recognition

In this paper, we adopt the four-class convention to define NER as a task to locate and classify named entities mentioned in unstructured text into four pre-defined categories: PER, ORG, LOC, and MISC

[24]. We use the Stanza NER architecture for training and evaluation, which is a contextualized string representation-based sequence tagger [3]. This model contains a forward and a backward character-level LSTM language model to extract token-level representations and a BiLSTM-CRF sequence labeler to predict the named entities. We also train the default NER models for SpaCy, FLAIR, and spaCy-BERTweet for comparison.

3.2 Syntactic NLP Tasks

3.2.1 Tokenization

Tokenizers predict whether a given character in a sentence is the end of a token. The Stanza tokenizer jointly works on tokenization and sentence segmentation, by modeling them as a tagging problem over character sequences. In accordance with previous work [9, 13], we focus on the performance in tokenization, as tweets are usually short with a single sentence.

To compare with spaCy, we train a spaCy tokenizer named char_pretokenizer.v1. FLAIR uses spaCy’s tokenizer, so we exclude it from comparison. We also include baselines mentioned in previous work [12, 13]. Twokenizer [17] is a regex-based tokenizer and does not adapt to the UD tokenization scheme. Stanford CoreNLP [14], spaCy, and UDPipe v1.2 [28] are three popular NLP frameworks re-trained on TB2. Twpipe tokenizer [13] is similar to UDPipe, but replaces GRU in UDPipe with an LSTM and uses a larger hidden unit number. We do not compare with transformer-based models because they use subword-level tokenization schemes like WordPiece [29] and BPE [26].

3.2.2 Lemmatization

Lemmatization is the process of recovering each word in a sentence to its canonical form. We train the Stanza lemmatizer on TB2, which is implemented as an ensemble model of a dictionary-based lemmatizer and a neural seq2seq lemmatizer. We compare the Stanza lemmatizer against three lemmatizers from spaCy, NLTK, and FLAIR (Table 7). Both NLTK and spaCy lemmatizer are rule-based and use a dictionary to look up the canonical form given a word and it POS tag. The FLAIR lemmatizer is a char-level seq2seq model. We provide gold POS tags for lemmatization.

3.2.3 POS Tagging

POS tagging assigns each token in a sentence a POS tag. We train the Stanza POS tagger, a bidirectional long short-term memory network as the basic architecture to predict the universal POS (UPOS) tags. We ignore the language-specific POS (XPOS) tags because TB2 only contains UPOS tags.

We also train the default POS taggers for SpaCy, FLAIR, spaCy-BERTweet, spaCy-XLM-RoBERTa. We include performance from existing work in Tweet POS tagging: (1) Stanford CoreNLP tagger, (2) owoputi2013improved’s word cluster–enhanced greedy tagger, (3) owoputi2013improved’s word cluster–enhanced tagger with CRF, (4) ma2016end’s neural tagger, (5) BERTweet-based POS tagger [15]. The first four models were re-trained on the combination of TB2 and UD_English-EWT ewt training sets, whereas the BERTweet-based tagger was fine-tuned solely on TB2.

3.2.4 Dependency Parsing

Dependency parsing predicts a syntactic structure for a sentence, where every word in the sentence is assigned a syntactic head that points to either another word in the sentence or an artificial root symbol. Stanza’s dependency parser combines a Bi-LSTM-based deep biaffine neural parser [8] and two linguistic features, which can significantly improve parsing accuracy [20]. Gold-standard tokenization and automatic POS tags are used.

We also re-train spaCy, spaCy-BERTweet, and spaCy-RoBERTa dependency parsers with their default parser architectures777FLAIR does not support dependency parsing so far.. Besides, we compare our Stanza models with previous work: (1) kong2014dependency’s graph-based parser with lexical features and word cluster and it uses dual decomposition for decoding, (2) dozat2016deep’s neural graph parser with biaffine attention, (3) ballesteros2015improved’s neural greedy stack LSTM parser, (4) an ensemble model of 20 transition-based parsers [13], (5) A distilled graph-based parser of the previous ensemble model [13]. These models are all train on TB2+UD_English-EWT. We are aware that stymne2020cross trained a transition-based uuparser [5] on a combination of TB2, UD_English-EWT, and more out-of-domain data (English GUM [31], LinES [1], ParTUT [25]) to further boost model performance, but we do not experiment with this data combination to be consistent with the data settings in liu2018parsing.

4 Evaluation

We train the NER and syntactic NLP models as we describe above with 1) TB2 training data (the default data setting), 2) TB2 training data + extra Twitter data (the combined data setting). For the combined data setting, we add the training and dev sets from other data sources to TB2’s training and dev sets respectively. Specifically, we add WNUT17888We map both “group” and “corporation” to “ORG”, and both “creative work” and “product” to “MISC”. [7] for NER. For syntactic NLP tasks, we add UD_English-EWT ewt. We pick the best models based on the corresponding dev sets and report their performance on their TB2 test sets. For each task, we compare Stanza models with existing works or frameworks.

4.1 Performance in NER

4.1.1 Main Findings

The NER experiments show that the Stanza NER model (TB2+W17) achieves the performance among all non-transformer models. At the same time, the Stanza model is up to 75% smaller than the second-best FLAIR model [21]. For transformers, the BERTweet-based approach trained only on TB2 achieves the highest performance (75.51%) on Tweebank-NER, establishing a strong benchmark for future research. We also find that combining the training data from both WNUT17 and TB2 improves the performance of spaCy, FLAIR, and Stanza models. However, it is not true for BERTweet, likely because BERTweet is able to generalize well with small high-quality data but also easy to learn inconsistent annotations between two datasets.

Systems F1


spaCy (TB2) 52.20
spaCy (TB2+W17) 53.89
FLAIR (TB2) 62.12
FLAIR (TB2+W17) 59.08
spaCy-BERTweet (TB2) 75.51
spaCy-BERTweet (TB2+W17) 75.44
Stanza (TB2) 60.14
Stanza (TB2+W17) 62.53
Table 3: NER comparison on the TB2 test set in entity-level F1. “TB2” indicates to use the TB2 train set for training. “TB2+W17” indicates to combine TB2 and WNUT17 train sets for training.

4.1.2 Confusion Matrix Analysis

In Figure 1

, we plot a confusion matrix for all four entity types and “O”, the label for tokens that do not belong to any of these types. The diagonal and the vertical blue lines are expected because the cells on the diagonal are when the algorithm predicts the correct entity and the vertical line is when the algorithm mistakes an entity for the “O” entity, which is the most common error for NER. We notice that MISC entities are easily mistaken as “O”, which corresponds to the annotation statistics in Table

2, where MISC has the lowest IIA score in pairwise F1. This shows that MISC is the most challenging of the four types for both humans and machines.

Figure 1: Confusion matrix generated by the Stanza (TB2+W17) model to show percentages for each combination of predicted and true entity types.

4.1.3 Error Analysis

We identify the most common error types that Stanza (TB2+W17) makes on the TB2 test in Figure 1: predicting PER, LOC, ORG, MISC to be O. We pick some representative examples for each error type, shown in Table 4. For the error type, every first letter in a word is capitalized and the model fails to recognize the famous investor “Warren Buffet” in such a context. We find that person entities with abbreviations (e.g., “GD” for “G-dragon”), lower case (e.g., “kush” for “Kush”), or irregular contextual capitalization are challenging to the NER system. For the error type, the structure to encode location is complicated and sometimes interrupted by the parentheses and dashes (e.g., “- day Adventist Church”). In this case, it is caused by the fact that “Seventh-day” is tokenized into three words in TB2. For the examples, “Guess Who” is a rock band and “Sounds Live Feels Live” is a concert tour by Australian pop-rock band 5 Seconds of Summer. These named entities tend to contain common English verbs with their first letters capitalized. It is difficult to annotate them correctly if the model does not have access to world and domain knowledge. Our analysis points to the future Twitter NER research to introduce text perturbations into training and to encode commonsense knowledge into NER modeling.

Error type weet example


PER O The 50 % Return Method Billionaire Investor Warren Buffet Wishes He Could Use
LOC O Getting ready … @ Pasco Ephesus Seventh - day Adventist Church
ORG O #bargains #deals 10.27.10 Guess Who “ American Woman ” Guhhh deeeh you !
MISC O RT @USER1508 : Do you ever realize Sounds Live Feels Live Starts this month and just
Table 4: Common mistakes made by the Stanza (W17+TB2) NER model for each error type. “X O” means the model predicts X entity to be O by mistake. Colored texts are gold annotations of the corresponding type in each row. Correct predictions are in bold green and gold annotations missed by the model are in bold red.

4.1.4 NER Models Trained on WNUT17

We train spaCy, FLAIR, Stanza, and spacy-BERTweet-base NER models on the four-type version of WNUT17 and evaluate their performance on the TB2 test. In Table 5, we compare the performance of these models trained on WNUT17 against the ones trained on TB2. We show that the performance of all the models drops significantly if we use the pre-trained model from WNUT17, meaning the Tweebank-NER dataset is still challenging for current NER models.

Training data TB2 WNUT17 F1 Drop


spaCy 52.20 44.93 7.27
FLAIR 62.12 55.11 7.01
spaCy-BERTweet 73.79 60.77 13.02
Stanza 60.14 56.40 3.74
Table 5: Comparison among NER models trained on TB2 vs. WNUT17 on TB2 test in entity-level F1.

4.2 Performance in Syntactic NLP Tasks

Apart from NER, we train and evaluate Stanza models for tokenization, lemmatization, POS tagging, and dependency parsing by leveraging TB2 and UD_English-EWT. For each task, we compare our models against previous work on the TB2 test set.

4.2.1 Tokenization Performance

In Table 6

, we observe that the Stanza model trained on TB2 outperforms Twpipe tokenizer, the previous SOTA model, and it achieves slightly higher performance than the spaCy tokenizer. We also find that blending TB2 and UD_English-EWT for training brings down the tokenization performance slightly. This is probably because the data source of UD_English-EWT, which is collected from weblogs, newsgroups, emails, reviews, and Yahoo! Answers, represents a different dialect from Twitter English.

System F1


Twokenizer 94.6
Stanford CoreNLP 97.3
UDPipe v1.2 97.4
Twpipe 98.3
spaCy (TB2) 98.57
spaCy (TB2+EWT) 95.57
Stanza (TB2) 98.64
Stanza (TB2+EWT) 98.59
Table 6: Tokenizer comparison on the TB2 test set. “TB2” indicates to use TB2 for training. “TB2+EWT” indicates to combine TB2 and UD English-EWT for training. Note that the first four results are rounded to one decimal place by Liu et al., (2018).

4.2.2 Lemmatization Performance

None of the previous Twitter NLP work reports the lemmatization performance on TB2. As shown in Table 7, the Stanza model outperforms the other two rule-based (NLTK and spaCy) and one neural (FLAIR) baseline approaches on TB2. This is not surprising because the Stanza ensemble lemmatizer makes good use of both ruled-based dictionary lookup and seq2seq learning. Similar to what we observe in the tokenization experiments, the combined data setting brings down the performance of both the FLAIR and Stanza models.

System F1


NLTK 88.23
spaCy 85.28
Flair (TB2) 96.18
Flair (TB2+EWT) 84.54
Stanza (TB2) 98.25
Stanza (TB2+EWT) 85.45
Table 7: Lemmatization results on the TB2 test set. “TB2” is to use TB2 for training. “TB2+EWT” is to combine TB2 and UD English-EWT for training.

4.2.3 POS Tagging Performance

As shown in Table 8, the best model that we train (spaCy-XLM-RoBERTa on TB2) is 1.3% lower than the SOTA BERTweet [15] in terms of accuracy scores. However, our spaCy-BERTweet also had lower performance than BERTweetnguyen2020bertweet. We conjecture that the difference is mainly caused by the implementations of the POS tagging layer between spaCy and nguyen2020bertweet. Apart from the non-transformers, Stanza achieves competitive performance compared with the best model - owoputi2013improved’s tagger with CRF (93.53% vs. 94.6%). Stanza outperforms all other non-transformer baselines including Stanford CoreNLP, spaCy, FLAIR, and ma2016end. Interestingly, we observe that adding UD_English-EWT for training improves the performance of non-transformer models but slightly brings down the performance of transformer models.

System UPOS


Stanford CoreNLP 90.6
owoputi2013improved (greedy) 93.7
owoputi2013improved (CRF) 94.6
ma2016end 92.5
BERTweet [15] 95.2
spaCy (TB2) 86.72
spaCy (TB2+EWT) 88.84
FLAIR (TB2) 87.85
FLAIR (TB2+EWT) 88.19
spaCy-BERTweet (TB2) 87.61
spaCy-BERTweet (TB2+EWT) 86.31
spaCy-XLM-RoBERTa (TB2) 93.90
spaCy-XLM-RoBERTa (TB2+EWT) 93.75
Stanza (TB2) 93.20
Stanza (TB2+EWT) 93.53
Table 8: POS Tagging comparison in accuracy on the TB2 test set. “TB2” is to use TB2 for training. “TB2+EWT” is to combine TB2 and UD English-EWT for training. Please note that the first five results are rounded to one decimal place by Liu et al., (2018).

4.2.4 Dependency Parsing Performance

For dependency parsing experiments, spaCy-XLM-RoBERTa (TB2) achieves the SOTA performance (Table 9), surpassing liu2018parsing (Ensemble) by 0.42% in UAS999It is difficult to compare their LAS with ours due to the difference in decimal places.. Besides that, the Stanza parser achieves the same UAS score and has a close LAS score (0.3%) compared to this best non-transformer performance (UAS 82.1% + LAS 77.9%) reported by the distilled parser. As liu2018parsing mentioned, the ensemble model is 20 times larger in size compared to the Stanza parser, although the former performs better. Finally, we confirm that the combination of TB2 and UD_English-EWT training sets boost the performance for non-transformer models [13]. The data combination brings down the performance of transformer-based models, which is consistent with our observations in tokenization, POS tagging, and dependency parsing.

System UAS LAS


kong2014dependency 81.4 76.9
dozat2017stanford 81.8 77.7
ballesteros2015improved 80.2 75.7
liu2018parsing (Ensemble) 83.4 79.4
liu2018parsing (Distillation) 82.1 77.9
spaCy (TB2) 66.93 58.79
spaCy (TB2 + EWT) 72.06 63.84
spaCy-BERTweet (TB2) 76.32 71.72
spaCy-BERTweet (TB2+EWT) 76.18 69.28
spaCy-XLM-RoBERTa (TB2) 83.82 79.39
spaCy-XLM-RoBERTa (TB2+EWT) 81.02 75.43
Stanza (TB2) 79.28 74.34
Stanza (TB2 + EWT) 82.10 77.60
Table 9: Dependency parsing comparison on the TB2 test set. “TB2” indicates to use TB2 for training. “TB2+EWT” indicates to combine TB2 and UD English-EWT for training. Note that the first six results are rounded to one decimal place by Liu et al., (2018).

5 Conclusion

In this paper, we introduce the four-type named entities to Tweebank V2, a popular Twitter dataset on Universal Dependencies, making it a new NER benchmark - Tweebank-NER. We evaluate our annotations and observe a good inter-annotator agreement score in pairwise F1 for NER annotation. We also train Twitter-specific NLP models (NER, tokenization, lemmatization, POS tagging, dependency parsing) on the dataset with Stanza and compare our models against existing work or NLP frameworks. Our Stanza models show SOTA performance on tokenization and lemmatization and competitive performance in NER, POS tagging, and dependency parsing on TB2. We also train BERT-based methods to establish a strong benchmark on Tweebank-NER and achieve SOTA performance in dependency parsing on TB2. Finally, we publish our dataset and release the Stanza NER and syntactic NLP models which are easy to download and use with Stanza’s Python interface. We hope that our research not only contributes annotations to an important dataset but also enables other researchers to use off-the-shelf NLP models for social media analysis.

6 Acknowledgements

We would like to thank Alan Ritter, Yuhui Zhang, Zifan Lin, and anonymous reviewers, who gave precious advice and comments to our paper. We also want to thank John Bauer and Yijia Liu for answering questions related to Stanza and Twpipe. At last, we would like to thank MIT Center for Constructive Communication for funding our research.

7 Bibliographical References


  • [1] L. Ahrenberg (2007-05) LinES: an English-Swedish parallel treebank. In Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007), Tartu, Estonia, pp. 270–273. External Links: Link Cited by: §3.2.4.
  • [2] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, and R. Vollgraf (2019) Flair: an easy-to-use framework for state-of-the-art nlp. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 54–59. Cited by: §1.
  • [3] A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In Proceedings of the 27th international conference on computational linguistics, pp. 1638–1649. Cited by: §3.1.
  • [4] A. Brandsen, S. Verberne, K. Lambers, M. Wansleeben, N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, et al. (2020) Creating a dataset for named entity recognition in the archaeology domain. In Conference Proceedings LREC 2020, pp. 4573–4577. Cited by: §2.4.
  • [5] M. de Lhoneux, Y. Shao, A. Basirat, E. Kiperwasser, S. Stymne, Y. Goldberg, and J. Nivre (2017) From raw text to universal dependencies-look, no tags!. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 207–217. Cited by: §3.2.4.
  • [6] L. Deleger, Q. Li, T. Lingren, M. Kaiser, K. Molnar, et al. (2012) Building gold standard corpora for medical natural language processing tasks. In AMIA Annual Symposium Proceedings, Vol. 2012, pp. 144. Cited by: §2.4.
  • [7] L. Derczynski, W. Xu, A. Ritter, and T. Baldwin (Eds.) (2017-09) Proceedings of the 3rd workshop on noisy user-generated text. Association for Computational Linguistics, Copenhagen, Denmark. External Links: Link, Document Cited by: §1, §4.
  • [8] T. Dozat and C. D. Manning (2017) Deep biaffine attention for neural dependency parsing. The International Conference on Learning Representations. Cited by: §3.2.4.
  • [9] K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith (2010) Part-of-speech tagging for twitter: annotation, features, and experiments. Technical report Carnegie-Mellon Univ Pittsburgh Pa School of Computer Science. Cited by: §1, §3.2.1.
  • [10] C. Grouin, S. Rosset, P. Zweigenbaum, K. Fort, O. Galibert, and L. Quintard (2011) Proposal for an extension of traditional named entitites: from guidelines to evaluation, an overview. In 5th Linguistics Annotation Workshop (The LAW V), pp. 92–100. Cited by: §2.4.
  • [11] G. Hripcsak and A. S. Rothschild (2005) Agreement, the f-measure, and reliability in information retrieval. Journal of the American medical informatics association 12 (3), pp. 296–298. Cited by: §2.4.
  • [12] L. Kong, N. Schneider, S. Swayamdipta, A. Bhatia, C. Dyer, and N. A. Smith (2014) A dependency parser for tweets. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1001–1012. Cited by: §1, §2.1, §3.2.1.
  • [13] Y. Liu, Y. Zhu, W. Che, B. Qin, N. Schneider, and N. A. Smith (2018) Parsing tweets into universal dependencies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 965––975. Cited by: §1, §2.1, §3.2.1, §3.2.1, §3.2.4, §4.2.4.
  • [14] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky (2014) The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60. Cited by: §3.2.1.
  • [15] D. Q. Nguyen, T. Vu, and A. T. Nguyen (2020) BERTweet: a pre-trained language model for english tweets. Association for Computational Linguistics. Cited by: §1, §3.2.3, §3, §4.2.3, Table 8.
  • [16] J. Nivre, M. De Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, et al. (2016) Universal dependencies v1: a multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 1659–1666. Cited by: §1.
  • [17] B. O’Connor, M. Krieger, and D. Ahn (2010) Tweetmotif: tweetmotif: exploratory search and topic summarization for twitter. In Fourth International AAAI Conference on Weblogs and Social Media, Cited by: §3.2.1.
  • [18] O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith (2013) Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 380–390. Cited by: §1.
  • [19] J. Pennington, R. Socher, and C. D. Manning (2014)

    GloVe: global vectors for word representation

    In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.
  • [20] P. Qi, T. Dozat, Y. Zhang, and C. D. Manning (2018) Universal dependency parsing from scratch. CoNLL 2018 UD Shared Task. Cited by: §3.2.4.
  • [21] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning (2020) Stanza: a python natural language processing toolkit for many human languages. pp. 101–108. Cited by: §1, §3, §4.1.1.
  • [22] L. Ratinov and D. Roth (2009) Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pp. 147–155. Cited by: §2.3.
  • [23] A. Ritter, S. Clark, O. Etzioni, et al. (2011) Named entity recognition in tweets: an experimental study. In Proceedings of the 2011 conference on empirical methods in natural language processing, pp. 1524–1534. Cited by: §1, §2.2.
  • [24] E. F. Sang and F. De Meulder (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050. Cited by: §3.1.
  • [25] M. Sanguinetti and C. Bosco (2015) Parttut: the turin university parallel treebank. In Harmonization and development of resources and tools for italian natural language processing within the parli project, pp. 51–69. Cited by: §3.2.4.
  • [26] R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. pp. 1715––1725. Cited by: §3.2.1.
  • [27] M. Straka, J. Hajic, and J. Straková (2016) UDPipe: trainable pipeline for processing conll-u files performing tokenization, morphological analysis, pos tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 4290–4297. Cited by: §1.
  • [28] M. Straka and J. Straková (2017) Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. Cited by: §3.2.1.
  • [29] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §3.2.1.
  • [30] W. Xu, B. Han, and A. Ritter (Eds.) (2015-07) Proceedings of the workshop on noisy user-generated text. Association for Computational Linguistics, Beijing, China. External Links: Link, Document Cited by: §1.
  • [31] A. Zeldes (2017) The GUM corpus: creating multilayer resources in the classroom. Language Resources and Evaluation 51 (3), pp. 581–612. External Links: Document Cited by: §3.2.4.
  • [32] Y. Zhang, Y. Zhang, P. Qi, C. D. Manning, and C. P. Langlotz (2021) Biomedical and clinical english model packages for the stanza python nlp library. Journal of the American Medical Informatics Association 28 (9), pp. 1892–1899. Cited by: §3.

8 Language Resource References

lrec2022-bib languageresource