Log In Sign Up

CrossWeigh: Training Named Entity Tagger from Imperfect Annotations

by   Zihan Wang, et al.
University of Illinois at Urbana-Champaign

Everyone makes mistakes. So do human annotators when curating labels for named entity recognition (NER). Such label mistakes might hurt model training and interfere model comparison. In this study, we dive deep into one of the widely-adopted NER benchmark datasets, CoNLL03 NER. We are able to identify label mistakes in about 5.38 considering that the state-of-the-art test F1 score is already around 93 Therefore, we manually correct these label mistakes and form a cleaner test set. Our re-evaluation of popular models on this corrected test set leads to more accurate assessments, compared to those on the original test set. More importantly, we propose a simple yet effective framework, CrossWeigh, to handle label mistakes during NER model training. Specifically, it partitions the training data into several folds and train independent NER models to identify potential mistakes in each fold. Then it adjusts the weights of training data accordingly to train the final NER model. Extensive experiments demonstrate significant improvements of plugging various NER models into our proposed framework on three datasets. All implementations and corrected test set are available at our Github repo:


page 1

page 2

page 3

page 4


Validating Label Consistency in NER Data Annotation

Data annotation plays a crucial role in ensuring your named entity recog...

What do we Really Know about State of the Art NER?

Named Entity Recognition (NER) is a well researched NLP task and is wide...

An End-to-End Solution for Named Entity Recognition in eCommerce Search

Named entity recognition (NER) is a critical step in modern search query...

Enhancing Label Consistency on Document-level Named Entity Recognition

Named entity recognition (NER) is a fundamental part of extracting infor...

kNN-NER: Named Entity Recognition with Nearest Neighbor Search

Inspired by recent advances in retrieval augmented methods in NLP <cit.>...

Leveraging Expert Guided Adversarial Augmentation For Improving Generalization in Named Entity Recognition

Named Entity Recognition (NER) systems often demonstrate great performan...

Prompt-Based Metric Learning for Few-Shot NER

Few-shot named entity recognition (NER) targets generalizing to unseen l...

1 Introduction

Named entity recognition (NER), identifying both spans and types of named entities in text, is a fundamental task in the natural language processing pipeline. On one of the widely-adopted NER benchmarks, the CoNLL03 NER dataset 

sang2003introduction, the state-of-the-art NER performance has been pushed to a F score around 93% akbikpooled, through building end-to-end neural models lample2016neural; ma2016end and introducing language models for contextualized representations peters2017semi; peters2018deep; akbik2018contextual; liu2018efficient. Such high performance makes the label mistakes in manually curated “gold standard” data non-negligible. For example, given a sentence “Chicago won game 1 with Derrick Rose scoring 25 points.”, this “Chicago”, representing the NBA team Chicago Bulls, should be annotated as an organization. However, when annotators are not careful or lack background knowledge, this “Chicago” might be annotated as a location, thus being a label mistake.

These label mistakes bring up two challenges to NER: (1) mistakes in the test set can interfere the evaluation results and even lead to an inaccurate assessment of model performance; and (2) mistakes in the training set can hurt NER model training. Therefore, in this paper, we conduct empirical studies to understand these mistakes, correct the mistakes in the test set to form a cleaner benchmark, and develop a novel framework to handle the mistakes in the training set.

We dive deep into the CoNLL03 NER dataset, and find label mistakes in about 5.38% test sentences. Considering that the state-of-the-art F score on this test set is already around 93%, these 5.38% mistakes should be considered as significant. So we hire human experts to correct these label mistakes in the test set. We then re-evaluate recent state-of-the-art NER models on this new, cleaner test set. Compared to the results on the original test set, the re-evaluation results are more accurate and stable. Therefore, we believe this new test set can better reflect the performance of NER models.

Figure 1: An overview of our proposed CrossWeigh framework. It can better handle label mistakes, identify low quality annotations and conduct learning from a weighted training set.

We further propose a novel, general framework, CrossWeigh, to handle the label mistakes during the NER model training stage. Figure 1

presents an overview of our proposed framework. It contains two modules: (1) mistake estimation: it identifies the potential label mistakes in training data through a cross checking process and (2) mistake re-weighing: it lowers the weights of these instances during the training of the final NER model. The cross checking process is inspired by the k-fold cross validation; differently, in each fold’s training data, it removes the data containing any of entities that appeared in this fold. In this way, each sentence will be scored by a NER model trained on a subset of training data not containing any entity in this sentence. Once we know where the potential mistakes are, we lower the weights of these sentences and train the final NER model based on this weighted training set. The final NER model is trained in a mistake-aware way, thus being more accurate. Note that, our proposed framework is general and fits most of, if not all, NER models that accept weighted training data.

To the best of our knowledge, we are the first to handle the label mistake issues systematically in the NER problem. We conduct extensive experiments on both the original CoNLL03 NER dataset and our corrected dataset. CrossWeigh is able to consistently improve performance when plugging with different NER models. In addition, we verify the effectiveness of CrossWeigh on emerging-entity and low-resource NER datasets. In summary, our major contributions are the following:

  • [leftmargin=*,nosep]

  • We correct label mistakes in the test set of the CoNLL03 NER dataset and re-evaluate popular NER models. This establishes a more accurate NER benchmark.

  • We propose a novel framework CrossWeigh to accommodate the mistakes during the model training stage. The proposed framework fits most of, if not all, NER models.

  • Extensive experiments demonstrate the significant, robust test F score improvements of plugging NER models into our proposed framework on three datasets, not only CoNLL03 but also emerging-entity and low-resource datasets.

Reproducibility. We release both the corrected test set and the implementation of CrossWeigh framework222

Sentence Original labels Corrected labels
Sporting Gijon 15 4 4 7 15 22 16 [Sporting]{ORG} [Sporting Gijon]{ORG}
NZ ’s Bolger says Nats to meet [NZ]{LOC}, [Bolger]{PER}, [NZ]{LOC}, [Bolger]{PER},
NZ First on Sunday . [Nats]{PER}, [NZ]{LOC} [Nats]{ORG}, [NZ First]{ORG}
Seagramd ace 20/11/96 5,000 Japan [Seagramd] {MISC}, [Japan]{LOC} [Seagramd ace]{MISC}, [Japan] {LOC}
Table 1: Typical Examples of Our Corrections on the CoNLL03 NER dataset.

2 CoNLL03 NER Re-Examination

The CoNLL03 NER dataset is one of the widely-adopted NER benchmark datasets. Its annotation guideline is based on MUC Conventions333 sang2003introduction. Following this guideline, the annotators are asked to mark entities of person (PER), location (LOC), and organization (ORG), while using an extra miscellaneous (MISC) type to deal with entities that do not fall in these categories. This dataset has been split into training, development, and test sets, with , , and sentences, respectively.

2.1 Test Set Correction

In order to understand and correct the label mistakes, we have hired 5 human experts as annotators. Before looking at the data, we first train the annotators by carefully going through the aforementioned guideline. During the correction process, we strongly encourage the annotators to use search engines for suspicious token spans. This helps them have more background knowledge. We also allow annotators to look at the original paragraph containing the sentence. This helps them have a better understanding of the context.

For the whole test set, we randomly split the test sentences between each pair combination of 5 annotators. In this way, each sentence in the test set is checked by exactly two annotators. The inter-annotator agreement is 95.66%. This is a reasonable score, given that the inter-annotator agreement in POS tagging annotations is about 97% manning2011part. After we collected all annotations, we run a final round of verification on each sentence, where the original annotation and the two annotators’ are not all the same. In the end, we have corrected label mistakes in 186 sentences, which is about of the test set.

Table 1 presents some typical examples of our corrections. In the first sentence, as a sport team, “Sporting Gijon” was not annotated completely. In the second sentence, while “JAPAN” is correctly marked as LOC, “China” is wrongly identified as PER instead of LOC. One may notice that they both represent sport teams. However, according to the aforementioned guideline, country names should be marked as LOC even when they are sports teams. More details about this type of labels are discussed in Section 5. In the third sentence, “NZ” is the abbreviation of New Zealand. However, “Nat” and “NZ First” in fact refer to political parties (i.e., New Zealand Young Nationals and New Zealand First). So they should be labelled as ORG. In the forth sentence, looking at its paragraph, our annotators figure out that this is a table about ships and vessels loading items at different locations. Through comparing with other sentences in the context, such as “Algoa Day 21/11/96 6,000 Africa”, our annotators identified “Seagramd ace” as a vessel, thus marking it as MISC. We have verified that there is indeed a vessel called “Seagrand Ace” (“Seagramd ace” might be a typo).

2.2 CoNLL03 Re-Evaluation

NER Algorithms. We re-evaluate following popular NER algorithms:

  • [leftmargin=*,nosep]

  • LSTM-CRF lample2016neural

    incorporates long short term memory (LSTM) neural network with conditional random field (CRF). It also uses a word-wise character LSTM.

  • LSTM-CNNs-CRF ma2016end

    has a similar structure as LSTM-CRF, but captures character-level information through a convolutional neural network (CNN) over the character embedding.

  • VanillaNER liu2018efficient also extends LSTM-CRF and LSTM-CNNs-CRF by using a sentence-wise character LSTM.

  • ELMo peters2018deep extends LSTM-CRF and leverages pre-trained word-level language models for better contextualized representations.

  • Flair akbik2018contextual also aims for contextualized representations, utilizing pretrained character level language models.

  • Pooled-Flair akbik2018contextual extends Flair and maintains an embedding pool for each word to bring in dataset-level word embedding.

We use the implementation released by the authors for each algorithm and report the performance on original test set and corrected test set averaging 5 runs.

Method Original Corrected
LSTM-CRF 90.64 (0.23) 91.47 (0.15)
LSTM-CNNs-CRF 90.65 (0.57) 91.87 (0.50)
VanillaNER 91.44 (0.16) 92.32 (0.16)
Elmo 92.28 (0.19) 93.42 (0.15)
Flair 92.87 (0.08) 93.89 (0.06)
Pooled Flair 93.14 (0.14) 94.13 (0.11)
Table 2: CoNLL03 Re-Evaluation: Test F

scores and standard deviations on both original and corrected datasets. The results are based on

different runs.

Results & Discussions. We re-evaluate the performance of the NER algorithms on the corrected test set. Their performance on the original test set is also listed for the reference. From the results in Table 2, one can observe that all models have higher F scores as well as smaller standard deviations on the corrected test set, compared to those on the original test set. Moreover, LSTM-CRF has a similar performance as LSTM-CNNs-CRF on the original test set, but on average lower performance on the corrected test set. This indicates that the corrected test set may be more discriminative. Therefore, we believe this corrected test set can better reflect the accuracy of NER algorithms in a stable way.

3 Our Framework: CrossWeigh

In this section, we introduce our framework. It is worth mentioning that our framework is designed to be general and fits most of, if not all, NER models. The only requirement is the capability to consume weighted training set.

3.1 Overview

As we have seen in the Section 2, human curated NER datasets are by no means perfect. Label mistakes in the training set can directly hurt the model’s performance. As shown in Figure 1, if there are many similar mistakes like wrongly annotating “Chicago” in “Chicago won …” as LOC instead of ORG, the NER model will likely capture the wrong pattern “LOC won” and make wrong predictions in future.

Our proposed CrossWeigh framework automates this process. Figure 1 presents an overview. It contains two modules: (1) mistake estimation: it identifies the potential label mistakes in training data through a cross checking process and (2) mistake re-weighing: it lowers the weights of these instances for the NER model training. The workflow is summarized in Algorithm 1.

Input: A NER model , the training set = , and hyper-parameters , , and .
Output: A final NER model
for i = 1 … n do
for iter =  do
       Randomly partition into folds.
       for Each fold  do
             Obtain . (Eq. 2)
             Build . (Eq. 3).
             Train a NER model .
             for Each  do
                   ’s prediction on .
                   if  then
for i =  do
       Compute (Eq. 4).
Return .
Algorithm 1 Our CrossWeigh Framework

3.2 Preliminary

We denote the training sentences as where is the number of sentences. Each sentence is formed up of a sequence of words. Correspondingly, the label sequence for each sentence is denoted as . We use to denote the training set, including both sentences and their labels. We use to represent the weight of the -th sentence. In most NER papers, the weights are uniform, i.e., .

We use to describe the training process of an NER model using the training set weighted by . This training process will return an NER model

. During this training, the weighted loss function is as below.


where is the loss function of prediction against its label sequence . Typically, it is the negative log-likelihood of the model’s prediction compared to labeling sequence .

Original CoNLL03 Corrected CoNLL03
w/o CrossWeigh w/ CrossWeigh w/o CrossWeigh w/ CrossWeigh
VanillaNER 91.44 (0.16) 91.78 (0.06) 92.32 (0.16) 92.64 (0.08)
Flair 92.87 (0.08) 93.19 (0.09) 93.89 (0.06) 94.18 (0.06)
Pooled-Flair 93.14 (0.14) 93.43 (0.06) 94.13 (0.11) 94.28 (0.05)
Table 3: Test F scores and its standard deviations of models trained without or with CrossWeigh.

3.3 Mistake Estimation

Our mistake estimation module is designed to let an NER model itself decide which sentences contain mistake and which do not. We would like to find sentences with label mistakes as many as possible (i.e. high recall), while keeping away from wrongly identified non-mistake sentences (i.e. high precision).

The basic idea of our mistake estimation module is similar to k-fold cross validation, however, in each fold’s training data, it further removes the data containing any of entities appearing in this fold. The details are presented as follows.

We first randomly partition the training data into folds: .

We then train NER models separately based on these folds. The -th () NER model will be evaluated on the sentences in the hold-out fold .

During its training, we avoid any sentence that may lead to “easy prediction” on this hold-out set. Therefore, we inspect every sentence in and get the set of entities as follows.


where is the set of named entities in sentence . We only consider the surface name in this entity set. That is, no matter “Chicago” is LOC or ORG, it only counts as its surface name “Chicago”.

All training sentences that have entities included in will be excluded in training process of the model . Specifically,


We call this step as entity disjoint filtering. The intuition behind this step is that we want the model to make prediction of an entity without prior information of the entity itself from training. This will be helpful to detect sentences that are inconsistent.

We train models by feeding each into with default uniform weight, and we use each to make predictions for and check for each sentence, whether the original label is the same as the model output. In this way, if the trained model makes correct predictions on some sentences in , they are more likely mistake-free. For those sentences that have labels disagreeing with the model output, we mark them as potentially mistake.

We run this mistake estimation module multiple iterations (i.e. iterations) using different random partitions. Then, for each sentence in the training set, we get estimations for it. We denote () as the confidence that sentence contains label mistakes. is defined as the the number of potentially mistake indications among all estimations.

The number of folds plays the role of a trade-off between the efficiency of the mistake estimation process and the number of training examples that can be used in each . When becomes larger, each fold will be smaller, thus leading to a smaller size of ; correspondingly, a larger will be picked. The model can therefore be trained with more examples. However, it also slows down the whole mistake estimation process. On the CoNLL03 NER dataset, we observe that leads to effective results, while having a reasonable running time.

3.4 Mistake Reweighing

In the mistake reweighing module, we adjust weight for each sentence that is marked as potentially mistake in the mistake estimation step. Here, we assign a weight to all sentences marked, while the weights of other sentences remain . Specifically, we set ,


where is a parameter. In practice, it can be chosen according to the quality of mistake estimation module. Particularly, we first estimate the precision of the detected mistakes of a single iteration. Let be the ratio of the number of true detected label mistakes over the number of detected label mistakes. can be roughly estimated through a manual check of a random sample from the detected label mistakes. Then, we choose , because represents the fraction of these detected label mistakes that might be still useful during the model training. Therefore, for the sentences that are marked as potentially mistake in that iteration,

of them are actually correct. With more iterations, the confidence of being correct lowers like a binomial distribution, which is the reason that we chose an exponential decaying weight function in Equation 


4 Experiments

In this section, we conduct several experiments to show effectiveness of our CrossWeigh framework. We first evaluate the overall performance of CrossWeigh on benchmark NER datasets, by plugging it into three base NER models. Since we have two modules in CrossWeigh, we then dive into each module and explore different variants and ablations. In addition, we further verify the effectiveness of CrossWeigh on two more datasets: an emerging-entity NER dataset from WNUT’17 and a low-resource language NER dataset of the Sinhalese language.

4.1 Experimental Settings

Dataset. We use both the original and corrected CoNLL03 datasets. We follow the standard train/dev/test splits and use both the train set and dev set for training  peters2017semi; akbik2018contextual. Entity-wise F

score on the test set is the evaluation metric.

Base NER Algorithm. We mainly choose Flair as our base NER algorithm. Flair is a strong NER algorithm using external resources (large corpus to train a language model). While Pooled-Flair has even better performance, its computational cost refrains us from doing extensive experiments.

Default Parameters in CrossWeigh. For all NER algorithms we experiment with, their default parameters are used. For CrossWeigh parameters, by default, we set , , and . We decide because among randomly sampled sentences with potentially mistake, we find that

of them really contain label mistakes (i.e., the probability of one annotation to be correct is roughly

). We use both train and development set to train the models, and report average F and its standard deviation on both original test set and our corrected test set across 5 different runs peters2017semi.

4.2 Overall Performance

We pair CrossWeigh with our base algorithm (i.e. Flair) and two best-performing NER algorithms with or without language models in Table 2 (i.e. Pooled-Flair and VanillaNER), and evaluate their performance. As shown in Table 3, compared with the three algorithms, applying CrossWeigh always leads to a higher F score and a comparable, sometimes even smaller, standard deviation. Therefore, it is clear that CrossWeigh can improve the performance of NER models. The smaller standard deviations also imply that the models trained with CrossWeigh are more stable. All these results illustrate the superiority of training with CrossWeigh.

Method Original Corrected
w/o CrossWeigh 92.87 (0.08) 93.89 (0.06)
w/ CrossWeigh 93.19 (0.09) 94.18 (0.06)
Entity Disjoint 92.88 (0.11) 93.84 (0.08)
Random Discard 93.01 (0.10) 93.94 (0.10)
Table 4: Importance of Entity Disjoint Filtering.

4.3 Ablations and Variants

We pick Flair as the base algorithm to conduct ablation study.

Entity Disjoint Filtering. There is an entity disjoint filtering step, when we are collecting training data for the NER model during the mistake estimation step. To study its importance, we have done a few ablation experiments.

We have evaluated the following variants:

  • [leftmargin=*,nosep]

  • Flair w/ CrossWeigh – Entity Disjoint: Skip the entity disjoint filtering step.

  • Flair w/ CrossWeigh + Random Discard: Instead of entity disjoint filtering, randomly discard the same number of sentences from each as it would do.

The results are listed in Table 4. One can easily observe that without the entity disjoint filtering, the F scores are very close to the raw Flair model. This demonstrates that the entity disjoint filtering is critical to reduce the over-fitting risk in the mistake estimation step. Also, our proposed entity disjoint filtering strategy works more effective than random discard. This further confirms the effectiveness of entity disjoint filtering.

Variants in Computing . There is definitely more than one way to determine . Let be the number of “potentially mistake”s among the

estimations, we can apply any of the following heuristics:

  • [leftmargin=*,nosep]

  • Ratio: is the number of “potentially mistake” (i.e. ). This is the method mentioned in Section 3, and used by default.

  • At Least One: is the indicator of at least one estimation being “potentially mistake” (i.e. ).

  • Majority: is the indicator of at least estimations being “potentially mistake” (i.e. ).

  • All: is the indicator of all estimations being “potentially mistake” (i.e. ).

We evaluate the performance of these heuristics when used in CrossWeigh, as shown in Table 5. There is not much difference across these heuristics, while our default choice “Ratio” is the most stable.

4.4 Label Mistake Identification Results

Another usage of CrossWeigh

is to identify potential label mistakes during label annotation process, thus improving the annotation quality. This could be also helpful to active learning.

Specifically in this experiment, we apply our noise estimation module to the concatenation of training and testing data. As we have manually corrected the label mistakes in the testing set, we are able to report the number of true mistakes among the potential mistakes discovered in the test set.

The results are presented in Table 9. The potential mistakes are the total number of mistakes identified by CrossWeigh, and actual mistakes is the true positives among all identifications. From the results, we can see that when the base model is Flair, CrossWeigh is able to spot more than 75% of label mistakes, while maintaining a precision about 25%. It is worth noting that 25% is a reasonably high precision, given that the label mistake ratio is only 5.38%. The 75% recall indicates that CrossWeigh is able to identify most of the label mistakes, which are extremely valuable to improve the annotation quality.

Heuristic Original Corrected
At Least One 93.10 (0.10) 94.16 (0.07)
Majority 93.20 (0.09) 94.12 (0.07)
All 93.16 (0.09) 94.11 (0.09)
Ratio 93.19 (0.09) 94.18 (0.06)
Table 5: Different Estimation Heuristics.
Original Corrected
1 93.09 (0.14) 94.07 (0.09)
5 93.23 (0.10) 94.14 (0.08)
3 93.19 (0.09) 94.18 (0.06)
Table 6: Different Numbers of Iterations .
Original Corrected
2 92.11 (0.24) 92.88 (0.11)
5 93.12 (0.08) 94.12 (0.08)
10 93.19 (0.09) 94.18 (0.06)
Table 7: Different Numbers of Folds .
Original Corrected
0.3 92.79 (0.14) 93.68 (0.15)
0.5 93.21 (0.09) 94.18 (0.07)
0.9 93.01 (0.10) 93.96 (0.09)
0.7 93.19 (0.09) 94.18 (0.06)
Table 8: Different Weight Adjustments .
Potential Mistakes
Actual Mistakes


Table 9: Quality of noise estimation. The number of true mistakes, based on our manual correction, is . The potential mistakes are counted based on average of 3 runs.

4.5 Parameter Study

We study how CrossWeigh performs with different hyper-parameters, i.e., (the number of iterations that we run mistake estimation), (the number of folds in mistake estimation), and (the weight scaling factor of identified potential mistakes).

In principle, a larger usually gives us a more stable mistake estimation. However, a larger also requires more computation resources. In our experiments (see Table 6), we find that provides a good enough result.

Specifically, during mistake estimation, we have to choose the number of folds to partition the data. The more partitions made, the smaller each is and the fewer sentences will be filtered, leading to more training data and better trained . On the other hand, this is at the cost of higher computational expense. As shown in Table 7, we observe that are significantly better than . In fact, when , each has only around 5000 sentences and 1500 entities inside. These numbers become 7000 and 4000 when , and 9000 and 7000 when .

As we mentioned before, the value can be chosen by estimating the quality of mistake estimation. Table 8 presents some results when other values are used. leads to the worst performance. Since our estimation does not have high precision, assigning to a low value like may not be a good choice. Interestingly performs on par with , and even slightly better in the original test set. We hypothesize that this is because there are some ambiguous sentences that we did not count during estimating the quality of mistake estimation, see Section 5, and the actual precision could be higher.

4.6 Other Datasets

To show the generalizability of our method across domains and languages, we further evaluate CrossWeigh on an emerging-entity NER dataset from WNUT’17 and a Sinhalese NER dataset from LORELEI444LDC2018E57. Sinhalese is a low-resource, morphology-rich language. For WNUT’17, we use the Flair as our base NER algorithm. For Sinhalese, we use BERT devlin2018bert followed by a BiLSTM-CRF as our base NER algorithm. We use the same parameters as used in the previous CoNLL03 experiments, namely .

Dataset w/o CrossWeigh w/ CrossWeigh
WNUT’17 48.96 (0.97) 50.03 (0.40)
Sinhalese 66.34 (0.34) 67.68 (0.21)
Table 10: Applying CrossWeigh on other datasets

The results averaged across 5 runs are reported in Table 10. One can observe quite similar results as those in the previous CoNLL03 experiments. Training with CrossWeigh leads to a significantly higher F1 and a smaller standard deviation. This suggests that CrossWeigh works well in other datasets and languages.

Training Set Test Set
Text Hapoel Haifa 3 Maccabi Tel Aviv 1 Hapoel Jerusalem 0 Maccabi Tel Aviv 4
Original Annotations [Hapoel Haifa]{ORG}, [Tel Aviv]{ORG} [Hapoel Jerusalem]{ORD}, [Maccabi Tel Aviv]{ORG}
Correct Annotations [Hapoel Haifa]{ORG}, [Maccabi Tel Aviv]{ORG} [Hapoel Jerusalem]{ORD}, [Maccabi Tel Aviv]{ORG}
Action Result
Flair Assumes this sentence is equally reliable as others. [Hapoel Jerusalem]{ORD}, [Tel Aviv]{ORG}
Flair w/ CrossWeight Lowers the weight of this sentence as mistakes. [Hapoel Jerusalem]{ORD}, [Maccabi Tel Aviv]{ORG}
Table 11: Case Study on the CoNLL03 dataset. Errors are marked with red

5 Case Studies

Test Set Correction. Despite the label mistakes that we have corrected, we also find some ambiguous but consistent cases. For instances, (1) All NBA/NHL divisions such as “CENTRAL DIVISION”, “WESTERN DIVISION” were annotated as MISC, while all European leagues, such as “SPANISH FIRST DIVISION” and “ENGLISH PREMIER LEAGUE”, are not marked as MISC correctly — only “SPANISH” and “ENGLISH” are labelled as MISC. And (2) “Team A at Team B” is a way to say “Team A” as an away team playing with Team B as a home team. However, in almost all cases (only 1 exception out of more than 100), “Team A” was labelled as ORG while “Team B” was labelled as LOC. For example, in “MINNESOTA AT MILWAUKEE”, “NEW YORK AT CALIFORNIA”, and “ORLANDO AT LA LAKERS”, the second sports team “MILWAUKEE”, “CALIFORNIA” and “LA LAKERS” were always labelled as LOC. Because these parts behave consistently and generally follow the annotation guideline, we didn’t touch them during the test set correction.

CrossWeigh Framework. The mistakes in the training set can harm the generalizability of the trained model. For example, in Table 11, the original training sentence “Hapoel Haifa 3 Maccabi Tel Aviv 1” contains a label mistake, because “Maccabi Tel Aviv” is a sports team but was not annotated completely. Interestingly, there is a similar sentence in the test set – “Hapoel Jerusalem 0 Maccabi Tel Aviv 4”. In all 5 different runs of the original Flair model, they failed to predict correctly that “Maccabi Tel Aviv” in the test sentence as ORG because of the label mistake in the training sentence, even though “ORG number ORG number” is an obvious pattern in the training set. In CrossWeigh, this label mistake in the training set was detected in all iterations and therefore assigned a very low weight during training. After that, in all 5 different runs of Flair w/ CrossWeigh, they successfully predict that “Maccabi Tel Aviv” is ORG as a whole.

6 Related Work

In this section, we review related works from three aspects, mistake identification, cross validation & boosting, and NER algorithms.

6.1 Mistake Identification

Researchers have noticed the label mistakes in sophisticated natural language processing tasks for a while. For example, it is reported that the inter-annotator agreement is about 97% on the Penn Treebank POS tagging dataset manning2011part; subramanya2010efficient.

There are a few attempts towards detecting label mistakes automatically. For example, nakagawa2002detecting

designed a support vector machine-based model to assign weights to examples that were hard to classify in the POS tagging task.

loftsson2009correcting further applied previous detection models and manually corrected Icelandic Frequency Dictionary pind1991islensk POS tagging dataset. However, these two methods are specifically developed for POS tagging and cannot be directly applied to NER.

Recently, rehbein2017detecting extends variational inference with active learning to detect label mistakes in “silver standard” data generated by machines. In this paper, we focus on detecting label mistakes in “gold standard” data, which is a different scenario.

6.2 Cross Validation & Boosting

Our mistake estimation module shares some similarity with cross validation. Applying cross validation to the training set is the same as our mistake estimation module, except that we have an entity disjoint filtering step. Experiments in Table 4 show that this step is crucial to our performance gain. The choice of ten folds also stems from cross validation kohavi1995study.

Another similar thread of work is boosting, such as Adaboost freund1999short; schapire1999improved. For example, abney1999boosting has applied Adaboost on the Penn Treebank POS tagging dataset and gained encouraging results on model performance. In boosting algorithms, the training data is assumed to be perfect. Therefore, it trains models using the full training set and then increases the weights of training instances that fails the current model in the next round of learning. In contrast, we decrease the weights of sentences that differ from the model built upon the entity disjoint training set. More importantly, our framework is a better fit for neural models, because they can likely overfit the training data and thus being bad choices as weak classifiers in boosting.

6.3 NER Algorithms

Neural models have been widely used for Named Entity Recognition, and the state-of-the-art models integrate LSTMs, conditional random field and language models lample2016neural; ma2016end; liu2018empower; peters2018deep; akbik2018contextual. In this paper, we focus on improving the annotation quality for NER, and our method has a big potential to help other methods, especially for noisy datasets shang2018learning.

7 Conclusion & Future work

In this paper, we explore and correct the label mistakes in the CoNLL03 NER dataset. Based on the corrected test set, we re-evaluate most of recent NER models. We further propose a novel framework, CrossWeigh, that is able to detect label mistakes in the training set and then train a more robust NER model accordingly. Extensive experiments demonstrate the effectiveness of CrossWeigh on three datasets and also indicate the potentials of using CrossWeigh to improve the annotation quality during the label curation process.

In future, we plan to extend our framework into an iterative setting, similar to those boosting algorithms. The bottleneck of doing this lies in the efficiency problems of training multiple deep neural models hundreds of times. One solution to overcome it is to apply meta learning. We can first train a meta model and only fine-tune on different training data on each fold. In this way, we can identify label mistakes more accurately and obtain a series of weighted models at the end.


We thank all reviewers for valuable comments and suggestions that brought improvements to our final version. Research was sponsored in part by U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), DARPA under Agreement No. W911NF-17-C-0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA HDTRA11810026, Google Ph.D. Fellowship and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative ( Any opinions, findings, and conclusions or recommendations expressed in this document are those of the author(s) and should not be interpreted as the views of any U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon. This research was supported by grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative ( This work was supported by Contracts HR0011-15-C-0113 and HR0011-18-2-0052 with the US Defense Advanced Research Projects Agency (DARPA).