Computationally Efficient NER Taggers with Combined Embeddings and Constrained Decoding

by   Brian Lester, et al.

Current State-of-the-Art models in Named Entity Recognition (NER) are neural models with a Conditional Random Field (CRF) as the final network layer, and pre-trained "contextual embeddings". The CRF layer is used to facilitate global coherence between labels, and the contextual embeddings provide a better representation of words in context. However, both of these improvements come at a high computational cost. In this work, we explore two simple techniques that substantially improve NER performance over a strong baseline with negligible cost. First, we use multiple pre-trained embeddings as word representations via concatenation. Second, we constrain the tagger, trained using a cross-entropy loss, during decoding to eliminate illegal transitions. While training a tagger on CoNLL 2003 we find a 786% speed-up over a contextual embeddings-based tagger without sacrificing strong performance. We also show that the concatenation technique works across multiple tasks and datasets. We analyze aspects of similarity and coverage between pre-trained embeddings and the dynamics of tag co-occurrence to explain why these techniques work. We provide an open source implementation of our tagger using these techniques in three popular deep learning frameworks — TensorFlow, Pytorch, and DyNet.



There are no comments yet.


page 1

page 2

page 3

page 4


Constrained Decoding for Computationally Efficient Named Entity Recognition Taggers

Current state-of-the-art models for named entity recognition (NER) are n...

Multiple Word Embeddings for Increased Diversity of Representation

Most state-of-the-art models in natural language processing (NLP) are ne...

Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings

Chemical patents are an important resource for chemical information. How...

Synapse at CAp 2017 NER challenge: Fasttext CRF

We present our system for the CAp 2017 NER challenge which is about name...

One Model to Recognize Them All: Marginal Distillation from NER Models with Different Tag Sets

Named entity recognition (NER) is a fundamental component in the modern ...

Adaptive Name Entity Recognition under Highly Unbalanced Data

For several purposes in Natural Language Processing (NLP), such as Infor...

RecipeSnap – a lightweight image-to-recipe model

In this paper we want to address the problem of automation for recogniti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivations

Named Entity Recognition (NER) is usually cast as a sequence labeling task where the goal is to identify objects in the world, such as people (“PER”) or locations (“LOC”). Multi-token spans are traditionally handled by having “Beginning” and “Inside” indicators identifying which tokens start, continue, or signal a change to a different entity. Ratinov and Roth (2009) show that the IOBES tagging scheme, where entity spans must begin with a “B” token, end with an “E” and single token entities are labeled with an “S”, performs better than BIO tagging schemes. IOBES tagging schemes dictate that some token sequences are illegal. It is possible to impose decoding constraints on the model rather than relying only on what is seen in the training data.

It is conventional wisdom in NER that models with a Linear Chain Conditional Random Field (CRF) Lafferty et al. (2001) layer perform better than those without Collobert et al. (2011); Ma and Hovy (2016); Lample et al. (2016), yielding relative performance increases between and percent Ma and Hovy (2016); Lample et al. (2016). A CRF with Viterbi decoding promotes global coherence where simple greedy decoding does not. Therefore, in a bidirectional LSTM (biLSTM) model with a CRF layer illegal transitions are rare compared to models that select just the best scoring tag for each token. However, as the CRF forward algorithm is where is the length of the sentence and is the number of possible tags, it slows down the training significantly. Moreover, substantial effort is required to build an optimized, correct implementation of this layer. Alternately, training with a cross-entropy loss runs in

for sparse labels. It can also be computed in parallel. Instead of traditional CRF training, we propose Viterbi decoding at test time with heuristically determined transition probabilities that prohibit illegal transitions. We find that this simple modification allows taggers trained with cross-entropy loss to compete with those trained using a CRF loss while yielding much faster training times.

Our approach is similar to previous work in NLP where constraints are introduced during inference Roth and Yih (2005); Punyakanok et al. (2005). There have been other attempts to eliminate the CRF layer in taggers, notably Shen et al. (2018), which found that an additional LSTM greedy decoder layer is competitive with the CRF layer, though their baseline is much weaker than the models found in other work, and the source code was never released. Additionally, their decoder has an auto-regressive relationship that is difficult to parallelize and, in practice, there is still significant overhead at training time. Chiu and Nichols (2016) mention good results with a similar decoding scheme but don’t provide in-depth analysis, metrics, or test its generality.

Many recent NLP publications have focused on better feature representations via contextual word embeddings Peters et al. (2018, 2017); Radford et al. (2018); Akbik et al. (2018); Devlin et al. (2018). These models vary in architecture and pre-training objective but they all encode the input based on the surrounding context in some way. For NER, these papers normally use biLSTM-CRF baselines where words are represented by single pre-trained word embeddings.

Contextual embeddings and transfer learning architectures are slow to train and evaluate, which may make them unfeasible for many types of deployments. We find that the concatenation of multiple pre-trained word embeddings instead is much faster and shows consistent improvements over single embeddings, much closer to contextual alternatives.

2 Experiments & Results

We use three sequential prediction tasks to test the performance of our concatenated embeddings: NER (CoNLL 2003 Tjong Kim Sang and De Meulder (2003), WNUT-17 Derczynski et al. (2017), and Ontonotes Hovy et al. (2006)) , Slot filling (Snips Coucke et al. (2018)) and POS tagging (TW-POS 1). We also show results on three classification datasets: SST2 Socher et al. (2013), Snips intent classification Coucke et al. (2018), and AG-News111

. For each (task, dataset) pair we use the most common embedding used in literature. For all tagging tasks, a biLSTM-CRF model is used. For all classification tasks, a single layer LSTM model is used except for the Snips classification dataset, where a convolutional word-based model is used. The hyperparameters are omitted here for brevity but can be found in our implementation.

The results are presented in Table 1. 6B, 27B and 840B are well-known GloVe embeddings Pennington et al. (2014), w2v-30M Pressel et al. (2018) and GN Mikolov et al. (2013) are Word2Vec embeddings trained on a corpus of 30 million tweets and Google News respectively, and Senna embedding was trained by Collobert et al. (2011). As hypothesized, we see improvements across tasks, datasets, and model architectures when multiple embeddings are concatenated (except for Ontonotes). When compared to a model that only uses a single pre-trained embedding, a model that uses the concatenation of pre-trained and randomly initialized embeddings does % worse on average, demonstrating that the performance gains are from the combination of different pretrianed embeddings rather than the increase in the number of parameters in the model. In some cases we were able to improve results further by adding several sets of additional embeddings.

Task Dataset Model Embeddings mean std min max
NER CoNLL biLSTM-CRF 6B 91.12 0.21 90.62 91.37
Senna 90.48 0.27 90.02 90.81
6B, Senna 91.47 0.25 91.15 92.00
WNUT-17 biLSTM-CRF 27B 39.20 0.71 37.98 40.33
27B, w2v-30M 39.52 0.83 38.09 40.39
27B, w2v-30M, 840B 40.33 1.13 38.38 41.99
Ontonotes biLSTM-CRF 6B 87.43 0.13 87.15 87.57
6B, Senna 87.41 0.17 87.14 87.74
Slot Filling Snips biLSTM-CRF 6B 95.84 0.29 95.39 96.21
GN 95.28 0.41 94.51 95.81
6B, GN 96.04 0.28 95.39 96.21
POS TW-POS biLSTM-CRF w2v-30M 89.21 0.28 88.72 89.74
27B 89.63 0.19 89.35 89.92
27B, w2v-30M 90.35 0.20 89.99 90.60
27B, w2v-30M, 840B 90.75 0.14 90.53 91.02
Classification SST2 LSTM 840B 88.39 0.45 87.42 89.07
GN 87.58 0.54 86.16 88.19
840B, GN 88.57 0.44 87.59 89.24
AG-NEWS LSTM 840B 92.53 0.45 87.42 89.07
GN 92.20 0.18 91.80 92.40
840B, GN 92.60 0.20 92.30 92.86
Snips Conv 840B 97.47 0.33 97.01 97.86
GN 97.40 0.27 97.00 97.86
840B, GN 97.63 0.52 97.00 98.29
Table 1: Results using multiple embeddings applied to several tasks and datasets. NER and Slot Filling tasks report entity level F1. POS tagging and Classification report accuracy. All results are reported across 10 runs.
Dataset Model mean std max
CoNLL CRF 91.47 0.25 92.00
Constrain 91.44 0.23 91.90
WNUT-17 CRF 40.33 1.13 41.99
Constrain 40.59 1.06 41.71
Snips CRF 96.04 0.28 96.35
Constrain 96.07 0.17 96.29
Ontonotes CRF 87.43 0.26 87.57
Constrain 86.13 0.17 86.72
Table 2: Results of tagging with constraints vs a CRF. For each, we use the best embedding combination as found in table 1. Scores are reported across 10 runs.

To test whether constrained decoding provides results comparable to a CRF layer, we implement a mask that effectively eliminates invalid IOBES transitions by setting those transition score to large negative values. This mask is generated via the rule of IOBES encoding. For example, an “I-” of one class cannot follow a “B-” of a different class or how entities must end in “E-”, i.e. “B-” cannot transition directly to an “O”. This mask can be applied to the CRF transistion scores or used directly to facilitate Viterbi decoding when no CRF is used.

We investigate the effect of constrained decoding on three NER datasets and one Slot Filling dataset. The results are presented in table 2. In three out of four datasets constrained decoding performs comparably or better than CRF, again Ontonotes is the only exception. We observe a 50% improvement in training time on average.

The models were trained using Baseline Pressel et al. (2018), an open-source framework for creating, training, evaluating and deploying models for NLP.

3 Analysis

Overlap Attested Performance
Embeddings train dev train dev mean std
Senna 18.9 20.8 74.3 80.3 91.466 0.247
GloVe twitter 27B 24.9 27.2 68.1 76.1 91.098 0.135
GloVe 840B 41.7 40.6 83.2 88.5 91.011 0.228
GloVe 42B 45.5 45.3 90.4 93.8 91.163 0.146
GoogleNews 25.2 26.8 55.9 65.1 90.948 0.180
Table 3:

Embedding similarity as defined by average Jaccard similarity of the 10 nearest neighbors on the top 200 words in CoNLL 2003. Performance is the F1 score of each embedding when paired with Glove-6B-100d vectors.

We observe that concatenated embeddings trained on sufficiently different datasets perform well in our experiments. We hypothesized that each embedding set augments the meaning representations, making them more useful for the downstream tagging task. To test this theory, we looked at the similarity of various pre-trained embeddings. We define similarity as the Jaccard overlap percentage between the 10 nearest neighbors for each of the top 200 words in the dataset by frequency. Embeddings that complement the base embedding set should have a low similarity, otherwise they would not add much extra information. More similar embeddings experienced less of a performance boost than dissimilar embeddings. However, similarity was not a perfect predictor of whether the model will improve with concatenated embeddings. We also investigated coverage — the proportion of words in the dataset found in the pre-trained vocabulary. While Google News vectors have low overlap with the GloVe 6B embeddings, which should augment the information in the word representations, they are used so rarely that they do not add a significant performance gain. Table 3 shows how model performance changes with different embedding combinations on CoNLL 2003 dataset. When finding complementary embeddings, it seems necessary both that embedding sets are highly attested and have low similarity to one another to improve performance.

For constrained decoding, we leverage the IOBES tagging scheme rather than BIO tagging, allowing us to inject more structure into the decoding mask. Our tests with BIO tagging failed to show the large gains we realized when we applied IOBES tagging. When run on the development set of CoNLL 2003, an unconstrained tagger made 125 illegal transitions with 114 of them resulting in an entity shift. Of these, 46 entity shifts caused errors.

It might seem that the constrained decoder does not have performance parity with the CRF on Ontonotes because Ontonotes has many more entity types () than either CoNLL () or WNUT (). However the constrained decoder does offer superior performance on Snips with types.

To help analyze when and how constrained decoding works, we define the concept of “strictly dominated tokens”. An ambiguous token is a token whose type takes multiple tag values throughout the dataset. A strictly dominated token is one that is normally ambiguous but can only take a single tag value because of the previous tag. Because the constrained decoding eliminates illegal transitions, we would expect that on datasets where it performs well, a large proportion of tokens are strictly dominated. This tends to hold true — only % of Ontonotes’s ambiguous tokens are strictly dominated while % of CoNLL’s tokens are and for WNUT17 % are.

Despite the constrained decoder’s performance on Snips, its strictly dominated token ratio is only %; we believe the ambiguity of the first token in an entity plays a role. Since we are no longer benefiting from the strictly dominated tokens, we suspected that the B- and S- tokens must be fairly unambiguous. In the Snips dataset % of these tokens are unambiguous — they are only ever tagged as that entity token. CoNLL and WNUT have similarly high percentages (% and %) compared to Ontonotes with %.

While a single metric does not completely capture the effectiveness of the constrained decoding, the metrics chosen are good proxies to estimate the effectiveness of the constrained decoder and the trends are present across multiple datasets.

4 Conclusion

Recent large-scale contextual pre-training and transfer learning efforts are exciting but produce relatively slow models. For tagging tasks a CRF layer introduces substantial computational cost as well. We propose two lightweight techniques: concatenation of pre-trained embeddings and constrained decoding. We show that individually each of these techniques has a significant impact on error reduction and, when used together, improves speed significantly with very little cost in performance.

Our analysis suggests that each concatenated embedding should individually have good coverage over the training set and exhibit representational diversity from the rest of the embeddings. For constrained decoding, our performance either exceeds or is on par with that of a CRF while exhibiting a 50% wall clock improvement at training time. We show that the constrained decoder can be used on common datasets where many tokens are conditionally unambiguous based on the rules of IOBES encoding.

In future work, we intend to try other methods of embeddings combination. The constrained decoder can be extended to use transition probabilities estimated from the training set in addition to masking illegal moves. In theory this should boost performance without adversely impacting speed. Also, automated tools can be developed towards a more principled approach for finding out which embeddings should be combined or whether CRFs should be replaced by a constrained decoder for a particular (task, dataset) pair.


  • [1] Cited by: §2.
  • A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In COLING, Cited by: §1.
  • J. P. C. Chiu and E. Nichols (2016) Named entity recognition with bidirectional lstm-cnns. TACL 4, pp. 357–370. Cited by: §1.
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa (2011) Natural language processing (almost) from scratch..

    Journal of Machine Learning Research

    12, pp. 2493–2537.
    External Links: Link Cited by: §1, §2.
  • A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, M. Primet, and J. Dureau (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: §2.
  • L. Derczynski, E. Nichols, M. van Erp, and N. Limsopatham (2017) Results of the wnut2017 shared task on novel and emerging entity recognition. In

    Proceedings of the 3rd Workshop on Noisy, User-generated Text

    Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel (2006) OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short ’06, Stroudsburg, PA, USA, pp. 57–60. External Links: Link Cited by: §2.
  • J. D. Lafferty, A. McCallum, and F. C. N. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, pp. 282–289. External Links: ISBN 1-55860-778-1, Link Cited by: §1.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 260–270. External Links: Link Cited by: §1.
  • X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1064–1074. External Links: Link Cited by: §1.
  • T. Mikolov, K. Chen, G. S. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. External Links: Link Cited by: §2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Link Cited by: §2.
  • M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power (2017) Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1756–1765. External Links: Link, Document Cited by: §1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §1.
  • D. Pressel, S. Ray Choudhury, B. Lester, Y. Zhao, and M. Barta (2018) Baseline: a library for rapid modeling, experimentation and development of deep learning algorithms targeting nlp. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pp. 34–40. External Links: Link Cited by: §2, §2.
  • V. Punyakanok, D. Roth, W. Yih, and D. Zimak (2005) Learning and inference over constrained output. In

    Proceedings of the 19th International Joint Conference on Artificial Intelligence

    IJCAI’05, San Francisco, CA, USA, pp. 1124–1129. External Links: Link Cited by: §1.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. External Links: Link Cited by: §1.
  • L. Ratinov and D. Roth (2009) Design challenges and misconceptions in named entity recognition. In CoNLL 2009 - Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 147–155 (English (US)). External Links: ISBN 1932432299 Cited by: §1.
  • D. Roth and W. Yih (2005)

    Integer linear programming inference for conditional random fields

    In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, New York, NY, USA, pp. 736–743. External Links: ISBN 1-59593-180-5, Link, Document Cited by: §1.
  • Y. Shen, H. Yun, Z. C. Lipton, Y. Kronrod, and A. Anandkumar (2018)

    Deep active learning for named entity recognition

    In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642. External Links: Link Cited by: §2.
  • E. F. Tjong Kim Sang and F. De Meulder (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, Stroudsburg, PA, USA, pp. 142–147. External Links: Link, Document Cited by: §2.