1 Introduction and Motivations
Named Entity Recognition (NER) is usually cast as a sequence labeling task where the goal is to identify objects in the world, such as people (“PER”) or locations (“LOC”). Multi-token spans are traditionally handled by having “Beginning” and “Inside” indicators identifying which tokens start, continue, or signal a change to a different entity. Ratinov and Roth (2009) show that the IOBES tagging scheme, where entity spans must begin with a “B” token, end with an “E” and single token entities are labeled with an “S”, performs better than BIO tagging schemes. IOBES tagging schemes dictate that some token sequences are illegal. It is possible to impose decoding constraints on the model rather than relying only on what is seen in the training data.
It is conventional wisdom in NER that models with a Linear Chain Conditional Random Field (CRF) Lafferty et al. (2001) layer perform better than those without Collobert et al. (2011); Ma and Hovy (2016); Lample et al. (2016), yielding relative performance increases between and percent Ma and Hovy (2016); Lample et al. (2016). A CRF with Viterbi decoding promotes global coherence where simple greedy decoding does not. Therefore, in a bidirectional LSTM (biLSTM) model with a CRF layer illegal transitions are rare compared to models that select just the best scoring tag for each token. However, as the CRF forward algorithm is where is the length of the sentence and is the number of possible tags, it slows down the training significantly. Moreover, substantial effort is required to build an optimized, correct implementation of this layer. Alternately, training with a cross-entropy loss runs in
for sparse labels. It can also be computed in parallel. Instead of traditional CRF training, we propose Viterbi decoding at test time with heuristically determined transition probabilities that prohibit illegal transitions. We find that this simple modification allows taggers trained with cross-entropy loss to compete with those trained using a CRF loss while yielding much faster training times.
Our approach is similar to previous work in NLP where constraints are introduced during inference Roth and Yih (2005); Punyakanok et al. (2005). There have been other attempts to eliminate the CRF layer in taggers, notably Shen et al. (2018), which found that an additional LSTM greedy decoder layer is competitive with the CRF layer, though their baseline is much weaker than the models found in other work, and the source code was never released. Additionally, their decoder has an auto-regressive relationship that is difficult to parallelize and, in practice, there is still significant overhead at training time. Chiu and Nichols (2016) mention good results with a similar decoding scheme but don’t provide in-depth analysis, metrics, or test its generality.
Many recent NLP publications have focused on better feature representations via contextual word embeddings Peters et al. (2018, 2017); Radford et al. (2018); Akbik et al. (2018); Devlin et al. (2018). These models vary in architecture and pre-training objective but they all encode the input based on the surrounding context in some way. For NER, these papers normally use biLSTM-CRF baselines where words are represented by single pre-trained word embeddings.
Contextual embeddings and transfer learning architectures are slow to train and evaluate, which may make them unfeasible for many types of deployments. We find that the concatenation of multiple pre-trained word embeddings instead is much faster and shows consistent improvements over single embeddings, much closer to contextual alternatives.
2 Experiments & Results
We use three sequential prediction tasks to test the performance of our concatenated embeddings: NER (CoNLL 2003 Tjong Kim Sang and De Meulder (2003), WNUT-17 Derczynski et al. (2017), and Ontonotes Hovy et al. (2006)) , Slot filling (Snips Coucke et al. (2018)) and POS tagging (TW-POS 1). We also show results on three classification datasets: SST2 Socher et al. (2013), Snips intent classification Coucke et al. (2018), and AG-News111http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html
. For each (task, dataset) pair we use the most common embedding used in literature. For all tagging tasks, a biLSTM-CRF model is used. For all classification tasks, a single layer LSTM model is used except for the Snips classification dataset, where a convolutional word-based model is used. The hyperparameters are omitted here for brevity but can be found in our implementation.
The results are presented in Table 1. 6B, 27B and 840B are well-known GloVe embeddings Pennington et al. (2014), w2v-30M Pressel et al. (2018) and GN Mikolov et al. (2013) are Word2Vec embeddings trained on a corpus of 30 million tweets and Google News respectively, and Senna embedding was trained by Collobert et al. (2011). As hypothesized, we see improvements across tasks, datasets, and model architectures when multiple embeddings are concatenated (except for Ontonotes). When compared to a model that only uses a single pre-trained embedding, a model that uses the concatenation of pre-trained and randomly initialized embeddings does % worse on average, demonstrating that the performance gains are from the combination of different pretrianed embeddings rather than the increase in the number of parameters in the model. In some cases we were able to improve results further by adding several sets of additional embeddings.
Task | Dataset | Model | Embeddings | mean | std | min | max |
---|---|---|---|---|---|---|---|
NER | CoNLL | biLSTM-CRF | 6B | 91.12 | 0.21 | 90.62 | 91.37 |
Senna | 90.48 | 0.27 | 90.02 | 90.81 | |||
6B, Senna | 91.47 | 0.25 | 91.15 | 92.00 | |||
WNUT-17 | biLSTM-CRF | 27B | 39.20 | 0.71 | 37.98 | 40.33 | |
27B, w2v-30M | 39.52 | 0.83 | 38.09 | 40.39 | |||
27B, w2v-30M, 840B | 40.33 | 1.13 | 38.38 | 41.99 | |||
Ontonotes | biLSTM-CRF | 6B | 87.43 | 0.13 | 87.15 | 87.57 | |
6B, Senna | 87.41 | 0.17 | 87.14 | 87.74 | |||
Slot Filling | Snips | biLSTM-CRF | 6B | 95.84 | 0.29 | 95.39 | 96.21 |
GN | 95.28 | 0.41 | 94.51 | 95.81 | |||
6B, GN | 96.04 | 0.28 | 95.39 | 96.21 | |||
POS | TW-POS | biLSTM-CRF | w2v-30M | 89.21 | 0.28 | 88.72 | 89.74 |
27B | 89.63 | 0.19 | 89.35 | 89.92 | |||
27B, w2v-30M | 90.35 | 0.20 | 89.99 | 90.60 | |||
27B, w2v-30M, 840B | 90.75 | 0.14 | 90.53 | 91.02 | |||
Classification | SST2 | LSTM | 840B | 88.39 | 0.45 | 87.42 | 89.07 |
GN | 87.58 | 0.54 | 86.16 | 88.19 | |||
840B, GN | 88.57 | 0.44 | 87.59 | 89.24 | |||
AG-NEWS | LSTM | 840B | 92.53 | 0.45 | 87.42 | 89.07 | |
GN | 92.20 | 0.18 | 91.80 | 92.40 | |||
840B, GN | 92.60 | 0.20 | 92.30 | 92.86 | |||
Snips | Conv | 840B | 97.47 | 0.33 | 97.01 | 97.86 | |
GN | 97.40 | 0.27 | 97.00 | 97.86 | |||
840B, GN | 97.63 | 0.52 | 97.00 | 98.29 |
Dataset | Model | mean | std | max |
---|---|---|---|---|
CoNLL | CRF | 91.47 | 0.25 | 92.00 |
Constrain | 91.44 | 0.23 | 91.90 | |
WNUT-17 | CRF | 40.33 | 1.13 | 41.99 |
Constrain | 40.59 | 1.06 | 41.71 | |
Snips | CRF | 96.04 | 0.28 | 96.35 |
Constrain | 96.07 | 0.17 | 96.29 | |
Ontonotes | CRF | 87.43 | 0.26 | 87.57 |
Constrain | 86.13 | 0.17 | 86.72 |
To test whether constrained decoding provides results comparable to a CRF layer, we implement a mask that effectively eliminates invalid IOBES transitions by setting those transition score to large negative values. This mask is generated via the rule of IOBES encoding. For example, an “I-” of one class cannot follow a “B-” of a different class or how entities must end in “E-”, i.e. “B-” cannot transition directly to an “O”. This mask can be applied to the CRF transistion scores or used directly to facilitate Viterbi decoding when no CRF is used.
We investigate the effect of constrained decoding on three NER datasets and one Slot Filling dataset. The results are presented in table 2. In three out of four datasets constrained decoding performs comparably or better than CRF, again Ontonotes is the only exception. We observe a 50% improvement in training time on average.
The models were trained using Baseline Pressel et al. (2018), an open-source framework for creating, training, evaluating and deploying models for NLP.
3 Analysis
Overlap | Attested | Performance | ||||
---|---|---|---|---|---|---|
Embeddings | train | dev | train | dev | mean | std |
Senna | 18.9 | 20.8 | 74.3 | 80.3 | 91.466 | 0.247 |
GloVe twitter 27B | 24.9 | 27.2 | 68.1 | 76.1 | 91.098 | 0.135 |
GloVe 840B | 41.7 | 40.6 | 83.2 | 88.5 | 91.011 | 0.228 |
GloVe 42B | 45.5 | 45.3 | 90.4 | 93.8 | 91.163 | 0.146 |
GoogleNews | 25.2 | 26.8 | 55.9 | 65.1 | 90.948 | 0.180 |
Embedding similarity as defined by average Jaccard similarity of the 10 nearest neighbors on the top 200 words in CoNLL 2003. Performance is the F1 score of each embedding when paired with Glove-6B-100d vectors.
We observe that concatenated embeddings trained on sufficiently different datasets perform well in our experiments. We hypothesized that each embedding set augments the meaning representations, making them more useful for the downstream tagging task. To test this theory, we looked at the similarity of various pre-trained embeddings. We define similarity as the Jaccard overlap percentage between the 10 nearest neighbors for each of the top 200 words in the dataset by frequency. Embeddings that complement the base embedding set should have a low similarity, otherwise they would not add much extra information. More similar embeddings experienced less of a performance boost than dissimilar embeddings. However, similarity was not a perfect predictor of whether the model will improve with concatenated embeddings. We also investigated coverage — the proportion of words in the dataset found in the pre-trained vocabulary. While Google News vectors have low overlap with the GloVe 6B embeddings, which should augment the information in the word representations, they are used so rarely that they do not add a significant performance gain. Table 3 shows how model performance changes with different embedding combinations on CoNLL 2003 dataset. When finding complementary embeddings, it seems necessary both that embedding sets are highly attested and have low similarity to one another to improve performance.
For constrained decoding, we leverage the IOBES tagging scheme rather than BIO tagging, allowing us to inject more structure into the decoding mask. Our tests with BIO tagging failed to show the large gains we realized when we applied IOBES tagging. When run on the development set of CoNLL 2003, an unconstrained tagger made 125 illegal transitions with 114 of them resulting in an entity shift. Of these, 46 entity shifts caused errors.
It might seem that the constrained decoder does not have performance parity with the CRF on Ontonotes because Ontonotes has many more entity types () than either CoNLL () or WNUT (). However the constrained decoder does offer superior performance on Snips with types.
To help analyze when and how constrained decoding works, we define the concept of “strictly dominated tokens”. An ambiguous token is a token whose type takes multiple tag values throughout the dataset. A strictly dominated token is one that is normally ambiguous but can only take a single tag value because of the previous tag. Because the constrained decoding eliminates illegal transitions, we would expect that on datasets where it performs well, a large proportion of tokens are strictly dominated. This tends to hold true — only % of Ontonotes’s ambiguous tokens are strictly dominated while % of CoNLL’s tokens are and for WNUT17 % are.
Despite the constrained decoder’s performance on Snips, its strictly dominated token ratio is only %; we believe the ambiguity of the first token in an entity plays a role. Since we are no longer benefiting from the strictly dominated tokens, we suspected that the B- and S- tokens must be fairly unambiguous. In the Snips dataset % of these tokens are unambiguous — they are only ever tagged as that entity token. CoNLL and WNUT have similarly high percentages (% and %) compared to Ontonotes with %.
While a single metric does not completely capture the effectiveness of the constrained decoding, the metrics chosen are good proxies to estimate the effectiveness of the constrained decoder and the trends are present across multiple datasets.
4 Conclusion
Recent large-scale contextual pre-training and transfer learning efforts are exciting but produce relatively slow models. For tagging tasks a CRF layer introduces substantial computational cost as well. We propose two lightweight techniques: concatenation of pre-trained embeddings and constrained decoding. We show that individually each of these techniques has a significant impact on error reduction and, when used together, improves speed significantly with very little cost in performance.
Our analysis suggests that each concatenated embedding should individually have good coverage over the training set and exhibit representational diversity from the rest of the embeddings. For constrained decoding, our performance either exceeds or is on par with that of a CRF while exhibiting a 50% wall clock improvement at training time. We show that the constrained decoder can be used on common datasets where many tokens are conditionally unambiguous based on the rules of IOBES encoding.
In future work, we intend to try other methods of embeddings combination. The constrained decoder can be extended to use transition probabilities estimated from the training set in addition to masking illegal moves. In theory this should boost performance without adversely impacting speed. Also, automated tools can be developed towards a more principled approach for finding out which embeddings should be combined or whether CRFs should be replaced by a constrained decoder for a particular (task, dataset) pair.
References
- [1] Cited by: §2.
- Contextual string embeddings for sequence labeling. In COLING, Cited by: §1.
- Named entity recognition with bidirectional lstm-cnns. TACL 4, pp. 357–370. Cited by: §1.
-
Natural language processing (almost) from scratch..
Journal of Machine Learning Research
12, pp. 2493–2537. External Links: Link Cited by: §1, §2. - Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: §2.
-
Results of the wnut2017 shared task on novel and emerging entity recognition.
In
Proceedings of the 3rd Workshop on Noisy, User-generated Text
, Cited by: §2. - BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
- OntoNotes: the 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short ’06, Stroudsburg, PA, USA, pp. 57–60. External Links: Link Cited by: §2.
- Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, pp. 282–289. External Links: ISBN 1-55860-778-1, Link Cited by: §1.
- Neural architectures for named entity recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 260–270. External Links: Link Cited by: §1.
- End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1064–1074. External Links: Link Cited by: §1.
- Efficient estimation of word representations in vector space. External Links: Link Cited by: §2.
- GloVe: global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Link Cited by: §2.
- Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 1756–1765. External Links: Link, Document Cited by: §1.
- Deep contextualized word representations. In Proc. of NAACL, Cited by: §1.
- Baseline: a library for rapid modeling, experimentation and development of deep learning algorithms targeting nlp. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pp. 34–40. External Links: Link Cited by: §2, §2.
-
Learning and inference over constrained output.
In
Proceedings of the 19th International Joint Conference on Artificial Intelligence
, IJCAI’05, San Francisco, CA, USA, pp. 1124–1129. External Links: Link Cited by: §1. - Improving language understanding by generative pre-training. External Links: Link Cited by: §1.
- Design challenges and misconceptions in named entity recognition. In CoNLL 2009 - Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 147–155 (English (US)). External Links: ISBN 1932432299 Cited by: §1.
-
Integer linear programming inference for conditional random fields
. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, New York, NY, USA, pp. 736–743. External Links: ISBN 1-59593-180-5, Link, Document Cited by: §1. -
Deep active learning for named entity recognition
. In International Conference on Learning Representations, External Links: Link Cited by: §1. - Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642. External Links: Link Cited by: §2.
- Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, Stroudsburg, PA, USA, pp. 142–147. External Links: Link, Document Cited by: §2.
Comments
There are no comments yet.