Neural Named Entity Recognition from Subword Units

08/22/2018 ∙ by Abdalghani Abujabal, et al. ∙ Max Planck Society 0

Named entity recognition (NER) is a vital task in language technology. Existing neural models for NER rely mostly on dedicated word-level representations, which suffer from two main shortcomings: 1) the vocabulary size is large, yielding large memory requirements and training time, and 2) they cannot learn morphological representations. We adopt a neural solution based on bidirectional LSTMs and conditional random fields, where we rely on subword units, namely characters, phonemes, and bytes, to remedy the above shortcomings. We conducted experiments on a large dataset covering four languages with up to 5.5M utterances per language. Our experiments show that 1) with increasing training data, performance of models trained solely on subword units becomes closer to that of models with dedicated word-level embeddings (91.35 vs 93.92 F1 for English), while using a much smaller vocabulary size (332 vs 74K), 2) subword units enhance models with dedicated word-level embeddings, and 3) combining different subword units improves performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Named Entity Recognition (NER) is an important task in language technology applications, such as smart assistants like the Amazon Echo or Google Home. For example, if a user requests an assistant to “play we are the champions by queen”, a named entity recogniser can be applied to the transcribed utterance to determine that ‘we are the champions’ refers to a song and ‘queen’ to an artist. As new utterances are collected over time, which are annotated with named entities, regular retraining with increasing data amounts is needed.

Recently, several neural models for NER have been proposed (e.g., Chiu and Nichols (2016); Gillick et al. (2016); Lample et al. (2016); Ma and Hovy (2016)), indicating promising performance on a small and artificially generated dataset (Sang, 2002; Sang and Meulder, 2003). However, these models mostly rely on dedicated word-level representations, which suffer from three shortcomings:

  • The vocabulary size is large, yielding a large number of parameters, and hence, large memory requirements and training time, which is particularly problematic if large amounts of data are available.

  • The models cannot learn subword representations, which can potentially improve performance by taking advantage of morphology.

  • Out-of-vocabulary words cannot be handled.

We adopt a neural solution relying on subword units, namely characters, phonemes and bytes. For each word in an utterance, we learn representations from each of the three subword units. The character-level unit looks at the characters of each word, while the phoneme-level unit treats a word as a sequence of phonemes, using lexica that map a given word into its corresponding phoneme sequence. The byte-level unit reads a word as bytes, where we use the variable length UTF- encoding.

A major advantage of subword-based models is the small vocabulary size which can positively affect memory requirements and training time of models, which is particularly relevant for large-scale applications, and for systems that operate under certain constraints like memory constraints; e.g., handheld or voice-controlled devices. In addition, subword units improve modelling of out-of-vocabulary words and support learning of morphological information. Specifically, character-level networks have already proven to boost the performance of many sequence tagging tasks, in particular for morphologically rich languages (Chiu and Nichols, 2016; Klein et al., 2003; Lample et al., 2016). In this paper, we combine different types of subword units for NER.

We present experiments on a large-scale dataset covering four languages, including English, German, and French, with up to 5.5M utterances per language. Our experiments show that:

  • With increasing training data size, performance of models trained solely on subword units becomes closer to that of models with dedicated word-level embeddings ( vs F1 for English), however, with much smaller vocabulary size ( vs K).

  • Subword units enhance models with dedicated word-level embeddings, in particular, for languages with smaller training data sets.

  • Combining the three subword units (byte-, phoneme- and character-level) yields better results than using only one or two of them.

2 Model

We follow recent works on neural named entity recognition and base our soultion on bidirectional LSTMs and conditional random fields (CRF) (Chiu and Nichols, 2016; Huang et al., 2015; Lample et al., 2016; dos Santos and Guimarães, 2015). For each word in an utterance, our model learns a low-dimensional representation from each subword unit (character-, phoneme- and byte-level). The outputs of the subword units are then concatenated and fed into a bidirectional LSTM-CRF model (Huang et al., 2015; Lample et al., 2016). Our model is depicted in Figure 1. Bidirectional LSTMs capture long-term dependencies among input tokens (Graves and Schmidhuber, 2005). In this work we use the LSTM implementation that was adopted by Lample et al. Lample et al. (2016):

where W’s are shared weight matrices, b’s are the biases,

is the element-wise sigmoid function,

represents the token at position , is hidden state at time , and

is the element-wise product. For sequence tagging problems, a softmax layer is used on top of the output of the bidirectional LSTM network to calculate a probability distribution of output tags for a given token. However, the model assumes independence among output tags, which is not practical for NER. As a remedy, a CRF layer is incorporated for decoding. For more details, we refer the reader to

Lample et al. (2016).

Figure 1: Our model with a word-level bidirectional LSTM layer and CRF layer for decoding. For each word in an utterance, our model learns embeddings from the three subword units. Dedicated word embeddings are optional.
Figure 2: The outputs of the three subword units are concatenated to learn embeddings for the whole word (‘dark’). Optionally, we can concatenate dedicated word embeddings. Learned embeddings are then fed into a word-level bidirectional LSTM.

Subword units. We rely on subword units to learn embeddings that represent the full word. As shown in Figure 2, each subword unit is a bidirectional LSTM network, where the last hidden state of the forward and the backward networks are concatenated, which constructs and

from the character-, phoneme- and byte-level networks, respectively. The vectors

and are, in turn, concatenated to represent the final embeddings of a word. Optionally, we can concatenate dedicated word embeddings that are either randomly initialized or pre-trained. Subword units enable the model to mitigate the out-of-vocabulary problem and to have smaller vocabulary sizes compared to models that rely on word-level representations. This positively affects memory requirements of the model, which is crucial for handheld and voice-controlled devices

For the phoneme-level unit, we use lexica that map a given word into its phoneme sequence. Phonemes are represented using the X-SAMPA phoneme set with some additional symbols. For example, the word ‘dark’ is mapped to , where marks the first phoneme in a word, while marks the last one. In addition, we add a special symbol to which we map words which are not comprised in our lexica. Including the additional symbols, the final phoneme vocabulary contains entries with unique phonemes. In general, for the setting explored in our paper, i.e. voice-controlled devices, phoneme lexica with good coverage are developed for the agent’s text to speech (TTS) and automated speech recognition (ASR) components which can be re-used. In case lexica with good coverage are not available, tools for grapheme to phoneme conversion can be used.

For the byte-level unit, we use the variable length UTF- encodings to keep the vocabulary small. For example, ‘Schön’ is represented as {0x53 0x63 0x68 0xc3 0xb6 0x6e}. Note that the character ö corresponds to two bytes, {0xc3 0xb6}. This distinguishes this unit from the character-level one.

3 Experiments

In the following, we first describe our experimental setup and subsequently present our results. In our experiments, we explored 1) performance of individual subword units and different combinations of subword units, both with and without using word-level embeddings, 2) performance of subword only models versus models with dedicated word level embeddings versus models combining both, and 3) whether subword units help in the presence of out-of-vocabulary words.

EN DE FR ES
Size of train set M M K K
Size of dev set M M K K
Size of test set M M K K
Table 1: Number of utterances of per language.

3.1 Experimental Setup

Datasets. We use a large dataset covering four languages, namely American English (EN), German (DE), French (FR) and Spanish (ES). The data is representative of user requests to voice-controlled devices, which were manually transcribed and annotated with named entities. Overall, the data covers several domains, comprising different intents and types of named entities ( on average). Table 1 shows data statistics for each language. While a large number of utterances is available for languages which were collected with deployed systems (EN and DE), rather few are available for the other two languages.

Metric. To evaluate our models, we use the CoNLL script (Sang, 2002) to compute precision, recall and F1 scores on a per-token basis. We report the average F1 score.

Language Subwords Word-level
EN K
DE K
FR K
ES K
Table 2: Vocabulary size of subword-based versus word-based models.

Training. We used a mini-batch Adam optimizer (Kingma and Ba, 2014) with a learning rate of

. We tried different optimizers with different learning rates (e.g., stochastic gradient descent), however, they performed worse than Adam. The batch size was set to 1024, 256, 4 and 4 utterances for EN, DE, FR and ES, respectively. The embedding dimension of the subword units is set to 35, while its counterpart of the word-level network is set to 64 (in case dedicated word-level representations are used). Both subword and word-level networks have a single layer for the forward and the backward LSTMs whose dimensions are set to 35 and 128, respectively. We tried several different values, however, the performance was inferior to the one reported with the above values. When a given number of epochs is reached (

) epochs , training is terminated. The model with the best F1 score on the development set is used to make predictions.

We used dropout training (Hinton et al., 2012), applying a dropout mask to the final embedding layer just before the input to the word-level bidirectional LSTM, with dropout rate set to .

Table 2 shows the vocabulary sizes of different languages for both subword-based and word-level models, highlighting the large differences between subword-based models and models with dedicated word embeddings. In terms of model complexity, subword-based models have a much smaller number of parameters e.g., for EN, K M fewer parameters to fine-tune during training.

Char Phoneme Byte Char + Phoneme Char + Byte Phoneme + Byte All
EN
DE
FR
ES
Table 3: F1 scores of the subword-only models with different units being used. The model with the three subword units combined achieved best performance across languages, except for FR.
Char Phoneme Byte Char + Phoneme Char + Byte Phoneme + Byte All
EN
DE
FR
ES
Table 4: F1 scores of the combined models with different subword units being used.

3.2 Results

Subwords only models. Table 3 shows the performance of models that rely solely on subword units. When used individually, different subword units yield the best results for different languages. For example, for English the best individual subword unit is phoneme (with

points in F1 more than character), while character-level unit achieved best results for French. Note that the phoneme lexicon for French had much lower coverage than that for English, which explains the low F1 score.

When several subword units are combined, results improve for all languages, and except for French, best results are achieved when using all of the subword units. For French, the best combination is character and byte units, i.e. without using phonemes, which we attribute to the low coverage of the French phoneme lexicon. To explore further whether these improvements are indeed due to using several subword units rather than the increased dimensionality of the hidden embedding representation, we trained models for the different languages using a single subword unit, however, with higher embedding and LSTM hidden dimensions. The performance was inferior to that reported in Table 3 (last column), indicating that there are indeed additive gains from combining different subword units.

Combined models. For combining word-level embeddings and subword units together, Table  4 shows that there are additive gains by using several subword units compared to using only one. Depending on the language, phonemes yield the best results in combination with either characters or bytes, indicating that phonemes are useful for NER. For three out of four languages, the combination of characters, phonemes and word-level embeddings achieved best results.

Lang Subwords Word-level Combined
EN () ()
DE () ()
FR () ()
ES () ()
Table 5: Comparison of subword only models versus word-level models and models combined word-level and subword units. For combined and subword models, the best combination is given.

Comparison. Table 5 compares the performance of the previously discussed models to that of models using only word-level representations. With increasing training data size, performance of models trained solely on subword units becomes closer to that of models with dedicated word-level embeddings ( vs F1 for EN), however, with smaller vocabulary size ( vs K). The gap in performance increases as the size of train data decreases ( vs F1 for ES). That is, with sufficient training data, subword-based models achieve rather similar results to word-level ones. Models that use both word-level embeddings and subword units achieve the best results (Table 5, last column), showing that subword units can enhance word-level models. As train data decreases, the positive effect of subword units increases ( F1 point for EN and F1 point for ES). We also trained different models (subword-only, word-level and combined) using different splits of DE train data (, and ). The observed trends confirm our observations above.

Out-of-vocabulary words. utterances of the ES test set contain at least one OOV word, with words in total. F1 values on these utterances are , and for subwords only, word-level and combined models, respectively, and are thus following the trends observed in Table 5. We also computed F1 scores on the OOV words, where, interestingly, the subword-based model outperformed the corresponding word-level model ( vs ), while combined model achieved F1, indicating that subword units are useful in the presence of out-of-vocabulary words.

4 Related Work

NER is a widely studied problem, where methods have been characterized by the use of CRFs with heavy feature engineering, gazetteers and external knowledge resources (Finkel et al., 2005; Florian et al., 2003; Kazama and Torisawa, 2007; Klein et al., 2003; Lin and Wu, 2009; Radford et al., 2015; Ratinov and Roth, 2009; Zhang and Johnson, 2003). Ratinov and Roth Ratinov and Roth (2009) use non-local features and gazetteers extracted from Wikipedia, while Kazama and Torisawa Kazama and Torisawa (2007) harness type information of candidate entities. In our work, we opt for a neural solution without hand-crafted features or external resources.

Recently, the focus has shifted towards adopting neural architectures for NER (Bharadwaj et al., 2016; Chiu and Nichols, 2016; Collobert et al., 2011; Gillick et al., 2016; Huang et al., 2015; Lample et al., 2016; Ma and Hovy, 2016; dos Santos and Guimarães, 2015; Yadav et al., 2018; Yang et al., 2016). Huang et al. Huang et al. (2015) use a word-level bidirectional LSTM-CRF for several sequence tagging problems including POS tagging and NER. They made use of heavy feature engineering to extract character-level features. Lample et al. Lample et al. (2016) extend the previous model by using a character-level bidirectional LSTM-based unit, where a word is represented by concatenating word-level embeddings and embeddings learned from its characters. Bharadwaj et al. Bharadwaj et al. (2016)

represent words as sequences of phonemes, which serve as universal representation across languages to facilitate cross-lingual transfer learning. Chiu and Nichols 

Chiu and Nichols (2016)

use a convolutional neural network to learn character-level embeddings and LSTM units on the word level. Santos and Guimaraẽs 

dos Santos and Guimarães (2015) propose the CharWNN network, a similar model to that of Chiu and Nichols Chiu and Nichols (2016). Gillick et al. Gillick et al. (2016) employ a sequence-to-sequence model with a novel tagging scheme. The model relies on bytes, allowing the joint training on different languages for NER, and eliminating the need for tokenization. Finally, Yang et al. Yang et al. (2016) adopt a similar model to that of Lample et al. Lample et al. (2016), however, they replace LSTMs with GRUs units. Furthermore, they study the multi-lingual and multi-task joint training, which we plan to address in future work.

Overall, existing methods mostly utilize dedicated word embeddings rather than subword units. While some work has also addressed characters or bytes, using phonemes or a combination of different subword units have not been explored, which we address in this work.

5 Conclusion

We presented a neural model for NER based on three subword units: characters, phonemes and bytes. Our experiments show that with increasing training data, performance of models trained solely on subword units becomes closer to that of models with dedicated word embeddings, using a smaller vocabulary, subword units enhance models with dedicated word embeddings, and combining subword units improves performance.

References

  • Bharadwaj et al. (2016) Akash Bharadwaj, David R. Mortensen, Chris Dyer, and Jaime G. Carbonell. 2016. Phonologically aware neural model for named entity recognition in low resource transfer settings. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016

    , pages 1462–1472.
  • Chiu and Nichols (2016) Jason P. C. Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. TACL, 4:357–370.
  • Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch.

    Journal of Machine Learning Research

    , 12:2493–2537.
  • Finkel et al. (2005) Jenny Rose Finkel, Trond Grenager, and Christopher D. Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL 2005, 43rd Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 25-30 June 2005, University of Michigan, USA, pages 363–370.
  • Florian et al. (2003) Radu Florian, Abraham Ittycheriah, Hongyan Jing, and Tong Zhang. 2003.

    Named entity recognition through classifier combination.

    In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, pages 168–171.
  • Gillick et al. (2016) Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2016. Multilingual language processing from bytes. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1296–1306.
  • Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6):602–610.
  • Hinton et al. (2012) Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991.
  • Kazama and Torisawa (2007) Jun’ichi Kazama and Kentaro Torisawa. 2007. Exploiting wikipedia as external knowledge for named entity recognition. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, pages 698–707.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  • Klein et al. (2003) Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning. 2003. Named entity recognition with character-level models. In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, pages 180–183.
  • Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 260–270.
  • Lin and Wu (2009) Dekang Lin and Xiaoyun Wu. 2009. Phrase clustering for discriminative learning. In ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore, pages 1030–1038.
  • Ma and Hovy (2016) Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
  • Radford et al. (2015) Will Radford, Xavier Carreras, and James Henderson. 2015. Named entity recognition with document-specific KB tag gazetteers. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 512–517.
  • Ratinov and Roth (2009) Lev-Arie Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL 2009, Boulder, Colorado, USA, June 4-5, 2009, pages 147–155.
  • Sang (2002) Erik F. Tjong Kim Sang. 2002. Introduction to the conll-2002 shared task: Language-independent named entity recognition. In Proceedings of the 6th Conference on Natural Language Learning, CoNLL 2002, Held in cooperation with COLING 2002, Taipei, Taiwan, 2002.
  • Sang and Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, pages 142–147.
  • dos Santos and Guimarães (2015) Cícero Nogueira dos Santos and Victor Guimarães. 2015. Boosting named entity recognition with neural character embeddings. In Proceedings of the Fifth Named Entity Workshop, NEWS@ACL 2015, Beijing, China, July 31, 2015, pages 25–33.
  • Yadav et al. (2018) Vikas Yadav, Rebecca Sharp, and Steven Bethard. 2018. Deep affix features improve neural named entity recognizers. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, *SEM@NAACL-HLT, New Orleans, Louisiana, USA, June 5-6, 2018, pages 167–172.
  • Yang et al. (2016) Zhilin Yang, Ruslan Salakhutdinov, and William W. Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. CoRR, abs/1603.06270.
  • Zhang and Johnson (2003) Tong Zhang and David Johnson. 2003. A robust risk minimization based named entity recognition system. In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, pages 204–207.