Instance-Based Learning of Span Representations: A Case Study through Named Entity Recognition

04/29/2020 ∙ by Hiroki Ouchi, et al. ∙ 0

Interpretable rationales for model predictions play a critical role in practical applications. In this study, we develop models possessing interpretable inference process for structured prediction. Specifically, we present a method of instance-based learning that learns similarities between spans. At inference time, each span is assigned a class label based on its similar spans in the training set, where it is easy to understand how much each training instance contributes to the predictions. Through empirical analysis on named entity recognition, we demonstrate that our method enables to build models that have high interpretability without sacrificing performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks have contributed to performance improvements in structured prediction. Instead, the rationales underlying the model predictions are difficult for humans to understand Lei et al. (2016). In practical applications, interpretable rationales play a critical role for driving human’s decisions and promoting human-machine cooperation Ribeiro et al. (2016). With this motivation, we aim to build models that have high interpretability without sacrificing performance. As an approach to this challenge, we focus on instance-based learning.

Instance-based learning Aha et al. (1991)

is a machine learning method that learns similarities between instances. At inference time, the class labels of the most similar training instances are assigned to the new instances. This transparent inference process provides an answer to the following question:

Which points in the training set most closely resemble a test point or influenced the prediction? This is categorized into example-based explanations Plumb et al. (2018); Baehrens et al. (2010). Recently, despite its preferable property, it has received little attention and been underexplored.

This study presents and investigates an instance-based learning method for span representations. A span is a unit that consists of one or more linguistically linked words. Why do we focus on spans instead of tokens? One reason is relevant to performance. Recent neural networks can induce good span feature representations and achieve high performance in structured prediction tasks, such as named entity recognition (NER) Sohrab and Miwa (2018); Xia et al. (2019), constituency parsing Stern et al. (2017); Kitaev et al. (2019), semantic role labeling (SRL) He et al. (2018); Ouchi et al. (2018) and coreference resolution Lee et al. (2017)

. Another reason is relevant to interpretability. The tasks above require recognition of linguistic structure that consists of spans. Thus, directly classifying each span based on its representation is more interpretable than token-wise classification such as BIO tagging, which reconstructs each span label from the predicted token-wise BIO tags.

Our method builds a feature space where spans with the same class label are close to each other. At inference time, each span is assigned a class label based on its neighbor spans in the feature space. We can easily understand why the model assigned the label to the span by looking at its neighbors. Through quantitative and qualitative analysis on NER, we demonstrate that our instance-based method enables to build models that have high interpretability and performance. To sum up, our main contributions are as follows.

  • This is the first work to investigate instance-based learning of span representations.111Our code is publicly available at https://github.com/hiroki13/instance-based-ner.git.

  • Through empirical analysis on NER, we demonstrate our instance-based method enables to build models that have high interpretability without sacrificing performance.

2 Related Work

Neural models generally have a common technical challenge: the black-box property. The rationales underlying the model predictions are opaque for humans to understand. Many recent studies have tried to look into classifier-based neural models Ribeiro et al. (2016); Lundberg and Lee (2017); Koh and Liang (2017). In this paper, instead of looking into the black-box, we build interpretable models based on instance-based learning.

Before the current neural era, instance-based learning, sometimes called memory-based learning Daelemans and Van den Bosch (2005), was widely used for various NLP tasks, such as part-of-speech tagging Daelemans et al. (1996), dependency parsing Nivre et al. (2004) and machine translation Nagao (1984). For NER, some instance-based models have been proposed Tjong Kim Sang (2002); De Meulder and Daelemans (2003); Hendrickx and van den Bosch (2003). Recently, despite its high interpretability, this direction has not been explored.

One exception is Wiseman and Stratos (2019), which used instance-based learning of token representations. Due to BIO tagging, it faces one technical challenge: inconsistent label prediction. For example, an entity candidate “World Health Organization” can be assigned inconsistent labels such as “B-LOC I-ORG I-ORG,” whereas the ground-truth labels are “B-ORG I-ORG I-ORG

.” To remedy this issue, they presented a heuristic technique for encouraging contiguous token alignment. In contrast to such token-wise prediction, we adopt span-wise prediction, which can naturally avoid this issue because each span is assigned one label.

NER is generally solved as (i) sequence labeling or (ii) span classification.222Very recently, a hybrid model of these two approaches has been proposed by Liu et al. (2019). In the first approach, token features are induced by using neural networks and fed into a classifier, such as conditional random fields Lample et al. (2016); Ma and Hovy (2016); Chiu and Nichols (2016). One drawback of this approach is the difficulty dealing with nested entities.333Some studies have sophisticated sequence labeling models for nested NER Ju et al. (2018); Zheng et al. (2019). By contrast, the span classification approach, adopted in this study, can straightforwardly solve nested NER Finkel and Manning (2009); Sohrab and Miwa (2018); Xia et al. (2019).444There is an approach specialized for nested NER using hypergraphs Lu and Roth (2015); Muis and Lu (2017); Katiyar and Cardie (2018); Wang and Lu (2018).

3 Instance-Based Span Classification

3.1 NER as span classification

NER can be solved as multi-class classification, where each of possible spans in a sentence is assigned a class label. As we mentioned in Section 2, this approach can naturally avoid inconsistent label prediction and straightforwardly deal with nested entities. Because of these advantages over token-wise classification, span classification has been gaining a considerable attention Sohrab and Miwa (2018); Xia et al. (2019).

Formally, given an input sentence of words , we first enumerate possible spans , and then assign a class label to each span . We will write each span as , where and are word indices in the sentence: . Consider the following sentence.

Franz    Kafka    is    a    novelist

[       PER       ]

Here, the possible spans in this sentence are . “Franz Kafka,” , is assigned the person type entity label (). Note that the other non-entity spans are assigned the null label (). For example, “a novelist,” , is assigned NULL. In this way, the NULL label is assigned to non-entity spans, which is the same as the O tag in the BIO tag set.

The probability that each span

is assigned a class label is modeled by using softmax function:

Typically, as the scoring function, the inner product between each label weight vector

and span feature vector is used:

The score for the NULL label is set to a constant,

, similar to logistic regression 

He et al. (2018)

. For training, the loss function we minimize is the negative log-likelihood:

where is a set of pairs of a span and its ground-truth label . We call this kind of models that use label weight vectors for classification classifier-based span model.

Figure 1: Illustration of our instance-based span model. An entity candidate “Franz Kafka” is used as a query and vectorized by an encoder. In the vector space, similarities between all pairs of the candidate () and the training instances (

) are computed, respectively. Based on the similarities, the label probability (distribution) is computed, and the label with the highest probability

PER is assigned to “Franz Kafka.”

3.2 Instance-based span model

Our instance-based span model classifies each span based on similarities between spans. In Figure 1, an entity candidate “Franz Kafka” and the spans in the training set are mapped onto the feature vector space, and the label distribution is computed from the similarities between them. In this inference process, it is easy to understand how much each training instance contributes to the predictions. This property allows us to explain the predictions by specific training instances, which is categorized into example-based explanations Plumb et al. (2018).

Formally, within the neighbourhood component analysis framework Goldberger et al. (2005), we define the neighbor span probability that each span will select another span as its neighbor from candidate spans in the training set:

(1)

Here, we exclude the input sentence and its ground-truth labels from the training set : , and regard all other spans as candidates: . The scoring function returns a similarity between the spans and . Then we compute the probability that a span will be assigned a label :

(2)

Here, , so the equation indicates that we sum up the probabilities of the neighbor spans that have the same label as the span . The loss function we minimize is the negative log-likelihood:

where is a set of pairs of a span and its ground-truth label . At inference time, we predict to be the class label with maximal marginal probability:

where the probability is computed for each of the label set .

Efficient neighbor probability computation

The neighbor span probability in Equation 1 depends on the entire training set , which leads to heavy computational cost. As a remedy, we use random sampling to retrieve sentences from the training set . At training time, we randomly sample

sentences for each mini-batch at each epoch. This simple technique realizes time and memory efficient training. In our experiments, it takes less than one day to train a model on a single GPU

555NVIDIA DGX-1 with Tesla V100..

4 Experiments

4.1 Experimental setup

Data

We evaluate the span models through two types of NER: (i) flat NER on the CoNLL-2003 dataset Tjong Kim Sang and De Meulder (2003) and (ii) nested NER on the GENIA dataset666We use the same one pre-processed by Zheng et al. (2019) at https://github.com/thecharm/boundary-aware-nested-ner Kim et al. (2003). We follow the standard training-development-test splits.

Baseline

We use a classifier-based span model (Section 3.1) as a baseline. Only the difference between the instance-based and classifier-based span models is whether to use softmax classifier or not.

Encoder and span representation

We adopt the encoder architecture proposed by Ma and Hovy (2016), which encodes each token of the input sentence with word embedding and character-level CNN. The encoded token representations are fed to bidirectional LSTM for computing contextual ones and . From them, we create for each span based on LSTM-minus Wang and Chang (2016). For flat NER, we use the representation . For nested NER, we use .777We use the different span representation from the one used for flat NER because concatenating the addition features, and , to the subtraction features improves performance in our preliminary experiments. We then multiply with a weight matrix and obtain the span representation: . For the scoring function in Equation 1 in the instance-based span model, we use the inner product between a pair of span representations: .

Model configuration

We train instance-based models by using training sentences randomly retrieved for each mini-batch. At test time, we use

nearest training sentences for each sentence based on the cosine similarities between their sentence vectors

888For each sentence , its sentence vector is defined as the vector averaged over the word embeddings (GloVe) within the sentence: .. For the word embeddings, we use the GloVe 100-dimensional embeddings Pennington et al. (2014) and the BERT embeddings Devlin et al. (2019).999Details on the experimental setup are described in Appendices A.1.

4.2 Quantitative analysis

Classifier-based Instance-based
GloVe
Flat NER 90.68 0.25 90.73 0.07
Nested NER 73.76 0.35 74.20 0.16
BERT
Flat NER 90.48 0.18 90.48 0.07
Nested NER 73.27 0.19 73.92 0.20
Table 1: Comparison between classifier-based and instance-based span models. Cells show the F

scores and standard deviations on each test set.

Figure 2: Performance on the CoNLL-2003 development set for different amounts of the training set.

We report averaged F scores across five different runs of the model training with random seeds.

Overall F scores

We investigate whether or not our instance-based span model can achieve competitive performance with the classifier-based span model. Table 1 shows F scores on each test set.101010The models using GloVe yielded slightly better results than those using BERT. One possible explanation is that subword segmentation is not so good for NER. In particular, tokens in upper case are segmented into too small elements, e.g., “LEICESTERSHIRE” “L,” “##EI,” “##CE,” “##ST,” “##ER,” “##S,” “##H,” “##IR,” “##E.” Consistently, the instance-based span model yielded comparable results to the classifier-based span model. This indicates that our instance-based learning method enables to build NER models without sacrificing performance.

Effects of training data size

Figure 2 shows F scores on the CoNLL-2003 development set by the models trained on full-size, , and of the training set. We found that (i) performance of both models gradually degrades when the size of the training set is smaller and (ii) both models yield very competitive performance curves.

4.3 Qualitative analysis

To better understand model behavior, we analyze the instance-based model using GloVe in detail.

Examples of retrieved spans

Query [Tom Moody] took six for 82 but …
Classifier-based
1 PER [Billy Mayfair] and Paul Goydos and …
2 NULL [Billy Mayfair and Paul Goydos] and …
3 NULL [Billy Mayfair and Paul Goydos and]
4 NULL [Billy] Mayfair and Paul Goydos and …
5 NULL [Ducati rider Troy Corser] , last year …
Instance-based
1 PER [Ian Botham] began his test career …
2 PER [Billy Mayfair] and Paul Goydos and …
3 PER [Mark Hutton] scattered four hits …
4 PER [Steve Stricker] , who had a 68 , and …
3 PER [Darren Gough] polishing off …
Table 2: Example of span retrieval. An entity candidate “Tom Moody” in the CoNLL-2003 development set used as a query for retrieving five nearest neighbors from the training set.

The span feature space learned by our method can be applied to various downstream tasks. In particular, it can be used as a span retrieval system. Table 2 shows five nearest neighbor spans of an entity candidate “Tom Moody.” In the classifier-based span model, person-related but non-entity spans were retrieved. By contrast, in the instance-based span model, person (PER) entities were consistently retrieved.111111The query span “Tom moody” was a cricketer at that time, and some neighbors, “Ian Botham” and “Darren Gough,” were also cricketers. This tendency was observed in many other cases, and we confirmed that our method can build preferable feature spaces for applications.

Errors analysis

Query … spokesman for [Air France] ’s …
                         Pred: LOC
                         Gold: ORG
1 LOC [Colombia] turned down American ’s …
2 LOC … involving [Scotland] , Wales , …
3 LOC … signed in [Nigeria] ’s capital Abuja …
4 LOC … in the West Bank and [Gaza] .
5 LOC … on its way to [Romania]
Table 3: Example of an error by the instance-based span model. Although the gold label is ORG (Organization), the wrong label LOC (Location) is assigned.

The instance-based span model tends to wrongly label spans that includes location or organization names. For example, in Table 3, the wrong label LOC (Location) is assigned to “Air France” whose gold label is ORG (Organization). Note that by looking at the neighbors, we can understand that country or district entities confused the model. This implies that prediction errors are easier to analyze because the neighbors are the rationales of the predictions.

4.4 Discussion

Classifier-based Instance-based
GloVe 94.91 0.11 94.96 0.06
BERT 96.20 0.03 96.24 0.04
Table 4: Comparison in syntactic chunking. Cells show F and standard deviations on the CoNLL-2000 test set.

Generalizability

Are our findings in NER generalizable to other tasks? To investigate it, we perform an additional experiment on the CoNLL-2000 dataset Tjong Kim Sang and Buchholz (2000) for syntactic chunking.121212The models are trained in the same way as in nested NER. While this task is similar to NER in terms of short-span classification, the class labels are based on syntax, not (entity) semantics. In Table 4, the instance-based span model achieved competitive F scores with the classifier-based one, which is consistent with the NER results. This suggests that our findings in NER are likely to generalizable to other short-span classification tasks.

Future work

One interesting line of future work is an extension of our method to span-to-span relation classification, such as SRL and coreference resolution. Another potential direction is to apply and evaluate learned span features to downstream tasks requiring entity knowledge, such as entity linking and question answering.

5 Conclusion

We presented and investigated an instance-based learning method that learns similarity between spans. Through NER experiments, we demonstrated that the models build by our method have (i) competitive performance with a classifier-based span model and (ii) interpretable inference process where it is easy to understand how much each training instance contributes to the predictions.

Acknowledgments

This work was partially supported by JSPS KAKENHI Grant Number JP19H04162 and JP19K20351. We would like to thank the members of Tohoku NLP Laboratory and the anonymous reviewers for their insightful comments.

References

  • D. W. Aha, D. Kibler, and M. K. Albert (1991) Instance-based learning algorithms. Machine learning 6 (1), pp. 37–66. Cited by: §1.
  • D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K. MÞller (2010) How to explain individual classification decisions. Journal of Machine Learning Research 11 (Jun), pp. 1803–1831. Cited by: §1.
  • J. P.C. Chiu and E. Nichols (2016) Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4, pp. 357–370. External Links: Link, Document Cited by: §2.
  • W. Daelemans and A. Van den Bosch (2005) Memory-based language processing. Cambridge University Press. Cited by: §2.
  • W. Daelemans, J. Zavrel, P. Berck, and S. Gillis (1996) MBT: a memory-based part of speech tagger-generator. In Proceedings of Fourth Workshop on Very Large Corpora, External Links: Link Cited by: §2.
  • F. De Meulder and W. Daelemans (2003) Memory-based named entity recognition using unannotated data. In Proceedings of HLT-NAACL, pp. 208–211. External Links: Link Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186. External Links: Link, Document Cited by: §A.1, §4.1.
  • J. R. Finkel and C. D. Manning (2009) Nested named entity recognition. In Proceedings of EMNLP, pp. 141–150. External Links: Link Cited by: §2.
  • J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov (2005) Neighbourhood components analysis. In Proceedings of NIPS, pp. 513–520. External Links: Link Cited by: §3.2.
  • A. Graves, N. Jaitly, and A. Mohamed (2013) Hybrid speech recognition with deep bidirectional LSTM. In

    Proceedings of Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop

    ,
    Cited by: §A.1.
  • L. He, K. Lee, O. Levy, and L. Zettlemoyer (2018) Jointly predicting predicates and arguments in neural semantic role labeling. In Proceedings of ACL, pp. 364–369. External Links: Link, Document Cited by: §1, §3.1.
  • I. Hendrickx and A. van den Bosch (2003) Memory-based one-step named-entity recognition: effects of seed list features, classifier stacking, and unannotated data. In Proceedings of CoNLL, pp. 176–179. External Links: Link Cited by: §2.
  • M. Ju, M. Miwa, and S. Ananiadou (2018) A neural layered model for nested named entity recognition. In Proceedings of NAACL-HLT, pp. 1446–1459. External Links: Link, Document Cited by: footnote 3.
  • A. Katiyar and C. Cardie (2018) Nested named entity recognition revisited. In Proceedings of NAACL-HLT, pp. 861–871. External Links: Link, Document Cited by: footnote 4.
  • J. Kim, T. Ohta, Y. Tateisi, and J. Tsujii (2003) GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19 (suppl_1), pp. i180–i182. Cited by: §4.1.
  • D.P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Note: arXiv preprint arXiv: 1412.6980 Cited by: §A.1.
  • N. Kitaev, S. Cao, and D. Klein (2019) Multilingual constituency parsing with self-attention and pre-training. In Proceedings of ACL, pp. 3499–3505. External Links: Link Cited by: §1.
  • P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In Proceedings of ICML, pp. 1885–1894. Cited by: §2.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, pp. 260–270. External Links: Link Cited by: §2.
  • K. Lee, L. He, M. Lewis, and L. Zettlemoyer (2017) End-to-end neural coreference resolution. In Proceedings of EMNLP, pp. 188–197. External Links: Link, Document Cited by: §1.
  • T. Lei, R. Barzilay, and T. Jaakkola (2016) Rationalizing neural predictions. In Proceedings of EMNLP, pp. 107–117. External Links: Link, Document Cited by: §1.
  • T. Liu, J. Yao, and C. Lin (2019) Towards improving neural named entity recognition with gazetteers. In Proceedings of ACL, pp. 5301–5307. External Links: Link, Document Cited by: footnote 2.
  • W. Lu and D. Roth (2015) Joint mention extraction and classification with mention hypergraphs. In Proceedings of EMNLP, pp. 857–867. External Links: Link, Document Cited by: footnote 4.
  • S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Proceedings of NIPS, pp. 4765–4774. Cited by: §2.
  • X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of ACL, pp. 1064–1074. External Links: Link, Document Cited by: §A.1, §2, §4.1.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §A.2.
  • A. O. Muis and W. Lu (2017) Labeling gaps between words: recognizing overlapping mentions with mention separators. In Proceedings of EMNLP, pp. 2608–2618. External Links: Link, Document Cited by: footnote 4.
  • M. Nagao (1984) A framework of a mechanical translation between japanese and english by analogy principle. Elsevier Science Publishers. External Links: Link Cited by: §2.
  • J. Nivre, J. Hall, and J. Nilsson (2004) Memory-based dependency parsing. In Proceedings of CoNLL, Boston, Massachusetts, USA, pp. 49–56. External Links: Link Cited by: §2.
  • H. Ouchi, H. Shindo, and Y. Matsumoto (2018) A span selection model for semantic role labeling. In Proceedings of EMNLP, pp. 1630–1642. External Links: Link Cited by: §1.
  • R. Pascanu, T. Mikolov, and Y. Bengio (2013)

    On the difficulty of training recurrent neural networks

    .
    In Proceedings of ICML, pp. 1310–1318. Cited by: §A.1.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of EMNLP), pp. 1532–1543. External Links: Link, Document Cited by: §A.1, §4.1.
  • G. Plumb, D. Molitor, and A. S. Talwalkar (2018) Model agnostic supervised local explanations. In Proceedings of NIPS, pp. 2515–2524. Cited by: §1, §3.2.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) Why should i trust you?: explaining the predictions of any classifier. In Proceedings of KDD, pp. 1135–1144. Cited by: §1, §2.
  • A. M. Saxe, J. L. McClelland, and S. Ganguli (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. Note: arXiv preprint arXiv:1312.6120 Cited by: §A.1.
  • M. G. Sohrab and M. Miwa (2018) Deep exhaustive model for nested named entity recognition. In Proceedings of EMNLP, pp. 2843–2849. External Links: Link, Document Cited by: §A.1, §1, §2, §3.1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §A.1.
  • M. Stern, J. Andreas, and D. Klein (2017) A minimal span-based neural constituency parser. In Proceedings of ACL, pp. 818–827. External Links: Link, Document Cited by: §1.
  • E. F. Tjong Kim Sang and S. Buchholz (2000) Introduction to the CoNLL-2000 shared task chunking. In Proceedings of CoNLL, External Links: Link Cited by: §4.4.
  • E. F. Tjong Kim Sang and F. De Meulder (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of CoNLL, pp. 142–147. External Links: Link Cited by: §4.1.
  • E. F. Tjong Kim Sang (2002) Memory-based named entity recognition. In Proceedings of CoNLL), External Links: Link Cited by: §2.
  • B. Wang and W. Lu (2018) Neural segmental hypergraphs for overlapping mention recognition. In Proceedings of EMNLP, pp. 204–214. External Links: Link, Document Cited by: footnote 4.
  • W. Wang and B. Chang (2016) Graph-based dependency parsing with bidirectional LSTM. In Proceedings of ACL, pp. 2306–2315. External Links: Link, Document Cited by: §A.1, §4.1.
  • S. Wiseman and K. Stratos (2019) Label-agnostic sequence labeling by copying nearest neighbors. In Proceedings of ACL, pp. 5363–5369. External Links: Link Cited by: §2.
  • C. Xia, C. Zhang, T. Yang, Y. Li, N. Du, X. Wu, W. Fan, F. Ma, and P. Yu (2019) Multi-grained named entity recognition. In Proceedings of ACL, pp. 1430–1440. External Links: Link, Document Cited by: §1, §2, §3.1.
  • C. Zheng, Y. Cai, J. Xu, H. Leung, and G. Xu (2019) A boundary-aware neural model for nested named entity recognition. In Proceedings of EMNLP-IJCNLP, pp. 357–366. External Links: Link, Document Cited by: footnote 3, footnote 6.

Appendix A Appendices

a.1 Experimental setup

Name Value
CNN window size 3
CNN filters 30
BiLSTM layers 2
BiLSTM hidden units 100 dimensions
Mini-batch size 8
Optimization Adam
Learning rate 0.001
Dropout ratio {0.1, 0.3, 0.5}
Table 5: Hyperparameters used in the experiments.

Network setup

Basically, we follow the encoder architecture proposed by Ma and Hovy (2016). First, the token-encoding layer encodes each token of the input sentence to a sequence of the vector representations . For the models using GloVe, we use the GloVe 100-dimensional embeddings131313https://nlp.stanford.edu/projects/glove/ Pennington et al. (2014) and character-level CNN. For the models using BERT, we use the BERT-Base, Cased141414https://github.com/google-research/bert Devlin et al. (2019), where we use the first subword embeddings within each token in the last layer of BERT. During training, we fix the word embeddings (except the CNN). Then, the encoded token representations are fed to bidirectional LSTM (BiLSTM) Graves et al. (2013) for computing contextual ones and . We use layers of the stacked BiLSTMs (2 forward and 2 backward LSTMs) with 100-dimensional hidden units. From and , we create for each span based on LSTM-minus Wang and Chang (2016). For flat NER, we use the representation . For nested NER, we use . We then multiply with a weight matrix and obtain the span representation: . Finally, we use the span representation for computing the label distribution in each model. For efficient computation, following Sohrab and Miwa (2018), we enumerate all possible spans in a sentence with the sizes less than or equal to the maximum span size , i.e., each span is satisfied with the condition . We set as .

Hyperparameters

Table 5 lists the hyperparameters used in the experiments. We initialize all the parameter matrices in BiLSTMs with random orthonormal matrices Saxe et al. (2013). Other parameters are initialized following glorot:10. We apply dropout Srivastava et al. (2014) to the token-encoding layer and the input vectors of each LSTM with dropout ratio of .

Optimization

To optimize the parameters, we use Adam Kingma and Ba (2014) with and . The initial learning rate is set to . The learning rate is updated on each epoch as , where the decay rate is and

is the number of epoch completed. A gradient clipping value is set to

 Pascanu et al. (2013). Parameter updates are performed in mini-batches of 8. The number of training epochs is set to 100. We save the parameters that achieve the best F1 score on each development set and evaluated them on each test set. Training the models takes less than one day on a single GPU, NVIDIA DGX-1 with Tesla V100.

a.2 Feature space visualization

(a) Classifier-based
(b) Instance-based
Figure 3: Visualization of entity span features computed by classifier-based and instance-based models.

To better understand span representations learned by our method, we observe the feature space. Specifically, we visualize the span representations on the CoNLL-2003 development set. Figure 3 visualizes two-dimensional entity span representations by t-distributed Stochastic Neighbor Embedding (t-SNE) Maaten and Hinton (2008). Both models successfully learned feature spaces where the instances with the same label come close each other.