Remedying BiLSTM-CNN Deficiency in Modeling Cross-Context for NER

08/29/2019 ∙ by Peng-Hsuan Li, et al. ∙ Academia Sinica 0

Recent researches prevalently used BiLSTM-CNN as a core module for NER in a sequence-labeling setup. This paper formally shows the limitation of BiLSTM-CNN encoders in modeling cross-context patterns for each word, i.e., patterns crossing past and future for a specific time step. Two types of cross-structures are used to remedy the problem: A BiLSTM variant with cross-link between layers; a multi-head self-attention mechanism. These cross-structures bring consistent improvements across a wide range of NER domains for a core system using BiLSTM-CNN without additional gazetteers, POS taggers, language-modeling, or multi-task supervision. The model surpasses comparable previous models on OntoNotes 5.0 and WNUT 2017 by 1.4 especially improving emerging, complex, confusing, and multi-token entity mentions, showing the importance of remedying the core module of NER.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

Code Repositories

ckiptagger

CKIP Neural Chinese Word Segmentation, POS Tagging, and NER


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Named Entity Recognition (NER) is a core task for information extraction. Originally a structured prediction task, NER has since been formulated as a task of sequential token labeling. BiLSTM-CNN uses a CNN to encode each word and then uses bi-directional LSTMs to encode past and future context respectively at each time step. With state-of-the-art empirical results, most regard it as a robust core module for sequence-labeling NER [1, 2, 3, 4, 5].

However, each direction of BiLSTM only sees and encodes half of a sequence at each time step. For each token, the forward LSTM only encodes past context; the backward LSTM only encodes future context. For computing sentence representations for tasks such as sentence classification and machine translation, this is not a problem, as only the rightmost hidden state of the forward LSTM and only the leftmost hidden state of the backward LSTM are used, and each of the endpoint hidden states sees and encodes the whole sentence. For computing sentence representations for sequence-labeling tasks such as NER, however, this becomes a limitation, as each token uses its own midpoint hidden states, which do not model the patterns that happen to cross past and future at this specific time step.

This paper explores two types of cross-structures to help cope with the problem: Cross-BiLSTM-CNN and Att-BiLSTM-CNN. Previous studies have tried to stack multiple LSTMs for sequence-labeling NER [2]. As they follow the trend of stacking forward and backward LSTMs independently, the Baseline-BiLSTM-CNN is only able to learn higher-level representations of past or future per se. Instead, Cross-BiLSTM-CNN, which interleaves every layer of the two directions, models cross-context in an additive manner by learning higher-level representations of the whole context of each token. On the other hand, Att-BiLSTM-CNN models cross-context in a multiplicative manner by capturing the interaction between past and future with a dot-product self-attentive mechanism [6, 7].

Section 3 formulates the three Baseline, Cross, and Att-BiLSTM-CNN models. The section gives a concrete proof that patterns forming an XOR cannot be modeled by Baseline-BiLSTM-CNN used in all previous work. Cross-BiLSTM-CNN and Att-BiLSTM-CNN are shown to have additive and multiplicative cross-structures respectively to deal with the problem. Section 4 evaluates the approaches on two challenging NER datasets spanning a wide range of domains with complex, noisy, and emerging entities. The cross-structures bring consistent improvements over the prevalently used Baseline-BiLSTM-CNN without additional gazetteers, POS taggers, language-modeling, or multi-task supervision. The improved core module surpasses comparable previous models on OntoNotes 5.0 and WNUT 2017 by 1.4% and 4.6% respectively. Experiments reveal that emerging, complex, confusing, and multi-token entity mentions benefitted much from the cross-structures, and the in-depth entity-chunking analysis finds that the prevalently used Baseline-BiLSTM-CNN is flawed for real-world NER.

2 Related Work

Many have attempted tackling the NER task with LSTM-based sequence encoders [8, 1, 2, 9]. Among these, the most sophisticated, state-of-the-art is the BiLSTM-CNN proposed by [2]

. They stack multiple layers of LSTM cells per direction and also use a CNN to compute character-level word vectors alongside pre-trained word vectors. This paper largely follows their work in constructing the Baseline-BiLSTM-CNN, including the selection of raw features, the CNN, and the multi-layer BiLSTM. A subtle difference is that they send the output of each direction through separate affine-softmax classifiers and then sum their probabilities, while this paper sum the scores from affine layers before computing softmax once. While not changing the modeling capacity regarded in this paper, the baseline model does perform better than their formulation.

The modeling of global contexts for sequence-labeling NER has been accomplished using traditional models with extensive feature engineering and conditional random fields (CRF). [10]

build the Illinois NER tagger with feature-based perceptrons. In their analysis, the usefulness of Viterbi decoding is minimal and conflicts their handcrafted global features. On the other hand, recent researches on LSTM or CNN-based sequence encoders report empirical improvements brought by CRF

[8, 1, 9, 11], as it discourages illegal predictions by explicitly modeling class transition probabilities. However, transition probabilities are independent of input sentences. In contrast, the cross-structures studied in this work provide for the direct capture of global patterns and extraction of better features to improve class observation likelihoods.

Thought to lighten the burden of compressing all relevant information into a single hidden state, using attention mechanisms on top of LSTMs have shown empirical success for sequence encoders [6, 7] and decoders [12]. Self-attention has also been used below encoders to compute word vectors conditioned on context [13]. This work further formally analyzes the deficiency of BiLSTM encoders for sequence labeling and shows that using self-attention on top is actually providing one type of cross structures that capture interactions between past and future context.

Besides using additional gazetteers or POS taggers [14, 3, 15], there is a recent trend to use additional large-scale language-modeling corpora [4] or additional multi-task supervision [5] to further improve NER performance beyond bare-bone models. However, they all rely on a core BiLSTM sentence encoder with the same limitation studied and remedied in Section 3. So they would indeed benefit from the improvements presented in this paper.

3 Model

3.1 CNN and Word Features

All models in the experiments use the same set of raw features: character embedding, character type, word embedding, and word capitalization.

For character embedding, 25d vectors are trained from scratch, and 4d one-hot character-type features indicate whether a character is uppercase, lowercase, digit, or punctuation [2]

. Word token lengths are unified to 20 by truncation and padding. The resulting 20-by-(25+4) feature map of each token is applied to a character-trigram CNN with 20 kernels per length 1 to 3 and max-over-time pooling to compute a 60d character-based word vector

[16, 2, 1].

For word embedding, either pre-trained 300d GloVe vectors [17] or 400d Twitter vectors [18] are used without further tuning. Also, 4d one-hot word capitalization features indicate whether a word is uppercase, upper-initial, lowercase, or mixed-caps [19, 2].

Throughout this paper, denotes the -by- matrix of sequence features, where is the sentence length and is either 364 (with GloVe) or 464 (with Twitter).

3.2 Baseline-BiLSTM-CNN

On top of an input feature sequence, BiLSTM is used to capture the future and the past for each time step. Following [2], 4 distinct LSTM cells – two in each direction – are stacked to capture higher level representations:

where denote applying LSTM cell in forward, backward order, denote the resulting feature matrices of the stacked application, and denotes row-wise concatenation. In all the experiments, 100d LSTM cells are used, so and .

Finally, suppose there are token classes, the probability of each of which is given by the composition of affine and softmax transformations:

where is the row of , , are a trainable weight matrix and bias, and and are the -th and -th elements of .

Following [2], the 5 chunk labels O, S, B, I, E denote if a word token is Outside any entity mentions, the Sole token of a mention, the Beginning token of a multi-token mention, In the middle of a multi-token mention, or the Ending token of a multi-token mention. Hence when there are types of named entities, the actual number of token classes for sequence labeling NER.

3.2.1 XOR Limitation

Consider the following four phrases that form an XOR:

  1. Key and Peele (work-of-art)

  2. You and I (work-of-art)

  3. Key and I

  4. You and Peele

The first two phrases are respectively a show title and a song title. The other two are not entities as a whole, where the last one actually occurs in an interview with Keegan-Michael Key. Suppose each phrase is the sequence given to Baseline-BiLSTM-CNN for sequence tagging, then the token "and" should be tagged as work-of-art:I in the first two cases and as O in the last two cases.

Firstly, note that the score vector at each time step is simply the sum of contributions coming from forward and backward directions plus a bias.

where denotes the top-half and bottom-half of .

Suppose the index of work-of-art:I and O are i, j respectively. Then, to predict each "and" correctly, it must hold that

where superscripts denote the phrase number.

Now, the catch is that phrase 1 and phrase 3 have exactly the same past context for "and". Hence the same and the same , i.e., . Similarly, , , and . Rewriting the constraints with these equalities gives

Finally, summing the first two inequalities and the last two inequalities gives two contradicting constraints that cannot be satisfied. In other words, even if an oracle is given to training the model, Baseline-BiLSTM-CNN can only tag at most 3 out of 4 "and" correctly. No matter how many LSTM cells are stacked for each direction, the formulation in previous studies simply does not have enough modeling capacity to capture cross-context patterns for sequence labeling NER.

3.3 Cross-BiLSTM-CNN

Motivated by the limitation of the conventional Baseline-BiLSTM-CNN for sequence labeling, this paper proposes the use of Cross-BiLSTM-CNN by changing the deep structure in Section 3.2 to

As the forward and backward hidden states are interleaved between stacked LSTM layers, Cross-BiLSTM-CNN models cross-context patterns by computing representations of the whole sequence in a feed-forward, additive manner.

Specifically, for the XOR cases introduced in Section 3.2.1, although phrase 1 and phrase 3 still have the same past context for "and" and hence the first layer can only extract the same low-level hidden features , the second layer considers the whole context and thus have the ability to extract different high-level hidden features for the two phrases.

As the higher-level LSTMs of Cross-BiLSTM-CNN have interleaved input from forward and backward hidden states down below, their weight parameters double the size of the first-level LSTMs. Nevertheless, the cross formulation provides the modeling capacity absent in previous studies with how many more LSTM layers.

3.4 Att-BiLSTM-CNN

Another way to capture the interaction between past and future context per time step is to add a token-level self-attentive mechanism on top of the same BiLSTM formulation introduced in Section 3.2. Given the hidden features of a whole sequence, the model projects each hidden state to different subspaces, depending on whether it is used as the query vector to consult other hidden states for each word token, the key vector to compute its dot-similarities with incoming queries, or the value vector to be weighted and actually convey information to the querying token. As different aspects of a task can call for different attention, multiple attention heads running in parallel are used [20].

Formally, let be the number of attention heads and be the subspace dimension. For each head , the attention weight matrix and context matrix are computed by

where are trainable projection matrices and performs softmax along the second dimension. Each row of the resulting contains the attention weights of a token to its context, and each row of is its context vector.

For Att-BiLSTM-CNN, the hidden vector and context vectors of each token are considered together for classification:

where is the -th row of , and is a trainable weight matrix. In all the experiments, and , so .

While the BiLSTM formulation stays the same as Baseline-BiLSTM-CNN, the computation of attention weights and context features models the cross interaction between past and future. To see this, the computation of attention scores can be rewritten as follows.

With the un-shifted covariance matrix of the projected , Att-BiLSTM-CNN correlates past and future context for each token in a dot-product, multiplicative manner.

One advantage of the multi-head self-attentive mechanism is that it only needs to be computed once per sequence, and the matrix computations are highly parallelizable, resulting in little computation time overhead. Moreover, in Section 4, the attention weights provide a better understanding of how the model learns to tackle sequence-labeling NER.

4 Experiments

4.1 Datasets

OntoNotes 5.0 Fine-Grained NER – a million-token corpus with diverse sources of newswires, web, broadcast news, broadcast conversations, magazines, and telephone conversations [21, 22]. Some are transcriptions of talk shows, and some are translations from Chinese or Arabic. The dataset contains 18 fine-grained entity types, including hard ones such as law, event, and work-of-art. All the diversities and noisiness require that models are robust across broad domains and able to capture a multitude of linguistic patterns for complex entities.

WNUT 2017 Emerging NER

 – a dataset providing maximally diverse, noisy, and drifting user-generated text

[23]. The training set consists of previously annotated tweets – social media text with non-standard spellings, abbreviations, and unreliable capitalization [24]; the development set consists of newly sampled YouTube comments; the test set includes text newly drawn from Twitter, Reddit, and StackExchange. Besides drawing new samples from diverse topics across different sources, the shared task also filtered out text containing surface forms of entities seen in the training set. The resulting dataset requires models to generalize to emerging contexts and entities instead of relying on familiar surface cues.

OntoNotes 5.0 WNUT 2017
train 1088.5 / 81.8 62.7 / 1.9
dev 147.7 / 11.0 15.7 / 0.8
test 152.7 / 11.2 23.3 / 1.0
Table 1: Datasets (K-tokens / K-entities).

4.2 Implementation and Baselines

All experiments for Baseline-, Cross-, and Att-BiLSTM-CNN used the same model parameters given in Section 3. The training minimized per-token cross-entropy loss with the Nadam optimizer [25]

with uniform learning rate 0.001, batch size 32, and 35% dropout. Each training lasted 400 epochs when using GloVe embedding (OntoNotes), and 1600 epochs when using Twitter embedding (WNUT). The development set of each dataset was used to select the best epoch to restore model weights for testing. Following previous work on NER, model performances were evaluated with strict mention F1 score. Training of each model on each dataset repeated 6 times to report the mean score and standard deviation.

Besides comparing to the Baseline implemented in this paper, results also compared against previously reported results of BiLSTM-CNN [2], CRF-BiLSTM(-BiLSTM) [11, 26], and CRF-IDCNN [11] on the two datasets. Among them, IDCNN was a CNN-based sentence encoder, which should not have the XOR limitation raised in this paper. Only fair comparisons against models without using additional resources were made. However, the models that used those additional resources (Secion 2) actually all used a BiLSTM sentence encoder with the XOR limitation, so they could indeed integrate with and benefit from the cross-structures.

4.3 Overall Results

Table 2 shows overall results on the two datasets spanning broad domains of newswires, broadcast, telephone, and social media. The models proposed in this paper significantly surpassed previous comparable models by 1.4% on OntoNotes and 4.6% on WNUT. Compared to the re-implemented Baseline-BiLSTM-CNN, the cross-structures brought 0.7% and 2.2% improvements on OntoNotes and WNUT. More substantial improvements were achieved for WNUT 2017 emerging NER, suggesting that cross-context patterns were even more crucial for emerging contexts and entities than familiar entities, which might often be memorized by their surface forms.

OntoNotes 5.0 WNUT 2017
Prec. Rec. F1 Prec. Rec. F1
BiLSTM-CNN 86.04 86.53 86.280.26 - - -
CRF-IDCNN - - 86.840.19 - - -
CRF-BiLSTM(-BiLSTM*) - - 86.990.22 - - 38.24
Baseline-BiLSTM-CNN 88.37 87.14 87.750.14 53.24 32.93 40.681.78
Cross-BiLSTM-CNN 88.37 88.17 88.270.17 58.28 33.92 42.850.99
Att-BiLSTM-CNN 88.71 88.11 88.400.18 55.82 34.08 42.260.82
Table 2: Overall results. *Used on WNUT for character-based word vectors, reported better than CNN.

4.4 Complex and Confusing Entity Mentions

Table 3 shows significant results per entity type compared to Baseline (3% absolute F1 differences for either Cross or Att). It could be seen that harder entity types generally benefitted more from the cross-structures. For example, work-of-art/creative-work entities could in principle take any surface forms – unseen, the same as a person name, abbreviated, or written with unreliable capitalizations on social media. Such mentions require models to learn a deep, generalized understanding of their context to accurately identify their boundaries and disambiguate their types. Both cross-structures were more capable in dealing with such hard entities (2.1%/5.6%/3.2%/2.0%) than the prevalently used, problematic Baseline.

Moreover, disambiguating fine-grained entity types is also a challenging task. For example, entities of language and NORP often take the same surface forms. Figure 0(a) shows an example containing "Dutch" and "English". While "English" was much more frequently used as a language and was identified correctly, the "Dutch" mention was tricky for Baseline. The attention heat map (Figure 1(a)) further tells the story that Att has relied on its attention head to make context-aware decisions. Overall, both cross-structures were much better at disambiguating these fine-grained types (4.1%/0.8%/3.3%/3.4%).

OntoNotes 5.0 WNUT 2017
event language law NORP* work-of-art corporation creative-work location
Cross +3.0% +4.1% +4.5% +3.3% +2.1% +6.4% +3.2% +8.6%
Att +4.6% +0.8% +0.8% +3.4% +5.6% +0.3% +2.0% +5.3%
Table 3: Types with significant results (3% absolute F1 differences). *Nationalities, religious, political groups.
OntoNotes 5.0 WNUT 2017
1          2          3 1          2          3
Cross +0.3%  +0.6%  +1.8% +1.7%  +2.9%  +8.7%
Att +0.1%  +1.1%  +2.3% +1.5%  +2.0%  +2.6%
Table 4: Improvements against Baseline among different mention lengths.

4.5 Multi-Token Entity Mentions

Table 4 shows results among different entity lengths. It could be seen that cross-structures were much better at dealing with multi-token mentions (1.8%/2.3%/8.7%/2.6%) compared to the prevalently used, problematic Baseline.

In fact, identifying correct mention boundaries for multi-token mentions poses a unique challenge for sequence-labeling models – all tokens in a mention must be tagged with correct sequential labels to form one correct prediction. Although models often rely on strong hints from a token itself or a single side of the context, however, in general, cross-context modeling is required. For example, a token should be tagged as Inside if and only if it immediately follows a Begin or an I and is immediately followed by an I or an End.

Figure 0(b) shows a sentence with multiple entity mentions. Among them, "the White house" is a triple-token facility mention with unreliable capitalization, resulting in an emerging surface form. Without usual strong hints given by a seen surface form, Baseline predicted a false single-token mention "White". In contrast, Att utilized its multiple attention heads (Figure 1(b)1(c)1(d)) to consider the preceding and succeeding tokens for each token and correctly tagged the three tokens as facility:B, facility:I, facility:E.

(a) A confusing surface form for language and nationality.
(b) A triple-token mention with unreliable capitalization.
Figure 1: Example problematic entities for Baseline-BiLSTM-CNN.
Att-BiLSTM-CNN Baseline-…
O 99.05 -1.68 0.75 0.95 -1.67 -45.57 -0.81 -35.46 -0.03
S 93.74 2.69 -91.02 -90.56 -90.88 -25.61 -86.25 -84.32 0.13
B 90.99 1.21 -52.26 -90.78 -88.08 -90.88 -12.21 -87.45 -0.63
I 90.09 -28.18 -3.80 -87.93 -60.56 -50.19 -57.19 -79.63 -0.41
E 93.23 2.00 -71.50 -93.12 -36.45 -39.19 -91.90 -90.83 -0.38
Table 5: Entity-chunking ablation results.

4.6 Entity-Chunking

Entity-chunking is a subtask of NER concerned with locating entity mentions and their boundaries without disambiguating their types. For sequence-labeling models, this means correct O, S, B, I, E tagging for each token. In addition to showing that cross-structures achieved superior performance on multi-token entity mentions (Section 4.5), an ablation study focused on the chunking tags was performed to better understand how it was achieved.

Table 5 shows the entity-chunking ablation results on OntoNotes 5.0 development set. Both Att and Baseline models were taken without re-training for this subtask. The column lists the performance of Att-BiLSTM-CNN on each chunking tag. Other columns list the performance compared to . Columns to are when the full model is deprived of all other information in testing time by forcefully zeroing all vectors except the one specified by the column header. The figures shown in the table are per-token recalls for each chunking tag, which tells if a part of the model is responsible for signaling the whole model to predict that tag. Colors mark relatively high and low values of interest.

Firstly, Att appeared to designate the task of scoring I to the attention mechanism: When context vectors were left alone, the recall for I tokens only dropped a little (-3.80); When token hidden states were left alone, the recall for I tokens seriously degraded (-28.18). When and work together, the full Att model was then better at predicting multi-token entity mentions than Baseline.

Then, breaking context vectors to each attention head reveals that they have worked in cooperation: , focused more on scoring E (-36.45, -39.19) than I (-60.56, -50.19), while focused more on scoring B (-12.21) than I (-57.19). It was when information from all these heads were combined was Att able to better identify a token as being Inside a multi-token mention than Baseline.

Finally, the quantitative ablation analysis of chunking tags in this Section and the qualitative case-study attention visualizations in Section 4.5 explains each other: and especially tended to focus on looking for immediate preceding mention tokens (the diagonal shifted left in Figure 1(b)1(c)), enabling them to signal for End and Inside; tended to focus on looking for immediate succeeding mention tokens (the diagonal shifted right in Figure 1(d)), enabling it to signal for Begin and Inside. In fact, without context vectors, instead of BIE, Att would tag "the White house" as BSE and extract the same false mention of "White" as the OSO of Baseline.

Lacking the ability to model cross-context patterns, Baseline inadvertently learned to retract to predict single-token entities (0.13 vs. -0.63, -0.41, -0.38) when an easy hint from a familiar surface form is not available. This indicates a major flaw in BiLSTM-CNNs prevalently used for real-world NER today.

(a) (Partial) of "…Dutch into English…".
(b) of "…the White house…".
(c) of "…the White house…".
(d) of "…the White house…".
Figure 2: Attention heat maps for the mentions in Figure 1, best viewed on computer.

5 Conclusion

This paper has formally analyzed and remedied the deficiency of the prevalently used BiLSTM-CNN in modeling cross-context for NER. A concrete proof of its inability to capture XOR patterns has been given. Additive and multiplicative cross-structures have shown to be crucial in modeling cross-context, significantly enhancing recognition of emerging, complex, confusing, and multi-token entity mentions. Against comparable previous models, 1.4% and 4.6% overall improvements on OntoNotes 5.0 and WNUT 2017 have been achieved, showing the importance of remedying the core module of NER.

References

  • [1] Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016.
  • [2] Jason Chiu and Eric Nichols. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 2016.
  • [3] Gustavo Aguilar, Adrian Pastor López Monroy, Fabio González, and Thamar Solorio.

    Modeling noisiness to recognize named entities using multitask neural networks on social media.

    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018.
  • [4] Alan Akbik, Duncan Blythe, and Roland Vollgraf. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, 2018.
  • [5] Kevin Clark, Minh-Thang Luong, Christopher D. Manning, and Quoc Le. Semi-supervised sequence modeling with cross-view training. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , 2018.
  • [6] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
  • [7] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. In International Conference on Learning Representations (ICLR), 2017.
  • [8] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
  • [9] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.
  • [10] Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), 2009.
  • [11] Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
  • [12] Thang Luong, Hieu Pham, and Christopher D. Manning.

    Effective approaches to attention-based neural machine translation.

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
  • [13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [14] Gustavo Aguilar, Suraj Maharjan, Adrian Pastor López Monroy, and Thamar Solorio. A multi-task approach for named entity recognition in social media data. In Proceedings of the 3rd Workshop on Noisy User-generated Text, 2017.
  • [15] Abbas Ghaddar and Phillippe Langlais. Robust lexical features for improved neural network named-entity recognition. In Proceedings of the 27th International Conference on Computational Linguistics, 2018.
  • [16] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-aware neural language models. In AAAI, 2016.
  • [17] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
  • [18] Fréderic Godin, Baptist Vandersmissen, Wesley De Neve, and Rik Van de Walle. Multimedia lab @ ACL WNUT NER shared task: Named entity recognition for twitter microposts using distributed word representations. In Proceedings of the Workshop on Noisy User-generated Text, 2015.
  • [19] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch.

    Journal of Machine Learning Research

    , 2011.
  • [20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, 2017.
  • [21] Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. Ontonotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, 2006.
  • [22] Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. Towards robust linguistic analysis using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, 2013.
  • [23] Leon Derczynski, Eric Nichols, Marieke van Erp, and Nut Limsopatham. Results of the WNUT2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, 2017.
  • [24] Benjamin Strauss, Bethany Toma, Alan Ritter, Marie-Catherine de Marneffe, and Wei Xu. Results of the WNUT16 named entity recognition shared task. In Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT), 2016.
  • [25] Timothy Dozat.

    Incorporating Nesterov momentum into Adam.

    In Proceedings of ICLR 2016 Workshop, 2016.
  • [26] Bill Y. Lin, Frank Xu, Zhiyi Luo, and Kenny Zhu. Multi-channel BiLSTM-CRF model for emerging named entity recognition in social media. In Proceedings of the 3rd Workshop on Noisy User-generated Text, 2017.