CKIP Neural Chinese Word Segmentation, POS Tagging, and NER
Recent researches prevalently used BiLSTM-CNN as a core module for NER in a sequence-labeling setup. This paper formally shows the limitation of BiLSTM-CNN encoders in modeling cross-context patterns for each word, i.e., patterns crossing past and future for a specific time step. Two types of cross-structures are used to remedy the problem: A BiLSTM variant with cross-link between layers; a multi-head self-attention mechanism. These cross-structures bring consistent improvements across a wide range of NER domains for a core system using BiLSTM-CNN without additional gazetteers, POS taggers, language-modeling, or multi-task supervision. The model surpasses comparable previous models on OntoNotes 5.0 and WNUT 2017 by 1.4 especially improving emerging, complex, confusing, and multi-token entity mentions, showing the importance of remedying the core module of NER.READ FULL TEXT VIEW PDF
State-of-the-art approaches of NER have used sequence-labeling BiLSTM as...
The dot product self-attention is known to be central and indispensable ...
Named Entity Recognition (NER) is the task of identifying spans that
Named Entity Recognition (NER) is a fundamental NLP task, commonly formu...
This paper introduces DaN+, a new multi-domain corpus and annotation
We propose a sequence labeling framework with a secondary training objec...
We propose the Tough Mentions Recall (TMR) metrics to supplement traditi...
CKIP Neural Chinese Word Segmentation, POS Tagging, and NER
Named Entity Recognition (NER) is a core task for information extraction. Originally a structured prediction task, NER has since been formulated as a task of sequential token labeling. BiLSTM-CNN uses a CNN to encode each word and then uses bi-directional LSTMs to encode past and future context respectively at each time step. With state-of-the-art empirical results, most regard it as a robust core module for sequence-labeling NER [1, 2, 3, 4, 5].
However, each direction of BiLSTM only sees and encodes half of a sequence at each time step. For each token, the forward LSTM only encodes past context; the backward LSTM only encodes future context. For computing sentence representations for tasks such as sentence classification and machine translation, this is not a problem, as only the rightmost hidden state of the forward LSTM and only the leftmost hidden state of the backward LSTM are used, and each of the endpoint hidden states sees and encodes the whole sentence. For computing sentence representations for sequence-labeling tasks such as NER, however, this becomes a limitation, as each token uses its own midpoint hidden states, which do not model the patterns that happen to cross past and future at this specific time step.
This paper explores two types of cross-structures to help cope with the problem: Cross-BiLSTM-CNN and Att-BiLSTM-CNN. Previous studies have tried to stack multiple LSTMs for sequence-labeling NER . As they follow the trend of stacking forward and backward LSTMs independently, the Baseline-BiLSTM-CNN is only able to learn higher-level representations of past or future per se. Instead, Cross-BiLSTM-CNN, which interleaves every layer of the two directions, models cross-context in an additive manner by learning higher-level representations of the whole context of each token. On the other hand, Att-BiLSTM-CNN models cross-context in a multiplicative manner by capturing the interaction between past and future with a dot-product self-attentive mechanism [6, 7].
Section 3 formulates the three Baseline, Cross, and Att-BiLSTM-CNN models. The section gives a concrete proof that patterns forming an XOR cannot be modeled by Baseline-BiLSTM-CNN used in all previous work. Cross-BiLSTM-CNN and Att-BiLSTM-CNN are shown to have additive and multiplicative cross-structures respectively to deal with the problem. Section 4 evaluates the approaches on two challenging NER datasets spanning a wide range of domains with complex, noisy, and emerging entities. The cross-structures bring consistent improvements over the prevalently used Baseline-BiLSTM-CNN without additional gazetteers, POS taggers, language-modeling, or multi-task supervision. The improved core module surpasses comparable previous models on OntoNotes 5.0 and WNUT 2017 by 1.4% and 4.6% respectively. Experiments reveal that emerging, complex, confusing, and multi-token entity mentions benefitted much from the cross-structures, and the in-depth entity-chunking analysis finds that the prevalently used Baseline-BiLSTM-CNN is flawed for real-world NER.
. They stack multiple layers of LSTM cells per direction and also use a CNN to compute character-level word vectors alongside pre-trained word vectors. This paper largely follows their work in constructing the Baseline-BiLSTM-CNN, including the selection of raw features, the CNN, and the multi-layer BiLSTM. A subtle difference is that they send the output of each direction through separate affine-softmax classifiers and then sum their probabilities, while this paper sum the scores from affine layers before computing softmax once. While not changing the modeling capacity regarded in this paper, the baseline model does perform better than their formulation.
The modeling of global contexts for sequence-labeling NER has been accomplished using traditional models with extensive feature engineering and conditional random fields (CRF). 
build the Illinois NER tagger with feature-based perceptrons. In their analysis, the usefulness of Viterbi decoding is minimal and conflicts their handcrafted global features. On the other hand, recent researches on LSTM or CNN-based sequence encoders report empirical improvements brought by CRF[8, 1, 9, 11], as it discourages illegal predictions by explicitly modeling class transition probabilities. However, transition probabilities are independent of input sentences. In contrast, the cross-structures studied in this work provide for the direct capture of global patterns and extraction of better features to improve class observation likelihoods.
Thought to lighten the burden of compressing all relevant information into a single hidden state, using attention mechanisms on top of LSTMs have shown empirical success for sequence encoders [6, 7] and decoders . Self-attention has also been used below encoders to compute word vectors conditioned on context . This work further formally analyzes the deficiency of BiLSTM encoders for sequence labeling and shows that using self-attention on top is actually providing one type of cross structures that capture interactions between past and future context.
Besides using additional gazetteers or POS taggers [14, 3, 15], there is a recent trend to use additional large-scale language-modeling corpora  or additional multi-task supervision  to further improve NER performance beyond bare-bone models. However, they all rely on a core BiLSTM sentence encoder with the same limitation studied and remedied in Section 3. So they would indeed benefit from the improvements presented in this paper.
All models in the experiments use the same set of raw features: character embedding, character type, word embedding, and word capitalization.
For character embedding, 25d vectors are trained from scratch, and 4d one-hot character-type features indicate whether a character is uppercase, lowercase, digit, or punctuation 
. Word token lengths are unified to 20 by truncation and padding. The resulting 20-by-(25+4) feature map of each token is applied to a character-trigram CNN with 20 kernels per length 1 to 3 and max-over-time pooling to compute a 60d character-based word vector[16, 2, 1].
For word embedding, either pre-trained 300d GloVe vectors  or 400d Twitter vectors  are used without further tuning. Also, 4d one-hot word capitalization features indicate whether a word is uppercase, upper-initial, lowercase, or mixed-caps [19, 2].
Throughout this paper, denotes the -by- matrix of sequence features, where is the sentence length and is either 364 (with GloVe) or 464 (with Twitter).
On top of an input feature sequence, BiLSTM is used to capture the future and the past for each time step. Following , 4 distinct LSTM cells – two in each direction – are stacked to capture higher level representations:
where denote applying LSTM cell in forward, backward order, denote the resulting feature matrices of the stacked application, and denotes row-wise concatenation. In all the experiments, 100d LSTM cells are used, so and .
Finally, suppose there are token classes, the probability of each of which is given by the composition of affine and softmax transformations:
where is the row of , , are a trainable weight matrix and bias, and and are the -th and -th elements of .
Following , the 5 chunk labels O, S, B, I, E denote if a word token is Outside any entity mentions, the Sole token of a mention, the Beginning token of a multi-token mention, In the middle of a multi-token mention, or the Ending token of a multi-token mention. Hence when there are types of named entities, the actual number of token classes for sequence labeling NER.
Consider the following four phrases that form an XOR:
Key and Peele (work-of-art)
You and I (work-of-art)
Key and I
You and Peele
The first two phrases are respectively a show title and a song title. The other two are not entities as a whole, where the last one actually occurs in an interview with Keegan-Michael Key. Suppose each phrase is the sequence given to Baseline-BiLSTM-CNN for sequence tagging, then the token "and" should be tagged as work-of-art:I in the first two cases and as O in the last two cases.
Firstly, note that the score vector at each time step is simply the sum of contributions coming from forward and backward directions plus a bias.
where denotes the top-half and bottom-half of .
Suppose the index of work-of-art:I and O are i, j respectively. Then, to predict each "and" correctly, it must hold that
where superscripts denote the phrase number.
Now, the catch is that phrase 1 and phrase 3 have exactly the same past context for "and". Hence the same and the same , i.e., . Similarly, , , and . Rewriting the constraints with these equalities gives
Finally, summing the first two inequalities and the last two inequalities gives two contradicting constraints that cannot be satisfied. In other words, even if an oracle is given to training the model, Baseline-BiLSTM-CNN can only tag at most 3 out of 4 "and" correctly. No matter how many LSTM cells are stacked for each direction, the formulation in previous studies simply does not have enough modeling capacity to capture cross-context patterns for sequence labeling NER.
Motivated by the limitation of the conventional Baseline-BiLSTM-CNN for sequence labeling, this paper proposes the use of Cross-BiLSTM-CNN by changing the deep structure in Section 3.2 to
As the forward and backward hidden states are interleaved between stacked LSTM layers, Cross-BiLSTM-CNN models cross-context patterns by computing representations of the whole sequence in a feed-forward, additive manner.
Specifically, for the XOR cases introduced in Section 3.2.1, although phrase 1 and phrase 3 still have the same past context for "and" and hence the first layer can only extract the same low-level hidden features , the second layer considers the whole context and thus have the ability to extract different high-level hidden features for the two phrases.
As the higher-level LSTMs of Cross-BiLSTM-CNN have interleaved input from forward and backward hidden states down below, their weight parameters double the size of the first-level LSTMs. Nevertheless, the cross formulation provides the modeling capacity absent in previous studies with how many more LSTM layers.
Another way to capture the interaction between past and future context per time step is to add a token-level self-attentive mechanism on top of the same BiLSTM formulation introduced in Section 3.2. Given the hidden features of a whole sequence, the model projects each hidden state to different subspaces, depending on whether it is used as the query vector to consult other hidden states for each word token, the key vector to compute its dot-similarities with incoming queries, or the value vector to be weighted and actually convey information to the querying token. As different aspects of a task can call for different attention, multiple attention heads running in parallel are used .
Formally, let be the number of attention heads and be the subspace dimension. For each head , the attention weight matrix and context matrix are computed by
where are trainable projection matrices and performs softmax along the second dimension. Each row of the resulting contains the attention weights of a token to its context, and each row of is its context vector.
For Att-BiLSTM-CNN, the hidden vector and context vectors of each token are considered together for classification:
where is the -th row of , and is a trainable weight matrix. In all the experiments, and , so .
While the BiLSTM formulation stays the same as Baseline-BiLSTM-CNN, the computation of attention weights and context features models the cross interaction between past and future. To see this, the computation of attention scores can be rewritten as follows.
With the un-shifted covariance matrix of the projected , Att-BiLSTM-CNN correlates past and future context for each token in a dot-product, multiplicative manner.
One advantage of the multi-head self-attentive mechanism is that it only needs to be computed once per sequence, and the matrix computations are highly parallelizable, resulting in little computation time overhead. Moreover, in Section 4, the attention weights provide a better understanding of how the model learns to tackle sequence-labeling NER.
OntoNotes 5.0 Fine-Grained NER – a million-token corpus with diverse sources of newswires, web, broadcast news, broadcast conversations, magazines, and telephone conversations [21, 22]. Some are transcriptions of talk shows, and some are translations from Chinese or Arabic. The dataset contains 18 fine-grained entity types, including hard ones such as law, event, and work-of-art. All the diversities and noisiness require that models are robust across broad domains and able to capture a multitude of linguistic patterns for complex entities.
WNUT 2017 Emerging NER
– a dataset providing maximally diverse, noisy, and drifting user-generated text. The training set consists of previously annotated tweets – social media text with non-standard spellings, abbreviations, and unreliable capitalization ; the development set consists of newly sampled YouTube comments; the test set includes text newly drawn from Twitter, Reddit, and StackExchange. Besides drawing new samples from diverse topics across different sources, the shared task also filtered out text containing surface forms of entities seen in the training set. The resulting dataset requires models to generalize to emerging contexts and entities instead of relying on familiar surface cues.
|OntoNotes 5.0||WNUT 2017|
|train||1088.5 / 81.8||62.7 / 1.9|
|dev||147.7 / 11.0||15.7 / 0.8|
|test||152.7 / 11.2||23.3 / 1.0|
with uniform learning rate 0.001, batch size 32, and 35% dropout. Each training lasted 400 epochs when using GloVe embedding (OntoNotes), and 1600 epochs when using Twitter embedding (WNUT). The development set of each dataset was used to select the best epoch to restore model weights for testing. Following previous work on NER, model performances were evaluated with strict mention F1 score. Training of each model on each dataset repeated 6 times to report the mean score and standard deviation.
Besides comparing to the Baseline implemented in this paper, results also compared against previously reported results of BiLSTM-CNN , CRF-BiLSTM(-BiLSTM) [11, 26], and CRF-IDCNN  on the two datasets. Among them, IDCNN was a CNN-based sentence encoder, which should not have the XOR limitation raised in this paper. Only fair comparisons against models without using additional resources were made. However, the models that used those additional resources (Secion 2) actually all used a BiLSTM sentence encoder with the XOR limitation, so they could indeed integrate with and benefit from the cross-structures.
Table 2 shows overall results on the two datasets spanning broad domains of newswires, broadcast, telephone, and social media. The models proposed in this paper significantly surpassed previous comparable models by 1.4% on OntoNotes and 4.6% on WNUT. Compared to the re-implemented Baseline-BiLSTM-CNN, the cross-structures brought 0.7% and 2.2% improvements on OntoNotes and WNUT. More substantial improvements were achieved for WNUT 2017 emerging NER, suggesting that cross-context patterns were even more crucial for emerging contexts and entities than familiar entities, which might often be memorized by their surface forms.
|OntoNotes 5.0||WNUT 2017|
Table 3 shows significant results per entity type compared to Baseline (3% absolute F1 differences for either Cross or Att). It could be seen that harder entity types generally benefitted more from the cross-structures. For example, work-of-art/creative-work entities could in principle take any surface forms – unseen, the same as a person name, abbreviated, or written with unreliable capitalizations on social media. Such mentions require models to learn a deep, generalized understanding of their context to accurately identify their boundaries and disambiguate their types. Both cross-structures were more capable in dealing with such hard entities (2.1%/5.6%/3.2%/2.0%) than the prevalently used, problematic Baseline.
Moreover, disambiguating fine-grained entity types is also a challenging task. For example, entities of language and NORP often take the same surface forms. Figure 0(a) shows an example containing "Dutch" and "English". While "English" was much more frequently used as a language and was identified correctly, the "Dutch" mention was tricky for Baseline. The attention heat map (Figure 1(a)) further tells the story that Att has relied on its attention head to make context-aware decisions. Overall, both cross-structures were much better at disambiguating these fine-grained types (4.1%/0.8%/3.3%/3.4%).
|OntoNotes 5.0||WNUT 2017|
|OntoNotes 5.0||WNUT 2017|
|1 2 3||1 2 3|
|Cross||+0.3% +0.6% +1.8%||+1.7% +2.9% +8.7%|
|Att||+0.1% +1.1% +2.3%||+1.5% +2.0% +2.6%|
Table 4 shows results among different entity lengths. It could be seen that cross-structures were much better at dealing with multi-token mentions (1.8%/2.3%/8.7%/2.6%) compared to the prevalently used, problematic Baseline.
In fact, identifying correct mention boundaries for multi-token mentions poses a unique challenge for sequence-labeling models – all tokens in a mention must be tagged with correct sequential labels to form one correct prediction. Although models often rely on strong hints from a token itself or a single side of the context, however, in general, cross-context modeling is required. For example, a token should be tagged as Inside if and only if it immediately follows a Begin or an I and is immediately followed by an I or an End.
Figure 0(b) shows a sentence with multiple entity mentions. Among them, "the White house" is a triple-token facility mention with unreliable capitalization, resulting in an emerging surface form. Without usual strong hints given by a seen surface form, Baseline predicted a false single-token mention "White". In contrast, Att utilized its multiple attention heads (Figure 1(b), 1(c), 1(d)) to consider the preceding and succeeding tokens for each token and correctly tagged the three tokens as facility:B, facility:I, facility:E.
Entity-chunking is a subtask of NER concerned with locating entity mentions and their boundaries without disambiguating their types. For sequence-labeling models, this means correct O, S, B, I, E tagging for each token. In addition to showing that cross-structures achieved superior performance on multi-token entity mentions (Section 4.5), an ablation study focused on the chunking tags was performed to better understand how it was achieved.
Table 5 shows the entity-chunking ablation results on OntoNotes 5.0 development set. Both Att and Baseline models were taken without re-training for this subtask. The column lists the performance of Att-BiLSTM-CNN on each chunking tag. Other columns list the performance compared to . Columns to are when the full model is deprived of all other information in testing time by forcefully zeroing all vectors except the one specified by the column header. The figures shown in the table are per-token recalls for each chunking tag, which tells if a part of the model is responsible for signaling the whole model to predict that tag. Colors mark relatively high and low values of interest.
Firstly, Att appeared to designate the task of scoring I to the attention mechanism: When context vectors were left alone, the recall for I tokens only dropped a little (-3.80); When token hidden states were left alone, the recall for I tokens seriously degraded (-28.18). When and work together, the full Att model was then better at predicting multi-token entity mentions than Baseline.
Then, breaking context vectors to each attention head reveals that they have worked in cooperation: , focused more on scoring E (-36.45, -39.19) than I (-60.56, -50.19), while focused more on scoring B (-12.21) than I (-57.19). It was when information from all these heads were combined was Att able to better identify a token as being Inside a multi-token mention than Baseline.
Finally, the quantitative ablation analysis of chunking tags in this Section and the qualitative case-study attention visualizations in Section 4.5 explains each other: and especially tended to focus on looking for immediate preceding mention tokens (the diagonal shifted left in Figure 1(b), 1(c)), enabling them to signal for End and Inside; tended to focus on looking for immediate succeeding mention tokens (the diagonal shifted right in Figure 1(d)), enabling it to signal for Begin and Inside. In fact, without context vectors, instead of BIE, Att would tag "the White house" as BSE and extract the same false mention of "White" as the OSO of Baseline.
Lacking the ability to model cross-context patterns, Baseline inadvertently learned to retract to predict single-token entities (0.13 vs. -0.63, -0.41, -0.38) when an easy hint from a familiar surface form is not available. This indicates a major flaw in BiLSTM-CNNs prevalently used for real-world NER today.
This paper has formally analyzed and remedied the deficiency of the prevalently used BiLSTM-CNN in modeling cross-context for NER. A concrete proof of its inability to capture XOR patterns has been given. Additive and multiplicative cross-structures have shown to be crucial in modeling cross-context, significantly enhancing recognition of emerging, complex, confusing, and multi-token entity mentions. Against comparable previous models, 1.4% and 4.6% overall improvements on OntoNotes 5.0 and WNUT 2017 have been achieved, showing the importance of remedying the core module of NER.
Modeling noisiness to recognize named entities using multitask neural networks on social media.In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
Effective approaches to attention-based neural machine translation.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
Journal of Machine Learning Research, 2011.
Incorporating Nesterov momentum into Adam.In Proceedings of ICLR 2016 Workshop, 2016.