Named Entity Recognition Tool
State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small, supervised training corpora that are available. In this paper, we introduce two new neural architectures---one based on bidirectional LSTMs and conditional random fields, and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words: character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora. Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers.READ FULL TEXT VIEW PDF
Named Entity Recognition Tool
Tensorflow-LSTM-CRF tool for Named Entity Recognizer
Named Entity Recognition
Chinese Word Segmentation Using Neural Architectures
Named entity recognition (NER) is a challenging learning problem. One the one hand, in most languages and domains, there is only a very small amount of supervised training data available. On the other, there are few constraints on the kinds of words that can be names, so generalizing from this small sample of data is difficult. As a result, carefully constructed orthographic features and language-specific knowledge resources, such as gazetteers, are widely used for solving this task. Unfortunately, language-specific resources and features are costly to develop in new languages and new domains, making NER a challenge to adapt. Unsupervised learning from unannotated corpora offers an alternative strategy for obtaining better generalization from small amounts of supervision. However, even systems that have relied extensively on unsupervised features[collobert2011natural, turian:2010, lin2009phrase, ando:2005, inter alia] have used these to augment, rather than replace, hand-engineered features (e.g., knowledge about capitalization patterns and character classes in a particular language) and specialized knowledge resources (e.g., gazetteers).
In this paper, we present neural architectures for NER that use no language-specific resources or features beyond a small amount of supervised training data and unlabeled corpora. Our models are designed to capture two intuitions. First, since names often consist of multiple tokens, reasoning jointly over tagging decisions for each token is important. We compare two models here, (i) a bidirectional LSTM with a sequential conditional random layer above it (LSTM-CRF; §2), and (ii) a new model that constructs and labels chunks of input sentences using an algorithm inspired by transition-based parsing with states represented by stack LSTMs (S-LSTM; §3). Second, token-level evidence for “being a name” includes both orthographic evidence (what does the word being tagged as a name look like?) and distributional evidence (where does the word being tagged tend to occur in a corpus?). To capture orthographic sensitivity, we use character-based word representation model [ling:2015]
to capture distributional sensitivity, we combine these representations with distributional representations[mikolov2013distributed]. Our word representations combine both of these, and dropout training is used to encourage the model to learn to trust both sources of evidence (§4).
Experiments in English, Dutch, German, and Spanish show that we are able to obtain state-of-the-art NER performance with the LSTM-CRF model in Dutch, German, and Spanish, and very near the state-of-the-art in English without any hand-engineered features or gazetteers (§5). The transition-based algorithm likewise surpasses the best previously published results in several languages, although it performs less well than the LSTM-CRF model.
We provide a brief description of LSTMs and CRFs, and present a hybrid tagging architecture. This architecture is similar to the ones presented by collobert2011natural and huang:2015.
Recurrent neural networks (RNNs) are a family of neural networks that operate on sequential data. They take as input a sequence of vectors and return another sequence that represents some information about the sequence at every step in the input. Although RNNs can, in theory, learn long dependencies, in practice they fail to do so and tend to be biased towards their most recent inputs in the sequence [bengio1994learning]
. Long Short-term Memory Networks (LSTMs) have been designed to combat this issue by incorporating a memory-cell and have been shown to capture long-range dependencies. They do so using several gates that control the proportion of the input to give to the memory cell, and the proportion from the previous state to forget[hochreiter:1997]. We use the following implementation:
is the element-wise sigmoid function, andis the element-wise product.
For a given sentence containing words, each represented as a -dimensional vector, an LSTM computes a representation of the left context of the sentence at every word . Naturally, generating a representation of the right context as well should add useful information. This can be achieved using a second LSTM that reads the same sequence in reverse. We will refer to the former as the forward LSTM and the latter as the backward LSTM. These are two distinct networks with different parameters. This forward and backward LSTM pair is referred to as a bidirectional LSTM [graves:2005].
The representation of a word using this model is obtained by concatenating its left and right context representations, . These representations effectively include a representation of a word in context, which is useful for numerous tagging applications.
A very simple—but surprisingly effective—tagging model is to use the ’s as features to make independent tagging decisions for each output [ling:2015]. Despite this model’s success in simple problems like POS tagging, its independent classification decisions are limiting when there are strong dependencies across output labels. NER is one such task, since the “grammar” that characterizes interpretable sequences of tags imposes several hard constraints (e.g., I-PER cannot follow B-LOC; see §2.4 for details) that would be impossible to model with independence assumptions.
Therefore, instead of modeling tagging decisions independently, we model them jointly using a conditional random field [lafferty2001conditional]. For an input sentence
we consider to be the matrix of scores output by the bidirectional LSTM network. is of size , where is the number of distinct tags, and corresponds to the score of the tag of the word in a sentence. For a sequence of predictions
we define its score to be
where is a matrix of transition scores such that represents the score of a transition from the tag to tag . and are the start and end tags of a sentence, that we add to the set of possible tags. is therefore a square matrix of size .
A softmax over all possible tag sequences yields a probability for the sequence:
During training, we maximize the log-probability of the correct tag sequence:
where represents all possible tag sequences (even those that do not verify the IOB format) for a sentence . From the formulation above, it is evident that we encourage our network to produce a valid sequence of output labels. While decoding, we predict the output sequence that obtains the maximum score given by:
The scores associated with each tagging decision for each token (i.e., the ’s) are defined to be the dot product between the embedding of a word-in-context computed with a bidirectional LSTM—exactly the same as the POS tagging model of ling:2015 and these are combined with bigram compatibility scores (i.e., the ’s). This architecture is shown in figure 1
. Circles represent observed variables, diamonds are deterministic functions of their parents, and double circles are random variables.
The parameters of this model are thus the matrix of bigram compatibility scores , and the parameters that give rise to the matrix , namely the parameters of the bidirectional LSTM, the linear feature weights, and the word embeddings. As in part 2.2, let denote the sequence of word embeddings for every word in a sentence, and be their associated tags. We return to a discussion of how the embeddings are modeled in Section 4. The sequence of word embeddings is given as input to a bidirectional LSTM, which returns a representation of the left and right context for each word as explained in 2.1.
These representations are concatenated () and linearly projected onto a layer whose size is equal to the number of distinct tags. Instead of using the softmax output from this layer, we use a CRF as previously described to take into account neighboring tags, yielding the final predictions for every word . Additionally, we observed that adding a hidden layer between and the CRF layer marginally improved our results. All results reported with this model incorporate this extra-layer. The parameters are trained to maximize Eq. 1 of observed sequences of NER tags in an annotated corpus, given the observed words.
The task of named entity recognition is to assign a named entity label to every word in a sentence. A single named entity could span several tokens within a sentence. Sentences are usually represented in the IOB format (Inside, Outside, Beginning) where every token is labeled as B-label if the token is the beginning of a named entity, I-label if it is inside a named entity but not the first token within the named entity, or O otherwise. However, we decided to use the IOBES tagging scheme, a variant of IOB commonly used for named entity recognition, which encodes information about singleton entities (S) and explicitly marks the end of named entities (E). Using this scheme, tagging a word as I-label with high-confidence narrows down the choices for the subsequent word to I-label or E-label, however, the IOB scheme is only capable of determining that the subsequent word cannot be the interior of another label. ratinov2009design and dai2015enhancing showed that using a more expressive tagging scheme like IOBES improves model performance marginally. However, we did not observe a significant improvement over the IOB tagging scheme.
As an alternative to the LSTM-CRF discussed in the previous section, we explore a new architecture that chunks and labels a sequence of inputs using an algorithm similar to transition-based dependency parsing. This model directly constructs representations of the multi-token names (e.g., the name Mark Watney is composed into a single representation).
This model relies on a stack data structure to incrementally construct chunks of the input. To obtain representations of this stack used for predicting subsequent actions, we use the Stack-LSTM presented by dyer:2015, in which the LSTM is augmented with a “stack pointer.” While sequential LSTMs model sequences from left to right, stack LSTMs permit embedding of a stack of objects that are both added to (using a push operation) and removed from (using a pop operation). This allows the Stack-LSTM to work like a stack that maintains a “summary embedding” of its contents. We refer to this model as Stack-LSTM or S-LSTM model for simplicity.
Finally, we refer interested readers to the original paper [dyer:2015] for details about the Stack-LSTM model since in this paper we merely use the same architecture through a new transition-based algorithm presented in the following Section.
We designed a transition inventory which is given in Figure 2 that is inspired by transition-based parsers, in particular the arc-standard parser of nivre2004. In this algorithm, we make use of two stacks (designated output and stack representing, respectively, completed chunks and scratch space) and a buffer that contains the words that have yet to be processed. The transition inventory contains the following transitions: The shift transition moves a word from the buffer to the stack, the out transition moves a word from the buffer directly into the output stack while the reduce transition pops all items from the top of the stack creating a “chunk,” labels this with label , and pushes a representation of this chunk onto the output stack. The algorithm completes when the stack and buffer are both empty. The algorithm is depicted in Figure 2, which shows the sequence of operations required to process the sentence Mark Watney visited Mars.
The model is parameterized by defining a probability distribution over actions at each time step, given the current contents of the stack, buffer, and output, as well as the history of actions taken. Following dyer:2015, we use stack LSTMs to compute a fixed dimensional embedding of each of these, and take a concatenation of these to obtain the full algorithm state. This representation is used to define a distribution over the possible actions that can be taken at each time step. The model is trained to maximize the conditional probability of sequences of reference actions (extracted from a labeled training corpus) given the input sentences. To label a new input sequence at test time, the maximum probability action is chosen greedily until the algorithm reaches a termination state. Although this is not guaranteed to find a global optimum, it is effective in practice. Since each token is either moved directly to the output (1 action) or first to the stack and then the output (2 actions), the total number of actions for a sequence of lengthis maximally .
|||||[Mark, Watney, visited, Mars]|
|Shift||||[Mark]||[Watney, visited, Mars]|
|Shift||||[Mark, Watney]||[visited, Mars]|
|REDUCE(PER)||[(Mark Watney)-PER]||||[visited, Mars]||(Mark Watney)-PER|
|OUT||[(Mark Watney)-PER, visited]||||[Mars]|
|SHIFT||[(Mark Watney)-PER, visited]||[Mars]|||
|REDUCE(LOC)||[(Mark Watney)-PER, visited, (Mars)-LOC]||||||(Mars)-LOC|
It is worth noting that the nature of this algorithm model makes it agnostic to the tagging scheme used since it directly predicts labeled chunks.
When the operation is executed, the algorithm shifts a sequence of tokens (together with their vector embeddings) from the stack to the output buffer as a single completed chunk. To compute an embedding of this sequence, we run a bidirectional LSTM over the embeddings of its constituent tokens together with a token representing the type of the chunk being identified (i.e., ). This function is given as , where is a learned embedding of a label type. Thus, the output buffer contains a single vector representation for each labeled chunk that is generated, regardless of its length.
The input layers to both of our models are vector representations of individual words. Learning independent representations for word types from the limited NER training data is a difficult problem: there are simply too many parameters to reliably estimate. Since many languages have orthographic or morphological evidence that something is a name (or not a name), we want representations that are sensitive to the spelling of words. We therefore use a model that constructs representations of words from representations of the characters they are composed of (4.1). Our second intuition is that names, which may individually be quite varied, appear in regular contexts in large corpora. Therefore we use embeddings learned from a large corpus that are sensitive to word order (4.2). Finally, to prevent the models from depending on one representation or the other too strongly, we use dropout training and find this is crucial for good generalization performance (4.3).
An important distinction of our work from most previous approaches is that we learn character-level features while training instead of hand-engineering prefix and suffix information about words. Learning character-level embeddings has the advantage of learning representations specific to the task and domain at hand. They have been found useful for morphologically rich languages and to handle the out-of-vocabulary problem for tasks like part-of-speech tagging and language modeling [ling:2015] or dependency parsing [lstmemnlp15].
Figure 4 describes our architecture to generate a word embedding for a word from its characters. A character lookup table initialized at random contains an embedding for every character. The character embeddings corresponding to every character in a word are given in direct and reverse order to a forward and a backward LSTM. The embedding for a word derived from its characters is the concatenation of its forward and backward representations from the bidirectional LSTM. This character-level representation is then concatenated with a word-level representation from a word lookup-table. During testing, words that do not have an embedding in the lookup table are mapped to a UNK embedding. To train the UNK embedding, we replace singletons with the UNK embedding with a probability . In all our experiments, the hidden dimension of the forward and backward character LSTMs are each, which results in our character-based representation of words being of dimension .
Recurrent models like RNNs and LSTMs are capable of encoding very long sequences, however, they have a representation biased towards their most recent inputs. As a result, we expect the final representation of the forward LSTM to be an accurate representation of the suffix of the word, and the final state of the backward LSTM to be a better representation of its prefix. Alternative approaches—most notably like convolutional networks—have been proposed to learn representations of words from their characters [zhang2015character, DBLP:journals/corr/KimJSR15]. However, convnets are designed to discover position-invariant features of their inputs. While this is appropriate for many problems, e.g., image recognition (a cat can appear anywhere in a picture), we argue that important information is position dependent (e.g., prefixes and suffixes encode different information than stems), making LSTMs an a priori better function class for modeling the relationship between words and their characters.
As in collobert2011natural, we use pretrained word embeddings to initialize our lookup table. We observe significant improvements using pretrained word embeddings over randomly initialized ones. Embeddings are pretrained using skip-n-gram[skipngram], a variation of word2vec [mikolov2013efficient] that accounts for word order. These embeddings are fine-tuned during training.
Word embeddings for Spanish, Dutch, German and English are trained using the Spanish Gigaword version 3, the Leipzig corpora collection, the German monolingual training data from the 2010 Machine Translation Workshop and the English Gigaword version 4 (with the LA Times and NY Times portions removed) respectively.222[graff2011spanish, biemann2007leipzig, callison2010findings, englishgigaword] We use an embedding dimension of for English, for other languages, a minimum word frequency cutoff of , and a window size of .
Initial experiments showed that character-level embeddings did not improve our overall performance when used in conjunction with pretrained word representations. To encourage the model to depend on both representations, we use dropout training [dropout], applying a dropout mask to the final embedding layer just before the input to the bidirectional LSTM in Figure 1. We observe a significant improvement in our model’s performance after using dropout (see table 5).
This section presents the methods we use to train our models, the results we obtained on various tasks and the impact of our networks’ configuration on model performance.
For both models presented, we train our networks using the back-propagation algorithm updating our parameters on every training example, one at a time, using stochastic gradient descent (SGD) with a learning rate of
and a gradient clipping of. Several methods have been proposed to enhance the performance of SGD, such as Adadelta [adadelta] or Adam [adam]. Although we observe faster convergence using these methods, none of them perform as well as SGD with gradient clipping.
Our LSTM-CRF model uses a single layer for the forward and backward LSTMs whose dimensions are set to . Tuning this dimension did not significantly impact model performance. We set the dropout rate to . Using higher rates negatively impacted our results, while smaller rates led to longer training time.
The stack-LSTM model uses two layers each of dimension for each stack. The embeddings of the actions used in the composition functions have dimensions each, and the output embedding is of dimension . We experimented with different dropout rates and reported the scores using the best dropout rate for each language.333English (D=), German, Spanish and Dutch (D=) It is a greedy model that apply locally optimal actions until the entire sentence is processed, further improvements might be obtained with beam search [zhang:2011] or training with exploration [ballesteros-arxiv16].
We test our model on different datasets for named entity recognition. To demonstrate our model’s ability to generalize to different languages, we present results on the CoNLL-2002 and CoNLL-2003 datasets [TjongKimSang:2002:ICS:1118853.1118877, TjongKimSang-DeMeulder:2003:CONLL] that contain independent named entity labels for English, Spanish, German and Dutch. All datasets contain four different types of named entities: locations, persons, organizations, and miscellaneous entities that do not belong in any of the three previous categories. Although POS tags were made available for all datasets, we did not include them in our models. We did not perform any dataset preprocessing, apart from replacing every digit with a zero in the English NER dataset.
Table 1 presents our comparisons with other models for named entity recognition in English. To make the comparison between our model and others fair, we report the scores of other models with and without the use of external labeled data such as gazetteers and knowledge bases. Our models do not use gazetteers or any external labeled resources. The best score reported on this task is by luojoint. They obtained a F of 91.2 by jointly modeling the NER and entity linking tasks [hoffart2011robust]. Their model uses a lot of hand-engineered features including spelling features, WordNet clusters, Brown clusters, POS tags, chunks tags, as well as stemming and external knowledge bases like Freebase and Wikipedia. Our LSTM-CRF model outperforms all other systems, including the ones using external labeled data like gazetteers. Our Stack-LSTM model also outperforms all previous models that do not incorporate external features, apart from the one presented by chiu2015named.
Tables 2, 3 and 4 present our results on NER for German, Dutch and Spanish respectively in comparison to other models. On these three languages, the LSTM-CRF model significantly outperforms all previous methods, including the ones using external labeled data. The only exception is Dutch, where the model of gillick2015multilingual can perform better by leveraging the information from other NER datasets. The Stack-LSTM also consistently presents state-the-art (or close to) results compared to systems that do not use external data.
As we can see in the tables, the Stack-LSTM model is more dependent on character-based representations to achieve competitive performance; we hypothesize that the LSTM-CRF model requires less orthographic information since it gets more contextual information out of the bidirectional LSTMs; however, the Stack-LSTM model consumes the words one by one and it just relies on the word representations when it chunks words.
|luojoint* + gaz||89.9|
|luojoint* + gaz + linking||91.2|
|LSTM-CRF (no char)||90.20|
|S-LSTM (no char)||87.96|
|LSTM-CRF – no char||75.06|
|S-LSTM – no char||65.87|
|LSTM-CRF – no char||73.14|
|S-LSTM – no char||69.90|
|LSTM-CRF – no char||83.44|
|S-LSTM – no char||79.46|
Our models had several components that we could tweak to understand their impact on the overall performance. We explored the impact that the CRF, the character-level representations, pretraining of our word embeddings and dropout had on our LSTM-CRF model. We observed that pretraining our word embeddings gave us the biggest improvement in overall performance of in F. The CRF layer gave us an increase of , while using dropout resulted in a difference of and finally learning character-level word embeddings resulted in an increase of about . For the Stack-LSTM we performed a similar set of experiments. Results with different architectures are given in table 5.
|LSTM||char + dropout + pretrain||89.15|
|LSTM-CRF||char + dropout||83.63|
|LSTM-CRF||pretrain + char||89.77|
|LSTM-CRF||pretrain + dropout||90.20|
|LSTM-CRF||pretrain + dropout + char||90.94|
|S-LSTM||char + dropout||80.88|
|S-LSTM||pretrain + char||89.32|
|S-LSTM||pretrain + dropout||87.96|
|S-LSTM||pretrain + dropout + char||90.33|
In the CoNLL-2002 shared task, carreras2002named obtained among the best results on both Dutch and Spanish by combining several small fixed-depth decision trees. Next year, in the CoNLL-2003 Shared Task, florian2003named obtained the best score on German by combining the output of four diverse classifiers. qi2009combining later improved on this with a neural network by doing unsupervised learning on a massive unlabeled corpus.
Several other neural architectures have previously been proposed for NER. For instance, collobert2011natural uses a CNN over a sequence of word embeddings with a CRF layer on top. This can be thought of as our first model without character-level embeddings and with the bidirectional LSTM being replaced by a CNN. More recently, huang:2015 presented a model similar to our LSTM-CRF, but using hand-crafted spelling features. zhou2015end also used a similar model and adapted it to the semantic role labeling task. lin2009phrase used a linear chain CRF with
regularization, they added phrase cluster features extracted from the web data and spelling features. passos2014lexicon also used a linear chain CRF with spelling features and gazetteers.
Language independent NER models like ours have also been proposed in the past. Cucerzan and Yarowsky cucerzan1999language,cucerzan2002language present semi-supervised bootstrapping algorithms for named entity recognition by co-training character-level (word-internal) and token-level (context) features. eisenstein2011structured use Bayesian nonparametrics to construct a database of named entities in an almost unsupervised setting. ratinov2009design quantitatively compare several approaches for NER and build their own supervised model using a regularized average perceptron and aggregating context information.
Finally, there is currently a lot of interest in models for NER that use letter-based representations. gillick2015multilingual model the task of sequence-labeling as a sequence to sequence learning problem and incorporate character-based representations into their encoder model. chiu2015named employ an architecture similar to ours, but instead use CNNs to learn character-level features, in a way similar to the work by santos2015boosting.
This paper presents two neural architectures for sequence labeling that provide the best NER results ever reported in standard evaluation settings, even compared with models that use external resources, such as gazetteers.
A key aspect of our models are that they model output label dependencies, either via a simple CRF architecture, or using a transition-based algorithm to explicitly construct and label chunks of the input. Word representations are also crucially important for success: we use both pre-trained word representations and “character-based” representations that capture morphological and orthographic information. To prevent the learner from depending too heavily on one representation class, dropout is used.
This work was sponsored in part by the Defense Advanced Research Projects Agency (DARPA) Information Innovation Office (I2O) under the Low Resource Languages for Emergent Incidents (LORELEI) program issued by DARPA/I2O under Contract No. HR0011-15-C-0114. Miguel Ballesteros is supported by the European Commission under the contract numbers FP7-ICT-610411 (project MULTISENSOR) and H2020-RIA-645012 (project KRISTINA).