Sequence Labeling: A Practical Approach

08/12/2018 ∙ by Adnan Akhundov, et al. ∙ Technische Universität München 0

We take a practical approach to solving sequence labeling problem assuming unavailability of domain expertise and scarcity of informational and computational resources. To this end, we utilize a universal end-to-end Bi-LSTM-based neural sequence labeling model applicable to a wide range of NLP tasks and languages. The model combines morphological, semantic, and structural cues extracted from data to arrive at informed predictions. The model's performance is evaluated on eight benchmark datasets (covering three tasks: POS-tagging, NER, and Chunking, and four languages: English, German, Dutch, and Spanish). We observe state-of-the-art results on four of them: CoNLL-2012 (English NER), CoNLL-2002 (Dutch NER), GermEval 2014 (German NER), Tiger Corpus (German POS-tagging), and competitive performance on the rest.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A variety of NLP tasks can be formulated as general sequence labeling problem: given a sequence of tokens and a fixed set of labels, assign one of the labels to each token in a sequence. We consider three concrete sequence labeling tasks: Part-of-speech (POS) tagging, Named Entity Recognition (NER), and Chunking (also known as shallow parsing). POS-tagging reduces to assigning a part-of-speech label to each word in a sentence; NER requires detecting (potentially multi-word) named entities, like person or organization names; chunking aims at identifying syntactic constituents within a sentence, like noun- or verb-phrases.

Traditionally, sequence labeling tasks were tackled using linear statistical models, for instance: Hidden Markov Models

Kupiec (1992), Maximum Entropy Markov Models McCallum et al. (2000), and Conditional Random Fields Lafferty et al. (2001)

. In their seminal paper, collobert2011natural have introduced a deep neural network-based solution to the problem, which has spawned immense research in this direction. Multiple works have introduced different neural architectures for universal sequence labeling afterwards

Huang et al. (2015); Lample et al. (2016); Ma and Hovy (2016); Chiu and Nichols (2016); Yang et al. (2016). However, aiming for better results on a particular dataset, these and numerous other works typically employ some form of feature engineering Ando and Zhang (2005); Shen and Sarkar (2005); Collobert et al. (2011); Huang et al. (2015), external data for training Ling et al. (2015); Lample et al. (2016)

or in a form of lexicons and gazetteers

Ratinov and Roth (2009); Passos et al. (2014); Chiu and Nichols (2016), extensive hyper-parameter search Chiu and Nichols (2016); Ma and Hovy (2016), or multi-task learning Durrett and Klein (2014); Yang et al. (2016).

Figure 1: Diagram of the proposed sequence labeling model. Learned components are shown in dashed rectangles. For clarity, computing byte and word embeddings is shown only for the first word of the sentence.

In this paper, we take an alternative stance by looking at the problem of sequence labeling from a practical perspective. The performance enhancements enumerated above typically require availability of expertise, time, or external resources. Some (or even all) of these may be unavailable to users in a practical situation. Therefore, we deliberately eschew any form of feature engineering, pre-training, external data (with the exception of publicly available word embeddings), and time-consuming hyper-parameter optimization. To this end, we formulate a single general-purpose sequence labeling model and apply it to eight different benchmark datasets to estimate the effectiveness of our approach.

Our model utilizes a bi-directional LSTM Graves and Schmidhuber (2005) to extract morphological information from the bytes of words in a sentence. These, combined with word embeddings bearing semantic cues, are fed to another Bi-LSTM to obtain word-level scores. Ultimately, the word-level scores are passed through a CRF layer Lafferty et al. (2001) to facilitate structured prediction of the labels.

The rest of the paper is organized as follows. Section 2 specifies the proposed model in detail. Sections 3 and 4 describe the datasets and the training procedure used in experiments. The results are presented in Section 5. We review related work in Section 6 and conclude in Section 7.

2 Model

Recurrent Neural Networks Elman (1990) are commonly used for modeling sequences in NLP Mikolov et al. (2010); Cho et al. (2014); Graves et al. (2013). However, because of well-known challenges of capturing long-term dependencies with plain RNNs Pascanu et al. (2013)

, we instead turn to Long Short-Term Memory (LSTM) networks

Hochreiter and Schmidhuber (1997)

capable of alleviating the vanishing/exploding gradient problem by design. The specific LSTM formulation we are using

Zaremba et al. (2014) can be described by the following equations:


denotes the sigmoid activation function;

- element-wise (Hadamard) product; - learned network parameters; - input, forget, and output gates; , , and - network input, cell state, and network output at time step respectively.

One issue with ordinary LSTMs is that they capture dependencies between sequence elements only in one direction, whereas it might be beneficial to learn also backward dependencies (e.g., for informed first label prediction). To overcome this limitation, we use bidirectional LSTM (Bi-LSTM) networks Graves and Schmidhuber (2005) comprising two independent LSTM instances (with separate sets of parameters): one processing an input sequence in forward direction and the other in backward direction. The output of Bi-LSTM is formed by concatenating the outputs of the two LSTMs corresponding to each sequence element.

The diagram of the proposed model is shown in Figure 1. The model can be decomposed into three logical components: (1) computing byte embeddings of each word in a sentence by means of the Byte Bi-LSTM; (2) combining byte embeddings with word embeddings and feeding resulting joint word representations to the Word Bi-LSTM to obtain word-level scores; (3) passing word-level scores through the CRF Layer to arrive at joint prediction of labels. We describe each of these components in detail in the following subsections.

2.1 Byte Embeddings

Assuming that input is available in tokenized form, we enable the model to extract morphological information from tokens by analyzing their character-level representations. Following ling2015finding, we apply a Bi-LSTM network for solving this task. However, to be truly neutral with respect to languages and character sets thereof, we consider a sequence of bytes underlying a UTF-8 encoding of each word instead of its characters.

Formally, given a sequence of words in a sentence, we first decompose each word into its characters including dummy start- and end-of-word characters (we assume that the -th word consists of characters including two dummy ones). Now we convert a sequence of characters to a sequence of underlying UTF-8 bytes (for simplicity, here we assume that all characters are single-byte, but this is obviously not necessary). Dummy characters are encoded by special byte values 0x01 (start-of-word) and 0x02 (end-of-word). Next, we compute byte projections of bytes in a sequence by multiplying the byte projection matrix B by a one-of-

coded vector representation of each byte. The matrix

is a learned model parameter. Each of the obtained byte projection vectors has a fixed dimensionality , which is a hyper-parameter of the model.

Next, the byte projection sequence is fed to the Byte Bi-LSTM network. The last outputs of its forward and backward LSTMs are concatenated to obtain ”byte embedding” vector of the -th word . These fixed-size vectors are assumed to capture morphological information about corresponding words in a sentence.

2.2 Word-level Representation

Morphological information alone is probably not representative enough to reliably predict target labels in a general sequence labeling setting. For this reason, we would also want to supply our prediction framework with semantic information. We fulfill this requirement by mixing in pre-trained word embeddings of words in a sentence. Then, following huang2015bidirectional, we infer word-level representation using a Bi-LSTM network.

Formally, given a vocabulary of size and fixed (not learned) word embedding matrix , the word embedding vector of the -th word is obtained by multiplying E by a one-of- coded vector representation of the word’s position in the vocabulary ( is the dimensionality of embedding vectors and depends on the choice of word embeddings). For all out-of-vocabulary words we use the same additional ”unknown” word embedding vector.

Next, computed byte embedding of -th word is concatenated with its word embedding to produce a joint embedding . This way we obtain a sequence of joint embeddings corresponding to the words in a sentence. The joint embeddings are assumed to capture both morphological and semantic information about the words. They are fed as inputs to the Word Bi-LSTM network. The outputs of the network at each time step are passed through a linear layer (with no activation function) to yield -dimensional word-level score vectors , where

denotes the number of distinct labels. In essence, word-level score vectors may be interpreted as unnormalized log-probabilities (logits) over the labels at each time step.

Theoretically, we could stop here by applying softmax to each word-level score vector to infer a distribution over possible labels at each time step. However, this approach (albeit bearing a certain degree of context-awareness due to the presence of the upstream Word Bi-LSTM) would treat each word more or less locally, lacking a global view over predicted labels. This is why we turn to a CRF layer as the last step of label inference.

Dataset training set development set testing set labels (classes)
sentences tokens sentences tokens sentences tokens
CoNLL 2000 (English, Chunking) 8,936 211,727 - - 2,012 47,377 45 (11)
CoNLL 2002 (Spanish, NER) 8,323 264,715 1,915 52,923 1,517 51,533 17 (4)
CoNLL 2002 (Dutch, NER) 15,806 202,644 2,895 37,687 5,195 68,875 17 (4)
CoNLL 2003 (English, NER) 14,041 203,621 3,250 51,362 3,453 46,435 17 (4)
CoNLL 2012 (English, NER) 59,924 1,088,503 8,528 147,724 8,262 152,728 73 (18)
GermEval 2014 (German, NER) 24,000 452,853 2,200 41,653 5,100 96,499 49 (12)
WSJ/PTB (English, POS-tag.) 38,219 912,344 5,527 131,768 5,462 129,654 45 (45)
Tiger Corpus (German, POS-tag.) 40,472 719,530 5,000 76,704 5,000 92,004 54 (54)
Table 1: Statistics of eight benchmark datasets used in the experiments.

2.3 CRF Layer

Oftentimes, labels predicted at different time steps follow certain structural patterns. As an example, the IOB labeling scheme has specific rules constraining label transitions: for example, an I-ORG label can follow only a B-ORG or another I-ORG but no other label. To learn patterns like this one, following lample2016neural, we utilize Conditional Random Fields (CRFs) Lafferty et al. (2001) in the final component of our model. CRFs can capture dependencies between labels predicted at different time steps by modeling probabilities of transitions from one step to the other. Linear chain CRFs model transitions between neighboring pairs of labels in a sequence and allow solving a structured prediction problem in a computationally feasible way.

We recall that a sequence of word-level score vectors is inferred by the Word Bi-LSTM. The -th component of the -th score vector - - represents unnormalized log-probability of assigning -th label to the word at the -th position. Joint prediction is modeled by introducing a total score of a sequence of labels given a sequence of words :

where is a matrix of label transition scores ( is a score of transition from label to label ; represents the number of distinct labels) and word-level scores obviously depend on w. The matrix A is another learned model parameter.

The probability of observing a particular sequence of labels given a sequence of words can be computed by applying softmax over total scores of all possible label assignments to a sequence of words w ( denotes the set of all learned model parameters):


And the corresponding log-probability is:


We learn the model parameters by maximizing log-likelihood (2) of given the ground truth labels y corresponding to the input sequence w. Computing the normalizing factor from equation (1), as well as deriving the most likely sequence of labels during test time is performed using dynamic programming Rabiner (1989).

3 Datasets

We evaluate the performance of our approach on eight benchmark datasets covering four languages and three sequence labeling tasks. Certain statistics of those are shown in Table 1. The labels of all NER and Chunking datasets are converted to IOBES tagging scheme, as it is reported to increase predictive performance Ratinov and Roth (2009). We briefly discuss each of the datasets in the subsections below.

3.1 CoNLL 2000

The CoNLL 2000 dataset Tjong Kim Sang and Buchholz (2000) was introduced as a part of a shared task on Chunking. Sections 15-18 of the Wall Street Journal part of the Penn Treebank corpus Marcus et al. (1993) are used for training, section 20 for testing. Due to the lack of specifically demarcated development set, we use randomly sampled 10% of the training set for this purposes (see Section 4 for the details of training).

3.2 CoNLL 2002

The CoNLL 2002 dataset Tjong Kim Sang (2002) was used for shared task on language-independent Named Entity Recognition. The data represents news wire covering two languages: Spanish and Dutch. In our experiments we treat Spanish and Dutch data separately, as two different datasets. The dataset is annotated by four entity types: persons (PER), organizations (ORG), locations (LOC), and miscellaneous names (MISC).

3.3 CoNLL 2003

CoNLL 2003 Tjong Kim Sang and De Meulder (2003) is a NER dataset structurally similar to CoNLL 2002 (including entity types), but in English and German. English data is based on news stories from Reuters Corpus Rose et al. (2002). We use the English portion of the dataset in the experiments.

3.4 CoNLL 2012

The CoNLL 2012 dataset Pradhan et al. (2012) was created for a shared task on multilingual unrestricted coreference resolution. The dataset is based on OntoNotes corpus v5.0 Hovy et al. (2006) and, among others, has named entity annotations. It is substantially larger and more diverse than the previously described NER datasets (see Table 1 for detailed comparison). Although some sources Durrett and Klein (2014); Chiu and Nichols (2016) refer to the dataset as ”OntoNotes”, we stick to the name ”CoNLL 2012” as the train/dev/test split that is used by this and other works is not defined in the OntoNotes corpus. Following durrett2014joint, we exclude the New Testament part of the data, as it is lacking gold annotations.

3.5 GermEval 2014

GermEval 2014 Benikova et al. (2014) is a recently organized shared task on German Named Entity Recognition. The corresponding dataset has four main entity types (Location, Person, Organization, and Other) and two sub-types of each type, ”-deriv” and ”-part”, indicating derivation from and inclusion of a named entity respectively. The original dataset has two levels of labeling: outer and inner. However, we use only outer labels in the experiments and compare our results to other works on the ”M3: Outer Chunks” metric.

3.6 Wall Street Journal / Penn Treebank

The Wall Street Journal section of the Penn Treebank corpus Marcus et al. (1993) is commonly used as a benchmark dataset for the English POS-tagging task. We follow this tradition and use a standard split of sections: 0-18 for training, 19-21 for development, and 22-24 for testing Toutanova et al. (2003).

3.7 Tiger Corpus

Tiger Corpus Brants et al. (2002) is an extensive collection of German newspaper texts. The dataset has several different types of annotations. We use part-of-speech annotations for setting up a German POS tagging task. Following fraser2013knowledge, we use the first 40,472 of the originally ordered sentences for training, the next 5,000 for development, and the last 5,000 for testing.

4 Training

The training procedure described in this section is used for every experiment on every dataset mentioned in this paper. The model is trained end-to-end, accepting tokenized sentences as input and predicting per-token labels as output.

The model is trained using an Adam optimizer Kingma and Ba (2014). Following ma2016end, we apply staircase learning rate decay:

where is the learning rate used throughout

-th epoch (

starts at ), is the initial learning rate, and is the learning rate decay factor. In our experiments we use and .

Motivated by the initial experiments, the dimension of the byte projections is set to , the size of Byte and Word Bi-LSTM to and respectively. Training lasts for epochs with a batch size of sentences. Early stopping Caruana et al. (2001) is used: at the end of every epoch the model is evaluated against the development set; eventually, the parameter values performing best on the development set are declared the final values.

Due to the high level of expressive power of the proposed model, we use dropout Srivastava et al. (2014), with the rate of , to reduce the possibility of overfitting. Dropout is applied to word embeddings and byte projections, as well as the outputs of Byte Bi-LSTM and Word Bi-LSTM. As per zaremba2014recurrent, we don’t apply dropout to state transitions of the LSTM networks.

Publicly available word embeddings are used for every language in the experiments. For English datasets we use -dimensional uncased GloVe embeddings Pennington et al. (2014) trained on English Wikipedia and Gigaword 5 corpora and comprising K unique word forms. For other languages we use -dimensional cased Polyglot embeddings Al-Rfou et al. (2013) trained on a respective Wikipedia corpus and comprising K unique (case-sensitive) word forms per language. Maintaining our commitment to the practical approach, we freeze the word embeddings during training not allowing them to train (except for the ”unknown” word embedding, which is trained).

To achieve higher efficiency, we compute the joint embedding of every unique word in a batch only once Ling et al. (2015). Unique joint embeddings are scattered according to the word positions in the input sentences, before being fed to the Word Bi-LSTM. The gradient with respect to each unique byte embedding is accumulated before being back-propagated once through the Byte Bi-LSTM. Albeit marginal during training (in small batches), performance improvement becomes considerable during inference (in larger batches).

Dataset word word + crf byte byte + crf byte + word byte + word + crf
CoNLL 2000 (English, Chunking, F) 91.39 92.70 92.75 93.41 93.93 94.74
CoNLL 2002 (Spanish, NER, F) 77.55 80.31 77.29 79.62 82.04 84.36
CoNLL 2002 (Dutch, NER, F) 74.95 76.91 75.14 78.06 83.26 85.61
CoNLL 2003 (English, NER, F) 86.35 87.58 81.80 82.85 89.70 91.11
CoNLL 2012 (English, NER, F) 79.00 82.95 79.93 83.70 84.69 87.84
GermEval 2014 (German, NER, F) 68.86 71.90 65.79 69.30 76.74 79.21
WSJ/PTB (English, POS-tag., %) 95.22 95.30 97.11 97.14 97.45 97.43
Tiger Corpus (German, POS-tag., %) 95.60 95.76 97.68 97.78 98.39 98.40
Table 2: Results of the ablation studies. word and byte indicate inclusion of respective embeddings into joint embeddings; crf indicates presence of the CRF layer. Testing score of the best model is marked in bold.
Model F Score
shen2005voting 94.01
collobert2011natural 94.32
sun2008modeling 94.34
huang2015bidirectional 94.46
zhai2017neural 94.72
Our Model 94.74
yang2016multi 95.41
peters2017semi 96.37
Table 3: Comparison with other works on CoNLL 2000 dataset (English, Chunking).
Model F Score
carreras2002named 81.39
dos2015boosting 82.21
gillick2016multilingual 82.95
Our Model 84.36
lample2016neural 85.75
yang2016multi 85.77
Table 4: Comparison with other works on CoNLL 2002 dataset (Spanish, NER).
Model F Score
carreras2002named 77.05
nothman2013learning 78.60
lample2016neural 81.74
gillick2016multilingual 82.84
yang2016multi 85.19
Our Model 85.61
Table 5: Comparison with other works on CoNLL 2002 dataset (Dutch, NER).
Model F Score
collobert2011natural 89.59
huang2015bidirectional 90.10
lample2016neural 90.94
Our Model 91.11
luo2015joint 91.20
yang2016multi 91.20
ma2016end 91.21
chiu2016named 91.62
peters2017semi 91.93
Table 6: Comparison with other works on CoNLL 2003 dataset (English, NER).

5 Results

We present the results of our experiments in two different contexts. Table 2 shows the performance of different model configurations gauged in ablation studies. Tables 3, 4, 5, 6, 7, 8, 9, and 10 juxtapose our results on each dataset with those of other works reporting their results on the same dataset. When citing results of other works, we indicate the best performance reported in the corresponding publication (independent of the methodology used). Each of our scores reported in Tables 2-10 was achieved by a trained model on the dataset’s official test set. The scores were verified using CoNLL 2000 evaluation script222

Model F Score
ratinov2009design 83.45333The result is taken from durrett2014joint.
durrett2014joint 84.04
chiu2016named 86.28
strubell2017fast 86.99
Our Model 87.84
Table 7: Comparison with other works on CoNLL 2012 dataset (English, NER).
Model F Score
schuller2014mostner 73.24
reimers2014germeval 76.91
agerri2016robust 78.42
hanig2014modular 79.08
Our Model 79.21
Table 8: Comparison with other works on GermEval 2014 dataset (German, NER).
Model Accuracy
toutanova2003feature 97.24
collobert2011natural 97.29
santos2014learning 97.32
sun2014structure 97.36
Our Model 97.45
ma2016end 97.55
huang2015bidirectional 97.55
yang2016multi 97.55
ling2015finding 97.78
Table 9: Comparison with other works on WSJ/PTB dataset (English, POS-tag.).
Model Accuracy
labeau2015non 97.14
muller2013efficient 97.44
nguyen2016robust 97.46
muller2015robust 97.73
ling2015finding 98.08444The result is not directly comparable to ours, as the authors used different train/dev/test split of Tiger Corpus.
Our Model 98.40
Table 10: Comparison with other works on Tiger Corpus dataset (German, POS-tag.).

The ablation studies examined different partial configurations of the full model described in Section 2 evaluated on each of the eight datasets. The configurations were obtained by altering the contents of joint embeddings and omitting the CRF layer. Joint embeddings were set to only word, only byte, or both word and byte embeddings. For each of these three settings, the CRF layer was included or excluded, amounting to six configurations in total.

The results of the ablation studies in Table 2 show that, on every dataset, word and byte embeddings used jointly substantially outperformed any one of them used individually. This emphasizes the importance of using both embedding types, bringing in both semantic and morphological information about the words, for solving the general sequence labeling task. The role of the CRF layer proved to be crucial for all NER and Chunking, but not POS-tagging tasks. Configurations with and without CRF layer, given the same word representation, demonstrated very similar results on both English and German POS-tagging datasets. This supports the conjecture that a CRF layer can provide substantial incremental benefit, when used for solving a structured prediction task (e.g., NER and Chunking). It is also worth mentioning that on five out of eight datasets, using byte embeddings alone yielded better results than using word embeddings alone, both with and without CRF layer. This may be caused by the fact that word embeddings are external (and not necessarily related) to the data, while byte embeddings are always learned from the data itself.

Comparison of our results with those of other works, shown in Tables 3 through 10, manifests that our model has achieved state-of-the-art performance on four out of eight datasets: namely, 85.61 F on CoNLL 2002 (Dutch NER), 87.48 F on CoNLL 2012 (English NER), 79.21 F on GermEval 2014 (German NER), and 98.40% on Tiger Corpus (German POS-tagging). We consider the scores obtained on the remaining four datasets to be competitive.

6 Related Work

Arguably, the modern era of deep neural network-based NLP, in particular that of sequence labeling, has emerged with the work of collobert2011natural. Using pre-trained word and feature embeddings as word-level inputs, the authors have applied a CNN with ”max over time” pooling and downstream MLP. As an objective, they maximized ”sentence-level log-likelihood”, similar to the CRF-based approach from Section 2.3.

Since then, many more sophisticated neural models were proposed. Combining a Bi-LSTM network with a CRF layer to model sequence of words in a sentence was first introduced by huang2015bidirectional. The authors have employed substantial amount of engineered features, as well as external gazetteers. ling2015finding have proposed extracting morphological information from words using a character-level Bi-LSTM. The authors have combined the character-level embeddings with proprietary word embeddings (trained in-house) to obtain current stat-of-the-art result on the WSJ/PTB English POS-tagging dataset.

dos2015boosting have augmented the model of collobert2011natural with a CNN-based character-level feature extractor network (”CharWNN”) and applied the resulting model for NER. chiu2016named have combined CharWNN with a Bi-LSTM on word level. A similar approach was proposed by ma2016end. However, ma2016end did not use any feature engineering or external lexicons, as opposed to chiu2016named, and still obtained respectably high performance.

One of the two models proposed by lample2016neural is quite similar to ours, except that the authors have assumed a fixed set of characters for each language, whereas we turn to bytes as a universal medium of encoding character-level information. Also, lample2016neural have pre-trained their own word embeddings, which turned out to have a crucial impact on their results.

An interesting approach of applying cross-lingual multi-task learning to sequence labeling problem was introduced in the work of yang2016multi. The authors have used hierarchical bi-directional GRU (on character and word levels) and optimized a modified version of a CRF objective function. Their model, in conjunction with the applied multi-task learning framework, has allowed the authors to obtain state-of-the-art results on multiple datasets.

7 Conclusion

We evaluate the performance of a single general-purpose neural sequence labeling model, assuming unavailability of domain expertise and scarcity of informational and computational resources. The work explores the frontiers of what may be achieved with a generic and resource-efficient sequence labeling framework applied to a diverse set of NLP tasks and languages.

For this purpose, we’ve applied the model (Section 2) and the end-to-end training methodology (Section 4) to eight benchmark datasets (Section 3), covering four languages and three tasks. The obtained results have convinced us that, with a model of sufficient learning capacity, it is indeed possible to achieve competitive sequence labeling performance without the burden of delving into specificities of each particular task and language, and summoning additional resources.

For the future work, we envision the integration of multi-task learning techniques (e.g., those used by yang2016multi) into the proposed approach. We suppose that this may improve the current results without compromising our general applicability and practicality assumptions.


  • Agerri and Rigau (2016) Rodrigo Agerri and German Rigau. 2016. Robust multilingual named entity recognition with shallow semi-supervised features. Artificial Intelligence, 238:63–82.
  • Al-Rfou et al. (2013) Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed word representations for multilingual nlp. CoNLL-2013, page 183.
  • Ando and Zhang (2005) Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data.

    Journal of Machine Learning Research

    , 6(Nov):1817–1853.
  • Benikova et al. (2014) Darina Benikova, Chris Biemann, Max Kisselew, and Sebastian Pado. 2014. Germeval 2014 named entity recognition shared task: companion paper. Organization, 7:281.
  • Brants et al. (2002) Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith. 2002. The tiger treebank. In Proceedings of the workshop on treebanks and linguistic theories, volume 168.
  • Carreras et al. (2002) Xavier Carreras, Lluis Marquez, and Lluís Padró. 2002. Named entity extraction using adaboost. In proceedings of the 6th conference on Natural language learning-Volume 20, pages 1–4. Association for Computational Linguistics.
  • Caruana et al. (2001) Rich Caruana, Steve Lawrence, and C Lee Giles. 2001.

    Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.

    In Advances in neural information processing systems, pages 402–408.
  • Chiu and Nichols (2016) Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics, 4:357–370.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014.

    On the properties of neural machine translation: Encoder–decoder approaches.

    Syntax, Semantics and Structure in Statistical Translation, page 103.
  • Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
  • Durrett and Klein (2014) Greg Durrett and Dan Klein. 2014. A joint model for entity analysis: Coreference, typing, and linking. Transactions of the Association for Computational Linguistics, 2:477–490.
  • Elman (1990) Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211.
  • Fraser et al. (2013) Alexander Fraser, Helmut Schmid, Richárd Farkas, Renjing Wang, and Hinrich Schütze. 2013. Knowledge sources for constituent parsing of german, a morphologically rich and less-configurational language. Computational Linguistics, 39(1):57–85.
  • Gillick et al. (2016) Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2016. Multilingual language processing from bytes. In Proceedings of NAACL-HLT, pages 1296–1306.
  • Graves et al. (2013) Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE.
  • Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610.
  • Hänig et al. (2014) C Hänig, S Bordag, and S Thomas. 2014.

    Modular classifier ensemble architecture for named entity recognition on low resource systems.

    In Proceedings of the KONVENS GermEval Shared Task on Named Entity Recognition, Hildesheim, Germany.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Hovy et al. (2006) Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: the 90% solution. In Proceedings of the human language technology conference of the NAACL, Companion Volume: Short Papers, pages 57–60. Association for Computational Linguistics.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kupiec (1992) Julian Kupiec. 1992. Robust part-of-speech tagging using a hidden markov model. Computer Speech & Language, 6(3):225–242.
  • Labeau et al. (2015) Matthieu Labeau, Kevin Löser, and Alexandre Allauzen. 2015. Non-lexical neural architecture for fine-grained pos tagging. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 232–237.
  • Lafferty et al. (2001) John D Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289. Morgan Kaufmann Publishers Inc.
  • Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, pages 260–270.
  • Ling et al. (2015) Wang Ling, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luís Marujo, and Tiago Luís. 2015. Finding function in form: Compositional character models for open vocabulary word representation. In EMNLP.
  • Luo et al. (2015) Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie. 2015. Joint entity recognition and disambiguation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 879–888.
  • Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1064–1074.
  • Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330.
  • McCallum et al. (2000) Andrew McCallum, Dayne Freitag, and Fernando CN Pereira. 2000. Maximum entropy markov models for information extraction and segmentation. In Icml, volume 17, pages 591–598.
  • Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
  • Müller et al. (2013) Thomas Müller, Helmut Schmid, and Hinrich Schütze. 2013. Efficient higher-order crfs for morphological tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 322–332.
  • Müller and Schütze (2015) Thomas Müller and Hinrich Schütze. 2015. Robust morphological tagging with word representations. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 526–536.
  • Nguyen et al. (2016) Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham. 2016. A robust transformation-based learning approach using ripple down rules for part-of-speech tagging. AI Communications, 29(3):409–422.
  • Nothman et al. (2013) Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, and James R Curran. 2013. Learning multilingual named entity recognition from wikipedia. Artificial Intelligence, 194:151–175.
  • Pascanu et al. (2013) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318.
  • Passos et al. (2014) Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. CoNLL-2014, page 78.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  • Peters et al. (2017) Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1756–1765.
  • Pradhan et al. (2012) Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL-Shared Task, pages 1–40. Association for Computational Linguistics.
  • Rabiner (1989) Lawrence R Rabiner. 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286.
  • Ratinov and Roth (2009) Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147–155. Association for Computational Linguistics.
  • Reimers et al. (2014) Nils Reimers, Judith Eckle-Kohler, Carsten Schnober, Jungi Kim, and Iryna Gurevych. 2014. Germeval-2014: Nested named entity recognition with neural networks. In Proceedings of the KONVENS GermEval Shared Task on Named Entity Recognition, Hildesheim, Germany.
  • Rose et al. (2002) Tony Rose, Mark Stevenson, and Miles Whitehead. 2002. The reuters corpus volume 1-from yesterday’s news to tomorrow’s language resources. In LREC, volume 2, pages 827–832. Las Palmas.
  • Santos and Guimarães (2015) Cicero Santos and Victor Guimarães. 2015. Boosting named entity recognition with neural character embeddings. In Proceedings of the Fifth Named Entity Workshop, pages 25–33.
  • Santos and Zadrozny (2014) Cicero D Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1818–1826.
  • Schüller (2014) Peter Schüller. 2014. Mostner: Morphology-aware split-tag german ner with factorie. In Proceedings of the KONVENS GermEval Shared Task on Named Entity Recognition, Hildesheim, Germany.
  • Shen and Sarkar (2005) Hong Shen and Anoop Sarkar. 2005. Voting between multiple data representations for text chunking. In Conference of the Canadian Society for Computational Studies of Intelligence, pages 389–400. Springer.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
  • Strubell et al. (2017) Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. 2017. Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2670–2680.
  • Sun (2014) Xu Sun. 2014. Structure regularization for structured prediction. In Advances in Neural Information Processing Systems, pages 2402–2410.
  • Sun et al. (2008) Xu Sun, Louis-Philippe Morency, Daisuke Okanohara, and Jun’ichi Tsujii. 2008. Modeling latent-dynamic in shallow parsing: a latent conditional model with improved inference. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 841–848. Association for Computational Linguistics.
  • Tjong Kim Sang (2002) Erik F. Tjong Kim Sang. 2002. Introduction to the conll-2002 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-2002, pages 155–158. Taipei, Taiwan.
  • Tjong Kim Sang and Buchholz (2000) Erik F Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the conll-2000 shared task: Chunking. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning-Volume 7, pages 127–132. Association for Computational Linguistics.
  • Tjong Kim Sang and De Meulder (2003) Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics.
  • Toutanova et al. (2003) Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173–180. Association for Computational Linguistics.
  • Yang et al. (2016) Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270.
  • Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
  • Zhai et al. (2017) Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. 2017. Neural models for sequence chunking. In AAAI, pages 3365–3371.