Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks
We present a deep hierarchical recurrent neural network for sequence tagging. Given a sequence of words, our model employs deep gated recurrent units on both character and word levels to encode morphology and context information, and applies a conditional random field layer to predict the tags. Our model is task independent, language independent, and feature engineering free. We further extend our model to multi-task and cross-lingual joint training by sharing the architecture and parameters. Our model achieves state-of-the-art results in multiple languages on several benchmark tasks including POS tagging, chunking, and NER. We also demonstrate that multi-task and cross-lingual joint training can improve the performance in various cases.READ FULL TEXT VIEW PDF
Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks
Sequence tagging is a fundamental problem in natural language processing which has many wide applications, including part-of-speech (POS) tagging, chunking, and named entity recognition (NER). Given a sequence of words, sequence tagging aims to predict a linguistic tag for each word such as the POS tag. Recently progress has been made on neural sequence-tagging models which make only minimal assumptions about the language, task, and feature set[Collobert et al.2011]
This paper explores an important potential advantage of these task-independent, language-independent and feature-engineering free models: their ability to be jointly trained on multiple tasks. In particular, we explore two types of joint training. In multi-task joint training, a model is jointly trained to perform multiple sequence-tagging tasks in the same language—e.g., POS tagging and NER for English. In cross-lingual joint training, a model is trained to perform the same task in multiple languages—e.g., NER in English and Spanish.
Multi-task joint training can exploit the fact that different sequence tagging tasks in one language share language-specific regularities. For example, models of English POS tagging and English NER might benefit from using similar underlying representations for words, and in past work, certain sequence-tagging tasks have benefitted by leveraging the underlying similarity of related tasks [Ando and Zhang2005]. Currently, however, the best results on specific sequence-tagging tasks are usually achieved by approaches that target only one specific task, either POS tagging [Søgaard2011, Toutanova et al.2003], chunking [Shen and Sarkar2005], or NER [Luo et al.2015, Passos et al.2014]. Such approaches employ separate model development for each individual task, which makes joint training difficult. In other work, some recent neural approaches have been proposed to address multiple sequence tagging problems in a unified framework [Huang et al.2015]. Though gains have been shown using multi-task joint training, the prior models that benefit from multi-task joint training did not achieve state-of-the-art performance [Collobert et al.2011]; thus the question of whether joint training can improve over strong baseline methods is still unresolved.
. However, many successful approaches in sequence tagging rely heavily on feature engineering to handcraft language-dependent features, such as character-level morphological features and word-level N-gram patterns[Huang et al.2015, Toutanova et al.2003, Sun et al.2008], making it difficult to share latent representations between different languages. Some multilingual taggers that do not rely on feature engineering have also been presented [Lample et al.2016, dos Santos et al.2015], but while these methods are language-independent, they do not study the effect of cross-lingual joint training.
In this work, we focus on developing a general model that can be applied in both multi-task and cross-lingual settings by learning from scratch, i.e., without feature engineering or pipelines. Given a sequence of words, our model employs deep gated recurrent units on both character and word levels, and applies a conditional random field layer to make the structured prediction. On the character level, the gated recurrent units capture the morphological information; on the word level, the gated recurrent units learn N-gram patterns and word semantics.
Our model can handle both multi-task and cross-lingual joint training in a unified manner by simply sharing the network architecture and model parameters between tasks and languages. For multi-task joint training, we share both character and word level parameters between tasks to learn language-specific regularities. For cross-lingual joint training, we share the character-level parameters to capture the morphological similarity between languages without use of parallel corpora or word alignments.
We evaluate our model on five datasets of different tasks and languages, including POS tagging, chunking and NER in English; and NER in Dutch and Spanish. We achieve state-of-the-art results on several standard benchmarks: CoNLL 2000 chunking (95.41%), CoNLL 2002 Dutch NER (85.19%), CoNLL 2003 Spanish NER (85.77%), and CoNLL 2003 English NER (91.20%). We also achieve very competitive results on Penn Treebank POS tagging (97.55%, the second best result in the literature). Finally, we conduct experiments to systematically explore the effectiveness of multi-task and cross-lingual joint training on several tasks.
Ando and Zhang ando2005framework proposed a multi-task joint training framework that shares structural parameters among multiple tasks, and improved the performance on various tasks including NER. Collobert et al. collobert2011natural presented a task independent convolutional network and employed multi-task joint training to improve the performance of chunking. However, there is still a gap between these multi-task approaches and the state-of-the-art results on individual tasks. Furthermore, it is unclear whether these approaches can be effective in a cross-lingual setting.
Multilingual resources were extensively used for cross-lingual sequence tagging through various ways, such as cross-lingual feature extraction[Darwish2013], text categorization [Virga and Khudanpur2003], and Bayesian parallel data prediction [Snyder et al.2008]. Parallel corpora and word alignments are also used for training cross-lingual distributed word representations [Kiros et al.2014, Gouws et al.2014, Zhou et al.2015]. Unlike these approaches, our method mainly focuses on using morphological similarity for cross-lingual joint training.
Several neural architectures based on recurrent networks were proposed for sequence tagging. Huang et al. huang2015bidirectional used word-level Long Short-Term Memory (LSTM) units based on handcrafted features; dos Santos et al. dos2015boosting employed convolutional layers on both character and word levels; Chiu and Nichols chiu2015named applied convolutional layers on the character level and LSTM units on the word level; Gillick et al. gillick2015multilingual employed a sequence-to-sequence LSTM with a novel tagging scheme. We show that our architecture gives better performance experimentally than these approaches in Section5.
Most similar to our work is the recent approach independently developed by Lample et al. lample2016neural (published two weeks before our submission), which employs LSTM on both character and word levels. However, there are several crucial differences. First, we study cross-lingual joint training and show improvement over their approach in various cases. Second, while they mainly focus on NER, we generalize our model to other sequence tagging tasks, and also demonstrate the effectiveness of multi-task joint training. There are also differences in the technical aspect, such as the cost-sensitive loss function and gated recurrent units used in our work.
In this section, we present our model for sequence tagging based on deep hierarchical gated recurrent units and conditional random fields. Our recurrent networks are hierarchical since we have multiple layers on both word and character levels in a hierarchy.
A gated recurrent unit (GRU) network is a type of recurrent neural networks first introduced for machine translation [Cho et al.2014]. A recurrent network can be represented as a sequence of units, corresponding to the input sequence , which can be either a word sequence in a sentence or a character sequence in a word. The unit at position takes and the previous hidden state as input, and outputs the current hidden state . The model parameters are shared between different units in the sequence.
A gated recurrent unit at position has two gates, an update gate and a reset gate . More specifically, each gated recurrent unit can be expressed as follows
where ’s are model parameters of each unit, is a candidate hidden state that is used to compute , is an element-wise sigmoid logistic function defined as , and
denotes element-wise multiplication of two vectors. Intuitively, the update gatecontrols how much the unit updates its hidden state, and the reset gate determines how much information from the previous hidden state needs to be reset.
Since a recurrent neural network only models the information flow in one direction, it is usually helpful to use an additional recurrent network that goes in the reverse direction. More specifically, we use bidirectional gated recurrent units, where given a sequence of length , we have one GRU going from to and the other from to . Let and denote the hidden states at position of the forward and backward GRUs respectively. We concatenate the two hidden states to form the final hidden state .
We stack multiple recurrent layers together to form a deep recurrent network [Sutskever et al.2014]. Each layer learns a more effective representation taking the hidden states of the previous layer as input. Let denote the hidden state at position in layer . The forward GRU at position in layer computes using and as input, and the backward GRU performs similar operations but in a reverse direction.
Our model employs a hierarchical GRU that encodes both word-level and character-level sequential information.
The input of our model is a sequence of words of length , where is a one-of- embedding of the -th word. The word at each position also has a character-level representation, denoted as a sequence of length , where is the one-of- embedding of the -th character in the -th word.
Given a word, we first employ a deep bidirectional GRU to learn useful morphological representation from the character sequence of the word. Suppose the character-level GRU has layers, we then obtain forward and backward hidden states and at each position in the character sequence. Since recurrent networks usually tend to memorize more short-term patterns, we concatenate the first hidden state of the backward GRU and the last hidden state of the forward GRU to encode character-level morphology in both prefixes and suffixes. We further concatenate the character-level representation with the one-of- word embedding to form the final representation for the -th word. More specifically, we have
where is a representation of the -th word, which encodes both character-level morphology and word-level semantics, as shown in Figure 1.
The character-level GRU outputs a sequence of word representations . We employ a word-level deep bidirectional GRU with layers on top of these word representations. The word-level GRU takes the sequence as input, and computes a sequence of hidden states .
Different from the character-level GRU, the word-level GRU aims to extract the context information in the word sequence, such as N-gram patterns and neighbor word dependencies. Such information is usually encoded using handcrafted features. However, as we show in our experimental results, the word-level GRU can learn the relevant information without being language-specific or task-specific. The hidden states output by the word-level GRU will be used as input features for the next layers.
The goal of sequence tagging is to predict a sequence of tags . To model the dependencies between tags in a sequence, we apply a conditional random field [Lafferty et al.2001] layer on top of the hidden states output by the word-level GRU [Huang et al.2015]. Let denote the space of tag sequences for
. The conditional log probability of a tag sequence, given the hidden state sequence , can be written as
where is a function that assigns a score for each pair of and .
To define the function , for each position , we multiply the hidden state with a parameter vector that is indexed by the the tag , to obtain the score for assigning at position . Since we also need to consider the correlation between tags, we impose first order dependency by adding a score at position , where is a parameter matrix defining the similarity scores between different tag pairs. Formally, the function can be written as
where we set to be a Start token.
It is possible to directly maximize the conditional log likelihood based on Eq. (1). However, this training objective is usually not optimal since each possible contributes equally to the objective function. Therefore, we add a cost function between and based on the max-margin principle that high-cost tags should be penalized more heavily [Gimpel and Smith2010]. More specifically, the objective function to maximize for each training instance and is written as
We employ mini-batch AdaGrad [Duchi et al.2011]
to train our neural network in an end-to-end manner with backpropagation. Both the character embeddings and word embeddings are fine-tuned during training. We use dynamic programming to compute the normalizer of the CRF layer in Eq. (2). When making prediction, we again use dynamic programming in the CRF layer to decode the most probable tag sequence.
In this section we study joint training of multiple tasks and multiple languages. On one hand, different sequence tagging tasks in the same language share language-specific regularities. For example, POS tagging and NER in English should learn similar underlying representation since they are in the same language. On the other hand, some languages share character-level morphologies, such as English and Spanish. Therefore, it is desirable to leverage multi-task and cross-lingual joint training to boost model performance.
Since our model is generally applicable to different tasks in different languages, it can be naturally extended to multi-task and cross-lingual joint training. The basic idea is to share part of the architecture and parameters between tasks and languages, and to jointly train multiple objective functions with respect to different tasks and languages.
We now discuss the details of our joint training algorithm in the multi-task setting. Suppose we have tasks, with the training instances of each task being . Each task has a set of model parameters , which is divided into two sets, task specific parameters and shared parameters, i.e.,
where shared parameters are a set of parameters that are shared among the tasks, while task specific parameters are the rest of the parameters that are trained for each task separately.
During joint training, we are optimizing the average over all objective functions of tasks. We iterate over each task , sample a batch of training instances from , and perform a gradient descent step to update model parameters . Similarly, we can derive a cross-lingual joint training algorithm by replacing tasks with languages.
The network architectures we employ for joint training are illustrated in Figure 2. For multi-task joint training, we share all the parameters below the CRF layer including word embeddings to learn language-specific regularities shared by the tasks. For cross-lingual joint training, we share the parameters of the character-level GRU to capture the morphological similarity between languages. Note that since we do not consider using parallel corpus in this work, we mainly focus on joint training between languages with similar morphology. We leave the study of cross-lingual joint training by sharing word semantics based on parallel corpora to future work.
|Benchmark||Task||Language||# Training Tokens||# Dev Tokens||# Test Tokens|
|PTB toutanova2003feature||POS Tagging||English||912,344||131,768||129,654|
In this section, we use several benchmark datasets for multiple tasks in multiple languages to evaluate our model as well as the joint training algorithm.
We use the following benchmark datasets in our experiments: Penn Treebank (PTB) POS tagging, CoNLL 2000 chunking, CoNLL 2003 English NER, CoNLL 2002 Dutch NER and CoNLL 2002 Spanish NER. The statistics of the datasets are described in Table 1.
We construct the POS tagging dataset with the instructions described in Toutanova et al. toutanova2003feature. Note that as a standard practice, the POS tags are extracted from the parsed trees.
For the task of CoNLL 2003 English NER, we follow previous works [Collobert et al.2011, Huang et al.2015, Chiu and Nichols2015] to append one-hot gazetteer features to the input of the CRF layer for fair comparison.111Although gazetteers are arguably a type of feature engineering, we note that unlike most feature engineering techniques they are straightforward to include in a model. We use only the gazetteer file provided by the CoNLL 2003 shared task, and do not use gazetteers for any other tasks or languages described here.
|Chieu et al. chieu2002named||88.31|
|Florian et al. florian2003named||88.76|
|Ando and Zhang ando2005framework||89.31|
|Lin and Wu lin2009phrase||90.90|
|Collobert et al. collobert2011natural||89.59|
|Huang et al. huang2015bidirectional||90.10|
We set the hidden state dimensions to be 300 for the word-level GRU. We set the number of GRU layers to
(two layers for the word-level and character-level GRUs respectively). The learning rate is fixed at 0.01. We use the development set to tune the other hyperparameters of our model. Since the CoNLL 2000 chunking dataset does not have a development set, we hold out one fifth of the training set for parameter tuning.
|Ratinov and Roth ratinov2009design||90.80|
|Passos et al. passos2014lexicon||90.90|
|Chiu and Nichols chiu2015named||90.77|
|Luo et al. luo2015joint||91.2|
|Lample et al. lample2016neural||90.94|
|Ours no gazetteer||90.96|
|Ours no char GRU||88.00|
|Ours no word embeddings||77.20|
We truncate all words whose character sequence length is longer than a threshold (17 for English, 35 for Dutch, and 20 for Spanish). We replace all numeric characters with “0”. We also use the BIOES (Begin, Inside, Outside, End, Single) tagging scheme [Ratinov and Roth2009].
Since the training corpus for a sequence tagging task is relatively small, it is difficult to train randomly initialized word embeddings to accurately capture the word semantics. Therefore, we leverage word embeddings pre-trained on large-scale corpora. All the pre-trained embeddings we use are publicly available.
On the English datasets, following previous works that are based on neural networks [Collobert et al.2011, Huang et al.2015, Chiu and Nichols2015], we use the 50-dimensional SENNA embeddings222http://ronan.collobert.com/senna/ trained on Wikipedia. For Spanish and Dutch, we use the 64-dimensional Polyglot embeddings333https://sites.google.com/site/rmyeid/projects/polyglot [Al-Rfou et al.2013], which are trained on Wikipedia articles of the corresponding languages. We use pre-trained word embeddings as initialization, and fine-tune the embeddings during training.
|Carreras et al. carreras2002named||77.05|
|Nothman et al. nothman2013learning||78.6|
|Gillick et al. gillick2015multilingual||82.84|
|Lample et al. lample2016neural||81.74|
|Ours joint training||85.19|
|Ours no char GRU||77.76|
|Ours no word embeddings||67.36|
|Carreras et al. carreras2002named||81.39|
|dos Santos et al. dos2015boosting||82.21|
|Gillick et al. gillick2015multilingual||82.95|
|Lample et al. lample2016neural||85.75|
|Ours joint training||85.77|
|Ours no char GRU||83.03|
|Ours no word embeddings||73.34|
|Kudo and Matsumoto kudo2001chunking||93.91|
|Shen and Sarkar shen2005voting||94.01444We note that this number is often mistakenly cited as 95.23, which is actually the score on base NP chunking rather than CoNLL 2000.|
|Sun et al. sun2008modeling||94.34|
|Collobert et al. collobert2011natural||94.32|
|Huang et al. huang2015bidirectional||94.46|
|Ours joint training||95.41|
|Ours no char GRU||94.44|
|Ours no word embeddings||88.13|
|Toutanova et al. toutanova2003feature||97.24|
|Shen et al. shen2007guided||97.33|
|Søgaard et al. sogaard2011semisupervised||97.50|
|Collobert et al. collobert2011natural||97.29|
|Huang et al. huang2015bidirectional||97.55|
|Ling et al. ling2015finding||97.78|
|Ling et al. ling2015finding (SENNA)||97.41|
|Ours no char GRU||96.69|
|Ours no word embeddings||95.43|
In this section, we report the results of our model on the benchmark datasets and compare to the previously-reported state-of-the-art results.
For English NER, there are two evaluation methods used in the literature. Some models are trained with both the training and development set, while others are trained with the training set only. We report our results in both cases. In the first case, we tune the hyperparameters by training on the training set and testing on the development set.
Besides our standalone model, we experimented with multi-task and cross-lingual joint training as well, using the architecture described in Section 4. For multi-task joint training, we jointly train all tasks in English, including POS tagging, chunking and NER. For cross-lingual joint training, we jointly train NER in English, Dutch and Spanish. We also remove the word embeddings and the character-level GRU respectively to analyze the contribution of different components.
The results are shown in Tables 2, 3, 4, 5, 6 and 7. We achieve state-of-the-art results on English NER, Dutch NER, Spanish NER and English chunking. Our model outperforms the best previously-reported results on Dutch NER and English chunking by 2.35 points and 0.95 points respectively. We also achieve the second best result on English POS tagging, which is 0.23 points worse than the current state-of-the-art.
Joint training improves the performance on Spanish NER, Dutch NER and English chunking by 1.08 points, 0.19 points and 0.75 points respectively, and has no significant improvement on English POS tagging and English NER.
On POS tagging, the best result is 97.78% reported by Ling et al. ling2015finding. However, the embeddings they used are not publicly available. To demonstrate the effectiveness of our model, we slightly revise our model to reimplement their model with the same parameter settings described in their original paper. We use SENNA embeddings to initialize the reimplemented model for fair comparison, and obtain an accuracy of 97.41% that is 0.14 points worse than our result, which indicates that our model is more effective and the main difference lies in using different pre-trained embeddings.
By comparing the results without the character-level GRU and without word embeddings, we can observe that both components contribute to the final results. It is also clear that word embeddings have significantly more contribution than the character-level GRU, which indicates that our model largely depends on memorizing the word semantics. Character-level morphology, on the other hand, has relatively smaller but still critical contribution.
In this section, we analyze the effectiveness of multi-task and cross-lingual joint training in more detail. In order to explore possible gains in performance of joint training for resource-poor languages or tasks, we consider joint training of various task pairs and language pairs where different-sized subsets of the actual labeled corpora are made available. Given a pair of tasks of languages, we jointly train one task with full labels and the other with partial labels. In particular, we introduce a labeling rate , and sample a fraction of the sentences in the training set, discarding the rest. Evaluation is based on the partially-labeled task. The results are reported in Table 8.
We observe that the performance of a specific task with relatively lower labeling rates (0.1 and 0.3) can usually benefit from other tasks with full labels through multi-task or cross-lingual joint training. The performance gain can be up to 1.99 points when the labeling rate of the target task is 0.1. The improvement with 0.1 labeling rate is on average 0.37 points larger than with 0.3 labeling rate, which indicates that the improvement of joint training is more significant when the target task has less labeled data.
We also use t-SNE [Van der Maaten and Hinton2008] to obtain a 2-dimensional visualization of the character-level GRU output for the country names in English and Spanish, shown in Figure 3. We can clearly see that our model captures the morphological similarity between two languages through joint training, since all corresponding pairs are nearest neighbors in the original embedding space.
We presented a new model for sequence tagging based on gated recurrent units and conditional random fields. We explored multi-task and cross-lingual joint training through sharing part of the network architecture and model parameters. We achieved state-of-the-art results on various tasks including POS tagging, chunking, and NER, in multiple languages. We also demonstrated that joint training can improve model performance in various cases.
In this work, we mainly focus on leveraging morphological similarities for cross-lingual joint training. In the future, an important problem will be joint training based on cross-lingual word semantics with the help of parallel data. Furthermore, it will be interesting to apply our joint training approach to low-resource tasks and languages.
This work was funded by the NSF under grant IIS-1250956.
On the properties of neural machine translation: Encoder-decoder approaches.In ACL.
Named entity recognition through classifier combination.In HLT-NAACL, pages 168–171.
Bilbowa: Fast bilingual distributed representations without word alignments.In ICML.
Chunking with support vector machines.In NAACL, pages 1–8.