Neural multi-task learning models have driven state-of-the-art results to new levels in a number of language processing tasks, ranging from part-of-speech (POS) tagging (Yang, Salakhutdinov, and Cohen, 2016; Søgaard and Goldberg, 2016), parsing (Peng, Thomson, and Smith, 2017; Guo et al., 2016), text classification (Liu, Qiu, and Huang, 2016; Liu, Qiu, and Huang, 2017) to machine translation (Luong et al., 2015; Firat, Cho, and Bengio, 2016).
Multi-task learning utilizes the correlation between related tasks to improve the performance of each task. In practice, existing work often models task relatedness by simply defining shared common parameters over some pre-defined task structures. Figure 1-(a,b) shows two typical pre-defined topology structures which have been popular. A flat structure (Collobert and Weston, 2008) assumes that all tasks jointly share a hidden space, while a hierarchical structure (Søgaard and Goldberg, 2016; Hashimoto et al., 2017) specifies a partial order of the direction of information flow between tasks.
There are two major limitations to the above approaches. First, static pre-defined structures represent a strong assumption about the nature of the interaction between tasks, restricting the model’s capacity to make use of shared information. For example, the structure in 1(a) does not allow the model to explicitly learn the strength of relatedness between tasks. This restriction prevents the model from fully utilizing and handling the complexity of the data (Li and Tian, 2015). Note that the strength of relatedness between tasks is itself not static but subject to change, depending on the data samples at hand. Second, these models are not interpretable to researchers and system developers, meaning that we learn little about what kinds of patterns have been shared besides the parameters themselves. Previous non-neural-network models (Bakker and Heskes, 2003; Kim and Xing, 2010; Chen et al., 2010) have demonstrated the importance of learning inter-task relationships for multi-task learning. However, there is little work giving an in-depth analysis in the neural setting.
The above issues motivate the following research questions: 1) How can we explicitly model complex relationships between different tasks? 2) Can we design models that learn interpretable shared structures?
To address these questions, we propose to model the relationships between language processing tasks over a graph structure, in which each task is regarded as a node. We take inspiration from the idea of message passing (Berendsen, van der Spoel, and van Drunen, 1995; Serlet, Boynton, and Tevanian, 1996; Gilmer et al., 2017; Kipf and Welling, 2016), designing two methods for communication between tasks, in which messages can be passed between any two nodes in a direct (Complete-graph in Figure 1-(c)) or an indirect way (Star-graph in Figure 1-(d)). Importantly, the strength of the relatedness is learned dynamically, rather than being pre-specified, which allows tasks to selectively share information when needed.
We evaluate our proposed models on two types of sequence learning tasks, text classification and sequence tagging, both well-studied NLP tasks (Li and Zong, 2008; Liu, Qiu, and Huang, 2017; Yang, Salakhutdinov, and Cohen, 2016). Moreover, we conduct experiments in both the multi-task setting and in the transfer learning setting to demonstrate that the shared knowledge learned by our models can be useful for new tasks. Our experimental results not only show the effectiveness of our methods in terms of reduced error rates, but also provide good interpretablility of the shared knowledge.
The contributions of this paper can be summarized as follows:
We explore the problem of learning the relationship between multiple tasks and formulate this problem as message passing over a graph neural network.
We present a state-of-the-art approach that allows multiple tasks to communicate dynamically rather than following a pre-defined structure.
Different from traditional black-box learned models, this paper makes a step towards learning transferable and interpretable representations, which enables us to know what types of patterns are shared.
Message Passing Framework for Multi-task Communication
We propose to use graph neural networks with message passing to deal with the problem of multi-task sequence learning. Two well-studied sequence learning tasks, text classification and sequence tagging, are used in our experiments. We denote the text sequence as and the output as . In text classification, is a single label; whereas in sequence labelling, is a sequence.
Assuming that there are related tasks, we refer to as a dataset with samples for task . Specifically,
where and denote a sentence and a corresponding label sequence for task.
Generally, when combining multi-task learning with sequence learning, two kinds of interactions should be modelled: the first is the interactions between different words within a sentence, and the other is the interactions across different tasks.
For the first type of interaction (interaction of words within a sentence
), many models have been proposed by applying a composition function in order to obtain representation of the sentence. Typical choices for defining the composition function include recurrent neural networks(Hochreiter and Schmidhuber, 1997)2014), and tree-structured neural networks (Tai, Socher, and Manning, 2015). In this paper, we adopt the LSTM architecture to learn the dependencies within a sentence, due to their impressive performance on many NLP tasks (Cheng, Dong, and Lapata, 2016). Formally, we refer to as the hidden state of the word at time , . Then, can be computed as:
Here, the represents all the parameters of LSTM.
For the second type of interaction (interaction across different tasks), we propose to conceptualize tasks and their interactions as a graph, and utilize message passing mechanisms to allow them to communicate. Our framework is inspired by the idea of message passing, which is used ubiquitously in modern computer software (Berendsen, van der Spoel, and van Drunen, 1995) and programming languages (Serlet, Boynton, and Tevanian, 1996). The general idea of this framework is that we provide a graph network that allows different tasks to cooperate and interact with one another. Below, we describe our conceptualization of the graph construction process, then we describe the message passing mechanism used for inter-task communication.
Formally, a graph
can be defined as an ordered pair, where is a set of nodes and is a set of edges. In this work, we use directed graphs to model the communication flows between tasks and an edge is therefore defined as an ordered set of two nodes .
In our models, we represent each task as a node. In addition, we allow virtual nodes to be introduced. These virtual nodes do not correspond to a task. Rather, their purpose is to facilitate communication among different tasks. Intuitively, the virtual node functions as a mailbox, storing messages from other nodes and distributing them as needed.
Tasks and virtual nodes are connected by weighted edges, which represent communication between different nodes. Previous flat and hierarchical architectures for multi-task learning can be considered as graphs with fixed edge connections. Our models dynamically learn the weight of each edge, which allows the models to adjust the strength of the communication signals.
In our graph structures, we use directed edges to model the communication between tasks. In other words, nodes communicate with each other by sending and receiving messages over edges. Given a sentence with a sequence of words from task , we use to represent the aggregated messages that the word of task at time step can get, and we use
to denote the task-dependent hidden representation of the word.
Below, we propose two basic communication architectures for message passing: Complete-graph and Star-graph which differ according to whether they allow direct communication between any pair of tasks, or whether the communication is mediated by an intermediate virtual node.
1. Complete-graph (CG): Direct Communication for Multi-Task Learning: in this model, each node can directly send (or receive) messages to (or from) any other nodes. Specifically, as shown in Fig.2-(a), we first assign each task a task-dependent LSTM layer. Each sentence in task can be passed to all the other task-dependent LSTMs111At training time, the loss is only calculated and used to compute the gradient for the task from which the sentence is drawn. to get corresponding representations , , . Then, these messages will be aggregated as:
Here, is a scalar, which controls the relatedness between two tasks and , and can be dynamically computed as:
are learnable parameters. And the relatedness scores will be normalized into a probability distribution:
2. Star-graph (SG): Indirect Communciation for Multi-Task Learning: the potential limitation of our proposed CG-MTL model lies in its computational cost, because the number of pairwise interactions grows quadratically with the number of tasks. Inspired by the mailbox idea used in the traditional message passing paradigm (Netzer and Miller, 1995), we introduce an extra virtual node into our graph to address this problem. In this setting, messages are not sent directly from one node to another, but are bridged by the virtual node. Intuitively, the virtual node stores the shared messages across all the tasks; different tasks can put messages into this global space, then other tasks can take out the useful messages for themselves from the same space. Fig.2-(b) shows how one task collects information from the mailbox (shared layer).
In details, we introduce an extra LSTM to act as the virtual node, whose parameters are shared across tasks. Given a sentence from task , its information can be written into the shared LSTM by the following operation:
where denotes the parameters are shared across all the tasks.
Then, the aggregated messages at time can be read from the shared LSTM:
Once the graphs and message passing between the nodes are defined, the next question to ask is how to update the task-dependent representation for node using the current input information and the aggregated messages . We employ a gating unit that allows the model to decide how many aggregated messages should be used for the target tasks, which avoids unnecessary information redundancy. Formally, the can be computed as:
where is a parameter matrix, and is a fusion gate that selects the aggregated messages. is computed as follows:
where and are parameter matrices.
Comparison of Complete-graph (CG) and Star-graph (SG)
For CG-MTL, the advantage is that we can figure out the strength of the association from a word in a source task to a word in the target task. However, the computation of CG-MTL is not efficient if the number of tasks is too large. For SG-MTL, the advantage is that the learned shared structures are interpretable and more importantly, the learned knowledge of SG-MTL can be used for unseen tasks. To conclude, the CG-MTL can be used in these scenarios: 1) The number of tasks is not too large; 2) We need to explicitly analyze the relatedness between different tasks (as shown in Fig.3). By contrast, the SG-MTL can be used in the following scenarios: 1) The number of tasks is large; 2) We need to transfer shared knowledge to new tasks; 3) We need to analyze what types of shared patterns have been learned by the model (As shown in Tab.2).
Given a sentence from task with its label (note
is either a classification label or sequential labels) and its feature vectoremitted by the two communication methods above, we can adapt our models to different tasks by using different task-specific layers. We call the task-specific layer as the Output-layer. For text classification tasks, the commonly used Output-layer
is a softmax layer, while for sequence labelling tasks, it can be a conditional random field (CRF) layer. Finally, the output probabilitycan be computed as:
Then, we can maximize the above probability to optimize the parameters of each task:
Experiments and Results
In this section, we describe our hyperparameter settings and present the empirical performance of our proposed models on two types of multi-task learning datasets, first on text classification, then on sequence tagging. Each dataset contains several related tasks.
The word embeddings for all of the models are initialized with the 200-dimensional GloVe vectors (840B token version (Pennington, Socher, and Manning, 2014)
). The other parameters are initialized by randomly sampling from the uniform distribution of. The mini-batch size is set to 8.
For each task, we take the hyperparameters which achieve the best performance on the development set via a grid search over combinations of the hidden size and regularization . Additionally, for text classification tasks, we set an equal lambda for each task; while for tagging tasks, we run a grid search of lambda in the range of and take the hyperparameters which achieve the best performance on the development set. Based on the validation performance, we choose the size of hidden state as and
as 0.0. We apply stochastic gradient descent with the diagonal variant of AdaDelta for optimization(Zeiler, 2012).
To investigate the effectiveness of multi-task learning, we experimented with 16 different text classification tasks involving different popular review corpora, such as books, apparel and movie (Liu, Qiu, and Huang, 2017). Each sub-task aims at predicting a correct sentiment label (positive or negative) for a given sentence. All the datasets in each task are partitioned into training, validating, and testing with the proportions of 1400, 200 and 400 samples respectively.
|Task||Single Task||Multiple Tasks||Transfer|
We choose several relevant and representative models as baselines.
MT-CNN: This model is proposed by Collobert and Weston (2008) with a convolutional layer, in which lookup-tables are shared partially while other layers are task-specific.
FS-MTL: Fully shared multi-task learning framework. Different tasks fully share a neural layer (LSTM).
SP-MTL: Shared-private multi-task learning framework with adversarial learning (Liu, Qiu, and Huang, 2017). Different tasks not only have common layers to share information, but have their own private layers.
Results on Multi-task Learning: The experimental results show that our proposed models outperform all single-task baselines by a large margin, and here we show the averaged error due to the following reasons: 1) it is easier to show the performance gain of multi-task learning models over single task models. 2) BiLSTM and stacked LSTM are also the necessary baselines for SG-MTL, since the combination of shared and private layers in SG-MTL is similar to two-layer LSTM.
Table 1 shows the overall results on the 16 different tasks under three settings: single task, multiple task, and transfer learning. Generally, we can see that almost all tasks benefit from multi-task learning, which boosts the performance by a large margin. Specifically, CG-MTL achieves the best performance, surpassing SP-MTL by 1.4, which suggests that explicit communication makes it easier to shared information. Although the improvement of SG-MTL is not as large as CG-MTL, SG-MTL is efficient to train. Additionally, the comparison between SG-MTL and SP-MTL shows the effectiveness of selectively sharing schema. Moreover, we may further improve our models by incorporating the adversarial training mechanism introduced in SP-MTL, as it is an orthogonal innovation to our methods.
Evaluation on Transfer Learning: We next present the potential of our methods on transfer learning, as we expect that the shared knowledge learned by our model architectures can be useful for new tasks. In particular, the virtual node in the SG-MTL model can condense shared information into a common space after multi-task learning, which allows us to transfer this knowledge to new tasks. In order to test the transferability of the shared knowledge learned by SG-MTL, we design an experiment following the supervised pre-training paradigm. Specifically, we adopt a 16-fold “leave-one-task-out” paradigm; we take turns choosing tasks to train our model via multi-task learning, then the learned shared layer is transferred to a second network that is used to test on the remaining target task . The parameters of the transferred layer are kept frozen, and the remaining parameters of the new network are randomly initialized.
Table 1 shows these results in the “Transfer” column, in which the task in each row is regarded as the target task. We observe that our model achieves a average improvement in terms of the error rate over the single tasking setting (13.1 vs. 18.0), surpassing SP-MTL by in average (13.1 vs. 14.3). This improvement suggests that our retrieval method with the selective mechanism (the attention layer in eq. 8) is more efficient in finding the relevant information from the shared space compared to SP-MTL, which reads the shared information without any selective mechanism and ignores the relationship between tasks.
In this section, we present the results of our models on the second task of sequence tagging. We conducted experiments by following the same settings as Yang, Salakhutdinov, and Cohen (2016). We use the following benchmark datasets in our experiments: Penn Treebank (PTB) POS tagging, CoNLL 2000 chunking, CoNLL 2003 English NER. The statistics of the datasets are described in Table 3.
|LSTM + CRF||94.46||90.10||97.55|
Results and Analysis: Table 4 shows the performance of the models on the sequence tagging tasks. CG-MTL and SG-MTL significantly outperform the three strong multi-task learning baselines, Specifically, SG-MTL achieves a performance gain of in terms of F1 score over the best competitor FS-MTL on the CoNLL2003 dataset, indicating that our models are able to make use of the shared information by modelling the relationship between different tasks. Our models also achieve slightly better F1 scores on the CoNLL2000 and PTB datasets when compared to the best baseline model FS-MTL.
Discussion and Qualitative Analysis
In order to obtain more insights and detailed interpretability of how messages are passed between tasks in our proposed models, we design a series of experiments targeting the following aspects:
Can the relationship between different tasks be learned by CG-MTL?
Are there interpretable structures that the shared layer in SG-MTL can learn? Are these shared patterns similar to linguistic structures, and can they be transferred for other tasks?
Explicit Relationship Learning
To answer the first question, we visualize the weight of CG-MTL in equation 3. As each task can receive messages from any other task in CG-MTL, directly indicates the relevance of other tasks to the current task at time step . As shown in Figure 3, we analyze the relationships learned by our models on randomly sampled sentences from different tasks. We find that the relationship between tasks cannot be modelled by a static score. Rather, it depends on the specific sample and context. Consider the example sentence in Figure 3-(a), drawn from the Kitchen task. Here, the words “easily” and “ads” are influenced by different sets of external tasks, in which those words express sentiment. For example, in the Camera and Toys tasks, “breaks easily” is usually used to express negative sentiment, while the word “ads” often appears in the Magazine task to express negative sentiment. Figure 3-(b) shows a similar case on “quality” and “name-brand”.
Interpretable Structure Learning
To answer the second question, we visualize in equation 8 inside the SG-MTL model. As different tasks can read information from shared layers in SG-MTL, visualizing allows us to analyze what kinds of sentence structures are shared. Specifically, each word can receive shared messages: . and the amount of messages is controlled by the scores . To illustrate the interpretable structures learned by the shared layer in SG-MTL, we randomly sample several examples from different tasks and visualize their shared structures. Three random sampled cases are described as in Figure 4.
From the experiments we conducted in visulizing in SG-MTL, we observed the following:
The proposed model can not only utilize the shared information across different tasks, but can tell us what kinds of features are shared. As shown in Table 2, the short-term and long-term dependencies between different words can be captured. For example, the word “movie” is prone to connecting to emotional words, such as “boring, amazing, exciting” while “products” is more likely to make friends with “stable, great, fantastic”.
Comparing Figure 4-(b) and (c), we can see how task Software borrows useful information from task Kitchen. Concretely, the sentence “I would have to buy the software again” in the task “Software” has negative emotion. In this sentence, the key pattern is “would have”, which does not appear too much in the training set of Software. Fortunately, the training samples in the task Kitchen provide more hints about this pattern.
As shown in Figure 4-(a) and (b), the shared layer has learned an informative sentence pattern “would have to ...” from the training set of task Kitchen. This pattern is useful for the sentiment prediction of another task Software, which suggests that we can analyze the sharabla patterns in an interpretable way for SG-MTL model.
Neural network-based multi-task frameworks have achieved success on many NLP tasks, such as POS tagging (Yang, Salakhutdinov, and Cohen, 2016; Søgaard and Goldberg, 2016), parsing (Peng, Thomson, and Smith, 2017; Guo et al., 2016), machine translation (Dong et al., 2015; Luong et al., 2015; Firat, Cho, and Bengio, 2016), and text classification (Liu, Qiu, and Huang, 2016; Liu, Qiu, and Huang, 2017). However, previous work does not focus on explicitly modelling the relationships between different tasks. These models are often trained with an opaque neural component, which makes it hard to understand what kind of knowledge is shared. By contrast, in this paper, we propose to explicitly learn the communication between different tasks, and learn some interpretable shared structures.
Before the bloom of neural-based models, non-neural multi-task learning methods have also been proposed to model the relationships between tasks. For example, Bakker and Heskes (2003) learn to cluster tasks by using Bayesian approaches. Kim and Xing (2010) utilizes a given tree structure to design a regularizer, while Chen et al. (2010) learns a structured multi-task problem over a given graph. These models adopt complex learning strategies and introduce a priori information between different tasks, which are usually not suitable for sequence modelling. In this paper, we provide a new perspective on how to model the relationships using distributed graph models and message passing, which can be learned dynamically rather than following a pre-defined structure.
The technique of message passing is used ubiquitously in computer software (Berendsen, van der Spoel, and van Drunen, 1995) and programming languages (Serlet, Boynton, and Tevanian, 1996). Recently, there has also been growing interest in developing graph neural networks (Kipf and Welling, 2016) or neural message passing algorithms (Gilmer et al., 2017) for learning representations of irregular graph-structured data. In this paper, we formulate multi-task learning as a communication problem over graph structures, allowing different tasks to communicate via message passing.
More recently, Liu and Huang (2018) propose to learn multi-task communication by explicitly passing gradients. Both our work try to incorporate inductive bias to multi-task learning. However, the difference is that we focus on the structural bias while Liu and Huang (2018) introduced an additional loss function.
Conclusion and Outlook
We have explored the problem of learning the relationships between multiple tasks, formulating the problem as message passing over a graph neural network. Our proposed methods explicitly model the relationships between different tasks and achieve improved performance in several multi-task and transfer learning settings. We also show that we can extract interpretable shared patterns from the outputs of our models. From our experiments, we believe that learning interpretable shared structures is a promising direction, which is also very useful for knowledge transfer.
The authors wish to thank the anonymous reviewers for their helpful comments. This work was partially funded by National Natural Science Foundation of China (No. 61751201, 61672162), STCSM (No.16JC1420401, No.17JC1404100), and Natural Sciences and Engineering Research Council of Canada (NSERC).
- Bakker and Heskes (2003) Bakker, B., and Heskes, T. 2003. Task clustering and gating for bayesian multitask learning. JMLR 4(May):83–99.
- Berendsen, van der Spoel, and van Drunen (1995) Berendsen, H. J.; van der Spoel, D.; and van Drunen, R. 1995. Gromacs: a message-passing parallel molecular dynamics implementation. Computer Physics Communications 91(1-3):43–56.
- Chen et al. (2010) Chen, X.; Kim, S.; Lin, Q.; Carbonell, J. G.; and Xing, E. P. 2010. Graph-structured multi-task regression and an efficient optimization method for general fused lasso. stat 1050:20.
- Cheng, Dong, and Lapata (2016) Cheng, J.; Dong, L.; and Lapata, M. 2016. Long short-term memory-networks for machine reading. In Proceedings of the 2016 Conference on EMNLP, 551–561.
Collobert, R., and Weston, J.
A unified architecture for natural language processing: Deep neural networks with multitask learning.In Proceedings of ICML.
- Dong et al. (2015) Dong, D.; Wu, H.; He, W.; Yu, D.; and Wang, H. 2015. Multi-task learning for multiple language translation. In Proceedings of the ACL.
Firat, Cho, and Bengio (2016)
Firat, O.; Cho, K.; and Bengio, Y.
Multi-way, multilingual neural machine translation with a shared attention mechanism.In Proceedings of NAACL-HLT, 866–875.
- Gilmer et al. (2017) Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In ICML, 1263–1272.
- Guo et al. (2016) Guo, J.; Che, W.; Wang, H.; and Liu, T. 2016. Exploiting multi-typed treebanks for parsing with deep multi-task learning. arXiv preprint arXiv:1606.01161.
- Hashimoto et al. (2017) Hashimoto, K.; Tsuruoka, Y.; Socher, R.; et al. 2017. A joint many-task model: Growing a neural network for multiple nlp tasks. In Proceedings of the 2017 Conference on EMNLP, 1923–1933.
- Hochreiter and Schmidhuber (1997) Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- Huang, Xu, and Yu (2015) Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
- Kalchbrenner, Grefenstette, and Blunsom (2014) Kalchbrenner, N.; Grefenstette, E.; and Blunsom, P. 2014. A convolutional neural network for modelling sentences. In Proceedings of ACL.
- Kim and Xing (2010) Kim, S., and Xing, E. P. 2010. Tree-guided group lasso for multi-task regression with structured sparsity.
- Kipf and Welling (2016) Kipf, T. N., and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
- Li and Tian (2015) Li, Y., and Tian, X. 2015. Graph-based multi-task learning. In Communication Technology (ICCT), 2015 IEEE 16th International Conference on, 730–733. IEEE.
- Li and Zong (2008) Li, S., and Zong, C. 2008. Multi-domain sentiment classification. In Proceedings of the ACL, 257–260.
- Liu and Huang (2018) Liu, P., and Huang, X. 2018. Meta-learning multi-task communication. arXiv preprint arXiv:1810.09988.
- Liu, Qiu, and Huang (2016) Liu, P.; Qiu, X.; and Huang, X. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of IJCAI.
- Liu, Qiu, and Huang (2017) Liu, P.; Qiu, X.; and Huang, X. 2017. Adversarial multi-task learning for text classification. In Proceedings of the 55th Annual Meeting of ACL, volume 1, 1–10.
- Luong et al. (2015) Luong, M.-T.; Le, Q. V.; Sutskever, I.; Vinyals, O.; and Kaiser, L. 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114.
- Netzer and Miller (1995) Netzer, R. H., and Miller, B. P. 1995. Optimal tracing and replay for debugging message-passing parallel programs. The Journal of Supercomputing 8(4):371–388.
- Peng, Thomson, and Smith (2017) Peng, H.; Thomson, S.; and Smith, N. A. 2017. Deep multitask learning for semantic dependency parsing. In Proceedings of the 55th ACL, volume 1, 2037–2048.
- Pennington, Socher, and Manning (2014) Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. Proceedings of the EMNLP 12:1532–1543.
- Serlet, Boynton, and Tevanian (1996) Serlet, B.; Boynton, L.; and Tevanian, A. 1996. Method for providing automatic and dynamic translation into operation system message passing using proxy objects. US Patent 5,481,721.
- Søgaard and Goldberg (2016) Søgaard, A., and Goldberg, Y. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of ACL.
- Tai, Socher, and Manning (2015) Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd ACL, 1556–1566.
- Yang, Salakhutdinov, and Cohen (2016) Yang, Z.; Salakhutdinov, R.; and Cohen, W. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270.
- Zeiler (2012) Zeiler, M. D. 2012. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701.