Massive Choice, Ample Tasks (MaChAmp):A Toolkit for Multi-task Learning in NLP

05/29/2020 ∙ by Rob van der Goot, et al. ∙ University of Groningen Università di Trento IT University of Copenhagen 0

Transfer learning, particularly approaches that combine multi-task learning with pre-trained contextualized embeddings and fine-tuning, have advanced the field of Natural Language Processing tremendously in recent years. In this paper we present MaChAmp, a toolkit for easy use of fine-tuning BERT-like models in multi-task settings. The benefits of MaChAmp are its flexible configuration options, and the support of a variety of NLP tasks in a uniform toolkit, from text classification to sequence labeling and dependency parsing.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-task learning (MTL) Caruana (1993, 1997)

has developed into a standard repertoire in NLP. It allows for learning tasks in parallel in neural networks 

Caruana (1993) while leveraging the benefits of sharing parameters. The shift—or the “tsunami” Manning (2015)

—of deep learning in NLP has facilitated the wide-spread use of MTL since the seminal work by collobert2011natural, which has lead to a multi-task learning “wave” 

Ruder and Plank (2018)

in NLP. It has since been applied to a wide range of NLP tasks, developing into a viable alternative to classical pipeline approaches. This includes early adoption in Recurrent Neural Network type of models, e.g. 

Lazaridou et al. (2015); Chrupała et al. (2015); Plank et al. (2016); Søgaard and Goldberg (2016); Hashimoto et al. (2017), to the use of several unsupervised multi-task objectives to train BERT-like Language Models Devlin et al. (2019) and ultimately, their combination in (low) supervised data regimes to fine-tune contextualized word embeddings with supervised objectives Sanh et al. (2019).

The key idea of language model pre-training and contextual embeddings Howard and Ruder (2018); Peters et al. (2018); Devlin et al. (2019) is to pre-train rich representation on large quantities of monolingual or multilingual text data. Taking these representations as starting point has lead to enormous improvements across a wide variety of NLP problems. Effective models emerged for many languages and tasks Hashimoto et al. (2017); Kondratyuk and Straka (2019); Sanh et al. (2019); Hu et al. (2020). MTL comes in many flavours, based on the type of sharing, the weighting of losses, and the design and relations of tasks and layers. In general though, outperforming single-task settings remains a challenge Martínez Alonso and Plank (2017); Clark et al. (2019). For an overview of MTL in NLP we refer to Ruder (2017).

CLS

Sm

##ell

ya

later

!

Smell ya later!

BERT

negative

VERB

PRON

ADV

PUNCT

Sentiment Decoder

UPOS Decoder

Layer Attention
Figure 1: Overview of MaChAmp

, when training jointly for sentiment analysis and POS tagging. A shared encoding representation and task-specific decoders are exploited to accomplish both tasks.

We introduce MaChAmp, a flexible toolkit for multi-task learning and fine-tuning of NLP problems. The main advantages of MaChAmp are:

  • Ease of configuration, especially for multi-task setups;

  • Support of a wide range of NLP tasks, from sequence labeling to dependency parsing and text classification;

  • Support for the processing of multiple datasets at once;

  • MaChAmp supports initialization with and finetuning of BERT embeddings Devlin et al. (2019), which have shown to achieve state-of-the-art results for many NLP tasks.

For MaChAmp, we took a recent multilingual multi-task Universal Dependency parser (UDify) as starting point, which in a single model obtains competitive performance on 75 languages and all UD tasks Kondratyuk and Straka (2019). UDify however is targeted to Universal Dependencies (UD) parsing only. Consequently, their code and data handling logic was specifically designed and tailored for UD data. kondratyuk-straka-2019-75 used multilingual BERT111https://github.com/google-research/bert/blob/master/multilingual.md (mBERT) and fine-tuned its parameters for all UD tasks at hand Nivre et al. (2016): part-of-speech tagging, lemmatization, morphological tagging, and dependency parsing.

We propose MaChAmp, which includes the functionality of UDify, yet supports many more data formats and tasks, and with its easy configuration, opens up for general use on more NLP tasks (see Section 2.2). Moreover, we plan to extend MaChAmp continuously and document the releases by updating this reference. The backbone of MaChAmp and UDify is AllenNLP Gardner et al. (2018)

, a PyTorch-based

Paszke et al. (2019) Python library containing modules for a variety of deep learning methods and NLP tasks. It is designed to be modular, high-level and flexible.

2 Model

In this section we will discuss the model and its supported tasks.

2.1 Model overview

An overview of the model is shown in Figure 1. MaChAmp takes a pre-trained BERT model Devlin et al. (2019) as initial encoder, and fine-tunes its layers by applying an inverse square root learning rate decay with linear warm-up Howard and Ruder (2018), according to a given set of downstream tasks. For the task-specific predictions (i.e., decoding), each task has its own decoder, which is trained only for the corresponding task. By default all task-specific decoders are placed at the top-layer of the encoder (BERT); the configuration is though flexible (see Section 4.1). To make sure the BERT layers are used optimally, a layer attention over all layers is used for each decoder.

All the input is converted to WordPieces Wu et al. (2016), an extension of Byte Pair Encodings (BPE) by sennrich-etal-2016-neural. For word-level tasks, the first WordPiece item of a word is used for the prediction. For sentence level tasks, the pooled output of the [CLS] token is used.

When multiple datasets are used for training, they are first separately split into batches (so each batch only contains instances from one dataset), which are then added together and shuffled before training. This means that small datasets will be underrepresented, which can be overcome by enabling proportional sampling (Section 4.2

). During decoding, the loss function is only activated for tasks which are present in the current batch. By default all tasks have an equal weight in the loss function. The weight can be tuned, see Section 

4.

2.2 Supported tasks

In the following, we describe the tasks MaChAmp supports. We distinguish two main types of tasks, one where the annotation is done on the word level (i.e. word-level tasks), and one where longer utterances of text are annotated with labels. For the latter, MaChAmp currently only supports sentence classification.

Sequence labeling

This is to support classical token-level sequence prediction tasks, like part-of-speech tagging. Currently, MaChAmp uses greedy decoding with a softmax output layer from the hidden BERT WordPiece representation, similar as in Kondratyuk and Straka (2019).

String2string

This is an extension to sequence labeling, which learns a conversion for each input word to its label. Instead of predicting the labels directly, the model can now learn to predict the conversion. This strategy is commonly used for lemmatization Chrupała (2006); Kondratyuk and Straka (2019), where it greatly reduces the label vocabulary. We use the transformation algorithm from UDPipe-Future Straka (2018), which was also used by kondratyuk-straka-2019-75.

Dependency parsing

As in UDify Kondratyuk and Straka (2019), MaChAmp implements the deep biaffine parser Dozat and Manning (2017) using the Chu-Liu/Edmonds algorithm Chu (1965); Edmonds (1967)

for decoding the final tree. The default evaluation metric is LAS over all tokens (as opposed to accuracy for the other task types).

Text classification

For text classification, MaChAmp predicts a label for every text by using the pooled [CLS] output of BERT. Pooling is performed by a feed-forward layer with a tanh activation which is trained with Next Sentence Prediction objective Devlin et al. (2019) during pre-training. For tasks which model a relation between multiple sentences (e.g. RTE), a special [SEP] token is automatically inserted, so that the model can take this into account.

3 Usage

To use MaChAmp, one needs a configuration file, input data and a command to start the training or prediction. In this section we will describe each of these requirements.

3.1 Configuration

The model requires two configuration files, one that specifies the datasets and tasks, and one for the hyperparameters.

222For the hyperparameters configuration a default option is already specified (configs/params.json), which should give reasonably high performance for most tasks. In the following subsections, we will describe the most useful options for both configurations.

[frame=single] ”UD”: ”train_data_path”: ”data/ewt.train”, ”validation_data_path”: ”data/ewt.dev”, ”word_idx”: 1, ”tasks”: ”lemma”: ”task_type”: ”string2string”, ”column_idx”: 2 , ”upos”: ”task_type”: ”seq”, ”column_idx”: 3 ”RTE”: ”train_data_path”: ”data/RTE.train”, ”validation_data_path”: ”data/RTE.dev”, ”sent_idxs”: [0,1], ”tasks”: ”rte”: ”task_type”: ”classification”, ”column_idx”: 2

Figure 2: Example dataset configuration file, to predict UPOS, lemmas, and textual entailment simultaneously.

A simple example of a dataset configuration file is shown in Figure 2. On the first level, the dataset names are specified (i.e., “UD” and “RTE”), which should be unique identifiers. Each of these datasets needs at least a train_data_path, a validation_data_path, a word index (i.e., word_idx) or sentence indices (i.e., sent_idxs), and a list of tasks. The word index specifies on which column the input words are to be read (see Section 3.2). For each of the defined tasks, the user is required to define the task_type (Section 2.2), and the column index from which to read the labels (i.e., column_idx). More options that can be passed on the task level are discussed in Section 4.1.

3.2 Data format

MaChAmp supports two types of data formats, which correspond to the level of annotation (Section 2.2). For word-level tasks, we will use the term “word-level file format”, whereas for sentence-level task, we will use “sentence-level file format”

The word-level file format is similar to the CoNLLU format Nivre (2015) introduced for Universal Dependencies. It assumes one word per line, with each annotation layer following each word separated by a tab character (Figure 2(a)). Sentences are delimited by an empty line. Comments are lines on top of the sentence which have a different number of columns with respect to token lines.333We do not identify comments based on lines starting with a ‘#’, because datasets might have words in the first column that can start with a ‘#’. It should be noted that for dependency parsing, it assumes the relation label to be on the column_idx and the head index on the following column.

The sentence-level file format for sentence classification is very similar (Figure 2(b)), except that there can be multiple inputs. In contrast to word_idx, a list of sent_idxs are defined to enable modeling the relation between any arbitrary number of them.

[frame=single] 1 smell VERB 2 ya PRON 3 later ADV 4 ! PUNCT

(a) Example of a word-level file format, where word_idx should be 1, and task_idx 2.

[frame=single] smell ya later ! negative

(b) Example of a phrase-level file format, where sent_idxs should be [0] and task_idx 1.
Figure 3: Examples of data file formats.

3.3 Training

Given the setup illustrated in the previous sections, a model can be trained using the following command. It assumes the configuration (Figure 2) called configs/upos-lemma-rte.json.

python3 train.py --parameters_config \
 configs/params.json --dataset_config \
 configs/upos-lemma-rte.json

As is common in AllenNLP Gardner et al. (2018), by default the model and the logs will be written to logs/JSONNAME/DATE. The name of the directory can be set manually by providing --name NAME. Furthermore, --device can be used to specify which GPU to use (-1 for CPU is the default).

3.4 Inference

Prediction on new data can then be done using the following command:

python3 predict.py \
 logs/<NAME>/<DATE>/model.tar.gz \
 <INPUT FILE> <OUTPUT FILE>

It requires the path to the best model serialized during training stored as model.tar.gz in the log directory as specified above.

4 Options

For the full list of tuning hyperparameters, see the default configs/params.json in the repository (and Section 5.2). In this section we first discuss the parameters which can be defined for tasks individually, and then describe how to change the pre-trained embeddings.

4.1 Dataset configuration

The settings described in this section can only be set on the task level.

Metric

Can be used to specify the evaluation metric. If not set, it defaults to accuracy, except when task_type is set to dependency, then LAS, as defined by AllenNLP Gardner et al. (2018), is used. Possible metrics are: ‘acc’, ‘LAS’, ‘micro-f1’, ‘macro-f1’ and ‘span_f1’ (for span-based sequence labeling).

Layer

Defines which layers are used for predicting the task. The model uses up to the specified layer (if it set to 8, it uses all layers from 1 to 8). As explained in Section 2, by default the model uses layer attention to mix the information from the specified layers.

Loss weights

In multi-task settings, not all tasks might be equally important, or some tasks might just be harder to learn, and therefore should gain more weight during training. This can be done by setting the loss_weight parameter on the task level (by default the value is 1.0 for all tasks).

Adaptive

This enables the adaptive softmax loss function Grave et al. (2017). This loss function groups the labels into several clusters based on their frequency. By first focusing on the more frequent labels the model can gain both in efficiency and in performance. This should mostly be beneficial for tasks with imbalanced label spaces.

Following UDify, the adaptive softmax loss is set to True by default (with cutoff values 8 and 15), as it generally improves performance. However, when the label vocabulary setting for a task is lower than 8, the cutoff value does not apply and adaptive softmax is not active.

4.2 Hyperparameter configuration

Whereas most of the hyperparameters can simply be changed from the default parameters (configs/params.json) in the repository, we would like to highlight two settings.

BERT model

The path to pre-trained BERT can be set in the params.json file. Specifically the pretrained_model value in the datasetreader/bert/ section. The model expects the embeddings to be in pytorch format, which can be obtained by the pytorch_transformers command.

Proportional sampling

To avoid larger datasets from overwelming the model, proportional sampling can be enabled (iteration/proportional_sampling). In previous work this has shown to be reach high performance when modeling multiple tasks hierarchically Sanh et al. (2019). When enabled, the model will first pick a random task, and then pick a random batch from that task. In other words, all datasets will have a roughly equal amount of batches; smaller datasets will be up-scaled and larger datasets will be downscaled (the number of batches per batch remains the same).444We also experimented with only up-scaling and only down-scaling, but found the first to be too efficient and the latter to lead to sub-optimal performance It should be noted that for specific tasks, more involved strategies have been devised Wang et al. (2020); Stickland and Murray (2019).

Parameter Value Range
Optimizer Adam
, 0.9,0.99
Weight decay 0.01
Label smoothing 0.03
Dropout 0.5 0.3, 0.5, 0.7
BERT dropout 0.1 0.1, 0.2

Mask probability

0.1 0.1, 0.2, 0.3
Layer dropout 0.1
Batch size 32 16, 32, 64
Epochs 80
Patience 5
Base learning rate .001 .0001, .001, .01
BERT learning rate
Warmup rate
Gradient clipping 5.0
Dep. tag dimension 256
Dep. arc dimension 768
Table 1: Final parameter settings, incl. tested ranges.

5 Experiments

In this section we describe the procedure how we determined robust default parameters for MaChAmp; note that the goal is not to achieve a new state-of-the-art, but generally to reach on-par performance for multiple tasks, while reaching one robust setting of hyperparameters. To this end, we will describe the datasets and hyperparameters used for tuning, and the obtained results in single-task and multi-task setups.

5.1 Datasets

We report performance over three benchmarks. They were selected to cover a range of NLP tasks, from syntactic to semantic and inference-level tasks, spanning the diverse supported tasks and different dataset setups. Next we describe each dataset and all its tasks. For simplicity (and due to availability), we focus on English datasets only.

Ud (Ewt)

The English Universal Dependencies Silveira et al. (2014) data derived from the English Web Treebank. It is the English dataset which was the earliest part of UD English. It is the most commonly used dataset for UD English dependency parsing. It contains 5 tasks: fine and coarse-grained POS tagging (XPOS and UPOS, respectively), lemmatization, morphological tagging and parsing. This dataset is an example where all tasks are annotated jointly for every instance.

Glue

The General Language Understanding Evaluation benchmark Wang et al. (2018) has become the default benchmark for inference-type or semantics tasks, including entailment, paraphrasing and sentiment analysis. It consists of a collection of several datasets Warstadt et al. (2019); Socher et al. (2013); Dolan and Brockett (2005); Cer et al. (2017); Williams et al. (2018); Rajpurkar et al. (2018); Bentivogli et al. (2009); Levesque et al. (2012), and hence represents an example of a dataset with multiple disjointly labeled datasets. We use all datasets except for SST, which is a regression task, and WNLI, following previous work Devlin et al. (2019).

EWT v2.3 PMB v3.0
Task dep feats lemma upos xpos lemma semtag supertag verbnet wordnet
Task type dep seq s2s seq seq s2s seq seq seq s2s
Train size 205k 43k
MaChAmp 89.90 97.18 98.21 97.01 96.64 97.52 98.32 94.87 94.37 89.15
MaChAmp 89.61 97.15 97.79 97.01 96.79 97.33 98.23 94.91 94.54 89.32
UDify 89.67 97.15 97.80 96.90

GLUE
Task cola mnli mnli-mis mrpc qnli qqp rte snli sst-2
Task type c c c c c c c c c
Train size 8.5k 392k 392k 3.6k 108k 363k 2.5k 549k 67k
MaChAmp 78.04 81.99 82.15 86.03 88.31 89.75 72.20 89.58 90.71
MaChAmp 72.20 82.35 82.80 82.11 86.58 89.27 73.65 89.61 90.25
MaChAmp 76.03 80.40 80.55 84.31 87.26 87.40 73.29 87.41 90.14
BERT-base 84.4 86.7 93.3
Table 2: Performance of MaChAmp (single-task), MaChAmp (multi-task), and MaChAmp (multi-task+proportional sampling) on the dev sets. For all tasks, accuracy is used as metric, except for dependency parsing where LAS score is used. Training size is the number of annotated instances (words in case of EWT and PMB, sentences for GLUE). Results reported per dataset are from UDify Kondratyuk and Straka (2019) and BERT-base Devlin et al. (2019). The task types are; dep: Dependency parsing; seq: Sequence labeling; c: text classification; s2s: String2string.

Pmb

The Parallel Meaning Bank Bos (2015); Abzianidze et al. (2017) is a multilingual data collection for semantic processing. It includes a range of basic tasks, all of which form the basis to generate a Discourse Representation Structure (DRS) using Boxer Bos (2015). The base tasks, which we evaluate on here, include: CCG supertagging, semtagging, verbnet and wordnet tagging. We use the English part and PMB version 3.0.0. It should be noted that for the wordnet senses, we used the string2string task type. This reduced the vocabulary size from 4,443 to 1,804. This is arguably a strange task to tackle, because we predict word-senses, without knowing which senses exist, which is why the performance is lower compared to the other tasks (Table 2).

5.2 Hyperparameter tuning

Because UDify Kondratyuk and Straka (2019) was focused on training on many UD parsing datasets and languages simultaneously, its hyperparameters were tuned towards massive data sizes. We compared a range of hyperparameter settings for our three setups by using grid search,555We capped the dataset sizes to a maximum of 20,000 sentences for efficiency reasons. and used the parameter settings that reached the highest rank (averaged over the three sets). In Table 1 we report the best hyperparameters across all datasets (these are the default of the toolkit), and the range of parameters which we evaluated. Patience and model selection is based on the sum of all the evaluation metrics of all tasks.

5.3 Results

The final performance for all datasets and tasks on the development data are reported in Table 2. For each dataset we ran a multi-task model (MaChAmp), performing all tasks jointly (for GLUE, we train on all datasets jointly). We compare this to single-task model (MaChAmp), where we train a separate model for each task. .

First, we see that MaChAmp obtains state-of-the-art performance for EWT for which we can directly compare on all tasks except fine-grained POS tagging. In more detail, the results of UDify Kondratyuk and Straka (2019) are in a similar range for all tasks; the largest difference and improvement for MaChAmp is on dependency parsing, which is probably due to parameter tuning. The multi-task model works well and for both EWT and PMB performs similar to the single-task models; on some of the task MaChAmp even slightly outperforms them.

However, on GLUE performance of the multi-task model lacks behind on 6 out of the 9 tasks. This is probably due to the immense size and disparity of the datasets. Proportional sampling helps to train the model more quickly and especially for the smaller datasets results in better accuracies. Comparing to the original results from BERT Devlin et al. (2019), we see that their scores are higher. This is probably because of the differences in the setup: they used BERT-base, we use BERT-multilingual, also they only trained for 3 epochs which only makes sense for large datasets. Furthermore, they tuned the learning rate per task, whereas we were mainly looking for a robust setting over multiple datasets.

Learning Curve

As training a joint MTL model on GLUE is computationally expensive (it takes several days, particularly when SNLI is included with over half a million training instances), we trained single-task models and examine the effect of increasing data size per GLUE task. We focus on the larger GLUE tasks which have more than 100k training instances. The learning curves shown in Figure 4 show that all tasks benefit from more GLUE training data, even beyond 50k instances accuracy keeps increasing, except for SST which starts to flatten out. MNLI is the task with the steepest learning curve and the lowest accuracy. This shows that the GLUE tasks remain challenging and more data clearly helps most of the tasks.

Figure 4: Learning curves for MaChAmp on the dev sets for the large GLUE training datasets.

Proportional Sampling

We examine the effect of proportional sampling (Section 4.1) only for the GLUE benchmark, as it contains datasets of varying sizes. The performances per epoch are plotted in Figure 5. The plot clearly shows the advantages of using proportional sampling; higher performance is obtained with fewer epochs. These results also show that there is more potential to be gained; perhaps using a dynamic ratio we can benefit from both types of training (also reflected in Table 2, where both models show very different strengths).666It should be noted that the proportional sampling model could have been somewhat ”unlucky” with the sampling, and might have benefited from training longer. The other model had many points where it did not improve for 3-4 epochs

Figure 5: Results of MaChAmp with and without proportional sampling.

6 Conclusions

We introduced MaChAmp, a flexible toolkit for BERT-based multi-task learning, and evaluated it on three multi-task benchmarks. Performance is on-par to previous state-of-art models, and even higher for some low-resource settings. The source code is freely available.

Acknowledgments

This research was supported by an Amazon Research Award, an STSM in the Multi3Generation COST action (CA18231) and Nvidia corporation for sponsoring Titan GPUs. We thank the NLPL laboratory and the HPC team at ITU for the computational resources used in this work.

References

  • L. Abzianidze, J. Bjerva, K. Evang, H. Haagsma, R. van Noord, P. Ludmann, D. Nguyen, and J. Bos (2017) The parallel meaning bank: towards a multilingual corpus of translations annotated with compositional meaning representations. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 242–247. External Links: Link Cited by: §5.1.
  • L. Bentivogli, P. Clark, I. Dagan, and D. Giampiccolo (2009) The fifth pascal recognizing textual entailment challenge.. In TAC, Cited by: §5.1.
  • J. Bos (2015) Open-domain semantic parsing with boxer. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), Vilnius, Lithuania, pp. 301–304. External Links: Link Cited by: §5.1.
  • R. Caruana (1993) Multitask learning: A knowledge-based source of inductive bias. In

    Proceedings of the Tenth International Conference on Machine Learning

    ,
    Cited by: §1.
  • R. Caruana (1997) Multitask learning. In Learning to learn, pp. 95–133. Cited by: §1.
  • D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 1–14. External Links: Link, Document Cited by: §5.1.
  • G. Chrupała, Á. Kádár, and A. Alishahi (2015) Learning language through pictures. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, pp. 112–118. External Links: Link, Document Cited by: §1.
  • G. Chrupała (2006) Simple data-driven context-sensitive lemmatization. SEPLN. Cited by: §2.2.
  • Y. Chu (1965) On the shortest arborescence of a directed graph. Scientia Sinica 14, pp. 1396–1400. Cited by: §2.2.
  • K. Clark, M. Luong, U. Khandelwal, C. D. Manning, and Q. V. Le (2019) BAM! born-again multi-task networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5931–5937. External Links: Link, Document Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: 4th item, §1, §1, §2.1, §2.2, §5.1, §5.3, Table 2.
  • W. B. Dolan and C. Brockett (2005) Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), External Links: Link Cited by: §5.1.
  • T. Dozat and C. D. Manning (2017) Deep biaffine attention for neural dependency parsing. Proceedings of 5th International Conference on Learning Representations (ICLR 2017). Cited by: §2.2.
  • J. Edmonds (1967) Optimum branchings. Journal of Research of the national Bureau of Standards B 71 (4), pp. 233–240. Cited by: §2.2.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. Zettlemoyer (2018) AllenNLP: a deep semantic natural language processing platform. In

    Proceedings of Workshop for NLP Open Source Software (NLP-OSS)

    ,
    Melbourne, Australia, pp. 1–6. External Links: Link, Document Cited by: §1, §3.3, §4.1.
  • E. Grave, A. Joulin, M. Cissé, H. Jégou, et al. (2017) Efficient softmax approximation for gpus. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1302–1310. Cited by: §4.1.
  • K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher (2017) A joint many-task model: growing a neural network for multiple NLP tasks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1923–1933. External Links: Link, Document Cited by: §1, §1.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. External Links: Link, Document Cited by: §1.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Cited by: §2.1.
  • J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson (2020) XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080. Cited by: §1.
  • D. Kondratyuk and M. Straka (2019) 75 languages, 1 model: parsing universal dependencies universally. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2779–2795. External Links: Link, Document Cited by: §1, §1, §2.2, §2.2, §2.2, §5.2, §5.3, Table 2.
  • A. Lazaridou, N. T. Pham, and M. Baroni (2015) Combining language and vision with a multimodal skip-gram model. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 153–163. External Links: Link, Document Cited by: §1.
  • H. Levesque, E. Davis, and L. Morgenstern (2012) The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Cited by: §5.1.
  • C. D. Manning (2015) Computational linguistics and deep learning. Computational Linguistics 41 (4), pp. 701–707. Cited by: §1.
  • H. Martínez Alonso and B. Plank (2017) When is multitask learning effective? semantic sequence prediction under varying data conditions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain. External Links: Link Cited by: §1.
  • J. Nivre, M. de Marneffe, F. Ginter, Y. Goldberg, J. Hajič, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman (2016) Universal dependencies v1: a multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 1659–1666. External Links: Link Cited by: §1.
  • J. Nivre (2015) Towards a universal grammar for natural language processing. In International Conference on Intelligent Text Processing and Computational Linguistics, pp. 3–16. Cited by: §3.2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8026–8037. External Links: Link Cited by: §1.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §1.
  • B. Plank, A. Søgaard, and Y. Goldberg (2016)

    Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss

    .
    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, pp. 412–418. External Links: Link, Document Cited by: §1.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 784–789. External Links: Link, Document Cited by: §5.1.
  • S. Ruder and B. Plank (2018)

    Strong baselines for neural semi-supervised learning under domain shift

    .
    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1044–1054. External Links: Link, Document Cited by: §1.
  • S. Ruder (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §1.
  • V. Sanh, T. Wolf, and S. Ruder (2019) A hierarchical multi-task approach for learning embeddings from semantic tasks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 6949–6956. Cited by: §1, §1, §4.2.
  • N. Silveira, T. Dozat, M. de Marneffe, S. Bowman, M. Connor, J. Bauer, and C. D. Manning (2014) A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Cited by: §5.1.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: §5.1.
  • A. Søgaard and Y. Goldberg (2016) Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 231–235. Cited by: §1.
  • A. C. Stickland and I. Murray (2019) BERT and pals: projected attention layers for efficient adaptation in multi-task learning. In ICML, Cited by: §4.2.
  • M. Straka (2018) UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, pp. 197–207. External Links: Link, Document Cited by: §2.2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. External Links: Link, Document Cited by: §5.1.
  • X. Wang, Y. Tsvetkov, and G. Neubig (2020)

    Balancing training for multilingual neural machine translation

    .
    In Annual Conference of the Association for Computational Linguistics (ACL), External Links: Link Cited by: §4.2.
  • A. Warstadt, A. Singh, and S. R. Bowman (2019) Neural network acceptability judgments. Transactions of the Association for Computational Linguistics 7, pp. 625–641. Cited by: §5.1.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. External Links: Link Cited by: §5.1.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §2.1.