Multiple tasks may often benefit from others by leveraging more available data. For natural language tasks, a simple approach is to pre-train embeddings Mikolov et al. (2013); Peters et al. (2018) or a language model Radford et al. (2018); Devlin et al. (2018) over a large corpus. The learnt representations may then be used for upstream tasks such as part-of-speech tagging or parsing, for which there is less annotated data. Alternatively, multiple tasks may be trained simultaneously with either a single model or by sharing some model components. In addition to potentially benefit from multiple data sources, this approach also reduces the memory use. However, multi-task models of similar size as single-task baselines often under-perform because of their limited capacity. The underlying multi-task model learns to improve on harder tasks, but may hit a plateau, while simpler (or data poor) tasks can be over-trained (over-fitted). Regardless of data complexity, some tasks may be forgotten if the schedule is improper, also known as catastrophic forgetting French (1999).
In this paper, we consider multilingual neural machine translation (NMT), where both of the above pathological learning behaviors are observed, sub-optimal accuracy on high-resource, and forgetting on low-resource language pairs. Multilingual NMT models are generally trained by mixing language pairs in a predetermined fashion, such as sampling from each task uniformly Dong et al. (2015) or in proportion to dataset sizes Luong et al. (2015)
. While results are generally acceptable with a fixed schedule, it leaves little control over the performance of each task. We instead consider adaptive schedules that modify the importance of each task based on their validation set performance. The task schedule may be modified explicitly by controlling the probability of each task being sampled. Alternatively, the schedule may be fixed, with the impact of each task controlled by scaling the gradients or the learning rates. In this case, we highlight important subtleties that arise with adaptive learning rate optimizers such as AdamKingma and Ba (2014). Our proposed approach improves the low-resource pair accuracy while keeping the high resource accuracy intact within the same multi-task model.
2 Explicit schedules
A common approach for multi-task learning is to train on each task uniformly Dong et al. (2015). Alternatively, each task may be sampled following a fixed non-uniform schedule, often favoring either a specific task of interest or tasks with larger amounts of data Luong et al. (2015); Kiperwasser and Ballesteros (2018). Kipperwasser and Ballesteros Kiperwasser and Ballesteros (2018) also propose variable schedules that increasingly favor some tasks over time. As all these schedules are pre-defined (as a function of the training step or amount of available training data), they offer limited control over the performance of all tasks. As such, we consider adaptive schedules that vary based on the validation performance of each task during training.
To do so, we assume that the baseline validation performance of each task, if trained individually, is known in advance111Baseline scores can be obtained from already trained single task models, or can be set to an expected value to be reached by the multi-task model.. When training a multi-task model, validation scores are continually recorded in order to adjust task sampling probabilities. The unnormalized score of task is given by
where is the latest validation BLEU score and is the (approximate) baseline performance. Tasks that perform poorly relative to their baseline will be over-sampled, and vice-versa for language pairs with good performance. The hyper-parameter controls how agressive oversampling is, while prevents numerical errors and slightly smooths out the distribution. Final probabilities are simply obtained by dividing the raw scores by their sum.
3 Implicit schedules
Explicit schedules may possibly be too restrictive in some circumstances, such as models trained on a very high number of tasks, or when one task is sampled much more often than others. Instead of explicitly varying task schedules, a similar impact may be achieved through learning rate or gradient manipulation. For example, the GradNorm Chen et al. (2017) algorithm scales task gradients based on the magnitude of the gradients as well as on the training losses.
As the training loss is not always a good proxy for validation and test performance, especially compared to a single-task baseline, we continue using validation set performance to guide gradient scaling factors. Here, instead of the previous weighting schemes, we consider one that satisfies the following desiderata. In addition to favoring tasks with low relative validation performance, we specify that task weights are close to uniform early on, when performance is still low on all tasks. We also as set a minimum task weight to avoid catastrophic forgetting.
Task weights , follow
where and is the average relative score . sets the floor to prevent catastrophic forgetting, adjusts how quickly and strongly the schedule may deviate from uniform, while a small emphasizes deviations from the mean score. With two tasks, the task weights already sum up to two, as in GradNorm Chen et al. (2017). With more tasks, the weights may be adjusted so their their sum matches the number of tasks.
3.1 Optimization details
Scaling either the gradients or the per-task learning rates
is equivalent with standard stochastic gradient descent, but not with adaptive optimizers such as AdamKingma and Ba (2014), whose update rule is given in Eq. 3.
Moreover, sharing or not the optimizer accumulators (eg. running average of 1st and 2ndmoment and
of the gradients) is also impactful. Using separate optimizers and simultaneously scaling the gradients of individual tasks is ineffective. Indeed, Adam is scale-insensitive because the updates are divided by the square root of the second moment estimate. The opposite scenario, a shared optimizer across tasks with scaled learning rates, is also problematic as the momentum effect () will blur all tasks together at every update. All experiments we present use distinct optimizers, with scaled learning rates. The converse, a shared optimizer with scaled gradients, could also potentially be employed.
We extract data from the WMT’14 English-French (En-Fr) and English-German (En-De) datasets. To create a larger discrepancy between the tasks, so that there is a clear dataset size imbalance, the En-De data is artificially restricted to only 1 million parallel sentences, while the full En-Fr dataset, comprising almost 40 million parallel sentences, is used entirely. Words are split into subwords units with a joint vocabulary of 32K tokens.222Joint vocabulary is extracted from the full En-De and En-Fr datasets. BLEU scores are computed on the tokenized output with multi-bleu.perl from Moses Koehn et al. (2007).
All baselines are Transformer models in their base configuration Vaswani et al. (2017), using 6 encoder and decoder layers, with model and hidden dimensions of 512 and 2048 respectively, and 8 heads for all attention layers. For initial multi-task experiments, all model parameters were shared Johnson et al. (2017), but performance was down by multiple BLEU points compared to the baselines. As the source language pair is the same for both tasks, in subsequent experiments, only the encoder is shared Dong et al. (2015). For En-Fr, 10% dropout is applied as in Vaswani et al. (2017). After observing severe overfitting on En-De in early experiments, the rate is increased to 25% for this lower-resource task. All models are trained on 16 GPUs, using Adam optimizer with a learning rate schedule (inverse square root Vaswani et al. (2017)) and warmup.
The main results are summarized in Table 1. Considering the amount of training data, we trained single task baselines for 400K and 600K steps for En-De and En-Fr respectively, where multi-task models are trained for 900K steps after training. All reported scores are the average of the last 20 checkpoints. Within each general schedule type, model selection was performed by maximizing the average development BLEU score between the two tasks.
With uniform sampling, results improve by more than 1 BLEU point on En-De, but there is a significant degradation on En-Fr. Sampling En-Fr with a 75% probability gives similar results on En-De, but the En-Fr performance is now comparable to the baseline. Explicit adaptive scheduling behaves similarly on En-De and somewhat trails the En-Fr baseline.
|Method||Task 1 (En-De)||Task 2 (En-Fr)|
|Explicit - Constant (50% En-Fr)||24.80||26.14||34.25||39.98|
|Explicit - Constant (75% En-Fr)||24.53||26.16||34.56||41.00|
|Explicit - Validation based||24.67||26.35||34.55||40.70|
|Implicit - GradNorm||24.69||26.42||34.33||40.28|
|Implicit - Validation based||24.32||25.58||34.67||40.89|
For implicit schedules, GradNorm performs reasonably strongly on En-De, but suffers on En-Fr, although slightly less than with uniform sampling. Implicit validation-based scheduling still improves upon the En-De baseline, but less than the other approaches. On En-Fr, this approach performs about as well as the baseline and the multilingual model with a fixed 75% En-Fr sampling probability.
Overall, adaptive approaches satisfy our desiderata, satisfactory performance on both tasks, but an hyper-parameter search over constant schedules led to slightly better results. One main appeal of adaptive models is their potential ability to scale much better to a very large number of tasks, where a large hyper-parameter search would prove prohibitively expensive.
Additional results are presented in the appendix.
5 Discussion and other related work
To train multi-task vision models, Liu et al. Liu et al. (2018) propose a similar dynamic weight average approach. Task weights are controlled by the ratio between a recent training loss and the loss at a previous time step, so that tasks that progress faster will be downweighted, while straggling ones will be upweighted. This approach contrasts with the curriculum learning framework proposed by Matiisen et al. Matiisen et al. (2017), where tasks with faster progress are preferred. Loss progress, and well as a few other signals, were also employed by Graves et al. Graves et al. (2017), which formulated curriculum learning as a multi-armed bandit problem. One advantage of using progress as a signal is that the final baseline losses are not needed. Dynamic weight average could also be adapted to employ a validation metric as opposed to the training loss. Alternatively, uncertainty may be used to adjust multi-task weights Kendall et al. (2018).
Sener and Volkun Sener and Koltun (2018)
discuss multi-task learning as a multi-objective optimization. Their objective tries to achieve Pareto optimality, so that a solution to a multi-task problem cannot improve on one task without hurting another. Their approach is learning-based, and contrarily to ours, doesn’t require a somewhat ad-hoc mapping between task performance (or progress) and task weights. However, Pareto optimality of the training losses does not guarantee Pareto optimality of the evaluation metrics. Xu et al. presentAutoLoss Xu et al. (2018)
, which uses reinforcement learning to train a controller that determines the optimization schedule. In particular, they apply their framework to (single language pair) NMT with auxiliary tasks.
With implicit scheduling approaches, the effective learning rates are still dominated by the underlying predefined learning rate schedule. For single tasks, hypergradient descent Baydin et al. (2017) adjusts the global learning rate by considering the direction of the gradient and of the previous update. This technique could likely be adapted for multi-task learning, as long as the tasks are sampled randomly.
Tangentially, adaptive approaches may behave poorly if validation performance varies much faster than the rate at which it is computed. Figure 6 (appendix) illustrates a scenario, with an alternative parameter sharing scheme, where BLEU scores and task probabilities oscillate wildly. As one task is favored, the other is catastrophically forgotten. When new validation scores are computed, the sampling weights change drastically, and the first task now begins to be forgotten.
We have presented adaptive schedules for multilingual machine translation, where task weights are controlled by validation BLEU scores. The schedules may either be explicit, directly changing how task are sampled, or implicit by adjusting the optimization process. Compared to single-task baselines, performance improved on the low-resource En-De task and was comparable on high-resource En-Fr task.
For future work, in order to increase the utility of adaptive schedulers, it would be beneficial to explore their use on a much larger number of simultaneous tasks. In this scenario, they may prove more useful as hyper-parameter search over fixed schedules would become cumbersome.
- Online learning rate adaptation with hypergradient descent. arXiv preprint arXiv:1703.04782. Cited by: §5.
- GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. arXiv preprint arXiv:1711.02257. Cited by: §3, §3.
- BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
Multi-task learning for multiple language translation.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 1723–1732. Cited by: §1, §2, §4.2.
- Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3 (4), pp. 128–135. Cited by: §1.
Automated curriculum learning for neural networks. arXiv preprint arXiv:1704.03003. Cited by: §5.
- Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association of Computational Linguistics 5 (1), pp. 339–351. Cited by: §4.2.
- Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In , pp. 7482–7491. Cited by: §5.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §1, §3.1.
- Scheduled multi-task learning: from syntax to translation. Transactions of the Association for Computational Linguistics 6, pp. 225–240. Cited by: §2.
- Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pp. 177–180. Cited by: §4.1.
- End-to-end multi-task learning with attention. arXiv preprint arXiv:1803.10704. Cited by: §5.
- Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114. Cited by: §1, §2.
- Teacher-student curriculum learning. arXiv preprint arXiv:1707.00183. Cited by: §5.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1.
- Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §1.
- Improving language understanding by generative pre-training. Cited by: §1.
- Multi-task learning as multi-objective optimization. arXiv preprint arXiv:1810.04650. Cited by: §5.
- Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: Appendix B, §4.2.
- AutoLoss: learning discrete schedules for alternate optimization. arXiv preprint arXiv:1810.02442. Cited by: §5.
Appendix A Impact of hyper-parameters
In this appendix, we present the impact of various hyper-parameters for the different schedule types.
Figure 1 illustrates the effect of sampling ratios in explicit constant scheduling. We vary the sampling ratio for a task from 10% to 90% and evaluated the development and test BLEU scores by using this fixed schedule throughout the training. Considering the disproportional dataset sizes between two tasks (1/40), oversampling high-resource task yields better overall performance for both tasks. While a uniform sampling ratio favors the low-resource task (50%-50%), more balanced results are obtained with a 75% - 25% split favoring the high-resource task.
Explicit Dev-Based schedule results are illustrated in Figure 2 below, where we explored varying and parameters, to control oversampling and forgetting.
Appendix B Implicit validation-based scheduling progress
We here present how the task weights, learning rates and validation BLEU scores are modified over time with an implicit schedule. For the implicit schedule hyper-parameters, we set , , with baselines being 24 and 35 for En-De and En-Fr respectively. For the best performing model, we used inverse-square root learning rate schedule  with a learning rate of 1.5 and 40K warm-up steps.
Task weights are adaptively changed by the scheduler during training (Figure 5 top-left), and predicted weights are used to adjust the learning rates for each task (Figure 5 top-right). Following Eq. 2, computed relative scores for each task, , are illustrated in Figure 5 bottom-left. Finally, progression of the validation set BLEU scores with their corresponding baselines (as solid horizontal lines) are given in in Figure 5 bottom-right.
Appendix C Possible training instabilities
This appendix presents a failed experiment with wildly varying oscillations. All encoder parameters were tied, as well as the first four layers of the decoder and the softmax. An explicit schedule was employed.