Task Selection Policies for Multitask Learning

07/14/2019 ∙ by John Glover, et al. ∙ 0

One of the questions that arises when designing models that learn to solve multiple tasks simultaneously is how much of the available training budget should be devoted to each individual task. We refer to any formalized approach to addressing this problem (learned or otherwise) as a task selection policy. In this work we provide an empirical evaluation of the performance of some common task selection policies in a synthetic bandit-style setting, as well as on the GLUE benchmark for natural language understanding. We connect task selection policy learning to existing work on automated curriculum learning and off-policy evaluation, and suggest a method based on counterfactual estimation that leads to improved model performance in our experimental settings.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent work on language understanding has demonstrated the effectiveness of pretraining neural networks on large corpora using unsupervised objectives such as language modelling, and then fine-tuning the resulting models on downstream target tasks 

Dai and Le (2015); Peters et al. (2018); Howard and Ruder (2018); Radford et al. (2018); Devlin et al. (2018). This approach has produced new state-of-the-art results on a variety of popular benchmark datasets, such as the SQuAD question answering dataset Rajpurkar et al. (2016), and the General Language Understanding Evaluation (GLUE) benchmark for sentence (and sentence pair) classification Wang et al. (2018). Notably, these approaches typically fine-tune a full copy of the pretrained model on each target task individually, effectively multiplying the number of parameters that must be trained and stored by the number of tasks, and ruling out any potential performance improvements that may arise from sharing information between related tasks.

An alternative approach is to do Multitask Learning (MTL) Caruana (1997), where a model is learned that shares some number of parameters across all tasks. BERT is a recent example of the benefits of this approach — in the pretraining stage it is trained on two tasks simultaneously: masked language modelling (predicting missing tokens in the input), and next sentence prediction (predicting whether two sentences are consecutive or not). However, successfully applying MTL to a particular problem is not necessarily straightforward, and depends on resolving many questions that do not arise in the single task setting, such as:

  1. In what settings is MTL effective? Can we detect and mitigate negative transfer (where performance on some subset of tasks decreases when trained in an MTL setting)?

  2. Which parameters of the model should be shared between tasks, and which should be task specific?

  3. Should the model look at all of the data from all of the tasks? Should all of the training examples be weighted equally in the loss term?

  4. How much of the available training budget should be spent on each individual task?

This work focuses on question 4, providing an empirical evaluation of the performance of different policies for selecting how much to sample from each task in two experimental settings; firstly on a novel bandit-style task, and secondly when fine-tuning a pretrained language model (BERT) on the GLUE benchmark. GLUE represents an interesting challenge as it evaluates the performance of a single model across multiple tasks, such as textual entailment and question answering, with the size of the available training data for different tasks spanning multiple orders of magnitude. We find that policies based on common heuristics such as sampling tasks uniformly at random 

Wang et al. (2018); Subramanian et al. (2018) or proportionally to their size Phang et al. (2018) are not able to match the performance of fine-tuning models for each individual task.

We also show that this problem can also be viewed through the lens of curriculum learning. We evaluate a previous method for automated curriculum learning Graves et al. (2017), but find that on our tasks it does not perform significantly better than a random policy. However, when learning a policy using counterfactual estimation Bottou et al. (2012), we are able to approach performance parity with the task-specific models.

2 Related Work

Multitask learning has been studied extensively, and can be motivated as a means of inductive bias learning Caruana (1993); Baxter (2000), representation learning Argyriou et al. (2007); Misra et al. (2016), or as a form of learning to learn Baxter (1997); Thrun and Pratt (1998); Heskes (2000); Lawrence and Platt (2004)

. In the context of natural language processing, MTL has been used to improve tasks such as semantic role labeling 

Collobert and Weston (2008); Strubell et al. (2018), and machine translation Luong et al. (2016); Firat et al. (2016); Johnson et al. (2016); Hokamp et al. (2018), and to learn general purpose sentence representations Subramanian et al. (2018). One of the best performing models on the GLUE benchmark at the time of publication Liu et al. (2019) combines MTL pretraining (BERT) with MTL fine-tuning on the GLUE tasks, although the model also incorporates non-trivial task-specific components and additional training using knowledge distillation Hinton et al. (2015).

However, MTL in NLP is not always successful — the GLUE baseline MTL models are significantly worse than the single task models Wang et al. (2018), Alonso and Plank, 2016 only find significant improvements with MTL in one of five tasks evaluated, and the multitask model in McCann et al., 2018 does not quite reach the performance of the same model trained on each task individually.

One way to approach the question of how much training budget to spend on each task is to view this as a curriculum learning problem Elman (1993); Bengio et al. (2009). McCann et al., 2018 show the importance of using the correct curriculum to train their multitask model. Graves et al., 2017 treat curriculum learning as an adversarial multi-armed bandit problem Auer et al. (2002); Bubeck and Cesa-Bianchi (2012), and show that the learned policies can improve the performance on language modeling and the bAbI dataset tasks Weston et al. (2015)

. Probably the most similar recent work to ours is AutoSeM 

Guo et al. (2019), where they first learn to select useful auxiliary tasks, then learn an MTL curriculum over them using Bayesian optimization. Our work differs in the methods used (we use counterfactual estimation to learn the curriculum), as well as in the learning objective — Guo et al., 2019 focus on the improving a single target task by incorporating auxiliary tasks, where as we seek to jointly maximize the performance of a set of target tasks.

Counterfactual estimation Bottou et al. (2012)

tries to answer the question “what would have happened if an agent had taken different actions?”, and so is closely related to the problem of off-policy evaluation in the context of reinforcement learning

111Counterfactual estimation has also been studied under the names “counterfactual reasoning” and “learning from logged bandit feedback”.. This setting presents an added set of challenges to on-policy evaluation, as an agent only has partial feedback from the environment, and it is assumed that collecting additional feedback is either not possible or prohibitively expensive. Typically approaches to off-policy evaluation are based on either modeling the environment dynamics and reward, using importance sampling, or a combination of the two Precup et al. (2000); Peshkin and Shelton (2002); Dudik et al. (2011); Swaminathan and Joachims (2015a); Jiang and Li (2015).

3 Task Selection Policies

We consider the case where we have a set of learning tasks. Each task is a distribution () over samples from the input space . In the supervised case for example, may be composed of pairs, where is an input example or sequence and is a target label or sequence. These individual samples are typically grouped into batches of multiple samples from the same task, we also refer to these batches as samples for convenience.

A model over has parameters

. A loss function

is defined for each task, with the expected loss for the task given by


The objective is to maximize the performance on all tasks, or to minimize the total loss:


We assume that we have a fixed training budget of steps. At each step , a model samples task (or action) according to a distribution defined by some task selection policy , then processes the resulting sample and observes loss . The probability of selecting a particular action at time is given by . We use to denote a model with parameters that uses policy , and to refer to any additional parameters that itself may have. In general, may be a function of the sequence of the observed samples, losses and model parameters in the training history at time . In this work we study different methods for specifying or learning , and how these approaches influence . We are primarily interested in cases where the model is a large neural network, and is a large dataset, so in general we seek methods that avoid evaluating for many different values of as this is computationally slow and expensive.

3.1 Baseline (Heuristic) Policies

A common task selection policy is to sample from all tasks uniformly at random:


This policy has been shown to be a strong baseline in previous work on curriculum learning Graves et al. (2017). Another common heuristic that we evaluate in this work is to sample tasks proportionally to their dataset size Phang et al. (2018):


3.2 Learning Policies Using Automated Curriculum Learning

We also evaluate the approach to automated curriculum learning introduced in Graves et al., 2017, which is briefly described here for reference. A curriculum of tasks can be viewed as an -armed bandit. At each round (time step) an agent uses a policy to select an action and sees a reward . The goal is to create a policy that maximizes the total reward from the bandit.

Graves et al., 2017 use the Exp3.S algorithm Auer et al. (2002) for their policy, which is an adaptation of the Exp3 algorithm to the non-stationary setting. Exp3 tries to minimize the regret with respect to a single best arm in hindsight, assuming an adversarial setting in which the distribution of rewards over arms can change at every time step. It does this by using importance sampled rewards to update a set of weights

, then acting stochastically according to a distribution based on these weights (with additional hyperparameters

and ):


3.2.1 Automated Curriculum Learning Reward Function

In Graves et al., 2017 the underlying hypothesis is that the policy should produce a syllabus that focuses on tasks in order of increasing difficulty. This lead the authors to design and evaluate several different ways of encoding measures of the rate at which learning progresses into a sample-level reward function. They find that the progress signal that they call prediction gain generally leads to the best performance across the tasks that they evaluated, and so that is the raw reward signal that we consider with here. Prediction gain is defined as the change in loss before and after training on a sample (ie., after a gradient update): .

When computing we also follow the reward scaling process described in Graves et al., 2017. Reservoir sampling is used to maintain a representative sample of the unscaled reward history up to time , and from this we compute the 20 and 80 percentiles as and respectively. The unscaled reward is then mapped to the interval :


3.3 Learning Policies Using Counterfactual Estimation

Automated curriculum learning using Exp3.S describes a method for online learning of task selection policies from bandit-style feedback. However in the context of MTL, the ability to learn online is typically not required — in many situations we have the ability to run multiple variations of a given experiment, and can potentially use the results of earlier training runs in order to improve our policies. Learning task selection policies can therefore be viewed as a problem of counterfactual estimation, where the goal is to use old policy data to improve on that policy without further interaction with the environment.

Most approaches to counterfactual estimation either model the reward generating process, use importance sampling to correct for the changes introduced by the new policy, or combine the two Dudik et al. (2011). In the first case, the task initially reduces to a supervised regression problem, followed by a process of policy optimization using the reward model as the target reward that should be maximized. However, defining a suitable MTL sample-level reward function is non-trivial. In this work we are interested in maximizing the average performance of all training tasks, where performance is measured at the end of the training run. In problems with large models and/or datasets, this could take anywhere from hours to weeks to complete, and so the most obvious reward signal (final average performance) is very sparse and difficult to optimize as the time that this reward is observed at is potentially very delayed from the time when an action must be taken. We therefore tend to rely on reward signals that are defined in response to each action taken by the policy, with the hope that they correlate well with the desired global metric that we care about. We discuss the specific variant of the sample-level reward used in this work further in Section 3.3.3 below.

As we are dealing with a surrogate reward signal, and previous work on reward modelling has found that this approach may not generalise well even with exact rewards Beygelzimer and Langford (2009), we instead focus on the counterfactual estimation methods that are based on importance sampling. We use the available data in a two-step process to create task selection policies:

  1. Create an estimator to evaluate a given policy, using some set of logged policy probabilities, decisions and resulting rewards from some initial training run(s).

  2. Create a new policy that maximises the expected reward according to this counterfactual estimator.

These steps are described in detail in Sections 3.3.1 and 3.3.2 respectively.

3.3.1 Counterfactual Estimation

At each time step in a learning process, an MTL model selects a task to sample from using a policy , and in response experiences a sample and reward signal . The probability of selecting each is given by a distribution . The value of is the expected reward obtained when selecting tasks using that policy:


can be estimated by sampling trajectories (or rollouts) from 222To simplify our notation it is assumed that we are just using a single rollout to compute , but multiple rollouts could be used.:


In counterfactual estimation, the goal is to use rollouts from to approximate the value of a different policy , under the assumption that we cannot sample directly from . We also cannot evaluate the reward function for samples that are selected using but not . One way to estimate is to use importance sampling Rosenbaum and Rubin (1983):


We can therefore use Monte Carlo to approximate while only relying on samples from :


The importance sampling estimator (also known as the inverse propensity score estimator) is unbiased, and is defined as long as has support everywhere that does, or in other words if . In practise this is not a significant limitation, as we generally have full control over , and so can ensure that it always assigns some probability mass to each task.

However, the importance sampling estimator is known to suffer from high variance 

Bottou et al. (2012); Joachims et al. (2018). This problem is particularly noticeable in regions of the input space that are not well covered by the sampled policy — if is very low for a particular sample, then the importance weight will be high (leading to inaccurate estimates) unless the reward for this sample is very low. Different estimators have been proposed that reduce this variance in some way, generally at the expense of adding some bias Dudik et al. (2011); Bottou et al. (2012), but there is no single estimator that works best in all situations Nedelec et al. (2017).

In this work we use the weighted importance sampling estimator Rubinstein (1981) to reduce the variance of , as it has been shown to work well in a variety of settings Mahmood et al. (2014); Swaminathan and Joachims (2015b); Nedelec et al. (2017):


So far we have been assuming that the off-policy data is collected using a single policy , but these methods can also be applied to data collected by multiple logging policies Peshkin and Shelton (2002); Agarwal et al. (2017).

3.3.2 Counterfactual Estimation: Policy Improvement

The estimators described in Section 3.3.1 can be used to evaluate arbitrary policies, and so they can be combined with policy search algorithms to learn an improved policy

. In general these policies may be dynamic, but here we consider fixed stochastic policies parameterised by a vector

, of the form:


The learning objective is now to find an appropriate :


To optimize for we use Covariance-Matrix Adaptation Evolution Strategy (CMA-ES) Hansen and Ostermeier (2001), as it has been shown to work well in low-dimensional parameter spaces Ha and Schmidhuber (2018).

In initial experiments, we found that the output of was still prone to overestimating the value of regions of the parameter space that were not well-represented in the data from the logging policy. In particular, it tended to assign high scores to policies with distributions that were very peaked around the individual tasks with the largest total reward. This is likely due to a combination of overfitting to a small amount of logging data, as well as deficiencies with the surrogate reward signal (described in Section 3.3.3). We therefore introduced a regularization term to the objective, limiting the parameter space to regions where the policy retains larger entropy (H) values (weighted by a hyperparameter ):


3.3.3 Counterfactual Estimation: Reward

To learn a task selection policy using counterfactual estimation we need to define a reward function . Ideally, a policy that maximises the expected value of will also minimise the average task loss at the end of training:


We found that maximising the prediction gain reward proposed in Graves et al., 2017 did not correlate well with minimising in our initial experiments, and so use a slightly different reward formulation here. Intuitively, as we want to maximise the average task performance, we want to incentivise spending more time on tasks that are performing poorly at a given phase of the training process, as long as we are continuing to make progress on those tasks. We want to penalise sampling from tasks that are not improving, as this is a waste of effort, regardless of their overall performance.

Concretely, we assume that the sample loss for a given task is a negative log likelihood value. We compare at time with at time , which is the time at which we last sampled from the same task. If the difference between the two () is negative (ie., the loss has decreased), then the reward received by the model for this action is , so the model receives a reward in the interval , with higher values for sampling from tasks that are performing poorly. If , the reward is . This reward process is described in Equation 16.


4 Experiments

To compare the different task selection policies for MTL we evaluate their performance on two tasks: a toy bandit-style problem, and the GLUE benchmark for natural language understanding. Further details are given in Sections 4.1 and 4.2 respectively.

4.1 Bandit Example

Our first experiment aims to verify that our approach to learning task selection policies works in a synthetic setting, that was designed to be a simplified environment that still presents some of the challenges that are experienced with MTL in more realistic scenarios. We define an MTL bandit as a bandit with arms, representing our tasks. We assume that the goal is to sample from these arms according to some fixed oracle distribution that is unknown to the agent interacting with the bandit environment. At the start of the experiment we sample , and the same is used for all experiment runs. Each arm has an associated score , which is initially zero. Each arm is assigned a maximum probability value () that it can obtain in the range , and then establishes a learning increment (LI) using the formula:

where is the arm and is the number of steps. Each arm is also assigned a forget increment (), which is a random number in the range multiplied by . At each time step, if arm is selected, is incremented by (and constrained to be ), simulating some improvement on that task. Similarly, is reduced by (constrained to be ) for all tasks at each time step, whether that task was selected or not, simulating some form of forgetting on each task. for the selected by the policy being evaluated is computed before and after applying and , and is used to compute rewards and for and respectively.

The MTL bandit creates an environment in which successful agents must learn to sample from all tasks periodically so as not to “forget”, but should learn that some tasks need to be sampled from more than others in order to maximise the overall average score. We evaluated , and on this task. In Graves et al., 2017 the authors used the same hyperparameters for all experiments, and so we use the same settings here: , . To set the entropy weight , we perform a grid search over , and select the best performing value (). For CMA-ES we used 20 iterations with a population size of 64. is computed based on 2 iterations of policy improvement, starting from a random uniform policy (ie., ). For the MTL bandit, we set and . We run each policy 10 times (with different initial random seeds). The results are shown in Figure 1. None of the methods are able to fully match the oracle performance, but our counterfactual method comes close. The random policy and Exp3.S perform similarly, both noticeably worse than the counterfactual policy.

Figure 1: Average score across all 8 arms over time for each policy. Each line is the median value over 10 seeds, with shaded regions encompassing all seeds.

4.2 Glue

We evaluated the task selection policies in a more challenging and realistic environment — the GLUE benchmark. GLUE performance is based on an average score across the following tasks: CoLA (the Corpus of Linguistic Acceptability), MNLI (Multi-Genre Natural Language Inference), MRPC (the Microsoft Research Paraphrase Corpus), QNLI (Question Natural Language Inference, a version of the Stanford Question Answering Dataset), QQP (Quora Question Pairs), RTE (Recognizing Textual Entailment), SST-2 (the Stanford Sentiment Treebank), STS-B (the Semantic Textual Similarity Benchmark), and WNLI (Winograd Natural Language Inference). We refer readers to Wang et al., 2018 for further details on the various tasks.

We use the same underlying model with identical hyperparameters to evaluate each policy, only varying the policy itself and the random seed for each run. We use the pretrained model, and follow a similar procedure to fine-tune on GLUE to the one described by the BERT authors in Devlin et al., 2018. We take the final hidden vector corresponding to the first input token as the representation of the sentence (or sentence pair) for each task. The only task-specific parameters that are used are for the output layers for each task (mapping from the BERT hidden size to the number of output labels that task), all other model parameters are shared across all tasks. One departure from the details in Devlin et al., 2018 is that we use a maximum sequence length of 256 (instead of 512), as we don’t notice a significant performance difference, and it allows us to fit one full batch (size 16) into memory on an NVIDIA P100 GPU. We use a learning rate of , and train for 200000 steps. To match the evaluation in Devlin et al., 2018, we only train on 8 of the GLUE tasks, excluding WNLI. When reporting GLUE test set results, we output the majority class label for the WNLI task. For we again use , . is computed based on a single policy improvement iteration, starting from the output of . We used 50 iterations of CMA-ES with a population size of 64. We compute distributions for each value in and run one iteration for each policy. We then pick the highest performing model () and use this policy for additional runs. We run the experiment 3 times for each policy, with different random seeds each time.

Figure 2: Average dev set score across 8 GLUE tasks (excluding WNLI) over time for each policy. Each line is the median value over 3 seeds, with shaded regions encompassing all seeds.

Figure 2 shows average scores on the GLUE dev set over time for each policy, with the final scores for each individual task given in Table 1. As in the bandit experiment, we find that and perform similarly on average, with neither reaching the final performance of our counterfactual method . performs much worse than all other methods, as the large differences in dataset sizes cause smaller tasks like CoLA to be under-sampled by this policy.

79.1 48.2 82.9 88.8 88.2 86.9 73.4 92.1 72.1
74.9 18.5 83.6 86.1 77.4 88.6 77.2 90.6 76.6
78.9 49.2 83.6 87.6 87.3 83.7 74.5 91.8 73.8
80.5 50.1 83.8 90.0 88.0 87.0 77.5 92.0 75.6
Table 1: Individual task scores on the GLUE dev set for each task selection policy. For tasks with multiple metrics, the task score is the average of their values, and we show the average task score across all 3 random seeds.
Task Single task MTL
CoLA 52.1 48.5
MNLI (m/mm) 84.6/83.4 83.5/83.1
MRPC 88.9/84.8 88.0/83.7
QNLI 90.5 90.5
QQP 71.2/89.2 70.4/88.7
RTE 66.4 74.5
SST-2 93.5 93.1
STS-B 87.1/85.1 80.7/80.6
WNLI 65.1 65.1
GLUE Score 78.3 77.9
Table 2: GLUE test set results, comparing the single-task fine-tuning results for BERT with MTL fine-tuning using our learned policy .

We evaluated the best performing model using on the GLUE test set to make a better comparison with single-task fine-tuning of , and give the results in Table 2. The MTL model comes close to the GLUE score of the single-task fine-tuned models, albeit with some differences in the individual task performances. The MTL model is noticeably worse on CoLA, MNLI matched and STS-B, considerably better on RTE, and similar on the remaining tasks.

4.3 Discussion

Our experiments show that choosing an appropriate task selection policy can have a large impact on MTL performance, as highlighted by the gap between the best and worst polices on the GLUE dataset. We see large performance gains for MTL on the RTE task in particular as reported in previous work Wang et al. (2018); Devlin et al. (2018), however the introduction of our task selection policy is not sufficient to prevent some reduction in performance on CoLA, MNLI matched and STS-B. We also note that a uniform random policy is a strong baseline in both of our experimental settings, which confirms findings (on a different set of experiments) in Graves et al., 2017.

One weakness with our counterfactual method is the need to weight the estimation of the target policy with a regularising entropy term, requiring an additional hyperparameter that must be tuned. We believe that this is largely due to a deficiency in our surrogate sample-level reward definition, perhaps this would not be necessary if we could devise a reward signal that aligns better with the global MTL objective. However, we are able to learn an improved policy on the GLUE benchmark with a relatively low number of training runs (4 in total, including parameter tuning) — one initial run with a uniform random policy followed by 3 to pick the entropy weight.

The policy learned by our method on the GLUE dataset is shown in Table 3. In general we see that the tasks with larger datasets are sampled more frequently, but that it is important to boost the relative probability of sampling from tasks with smaller datasets to maintain performance on them. Task size alone is not completely indicative of sampling frequency for our learned policy, as evidenced by STS-B and SST-2 being weighted similarly despite an order of magnitude difference in their respective training set sizes, and the difference in weight between MNLI and QQP which have similar training set sizes. Our results support previous findings that it is often beneficial in MTL to spend more time on difficult or larger tasks (a strategy that is sometimes referred to as “anti-curriculum) McCann et al. (2018); Hokamp et al. (2019), while suggesting a means for learning how to weight the tasks automatically.

CoLA 0.089 10k
MNLI 0.255 393k
MRPC 0.086 4k
QNLI 0.134 108k
QQP 0.154 400k
RTE 0.086 2.7k
SST-2 0.094 67k
STS-B 0.102 7k
Table 3: The fixed stochastic policy learned on the GLUE tasks using our counterfactual method (rounded to 3 decimal places), and the size of the respective training sets.

5 Conclusion

We evaluated several approaches to creating task selection policies for multitask learning, and highlighted that in the context of the GLUE benchmark, the choice of policy can have a large effect on overall performance. We showed how the problem of task selection is related to the areas of curriculum learning and off-policy evaluation, and suggested an approach based on counterfactual estimation that leads to improved performance on a synthetic bandit-style task, as well as on the more challenging GLUE tasks. Interesting possibilities for future work include extending our learned policies to be dynamic (instead of fixed stochastic), evaluating additional counterfactual estimators Nedelec et al. (2017), and devising (or learning) improved sample-level reward signals.