Recent work on language understanding has demonstrated the effectiveness of pretraining neural networks on large corpora using unsupervised objectives such as language modelling, and then fine-tuning the resulting models on downstream target tasksDai and Le (2015); Peters et al. (2018); Howard and Ruder (2018); Radford et al. (2018); Devlin et al. (2018). This approach has produced new state-of-the-art results on a variety of popular benchmark datasets, such as the SQuAD question answering dataset Rajpurkar et al. (2016), and the General Language Understanding Evaluation (GLUE) benchmark for sentence (and sentence pair) classification Wang et al. (2018). Notably, these approaches typically fine-tune a full copy of the pretrained model on each target task individually, effectively multiplying the number of parameters that must be trained and stored by the number of tasks, and ruling out any potential performance improvements that may arise from sharing information between related tasks.
An alternative approach is to do Multitask Learning (MTL) Caruana (1997), where a model is learned that shares some number of parameters across all tasks. BERT is a recent example of the benefits of this approach — in the pretraining stage it is trained on two tasks simultaneously: masked language modelling (predicting missing tokens in the input), and next sentence prediction (predicting whether two sentences are consecutive or not). However, successfully applying MTL to a particular problem is not necessarily straightforward, and depends on resolving many questions that do not arise in the single task setting, such as:
In what settings is MTL effective? Can we detect and mitigate negative transfer (where performance on some subset of tasks decreases when trained in an MTL setting)?
Which parameters of the model should be shared between tasks, and which should be task specific?
Should the model look at all of the data from all of the tasks? Should all of the training examples be weighted equally in the loss term?
How much of the available training budget should be spent on each individual task?
This work focuses on question 4, providing an empirical evaluation of the performance of different policies for selecting how much to sample from each task in two experimental settings; firstly on a novel bandit-style task, and secondly when fine-tuning a pretrained language model (BERT) on the GLUE benchmark. GLUE represents an interesting challenge as it evaluates the performance of a single model across multiple tasks, such as textual entailment and question answering, with the size of the available training data for different tasks spanning multiple orders of magnitude. We find that policies based on common heuristics such as sampling tasks uniformly at randomWang et al. (2018); Subramanian et al. (2018) or proportionally to their size Phang et al. (2018) are not able to match the performance of fine-tuning models for each individual task.
We also show that this problem can also be viewed through the lens of curriculum learning. We evaluate a previous method for automated curriculum learning Graves et al. (2017), but find that on our tasks it does not perform significantly better than a random policy. However, when learning a policy using counterfactual estimation Bottou et al. (2012), we are able to approach performance parity with the task-specific models.
2 Related Work
Multitask learning has been studied extensively, and can be motivated as a means of inductive bias learning Caruana (1993); Baxter (2000), representation learning Argyriou et al. (2007); Misra et al. (2016), or as a form of learning to learn Baxter (1997); Thrun and Pratt (1998); Heskes (2000); Lawrence and Platt (2004)
. In the context of natural language processing, MTL has been used to improve tasks such as semantic role labelingCollobert and Weston (2008); Strubell et al. (2018), and machine translation Luong et al. (2016); Firat et al. (2016); Johnson et al. (2016); Hokamp et al. (2018), and to learn general purpose sentence representations Subramanian et al. (2018). One of the best performing models on the GLUE benchmark at the time of publication Liu et al. (2019) combines MTL pretraining (BERT) with MTL fine-tuning on the GLUE tasks, although the model also incorporates non-trivial task-specific components and additional training using knowledge distillation Hinton et al. (2015).
However, MTL in NLP is not always successful — the GLUE baseline MTL models are significantly worse than the single task models Wang et al. (2018), Alonso and Plank, 2016 only find significant improvements with MTL in one of five tasks evaluated, and the multitask model in McCann et al., 2018 does not quite reach the performance of the same model trained on each task individually.
One way to approach the question of how much training budget to spend on each task is to view this as a curriculum learning problem Elman (1993); Bengio et al. (2009). McCann et al., 2018 show the importance of using the correct curriculum to train their multitask model. Graves et al., 2017 treat curriculum learning as an adversarial multi-armed bandit problem Auer et al. (2002); Bubeck and Cesa-Bianchi (2012), and show that the learned policies can improve the performance on language modeling and the bAbI dataset tasks Weston et al. (2015)
. Probably the most similar recent work to ours is AutoSeMGuo et al. (2019), where they first learn to select useful auxiliary tasks, then learn an MTL curriculum over them using Bayesian optimization. Our work differs in the methods used (we use counterfactual estimation to learn the curriculum), as well as in the learning objective — Guo et al., 2019 focus on the improving a single target task by incorporating auxiliary tasks, where as we seek to jointly maximize the performance of a set of target tasks.
Counterfactual estimation Bottou et al. (2012)
tries to answer the question “what would have happened if an agent had taken different actions?”, and so is closely related to the problem of off-policy evaluation in the context of reinforcement learning111Counterfactual estimation has also been studied under the names “counterfactual reasoning” and “learning from logged bandit feedback”.. This setting presents an added set of challenges to on-policy evaluation, as an agent only has partial feedback from the environment, and it is assumed that collecting additional feedback is either not possible or prohibitively expensive. Typically approaches to off-policy evaluation are based on either modeling the environment dynamics and reward, using importance sampling, or a combination of the two Precup et al. (2000); Peshkin and Shelton (2002); Dudik et al. (2011); Swaminathan and Joachims (2015a); Jiang and Li (2015).
3 Task Selection Policies
We consider the case where we have a set of learning tasks. Each task is a distribution () over samples from the input space . In the supervised case for example, may be composed of pairs, where is an input example or sequence and is a target label or sequence. These individual samples are typically grouped into batches of multiple samples from the same task, we also refer to these batches as samples for convenience.
A model over has parameters
. A loss functionis defined for each task, with the expected loss for the task given by
The objective is to maximize the performance on all tasks, or to minimize the total loss:
We assume that we have a fixed training budget of steps. At each step , a model samples task (or action) according to a distribution defined by some task selection policy , then processes the resulting sample and observes loss . The probability of selecting a particular action at time is given by . We use to denote a model with parameters that uses policy , and to refer to any additional parameters that itself may have. In general, may be a function of the sequence of the observed samples, losses and model parameters in the training history at time . In this work we study different methods for specifying or learning , and how these approaches influence . We are primarily interested in cases where the model is a large neural network, and is a large dataset, so in general we seek methods that avoid evaluating for many different values of as this is computationally slow and expensive.
3.1 Baseline (Heuristic) Policies
A common task selection policy is to sample from all tasks uniformly at random:
This policy has been shown to be a strong baseline in previous work on curriculum learning Graves et al. (2017). Another common heuristic that we evaluate in this work is to sample tasks proportionally to their dataset size Phang et al. (2018):
3.2 Learning Policies Using Automated Curriculum Learning
We also evaluate the approach to automated curriculum learning introduced in Graves et al., 2017, which is briefly described here for reference. A curriculum of tasks can be viewed as an -armed bandit. At each round (time step) an agent uses a policy to select an action and sees a reward . The goal is to create a policy that maximizes the total reward from the bandit.
Graves et al., 2017 use the Exp3.S algorithm Auer et al. (2002) for their policy, which is an adaptation of the Exp3 algorithm to the non-stationary setting. Exp3 tries to minimize the regret with respect to a single best arm in hindsight, assuming an adversarial setting in which the distribution of rewards over arms can change at every time step. It does this by using importance sampled rewards to update a set of weights
, then acting stochastically according to a distribution based on these weights (with additional hyperparametersand ):
3.2.1 Automated Curriculum Learning Reward Function
In Graves et al., 2017 the underlying hypothesis is that the policy should produce a syllabus that focuses on tasks in order of increasing difficulty. This lead the authors to design and evaluate several different ways of encoding measures of the rate at which learning progresses into a sample-level reward function. They find that the progress signal that they call prediction gain generally leads to the best performance across the tasks that they evaluated, and so that is the raw reward signal that we consider with here. Prediction gain is defined as the change in loss before and after training on a sample (ie., after a gradient update): .
When computing we also follow the reward scaling process described in Graves et al., 2017. Reservoir sampling is used to maintain a representative sample of the unscaled reward history up to time , and from this we compute the 20 and 80 percentiles as and respectively. The unscaled reward is then mapped to the interval :
3.3 Learning Policies Using Counterfactual Estimation
Automated curriculum learning using Exp3.S describes a method for online learning of task selection policies from bandit-style feedback. However in the context of MTL, the ability to learn online is typically not required — in many situations we have the ability to run multiple variations of a given experiment, and can potentially use the results of earlier training runs in order to improve our policies. Learning task selection policies can therefore be viewed as a problem of counterfactual estimation, where the goal is to use old policy data to improve on that policy without further interaction with the environment.
Most approaches to counterfactual estimation either model the reward generating process, use importance sampling to correct for the changes introduced by the new policy, or combine the two Dudik et al. (2011). In the first case, the task initially reduces to a supervised regression problem, followed by a process of policy optimization using the reward model as the target reward that should be maximized. However, defining a suitable MTL sample-level reward function is non-trivial. In this work we are interested in maximizing the average performance of all training tasks, where performance is measured at the end of the training run. In problems with large models and/or datasets, this could take anywhere from hours to weeks to complete, and so the most obvious reward signal (final average performance) is very sparse and difficult to optimize as the time that this reward is observed at is potentially very delayed from the time when an action must be taken. We therefore tend to rely on reward signals that are defined in response to each action taken by the policy, with the hope that they correlate well with the desired global metric that we care about. We discuss the specific variant of the sample-level reward used in this work further in Section 3.3.3 below.
As we are dealing with a surrogate reward signal, and previous work on reward modelling has found that this approach may not generalise well even with exact rewards Beygelzimer and Langford (2009), we instead focus on the counterfactual estimation methods that are based on importance sampling. We use the available data in a two-step process to create task selection policies:
Create an estimator to evaluate a given policy, using some set of logged policy probabilities, decisions and resulting rewards from some initial training run(s).
Create a new policy that maximises the expected reward according to this counterfactual estimator.
3.3.1 Counterfactual Estimation
At each time step in a learning process, an MTL model selects a task to sample from using a policy , and in response experiences a sample and reward signal . The probability of selecting each is given by a distribution . The value of is the expected reward obtained when selecting tasks using that policy:
can be estimated by sampling trajectories (or rollouts) from 222To simplify our notation it is assumed that we are just using a single rollout to compute , but multiple rollouts could be used.:
In counterfactual estimation, the goal is to use rollouts from to approximate the value of a different policy , under the assumption that we cannot sample directly from . We also cannot evaluate the reward function for samples that are selected using but not . One way to estimate is to use importance sampling Rosenbaum and Rubin (1983):
We can therefore use Monte Carlo to approximate while only relying on samples from :
The importance sampling estimator (also known as the inverse propensity score estimator) is unbiased, and is defined as long as has support everywhere that does, or in other words if . In practise this is not a significant limitation, as we generally have full control over , and so can ensure that it always assigns some probability mass to each task.
However, the importance sampling estimator is known to suffer from high varianceBottou et al. (2012); Joachims et al. (2018). This problem is particularly noticeable in regions of the input space that are not well covered by the sampled policy — if is very low for a particular sample, then the importance weight will be high (leading to inaccurate estimates) unless the reward for this sample is very low. Different estimators have been proposed that reduce this variance in some way, generally at the expense of adding some bias Dudik et al. (2011); Bottou et al. (2012), but there is no single estimator that works best in all situations Nedelec et al. (2017).
In this work we use the weighted importance sampling estimator Rubinstein (1981) to reduce the variance of , as it has been shown to work well in a variety of settings Mahmood et al. (2014); Swaminathan and Joachims (2015b); Nedelec et al. (2017):
So far we have been assuming that the off-policy data is collected using a single policy , but these methods can also be applied to data collected by multiple logging policies Peshkin and Shelton (2002); Agarwal et al. (2017).
3.3.2 Counterfactual Estimation: Policy Improvement
The estimators described in Section 3.3.1 can be used to evaluate arbitrary policies, and so they can be combined with policy search algorithms to learn an improved policy
. In general these policies may be dynamic, but here we consider fixed stochastic policies parameterised by a vector, of the form:
The learning objective is now to find an appropriate :
To optimize for we use Covariance-Matrix Adaptation Evolution Strategy (CMA-ES) Hansen and Ostermeier (2001), as it has been shown to work well in low-dimensional parameter spaces Ha and Schmidhuber (2018).
In initial experiments, we found that the output of was still prone to overestimating the value of regions of the parameter space that were not well-represented in the data from the logging policy. In particular, it tended to assign high scores to policies with distributions that were very peaked around the individual tasks with the largest total reward. This is likely due to a combination of overfitting to a small amount of logging data, as well as deficiencies with the surrogate reward signal (described in Section 3.3.3). We therefore introduced a regularization term to the objective, limiting the parameter space to regions where the policy retains larger entropy (H) values (weighted by a hyperparameter ):
3.3.3 Counterfactual Estimation: Reward
To learn a task selection policy using counterfactual estimation we need to define a reward function . Ideally, a policy that maximises the expected value of will also minimise the average task loss at the end of training:
We found that maximising the prediction gain reward proposed in Graves et al., 2017 did not correlate well with minimising in our initial experiments, and so use a slightly different reward formulation here. Intuitively, as we want to maximise the average task performance, we want to incentivise spending more time on tasks that are performing poorly at a given phase of the training process, as long as we are continuing to make progress on those tasks. We want to penalise sampling from tasks that are not improving, as this is a waste of effort, regardless of their overall performance.
Concretely, we assume that the sample loss for a given task is a negative log likelihood value. We compare at time with at time , which is the time at which we last sampled from the same task. If the difference between the two () is negative (ie., the loss has decreased), then the reward received by the model for this action is , so the model receives a reward in the interval , with higher values for sampling from tasks that are performing poorly. If , the reward is . This reward process is described in Equation 16.
To compare the different task selection policies for MTL we evaluate their performance on two tasks: a toy bandit-style problem, and the GLUE benchmark for natural language understanding. Further details are given in Sections 4.1 and 4.2 respectively.
4.1 Bandit Example
Our first experiment aims to verify that our approach to learning task selection policies works in a synthetic setting, that was designed to be a simplified environment that still presents some of the challenges that are experienced with MTL in more realistic scenarios. We define an MTL bandit as a bandit with arms, representing our tasks. We assume that the goal is to sample from these arms according to some fixed oracle distribution that is unknown to the agent interacting with the bandit environment. At the start of the experiment we sample , and the same is used for all experiment runs. Each arm has an associated score , which is initially zero. Each arm is assigned a maximum probability value () that it can obtain in the range , and then establishes a learning increment (LI) using the formula:
where is the arm and is the number of steps. Each arm is also assigned a forget increment (), which is a random number in the range multiplied by . At each time step, if arm is selected, is incremented by (and constrained to be ), simulating some improvement on that task. Similarly, is reduced by (constrained to be ) for all tasks at each time step, whether that task was selected or not, simulating some form of forgetting on each task. for the selected by the policy being evaluated is computed before and after applying and , and is used to compute rewards and for and respectively.
The MTL bandit creates an environment in which successful agents must learn to sample from all tasks periodically so as not to “forget”, but should learn that some tasks need to be sampled from more than others in order to maximise the overall average score. We evaluated , and on this task. In Graves et al., 2017 the authors used the same hyperparameters for all experiments, and so we use the same settings here: , . To set the entropy weight , we perform a grid search over , and select the best performing value (). For CMA-ES we used 20 iterations with a population size of 64. is computed based on 2 iterations of policy improvement, starting from a random uniform policy (ie., ). For the MTL bandit, we set and . We run each policy 10 times (with different initial random seeds). The results are shown in Figure 1. None of the methods are able to fully match the oracle performance, but our counterfactual method comes close. The random policy and Exp3.S perform similarly, both noticeably worse than the counterfactual policy.
We evaluated the task selection policies in a more challenging and realistic environment — the GLUE benchmark. GLUE performance is based on an average score across the following tasks: CoLA (the Corpus of Linguistic Acceptability), MNLI (Multi-Genre Natural Language Inference), MRPC (the Microsoft Research Paraphrase Corpus), QNLI (Question Natural Language Inference, a version of the Stanford Question Answering Dataset), QQP (Quora Question Pairs), RTE (Recognizing Textual Entailment), SST-2 (the Stanford Sentiment Treebank), STS-B (the Semantic Textual Similarity Benchmark), and WNLI (Winograd Natural Language Inference). We refer readers to Wang et al., 2018 for further details on the various tasks.
We use the same underlying model with identical hyperparameters to evaluate each policy, only varying the policy itself and the random seed for each run. We use the pretrained model, and follow a similar procedure to fine-tune on GLUE to the one described by the BERT authors in Devlin et al., 2018. We take the final hidden vector corresponding to the first input token as the representation of the sentence (or sentence pair) for each task. The only task-specific parameters that are used are for the output layers for each task (mapping from the BERT hidden size to the number of output labels that task), all other model parameters are shared across all tasks. One departure from the details in Devlin et al., 2018 is that we use a maximum sequence length of 256 (instead of 512), as we don’t notice a significant performance difference, and it allows us to fit one full batch (size 16) into memory on an NVIDIA P100 GPU. We use a learning rate of , and train for 200000 steps. To match the evaluation in Devlin et al., 2018, we only train on 8 of the GLUE tasks, excluding WNLI. When reporting GLUE test set results, we output the majority class label for the WNLI task. For we again use , . is computed based on a single policy improvement iteration, starting from the output of . We used 50 iterations of CMA-ES with a population size of 64. We compute distributions for each value in and run one iteration for each policy. We then pick the highest performing model () and use this policy for additional runs. We run the experiment 3 times for each policy, with different random seeds each time.
Figure 2 shows average scores on the GLUE dev set over time for each policy, with the final scores for each individual task given in Table 1. As in the bandit experiment, we find that and perform similarly on average, with neither reaching the final performance of our counterfactual method . performs much worse than all other methods, as the large differences in dataset sizes cause smaller tasks like CoLA to be under-sampled by this policy.
We evaluated the best performing model using on the GLUE test set to make a better comparison with single-task fine-tuning of , and give the results in Table 2. The MTL model comes close to the GLUE score of the single-task fine-tuned models, albeit with some differences in the individual task performances. The MTL model is noticeably worse on CoLA, MNLI matched and STS-B, considerably better on RTE, and similar on the remaining tasks.
Our experiments show that choosing an appropriate task selection policy can have a large impact on MTL performance, as highlighted by the gap between the best and worst polices on the GLUE dataset. We see large performance gains for MTL on the RTE task in particular as reported in previous work Wang et al. (2018); Devlin et al. (2018), however the introduction of our task selection policy is not sufficient to prevent some reduction in performance on CoLA, MNLI matched and STS-B. We also note that a uniform random policy is a strong baseline in both of our experimental settings, which confirms findings (on a different set of experiments) in Graves et al., 2017.
One weakness with our counterfactual method is the need to weight the estimation of the target policy with a regularising entropy term, requiring an additional hyperparameter that must be tuned. We believe that this is largely due to a deficiency in our surrogate sample-level reward definition, perhaps this would not be necessary if we could devise a reward signal that aligns better with the global MTL objective. However, we are able to learn an improved policy on the GLUE benchmark with a relatively low number of training runs (4 in total, including parameter tuning) — one initial run with a uniform random policy followed by 3 to pick the entropy weight.
The policy learned by our method on the GLUE dataset is shown in Table 3. In general we see that the tasks with larger datasets are sampled more frequently, but that it is important to boost the relative probability of sampling from tasks with smaller datasets to maintain performance on them. Task size alone is not completely indicative of sampling frequency for our learned policy, as evidenced by STS-B and SST-2 being weighted similarly despite an order of magnitude difference in their respective training set sizes, and the difference in weight between MNLI and QQP which have similar training set sizes. Our results support previous findings that it is often beneficial in MTL to spend more time on difficult or larger tasks (a strategy that is sometimes referred to as “anti-curriculum) McCann et al. (2018); Hokamp et al. (2019), while suggesting a means for learning how to weight the tasks automatically.
We evaluated several approaches to creating task selection policies for multitask learning, and highlighted that in the context of the GLUE benchmark, the choice of policy can have a large effect on overall performance. We showed how the problem of task selection is related to the areas of curriculum learning and off-policy evaluation, and suggested an approach based on counterfactual estimation that leads to improved performance on a synthetic bandit-style task, as well as on the more challenging GLUE tasks. Interesting possibilities for future work include extending our learned policies to be dynamic (instead of fixed stochastic), evaluating additional counterfactual estimators Nedelec et al. (2017), and devising (or learning) improved sample-level reward signals.
- Agarwal et al. (2017) Aman Agarwal, Soumya Basu, Tobias Schnabel, and Thorsten Joachims. 2017. Effective Evaluation using Logged Bandit Feedback from Multiple Loggers. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’17, pages 687–696.
- Alonso and Plank (2016) Héctor Martínez Alonso and Barbara Plank. 2016. When is multitask learning effective? Semantic sequence prediction under varying data conditions. arXiv:1612.02251 [cs].
- Argyriou et al. (2007) Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. 2007. Multi-Task Feature Learning. In N Advances in Neural Information Processing Systems.
- Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002. The Nonstochastic Multiarmed Bandit Problem. SIAM Journal on Computing, 32(1):48–77.
J. Baxter. 2000.
A Model of Inductive
Journal of Artificial Intelligence Research, 12:149–198.
- Baxter (1997) Jonathan Baxter. 1997. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28(1):7–39.
- Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09, pages 1–8, Montreal, Quebec, Canada. ACM Press.
- Beygelzimer and Langford (2009) Alina Beygelzimer and John Langford. 2009. The Offset Tree for Learning with Partial Labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 129–138.
- Bottou et al. (2012) Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2012. Counterfactual Reasoning and Learning Systems. CoRR, abs-1209-2355:56.
- Bubeck and Cesa-Bianchi (2012) Sébastien Bubeck and Nicolò Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning, 5(1):1–122.
- Caruana (1993) Rich Caruana. 1993. Multitask Learning: A Knowledge-Based Source of Inductive Bias. In Proceedings of the Tenth International Conference on Machine Learning.
- Caruana (1997) Rich Caruana. 1997. Multitask Learning. Machine Learning, 28(1):41–75.
- Collobert and Weston (2008) Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing. Proceedings of the 25th international conference on Machine learning - ICML ’08, 20(1):160–167.
- Dai and Le (2015) Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised Sequence Learning. Advances in Neural Information Processing Systems (NIPS ’15), pages 1–9.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs].
- Dudik et al. (2011) Miroslav Dudik, John Langford, and Lihong Li. 2011. Doubly Robust Policy Evaluation and Learning. arXiv:1103.4601 [cs, stat].
- Elman (1993) Jeffrey L. Elman. 1993. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99.
- Firat et al. (2016) Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 866–875, San Diego, California. Association for Computational Linguistics.
- Graves et al. (2017) Alex Graves, Marc G Bellemare, and Jacob Menick. 2017. Automated Curriculum Learning for Neural Networks.
- Guo et al. (2019) Han Guo, Ramakanth Pasunuru, and Mohit Bansal. 2019. AutoSeM: Automatic Task Selection and Mixing in Multi-Task Learning. In Proceedings of NAACL-HLT, pages 3520–3531.
- Ha and Schmidhuber (2018) David Ha and Jürgen Schmidhuber. 2018. Recurrent World Models Facilitate Policy Evolution. In Advances in Neural Information Processing Systems 31, pages 2451–2463.
- Hansen and Ostermeier (2001) Nikolaus Hansen and Andreas Ostermeier. 2001. Completely Derandomized Self-Adaptation in Evolution Strategies. Evolutionary Computation, 9(2):159–195.
- Heskes (2000) Tom Heskes. 2000. Empirical Bayes for Learning to Learn. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 367–364.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [cs, stat].
- Hokamp et al. (2019) Chris Hokamp, John Glover, and Demian Gholipour. 2019. Evaluating the Supervised and Zero-shot Performance of Multi-lingual Translation Models. arXiv:1906.09675 [cs].
- Hokamp et al. (2018) Chris Hokamp, Sebastian Ruder, and John Glover. 2018. Off-the-Shelf Unsupervised NMT. arXiv:1811.02278 [cs].
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification.
- Jiang and Li (2015) Nan Jiang and Lihong Li. 2015. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning. arXiv:1511.03722 [cs, stat].
- Joachims et al. (2018) Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. 2018. Deep learning with logged bandit feedback. page 12.
- Johnson et al. (2016) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. pages 1–17.
- Lawrence and Platt (2004) Neil D. Lawrence and John C. Platt. 2004. Learning to learn with the informative vector machine. In Twenty-First International Conference on Machine Learning - ICML ’04, page 65, Banff, Alberta, Canada. ACM Press.
- Liu et al. (2019) Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding. arXiv:1904.09482 [cs].
- Luong et al. (2016) Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task Sequence to Sequence Learning. Iclr, (c):1–9.
- Mahmood et al. (2014) A Rupam Mahmood, Hado Van Hasselt, and Richard Sutton. 2014. Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems 27, pages 3014–3022.
- McCann et al. (2018) Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The Natural Language Decathlon: Multitask Learning as Question Answering. arXiv:1806.08730 [cs, stat].
- Misra et al. (2016) Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016. Cross-stitch Networks for Multi-task Learning. arXiv:1604.03539 [cs].
- Nedelec et al. (2017) Thomas Nedelec, Nicolas Le Roux, and Vianney Perchet. 2017. A comparative study of counterfactual estimators. arXiv:1704.00773 [cs, stat].
- Peshkin and Shelton (2002) Leonid Peshkin and Christian R Shelton. 2002. Learning from Scarce Experience. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 498–505.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv:1802.05365 [cs].
- Phang et al. (2018) Jason Phang, Thibault Févry, and Samuel R. Bowman. 2018. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks. arXiv:1811.01088 [cs].
- Precup et al. (2000) Doina Precup, Richard S Sutton, and Satinder Singh. 2000. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the Seventeenth International Conference on Machine Learnin, pages 759–766.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. OpenAI Technical Report.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. (ii).
- Rosenbaum and Rubin (1983) Paul R. Rosenbaum and Donald B. Rubin. 1983. The Central Role of the Propensity Score in Observational Studies for Causal Effects.pdf. Biometrika, 70(1):41–55.
- Rubinstein (1981) Reuven Rubinstein. 1981. Simulation and the Monte Carlo Method. Wiley, New York.
- Strubell et al. (2018) Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-Informed Self-Attention for Semantic Role Labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5027–5038, Brussels, Belgium. Association for Computational Linguistics.
- Subramanian et al. (2018) Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. 2018. Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning. arXiv:1804.00079 [cs].
- Swaminathan and Joachims (2015a) Adith Swaminathan and Thorsten Joachims. 2015a. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization. In Proceedings of the 24th International Conference on World Wide Web - WWW ’15 Companion, pages 939–941, Florence, Italy. ACM Press.
- Swaminathan and Joachims (2015b) Adith Swaminathan and Thorsten Joachims. 2015b. The Self-Normalized Estimator for Counterfactual Learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 3231–3239.
- Thrun and Pratt (1998) Sebastian Thrun and Lorien Pratt. 1998. Learning to Learn. Kluwer Academic Publishers, Norwell, MA, USA.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.
- Weston et al. (2015) Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. 2015. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. arXiv:1502.05698 [cs, stat].