1 Introduction
Multitask Learning (MTL) Caruana (1997)
is an inductive transfer mechanism which leverages information from related tasks to improve the primary model’s generalization performance. It achieves this goal by training multiple tasks in parallel while sharing representations, where the training signals from the auxiliary tasks can help improve the performance of the primary task. Multitask learning has been applied to a wide range of natural language processing problems
Luong et al. (2015); Pasunuru and Bansal (2017); Hashimoto et al. (2017); Ruder et al. (2017b); Kaiser et al. (2017); McCann et al. (2018). Despite its impressive performance, the design of a multitask learning system is nontrivial. In the context of improving the primary task’s performance using knowledge from other auxiliary tasks Luong et al. (2015); Pasunuru and Bansal (2017), two major challenges include selecting the most relevant auxiliary tasks and also learning the balanced mixing ratio for synergized training of these tasks. One can achieve this via manual intuition or hyperparameter tuning over all combinatorial task choices, but this introduces human inductive bias or is not scalable when the number of candidate auxiliary tasks is considerable. To this end, we present AutoSeM, a twostage Bayesian optimization pipeline to this problem.In our AutoSeM framework^{1}^{1}1We make all our code and models publicly available at: https://github.com/HanGuo97/AutoSeM, the first stage addresses automatic task selection from a pool of auxiliary tasks. For this, we use a nonstationary multiarmed bandit controller (MAB) Bubeck et al. (2012); Raj and Kalyani (2017)
that dynamically alternates among task choices within the training loop, and eventually returns estimates of the utility of each task w.r.t. the primary task. We model the utility of each task as a Beta distribution, whose expected value can be interpreted as the probability of each task making a nonnegative contribution to the training performance of the primary task. Further, we model the observations as Bernoulli variables so that the posterior distribution is also Betadistributed. We use Thompson sampling
Chapelle and Li (2011); Russo et al. (2018) to trade off exploitation and exploration.The second stage then takes the auxiliary tasks selected in the first stage and automatically learns the training mixing ratio of these tasks, through the framework of Bayesian optimization, by modeling the performance of each mixing ratio as a sample from a Gaussian Process (GP) to sequentially search for the optimal values Rasmussen (2004); Snoek et al. (2012)
. For the covariance function in the GP, we use the Matern kernel which is parameterized by a smoothness hyperparameter so as to control the level of differentiability of the samples from GP. Further, following
Hoffman et al. (2011), we use a portfolio of optimistic and improvementbased policies as acquisition functions Shahriari et al. (2016) for selecting the next sample point from the GP search space.We conduct several experiments on the GLUE natural language understanding benchmark Wang et al. (2018), where we choose each of RTE, MRPC, QNLI, CoLA, and SST2 as the primary task, and treat the rest of the classification tasks from the GLUE benchmark as candidate auxiliary tasks. Results show that our AutoSeM framework can successfully find useful auxiliary tasks and automatically learn their mixing ratio, achieving significant performance boosts on top of strong baselines for several primary tasks, e.g., 5.2% improvement on QNLI, 4.7% improvement on RTE, and 2.8%/0.8% improvement on MRPC.
We also ablate the usefulness of our two stages of auxiliary task selection and automatic mixing ratio learning. The first ablation removes the task selection stage and instead directly performs the second GP mixing ratio learning stage on all auxiliary tasks. The second ablation performs the task selection stage (with multiarmed bandit) but replaces the second stage Gaussian Process with manual tuning on the selected tasks. Our 2stage model performs better than both these ablations, showing that both of our stages are crucial. Further, we also discuss the learned auxiliary task choices in terms of their intuitive relevance w.r.t. the corresponding primary task.
2 Related Work
Multitask learning Caruana (1998)
, known for improving the generalization performance of a task with auxiliary tasks, has successfully been applied to many domains of machine learning, including natural language processing
Collobert and Weston (2008); Girshick (2015); Luong et al. (2015); Pasunuru and Bansal (2017); Pasunuru et al. (2017)Misra et al. (2016); Kendall et al. (2017); Dai et al. (2016), and reinforcement learning
Teh et al. (2017); Parisotto et al. (2015); Jaderberg et al. (2016). Although there are many variants of multitask learning Ruder et al. (2017b); Hashimoto et al. (2017); Luong et al. (2015); McCann et al. (2018), our goal is to improve the performance of a primary task using a set of relevant auxiliary tasks, where different tasks share some common model parameters with alternating minibatches optimization, similar to Luong et al. (2015).To address the problem of automatic shared parameter selection, ruder2017learning automatically learned the latent multitask sharing architecture, and xiao2018gated used a gate mechanism that filters the feature flows between tasks. On the problem of identifying task relatedness, BenDavid and Schuller (2003) provided a formal framework for task relatedness and derived generalization error bounds for learning of multiple tasks. Bingel and Søgaard (2017)
explored task relatedness via exhaustively experimenting with all possible two task tuples in a nonautomated multitask setup. Other related works explored data selection, where the goal is to select or reorder the examples from one or more domains (usually in a single task) to either improve the training efficiency or enable better transfer learning. These approaches have been applied in machine translation
van der Wees et al. (2017), language models Moore and Lewis (2010); Duh et al. (2013), dependency parsing Søgaard (2011), etc. In particular, Ruder and Plank (2017) used Bayesian optimization to select relevant training instances for transfer learning, and Tsvetkov et al. (2016) applied it to learn a curriculum for training word embeddings via reordering data. Graves et al. (2017) used the bandit approach (Exp3.S algorithm) in the context of automated curriculum learning, but in our work, we have two stages with each stage addressing a different problem (automatic task selection and learning of the training mixing ratio). Recently, sharma2017online used multiarmed bandits (MAB) to learn the choice of hard vs. easy domain data selection as input feed for the model. guo2018dynamic used MAB to effectively switch across tasks in a dynamic multitask learning setup. In our work, we use MAB with Thompson Sampling for the novel paradigm of automatic auxiliary task selection; and next, we use a Maternkernel Gaussian Process to automatically learn an exact (static) mixing ratio (i.e., relatedness ratio) for the small number of selected tasks.Many control problems can be cast as a multiarmed bandits problem, where the goal of the agent is to select the arm/action from one of the choices that minimizes the regrets Bubeck et al. (2012). One problem in bandits learning is the tradeoff between exploration and exploitation, where the agent needs to make a decision between taking the action that yields the best payoff on current estimates or exploring new actions whose payoffs are not yet certain. Many previous works have explored various exploration and exploitation strategies to minimize regret, including Boltzmann exploration Kaelbling et al. (1996), adversarial bandits Auer et al. (2002b), UCB Auer et al. (2002a), and information gain using variational approaches Houthooft et al. (2016). In this work, for task selection, we use Thompson Sampling (Russo et al., 2018; Chapelle and Li, 2011), an algorithm for sequential decision making problems, which addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use.
Gaussian Process (GP) is a nonparametric Bayesian approach, and it can capture a wide variety of underlying functions or relations between inputs and outputs by taking advantage of the full information provided by the history of observations and is thus very dataefficient Rasmussen (2004); Shahriari et al. (2016); Schulz et al. (2018). Gaussian Processes have been widely used as a blackbox optimizer and hyperparameter optimization Snoek et al. (2012); Brochu et al. (2010); Knudde et al. (2017); Cully et al. (2018); Swersky et al. (2013); Golovin et al. (2017). In our work, we use Gaussian Process for automatic learning of the multitask mixing ratio in our stage2 among the selected tasks from stage1.
3 Models
We will first introduce our baseline model and its integration for multiple classification tasks in a multitask learning (MTL) setup. Next, we will introduce our AutoSeM framework, an automatic way of selecting auxiliary tasks and learning their optimal training mixing ratio w.r.t. the primary task, via a BetaBernoulli bandit with Thompson Sampling and a Gaussian Process framework.
3.1 BiText Classification Model
Let and
be the input sentence pair in our classification task, where we encode these sentences via bidirectional LSTMRNN, similar to that of conneau2017supervised. Next, we do maxpooling on the output hidden states of both encoders where
and are the outputs from the maxpooing layer for and respectively. Later, we map these two representations ( and) into a single rich dense representation vector
:(1) 
where represents the concatenation and represents the elementwise multiplication of and . We project this final representation
to label space to classify the given sentence pair (see Fig.
1). We also use ELMo Peters et al. (2018) representations for word embeddings in our model. For this, we extract the three ELMo layer representations for each of the sentence pair and use their weighted sum as the ELMo output representation, where the weights are trainable.3.2 MultiTask Learning
In this work, we focus on improving a task (primary task) by allowing it to share parameters with related auxiliary tasks via multitask learning (MTL). Let be a set of tasks, where we set to be the primary task and the rest of them as auxiliary tasks. We can extend our singletask learning baseline (see Sec. 3.1) into multitask learning model by augmenting the model with projection layers while sharing the rest of the model parameters across these tasks (see Fig. 1). We employ MTL training of these tasks in alternate minibatches based on a mixing ratio , similar to previous work Luong et al. (2015), where we optimize minibatches of task and go to the next task.
In MTL, choosing the appropriate auxiliary tasks and properly tuning the mixing ratio can be important for the performance of multitask models. The naive way of trying all combinations of task selections is hardly tractable. To solve this issue, we propose AutoSeM, a twostage pipeline in the next section. In the first stage, we automatically find the relevant auxiliary tasks (out of the given options) which improve the performance of the primary task. After finding the relevant auxiliary tasks, in the second stage, we take these selected tasks along with the primary task and automatically learn their training mixing ratio.
3.3 Automatic Task Selection: MultiArmed Bandit with Thompson Sampling
Tuning the mixing ratio for tasks in MTL becomes exponentially harder as the number of auxiliary tasks grows very large. However, in most circumstances, only a small number of these auxiliary tasks are useful for improving the primary task at hand. Manually searching for this optimal choice of relevant tasks is intractable. Hence, in this work, we present a method for automatic task selection via multiarmed bandits with Thompson Sampling (see the left side of Fig. 2).
Let represent the set of arms (corresponding to the set of tasks ) of the bandit controller in our multitask setting, where the controller selects a sequence of actions/arms over the current training trajectory to maximize the expected future payoff. At each round , the controller selects an arm based on the noisy value estimates and observes rewards for the selected arm. Let be the utility (usefulness) of task . Initially, the agent begins with an independent prior belief over . We take these priors to be Betadistributed with parameters and
, and the prior probability density function of
is:(2) 
where denotes the gamma function. We formulate the reward at round as a Bernoulli variable, where an action produces a reward of with a chance of and a reward of with a chance of . The true utility of task , i.e., , is unknown, and may or may not change over time (based on stationary vs. nonstationary of task utility). We define the reward as whether sampling the task improves (or maintains) the validation metric of the primary task,
(3) 
where represents the validation performance of the primary task at time . With our reward setup above, the utility of each task () can be intuitively interpreted as the probability that multitask learning with task can improve (or maintain) the performance of the primary task. The conjugacy properties of the Beta distribution assert that the posterior distribution is also Beta with parameters that can be updated using a simple Bayes rule, which is defined as follows Russo et al. (2018),
(4) 
(5) 
where is the sampled task at round . Finally, at the end of the training, we calculate the expected value of each arm as follows:
(6) 
Here, the expectation measures the probability of improving (or maintaining) the primary task by sampling this task. To decide the next action to take, we apply Thompson Sampling (Russo et al., 2018; Chapelle and Li, 2011) to trade off exploitation (maximizing immediate performance) and exploration (investing to accumulate new information that might improve performance in the future). In Thompson Sampling Russo et al. (2018), instead of taking action that maximizes the expectation (i.e., ), we randomly sample the primary task improvement probability from the posterior distribution , and take the action that maximizes the sampled primary task improvement probability, i.e., . At the end of the training, the task selection can proceed either via a threshold on the expectation, or take the top tasks, and run stage2 using the selected task subset as auxiliary tasks (details in Sec. 3.4).
Stronger Prior for Primary Task
Note that at the beginning of training, model performance is usually guaranteed to improve from the initial random choices. This causes issues in updating arm values because less useful tasks will be given high arm values when they happen to be sampled at the beginning. To resolve this issue, we initially set a slightly stronger prior/armvalue in favor of the arm corresponding to the primary task. Intuitively, the bandit will then sample the primary model more often at the beginning, and then start exploring auxiliary tasks when the primary model’s performance stabilizes (as the arm value of the primary model will start decreasing because sampling it in later rounds produces smaller additional improvements).
NonStationary MultiArmed Bandit
Also note that the intrinsic usefulness of each task varies throughout the training (e.g., the primary task might be more important at the beginning, but not necessarily at the end), and thus the agent faces a nonstationary system. In such cases, the agent should always be encouraged to explore in order to track changes as the system drifts. One simple approach to inject nonstationarity is to discount the relevance of previous observations. Thus we introduce a tunable decay ratio , and modify Eq. 3.3 as follows:
(7) 
where and , and controls how quickly uncertainty is injected into the system ( are parameters of the prior). Algorithm 1 presents the Thompson Sampling algorithm with a BetaBernoulli MAB.
3.4 Automatic Mixing Ratio Learning via Gaussian Process
The right side of Fig. 2 illustrates our Gaussian Process controller for automatic learning of the MTL training mixing ratio (see definition in Sec. 3.2). Given the selected auxiliary tasks from the previous section, the next step is to find a proper mixing ratio of training these selected tasks along with the primary task.^{2}^{2}2Note that ideally Gaussian Process can also learn to set the mixing ratio of less important tasks to zero, hence allowing it to essentially also perform the task selection step. However, in practice, first applying our task selection ThompsonSampling model (Sec. 3.3) allows GP to more efficiently search the mixing ratio space for the small number of filtered auxiliary tasks, as shown in results of Sec. 6.1. Manual tuning of this mixing ratio via a large grid search over the hyperparameter values is very time and compute expensive (even when the number of selected auxiliary tasks is small, e.g., 2 or 3). Thus, in our second stage, we instead apply a nonparametric Bayesian approach to search for the approximatelyoptimal mixing ratio. In particular, we use a ‘Gaussian Process’ to sequentially search for the mixing ratio by trading off exploitation and exploration automatically. Next, we describe our Gaussian Process approach in detail.
A Gaussian Process Rasmussen (2004); Snoek et al. (2012); Shahriari et al. (2016),
, is a nonparametric model that is fully characterized by a mean function
and a positivedefinite kernel or covariance function . Let denote any finite collections of points, where each represents a choice of the mixing ratio (i.e., the ratio described in Sec. 3.2), and is the (unknown) function values evaluated at (true performance of the model given the selected mixing ratio). Let be the corresponding noisy observations (the validation performance at the end of training). In the context of GP Regression (GPR), are assumed to be jointly Gaussian Rasmussen (2004), i.e., , where, is the mean vector, and is the covariance matrix. Then the noisy observationsare normally distributed around
as follows: .Given , the set of random initial observations, where represents a mixing ratio and represents the corresponding model’s validation performance. Next, we model the GP based on these initial observations as described above. We sample a next point (a mixing ratio in our case) from this GP and get its corresponding model performance , and update the GP again by now considering the points Rasmussen (2004). We continue this process for a fixed number of steps. Next, we will discuss how we perform the sampling (based on acquisition functions) and the kernels used for calculating the covariance.
Models  RTE  MRPC  QNLI  CoLA  SST2 

BiLSTM+ELMo (SingleTask) Wang et al. (2018)  50.1  69.0/80.8  69.4  35.0  90.2 
BiLSTM+ELMo (MultiTask) Wang et al. (2018)  55.7  76.2/83.5  66.7  27.5  89.6 
Our Baseline  54.0  75.7/83.7  74.0  30.8  91.3 
Our AutoSeM 
58.7  78.5/84.5  79.2  32.9  91.8 

Acquisition Functions
Here, we describe the acquisition functions for deciding where to sample next. While one could select the points that maximize the mean function, this does not always lead to the best outcome Hoffman et al. (2011)
. Since we also have the variance of the estimates along with the mean value of each point
, we can incorporate this information into the optimization. In this work, we use the GPHedge approach Hoffman et al. (2011); Auer et al. (1995), which probabilistically chooses one of three acquisition functions: probability of improvement, expected improvement, and upper confidence bound. Probability of improvement acquisition functions measure the probability that the sampled mixing ratio leads to an improvement upon the best observed value so far (), . Expected improvement additionally incorporates the amount of improvement, . The Gaussian Process upper confidence bound (GPUCB) algorithm measures the optimistic performance upper bound of the sampled mixing ratio Srinivas et al. (2009), , for some hyperparameter .Matern Kernel
The covariance function (or kernel) defines the nearness or similarity of two points in the Gaussian Process. Here, we use the automatic relevance determination (ARD) Matern kernel Rasmussen (2004), which is parameterized by that controls the level of smoothness. In particular, samples from a GP with such a kernel are differentiable times. When is halfinteger (i.e. for nonnegative integer ), the covariance function is a product of an exponential and a polynomial of order . In the context of machine learning, usual choices of include and Shahriari et al. (2016).
4 Experiment Setup
Datasets: We evaluate our models on several datasets from the GLUE benchmark Wang et al. (2018): RTE, QNLI, MRPC, SST2, and CoLA. For all these datasets, we use the standard splits provided by wang2018glue. For dataset details, we refer the reader to the GLUE paper.^{3}^{3}3We did not include the remaining tasks as primary tasks, because STSB is a regression task; MNLI is a very large dataset and does not benefit much from MTL with other tasks in the GLUE benchmark; and QQP and WNLI have dev/test discrepancies and adversarial label issues as per the GLUE website’s FAQ: https://gluebenchmark.com/faq
Training Details: We use pretrained ELMo^{4}^{4}4https://allennlp.org/elmo to obtain sentence representations as inputs to our model Peters et al. (2018), and the Gaussian Process implementation is based on ScikitOptimize^{5}^{5}5https://scikitoptimize.github.io, and we adopt most of the default configurations. We use accuracy as the validation criterion for all tasks. For all of our experiments except QNLI and SST2, we apply early stopping on the validation performance plateau.^{6}^{6}6In our initial experiments, we found early stopping on larger datasets led to suboptimal performance, and hence we used a prespecified maximum number of steps instead. The set of candidate auxiliary tasks consists of all 2sentence classification tasks when the primary task is a classification of two sentences, whereas it consists of all twosentence and singlesentence classification tasks when the primary task is a classification of a single sentence.^{7}^{7}7We made this design decision because there are only two singlesentence tasks in GLUE, so we mix them with 2sentence tasks to allow more auxiliary choices. Since the utility estimates from the multiarmed bandit controller are noisy, we choose the top two tasks based on expected task utility estimates, and include additional tasks if their utility estimate is above 0.5. All the results reported are the aggregate of the same experiment with two runs (with different random seeds) unless explicitly mentioned.^{8}^{8}8We use the average of validation results across runs as the tuning criterion, and use the ensemble of models across runs for reporting the test results. We use a twolayer LSTMRNN with hidden size of 1024 for RTE and 512 for the rest of the models, and use Adam Optimizer Kingma and Ba (2014). The prior parameters of each task in stage1 are set to be , , which are commonly used in other literature. For stage1, the bandit controller iteratively selects batches of data from different tasks during training to learn the approximate importance of each auxiliary task Graves et al. (2017). In stage2 (Gaussian Process), we sequentially draw samples of mixing ratios and evaluate each sample after full training Snoek et al. (2012). Without much tuning, we used approximately 200 rounds for the stage1 banditbased approach, where each round consist of approximately 10 minibatches of optimization. For stage2, we experimented with 15 and 20 as the number of samples to draw and found that 15 samples for MRPC and 20 samples for the rest of the tasks work well. This brings the total computational cost for our twostage pipeline to be approximately (15+1)x and (20+1)x, where x represents the time taken to run the baseline model for the given task. This is significantly more efficient than a gridsearch based manuallytuned mixing ratio setup (which would scale exponentially with the number of tasks).
5 Results
5.1 Baseline Models
Table 1 shows the results of our baseline and previous works Wang et al. (2018). We can see that our singletask baseline models achieve stronger performance on almost all tasks in comparison to previous work’s singletask models.^{9}^{9}9Note that we do not report previous works which finetune large external language models for the task (e.g., OpenAIGPT and BERT), because they are not fairly comparable w.r.t. our models. Similarly, we report the nonattention based best GLUE models (i.e., BiLSTM+ELMo) for a fair comparison to our nonattention baseline. Our approach should ideally scale to large pretraining/finetuning models like BERT, given appropriate compute resources. Next, we present the performance of our AutoSeM framework on top of these strong baselines.
5.2 MultiTask Models
Table 1 also presents the performance of our AutoSeM frameworkbased MTL models. As can be seen, our MTL models improve significantly (see Table 3
for standard deviations) upon their corresponding singletask baselines for all tasks, and achieve strong improvements as compared to the fairlycomparable
^{9} multitask results of previous work Wang et al. (2018).^{10}^{10}10Note that even though the performance improvement gaps of wang2018glue (MTL vs. baseline) and our improvements (AutoSeM vs. our improved baseline) are similar, these are inherently two different setups. wang2018glue MTL is based on a ‘one model for all’ setup Kaiser et al. (2017); McCann et al. (2018), whereas our approach interpretably chooses the 23 tasks that are most beneficial for the given primary task. Also see Sec. 4 for comparison of training speeds for these two setups. During the task selection stage of our AutoSeM framework, we observe that MultiNLI is chosen as one of the auxiliary tasks in all of our MTL models. This is intuitive given that MultiNLI contains multiple genres covering diverse aspects of the complexity of language Conneau et al. (2017). Also, we observe that WNLI is sometimes chosen in the task selection stage; however, it is always dropped (mixing ratio of zero) by the Gaussian Process controller, showing that it is not beneficial to use WNLI as an auxiliary task (intuitive, given its small size). Next, we discuss the improvements on each of the primary tasks and the corresponding auxiliary tasks selected by AutoSeM framework.RTE: Our AutoSeM approach achieves stronger results w.r.t. the baseline on RTE (58.7 vs. 54.0). During our task selection stage, we found out that QQP and MultiNLI tasks are important for RTE as auxiliary tasks. For the second stage of automatic mixing ratio learning via Gaussian Process, the model learns that a mixing ratio of 1:5:5 works best to improve the primary task (RTE) using related auxiliary tasks of QQP and MultiNLI.
MRPC: AutoSeM here performs much better than the baseline on MRPC (78.5/84.5 vs. 75.7/83.7). During our task selection stage, we found out that RTE and MultiNLI tasks are important for MRPC as auxiliary tasks. In the second stage, AutoSeM learned a mixing ratio of 9:1:4 for these three tasks (MRPC:RTE:MultiNLI).
QNLI: Again, we achieve substantial improvements with AutoSeM w.r.t. baseline on QNLI (79.2 vs. 74.0). Our task selection stage learned that WNLI and MultiNLI tasks are best as auxiliary tasks for QNLI. We found that the Gaussian Process further drops WNLI by setting its mixing ratio to zero, and returns 20:0:5 as the best mixing ratio for QNLI:WNLI:MultiNLI.
CoLA: We also observe a strong performance improvement on CoLA with our AutoSeM model w.r.t. our baseline (32.9 vs. 30.8). During our task selection stage, we found out that MultiNLI and WNLI tasks are important for CoLA as auxiliary tasks. In the second stage, GP learns to drop WNLI, and found the mixing ratio of 20:5:0 for CoLA:MultiNLI:WNLI.
SST2: Here also our AutoSeM approach performs better than the baseline (91.8 vs. 91.3). The task selection stage chooses MultiNLI, MRPC, and WNLI as auxiliary tasks and the stage2 Gaussian Process model drops MRPC and WNLI by setting their mixing ratio to zero (learns ratio of 13:5:0:0 for SST2:MultiNLI:MRPC:WNLI).
Name  Validation  Test 

Baseline  78.3  75.7/83.7 
w/o Stage1  80.3  76.3/83.8 
w/o Stage2  80.3  76.7/83.8 
Final MTL  81.2  78.5/84.5 
6 Analysis
6.1 Ablation on MTL stages
In this section, we examine the usefulness of each stage of our twostage MTL pipeline.^{11}^{11}11We present this ablation only on MRPC for now, because GP stage2 takes a lot of time without the task selection stage.
Removing Stage1: The purpose of the BetaBernoulli MAB in stage1 is to find useful auxiliary tasks for the given primary task. Here, to understand its importance, we remove the task selection part, and instead directly run the Gaussian Process (GP) model on all tasks (see ‘w/o Stage1’ row in Table 2). We can see that by removing the task selection stage, the Gaussian Process model can still outperform the baseline, indicating the usefulness of the GP, but the large mixing ratio search space causes the GP to be unable to efficiently find the best mixing ratio setting.
Removing Stage2: Given the selected tasks from stage1, the goal of the Gaussian Process in stage2 is to efficiently find the approximatelyoptimal mixing ratio. To examine its usefulness, we replace the Gaussian Process controller by manually tuning a grid of mixing ratios, where the number of tuning experiments equals to the number of steps used in the Gaussian Process model (for a fair comparison). Table 2 shows the results by removing stage2. We can see that a grid search over hyperparameters can improve upon the baseline, indicating the usefulness of stage1 task selection, but a reasonablesized faircomparison grid search (i.e., not exhaustive over all ratio values) is not able to match our stage2 GP process that leverages prior experimental results to more efficiently find the best setting.
6.2 Stability of MTL Models
In this section, we provide the mean and standard deviation of our baseline and multitask models (over three runs) on the validation set. Note that the test set is hidden, so we cannot do these studies on it. As seen in Table 3, our multitask models clearly surpass the performance of baseline models w.r.t. standard deviation gaps, in all tasks.
6.3 Visualization of Task Selection
In Fig. 3, we show an example of the task utility estimates from the stage1 multiarmed bandit controller (Eq. 3.3) on SST2. The xaxis represents the task utility, and the yaxis represents the probability density over task utility. Each curve represents a task (the blue curve corresponds to the primary task, SST2, and the rest of the curves correspond to auxiliary tasks), and the width of the bars represents the confidence interval of their estimates. We can see that the bandit controller gives the highest (and most confident) utility estimate for the primary task, which is intuitive given that the primary task should be the most useful task for learning itself. Further, it gives 23 tasks moderate utility estimates (the corresponding expected values are around 0.5), and relatively lower utility estimates for the remaining tasks (the corresponding expected values are lower than 0.5).
6.4 EducatedGuess Baselines
We additionally experimented with ‘educatedguess’ baseline models, where MTL is performed using manual intuition mixtures that seem a priori sensible.^{12}^{12}12These educatedguess models replace our stage1 automatic auxiliary task section with manual intuition taskmixtures; but we still use our stage2 Gaussian Process for mixing ratio learning, for fair comparison. For example, with MRPC as the primary task, our first educatedguess baseline is to choose other similar paraphrasingbased auxiliary tasks, i.e., QQP in case of GLUE. This MRPC+QQP model achieves 80.8, whereas our AutoSeM framework chose MRPC+RTE+MultiNLI and achieved 81.2. Furthermore, as our second educatedguess baseline, we added MultiNLI as an auxiliary task (in addition to QQP), since MultiNLI was helpful for all tasks in our MTL experiments. This educatedguess MRPC+QQP+MultiNLI model achieves 80.9 (vs. 81.2 for our AutoSeM model). This suggests that our AutoSeM framework (that automatically chose the seemingly lessrelated RTE task for MRPC) is equal or better than manual intuition based educatedguess models.
Name  RTE  MRPC  QNLI  CoLA  SST2 

Baselines  
Mean  58.6  78.3  74.9  74.6  91.4 
Std  0.94  0.31  0.30  0.44  0.36 
MultiTask Models  
Mean  62.0  81.1  76.0  75.7  91.8 
Std  0.62  0.20  0.18  0.18  0.29 
7 Conclusion
We presented the AutoSeM framework, a twostage multitask learning pipeline, where the first stage automatically selects the relevant auxiliary tasks for the given primary task and the second stage automatically learns their optimal mixing ratio. We showed that AutoSeM performs better than strong baselines on several GLUE tasks. Further, we ablated the importance of each stage of our AutoSeM framework and also discussed the intuition of selected auxiliary tasks.
Acknowledgments
We thank the reviewers for their helpful comments. This work was supported by DARPA (YFA17D17AP00022), ONR (N000141812871), Google, Facebook, Baidu, Salesforce, and Nvidia. The views contained in this article are those of the authors and not of the funding agency.
References
 Auer et al. (2002a) Peter Auer, Nicolo CesaBianchi, and Paul Fischer. 2002a. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256.
 Auer et al. (1995) Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. 1995. Gambling in a rigged casino: The adversarial multiarmed bandit problem. In focs, page 322. IEEE.
 Auer et al. (2002b) Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. 2002b. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77.
 BenDavid and Schuller (2003) Shai BenDavid and Reba Schuller. 2003. Exploiting task relatedness for multiple task learning. In Learning Theory and Kernel Machines, pages 567–580. Springer.
 Bingel and Søgaard (2017) Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multitask learning in deep neural networks. arXiv preprint arXiv:1702.08303.
 Brochu et al. (2010) Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.
 Bubeck et al. (2012) Sébastien Bubeck, Nicolo CesaBianchi, et al. 2012. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122.
 Caruana (1997) Rich Caruana. 1997. Multitask learning. Machine learning, 28(1):41–75.
 Caruana (1998) Rich Caruana. 1998. Multitask learning. In Learning to learn, pages 95–133. Springer.
 Chapelle and Li (2011) Olivier Chapelle and Lihong Li. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257.

Collobert and Weston (2008)
Ronan Collobert and Jason Weston. 2008.
A unified architecture for natural language processing: Deep neural networks with multitask learning.
In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM.  Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.
 Cully et al. (2018) A. Cully, K. Chatzilygeroudis, F. Allocati, and J.B. Mouret. 2018. Limbo: A Flexible Highperformance Library for Gaussian Processes modeling and DataEfficient Optimization. The Journal of Open Source Software, 3(26):545.

Dai et al. (2016)
Jifeng Dai, Kaiming He, and Jian Sun. 2016.
Instanceaware semantic segmentation via multitask network cascades.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3150–3158.  Duh et al. (2013) Kevin Duh, Graham Neubig, Katsuhito Sudoh, and Hajime Tsukada. 2013. Adaptation data selection using neural language models: Experiments in machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 678–683.
 Girshick (2015) Ross Girshick. 2015. Fast rcnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448.
 Golovin et al. (2017) Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D Sculley. 2017. Google vizier: A service for blackbox optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1487–1495. ACM.
 Graves et al. (2017) Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. 2017. Automated curriculum learning for neural networks. arXiv preprint arXiv:1704.03003.
 Guo et al. (2018) Han Guo, Ramakanth Pasunuru, and Mohit Bansal. 2018. Dynamic multilevel multitask learning for sentence simplification. arXiv preprint arXiv:1806.07304.
 Hashimoto et al. (2017) Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A joint manytask model: Growing a neural network for multiple nlp tasks. In EMNLP.
 Hoffman et al. (2011) Matthew D Hoffman, Eric Brochu, and Nando de Freitas. 2011. Portfolio allocation for bayesian optimization. In UAI, pages 327–336. Citeseer.
 Houthooft et al. (2016) Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. 2016. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117.
 Jaderberg et al. (2016) Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. 2016. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397.

Kaelbling et al. (1996)
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996.
Reinforcement learning: A survey.
Journal of artificial intelligence research
, 4:237–285.  Kaiser et al. (2017) Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. 2017. One model to learn them all. arXiv preprint arXiv:1706.05137.
 Kendall et al. (2017) Alex Kendall, Yarin Gal, and Roberto Cipolla. 2017. Multitask learning using uncertainty to weigh losses for scene geometry and semantics. arXiv preprint arXiv:1705.07115, 3.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
 Knudde et al. (2017) Nicolas Knudde, Joachim van der Herten, Tom Dhaene, and Ivo Couckuyt. 2017. GPflowOpt: A Bayesian Optimization Library using TensorFlow. arXiv preprint – arXiv:1711.03845.
 Luong et al. (2015) MinhThang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015. Multitask sequence to sequence learning. arXiv preprint arXiv:1511.06114.
 McCann et al. (2018) Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.
 Misra et al. (2016) Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016. Crossstitch networks for multitask learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3994–4003.
 Moore and Lewis (2010) Robert C Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 conference short papers, pages 220–224. Association for Computational Linguistics.
 Parisotto et al. (2015) Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. 2015. Actormimic: Deep multitask and transfer reinforcement learning. arXiv preprint arXiv:1511.06342.
 Pasunuru and Bansal (2017) Ramakanth Pasunuru and Mohit Bansal. 2017. Multitask video captioning with video and entailment generation. arXiv preprint arXiv:1704.07489.
 Pasunuru et al. (2017) Ramakanth Pasunuru, Han Guo, and Mohit Bansal. 2017. Towards improving abstractive summarization via entailment generation. In NFiS@EMNLP.
 Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
 Raj and Kalyani (2017) Vishnu Raj and Sheetal Kalyani. 2017. Taming nonstationary bandits: A bayesian approach. arXiv preprint arXiv:1707.09727.
 Rasmussen (2004) Carl Edward Rasmussen. 2004. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer.
 Ruder et al. (2017a) Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Søgaard. 2017a. Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142.
 Ruder et al. (2017b) Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders Søgaard. 2017b. Sluice networks: Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142.
 Ruder and Plank (2017) Sebastian Ruder and Barbara Plank. 2017. Learning to select data for transfer learning with bayesian optimization. arXiv preprint arXiv:1707.05246.
 Russo et al. (2018) Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. 2018. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96.
 Schulz et al. (2018) Eric Schulz, Maarten Speekenbrink, and Andreas Krause. 2018. A tutorial on gaussian process regression: Modelling, exploring, and exploiting functions. Journal of Mathematical Psychology, 85:1–16.
 Shahriari et al. (2016) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. 2016. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175.
 Sharma and Ravindran (2017) Sahil Sharma and Balaraman Ravindran. 2017. Online multitask learning using active sampling. In ICLR.
 Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959.
 Søgaard (2011) Anders Søgaard. 2011. Data point selection for crosslanguage adaptation of dependency parsers. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papersVolume 2, pages 682–686. Association for Computational Linguistics.
 Srinivas et al. (2009) Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. 2009. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.
 Swersky et al. (2013) Kevin Swersky, Jasper Snoek, and Ryan P Adams. 2013. Multitask bayesian optimization. In Advances in neural information processing systems, pages 2004–2012.
 Teh et al. (2017) Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. 2017. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages 4496–4506.
 Tsvetkov et al. (2016) Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Brian MacWhinney, and Chris Dyer. 2016. Learning the curriculum with bayesian optimization for taskspecific word representation learning. arXiv preprint arXiv:1605.03852.
 Wang et al. (2018) Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multitask benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
 van der Wees et al. (2017) Marlies van der Wees, Arianna Bisazza, and Christof Monz. 2017. Dynamic data selection for neural machine translation. arXiv preprint arXiv:1708.00712.
 Xiao et al. (2018) Liqiang Xiao, Honglun Zhang, and Wenqing Chen. 2018. Gated multitask network for text classification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), volume 2, pages 726–731.