AutoSeM: Automatic Task Selection and Mixing in Multi-Task Learning

Multi-task learning (MTL) has achieved success over a wide range of problems, where the goal is to improve the performance of a primary task using a set of relevant auxiliary tasks. However, when the usefulness of the auxiliary tasks w.r.t. the primary task is not known a priori, the success of MTL models depends on the correct choice of these auxiliary tasks and also a balanced mixing ratio of these tasks during alternate training. These two problems could be resolved via manual intuition or hyper-parameter tuning over all combinatorial task choices, but this introduces inductive bias or is not scalable when the number of candidate auxiliary tasks is very large. To address these issues, we present AutoSeM, a two-stage MTL pipeline, where the first stage automatically selects the most useful auxiliary tasks via a Beta-Bernoulli multi-armed bandit with Thompson Sampling, and the second stage learns the training mixing ratio of these selected auxiliary tasks via a Gaussian Process based Bayesian optimization framework. We conduct several MTL experiments on the GLUE language understanding tasks, and show that our AutoSeM framework can successfully find relevant auxiliary tasks and automatically learn their mixing ratio, achieving significant performance boosts on several primary tasks. Finally, we present ablations for each stage of AutoSeM and analyze the learned auxiliary task choices.


Transfer Learning in Conversational Analysis through Reusing Preprocessing Data as Supervisors

Conversational analysis systems are trained using noisy human labels and...

GradTS: A Gradient-Based Automatic Auxiliary Task Selection Method Based on Transformer Networks

A key problem in multi-task learning (MTL) research is how to select hig...

Dynamic Multi-Level Multi-Task Learning for Sentence Simplification

Sentence simplification aims to improve readability and understandabilit...

Scheduled Multi-task Learning for Neural Chat Translation

Neural Chat Translation (NCT) aims to translate conversational text into...

Many Hands Make Light Work: Using Essay Traits to Automatically Score Essays

Most research in the area of automatic essay grading (AEG) is geared tow...

Meta Auxiliary Learning for Facial Action Unit Detection

Despite the success of deep neural networks on facial action unit (AU) d...

Work in Progress: Temporally Extended Auxiliary Tasks

Predictive auxiliary tasks have been shown to improve performance in num...

1 Introduction

Multi-task Learning (MTL) Caruana (1997)

is an inductive transfer mechanism which leverages information from related tasks to improve the primary model’s generalization performance. It achieves this goal by training multiple tasks in parallel while sharing representations, where the training signals from the auxiliary tasks can help improve the performance of the primary task. Multi-task learning has been applied to a wide range of natural language processing problems 

Luong et al. (2015); Pasunuru and Bansal (2017); Hashimoto et al. (2017); Ruder et al. (2017b); Kaiser et al. (2017); McCann et al. (2018). Despite its impressive performance, the design of a multi-task learning system is non-trivial. In the context of improving the primary task’s performance using knowledge from other auxiliary tasks Luong et al. (2015); Pasunuru and Bansal (2017), two major challenges include selecting the most relevant auxiliary tasks and also learning the balanced mixing ratio for synergized training of these tasks. One can achieve this via manual intuition or hyper-parameter tuning over all combinatorial task choices, but this introduces human inductive bias or is not scalable when the number of candidate auxiliary tasks is considerable. To this end, we present AutoSeM, a two-stage Bayesian optimization pipeline to this problem.

In our AutoSeM framework111We make all our code and models publicly available at:, the first stage addresses automatic task selection from a pool of auxiliary tasks. For this, we use a non-stationary multi-armed bandit controller (MAB) Bubeck et al. (2012); Raj and Kalyani (2017)

that dynamically alternates among task choices within the training loop, and eventually returns estimates of the utility of each task w.r.t. the primary task. We model the utility of each task as a Beta distribution, whose expected value can be interpreted as the probability of each task making a non-negative contribution to the training performance of the primary task. Further, we model the observations as Bernoulli variables so that the posterior distribution is also Beta-distributed. We use Thompson sampling 

Chapelle and Li (2011); Russo et al. (2018) to trade off exploitation and exploration.

The second stage then takes the auxiliary tasks selected in the first stage and automatically learns the training mixing ratio of these tasks, through the framework of Bayesian optimization, by modeling the performance of each mixing ratio as a sample from a Gaussian Process (GP) to sequentially search for the optimal values Rasmussen (2004); Snoek et al. (2012)

. For the covariance function in the GP, we use the Matern kernel which is parameterized by a smoothness hyperparameter so as to control the level of differentiability of the samples from GP. Further, following 

Hoffman et al. (2011), we use a portfolio of optimistic and improvement-based policies as acquisition functions Shahriari et al. (2016) for selecting the next sample point from the GP search space.

We conduct several experiments on the GLUE natural language understanding benchmark Wang et al. (2018), where we choose each of RTE, MRPC, QNLI, CoLA, and SST-2 as the primary task, and treat the rest of the classification tasks from the GLUE benchmark as candidate auxiliary tasks. Results show that our AutoSeM framework can successfully find useful auxiliary tasks and automatically learn their mixing ratio, achieving significant performance boosts on top of strong baselines for several primary tasks, e.g., 5.2% improvement on QNLI, 4.7% improvement on RTE, and 2.8%/0.8% improvement on MRPC.

We also ablate the usefulness of our two stages of auxiliary task selection and automatic mixing ratio learning. The first ablation removes the task selection stage and instead directly performs the second GP mixing ratio learning stage on all auxiliary tasks. The second ablation performs the task selection stage (with multi-armed bandit) but replaces the second stage Gaussian Process with manual tuning on the selected tasks. Our 2-stage model performs better than both these ablations, showing that both of our stages are crucial. Further, we also discuss the learned auxiliary task choices in terms of their intuitive relevance w.r.t. the corresponding primary task.

2 Related Work

Multi-task learning Caruana (1998)

, known for improving the generalization performance of a task with auxiliary tasks, has successfully been applied to many domains of machine learning, including natural language processing 

Collobert and Weston (2008); Girshick (2015); Luong et al. (2015); Pasunuru and Bansal (2017); Pasunuru et al. (2017)

, computer vision 

Misra et al. (2016); Kendall et al. (2017); Dai et al. (2016)

, and reinforcement learning 

Teh et al. (2017); Parisotto et al. (2015); Jaderberg et al. (2016). Although there are many variants of multi-task learning Ruder et al. (2017b); Hashimoto et al. (2017); Luong et al. (2015); McCann et al. (2018), our goal is to improve the performance of a primary task using a set of relevant auxiliary tasks, where different tasks share some common model parameters with alternating mini-batches optimization, similar to Luong et al. (2015).

To address the problem of automatic shared parameter selection, ruder2017learning automatically learned the latent multi-task sharing architecture, and xiao2018gated used a gate mechanism that filters the feature flows between tasks. On the problem of identifying task relatedness, Ben-David and Schuller (2003) provided a formal framework for task relatedness and derived generalization error bounds for learning of multiple tasks.  Bingel and Søgaard (2017)

explored task relatedness via exhaustively experimenting with all possible two task tuples in a non-automated multi-task setup. Other related works explored data selection, where the goal is to select or reorder the examples from one or more domains (usually in a single task) to either improve the training efficiency or enable better transfer learning. These approaches have been applied in machine translation 

van der Wees et al. (2017), language models Moore and Lewis (2010); Duh et al. (2013), dependency parsing Søgaard (2011), etc. In particular, Ruder and Plank (2017) used Bayesian optimization to select relevant training instances for transfer learning, and Tsvetkov et al. (2016) applied it to learn a curriculum for training word embeddings via reordering data. Graves et al. (2017) used the bandit approach (Exp3.S algorithm) in the context of automated curriculum learning, but in our work, we have two stages with each stage addressing a different problem (automatic task selection and learning of the training mixing ratio). Recently, sharma2017online used multi-armed bandits (MAB) to learn the choice of hard vs. easy domain data selection as input feed for the model. guo2018dynamic used MAB to effectively switch across tasks in a dynamic multi-task learning setup. In our work, we use MAB with Thompson Sampling for the novel paradigm of automatic auxiliary task selection; and next, we use a Matern-kernel Gaussian Process to automatically learn an exact (static) mixing ratio (i.e., relatedness ratio) for the small number of selected tasks.

Many control problems can be cast as a multi-armed bandits problem, where the goal of the agent is to select the arm/action from one of the choices that minimizes the regrets Bubeck et al. (2012). One problem in bandits learning is the trade-off between exploration and exploitation, where the agent needs to make a decision between taking the action that yields the best payoff on current estimates or exploring new actions whose payoffs are not yet certain. Many previous works have explored various exploration and exploitation strategies to minimize regret, including Boltzmann exploration Kaelbling et al. (1996), adversarial bandits Auer et al. (2002b), UCB Auer et al. (2002a), and information gain using variational approaches Houthooft et al. (2016). In this work, for task selection, we use Thompson Sampling (Russo et al., 2018; Chapelle and Li, 2011), an algorithm for sequential decision making problems, which addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use.

Gaussian Process (GP) is a non-parametric Bayesian approach, and it can capture a wide variety of underlying functions or relations between inputs and outputs by taking advantage of the full information provided by the history of observations and is thus very data-efficient Rasmussen (2004); Shahriari et al. (2016); Schulz et al. (2018). Gaussian Processes have been widely used as a black-box optimizer and hyper-parameter optimization Snoek et al. (2012); Brochu et al. (2010); Knudde et al. (2017); Cully et al. (2018); Swersky et al. (2013); Golovin et al. (2017). In our work, we use Gaussian Process for automatic learning of the multi-task mixing ratio in our stage-2 among the selected tasks from stage-1.

3 Models

We will first introduce our baseline model and its integration for multiple classification tasks in a multi-task learning (MTL) setup. Next, we will introduce our AutoSeM framework, an automatic way of selecting auxiliary tasks and learning their optimal training mixing ratio w.r.t. the primary task, via a Beta-Bernoulli bandit with Thompson Sampling and a Gaussian Process framework.

3.1 Bi-Text Classification Model

Let and

be the input sentence pair in our classification task, where we encode these sentences via bidirectional LSTM-RNN, similar to that of conneau2017supervised. Next, we do max-pooling on the output hidden states of both encoders where

and are the outputs from the max-pooing layer for and respectively. Later, we map these two representations ( and

) into a single rich dense representation vector



where represents the concatenation and represents the element-wise multiplication of and . We project this final representation

to label space to classify the given sentence pair (see Fig. 

1). We also use ELMo Peters et al. (2018) representations for word embeddings in our model. For this, we extract the three ELMo layer representations for each of the sentence pair and use their weighted sum as the ELMo output representation, where the weights are trainable.

Figure 1: Overview of our baseline model where we use different projection layers for each task during MTL, while sharing rest of the model parameters.
Figure 2: Overview of our AutoSeM framework. Left: the multi-armed bandit controller used for task selection, where each arm represents a candidate auxiliary task. The agent iteratively pulls an arm, observes a reward, updates its estimates of the arm parameters, and samples the next arm. Right: the Gaussian Process controller used for automatic mixing ratio (MR) learning. The GP controller sequentially makes a choice of mixing ratio, observes a reward, updates its estimates, and selects the next mixing ratio to try, based on the full history of past observations.

3.2 Multi-Task Learning

In this work, we focus on improving a task (primary task) by allowing it to share parameters with related auxiliary tasks via multi-task learning (MTL). Let be a set of tasks, where we set to be the primary task and the rest of them as auxiliary tasks. We can extend our single-task learning baseline (see Sec. 3.1) into multi-task learning model by augmenting the model with projection layers while sharing the rest of the model parameters across these tasks (see Fig. 1). We employ MTL training of these tasks in alternate mini-batches based on a mixing ratio , similar to previous work Luong et al. (2015), where we optimize mini-batches of task and go to the next task.

In MTL, choosing the appropriate auxiliary tasks and properly tuning the mixing ratio can be important for the performance of multi-task models. The naive way of trying all combinations of task selections is hardly tractable. To solve this issue, we propose AutoSeM, a two-stage pipeline in the next section. In the first stage, we automatically find the relevant auxiliary tasks (out of the given options) which improve the performance of the primary task. After finding the relevant auxiliary tasks, in the second stage, we take these selected tasks along with the primary task and automatically learn their training mixing ratio.

3.3 Automatic Task Selection: Multi-Armed Bandit with Thompson Sampling

Tuning the mixing ratio for tasks in MTL becomes exponentially harder as the number of auxiliary tasks grows very large. However, in most circumstances, only a small number of these auxiliary tasks are useful for improving the primary task at hand. Manually searching for this optimal choice of relevant tasks is intractable. Hence, in this work, we present a method for automatic task selection via multi-armed bandits with Thompson Sampling (see the left side of Fig. 2).

Let represent the set of arms (corresponding to the set of tasks ) of the bandit controller in our multi-task setting, where the controller selects a sequence of actions/arms over the current training trajectory to maximize the expected future payoff. At each round , the controller selects an arm based on the noisy value estimates and observes rewards for the selected arm. Let be the utility (usefulness) of task . Initially, the agent begins with an independent prior belief over . We take these priors to be Beta-distributed with parameters and

, and the prior probability density function of



where denotes the gamma function. We formulate the reward at round as a Bernoulli variable, where an action produces a reward of with a chance of and a reward of with a chance of . The true utility of task , i.e., , is unknown, and may or may not change over time (based on stationary vs. non-stationary of task utility). We define the reward as whether sampling the task improves (or maintains) the validation metric of the primary task,


where represents the validation performance of the primary task at time . With our reward setup above, the utility of each task () can be intuitively interpreted as the probability that multi-task learning with task can improve (or maintain) the performance of the primary task. The conjugacy properties of the Beta distribution assert that the posterior distribution is also Beta with parameters that can be updated using a simple Bayes rule, which is defined as follows Russo et al. (2018),


where is the sampled task at round . Finally, at the end of the training, we calculate the expected value of each arm as follows:


Here, the expectation measures the probability of improving (or maintaining) the primary task by sampling this task. To decide the next action to take, we apply Thompson Sampling (Russo et al., 2018; Chapelle and Li, 2011) to trade off exploitation (maximizing immediate performance) and exploration (investing to accumulate new information that might improve performance in the future). In Thompson Sampling Russo et al. (2018), instead of taking action that maximizes the expectation (i.e., ), we randomly sample the primary task improvement probability from the posterior distribution , and take the action that maximizes the sampled primary task improvement probability, i.e., . At the end of the training, the task selection can proceed either via a threshold on the expectation, or take the top- tasks, and run stage-2 using the selected task subset as auxiliary tasks (details in Sec. 3.4).

Stronger Prior for Primary Task

Note that at the beginning of training, model performance is usually guaranteed to improve from the initial random choices. This causes issues in updating arm values because less useful tasks will be given high arm values when they happen to be sampled at the beginning. To resolve this issue, we initially set a slightly stronger prior/arm-value in favor of the arm corresponding to the primary task. Intuitively, the bandit will then sample the primary model more often at the beginning, and then start exploring auxiliary tasks when the primary model’s performance stabilizes (as the arm value of the primary model will start decreasing because sampling it in later rounds produces smaller additional improvements).

Non-Stationary Multi-Armed Bandit

Also note that the intrinsic usefulness of each task varies throughout the training (e.g., the primary task might be more important at the beginning, but not necessarily at the end), and thus the agent faces a non-stationary system. In such cases, the agent should always be encouraged to explore in order to track changes as the system drifts. One simple approach to inject non-stationarity is to discount the relevance of previous observations. Thus we introduce a tunable decay ratio , and modify Eq. 3.3 as follows:


where and , and controls how quickly uncertainty is injected into the system ( are parameters of the prior). Algorithm 1 presents the Thompson Sampling algorithm with a Beta-Bernoulli MAB.

1:for  do
2:     # sample model:
3:     for  do
4:          Sample
5:     end for
6:     # select and apply action:
8:     Apply and observe
9:     # non-stationarity
10:     for  do
13:          if  then
15:          else
17:          end if
18:     end for
19:end for
Algorithm 1

3.4 Automatic Mixing Ratio Learning via Gaussian Process

The right side of Fig. 2 illustrates our Gaussian Process controller for automatic learning of the MTL training mixing ratio (see definition in Sec. 3.2). Given the selected auxiliary tasks from the previous section, the next step is to find a proper mixing ratio of training these selected tasks along with the primary task.222Note that ideally Gaussian Process can also learn to set the mixing ratio of less important tasks to zero, hence allowing it to essentially also perform the task selection step. However, in practice, first applying our task selection Thompson-Sampling model (Sec. 3.3) allows GP to more efficiently search the mixing ratio space for the small number of filtered auxiliary tasks, as shown in results of Sec. 6.1. Manual tuning of this mixing ratio via a large grid search over the hyperparameter values is very time and compute expensive (even when the number of selected auxiliary tasks is small, e.g., 2 or 3). Thus, in our second stage, we instead apply a non-parametric Bayesian approach to search for the approximately-optimal mixing ratio. In particular, we use a ‘Gaussian Process’ to sequentially search for the mixing ratio by trading off exploitation and exploration automatically. Next, we describe our Gaussian Process approach in detail.

A Gaussian Process Rasmussen (2004); Snoek et al. (2012); Shahriari et al. (2016),

, is a non-parametric model that is fully characterized by a mean function

and a positive-definite kernel or covariance function . Let denote any finite collections of points, where each represents a choice of the mixing ratio (i.e., the ratio described in Sec. 3.2), and is the (unknown) function values evaluated at (true performance of the model given the selected mixing ratio). Let be the corresponding noisy observations (the validation performance at the end of training). In the context of GP Regression (GPR), are assumed to be jointly Gaussian Rasmussen (2004), i.e., , where, is the mean vector, and is the covariance matrix. Then the noisy observations

are normally distributed around

as follows: .

Given , the set of random initial observations, where represents a mixing ratio and represents the corresponding model’s validation performance. Next, we model the GP based on these initial observations as described above. We sample a next point (a mixing ratio in our case) from this GP and get its corresponding model performance , and update the GP again by now considering the points Rasmussen (2004). We continue this process for a fixed number of steps. Next, we will discuss how we perform the sampling (based on acquisition functions) and the kernels used for calculating the covariance.

BiLSTM+ELMo (Single-Task)  Wang et al. (2018) 50.1 69.0/80.8 69.4 35.0 90.2
BiLSTM+ELMo (Multi-Task)  Wang et al. (2018) 55.7 76.2/83.5 66.7 27.5 89.6
Our Baseline 54.0 75.7/83.7 74.0 30.8 91.3

Our AutoSeM
58.7 78.5/84.5 79.2 32.9 91.8

Table 1: Test GLUE results of previous work, our baseline, and our AutoSeM MTL framework. We report accuracy and F1 for MRPC, Matthews correlation for CoLA, and accuracy for all others.

Acquisition Functions

Here, we describe the acquisition functions for deciding where to sample next. While one could select the points that maximize the mean function, this does not always lead to the best outcome Hoffman et al. (2011)

. Since we also have the variance of the estimates along with the mean value of each point

, we can incorporate this information into the optimization. In this work, we use the GP-Hedge approach Hoffman et al. (2011); Auer et al. (1995), which probabilistically chooses one of three acquisition functions: probability of improvement, expected improvement, and upper confidence bound. Probability of improvement acquisition functions measure the probability that the sampled mixing ratio leads to an improvement upon the best observed value so far (), . Expected improvement additionally incorporates the amount of improvement, . The Gaussian Process upper confidence bound (GP-UCB) algorithm measures the optimistic performance upper bound of the sampled mixing ratio Srinivas et al. (2009), , for some hyper-parameter .

Matern Kernel

The covariance function (or kernel) defines the nearness or similarity of two points in the Gaussian Process. Here, we use the automatic relevance determination (ARD) Matern kernel Rasmussen (2004), which is parameterized by that controls the level of smoothness. In particular, samples from a GP with such a kernel are differentiable times. When is half-integer (i.e. for non-negative integer ), the covariance function is a product of an exponential and a polynomial of order . In the context of machine learning, usual choices of include and  Shahriari et al. (2016).

4 Experiment Setup

Datasets: We evaluate our models on several datasets from the GLUE benchmark Wang et al. (2018): RTE, QNLI, MRPC, SST-2, and CoLA. For all these datasets, we use the standard splits provided by wang2018glue. For dataset details, we refer the reader to the GLUE paper.333We did not include the remaining tasks as primary tasks, because STS-B is a regression task; MNLI is a very large dataset and does not benefit much from MTL with other tasks in the GLUE benchmark; and QQP and WNLI have dev/test discrepancies and adversarial label issues as per the GLUE website’s FAQ:

Training Details: We use pre-trained ELMo444 to obtain sentence representations as inputs to our model Peters et al. (2018), and the Gaussian Process implementation is based on Scikit-Optimize555, and we adopt most of the default configurations. We use accuracy as the validation criterion for all tasks. For all of our experiments except QNLI and SST-2, we apply early stopping on the validation performance plateau.666In our initial experiments, we found early stopping on larger datasets led to sub-optimal performance, and hence we used a pre-specified maximum number of steps instead. The set of candidate auxiliary tasks consists of all 2-sentence classification tasks when the primary task is a classification of two sentences, whereas it consists of all two-sentence and single-sentence classification tasks when the primary task is a classification of a single sentence.777We made this design decision because there are only two single-sentence tasks in GLUE, so we mix them with 2-sentence tasks to allow more auxiliary choices. Since the utility estimates from the multi-armed bandit controller are noisy, we choose the top two tasks based on expected task utility estimates, and include additional tasks if their utility estimate is above 0.5. All the results reported are the aggregate of the same experiment with two runs (with different random seeds) unless explicitly mentioned.888We use the average of validation results across runs as the tuning criterion, and use the ensemble of models across runs for reporting the test results. We use a two-layer LSTM-RNN with hidden size of 1024 for RTE and 512 for the rest of the models, and use Adam Optimizer Kingma and Ba (2014). The prior parameters of each task in stage-1 are set to be , , which are commonly used in other literature. For stage-1, the bandit controller iteratively selects batches of data from different tasks during training to learn the approximate importance of each auxiliary task Graves et al. (2017). In stage-2 (Gaussian Process), we sequentially draw samples of mixing ratios and evaluate each sample after full training Snoek et al. (2012). Without much tuning, we used approximately 200 rounds for the stage-1 bandit-based approach, where each round consist of approximately 10 mini-batches of optimization. For stage-2, we experimented with 15 and 20 as the number of samples to draw and found that 15 samples for MRPC and 20 samples for the rest of the tasks work well. This brings the total computational cost for our two-stage pipeline to be approximately (15+1)x and (20+1)x, where x represents the time taken to run the baseline model for the given task. This is significantly more efficient than a grid-search based manually-tuned mixing ratio setup (which would scale exponentially with the number of tasks).

5 Results

5.1 Baseline Models

Table 1 shows the results of our baseline and previous works Wang et al. (2018). We can see that our single-task baseline models achieve stronger performance on almost all tasks in comparison to previous work’s single-task models.999Note that we do not report previous works which fine-tune large external language models for the task (e.g., OpenAI-GPT and BERT), because they are not fairly comparable w.r.t. our models. Similarly, we report the non-attention based best GLUE models (i.e., BiLSTM+ELMo) for a fair comparison to our non-attention baseline. Our approach should ideally scale to large pre-training/fine-tuning models like BERT, given appropriate compute resources. Next, we present the performance of our AutoSeM framework on top of these strong baselines.

5.2 Multi-Task Models

Table 1 also presents the performance of our AutoSeM framework-based MTL models. As can be seen, our MTL models improve significantly (see Table 3

for standard deviations) upon their corresponding single-task baselines for all tasks, and achieve strong improvements as compared to the fairly-comparable

9 multi-task results of previous work Wang et al. (2018).101010Note that even though the performance improvement gaps of wang2018glue (MTL vs. baseline) and our improvements (AutoSeM vs. our improved baseline) are similar, these are inherently two different setups. wang2018glue MTL is based on a ‘one model for all’ setup Kaiser et al. (2017); McCann et al. (2018), whereas our approach interpretably chooses the 2-3 tasks that are most beneficial for the given primary task. Also see Sec. 4 for comparison of training speeds for these two setups. During the task selection stage of our AutoSeM framework, we observe that MultiNLI is chosen as one of the auxiliary tasks in all of our MTL models. This is intuitive given that MultiNLI contains multiple genres covering diverse aspects of the complexity of language Conneau et al. (2017). Also, we observe that WNLI is sometimes chosen in the task selection stage; however, it is always dropped (mixing ratio of zero) by the Gaussian Process controller, showing that it is not beneficial to use WNLI as an auxiliary task (intuitive, given its small size). Next, we discuss the improvements on each of the primary tasks and the corresponding auxiliary tasks selected by AutoSeM framework.

RTE: Our AutoSeM approach achieves stronger results w.r.t. the baseline on RTE (58.7 vs. 54.0). During our task selection stage, we found out that QQP and MultiNLI tasks are important for RTE as auxiliary tasks. For the second stage of automatic mixing ratio learning via Gaussian Process, the model learns that a mixing ratio of 1:5:5 works best to improve the primary task (RTE) using related auxiliary tasks of QQP and MultiNLI.

MRPC: AutoSeM here performs much better than the baseline on MRPC (78.5/84.5 vs. 75.7/83.7). During our task selection stage, we found out that RTE and MultiNLI tasks are important for MRPC as auxiliary tasks. In the second stage, AutoSeM learned a mixing ratio of 9:1:4 for these three tasks (MRPC:RTE:MultiNLI).

QNLI: Again, we achieve substantial improvements with AutoSeM w.r.t. baseline on QNLI (79.2 vs. 74.0). Our task selection stage learned that WNLI and MultiNLI tasks are best as auxiliary tasks for QNLI. We found that the Gaussian Process further drops WNLI by setting its mixing ratio to zero, and returns 20:0:5 as the best mixing ratio for QNLI:WNLI:MultiNLI.

CoLA: We also observe a strong performance improvement on CoLA with our AutoSeM model w.r.t. our baseline (32.9 vs. 30.8). During our task selection stage, we found out that MultiNLI and WNLI tasks are important for CoLA as auxiliary tasks. In the second stage, GP learns to drop WNLI, and found the mixing ratio of 20:5:0 for CoLA:MultiNLI:WNLI.

SST-2: Here also our AutoSeM approach performs better than the baseline (91.8 vs. 91.3). The task selection stage chooses MultiNLI, MRPC, and WNLI as auxiliary tasks and the stage-2 Gaussian Process model drops MRPC and WNLI by setting their mixing ratio to zero (learns ratio of 13:5:0:0 for SST-2:MultiNLI:MRPC:WNLI).

Name Validation Test
Baseline 78.3 75.7/83.7
w/o Stage-1 80.3 76.3/83.8
w/o Stage-2 80.3 76.7/83.8
Final MTL 81.2 78.5/84.5
Table 2: Ablation results on the two stages of our AutoSeM framework on MRPC.

6 Analysis

6.1 Ablation on MTL stages

In this section, we examine the usefulness of each stage of our two-stage MTL pipeline.111111We present this ablation only on MRPC for now, because GP stage-2 takes a lot of time without the task selection stage.

Removing Stage-1: The purpose of the Beta-Bernoulli MAB in stage-1 is to find useful auxiliary tasks for the given primary task. Here, to understand its importance, we remove the task selection part, and instead directly run the Gaussian Process (GP) model on all tasks (see ‘w/o Stage-1’ row in Table 2). We can see that by removing the task selection stage, the Gaussian Process model can still outperform the baseline, indicating the usefulness of the GP, but the large mixing ratio search space causes the GP to be unable to efficiently find the best mixing ratio setting.

Removing Stage-2: Given the selected tasks from stage-1, the goal of the Gaussian Process in stage-2 is to efficiently find the approximately-optimal mixing ratio. To examine its usefulness, we replace the Gaussian Process controller by manually tuning a grid of mixing ratios, where the number of tuning experiments equals to the number of steps used in the Gaussian Process model (for a fair comparison). Table 2 shows the results by removing stage-2. We can see that a grid search over hyper-parameters can improve upon the baseline, indicating the usefulness of stage-1 task selection, but a reasonable-sized fair-comparison grid search (i.e., not exhaustive over all ratio values) is not able to match our stage-2 GP process that leverages prior experimental results to more efficiently find the best setting.

6.2 Stability of MTL Models

In this section, we provide the mean and standard deviation of our baseline and multi-task models (over three runs) on the validation set. Note that the test set is hidden, so we cannot do these studies on it. As seen in Table 3, our multi-task models clearly surpass the performance of baseline models w.r.t. standard deviation gaps, in all tasks.

Figure 3:

Visualization of task utility estimates from the multi-armed bandit controller on SST-2 (primary task). The x-axis represents the task utility, and the y-axis represents the corresponding probability density. Each curve corresponds to a task and the bar corresponds to their confidence interval.

6.3 Visualization of Task Selection

In Fig. 3, we show an example of the task utility estimates from the stage-1 multi-armed bandit controller (Eq. 3.3) on SST-2. The x-axis represents the task utility, and the y-axis represents the probability density over task utility. Each curve represents a task (the blue curve corresponds to the primary task, SST-2, and the rest of the curves correspond to auxiliary tasks), and the width of the bars represents the confidence interval of their estimates. We can see that the bandit controller gives the highest (and most confident) utility estimate for the primary task, which is intuitive given that the primary task should be the most useful task for learning itself. Further, it gives 2-3 tasks moderate utility estimates (the corresponding expected values are around 0.5), and relatively lower utility estimates for the remaining tasks (the corresponding expected values are lower than 0.5).

6.4 Educated-Guess Baselines

We additionally experimented with ‘educated-guess’ baseline models, where MTL is performed using manual intuition mixtures that seem a priori sensible.121212These educated-guess models replace our stage-1 automatic auxiliary task section with manual intuition task-mixtures; but we still use our stage-2 Gaussian Process for mixing ratio learning, for fair comparison. For example, with MRPC as the primary task, our first educated-guess baseline is to choose other similar paraphrasing-based auxiliary tasks, i.e., QQP in case of GLUE. This MRPC+QQP model achieves 80.8, whereas our AutoSeM framework chose MRPC+RTE+MultiNLI and achieved 81.2. Furthermore, as our second educated-guess baseline, we added MultiNLI as an auxiliary task (in addition to QQP), since MultiNLI was helpful for all tasks in our MTL experiments. This educated-guess MRPC+QQP+MultiNLI model achieves 80.9 (vs. 81.2 for our AutoSeM model). This suggests that our AutoSeM framework (that automatically chose the seemingly less-related RTE task for MRPC) is equal or better than manual intuition based educated-guess models.

Mean 58.6 78.3 74.9 74.6 91.4
Std 0.94 0.31 0.30 0.44 0.36
Multi-Task Models
Mean 62.0 81.1 76.0 75.7 91.8
Std 0.62 0.20 0.18 0.18 0.29
Table 3: Validation-set performance mean and standard deviation (based on three runs) of our baselines and Multi-task models in accuracy.

7 Conclusion

We presented the AutoSeM framework, a two-stage multi-task learning pipeline, where the first stage automatically selects the relevant auxiliary tasks for the given primary task and the second stage automatically learns their optimal mixing ratio. We showed that AutoSeM performs better than strong baselines on several GLUE tasks. Further, we ablated the importance of each stage of our AutoSeM framework and also discussed the intuition of selected auxiliary tasks.


We thank the reviewers for their helpful comments. This work was supported by DARPA (YFA17-D17AP00022), ONR (N00014-18-1-2871), Google, Facebook, Baidu, Salesforce, and Nvidia. The views contained in this article are those of the authors and not of the funding agency.