Uncertainty in Multitask Transfer Learning

06/20/2018 ∙ by Alexandre Lacoste, et al. ∙ Element AI Inc 0

Using variational Bayes neural networks, we develop an algorithm capable of accumulating knowledge into a prior from multiple different tasks. The result is a rich and meaningful prior capable of few-shot learning on new tasks. The posterior can go beyond the mean field approximation and yields good uncertainty on the performed experiments. Analysis on toy tasks shows that it can learn from significantly different tasks while finding similarities among them. Experiments of Mini-Imagenet yields the new state of the art with 74.5 accuracy on 5 shot learning. Finally, we provide experiments showing that other existing methods can fail to perform well in different benchmarks.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While conventional supervised learning is getting more stable and used in a wide range of applications, learning a complex model may require a daunting amount of labeled data. For this reason, transfer learning is often considered as an option to reduce the sample complexity of learning a new task

111A task is defined as modeling the underlying distribution from a dataset of observations.. While there has been a significant amount of progress in domain adaptation (Ganin et al., 2016), this particular form of transfer learning requires a source task highly related to the target task and a large amount of data on the source task. For this reason, we seek to make progress on multitask transfer learning (also know as few-shot learning), which is still far behind human level transfer capabilities (Lake et al., 2017). In the few-shot learning setup, a potentially large number of tasks are available to learn parameters shared across all tasks. Once the shared parameters are learned, the objective is to obtain good generalization performance on a new task with a small number of samples.

Recently, significant progress has been made to scale Bayesian neural networks to large tasks and to provide better approximations of the posterior distribution (Blundell et al., 2015; Louizos and Welling, 2017; Krueger et al., 2017). This, however, comes with an important question: “What does the posterior distribution actually represent?”. For neural networks, the prior is often chosen for convenience and the approximate posterior is often very limited (Blundell et al., 2015). For sufficiently large datasets, the observations overcome the prior, and the posterior becomes a single mode around the true model222

The true model must have positive probability under the prior. Also, when the true model can be parameterized differently, modeling one or multiple modes is equivalent.

, justifying most uni-modal posterior approximations.

However, many usages of the posterior distribution require a meaningful prior. That is, a prior expressing our current knowledge on the task and, most importantly, our lack of knowledge on the task. In addition to that, a good approximation of the posterior under the small sample size regime is required, including the ability to model multiple modes. This is indeed the case for Bayesian optimization (Snoek et al., 2012)

, Bayesian active learning

(Gal et al., 2017), continual learning (Kirkpatrick et al., 2017)

, safe reinforcement learning

(Berkenkamp et al., 2017), exploration-exploitation trade-off in reinforcement learning (Houthooft et al., 2016). Gaussian processes (Rasmussen, 2004) have historically been used for these applications, but using an RBF kernel is a too generic prior for many tasks. More recent tools such as deep Gaussian processes (Damianou and Lawrence, 2013) show great potential and yet their scalability whilst learning from multiple tasks needs to be improved.

Our aim in this work is to learn a good prior across multiple tasks and transfer it to a new task. To be able to express a rich and flexible prior learned across a large number of tasks, we use neural networks learned with a variational Bayes procedure. By doing so, we are able to (i) isolate a small number of task specific parameters and (ii) obtain a rich posterior distribution over this space. Additionally, the knowledge accumulated from the previous tasks provides a meaningful prior on the target task, yielding a meaningful posterior distribution which can be used in a small data regime.

The rest of the paper is organized as follows: We first describe the proposed approach in Section 2 while reviewing hierarchical Bayes modeling. Section 4 focuses on outlining key differences between our approach and related methods. In Section 3, we extend to 3 level of hierarchies to obtain a model more suited for classification. In Section 5, we conduct experiments on toy tasks to gain insight on the behavior of the algorithm. Finally, we show that we can obtain the new state of the art on the Mini-Imagenet benchmark Vinyals et al. (2016).

2 Learning a Deep Prior

By leveraging the variational Bayes approach, we show how we can learn a prior over models with neural networks. Also, by factorizing the posterior distribution into a task agnostic and task specific component, we show an important simplification resulting in a scalable algorithm, which we refer to as deep prior.

2.1 Hierarchical Bayes

We consider learning a prior from previous tasks by learning a probability distribution

over the weights of a network parameterized by . This is done using a hierarchical Bayes approach across tasks, with hyper-prior . Each task has its own parameters , with . Using all datasets , we have the following posterior:333 cancelled with itself from the denominator since it does not depend on nor . This would have been different for a generative approach.

The term corresponds to the likelihood of sample of task given a model parameterized by e.g. the probability of class from the softmax of a neural network parameterized by with input . For the posterior , we assume that the large amount of data available across multiple tasks will be enough to overcome generic prior

such as an isotropic Normal distribution. Hence, we consider a point estimate of the posterior

using maximum a posteriori444This can be done through simply minimizing the cross entropy of a neural network with regularization..

We can now focus on the remaining term: . Since

is potentially high dimensional with intricate correlations among the different dimensions, we cannot use a simple Gaussian distribution. Following inspiration from generative models such as GANs

(Goodfellow et al., 2014) and VAE (Kingma and Welling, 2013), we use an auxiliary variable and a deterministic function projecting the noise to the space of i.e. . Marginalizing , we have: , where is the Dirac delta function. Unfortunately, directly marginalizing is intractable for general . To overcome this issue, we add to the joint inference and marginalize it at inference time. Considering the point estimation of , the full posterior is factorized as follows:


where is the conventional likelihood function of a neural network with weight matrices generated from the function i.e.: . Similar architecture has been used in Krueger et al. (2017) and Louizos and Welling (2017), but we will soon show that it can be reduced to a simpler architecture in the context of multi-task learning. The other terms are defined as follows:


The task will consist of jointly learning a function common to all tasks and a posterior distribution for each task. At inference time, predictions are performed by marginalizing i.e.: .

2.2 Hierarchical Variational Bayes Neural Network

In the previous section, we describe the different components for expressing the posterior distribution of Equation 4. While all those components are tractable, the normalization factor hidden behind the "" sign is still intractable. To address this issue, we follow the Variational Bayes approach (Blundell et al., 2015).

Conditioning on , we saw in Equation 1 that the posterior factorizes independently for all tasks. This reduces the joint Evidence Lower BOund (ELBO) to a sum of individual ELBO for each task.

Given a family of distributions , parameterized by and , the Evidence Lower Bound for task is:




Notice that after simplification555

We can justify the cancellation of the Dirac delta functions by instead considering a Gaussian with finite variance,

. For all , the cancellation is valid, so letting , we recover the result., is no longer over the space of but only over the space . Namely, the posterior distribution is factored into two components, one that is task specific and one that is task agnostic and can be shared with the prior. This amounts to finding a low dimensional manifold in the parameter space where the different tasks can be distinguished. Then, the posterior only has to model which of the possible tasks are likely, given observations instead of modeling the high dimensional .

But, most importantly, any explicit reference to has now vanished from both Equation 5 and Equation 6. This simplification has an important positive impact on the scalability of the proposed approach. Since we no longer need to explicitly calculate the KL on the space of , we can simplify the likelihood function to , which can be a deep network parameterized by , taking both and as inputs. This contrasts with the previous formulation, where produces all the weights of a network, yielding an extremely high dimensional representation and slow training.

2.3 Posterior Distribution

For modeling , we can use , where and can be learned individually for each task. This, however limits the posterior family to express a single mode. For more flexibility, we also explore the usage of more expressive posterior, such as Inverse Autoregressive Flow (IAF) (Kingma et al., 2016). This gives a flexible tool for learning a rich variety of multivariate distributions. In principle, we can use a different IAF for each task, but for memory and computational reasons, we use a single IAF for all tasks and we condition666We follow the architecture proposed in Kingma et al. (2016). on an additional task specific context .

Note that with IAF, we cannot evaluate for any values of efficiently, only for those which we just sampled, but this is sufficient for estimating the KL term with a Monte-Carlo approximation i.e.:

where . It is common to approximate with a single sample and let the mini-batch average the noise incurred on the gradient. We experimented with , but this did not significantly improve the rate of convergence.

2.4 Training Procedure

In order to compute the loss proposed in Equation 5, we would need to evaluate every sample of every task. To accelerate the training, we describe a procedure following the mini-batch principle. First we replace summations with expectations:


Now it suffices to approximate the gradient with samples across all tasks. Thus, we simply concatenated all datasets into a meta-dataset and added as an extra field. Then, we sample uniformly777We also explored a sampling scheme that always make sure to have at least samples from the same task. The aim was to reduce gradient variance on task specific parameters but, we did not observed any benefits. times with replacement from the meta-dataset. Notice the term appearing in front of the likelihood in Equation 7, this indicates that individually for each task it finds the appropriate trade-off between the prior and the observations. Refer to Algorithm 1 for more details on the procedure.

1:  for i in 1 .. :
2:     sample , and uniformly from the meta dataset
Algorithm 1 Calculating the loss for a mini-batch

3 Extending to 3 Level of Hierarchies

Deep prior, gives rise to a very flexible way to transfer knowledge from multiple tasks. However, there is still an important assumption at the heart of deep prior (and other VAE based approach such as Edwards and Storkey (2016)), the task information must be encoded in a low dimensional variable . In Section 5

, we show that it is appropriate for regression, but for image classification, it is not the most natural assumption. Hence, we propose to extend to a third level of hierarchy by introducing a latent classifier on the obtained representation.

In Equation 5, for a given888We removed from equations to alleviate the notation. task , we decomposed the likelihood into by assuming that the neural network is directly predicting . Here, we introduce a latent variable to make the prediction

. This can be, for example, a Gaussian linear regression on the representation

produced by the neural network. The general form now factorizes as follow: , which is commonly called the marginal likelihood.

To compute ELBO in 5 and update the parameters , the only requirement is to be able to compute the marginal likelihood . There are closed form solutions for, e.g., linear regression with Gaussian prior, but our aim is to compare with algorithms such as Prototypical Networks (Proto Net) (Snell et al., 2017) on a classification benchmark. Alternatively, we can factor the marginal likelihood as follow . If a well calibrated task uncertainty is not required, one can also use a leave one out procedure . Both of these factorizations corresponds to training times the latent classifier on a subset of the training set and evaluating on a left out sample. We refer the reader to Rasmussen (2004, Chapter 5) for a discussion on the difference between leave one out cross validation and marginal likelihood.

For a practical algorithm, we propose a closed form solution for leave one out in prototypical networks. In it’s standard form, the prototypical network produces a prototype by averaging all representations of class i.e. , where . Then, predictions are made using .

Theorem 1.

Let be the prototypes computed without example in the training set. Then,


We defer to supplementary materials. Hence, we only need to compute prototypes one time and rescale the Euclidean distance when comparing with a sample that was used for computing the current prototype. This gives an efficient algorithm with the same complexity as the original one and a good proxy for the marginal likelihood.

4 Related Work

Hierarchical Bayes algorithms for multitask learning has a long history (Daumé III, 2009; Wan et al., 2012; Bakker and Heskes, 2003). However most of the literature focus on simple statistical models and do not consider transferring on new tasks.

More recently, Edwards and Storkey (2016) and Bouchacourt et al. (2017)

explore hierarchical Bayesian inference with neural networks and evaluate on new tasks. Both of them use a two level Hierarchical VAE for modeling the observations. While similar, our approach differs in a few different ways. We use a discriminative approach and focus on model uncertainty. We show that we can obtain a posterior on

without having to explicitly encode . We also explore the usage of more complex posterior family such as IAF. Those differences make our algorithm simpler to implement, and easier to scale to larger datasets.

Some recent works on meta-learning are also targeting transfer learning from multiple tasks. Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) finds a shared parameter such that for a given task, one gradient step on using the training set will yield a model with good predictions on the test set. Then, a meta-gradient update is performed from the test error through the one gradient step in the training set, to update . This yields a simple and scalable procedure which learns to generalize. Recently Grant et al. (2018) considers a Bayesian version of MAML. Additionally, (Ravi and Larochelle, 2016) also consider a meta-learning approach where an encoding network reads the training set and generates the parameters of a model, which is trained to perform well on the test set.

Finally, some recent interest in few-shot learning give rise to various algorithms capable of transferring from multiple tasks. Many of these approaches (Vinyals et al., 2016; Snell et al., 2017) find a representation where a simple algorithm can produce a classifier from a small training set. Bauer et al. (2017) use a neural network pre-trained on a standard multi-class dataset to obtain a good representation and use classes statistics to transfer prior knowledge to new classes.

5 Experimental Results

Through experiments, we want to answer i) Can deep prior learn a meaningful prior on tasks? ii) Can it compete against state of the art on a strong benchmark? iii) In which situations deep prior and other approaches are failing?

5.1 Regression on one dimensional Harmonic signals

To gain a good insight into the behavior of the prior and posterior, we choose a collection of one dimensional regression tasks. We also want to test the ability of the method to learn the task and not just match the observed points. For this, we will use periodic functions and test the ability of the regressor to extrapolate outside of its domain.

Specifically, each dataset consists of pairs (noisily) sampled from a sum of two sine waves with different phase and amplitude and a frequency ratio of 2: , where . We construct a meta-training set of 5000 tasks, sampling , and independently for each task. To evaluate the ability to extrapolate outside of the task’s domain, we make sure that each task has a different domain. Specifically, values are sampled according to , where is sample from the meta-domain . The number of training samples ranges from 4 to 50 for each task and, evaluation is performed on 100 samples from tasks never seen during training.


Once is sampled from IAF, we simply concatenate it with

and use 12 densely connected layers of 128 neurons with residual connections between every other layer. The final layer linearly projects to 2 outputs

and , where is used to produce a heteroskedastic noise, . Finally, we use

to express the likelihood of the training set. To help gradient flow, we use ReLU activation functions and Layer Normalization

999Layer norm only marginally helped. (Ba et al., 2016).


Figure 0(a)

depicts examples of tasks with 1, 2, 8, and 64 samples. The true underlying function is in blue while 10 samples from the posterior distributions are faded in the background. The thickness of the line represent 2 standard deviations. The first plot has only one single data point and mostly represents samples from the prior, passing near this observed point. Interestingly, all samples are close to some parametrization of Equation 

5.1. Next with only 2 points, the posterior is starting to predict curves highly correlated with the true function. However, note that the uncertainty is over optimistic and that the posterior failed to fully represent all possible harmonics fitting those two points. We discuss this issue more in depth in supplementary materials. Next, with 8 points, it managed to mostly capture the task, with reasonable uncertainty. Finally, with 64 points the model is certain of the task.

To add a strong baseline, we experimented with MAML (Finn et al., 2017). After exploring a variety of values for hyper-parameter and architecture design we couldn’t make it work for our two harmonics meta-task. We thus reduced the meta-task to a single harmonic and reduced the base frequency range by a factor of two. With those simplifications, we managed to make it converge, but the results are far behind that of deep prior even in this simplified setup. Figure 0(b) shows some form of adaptation with 16 samples per task but the result is jittery and the extrapolation capacity is very limited. Those results were obtained with a densely connected network of 8 hidden layers of 64 units101010We also experimented with various other architectures., with residual connections every other layer. The training is performed with two gradient steps and the evaluation with 5 steps. To make sure our implementation is valid, we first replicated their regression result with a fixed frequency as reported in (Finn et al., 2017).

(a) Deep Prior
(b) MAML
Figure 1: Preview of a few tasks (blue line) with increasing amount of training samples (red dots). Samples from the posterior distribution are shown in semi-transparent colors. The width of each samples is two standard deviations (provided by the predicted heteroskedastic noise).
Figure 2: left: Mean Square Error on increasing dataset size. The baseline corresponds to the same model without the KL regularizer. Each value is averaged over 100 tasks and 10 different restart. right: 4 sample tasks from the Synbols dataset. Each row is a class and each column is a sample from the classes. In the 2 left tasks, the symbol have to be predicted while in the two right tasks, the font has to be predicted.

Finally, to provide a stronger baseline, we remove the KL regularizer of deep prior and reduced the posterior to a deterministic distribution centered on . The mean square error is reported in Figure 2 for an increasing dataset size. This highlights how the uncertainty provided by deep prior yields a systematic improvement.

5.2 Mini-Imagenet Experiment

Vinyals et al. (2016) proposed to use a subset of Imagenet to generate a benchmark for few-shot learning. Each task is generated by sampling 5 classes uniformly and 5 training samples per class, the remaining images from the 5 classes are used as query images to compute accuracy. The number of unique classes sums to 100, each having 600 examples of images. To perform meta-validation and meta-test on unseen tasks (and classes), we isolate 16 and 20 classes respectively from the original set of 100, leaving 64 classes for the training tasks. This follows the procedure suggested in Ravi and Larochelle (2016).

The training procedure proposed in Section 2 requires training on a fixed set of tasks. We found that 1000 tasks yields enough diversity and that over 9000 tasks, the embeddings are not being visited often enough over the course of the training. To increase diversity during training, the training and test sets are re-sampled every time from a fixed train-test split of the given task111111If the train and test split is not fixed for a given task, one could leak the test information through the task embeddings across different resampling of the task..

We first experimented with the vanilla version of deep prior (2). In this formulation, we use a ResNet (He et al., 2016) network, where we inserted FILM layers (Perez et al., 2017; de Vries et al., 2017) between each residual block to condition on the task. Then, after flattening the output of the final convolution layer and reducing to 64 hidden units, we apply a 64 5 matrix generated from a transformation of

. Finally, predictions are made through a softmax layer. We found this architecture to be slow to train as the generated last layer is noisy for a long time and prevent the rest of the network to learn. Nevertheless, we obtained 62.6% accuracy on Mini-Imagenet, on par with many strong baselines.

To enhance the model, we combine task conditioning with prototypical networks as proposed in Section 3. This approach alleviates the need to generate the final layer of the network, thus accelerating training and increasing generalization performances. While we no longer have a well calibrated task uncertainty, the KL term still acts as an effective regularizer and prevents overfitting on small datasets121212We had to cross validate the weight of the kl term and obtained our best results using values around 0.1. With this improvement, we are now the new state of the art with (Table 2). In Table 2, we perform an ablation study to highlight the contributions of the different components of the model. In sum, a deeper network with residual connections yields major improvements. Also, task conditioning does not yield improvement if the leave one out procedure is not used. Finally, the KL regularizer is the final touch to obtain state of the art.

5.3 Heterogeneous Collection of Tasks

In Section 5.2, we saw that conditioning helps, but only yields a minor improvement. This is due to the fact that Mini-Imagenet is a very homogeneous collection of tasks where a single representation is sufficient to obtain good results. To support this claim, we provide a new benchmark131313Code and dataset will be provided. of synthetic symbols which we refer to as Synbols. Images are generated using various font family on different alphabets (Latin, Greek, Cyrillic, Chinese) and background noise (Figure 2, right). For each task we have to predict either a subset of 4 font families or 4 symbols with only 4 examples. Predicting either fonts or symbols with two separate Prototypical Networks, yields 84.2% and 92.3% accuracy respectively, with an average of 88.3%. However, blending the two collections of tasks in a single benchmark, brings prototypical network down to 76.8%. Now, conditioning on the task with deep prior brings back the accuracy to 83.5%. While there is still room for improvement, this supports the claim that a single representation will only work on homogeneous collection of tasks and that task conditioning helps learning a family of representations suitable for heterogeneous benchmarks.

Accuracy Matching Networks (Vinyals et al., 2016) 60.0 % Meta-Learner LSTM (Ravi and Larochelle, 2016) 60.6 % MAML (Finn et al., 2017) 63.2% Prototypical Networks (Snell et al., 2017) 68.2 % SNAIL (Mishra et al., 2018) 68.9 % Discriminative k-shot (Bauer et al., 2017) 73.9 % adaResNet (Munkhdalai et al., 2018) 71.9 % Deep Prior (Ours) 62.7 % Deep Prior + Proto Net (Ours) 74.5 %
Table 1: Average classification accuracy on 5-shot Mini-Imagenet benchmark.
5-way, 5-shot 4-way, 4-shot Mini-Imagenet Synbols Proto Net (ours) 68.6 0.5% 69.6 0.8% + ResNet(12) 72.4 1.0% 76.8 0.4% + Conditioning 72.3 0.6% 80.1 0.9% + Leave One Out 73.9 0.4% 82.7 0.2% + KL 74.5 0.5% 83.5 0.4%
Table 2:

Ablation Study of our model. Accuracy is shown with 90% confidence interval over bootstrap of the validation set.

6 Conclusion

Using variational Bayes, we developed a scalable algorithm for hierarchical Bayes learning of neural networks, called deep prior. This algorithm is capable of transferring information from tasks that are potentially remarkably different. Results on the Harmonics dataset shows that the learned manifold across tasks exhibits the properties of a meaningful prior. Finally, we found that MAML, while very general, will have a hard time adapting when tasks are too different. Also, we found that algorithms based on a single image representation only works well when all tasks can succeed with a very similar set of features. Together those findings allowed us to develop the new state of the art on Mini-Imagenet.


7 Appendix

7.1 Proof of Leave One Out

Theorem 1.

Let be the prototypes computed without example in the training set. Then,


Let , and assume then,


When , the result is trivially . ∎

7.2 Limitations of IAF

Figure 3: top: True function in the original space with 2 observed data points. middle: True posterior distribution, where the orange dot corresponds to the location of the true underlying function. bottom: Samples from IAF’s learned posterior.

When experimenting with the Harmonics toy dataset in Section 5.1, we observed issues with repeatability, most likely due to local minima. We decided to investigate further on the multimodality of posterior distributions with small sample size and the capacity of IAF to model them. For this purpose we simplified the problem to a single sine function and removed the burden of learning the prior. The likelihood of the observations is defined as follows:

where is given and . Only the frequency and the bias are unknown141414We scale and by a factor of 5 so that the range of interesting values fits well in the interval . This Makes it more approachable by IAF., yielding a bi-dimensional problem that is easy to visualize and quick to train. We use a dataset of 2 points at and and the corresponding posterior distribution is depicted in Figure 3-middle, with an orange point at the location of the true underlying function. Some samples from the posterior distribution can be observed in Figure 3-top.

We observe a high amount of multi-modality on the posterior distribution (Figure 3-middle). Some of the modes are just the mirror of another mode and correspond to the same functions e.g. or . But most of the time they correspond to different functions and modeling them is crucial for some application. The number of modes varies a lot with the choice of observed dataset, ranging from a few to several dozens. Now, the question is: "How many of those modes can IAF model?". Unfortunately, Figure 3

-bottom reveals poor capability for this particular case. After carefully adjusting the hyperparameters

15151512 layers with 64 hidden units MADE network for each layer, learned with Adam at a learning rate of . of IAF, exploring different initialization schemes and running multiple restarts, we rarely capture more than two modes (sometimes 4). Moreover, it will not be able to fully separate the two modes. There is systematically a thin path of density connecting each modes as a chain. With longer training, the path becomes thinner but never vanishes and the magnitude stays significant.