1 Introduction
While conventional supervised learning is getting more stable and used in a wide range of applications, learning a complex model may require a daunting amount of labeled data. For this reason, transfer learning is often considered as an option to reduce the sample complexity of learning a new task
^{1}^{1}1A task is defined as modeling the underlying distribution from a dataset of observations.. While there has been a significant amount of progress in domain adaptation (Ganin et al., 2016), this particular form of transfer learning requires a source task highly related to the target task and a large amount of data on the source task. For this reason, we seek to make progress on multitask transfer learning (also know as fewshot learning), which is still far behind human level transfer capabilities (Lake et al., 2017). In the fewshot learning setup, a potentially large number of tasks are available to learn parameters shared across all tasks. Once the shared parameters are learned, the objective is to obtain good generalization performance on a new task with a small number of samples.Recently, significant progress has been made to scale Bayesian neural networks to large tasks and to provide better approximations of the posterior distribution (Blundell et al., 2015; Louizos and Welling, 2017; Krueger et al., 2017). This, however, comes with an important question: “What does the posterior distribution actually represent?”. For neural networks, the prior is often chosen for convenience and the approximate posterior is often very limited (Blundell et al., 2015). For sufficiently large datasets, the observations overcome the prior, and the posterior becomes a single mode around the true model^{2}^{2}2
The true model must have positive probability under the prior. Also, when the true model can be parameterized differently, modeling one or multiple modes is equivalent.
, justifying most unimodal posterior approximations.However, many usages of the posterior distribution require a meaningful prior. That is, a prior expressing our current knowledge on the task and, most importantly, our lack of knowledge on the task. In addition to that, a good approximation of the posterior under the small sample size regime is required, including the ability to model multiple modes. This is indeed the case for Bayesian optimization (Snoek et al., 2012)
, Bayesian active learning
(Gal et al., 2017), continual learning (Kirkpatrick et al., 2017), safe reinforcement learning
(Berkenkamp et al., 2017), explorationexploitation tradeoff in reinforcement learning (Houthooft et al., 2016). Gaussian processes (Rasmussen, 2004) have historically been used for these applications, but using an RBF kernel is a too generic prior for many tasks. More recent tools such as deep Gaussian processes (Damianou and Lawrence, 2013) show great potential and yet their scalability whilst learning from multiple tasks needs to be improved.Our aim in this work is to learn a good prior across multiple tasks and transfer it to a new task. To be able to express a rich and flexible prior learned across a large number of tasks, we use neural networks learned with a variational Bayes procedure. By doing so, we are able to (i) isolate a small number of task specific parameters and (ii) obtain a rich posterior distribution over this space. Additionally, the knowledge accumulated from the previous tasks provides a meaningful prior on the target task, yielding a meaningful posterior distribution which can be used in a small data regime.
The rest of the paper is organized as follows: We first describe the proposed approach in Section 2 while reviewing hierarchical Bayes modeling. Section 4 focuses on outlining key differences between our approach and related methods. In Section 3, we extend to 3 level of hierarchies to obtain a model more suited for classification. In Section 5, we conduct experiments on toy tasks to gain insight on the behavior of the algorithm. Finally, we show that we can obtain the new state of the art on the MiniImagenet benchmark Vinyals et al. (2016).
2 Learning a Deep Prior
By leveraging the variational Bayes approach, we show how we can learn a prior over models with neural networks. Also, by factorizing the posterior distribution into a task agnostic and task specific component, we show an important simplification resulting in a scalable algorithm, which we refer to as deep prior.
2.1 Hierarchical Bayes
We consider learning a prior from previous tasks by learning a probability distribution
over the weights of a network parameterized by . This is done using a hierarchical Bayes approach across tasks, with hyperprior . Each task has its own parameters , with . Using all datasets , we have the following posterior:^{3}^{3}3 cancelled with itself from the denominator since it does not depend on nor . This would have been different for a generative approach.The term corresponds to the likelihood of sample of task given a model parameterized by e.g. the probability of class from the softmax of a neural network parameterized by with input . For the posterior , we assume that the large amount of data available across multiple tasks will be enough to overcome generic prior
such as an isotropic Normal distribution. Hence, we consider a point estimate of the posterior
using maximum a posteriori^{4}^{4}4This can be done through simply minimizing the cross entropy of a neural network with regularization..We can now focus on the remaining term: . Since
is potentially high dimensional with intricate correlations among the different dimensions, we cannot use a simple Gaussian distribution. Following inspiration from generative models such as GANs
(Goodfellow et al., 2014) and VAE (Kingma and Welling, 2013), we use an auxiliary variable and a deterministic function projecting the noise to the space of i.e. . Marginalizing , we have: , where is the Dirac delta function. Unfortunately, directly marginalizing is intractable for general . To overcome this issue, we add to the joint inference and marginalize it at inference time. Considering the point estimation of , the full posterior is factorized as follows:(1)  
where is the conventional likelihood function of a neural network with weight matrices generated from the function i.e.: . Similar architecture has been used in Krueger et al. (2017) and Louizos and Welling (2017), but we will soon show that it can be reduced to a simpler architecture in the context of multitask learning. The other terms are defined as follows:
(2)  
(3)  
(4) 
The task will consist of jointly learning a function common to all tasks and a posterior distribution for each task. At inference time, predictions are performed by marginalizing i.e.: .
2.2 Hierarchical Variational Bayes Neural Network
In the previous section, we describe the different components for expressing the posterior distribution of Equation 4. While all those components are tractable, the normalization factor hidden behind the "" sign is still intractable. To address this issue, we follow the Variational Bayes approach (Blundell et al., 2015).
Conditioning on , we saw in Equation 1 that the posterior factorizes independently for all tasks. This reduces the joint Evidence Lower BOund (ELBO) to a sum of individual ELBO for each task.
Given a family of distributions , parameterized by and , the Evidence Lower Bound for task is:
(5)  
where,
(6)  
Notice that after simplification^{5}^{5}5
We can justify the cancellation of the Dirac delta functions by instead considering a Gaussian with finite variance,
. For all , the cancellation is valid, so letting , we recover the result., is no longer over the space of but only over the space . Namely, the posterior distribution is factored into two components, one that is task specific and one that is task agnostic and can be shared with the prior. This amounts to finding a low dimensional manifold in the parameter space where the different tasks can be distinguished. Then, the posterior only has to model which of the possible tasks are likely, given observations instead of modeling the high dimensional .But, most importantly, any explicit reference to has now vanished from both Equation 5 and Equation 6. This simplification has an important positive impact on the scalability of the proposed approach. Since we no longer need to explicitly calculate the KL on the space of , we can simplify the likelihood function to , which can be a deep network parameterized by , taking both and as inputs. This contrasts with the previous formulation, where produces all the weights of a network, yielding an extremely high dimensional representation and slow training.
2.3 Posterior Distribution
For modeling , we can use , where and can be learned individually for each task. This, however limits the posterior family to express a single mode. For more flexibility, we also explore the usage of more expressive posterior, such as Inverse Autoregressive Flow (IAF) (Kingma et al., 2016). This gives a flexible tool for learning a rich variety of multivariate distributions. In principle, we can use a different IAF for each task, but for memory and computational reasons, we use a single IAF for all tasks and we condition^{6}^{6}6We follow the architecture proposed in Kingma et al. (2016). on an additional task specific context .
Note that with IAF, we cannot evaluate for any values of efficiently, only for those which we just sampled, but this is sufficient for estimating the KL term with a MonteCarlo approximation i.e.:
where . It is common to approximate with a single sample and let the minibatch average the noise incurred on the gradient. We experimented with , but this did not significantly improve the rate of convergence.
2.4 Training Procedure
In order to compute the loss proposed in Equation 5, we would need to evaluate every sample of every task. To accelerate the training, we describe a procedure following the minibatch principle. First we replace summations with expectations:
(7) 
Now it suffices to approximate the gradient with samples across all tasks. Thus, we simply concatenated all datasets into a metadataset and added as an extra field. Then, we sample uniformly^{7}^{7}7We also explored a sampling scheme that always make sure to have at least samples from the same task. The aim was to reduce gradient variance on task specific parameters but, we did not observed any benefits. times with replacement from the metadataset. Notice the term appearing in front of the likelihood in Equation 7, this indicates that individually for each task it finds the appropriate tradeoff between the prior and the observations. Refer to Algorithm 1 for more details on the procedure.
3 Extending to 3 Level of Hierarchies
Deep prior, gives rise to a very flexible way to transfer knowledge from multiple tasks. However, there is still an important assumption at the heart of deep prior (and other VAE based approach such as Edwards and Storkey (2016)), the task information must be encoded in a low dimensional variable . In Section 5
, we show that it is appropriate for regression, but for image classification, it is not the most natural assumption. Hence, we propose to extend to a third level of hierarchy by introducing a latent classifier on the obtained representation.
In Equation 5, for a given^{8}^{8}8We removed from equations to alleviate the notation. task , we decomposed the likelihood into by assuming that the neural network is directly predicting . Here, we introduce a latent variable to make the prediction
. This can be, for example, a Gaussian linear regression on the representation
produced by the neural network. The general form now factorizes as follow: , which is commonly called the marginal likelihood.To compute ELBO in 5 and update the parameters , the only requirement is to be able to compute the marginal likelihood . There are closed form solutions for, e.g., linear regression with Gaussian prior, but our aim is to compare with algorithms such as Prototypical Networks (Proto Net) (Snell et al., 2017) on a classification benchmark. Alternatively, we can factor the marginal likelihood as follow . If a well calibrated task uncertainty is not required, one can also use a leave one out procedure . Both of these factorizations corresponds to training times the latent classifier on a subset of the training set and evaluating on a left out sample. We refer the reader to Rasmussen (2004, Chapter 5) for a discussion on the difference between leave one out cross validation and marginal likelihood.
For a practical algorithm, we propose a closed form solution for leave one out in prototypical networks. In it’s standard form, the prototypical network produces a prototype by averaging all representations of class i.e. , where . Then, predictions are made using .
Theorem 1.
Let be the prototypes computed without example in the training set. Then,
(8) 
We defer to supplementary materials. Hence, we only need to compute prototypes one time and rescale the Euclidean distance when comparing with a sample that was used for computing the current prototype. This gives an efficient algorithm with the same complexity as the original one and a good proxy for the marginal likelihood.
4 Related Work
Hierarchical Bayes algorithms for multitask learning has a long history (Daumé III, 2009; Wan et al., 2012; Bakker and Heskes, 2003). However most of the literature focus on simple statistical models and do not consider transferring on new tasks.
More recently, Edwards and Storkey (2016) and Bouchacourt et al. (2017)
explore hierarchical Bayesian inference with neural networks and evaluate on new tasks. Both of them use a two level Hierarchical VAE for modeling the observations. While similar, our approach differs in a few different ways. We use a discriminative approach and focus on model uncertainty. We show that we can obtain a posterior on
without having to explicitly encode . We also explore the usage of more complex posterior family such as IAF. Those differences make our algorithm simpler to implement, and easier to scale to larger datasets.Some recent works on metalearning are also targeting transfer learning from multiple tasks. ModelAgnostic MetaLearning (MAML) (Finn et al., 2017) finds a shared parameter such that for a given task, one gradient step on using the training set will yield a model with good predictions on the test set. Then, a metagradient update is performed from the test error through the one gradient step in the training set, to update . This yields a simple and scalable procedure which learns to generalize. Recently Grant et al. (2018) considers a Bayesian version of MAML. Additionally, (Ravi and Larochelle, 2016) also consider a metalearning approach where an encoding network reads the training set and generates the parameters of a model, which is trained to perform well on the test set.
Finally, some recent interest in fewshot learning give rise to various algorithms capable of transferring from multiple tasks. Many of these approaches (Vinyals et al., 2016; Snell et al., 2017) find a representation where a simple algorithm can produce a classifier from a small training set. Bauer et al. (2017) use a neural network pretrained on a standard multiclass dataset to obtain a good representation and use classes statistics to transfer prior knowledge to new classes.
5 Experimental Results
Through experiments, we want to answer i) Can deep prior learn a meaningful prior on tasks? ii) Can it compete against state of the art on a strong benchmark? iii) In which situations deep prior and other approaches are failing?
5.1 Regression on one dimensional Harmonic signals
To gain a good insight into the behavior of the prior and posterior, we choose a collection of one dimensional regression tasks. We also want to test the ability of the method to learn the task and not just match the observed points. For this, we will use periodic functions and test the ability of the regressor to extrapolate outside of its domain.
Specifically, each dataset consists of pairs (noisily) sampled from a sum of two sine waves with different phase and amplitude and a frequency ratio of 2: , where . We construct a metatraining set of 5000 tasks, sampling , and independently for each task. To evaluate the ability to extrapolate outside of the task’s domain, we make sure that each task has a different domain. Specifically, values are sampled according to , where is sample from the metadomain . The number of training samples ranges from 4 to 50 for each task and, evaluation is performed on 100 samples from tasks never seen during training.
Model
Once is sampled from IAF, we simply concatenate it with
and use 12 densely connected layers of 128 neurons with residual connections between every other layer. The final layer linearly projects to 2 outputs
and , where is used to produce a heteroskedastic noise, . Finally, we useto express the likelihood of the training set. To help gradient flow, we use ReLU activation functions and Layer Normalization
^{9}^{9}9Layer norm only marginally helped. (Ba et al., 2016).Results
Figure 0(a)
depicts examples of tasks with 1, 2, 8, and 64 samples. The true underlying function is in blue while 10 samples from the posterior distributions are faded in the background. The thickness of the line represent 2 standard deviations. The first plot has only one single data point and mostly represents samples from the prior, passing near this observed point. Interestingly, all samples are close to some parametrization of Equation
5.1. Next with only 2 points, the posterior is starting to predict curves highly correlated with the true function. However, note that the uncertainty is over optimistic and that the posterior failed to fully represent all possible harmonics fitting those two points. We discuss this issue more in depth in supplementary materials. Next, with 8 points, it managed to mostly capture the task, with reasonable uncertainty. Finally, with 64 points the model is certain of the task.To add a strong baseline, we experimented with MAML (Finn et al., 2017). After exploring a variety of values for hyperparameter and architecture design we couldn’t make it work for our two harmonics metatask. We thus reduced the metatask to a single harmonic and reduced the base frequency range by a factor of two. With those simplifications, we managed to make it converge, but the results are far behind that of deep prior even in this simplified setup. Figure 0(b) shows some form of adaptation with 16 samples per task but the result is jittery and the extrapolation capacity is very limited. Those results were obtained with a densely connected network of 8 hidden layers of 64 units^{10}^{10}10We also experimented with various other architectures., with residual connections every other layer. The training is performed with two gradient steps and the evaluation with 5 steps. To make sure our implementation is valid, we first replicated their regression result with a fixed frequency as reported in (Finn et al., 2017).
Finally, to provide a stronger baseline, we remove the KL regularizer of deep prior and reduced the posterior to a deterministic distribution centered on . The mean square error is reported in Figure 2 for an increasing dataset size. This highlights how the uncertainty provided by deep prior yields a systematic improvement.
5.2 MiniImagenet Experiment
Vinyals et al. (2016) proposed to use a subset of Imagenet to generate a benchmark for fewshot learning. Each task is generated by sampling 5 classes uniformly and 5 training samples per class, the remaining images from the 5 classes are used as query images to compute accuracy. The number of unique classes sums to 100, each having 600 examples of images. To perform metavalidation and metatest on unseen tasks (and classes), we isolate 16 and 20 classes respectively from the original set of 100, leaving 64 classes for the training tasks. This follows the procedure suggested in Ravi and Larochelle (2016).
The training procedure proposed in Section 2 requires training on a fixed set of tasks. We found that 1000 tasks yields enough diversity and that over 9000 tasks, the embeddings are not being visited often enough over the course of the training. To increase diversity during training, the training and test sets are resampled every time from a fixed traintest split of the given task^{11}^{11}11If the train and test split is not fixed for a given task, one could leak the test information through the task embeddings across different resampling of the task..
We first experimented with the vanilla version of deep prior (2). In this formulation, we use a ResNet (He et al., 2016) network, where we inserted FILM layers (Perez et al., 2017; de Vries et al., 2017) between each residual block to condition on the task. Then, after flattening the output of the final convolution layer and reducing to 64 hidden units, we apply a 64 5 matrix generated from a transformation of
. Finally, predictions are made through a softmax layer. We found this architecture to be slow to train as the generated last layer is noisy for a long time and prevent the rest of the network to learn. Nevertheless, we obtained 62.6% accuracy on MiniImagenet, on par with many strong baselines.
To enhance the model, we combine task conditioning with prototypical networks as proposed in Section 3. This approach alleviates the need to generate the final layer of the network, thus accelerating training and increasing generalization performances. While we no longer have a well calibrated task uncertainty, the KL term still acts as an effective regularizer and prevents overfitting on small datasets^{12}^{12}12We had to cross validate the weight of the kl term and obtained our best results using values around 0.1. With this improvement, we are now the new state of the art with (Table 2). In Table 2, we perform an ablation study to highlight the contributions of the different components of the model. In sum, a deeper network with residual connections yields major improvements. Also, task conditioning does not yield improvement if the leave one out procedure is not used. Finally, the KL regularizer is the final touch to obtain state of the art.
5.3 Heterogeneous Collection of Tasks
In Section 5.2, we saw that conditioning helps, but only yields a minor improvement. This is due to the fact that MiniImagenet is a very homogeneous collection of tasks where a single representation is sufficient to obtain good results. To support this claim, we provide a new benchmark^{13}^{13}13Code and dataset will be provided. of synthetic symbols which we refer to as Synbols. Images are generated using various font family on different alphabets (Latin, Greek, Cyrillic, Chinese) and background noise (Figure 2, right). For each task we have to predict either a subset of 4 font families or 4 symbols with only 4 examples. Predicting either fonts or symbols with two separate Prototypical Networks, yields 84.2% and 92.3% accuracy respectively, with an average of 88.3%. However, blending the two collections of tasks in a single benchmark, brings prototypical network down to 76.8%. Now, conditioning on the task with deep prior brings back the accuracy to 83.5%. While there is still room for improvement, this supports the claim that a single representation will only work on homogeneous collection of tasks and that task conditioning helps learning a family of representations suitable for heterogeneous benchmarks.
6 Conclusion
Using variational Bayes, we developed a scalable algorithm for hierarchical Bayes learning of neural networks, called deep prior. This algorithm is capable of transferring information from tasks that are potentially remarkably different. Results on the Harmonics dataset shows that the learned manifold across tasks exhibits the properties of a meaningful prior. Finally, we found that MAML, while very general, will have a hard time adapting when tasks are too different. Also, we found that algorithms based on a single image representation only works well when all tasks can succeed with a very similar set of features. Together those findings allowed us to develop the new state of the art on MiniImagenet.
References
 Ba et al. (2016) J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

Bakker and Heskes (2003)
B. Bakker and T. Heskes.
Task clustering and gating for bayesian multitask learning.
Journal of Machine Learning Research
, 4(May):83–99, 2003.  Bauer et al. (2017) M. Bauer, M. RojasCarulla, J. B. Świątkowski, B. Schölkopf, and R. E. Turner. Discriminative kshot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.
 Berkenkamp et al. (2017) F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe modelbased reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–919, 2017.
 Blundell et al. (2015) C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 Bouchacourt et al. (2017) D. Bouchacourt, R. Tomioka, and S. Nowozin. Multilevel variational autoencoder: Learning disentangled representations from grouped observations. arXiv preprint arXiv:1705.08841, 2017.
 Damianou and Lawrence (2013) A. Damianou and N. Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
 Daumé III (2009) H. Daumé III. Bayesian multitask learning with latent hierarchies. In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence, pages 135–142. AUAI Press, 2009.
 de Vries et al. (2017) H. de Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. Courville. Modulating early visual processing by language. In Advances in Neural Information Processing Systems, pages 6597–6607, 2017.
 Edwards and Storkey (2016) H. Edwards and A. Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
 Finn et al. (2017) C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. In Proc. International Conference on Machine Learning, pages 1126–1135, 2017.
 Gal et al. (2017) Y. Gal, R. Islam, and Z. Ghahramani. Deep bayesian active learning with image data. arXiv preprint arXiv:1703.02910, 2017.
 Ganin et al. (2016) Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domainadversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 Grant et al. (2018) E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths. Recasting gradientbased metalearning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.

He et al. (2016)
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  Houthooft et al. (2016) R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
 Kingma and Welling (2013) D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2016) D. P. Kingma, T. Salimans, and M. Welling. Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934, 2016.
 Kirkpatrick et al. (2017) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
 Krueger et al. (2017) D. Krueger, C.W. Huang, R. Islam, R. Turner, A. Lacoste, and A. Courville. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.
 Lake et al. (2017) B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
 Louizos and Welling (2017) C. Louizos and M. Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
 Mishra et al. (2018) N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive metalearner. In ICLR, 2018.
 Munkhdalai et al. (2018) T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler. Rapid adaptation with conditionally shifted neurons. In ICML, 2018.
 Perez et al. (2017) E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871, 2017.
 Rasmussen (2004) C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.
 Ravi and Larochelle (2016) S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. 2016.
 Snell et al. (2017) J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4080–4090, 2017.
 Snoek et al. (2012) J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
 Vinyals et al. (2016) O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638. 2016.
 Wan et al. (2012) J. Wan, Z. Zhang, J. Yan, T. Li, B. D. Rao, S. Fang, S. Kim, S. L. Risacher, A. J. Saykin, and L. Shen. Sparse bayesian multitask learning for predicting cognitive outcomes from neuroimaging measures in alzheimer’s disease. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 940–947. IEEE, 2012.
7 Appendix
7.1 Proof of Leave One Out
Theorem 1.
Let be the prototypes computed without example in the training set. Then,
(9) 
Proof.
Let , and assume then,
(10)  
(11)  
(12)  
(13) 
When , the result is trivially . ∎
7.2 Limitations of IAF
When experimenting with the Harmonics toy dataset in Section 5.1, we observed issues with repeatability, most likely due to local minima. We decided to investigate further on the multimodality of posterior distributions with small sample size and the capacity of IAF to model them. For this purpose we simplified the problem to a single sine function and removed the burden of learning the prior. The likelihood of the observations is defined as follows:
where is given and . Only the frequency and the bias are unknown^{14}^{14}14We scale and by a factor of 5 so that the range of interesting values fits well in the interval . This Makes it more approachable by IAF., yielding a bidimensional problem that is easy to visualize and quick to train. We use a dataset of 2 points at and and the corresponding posterior distribution is depicted in Figure 3middle, with an orange point at the location of the true underlying function. Some samples from the posterior distribution can be observed in Figure 3top.
We observe a high amount of multimodality on the posterior distribution (Figure 3middle). Some of the modes are just the mirror of another mode and correspond to the same functions e.g. or . But most of the time they correspond to different functions and modeling them is crucial for some application. The number of modes varies a lot with the choice of observed dataset, ranging from a few to several dozens. Now, the question is: "How many of those modes can IAF model?". Unfortunately, Figure 3
bottom reveals poor capability for this particular case. After carefully adjusting the hyperparameters
^{15}^{15}1512 layers with 64 hidden units MADE network for each layer, learned with Adam at a learning rate of . of IAF, exploring different initialization schemes and running multiple restarts, we rarely capture more than two modes (sometimes 4). Moreover, it will not be able to fully separate the two modes. There is systematically a thin path of density connecting each modes as a chain. With longer training, the path becomes thinner but never vanishes and the magnitude stays significant.
Comments
There are no comments yet.