Log In Sign Up

A Generative Deep Recurrent Model for Exchangeable Data

We present a novel model architecture which leverages deep learning tools to perform exact Bayesian inference on sets of high dimensional, complex observations. Our model is provably exchangeable, meaning that the joint distribution over observations is invariant under permutation: this property lies at the heart of Bayesian inference. The model does not require variational approximations to train, and new samples can be generated conditional on previous samples, with cost linear in the size of the conditioning set. The advantages of our architecture are demonstrated on learning tasks requiring generalisation from short observed sequences while modelling sequence variability, such as conditional image generation, few-shot learning, set completion, and anomaly detection.


page 6

page 7

page 8

page 12

page 13

page 14


Bayesian Hypernetworks

We propose Bayesian hypernetworks: a framework for approximate Bayesian ...

Approximating Permutations with Neural Network Components for Travelling Photographer Problem

Many of current inference techniques rely upon Bayesian inference on Pro...

Multichannel Generative Language Model: Learning All Possible Factorizations Within and Across Channels

A channel corresponds to a viewpoint or transformation of an underlying ...

Constrained Bayesian Inference for Low Rank Multitask Learning

We present a novel approach for constrained Bayesian inference. Unlike c...

Learning Latent Structural Causal Models

Causal learning has long concerned itself with the accurate recovery of ...

PICASO: Permutation-Invariant Cascaded Attentional Set Operator

Set-input deep networks have recently drawn much interest in computer vi...

Conditional Permutation Invariant Flows

We present a novel, conditional generative probabilistic model of set-va...

1 Introduction

We address the problem of modelling unordered sets of objects that have some characteristic in common. Set modelling has been a recent focus in machine learning, both due to relevant application domains and to efficiency gains when dealing with groups of objects 

Edwards and Storkey (2017); Szabo et al. (2016); Vinyals et al. (2016a); Zaheer et al. (2017)

. The relevant concept in statistics is the notion of an exchangeable sequence of random variables – a sequence where any re-ordering of the elements is equally likely. To fulfil this definition, subsequent observations must behave like previous ones, which implies that we can make predictions about the future. This property allows the formulation of some machine learning problems in terms of modelling exchangeable data. For instance, one can think of few-shot concept learning as learning to complete short exchangeable sequences 

Lake et al. (2015). A related example comes from the generative image modelling field, where we might want to generate images that are in some ways similar to the ones from a given set. At present, however, there are few flexible and provably exchangeable deep generative models to solve this problem.

Formally, a finite or infinite sequence of random variables is said to be exchangeable if for all and all permutations


i. e. the joint probability remains the same under any permutation of the sequence. If random variables in the sequence are independent and identically distributed (i. i. d.), then it is easy to see that the sequence is exchangeable. The converse is false: exchangeable random variables can be correlated. One example of an exchangeable but non-i. i. d. sequence is a sequence of variables

, which jointly have a multivariate normal distribution

with the same variance and covariance for all the dimensions 

Aldous et al. (1985): .

The concept of exchangeability is intimately related to Bayesian statistics. De Finetti’s theorem states that every exchangeable process (infinite sequence of random variables) is a mixture of i. i. d. processes:


where is some parameter (finite or infinite dimensional) conditioned on which, the random variables are i. i. d. Aldous et al. (1985). In our previous Gaussian example, one can prove that are i. i. d. with conditioned on .

In terms of predictive distributions , the stochastic process in Eq. 2 can be written as


by conditioning both sides on . Eq. 3

is exactly the posterior predictive distribution, where we marginalise the likelihood of

given with respect to the posterior distribution of . From this follows one possible interpretation of the de Finetti’s theorem: learning to fit an exchangeable model to sequences of data is implicitly the same as learning to reason about the hidden variables behind the data.

One strategy for defining models of exchangeable sequences is through explicit Bayesian modelling: one defines a prior , a likelihood and calculates the posterior in Eq. 2 directly. Here, the key difficulty is the intractability of the posterior and the predictive distribution . Both of these expressions require integrating over the parameter , so we might end up having to use approximations. This could violate the exchangeability property and make explicit Bayesian modelling difficult.

On the other hand, we do not have to explicitly represent the posterior to ensure exchangeability. One could define a predictive distribution directly, and as long as the process is exchangeable, it is consistent with Bayesian reasoning. The key difficulty here is defining an easy-to-calculate

which satisfies exchangeability. For example, it is not clear how to train or modify an ordinary recurrent neural network (RNN) to model exchangeable data. In our opinion, the main challenge is to ensure that a hidden state contains information about all previous inputs

regardless of sequence length.

In this paper, we propose a novel architecture which combines features of the approaches above, which we will refer to as BRUNO: Bayesian RecUrrent Neural mOdel. Our model is provably exchangeable

, and makes use of deep features learned from observations so as to model complex data types such as images. To achieve this, we construct a

bijective mapping between random variables in the observation space and features , and explicitly define an exchangeable model for the sequences , where we know an analytic form of without explicitly computing the integral in Eq. 3.

Using BRUNO, we are able to generate samples conditioned on the input sequence by sampling directly from . The latter is also tractable to evaluate, i. e. has linear complexity in the number of data points. In respect of model training, evaluating the predictive distribution requires a single pass through the neural network that implements mapping. The model can be learned straightforwardly, since is differentiable with respect to the model parameters.

The paper is structured as follows. In Section 2 we will look at two methods selected to highlight the relation of our work with previous approaches to modelling exchangeable data. Section 3 will describe BRUNO, along with necessary background information. In Section 4, we will use our model for conditional image generation, few-shot learning, set expansion and set anomaly detection. Our code is available at

2 Related work

Bayesian sets Ghahramani and Heller (2006) aim to model exchangeable sequences of binary random variables by analytically computing the integrals in Eq. 23

. This is made possible by using a Bernoulli distribution for the likelihood and a beta distribution for the prior. To apply this method to other types of data, e.g. images, one needs to engineer a set of binary features 

Heller and Ghahramani (2006). In that case, there is usually no one-to-one mapping between the input space and the features space : in consequence, it is not possible to draw samples from . Unlike Bayesian sets, our approach does have a bijective transformation, which guarantees that inference in is equivalent to inference in space .

The neural statistician Edwards and Storkey (2017)

is an extension of a variational autoencoder model 

Kingma and Welling (2014); Rezende et al. (2014) applied to datasets. In addition to learning an approximate inference network over the latent variable for every in the set, approximate inference is also implemented over a latent variable – a context that is global to the dataset. The architecture for the inference network maps every

into a feature vector and applies a mean pooling operation across these representations. The resulting vector is then used to produce parameters of a Gaussian distribution over

. Mean pooling makes invariant under permutations of the inputs. In addition to the inference networks, the neural statistician also has a generative component which assumes that ’s are independent given . Here, it is easy to see that plays the role of from Eq. 2. In the neural statistician, it is intractable to compute , so its variational lower bound is used instead. In our model, we perform an implicit inference over and can exactly compute predictive distributions and the marginal likelihood. Despite these differences, both neural statistician and BRUNO can be applied in similar settings, namely few-shot learning and conditional image generation, albeit with some restrictions, as we will see in Section 4.

3 Method

We begin this section with an overview of the mathematical tools needed to construct our model: first the Student-t process Shah et al. (2014)

; and then the Real NVP – a deep, stably invertible and learnable neural network architecture for density estimation 

Dinh et al. (2017). We next propose BRUNO, wherein we combine an exchangeable Student-t process with the Real NVP, and derive recurrent equations for the predictive distribution such that our model can be trained as an RNN. Our model is illustrated in Figure 1.

Figure 1: A schematic of the BRUNO model. It depicts how Bayesian thinking can lead to an RNN-like computational graph in which Real NVP is a bijective feature extractor and the recurrence is represented by Bayesian updates of an exchangeable Student-t process.

3.1 Student-t processes

The Student-t process () is the most general elliptically symmetric process with an analytically representable density Shah et al. (2014). The more commonly used Gaussian processes (s) can be seen as limiting case of s. In what follows, we provide the background and definition of s.

Let us assume that follows a multivariate Student-t distribution

with degrees of freedom

, mean and a positive definite covariance matrix . Its density is given by


For our problem, we are interested in computing a conditional distribution. Suppose we can partition into two consecutive parts and , such that


Then conditional distribution is given by


In the general case, when one needs to invert the covariance matrix, the complexity of computing is . These computations become infeasible for large datasets, which is a known bottleneck for s and Rasmussen and Williams (2005). In Section 3.3, we will show that exchangeable processes do not have this issue.

The parameter , representing the degrees of freedom, has a large impact on the behaviour of s. It controls how heavy-tailed the t-distribution is: as increases, the tails get lighter and the t-distribution gets closer to the Gaussian. From Eq. 6, we can see that as or tends to infinity, the predictive distribution tends to the one from a . Thus, for small and , a would give less certain predictions than its corresponding .

A second feature of the is the scaling of the predictive variance with a coefficient, which explicitly depends on the values of the conditioning observations. From Eq. 6, the value of is precisely the Hotelling statistic for the vector , and has a distribution with mean in the event that . Looking at the weight , we see that the variance of is increased over the Gaussian default when , and is reduced otherwise. In other words, when the samples are dispersed more than they would be under the Gaussian distribution, the predictive uncertainty is increased compared with the Gaussian case. It is helpful in understanding these two properties to recall that the multivariate Student-t distribution can be thought of as a Gaussian distribution with an inverse Wishart prior on the covariance Shah et al. (2014).

3.2 Real NVP

Real NVP Dinh et al. (2017) is a member of the normalising flows family of models, where some density in the input space

is transformed into a desired probability distribution in space

through a sequence of invertible mappings Rezende and Mohamed (2015). Specifically, Real NVP proposes a design for a bijective function with and such that (a) the inverse is easy to evaluate, i.e. the cost of computing is the same as for the forward mapping, and (b) computing the Jacobian determinant takes linear time in the number of dimensions . Additionally, Real NVP assumes a simple distribution for , e.g. an isotropic Gaussian, so one can use a change of variables formula to evaluate :


The main building block of Real NVP is a coupling layer. It implements a mapping that transforms half of its inputs while copying the other half directly to the output:


where is an elementwise product, (scale) and

(translation) are arbitrarily complex functions, e.g. convolutional neural networks.

One can show that the coupling layer is a bijective, easily invertible mapping with a triangular Jacobian and composition of such layers preserves these properties. To obtain a highly nonlinear mapping , one needs to stack coupling layers while alternating the dimensions that are being copied to the output.

To make good use of modelling densities, the Real NVP has to treat its inputs as instances of a continuous random variable 

Theis et al. (2016). To do so, integer pixel values in are dequantised by adding uniform noise . The values are then rescaled to a interval and transformed with an elementwise function: with some small .

3.3 BRUNO: the exchangeable sequence model

We now combine Bayesian and deep learning tools from the previous sections and present our model for exchangeable sequences whose schematic is given in Figure 1.

Assume we are given an exchangeable sequence , where every element is a D-dimensional vector: . We apply a Real NVP transformation to every , which results in an exchangeable sequence in the latent space: , where . The proof that the latter sequence is exchangeable is given in Appendix A.

We make the following assumptions about the latents:

A1: dimensions are independent, so

A2: for every dimension , we assume the following: , with parameters:

  • degrees of freedom

  • mean is a dimensional vector of ones multiplied by the scalar

  • covariance matrix with and where to make sure that is a positive-definite matrix that complies with covariance properties of exchangeable sequences Aldous et al. (1985).

The exchangeable structure of the covariance matrix and having the same mean for every , guarantees that the sequence is exchangeable. Because the covariance matrix is simple, we can derive recurrent updates for the parameters of . Using the recurrence is a lot more efficient compared to the closed-form expressions in Eq. 6 since we want to compute the predictive distribution for every step .

We start from a prior Student-t distribution for with parameters , , , . Here, we will drop the dimension index to simplify the notation. A detailed derivation of the following results is given in Appendix B. To compute the degrees of freedom, mean and variance of for every , we begin with the recurrent relations


where . Note that the recursions simply use the latter two equations, i.e. if we were to assume that . For s, however, we also need to compute – a data-dependent term that scales the covariance matrix as in Eq. 6. To update , we introduce recurrent expressions for the auxiliary variables:

From these equations, we see that computational complexity of making predictions in exchangeable s or s scales linearly with the number of observations, i.e. instead of a general case where one needs to compute an inverse covariance matrix.

So far, we have constructed an exchangeable Student-t process in the latent space . By coupling it with a bijective Real NVP mapping, we get an exchangeable process in space . Although we do not have an explicit analytic form of the transitions in , we still can sample from this process and evaluate the predictive distribution via the change of variables formula in Eq. 7.

3.4 Training

Having an easy-to-evaluate autoregressive distribution allows us to use a training scheme that is common for RNNs, i.e. maximise the likelihood of the next element in the sequence at every step. Thus, our objective function for a single sequence of fixed length can be written as , which is equivalent to maximising the joint log-likelihood . While we do have a closed-form expression for the latter, we chose not to use it during training in order to minimize the difference between the implementation of training and testing phases. Note that at test time, dealing with the joint log-likelihood would be inconvenient or even impossible due to high memory costs when gets large, which again motivates the use of a recurrent formulation.

During training, we update the weights of the Real NVP model and also learn the parameters of the prior Student-t distribution. For the latter, we have three trainable parameters per dimension: degrees of freedom , variance and covariance . The mean is fixed to 0 for every and is not updated during training.

4 Experiments

In this section, we will consider a few problems that fit naturally into the framework of modeling exchangeable data. We chose to work with sequences of images, so the results are easy to analyse; yet BRUNO does not make any image-specific assumptions, and our conclusions can generalise to other types of data. Specifically, for non-image data, one can use a general-purpose Real NVP coupling layer as proposed by Papamakarios et al. (2017). In contrast to the original Real NVP model, which uses convolutional architecture for scaling and translation functions in Eq. 8, a general implementation has and composed from fully connected layers. We experimented with both convolutional and non-convolutional architectures, the details of which are given in Appendix C.

In our experiments, the models are trained on image sequences of length 20. We form each sequence by uniformly sampling a class and then selecting 20 random images from that class. This scheme implies that a model is trained to implicitly infer a class label that is global to a sequence. In what follows, we will see how this property can be used in a few tasks.

4.1 Conditional image generation

We first consider a problem of generating samples conditionally on a set of images, which reduces to sampling from a predictive distribution. This is different from a general Bayesian approach, where one needs to infer the posterior over some meaningful latent variable and then ‘decode’ it.

To draw samples from , we first sample and then compute the inverse Real NVP mapping: . Since we assumed that dimensions of are independent, we can sample each from a univariate Student-t distribution. To do so, we modified Bailey’s polar t-distribution generation method Bailey (1994) to be computationally efficient for GPU. Its algorithm is given in Appendix D.

In Figure 3, we show samples from the prior distribution and conditional samples from a predictive distribution at steps . Here, we used a convolutional Real NVP model as a part of BRUNO. The model was trained on Omniglot Lake et al. (2015) same-class image sequences of length 20 and we used the train-test split and preprocessing as defined by Vinyals et al. (2016b). Namely, we resized the images to pixels and augmented the dataset with rotations by multiples of 90 degrees yielding 4,800 and 1,692 classes for training and testing respectively.

Figure 2: Samples generated conditionally on the sequence of the unseen Omniglot character class. An input sequence is shown in the top row and samples in the bottom 4 rows. Every column of the bottom subplot contains 4 samples from the predictive distribution conditioned on the input images up to and including that column. That is, the 1st column shows samples from the prior when no input image is given; the 2nd column shows samples from where is the 1st input image in the top row and so on.

To better understand how BRUNO  behaves, we test it on special types of input sequences that were not seen during training. In Appendix E, we give an example where the same image is used throughout the sequence. In that case, the variability of the samples reduces as the models gets more of the same input. This property does not hold for the neural statistician model Edwards and Storkey (2017), discussed in Section 2. As mentioned earlier, the neural statistician computes the approximate posterior and then uses its mean to sample from a conditional model . This scheme does not account for the variability in the inputs as a consequence of applying mean pooling over the features of when computing . Thus, when all ’s are the same, it would still sample different instances from the class specified by . Given the code provided by the authors of the neural statistician and following an email exchange, we could not reproduce the results from their paper, so we refrained from making any direct comparisons.

More generated samples from convolutional and non-convolutional architectures trained on MNIST LeCun et al. (1998), Fashion-MNIST Xiao et al. (2017) and CIFAR-10 Krizhevsky (2009) are given in the appendix. For a couple of these models, we analyse the parameters of the learnt latent distributions (see Appendix F).

4.2 Few-shot learning

Previously, we saw that BRUNO  can generate images of the unseen classes even after being conditioned on a couple of examples. In this section, we will see how one can use its conditional probabilities not only for generation, but also for a few-shot classification.

We evaluate the few-shot learning accuracy of the model from Section 4.1 on the unseen Omniglot characters from the 1,692 testing classes following the -shot and -way classification setup proposed by Vinyals et al. (2016b). For every test case, we randomly draw a test image and a sequence of images from the target class. At the same time, we draw images for every of the

random decoy classes. To classify an image

, we compute for each class in the batch. An image is classified correctly when the conditional probability is highest for the target class compared to the decoy classes. This evaluation is performed 20 times for each of the test classes and the average classification accuracy is reported in Table 1.

For comparison, we considered three models from Vinyals et al. (2016b): (a) k-nearest neighbours (k-NN), where matching is done on raw pixels (Pixels), (b) k-NN with matching on discriminative features from a state-of-the-art classifier (Baseline Classifier), and (c) Matching networks.

We observe that BRUNO  model from Section 4.1 outperforms the baseline classifier, despite having been trained on relatively long sequences with a generative objective, i.e. maximising the likelihood of the input images. Yet, it cannot compete with matching networks – a model tailored for a few-shot learning and trained in a discriminative way on short sequences such that its test-time protocol exactly matches the training time protocol. One can argue, however, that a comparison between models trained generatively and discriminatively is not fair. Generative modelling is a more general, harder problem to solve than discrimination, so a generatively trained model may waste a lot of statistical power on modelling aspects of the data which are irrelevant for the classification task. To verify our intuition, we fine-tuned BRUNO  with a discriminative objective, i.e. maximising the likelihood of correct labels in -shot, -way classification episodes formed from the training examples of Omniglot. While we could sample a different and for every training episode like in matching networks, we found it sufficient to fix and during training. Namely, we chose the setting with and . From Table 1, we see that this additional discriminative training makes BRUNO  competitive with state-of-the-art models across all -shot and -way tasks.

Model 5-way 20-way
1-shot 5-shot 1-shot 5-shot
Pixels Vinyals et al. (2016b) 41.7% 63.2% 26.7% 42.6%
Baseline Classifier Vinyals et al. (2016b) 80.0% 95.0% 69.5% 89.1%
Matching Nets Vinyals et al. (2016b) 98.1% 98.9% 93.8% 98.5%
BRUNO 86.3% 95.6% 69.2% 87.7%
BRUNO (discriminative fine-tuning) 97.1% 99.4% 91.3% 97.8%
Table 1: Classification accuracy for a few-shot learning task on the Omniglot dataset.

As an extension to the few-shot learning task, we showed that BRUNO  could also be used for online set anomaly detection. These experiments can be found in Appendix H.

4.3 -based models

In practice, we noticed that training -based models can be easier compared to

-based models as they are more robust to anomalous training inputs and are less sensitive to the choise of hyperparameters. Under certain conditions, we were not able to obtain convergent training with

-based models which was not the case when using

s; an example is given in Appendix G. However, we found a few heuristics that make for a successful training such that

and -based models perform equally well in terms of test likelihoods, sample quality and few-shot classification results. For instance, it was crucial to use weight normalisation with a data-dependent initialisation of parameters of the Real NVP Salimans and Kingma (2016). As a result, one can opt for using s due to their simpler implementation. Nevertheless, a Student-t process remains a strictly richer model class for the latent space with negligible additional computational costs.

5 Discussion and conclusion

In this paper, we introduced BRUNO, a new technique combining deep learning and Student-t or Gaussian processes for modelling exchangeable data. With this architecture, we may carry out implicit Bayesian inference, avoiding the need to compute posteriors and eliminating the high computational cost or approximation errors often associated with explicit Bayesian inference.

Based on our experiments, BRUNO shows promise for applications such as conditional image generation, few-shot concept learning, few-shot classification and online anomaly detection. The probabilistic construction makes the BRUNO approach particularly useful and versatile in transfer learning and multi-task situations. To demonstrate this, we showed that BRUNO trained in a generative way achieves good performance in a downstream few-shot classification task without any task-specific retraining. Though, the performance can be significantly improved with discriminative fine-tuning.

Training BRUNO is a form of meta-learning or learning-to-learn: it learns to perform Bayesian inference on various sets of data. Just as encoding translational invariance in convolutional neural networks seems to be the key to success in vision applications, we believe that the notion of exchangeability is equally central to data-efficient meta-learning. In this sense, architectures like BRUNO and Deep Sets Zaheer et al. (2017) can be seen as the most natural starting point for these applications.

As a consequence of exchangeability-by-design, BRUNO is endowed with a hidden state which integrates information about all inputs regardless of sequence length. This desired property for meta-learning is usually difficult to ensure in general RNNs as they do not automatically generalise to longer sequences than they were trained on and are sensitive to the ordering of inputs. Based on this observation, the most promising applications for BRUNO may fall in the many-shot meta-learning regime, where larger sets of data are available in each episode. Such problems naturally arise in privacy-preserving on-device machine learning, or federated meta-learning Chen et al. (2018), which is a potential future application area for BRUNO.


We would like to thank Lucas Theis for his conceptual contributions to BRUNO, Conrado Miranda and Frederic Godin for their helpful comments on the paper, Wittawat Jitkrittum for useful discussions, and Lionel Pigou for setting up the hardware.


  • Aldous et al. (1985) Aldous, D., Hennequin, P., Ibragimov, I., and Jacod, J. (1985). Ecole d’Ete de Probabilites de Saint-Flour XIII, 1983. Lecture Notes in Mathematics. Springer Berlin Heidelberg.
  • Bailey (1994) Bailey, R. W. (1994). Polar generation of random variates with the -distribution. Math. Comp., 62(206):779–781.
  • Chen et al. (2018) Chen, F., Dong, Z., Li, Z., and He, X. (2018). Federated meta-learning for recommendation. arXiv preprint arXiv:1802.07876.
  • Clevert et al. (2016) Clevert, D., Unterthiner, T., and Hochreiter, S. (2016). Fast and accurate deep network learning by exponential linear units (ELUs). In Proceedings of the 4th International Conference on Learning Representations.
  • Dinh et al. (2014) Dinh, L., Krueger, D., and Bengio, Y. (2014). NICE: non-linear independent components estimation. arXiv preprint, abs/1410.8516.
  • Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using Real NVP. In Proceedings of the 5th International Conference on Learning Representations.
  • Edwards and Storkey (2017) Edwards, H. and Storkey, A. (2017). Towards a neural statistician. In Proceedings of the 5th International Conference on Learning Representations.
  • Ghahramani and Heller (2006) Ghahramani, Z. and Heller, K. A. (2006). Bayesian sets. In Weiss, Y., Schölkopf, B., and Platt, J. C., editors, Advances in Neural Information Processing Systems 18, pages 435–442. MIT Press.
  • Heller and Ghahramani (2006) Heller, K. A. and Ghahramani, Z. (2006).

    A simple bayesian framework for content-based image retrieval.


    IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , pages 2110–2117.
  • Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations.
  • Krizhevsky (2009) Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.
  • Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science.
  • LeCun et al. (1998) LeCun, Y., Cortes, C., and Burges, C. J. (1998).

    The MNIST database of handwritten digits.

  • Papamakarios et al. (2017) Papamakarios, G., Murray, I., and Pavlakou, T. (2017). Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems 30, pages 2335–2344.
  • Rasmussen and Williams (2005) Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
  • Rezende and Mohamed (2015) Rezende, D. and Mohamed, S. (2015). Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1530–1538.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).

    Stochastic backpropagation and approximate inference in deep generative models.

    In Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286.
  • Salimans and Kingma (2016) Salimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems.
  • Shah et al. (2014) Shah, A., Wilson, A. G., and Ghahramani, Z. (2014). Student-t processes as alternatives to gaussian processes. In

    Proceedings of the 17th International Conference on Artificial Intelligence and Statistics

    , pages 877–885.
  • Szabo et al. (2016) Szabo, Z., Sriperumbudur, B., Poczos, B., and Gretton, A. (2016). Learning theory for distribution regression. Journal of Machine Learning Research, 17(152).
  • Theis et al. (2016) Theis, L., van den Oord, A., and Bethge, M. (2016). A note on the evaluation of generative models. In Proceedings of the 4th International Conference on Learning Representations.
  • Tieleman and Hinton (2012) Tieleman, T. and Hinton, G. (2012).

    Lecture 6.5 - RmsProp: Divide the gradient by a running average of its recent magnitude.

    COURSERA: Neural Networks for Machine Learning.
  • Vinyals et al. (2016a) Vinyals, O., Bengio, S., and Kudlur, M. (2016a). Order matters: Sequence to sequence for sets. In Proceedings of the 4th International Conference on Learning Representations.
  • Vinyals et al. (2016b) Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. (2016b). Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29, pages 3630–3638.
  • Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint, abs/1708.07747.
  • Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. (2017). Deep sets. In Advances in Neural Information Processing Systems 30, pages 3394–3404.

Appendix A Proofs

Lemma 1

Given an exchangeable sequence of random variables and a bijective mapping , the sequence is exchangeable.

Proof. Consider a vector function such that
. A change of variable formula gives:

where is the determinant of the Jacobian of . Since both the joint probability of and the are invariant to the permutation of sequence entries, so must be . This proves that is exchangeable.

Lemma 2

Given two exchangeable sequence and of random variables, where is independent from for , the concatenated sequence is exchangeable as well.

Proof. For any permutation , as both sequences and are exchangeable we have:

Independence between elements in and allows to write it as a joint distribution:

and thus the sequence is exchangeable.

This Lemma justifies our construction with independent exchangeable processes in the latent space as given in A1 from Section 3.3.

Appendix B Derivation of recurrent Bayesian updates for exchangeable Student-t and Gaussian processes

We assume that follows a multivariate Student-t distribution with degrees of freedom , mean and a positive definite covariance matrix . Its density is given by:


Note that this parameterization of the multivariate t-distribution as defined by Shah et al. (2014) is slightly different from the commonly used one. We used this parametrization as it makes the formulas simpler.

If we partition into two consecutive parts and :

the conditional distribution is given by:



Derivation of this result is given in the appendix of Shah et al. (2014). Let us now simplify these equations for the case of exchangeable sequences with the following covariance structure:

In our problem, we are interested in doing one-step predictions, i.e. computing a univariate density with parameters , , . Therefore, in Eq. 11 we can take: , , , , , , and .

Computing the parameters of the predictive distribution requires the inverse of , which we can find using the Sherman-Morrison formula:


After a few steps, the inverse of is:


Note that entries of explicitly depend on .

Equations for the mean and variance of the predictive distribution require the following term:

which is a vector.

With this in mind, it is easy to derive the following recurrence:

Finally, let us derive recurrent equations for .

Let , then:

Similarly, from is:

It is easy to show that , so can be written recursively as:

with .

Appendix C Implementation details

For simple datasets, such as MNIST, we found it tolerable to use models that rely upon a general implementation of the Real NVP coupling layer similarly to Papamakarios et al. (2017). Namely, when scaling and translation functions and are fully-connected neural networks. In our model, networks and share the parameters in the first two dense layers with 1024 hidden units and ELU nonlinearity Clevert et al. (2016). Their output layers are different: ends with a dense layer with and

ends with a dense layer without a nonlinearity. We stacked 6 coupling layers with alternating the indices of the transformed dimensions between odd and even as described by

Dinh et al. (2014). For the first layer, which implements a logit transformation of the inputs, namely , we used . The logit transformation ensures that when taking the inverse mapping during sample generation, the outputs always lie within .

In Omniglot, Fashion MNIST and CIFAR-10 experiments, we built upon a Real NVP model originally designed for CIFAR-10 by Dinh et al. (2017): a multi-scale architecture with deep convolutional residual networks in the coupling layers. Our main difference was the use of coupling layers with fully-connected and networks (as described above) placed on top of the original convolutional Real NVP model. We found that adding these layers allowed for a faster convergence and improved results. This is likely due to a better mixing of the information before the output of the Real NVP gets into the Student-t layer. We also found that using weight normalisation Salimans and Kingma (2016) within every and function was crucial for successful training of large models.

The model parameters were optimized using RMSProp Tieleman and Hinton (2012) with a decaying learning rate starting from . Trainable parameters of a or were updated with a 10x smaller learning rate and were initialized as following: , , for every dimension . The mean was fixed at 0. For the Omniglot model, we used a batch size of 32, sequence length of 20 and trained for 200K iterations. The other models were trained for a smaller number of iterations, i.e. ranging from 50K to 100K updates.

Appendix D Sampling from a Student-t distribution

  function sample(
  end function
Algorithm 1 Efficient sampling on GPU from a univariate t-distribution with mean , variance and degrees of freedom

Appendix E Sample analysis

In Figure 3, which includes Figure 2 from the main text, we want to illustrate how sample variability depends on the variance of the inputs. From these examples, we see that in the case of a repeated input image, samples get more coherent as the number of conditioning inputs grows. It also shows that BRUNO does not merely generate samples according to the inferred class label.

While Omngilot is limited to 20 images per class, we can experiment with longer sequences using CIFAR-10 or MNIST. In Figure 4 and Figure 5, we show samples from the models trained on those datasets. In Figure 6, we also show more samples from the prior distribution .

Figure 3: Samples generated conditionally on images from an unseen Omniglot character class. Left: input sequence of 20 images from one class. Right: the same image is used as an input at every step.
Figure 4: CIFAR-10 samples from for every . Left: input sequence (given in the top row of each subplot) is composed of random same-class test images. Right: same image is given as input at every step. In both cases, input images come from the test set of CIFAR-10 and the model was trained on all of the classes.
Figure 5: MNIST samples from for every . Left: input sequence (given in the top row of each subplot) is composed of random same-class test images. Right: same image is given as input at every step. In both cases, input images come from the test set of MNIST and the model was trained only on even digits, so it did not see digit ‘1’ during training.
Figure 6: Samples from the prior for the models trained on Omniglot, CIFAR-10, Fashion MNIST and MNIST (only trained on even digits).

Appendix F Parameter analysis

After training a model, we observed that a majority of the processes in the latent space have low correlations , and thus their predictive distributions remain close to the prior. Figure 7 plots the number of dimensions where correlations exceed a certain value on the x-axis. For instance, MNIST model has 8 dimensions where the correlation is higher than 0.1. While we have not verified it experimentally, it is reasonable to expect those dimensions to capture information about visual features of the digits.

Figure 7: Number of dimensions where plotted on a double logarithmic scale. Left: Omniglot model. Middle: CIFAR-10 model Right: Non-convolutional version of BRUNO trained on MNIST.

For -based models, degrees of freedom for every process in the latent space were intialized to 1000, which makes a close to a . After training, most of the dimensions retain fairly high degrees of freedom, but some can have small ’s. One can notice from Figure 8 that dimensions with high correlation tend to have smaller degrees of freedom.

Figure 8: Correlation versus degrees of freedom for every . Degrees of freedom on the x-axis are plotted on a logarithmic scale. Left: Omniglot model. Middle: CIFAR-10 model Right: Non-convolutional version of BRUNO trained on MNIST.

We noticed that exchangeable s and s can behave differently for certain settings of hyperparameters even when s have high degrees of freedom. Figure 9 gives one example when this is the case.

Figure 9: A toy example which illustrates how degrees of freedom affect the behaviour of a compared to a . Here, we generate one sequence of 100 observations from an exchangeable multivariate normal disribution with parameters , , and evaluate predictive probabilities under an exchangeable and models with parameters , , and different for s in the left and the right plots.

Appendix G Training of and -based models

When jointly optimizing Real NVP with a or a on top, we found that these two versions of BRUNO occasionally behave differently during training. Namely, with s the convergence was harder to achive. We could pinpoint a few determining factors: (a) the use of weightnorm Salimans and Kingma (2016) in the Real NVP layers, (b) an intialisation of the covariance parameters, and (c)

presence of outliers in the training data. In Figure 

10, we give examples of learning curves when BRUNO with s tends not to work well. Here, we use a convolutional architecture and train on Fashion MNIST. To simulate outliers, every 100 iterations we feed a training batch where the last image of every sequence in the batch is completely white.

Figure 10: Negative log-likelihood of and -based BRUNO on the training batches, smoothed using a moving average over 10 points. Left: not using weightnorm, initial covariances are sampled from for every dimension. Here, the -based model diverged after a few hundred iterations. Adding weighnorm fixes this problem. Middle: using weightnorm, covariances are initialised to 0.1, learning rate is 0.002 (two times the default one). In this case, the learning rate is too high for both models, but the -based model suffers from it more. Right: using weightnorm, covariances are initialised to 0.95.

We would like to note that there are many settings where both versions of BRUNO diverge or they both work well, and that the results of this partial ablation study are not sufficient to draws general conclusions. However, we can speculate that when extending BRUNO to new problems, it is reasonable to start from a -based model with weightnorm, small initial covariances, and small learning rates. However, when finding a good set of hyperparameters is difficult, it might be worth trying the -based BRUNO.

Appendix H Set anomaly detection

Online anomaly detection for exchangeable data is one of the application where we can use BRUNO. This problem is closely related to the task of content-based image retrieval, where we need to rank an image on how well it fits with the sequence  Heller and Ghahramani (2006). For the ranking, we use the probabilistic score proposed in Bayesian sets Ghahramani and Heller (2006):


When we care exclusively about comparing ratios of conditional densities of under different sequences , we can compare densities in the latent space instead. This is because the Jacobian from the change of variable formula does not depend on the sequence we condition on.

For the following experiment, we trained a small convolutional version of BRUNO only on even MNIST digits (30,508 training images). In Figure 11, we give typical examples of how the score evolves as the model gets more data points and how it behaves in the presence of inputs that do not conform with the majority of the sequence. This preliminary experiment shows that our model can detect anomalies in a stream of incoming data.

Figure 11: Evolution of the score as the model sees more images from an input sequence. Identified outliers are marked with vertical lines and plotted on the right in the order from top to bottom. Note that the model was trained only on images of even digits. Left: a sequence of digit ‘1’ images with one image of ‘7’ correctly identified as an outlier. Right: a sequence of digit ‘9’ with one image of digit ‘5’.

Appendix I Model samples

Figure 12: Samples from a model trained on Omniglot. Conditioning images come from character classes that were not used during training, so when is small, the problem is equivalent to a few-shot generation.
Figure 13: Samples from a model trained on CIFAR-10. The model was trained on the set with 10 classes. Conditioning images in the top row of each subplot come from the test set.
Figure 14: Samples from a convolutional BRUNO model trained on Fashion MNIST. The model was trained on the set with 10 classes. Conditioning images in the top row of each subplot come from the test set.
Figure 15: Samples from a non-convolutional model trained on MNIST. The model was trained on the set with 10 classes. Conditioning images in the top row of each subplot come from the test set.