# The Functional Neural Process

We present a new family of exchangeable stochastic processes, the Functional Neural Processes (FNPs). FNPs model distributions over functions by learning a graph of dependencies on top of latent representations of the points in the given dataset. In doing so, they define a Bayesian model without explicitly positing a prior distribution over latent global parameters; they instead adopt priors over the relational structure of the given dataset, a task that is much simpler. We show how we can learn such models from data, demonstrate that they are scalable to large datasets through mini-batch optimization and describe how we can make predictions for new points via their posterior predictive distribution. We experimentally evaluate FNPs on the tasks of toy regression and image classification and show that, when compared to baselines that employ global latent parameters, they offer both competitive predictions as well as more robust uncertainty estimates.

## Authors

• 18 publications
• 3 publications
• 1 publication
• 147 publications
• ### Doubly Stochastic Variational Inference for Neural Processes with Hierarchical Latent Variables

Neural processes (NPs) constitute a family of variational approximate mo...
08/21/2020 ∙ by Qi Wang, et al. ∙ 15

• ### Global Convolutional Neural Processes

The ability to deal with uncertainty in machine learning models has beco...
09/02/2021 ∙ by Xuesong Wang, et al. ∙ 42

• ### Reliable Uncertainty Estimates in Deep Neural Networks using Noise Contrastive Priors

Obtaining reliable uncertainty estimates of neural network predictions i...
07/24/2018 ∙ by Danijar Hafner, et al. ∙ 8

• ### All You Need is a Good Functional Prior for Bayesian Deep Learning

The Bayesian treatment of neural networks dictates that a prior distribu...
11/25/2020 ∙ by Ba-Hien Tran, et al. ∙ 0

• ### Neural Processes Mixed-Effect Models for Deep Normative Modeling of Clinical Neuroimaging Data

Normative modeling has recently been introduced as a promising approach ...
12/12/2018 ∙ by Seyed Mostafa Kia, et al. ∙ 0

• ### Replica-exchange Nosé-Hoover dynamics for Bayesian learning on large datasets

In this paper, we propose a new sampler for Bayesian learning that can e...
05/29/2019 ∙ by Rui Luo, et al. ∙ 5

• ### Bootstrapping Neural Processes

Unlike in the traditional statistical modeling for which a user typicall...
08/07/2020 ∙ by Juho Lee, et al. ∙ 1

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Neural networks are a prevalent paradigm for approximating functions of almost any kind. Their highly flexible parametric form coupled with large amounts of data allows for accurate modelling of the underlying task, a fact that usually leads to state of the art prediction performance. While predictive performance is definitely an important aspect, in a lot of safety critical applications, such as self-driving cars, we also require accurate uncertainty estimates about the predictions.

Bayesian neural networks (mackay1995probable, ; neal1995bayesian, ; graves2011practical, ; blundell2015weight, ) have been an attempt at imbuing neural networks with the ability to model uncertainty; they posit a prior distribution over the weights of the network and through inference they can represent their uncertainty in the posterior distribution. Nevertheless, for such complex models, the choice of the prior is quite difficult since understanding the interactions of the parameters with the data is a non-trivial task. As a result, priors are usually employed for computational convenience and tractability. Furthermore, inference over the weights of a neural network can be a daunting task due to the high dimensionality and posterior complexity (louizos2017multiplicative, ; shi2017kernel, ).

An alternative way that can “bypass” the aforementioned issues is that of adopting a stochastic process (klenke2013probability, ). They posit distributions over functions, e.g. neural networks, directly, without the necessity of adopting prior distributions over global parameters, such as the neural network weights. Gaussian processes (rasmussen2003gaussian, ) (GPs) is a prime example of a stochastic process; they can encode any inductive bias in the form of a covariance structure among the datapoints in the given dataset, a more intuitive modelling task than positing priors over weights. Furthermore, for vanilla GPs, posterior inference is much simpler. Despite these advantages, they also have two main limitations: 1) the underlying model is not very flexible for high dimensional problems and 2) training and inference is quite costly since it generally scales cubically with the size of the dataset.

Given the aforementioned limitations of GPs, one might seek a more general way to parametrize stochastic processes that can bypass these issues. To this end, we present our main contribution, Functional Neural Processes (FNPs), a family of exchangeable stochastic processes that posit distributions over functions in a way that combines the properties of neural networks and stochastic processes. We show that, in contrast to prior literature such as Neural Processes (NPs) garnelo2018neural

, FNPs do not require explicit global latent variables in their construction, but they rather operate by building a graph of dependencies among local latent variables, reminiscing more of autoencoder type of latent variable models

(kingma2013auto, ; rezende2014stochastic, ). We further show that we can exploit the local latent variable structure in a way that allows us to easily encode inductive biases and illustrate one particular instance of this ability by designing an FNP model that behaves similarly to a GP with an RBF kernel. Furthermore, we demonstrate that FNPs are scalable to large datasets, as they can facilitate for minibatch gradient optimization of their parameters, and have a simple to evaluate and sample posterior predictive distribution. Finally, we evaluate FNPs on toy regression and image classification tasks and show that they can obtain competitive performance and more robust uncertainty estimates.

## 2 The Functional Neural Process

For the following we assume that we are operating in the supervised learning setup, where we are given tuples of points

, with being the input covariates and being the given label. Let be a sequence of observed datapoints. We are interested in constructing a stochastic process that can bypass the limitations of GPs and can offer the predictive capabilities of neural networks. There are two necessary conditions that have to be satisfied during the construction of such a model: exchangeability and consistency (klenke2013probability, ). An exchangeable distribution over

is a joint probability over these elements that is invariant to permutations of these points, i.e.

 p(y1:N|x1:N)=p(yσ(1:N)|xσ(1:N)), (1)

where corresponds to the permutation function. Consistency refers to the phenomenon that the probability defined on an observed sequence of points , , is the same as the probability defined on an extended sequence , , when we marginalize over the new points:

 pn(y1:n|x1:n)=∫pn+m(y1:n+m|x1:n+m)dyn+1:n+m. (2)

Ensuring that both of these conditions hold, allows us to invoke the Kolmogorov Extension and de-Finneti’s theorems (klenke2013probability, ), hence prove that the model we defined is an exchangeable stochastic process. In this way we can guarantee that there is an underlying Bayesian model with an implied prior over global latent parameters

such that we can express the joint distribution in a conditional i.i.d. fashion, i.e.

.

This constitutes the main objective of this work; how can we parametrize and optimize such distributions? Essentially, our target is to introduce dependence among the points of in a manner that respects the two aforementioned conditions. We can then encode prior assumptions and inductive biases to the model by considering the relations among said points, a task much simpler than specifying a prior over latent global parameters . To this end, we introduce in the following our main contribution, the Functional Neural Process (FNP).

### 2.1 Designing the Functional Neural Process

On a high level the FNP follows the construction of a stochastic process as described at datta2016hierarchical ; it posits a distribution over functions from to by first selecting a “reference” set of points from

, and then basing the probability distribution over

around those points. This concept is similar to the “inducing inputs” that are used in sparse GPs (snelson2006sparse, ; titsias2009variational, ). More specifically, let be such a reference set and let be the “other” set, i.e. the set of all possible points that are not in . Now let be any finite random set from , that constitutes our observed inputs. To facilitate the exposition we also introduce two more sets; that contains the points of that are from and that contains all of the points in and . We provide a Venn diagram in Fig. 2. In the following we describe the construction of the model, shown in Fig. 2, and then prove that it corresponds to an infinitely exchangeable stochastic process.

##### Embedding the inputs to a latent space

The first step of the FNP is to embed each of the of independently to a latent representation

 pθ(UB|XB) =∏i∈Bpθ(ui|xi), (3)

where

can be any distribution, e.g. a Gaussian or a delta peak, where its parameters, e.g. the mean and variance, are given by a function of

. This function can be any function, provided that it is flexible enough to provide a meaningful representation for . For this reason, we employ neural networks, as their representational capacity has been demonstrated on a variety of complex high dimensional tasks, such as natural image generation and classification.

##### Constructing a graph of dependencies in the embedding space

The next step is to construct a dependency graph among the points in ; it encodes the correlations among the points in that arise in the stochastic process. For example, in GPs such a correlation structure is encoded in the covariance matrix according to a kernel function that measures the similarity between two inputs. In the FNP we adopt a different approach. Given the latent embeddings that we obtained in the previous step we construct two directed graphs of dependencies among the points in ; a directed acyclic graph (DAG) among the points in and a bipartite graph from to . These graphs are represented as random binary adjacency matrices, where e.g. corresponds to the vertex being a parent for the vertex . The distribution of the bipartite graph can be defined as

 p(A|UR,UM) =∏i∈M∏j∈RBern(Aij|g(ui,uj)). (4)

where provides the probability that a point depends on a point in the reference set . This graph construction reminisces graphon (orbanz2015bayesian, )

models, with however two important distinctions. Firstly, the embedding of each node is a vector rather than a scalar and secondly, the prior distribution over

is conditioned on an initial vertex representation rather than being the same for all vertices. We believe that the latter is an important aspect, as it is what allows us to maintain enough information about the vertices and construct more informative graphs.

The DAG among the points in is a bit trickier, as we have to adopt a topological ordering of the vectors in in order to avoid cycles. Inspired by the concept of stochastic orderings (shaked2007stochastic, ), we define an ordering according to a parameter free scalar projection of , i.e. when . The function is defined as where each individual

is a monotonic function (e.g. the log CDF of a standard normal distribution); in this case we can guarantee that

when individually for all of the dimensions we have that under . This ordering can then be used in

 p(G|UR) =∏i∈R∏j∈R,j≠iBern(Gij|I[t(ui)>t(uj)]g(ui,uj)) (5)

which leads into random adjacency matrices that can be re-arranged into a triangular structure with zeros in the diagonal (i.e. DAGs). In a similar manner, such a DAG construction reminisces of digraphon models (cai2016priors, ), a generalization of graphons to the directed case. The same two important distinctions still apply; we are using vector instead of scalar representations and the prior over the representation of each vertex depends on . It is now straightforward to bake in any relational inductive biases that we want our function to have by appropriately defining the that is used for the construction of and . For example, we can encode an inductive bias that neighboring points should be dependent by choosing . This what we used in practice. We provide examples of the , that FNPs learn in Figures 44 respectively.

##### Parametrizing the predictive distribution

Having obtained the dependency graphs , we are now interested in how to construct a predictive model that induces them. To this end, we parametrize predictive distributions for each target variable that explicitly depend on the reference set according to the structure of and . This is realized via a local latent variable that summarizes the context from the selected parent points in and their targets

 ∫pθ(yB,ZB|R,G,A)dZB=∫pθ(yR,ZR|R,G)dZR∫pθ(yM,ZM|R,yR,A)dZM =∏i∈R∫pθ(zi|parGi(R,yR))pθ(yi|zi)dzi∏j∈M∫pθ(zj|parAj(R,yR))pθ(yj|zj)dzj (6)

where are functions that return the parents of the point , according to respectively. Notice that we are guaranteed that the decomposition to the conditionals at Eq. 6 is valid, since the DAG coupled with correspond to another DAG. Since permutation invariance in the parents is necessary for an overall exchangeable model, we define each distribution over , e.g.

, as an independent Gaussian distribution per dimension

of 111The factorized Gaussian distribution was chosen for simplicity, and it is not a limitation. Any distribution is valid for provided that it defines a permutation invariant probability density w.r.t. the parents.

 pθ(zik|parAi(R,yR))=N(zik∣∣∣Ci∑j∈RAijμθ(xrj,yrj)k,exp(Ci∑j∈RAijνθ(xrj,yrj)k)) (7)

where the and are vector valued functions with a codomain in that transform the data tuples of . The is a normalization constant with , i.e. it corresponds to the reciprocal of the number of parents of point , with an extra small to avoid division by zero when a point has no parents. By observing Eq. 6 we can see that the prediction for a given depends on the input covariates only indirectly via the graphs which are a function of . Intuitively, it encodes the inductive bias that predictions on points that are “far away”, i.e. have very small probability of being connected to the reference set via , will default to an uninformative standard normal prior over hence a constant prediction for . This is similar to the behaviour that GPs with RBF kernels exhibit.

Nevertheless, Eq. 6 can also hinder extrapolation, something that neural networks can do well. In case extrapolation is important, we can always add a direct path by conditioning the prediction on , the latent embedding of , i.e. . This can serve as a middle ground where we can allow some extrapolation via

. In general, it provides a knob, as we can now interpolate between GP and neural network behaviours by e.g. changing the dimensionalities of

and .

##### Putting everything together: the FNP and FNP+ models

Now by putting everything together we arrive at the overall definitions of the two FNP models that we propose

 FNPθ(D) :=∑G,A∫pθ(UB|XB)p(G,A|UB)pθ(yB,ZB|R,G,A)dUBdZBdyi∈R∖Dx, (8) FNP+θ(D) :=∑G,A∫pθ(UB,G,A|XB)pθ(yB,ZB|R,UB,G,A)dUBdZBdyi∈R∖Dx, (9)

where the first makes predictions according to Eq. 6 and the second further conditions on . Notice that besides the marginalizations over the latent variables and graphs, we also marginalize over any of the points in the reference set that are not part of the observed dataset . This is necessary for the proof of consistency that we provide later. For this work, we always chose the reference set to be a part of the dataset so the extra integration is omitted. In general, the marginalization can provide a mechanism to include unlabelled data to the model which could be used to e.g. learn a better embedding

or “impute” the missing labels. We leave the exploration of such an avenue for future work. Having defined the models at Eq.

89 we now prove that they both define valid permutation invariant stochastic processes by borrowing the methodology described at datta2016hierarchical .

###### Proposition 1.

The distributions defined at Eq. 89 are valid permutation invariant stochastic processes, hence they correspond to Bayesian models.

###### Proof sketch.

The full proof can be found in the Appendix. Permutation invariance can be proved by noting that each of the terms in the products are permutation equivariant w.r.t. permutations of hence each of the individual distributions defined at Eq. 89 are permutation invariant due to the products. To prove consistency we have to consider two cases (datta2016hierarchical, ), the case where we add a point that is part of and the case where we add one that is not part of . In the first case, marginalizing out that point will lead to the same distribution (as we were marginalizing over that point already), whereas in the second case the point that we are adding is a leaf in the dependency graph, hence marginalizing it doesn’t affect the other points. ∎

### 2.2 The FNPs in practice: fitting and predictions

Having defined the two models, we are now interested in how we can fit their parameters when we are presented with a dataset , as well as how to make predictions for novel inputs . For simplicity, we assume that and focus on the FNP as the derivations for the FNP are analogous. Notice that in this case we have that .

##### Fitting the model to data

Fitting the model parameters with maximum marginal likelihood is difficult, as the necessary integrals / sums of Eq.8 are intractable. For this reason, we employ variational inference and maximize the following lower bound to the marginal likelihood of

 L =Eqϕ(UD,G,A,ZD|XD)[logpθ(UD,G,A,ZD,yD|XD)−logqϕ(UD,G,A,ZD|XD)], (10)

with respect to the model parameters and variational parameters . For a tractable lower bound, we assume that the variational posterior distribution factorizes as with . This leads to

 (11) +Epθ(UD,A|XD)qϕ(ZM|XM)[logpθ(yM|ZM)+logpθ(ZM|parA(R,yR))−logqϕ(ZM|XM)]

where we decomposed the lower bound into the terms for the reference set , , and the terms that correspond to , . For large datasets we are interested in doing efficient optimization of this bound. While the first term is not, in general, amenable to minibatching, the second term is. As a result, we can use minibatches that scale according to the size of the reference set . We provide more details in the Appendix.

In practice, for all of the distributions over and , we use diagonal Gaussians, whereas for we use the concrete / Gumbel-softmax relaxations (maddison2016concrete, ; jang2016categorical, ) during training. In this way we can jointly optimize with gradient based optimization by employing the pathwise derivatives obtained with the reparametrization trick (kingma2013auto, ; rezende2014stochastic, ). Furthermore, we tie most of the parameters of the model and of the inference network, as the regularizing nature of the lower bound can alleviate potential overfitting of the model parameters . More specifically, for , we share a neural network torso and have two output heads, one for each distribution. We also parametrize the priors over the latent in terms of the for the points in ; the are both defined as , , where are the functions that provide the mean and variance for and are linear embeddings of the labels.

It is interesting to see that the overall bound at Eq. 11 reminisces the bound of a latent variable model such as a variational autoencoder (VAE) (kingma2013auto, ; rezende2014stochastic, ) or a deep variational information bottleneck model (VIB) (alemi2016deep, ). We aim to predict the label of a given point from its latent code where the prior, instead of being globally the same as in kingma2013auto ; rezende2014stochastic ; alemi2016deep

, it is conditioned on the parents of that particular point. The conditioning is also intuitive, as it is what converts the i.i.d. to the more general exchangeable model. This is also similar to the VAE for unsupervised learning described at associative compression networks (ACN)

(graves2018associative, ) and reminisces works on few-shot learning (bartunov2018few, ).

##### The posterior predictive distribution

In order to perform predictions for unseen points , we employ the posterior predictive distribution of FNPs. More specifically, we can show that by using Bayes rule, the predictive distribution of the FNPs has the following simple form

 ∑a∗∫pθ(UR,u∗|XR,x∗)p(a∗|UR,u∗)pθ(z∗|para∗(R,yR))pθ(y∗|z∗)dURdu∗dz∗ (12)

where are the representations given by the neural network and is the binary vector that denotes which points from are the parents of the new point. We provide more details in the Appendix. Intuitively, we first project the reference set and the new point on the latent space with a neural network and then make a prediction by basing it on the parents from according to . This predictive distribution reminisces the models employed in few-shot learning (vinyals2016matching, ).

## 3 Related work

There has been a long line of research in Bayesian Neural Networks (BNNs) (graves2011practical, ; blundell2015weight, ; kingma2015variational, ; hernandez2015probabilistic, ; louizos2017multiplicative, ; shi2017kernel, ). A lot of works have focused on the hard task of posterior inference for BNNs, by positing more flexible posteriors (louizos2017multiplicative, ; shi2017kernel, ; louizos2016structured, ; zhang2017noisy, ; bae2018eigenvalue, ). The exploration of more involved priors has so far not gain much traction, with the exception of a handful of works (kingma2015variational, ; louizos2017bayesian, ; atanov2018deep, ; hafner2018reliable, ). For flexible stochastic processes, we have a line of works that focus on (scalable) Gaussian Processes (GPs); these revolve around sparse GPs (snelson2006sparse, ; titsias2009variational, ), using neural networks to parametrize the kernel of a GP (wilson2016deep, ; wilson2016stochastic, ), employing finite rank approximations to the kernel (cutajar2017random, ; hensman2017variational, ) or parametrizing kernels over structured data mattos2015recurrent ; van2017convolutional . Most of these are unfortunately still quite involved and might not scale well to large datasets.

There have been interesting recent works that attempt to merge stochastic processes and neural networks. Neural Processes (NPs) (garnelo2018neural, ) define distributions over global latent variables in terms of subsets of the data, while Attentive NPs (kim2019attentive, ) extend NPs with a deterministic path that has a cross-attention mechanism among the datapoints. In a sense, FNPs can be seen as a variant where we discard the global latent variables and instead incorporate cross-attention in the form of a dependency graph among local latent variables. Another line of works is the Variational Implicit Processes (VIPs) (ma2018variational, ), which consider BNN priors and then use GPs for inference, and functional variational BNNs (fBNNs) (sun2019functional, ), which employ GP priors and use BNNs for inference. Both methods have their drawbacks, as with VIPs we have to posit a meaningful prior over global parameters and the objective of fBNNs does not always correspond to a bound of the marginal likelihood.

Similarities can be also seen at other works; Associative Compression Networks (ACNs) (graves2018associative, ) employ similar ideas for generative modelling with VAEs and conditions the prior over the latent variable of a point to its nearest neighbors. Correlated VAEs (tang2019correlated, ) similarly employ a (a-priori known) dependency structure across the latent variables of the points in the dataset. In few-shot learning, metric-based approaches (vinyals2016matching, ; bartunov2018few, ; sung2018learning, ; snell2017prototypical, ; koch2015siamese, ) similarly rely on similarities w.r.t. a reference set for predictions.

## 4 Experiments

We performed two main experiments in order to verify the effectiveness of FNPs. We implemented and compared against 3 baselines: a standard neural network (denoted as NN), a neural network trained and evaluated with Monte Carlo (MC) dropout (gal2016dropout, ) and a Neural Process (NP) (garnelo2018neural, ) architecture. The architecture of the NP was designed in a way that is similar to the FNP. For the first experiment we explored the inductive biases we can encode in FNPs by visualizing the predictive distributions in a toy 1d regression task. For the second, we measured the prediction performance and uncertainty quality that FNPs can offer on the benchmark image classification tasks of MNIST and CIFAR 10. We provide the experimental details in the Appendix.

##### Exploring the inductive biases in toy regression

To visually access the inductive biases we encode in the FNP we experiment with the toy 1-d regression task described at osband2016deep . The generative process corresponds to drawing 12 points from , 8 points from and then parametrizing the target as with

. This generates a nonlinear function with “gaps” in between the data where we, ideally, want the uncertainty to be high. For all of the models we used a heteroscedastic noise model. Furthermore, due to the toy nature of this experiment, we also included a Gaussian Process (GP) with an RBF kernel. We used

dimensions for the global latent of NP and dimensions for the latents of the FNPs. For the reference set we used 10 random points for the FNPs and the full dataset for the NP.

The results we obtain are presented in Figure 5. We can see that the FNP with the RBF function for has a behaviour that is very similar to the GP. This is not the case for MC-dropout or NP where we see a more linear behaviour on the uncertainty and erroneous overconfidence in the areas in-between the data. Nevertheless, they do seem to extrapolate better whereas FNP and GP default to a flat zero prediction outside of the data. The FNP seems to combine the best of both worlds as it allows for extrapolation and GP like uncertainty, although a free bits chen2016variational modification of the bound for was helpful in encouraging the model to rely more on these particular latent variables. Empirically, we observed that adding more capacity on can move the FNP closer to the behaviour we observe for MC-dropout and NPs. In addition, increasing the amount of model parameters can make FNPs overfit, a fact that can result into a reduction of predictive uncertainty.

##### Prediction performance and uncertainty quality

For the second task we considered the image classification of MNIST and CIFAR 10. For MNIST we used a LeNet-5 architecture that had two convolutional and two fully connected layers, whereas for CIFAR we used a VGG-like architecture that had 6 convolutional and two fully connected. In both experiments we used 300 random points from as for the FNPs and for NPs, in order to be comparable, we randomly selected up to 300 points from the current batch for the context points during training and used the same 300 points as FNPs for evaluation. The dimensionality of was for the FNP models in both datasets, whereas for the NP the dimensionality of the global variable was for MNIST and for CIFAR.

As a proxy for the uncertainty quality we used the task of out of distribution (o.o.d.) detection; given the fact that FNPs are Bayesian models we would expect that their epistemic uncertainty will increase in areas where we have no data (i.e. o.o.d. datasets). The metric that we report is the average entropy on those datasets as well as the area under an ROC curve (AUCR) that determines whether a point is in or out of distribution according to the predictive entropy. Notice that it is simple to increase the first metric by just learning a trivial model but that would be detrimental for AUCR; in order to have good AUCR the model must have low entropy on the in-distribution test set but high entropy on the o.o.d. datasets. For the MNIST model we considered notMNIST, Fashion MNIST, Omniglot, Gaussian and uniform noise as o.o.d. datasets whereas for CIFAR 10 we considered SVHN, a tinyImagenet resized to pixels, iSUN and similarly Gaussian and uniform noise. The summary of the results can be seen at Table 1.

We observe that both FNPs have comparable accuracy to the baseline models while having higher average entropies and AUCR on the o.o.d. datasets. FNP in general seems to perform better than FNP. The FNP did have a relatively high in-distribution entropy for CIFAR 10, perhaps denoting that a larger might be more appropriate. We further see that the FNPs have almost always better AUCR than all of the baselines we considered. Interestingly, out of all the non-noise o.o.d. datasets we did observe that Fashion MNIST and SVHN, were the hardest to distinguish on average across all the models. This effect seems to agree with the observations from nalisnick2018deep , although more investigation is required. We also observed that, sometimes, the noise datasets on all of the baselines can act as “adversarial examples” (szegedy2013intriguing, ) thus leading to lower entropy than the in-distribution test set (e.g. Gaussian noise for the NN on CIFAR 10). FNPs did have a similar effect on CIFAR 10, e.g. the FNP on uniform noise, although to a much lesser extent. We leave the exploration of this phenomenon for future work. It should be mentioned that other advances in o.o.d. detection, e.g. liang2017enhancing ; choi2018generative , are orthogonal to FNPs and could further improve performance.

We also provide some additional insights after doing ablation studies on MNIST w.r.t. the sensitivity to the number of points in for NP, FNP and FNP, as well as varying the amount of dimensions for in the FNP. The results can be found in the Appendix. We generally observed that NP models have lower average entropy at the o.o.d. datasets than both FNP and FNP irrespective of the size of . The choice of seems to be more important for the FNPs rather than NPs, with FNP needing a larger , compared to FNP, to fit the data well. In general, it seemed that it is not the quantity of points that matters but rather the quality; the performance did not always increase with more points. This supports the idea that there could be a “coreset” of points, thus exploring ideas to infer it is a promising direction for future research that could improve scalability and alleviate the dependence of FNPs on a reasonable . As for the trade-off between in FNP; a larger capacity for , compared to , leads to better uncertainty whereas the other way around seems to improve accuracy. These observations are conditioned on having a reasonably large in order to learn a meaningful .

## 5 Discussion

We presented a novel family of exchangeable stochastic processes, the Functional Neural Processes (FNPs). In contrast to NPs (garnelo2018neural, ) that employ global latent variables, FNPs operate by employing local latent variables along with a dependency structure among them, a fact that allows for easier encoding of inductive biases. We verified the potential of FNPs experimentally, and showed that they can serve as competitive alternatives. We believe that FNPs open the door to plenty of exciting avenues for future research; designing better function priors by e.g. imposing a manifold structure on the FNP latents (falorsi2019reparameterizing, ), extending FNPs to unsupervised learning by e.g. adapting ACNs (graves2018associative, ) or considering hierarchical models similar to deep GPs (damianou2012deep, ).

#### Acknowledgments

We would like to thank Patrick Forré for helpful discussions over the course of this project and Peter Orbanz, Benjamin Bloem-Reddy for helpful discussions during a preliminary version of this work. We would also like to thank Daniel Worrall, Tim Bakker and Stephan Alaniz for helpful feedback on an initial draft.

## Appendix A Experimental details

Throughout the experiments, the architectures for the FNP and FNP

were constructed as follows. We used a neural network torso in order to obtain an intermediate hidden representation

of the inputs and then parametrized two linear output layers, one that lead to the parameters of and one that lead to the parameters of , both of which were fully factorized Gaussians. The function for the Bernoulli probabilities was set to an RBF, i.e. , where was optimized to maximize the lower bound. The temperature of the binary concrete / Gumbel-softmax relaxation was kept at throughout training and we used the log CDF of a standard normal as the for

. For the classifiers

we used a linear model that operated on top of or

respectively. We used a single Monte Carlo sample for each batch during training in order to estimate the bound of FNPs. We similarly used a single sample for the NP and MC-dropout. All of the models were implemented in PyTorch and were run across five Titan X (Pascal) GPUs (one GPU per model).

The NN and MC-dropout had the same torso and classifier as the FNPs. As the NP has not been previously employed in the settings we considered, we designed the architecture in a way that is similar to the FNP. More specifically, we used the same neural network torso to provide an intermediate representation for the inputs . To obtain the global embedding we concatenated the labels to obtain , projected to dimensions with a linear layer and then computed the average of each dimension across the context. The parameters of the distribution over the global latent variables were then given by a linear layer acting on top of . After sampling we then used a linear classifier that operated on top of .

In the regression experiment for the initial transformation of we used 100 ReLUs for both NP and FNP models via a single layer MLP, whereas for the regressor we used a linear layer for NP (more capacity lead to overfitting and a decrease in predictive uncertainty) and a single hidden layer MLP of 100 ReLUs for the FNPs. For the MC-dropout network used a single hidden layer MLP of 100 units and we applied dropout with a rate of 0.5 at the hidden layer. In all of the neural networks models, the heteroscedastic noise was parametrized according to , where was a neural network output. For the GP, we optimized the kernel lengthscale according to the marginal likelihood. We also found it beneficial to apply a soft-free bits [7] modification of the bound to help with the optimization of , where we initially allowed free bit on average across all dimensions and batch elements for the FNP and for the FNP both of which were slowly annealed to zero over the course of 5k updates.

For the MNIST experiment, the model architecture was a 20C5 - MP2 - 50C5 - MP2 - 500FC - Softmax, where 20C5 corresponds to a convolutional layer of 20 output feature maps with a kernel size of 5, MP2 corresponds to max pooling with a size of 2, 500FC corresponds to fully connected layer of 500 output units and Softmax corresponds to the output layer. The initial representation of

for the NP and FNPs was provided by the penultimate layer of the network. For the MC-dropout network we applied 0.5 dropout to every layer. The number of points in was set to , a value that was determined from a range of by judging the performance of the NP and FNP models on the MNIST / notMNIST pair. For the FNP we used minibatches of 100 points from , while we always appended the full

to each of those batches. For the NP, since we were using a random set of contexts every time, we used a batch size of 400 points, where, in order to be comparable to the FNP, we randomly selected up to 300 points from the current batch for the context points during training and used the same 300 points as FNP for evaluation. We set the upper bound of training epochs for the FNPs, NN and MC-dropout networks to 100 epochs, and 200 epochs for the NP as it did less parameter updates per epoch than the FNPs. Optimization was done with Adam

[23]

using the default hyperparameters. We further did early stopping according to the accuracy on the validation set and no other regularization was employed. Finally, we also employed a soft-free bits

[7] modification of the bound to help with the optimization of , where we allowed free bit on average across all dimensions and batch elements throughout training.

The architecture for the CIFAR 10 experiment was a 2x(128C3) - MP2 - 2x(256C3) - MP2 - 2x(512C3) - MP2 - 1024FC - Softmax along with batch normalization

[20] employed after every layer (besides the output one). Similarly to the MNIST experiment, the initial representation of for the NP and FNPs was provided by the penultimate layer of each of the networks. We didn’t optimize any hyperparameters for these experiments and used the same number of reference points, free bits, amount of epochs, regularization and early stopping criteria we used at MNIST. For the MC-dropout network we applied dropout with a rate of 0.2 at the beginning of each stack of convolutional layers that shared the same output channels and with a rate of 0.5 before every fully connected layer. Optimization was done with Adam with an initial learning rate of that was decayed by a factor of every thirty epochs for the NN, MC-Dropout and FNPs and every

epochs for the NP. We also performed data augmentation during training by doing random cropping with a padding of 4 pixels and random horizontal flips for both the reference and other points. We did not do any data augmentation during test time. The images were further normalized by subtracting the mean and by dividing with the standard deviation of each channel, computed across the training dataset.

## Appendix B Ablation study on MNIST

In this section we provide the additional results we obtained on MNIST during the ablation study. The discussion of the results can be found in the main text. We measured the sensitivity of NPs and FNPs to the size of the reference set as well as the trade-offs we obtain by varying the dimensionalities of for the FNP. The results from the former can be seen at Table 2, whereas the results from the latter can be seen at Table 3.

## Appendix C The Functional Neural Process is an exchangeable stochastic process

###### Proposition.

The distributions defined in Eq.8, 9 define valid permutation invariant stochastic processes, hence they correspond to Bayesian models.

###### Proof.

In order to prove the proposition we will rely on de Finetti’s and Kolmogorov Extension Theorems [26] and show that is permutation invariant and its marginal distributions are consistent under marginalization. We will focus on FNP as the proof for FNP is analogous. As a reminder, we previously defined to be a set of reference inputs , we defined to be the set of observed inputs, and we also defined the auxiliary sets , the set of all inputs in the observed dataset that are not a part of the reference set , and , the set of all points in the reference and observed dataset.

We will start with the permutation invariance. It will suffice to show that each of the individual probability densities described at Section 2.1 are permutation equivariant, as the products / sums will then make the overall probability permutation invariant. Without loss of generality we will assume that the elements in the set are arranged as . Consider applying a permutation over , ; this will also induce the same permutation over , hence we will have that . Now consider the fact that in the FNP each individual is a function, let it be , of the values of ; as a result we will have that:

 f(σ(B))=σ(f(B)), (13)

i.e. the latent variables are permutation equivariant w.r.t. . Continuing to the latent adjacency matrices ; in the FNP each particular element of these is a function of the values of the specific . As a result, we will also have permutation equivariance for the rows / columns of . Now since are essentially used as a way to factorize the joint distribution over the in and given the fact that the distribution of each is invariant to the permutation of its parents, we will have that the permutation of will result into the same re-ordering of the ’s i.e.:

 σ(ZB)=g(σ(B)), (14)

where is the function that maps to . Finally, as each is a function, let it be of the specific , we will similarly have that

. We have thus described that all of the aforementioned random variables are permutation equivariant to

and as a result, due to the permutation invariant product / integral / summation operators, we will have that the FNP model is permutation invariant.

Continuing to the consistency under marginalization. Following [11] let us define and consider two cases, one where the belongs in and one where it doesn’t. We will show that in both cases . Lets consider the case when . In this case we have that the and sets will be the same across and . As a result we can proceed as

 ∫p(y~D|X~D)dy0 =∑G,A∫pθ(UB|XB)p(G,A|UB)pθ(ZB,yB|R,G,A)dUBdZBdyi∈R∖~Dxdy0. (15)

Now we can notice that , hence the measure that we are integrating over above can be rewritten as

 ∫p(y~D|X~D)dy0 =∑G,A∫pθ(UB|XB)p(G,A|UB)pθ(yB,ZB|R,G,A)dUBdZBdyi∈R∖Dx, (16)

where it is easy to see that we arrived at the same expression as the one provided at Eq. 8. Now we will consider the case where . In this case we have that and thus

 ∫p(y~D|X~D)dy0=∑G,A,a0∫pθ(UB|XB)p(G,A|UB)pθ(yB,ZB|R,G,A) pθ(u0|x0)p(a0|UR,u0)pθ(z0|para0(R,yR))pθ(y0|z0)dUBdZBdu0dz0dyi∈R∖Dxdy0. (17)

Notice that in this case the new point that is added is a leaf in the dependency graph, hence it doesn’t affect any of the points in . As a result we can easily marginalize it out sequentially

 ∫p(y~D|X~D)dy0=∑G,A,a0∫pθ(UB|XB)p(G,A|UB)pθ(yB,ZB|R,G,A) pθ(u0|x0)p(a0|UR,u0)pθ(z0|para0(R,yR))(∫pθ(y0|z0)dy0)1dUBdZBdu0dz0dyi∈R∖Dx. (18) =∑G,A,a0∫pθ(UB|XB)p(G,A|UB)pθ(yB,ZB|R,G,A) pθ(u0|x0)p(a0|UR,u0)(∫pθ(z0|para0(R,yR))dz0)1dUBdZBdu0dyi∈R∖Dx (19) =∑G,A∫pθ(UB|XB)p(G,A|UB)pθ(yB,ZB|R,G,A) (20) =∑G,A∫pθ(UB|XB)p(G,A|UB)pθ(yB,ZB|R,G,A)(∫p(u0|x0)du0)1dUBdZBdyi∈R∖Dx (21) =∑G,A∫pθ(UB|XB)p(G,A|UB)pθ(yB,ZB|R,G,A)dUBdZBdyi∈R∖Dx (22)

where it is similarly easy to see that we arrived at Eq. 8. So we just showed that in both cases we have that , hence the model is consistent under marginalization. ∎

## Appendix D Minibatch optimization of the bound of FNPs

As we mentioned in the main text, the objective of FNPs is amenable to minibatching where the size of the batch scales according to the reference set . We will only describe the procedure for the FNP as the extension for FNP is straightforward. Lets remind ourselves that the bound of FNPs can be expressed into two terms:

 L +Epθ(UD,A|XD)qϕ(ZM|XM)[logpθ(yM|ZM)+logpθ(ZM|parA(R,yR))−logqϕ(ZM|XM)] =LR+LM|R, (23)

where we have a term that corresponds to the variational bound on the datapoints in , , and a second term that corresponds to the bound on the points in when we condition on , . While the term of Eq. 23 cannot, in general, be decomposed to independent sums due to the DAG structure in , the term can; from the conditional i.i.d. nature of and the structure of the variational posterior we can express it as independent sums:

 LM|R −logqϕ(zi|xi)]]. (24)

We can now easily use a minibatch of points from

in order to approximate the inner sum and thus obtain unbiased estimates of the overall bound that depend on a minibatch

:

 ~LM|R −logqϕ(zi|xi)]], (25)

thus obtain the following unbiased estimate of the overall bound that depends on a minibatch

 L≈LR+~LM|R. (26)

In practice, this might limit us to use relatively small reference sets as training can become relatively expensive; in this case an alternative would be to subsample also the reference set and just reweigh appropriately . This provides a biased gradient estimator but, after a limited set of experiments, it seems that it can work reasonably well.

## Appendix E Predictive distribution of FNPs

Given the fact that the parameters of the model has been optimized, we are now seeking a way to do predictions for new unseen points. As we assumed that all of the reference points are a part of the observed dataset , every new point will be a part of . Furthermore, we will have that . We will only provide the derivation for the FNP model, since the extension to FNP

is straightforward. To derive the predictive distribution for this point we will rely on Bayes theorem and thus have:

 pθ(y∗|x∗,XD,yD)=pθ(y∗,yD|x∗,XD)∫pθ(y∗,yD|x∗,XD)dy∗. (27)

As we have established the consistency of FNP, we know that the denominator is . Therefore we can expand the enumerator and rewrite Eq. 27 as

 pθ(y∗|x∗,XD,yD) =∑G,A,a∗∫pθ(UD|XD)p(G,A|UD)pθ(ZD,yD|R,G,A)pθ(yD|XD) (28)

where is the binary vector that denotes which points from are the parents of the new point. We can now see that the top part is the posterior distribution of the latent variables of the model when we condition on . We can thus replace it with its variational approximation and obtain

 pθ(y∗|x∗,XD,yD) ≈∑G,A,a∗∫pθ(UD|XD)p(G,A|UD)qϕ(ZD|XD) pθ(u∗|x∗)p(a∗|UR,u∗)pθ(z∗|para∗(R,yR))pθ(y∗|z∗)dUDdu∗dZDdz∗ (29) pθ(y∗|z∗)dURdu∗dz∗ (30)

after integrating / summing over the latent variables that do not affect the distributions that are specific to the new point.