Deep Convolutional Networks as shallow Gaussian Processes

08/16/2018 ∙ by Adrià Garriga-Alonso, et al. ∙ University of Cambridge 0

We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many convolutional filters, extending similar results for dense networks. For a CNN, the equivalent kernel can be computed exactly and, unlike "deep kernels", has very few parameters: only the hyperparameters of the original CNN. Further, we show that this kernel has two properties that allow it to be computed efficiently; the cost of evaluating the kernel for a pair of images is similar to a single forward pass through the original CNN with only one filter per layer. The kernel equivalent to a 32-layer ResNet obtains 0.84 GPs with a comparable number of parameters.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Networks (CNN) have powerful pattern-recognition capabilities that have recently given dramatic improvements in important tasks such as image classification

(Krizhevsky et al., 2012). However, as CNN are increasingly being applied in real-world, safety-critical domains, their vulnerability to adversarial examples (Szegedy et al., 2013; Kurakin et al., 2016)

, and their poor uncertainty estimates are becoming increasingly problematic. Bayesian inference is a theoretically principled and demonstrably successful

(Snoek et al., 2012; Deisenroth & Rasmussen, 2011) framework for learning in the face of uncertainty, which may also help to address the problems of adversarial examples (Gal & Smith, 2018). Unfortunately, Bayesian inference in CNN is extremely difficult due to the very large number of parameters, requiring highly approximate factorised variational approximations (Blundell et al., 2015; Gal & Ghahramani, 2015), or requiring the storage (Lakshminarayanan et al., 2017) of large numbers of posterior samples (Welling & Teh, 2011; Mandt et al., 2017).

Other methods such as those based on GP are more amenable to Bayesian inference, allowing us to compute the posterior uncertainty exactly (Rasmussen & Williams, 2006). This raises the question of whether it might be possible to combine the pattern-recognition capabilities of CNN with the exact probabilistic computations in GP. Two such approaches exist in the literature. First, deep convolutional kernels (Wilson et al., 2016) parameterise a GP kernel using the weights and biases of a CNN, which is used to embed the input images into some latent space before computing their similarity. The CNN parameters of the resulting kernel then are optimised by gradient descent. However, the large number of kernel parameters in the CNN reintroduces the risk of overconfidence and overfitting. To avoid this risk, we need to infer a posterior over the CNN kernel parameters, which is as difficult as directly inferring a posterior over the parameters of the original CNN. Second, it is possible to define a convolutional GP (van der Wilk et al., 2017) or a deep convolutional GP (Kumar et al., 2018) by defining a GP that takes an image patch as input, and using that GP as a component in a larger CNN-like system. However, inference in such systems is very computationally expensive, at least without the use of potentially severe variational approximations (van der Wilk et al., 2017).

An alternative approach is suggested by the underlying connection between Bayesian Neural Networks (NN) and GP. In particular, Neal (1996) showed that the function defined by a single-layer fully-connected NN with infinitely many hidden units, and random independent zero-mean weights and biases is equivalent to a GP, implying that we can do exact Bayesian inference in such a NN by working with the equivalent GP. Recently, this result was extended to arbitrarily deep fully-connected NN with infinitely many hidden units in each layer (Lee et al., 2017; Matthews et al., 2018a). However, these fully-connected networks are rarely used in practice, as they are unable to exploit important properties of images such as translational invariance, raising the question of whether state-of-the-art architectures such as CNN (LeCun et al., 1990) and ResNets (He et al., 2016a) have equivalent GP representations. Here, we answer in the affirmative, giving the GP kernel corresponding to arbitrarily deep CNN and to (convolutional) residual neural networks (He et al., 2016a). In this case, if each hidden layer has an infinite number of convolutional filters, the network prior is equivalent to a GP.

Furthermore, we show that two properties of the GP kernel induced by a CNN allow it to be computed very efficiently. First, in previous work it was necessary to compute the covariance matrix for the output of a single convolutional filter applied at all possible locations within a single image (van der Wilk et al., 2017)

, which was prohibitively computationally expensive. In contrast, under our prior, the downstream weights are independent with zero-mean, which decorrelates the contribution from each location, and implies that it is necessary only to track the patch variances, and not their covariances. Second, while it is still necessary to compute the variance of the output of a convolutional filter applied at all locations within the image, the specific structure of the kernel induced by the CNN means that the variance at every location can be computed simultaneously and efficiently as a convolution.

Finally, we empirically demonstrate the performance increase coming from adding translation-invariant structure to the GP prior. Without computing any gradients, and without augmenting the training set (e.g. using translations), we obtain 0.84% error rate on the MNIST classification benchmark, setting a new record for nonparametric GP-based methods.

2 GP behaviour in a Cnn

For clarity of exposition, we will treat the case of a 2D convolutional NN. The result applies straightforwardly to

D convolutions, dilated convolutions and upconvolutions (“deconvolutions”), since they can be represented as linear transformations with tied coefficients (see Fig. 

1).

2.1 A 2D convolutional network prior

The network takes an arbitrary input image of height and width , as a real matrix. Each row, which we denote , corresponds to a channel of the image (e.g.

for RGB), flattened to form a vector. The first activations

are a linear transformation of the inputs. For :

(1)

We consider a network with hidden layers. The other activations of the network, from up to , are defined recursively:

(2)

The activations are matrices. Each row represents the flattened th channel of the image that results from applying a convolutional filter to .

The structure of the pseudo-weight matrices and biases , for and , depends on the architecture. For a convolutional layer, each row of represents a position of the filter, such that the dot product of all the rows with the image vector represents applying the convolutional filter to the th channel. Thus, the elements of each row of are: 0 where the filter does not apply, and the corresponding element of where it does, as illustrated in Fig. 1.

The outputs of the network are the last activations, . In the classification or regression setting, the outputs are not spatially extended, so we have , which is equivalent to a fully-connected output layer. In this case, the pseudo-weights only have one row, and the activations are single-element vectors.

Figure 1: The 2D convolution as the dot product . The blank elements of are zeros. The th row of corresponds to applying the filter to the th convolutional patch of the channel .

Finally, we define the prior distribution over functions by making the filters and biases be independent Gaussian RV. For each layer , channels and locations within the filter :

(3)

Note that, to keep the activation variance constant, the weight variance is divided by the number of input channels. The weight variance can also be divided by the number of elements of the filter, which makes it equivalent to the NN weight initialisation scheme introduced by He et al. (2016a).

2.2 Argument for GP behaviour

We follow the proofs by Lee et al. (2017) and Matthews et al. (2018a) to show that the output of the CNN described in the previous section, , defines a GP indexed by the inputs, . Their proof (Lee et al., 2017) proceeds by applying the multivariate CLT to each layer in sequence, i.e. taking the limit as , then etc, where is the number of hidden units in layer . By analogy, we sequentially apply the multivariate CLT by taking the limit as the number of channels goes to infinity, i.e. , then etc. While this is the simplest approach to taking the limits, other potentially more realistic approaches also exist (Matthews et al., 2018a). In Appendix 7.2 we take the latter approach.

The fundamental quantity we consider is a vector formed by concatenating the feature maps (or equivalently channels), and from data points and ,

(4)

This quantity (and the following arguments) can all be extended to the case of finitely many input points.

Induction base case.

For any pair of data points, and the feature-maps corresponding to the th channel,

have a multivariate Gaussian joint distribution. This is because each element is a linear combination of shared Gaussian random variables: the biases,

and the filters, . Following Eq. (1),

(5)

where is a vector of all-ones. While the elements within a feature map display strong correlations, different feature maps are iid conditioned on the data (i.e. and are iid for ), because the parameters for different feature-maps (i.e. the biases, and the filters, ) are themselves iid.

Induction step.

Consider the feature maps at the th layer, , to be iid multivariate Gaussian RV (i.e. for , and are iid). Our goal is to show that, taking the number of channels at layer to infinity (i.e. ), the same properties hold at the next layer (i.e. all feature maps, , are iid multivariate Gaussian RV). Writing Eq. (2) for two training examples, and , we obtain,

(6)

We begin by showing that is a multivariate Gaussian RV. The first term is multivariate Gaussian, as it is a linear function of , which is itself iid Gaussian. We can apply the multivariate CLT to show that the second term is also Gaussian, because, in the limit as , it is the sum of infinitely many iid terms: are iid by assumption, and are iid by definition. Note that the same argument applies to all feature maps jointly, so all elements of (defined by analogy with Eq. 4) are jointly multivariate Gaussian.

Following Lee et al. (2017), to complete the argument, we need to show that the output feature maps are iid, i.e. and are iid for . They are identically distributed, as and are iid and is shared. To show that they are independent, remember that and are jointly Gaussian, so it is sufficient to show that they are uncorrelated. We can show that they are uncorrelated noting that the weights are independent with zero-mean, eliminating any correlations that might arise through the shared random vector, .

3 The ConvNet and ResNet kernels

Here we derive a computationally efficient kernel corresponding to the CNN described in the previous section. It is surprising that we can compute the kernel efficiently because the feature maps, , display rich covariance structure due to the shared convolutional filter. Computing and representing these covariances would be prohibitively computationally expensive. However, in many cases we only need the variance of the output, e.g. in the case of classification or regression with a final dense layer. It turns out that this propagates backwards through the convolutional network, implying that for every layer, we only need the “diagonal covariance” of the activations: the covariance between the corresponding elements of and , that is, .

3.1 GP mean and covariance

A GP is completely specified by its mean and covariance (or kernel) functions. These give the parameters of the joint Gaussian distribution of the RV indexed by any two inputs,

and . For the purposes of computing the mean and covariance, it is easiest to consider the network as being written entirely in index notation,

where and denote the input and output layers respectively, and denote the input and output channels, and and denote the location within the input and output channel or feature-maps.

The mean function is thus easy to compute,

as and have zero mean, and are independent of the activations at the previous layer, .

Now we show that it is possible to efficiently compute the covariance function. This is surprising because for many networks, we need to compute the covariance of activations between all pairs of locations in the feature map (i.e. for ) and this object is extremely high-dimensional, . However, it turns out that we only need to consider the “diagonal” covariance, (i.e. we only need for ), which is a more manageable quantity of size .

This is true at the output layer : in order to achieve an output suitable for classification or regression, we use only a single output location , with a number of “channels” equal to the number of of outputs/classes, so it is only possible to compute the covariance at that single location. We now show that, if we only need the covariance at corresponding locations in the outputs, we only need the covariance at corresponding locations in the inputs, and this requirement propagates backwards through the network.

Formally, as the activations are composed of a sum of terms, their covariance is the sum of the covariances of all those underlying terms,

(7)

As the terms in the covariance have mean zero, and as the weights and activations from the previous layer are independent,

(8)

The weights are independent for different channels: and are iid for , so for . Further, each row of the weight matrices only contains independent variables or zeros (Fig. 1), so for . Thus, we can eliminate the sums over and :

(9)

The th row of is zero for indices that do not belong to its convolutional patch, so we can restrict the sum over to that region. We also define , to emphasise that the covariances are independent of the output channel, . The variance of the first layer is

(10)
And we do the same for the other layers,
(11)

where

(12)

is the covariance of the activations, which is again independent of the channel.

3.2 Covariance of the activities

The elementwise covariance in the right-hand side of Eq. (11) can be computed in closed form for many choices of if the activations are Gaussian. For each element of the activations, one needs to keep track of the 3 distinct entries of the bivariate covariance matrix between the inputs, , and .

For example, for the ReLU nonlinearity (

), one can adapt Cho & Saul (2009) in the same way as Matthews et al. (2018a, section 3) to obtain

(13)

where . Another example is the error function (erf) nonlinearity, similar to the hyperbolic tangent (tanh). The form of its relevant expectation (Williams, 1997) is in Appendix 7.4.

1:  Input: two images, .
2:  Compute , , and       for ; using Eq. (10).
3:  for  do
4:     Compute , and       for ; using Eq. (13), or some other nonlinearity.
5:     Compute , , and       for ; using Eq. (11).
6:  end for
7:  Output the scalar .
Algorithm 1 The ConvNet kernel

3.3 Efficiency of the ConvNet kernel

We now have all the pieces for computing the kernel, as written in Algorithm 1.

Putting together Eq. (11) and Eq. (13) gives us the surprising result that the diagonal covariances of the activations at layer only depend on the diagonal covariances of the activations at layer . This is very important, because it makes the computational cost of the kernel be within a constant factor of the cost of a forward pass for the equivalent CNN with 1 filter per layer.

Thus, the algorithm is more efficient that one would naively think. A priori, one needs to compute the covariance between all the elements of and combined, yielding a covariance matrix for every pair of points. Instead, we only need to keep track of a -dimensional vector per layer and pair of points.

Furthermore, the particular form for the kernel (Eq. 1 and Eq. 2) implies that the required variances and covariances at all required locations can be computed efficiently as a convolution.

3.4 Kernel for a residual Cnn

The induction step in the argument for GP behaviour from Sec. 2.2 depends only on the previous activations being iid Gaussian. Since all the activations are iid Gaussian, we can add skip connections between the activations of different layers while preserving GP behaviour, e.g. and where is the number of layers that the skip connection spans. If we change the NN recursion (Eq. 2) to

(14)

then the kernel recursion (Eq. 11) becomes

(15)

This way of adding skip connections is equivalent to the “pre-activation” shortcuts described by He et al. (2016b)

. Remarkably, the natural way of adding residual connections to NN is the one that performed best in their empirical evaluations.

4 Experiments

We evaluate our kernel on the MNIST handwritten digit classification task. Classification likelihoods are not conjugate for GP, so we must make an approximation, and we follow Lee et al. (2017), in re-framing classification as multi-output regression.

The training set is split into training and validation examples. The regression targets

are a one-hot encoding of the example’s class:

if the th example belongs to class , and otherwise.

Training is exact conjugate likelihood GP regression with noiseless targets (Rasmussen & Williams, 2006). First we compute the kernel matrix , which contains the kernel between every pair of examples. Then we compute using a linear system solver.

The test set has examples. We compute the matrix , the kernel between each test example and all the training examples. The predictions are given by the row-wise maximum of .

For the “ConvNet GP” and “Residual CNN GP”, (Table 1) we optimise the kernel hyperparameters by random search. We draw random hyperparameter samples, compute the resulting kernel’s performance in the validation set, and pick the highest performing run. The kernel hyperparameters are: ,

; the number of layers; the convolution stride, filter sizes and edge behaviour; the nonlinearity (we consider the error function and ReLU); and the frequency of residual skip connections (for Residual CNN GPs). We do not retrain the model on the validation set after choosing hyperparameters.

The “ResNet GP” (Table 1) is the kernel equivalent to a 32-layer version of the basic residual architecture by He et al. (2016a). The differences are: an initial convolutional layer and a final dense layer instead of average pooling. We chose to remove the pooling because computing its output variance requires the off-diagonal elements of the filter covariance, in which case we could not exploit the efficiency gains described in Sec. 3.3.

We found that the despite it not being optimised, the 32-layer ResNet GP outperformed all other comparable architectures (Table 1), including the NNGP in Lee et al. (2017), which is state-of-the-art for non-convolutional networks, and convolutional GP (van der Wilk et al., 2017; Kumar et al., 2018). That said, our results have not reached state-of-the-art for methods that incorporate a parametric neural network, such as a standard ResNet (Chen et al., 2018) and a Gaussian process with a deep neural network kernel (Bradshaw et al., 2017).

Method #samples Validation error Test error
NNGP (Lee et al., 2017) 1.21%
Convolutional GP (van der Wilk et al., 2017) SGD 1.17%
Deep Conv. GP (Kumar et al., 2018) SGD 1.34%
ConvNet GP 27 0.71% 1.03%
Residual CNN GP 27 0.71% 0.93%
ResNet GP 0.68% 0.84%
GP + parametric deep kernel (Bradshaw et al., 2017) SGD 0.60%
ResNet (Chen et al., 2018) 0.41%
Table 1: MNIST classification results. #samples gives the number of kernels that were randomly sampled for the hyperparameter search. “ConvNet GP” and “Residual CNN GP” are random CNN architectures with a fixed filter size, whereas “ResNet GP” is a slight modification of the architecture by He et al. (2016b)

. Entries labelled “SGD” used stochastic gradient descent for tuning hyperparameters, by maximising the likelihood of the training set. The last two methods use parametric neural networks. The hyperparameters of the ResNet GP were not optimised

(they were fixed based on the architecture from He et al., 2016b). See Table 2 (appendix) for optimised hyperparameter values.

To check whether the GP limit is applicable to relatively small networks used practically (with of the order of channels in the first layers), we randomly sampled 32-layer ResNets, with 3, 10, 30 and 100 channels in the first layers. Following the usual practice for ResNets we increase the number the number of hidden units when we downsample the feature maps. Then, we compare the sampled and limiting theoretical distributions of for a given input .

The probability density plots show a good match around 100 channels (Fig. 

2

A), which matches a more sensitive graphical procedure based on quantile-quantile plots (Fig. 

2

B). Notably, even for only 30 channels, the empirical moments (computed over many input images) match closely the limiting ones (Fig. 

2C). For comparison, typical ResNets use from 64 (He et al., 2016a) to 192 (Zagoruyko & Komodakis, 2016) channels in their first layers. We believe that this is because the moment propagation equations only require the Gaussianity assumption for propagation through the ReLU, and presumably this is robust to non-Gaussian input activations.

Figure 2: Distribution of the output : limiting density and samples of finite 32-layer ResNets (He et al., 2016b) with channels. A) Empirical histogram and limiting density function for one input image. B) A more sensitive test of Gaussianity is a quantile-quantile plot, which plots in the value of a quantile in the limiting density and in the corresponding quantile in the empirical one, for one input image. C) The empirical moments (variances and covariances) over 100 training images show a good match for all numbers of channels.

Computational efficiency.

Asymptotically, computing the kernel matrix takes time, where is the number of layers in the network and is the dimensionality of the input, and inverting the kernel matrix takes . As such, we expect that for very large datasets, inverting the kernel matrix will dominate the computation time. However, on MNIST, is only around a factor of larger than . In practice, we found that it was more expensive to compute the kernel matrix than to invert it. For the ResNet kernel, the most expensive, computing , and for validation and test took min on two Tesla P100 GPUs. In contrast, inverting and computing validation and test performance took  seconds on a single Tesla P100 GPU.

5 Related work

Van der Wilk et al. (van der Wilk et al., 2017) also adapted GP to image classification. They defined a prior on functions that takes an image and outputs a scalar. First, draw a function . Then, is the sum of the output of applied to each of the convolutional patches. Their approach is also inspired by convolutional NN, but their kernel is applied to all pairs of patches of and . This makes their convolutional kernel expensive to evaluate, requiring inter-domain inducing point approximations to remain tractable. The kernels in this work, directly motivated by the infinite-filter limit of a CNN, only apply something like to the corresponding pairs of patches within and (Eq. 10). As such, the CNN kernels are cheaper to compute and exhibit superior performance (Table 1), despite the use of an approximate likelihood function.

Kumar et al. (2018) define a prior over functions by stacking several GP with van der Wilk’s convolutional kernel, forming a “Deep GP” (Damianou & Lawrence, 2013). In contrast, the kernel in this paper confines all hierarchy to the definition of the kernel, and the resulting GP is shallow.

Wilson et al. (2016) introduced and Bradshaw et al. (2017) improved deep kernel learning. The inputs to a classic GP kernel (e.g. RBF) are preprocessed by applying a feature extractor (a deep NN) prior to computing the kernel: . The NN parameters are optimised by gradient ascent using the likelihood as the objective, as in standard GP kernel learning (Rasmussen & Williams, 2006, Chapter 5). Since deep kernel learning incorporates a state-of-the-art NN with over parameters, we expect it to perform similarly to a NN applied directly to the task of image classification. At present both CNN and deep kernel learning display superior performance to the GP kernels in this work. However, the kernels defined here have far fewer parameters (around , compared to their ).

Borovykh (2018) also suggests that a CNN exhibits GP behaviour. However, they take the infinite limit with respect to the filter size, not the number of filters. Thus, their infinite network is inapplicable to real data which is always of finite dimension.

Finally, there is a series of papers analysing the mean-field behaviour of deep NN and CNN which aims to find good random initializations, i.e. those that do not exhibit vanishing or exploding gradients or activations (Schoenholz et al., 2016; Yang & Schoenholz, 2017). Apart from their very different focus, the key difference to our work is that they compute the variance for a single training-example, whereas to obtain the GP kernel, we additionally need to compute the output covariances for different training/test examples (Xiao et al., 2018).

6 Conclusions and future work

We have shown that deep Bayesian CNN with infinitely many filters are equivalent to a GP with a recursive kernel. We also derived the kernel for the GP equivalent to a CNN, and showed that, in handwritten digit classification, it outperforms all previous GP approaches that do not incorporate a parametric NN into the kernel. Given that most state-of-the-art neural networks incorporate structure (convolutional or otherwise) into their architecture, the equivalence between CNN and GP is potentially of considerable practical relevance. In particular, we hope to apply GP CNN in domains as widespread as adversarial examples, lifelong learning and k-shot learning, and we hope to improve them by developing efficient multi-layered inducing point approximation schemes.

References

7 Appendix

7.1 Technical notes on limits

The key technical issues in the proof (and the key differences between Lee et al. 2017 and Matthews et al. 2018b) arise from exactly how and where we take limits. In particular, consider the activations as being functions of the activities at the previous layer,

(16)

Now, there are two approaches to taking limits. First, both our argument in the main text, and the argument in Lee et al. (2017) is valid if we are able to take limits “inside” the network,

(17)

However, Matthews et al. (2018a, b) argue that is preferable to take limits “outside” the network. In particular, Matthews et al. (2018b) take the limit with all layers simultaneously,

(18)

where goes to infinity as . That said, similar technical issues arise if we take limits in sequence, but outside the network.

7.2 Extending the derivations of Matthews et al. (2018b) to the convolutional case

In the main text, we follow Lee et al. (2017) in sequentially taking the limit of each layer to infinity (i.e. , then etc.). This dramatically simplified the argument, because taking the number of units in the previous layer to infinity means that the inputs from that layer are exactly Gaussian distributed. However, Matthews et al. (2018b) argue that the more practically relevant limit is where we take all layers to infinity simultaneously. This raises considerable additional difficulties, because we must reason about convergence in the case where the previous layer is finite. Note that this section is not intended to stand independently: it is intended to be read alongside Matthews et al. (2018b), and we use several of their results without proof.

Mirroring Definition 3 in Matthews et al. (2018b), we begin by choosing a set of “width” functions, , for which all approach infinity as . In Matthews et al. (2018b), these functions described the number of hidden units in each layer, whereas here they describe the number of channels. Our goal is then to extend the proofs in Matthews et al. (2018b) (in particular, of theorem 4), to show that the output of our convolutional networks converge in distribution to a Gaussian process as , with mean zero and covariance given by the recursion in Eqs. (1012).

The proof in Matthews et al. (2018b) has three main steps. First, they use the Cramér-Wold device, to reduce the full problem to that of proving convergence of scalar random variables to a Gaussian with specified variance. Second, if the previous layers have finite numbers of channels, then the channels and are uncorrelated but no longer independent, so we cannot apply the CLT directly, as we did in the main text. Instead, they write the activations as a sum of exchangeable random variables, and derive an adapted CLT for exchangeable (rather than independent) random variables (Blum et al., 1958). Third, they show that moment conditions required by their exchangeable CLT are satisfied.

To extend their proofs to the convolutional case, we begin by defining our networks in a form that is easier to manipulate and as close as possible to Eq. (21-23) in Matthews et al. (2018b),

(19)
(20)
(21)
where,
(22)

The first step is to use the Cramér-Wold device (Lemma 6 in Matthews et al., 2018b), which indicates that convergence in distribution of a sequence of finite-dimensional vectors is equivalent to convergence on all possible linear projections to the corresponding real-valued random variable. Mirroring Eq. 25 in Matthews et al. (2018b), we consider convergence of random vectors, , projected onto ,

(23)

where is a finite set of tuples of data points and channel indicies, , and indicies of elements within channels/feature maps, . The suffix indicates width functions that are instantiated with input, .

Now, we must prove that these projections converge in distribution a Gaussian. We begin by defining summands, as in Eq. 26 in Matthews et al. (2018b),

(24)

such that the projections can be written as a sum of the summands, exactly as in Eq. 27 in Matthews et al. (2018b),

(25)

Now we can apply the exchangeable CLT to prove that converges to the limiting Gaussian implied by the recursions in the main text. To apply the exchangeable CLT, the first step is to mirror Lemma 8 in Matthews et al. (2018b), in showing that for each fixed and , the summands, are exchangeable with respect to the index . In particular, we apply de Finetti’s theorem, which states that a sequence of random variables is exchangeable if and only if they are i.i.d. conditional on some set of random variables, so it is sufficient to exhibit such a set of random variables. Mirroring Eq. 29 in Matthews et al. (2018b), we apply the recursion,

(26)

As such, the summands are iid conditional on the finite set of random variables , where is the set of input points in .

The exchangeable CLT in Lemma 10 in Matthews et al. (2018b) indicates that converges in distribution to if the summands are exchangeable (which we showed above), and if three conditions hold,

Condition a) follows immediately as the summands are uncorrelated and zero-mean. Conditions b) and c) are more involved as convergence in distribution in the previous layers does not imply convergence in moments for our activation functions.

We begin by considering the extension of Lemma 20 in Matthews et al. (2018b), which allow us to show conditions b) and c) above, even in the case of unbounded but linearly enveloped nonlinearities (Definition 1 in Matthews et al., 2018b). Lemma 20 states that the eighth moments of are bounded by a finite constant independent of . We prove this by induction. The base case is trivial, as is Gaussian. Following Matthews et al. (2018b), assume the condition holds up to , and show that the condition holds for layer . Using Eq. (21), we can bound the activations at layer ,

(27)

Following Eq. 48 in Matthews et al. (2018b), which uses Lemma 19 in Matthews et al. (2018b), we have,

(28)

where is the set of post-nonlinearities corresponding to and . Following Matthews et al. (2018b), observe that,