1 Introduction
In recent years significant effort has gone into developing flexible probabilistic models for the supervised setting. These include, among others, deep gaussian processes (damianou2013deep, ) as well as various approaches to Bayesian neural networks (peterson1987mean, ; mackay1992practical, ; hinton1993keeping, ; graves2011practical, ; blundell2015weight, ; hernandez2015probabilistic, )
. While neural networks promise considerable flexibility, scalable learning algorithms for Bayesian neural networks that can deliver robust uncertainty estimates remain elusive. While some of the difficulty stems from the inadequate (weightspace) priors that are typically used, much of the challenge can be traced to the difficulty of the inference problem itself. In the variational inference setting, this manifests itself in at least two ways. First, the need to restrict the variational family to a tractable class limits the fidelity of the approximate learned posterior. Second, nested nonlinearities necessitate sampling methods during training, which can make for a challenging stochastic optimization problem, especially for wide, deep networks. In this work our goal is to make the stochastic optimization problem (somewhat) easier by integrating out some of the weights analytically. In the next section we focus on the regression case, leaving a discussion of other cases to the appendix.
^{1}^{1}1Please refer to the appendix for a more detailed discussion of related work.2 Regression Setup
Consider a dataset of size with inputs and outputs . To simplify the notation, we consider the case where is dimensional and is 1dimensional. We consider a neural network with a single hidden layer defined by the following computational flow:^{2}^{2}2We handle bias terms by augmenting inputs to each neural network layer with an element equal to 1.
(1) 
Here is the nonlinearity, is of size and is of size , where is the number of hidden units. We choose a Normal likelihood with precision and standard Normal priors for the weights. Thus the marginal likelihood of the observed data is:
(2) 
3 Variational Bound
We consider a variational distribution of the form
(3) 
where each component distribution is Normal. Since we treat each row of independently, the activations are conditionally independent given an input . With these assumptions we can write down the following variational bound:
(4) 
The KL divergences are readily computed. We now show that we can compute closed form expressions for the first term in Eqn. 4 (i.e. the expected log likelihood) for certain nonlinearities . For concreteness we consider the ReLU activation function, i.e. . The expected log likelihood (ELL) for a single datapoint is given by
(5) 
The expectation in Eqn. 5 becomes
(6) 
Massaging terms, the expected log likelihood for the full dataset is given by
(7) 
Here we have introduced the mean function as well as the corresponding variance:
(8) 
Note that and can be used at test time to yield fast predictive means and variances. We have also defined the matrix and the
dimensional vectors
and :(9) 
The key quantities are the expectations in Eqn. 9. As we show in the appendix, these can be computed in closed form for piecewise polynomial activation functions. The resulting expressions involve nothing more exotic than the error function.
4 Experiments
We present a few experiments that demonstrate how our approach can be folded into larger probabilistic models. Note that our focus here is on how (partial) analytic control can help training (Sec. 4.14.2) and prediction (Sec. 4.3) and not the suitability of Bayesian neural networks for particular tasks or datasets. Please refer to the appendix for details on experimental setups.
4.1 Variance Reduction
We train a Bayesian neural network with two hidden layers on a regression task and compute the gradient variance during training. As can be seen from Table 4.1, RaoBlackwellizing the two weight matrices closest to the outputs reduces the variance, especially for the covariance parameters. As the weight matrices we integrate out get larger, this variance reduction becomes more pronounced.
c[1pt]cccccc & Upon initialization & Late in training
& First Layer & Second Layer &Final Layer & First Layer & Second Layer &Final Layer
[1pt] Analytic & 8.6 / & 3.2 / & 227 / & 1.75 / & 0.2 / & 259 /
Sampling & 19.7 / & 10.1 / & 704 / 0.03& 2.6 / & 0.4 / & 518 /
4.2 VAE with a Bayesian Decoder
We train a VAE (kingma2013auto, ; rezende2014stochastic, ) with a Normal likelihood on a continuousvalued dataset. For the decoder we use a Bayesian neural network with a single hidden layer.^{3}^{3}3Alternatively, we can think of this as the neural network analog of the deep latent variable model in ref. (dai2015variational, ). We train three model and inference variants and report test log likelihoods in Table 2. Apart from the first variant (V1), all variants make use of a Bayesian neural network as the decoder. Variants V2 and V3 differ in which weights are sampled during training.^{4}^{4}4More precisely, we always use the ‘local reparameterization trick’ (kingma2015variational, ) and never sample weights directly. In V2 the weights before the nonlinearity are sampled, while in V3 no weights are sampled. While the test log likelihoods in Table 2 do not differ dramatically, we see evidence that: i) a Bayesian decoder can be useful in this setting; and ii) integrating out weights can help us train a better model.
4.3 Fast Prediction
We train a Bayesian neural network on ImageNet
(russakovsky2015imagenet, ).^{5}^{5}5While our analytic results can be used to form approximate variational objectives (or, alternatively, control variates) in the classification setting (see Appendix) here we sample the weights during training. Specifically we place a prior on the two weight matrices closest to the softmax output. We then compute classification accuracies on the test set using two methods: i) Monte Carlo; and ii) a deterministic approximation using the analytic results described above (see Appendix for details). As can be seen from Fig. 1, a large number of samples must be drawn before the MC estimator reaches the performance of the deterministic approximation. Indeed even with samples the deterministic approximation outperforms MC on top5 accuracy.5 Discussion
The approach developed here is expected to be most useful when integrated into larger Bayesian neural networks setups. It would be of particular interest to combine this approach with the class of priors described in (karaletsos2018probabilistic, ), the deterministic approximations in (wu2018fixing, ), or Normal variational distributions with flexible conditional dependence like those in (louizos2017multiplicative, ). Finally, our analytic results could be useful in the context of other classes of nonlinear probabilistic models.
References
 [1] T. BertinMahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.

[2]
E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Karaletsos,
R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman.
Pyro: Deep Universal Probabilistic Programming.
Journal of Machine Learning Research
, 2018.  [3] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 [4] Z. Dai, A. Damianou, J. González, and N. Lawrence. Variational autoencoded deep gaussian processes. arXiv preprint arXiv:1511.06455, 2015.
 [5] A. Damianou and N. Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
 [6] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017.
 [7] A. Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011.

[8]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016. 
[9]
J. M. HernándezLobato and R. Adams.
Probabilistic backpropagation for scalable learning of bayesian neural networks.
In International Conference on Machine Learning, pages 1861–1869, 2015. 
[10]
G. E. Hinton and D. Van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
In
Proceedings of the sixth annual conference on Computational learning theory
, pages 5–13. ACM, 1993.  [11] M. Kandemir, M. Haussmann, and F. A. Hamprecht. Samplingfree variational inference of bayesian neural nets. arXiv preprint arXiv:1805.07654, 2018.
 [12] T. Karaletsos, P. Dayan, and Z. Ghahramani. Probabilistic metarepresentations of neural networks. arXiv preprint arXiv:1810.00555, 2018.
 [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [14] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
 [15] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[16]
W. V. Li, A. Wei, et al.
Gaussian integrals involving absolute value functions.
In
High dimensional probability V: the Luminy volume
, pages 43–59. Institute of Mathematical Statistics, 2009.  [17] C. Louizos and M. Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
 [18] D. J. MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
 [19] B. M. Marlin, M. E. Khan, and K. P. Murphy. Piecewise bounds for estimating bernoullilogistic latent gaussian models. In ICML, pages 633–640, 2011.

[20]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer.
Automatic differentiation in pytorch.
2017.  [21] C. Peterson. A mean field theory learning algorithm for neural networks. Complex systems, 1:995–1019, 1987.
 [22] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [23] S. M. Ross. Simulation. Academic Press, San Diego, 2006.
 [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [25] Y. Wen, P. Vicol, J. Ba, D. Tran, and R. Grosse. Flipout: Efficient pseudoindependent weight perturbations on minibatches. arXiv preprint arXiv:1803.04386, 2018.
 [26] A. Wu, S. Nowozin, E. Meeds, R. E. Turner, J. M. HernándezLobato, and A. L. Gaunt. Fixing variational bayes: Deterministic variational inference for bayesian neural networks. arXiv preprint arXiv:1810.03958, 2018.
6 Appendix
The main goal of this appendix is to show how to compute the necessary expectations in Eqn. 9 for piecewise polynomial nonlinearities. Instead of presenting a (unwieldy) master formula for the general case, we proceed step by step and show how the computation is done in a few cases of increasing complexity. We begin with a basic ReLU integral.
6.1 ReLU Mean Function
We first consider the mean function for the ReLU activation function , i.e. we would like to compute the following expectation:
(10) 
where . The first expectation in Eqn. 10 is elementary. For the second expectation note that, since , the expectation can be transformed to a onedimensional integral
(11) 
that can readily be computed in terms of the error function. We do not do so here, however, because in subsequent derivations we will find an alternative strategy—namely to make use of a particular integral representation for the absolute value function—to be more convenient. Thus before we give an explicit formula for Eqn. 10 we collect a few useful identities.
6.2 Useful Integrals
First we define the following (scalar) quantities, which we will make extensive use of throughout the appendix:
(12) 
We then have
(13) 
and
(14) 
as well as
(15) 
The integral identity we make use of is:^{6}^{6}6Note that this identity is also used in [16] to compute a related class of Gaussian integrals.
(16) 
For reference we note that this identity can easily be derived by integrating by parts and making use of the wellknown sine integral:^{7}^{7}7The absolute value in Eqn. 16 is an immediate consequence of .
(17) 
6.3 ReLU Part II
Combining the above identities we get:
(18) 
and
(19) 
Thus we have all the ingredients to compute the expectations in Eqn. 9:
(20) 
As stated in the main text, these expectations involve nothing more exotic than the error function. Note that as only a small portion of probability mass is propagated through the constant portion of the ReLU activation function. As such, in this limit we expect and . It is easy to verify that this is indeed the case. Similarly, as , we have .
6.4 Other Nonlinearities
6.4.1 Leaky ReLU
We consider the ‘leaky’ ReLU, which we define to be given by
(21) 
for some . In this case one finds:
(22) 
6.4.2 Hard Sigmoid
We consider the ‘hard sigmoid’ nonlinearity, which we define to be given by
(23) 
for a given constant . The identity in Eqn. 16 can immediately be generalized to
(24) 
Proceeding as before we compute
(25) 
and (for )
(26) 
Using these identities we find:
(27) 
As , we have , i.e. the hard sigmoid nonlinearity approaches the identity function. It is easy to verify that in this limit the expectations in Eqn. 27 approach the correct limit.
6.4.3 ReLU Squared
We consider the ‘ReLU squared’ nonlinearity, which we define to be given by
(28) 
This is the first nonlinearity we have considered that contains a piecewise quadratic portion. Using previous results as well as the higher moments computed in the next section we find:
(29) 
6.5 Higher Moments
We can also compute higher moments of activation functions. We start with the integral identities
(30) 
and
(31) 
and
(32) 
We then have
(33) 
as well as
(34) 
6.6 Piecewise Polynomial Activation Functions
Piecewise polynomial functions in can be represented by composing polynomials in with the absolute value function. Thus in order to compute Eqn. 9 for general piecewise polynomial activation functions we need to be able to compute expectations of the form
(35) 
where the are polynomials. We have shown how this computation can be done in a number of cases. For any specific case the recipe we have used to do the computation remains applicable. In particular one can compute any needed ‘base’ integrals by doing the computation in one dimension as in Eqn. 11. One can then make use of the integral identity in Eqn. 16 and differentiation to compute higher order moments via purely algebraic operations (c.f. the manipulations in Sec. 6.5).
6.7 Other likelihoods
In the main text we showed how to compute exact closed form expressions for the ELBO variational objective in the regression case. For other likelihoods, the required expectations are generally intractable. Nevertheless we can still compute closed form variational objectives at the price of some approximation. Alternatively, if we are worried about the bias introduced by our approximations, we can use our approximations as control variates in the Monte Carlo sampling setting. In this section we briefly describe how this goes in the case of softmax classification.
6.7.1 Softmax Categorical Likelihood
Using familiar bounds^{8}^{8}8See e.g. http://www.columbia.edu/~jwp2128/Teaching/E6720/Fall2016/papers/twobounds.pdf we have:
(36) 
We do a secondorder Taylor expansion of around its expectation to obtain
(37) 
so that our approximate lower bound to the expected log likelihood becomes
(38) 
We can then use the closed form expressions for the mean function and variance given in the main text to form a deterministic approximation to the expected log likelihood.
6.7.2 Logistic Bernoulli Likelihood
For the case with two classes with
and where we have a single logit
the approximation in Eqn. 38 reduces to(39) 
6.7.3 Control Variates
To reduce bias to zero one can construct a control variate [23] version of the variational objective in Eqn. 38:
(40) 
The expectations in Eqn. 40 are then estimated with Monte Carlo, while the rest is available in closed form. Alternatively, we can use the following estimator:
(41) 
That is, we use the analytic result for the numerator in the softmax likelihood and sample the troublesome denominator.
6.7.4 Fast Approximate Prediction
To make fast test time predictions we can simply use the analytic mean function , i.e. use
(42) 
effectively ignoring the normalizing term in the softmax likelihood. As can be seen in Fig. 1, this approximation can be quite effective in practice.
6.8 Experimental Details
All the experiments described in this work were implemented in the Pyro probabilistic programming language [2], which is built on top of PyTorch [20]. As noted in the main text, whenever sampling a weight matrix we make use of the ‘local reparameterization trick’ [14], i.e. we sample in preactivation space and not in weight space. This can lead to substantial variance reduction as compared to sampling in weight space directly.
6.8.1 Variance Reduction
We use the 90dimensional ‘YearPredictionMSD’ dataset from the UCI repository [6]. This dataset is a subset of the Million Song Dataset [1]. The architecture of our neural network is given by , where all layers are fully connected and both nonlinearities are ReLU. We use mean field (Normal) variational distributions for all weight matrices. To compute gradient variances of variational parameters with respect to the variational objective, we fix a random minibatch of training data with 500 elements. We then compute
samples and report empirical gradient variances averaged over the elements of each tensor. We report gradient variances computed before any training as well as after 50 epochs. For the (partially) analytic result, only the weight matrix closest to the inputs is sampled, while for the sampling result all weight matrices are sampled.
6.8.2 VAE with a Bayesian Decoder
We use the same dataset as for the variance reduction experiment above (with the difference that in this unsupervised setting we only use the input features). This dataset has data points. We split the data into training, test, and validation sets in the proportion 7 : 2 : 1. For the encoder we use a fully connected (nonBayesian) neural network with 500 hidden units in each of the two hidden layers. For the decoder we use a neural network with a single hidden layer with 500 hidden units. All nonlinearities are ReLU. We use the Adam optimizer [13]
and minibatches of size 2000 during training. We use mean field (Normal) variational distributions for all weight matrices in the decoder. We do a grid search over the hyperparameters of the optimizer and use the validation set to choose the number of epochs to train. For all three model variants this procedure resulted in the following choices: default Adam hyperparameters and 1500 epochs of training. We use a latent dimension of 30. We report test log likelihoods that make use of an importance weighted estimator that draws 500
100 samples per data point (100 samples inside the log averaged over 500 trials).6.8.3 Fast Prediction
We take a ResNet50 [8] that is pretrained^{9}^{9}9https://pytorch.org/docs/stable/torchvision/models.html on ImageNet [24] and then lop off the final layer and replace it with the following neural network architecture: . Here the first two dashes represent ReLU nonlinearities and the final layer of outputs represents softmax logits. We learn the first weight matrix (2048  1000) using MLE and are Bayesian about the subsequent weight matrices (we use mean field variational distributions). We do not finetune the weight matrices inherited from the pretrained ResNet50. Our test set and validation set consist of 40k and 10k images, respectively. We train^{10}^{10}10Note that we also trained our network using the approximations and control variates described in Sec. 6.7, but we do not report those results here.
for up to 120 epochs and use the validation set to fix optimization hyperparameters and determine how many epochs to train. In contrast to the typical approach taken in deep learning, we used a fixed size/crop for training images, i.e. we do not do any data augmentation.
Fig. 1 is generated as follows. For the MC estimate of the predicted class probabilities, we draw a total of 4096 samples per datapoint. These samples are then combined via the following allocation:
(43) 
Here , which represents the number of samples inside the log, is the quantity plotted on the horizontal axis of Fig. 1. To form the deterministic approximation we follow Sec. 6.7.4.
6.9 Related Work
The approach most closely related to ours is probably the deterministic approximations in ref. [26] (indeed they compute some of the same ReLU integrals that we do). While we focus on singlelayer neural networks, the distinct advantage of their approximation scheme is that it can be applied to networks of arbitrary depth. Thus some of our results are potentially complementary to theirs. Reference [11] also constructs deterministic variational objectives for the specific case of the ReLU activation function. Reference [19] considers quadratic piecewise linear bounds for the logisticlogpartion function in the context of Bernoullilogistic latent Gaussian models. Finally, approaches for variance reduction in the stochastic variational inference setting include [14] and [25].
6.10 Assorted Remarks

The factorization assumption in Eqn. 3 can probably be weakened at the cost of dealing with special functions more exotic than the error function.

Note that the expression for the full (predictive) variance, which decomposes into three readily identified components, is given by:
(44) 
Although we do not do so here, it would probably be straightforward to compute an all orders formula for the moments of the ReLU activation function, for which the main computational ingredient is the expectation
(45)
5 Discussion
The approach developed here is expected to be most useful when integrated into larger Bayesian neural networks setups. It would be of particular interest to combine this approach with the class of priors described in (karaletsos2018probabilistic, ), the deterministic approximations in (wu2018fixing, ), or Normal variational distributions with flexible conditional dependence like those in (louizos2017multiplicative, ). Finally, our analytic results could be useful in the context of other classes of nonlinear probabilistic models.
References
 [1] T. BertinMahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.

[2]
E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Karaletsos,
R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman.
Pyro: Deep Universal Probabilistic Programming.
Journal of Machine Learning Research
, 2018.  [3] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
 [4] Z. Dai, A. Damianou, J. González, and N. Lawrence. Variational autoencoded deep gaussian processes. arXiv preprint arXiv:1511.06455, 2015.
 [5] A. Damianou and N. Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
 [6] D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017.
 [7] A. Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011.

[8]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016. 
[9]
J. M. HernándezLobato and R. Adams.
Probabilistic backpropagation for scalable learning of bayesian neural networks.
In International Conference on Machine Learning, pages 1861–1869, 2015. 
[10]
G. E. Hinton and D. Van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
In
Proceedings of the sixth annual conference on Computational learning theory
, pages 5–13. ACM, 1993.  [11] M. Kandemir, M. Haussmann, and F. A. Hamprecht. Samplingfree variational inference of bayesian neural nets. arXiv preprint arXiv:1805.07654, 2018.
 [12] T. Karaletsos, P. Dayan, and Z. Ghahramani. Probabilistic metarepresentations of neural networks. arXiv preprint arXiv:1810.00555, 2018.
 [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [14] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
 [15] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[16]
W. V. Li, A. Wei, et al.
Gaussian integrals involving absolute value functions.
In
High dimensional probability V: the Luminy volume
, pages 43–59. Institute of Mathematical Statistics, 2009.  [17] C. Louizos and M. Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
 [18] D. J. MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
 [19] B. M. Marlin, M. E. Khan, and K. P. Murphy. Piecewise bounds for estimating bernoullilogistic latent gaussian models. In ICML, pages 633–640, 2011.

[20]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer.
Automatic differentiation in pytorch.
2017.  [21] C. Peterson. A mean field theory learning algorithm for neural networks. Complex systems, 1:995–1019, 1987.
 [22] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [23] S. M. Ross. Simulation. Academic Press, San Diego, 2006.
 [24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [25] Y. Wen, P. Vicol, J. Ba, D. Tran, and R. Grosse. Flipout: Efficient pseudoindependent weight perturbations on minibatches. arXiv preprint arXiv:1803.04386, 2018.
 [26] A. Wu, S. Nowozin, E. Meeds, R. E. Turner, J. M. HernándezLobato, and A. L. Gaunt. Fixing variational bayes: Deterministic variational inference for bayesian neural networks. arXiv preprint arXiv:1810.03958, 2018.
6 Appendix
The main goal of this appendix is to show how to compute the necessary expectations in Eqn. 9 for piecewise polynomial nonlinearities. Instead of presenting a (unwieldy) master formula for the general case, we proceed step by step and show how the computation is done in a few cases of increasing complexity. We begin with a basic ReLU integral.
6.1 ReLU Mean Function
We first consider the mean function for the ReLU activation function , i.e. we would like to compute the following expectation:
(10) 
where . The first expectation in Eqn. 10 is elementary. For the second expectation note that, since , the expectation can be transformed to a onedimensional integral
(11) 
that can readily be computed in terms of the error function. We do not do so here, however, because in subsequent derivations we will find an alternative strategy—namely to make use of a particular integral representation for the absolute value function—to be more convenient. Thus before we give an explicit formula for Eqn. 10 we collect a few useful identities.
6.2 Useful Integrals
First we define the following (scalar) quantities, which we will make extensive use of throughout the appendix:
(12) 
We then have
(13) 
and
(14) 
as well as
(15) 
The integral identity we make use of is:^{6}^{6}6Note that this identity is also used in [16] to compute a related class of Gaussian integrals.
(16) 
For reference we note that this identity can easily be derived by integrating by parts and making use of the wellknown sine integral:^{7}^{7}7The absolute value in Eqn. 16 is an immediate consequence of .
(17) 
6.3 ReLU Part II
Combining the above identities we get:
(18) 
and
(19) 
Thus we have all the ingredients to compute the expectations in Eqn. 9:
(20) 
As stated in the main text, these expectations involve nothing more exotic than the error function. Note that as only a small portion of probability mass is propagated through the constant portion of the ReLU activation function. As such, in this limit we expect and . It is easy to verify that this is indeed the case. Similarly, as , we have .
6.4 Other Nonlinearities
6.4.1 Leaky ReLU
We consider the ‘leaky’ ReLU, which we define to be given by
(21) 
for some . In this case one finds:
(22) 
6.4.2 Hard Sigmoid
We consider the ‘hard sigmoid’ nonlinearity, which we define to be given by
(23) 
for a given constant . The identity in Eqn. 16 can immediately be generalized to
(24) 
Proceeding as before we compute
(25) 
and (for )
(26) 
Using these identities we find:
Comments
There are no comments yet.