1 Introduction and background
Artificial neural networks are versatile representations of functions that have led to groundbreaking results in a variety of learning problems such as function approximation (regression), classification and dimensionality reduction. Deep neural networks have been shown to learn effective approximations of difficult functions (highdimensional, with discontinuities, etc.) such as digital images defined over the space spanned by pixels. While there are numerous methods that focus on learning these representations, the problem of estimating integrals of the learned function remains an open problem.
In this paper, we derive a closedform formula to calculate the integral of a specific but well known shallow feedforward network (SFFN) containing one hidden layer and a sigmoid activation function. This textbook case is an example of what are called universal approximator networks Cybenko (1989); Hornik (1991); Lu et al. (2017) since it can approximate any continuous function accurately Shaham et al. (2015). Although learning one from point estimates where
might be slow or require a large number of neurons to achieve a sufficiently accurate
, its simplicity and universality make it an attractive case to study. We derive an exact formula to describe the value of the integral of as a function of the parameters or weights learned using a standard training procedure.A formula for integrating exactly enhances its appeal as a proxy or surrogate function for in applications where the latter needs to be numerically integrated. We test the utility of our formula by estimating multidimensional intregals and its marginals , where
(1) 
over hyperrectangular domains (dimensional) and (dimensional) using samples . Here and describe the distribution from which the samples are drawn in each case. Here, the superscript in denotes select elements of , so
represents a vector with the first
components (without loss of generality) of . We focus on the case when there is no informative prior on the integrand and withas a constant. These problems are abstract representatives of the applications of numerical integration in machine learning. e.g. normalization of multidimensional probability densities (frequentist statistics) or likelihood functions (likelihood statistics), computation of marginal likelihoods (model evidence) which forms the crux of Bayesian approaches, etc. An advantage of the surrogate function over averaging of point samples (ala Monte Carlo integration) is that it enables functional representations of marginal distributions. Another advantage is that
might be used as a control variate (Owen, 2013, Sec. 8.9)Wan et al. (2018) as shown in section 3.Despite the plethora of computational approaches such as quadrature rules Brass and Petras ; Keshavarzzadeh et al. (2018)
, Monte Carlo integration and Markov Chain Monte Carlo methods, the general procedure to estimate integrals of functions using point samples remains frustratingly inefficient even in moderately many (about ten) dimensions. Monte Carlo methods operate by expressing integrals as expectations which can be estimated via random simulation
Metropolis and Ulam (1949). Several variants of this method address its slow convergence, and strike different compromises between bias (accuracy) and variance (precision). We refer the reader to standard texts on the subject Owen (2013). QuasiMonte Carlo (QMC) methods replace stochasticity with carefully designed, deterministic samples which improves convergence dramatically when the integrands are smooth Niederreiter (1978, )or of moderate dimensionality (below ten). Deep neural networks can mimic MC solutions to certain partial differential equations
Grohs et al. (2019)sidestepping the curse of dimensionality.
Bayesian methods use probabilistic modelbased surrogates to improve sampleefficiency for expensive integrands Ghahramani and Rasmussen (2003) or to guide active sampling (adaptive Bayesian quadrature Osborne et al. (2012); Kanagawa and Hennig (2019), Bayesian Optimization Shahriari et al. (2016), etc.). This general approach can be used to estimate the marginal likelihood Briol et al. (2015); Gunter et al. (2014), approximate the posterior Kandasamy et al. (2015); Wang and Li (2018) and to simultaneously infer both Acerbi (2018). These works impose a Gaussian Process (GP) prior on the integrand and focus on cases where analytical formulae can be derived for the expectation and uncertainty of the statistical surrogate. The nonparametric nature of the GP can affect scability and the need to explicitly evaluate the sampling probability sometimes precludes the use of these methods. The advantages of these approaches is that they inherently enable reasoning about uncertainty and that they serve as generative models. Hybrid methods such as using GPs to model weights Karaletsos and Bui (2020) investigate calibrated reasoning about uncertainty using probabilistic neural networks. Integral representations Dereventsov et al. (2019) use an analysis of continuous distributions, for a particular target function, from which instances of shallow neural networks can be sampled.
We test the utility of a neural network as a surrogate representation for numerical integration. We identify that the integral of a sigmoidal universal approximator function can be expressed in closed form and derive its formula. Our result can be viewed as parametric in the sense that for a fixed architecture, the set of parameters (weights and biases) learned is independent of the amount of training data. Once learned, a parametric formula can be used to evaluate the integral or its marginals. On the other hand, it does not encode distributional information or serve as a generative model.
Contributions The contributions of this paper are:

[leftmargin=*]

a formula, to estimate multidimensional integrals of , a neural network function representation;

an interpretation of the formula as a neural network with one hidden layer which we call a QNET;

and a parametric formula for marginalization of along subsets of input dimensions.
We test the use of for approximate numerical integration using smooth and discontinuous functions.
2 Numerical integration of a shallow feedforward network
2.1 Preliminaries
We denote row^{1}^{1}1We do not assume that all vectors are column vectors so that the formulae in the paper match output weights of neural network libraries with no extra transposes necessary for implementation. and column vectors with boldface characters (e.g. , , ) and capital letters to represent matrices (e.g. ). Superscripts denote selected elements of a vector or matrix. So and represent the element and row of respectively. For convenience, we will operate in a normalised domain , as is commonly the case with inputs of neural networks. General hyperrectangular domains may be handled by scaling the dimensions independently (see Sec.4) . We use to represent an approximation of the function obtained by a training a shallow feedforward neural network with one hidden layer of neurons and a linear output layer using samples . For convenience, we use to collectively encode all learnable parameters of the approximator network: a matrix , a dimensional row vector , a dimensional column vector and a real number . The approximate function is
(2) 
where we define to be the logistic function that operates independently on each of the elements of the input: . In this case, we derive a closedform formula that exactly computes as an approximation to I (eq. 1) and show that the formula can be adapted to produce a closedform representation of its marginals (eq. 1).
2.2 The formula
Our formula for the D integral of using neurons is (see App. A.1)
(3)  
(4) 
Here represents the polylogarithm Euler (1768); Lewin (1981) function of order and is a ‘sign’ matrix with elements in whose rows represent all combinations of signs. The contribution of each term is positive (or negative) depending on
, which is even (resp. odd) when there is an even (resp. odd) number of
s in the row . is a column vector whose elements correspond to the outputs of each of the neurons in the hidden layer.The computational complexity of evaluating the formula is and the memory complexity is since the columns of the elements of can be binary encoded with length . While the computation time is independent of the number of samples of the integrand , increasing improves as an approximation of .
2.3 QNET representation
Although calculation can be performed explicitly as in equations 3 and 4, we observe that the expression within the summation term may itself be represented using a feedforward neural network, with one hidden layer containing neurons, which we call QNET. The parameters (weights, biases and activation function) of the QNET are fixed (not learned) as a function of and independent of the parameters of . The input to the QNET is a concatenation of a row of the weight matrix and the corresponding bias term: . The first layer is defined by weight matrix that is concatenated with a column of s, , and zero bias. The activation function for the hidden layer is and the output is a linear combination with weights for those neurons whose input weights contain an odd number (since we concatenated to each row) of s and otherwise. to with zero bias. Thus, the output of the QNET is . The subscript serves as a reminder that the QNET is fully specified by . Now eq. 4 simplifies to
(5) 
so that the element of the column vector is calculated using a single evaluation of the QNET: . Figure 2 provides an illustrative summary of how QNETs may be constructed and used.
The output of the QNET (eq. 5) is identical to the formula in eq. 4. While it might be possible to construct a single network whose output is , this would be complicated by the denominator in . Although the QNET does not have any learnable parameters, it is a useful abstraction for that enables an elegant representation of integrals along subdimensions (marginalization).
2.4 Marginalizing using QNETs
We now explain how dimensions of may be marginalized. Although we choose the first adjacent dimensions for convenient exposition the following procedure holds without loss of generalization for any dimensions of . The result of marginalizing these dimensions is a function whose formula requires integration over dimensions. Using QNETs, we obtain:
(6) 
where the only differences from equation 5 are: 1) that has been replaced with on the rhs ; and 2) that the input to the QNET is instead of . In other words, the weighted nonmarginalized variables are added to the last column of to obtain . Thus, by modifying the input and evaluating QNET (parameterized by the reduced dimensionality) we can approximate marginal distributions. Our definition of QNETs is motivated by this ability to elegantly obtain functional approximations to marginal distributions from pointsampled functions. See figure 3 for an illustrative summary and figure 1 for an example. The example shows the result of integrating a 3D function (Gaussian mixture with components) using neurons in the hidden layer and samples. The figure drops subscripts on for brevity. Also shown are the pointsampled functions for and for different permutations of the components of . In other words, different marginals in 2D and 1D.
3 Empirical evaluation
The empirical results in this section demonstrate that: 1) error consistently diminshes with increased training samples for moderate dimensionality (); 2) the approach is applicable for challenging classes of integrands in the absence of any prior; and 3) that can be used as a control variate, which improves on standard MC and QMC on discontinuous integrands for as low as 4.
a) Gaussian Mixtures (GM)  b) Discontinuous GM (DGM)  c) Binary Hyperrectangles (HR) 
We trained
using samples from three families of integrands. The first, GM, is the probability density function (pdf) of a mixture of Gaussians parametrized by dimensionality
and the number of components in the mixture. The covariances are generated randomly, in a way that the Gaussians become narrower (peaky) as the number of components is increased. Figure 4. a visualizes two examples in 2D with and components. The second class GMD is the same as GM but with discontinuities introduced. We choose a random point in the domain and zero out the pdf of GM if there exists such that . Everywhere else it is identical to GM (see fig. 4. b). The third family of integrands HR is a binary function equals one everywhere except within a set of randomly located and sized (but centered on a grid) hyperrectangles where the function is zero. The parameters are dimensionality, number of grid cells per axis and the fraction of the total gridcells that are chosen. e.g. setting the parameters to , and will result in randomly sized boxes, whose centers are unique, on a grid with cells. All computation was performed using an unoptimizedMATLAB
implementation and executed (CPU only) on a standard desktop machine (single core processor) with GB
RAM. It takes about one second to obtain an estimate in 12D.
Convergence tests We averaged the relative root meansquared error (RRMSE) of our sample estimates times each across randomly generated integrands for
. We use RRMSE since each instance of the integrand has a different reference integral. We plotted the mean and standard deviation (error bars) of RRMSE against
on aloglog
scale for (fig. 5 top row) and (fig. 6 top row).
Errorbars depict variability across integrands. We plotted the relative variances or the estimator (sample variance divided by the square of the reference value) for (fig. 5 bottom row) and (fig. 6 bottom row). The graphs show the statistics of a standard MC estimator (grey) and a QMC (black) estimator using Halton samples alongside ours (red). Dashed trend lines and are also shown for RRMSE (squares for variance).










(a) 2D GM  (b) 2D GMD  (c) 2D HR 










(a) 5D GM  (b) 5D GMD  (c) 5D HR 
The plots show that RRMSE consistently decreases as is increased. Curiously, for , QNET converges faster than MC despite training with random samples . For discontinuous in higher dimensions, using results in higher error as predicted by theory (see sec. 4). For each sample estimator, the statistics of MC and QMC estimators were calculated using evaluations of the integrand. On the other hand, ours only requires evaluations since different estimates are obtained by stochasticity in training and initialization. This could be relevant when sampling repeatedly is prohibitively costly, or even impossible, as could be the case in some real problems such as those involving large amounts of computation (e.g. astronomy), surveys or clinical trials.
Control variates We devised a family of estimators CVQNET, parametrized by , that use as a control variate (Owen, 2013, Sec. 8.9) to integrate rather than as a direct proxy. Given , CVQNET uses samples to train and the remaining samples to integrate via standard MC or QMC. The final estimator is then the sum of the closedform integral of and the MC (or QMC) estimator. When , CVQNET is equivalent to (QNET) and as tends to one it approaches pure MC (or QMC). The results (fig. 7) show a dramatic reduction in variance for intermediate values of .
(a) 4D HR  (b) 8D HR 
Dimensionality and nature of the integrand Since existing theoretical results (see sec. 4) for error do not apply to nonbandlimited functions such as HR, we provide empirical results of relative error (fig. 8.a) and variance (fig. 8.b) across dimensions and for different volumes of training data (different curves). Although there is an increase in error from dimensions to (consistent with the convergence plots), reassuringly, this effect diminishes for the higher dimensions even for this challenging class of integrand. We also investigated the effects (on error) of increasing variation of within via input parameters in GMD (number of components) and HR (number of hyperrectangles). Figure 9 visualizes the relative error and variance of our estimator using training samples and with different numbers of neurons (different lines). The results were obtained by averaging across randomly generated integrands and from each family and the variation across the integrands is visualized as error bars. For GM, the error increases slightly for more components but for HR it remains constant. Curiously, in both cases the variance is higher while using more neurons. We found the cause to be ‘ringing’ artifacts in at the discontinuities when is large.
4 Discussion
Error The error of sigmoidal universal approximator networks Barron (1993); Shaham et al. (2015) is bounded by
which is proportional to the first moment of the Fourier spectrum of
and inversely proportional to when is times differentiable. From this, the error of our estimator can be bounded as (derived in App. A.2). Compared to reconstruction, numerical integration requires a factor of fewer samples Subr and Kautz (2013); Belcour et al. (2013) (below the Nyquist rate).Domain and range adjustment Our formula is derived assuming a normalized domain of and range of for the approximator network. However, our test integrands are in with unknown range. It is easy to show (using elementary variable substitution) that an estimate in the normalized space can be transformed to an estimate in an arbitrary domain for a function with range using the formula , where is the volume of the domain.
Loss functions and training
We experimented with different loss functions such as
mse,
mae
and crossentropy
to arrive at the surrogate but did not notice significant differences in the convergence rate. We also experimented with standard training algorithms and found that optimizing using the LevenbergMarquardt method and Bayesian Regularization perform better than conjugate gradient based methods particularly for discontinuous .
(a) relative RMSE (HR)  (b) relative variance (HR) 
Sampling Different estimates may be drawn from the same samples by either retraining the approximator network with different initialization conditions or due to the stochasticity in the optimization procedure. Although it is also possible to vary the samples across training runs (as with MC or QMC), we found that this did not impact estimation error, particularly for larger training data (). However, different sampling arrangements (correlations) with the same distribution were useful particularly when combined with control variates. e.g. Halton samples instead of random. Intuitively, this improves thereby also improving the estimation error. This relationship requires further analysis and is a promising direction for future research.
(a) Integrand: GMD  (b) Integrand: HR 
Limitations and future work In our experiments, the choice of had little impact on error for low training volumes (). For large , its relevance depended on the nature of the integrand. For smooth integrands (GM), increasing reduced error due to the improved approximation. For discontinuous integrands (GM, HR), increasing arbitrarily was counterproductive (see fig. 9) due to increased error introduced by ‘ringing’ artifacts at discontinuities. This is an inherent limitation of shallow sigmoidal approximator networks which could be addressed by bounding width rather than depth Fan et al. (2018) and adding layers. We strike a general compromise, for all experiments, choosing with a minimum of neurons. Further investigation is required to extend our formula to multiple sigmoidal layers and to develop fast approximations of our exact formula.
5 Conclusion
We derived a formula, in closedform, for the integral of a function represented by a onelayer feedforward neural network whose activation function is the logistic function. Our formula can be evaluated directly but also using a fixedweight feedforward neural network with one layer. We call this network a QNET, and derived its weights and activation function. QNETs may be used to marginalize along input dimensions to yield functions in closedform of the remaining dimensions. We presented experimental evidence to show that the error drops consistently (as the number of training samples for the approximator is increased) for challenging integrands, along with investigation of the effect of various parameters on the approximation.
Appendix A Appendix
a.1 Derivation
To derive the formula for the multidimensional case, we start with the simple case when and , then develop intuition for and finally generalize the result to dimensional integrands with neurons in the hidden layer.
When , the matrix in equation 2 is reduced to a vector () . Further, when (single neuron) all learned parameters for reduce to scalars so that
which can be integrated analytically to yield a formula for the integral of the approximate function:
(7)  
(8) 
which consists of softplus
terms . Since the output layer is a linear combination of the activations of the neurons in the hidden layer, when the rhs is the weighted sum:
(9) 
where superscript is used to denote the element of a vector. Since is an exact integral of , it follows that if the approximation is accurate in the limit then the estimator is unbiased. i.e. implies . Since independent training runs of the approximator network result in different , secondary estimates of may be obtained by averaging training runs obtained with random initialization.
When and , the approximator network can be written as where , and is a vector. Proceeding similarly to the 1D case we can first integrate over one variable (or equally ) and the intermediate formula, a function of (resp. ), requires a second integration step:
(10)  
Here represents the polylogarithm polylog function of order and is real when and are real. An important property of this function is that , which enables the general formula for the dimensional case with hidden neuron:
where the rows of matrix , where , represent all combinations of signs, one for each element in the row vector . The contribution of each term is either positive ( is even) or negative ( is odd) depending on whether there are an even (resp. odd) number of s in the row .
a.2 Error
When is trained to approximate , the approximation error of shallow feedforward networks Barron (1993); Shaham et al. (2015) is bounded by where when is the first moment of the Fourier spectrum of function which is times differentiable. But
(11)  
(12)  
(15)  
(16) 
which then gives us the result in the paper: .
References
 Variational bayesian monte carlo. In Advances in Neural Information Processing Systems 31, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 8223–8233. Cited by: §1.

Universal approximation bounds for superpositions of a sigmoidal function
. IEEE Transactions on Information theory 39 (3), pp. 930–945. Cited by: §A.2, §4.  5D covariance tracing for efficient defocus and motion blur. ACM Trans. Graph. 32 (3). External Links: ISSN 07300301, Link, Document Cited by: §4.
 [4] Quadrature theory: the theory of numerical integration on a compact interval. Mathematical surveys and monographs, American Mathematical Society. External Links: ISBN 9780821887011 Cited by: §1.
 Frankwolfe bayesian quadrature: probabilistic integration with theoretical guarantees. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1162–1170. Cited by: §1.
 Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: §1.

Neural network integral representations with the relu activation function
. External Links: 1910.02743 Cited by: §1.  Institutionum calculi integralis volumen primum. Vol. 1. External Links: Document, https://epubs.siam.org/doi/pdf/10.1137/1.9781611970081.fm Cited by: §2.2.
 Slim, sparse, and shortcut networks. External Links: 1811.09003 Cited by: §4.
 Bayesian monte carlo. In Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer (Eds.), pp. 505–512. Cited by: §1.
 Deep neural network approximations for monte carlo algorithms. External Links: 1908.10828 Cited by: §1.
 Sampling for inference in probabilistic models with fast bayesian quadrature. In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2789–2797. Cited by: §1.
 Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), pp. 251–257. Cited by: §1.
 Convergence guarantees for adaptive bayesian quadrature methods. In Advances in Neural Information Processing Systems 32, pp. 6237–6248. Cited by: §1.

Bayesian active learning for posterior estimation
. InProceedings of the 24th International Conference on Artificial Intelligence
, IJCAI’15, pp. 3605–3611. External Links: ISBN 9781577357384 Cited by: §1.  Hierarchical gaussian process priors for bayesian neural network weights. External Links: 2002.04033 Cited by: §1.
 Numerical integration in multiple dimensions with designed quadrature. CoRR abs/1804.06501. External Links: Link, 1804.06501 Cited by: §1.
 Polylogarithms and associated functions. North Holland. External Links: ISBN 9780444005502, LCCN lc80020945 Cited by: §2.2.
 The expressive power of neural networks: a view from the width. External Links: 1709.02540 Cited by: §1.
 The monte carlo method. Journal of the American Statistical Association 44 (247), pp. 335–341. External Links: ISSN 01621459 Cited by: §1.
 [21] Random number generation and quasimonte carlo methods. External Links: Document, https://epubs.siam.org/doi/pdf/10.1137/1.9781611970081.fm Cited by: §1.
 Quasimonte carlo methods and pseudorandom numbers. Bull. Amer. Math. Soc. 84 (6), pp. 957–1041. Cited by: §1.
 Active learning of model evidence using bayesian quadrature. In Advances in Neural Information Processing Systems 25, pp. 46–54. Cited by: §1.
 Monte carlo theory, methods and examples. External Links: Link Cited by: §1, §1, §3.
 Provable approximation properties for deep neural networks. External Links: 1509.07385 Cited by: §A.2, §1, §4.
 Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §1.
 Fourier analysis of stochastic sampling strategies for assessing bias and variance in integration. ACM Trans. Graph. 32 (4). External Links: ISSN 07300301, Link, Document Cited by: §4.
 Neural control variates for variance reduction. External Links: 1806.00159 Cited by: §1.

Adaptive gaussian process approximation for bayesian inference with expensive likelihood functions
. Neural Computation 30 (11), pp. 3072–3094. Note: PMID: 30216145 External Links: https://doi.org/10.1162/neco_a_01127 Cited by: §1.
Comments
There are no comments yet.