1 Introduction and background
Artificial neural networks are versatile representations of functions that have led to groundbreaking results in a variety of learning problems such as function approximation (regression), classification and dimensionality reduction. Deep neural networks have been shown to learn effective approximations of difficult functions (high-dimensional, with discontinuities, etc.) such as digital images defined over the space spanned by pixels. While there are numerous methods that focus on learning these representations, the problem of estimating integrals of the learned function remains an open problem.
In this paper, we derive a closed-form formula to calculate the integral of a specific but well known shallow feed-forward network (SFFN) containing one hidden layer and a sigmoid activation function. This textbook case is an example of what are called universal approximator networks Cybenko (1989); Hornik (1991); Lu et al. (2017) since it can approximate any continuous function accurately Shaham et al. (2015). Although learning one from point estimates where
might be slow or require a large number of neurons to achieve a sufficiently accurate, its simplicity and universality make it an attractive case to study. We derive an exact formula to describe the value of the integral of as a function of the parameters or weights learned using a standard training procedure.
A formula for integrating exactly enhances its appeal as a proxy or surrogate function for in applications where the latter needs to be numerically integrated. We test the utility of our formula by estimating multidimensional intregals and its marginals , where
over hyper-rectangular domains (-dimensional) and (-dimensional) using samples . Here and describe the distribution from which the samples are drawn in each case. Here, the superscript in denotes select elements of , so
represents a vector with the firstcomponents (without loss of generality) of . We focus on the case when there is no informative prior on the integrand and with
as a constant. These problems are abstract representatives of the applications of numerical integration in machine learning. e.g. normalization of multidimensional probability densities (frequentist statistics) or likelihood functions (likelihood statistics), computation of marginal likelihoods (model evidence) which forms the crux of Bayesian approaches, etc. An advantage of the surrogate function over averaging of point samples (ala Monte Carlo integration) is that it enables functional representations of marginal distributions. Another advantage is thatmight be used as a control variate (Owen, 2013, Sec. 8.9)Wan et al. (2018) as shown in section 3.
, Monte Carlo integration and Markov Chain Monte Carlo methods, the general procedure to estimate integrals of functions using point samples remains frustratingly inefficient even in moderately many (about ten) dimensions. Monte Carlo methods operate by expressing integrals as expectations which can be estimated via random simulationMetropolis and Ulam (1949). Several variants of this method address its slow convergence, and strike different compromises between bias (accuracy) and variance (precision). We refer the reader to standard texts on the subject Owen (2013). Quasi-Monte Carlo (QMC) methods replace stochasticity with carefully designed, deterministic samples which improves convergence dramatically when the integrands are smooth Niederreiter (1978, )
or of moderate dimensionality (below ten). Deep neural networks can mimic MC solutions to certain partial differential equationsGrohs et al. (2019)
sidestepping the curse of dimensionality.
Bayesian methods use probabilistic model-based surrogates to improve sample-efficiency for expensive integrands Ghahramani and Rasmussen (2003) or to guide active sampling (adaptive Bayesian quadrature Osborne et al. (2012); Kanagawa and Hennig (2019), Bayesian Optimization Shahriari et al. (2016), etc.). This general approach can be used to estimate the marginal likelihood Briol et al. (2015); Gunter et al. (2014), approximate the posterior Kandasamy et al. (2015); Wang and Li (2018) and to simultaneously infer both Acerbi (2018). These works impose a Gaussian Process (GP) prior on the integrand and focus on cases where analytical formulae can be derived for the expectation and uncertainty of the statistical surrogate. The non-parametric nature of the GP can affect scability and the need to explicitly evaluate the sampling probability sometimes precludes the use of these methods. The advantages of these approaches is that they inherently enable reasoning about uncertainty and that they serve as generative models. Hybrid methods such as using GPs to model weights Karaletsos and Bui (2020) investigate calibrated reasoning about uncertainty using probabilistic neural networks. Integral representations Dereventsov et al. (2019) use an analysis of continuous distributions, for a particular target function, from which instances of shallow neural networks can be sampled.
We test the utility of a neural network as a surrogate representation for numerical integration. We identify that the integral of a sigmoidal universal approximator function can be expressed in closed form and derive its formula. Our result can be viewed as parametric in the sense that for a fixed architecture, the set of parameters (weights and biases) learned is independent of the amount of training data. Once learned, a parametric formula can be used to evaluate the integral or its marginals. On the other hand, it does not encode distributional information or serve as a generative model.
Contributions The contributions of this paper are:
a formula, to estimate multidimensional integrals of , a neural network function representation;
an interpretation of the formula as a neural network with one hidden layer which we call a Q-NET;
and a parametric formula for marginalization of along subsets of input dimensions.
We test the use of for approximate numerical integration using smooth and discontinuous functions.
2 Numerical integration of a shallow feed-forward network
We denote row111We do not assume that all vectors are column vectors so that the formulae in the paper match output weights of neural network libraries with no extra transposes necessary for implementation. and column vectors with boldface characters (e.g. , , ) and capital letters to represent matrices (e.g. ). Superscripts denote selected elements of a vector or matrix. So and represent the element and row of respectively. For convenience, we will operate in a normalised domain , as is commonly the case with inputs of neural networks. General hyper-rectangular domains may be handled by scaling the dimensions independently (see Sec.4) . We use to represent an approximation of the function obtained by a training a shallow feedforward neural network with one hidden layer of neurons and a linear output layer using samples . For convenience, we use to collectively encode all learnable parameters of the approximator network: a matrix , a -dimensional row vector , a -dimensional column vector and a real number . The approximate function is
where we define to be the logistic function that operates independently on each of the elements of the input: . In this case, we derive a closed-form formula that exactly computes as an approximation to I (eq. 1) and show that the formula can be adapted to produce a closed-form representation of its marginals (eq. 1).
2.2 The formula
Our formula for the -D integral of using neurons is (see App. A.1)
Here represents the polylogarithm Euler (1768); Lewin (1981) function of order and is a ‘sign’ matrix with elements in whose rows represent all combinations of signs. The contribution of each term is positive (or negative) depending on
, which is even (resp. odd) when there is an even (resp. odd) number ofs in the row . is a column vector whose elements correspond to the outputs of each of the neurons in the hidden layer.
The computational complexity of evaluating the formula is and the memory complexity is since the columns of the elements of can be binary encoded with length . While the computation time is independent of the number of samples of the integrand , increasing improves as an approximation of .
2.3 Q-NET representation
Although calculation can be performed explicitly as in equations 3 and 4, we observe that the expression within the summation term may itself be represented using a feedforward neural network, with one hidden layer containing neurons, which we call Q-NET. The parameters (weights, biases and activation function) of the Q-NET are fixed (not learned) as a function of and independent of the parameters of . The input to the Q-NET is a concatenation of a row of the weight matrix and the corresponding bias term: . The first layer is defined by weight matrix that is concatenated with a column of s, , and zero bias. The activation function for the hidden layer is and the output is a linear combination with weights for those neurons whose input weights contain an odd number (since we concatenated to each row) of s and otherwise. to with zero bias. Thus, the output of the Q-NET is . The subscript serves as a reminder that the Q-NET is fully specified by . Now eq. 4 simplifies to
so that the element of the column vector is calculated using a single evaluation of the Q-NET: . Figure 2 provides an illustrative summary of how QNETs may be constructed and used.
The output of the Q-NET (eq. 5) is identical to the formula in eq. 4. While it might be possible to construct a single network whose output is , this would be complicated by the denominator in . Although the Q-NET does not have any learnable parameters, it is a useful abstraction for that enables an elegant representation of integrals along sub-dimensions (marginalization).
2.4 Marginalizing using Q-NETs
We now explain how dimensions of may be marginalized. Although we choose the first adjacent dimensions for convenient exposition the following procedure holds without loss of generalization for any dimensions of . The result of marginalizing these dimensions is a function whose formula requires integration over dimensions. Using Q-NETs, we obtain:
where the only differences from equation 5 are: 1) that has been replaced with on the rhs ; and 2) that the input to the Q-NET is instead of . In other words, the weighted non-marginalized variables are added to the last column of to obtain . Thus, by modifying the input and evaluating Q-NET (parameterized by the reduced dimensionality) we can approximate marginal distributions. Our definition of Q-NETs is motivated by this ability to elegantly obtain functional approximations to marginal distributions from point-sampled functions. See figure 3 for an illustrative summary and figure 1 for an example. The example shows the result of integrating a 3D function (Gaussian mixture with components) using neurons in the hidden layer and samples. The figure drops subscripts on for brevity. Also shown are the point-sampled functions for and for different permutations of the components of . In other words, different marginals in 2D and 1D.
3 Empirical evaluation
The empirical results in this section demonstrate that: 1) error consistently diminshes with increased training samples for moderate dimensionality (); 2) the approach is applicable for challenging classes of integrands in the absence of any prior; and 3) that can be used as a control variate, which improves on standard MC and QMC on discontinuous integrands for as low as 4.
|a) Gaussian Mixtures (GM)||b) Discontinuous GM (DGM)||c) Binary Hyperrectangles (HR)|
using samples from three families of integrands. The first, GM, is the probability density function (pdf) of a mixture of Gaussians parametrized by dimensionalityand the number of components in the mixture. The covariances are generated randomly, in a way that the Gaussians become narrower (peaky) as the number of components is increased. Figure 4. a visualizes two examples in 2D with and components. The second class GMD is the same as GM but with discontinuities introduced. We choose a random point in the domain and zero out the pdf of GM if there exists such that . Everywhere else it is identical to GM (see fig. 4. b). The third family of integrands HR is a binary function equals one everywhere except within a set of randomly located and sized (but centered on a grid) hyper-rectangles where the function is zero. The parameters are dimensionality, number of grid cells per axis and the fraction of the total grid-cells that are chosen. e.g. setting the parameters to , and will result in randomly sized boxes, whose centers are unique, on a grid with cells. All computation was performed using an unoptimized
MATLABimplementation and executed (CPU only) on a standard desktop machine (single core processor) with
GBRAM. It takes about one second to obtain an estimate in 12D.
Convergence tests We averaged the relative root mean-squared error (RRMSE) of our -sample estimates times each across randomly generated integrands for
. We use RRMSE since each instance of the integrand has a different reference integral. We plotted the mean and standard deviation (error bars) of RRMSE againston a
log-logscale for (fig. 5 top row) and (fig. 6 top row). Error-bars depict variability across integrands. We plotted the relative variances or the estimator (sample variance divided by the square of the reference value) for (fig. 5 bottom row) and (fig. 6 bottom row). The graphs show the statistics of a standard MC estimator (grey) and a QMC (black) estimator using Halton samples alongside ours (red). Dashed trend lines and are also shown for RRMSE (squares for variance).
|(a) 2D GM||(b) 2D GMD||(c) 2D HR|
|(a) 5D GM||(b) 5D GMD||(c) 5D HR|
The plots show that RRMSE consistently decreases as is increased. Curiously, for , Q-NET converges faster than MC despite training with random samples . For discontinuous in higher dimensions, using results in higher error as predicted by theory (see sec. 4). For each sample estimator, the statistics of MC and QMC estimators were calculated using evaluations of the integrand. On the other hand, ours only requires evaluations since different estimates are obtained by stochasticity in training and initialization. This could be relevant when sampling repeatedly is prohibitively costly, or even impossible, as could be the case in some real problems such as those involving large amounts of computation (e.g. astronomy), surveys or clinical trials.
Control variates We devised a family of estimators CV-Q-NET, parametrized by , that use as a control variate (Owen, 2013, Sec. 8.9) to integrate rather than as a direct proxy. Given , CV-Q-NET uses samples to train and the remaining samples to integrate via standard MC or QMC. The final estimator is then the sum of the closed-form integral of and the MC (or QMC) estimator. When , CV-Q-NET is equivalent to (Q-NET) and as tends to one it approaches pure MC (or QMC). The results (fig. 7) show a dramatic reduction in variance for intermediate values of .
|(a) 4D HR||(b) 8D HR|
Dimensionality and nature of the integrand Since existing theoretical results (see sec. 4) for error do not apply to non-bandlimited functions such as HR, we provide empirical results of relative error (fig. 8.a) and variance (fig. 8.b) across dimensions and for different volumes of training data (different curves). Although there is an increase in error from dimensions to (consistent with the convergence plots), reassuringly, this effect diminishes for the higher dimensions even for this challenging class of integrand. We also investigated the effects (on error) of increasing variation of within via input parameters in GMD (number of components) and HR (number of hyper-rectangles). Figure 9 visualizes the relative error and variance of our estimator using training samples and with different numbers of neurons (different lines). The results were obtained by averaging across randomly generated integrands and from each family and the variation across the integrands is visualized as error bars. For GM, the error increases slightly for more components but for HR it remains constant. Curiously, in both cases the variance is higher while using more neurons. We found the cause to be ‘ringing’ artifacts in at the discontinuities when is large.
which is proportional to the first moment of the Fourier spectrum ofand inversely proportional to when is times differentiable. From this, the error of our estimator can be bounded as (derived in App. A.2). Compared to reconstruction, numerical integration requires a factor of fewer samples Subr and Kautz (2013); Belcour et al. (2013) (below the Nyquist rate).
Domain and range adjustment Our formula is derived assuming a normalized domain of and range of for the approximator network. However, our test integrands are in with unknown range. It is easy to show (using elementary variable substitution) that an estimate in the normalized space can be transformed to an estimate in an arbitrary domain for a function with range using the formula , where is the volume of the domain.
Loss functions and training
We experimented with different loss functions such as
cross-entropyto arrive at the surrogate but did not notice significant differences in the convergence rate. We also experimented with standard training algorithms and found that optimizing using the Levenberg-Marquardt method and Bayesian Regularization perform better than conjugate gradient based methods particularly for discontinuous .
|(a) relative RMSE (HR)||(b) relative variance (HR)|
Sampling Different estimates may be drawn from the same samples by either retraining the approximator network with different initialization conditions or due to the stochasticity in the optimization procedure. Although it is also possible to vary the samples across training runs (as with MC or QMC), we found that this did not impact estimation error, particularly for larger training data (). However, different sampling arrangements (correlations) with the same distribution were useful particularly when combined with control variates. e.g. Halton samples instead of random. Intuitively, this improves thereby also improving the estimation error. This relationship requires further analysis and is a promising direction for future research.
|(a) Integrand: GMD||(b) Integrand: HR|
Limitations and future work In our experiments, the choice of had little impact on error for low training volumes (). For large , its relevance depended on the nature of the integrand. For smooth integrands (GM), increasing reduced error due to the improved approximation. For discontinuous integrands (GM, HR), increasing arbitrarily was counterproductive (see fig. 9) due to increased error introduced by ‘ringing’ artifacts at discontinuities. This is an inherent limitation of shallow sigmoidal approximator networks which could be addressed by bounding width rather than depth Fan et al. (2018) and adding layers. We strike a general compromise, for all experiments, choosing with a minimum of neurons. Further investigation is required to extend our formula to multiple sigmoidal layers and to develop fast approximations of our exact formula.
We derived a formula, in closed-form, for the integral of a function represented by a one-layer feedforward neural network whose activation function is the logistic function. Our formula can be evaluated directly but also using a fixed-weight feedforward neural network with one layer. We call this network a Q-NET, and derived its weights and activation function. Q-NETs may be used to marginalize along input dimensions to yield functions in closed-form of the remaining dimensions. We presented experimental evidence to show that the error drops consistently (as the number of training samples for the approximator is increased) for challenging integrands, along with investigation of the effect of various parameters on the approximation.
Appendix A Appendix
To derive the formula for the multidimensional case, we start with the simple case when and , then develop intuition for and finally generalize the result to -dimensional integrands with neurons in the hidden layer.
When , the matrix in equation 2 is reduced to a vector () . Further, when (single neuron) all learned parameters for reduce to scalars so that
which can be integrated analytically to yield a formula for the integral of the approximate function:
which consists of
softplus terms . Since the output layer is a linear combination of the activations of the neurons in the hidden layer, when the rhs is the weighted sum:
where superscript is used to denote the element of a vector. Since is an exact integral of , it follows that if the approximation is accurate in the limit then the estimator is unbiased. i.e. implies . Since independent training runs of the approximator network result in different , secondary estimates of may be obtained by averaging training runs obtained with random initialization.
When and , the approximator network can be written as where , and is a vector. Proceeding similarly to the 1D case we can first integrate over one variable (or equally ) and the intermediate formula, a function of (resp. ), requires a second integration step:
Here represents the polylogarithm polylog function of order and is real when and are real. An important property of this function is that , which enables the general formula for the -dimensional case with hidden neuron:
where the rows of matrix , where , represent all combinations of signs, one for each element in the row vector . The contribution of each term is either positive ( is even) or negative ( is odd) depending on whether there are an even (resp. odd) number of s in the row .
When is trained to approximate , the approximation error of shallow feedforward networks Barron (1993); Shaham et al. (2015) is bounded by where when is the first moment of the Fourier spectrum of function which is times differentiable. But
which then gives us the result in the paper: .
- Variational bayesian monte carlo. In Advances in Neural Information Processing Systems 31, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8223–8233. Cited by: §1.
Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory 39 (3), pp. 930–945. Cited by: §A.2, §4.
- 5D covariance tracing for efficient defocus and motion blur. ACM Trans. Graph. 32 (3). External Links: Cited by: §4.
-  Quadrature theory: the theory of numerical integration on a compact interval. Mathematical surveys and monographs, American Mathematical Society. External Links: Cited by: §1.
- Frank-wolfe bayesian quadrature: probabilistic integration with theoretical guarantees. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1162–1170. Cited by: §1.
- Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: §1.
Neural network integral representations with the relu activation function. External Links: Cited by: §1.
- Institutionum calculi integralis volumen primum. Vol. 1. External Links: Cited by: §2.2.
- Slim, sparse, and shortcut networks. External Links: Cited by: §4.
- Bayesian monte carlo. In Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer (Eds.), pp. 505–512. Cited by: §1.
- Deep neural network approximations for monte carlo algorithms. External Links: Cited by: §1.
- Sampling for inference in probabilistic models with fast bayesian quadrature. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2789–2797. Cited by: §1.
- Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), pp. 251–257. Cited by: §1.
- Convergence guarantees for adaptive bayesian quadrature methods. In Advances in Neural Information Processing Systems 32, pp. 6237–6248. Cited by: §1.
Bayesian active learning for posterior estimation. In
Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pp. 3605–3611. External Links: Cited by: §1.
- Hierarchical gaussian process priors for bayesian neural network weights. External Links: Cited by: §1.
- Numerical integration in multiple dimensions with designed quadrature. CoRR abs/1804.06501. External Links: Cited by: §1.
- Polylogarithms and associated functions. North Holland. External Links: Cited by: §2.2.
- The expressive power of neural networks: a view from the width. External Links: Cited by: §1.
- The monte carlo method. Journal of the American Statistical Association 44 (247), pp. 335–341. External Links: Cited by: §1.
-  Random number generation and quasi-monte carlo methods. External Links: Cited by: §1.
- Quasi-monte carlo methods and pseudo-random numbers. Bull. Amer. Math. Soc. 84 (6), pp. 957–1041. Cited by: §1.
- Active learning of model evidence using bayesian quadrature. In Advances in Neural Information Processing Systems 25, pp. 46–54. Cited by: §1.
- Monte carlo theory, methods and examples. External Links: Cited by: §1, §1, §3.
- Provable approximation properties for deep neural networks. External Links: Cited by: §A.2, §1, §4.
- Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §1.
- Fourier analysis of stochastic sampling strategies for assessing bias and variance in integration. ACM Trans. Graph. 32 (4). External Links: Cited by: §4.
- Neural control variates for variance reduction. External Links: Cited by: §1.
Adaptive gaussian process approximation for bayesian inference with expensive likelihood functions. Neural Computation 30 (11), pp. 3072–3094. Note: PMID: 30216145 External Links: Cited by: §1.