Q-NET: A Formula for Numerical Integration of a Shallow Feed-forward Neural Network

06/25/2020 ∙ by Kartic Subr, et al. ∙ 0

Numerical integration is a computational procedure that is widely encountered across disciplines when reasoning about data. We derive a formula in closed form to calculate the multidimensional integral of functions fw that are representable using a shallow feed-forward neural network with weights w and a sigmoid activation function. We demonstrate its applicability in estimating numerical integration of arbitrary functions f over hyper-rectangular domains in the absence of a prior. To achieve this, we first train the network to learn fw ≈ f using point-samples of the integrand. We then use our formula to calculate the exact integral of the learned function fw. Our formula operates on the weights w of the trained approximator network. We show that this formula can itself be expressed as a shallow feed-forward network, which we call a Q-NET, with w as its inputs. Although the Q-NET does not have any learnable parameters, we use this abstraction to derive a family of elegant parametric formulae that represent the marginal distributions of the input function over arbitrary subsets of input dimensions in functional form. We perform empirical evaluations of Q-NETs for integrating smooth functions as well as functions with discontinuities.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and background

Artificial neural networks are versatile representations of functions that have led to groundbreaking results in a variety of learning problems such as function approximation (regression), classification and dimensionality reduction. Deep neural networks have been shown to learn effective approximations of difficult functions (high-dimensional, with discontinuities, etc.) such as digital images defined over the space spanned by pixels. While there are numerous methods that focus on learning these representations, the problem of estimating integrals of the learned function remains an open problem.

In this paper, we derive a closed-form formula to calculate the integral of a specific but well known shallow feed-forward network (SFFN) containing one hidden layer and a sigmoid activation function. This textbook case is an example of what are called universal approximator networks Cybenko (1989); Hornik (1991); Lu et al. (2017) since it can approximate any continuous function accurately Shaham et al. (2015). Although learning one from point estimates where

might be slow or require a large number of neurons to achieve a sufficiently accurate

, its simplicity and universality make it an attractive case to study. We derive an exact formula to describe the value of the integral of  as a function of the parameters or weights  learned using a standard training procedure.

A formula for integrating  exactly enhances its appeal as a proxy or surrogate function for  in applications where the latter needs to be numerically integrated. We test the utility of our formula by estimating multidimensional intregals and its marginals , where


over hyper-rectangular domains  (-dimensional) and (-dimensional) using samples . Here and describe the distribution from which the samples are drawn in each case. Here, the superscript in denotes select elements of , so

represents a vector with the first

components (without loss of generality) of . We focus on the case when there is no informative prior on the integrand and with

as a constant. These problems are abstract representatives of the applications of numerical integration in machine learning. e.g.  normalization of multidimensional probability densities (frequentist statistics) or likelihood functions (likelihood statistics), computation of marginal likelihoods (model evidence) which forms the crux of Bayesian approaches, etc. An advantage of the surrogate function over averaging of point samples (ala Monte Carlo integration) is that it enables functional representations of marginal distributions. Another advantage is that

 might be used as a control variate (Owen, 2013, Sec. 8.9)Wan et al. (2018) as shown in section 3.

Despite the plethora of computational approaches such as quadrature rules Brass and Petras ; Keshavarzzadeh et al. (2018)

, Monte Carlo integration and Markov Chain Monte Carlo methods, the general procedure to estimate integrals of functions using point samples remains frustratingly inefficient even in moderately many (about ten) dimensions. Monte Carlo methods operate by expressing integrals as expectations which can be estimated via random simulation 

Metropolis and Ulam (1949). Several variants of this method address its slow convergence, and strike different compromises between bias (accuracy) and variance (precision). We refer the reader to standard texts on the subject Owen (2013). Quasi-Monte Carlo (QMC) methods replace stochasticity with carefully designed, deterministic samples which improves convergence dramatically when the integrands are smooth Niederreiter (1978, )

or of moderate dimensionality (below ten). Deep neural networks can mimic MC solutions to certain partial differential equations 

Grohs et al. (2019)

sidestepping the curse of dimensionality.

Bayesian methods use probabilistic model-based surrogates to improve sample-efficiency for expensive integrands Ghahramani and Rasmussen (2003) or to guide active sampling (adaptive Bayesian quadrature Osborne et al. (2012); Kanagawa and Hennig (2019), Bayesian Optimization Shahriari et al. (2016), etc.). This general approach can be used to estimate the marginal likelihood Briol et al. (2015); Gunter et al. (2014), approximate the posterior Kandasamy et al. (2015); Wang and Li (2018) and to simultaneously infer both Acerbi (2018). These works impose a Gaussian Process (GP) prior on the integrand and focus on cases where analytical formulae can be derived for the expectation and uncertainty of the statistical surrogate. The non-parametric nature of the GP can affect scability and the need to explicitly evaluate the sampling probability sometimes precludes the use of these methods. The advantages of these approaches is that they inherently enable reasoning about uncertainty and that they serve as generative models. Hybrid methods such as using GPs to model weights Karaletsos and Bui (2020) investigate calibrated reasoning about uncertainty using probabilistic neural networks. Integral representations Dereventsov et al. (2019) use an analysis of continuous distributions, for a particular target function, from which instances of shallow neural networks can be sampled.

We test the utility of a neural network as a surrogate representation for numerical integration. We identify that the integral of a sigmoidal universal approximator function can be expressed in closed form and derive its formula. Our result can be viewed as parametric in the sense that for a fixed architecture, the set of parameters (weights and biases) learned is independent of the amount of training data. Once learned, a parametric formula can be used to evaluate the integral or its marginals. On the other hand, it does not encode distributional information or serve as a generative model.

Contributions    The contributions of this paper are:

  1. [leftmargin=*]

  2. a formula, to estimate multidimensional integrals of , a neural network function representation;

  3. an interpretation of the formula as a neural network with one hidden layer which we call a Q-NET;

  4. and a parametric formula for marginalization of  along subsets of input dimensions.

We test the use of  for approximate numerical integration using smooth and discontinuous functions.

Figure 1: Overview: We derive a formula in closed-form to integrate a function  represented using a shallow, feed-forward sigmoidal network. Once  is trained as a proxy to some function  using samples , the formula can be used to approximately integrate or marginalize . Here, it is applied to marginalize the pdf of a Gaussian mixture in 3D (visualized) along different axes to obtain accurate parametric (functions of ) formulae for the 2D and 1D marginal distributions.

2 Numerical integration of a shallow feed-forward network

2.1 Preliminaries

We denote row111We do not assume that all vectors are column vectors so that the formulae in the paper match output weights of neural network libraries with no extra transposes necessary for implementation. and column vectors with boldface characters (e.g. , , ) and capital letters to represent matrices (e.g. ). Superscripts denote selected elements of a vector or matrix. So and represent the element  and row of  respectively. For convenience, we will operate in a normalised domain , as is commonly the case with inputs of neural networks. General hyper-rectangular domains may be handled by scaling the dimensions independently (see Sec.4) . We use to represent an approximation of the function  obtained by a training a shallow feedforward neural network with one hidden layer of neurons and a linear output layer using samples . For convenience, we use  to collectively encode all learnable parameters of the approximator network: a matrix , a -dimensional row vector , a -dimensional column vector  and a real number . The approximate function  is


where we define  to be the logistic function that operates independently on each of the elements of the input: . In this case, we derive a closed-form formula that exactly computes as an approximation to I (eq. 1) and show that the formula can be adapted to produce a closed-form representation of its marginals (eq. 1).

2.2 The formula

Our formula for the -D integral of  using neurons is (see App. A.1)


Here represents the polylogarithm Euler (1768); Lewin (1981) function of order and is a ‘sign’ matrix with elements in whose rows represent all combinations of signs. The contribution of each term is positive (or negative) depending on

, which is even (resp. odd) when there is an even (resp. odd) number of

s in the row . is a column vector whose elements correspond to the outputs of each of the neurons in the hidden layer.

The computational complexity of evaluating the formula is and the memory complexity is since the columns of the elements of  can be binary encoded with length . While the computation time is independent of the number of samples of the integrand , increasing improves  as an approximation of .

2.3 Q-NET representation

Although calculation can be performed explicitly as in equations 3 and 4, we observe that the expression within the summation term may itself be represented using a feedforward neural network, with one hidden layer containing neurons, which we call Q-NET. The parameters (weights, biases and activation function) of the Q-NET are fixed (not learned) as a function of and independent of the parameters of . The input to the Q-NET is a concatenation of a row of the weight matrix  and the corresponding bias term: . The first layer is defined by weight matrix that is  concatenated with a column of s, , and zero bias. The activation function for the hidden layer is and the output is a linear combination with weights for those neurons whose input weights contain an odd number (since we concatenated to each row) of s and otherwise. to with zero bias. Thus, the output of the Q-NET is . The subscript serves as a reminder that the Q-NET is fully specified by . Now eq. 4 simplifies to


so that the element of the column vector is calculated using a single evaluation of the Q-NET: . Figure 2 provides an illustrative summary of how QNETs may be constructed and used.

Figure 2: An illustrative example of 2D integration using Q-NETs. In this example, the approximator network contains neurons in the hidden layer. Once the approximator is trained, three Q-NETs are used to calculate the integral of  from .

The output of the Q-NET (eq. 5) is identical to the formula in eq. 4. While it might be possible to construct a single network whose output is , this would be complicated by the denominator in . Although the Q-NET does not have any learnable parameters, it is a useful abstraction for that enables an elegant representation of integrals along sub-dimensions (marginalization).

2.4 Marginalizing using Q-NETs

We now explain how dimensions of  may be marginalized. Although we choose the first adjacent dimensions for convenient exposition the following procedure holds without loss of generalization for any dimensions of . The result of marginalizing these dimensions is a function whose formula requires integration over dimensions. Using Q-NETs, we obtain:


where the only differences from equation 5 are: 1) that has been replaced with on the rhs ; and 2) that the input to the Q-NET is instead of . In other words, the weighted non-marginalized variables are added to the last column of  to obtain . Thus, by modifying the input and evaluating Q-NET (parameterized by the reduced dimensionality) we can approximate marginal distributions. Our definition of Q-NETs is motivated by this ability to elegantly obtain functional approximations to marginal distributions from point-sampled functions. See figure 3 for an illustrative summary and figure 1 for an example. The example shows the result of integrating a 3D function (Gaussian mixture with components) using neurons in the hidden layer and samples. The figure drops subscripts on for brevity. Also shown are the point-sampled functions for and for different permutations of the components of . In other words, different marginals in 2D and 1D.

Figure 3: The figure shows how a Q-NET (left) may be applied to the weights of  with to obtain parametric functions (right) as the different marginal distributions of .

3 Empirical evaluation

The empirical results in this section demonstrate that: 1) error consistently diminshes with increased training samples for moderate dimensionality (); 2) the approach is applicable for challenging classes of integrands in the absence of any prior; and 3) that  can be used as a control variate, which improves on standard MC and QMC on discontinuous integrands for as low as 4.

a) Gaussian Mixtures (GM) b) Discontinuous GM (DGM) c) Binary Hyperrectangles (HR)
Figure 4: 2D Examples of randomly generated integrands used: a) Gaussian mixture, b) Gaussian mixture with discontinuities at hyperplanes and c) a binary integrand that is one everywhere except inside randomly located hyperrectangles.

We trained

 using samples from three families of integrands. The first, GM, is the probability density function (pdf) of a mixture of Gaussians parametrized by dimensionality

and the number of components in the mixture. The covariances are generated randomly, in a way that the Gaussians become narrower (peaky) as the number of components is increased. Figure 4. a visualizes two examples in 2D with and components. The second class GMD is the same as GM but with discontinuities introduced. We choose a random point in the domain and zero out the pdf of GM if there exists such that . Everywhere else it is identical to GM (see fig. 4. b). The third family of integrands HR is a binary function equals one everywhere except within a set of randomly located and sized (but centered on a grid) hyper-rectangles where the function is zero. The parameters are dimensionality, number of grid cells per axis and the fraction of the total grid-cells that are chosen. e.g. setting the parameters to , and will result in randomly sized boxes, whose centers are unique, on a grid with cells. All computation was performed using an unoptimized MATLAB implementation and executed (CPU only) on a standard desktop machine (single core processor) with GB RAM. It takes about one second to obtain an estimate in 12D.

Convergence tests    We averaged the relative root mean-squared error (RRMSE) of our -sample estimates times each across randomly generated integrands for

. We use RRMSE since each instance of the integrand has a different reference integral. We plotted the mean and standard deviation (error bars) of RRMSE against

on a log-log scale for (fig. 5 top row) and (fig. 6 top row). Error-bars depict variability across integrands. We plotted the relative variances or the estimator (sample variance divided by the square of the reference value) for (fig. 5 bottom row) and (fig. 6 bottom row). The graphs show the statistics of a standard MC estimator (grey) and a QMC (black) estimator using Halton samples alongside ours (red). Dashed trend lines and are also shown for RRMSE (squares for variance).


Rel. Variance

(a) 2D GM (b) 2D GMD (c) 2D HR
Figure 5: Convergence plots for relative RMSE (top row) and relative variance (bottom) averaged across randomly generated D integrands from three families (columns). Ours is shown alongside MC and QMC references (solid grey) and error trend lines (dashed grey) for and .


Rel. Variance

(a) 5D GM (b) 5D GMD (c) 5D HR
Figure 6: Convergence plots for relative RMSE (top row) and relative variance (bottom) averaged across randomly generated D integrands from three families (columns). Ours is shown alongside MC and QMC references (solid grey) and error trend lines (dashed grey) for and .

The plots show that RRMSE consistently decreases as is increased. Curiously, for , Q-NET converges faster than MC despite training with random samples . For discontinuous  in higher dimensions, using  results in higher error as predicted by theory (see sec. 4). For each sample estimator, the statistics of MC and QMC estimators were calculated using evaluations of the integrand. On the other hand, ours only requires evaluations since different estimates are obtained by stochasticity in training and initialization. This could be relevant when sampling  repeatedly is prohibitively costly, or even impossible, as could be the case in some real problems such as those involving large amounts of computation (e.g. astronomy), surveys or clinical trials.

Control variates    We devised a family of estimators CV-Q-NET, parametrized by , that use  as a control variate (Owen, 2013, Sec. 8.9) to integrate  rather than as a direct proxy. Given , CV-Q-NET uses samples to train  and the remaining samples to integrate via standard MC or QMC. The final estimator is then the sum of the closed-form integral of  and the MC (or QMC) estimator. When , CV-Q-NET is equivalent to (Q-NET) and as tends to one it approaches pure MC (or QMC). The results (fig. 7) show a dramatic reduction in variance for intermediate values of .

(a) 4D HR (b) 8D HR
Figure 7: Control variates We used a fraction (X-axes) of the training samples to randomly sample the domain and the remaining samples to build the proxy  which we used as a control variate. We call this estimator CV-Q-NET, which is equivalent to Q-NET when and equivalent to MC (or QMC) when . For values , we see that RRMSE reduces to MC. The variance curves show dramatic reductions particularly when is around . We used and random HR integrands to estimate the means, error-bars and variances in 4D and 8 respectively.

Dimensionality and nature of the integrand    Since existing theoretical results (see sec. 4) for error do not apply to non-bandlimited functions such as HR, we provide empirical results of relative error (fig. 8.a) and variance (fig. 8.b) across dimensions and for different volumes of training data (different curves). Although there is an increase in error from dimensions to (consistent with the convergence plots), reassuringly, this effect diminishes for the higher dimensions even for this challenging class of integrand. We also investigated the effects (on error) of increasing variation of  within  via input parameters in GMD (number of components) and HR (number of hyper-rectangles). Figure 9 visualizes the relative error and variance of our estimator using training samples and with different numbers of neurons (different lines). The results were obtained by averaging across randomly generated integrands and from each family and the variation across the integrands is visualized as error bars. For GM, the error increases slightly for more components but for HR it remains constant. Curiously, in both cases the variance is higher while using more neurons. We found the cause to be ‘ringing’ artifacts in  at the discontinuities when is large.

4 Discussion

Error    The error of sigmoidal universal approximator networks Barron (1993); Shaham et al. (2015) is bounded by

which is proportional to the first moment of the Fourier spectrum of

 and inversely proportional to when  is times differentiable. From this, the error of our estimator can be bounded as (derived in App. A.2). Compared to reconstruction, numerical integration requires a factor of fewer samples Subr and Kautz (2013); Belcour et al. (2013) (below the Nyquist rate).

Domain and range adjustment    Our formula is derived assuming a normalized domain of and range of for the approximator network. However, our test integrands are in with unknown range. It is easy to show (using elementary variable substitution) that an estimate in the normalized space can be transformed to an estimate in an arbitrary domain for a function with range using the formula , where is the volume of the domain.

Loss functions and training

    We experimented with different loss functions such as

mse,mae and cross-entropy to arrive at the surrogate  but did not notice significant differences in the convergence rate. We also experimented with standard training algorithms and found that optimizing using the Levenberg-Marquardt method and Bayesian Regularization perform better than conjugate gradient based methods particularly for discontinuous .

(a) relative RMSE (HR) (b) relative variance (HR)
Figure 8: Dimensionality› Relative error and variance across randomly sampled HR integrands in and (X-axes) for various amounts of training data (different curves). The dashed lines show the same data when the number of neurons is reduced by .

Sampling    Different estimates may be drawn from the same samples by either retraining the approximator network with different initialization conditions or due to the stochasticity in the optimization procedure. Although it is also possible to vary the samples across training runs (as with MC or QMC), we found that this did not impact estimation error, particularly for larger training data (). However, different sampling arrangements (correlations) with the same distribution were useful particularly when combined with control variates. e.g. Halton samples instead of random. Intuitively, this improves  thereby also improving the estimation error. This relationship requires further analysis and is a promising direction for future research.

(a) Integrand: GMD (b) Integrand: HR
Figure 9: RMSE and variance using our method are consistent across integrands with different parameters. The plots in this figure visualize estimation error and variance using for various numbers of neurons and integrand parameters. Curiously, the variance is higher for larger .

Limitations and future work    In our experiments, the choice of had little impact on error for low training volumes (). For large , its relevance depended on the nature of the integrand. For smooth integrands (GM), increasing reduced error due to the improved approximation. For discontinuous integrands (GM, HR), increasing arbitrarily was counterproductive (see fig. 9) due to increased error introduced by ‘ringing’ artifacts at discontinuities. This is an inherent limitation of shallow sigmoidal approximator networks which could be addressed by bounding width rather than depth Fan et al. (2018) and adding layers. We strike a general compromise, for all experiments, choosing with a minimum of neurons. Further investigation is required to extend our formula to multiple sigmoidal layers and to develop fast approximations of our exact formula.

5 Conclusion

We derived a formula, in closed-form, for the integral of a function represented by a one-layer feedforward neural network whose activation function is the logistic function. Our formula can be evaluated directly but also using a fixed-weight feedforward neural network with one layer. We call this network a Q-NET, and derived its weights and activation function. Q-NETs may be used to marginalize along input dimensions to yield functions in closed-form of the remaining dimensions. We presented experimental evidence to show that the error drops consistently (as the number of training samples for the approximator is increased) for challenging integrands, along with investigation of the effect of various parameters on the approximation.

Appendix A Appendix

a.1 Derivation

To derive the formula for the multidimensional case, we start with the simple case when and , then develop intuition for and finally generalize the result to -dimensional integrands with neurons in the hidden layer.

When , the matrix in equation 2 is reduced to a vector () . Further, when (single neuron) all learned parameters for  reduce to scalars so that

which can be integrated analytically to yield a formula for the integral of the approximate function:


which consists of softplus terms . Since the output layer is a linear combination of the activations of the neurons in the hidden layer, when the rhs is the weighted sum:


where superscript is used to denote the element of a vector. Since is an exact integral of , it follows that if the approximation is accurate in the limit then the estimator is unbiased. i.e.  implies . Since independent training runs of the approximator network result in different , secondary estimates of may be obtained by averaging training runs obtained with random initialization.

When and , the approximator network can be written as where , and  is a vector. Proceeding similarly to the 1D case we can first integrate over one variable (or equally ) and the intermediate formula, a function of (resp. ), requires a second integration step:


Here represents the polylogarithm polylog function of order and is real when and are real. An important property of this function is that , which enables the general formula for the -dimensional case with hidden neuron:

where the rows of matrix , where , represent all combinations of signs, one for each element in the row vector . The contribution of each term is either positive ( is even) or negative ( is odd) depending on whether there are an even (resp. odd) number of s in the row .

The general case , where the approximation is given by equation 2, is a weighted sum of the outputs of the neurons (), with weights as the corresponding elements of (eq. 3).

a.2 Error

When  is trained to approximate , the approximation error of shallow feedforward networks Barron (1993); Shaham et al. (2015) is bounded by where when is the first moment of the Fourier spectrum of function which is times differentiable. But


which then gives us the result in the paper: .


  • L. Acerbi (2018) Variational bayesian monte carlo. In Advances in Neural Information Processing Systems 31, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 8223–8233. Cited by: §1.
  • A. R. Barron (1993)

    Universal approximation bounds for superpositions of a sigmoidal function

    IEEE Transactions on Information theory 39 (3), pp. 930–945. Cited by: §A.2, §4.
  • L. Belcour, C. Soler, K. Subr, N. Holzschuch, and F. Durand (2013) 5D covariance tracing for efficient defocus and motion blur. ACM Trans. Graph. 32 (3). External Links: ISSN 0730-0301, Link, Document Cited by: §4.
  • [4] H. Brass and K. Petras Quadrature theory: the theory of numerical integration on a compact interval. Mathematical surveys and monographs, American Mathematical Society. External Links: ISBN 9780821887011 Cited by: §1.
  • F. Briol, C. Oates, M. Girolami, and M. A. Osborne (2015) Frank-wolfe bayesian quadrature: probabilistic integration with theoretical guarantees. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1162–1170. Cited by: §1.
  • G. Cybenko (1989) Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2 (4), pp. 303–314. Cited by: §1.
  • A. Dereventsov, A. Petrosyan, and C. Webster (2019)

    Neural network integral representations with the relu activation function

    External Links: 1910.02743 Cited by: §1.
  • L. Euler (1768) Institutionum calculi integralis volumen primum. Vol. 1. External Links: Document, https://epubs.siam.org/doi/pdf/10.1137/1.9781611970081.fm Cited by: §2.2.
  • F. Fan, D. Wang, H. Guo, Q. Zhu, P. Yan, G. Wang, and H. Yu (2018) Slim, sparse, and shortcut networks. External Links: 1811.09003 Cited by: §4.
  • Z. Ghahramani and C. E. Rasmussen (2003) Bayesian monte carlo. In Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer (Eds.), pp. 505–512. Cited by: §1.
  • P. Grohs, A. Jentzen, and D. Salimova (2019) Deep neural network approximations for monte carlo algorithms. External Links: 1908.10828 Cited by: §1.
  • T. Gunter, M. A. Osborne, R. Garnett, P. Hennig, and S. J. Roberts (2014) Sampling for inference in probabilistic models with fast bayesian quadrature. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2789–2797. Cited by: §1.
  • K. Hornik (1991) Approximation capabilities of multilayer feedforward networks. Neural networks 4 (2), pp. 251–257. Cited by: §1.
  • M. Kanagawa and P. Hennig (2019) Convergence guarantees for adaptive bayesian quadrature methods. In Advances in Neural Information Processing Systems 32, pp. 6237–6248. Cited by: §1.
  • K. Kandasamy, J. Schneider, and B. Póczos (2015)

    Bayesian active learning for posterior estimation


    Proceedings of the 24th International Conference on Artificial Intelligence

    IJCAI’15, pp. 3605–3611. External Links: ISBN 9781577357384 Cited by: §1.
  • T. Karaletsos and T. D. Bui (2020) Hierarchical gaussian process priors for bayesian neural network weights. External Links: 2002.04033 Cited by: §1.
  • V. Keshavarzzadeh, R. M. Kirby, and A. C. Narayan (2018) Numerical integration in multiple dimensions with designed quadrature. CoRR abs/1804.06501. External Links: Link, 1804.06501 Cited by: §1.
  • L. Lewin (1981) Polylogarithms and associated functions. North Holland. External Links: ISBN 9780444005502, LCCN lc80020945 Cited by: §2.2.
  • Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang (2017) The expressive power of neural networks: a view from the width. External Links: 1709.02540 Cited by: §1.
  • N. Metropolis and S. Ulam (1949) The monte carlo method. Journal of the American Statistical Association 44 (247), pp. 335–341. External Links: ISSN 01621459 Cited by: §1.
  • [21] H. Niederreiter Random number generation and quasi-monte carlo methods. External Links: Document, https://epubs.siam.org/doi/pdf/10.1137/1.9781611970081.fm Cited by: §1.
  • H. Niederreiter (1978) Quasi-monte carlo methods and pseudo-random numbers. Bull. Amer. Math. Soc. 84 (6), pp. 957–1041. Cited by: §1.
  • M. Osborne, R. Garnett, Z. Ghahramani, D. K. Duvenaud, S. J. Roberts, and C. E. Rasmussen (2012) Active learning of model evidence using bayesian quadrature. In Advances in Neural Information Processing Systems 25, pp. 46–54. Cited by: §1.
  • A. B. Owen (2013) Monte carlo theory, methods and examples. External Links: Link Cited by: §1, §1, §3.
  • U. Shaham, A. Cloninger, and R. R. Coifman (2015) Provable approximation properties for deep neural networks. External Links: 1509.07385 Cited by: §A.2, §1, §4.
  • B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas (2016) Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §1.
  • K. Subr and J. Kautz (2013) Fourier analysis of stochastic sampling strategies for assessing bias and variance in integration. ACM Trans. Graph. 32 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §4.
  • R. Wan, M. Zhong, H. Xiong, and Z. Zhu (2018) Neural control variates for variance reduction. External Links: 1806.00159 Cited by: §1.
  • H. Wang and J. Li (2018)

    Adaptive gaussian process approximation for bayesian inference with expensive likelihood functions

    Neural Computation 30 (11), pp. 3072–3094. Note: PMID: 30216145 External Links: https://doi.org/10.1162/neco_a_01127 Cited by: §1.