Convergence Analysis of Deterministic Kernel-Based Quadrature Rules in Misspecified Settings

09/01/2017 ∙ by Motonobu Kanagawa, et al. ∙ 0

This paper presents convergence analysis of kernel-based quadrature rules in misspecified settings, focusing on deterministic quadrature in Sobolev spaces. In particular, we deal with misspecified settings where a test integrand is less smooth than a Sobolev RKHS based on which a quadrature rule is constructed. We provide convergence guarantees based on two different assumptions on a quadrature rule: one on quadrature weights, and the other on design points. More precisely, we show that convergence rates can be derived (i) if the sum of absolute weights remains constant (or does not increase quickly), or (ii) if the minimum distance between distance design points does not decrease very quickly. As a consequence of the latter result, we derive a rate of convergence for Bayesian quadrature in misspecified settings. We reveal a condition on design points to make Bayesian quadrature robust to misspecification, and show that, under this condition, it may adaptively achieve the optimal rate of convergence in the Sobolev space of a lesser order (i.e., of the unknown smoothness of a test integrand), under a slightly stronger regularity condition on the integrand.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper discusses the problem of numerical integration (or quadrature), which has been a fundamental task in numerical analysis, statistics, computer science including machine learning and other areas. Let

be a (known) Borel probability measure on the Euclidean space

with support contained in an open set , and be an integrand on . Suppose that the integral has no closed form solution. We consider quadrature rules that provide an approximation of the integral, in the form of a weighted sum of function values

where are design points and are quadrature weights. Throughout this paper, the integral of

and its quadrature estimate are denoted by

and , respectively; namely,

Examples of such quadrature rules include Monte Carlo methods, which make use of a random sample from a suitable proposal distribution as , and importance weights as . A limitation of standard Monte Carlo methods is that a huge number of design points (i.e., large ) may be needed for providing an accurate approximation of the integral; this comes from the fact that the rate of convergence of Monte Carlo methods is typically of the order . The need for large is problematic, when an evaluation of the function value is expensive for each input . Such situations appear in modern scientific and engineering problems where the mapping involves complicated computer simulation. In applications to time-series forecasting, for instance, may be a parameter of an underlying system, a certain quantity of interest in future, and a prior distribution on . Then the target integral is the predictive value of the future quantity. The evaluation of for each may require numerically solving an initial value problem for the differential equation, which results in time-consuming computation [7]. Similar examples can be seen in applications to statistics and machine learning, as mentioned below. In these situations, one can only use a limited number of design points, and thus it is desirable to have quadrature rules with a faster convergence rate, in order to obtain a reliable solution [42].

1.1 Kernel-based quadrature rules

How can we obtain a quadrature rule whose convergence rate is faster than ? In practice, one often has prior knowledge or belief on the integrand , such as smoothness, periodicity, sparsity, and so on. Exploiting such knowledge or assumption in constructing a quadrature rule may achieve faster rates of convergence, and such methods have been extensively studied in the literature for decades; see e.g. [16] and [9] for review.

This paper deals with quadrature rules using reproducing kernel Hilbert spaces (RKHS) explicitly or implicitly to achieve fast convergence rates; we will refer to such methods as kernel-based quadrature rules or simply kernel quadrature. As discussed in Section 2.4, notable examples include Quasi Monte Carlo methods [25, 39, 16, 17], Bayesian quadrature [44, 9], and Kernel herding [10, 5]. These methods have been studied extensively in recent years [49, 8, 41, 42, 4] and have recently found applications in, for instance, machine learning and statistics [3, 30, 20, 9, 29].

In kernel quadrature, we make use of available knowledge on properties of the integrand by assuming that belongs to a certain RKHS that possesses those properties (where is the reproducing kernel), and then constructing weighted points such that the worst case error in the RKHS

(1)

is made small, where is the norm of . The use of RKHS is beneficial when compared to other function spaces, as it leads to a closed form expression of the worst case error (1) in terms of the kernel, and thus one may explicitly use this expression for designing (see Section 2.3).

Note that, in a well-specified case, that is, the integrand satisfies , the quadrature error is bounded as

This guarantees that, if a quadrature rule satisfies as for some , then the quadrature error also satisfies . Take a Sobolev space of order on as the RKHS , for example. It is known that optimal quadrature rules achieve [37], and thus holds for any . As we have , this rate is faster than Monte Carlo integration; this is the desideratum that has been discussed.

1.2 Misspecified settings

This paper focuses on situations where the assumption is violated, that is, misspecified settings. As explained above, convergence guarantees for kernel quadrature rules often assume that . However, in practice one may lack the full knowledge on the properties on the integrand, and therefore misspecification of the RKHS (via the choice of its reproducing kernel ) may occur, that is, .

Such misspecification is likely to happen when the integrand is a black box function. An illustrative example can be found in applications to computer graphics such as the problem of illumination integration (see e.g. [9]), where the task is to compute the total amount of light arriving at a camera in a virtual environment. This problem is solved by quadrature, with integrand being the intensity of light arriving at the camera from a direction (angle). However, the value of is only given by simulation of the environment for each , so the integrand is a black box function. Similar situations can be found in application to statistics and machine learning. A representable example is the computation of marginal likelihood for a probabilistic model, which is an important but challenging task required for model section (see e.g. [43]). In modern scientific applications where complex phenomena are dealt with (e.g. climate science), we often encounter situations where the evaluation of a likelihood function, which forms the integrand in marginal likelihood computation, involves an expensive simulation model, making the integrand complex and even black box.

If the integrand is a black box function, there is a trade-off between the risk of misspecification and gain in the rate of convergence for kernel-based quadrature rules; for a faster convergence rate, one may want to use a quadrature rule for a narrower such as of higher order differentiability, while such a choice may cause misspecification of the function class. Therefore it is of great importance to elucidate their convergence properties in misspecified situations, in order to make use of such quadrature rules in a safe manner.

1.3 Contributions

This paper provides convergence rates of kernel-based quadrature rules in misspecified settings, focusing on deterministic rules (i.e., without randomization). The focus of misspecification is placed on the order of Sobolev spaces: the unknown order of the integrand is overestimated as , that is, .

Let be a bounded domain with a Lipschitz boundary (see Section 3 for definition). For , consider a positive definite kernel on that satisfies the following assumption;

Assumption 1.

The kernel on satisfies , where is a positive definite function such that

for some constants , where

is the Fourier transform of

. The RKHS is the restriction of to (see Section 2).

The resulting RKHS is norm-equivalent to the standard Sobolev space . The Matérn and Wendland kernels satisfy Assumption 1 (see Section 2).

Consider a quadrature rule with the kernel such that

(2)

We do not specify how the weighted points are generated, but assume (2) aiming for wide applicability. Suppose that an integrand has partial derivatives up to order and they are bounded and uniformly continuous. If , the integrand may not belong to the assumed RKHS , in which case a misspecification occurs.

Under this misspecified setting, two types of assumptions on the quadrature rule will be considered: one on the quadrature weights (Section 4.1), and the other on the design points (Section 4.2). In both cases, a rate of convergence of the form

(3)

will be derived under some additional conditions. The results guarantee the convergence in the misspecified setting, and the rate is determined by the ratio between the true smoothness and the assumed smoothness . As discussed in Section 2, the optimal rate of deterministic quadrature rules for the Sobolev space is [37]. If a quadrature rule satisfies this optimal rate (i.e., ), then the rate (3) becomes for an integrand (), which matches the optimal rate for .

The specific results are summarized as follows:

  • In Section 4.1, it is assumed that as for some constant . Note that is taken if the weights satisfy , an example of which is the equal weights . Under this assumption and other suitable conditions, Corollary 7 shows

    The rate in (3) holds if . Therefore this result provides convergence guarantees in particular for equal-weight quadrature rules, such as quasi Monte Carlo methods and kernel herding, in the misspecified setting.

  • Section 4.2 uses an assumption on design points in terms of separation radius , which is defined by

    (4)

    Corollary 9 shows that, if as for some , under other regularity conditions,

    The best possible rate is when . This result provides a convergence guarantee for quadrature rules that obtain the weights to give for the worst case error with fixed beforehand. We demonstrate this result by applying it to Bayesian quadrature, as explained below. Our result may also provide the following guideline for practitioners: in order to make a kernel quadrature rule robust to misspecification, one should specify the design points so that the spacing is not too small.

  • Section 5 discusses a convergence rate for Bayesian quadrature under the misspecified setting, demonstrating the results of Section 4.2. Given design points , Bayesian quadrature defines weights as the minimizer of the worst case error (1), which can be obtained by solving a linear equation (see Section 2.4 for more detail). For points in , the fill distance is defined by

    (5)

    Assume that there exists a constant independent of such that

    and that as . Then Corollary 11 shows that with Bayesian quadrature weights based on the kernel we have

    Note that the rate matches the minimax optimal rate for deterministic quadrature rules in the Sobolev space of order [37], which implies that Bayesian quadrature can be adaptive to the unknown smoothness of the integrand . The adaptivity means that it can achieve the rate without the knowledge of ; it only requires the knowledge of the upper bound of the true smoothness .

  • Section 3 establishes a rate of convergence for Bayesian quadrature in the well-specified case, which serves as a basis for the results in the misspecified case (Section 5). Corollary 5 asserts that if the the design points satisfy as , then

    This rate is minimax optimal for deterministic quadrature rules in Sobolev spaces. To the best of our knowledge, this optimality of Bayesian quadrature has not been established before, while recently there has been extensive theoretical analysis on Bayesian quadrature [8, 9, 40, 4].

This paper is organized as follows. Section 2 provides various definitions, notation and preliminaries including reviews on kernel-based quadrature rules. Section 3 then establishes a rate of convergence for the worst case error of Bayesian quadrature in a Sobolev space. Section 4 presents the main contributions on the convergence analysis in misspecified settings, and Section 5 demonstrates these results by applying them to Bayesian quadrature. Finally Section 6 concludes the paper with possible future directions.
Preliminary results. This paper expands on preliminary results reported in a conference paper by the authors [28]. Specifically, this paper is a complete version of the results presented in Section 5 of [28]. The current paper contains significantly new topics mainly in the following points: (i) We establish the rate of convergence for Bayesian quadrature with deterministic design points, and show that it can achieve minimax optimal rates in Sobolev spaces (Section 3); (ii) We apply our general convergence guarantees in misspecified settings to the specific case of Bayesian quadrature, and reveal the conditions required for Bayesian quadrature to be robust to misspecification (Section 5); To make the contribution (ii) possible, we derive finite sample bounds on quadrature error in misspecified settings (Section 4). These results are not included in the conference paper.

We also mention that this paper does not contain the results presented in Section 4 of the conference paper [28], which deal with randomized design points. For randomized design points, theoretical analysis can be done based on an approximation theory developed in the statical learning theory literature [11]. On the other hand, the analysis in the deterministic case makes use of the approximation theory developed by [35], which is based on Calderón’s decomposition formula in harmonic analysis [18]. This paper focuses on the deterministic case, and we will report a complete version of the randomized case in a forthcoming paper.
Related work. The setting of this paper is complementary to that of [41], in which the integrand is smoother than assumed. That paper proposes to apply the control functional method by [42] to Quasi Monte Carlo integration, in order to make it adaptable to the (unknown) greater smoothness of the integrand.

Another related line of research is the proposals of quadrature rules that are adaptive to less smooth integrands [13, 14, 15, 19, 22]. For instance, [19] proposed a kernel-based quadrature rule on a finite dimensional sphere. Their method is essentially a Bayesian quadrature using a specific kernel designed for spheres. They derive convergence rates for this method both in well-specified and misspecified settings, and obtain results similar to ours. The current work differs from [19] in mainly two aspects: (i) quadrature problems considered in standard Euclidean spaces, as opposed to spheres; (ii) a generic framework is presented, as opposed to the analysis of a specific quadrature rule.

Quasi Monte Carlo rules based on a certain digit interlacing algorithm [13, 14, 15, 22] are also shown to be adaptive to the (unknown) lower smoothness of an integrand. These papers assume that an integrand is in an anisotropic function class in which every function possesses (square-integrable) partial mixed derivatives of order in each variable. Examples of such spaces include Korobov spaces, Walsh spaces, and Sobolev spaces of dominating mixed smoothness (see e.g. [39, 16]). In their notation, an integer , which is a parameter called an interlacing factor, can be regarded as an assumed smoothness. Then, if an integrand belongs to an anisotropic function class with smoothness such that , the rate of the form (or in a randomized setting) is guaranteed for the quadrature error for arbitrary . The present work differs from these works in that (i) isotropic Sobolev spaces are discussed, where the order of differentiability is identical in all directions of variables, and that (ii) theoretical guarantees are provided for generic quadrature rules, as opposed to analysis of specific quadrature methods.

2 Preliminaries

2.1 Basic definitions and notation

We will use the following notation throughout the paper. The set of positive integers is denoted by , and . For , we write . The -dimensional Euclidean space is denoted by , and the closed ball of radius centered at by . For , is the greatest integer that is less than . For a set , is the diameter of .

Let and be a Borel measure on a Borel set in . The Banach space of -integrable functions is defined in the standard way with norm , and is the class of essentially bounded measurable functions on with norm . If is the Lebesgue measure on , we write and further for . For , its Fourier transform is defined by

where .

For and an open set in ,

denotes the vector space of all functions on

that are continuously differentiable up to order , and the Banach space of all functions whose partial derivatives up to order are bounded and uniformly continuous. The norm of is given by , where is the partial derivative with multi-index . The Banach space of the continuous functions that vanish at infinity is denoted by with sup norm. Let be a Banach space with the norm .

For function and a measure on , the support of and are denoted by and , respectively. The restriction of to a subset is denoted by .

Let and be normed vector spaces with norms and , respectively. Then and are said to be norm-equivalent, if as a set, and there exists constants such that for all . For a Hibert space with inner product , the norm of is denoted by .

2.2 Sobolev spaces and reproducing kernel Hilbert spaces

Here we briefly review key facts regarding Sobolev spaces necessary for stating and proving our contributions; for details we refer to [1, 53, 6]. We first introduce reproducing kernel Hilbert spaces. For details, see, e.g., [52, Section 4] and [55, Section 10].

Let be a set. A Hilbert space of real-valued functions on is a reproducing kernel Hilbert space (RKHS) if the functional is continuous for any . Let be the inner product of . Then, there is a unique function such that . The kernel defined by is positive definite, and called reproducing kernel of . It is known (Moore-Aronszajn theorem [2]) that for every positive definite kernel there exists a unique RKHS with as the reproducing kernel. Therefore, the notation is used to the RKHS associated with .

In the following, we will introduce two definitions of Sobolev spaces, i.e., (6) and (7), as both will be used throughout our analysis. For a measurable set and , a Sobolev space of order on is defined by

(6)

where denotes the -th weak derivative of . This is a Hilbert space with inner-product

For a positive real , another definition of Sobolev space of order on is given by

(7)

where the function is defined by

The inner product of is defined by

where denotes the complex conjugate of .

For a measurable set in , the (fractional order) Sobolev space is defined by the restriction of ; namely (see, e.g., [53, Eq. (1.8) and Definition 4.10])

with its norm defined by

If and is an open set with Lipschitz boundary (see Definition 3), then is norm-equivalent to (see, e.g., [53, Eqs. (1.8), (4.20)]).

If , the Sobolev space is an RKHS [55, Section 10]. In fact, the condition guarantees that the function is integrable, so that has a (inverse) Fourier transform

where denotes the Gamma function and is the modified Bessel function function of the third kind of order . The function is positive definite, and the kernel gives as an RKHS. This kernel is essentially a Matérn kernel [31, 32] with specific parameters. A Wendland kernel [54] also defines an RKHS that is norm-equivalent to .

2.3 Kernel-based quadrature rules

We briefly review basic facts regarding kernel-based quadrature rules necessary to describe our results. For details we refer to [9, 16].

Let be an open set, be a measurable kernel on , and be the RKHS of with inner-product . Suppose is a Borel probability measure on with its support contained in , and is weighted points, which serve for quadrature. For an integrand , define and by the integral and a quadrature estimate, respectively; namely,

As mentioned in Section 1, a kernel quadrature rule aims at minimizing the worst case error

(8)

Assume , and define 111In the machine learning literature, the function is known as kernel mean embedding, and the worst case error is called the maximum mean discrepancy, which have been used in a variety of problems including two-sample testing [50, 23, 34]. by

(9)

where the integral for is understood as the Bochner integral. It is easy to see that, for all ,

The worst case error (8) can then be written as

(10)

and for any

It follows from (10) that

(11)

The integrals in (11) are known in closed form for many pairs of and (see e.g. Table 1 of [9]); for instance, it is known if is a Wendland kernel and

is the uniform distribution on a ball in

. One can then explicitly use the formula (11) in order to obtain weighted points that minimizes the worst case error (8).

2.4 Examples of kernel-based quadrature rules

Bayesian quadrature. This is a class of kernel-based quadrature rules that has been studied extensively in literature on statistics and machine learning [12, 44, 33, 21, 45, 26, 24, 8, 9, 7, 46, 4, 42]. In Bayesian quadrature, design points may be obtained jointly in a deterministic manner [12, 44, 33, 9, 46], sequentially (adaptively) [45, 26, 24, 8], or randomly [21, 9, 7, 4, 42]. For instance, [9]

proposes to generate design points randomly as a Markov Chain Monte Carlo sample, or deterministically by a Quasi Monte Carlo rule, specifically as a higher-order digital net

[14].

Given the design points being fixed, quadrature weights are then obtained by the minimization of the worst case error (11), which can be done analytically by solving a linear system of size . To describe this, let be design points such that the kernel matrix is invertible. The weights are then given by

(12)

where , with defined in (9).

This way of constructing the estimate is called Bayesian quadrature, since

can be seen as a posterior estimate in a certain Bayesian inference problem with

generated as sample of a Gaussian process (see, e.g., [26] and [9]).
Quasi Monte Carlo. Quasi Monte Carlo (QMC) methods are equal-weight quadrature rules designed for the uniform distribution on a hyper-cube [16]. Modern QMC methods make use of RKHSs and the associated kernels to define and calculate the worst case error in order to obtain good design points (e.g. [25, 48, 13, 17]). Therefore, such QMC methods are instances of kernel-based quadrature rules; see [39] and [16] for a review.
Kernel herding. In the machine learning literature, an equal-weight quadrature rule called kernel herding [10] has been studied extensively [26, 5, 30, 27]. It is an algorithm that greedily searches for design points so as to minimize the worst case error in an RKHS. In contrast to QMC methods, kernel herding may be used with an arbitrarily distribution on a generic measurable space, given that the integral admits a closed form solution with a reproducing kernel . It has been shown that a fast rate is achievable for the worst case error, when the RKHS is finite dimensional [10]. While empirical studies indicate that the fast rate would also hold in the case of an infinite dimensional RKHS, its theoretical proof remains an open problem [5].

3 Convergence rates of Bayesian quadrature

This section discusses the convergence rates of Bayesian quadrature in well-specified settings. It is shown that Bayesian quadrature can achieve the minimax optimal rates for deterministic quadrature rules in Sobolev spaces. The result also serves as a preliminary to Section 5, where misspecified cases are considered.

Let be an open set in and . The main notion to express the convergence rate is fill distance (5), which plays a central role in the literature on scattered data approximation [55], and has been used in the theoretical analysis of Bayesian quadrature in [9, 40]. However, it is necessary to introduce some conditions on . The first one is the interior cone condition [55, Definition 3.6], which is a regularity condition on the boundary of . A cone with vertex , direction (), angle and radius is defined by

Definition 1 (Interior cone condition).

A set is said to satisfy an interior cone condition if there exist an angle and a radius such that every is associated with a unit vector so that the cone is contained in .

The interior cone condition requires that there is no ‘pinch point’ (i.e. a -shape region) on the boundary of ; see also [40]. Next, the notions of special Lipschitz domain [51, p.181] and Lipschitz boundary222The definition of the Lipschitz boundary in [6] is identical to the definition of the minimally smooth boundary in [51, p.189]. This boundary condition was introduced by Elias M. Stein to prove the so-called Stein’s extension theorem for Sobolev spaces [51, p.181]. are defined as follows (see [51, p.189]; [6, Definition 1.4.4]).

Definition 2 (Special Lipschitz domain).

For , an open set is called a special Lipschitz domain, if there exists a rotation of , denoted by , and a function that satisfy the following:

  1. ;

  2. is a Lipschitz function such that for all , where .

The smallest constant for is called the Lipschitz bound of .

Definition 3 (Lipschitz boundary).

Let be an open set and be its boundary. Then is called a Lipschitz boundary, if there exist constants , , > 0, and open sets , where , such that the following conditions are satisfied:

  1. For any , there exists an index such that , where is the ball centered at and radius ;

  2. for any distinct indices ;

  3. For each index , there exists a special Lipschitz domain with Lipschitz bound such that and .

Examples of a set having a Lipschitz boundary include: (i) is an open bounded set whose boundary is embedded in ; (ii) is an open bounded convex set [51, p.189].

Proposition 4.

Let be a bounded open set such that an interior cone condition is satisfied and the boundary is Lipschitz, and

be a probability distribution on

with a bounded density function such that . For with , is a kernel on that satisfies Assumption 1 and is the RKHS of restricted on . Suppose that are finite points such that is invertible, and are the quadrature weights given by (12). Then there exist constants and independent of , such that

provided that , where is the worst case error for the quadrature rule .

Proof.

The proof idea is borrowed from [9, Theorem 1]. Let be arbitrary and fixed. Define a function by

where and

. This function is an interpolant of

on such that for all

It follows from the norm-equivalence that and

(13)

where is a constant.

We see that . In fact, recalling that the weights are defined as , where with , it follows that

Using this identity, we have

(14)
(15)

where (14) follows from Theorem 11.32 and Corollary 11.33 in [55] (where we set , , , and ), and (15) from (13). Note that constant depends only on , and the constants in the interior cone condition (which follows from the fact that Theorem 11.32 in [55] is derived from Proposition 11.30 in [55]). Setting completes the proof. ∎

Remark 1.
  • Typically the fill distance decreases to as the number of design points increases. Therefore the upper bound provides a faster rate of convergence for by a larger value of the degree of smoothness.

  • The condition requires that the design points must cover the set to a certain extent in order to guarantee the error bound to hold. This requirement arises since we have used a result from the scattered data approximation literature [55, Corollary 11.33] to derive the inequality (14) in our proof. In the literature such a condition is necessary and we refer an interested reader to Section 11 of [55] and references therein.

  • The constant depends only on the constants and in the interior cone condition (Definition 1). The explicit form is , where with [55, p.199].

The following is an immediate corollary to Proposition 4.

Corollary 5.

Assume that , and satisfy the conditions in Proposition 4. Suppose that are finite points such that is invertible and for some as , and are the quadrature weights given by (12) based on . Then we have

(16)

where is the worst case error of the quadrature rule .

Remark 2.
  • The result (16) implies that the same rate is attainable for the Sobolev space (instead of :

    with (the sequence of) the same weighted points . This follows from the norm-equivalence between and .

  • If the fill distance satisfies as , then . This rate is minimax optimal for the deterministic quadrature rules for the Sobolev space on a hyper-cube [37, Proposition 1 in Section 1.3.12]. Corollary 5 thus shows that Bayesian quadrature achieves the minimax optimal rate in this setting.

  • The decay rate for the fill distance holds when, for example, the design points are equally-spaced grid points in . Note that this rate cannot be improved: if the fill distance decreased at a rate faster than , then would decrease more quickly than the minimax optimal rate, which is a contradiction.

4 Main results

This section presents the main results on misspecified settings. Two results based on different assumptions are discussed: one on the quadrature weights in Section 4.1, and the other on the design points in Section 4.2. The approximation theory for Sobolev spaces developed by [35] is employed in the results.

4.1 Convergence rates under an assumption on quadrature weights

Theorem 6.

Let be an open set whose boundary is Lipschitz, be a probability distribution on with , be a real number with , and be a natural number with . Let denote a kernel on satisfying Assumption 1, and the RKHS of restricted on . Then, for any ,