# A minimax near-optimal algorithm for adaptive rejection sampling

Rejection Sampling is a fundamental Monte-Carlo method. It is used to sample from distributions admitting a probability density function which can be evaluated exactly at any given point, albeit at a high computational cost. However, without proper tuning, this technique implies a high rejection rate. Several methods have been explored to cope with this problem, based on the principle of adaptively estimating the density by a simpler function, using the information of the previous samples. Most of them either rely on strong assumptions on the form of the density, or do not offer any theoretical performance guarantee. We give the first theoretical lower bound for the problem of adaptive rejection sampling and introduce a new algorithm which guarantees a near-optimal rejection rate in a minimax sense.

## Authors

• 1 publication
• 1 publication
• 29 publications
• 32 publications
04/19/2019

### Uncertainty about Uncertainty: Near-Optimal Adaptive Algorithms for Estimating Binary Mixtures of Unknown Coins

Given a mixture between two populations of coins, "positive" coins that ...
08/14/2018

### Adaptive Sampling for Convex Regression

In this paper, we introduce the first principled adaptive-sampling proce...
02/05/2020

### Near-Optimal Algorithms for Minimax Optimization

This paper resolves a longstanding open question pertaining to the desig...
08/25/2020

### Minimax estimation of norms of a probability density: II. Rate-optimal estimation procedures

In this paper we develop rate–optimal estimation procedures in the probl...
10/22/2019

### Minimax Rate Optimal Adaptive Nearest Neighbor Classification and Regression

k Nearest Neighbor (kNN) method is a simple and popular statistical meth...
04/07/2021

### Near-optimal estimation of the unseen under regularly varying tail populations

Given n samples from a population of individuals belonging to different ...
11/29/2017

### Near-optimal irrevocable sample selection for periodic data streams with applications to marine robotics

We consider the task of monitoring spatiotemporal phenomena in real-time...

## Code Repositories

### NNARS

Code for the paper: A minimax near-optimal algorithm for adaptive rejection sampling.

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The breadth of applications requiring independent sampling from a probability distribution is sizable. Numerous classical statistical results, and in particular those involved in machine learning, rely on the independence assumption. For some densities, direct sampling may not be tractable, and the evaluation of the density at a given point may be costly. Rejection sampling (RS) is a well-known Monte-Carlo method for sampling from a density

on when direct sampling is not tractable (see Von Neumann, 1951, Devroye, 1986). It assumes access to a density , called the proposal density, and a positive constant , called the rejection constant, such that is upper-bounded by , which is called the envelope. Sampling from is assumed to be easy. At every step, the algorithm draws a proposal sample from the density and a point

from the uniform distribution on

, and accepts if is smaller than the ratio of and , otherwise it rejects . The algorithm outputs all accepted samples, which can be proven to be independent and identically distributed samples from the density

. This is to be contrasted with Markov Chain Monte Carlo (MCMC) methods which produce a sequence of non dependent samples—and fulfill therefore a different objective. Besides, the application of rejection sampling includes variational inference:

Naesseth et al. (2016, 2017) generalize the reparametrization trick to distributions which can be generated by rejection sampling.

Rejection sampling has a very intuitive geometrical interpretation. Consider the variable , where , , and are defined as above. As shown in Figure 4 in the Supplementary Material, has a uniform distribution on the region under the graph of , and the sample is accepted if it falls into the region under the graph of . Conditional to acceptance, is then drawn uniformly from the area under the graph of . Thus is drawn from the distribution with density . The acceptance probability is the ratio of the two areas, . This means that the closer is to —and to 1—, the more samples are accepted. The goal is hence to find a good envelope of in order to obtain a number of rejected samples as small as possible. In the absence of prior knowledge on the target density , the proposal is typically the uniform density on a set including the support of (here assumed to be compact), and the rejection constant is set as an upper bound on . Consequently, this method leads to rejecting many samples in most cases and is evaluated many times uselessly.

Adaptive rejection sampling is a variant motivated by the high number of rejected samples mentioned above. Given , a budget of evaluations of , the goal is to maximize , the number of output samples which have to be drawn independently from . In other words, the ratio , also called rejection rate, is to be made as small as possible, like in standard rejection sampling. To achieve this maximal number of output samples, adaptive rejection sampling methods gradually improve the proposal function and the rejection constant by using the information given by the evaluations of at the previous proposal samples. These samples are used to estimate—and tightly bound from above.

### 1.2 Literature review

#### Closely related works.

A recent approach in Erraqabi et al. (2016), pliable rejection sampling (PRS), allows sampling from multivariate densities satisfying mild regularity assumptions. In particular the function is of a given -Hölder regularity. PRS is a two-step adaptive algorithm, based on the use of non-parametric kernel methods for estimating the target density. Assume that PRS is given a budget of evaluations of the function . For a density defined on a compact domain, PRS first evaluates on a number of points uniformly drawn in the domain of . It uses these evaluations to produce an estimate of the density using Kernel regression. Then it builds a proposal density using a high probability confidence bound on the estimate of . The associated rejection constant is then the renormalization constant. The proposal density multiplied by the rejection constant is proven to be with high probability a correct envelope, i.e., an upper bound for . PRS then applies rejection sampling times using such an envelope. This method provides with high probability a perfect sampler, i.e., a sampler which outputs i.i.d. samples from the density . It also comes with efficiency guarantees. Indeed in dimension , if and for large enough, PRS reaches an average rejection rate of the order of . This means that it asymptotically accepts almost all the samples. However, there is no guarantee that this rate might not be improved using another algorithm. Indeed, no lower bound on the rejection rate over all algorithms is presented.
Another recent related sampling method is A* sampling (Maddison et al., 2014). It is close to the OS* algorithm from Dymetman et al. (2012) and relies on an extension of the Gumbel-max trick. The trick enables the sampling from a categorical distribution over classes with probability proportional to , where is an unnormalized mass. It uses the following property of the Gumbel distribution. Adding Gumbel noise to each of the ’s and taking the argmax of the resulting variables returns with a probability proportional to . Then, the authors generalize the notion of Gumbel-max trick to a continuous distribution. This method shows good empirical efficiency in the number of evaluations of the target density. However, the assumption that the density can be decomposed into a bounded function and a function, that is easy to integrate and sample from, is rarely true in practice.

#### Other related works.

Gilks and Wild (1992) introduced ARS: a technique of adaptive rejection sampling for one-dimensional log-concave and differentiable densities whose derivative can be evaluated. ARS sequentially builds a tight envelope of the density by exploiting the concavity of in order to bound it from above. At each step, it samples a point from a proposal density. It evaluates at this point, and updates the current envelope to a new one which is closer to . The proposal density and the envelope thus converge towards , while the rejection constant converges towards 1. The rejection rate is thereby improved. Gilks (1992) also developed an alternative to this ARS algorithm for the case where the density is not differentiable or the derivative can not be evaluated. The main difference with the former method is that the computation of the new proposal does not require any evaluation of the derivative. For this algorithm, as for the one presented in Gilks et al. (1995), the assumption that the density is log-concave represents a substantial constraint in practice. In particular, it restrains the use of ARS to unimodal densities.
An extension from Hörmann (1995) of ARS adapts it to -concave densities, with being a monotonically increasing transformation. However, this method still cannot be used with multimodal densities. In 1998, Evans and Swarz proposed a method applicable to multimodal densities presented in Evans and Swartz (1998) which extends the former one. It deals with -transformed densities and spots the intervals where the transformed density is concave or convex. Then it applies an ARS-like method separately on each of these intervals. However it needs access to the inflection points, which is a strong requirement. A more general method in Görür and Teh (2011) consists of decomposing the log of the target density into a sum of a concave and convex functions. It deals with these two components separately. An obvious drawback of this technique is the necessity of the decomposition itself, which may be a difficult task. Similarly, Martino and Míguez (2011) deal with cases where the log-density can be expressed as a sum of composition of convex functions and of functions that are either convex or concave. This represents a relatively broad class of functions; other variants focusing on the computational cost of ARS have been explored in Martino (2017); Martino and Louzada (2017).
For all the methods previously introduced, no theoretical efficiency guarantees are available.

A further attempt at improving simple rejection sampling resulted in Adaptive Rejection Metropolis Sampling (ARMS) (Gilks et al., 1995). ARMS extends ARS to cases where densities are no longer assumed to be log-concave. It builds a proposal function whose formula is close to the one in Gilks (1992). This time however, the proposal might not be an envelope, which would normally lead to oversampling in the regions where the proposal is smaller than the density. In ARMS, this is compensated with a Metropolis-Hastings control-step. One drawback of this method is that it outputs a Markov Chain, in which the samples are correlated. Moreover, the chain may be trapped in a single mode. Improved adaptive rejection Metropolis (Martino et al., 2012) modifies ARMS in order to ensure that the proposal density tends to the target density. In Meyer et al. (2008)

an alternative is presented that uses polynomial interpolations as proposal functions. However, this method still yields correlated samples.

Markov Chain Monte Carlo (MCMC) methods (Metropolis and Ulam, 1949; Andrieu et al., 2003) represent a very popular set of generic approaches in order to sample from a distribution. Although they scale with dimension better than rejection sampling, they are not perfect samplers, as they do not produce i.i.d. samples, and can therefore not be applied to achieve our goals. Variants producing independent samples were proposed in Fill (1997); Propp and Wilson (1998). However, to the best of our knowledge, no theoretical studies on the rejection rate of these variants is available in the literature.

Importance sampling is a problem close to rejection sampling, and adaptive importance sampling algorithms are also available (see e.g., Oh and Berger, 1992; Cappé et al., 2008; Ryu and Boyd, 2014). Among them, the algorithm in Zhang (1996) sequentially estimates the target function, whose integral has to be computed using kernel regression, similarly to the approach of Erraqabi et al. (2016). A recent notable method regarding discrete importance sampling was introduced in Canévet et al. (2016). In Delyon and Portier (2018)

, adaptive importance sampling is shown to be efficient in terms of asymptotic variance.

### 1.3 Our contributions

The above mentioned sampling methods either do not provide i.i.d samples, or do not come with theoretical efficiency guarantees, apart from Erraqabi et al. (2016) or Zhang (1996); Delyon and Portier (2018) in importance sampling. In the present work, we propose the Nearest Neighbour Adaptive Rejection Sampling algorithm (NNARS), an adaptive rejection sampling technique which requires to have -Hölder regularity (see Assumption 2.2). Our contributions are threefold, since NNARS:

• is a perfect sampler for sampling from the density .

• offers an average rejection rate of order , if . This significantly improves the state of the art average rejection rate from Erraqabi et al. (2016) over -Hölder densities, which is of order .

• matches a lower bound for the rejection rate on the class of all adaptive rejection sampling algorithms and all -Hölder densities. It gives an answer to the theoretical problem of quantifying the difficulty of adaptive rejection sampling in the minimax sense. So NNARS offers a near-optimal average rejection rate, in the minimax sense over the class of Hölder densities.

NNARS follows a common approach to that of most adaptive rejection sampling methods. It relies on non-parametric estimation of . It improves this estimation iteratively, and as the latter gets closer to , the envelope also approaches . Our improvements consist of designing an optimal envelope, and updating the envelope as we get more information at carefully chosen times. This leads to an average rejection rate for NNARS which is minimax near-optimal (up to a logarithmic term) over the class of Hölder densities. No adaptive rejection algorithm can perform significantly better on this class. The proof of the minimax lower bound is also new to the best of our knowledge.

The optimal envelope we construct is a very simple one. For every known point of the target density , we use the regularity assumptions on in order to construct an envelope which is piecewise constant. It stays constant in the neighborhood of every known point of . Figure 1 depicts NNARS’ first steps on a mixture of Gaussians in dimension 1.

In the second section of this paper, we set the problem formally and discuss the assumptions that we make. In the third section, we introduce the NNARS algorithm and provide a theoretical upper bound on its rejection rate. In the fourth section, we present a minimax lower bound for the problem of adaptive rejection sampling. In the fifth section, we discuss our method and detail the open questions regarding NNARS. In the sixth section, we present experimental results on both simulated and real data that compare our strategy with state of the art algorithms for adaptive rejection sampling. Finally, the Supplementary Material contains the proofs of all the results presented in this paper.

## 2 Setting

Let be a bounded density defined on . The objective is to provide an algorithm which outputs as many i.i.d. samples drawn according to as possible, with a fixed number of evaluations of . We call the budget.

### 2.1 Description of the problem

The framework that we consider is sequential and adaptive rejection sampling.

Set and let be the budget. An ARS method sequencially performs steps At each step , the samples collected until , each in , are known to the learner, as well as their images by . The learner chooses a positive constant and a density defined on that both depend on the previous samples and on the evaluations of at these points . Then the learner performs a rejection sampling step with the proposal and rejection constant , as depicted in Algorithm 1. It generates a point from and a variable that is independent from every other variable and drawn uniformly from . is accepted as a sample from if and rejected otherwise. If it is accepted, the output is , otherwise the output is . Once the rejection sampling step is complete, the learner adds the output of this rejection sampling step to . The learner iterates until the budget of evaluations of has been spent.

(Class of Adaptive Rejection Sampling (ARS) Algorithms) An algorithm is an ARS algorithm if, given and , at each step t : chooses a density , and a positive constant , depending on: . performs a Rejection Sampling Step with (). The objective of an ARS algorithm is to sample as many i.i.d. points according to as possible.

Given access to a positive, bounded density defined on , any Adaptive Rejection Sampling algorithm (as described above) satisfies:
if the output contains i.i.d. samples drawn according to .

#### Definition of the loss.

Theorem 2.1 gives a necessary and sufficient condition under which an adaptive rejection sampling algorithm is a perfect sampler—that is, it outputs i.i.d. samples. Its proof is given in the Supplementary Material, see Appendix B.

Let us define the number of samples which are known to be independent and sampled according to based on Theorem 2.1: . We define the loss of the learner as . This is justified by considering two complementary events. In the first, the rejection sampling procedure is correct at all steps, that is to say all proposed envelopes bound from above; and in the second, there exists a step where the procedure is not correct. In the first case, the sampler will output i.i.d. samples drawn from the density . So the loss of the learner is the number of samples rejected by the sampler. In the second case, the sampler is not trusted to produce correct samples. So the loss becomes ,. Finally, we note that the rejection rate is .

#### Remark on the loss.

Let be the set of ARS algorithms defined in Definition 2.1. Note that for any algorithm , the loss can be interpreted as a regret. Indeed, a learner that can sample directly from would not reject a single sample, and would hence achieve So is equal to the difference between and . Hence is the cost of not knowing how to sample directly from .

### 2.2 Assumptions

We make the following assumptions on . They will be used by the algorithm and for the theoretical results.

###### Assumption

• The function is -Hölder for some and ,
i.e., where ;

• There exists such that:

Let denote the set of functions satisfying Assumption 2.2 for given , and .

#### Remarks.

Here the domain of is assumed to be , but it could without loss of generality be relaxed to any hyperrectangle of . Besides, for any distribution with sub-Gaussian tails, this assumption is almost true. In practice, the diameter of the support is bounded by , where is the number of evaluations, because of the vanishing tail property. The assumption of Hölder regularity is a usual regularity assumption in order to control for rates of convergence. It is also a mild one, considering that can be chosen arbitrarily close to . Note however that we assume the knowledge of and for the NNARS algorithm. Since is a Hölder regular density defined on the unit cube, we can obtain the following upper bound: . As for the assumption involving the constant , it is widespread in non-parametric density estimation. Besides, the algorithm will still produce exact independent samples from the target distribution without the latter assumption. It is important to note that is chosen as a probability density for clarity, but it is not a required hypothesis. In the proofs, we study the general case when is not assumed to be a probability density.

## 3 The NNARS Algorithm

The NNARS algorithm proceeds by constructing successive proposal functions and rejection constants that gradually approach and , respectively. In turn, an increased number of evaluations of should result in a more accurate estimate and thus in a better upper bound.

### 3.1 Description of the algorithm

The algorithm outlined in Algorithm 2 takes as inputs the budget , and as defined in Assumption 2.2. Let denote its output. At each round , the algorithm performs a number of RSS steps with specifically chosen and . We call the set of points generated at round and of their images by , whether they get accepted or rejected.

#### Initialization.

The sets and are initialized to . is a uniform proposal on . is an upper bound on and . For any function defined on , we set .

#### Loop.

The algorithm proceeds in rounds, where , is the ceiling function, and is the logarithm in base .
Each round consists of the following steps.

1. Perform a Rejection Sampling Step RSS() times. Add the accepted samples to . All proposal samples as well as their images by produced in the Rejection Sampling Step are stored in , whether they are rejected or not.

2. Build an estimate of based on the evaluations of at all points stored in , thanks to the Approximate Nearest Neighbor Estimator, referred to in Definition 3.1, applied to the set .

3. Compute the proposal with the formula:

 gk+1(x)=^f∪i≤kχi(x)+^r∪i≤kχiI^f∪i≤kχi+^r∪i≤kχi, (1)

and the rejection constant with the formula:

 Mk+1=I^f∪i≤kχi+^r∪i≤kχi, (2)

where is defined in Equation (3) below, in Definition 3.1. Note that and are indexed here by the number of the round, unlike in the last section where the index was the current time.

4. If , set as Otherwise .

Finally, the algorithm outputs , the set of accepted samples that have been collected.

[Approximate Nearest Neighbor Estimator applied to ]
Let be a positive density satisfying Assumption 2.2. We consider a set of points and their images by , . Let us define a set of centers of cells constituting a uniform grid of , namely

 C~N={2−1(⌊~N1d⌋+1)−1u,u∈{1,…,2(⌊~N1d⌋+1)−1}d}.

The cells are of side-length . For , write .

We define the approximate nearest neighbor estimator as the piecewise-constant estimator of f by: where .

We also write

 ^rχ=H⎛⎜⎝maxu∈C~Nmini≤~N∥u−Xi∥∞+12(⌊~N1d⌋+1)⎞⎟⎠s. (3)

#### Remarks on the proposal densities and rejection constants

At each step, the envelope is made up of evaluations of summed with a positive constant which stands for a confidence term of the estimation. It provides an upper bound for . Furthermore, the use of nearest neighbour estimation in a noiseless setting implies that this bound is optimal. Besides, the approximate construction of the estimator builds proposal densities which are simple to sample from.
As explained in Lemma C.2 in the Supplementary Material, an important remark is that the proposal density from Equation (1) multiplied by the rejection constant from Equation (2) is an envelope of . This means for all . So by Theorem 2.1, NNARS is a perfect sampler.

The algorithm is illustrated in Figure 1.

Remark on sampling from the proposal densities in NNARS. The number of rounds is of order . The construction of the proposal in NNARS involves at each round the storage of values. So the total number of values stored is upper bounded by the budget. At each round, each value corresponds to a hypercube of side-length splitting equally. Partitioning the space in this way allows us to efficiently assign a value to every , depending on which cell of the grid belongs to. Besides, sampling from the proposal amounts to sampling from a multinomial convolved with a uniform distribution on a hypercube. In other words, a cell is chosen multinomially, then a point is sampled uniformly inside that cell, because the proposal is piecewise constant.

The process to sample according to is the following: given ,

1. Each center of the cells from the grid is mapped to a value .

2. One of the centers is sampled with probability .

3. The sample point is drawn according to the uniform distribution on the hypercube of center and side-length .

### 3.2 Upper bound on the loss of NNARS

In this section, we present an upper bound for the expectation of the loss of the NNARS sampler. This bound holds under Assumption 2.2, that only requires to be large enough in comparison with constants depending on , , and . Related conditions about the sample size are in most theoretical works on Rejection Sampling (Gilks and Wild, 1992, Meyer et al., 2008, Görür and Teh, 2011, Erraqabi et al., 2016).

###### Assumption (Assumption on n)

Assume that and , i.e.,

 n≥⌈2(10H)d/slog(n)c−1−d/sf⌉4log(n)2log(⌈321cf⌉)2=O(log(n)3).

Let , and . If satisfies Assumption 2.2 with such that , then NNARS is a perfect sampler according to .

Besides if satisfies Assumption 3.2, then

 EfLn(NNARS) ≤40Hc−1f(1+√2log(3n))(log(2n))s/dn1−s/d + (25+80c−1f+2(10H)d/sc−1−d/sf)log2(n)=O(log2(n)n1−s/d),

where is the expected loss of NNARS on the problem defined by . The expectation is taken over the randomness of the algorithm. This result is uniform over . The proof of this theorem is in the Supplementary Material, see Section C. The loss presented here divided by is to be interpreted as an upper bound for the expected rejection rate obtained by the NNARS algorithm.

#### Sketch of the proof.

The average number of rejected samples is , since a sample is accepted at round with probability . In order to bound the average number of rejected samples, we bound at each round with high probability.

By Hölder regularity and the definition of in Equation (3) (in Definition 3.1), we always have , as shown in the proof of Proposition C.1. So with . Then, we consider the event , where is a constant. Now, for each , on , is bounded from above, with a bound of the order of . So, on , the average number of rejected samples has an upper bound of the order of , as presented in Theorem 3.2.

Now, we prove by induction that event has high probability, as in the proof of Lemma C.2. More precisely, has probability larger than . At every step , we verify that is positively lower bounded conditionally on . Hence, the probability of having drawn at least one point in each hypercube of the grid with centers is high, as shown in the proof of Proposition C.1. So the distance from any center to its closest drawn point is upper bounded with high probability. And this implies that has high probability if has high probability, which gets the induction going. On the other hand, the number of rejected samples is always bounded by on the small probability event where does not hold. This concludes the proof.

## 4 Minimax Lower Bound on the Rejection Rate

It is now essential to get an idea of how much it is possible to reduce the loss obtained in Theorem 3.2. That is why we apply the framework of minimax optimality and complement the upper bound with a lower bound. The minimax lower bound on this problem is the infimum of the supremum of the loss of algorithm on the problem defined by ; the infimum is taken over all adaptive rejection sampling algorithms and the supremum over all functions satisfying Assumption 2.2. It characterizes the difficulty of the rejection sampling problem. And it provides the best rejection rate that can possibly be achieved by such an algorithm in a worst-case sense over the class . For , there exists a constant that depends only on and such that for any :

 infA∈Asupf∈F0(s,1,1,d)∩{f:If=1}Ef(Ln(A))≥3−12−1−3s−2d5−s/dn1−s/d=O(n1−s/d),

where is the expectation of the loss of on the problem defined by . It is taken over the randomness of the algorithm .

The proof of this theorem is in Section D, but the following discussion outlines its main arguments.

#### Sketch of the proof in dimension 1.

Consider the setup where firstly points from are chosen and evaluated. Secondly, other points are sampled using rejection sampling with a proposal based only on the first points. This is related to Definition D.1. Now, corresponds to one-dimensional -Hölder functions which are bounded from below by . We consider a subset of satisfying Assumption 2.2. Set

Let us define the bump function such that for any :

 b(x)={x,for x≤1/(8n).1/(4n)−x,otherwise.

We will consider the following functions such that for any :

 fν(x)=1+νib(x−i/(4n)), if i/(4n)≤x≤(i+1)/(4n),

We note that , for large enough ensuring that .

An upward bump at position corresponds to and a downward bump to . The construction presented here is analog to the one in the proof of Lemma D.3. The function is entirely determined by the knowledge of . It is only possible to determine a by evaluating at some . So with a budget of , we observe at most of the signs in . Among the unobserved , at least are positive and are negative, because . Now, we compute the loss. In the case when is not an envelope, the loss simply is . Now let us consider the case where is an envelope. The loss is . has to account for at least upward bumps at unknown positions; and the available information is insufficient to distinguish between upward and downward bumps. This results in an envelope that is not tight for the negative with unknown positions. So a necessary loss is incurred at the downward bumps corresponding to those negative . This translates as , where is a constant only dependent on , with in our case. Finally, we obtain a risk which is of order , as seen in Lemma D.3.

In a nutshell, we first made a setup with more available information than in the problem of adaptive rejection sampling, from Definition 2.1. Then we restricted the setting to some subspace of . This led to the obtention of a lower bound on the risk for an easier setting. This implies we have displayed a lower bound for the problem of adaptive rejection sampling over too.

This theorem gives a lower bound on the minimax risk of all possible adaptive rejection sampling algorithms. Up to a factor, NNARS is minimax-optimal and the rate in the lower bound is then the minimax rate of the problem. It is remarkable that this problem admits such a fast minimax rate; the same rate as a standard rejection sampling scheme with an envelope built using the knowledge of evaluations of the target density (see Setting D.1 in the Appendix).

## 5 Discussion

Theorem 4 asserts that NNARS provides a minimax near-optimal solution in expectation (up to a multiplicative factor of the order of ). This result holds for all adaptive rejection sampling algorithms and densities in . To the best of our knowledge, this is the first time a lower bound is proved on adaptive rejection samplers; or that an adaptive rejection sampling algorithm that achieves near-optimal performance is presented. In order to ensure the theoretical rates mentioned in this work, the algorithm requires to know , a positive lower bound for , and the regularity constants of : , and . Note that to achieve a near-optimal rejection rate, the precise knowledge of is required. Indeed, replacing the exponent by a smaller number will result in adding a confidence term to the estimator which is too large. Finally, it will result in a higher rejection rate than if one had set to the exact Hölder exponent of . The assumption on implies in particular that does not vanish. However, as long as it remains positive, can be chosen arbitrarily small, and has to be taken large enough to ensure that is approximately larger than . When is not available, asymptotically taking of this order will offer a valid algorithm, which outputs independent samples drawn according to . Moreover taking of this order will still result in a minimax near-optimal rejection rate. Indeed it will approximately boil down to multiplying the rejection rate by a . Similarly can be taken of order without hindering the minimax near-optimality. Extending NNARS to non lower-bounded densities is still an open question.

The algorithm NNARS is a perfect sampler. Since our objective is to maximize the number of i.i.d. samples generated according to , we cannot compare the algorithm with MCMC methods, which provide non-i.i.d. samples. In our setting, they have a loss of . The same argument is valid for other adaptive rejection samplers that produce correlated samples, like e.g., Gilks et al. (1995); Martino et al. (2012); Meyer et al. (2008).
Considering other perfect adaptive rejection samplers, like the ones in e.g., Gilks (1992); Martino and Míguez (2011); Hörmann (1995); Görür and Teh (2011), their assumptions differ in nature from ours. Instead of shape constraint assumptions—like log-concavity— which are often assumed in the quoted literature, we only assume Hölder regularity. Note that log-concavity implies Hölder regularity of order two almost everywhere. Moreover no theoretical results on the proportion of rejected samples are available for most samplers—except possibly asymptotic convergence to , which is induced by our result.

PRS (Erraqabi et al., 2016) is the only algorithm with a theoretical guarantee on the rate with the proportion of rejected samples decreasing to . But it is not optimal, as explained in Section 1. So the near-optimal rejection rate is a major asset of the NNARS algorithm compared to the PRS algorithm. Besides, PRS only provides an envelope with high probability, whereas NNARS provides it with probability at any time. The improved performance of NNARS compared to PRS may be attributed to the use of an estimator more adapted to noiseless evaluations of , and to the multiple updates of the proposal.

## 6 Experiments

Let us compare NNARS numerically with Simple Rejection Sampling (SRS), PRS (Erraqabi et al., 2016), OS* (Dymetman et al., 2012) and A* sampling (Maddison et al., 2014). The value of interest is the sampling rate corresponding to the proportion of samples produced with respect to the number of evaluations of

. This is equivalent to the acceptance rate in rejection sampling. Every result is an average over 10 runs with its standard deviation.

### 6.1 Presentation of the experiments

#### Exp1.

We first consider the following target density from Maddison et al. (2014): where is the peakiness parameter. Increasing also increases the sampling difficulty. In Figure 1(a), PRS and NNARS both give good results for low peakiness values, but their sampling rates fall drastically as the peakiness increases. So their results are similar to SRS after a peakiness of . On the other hand, the rates of A* and OS* sampling decrease more smoothly.

#### Exp2.

For the next experiment, we are interested in how the method scale when the dimension increases and consider a distribution that is related to the one in Erraqabi et al. (2016): , where . In Figure 1(b), we present the results for between 1 and 7. NNARS scales the best in dimension. A* and OS* have the same behaviour, while PRS and SRS share very similar results. A* and OS* start with good sampling rates, which however decrease radically when the dimension increases.

#### Exp3.

Then, we focus on how the efficiency scales with respect to the budget. The distribution tested is: with in . In Figure 2(a), NNARS, A* and OS* give the best performance, reaching the asymptotic regime after 20,000 function evaluations. So NNARS is applicable in a reasonable number of evaluations. Coupled with the study of the evolution of the standard deviations in Figure 2(b), we conclude that the results in the transition regime may vary, but the time to the asymptotic region is not initialization-sensitive.

#### Exp4.

Finally, we show the efficiency of NNARS on non-synthetic data from the set in Cortez and Morais (2007). It consists of 517 observations of meteorological data used in order to predict forest fires in the north-eastern part of Portugal. The goal is to enlarge the data set. So we would like to sample artificial data points from a distribution which is close to the one which generated the data set. This target distribution is obtained in a non-parametric way, using the Epanechnikov kernel which creates a non-smooth . We then apply samplers which do not use the decomposition of described in Maddison et al. (2014). That is why A* and OS* sampling will not be applied. From the 13 dimensions of the dataset we work with those corresponding to Duff Moisture Code (DMC) and Drought Code (DC) and we get the sampling rates in Table 1. NNARS clearly offers the best performance.

### 6.2 Synthesis on the numerical experiments

The essential features of NNARS have been brought to light in the experiments presented in Figures 2, 3 and using the non-synthetic data from Cortez and Morais (2007). In particular, Figure 2(a) gives the evidence that the algorithm reaches good sampling rates in a relatively small number of evaluations of the target distribution. Furthermore, Figure 1(b) illustrates the possibility of applying the algorithm in a multidimensional setting. In Figure 1(a), we observe that A* and OS* sampling benefit from the knowledge of the specific decomposition of needed in Maddison et al. (2014)

. We highlight the fact that this assumption is not true in general. Besides, A* sampling requires relevant bounding and splitting strategies. We note that tuning NNARS only requires the choice of a few numerical hyperparameters. They might be chosen thanks to generic strategies like grid search. Finally, the application to forest fire data generation illustrates the great potential of NNARS for applications reaching beyond the scope of synthetic experiments.

## 7 Conclusion

In this work, we introduced an adaptive rejection sampling algorithm, which is a perfect sampler according to . It offers a rejection rate of order , if . This rejection rate is near-optimal, in the minimax sense over the class of -Hölder smooth densities. Indeed, we provide the first lower bound for the adaptive rejection sampling problem, which provides a measure of the difficulty of the problem. Our algorithm matches this bound up to logarithmic terms.

In the experiments, we test our algorithm in the context of synthetic target densities and of a non-synthetic dataset. A first set of experiments shows that the behavior of the sampling rate of our algorithm is similar to that of state of the art methods, as the dimension and the budget increase. Two of the methods used in this set of experiments require the target density to allow a specific decomposition. Therefore, these methods are neglected for the experiment which aims at generating forest fire data. In this experiment, NNARS clearly performs better than its competitors.

The extension of the NNARS algorithm to non lower-bounded densities is still an open question, as well as the development of an optimal adaptive rejection sampler, when the density’s derivative is Hölder regular instead. We leave these interesting open questions for future work.

#### Acknowledgements.

The work of A. Carpentier is partially supported by the Deutsche Forschungsgemeinschaft (DFG) Emmy Noether grant MuSyAD (CA 1488/1-1), by the DFG - 314838170, GRK 2297 MathCoRe, by the DFG GRK 2433 DAEDALUS, by the DFG CRC 1294 ’Data Assimilation’, Project A03, and by the UFA-DFH through the French-German Doktorandenkolleg CDFA 01-18.

## References

• Andrieu et al. (2003) Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to mcmc for machine learning. Machine learning, 50(1-2):5–43, 2003.
• Canévet et al. (2016) Olivier Canévet, Cijo Jose, and François Fleuret. Importance sampling tree for large-scale empirical expectation. In Proceedings of the International Conference on Machine Learning (ICML), number EPFL-CONF-218848, 2016.
• Cappé et al. (2008) Olivier Cappé, Randal Douc, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert. Adaptive importance sampling in general mixture classes. Statistics and Computing, 18(4):447–459, 2008.
• Cortez and Morais (2007) Paulo Cortez and Aníbal de Jesus Raimundo Morais. A data mining approach to predict forest fires using meteorological data. 2007.
• Delyon and Portier (2018) Bernard Delyon and François Portier. Efficiency of adaptive importance sampling. arXiv preprint arXiv:1806.00989, 2018.
• Devroye (1986) Luc Devroye. Sample-based non-uniform random variate generation. In Proceedings of the 18th conference on Winter simulation, pages 260–265. ACM, 1986.
• Dymetman et al. (2012) Marc Dymetman, Guillaume Bouchard, and Simon Carter. The os* algorithm: a joint approach to exact optimization and sampling. arXiv preprint arXiv:1207.0742, 2012.
• Erraqabi et al. (2016) Akram Erraqabi, Michal Valko, Alexandra Carpentier, and Odalric Maillard. Pliable rejection sampling. In International Conference on Machine Learning, pages 2121–2129, 2016.
• Evans and Swartz (1998) Michael Evans and Timothy Swartz. Random variable generation using concavity properties of transformed densities. Journal of Computational and Graphical Statistics, 7(4):514–528, 1998.
• Fill (1997) James Allen Fill. An interruptible algorithm for perfect sampling via markov chains. In

Proceedings of the twenty-ninth annual ACM symposium on Theory of computing

, pages 688–695. ACM, 1997.
• Gilks (1992) Walter R Gilks.

Derivative-free adaptive rejection sampling for gibbs sampling, bayesian statistics 4, 1992.

• Gilks and Wild (1992) Walter R Gilks and Pascal Wild. Adaptive rejection sampling for gibbs sampling. Applied Statistics, pages 337–348, 1992.
• Gilks et al. (1995) Walter R Gilks, NG Best, and KKC Tan. Adaptive rejection metropolis sampling within gibbs sampling. Applied Statistics, pages 455–472, 1995.
• Görür and Teh (2011) Dilan Görür and Yee Whye Teh. Concave-convex adaptive rejection sampling. Journal of Computational and Graphical Statistics, 20(3):670–691, 2011.
• Hörmann (1995) Wolfgang Hörmann. A rejection technique for sampling from t-concave distributions. ACM Transactions on Mathematical Software (TOMS), 21(2):182–193, 1995.
• Maddison et al. (2014) Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. In Advances in Neural Information Processing Systems, pages 3086–3094, 2014.
• Martino (2017) Luca Martino. Parsimonious adaptive rejection sampling. Electronics Letters, 53(16):1115–1117, 2017.
• Martino and Louzada (2017) Luca Martino and Francisco Louzada. Adaptive rejection sampling with fixed number of nodes. Communications in Statistics-Simulation and Computation, pages 1–11, 2017.
• Martino and Míguez (2011) Luca Martino and Joaquín Míguez. A generalization of the adaptive rejection sampling algorithm. Statistics and Computing, 21(4):633–647, 2011.
• Martino et al. (2012) Luca Martino, Jesse Read, and David Luengo. Improved adaptive rejection metropolis sampling algorithms. arXiv preprint arXiv:1205.5494, 2012.
• Metropolis and Ulam (1949) Nicholas Metropolis and Stanislaw Ulam. The monte carlo method. Journal of the American statistical association, 44(247):335–341, 1949.
• Meyer et al. (2008) Renate Meyer, Bo Cai, and François Perron. Adaptive rejection metropolis sampling using lagrange interpolation polynomials of degree 2. Computational Statistics & Data Analysis, 52(7):3408–3423, 2008.
• Naesseth et al. (2017) Christian Naesseth, Francisco Ruiz, Scott Linderman, and David Blei. Reparameterization gradients through acceptance-rejection sampling algorithms. In Artificial Intelligence and Statistics, pages 489–498, 2017.
• Naesseth et al. (2016) Christian A Naesseth, Francisco JR Ruiz, Scott W Linderman, and David M Blei. Rejection sampling variational inference. arXiv preprint arXiv:1610.05683, 2016.
• Oh and Berger (1992) Man-Suk Oh and James O Berger. Adaptive importance sampling in monte carlo integration. Journal of Statistical Computation and Simulation, 41(3-4):143–168, 1992.
• Propp and Wilson (1998) James Propp and David Wilson. Coupling from the past: a user’s guide. Microsurveys in Discrete Probability, 41:181–192, 1998.
• Ryu and Boyd (2014) Ernest K Ryu and Stephen P Boyd. Adaptive importance sampling via stochastic convex programming. arXiv preprint arXiv:1412.4845, 2014.
• Von Neumann (1951) John Von Neumann. 13. various techniques used in connection with random digits. Appl. Math Ser, 12:36–38, 1951.
• Zhang (1996) Ping Zhang. Nonparametric importance sampling. Journal of the American Statistical Association, 91(435):1245–1253, 1996.

## Appendix A Illustration of Rejection Sampling

In the following, we do not assume that is a density. In fact ARS samplers could be given evaluations of the density multiplied by a positive constant. We prove in the sequel that as long as the resulting function satisfies Assumption 2.2, the upper bound presented in Theorem 3.2 holds in this case as well as in the case when is a density. The lower bound is also proved without the assumption that is a density.

## Appendix B Proof of Theorem 2.1

Let us assume that . If has been drawn at time , and denotes the event in which is accepted, and denotes the set of the proposal samples drawn at time and of their images by , then such that is Lebesgue measureable, it holds:

 PXt∼gt,U∼U[0,1]({Xt∈Ω}∩Et∣∣⋃j

because is independent from conditionally to .
Hence, since , we have:

 PXt∼gt,U∼U[0,1](Xt∈Ω∣∣Et;⋃j

Thus is distributed according to and is independent from the samples accepted before step , since is independent from .

We have proved that the algorithm provides independent samples drawn according to the density .

## Appendix C Proof of Theorem 3.2

### c.1 Approximate Nearest Neighbor Estimator

In this subsection, we study the characteristics of the Approximate Nearest Neighbor Estimator. First, we prove a bound on the distance between the image of by the Approximate Nearest Neighbor Estimator of and , under the condition that satisfies Assumption 2.2. More precisely, we prove that lies within a radius of away from . Then we prove a high probability bound on the radius under the same assumptions. This bound only depends on the probability, the number of samples, and constants of the problem. These propositions will be of use in the proof of Theorem 3.2.

Let , we write (as in Definition 3.1) for simplicity.

###### Proposition

Let be a positive function satisfying Assumption 2.2. Consider points .
If is the Approximate Nearest Neighbor Estimate of , as defined in Definition 3.1, then:

 ∀x∈[0,1]d,|^fχ(x)−f(x)|≤^rχ,

where is defined in Equation (3) (in Definition 3.1).

#### Proof of Proposition c.1.

We have that

 ∥x−Xi(C(x))∥∞≤∥x−C(x)∥∞+∥C(x)−Xi(C(x))∥∞,

where the set and the function are defined in Definition 3.1.
Now, and and where is defined in Definition 3.1.
Thus

 ∥x−Xi(C(x))∥∞≤12˜N1d+maxu∈C~N∥u−Xi(u)∥∞, (4)

and from Assumption 2.2

 ∀x∈[0,1]d,|^fχ(x)−f(x)|≤^rχ.
###### Proposition

Consider the same notations and assumptions as in Proposition C.1. Let g be a density on such that:

 ∃1≥c>0 such that ∀x∈[0,1]d,c

and assume that the points in are sampled in an i.i.d. fashion according to .
Defining it holds for any , that with probability larger than :

 ^rχ≤2sr˜N,δ,c.

where we write .

#### Proof of Proposition c.1.

Let be a positive number smaller than such that is an integer. We split in hypercubes of side-length and of centers in . Let be one of these hypercubes, we have
So with probability larger than , at least one point has been drawn in .
Thus , it holds:

 ∥x−Xi(x)∥∞≤ϵ,

where .
Thus ,

 ∥x−Xi(x)∥∞≤(log(1/δ′)c˜N)1d,

where (observe ).
Thus with probability larger than , it holds

 ∀x∈[0,1]d,∥x−Xi(x)∥∞≤(log(1/δ′)c˜N)1d.

With probability larger than , it holds

 ∀x∈[0,1]d,∥x−Xi(x)∥∞≤(log(1/δ′)c˜N)1d.

Hence, by letting , with probability larger than ,

 ∀x∈[0,1]d,∥x−Xi(x)∥∞ ≤(log(c˜N/δ)c˜N)1d ≤(log(˜N/δ)c˜N)1d.

Thus with probability larger than ,

 ∀x∈[0,1]d,∥x−Xi(x)∥∞≤(log(