Poisson point processes are extensively used to model event occurrences (Diggle2013b). They have been successfully applied in crime rate modeling (Shirota2017), genetic biodiversity (Diggle2013) as well as environmental modeling (Heikkinen1999). The intensity function
of a point process is a function whose integral over a region is proportional to the probability of an event occurring in that region. A particularly widespread example of a Poisson point process is theCox process, where the intensity function is itself assumed to be random, often modeled as a sample from a Gaussian process (Cox1955).
In this work, we address the problem of adaptive sensing of the spatio-temporal Cox processes. Our goal is to adaptively locate and capture events in particular regions of the space – sensing regions – in order to infer the intensity function, and consequently achieve certain objectives. Depending on the objective, the choice of sensing regions differs and leads to different algorithms. Examples of objectives include identification of the region with the highest incidence of events, identifying level sets of , and capturing the most events subject to the costs of sensing. Specifically, we assume that the intensity varies smoothly over the sensed domain, which we incorporate into the prior for by assuming it is an unknown sample from a truncated Gaussian Process with a known covariance kernel .
To give a concrete example, consider the task of estimating the size of an animal population. This is a recurring task faced by ecologists, especially in the context of understanding the impact of climate change on the habitat of species. Classical approaches to this problem rely on the capture and release principle, where animals are captured, tagged, released and recaptured or tracked. These approaches are costly and time consuming, hence remote sensing from satellites or aircraft combined with recognition software are currently being investigated as promising candidates to replace some classical approaches.Guirado2019 investigate whale identification from satellite data, and Goncalves2020 count Antarctic seal population from satellite data. With the advent of high resolution satellite imagery, seals in Antarctic can be recognized as the image in Fig. 0(a). In the same figure, one can see that the occurrence of seals can be fitted well by a smoothly varying Cox process. Suppose we seek to identify the level sets of the intensity function, i.e., locations where the intensity is above a certain threshold – defined to be the habitat. In order to use monitoring resources or annotators time on the images efficiently, we want to minimize the number of sensing sessions needed to estimate the quantity of interest. Satellites or aircraft can cover only a limited region at once, and need to zoom in on a specific region to facilitate identification, hence finding an efficient sequence of sensing regions is a practical problem that can be addressed by methods developed in this work. Sensing regions can be chosen, for example, as rectangular blocks that stem from the hierarchical splitting of the domain as in Figure 0(b).
1) We formulate three adaptive sensing problems for Cox processes: maximum identification, level set estimation and maximum event capture. We propose simple and efficient algorithms for each of them based on posterior sampling and top-two posterior sampling. 2) We address the inherent positivity constraint of the intensity function by employing representations that naturally allow to enforce the positivity constraint. To this end, we introduce a novel minimal description positive basis problem that, suitably relaxed, reduces to non-negative matrix factorization. We further relate the minimal description positive basis to other positive bases from prior work, and demonstrate their application to non-stationary kernels. 3) Our algorithms are easy to implement and rely solely on convex optimization methods. We show that they outperform algorithms based on optimism, Laplace approximation and Bayesian V-optimal experiment design, and demonstrate their applicability to real-life problems.
2 Problem Statement
Let be a compact domain. A Poisson point process is a random process such that for any subset ,
– the random variable denoting the number of events in– is distributed as , where is a positive intensity function. If and then and are independent given . If, in addition is modeled as a random function such as a truncated Gaussian process, the combined structure is referred to as doubly stochastic or Cox process (Kingmann2002; Snyder2012).
When we observe events in a sensing region for duration , their likelihood given the intensity function can be calculated as
. This likelihood, together with a truncated Gaussian process prior allows, via Bayes’ theorem, to obtain a posterior distribution of. Compressing the notation of acquired data to , , where the denominator contains a double integral – sometimes referred to as double-intractability (John2018). Two core issues with this posterior inference are that the prior needs to be restricted to positive Gaussian processes - itself a non-trivial task, and the double intractability needs to be efficiently handled. In the Section 3, we show how representation in a specific basis solves these problems, but first we state our sensing problems.
2.1 Sensing Problems and Protocol
We consider the problem of adaptive sensing of Poisson processes. We sequentially pick sensing regions, observe events inside them at some cost, with the goal to achieve one of the following three objectives:
Event capture maximization: Capture as many events of as possible;
Estimation of level sets: Given a threshold , find the largest s.t. ;
Maximum identification: Identify (approximately) .
In all these tasks, we aim to minimize the cost of sensing. While sensing the whole domain might be optimal in terms of information provided, it is often practically infeasible due to high cost. Hence either restricting the allowed sensing regions to small regions of the domain or introducing the notion of costs is necessary.
We acquire information about a subset of the domain for a certain duration , for example by zooming in with a satellite. We refer to such a sensing event as an action. The set of actions is parametrized by a collection of sets , where . In each iteration, we pick a sensing action that we sense for a duration , and then receive locations of events in only.
Each action has an associated cost calculated as , where the cost function is known. In this work, we assume the cost is time invariant and proportional to the volume (area) of the sensed set , where is a fixed minimum sensing time duration. Longer sensing is possible by repeating this action sufficiently often. The formalism we develop, however, is more general and can be applied with other cost models. In particular, we focus on two cost functions: uniform costs and fixed costs, where a for which introduces the preference to sense larger sets, since with uniform costs, sensing smaller parts of the set is always preferred over sensing a larger set containing them, which might be sometimes undesired.
The sensing algorithms that we develop follow a simple meta-protocol consisting of three steps. In the first, we collect all information gathered so far, then pick a new sensing region as a solution to , where we refer to as the acquisition function, which represents the utility of the next action and is objective dependent. In the last step, we sense to observe events occurring inside it. We repeat the procedure until our budget is exhausted or we are sufficiently satisfied in the progress. We can see this protocol in Algorithm 1 in steps 2, 3 and 4.
2.2 Modeling Assumptions
Truncated Gaussian Process
We put a prior on the intensity function and assume that it can be modeled as a sample from a truncated Gaussian process. By truncated Gaussian process, we understand a zero-mean Gaussian process with kernel restricted such that its samples are functions bounded away from zero. Namely, we assume that for all and known. Generating samples can be implemented using rejection sampling. The prior is proportional to GP prior with the truncation entering via multiplication by a functional indicating whether everywhere in the domain (see Rasmussen2006[p.136] for specifying GP priors with functionals).
Poisson Process Likelihood
Suppose that there are sensing sessions during which we sensed regions for , and we observed locations of the events . We use the Poisson point process likelihood, Bayes’ rule and take the logarithm at the end to arrive at the following maximum a posteriori (MAP) estimation , where
The inequality denotes the positivity constraint, is the reproducing kernel Hilbert space (RKHS) associated with kernel , and is the associated norm. This expression is possible due the well-known connection between RKHS functions and MAP of Gaussian Processes (for detail see Kanagawa2018). The functional in the minimization problem is commonly referred to as energy, and the posterior distribution is proportional to .
There are two main challenges that we need to address to formulate sensing algorithms. First we need to incorporate the positivity constraint in the inference procedures as well as address the double intractability. Secondly, since our algorithms rely on posterior sampling, we need to efficiently sample from a seemingly intractable posterior. We address these two issues in Section 3 and 4, respectively. Having this, we can design acquisition functions in our sensing protocol tailored to each objective separately.
3 Positive Bases
One of the core challenges for inference in our model compared to classical Gaussian process (GP) regression is the positivity of the intensity function (). The global nature of the constraint hinders the application of representer theorem to make optimization tractable. Practical approaches addressing this positivity constraint are threefold: one could either employ a link function that transforms the output space to positive values; discretize the space; or choose a so called positive basis. All approaches have their pros and cons that we discuss at the end of this section and provide references. In this work, we use a technique that falls into the positive basis approach.
By a positive basis we mean that the intensity can be approximately represented as via nonnegative basis functions . This way, the positivity constraint can be implemented simply by enforcing that the inferred . Additionally, the integral in the likelihood can be succinctly represented as and pre-computed, eliminating the double intractability.
We use this representation for the prior, and specify the necessary distribution over such that it approximates the truncated GP prior. As a GP is fully specified by its mean and covariance, we need to make sure the approximated GP matches them. The core challenge is to match the covariance of the approximating GP, where are the samples, i.e., ensure ()
Unfortunately, this constraints cannot be satisfied globally due to the finite number of basis functions, which have limited representational power. However, it can enforced on a set of representative nodes , which ensure it is approximately satisfied in the rest of . Evaluating at nodes, the covariance of the approximate Gaussian prior needs to be equal to , where is the change of basis matrix, and the kernel matrix evaluated at the nodes. The approximation at the nodes is exact, while in between the nodes
, the value of the approximate covariance is interpolated. To simplify notation we use the basis, where the positivity constraint becomes
and the Gaussian distribution in the prior is. The nodes are selected such that the matrix is invertible, and the approximation error is low. In the next section we provide a concrete example of such a basis. However, generally, nodes are selected as equally spaced grids (Papp2014) (generalized to arbitrary dimensions via Cartesian products). Note that the approximation error decreases with the size of the basis .
Using the above representation, the energy in (1) and MAP estimate become a convex function and a simple optimization problem, respectively:
which can be efficiently solved to near-optimality using interior point methods (Nemirovski2008). Now we study how to choose the basis such that the above construction has low approximation error.
3.1 Minimal Description Positive Basis
We want to select as an minimal description positive basis. By that we mean a basis, which has the least average squared error for a fixed size
. This is desirable, since the computational burden of approximate inference is proportional to complexity of matrix-vector multiplication of sizesi.e. basic linear algebra operations.
Let us represent the evaluation map as an operator from . The minimal description positive basis also called optimal basis with size can is defined as the solution to the following optimization:
Let us unpack the meaning of this optimization problem. Moving from inner to outer problems, we first look for a positive combination of basis functions, which has lowest under the expectation over the prior of . Then in the outer minimization, we search for a basis representation that minimizes the expected error, and is positive (we add normalization to make the solution well-defined). This reconstruction error minimization has the same goal as the Karhunen-Loéve decomposition (stochastic version of Mercer decomposition) where the basis functions are orthogonal. It is known that Karhunen-Loéve decomposition has the lowest average error among all orthonormal bases (Adler2009). Hence, our construction can be considered a positive counterpart to Karhunen-Loéve basis.
This functional optimization problem is intractable due to, among other things, the difficulty of integrating. Nevertheless, we can perform two approximations: assume a finite domain , and approximate the integral with samples to get
After these relaxations, the problem remarkably becomes equivalent to non-negative matrix factorization (Gillis2014)
– a challenging but heuristically solvable problem for which efficient solvers are readily available. The rows ofare the sample paths sampled from the truncated GP, and incorporates the weights . The evaluation operator becomes a matrix on a finite domain.
The resulting basis can be seen in Fig. 0(c) (1st row) for the squared exponential kernel. The effect of using a kernel that generates rougher sample paths such as Matérn or Laplace kernel causes the basis to be more peaked as in Fig. 0(c), 2nd row. As for choice of the nodes for the optimal basis, we pick the peaks of the bumps in Fig. 0(c). With this choice, we see that
will become close to the identity matrix, as each separate basis function (single color) is zero at the peaks of other basis functions.
Of special interest are optimal positive bases for non-stationary kernels e.g. Gibbs kernel with varying lenghtscale as in Fig. 0(c) 3rd row. In this example, the regions of the domain with low spatial variations do not need to be approximated by large numbers of basis functions and the optimal basis adapts accordingly. This is of special importance for us, since in spatial modeling, the ultimate goal of this work, the problems often exhibit non-stationarity due to varying geographical features such as effects of water, mountains etc. This can be modelled by what we call indicator modification, e.g., , where summarizes geographical features and is a non-stationary kernel. We will see concrete examples in the experiments.
3.2 Other positive bases
Other positive bases used in prior works are the famed Bernstein polynomials (Fig. 0(c), 5th row) (Papp2014), triangle basis (Fig. 0(c), 4th row) or positive Hermite splines (Alizadeh2008). The triangle basis of Maatouk2016 and Cressie2008 stands out. Qualitatively, it is very similar to the optimal basis (in case of stationary kernels), is well-conditioned, has good approximation properties and the integrals can be calculated in closed form. It has also been used in prior works for positive constraints (Lopez-Lopera2019). Its choice was motivated by convenience, but we believe this work justifies its good empirical performance by its closeness to the optimal basis representation for stationary kernels. We introduce it in more detail in the next paragraph.
Given a domain , let us define the individual triangle basis functions for : . As the basis functions are bounded and do not overlap at node points , the constraints , can be represented as linear constraints on : (for details refer to Lopez-Lopera2018). Now in order to use this basis with kernel , we transform it by as above, where we note that is the identity when we use the same nodes as in the definition.
3.3 Why not link functions?
Many prior works model the intensity function with a link function to ensure positivity. These include: sigmoid (Adams2009), exponential (Moller1998) and square (Flaxman2017). The problems with link functions are twofold: the energy is either non-convex or the integral cannot be evaluated in closed form nor precomputed – the so called double-intractability – both of which are avoided by positive bases. The square link function seems to be the most promising, as the double intractability can be alleviated with a fixed basis, but it results in non-convexity of the energy due to the equivalence of and as solutions. It also leads to artefacts called nodal lines (John2018), where, as the range of changes sign, the approximation of is poor (see Appendix A.4). The sigmoid link function leads to a convex energy, but due to the double-intractability it requires numerical integration, which makes the computation costly as we will show.111The sensing algorithms we develop can be used with sigmoid as well; they are just not as efficient.
4 Posterior Sampling and Acquisition Functions
Sampling from the posterior over the intensity function is a cornerstone for our algorithms, as introduced below. To sample from the posterior using a positive basis, we need to sample from , where . Fortunately, in this representation, is convex on and hence the posterior distribution is log-concave. Log-concavity of a distribution is a desirable property that leads to efficient and even provably consistent approximate inference (Dwivedi2018), and can be solved, e.g., via Langevin dynamics. Langevin dynamics is a stochastic process that follows a discretized stochastic differential equation defined via a gradient flow on the energy (see (2) below for an example). In our case, the only difficulty arises from the fact that is not smooth, as outside of it is zero. Recent advances extend Langevin dynamics to non-smooth energies as we show below.
Proximal Langevin Dynamics
Let us define an indicator function , taking values , and let our modified energy defined on be . Durmus2016 and Brosse2017 propose a proximal version of Langevin dynamics utilizing the Moreau-Yosida envelope of a function (Borwein2005). While, the details of the derivation of this algorithm are out of scope of this paper, the final algorithm is remarkably simple and relies on the proximal operator , which for our simplifies to projection on a polytope , which can be very efficiently solved using a quadratic programming algorithm of Goldfarb1983.
The Moreau-Yosida Unadjusted Langevin Algorithm (MYULA) of Brosse2017 is defined via the stochastic update rule,
which requires step size , where is the Lipschitz constant of and . For any , the Lipschitz constant is bounded and can be found by approximate power method applied on the Hessian of . The initial point is set to coincide with the MAP. As , the continuous flow of becomes provably close for the to a sample from in TV distance (Brosse2017).
Other Langevin dynamics algorithms (Hsieh2018; Zhang2020a) are applicable in this context as well as we show in Appendix A.5. However, MYULA seems to be suited for this problem as it can deal with ill-conditioning of , which is often the case for smooth kernels. With proximal sampling, the ill-conditioning appears only within a quadratic program, which can be easily addressed via second-order methods. Other methods rely on first-order approaches, thus ill-conditioning forces step sizes to be small, leading to longer compute time.
4.1 Maximizing Event Capture
Using the above, we can formulate an algorithm for the first objective: maximizing event capture. The goal here is to capture as many events subject to the cost of sensing. Each has an associated cost, . We want to spend the budget in the most effective way by minimizing the following regret,
where maximizes over and expectation is over the sampling from the Cox process. The regret represents the average missed events incurred by investing the budget to potentially suboptimal actions in terms of number of observed events count to cost ratio. We propose to use posterior sampling to solve this task (Russo2017a). In connection to Poisson process sensing, we refer to this algorithm as Cox-Thompson, where we use (2) with positive bases to sample from the posterior distribution. Afterwards, we sense which leads to the best count to cost ratio under the sampled parameter as in Alg. 1.
4.2 Learning Level Sets and Maxima
To identify the maximum of and level sets, we rely on the seminal re-sampling heuristic of Chernoff1959, which was recently analyzed, and popularized by Russo2016a. Namely, the heuristic relies on resampling from the posterior until another sample lead to a different recommendation of what is the maximizer or what are the level sets of . Having these two recommendations, we sense a region that leads to the decrease in uncertainty of this recommendation. In particular, the recommendation rule for maximum is , where is the sample from the posterior. Then, we re-sample from posterior until we find a different recommendation . After that, we randomly choose to sense the lowest cost sets and containing and , respectively with equal probability. We refer to this algorithm as Top2. This algorithm is known to be consistent for any uncorrelated exponential family random responses, which includes Poisson responses, as shown by Russo2016a. In our case, we exploit correlation structure of the posterior, which we believe can only improve the efficiency of this scheme.
In order to implement an algorithm that identifies level sets of , in other words, the largest s.t. for all , we resample until the prediction of level sets are different. The difference in recommendation is perhaps best summarized by an exclusive OR (XOR) operation on the two sets , and then we choose an action that shares the most overlap with the resultant set . We weight the overlap by the respective difference in magnitude of the two random samples over , namely, using the positive bases where two samples are and ,
5 Discussion & Experiments
Before we proceed to compare the sequential algorithms, let us first investigate how positive bases compare with link functions in two metrics: validity to solve the task and execution time. We study event capture maximization on a toy example function on in Fig. 3a well captured by squared exponential kernel. Our experiments with link function are done with a finite basis as well, because performing inference for each for each is computationally infeasible.
We see that the square link function fails to solve the task due to nodal lines (Appendix A.4). Sigmoid link functions with Quadrature Fourier Features (QFF) (Mutny2018b), an minimal description basis (non-positive), fails because the features cause severe ill-conditioning. On the other hand, using the sigmoid link with the triangle basis performs just as good as the constrained version – however it is orders of magnitude slower (cf., 4 in Appendix B). Lastly, using the optimal basis performs essentially the same as the triangle basis, because they are indeed very similar on this stationary problem. However, the difference appears when using non-stationary kernels as we will see in other applications.
Baseline Sensing Algorithms
For the count-regret minimization task, we compare our approach with the UCB algorithm (Srinivas2009) based on credible sets from the Laplace approximation (UCB-Laplace), the UCB algorithm of Mutny2021 (UCB-MK), which uses a regression estimate instead of MAP with stylized confidence parameter , and -greedy with decreasing exploration probability. For the level-set and maxima identification, we compare against random search and classical approximate sequential Bayesian V-optimal design (Chaloner1995). We refer to the latter as V-optimal. It uses Laplace approximation for posterior inference (as other inference is infeasible). This is, to our knowledge, a novel instantiation of Bayesian experimental design, and we explain it in detail in Appendix A.2.
5.1 Discussion: Algorithm comparison
We compare our proposed algorithms on four problems including an additional experiment in Appendix B with fixed costs (concave). For the experiments presented below, we use a uniform cost function , hence only the smallest sets in hierarchically generated
are sensed. More details and hyperparameters along with inferred intensities can be found in the AppendixB.
The main overall message from the benchmarks is that algorithms that do not exploit correlation stemming from the kernel , such as random sampling or -greedy perform worse than other algorithms. Secondly, algorithms based on Laplace approximation; UCB-Laplace and V-optimal fail when the rate function is small in some regions () due to the reason explained in Appendix A.1. Lastly, the UCB algorithm of Mutny2021
, which involves intricate variance rescaling, performs similarly to our very simpleCox-Thompson algorithm for count-regret minimization.
Our first problem in Fig. 3b has the same rate function as in 3a, but we vary sensing algorithms instead of representations. We see that Thomspon sampling performs slightly better to the optimally tuned UCB algorithm of Mutny2021. Cox-Thompson required no tuning, and no confidence parameter. Next, we discuss the four real-world benchmarks individually.
In this problem, we estimate the intensity of burglary occurrences as a function over space in the city of San Francisco, as illustrated in Fig. 0(b). The sensing duration is days corresponding to, e.g., deploying mobile cameras in rectangular city blocks. We compare the regret (as in (3)) of the baselines and see that Cox-Thompson outperforms other algorithms in Fig. 3c except for UCB (with optimal basis) which uses optimized confidence value . If it was run according to the theoretical specifications, it would not be competitive. The optimal basis performs much better than the triangle basis (blue), since the covariance of the problem is non-stationary. Namely, in parks, water areas, the rate is zero, and optimal basis does not cover these areas with basis functions. With the same basis size, the triangle basis cannot capture the problem sufficiently accurately.
In this experiment, we simulate a 1854 London outbreak of cholera using John Snow’s famous dataset. The goal here is to identify pollutant source (polluted well). We assume, as commonly done, that the rate function smoothly decays away from potential polluting sources (wells). Hence, the goal is to identify the maximum of . To measure progress, in Fig. 3c, we report inference regret on the -axis, which measures the suboptimality of the current belief about optimum, defined as , where is the current belief for the value at optimality. The simple Top2 performs very well. The sensing duration is set to of length of the epidemic.
Suppose we want to perform surveillance of a region in order to estimate the rate of occurrence of a particular species. In order to model this application, we use occurrence data of Beilschmiedia, a tree genus native to Africa, from Baddeley2015 and seal location in Antarctica from Goncalves2020, specifically a section of Antarctic territory around the Ross sea. In both cases, we assume the generating process is Poisson point process and are interested in determining the region where the occurrence rate is above a given threshold value , which we take as the definition of a habitat. For Antartic seals we use an indicator kernel, which indicates whether the region is sufficiently close to the shore, and for Beilschmiedia we use squared exponential kernel that takes slope and height of a point as inputs, making it non-stationary in .
As the task essentially corresponds to a classification of regions according to whether the rate is above and below , we report the score over the sensing rounds. Hereby, below the level is associated with label and above , , in Figs. 3e) and 3f). We see that Top2 outperforms random sampling and V-optimal in both instances by identifying the level sets more efficiently. It seems that V-optimal gets stuck in regions with small values , which are present in the seal problem. Also, notice the dashed lines in Fig. 3e) representing a triangle basis approximation with the same basis size. With the same size as optimal, it fails at approximating the function to the extent to solve this task.
6 Related Work
Previous works on Poisson point process intensity estimation can be roughly categorized into those based on smoothing kernels (Berman1989), and those based on positive definite kernels. Unlike the methods based on smoothing kernels, kernelized estimators need to address the positivity constraint of the intensity function. Positivity, as one of the simplest shape constraints, can be enforced by representing our function in a specially constructed positive basis such as positive polynomials or inherently positive bases (Papp2014; Maatouk2016; Lopez-Lopera2018). Lopez-Lopera2019 and Alizadeh2008 use positive bases for inhomogenous point process inference with passively collected data, similarly as done in this work. Link functions are another way to enforce positivity, investigated by Adams2009, Moller1998 and John2018.
Confidence or credibility estimates are considered by Moller2003 who propose to bootstrap confidence bounds from observational data. Walder2017 use a Laplace approximation, where the intensity is modeled as square of a Gaussian process, which we exploited in our baseline algorithms. Adaptive sensing of Poisson point processes with sensing costs is introduced by Grant2020 and the kernelized version by Mutny2021. Grant2020 partition the domain and sequentially refine histogram estimators with truncated gamma priors. Their adaptive discretization is done at a specific rate in order to identify the maximum. This approach is suboptimal to approaches utilizing the correlation between bins as here. Mutny2021 introduce adaptive sensing of Poisson point process where the rate belong to reproducing kernel Hilbert space. They study the same regret as here from a Frequentist perspective, however their algorithm cannot be applied to level-set nor maximum identification and is more complicated.
We studied adaptive sensing of Cox processes. We proposed three sensing objectives: learning level sets, maximum identification and event capture, and provided adaptive algorithms based on posterior sampling that optimize these objectives. We introduced the concept of minimal description positive basis to approximate the rate function for accurate inference.
Supplementary: Sensing Cox Processes via Posterior Sampling and Positive Bases
.tocmtappendix mtchapternone mtappendixsubsection
- 1 INTRODUCTION
- 2 PROBLEM STATEMENT
- 3 POSITIVE BASES
- 4 POSTERIOR SAMPLING AND ACQUISITION FUNCTIONS
- 5 DISCUSSION & EXPERIMENTS
- 6 RELATED WORK
- 7 CONCLUSION
A Extra material
- A.1 Upper Confidence Bound Algorithm and Laplace Approximation
- A.2 V-optimal experiment design
- A.3 Derivation of Positive Basis
- A.4 The square link function: Nodal Lines
- A.5 Other Langevin Dynamics Algorithms
- A.6 Sigmoid and Hermite basis: why they fail
- B Numerical Experiments
Appendix A Extra material
a.1 Upper Confidence Bound Algorithm and Laplace Approximation
A successful class of algorithms for regret minimization are based on the principle of optimism (Auer2002; Srinivas2009). This principle relies on the access to a confidence set/credibility set, and dictates to sense the region with the largest upper confidence bound (in plots this algorithm has name UCB).
In order to reason about the confidence bounds without closed form posterior, Laplace approximation offers itself as the first natural candidate. A short calculation reveals that the covariance for the MAP estimate with positive basis is . The upper confidence bound can be then solved as simple convex program (see below). The width of confidence set is inversely proportional to magnitude of . Already before executing this algorithm, we can see that it is destined to fail when , since sensing no events does not increase at all, and hence does not decrease the confidence region. Thus, we expect the algorithm to get stuck on sensing a region with small counts since it is unable to shrink the confidence estimate for it.
avoids the above described pathology by using a regression estimate instead of maximum likelihood estimate at the cost of blowup to the confidence set by a specifically designed scaling factor that takes into account the tail properties of Poisson distribution.
a.2 V-optimal experiment design
We design an algorithm for approximately learning level sets, and identifying maxima of . We take an inspiration from the literature on statistical experimental design (Chaloner1995; Gotovos2013), and design an acquisition function based on Bayesian V-optimal design. The general idea is to sequentially reduce the uncertainty in the region of interest subject to the cost of sensing.
The region of interest depends on our objective. If learning levels sets above a threshold , then , while for identifying maxima, . As we proceed with iterations, the region of interest is shrinking to the smallest such set.
The ucb and lcb can be calculated as conic convex optimization problem:
Following ideas of Bayesian experimental design, we greedily select the next sensing region to maximally reduce the posterior squared prediction error in in expectation over the realization of and in expectation over the posterior over in Equation (6).
The expectation in (6) is over the realization of the Poisson process in region for a constant duration , and is the posterior distribution of . In general, it is difficult to integrate over the exact posterior, so we again make use of the Laplace approximation of posterior to arrive at the simplified acquisition function in Equation (7). We refer to this algorithm as V-optimal.
The in (7) is a function of that we take the expectation over, and denotes the new MAP estimate with the new data appended due to sensing . The Poisson process is used to sample and locations of the points simulating the sensing with the rate function defined by our current optimistic estimate for action . The estimator is defined via s.t. it belongs to the same constraint set as in Equation (5).
Notice that, as we are interested in identifying maxima or level sets, the optimistic estimate is justified as we hope to discover regions with high values. For example, starting with no information, would be close to the lower bound, then sensing under the current MAP estimate leads to no event occurrences anywhere except for where it has already seen some samples. This effect is eliminated by instead using an optimistic estimate while simulating sensing in a new region .
In our experiments, we often skip the last steps and set as the new data does not lead to a significant change in the MAP estimate, leading to (8). The evaluation of the acquisition function in Equation (8) requires: calculation of optimistic (convex program), a sampling step from a Poisson process (sampling), and a MAP calculation with the new sample added (convex program). Sampling from the Poisson process can be done via Monte Carlo simulation as in (Snyder2012).
a.3 Derivation of Positive Basis
In this section we give more information and an alternative argument as to why the basis needs to be scaled by and what is the origin of this scaling. We always start with the basis functions , which have the desired positive property but are very simple and do not generate the appropriate function class. There are two types of arguments to explain why is necessary: a) Bayesian covariance matching (mentioned in the text), and b) Change of basis in Hilbert spaces.
The Bayesian derivation demands that the prior covariance kernel is maintained under the basis transformation. In other words . To achieve this on set of nodes , we use again , then . Picking to be achieves the desired goal. In between the points , the values of the kernel are interpolated.
Nodes needs to be selected such that is invertible (change of basis exists), and this is indeed the case for the bases we consider in our work. In fact, nodes are selected such that becomes close to identity. Similar issues arise also when working Bernstein polynomials and Hermite splines (see (Papp2014)), and not much attention is given to these concepts, since there is always a reasonable choice, however a bad choice of nodes can indeed render the approximation useless. Papp2014 propose to optimize nodes for Bernstein polynomials but refrain from practically implementing it due to difficulty of the optimization problem.
For the more Hilbert space inclined reader, the same needs to happen with the RKHS perspective. The optimization in Hilbert spaces proceeds via regularized ERM. The regularizer is the norm of the original RKHS space. If we use a new basis for the RKHS, we need to make sure the same regularization strength is applied in the new basis. Assuming the domain is finite dimensional only supported on the nodes for we calculate the value of in the new RKHS. We will see it has form , where is the basis change matrix. The rest follows by using .
a.4 The square link function: Nodal Lines
John2018 pioneered the use of square link function to estimate Poisson process intensity function. The combination of square link function and Gaussian process was later referred to as a Permanental process in literature (Walder2017). Already John2018 show that the objective function is non-convex due to equivalence of and where denotes the latent function passed through the link function. The non-convexity of the estimation problem leads to a phenomenon called nodal lines. The nodal lines are direct consequence of non-uniqueness of stationary points in optimization, where a stationary solution to the MAP problem does not fit the solution well in the regions where the latent function crosses zero. In Figure 3, we plot in the first row estimates of (in red) with squared link function (blue and orange) as well as our constrained variant (green). We observe that in the second row that once the latent function crosses zero, significant error artefact occurs in estimate of (in the first plot) which is not present in the case of triangle basis with a constraint. This mainly occurs in the region where is small but positive as already suggested by John2018. Naturally, this effect can be somewhat mitigated by multiple restarts as well as using fixed off-set in the link function, however it turns out to be complicate inference especially when we need to perform multitude of repeated inferences in sequential decision problems.
a.5 Other Langevin Dynamics Algorithms
a.5.1 Mirrored Langevin Dynamics
Instead of using MYULA - a proximal based Langevin dynamics of Brosse2017 one could also apply Mirrored Langevin dynamics of (Hsieh2018) due to the simplicity of the constraint.
The Mirrored Langevin dynamics of Hsieh2018 constructs so called dual distribution which incorporates the constraint in its functional form and is unconstrained such that the standard Langevin dynamics applies. The dual distribution is defined as push-forward via the gradient of the mirror map . It can be derived by solving Monge-Ampere equations:
The mirror map is chosen such that the the dual distribution becomes tractable and convex, and needs to be twice differentiable and strictly convex inside the constraint set.
The advantage of this algorithm is two-fold it never leaves the constraint set and has better convergence guarantee - it requires only to reach accuracy under total variation. The disadvantage we found is that its dependence on condition number is worse than MYULA in practice. The condition number becomes proportional to condition number of , and as we use only gradient based optimization (classical Langevin dynamics) it has bad convergence properties.
We explain the derivation of the mirror map in the steps below. It turns out that our constraint is related to a simple box constraint already analyzed in Hsieh2018 for which a simple mirror map exists. We reproduce the derivation here for completeness. In order to derive for our application, we transform the variables to domain by applying an affine transformation. The potential becomes:
It turns out that the appropriate mirror map for the box constraint is of the form:
The following mirror map has derivative and consequently the Fenchel dual has derivative , where are the dual variables. Using the equation (9) and , we can derive the dual distribution up to a constant:
Notice that the is a monotone function and since composition of convex and monotone remains convex (can be found in (Boyd2004) in Chapter 2), this is a valid objective for Langevin dynamics:
The transformation back to primal coordinate is via the dual map .
a.5.2 Wasserstein Langevin Dynamics
One could also apply a version of Mirrored Langevin Dynamics due to Zhang2020a. Akin to the work of Hsieh2018 it relies on a mirror map to define a constraint but uses different discretization. Let us here describe the problem for a positive constraint with the following classical barrier instead:
The Langevin flow is then defined as follows,
The inverse of gradient obeys . With the barrier function above the does not have a closed form and needs to be solved using an iterative process in our case Newton’s method. Additionally, using leads to convergence according to Zhang2020a. For further details please refer to the original paper. While this algorithm can avoid bad conditioning; it requires rather costly mirror map inversion procedure.
a.6 Sigmoid and Hermite basis: why they fail
The energy function with the log-link function is equal to:
where denotes the maximum value the rate function can attain. Notice that when calculating gradients we need to do approximate quadrature of each dimension of the :
From (Mutny2018b) we know that, while Hermite basis, is optimal in its description size for squared exponential kernel, it has rapidly increasing frequencies , as increases in its Fourier spectrum. This means that is a oscillatory function with high frequency, which are notoriously difficult to integrate (Davis2007). Hence, the approximate inference scheme which we employ: Legendre quadrature with fixed nodes, fails. Naturally, an adaptive quadrature scheme for each frequency could improve the convergence of this method, however already with our fixed quadrature rule it is significantly more costly, and not competitive in terms of computational time. This does not happen with triangle basis since each basis function is a simple piece-wise linear function, hence the performance is unhampered with Legendre quadrature.
This demonstrates a very important point. While the basis representation , has no effect on the function values, it does have an important effect on inference scheme . Namely, in this case it influences the regularity conditions of the gradient of the loss function with respect to the parametrization.
. Namely, in this case it influences the regularity conditions of the gradient of the loss function with respect to the parametrization.
Appendix B Numerical Experiments
We run our experiments on a local 28 core machine with 128GB memory for ca. 5 days total runtime. Individual operations are not extremely costly but need to repeated many times due to sequential nature of the algorithms. This significantly blows up the computational time.
We perform sampling with the proximal algorithm of Brosse2017 for regret minimization with steps, and Lipschitz constant calculated at each step approximately. We always initialize with the MAP estimate. The number of iterations of the iterative procedure is set such that the samples visually appear to capture the posterior variation of the rate function adequately. All our examples are either one or two dimensional, and samples from the posterior can be easily visually inspected to determine if the approximate sampling produces viable samples. Naturally increasing this number may improve the validity of the samples, but seems to be not required for our application. Also, notice an approach with link function requires the same choice to be made.
We create - set of sensing regions - by hierarchical splitting the domain with the size for specific problem given below.
Both UCB algorithms and V-optimal, we use . V-optimal resamples from the Poisson process times to calculate the acquisition function. We use the algorithm of Snyder2012 to sample on a discretized domain.
For approximate inference we use Legendre quadrature of scipy (scipy)
In the table below one can find specifics for each dataset.
b.1 Fit and test approach
We want to evaluate the applicability of our approach without the model misspecification interfering in understanding of the performance, and at the same time be able to include randomness of the process into our algorithmic comparison. We execute the experiments in fit and test approach. We first fit the model to the whole data corresponding to our application that results in a ground truth rate function . In all cases, this is visually satisfactory fit as can be seen in the Figures below. We then perform the experiments, where points are sampled from the fitted and compared with respect to it as the ground truth. This way we will avoid overfitting to a specific data realization in the data, which can lead to missrepresentation of performance due to specifics of the instance.
In the table below SE denotes squared exponential kernel.
|San Francisco dataset||custom||lengthscale , (see below)||days||64||200|
|Cholera dataset||custom||lengthscale (see below)||of d.d.||256||50|
|Beilschmiedia dataset||custom||lengthscale (see below)||64||100|
|Antartic seal dataset||custom||lengthscale (see below)||64||60|
By d.d, we mean dataset duration which means the time span between first and last event. By the ratio of dataset we mean that if the time duration of the dataset was lets say days and the ratio was , this means that one sensing session was days. Further details are provided below.
Before we describe the benchmarks, we stress that while the experiments stem from real problems, they are simplified versions. Each application dictates a lot of additional technological, ethical and economical specifications that we do not consider. Therefore, one should see the following as show-cases of what is possible, rather than the complete solutions. More crucially, some of the benchmarks can be improved by expert designed kernels . For example, ecologists might have prior knowledge regarding the seals’ habitat based on geographical features, and this should be incorporated to improve efficiency. With the lack of such expert knowledge, we rely only on naive geographical features and spatial correlation. However, the strength of this formalism lies in its flexibility.
b.2.1 The San Francisco benchmark
The data comes for this experiment comes from https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-Historical-2003/tmnf-yvry. The dataset contains much more information that we do not use. We just use data for days of bulglary events in San Francisco. We model the problem using a kernel , where indicates whether the given coordinate is urban on not with if not. This way we disregard the portions of the map which correspond to the forests or water areas. The lenghtscale was determined by visually inspecting the data and picking an appropriate lenghtscale such that the fit corresponded to meaningful rate. Alternative approaches include is to integrate the uncertainty over the posterior and maximize the Bayesian evidence, which in this case is difficult due to absence of closed form of the posterior. Note that the image in the banner includes a cut-off on the rate function to avoid clutter.
b.2.2 Cholera benchmark
The next dataset is a classic example of pollutant modeling, and relies on historical dataset. During the London cholera outbreak of 1854, John Snow used records of cholera deaths to identify a contaminated water well – which is nowadays regarded as the cradle of spatial epidemiology. Assuming that contamination propagates locally from the source, infections can be modeled as Poisson point process as in Fig. below with water wells (blue) as inducing points. The goal in this example is to stop an epidemic as soon as possible and identify the contaminant by surveying the neighbourhoods – sensing – to identify sick individuals as the epidemic evolves. The problem is modelled with SE (squared exponential kernel) with inducing point in the water wells according to the strategy in Williams2001.
b.2.3 Beilschmiedia benchmark
This dataset comes from the work of Baddeley2015, and we run level set identification , which covers approximately of the domain. In modelling we use kernel , where corresponds to the slope magnitude at and to the height at for which we have look-up tables. While this representation is not perfect for the observed data it matches them quite closely for sufficiently good fit. In this example, we did not do any elaborate hyperparameter fitting.
b.2.4 Antartic seal benchmark
This dataset comes from Goncalves2020. The dataset contains images and locations of identified seals. We do not use the images in this work. We use only locations in the training set to fit a Cox process on a section of antarctic around Ross sea. We use kernel where indicates whether the point x is at most km away from the Antarctic continent. Notice from the image that Artactic continent and Antartic coast are not the same. This modelling is purely arbitrary and could significantly benefit from expert knowledge. However, already such simplified setting can provide interesting insights to efficiency of the tested adaptive algorithms. We use as which is approximately half of the maximum rate function.
b.3 Further experiments: concave cost and runtime
Not all our experiments could fit to the main body of the paper. We report two more experiment in Figure 4 with fixed costs (simplest version of concave cost), as well as time benchmark for the different link functions.
b.4 How to pick ?
The basis approximation we discuss in our work requires the specification of parameter – basis size. With increasing basis size, the approximation has higher fidelity, however the computational cost grows as (matrix-vector multiplication). Hence, we want smallest basis which still captures the complexity of the problem. One way to determine if is sufficiently big is to do it by visual inspection, i.e. looking at the samples of the Bayesian model with fixed and seeing if the samples have sufficient variation we believe the studied problem should have.
Secondly, given a certain cut-off
on the eigenvalue spectrum, we can calculate the eigenvalues of the Gram matrixon the selected nodes . We can adopt the following procedure: Consider which are on a regular grind in (one or two) dimensions, which are generated by Cartesian products. We gradually increase and hence the granularity of the grid. For each value of , we calculate the eigenvalues of , namely the minimum eigenvalue. If the minimum eigenvalue is below the cut-off value, we can declare the the approximation to be of sufficient fidelity. The exact relationship between e.g. and cut-off is not straightforward but follows an inverse relationship, and can be empirically established.
b.5 Influence of for UCB
In Fig. 2c) it seems like UCB of Mutny2021 outperforms Thompson sampling. While it is possible that on this specific problem UCB performs better, it is more the cause of optimized value of confidence parameter . In Fig. 5 we report the same experiment with multiple values of and we see that the performance significantly deteriorates with different choice of . Also, note that the values of as specified by the theory are much larger than used in the comparison here. Naturally, in this problem it seems that low beta due to unimodality of the objective is optimal, since by detecting the first peak the best strategy is to focus on this peak solely and stop exploring. However, this is a special property of this benchmark only.
The value of in this experiment.