1 Introduction
In this paper we consider the problem of adaptively placing sensors to detect events occurring stochastically according to a inhomogeneous Poisson process. This is a problem arising in numerous applications including ecology (Heikkinen & Arjas, 1999), and astronomy (Gregory & Loredo, 1992). Adaptive sequential decisionmaking that learns an optimal placement of sensors in response to observations can lead to detecting many more events than fixed policies based on an assumed Poisson process rate function. We study the problem under a simple abstract framework which encompasses many possible practical scenarios, including choosing which hours to operate to maximise customer engagement, or choosing placement of mobile base stations to service as many requests as possible, as well as the classical sensing applications.
Suppose that a decisionmaker is tasked with placing a finite number of sensors along an interval. The decisionmaker’s objective is to maximise, through time, a reward function which trades off the number of events detected with the cost of sensing. At each step, each sensor is tasked with sensing a subinterval, with the cost of sensing depending on the length of the subinterval. Only the events that occur in a sensed subinterval are detected. The decisionmaker may update the placement of sensors at regular intervals creating a sequential problem where the decisionmaker iteratively places sensors and receives feedback on where events occurred.
The decisionmaker therefore faces a classic explorationexploitation dilemma. In each round they will gather information on what was detected in the sensed regions, and will receive a reward. The most informative action is to sense the entire interval, but this may not be the rewardmaximising action due to the cost of sensing. Hence the decisionmaker must choose sensor placements to trade off learning about regions where information is insufficient, while also capitalising on information they already have to generate large rewards. This paper develops an algorithm to tackle this problem with the aim of minimising Bayesian regret, the difference between the expected reward achieved by constantly selecting an optimal action and the expected reward of actions actually taken, where the expectation is taken with respect to the prior over the rewardgenerating parameters.
Multiarmed bandits provide models for sequential decision problems, and our problem most closely resembles the continuumarmed or armed bandit problem (Agrawal, 1995). In a continuumarmed bandit (CAB) problem a decisionmaker sequentially selects points in some dimensional continuous space and receives reward in the form of a noisy realisation of some unknown (usually Lipschitz smooth) function on the space. Our sensor placement problem can map to this framework by considering that the placement of sensors can be represented by the set of endpoints of the sensors’ subintervals. Note, however, that the noise and feedback models in the sensor placement problem are more complex than in previous treatments of CAB models, which have focused on simple numerical reward observations with bounded or subGaussian noise (e.g. Bubeck et al., 2011)
. In this paper, we handle the added complexities of observing event locations and the heaviertailed noise of the Poisson distribution.
Our proposed method performs fast Bayesian inference on the rate function, by means of a Bayesian histogram approach
(Gugushvili et al., 2018), and makes decisions to trade off exploration and exploitation using Thompson sampling (TS) (e.g. Russo et al., 2018). Gugushvili et al.’s approach to nonparametric inference on the continuous action space imposes a mesh structure over the interval, splitting it into a finite number of bins, with the mesh becoming finer as time increases. Inference is then performed over the rate of event occurrence in each bin. TS methods select an action in a given round according to the posterior probability that it is optimal. In our approach, this is implemented by sampling bin rates from the simple posterior distributions of
Gugushvili et al.’s model and selecting an optimal action for these sampled rates via an efficient optimisation algorithm described in Section 3.4.We analyse the Bayesian regret of the TS algorithm in this setting using similar techniques to those of Russo & Van Roy (2014). This allows us to derive an upper bound on the Bayesian regret that holds across all possible rate functions with a bounded maximum, and has minimal dependency on the prior used by the TS algorithm. The CAB problem with Poisson noise and event data as feedback is to the best of our knowledge unstudied, however our regret upper bound is encouragingly close to the lower bound on simpler CAB models of Kleinberg (2005).
The remainder of the paper is structured as follows. We review related work in Section 2, formalise our model and algorithm in Sections 3, present the regret analysis in Section 4, and conclude with simulation experiments in Section 5.
1.1 Principal Contributions
The principal contributions of this work are: (i) formulation of a new widely applicable model of sequential sensor placement as a CAB; (ii) the first study of CABs with Poisson process feedback, and use of a new progressive discretisation technique as an approximation to the continuous action space; (iii) an efficient optimisation routine for sensor placement given known event rate; (iv) analysis of the Bayesian regret of a TS approach, resulting in a upper bound; (v) numerical validation of the efficacy of the TS method, and its favourable performance relative to upper confidence bound and greedy approaches.
2 Related Work
The problem of allocating searchers in a continuous space has been studied by Carlsson et al. (2016) under the assumption that the rate of arrivals is known. A first attempt to solve a version of the problem in which the rate must be learned is presented in Grant et al. (2018), in which the space is discretised to a fixed grid for all time. The objective of our paper is to present the first learning version of the problem for the fully continuous space.
The fixed discretisation version of the problem maps directly to Combinatorial MultiArmed Bandits (CMAB) (CesaBianchi & Lugosi, 2012; Chen et al., 2016). This is a class of problems wherein the decisionmaker may pull multiple arms among a discrete set and receives a reward which is a function of observations from individual arms. In the discretised sensorplacement problem, the individual arms correspond to cells of the grid. The model remains relevant for the continuous version of the problem, as by using an increasingly fine mesh, we approximate the problem with a series of increasingly many armed CMABs.
The continuumarmed bandit (CAB) model (Agrawal, 1995) is an infinitelymany armed extension of the classic multiarmed bandit (MAB) problem. There are two main classes of algorithm for CAB problems: discretisationbased approaches which select from a discrete subset of the continuous action space at each iteration, and approaches which make decisions directly on the whole action space. Our proposed method belongs to the former class. Early discretisationbased approaches focused on fixed discretisation (Kleinberg, 2005; Auer et al., 2007), with more recent approaches typically using adaptive discretisations such as a “zooming” approach (Kleinberg et al., 2008) or a treebased structure (Bubeck et al., 2011; Bull, 2015; Grill et al., 2015) to manage the exploration. Authors who handle the full continuous action space typically use Gaussian process models to capture uncertainty in the unknown continuous function and balance explorationexploitation in light of this (Srinivas et al., 2009; Chowdhury & Gopalan, 2017; Basu & Ghosh, 2017). As mentioned in Section 1, our problem can map into a CAB, but since our information structure is more complex, our action space has dimension greater than 1, and the stochastic components have heavier tails than usual, standard algorithms and results do not apply.
Thompson sampling (TS) is a particularly convenient, and generally effective, method for trading off exploration and exploitation. The critical ideas can be traced as far back as Thompson (1933), although the first proofs of its asymptotic optimality came much later (May et al., 2012; Agrawal & Goyal, 2012; Kaufmann et al., 2012). Later, similar results were derived for MABs with rewards from univariate exponential families (Korda et al., 2013) and in multiple play bandits (Komiyama et al., 2015; Luedtke et al., 2016). More recently, TS has been studied in the CMAB framework by Wang & Chen (2018) and Huyuk & Tekin (2018) under slightly differing models, but both with bounded reward noise. Both papers demonstrate the asymptotic optimality of TS with respect to the frequentist regret, and we anticipate that these results could be extended to univariate exponential families. However, in both of these works, the leading order coefficients can be highly suboptimal. Therefore, rather than attempt to extend these ideas to CABs, we favour an alternative analysis of the Bayesian regret to get bounds that are of slightly suboptimal order but are more meaningful because of their (relatively) small coefficients. The Bayesian regret is less extensively studied than the frequentist regret. However the bounds that have been derived for the Bayesian regret of TS (Russo & Van Roy, 2014; Bubeck & Liu, 2013) are powerful as they do not depend on a specific parameterisation of the reward functions.
3 Model and solution
We now formally present our model and solution method.
3.1 Reward and regret
In each of a series of rounds , events of interest arise at locations according to a nonhomogeneous Poisson process with rate . sensors are deployed in each round with each sensor observing a distinct subinterval of ; the action space consists of the sets of at most disjoint intervals of . Let be the union of the subintervals covered by the sensors in round . An event is detected if it lies in . The system objective is to maximise the number of detected events while penalised by a cost of operating the sensors. The expected reward for playing action is therefore
where is the cost per unit length of sensing. We define the Bayesian regret of an algorithm to be the expected difference (with respect to the prior on ) between the reward achieved when playing the optimal action in each of rounds and the actions taken by the algorithm:
where is the optimal action on the continuous interval.
3.2 Inference
With the Poisson process rate being defined on the continuum
, nonparametric estimation is preferable to a parametric form. We use the increasingly granular histogram approach of
Gugushvili et al. (2018), since it provides us with fast inference and a concentration rate. At the beginning of each round a piecewiseconstant estimation of is considered by counting the number of events to have been observed in each of bins. The number of bins will be gradually increased as rounds proceed. To maintain simplicity in the inference and analysis we choose all bins to be of a constant width .We introduce the notation
to refer to the th histogram bin at iteration (the index is needed to uniquely index a bin since the number of bins changes as increases). The number of events in bin
in a single observation of the Poisson process is a Poisson random variable with parameter
. Since this depends on the width of the bin, we instead estimate the average rate function in a bin, defined asWe place independent truncated Gamma (TG) priors on each of the parameters, with shape and scale parameters and and support on where is some known upper bound on the maximum of rate functions. (The TG distribution has a density proportional to a Gamma distribution, but with truncated support .) In practice the parameter may be chosen very conservatively; setting to be too large does not affect the action selection; however it is important to include an upper limit on the prior support to permit tractable regret analysis, and the chosen appears in the regret bound in Theorem 2.
The consequence of this formulation is that, conditional on actions and observations in the first rounds, we have a posterior distribution over at time which is piecewise constant. A sampled from this posterior takes the form
(1) 
where gives the number events observed up to iteration in bin , and gives the number of times to iteration that bin has been sensed (see Section 3.3).
Gugushvili et al. (2018) demonstrate that, with a full observation at each iteration, this posterior contracts to the truth at the optimal rate for any Hölder continuous rate function . In particular,
if for all and . We describe in the next subsection how the same choice of gives favourable performance in our sequential decision problem, even when we only observe subintervals of .
3.3 Thompson sampling
In order to make action selection feasible, and to facilitate the inference using histograms, we constrain the action set of the TS approach using the same (increasingly finemeshed) grid that the inference is performed over. In particular, in round , the action is constrained to lie in the set of available actions , consisting of those intervals and unions of intervals where only entire bins (no fractions of bins) are covered and the action consists of at most subintervals. Recall is the number of sensors, and the restriction to at most intervals ensures that each sensor can be allocated a single contiguous subinterval. We allow the number of bins to increase at rate by doubling the number of bins in line with the growth of .
Our TS approach is described in Algorithm 1. In each round , for each bin , a rate is sampled according to (1), and then an action is selected that would be optimal if the true rate function were the piecewiseconstant combination of these rates. As each bin rate is sampled from the current posterior and the action selected is the optimal action for this set of sampled rates, the selected action is chosen according to the posterior probability that it is the optimal one available. The optimal action conditional on a given sampled rate can be determined efficiently and exactly using the approach described in Section 3.4.
Inputs: Gamma prior parameters , upper truncation point
Iterative Phase: For

For each , evaluate and and sample an index

Choose an action that maximises conditional on the true rate being given by the sampled values, and observe the events in
3.4 Action selection by iterative merging (ASIM)
In this section we describe a routine, called action selection by iterative merging (ASIM), for efficiently determining the optimal action conditional on a given sampled rate function. For the piecewise constant functions sampled by the TS approach, the above optimization problem can be formulated as an integer program in which each bin is either searched or not. Grant et al. (2018) solve this program (albeit for more general cost functions and fixed discretisation) using traditional integer programming methods, with exponentially high computation complexities in and . We instead introduce an efficient optimal action selection policy with polynomial sample complexity.
Firstly, we introduce additional notation that will be useful for explaining the algorithm. Throughout this section we take as fixed and piecewise constant on bins , and provide a method to find for this . An action can be written as the union of disjoint intervals: and for all . Define the weight of an interval as . Thus, we may write the optimal action as
ASIM creates an initial set of candidate intervals such that each is the union of a number of adjacent , and for , and belong to the same if and only if and have the same sign. Notice that, by construction, the weights of adjacent intervals have opposite signs. If the number of intervals in with positive weight is not bigger than , ASIM returns all such intervals as the optimal action. Otherwise, ASIM proceeds to the next step.
ASIM iteratively reduces the number of intervals with positive weights by merging the intervals. Specifically, let be the set of intervals that should be considered for merging. If is empty, no further merging should take place. If is nonempty let be the label in with the smallest absolute weight; ASIM merges with its two neighbour intervals and into one interval and updates the set of intervals . The merging procedure repeats until either is empty or the number of intervals with positive weight equals . At this point ASIM returns the intervals with the largest weights as .
We have the following result on ASIM guaranteeing its optimality and efficiency. The proof is given in the supplementary material via an induction argument.
Theorem 1.
The ASIM policy returns the optimal action and its sample complexity is not bigger than .
4 Regret Bound
In this section, we present our main theoretical contribution: an upper bound on the Bayesian regret of the TS approach. There is an inevitable minimum contribution to regret due to the optimal action likely not being in our discretised action set. But by allowing the mesh to become more fine as more observations are made, we will gradually reduce this discretisation regret and permit a closer approximation to the true underlying rate function.
For the analysis that follows it will be useful to define as the optimal action available in round . We then define for any and :
as the singleround regret of the action with respect to the optimal continuous action and the optimal action available to the algorithm in round respectively. The difference between and is that the “discretisation regret” by choosing actions only from is present only in . Minimising the true regret requires balancing out estimation accuracy (requiring a coarse grid) versus discretisation regret (requiring a finer grid). We find below that choosing the number of bins to be order provides the best theoretical performance guarantees. This coincides with the optimal posterior contraction rate findings in Gugushvili et al. (2018). We verify this numerically in Section 5 and find that this rebinning rate is superior to a faster linear rate of rebinning.
Theorem 2.
This main result is that we have a bound on the Bayesian regret. A lower bound for the problem is not currently available. The closest result available is that of Kleinberg (2005) for CABs with bounded Lipschitz smooth reward function and bounded noise. The bound holds only for a onedimensional action space and is of order . The material differences in our setting are that the observation noise is unbounded (with Poisson tails), our reward function is defined on higher dimension (the unrestricted action space of the underlying CAB is of dimension ), and that we observe additional information in the form of event locations. In the context of the nearest related results therefore, Theorem 2 suggests that the TS approach is a strongly performing policy.
Proof of Theorem 2.
The Bayesian regret can be decomposed as the sum of the regret due to discretisation and the regret due to selecting suboptimal actions in , as follows
The expectation in the first term only averages over functions, not over action selection, and the sum can be upper bounded uniformly over all ’s by considering the rate of rebinning. In particular we have the following lemma, proved in the supplementary material.
Lemma 1.
The regret due to discretisation is bounded by
uniformly over all rates .
To handle the stochastic part of the regret we use a decomposition from Proposition 1 of Russo & Van Roy (2014). For all , for all and for all , let and satisfy (see below for a judicious choice of these variables). Then, for any ,
The key step here is the second equality, which holds for TS because the distribution of is precisely the distribution of due to the method of selecting . The final step follows by noting that, for any ,
and similarly for . The term arises from and for all .
We will choose and so that each sum converges. In particular, the confidence bounds derived in Grant et al. (2018) for Poisson random variables inspire the definition of
for all , with upper and lower confidence bounds on the reward of an action at time as follows:
where gives the empirical mean in bin after rounds. It is in the definition of and that we see the need for a dependence in our choice of upper and lower confidence bounds—we need to count the number times actions for selected the bin defined for time .
In the supplementary material we prove the following lemmas, which when combined are sufficient to complete the proof of Theorem 2.
Lemma 2.
For and as defined above, we have
Lemma 3.
The deviation probabilities can be bounded
Combining these results we have:
which gives the required result as . ∎
5 Simulations
In this section, we provide simulation examples on the performance of the Thompson sampling approach presented in Section 3.3. We first examine the effect of the rebinning rate on the regret and then investigate the performance of the Thompson sampling approach in relation to other algorithms.
5.1 Effect of rebinning rate
Firstly we examine the effect of different rebinning rates in a simple unimodal setting with , , and sensor. This setting is chosen such that the optimal action can be calculated as . Here, and throughout our experiments, we set the prior parameters for Thompson sampling to be and , where scaling by cost makes the prior relevant to the expected scale of costs in the problem. We also set the truncation to be ten times the true maximal value of ; is an inconvenient parameter that is only needed for the theory, so we set it to a conservative large value that should have no influence on the real behaviour of the algorithm. The experiment is run 10 times for timesteps starting with bins.
We compare linear, square root and cube root rebinning rates: the number of bins is doubled in rounds where (in the linear case), (square root case) or (cube root case) is twice its value at the last rebinning time. Actions are selected using the TS method of Algorithm 1 and Fig. 1 shows that the cumulative regret is consistently lower under the cube root rate. While under the linear rebinning rate, actions with reward close to that of become available more quickly, reducing the discretisation regret, the issue is that the majority of bins contain very little data and the posterior inference is heavily dependent on the prior. Under the cube root (and indeed square root) rebinning rate the action set grows more slowly but the unavoidable discretisation regret is balanced by better action selection. The square root case is surprisingly similar to the cube root case despite a weaker theoretical rate in this case. We demonstrate the shrinking of the discretisation regret in the supplementary material.
We also show, in Fig. 2, the posterior inference under the linear and cube root settings at the last time step of one run of the experiment. The posterior under the linear rebinning is highly unconcentrated with simply insufficient numbers of observations in almost all bins. The cube root rate on the other hand results in a posterior which is much more concentrated about the truth in the region where it matters.
. We show the true rate function (blue) and cost (pink), the posterior credible interval (light green) and mean (dark green) per bin. Thompson samples are shown in black, and the selected interval,
, is the (red) vertical bar. The initial number of bins is 4 in both cases and the final number of bins, , is 2048 for the linear rebinning schedule and 32 bins for the cube root schedule.5.2 Comparison to Baselines
We now compare different baseline policies solely using the cube root rebinning schedule. Experiments with the unimodal rate of Section 5.1 were not informative since the problem is an easy one. We instead use a bimodal rate with and sensors. Each experiment was run 10 times for time steps, starting with bins and terminating with bins. In addition to the Thompson sampling approach described in Section 3.3, we consider three other algorithms, which are summarised here and described precisely in the supplementary material. (i) An upper confidence bound (UCB) approach, in which the decisionmaker chooses what would be an optimal action if the true rates were (as defined in the proof of Theorem 1); this is essentially the FPCUCB algorithm of Grant et al. (2018), albeit with a changing mesh, and requires the specification of an upper bound on the rate in order to define the action selection. In our experiments we fix this to the correct value; in practise a conservative estimate is usually available, but for this algorithm the choice of strongly affects the actions selected, in contrast with the TS algorithm, and we choose the most favourable for this algorithm. (ii) A modifiedUCB approach (mUCB) where the empirical mean for each histogram bin is used in place of the overall maximum rate . Note this modification invalidates the concentration results used in Grant et al. (2018), but appears to improve performance in practice. (iii) An Greedy approach where the intervals are selected according to the empirical mean for each bin but occasionally an explorative randomisation step occurs in which the algorithm samples, for each bin, a draw from the prior. The randomisation step is taken with probability .
The cumulative regret for each policy is shown in Figure 3. The worst performing policy is the UCB approach, despite its theoretical properties. The poor performance of the UCB policy is due to the overestimation of the true rate as can be seen in the illustrative example shown in Figure 4(d). Even after 900 iterations, the UCB values (in black) are close to the cost threshold even in the regions where the true rate is low and there is little uncertainty. In contrast the modifiedUCB values, that do not depend on , are less inflated where the uncertainty is low (Figure 4(c)) resulting in more often choosing a better action. In Fig. 3 the
Greedy achieves similar mean regret to modifiedUCB but with a higher variance. The
Greedy approach has the highest variance due to the greediness of the algorithm. A higher value of would reduce variance but would increase the exploration cost. The TS approach consistently outperforms all other policies.Further intuition can also be gained from the posterior examples shown in Figure 4. These were selected at time step from one of the experimental runs. The TS approach has selected an action close to optimal. Further, the posterior variance outside the optimal interval is significantly higher that in the selected regions as only a small number of observations were taken in those regions demonstrating the high efficiency of the method. In contrast both UCB approaches have uniformly low posterior variance in the entirety of the domain reflecting the large number of observations taken incurring a high exploration cost. In contrast, the Greedy approach selects smaller than optimal intervals with high posterior variance outside these regions. This reflects an underexploration of the greedy approach which is only able to escape bad local minima when the randomisation step is used.
In summary, the TS approach outperforms all the other approaches we have considered and is able to efficiently tradeoff exploration penalty and exploitation reward.
6 Conclusion
We have presented a continuumarmed bandit model of sequential sensor placement. This model introduces the complexities of point process data and heavytailed reward distributions to continuumarmed bandits for the first time through its Poisson process observations. We proposed a Thompson sampling approach to make decisions based on fast nonparametric Bayesian inference and an increasingly granular action set, and derived an upper bound on the Bayesian regret of the policy which is independent of the choice of prior distribution.
In our simulation study we have studied two aspects of our approach. Firstly we examined the effect of the rebinning rate on posterior inference and regret. The theoreticallyoptimal cube root rate resulted in more accurate posterior inference than a linear or square root rebinning rate. This effect was also evident in a lower regret for the cube root rate.
Our empirical study also contrasted our Thompson sampling approach to alternative approaches like UCB or greedy policies. In both the cases we examined, we found the other methods either overexplored (e.g. UCB) or overexploited (e.g. greedy). The TS approach achieved the best tradeoff between the two and consistently achieved the lowest regret.
The observation model and rebinning strategies we have presented here are straightforward; it would be interesting to extend the algorithm and analysis to account for imperfect observations and to allow for heterogeneous bin widths, letting us capture more detail of the rate function in areas where we have made many observations and adopt a smoother estimate in others.
An alternative to the discretisation approach we have followed is to employ a continuous model such as a Cox process for which efficient approximate inference methods exist (John & Hensman, 2018). Action selection under the additive cost model would still be possible via a continuous action space extension of the ASIM routine. The regret analysis in this setting would be more involved although recent concentration results (e.g. Kirichenko & Van Zanten, 2015) suggest possible approaches.
References
 Agrawal (1995) Agrawal, R. The continuumarmed bandit problem. SIAM Journal on Control and Optimization, 33:1926–1951, 1995.
 Agrawal & Goyal (2012) Agrawal, S. and Goyal, N. Analysis of Thompson Sampling for the multiarmed bandit problem. In COLT, 2012.
 Auer et al. (2007) Auer, P., Ortner, R., and Szepesvári, C. Improved rates for the stochastic continuumarmed bandit problem. In COLT, pp. 454–468, 2007.
 Basu & Ghosh (2017) Basu, K. and Ghosh, S. Analysis of Thompson sampling for Gaussian process optimization in the bandit setting, 2017. arXiv:1705.06808.
 Bubeck & Liu (2013) Bubeck, S. and Liu, C. Priorfree and priordependent regret bounds for thompson sampling. In NeurIPS, pp. 638–646, 2013.
 Bubeck et al. (2011) Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, C. armed bandits. J. Mach. Learn. Res., 12:1655–1695, 2011.
 Bull (2015) Bull, A. Adaptivetreed bandits. Bernoulli, 21:2289–2307, 2015.
 Carlsson et al. (2016) Carlsson, J., Carlsson, E., and Devulapalli, R. Shadow prices in territory division. Netw. Spat. Econ., 16:893–931, 2016.
 CesaBianchi & Lugosi (2012) CesaBianchi, N. and Lugosi, G. Combinatorial bandits. J. Comput. Syst. Sci., 78:1404–1422, 2012.
 Chen et al. (2016) Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multiarmed bandit and its extension to probabilistically triggered arms. J. Mach. Learn. Res., 17:1746–1778, 2016.
 Chowdhury & Gopalan (2017) Chowdhury, S. and Gopalan, A. On kernelized multiarmed bandits, 2017. arXiv:1704.00445.
 Grant et al. (2018) Grant, J., Leslie, D., Glazebrook, K., Szechtman, R., and Letchford, A. Adaptive policies for perimeter surveillance problems, 2018. arXiv:1810.02176.
 Gregory & Loredo (1992) Gregory, P. and Loredo, T. A new method for the detection of a periodic signal of unknown shape and period. Astrophys. J., 398:146–168, 1992.
 Grill et al. (2015) Grill, J., Valko, M., and Munos, R. Blackbox optimization of noisy functions with unknown smoothness. In NeurIPS, pp. 667–675, 2015.
 Gugushvili et al. (2018) Gugushvili, S., van der Meulen, F., Schauer, M., and Spreij, P. Fast and scalable nonparametric bayesian inference for poisson point processes, 2018. arxiv:1804.03616.
 Heikkinen & Arjas (1999) Heikkinen, J. and Arjas, E. Modeling a Poisson forest in variable elevations: a nonparametric bayesian approach. Biometrics, 55:738–745, 1999.
 Huyuk & Tekin (2018) Huyuk, A. and Tekin, C. Thompson sampling for combinatorial multiarmed bandit with probabilistically triggered arms, 2018. arXiv:1809.02707.
 John & Hensman (2018) John, S. T. and Hensman, J. Largescale Cox process inference using variational Fourier features, 2018. arXiv:1804.01016.
 Kaufmann et al. (2012) Kaufmann, E., Korda, N., and Munos, R. Thompson sampling: An asymptotically optimal finitetime analysis. In ALT, pp. 199–213, 2012.
 Kirichenko & Van Zanten (2015) Kirichenko, A. and Van Zanten, J. Optimality of poisson process intensity learning with gaussian processes. J. Mach. Learn. Res., 16:2909–2919, 2015.
 Kleinberg (2005) Kleinberg, R. Nearly tight bounds for the continuumarmed bandit problem. In NeurIPS, pp. 697–704, 2005.

Kleinberg et al. (2008)
Kleinberg, R., Slivkins, A., and Upfal, E.
Multiarmed bandits in metric spaces.
In
Proc. 40th Annu. ACM Symp. on Theory of Computing
, pp. 681–690, 2008.  Komiyama et al. (2015) Komiyama, J., Honda, J., and Nakagawa, H. Optimal regret analysis of Thompson sampling in stochastic multiarmed bandit problem with multiple plays, 2015. arXiv:1506.00779.
 Korda et al. (2013) Korda, N., Kaufmann, E., and Munos, R. Thompson sampling for 1dimensional exponential family bandits. In NeurIPS, pp. 1448–1456, 2013.
 Luedtke et al. (2016) Luedtke, A., Kaufmann, E., and Chambaz, A. Asymptotically optimal algorithms for multiple play bandits with partial feedback, 2016. arXiv:1606.09388v1.
 May et al. (2012) May, B., Korda, N., Lee, A., and Leslie, D. Optimistic Bayesian sampling in contextualbandit problems. J. Mach. Learn. Res., 13:2069–2106, 2012.
 Russo & Van Roy (2014) Russo, D. and Van Roy, B. Learning to optimize via posterior sampling. Math. Oper. Res., 39:1221–1243, 2014.
 Russo et al. (2018) Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on Thompson sampling. Found. Trends Mach. Learn., 11:1–96, 2018.
 Srinivas et al. (2009) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design, 2009. arXiv:0912.3995.
 Thompson (1933) Thompson, W. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285–294, 1933.
 Wang & Chen (2018) Wang, S. and Chen, W. Thompson sampling for combinatorial semibandits, 2018. arXiv:1803.04623.
Appendix A Regret bound proofs
Proof of Lemma 1
Define as the smallest interval (or union of intervals) in containing the optimal interval (or union of intervals). It will be easier to bound the regret of than wrt . We have, for ,
Here, the final inequality holds since bounds the difference between the lengths of subintervals of and , and there are such subintervals. Since the result follows immediately.
Proof of Lemma 2
Consider the term inside the expectation
where the penultimate line is due to , and the final inequality is because .
Proof of Lemma 3
We have the following, which holds for any round
The final inequality is a direct application of Lemma 1 of (Grant et al., 2018) which in turn exploits Bernstein’s Inequality for independent Poisson random variables.
Appendix B Proof of optimality and efficiency of ASIM
Proof of Theorem 1
Recall that the reward of an action is the sum of the weights of the intervals that comprise that action.
We prove the theorem by induction. Assume at least one initial has a positive weight (otherwise the optimal action is to do no sensing). For initial interval, which therefore has a positive weight, ASIM simply returns this interval, which is optimal. For initial intervals, with one positive weight, ASIM returns the postitivelyweighted interval, which is the optimal action. Now, assuming ASIM returns the optimal action for , we prove that ASIM returns the optimal action for initial intervals. The result follows by induction.
Given , if the number of intervals in with positive weight is not bigger than , ASIM returns all such intervals. This is the optimal action since all bins with positive reward can be covered without incurring the cost of any bins with negative reward; any other action either omits a positivereward bin, or includes a negativereward bin.
Similarly, consider the situation in which no interval satisfies the merging condition. Suppose that the optimal action places a sensor on a sequence of intervals with . Clearly we must have and since otherwise the total weight could be increased by omitting the negativelyweighted end interval. But the fact that no interval can be merged implies that either or . Hence removing either or from the sensor will improve the total weight. It follows that, under , each sensor is allocated to a single interval, and allocating to the highestweight intervals, as specified by ASIM, maximises the reward.
Now, assume that at least one interval is merged in ASIM. Let be the interval which minimises and so is the first interval which is merged with its neighbours in ASIM into a single interval . Let be ASIM’s solution for the set of intervals . By induction, is optimal for . We prove that , the optimal solution for , is equal to . To prove this, we consider different cases based on the sign of .
Case 1: .
First note that the optimal solution cannot include only one neighbour of . If were included but were not, we could add both and and increase the overall weight (since has the smallest absolute weight). Similarly, can not include both and but not ; if so then could be improved by (i) using a single sensor in place of the two that cover and , adding to , and (ii) redeploying the sensor we have saved to either split one existing sensor by removing a negativeweight with , or adding a new positiveweight with . The net outcome is an improved total weight. We have shown that includes either all or none of . Since is optimal for , and the restriction to does not prevent ASIM from finding this optimal , it follows that .
Case 2: .
Under the optimal solution , a sensor cannot have a negativeweighted interval as an end interval, since dropping the negativeweight interval only increases the total weight. Furthermore, a sensor cannot include as an end interval of a series of intervals, since then the total weight could be improved by stopping sensing both and its sensed neighbour. Thus if is included in then either a sensor is observing only , or a single sensor observes all of . As in Case 1, if a sensor is observing only we can improve on by redeploying this sensor to either sense a better interval, or stop sensing an interval which has a higher negative weight than is lost by stopping sensing . So again, under , is either sensed with all its neighbours, or none of them are sensed. The same logic as in Case 1 ensures .
Complexity:
ASIM requires sorting the initial intervals. Noticing that there are at most mergings, and assuming constant complexity for each merging, ASIM offers an sample complexity. Since , ASIM has a sample complexity not bigger than .
Appendix C Discretisation error under linear and cubic root rates
The effect of the different rates on the unavoidable discretisation error is depicted in Figure 5. The regret for the linear rate is reduced at a faster rate than for the cubic root rate as the number of bins is increased at a much faster rate. However as we show in the main paper (Section 5.1) the other part of the regret due to error in action selection from the model forecast is much higher under the linear regret rate.
Appendix D Baselines used in the empirical study
In the paper we have compared the TS approach other approaches which we now describe in more details.

UCB approach, which is based on the FPCUCB algorithm of (Grant et al., 2018) and requires the specification of an upper bound on the rate which we fix to the correct value in our experiments; in practise a conservative estimate is usually available. This is described in Algorithm 2.
Inputs: Upper bound
Initialisation Phase: For

Select
Iterative Phase: For

For each , evaluate and and calculate an index

Choose an action that maximises conditional on the true rate being given by the values

Observe the events in
Algorithm 2 UCB 

A modifiedUCB approach (mUCB) which has the same form as Algorithm 1 except is replaced with the empirical mean. Note this modification breaks the upper bound regret guarantee. The indices are :
where .

An Greedy approach where with probability an action is selected that maximises conditional on the rate being given by the empirical mean values . With probability , the action is instead chosen by sampling rates from independent priors. In our experiments we fix .
Comments
There are no comments yet.