# Estimation of Poisson arrival processes under linear models

In this paper we consider the problem of estimating the parameters of a Poisson arrival process where the rate function is assumed to lie in the span of a known basis. Our goal is to estimate the basis expansions coefficients given a realization of this process. We establish novel guarantees concerning the accuracy achieved by the maximum likelihood estimate. Our initial result is near optimal, with the exception of an undesirable dependence on the dynamic range of the rate function. We then show how to remove this dependence through a process of "noise regularization", which results in an improved bound. We conjecture that a similar guarantee should be possible when using a more direct (deterministic) regularization scheme. We conclude with a discussion of practical applications and an empirical examination of the proposed regularization schemes.

## Authors

• 1 publication
• 13 publications
• ### Classification and clustering for samples of event time data using non-homogeneous Poisson process models

Data of the form of event times arise in various applications. A simple ...
03/06/2017 ∙ by Duncan Barrack, et al. ∙ 0

• ### A new integer valued AR(1) process with Poisson-Lindley innovations

In this paper, we introduce the first-order integer-valued autoregressiv...
02/03/2018 ∙ by Ameneh Rostami, et al. ∙ 0

• ### Likelihood-Free Parameter Estimation for Dynamic Queueing Networks

Many complex real-world systems such as airport terminals, manufacturing...
04/07/2018 ∙ by Anthony Ebert, et al. ∙ 0

• ### Linear-Time Poisson-Disk Patterns

We present an algorithm for generating Poisson-disc patterns taking O(N)...
07/15/2011 ∙ by Thouis R. Jones, et al. ∙ 0

• ### Fractional Risk Process in Insurance

Important models in insurance, for example the Carmér--Lundberg theory a...
08/23/2018 ∙ by Arun Kumar, et al. ∙ 0

• ### Estimating customer impatience in a service system with balking

This paper studies a service system in which arriving customers are prov...
05/07/2020 ∙ by Yoshiaki Inoue, et al. ∙ 0

• ### Transfer Learning via ℓ_1 Regularization

Machine learning algorithms typically require abundant data under a stat...
06/26/2020 ∙ by Masaaki Takada, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In a wide range of practical problems, we wish to understand an underlying process given observations consisting only of the times (or locations) of certain events of interest. Such problems arise in as diverse applications as fluoroscopy [1] and mass spectrometry [2], or when modeling earthquakes [3]

[4], cell tissue [5], social networks [6], or terrorism incidents [7], just to name a few. In this paper we consider the case where the events are generated according to a linear parametric Poisson observation model. Specifically, we assume the events are generated according to some unknown intensity function and aim to estimate the parameterization of the intensity in a known basis. Our goal is to understand how various characteristics of the generative model relate to the quality of the estimated parameters.

To be more precise, we will assume that we observe events drawn from a Poisson arrival process over a Lebesgue-measurable space 111We will typically think of as simply representing an interval of , but leave this general to allow for more complex (e.g., spatio-temporal) Poisson processes, to which our analysis also applies. with measure corresponding to the volume of this space. We will consider a set of known basis functions and and unknown coefficients for that define a Poisson intensity function of the form

 R¯x(t)=g(t)+N∑n=1¯xnγn(t)=g(t)+¯xTγ(t), (1)

where and . Our observations consist of event coordinates (or for short) drawn according to this process, by which we mean that the event coordinates satisfy

 |τ∩T|∼Poisson(∫TR¯x(t)dt) (2)

for any , where here denotes the cardinality of a set. We note that the total number of events we observe is

, which is a Poisson-distributed random variable with parameter

. We also note that, conditioned on , each

is independent and identically distributed with probability density function

. Given the observed events

, our goal is to estimate the vector

.

A primary motivation for this work is to obtain a result that applies to arbitrary sets of basis functions . Existing work often focuses on bases with desirable properties. While it is sometimes possible for the user to carefully construct a basis with desirable properties that result in a simpler analysis, in many important cases the basis is determined by the application. For example, consider the following applications:

• Low-light imaging: Suppose that we desire to resolve a sparse image from blurry photon counts [8]. In this case the natural basis is composed of shifted versions of the point spread function inherent to the system, and the coefficients in this basis represent the resolved pixel values we wish to recover.

• Oversampled mass spectrometry: The mass spectrum of an ionized substance (the mass-to-charge ratios of the constituent ions) can be measured by applying an electrical impulse to the ions and measuring their corresponding velocity. This is accomplished by sensing when they hit a detector after traveling across a distance. If the impulse is applied rapidly to the substance [9], the observations at the detector is the mass spectrum convolved with the electrical impulse pattern. In this setting the basis elements are shifted (and sometimes blurred) versions of the impulse pattern, one for each mass-to-charge ratio under consideration (which determines the size of the shift). The coefficients estimated using this basis then determine the mass spectrum of the sample.

• Autoregressive Poisson processes: Linear autoregressive Poisson processes [10, 11] or Hawkes processes [12, 7, 13] correspond to networks of mutually-exciting temporal point processes. When attempting to estimate the network parameters of these processes, we consider a Poisson-like (though, in general, not Poisson) estimation problem where the basis is causally constructed as a function of previous events. Specifically, the basis is constructed by convolving a known causal kernel with the preceding event times. Under certain assumptions on the structure of a network (e.g., that the network forms a directed acyclic graph), these observations correspond to observations of an ensemble of Poisson processes with (random) bases. The coefficients in this basis then determine the dictate the degree of influence of the other processes in the network on the process under consideration.

In each of these examples, we have almost no influence over the basis selection process but are interested in the basis representation of the process. We must be capable of analyzing the performance of our processing for a wide variety of bases. Towards this end, the results we present in this paper are based on simple basis properties (e.g., norms on and properties related to the conditioning of the basis), which make them particularly suited to these situations where we do exercise direct control over the basis.

### 1.1 Main contribution

In Theorem 1 we provide an error bound for the maximum-likelihood estimate of in the linear model of (1). To the best of our knowledge, this is the first such guarantee for this setting (with existing results applying principally to the case where is a discrete set, as discussed in Section 1.2). The result roughly states that if where is sufficiently large, then with high probability the maximum likelihood estimate will satisfy

 ∥ˆx−¯x∥2≤CγRmax√R%min,

where is a constant that depends on the conditioning of the basis . In Theorem 2, we extend this result to the special case where

and the estimate are known to be sparse, although in practice this may require that we solve a combinatorial optimization.

We observe that our performance bound exhibits no dependence on the term in (1). Thus, it is possible to artificially reduce the dynamic range to a constant by augmenting the observations with additional events generated by a homogenous Poisson process (thereby uniformly increasing and reducing the dynamic range). This “noise augmentation” procedure can be viewed as a form of regularization, which allows us to establish an improved bound (Corollary 2) of the form

 ∥ˆx−¯x∥2≤C′γ√Rmax

with no requirement on . Our result is, to the best of our knowledge, the first to remove this dependence on the dynamic range. Previously, this dependence limited the situations to-which theoretical results were applicable.

We conjecture that a bound of this form should also apply to an alternative (deterministic) regularization scheme, described in Section 2.5. Note, however, that in Section 4 we empirically study the impact of these regularization schemes and find little evidence for improvements in practice. This suggests that the dependence on the dynamic range in Theorem 1 may simply be an artifact of our proof and the proofs of related results.

Finally, we also compare our results to estimates of the Cramér-Rao lower bound, as well as to the error we expect in a simple concrete choice of basis . Together, these comparisons suggest that when the basis is relatively well-conditioned, there is little potential room for improvement in our bounds (up to a constant and the dynamic range term in Theorem 1). On the other hand, there may still be room for improvement when the basis is poorly conditioned.

### 1.2 Related work

Our focus in this paper is on the general Poisson arrival model of (2). This model has received relatively little attention up to this point. It is considered in [14], which studies a LASSO-based estimator, but which provides bounds only on the accuracy with which is recovered, rather than as is our focus here. Far more common in the literature is a related counting-based model which is used in, e.g., [15, 9, 16, 17, 18, 19]. In this model we instead observe Poisson bin counts in the form

 ym∼Poisson(gm+⟨γm,¯x⟩) (3)

for (we retain as the number of events). This can be viewed as a discrete approximation to the arrival model, realized by partitioning into non-overlapping intervals and setting , , and . In fact, the arrival model in (2) is identical to the counting model in (3) in the special case that and are piecewise-constant on each interval . Thus, any result for the arrival model can also be applied to this counting model. Note that, in contrast, it is usually impossible to apply counting results to the general arrival model because driving the interval widths to zero results in vanishing counts within each interval [18].

In [17], a variation of the LASSO estimator (similar to that used in [14]) is proposed that is suitable for solving the Poisson counting inference problem. Performance bounds for constrained/penalized maximum-likelihood estimators are provided in [18, 16, 19]. Of course, these bounds apply only to Poisson counting problems – nevertheless, since our results also imply bounds for the counting model, it is instructive to compare our bounds with the existing work in this domain. Among the prior work which is most relevant is that of [16], from which we specifically draw inspiration in portions of our analysis. We provide a detailed comparison of our results with those in [16] in Section 2.7.

### 1.3 Organization of the paper

We begin in Section 2 with the statement of our main recovery guarantee for the estimation of arrival process parameters and several corollaries. In particular, we consider the special case of Poisson counting processes and contrast with the guarantee provided by [16]. Section 3 compares our result to estimates of the Cramér-Rao lower bound, as well as to the error we expect in a simple concrete choice of basis . Section LABEL:sec:app briefly presents an example application of these results. In Section 4 we explore the practical impact of different forms of regularization via simulations. Finally, we close in Section 5 with a discussion of our results.

## 2 Recovery guarantees

### 2.1 Maximum likelihood estimation

The negative log-likelihood of observing a set of events at coordinates under (2) is given by [20] to be

 L(τ|x)=∫TRx(t)dt−M∑m=1logRx(τm). (4)

for any intensity parameterized by (e.g., the intensity in (1)). Note that an intensity function must be nonnegative across its entire domain. This can be enforced by defining . Accordingly, we constrain to live in the set , which is a convex set under our model (1).

The maximum likelihood estimate of the true parameters, constrained to lie within intersected with an arbitrary set , is simply

 ˆx=argminx∈X∩RL(τ|x). (5)

Given the linear model for presented in (1), we can simplify (4) to the equivalent form (omitting the constant term) of

 L(τ|x)=bTx−1Tlog(g+Ax), (6)

where , , for and , and the logarithm is taken element-wise over the vector argument. Note that and are both random, as they depend on the random observations . In the Poisson counting model of (3), we instead define . In this case (4) can be simplified to

 L(y|x)=1TAx−yTlog(g+Ax). (7)

We substitute this likelihood function, depending on the bin counts instead of arrival coordinates , into (5) to form the corresponding maximum likelihood estimation program.

For convex , the estimation program (5) is convex (for both the likelihood functions in (6) and in (7)). This optimization problem can be efficiently solved by algorithms such as those presented in [21, 22].

### 2.2 Parameter estimation error bound

In order to state our main result, we will make use of the definition , i.e., the matrix with entries . Let and

represent the trace and minimum eigenvalue of

, respectively, and let .

With these definitions, we can state our main result as follows:

###### Theorem 1

If events are produced by a Poisson arrival process with intensity and , any vector satisfying and will also satisfy

 ∥ˆx−¯x∥2

for an absolute constant with probability at least .

The core idea behind the proof (which we borrow from [16]) is to establish the strong convexity of the negative log-likelihood over the domain of our estimator (specifically, such that ). With strong convexity, we can show that any large deviation from the true solution (i.e., far from ) must necessarily violate , disqualifying it from being the maximizer of the likelihood. Because the likelihood function depends on a random set , this requires the use of concentration inequalities. The complete proof is detailed in the Appendix.

The parameter primarily serves as a tradeoff parameter between the scaling of the error and the probability that our bound does not hold, although it also has a role in regulating the minimum that we can tolerate. Note that the probability bound is only nontrivial when , thus the theorem contains a dependence on in both the error bound and . In general, however, our primary interest is less on the asymptotic dependence on and more on the dependence of the error on the structural properties of the underlying dictionary, so we leave this dependence implicit.

### 2.3 Estimation under sparsity assumptions

In many practical settings, especially when the dimension of the basis is large, one might expect to be sparse (i.e., to have relatively few nonzeros). There is a natural extension of Theorem 1 that can result in an improved bound when this occurs. To state this result, we will define to be the -dimensional basis obtained by taking the basis elements indexed by and to be the submatrix composed of the rows and columns of indexed by (i.e., the Gram matrix of ). Let and represent the trace and minimum eigenvalue of , respectively, and let .

###### Theorem 2

If events are produced by a Poisson arrival process with intensity and , any vector satisfying , , and will also satisfy

 ∥ˆx−¯x∥2

for an absolute constant with probability at least .

The proof is detailed in the Appendix.

Note that the bound in Theorem 2 depends on the specific choice of index set . A reader familiar with the sparse approximation literature may wonder how this compares with more traditional results, which hold for arbitrary sparsity patterns, and how the quantities and compare to more familiar quantities. In fact, it is straightforward to restate this result in terms of an appropriate analogue of the restricted isometry property (RIP) [23]. Specifically, let denote the operator defined by . Observe that we can write . We say that the operator (or equivalently, the basis ) satisfies the RIP if there exists a constant such that

 (1−δs)∥x∥22≤xTΓx≤(1+δs)∥x∥22

for all with at most nonzeros.

Using this definition, if satisfies the RIP then we have the bounds and . These bounds let us ignore the specifics of if we can instead bound the size of the support. Accordingly, the bound (9) from Theorem 2 can be replaced with

 ∥ˆx−¯x∥2

Observe that the results of Theorem 2 (as stated above or combined with bounds based on the RIP) are really only of interest when we can ensure that the set is sufficiently small. By the triangle inequality, . Thus, if and are both small, we can use the RIP of to obtain an improved error bound. Unfortunately, enforcing a constraint leads to a challenging nonconvex optimization problem. An open question concerning (5) is whether we can guarantee that is sparse when using a sparsity-inducing convex constraint such as . We leave such analysis for future work.

### 2.4 Guarantees for counting processes

When using the Poisson counting model (3), the special case of Theorem 1 follows immediately from the substitutions discussed in Section 3. Specifically, if we define , to be the Frobenius norm of , and

to be the smallest singular value of

(the minimum eigenvalue of ), then we have the following result.

###### Corollary 1

Let with . If , any vector satisfying and will also satisfy

 ∥ˆx−¯x∥2

for an absolute constant with probability at least .

We note that this result can easily be extended to the sparse setting in a manner analogous to Theorem 2 by replacing with , with , with , with , and adding the constraint .

### 2.5 Improving the guarantee via regularization

We note that has no impact on the result of Theorem 1. This opens the door to what seems like a rather unorthodox strategy for improving this bound. Specifically, the dependence on the dynamic range in Theorem 1 can be removed by simply augmenting our observations with additional events generated by a homogeneous Poisson process with intensity . These additional events will increase the intensity of our process, but in a manner completely known to us. Specifically, in this case the intensity of is

 ˜R¯x(t)=R¯x(t)+Rmax=(g(t)+Rmax)+¯xTγ(t).

Note that if then and . Thus, by simply replacing with in (6) and applying Theorem 1, we obtain the following corollary:

###### Corollary 2

If events are produced by a Poisson arrival process with intensity , events are produced by a homogeneous Poisson process with intensity , and , then any vector satisfying and will also satisfy

 ∥ˆx−¯x∥2<2c√ζTr(Γ)σ(Γ)√R\emph{max} (12)

for an absolute constant with probability at least .

To be clear, when we augment our observations with the set of events (with known intensity ), the negative log-likelihood function that results (ignoring constant terms) is

 L(τ∪ρ|x)=∫TRx(t)dt−∑mlog(β+Rx(τm))−∑ℓlog(β+Rx(ρℓ)).

Comparing the bound in (12) with that in (8), we see that in exchange for an additional factor of two, we have completely removed the dependence on the dynamic range. Moreover, we have replaced the constraint on in Theorem 1 with a more relaxed constraint on . One might be justifiably skeptical that this will lead to improved performance in practice – we are essentially adding a large amount of noise to the signal we wish to estimate. We explore this empirically in Section 4. Here, we note that a potentially more palatable strategy might be to perform a deterministic form of regularization that is similar in spirit. Specifically, define the regularized negative log-likelihood

 L′β(τ|x)=bTx−∑mlog(β+Rx(τm))−β∫Tlog(β+Rx(t))dt, (13)

noting that this is equivalent to (6) when . is the result of replacing from with its expectation.222The difficulty of computing the integral in (13) in closed-form means that, typically, numerical integration would be necessary when performing estimation using the regularized likelihood in (13). The use of Riemann integration reduces this result to a sampled version of an arrival model, i.e., a counting model. Recovery as in Corollary 2 does not suffer this drawback, remaining comparable to standard maximum likelihood estimation as featured in Theorem 1.

Unfortunately, there does not appear to be an easy extension of our current analysis techniques that would allow us to provide an error bound for the estimate obtained by minimizing (13). We conjecture that a result similar to Corollary 2 should also hold when optimizing the regularized negative log-likelihood in (13) with .

The same regularization scheme can be applied to Theorem 2 and Corollary 1, yielding the same consequence of exchanging the dynamic range for a factor of two.

### 2.6 Practical considerations

#### 2.6.1 Choosing X

If then the program in (5) will result in an estimate that, by construction, satisfies the condition of our theorems. Possible sets can include unconstrained vectors, sparse vectors, vectors with limited norms, or nonnegative vectors. Any convex will maintain the convexity of (5) – although convexity is not necessary if we are not concerned with computational efficiency or can still ensure that through some alternative argument that does not rely on solving (5) to global optimality. This results in a tradeoff between constraint sets that enable simple efficient algorithms and that more tightly enforce desired properties in our recovery . For example, when sparsity is exploited (as in Theorem 2) the set (where we define to count the number of nonzeros in ) is a natural choice to ensure a fixed sparsity level in , but results in a nonconvex (and thus challenging) optimization problem. In contrast, is convex and encourages sparse solutions, but does not necessarily guarantee a specific sparsity level.

We also note that our Theorems depend on the condition that . This can be satisfied by choosing . While the exact value of will likely be unknown, it can be estimated or bounded to sufficient accuracy from . Alternatively, it can be bounded as when we have knowledge of the norm of , although this bound is typically poor. As a matter of practice, we believe that there is often minimal risk in ignoring this constraint completely. Its role in the proof is solely to limit the pointwise difference , but we expect that a suitable will accomplish this (at least up to a constant factor) even in the absence of an explicit constraint.

#### 2.6.2 Sample complexity

A common lens through which we can evaluate a statistical estimation problem concerns the number of observations required to obtain an accurate estimate of the quantities of interest. Here our situation is somewhat outside the standard framework as the parameters which we wish to estimate actually determine the number of observations (events) we will obtain. Nevertheless, we can gain some insight by approaching the problem from this perspective.

For the purpose of generality, we will analyze the number of observations with respect to Theorem 2. In the dense case of Theorem 1, simply set . It is necessary that for the bound to hold with nonzero probability. Because , we have that

 ∥γS∥22,∞σ(ΓS)≥|S||T|.

Thus, the requirement that implies that it is necessary that for the bound to hold with nonzero probability. Alternatively, in the regularized case of Corollary 2 we replace the requirement on with a requirement on , resulting in the condition that .

On the surface, neither of these conditions directly addresses the issue of how many observations are required. One may have noticed that the number of events observed, (or even its expectation ) was conspicuously absent from our theorems. While not explicitly included, this quantity is implicitly present through the inequalities . Thus when the dynamic range is modest, these conditions essentially reduce to . This mirrors typical results in the standard setting of sparse recovery with Gaussian statistics [23]. In non-sparse recovery, we are left with . While it is possible that the factor here is only an artifact of our proof (the factor is shared by Gaussian estimation), it may also be the case that it is necessary to account for the variability of .

### 2.7 Comparison to existing results

Here we summarize the result of [16] in terms of the quantities introduced here. The result of [16] holds when , , and are constrained to be nonnegative and holds for some choice . To make the comparison, we will need to provide several additional definitions from [16] (relabeled to better-reflect our notation):

 σ2∗(A) =minS:|S|≤s,x:∥xS∥1≥∥x−xS∥1∥Ax∥22∥x∥22 R∗min =minx:|x∥0=s,∥x∥1=∥¯x∥1H(g+Ax) R∗max =∥g∥∞+∥A∥max∥¯x∥1

where

denotes the harmonic mean of the input vector and

denotes the maximum-magnitude entry of a matrix. The quantity denotes the restricted eigenvalue of of order and is roughly comparable to . Additionally, is roughly on the order of . Finally, note that and (for any ) and that these inequalities are often quite loose.

With these definitions, the main result of [16] can be written as

 ∥ˆx−¯x∥2≤54(3+logR∗maxRmin)√ζsM0∥A∥maxσ2∗(A)R∗max√R∗min

with probability at least when . We remark that this is extremely similar to the findings of the sparse variant of Corollary 1. However, Corollary 1 removes the restriction that , , and be nonnegative. Further, Theorem 2 generalizes this result to the case of Poisson arrival processes. We have also presented Corollary 2, which removes the dependence on or entirely. Although it was not the primary aim of this work, we have also considerably improved the dependence on the terms and . However, we also stress that the results in [16] do have an advantage when dealing with sparse in that their guarantees hold for the (convex) constrained estimate (as opposed to requiring the nonconvex sparsity constraint required to obtain a similar guarantee via Theorem 2).

## 3 Optimality

Here we briefly discuss the optimality of our results. In particular, we will first consider a concrete choice of for which the expected performance is easy to directly calculate and compare this performance to the bounds established above. Next, we will characterize the Cramér-Rao lower bound for this problem and compare our results with this bound.

### 3.1 Example: A simple orthobasis

Consider the simple orthobasis of

 γn(t)={1n−1≤t

In this case we have and . Since this basis is disjoint (no two elements are supported on intersecting intervals), this is really just a collection of independent Poisson estimation problems with . The maximum likelihood estimate is trivially given by

 ˆxn=∑mγn(τm),

i.e., the number of events falling in that interval. Hence, and the expected squared error is , from which it follows that

 E∥ˆx−¯x∥22=N∑n=1¯xn.

Applying Theorem 1 to this problem yields the bound

 ∥ˆx−¯x∥22≤c2ζNRmaxR% maxRmin

with probability at least when . Our high-probability bound differs from the average-case error of this system (up to a constant) only by the dynamic range. Alternatively, in Corollary 2, we showed how the dynamic range could be removed from our bound by using a modified recovery program. Specifically, we obtain a guarantee of the form

 ∥ˆx−¯x∥22≤4c2ζNRmax

with probability at least when . In this case, the bound differs from the expectation by only constants and the ratio between the average and maximum intensity .

### 3.2 Cramér-Rao lower bound

The example described above suggests that for a well-conditioned basis , our analysis appears to be relatively tight. Indeed, by comparing our bound to the Cramér-Rao lower bound for this problem, we will obtain additional evidence that this is, in fact, the case.

Towards this end, we note that the Fisher information matrix for (6) is

The precise value of will depend on the problem (on both and on ). However, one can show that (see Appendix A.2 for details), which provides the semidefinite ordering

 R−1maxΓ⪯I¯x⪯R−1minΓ.

From this we obtain

 Tr(Γ−1)Rmin≤Tr(I−1¯x)≤Tr(Γ−1)Rmax. (14)

The Cramér-Rao lower bound states that any unbiased estimate

of must satisfy

 E∥ˆx−¯x∥22≥Tr(I−1¯x).

Because the maximum-likelihood estimator asymptotically achieves the Cramér-Rao bound under this model [24], (14) implies that, asymptotically, the MLE will satisfy a bound of the form

 E∥ˆx−¯x∥22≤Tr(Γ−1)Rmax. (15)

There is limited room for improvement here because another asymptotic bound implied by (14) is that

 E∥ˆx−¯x∥22≥Tr(Γ−1)Rmin,

as any opposing result would violate the Cramér-Rao bound.

Compare these bounds to the result of Corollary 2, which states that (with high probability) the MLE can achieve

 ∥ˆx−¯x∥22≲Tr(Γ)σ(Γ)2Rmax. (16)

Contrasting (15) with (16), we observe that (15) is a slightly stronger guarantee (ignoring the fact that (16) holds with high probability while (15) holds only in expectation). In particular, it is a straightforward consequence of the eigendecomposition of that

 Tr(Γ−1)≤Tr(Γ)σ(Γ)2,

with equality if and only if the spectrum of is flat (i.e., all eigenvalues are equal to each other). In the case where the basis is ill-conditioned and has one or more eigenvalues which are very small compared to the remainder, the bound in (15) can be somewhat tighter than that in (16). However, we note that unless is quite large, both and can both be dominated by , and so when the basis is very poorly conditioned both bounds are rather poor. In total, these results suggest that (up to a constant), there is relatively little room for improvement over the bound in Corollary 2.

## 4 Regularization in practice

Here we take a practical look into the claims of Corollary 2 and our conjecture concerning our deterministically regularized alternative. Specifically, Corollary 2 provides improved theory (at least when is small) when we augment our observations with additional random events (i.e., noise). We would like to understand if such gains are actually to be expected in practice.

Towards this end, we choose a basis of shifted-Gaussian basis functions, sampled as to produce a counting process with observations (). We choose coefficient vectors

with entries that are independent and identically exponentially distributed such that the expected number of events is

. We then simulate the Poisson process with the intensity given by this basis and these parameters and compute the -norm error between the true and maximum likelihood estimate . We compare this error to the error of computing the regularized estimator using random (Corollary 2) or deterministic (using (13)) regularization. Figure 1 plots this result for different choices of regularization parameter relative to over many trials. Note that Corollary 2 uses , and larger denotes stronger regularization (and denotes no regularization). Relative error larger than 1 denotes degraded performance with regularization and less than 1 denotes improved performance.

While we only include simulations for a single choice of basis and parameters, these results are representative of the behavior we observed for every choice of basis/parameters we have attempted. Examining Figure 1, we note that randomized regularization (Corollary 2) frequently results in significantly-degraded performance. Larger regularization results in larger error as the noise overwhelms the signal. Deterministic regularization does not suffer the same increasing error with increasing regularization, but instead saturates at error that is, in the median, slightly higher than the unregularized error. We infer that the effect of regularization is simply to perturb the solution in a way that will sometimes improve the error and will sometimes degrade it, without biasing our solution in some way that results in consistent improvement.

In summary, despite our best efforts, we have not uncovered any regime in which the regularization suggested by Corollary 2 (or our conjecture) systematically improved recovery over the unregularized case. This leaves open the possibility that the bound of Corollary 2 may be more representative of actual performance than that of Theorem 1 (differing by a dependence on the dynamic range of the underlying intensity) even when the prescribed regularization is completely omitted. In other words, it is possible that the results of Corollary 2 actually hold without any regularization at all. However, our analysis does not provide an obvious means to establish this result.

## 5 Conclusions

In this paper we have provided novel recovery guarantees for parameter estimation in the context of Poisson arrival processes. These guarantees are near-optimal. However, we note that this optimal theoretical performance is established for a version of the MLE which has been augmented with noise – a procedure that does not appear to actually result in improved performance in practice.

There are several remaining improvements that may yet be possible. First, the results presented here require that (the unknown part of) the Poisson intensity function lie within the span of the basis. It may also be possible to address recovery when the true intensity is only well-approximated by this basis. Second, the existence of Corollary 2, which adds noise to the observations to improve theoretical performance, combined with the findings of Sections 3 and 4 suggests that the dynamic range dependence in Theorem 1 may be an artifact of the proof and ultimately unnecessary. We continue to conjecture that an improved bound should be possible for our deterministically-regularized MLE (or a similar alternative), but leave further exploration of this issue for future work. Finally, many important open questions remain concerning the use of more sophisticated convex regularizers (e.g., an -norm constraint when is sparse).

## Acknowledgments

This work was supported by grants NRL N00173-14-2-C001, AFOSR FA9550-14-1-0342, and NSF CCF-1350616 as well as support from the Alfred P. Sloan Foundation.

## References

• [1] A. Singh, D. Wilson, and R. Aufrichtig, “Enhancement of X-ray fluoroscopy images,” in Proc. SPIE, vol. 1898, 1993, pp. 304–310. [Online]. Available: http://dx.doi.org/10.1117/12.154516
• [2] R. Cotter, Time-of-flight mass spectrometry.   ACS Publications, 1993.
• [3] Y. Ogata, “Space-time point-process models for earthquake occurrences,” Ann. Inst. Stat. Math., vol. 50, no. 2, pp. 379–402, 1998.
• [4] D. Johnson, “Point process models of single-neuron discharges,” J. Comput. Neurosci., vol. 3, no. 4, pp. 275–299, 1996.
• [5] T. LaGrow, M. Moore, J. Prasad, M. Davenport, and E. Dyer, “Approximating cellular densities from high-resolution neuroanatomical imaging data,” in Proc. IEEE Eng. in Med. and Bio. Conf. (EMBC), Honolulu, HI, USA, Jul. 2018.
• [6] K. Zhou, H. Zha, and L. Song, “Learning social infectivity in sparse low-rank networks using multi-dimensional Hawkes processes,” in Proc. Int. Conf. Art. Intell. Stat. (AISTATS), Scottsdale, AZ, USA, Apr. 2013.
• [7] S. Tench, H. Fry, and P. Gill, “Spatio-temporal patterns of IED usage by the Provisional Irish Republican Army,” European J. Appl. Math., vol. 27, no. 3, pp. 377–402, 2016.
• [8] C. Lo and A. Sawchuk, “Nonlinear restoration of filtered images with Poisson noise,” vol. 0207, 1979, pp. 84–95.
• [9] M. Moore, A. Massimino, and M. Davenport, “Randomized multi-pulse time-of-flight mass spectrometry,” in Proc. IEEE Work. Comput. Advances in Multi-Sensor Adaptive Processing (CAMSAP), Cancun, Mexico, Dec. 2015.
• [10] K. Fokianos, A. Rahbek, and D. Tjøstheim, “Poisson autoregression,” J. Amer. Statist. Assoc., vol. 104, no. 488, pp. 1430–1439, 2009.
• [11]

E. Hall, G. Raskutti, and R. Willett, “Inferring high-dimensional Poisson autoregressive models,” in

Proc. IEEE Work. Stat. Signal Processing, Palma de Mallorca, Spain, Jun. 2016.
• [12] A. Hawkes, “Point spectra of some mutually exciting point processes,” J. Royal Stat. Soc. B, vol. 33, no. 3, pp. 438–443, 1971.
• [13] M. Moore and M. Davenport, “Analysis of wireless networks using Hawkes processes,” in Proc. IEEE Work. Signal Proc. Advances in Wireless Comm. (SPAWC), Edinburgh, UK, Jul. 2016.
• [14] N. Hansen, P. Reynaud-Bouret, and V. Rivoirard, “Lasso and probabilistic inequalities for multivariate point processes,” Bernoulli, vol. 21, no. 1, pp. 83–143, 2015.
• [15] Y. Li and G. Raskutti, “Minimax optimal convex methods for Poisson inverse problems under -ball sparsity,” arXiv:1604.08943, Apr. 2016.
• [16] M. Rohban, V. Saligrama, and D. Vaziri, “Minimax optimal sparse signal recovery with Poisson statistics,” IEEE Trans. Signal Processing, vol. 64, no. 13, pp. 3495–3508, 2016.
• [17] X. Jiang, P. Reynaud-Bouret, V. Rivoirard, L. Sansonnet, and R. Willett, “A data-dependent weighted LASSO under Poisson noise,” arXiv:1509.08892, 2015. [Online]. Available: http://arxiv.org/abs/1509.08892
• [18] M. Raginsky, R. Willett, Z. Harmany, and R. Marcia, “Compressed sensing performance bounds under Poisson noise,” IEEE Trans. Signal Processing, vol. 58, no. 8, pp. 3990–4002, 2010.
• [19] A. Soni and J. Haupt, “Estimation error guarantees for Poisson denoising with sparse and structured dictionary models,” in Proc. IEEE Int. Symp. Inform. Theory (ISIT), Jun. 2014, pp. 2002–2006.
• [20] D. Daley and D. Vere-Jones, An introduction to the theory of point processes.   New York: Springer-Verlag, 1988.
• [21] Z. Harmany, R. Marcia, and R. Willett, “This is SPIRAL-TAP: Sparse Poisson intensity reconstruction algorithms – theory and practice,” IEEE Trans. Image Processing, vol. 21, no. 3, pp. 1084–1096, 2012.
• [22] Q. Tran-Dinh, A. Kyrillidis, and V. Cevher, “Composite self-concordant minimization,”

J. Machine Learning Research

, vol. 16, pp. 371–416, 2015.
• [23] M. Davenport, M. Duarte, Y. Eldar, and G. Kutyniok, “Introduction to compressed sensing,” in Compressed Sensing: Theory and Applications.   Cambridge University Press, 2012.
• [24] Y. Ogata, “The asymptotic behaviour of maximum likelihood estimators for stationary point processes,” Ann. Inst. Stat. Math., vol. 30, no. 1, pp. 243–261, Dec. 1978.
• [25] J. Tropp, “An introduction to matrix concentration inequalities,” Foundations and Trends in Machine Learning, vol. 8, no. 1-2, 2015.

## Appendix A Proof of Theorems 1 and 2

The overall outline of our proof is inspired by the proof of the main result of [16]. We will provide a proof of Theorem 2 – Theorem 1 follows as a special case of the same argument.

Define . Our goal is to show that

 ∥Δ′∥2<ϵ∀Δ′ s.t. L(τ|¯x+Δ′)

A sufficient condition for (17) is that

 L(τ|¯x+Δ′)−L(τ|¯x)≥0∀Δ′ s.t. ∥Δ′∥2≥ϵ. (18)

In other words, any large deviation from implies an inferior likelihood to that of , and hence any with a smaller negative log-likelihood than must lie close to .

Suppose that has nonzero elements. We can define a and a selector matrix (a submatrix of the identity matrix) such that . We make the definitions , , , and . As a shorthand, we will define . The negative log-likelihood gap between the perturbed and true solutions reduces to

 L(τ|¯x+SΔ)−L(τ|¯x)=bTSΔ−1Tlog(1+DASΔ).

We use the bound , along with some simple algebraic manipulation, to obtain

 L(τ|¯x+SΔ)−L(τ|¯x)≥bTSΔ−1TDASΔ+3ΔTATSDdiag(6(g+A¯x)+4ASΔ)−1ASΔ. (19)

Constraining and to the interval ensures that and . Employing these bounds allows us state that

 diag(6(g+A¯x)+4ASΔ)−1⪰110R%maxI

and thus we can lower-bound the right-hand side of (19) by

 (bS−ATSD1)TΔ+310RmaxΔTATSDASΔ.

After applying the Cauchy-Schwarz inequality and basic properties of singular values, we obtain

 L(τ|¯x+SΔ)−L(τ|¯x)≥−∥∥bS−ATSD1∥∥2∥Δ∥2+3σ(ATSDAS)10Rmax∥Δ∥22

where denotes the minimum singular value of matrix. Accordingly,

 ∥Δ∥2≥10Rmax∥∥bS−ATSD1∥∥23σ(ATSDAS)

implies that . Thus, the choice is sufficient to satisfy (18). This expression involves two random quantities, and . These are random because and depend on the (random) observations of our process. We address these quantities using the following lemmas:

###### Lemma 1

Using the preceding definitions, the inequality

 ∥∥bS−ATSD1∥∥2≤23ζ∥γS∥2,∞R\emph{min}+√2ζM¯¯¯¯MTr(ΓS)R\emph{min}

holds with probability at least .

Proof: See Appendix A.1.

###### Lemma 2

Using the preceding definitions, the inequality

 σ(ATSDAS)≥σ(ΓS)⎛⎜⎝1− ⎷2ζ∥γS∥22,∞R\emph{min}σ(ΓS)⎞⎟⎠

holds with probability at least .

Proof: See Appendix A.2.

By applying Lemmas 1 and 2 to our expression for and applying a union bound, with probability at least the error is bounded

 ϵ≤10Rmax323ζ∥γS∥2,∞Rmin+√2ζTr(ΓS)Rminσ(ΓS)⎛⎝1−√2ζ∥γS∥22,∞Rminσ(ΓS)⎞⎠.

If for some then, using the inequalities , we bound

 ζ≤Rminσ(ΓS)α∥γS∥22,∞≤RminTr(ΓS)αs∥γS∥22,∞.

By defining , it follows that

 ϵ≤cα,s√ζTr(ΓS)σ(ΓS)Rmax√Rmin

with probability at least .

The constant shrinks as or grow (though other terms in the bound grow with ). For any , is well-defined and bounded by an absolute constant for any . For example, with the trivial choice we can bound , , and . This completes the proof of Theorem 2. Theorem 1 is a result of the choice (i.e., ).

### a.1 Proof of Lemma 1

The elements of the vector are given by the expression

 [b−A