1 Introduction
In various imaging modalities, recovering the image from acquired data can be recast as solving an inverse problem of the form , where is a linear operator, represents noisy measurements and the image, with usually a desirable property. The problem thus becomes where is some given distance or divergence.
When the model is finitedimensional, the operator is simply a matrix . If we assume a Poisson noise model, i.e., with independent draws, the corresponding (negative log) likelihood problem is equivalent to
(1) 
where
is the Kullback–Leibler divergence. As it turns out, this statistical model is similar to the familiar nonnegative leastsquares regression corresponding to Gaussian noise, but for a different distance functional: if
, it is projected onto it in the sense of , whereas if it belongs to this image set, any such that will be optimal.The celebrated Maximum Likelihood Expectation Maximisation algorithm (MLEM) precisely aims at solving (1) and was introduced by Shepp and Vardi [Shepp1982, Vardi1985], in the particular context of the imaging modality called Positron Emission Tomography (PET). MLEM is iterative and writes
(2) 
starting from , usually .
This algorithm is an expectationmaximisation (EM) algorithm, and as such it has many desirable properties: it preserves nonnegativity and the negative loglikelihood decreases along iterates [Dempster1977]. It can also be interpreted in several other ways [Vardi1985, Csiszar1984, Benvenuto2014], see [Natterer2001] for an overview. The expectationmaximisation point of view has also led to alternative algorithms [Fessler1993], but in spite of various competing approaches, MLEM (actually, its more numerically efficient variation OSEM [Hudson1994]) has remained the algorithm used in practice in many PET scanners.
Despite its success, the MLEM algorithm (2) is known to produce undesirable spikes along the iterates, where some pixels take increasingly high values. The toy example presented in Figure 1 is an example of such artefacts in the case of PET. The reconstruction of a torus of maximum with iterations of MLEM indeed exhibits some pixels with values as high as about .
This phenomenon has long been noticed in the literature, where images are referred to as “spiky” or “speckled” [Silverman1990], others talking about “the chequerboard effect” [Vardi1985]. In the discrete case, the works [Byrne1993, Byrne1995, Byrne1996] have provided a partial explanation for this result. The author proves that, under general conditions (which include ), the minimum of (1) is such that it has at most nonzero entries whenever .
To the best of our knowledge, a theoretical justification for the subsistence of only a few nonzero entries has however remained elusive.
The aim of the present paper is to better understand that phenomenon via the analysis in a continuous setting of the minimisation problem (1) and the corresponding MLEM algorithm (2). The continuous setting here refers to the image not being discretised on a grid. Note however that we keep the data space discrete.
Informally, considering as an element in some function space, we consider forward operators of the form
where is the compact on which one aims at reconstructing the image, and is some nonnegative function on . This covers a wide range of applications, including PET.
One of our motivations is to derive algorithms for Poisson inverse problems with movement, for example for PET acquisition of a moving organ [Jacobson2003]. In that case, movement can be modelled by deformations of images which do not easily carry over to discretised images (simply because interesting deformations do not preserve the grid). It is then desirable to express the problem in a continuous setting, in order to both analyse the algorithms proposed in the literature, and to derive new ones [Hinkle2012, Oktem2019].
The field of inverse problems for imaging, with a continuum description of the unknown image, is abundant [Bertero1998, Baumeister2005]. Most often, the image is taken to be a function in some appropriate Sobolev space. To the best of our knowledge, however, there are relatively few results concerning the continuum description of the Poisson likelihood and the MLEM algorithm for solving it.
In [Multhei1987, Multhei1989, Multhei1992] and [Resmerita2007], both the image and data are considered in continuum, with a deterministic description of noise. These authors assume that detectors lie in and correspondingly assume that the image lies in . They study the convergence properties of the corresponding MLEM algorithm in detail. In the first series of three papers, the compact is restricted to .
Our paper differs from these works in that we do not make the two following restrictive assumptions, common to [Multhei1987, Multhei1989, Multhei1992, Resmerita2007]. The first restriction is to assume the existence of a nonnegative solution to the equation , assumed to lie in . The second restriction is to assume that the functions are bounded away from zero. This last assumption is unrealistic for some applications such as PET [Resmerita2007, Remark 6.1].
The seminal paper [Mair1996] considers the optimisation problem over the set of nonnegative Borel measures as we do. They obtain the corresponding likelihood function informally as the limit of the discrete one, but do not prove that it is an actual maximum likelihood problem for the PET statistical model. They then proceed to study the problem of whether minimisers can be identified with bounded functions, and not merely measures which might have a singular part. They indeed note that in some very specific cases (see also [Multhei1992]), one can prove that the minimiser should be a Dirac mass. They speculate that there might be a link with the usual spiky results from MLEM. They, however, do not provide any general conditions for sparsity.
Working in the space of nonnegative measures , our main contributions are as follows:
 Continuous Framework.

We prove that the continuous setting of measures is precisely the maximum likelihood problem with a Poisson point process model (subsection 2.2), and that the natural generalisation of the MLEM iterates (25) indeed corresponds to the expectationmaximisation method associated to that continuous statistical model (see subsection 2.2).
 Sparsity.

We give a precise sparsity criterion (sparsity means that any optimal solution has singular support): if the data is outside the cone , then all optimal solutions are necessarily sparse (subsection 3.2); if the data is inside the cone , then there exist absolutely continuous solutions (subsection 3.3)
 Properties of MLEM iterates.

We show the expected properties of the MLEM iterates, namely monotonicity (Equation 36) and optimality of cluster points (subsection 4.1).
 Properties of MLEM solutions.

We show that in the nonsparse case, i.e., when an absolutely continuous solution exists as just mentioned, the solution picked up by MLEM is absolutely continuous (subsection 4.3), and we give an explicit example of MLEM converging to a sum of point masses in the sparse case (subsection 4.2).
 Effect of Noise.
 "Spiky" artefacts.
Outline of the paper
The paper is organised as follows. In section 2, we introduce the functional and MLEM in continuum in detail, with all the necessary notations, normalisations, definitions and useful properties about Kullback–Leibler divergences. section 3 contains all results on the functional minimisers, starting from the optimality conditions to the diverging cases of the data being inside or outside the cone . section 4 is devoted to the algorithm MLEM itself, with the proof of its usual properties in continuum together with the implications they have on the case where the data is in the cone . In section 5, we estimate the probability that the data ends up outside the image cone . In section 6, we present simulations which confirm our theoretical predictions. Finally, in section 7 we conclude with open questions and perspectives.
2 Maximum likelihood and MLEM in continuum
2.1 Mathematical background
Space of Radon measures.
As stated in the introduction, we model the image to reconstruct as a nonnegative measure defined on a compact set . Some of our results require (typically, or ).
More precisely, we will consider the set of Radon measures, denoted and defined as the topological dual of the set of continuous functions over , denoted . The space of nonnegative measures will be denoted by . For brevity, we will often write for and for when there is no ambiguity as to the underlying compact .
We identify a linear functional with its corresponding Borel measure (as per the Riesz–Markov representation Theorem), using to denote the measure of a Borel subset of . We will also sometimes write the dual pairing between a measure and a function as
The support of a measure is defined as the closed set
where is the set of all open neighbourhoods of .
Finally, recall that, by the Banach–Alaoglu Theorem, bounded sets in are weak compact [Rudin1991].
Kullback–Leibler divergence.
We here recall the definition of the KullbackLeibler divergence. Instead of giving the general definition, we directly explicit the two instances that will actually be needed in this paper, for (nonnormalised) nonnegative vectors in
, and for probability measures on .
For vectors in . For any two nonnegative vectors and in , we define the Kullback–Leibler divergence between and as
with the convention and if there exists such that and .

For probability measures on . For any two probability measures and on , we define the Kullback–Leibler divergence between and
if is absolutely continuous with respect to and is integrable with respect to . Here stands for the Radon–Nikodym derivative of with respect to . Otherwise, we define .
When a measure is absolutely continuous with respect to the Lebesgue measure on , we simply say that it is absolutely continuous. Any reference to the Lebesgue measure implicitly assumes that stands for the closure of some bounded domain in (i.e., a bounded, connected and open subset of ).
2.2 Statistical model
Positron Emission Tomography [Ollinger1997].
In PET, a radiotracer injected into the patient and, once concentrated into tissues, disintegrates by emitting a positron. This process is well known to be accurately modelled by a Poisson point process, itself defined by a nonnegative measure. After a short travel distance called positron range, this positron interacts with an electron. The result is the emission of two photons in random opposite directions. Pairs of detectors around the body then detect simultaneous photons, and the data is given by the number of counts per pair of detectors.
In the case of PET, is the number of of detectors (i.e., pairs of single detectors). For a given point and detector , then denotes the probability that a positron emitted in will be detected by detector .
Finally, we will throughout the paper assume
(5) 
For PET, this amounts to assuming that that the points in are in the socalled field of view, namely that any emission has a nonzero probability to be detected.
Derivation of the statistical model (3) for PET.
We proceed to give a proof that the statistical model (3) (and thus, the corresponding likelihood function) applies to PET. Here, we assume that the emission process of PET is modelled by a Poisson point process, defined by a measure , and that each point drawn from the Poisson process has a probability to be detected by detector .
The statistical model (3) applies to PET.
Proof.
The proof relies on the following properties of Poisson point processes [Last2017]:

law of numbers: the number of points emitted by a Poisson process of intensity follows the Poisson law with parameter .

thinning property: the points that are kept with (measurable) probability still form a Poisson point process, with intensity , and it is independent from that of points that are not kept. This property generalises to , with for all .
By the thinning property, the families of points which lead to an emission detected in detector , , are all independent Poisson processes with associated measure , for
. Thus, the random variables
representing the number of points detected in detector are independent and of law , which proves the claim. ∎Maximum likelihood problem.
Dropping the factorial terms (they do not depend on and will thus play no role when maximising the likelihood), we get
(6) 
The corresponding maximum likelihood problem, written for a realisation of the random variable , , is given by
(7) 
Defining the operator
(8)  
(9) 
the optimisation problem conveniently rewrites in terms of the KullbackLeibler divergence: upon adding constants and taking the negative loglikelihood problem, it reads
(10) 
MLEM iterates.
We now define the MLEM algorithm, which aims at solving the optimisation problem (7). It is given by the iterates
(11) 
starting from an initial guess .
Note that this algorithm can be shown to be an EM algorithm for the continuous problem. The proof is beyond the scope of this paper, so we decide to omit it, but we just mention that the corresponding socalled complete data would be given by the positions of points together with the detector that has detected each of them.
2.3 Normalisations
Due to the assumption (5), we may without loss of generality assume that
(12) 
on . Otherwise we could just define and . This normalisation now implies for all .
We further normalise the measures by dividing the functional by , considering to remove the factor. We then define
From now on, we consider the optimisation problem (minimisation of the negative loglikelihood):
(13) 
where
(14)  
(15) 
defined to be for any measure such that for some , where
After normalisation, the MLEM iterates are given by
(16) 
We recall the property that MLEM preserves the total number of counts: for all , which corresponds to before normalisation. We also emphasise the important property that iterations cannot increase the support of the measure, namely
MLEM iterates are welldefined.
We assume throughout that the initial measure fulfils
(17) 
Note that usual practice is to take to be absolutely continuous with respect to the Lebesgue measure, typically , in which case (17) is satisfied.
The following simple Lemma shows that assumption (17) ensures that the iterates are welldefined.
The MLEM iterates (16) satisfy
Proof.
From the Cauchy–Schwarz inequality, . Combined with the definition of MLEM iterates, this entails for any ,
∎
2.4 Adjoint and cone
Since , we can define its adjoint (identifying as a Euclidean space with its dual), which is given by
(18) 
The set is a closed and convex cone and, as proved in [Georgiou2005], its dual cone can be characterised as being given by the set of vectors such that on , i.e.,
(19) 
As a result, the interior of the dual cone is given by the vectors such that on .
The normalisation condition (12) can now be rewritten
(20) 
where is the vector of which all components are one: . Moreover, we can rewrite the MLEM iteration (16) as
which is the continuous analogue to the discrete case (2), taking into account the normalisation (20).
Minimisation over the cone.
The problem , is equivalent to the following minimisation problem over the cone :
(21) 
Indeed, if is optimal for the problem (21), any such that is optimal for the original problem. From the property , we also infer that when , is optimal if and only if .
3 Properties of minimisers
In this section, we gather results concerning the functional and its minimisers, proving that they are sparse when the data is not in the image cone . First, we note that the functional defined by (14) is a convex and proper function.
3.1 Characterisation of optimality
We now derive necessary and sufficient optimality conditions. The existence of minimisers turns out to be a byproduct of the analysis of the MLEM algorithm, whose cluster points will be shown to satisfy the optimality conditions, see subsection 4.1.
We first prove that any optimum must have a fixed unit mass.
If is optimal for (13), then .
Proof.
For any , we have
(22) 
Observe that the second term depends only on the mass , whereas the first term is scaleinvariant. As a result, an optimal has to minimise the second term, which turns out to admit the unique minimiser . ∎
This result follows from the optimality conditions derived later in subsection 3.1, but the proof above is simple and also highlights that the maximum likelihood estimator for is consistent with that for : the second term in (22) is none other than the negative loglikelihood of the total mass.
We now give the full optimality conditions. The convex function defined in (14) has the following open domain:
Notice further that for any , the function is Fréchetdifferentiable (in the sense of the strong topology). Its gradient is given for is then the element in the dual of given by
(23) 
which we identify with an element of .
For any vector , we define by
(24) 
(with the convention if , that is, if ). Using , we can rewrite (23) as
(25) 
The measure is optimal for the problem (13) if and only if the following optimality conditions hold
(26)  
These conditions can be equivalently written as
(27)  
Proof.
Since is differentiable on and convex on the convex set , a point is optimal if and only if
where
is the normal cone to at .
From the characterisation of given below in subsection 3.1, and the fact that , the optimality condition exactly amounts to the conditions (26). ∎
The normal cone at a given is given by
Proof.
Let be in , i.e., it satisfies for all in . First, we choose which yields , so we must have on . Then with , we find . Since we also have , leading to on . The reverse also holds true: if on and on , then for all in , and consequently is in . ∎
An alternative proof of these optimality conditions can be obtained by considering instead the equivalent problem of minimising with ranging over the cone . The cone has a nonempty relative interior which proves that Slater’s condition is fulfilled. Since the problem is convex, KKT conditions are equivalent to optimality for [Boyd2004].
The Lagrange dual is given by for . A straightforward computation leads to
(28) 
for , with value if there exists such that .
The KKT conditions for a primal optimal and dual optimal write

[label=()]

,

,

(equivalent to )
3.2 Case
When the data is not in the cone , optimality conditions imply sparsity of any optimal measure.
Proof.
Define . Note that . Moreover, , so we conclude that for some , . ∎
Why does condition (29) imply sparsity? Let be defined as (24) for an optimal measure , and define the function . We know from subsection 3.1 that both and .
Assuming that the ’s are linearly independent in , cannot vanish identically since . Supposing further that that for all , , we have
We make the final assumption that the Hessian of is invertible at the points , which is equivalent to its positive definiteness since these are minimum points of . This implies that consists of isolated points. Consequently, the restriction to of any optimal solution is a sum of Dirac masses.
Note that all the above regularity assumptions hold true for generic functions . One case where all of them are readily satisfied is when the functions are analytic with connected.
We can exhibit a case where only Dirac masses are optimal. Suppose that only . Then the function for measures such that is simply . One can directly check that a minimiser necessarily satisfies , in agreement with condition (29). If this set is discrete, then is a sum of Dirac masses located at these points. Note that such a data point is outside the cone if and only if . If not, it lies on the boundary of the cone, showing that some boundary points might lead to sparse minimisers as well.
3.3 Moment matching problem and case
When the data is in the cone , searching for minimisers of (13) is equivalent to solving for . For the applications, we are particularly interested in the existence of absolutely continuous solutions. We make use of the results of [Georgiou2005], which addresses this problem.
We now recall a part of Theorem 3 of [Georgiou2005] which will be sufficient of our purpose. [[Georgiou2005]] Assume that and its dual cone have nonempty interior. Then for any , there exists which is absolutely continuous, with positive and continuous density, such that .
Under hypothesis (30), has nonempty interior, and if , there exists an optimal measure which is absolutely continuous with positive and continuous density.
Proof.
This is a direct consequence of subsection 3.3. We just need to check that has nonempty interior. Using the characterisation of the dual cone (19), this is straightforward since . ∎
3.4 Case
The previous approach settles the case where the data is in the interior of the image cone, which poses the natural question of its boundary . It routinely happens in practice that some components of the data are zero, which means that the vector lies at the border of the cone, . Upon changing the compact, a further use of the results of [Georgiou2005] shows that if the support of the data is not too small (see the precise condition (34) below), the situation is the same as for .
The idea is to remove all the zero components of the data vector , and to consider only the positive ones and try to solve for but making sure that the measure has a support such that for .
We denote , , , and finally the reduced operator,
(31)  
(32) 
which has an associated cone .
We will need the assumptions
(33) 
and
(34) 
We assume (33) and (34). has nonempty interior and we assume
Then there exists an absolutely continuous solution of .
Proof.
We make use of subsection 3.3. In order to do so, we need the dual cone of to not have empty interior, which holds true by (34). Then we may build an absolutely continuous with positive and continuous density, such that . We then extend to a measure on the whole of by defining to equal on with support contained in , namely for any Borel subset of . Then clearly solves and thus minimises . ∎
Note that subsection 3.3 is a particular case of subsection 3.4, but we believe this presentation makes the role of and clearer.
Let us now finish this section by proving that not any point of the boundary may be associated to absolutely continuous measures. We denote the simplex in , i.e.,
(35) 
Assume that is an extremal point of . Then any measure satisfying is a Dirac mass.
We omit the proof, which is straightforward and relies on the linearity of the operator and the fact that the only extremal points among probability measures are the Dirac masses [Rudin1991].
4 Properties of MLEM
We now turn our attention to the MLEM algorithm (16) for the minimisation of the functional (problem (13)).
4.1 Monotonicity and asymptotics
We first proceed to prove that the algorithm is monotonous, a property stemming from it being an expectationmaximisation algorithm.
For any satisfying (17), MLEM is monotone, i.e.,
(36) 
Proof.
We will build a socalled surrogate function, i.e., a function such that for all in , with equality for , where is minimised at over . Note that . The claim then follows since
For a measure , we define . One can check that , i.e., that is always a probability measure. From the nonnegativity of the divergence, i.e., , we obtain
(37) 
with equality if .
Notice now that , where we use the notation
(38) 
This motivates the following definition for any :
which is a surrogate function of .
Making use once more of the concavity of the logarithm through for any we may write for
the righthand side vanishing due to the definition of . ∎
If the initial measure has its mass sufficiently spread over , we now prove that all cluster points of MLEM are optimal solutions to the minimisation problem. This also proves that minimisers exist in full generality.
For any satisfying (17) and
(39) 
where is the closed ball of centre and radius , then the weak cluster points of MLEM minimise the function . In particular,
Proof.
Let be a weak cluster point of MLEM. Such a point exists since we have for all . Thus, is a bounded sequence in . By the Banach–Alaoglu Theorem, it is thus weak compact in .
Also note that such a limit must satisfy for any (i.e., whenever ). Otherwise, would go to since it is weak continuous, a contradiction with the fact that decreases along iterates and . We may then pass to the limit in the definition of MLEM and we find
(40) 
In particular, we have on . Now, assume by contradiction that at a point . Defining , by assumption