The ML-EM algorithm in continuum: sparse measure solutions

by   Camille Pouchol, et al.

Linear inverse problems A μ = δ with Poisson noise and non-negative unknown μ≥ 0 are ubiquitous in applications, for instance in Positron Emission Tomography (PET) in medical imaging. The associated maximum likelihood problem is routinely solved using an expectation-maximisation algorithm (ML-EM). This typically results in images which look spiky, even with early stopping. We give an explanation for this phenomenon. We first regard the image μ as a measure. We prove that if the measurements δ are not in the cone {A μ, μ≥ 0}, which is typical of short exposure times, likelihood maximisers as well as ML-EM cluster points must be sparse, i.e., typically a sum of point masses. On the other hand, in the long exposure regime, we prove that cluster points of ML-EM will be measures without singular part. Finally, we provide concentration bounds for the probability to be in the sparse case.


page 3

page 24

page 25


ML-EM algorithm with known continuous movement model

In Positron Emission Tomography, movement leads to blurry reconstruction...

Maximum Likelihood Estimation of Toric Fano Varieties

We study the maximum likelihood estimation problem for several classes o...

Minimax Optimal Sparse Signal Recovery with Poisson Statistics

We are motivated by problems that arise in a number of applications such...

Physics Embedded Machine Learning for Electromagnetic Data Imaging

Electromagnetic (EM) imaging is widely applied in sensing for security, ...

Spatiotemporal PET reconstruction using ML-EM with learned diffeomorphic deformation

Patient movement in emission tomography deteriorates reconstruction qual...

Statistical Estimation and Clustering of Group-invariant Orientation Parameters

We treat the problem of estimation of orientation parameters whose value...

Key point selection and clustering of swimmer coordination through Sparse Fisher-EM

To answer the existence of optimal swimmer learning/teaching strategies,...

1 Introduction

In various imaging modalities, recovering the image from acquired data can be recast as solving an inverse problem of the form , where is a linear operator, represents noisy measurements and the image, with usually a desirable property. The problem thus becomes where is some given distance or divergence.

When the model is finite-dimensional, the operator is simply a matrix . If we assume a Poisson noise model, i.e., with independent draws, the corresponding (negative log) likelihood problem is equivalent to



is the Kullback–Leibler divergence. As it turns out, this statistical model is similar to the familiar non-negative least-squares regression corresponding to Gaussian noise, but for a different distance functional: if

, it is projected onto it in the sense of , whereas if it belongs to this image set, any such that will be optimal.

The celebrated Maximum Likelihood Expectation Maximisation algorithm (ML-EM) precisely aims at solving (1) and was introduced by Shepp and Vardi [Shepp1982, Vardi1985], in the particular context of the imaging modality called Positron Emission Tomography (PET). ML-EM is iterative and writes


starting from , usually .

This algorithm is an expectation-maximisation (EM) algorithm, and as such it has many desirable properties: it preserves non-negativity and the negative log-likelihood decreases along iterates [Dempster1977]. It can also be interpreted in several other ways [Vardi1985, Csiszar1984, Benvenuto2014], see [Natterer2001] for an overview. The expectation-maximisation point of view has also led to alternative algorithms [Fessler1993], but in spite of various competing approaches, ML-EM (actually, its more numerically efficient variation OSEM [Hudson1994]) has remained the algorithm used in practice in many PET scanners.

Despite its success, the ML-EM algorithm (2) is known to produce undesirable spikes along the iterates, where some pixels take increasingly high values. The toy example presented in Figure 1 is an example of such artefacts in the case of PET. The reconstruction of a torus of maximum with iterations of ML-EM indeed exhibits some pixels with values as high as about .

This phenomenon has long been noticed in the literature, where images are referred to as “spiky” or “speckled” [Silverman1990], others talking about “the chequerboard effect” [Vardi1985]. In the discrete case, the works [Byrne1993, Byrne1995, Byrne1996] have provided a partial explanation for this result. The author proves that, under general conditions (which include ), the minimum of (1) is such that it has at most non-zero entries whenever .

To the best of our knowledge, a theoretical justification for the subsistence of only a few non-zero entries has however remained elusive.

Figure 1: Phantom and reconstruction after iterations of ML-EM, with a zoom on the region containing the pixel of highest value.

The aim of the present paper is to better understand that phenomenon via the analysis in a continuous setting of the minimisation problem (1) and the corresponding ML-EM algorithm (2). The continuous setting here refers to the image not being discretised on a grid. Note however that we keep the data space discrete.

Informally, considering as an element in some function space, we consider forward operators of the form

where is the compact on which one aims at reconstructing the image, and is some non-negative function on . This covers a wide range of applications, including PET.

One of our motivations is to derive algorithms for Poisson inverse problems with movement, for example for PET acquisition of a moving organ [Jacobson2003]. In that case, movement can be modelled by deformations of images which do not easily carry over to discretised images (simply because interesting deformations do not preserve the grid). It is then desirable to express the problem in a continuous setting, in order to both analyse the algorithms proposed in the literature, and to derive new ones [Hinkle2012, Oktem2019].

The field of inverse problems for imaging, with a continuum description of the unknown image, is abundant [Bertero1998, Baumeister2005]. Most often, the image is taken to be a function in some appropriate Sobolev space. To the best of our knowledge, however, there are relatively few results concerning the continuum description of the Poisson likelihood and the ML-EM algorithm for solving it.

In [Multhei1987, Multhei1989, Multhei1992] and [Resmerita2007], both the image and data are considered in continuum, with a deterministic description of noise. These authors assume that detectors lie in and correspondingly assume that the image lies in . They study the convergence properties of the corresponding ML-EM algorithm in detail. In the first series of three papers, the compact is restricted to .

Our paper differs from these works in that we do not make the two following restrictive assumptions, common to [Multhei1987, Multhei1989, Multhei1992, Resmerita2007]. The first restriction is to assume the existence of a non-negative solution to the equation , assumed to lie in . The second restriction is to assume that the functions are bounded away from zero. This last assumption is unrealistic for some applications such as PET [Resmerita2007, Remark 6.1].

The seminal paper [Mair1996] considers the optimisation problem over the set of non-negative Borel measures as we do. They obtain the corresponding likelihood function informally as the limit of the discrete one, but do not prove that it is an actual maximum likelihood problem for the PET statistical model. They then proceed to study the problem of whether minimisers can be identified with bounded functions, and not merely measures which might have a singular part. They indeed note that in some very specific cases (see also [Multhei1992]), one can prove that the minimiser should be a Dirac mass. They speculate that there might be a link with the usual spiky results from ML-EM. They, however, do not provide any general conditions for sparsity.

Working in the space of non-negative measures , our main contributions are as follows:

Continuous Framework.

We prove that the continuous setting of measures is precisely the maximum likelihood problem with a Poisson point process model (subsection 2.2), and that the natural generalisation of the ML-EM iterates (25) indeed corresponds to the expectation-maximisation method associated to that continuous statistical model (see subsection 2.2).


We give a precise sparsity criterion (sparsity means that any optimal solution has singular support): if the data is outside the cone , then all optimal solutions are necessarily sparse (subsection 3.2); if the data is inside the cone , then there exist absolutely continuous solutions (subsection 3.3)

Properties of ML-EM iterates.

We show the expected properties of the ML-EM iterates, namely monotonicity (Equation 36) and optimality of cluster points (subsection 4.1).

Properties of ML-EM solutions.

We show that in the non-sparse case, i.e., when an absolutely continuous solution exists as just mentioned, the solution picked up by ML-EM is absolutely continuous (subsection 4.3), and we give an explicit example of ML-EM converging to a sum of point masses in the sparse case (subsection 4.2).

Effect of Noise.

We derive estimates on the probability to be in the sparse case, depending on the noise level (

section 5, section 5).

"Spiky" artefacts.

With these results, we provide an explanation for the artefacts of Figure 1: they are related to the sparsity result. By weak duality, we can indeed certify that the iterates should converge to a sum of point masses in that case, as detailed in section 6 dedicated to simulations.

Outline of the paper

The paper is organised as follows. In section 2, we introduce the functional and ML-EM in continuum in detail, with all the necessary notations, normalisations, definitions and useful properties about Kullback–Leibler divergences. section 3 contains all results on the functional minimisers, starting from the optimality conditions to the diverging cases of the data being inside or outside the cone . section 4 is devoted to the algorithm ML-EM itself, with the proof of its usual properties in continuum together with the implications they have on the case where the data is in the cone . In section 5, we estimate the probability that the data ends up outside the image cone . In section 6, we present simulations which confirm our theoretical predictions. Finally, in section 7 we conclude with open questions and perspectives.

2 Maximum likelihood and ML-EM in continuum

2.1 Mathematical background

Space of Radon measures.

As stated in the introduction, we model the image to reconstruct as a non-negative measure defined on a compact set . Some of our results require (typically, or ).

More precisely, we will consider the set of Radon measures, denoted and defined as the topological dual of the set of continuous functions over , denoted . The space of non-negative measures will be denoted by . For brevity, we will often write for and for when there is no ambiguity as to the underlying compact .

We identify a linear functional with its corresponding Borel measure (as per the Riesz–Markov representation Theorem), using to denote the measure of a Borel subset of . We will also sometimes write the dual pairing between a measure and a function as

The support of a measure is defined as the closed set

where is the set of all open neighbourhoods of .

Finally, recall that, by the Banach–Alaoglu Theorem, bounded sets in are weak- compact [Rudin1991].

Kullback–Leibler divergence.

We here recall the definition of the Kullback-Leibler divergence. Instead of giving the general definition, we directly explicit the two instances that will actually be needed in this paper, for (non-normalised) non-negative vectors in

, and for probability measures on .

  • For vectors in . For any two non-negative vectors and in , we define the Kullback–Leibler divergence between and as

    with the convention and if there exists such that and .

  • For probability measures on . For any two probability measures and on , we define the Kullback–Leibler divergence between and 

    if is absolutely continuous with respect to and is integrable with respect to . Here stands for the Radon–Nikodym derivative of with respect to . Otherwise, we define .

When a measure is absolutely continuous with respect to the Lebesgue measure on , we simply say that it is absolutely continuous. Any reference to the Lebesgue measure implicitly assumes that stands for the closure of some bounded domain in (i.e., a bounded, connected and open subset of ).

2.2 Statistical model

We want to recover a measure

from independent Poisson distributed measurements




Positron Emission Tomography [Ollinger1997].

In PET, a radiotracer injected into the patient and, once concentrated into tissues, disintegrates by emitting a positron. This process is well known to be accurately modelled by a Poisson point process, itself defined by a non-negative measure. After a short travel distance called positron range, this positron interacts with an electron. The result is the emission of two photons in random opposite directions. Pairs of detectors around the body then detect simultaneous photons, and the data is given by the number of counts per pair of detectors.

In the case of PET, is the number of of detectors (i.e., pairs of single detectors). For a given point  and detector , then denotes the probability that a positron emitted in  will be detected by detector .

Finally, we will throughout the paper assume


For PET, this amounts to assuming that that the points in are in the so-called field of view, namely that any emission has a non-zero probability to be detected.

Derivation of the statistical model (3) for PET.

We proceed to give a proof that the statistical model (3) (and thus, the corresponding likelihood function) applies to PET. Here, we assume that the emission process of PET is modelled by a Poisson point process, defined by a measure , and that each point drawn from the Poisson process has a probability to be detected by detector .

The statistical model (3) applies to PET.


The proof relies on the following properties of Poisson point processes [Last2017]:

  • law of numbers: the number of points emitted by a Poisson process of intensity follows the Poisson law with parameter .

  • thinning property: the points that are kept with (measurable) probability still form a Poisson point process, with intensity , and it is independent from that of points that are not kept. This property generalises to , with for all .

By the thinning property, the families of points which lead to an emission detected in detector , , are all independent Poisson processes with associated measure , for

. Thus, the random variables

representing the number of points detected in detector are independent and of law , which proves the claim. ∎

Maximum likelihood problem.

The likelihood corresponding to the statistical model (3) reads

since .

Dropping the factorial terms (they do not depend on and will thus play no role when maximising the likelihood), we get


The corresponding maximum likelihood problem, written for a realisation of the random variable , , is given by


Defining the operator


the optimisation problem conveniently rewrites in terms of the Kullback-Leibler divergence: upon adding constants and taking the negative log-likelihood problem, it reads


ML-EM iterates.

We now define the ML-EM algorithm, which aims at solving the optimisation problem (7). It is given by the iterates


starting from an initial guess .

Note that this algorithm can be shown to be an EM algorithm for the continuous problem. The proof is beyond the scope of this paper, so we decide to omit it, but we just mention that the corresponding so-called complete data would be given by the positions of points together with the detector that has detected each of them.

2.3 Normalisations

Due to the assumption (5), we may without loss of generality assume that


on . Otherwise we could just define and . This normalisation now implies for all .

We further normalise the measures by dividing the functional by , considering to remove the factor. We then define

From now on, we consider the optimisation problem (minimisation of the negative log-likelihood):




defined to be for any measure such that for some , where

After normalisation, the ML-EM iterates are given by


We recall the property that ML-EM preserves the total number of counts: for all , which corresponds to before normalisation. We also emphasise the important property that iterations cannot increase the support of the measure, namely

ML-EM iterates are well-defined.

We assume throughout that the initial measure fulfils


Note that usual practice is to take to be absolutely continuous with respect to the Lebesgue measure, typically , in which case (17) is satisfied.

The following simple Lemma shows that assumption (17) ensures that the iterates are well-defined.

The ML-EM iterates (16) satisfy


From the Cauchy–Schwarz inequality, . Combined with the definition of ML-EM iterates, this entails for any ,

2.4 Adjoint and cone

Since , we can define its adjoint (identifying as a Euclidean space with its dual), which is given by


The set is a closed and convex cone and, as proved in [Georgiou2005], its dual cone can be characterised as being given by the set of vectors such that on , i.e.,


As a result, the interior of the dual cone is given by the vectors such that on .

The normalisation condition (12) can now be rewritten


where is the vector of which all components are one: . Moreover, we can rewrite the ML-EM iteration (16) as

which is the continuous analogue to the discrete case (2), taking into account the normalisation (20).

Minimisation over the cone.

The problem , is equivalent to the following minimisation problem over the cone :


Indeed, if is optimal for the problem (21), any such that is optimal for the original problem. From the property , we also infer that when , is optimal if and only if .

3 Properties of minimisers

In this section, we gather results concerning the functional and its minimisers, proving that they are sparse when the data is not in the image cone . First, we note that the functional defined by (14) is a convex and proper function.

3.1 Characterisation of optimality

We now derive necessary and sufficient optimality conditions. The existence of minimisers turns out to be a byproduct of the analysis of the ML-EM algorithm, whose cluster points will be shown to satisfy the optimality conditions, see subsection 4.1.

We first prove that any optimum must have a fixed unit mass.

If is optimal for (13), then .


For any , we have


Observe that the second term depends only on the mass , whereas the first term is scale-invariant. As a result, an optimal has to minimise the second term, which turns out to admit the unique minimiser . ∎

This result follows from the optimality conditions derived later in subsection 3.1, but the proof above is simple and also highlights that the maximum likelihood estimator for is consistent with that for : the second term in (22) is none other than the negative log-likelihood of the total mass.

We now give the full optimality conditions. The convex function defined in (14) has the following open domain:

Notice further that for any , the function is Fréchet-differentiable (in the sense of the strong topology). Its gradient is given for is then the element in the dual of given by


which we identify with an element of .

For any vector , we define by


(with the convention if , that is, if ). Using , we can rewrite (23) as


The measure is optimal for the problem (13) if and only if the following optimality conditions hold


These conditions can be equivalently written as


Since is differentiable on and convex on the convex set , a point is optimal if and only if


is the normal cone to at .

From the characterisation of given below in subsection 3.1, and the fact that , the optimality condition exactly amounts to the conditions (26). ∎

The normal cone at a given is given by


Let be in , i.e., it satisfies for all in . First, we choose which yields , so we must have on . Then with , we find . Since we also have , leading to on . The reverse also holds true: if on and on , then for all in , and consequently is in . ∎

An alternative proof of these optimality conditions can be obtained by considering instead the equivalent problem of minimising with ranging over the cone . The cone has a non-empty relative interior which proves that Slater’s condition is fulfilled. Since the problem is convex, KKT conditions are equivalent to optimality for  [Boyd2004].

The Lagrange dual is given by for . A straightforward computation leads to


for , with value if there exists such that .

The KKT conditions for a primal optimal and dual optimal write

  1. [label=()]

  2. ,

  3. ,

  4. (equivalent to )

A measure is then optimal if and only if . Since (by definition of ), the condition 2 thus becomes

Since , over . Thus, we must have on for the above integral to vanish. All in all, we exactly recover the conditions (27), with the additional interpretation that is a dual optimal variable.

3.2 Case

When the data is not in the cone , optimality conditions imply sparsity of any optimal measure.

Assume that . Then any minimiser of (13) is sparse, in the following sense


and .


Define . Note that . Moreover, , so we conclude that for some , . ∎

Why does condition (29) imply sparsity? Let be defined as (24) for an optimal measure , and define the function . We know from subsection 3.1 that both and .

Assuming that the ’s are linearly independent in , cannot vanish identically since . Supposing further that that for all , we have

We make the final assumption that the Hessian of is invertible at the points , which is equivalent to its positive definiteness since these are minimum points of . This implies that consists of isolated points. Consequently, the restriction to of any optimal solution is a sum of Dirac masses.

Note that all the above regularity assumptions hold true for generic functions . One case where all of them are readily satisfied is when the functions are analytic with connected.

We can exhibit a case where only Dirac masses are optimal. Suppose that only . Then the function for measures such that is simply . One can directly check that a minimiser necessarily satisfies , in agreement with condition (29). If this set is discrete, then is a sum of Dirac masses located at these points. Note that such a data point is outside the cone if and only if . If not, it lies on the boundary of the cone, showing that some boundary points might lead to sparse minimisers as well.

3.3 Moment matching problem and case

When the data is in the cone , searching for minimisers of (13) is equivalent to solving for . For the applications, we are particularly interested in the existence of absolutely continuous solutions. We make use of the results of [Georgiou2005], which addresses this problem.

We shall use the assumption:


Under (30), has non-empty interior.

We now recall a part of Theorem 3 of [Georgiou2005] which will be sufficient of our purpose. [[Georgiou2005]] Assume that and its dual cone have non-empty interior. Then for any , there exists which is absolutely continuous, with positive and continuous density, such that .

Under hypothesis (30), has non-empty interior, and if , there exists an optimal measure which is absolutely continuous with positive and continuous density.


This is a direct consequence of subsection 3.3. We just need to check that has non-empty interior. Using the characterisation of the dual cone (19), this is straightforward since . ∎

3.4 Case

The previous approach settles the case where the data is in the interior of the image cone, which poses the natural question of its boundary . It routinely happens in practice that some components of the data are zero, which means that the vector lies at the border of the cone, . Upon changing the compact, a further use of the results of [Georgiou2005] shows that if the support of the data is not too small (see the precise condition (34) below), the situation is the same as for .

The idea is to remove all the zero components of the data vector , and to consider only the positive ones and try to solve for but making sure that the measure has a support such that for .

We denote , , , and finally the reduced operator,


which has an associated cone .

We will need the assumptions




We assume (33) and (34). has non-empty interior and we assume

Then there exists an absolutely continuous solution of .


We make use of subsection 3.3. In order to do so, we need the dual cone of to not have empty interior, which holds true by (34). Then we may build an absolutely continuous with positive and continuous density, such that . We then extend to a measure on the whole of by defining to equal on with support contained in , namely for any Borel subset of . Then clearly solves and thus minimises . ∎

Note that subsection 3.3 is a particular case of subsection 3.4, but we believe this presentation makes the role of and clearer.

Let us now finish this section by proving that not any point of the boundary may be associated to absolutely continuous measures. We denote the simplex in , i.e.,


Assume that is an extremal point of . Then any measure satisfying is a Dirac mass.

We omit the proof, which is straightforward and relies on the linearity of the operator and the fact that the only extremal points among probability measures are the Dirac masses [Rudin1991].

4 Properties of ML-EM

We now turn our attention to the ML-EM algorithm (16) for the minimisation of the functional (problem (13)).

4.1 Monotonicity and asymptotics

We first proceed to prove that the algorithm is monotonous, a property stemming from it being an expectation-maximisation algorithm.

For any satisfying (17), ML-EM is monotone, i.e.,


We will build a so-called surrogate function, i.e., a function such that for all in , with equality for , where is minimised at over . Note that . The claim then follows since

For a measure , we define . One can check that , i.e., that is always a probability measure. From the non-negativity of the divergence, i.e., , we obtain


with equality if .

Notice now that , where we use the notation


This motivates the following definition for any :

which is a surrogate function of .

Making use once more of the concavity of the logarithm through for any we may write for

the right-hand side vanishing due to the definition of . ∎

If the initial measure has its mass sufficiently spread over , we now prove that all cluster points of ML-EM are optimal solutions to the minimisation problem. This also proves that minimisers exist in full generality.

For any satisfying (17) and


where is the closed ball of centre and radius , then the weak cluster points of ML-EM minimise the function . In particular,


Let be a weak cluster point of ML-EM. Such a point exists since we have for all . Thus, is a bounded sequence in . By the Banach–Alaoglu Theorem, it is thus weak- compact in .

Also note that such a limit must satisfy for any (i.e., whenever ). Otherwise, would go to since it is weak- continuous, a contradiction with the fact that decreases along iterates and . We may then pass to the limit in the definition of ML-EM and we find


In particular, we have on . Now, assume by contradiction that at a point . Defining , by assumption