## 1 Introduction

In the original treatment of classical statistical inverse problems such as the current status model it was assumed that the nonparametric maximum likelihood estimator (MLE) would converge as a process at rate and in particular would be “tight”. It was also conjectured that the pointwise limit distribution would be normal ([15], [17], [13]). But it was proved in [6] that the process is not tight, does not pointwise converge at rate, and that the actual pointwise limit distribution is also not normal, but in fact given by Chernoff’s distribution (see [4] and [12]). This fact was for example noticed in [20], who refer for the result to [11], where it is also given.

On the other hand, if we consider differentiable functionals of the model, we are back in asymptotics, with normal limit distributions. Theorem 3.1 on p. 183 of [19] gives necessary and sufficient conditions for a functional to be differentiable in a very general setting. We give a short account of the (for us) relevant facts here, also summarized in [5] and [7].

We need the concept of Hellinger differentiability. Let the unknown distribution on

be contained in some class of probability measures

, which is dominated by a -finite measure . Let have density with respect to . We are interested in estimating some real-valued function of .Let, for some , the collection with be a 1-dimensional parametric submodel which is smooth in the following sense:

Such a submodel is called Hellinger differentiable. This property can be seen as an version of the pointwise differentiability of at (with ), with the function playing the role of the so-called score-function in classical statistics. For we have,

Therefore, is also called the score function or score. The collection of scores obtained by considering all possible one-dimensional Hellinger differentiable parametric submodels, is a linear space, the tangent space at , denoted by .

In the models for inverse problems, to be considered in this paper, we work with a so-called hidden space and an observation space. All Hellinger differentiable submodels that can be formed in the observation space, together with the corresponding score functions, are induced by the Hellinger differentiable paths of densities on the hidden space, according to the following theorem:

###### Theorem 1.

Let be a class of probability measures on the hidden space .

is induced by the random vector

. Suppose that the path to satisfiesfor some , where the superscript means that .

Let be a measurable
mapping. Suppose that the induced measures and on
are absolutely continuous with respect to , with densities
and . Then the path is also Hellinger differentiable, satisfying

with .

For a proof, see [2]. Note that . The relation between the scores in the hidden tangent space and the induced scores is expressed by the mapping

(1.1) |

This mapping is called the score operator. It is continuous and linear. Its range is the induced tangent space, which is contained in .

Now is pathwise differentiable at if for each Hellinger differentiable path , with corresponding score , we have

where

is continuous and linear.

can be written in an inner product form. Since the tangent space is a subspace of the Hilbert-space , the continuous linear functional can be extended to a continuous linear functional on . By the Riesz representation theorem, to belongs a unique , called the gradient, satisfying

One gradient is playing a special role, which is obtained by extending to the Hilbert space . Then, the extension of is unique, yielding the canonical gradient or efficient influence function . This canonical gradient is also obtained by taking the orthogonal projection of any gradient , obtained after extension of , into . Hence is the gradient with minimal norm among all gradients and we have (Pythagoras):

In our censoring model, differentiability of a functional along the induced Hellinger differentiable paths in the observation space can be proved by looking at the structure of the adjoint of the score operator according to theorem 2 below, which was first proved in [19] in a more general setting, allowing for Banach space valued functions as estimand. Then the proof is slightly more elaborate.

Recall that the adjoint of a continuous linear mapping , with and Hilbert-spaces, is the unique continuous linear mapping satisfying

The score operator from (1.1) is playing the role of . Its adjoint can be written as a conditional expectation as well. If , then:

###### Theorem 2.

Let be a class of probability measures on the image
space of the measurable transformation S. Suppose the functional can be written as
with
pathwise differentiable at in the hidden space, having canonical gradient
.

Then is differentiable at along the collection of
induced paths in the observation space obtained via Theorem 1 if and only if

(1.2) |

where is the score operator. If (1.2) holds, then the canonical gradient of and of are related by

We consider the following model, used for estimating the distribution of the incubation time of a disease. In this model there is an infection time

, uniformly distributed on an interval

, where (“exposure time”) has an absolutely continuous distribution function on an interval , and where is uniform on , conditionally on . Moreover there is an incubation time with an absolutely continuous distribution on an interval and a time for getting symptomatic , where . We assume that and are independent, conditionally on . Our observations consist of the pairsThe model is for example considered in [16], [3], [1] and [9].

We define the (convolution) density by

(1.3) |

w.r.t. , which is the product of the measure of the exposure time and Lebesgue measure on , where is the upper bound for the incubation time and is the upper bound for the exposure time.

For estimating the distribution function

of the incubation time, usually parametric distributions are used, like the Weibull, log-normal or gamma distribution. However, in

[9] the nonparametric maximum likelihood estimator is used. The maximum likelihood estimator maximizes the function(1.4) |

over all distribution functions on which satisfy , , see [9]. Here is the empirical distribution function of the pairs , . We’ll prove that, under some conditions on the underlying distributions, the MLE of the incubation time has cube root convergence and converges in distribution, after standardization, to Chernoff’s distribution, see Theorem 4.

So this is a clear example of a situation where the MLE doe not converge pointwise at rate and has similar asymptotic properties as the MLE in the interval censoring models. But deriving this is considerably more difficult than it is for the current status model and, in contrast with the proof for the current statust model, heavily relies on smooth functional theory, as will be clear from the proof of Theorem 4. But apart from this, there also exist differentiable functionals of the model of which we now give two examples.

Example 1 We can consider the “mean functional”:

The score operator is of the form

(1.5) |

The adjoint is given by

(1.6) |

Defining

we get the following equation for :

(1.7) |

By differentiating w.r.t. , we find that is also the solution of the following equation in :

(1.8) |

The canonical gradient is in this case given by:

The solution is shown in Figure 1 for , where we chose to be a Weibull distribution function , with and , truncated on . The distribution function of the exposure time was chosen to be the uniform distribution function on (these distributions were also used in the simulations in [9]).

This leads us to expect the following asymptotic normality result:

(1.9) |

where

is a normal distribution with mean zero and variance

In fact, using , we get:

Example 2

We can also apply the theory to estimators which converge at a lower speed. For example, if we want to estimate the density w.r.t. Lebesgue measure at a point by a kernel estimator in the model discussed in Example 1, equation (1) is replaced by:

(1.10) |

which becomes after differentiation w.r.t. :

(1.11) |

where and is a symmetric kernel with support , for example the triweight kernel

(1.12) |

This time, the solution is shown in Figure 2.

If one would keep the bandwidth fixed, one would indeed get convergence again, but usually one would let the bandwidth tend to zero in such a way that the squared bias and variance are of the same order. This would in this case mean that one takes of order , if is the sample size, which would give a rate of convergence of order for the estimator itself, see [9]. Note that in [9] the right-hand side of the equation is instead of which leads to a picture which is flipped around w.r.t. the -axis. But this makes no difference for the estimate of the variance or the asympotic distribution result.

This leads us to expect the following asymptotic normality result:

where

and is the canonical gradient in the observation space if we evaluate the density estimate at point and use bandwidth , see [9]. As in Example 1, also has the representation:

where solves (1.11) for , for some .

The organization of the paper is as follows. In Section 2 we give necessary a sufficient conditions for to be the MLE. Under some extra conditions, we derive consistency of the MLE in Section 3.

In Section 4 we discuss the limit distribution of the MLE. To our knowledge, this result has not been derived before. It is somewhat analogous to the methods, used in [7] for deriving the asymptotic distribution of the MLE for the case of interval censoring, case 2 in Section 4.2 of [7]. Although about 25 years have passed now since the publication of this proof, and although it would be nice to have a simpler proof of this result, no other proofs are known to me. So we have to go through similar but still somewhat different steps again.

In Section 5 we discuss the behavior of smooth estimates of the distribution function and density, based on the nonparametric MLE. We end with some concluding remarks in Section 6. The Appendis, Section 7, contains technical details of the proof of the convergence of the MLE to Chernoff’s distribution.

## 2 Characterization of the nonparametric maximum likelihood estimator (MLE)

Let be the empirical distribution function of the pairs . Then, for a distribution function on , which is zero on , we define the process

(2.1) |

defining , where is the empirical distribution of . The following lemma characterizes the MLE.

###### Lemma 1.

Let be the set of discrete distribution functions with mass concentrated on a set of points , , where and . Then maximizes

(2.2) |

over if and only if

###### Proof.

First suppose satisfies (i) and (ii) and let , , where . Then we have, using the concavity of the log function and Jensen’s inequality:

using (ii) and next (i) on the last line.

Conversely, suppoe maximizes over . If we must have , since otherwise . We have:

If has mass at , we also have:

and hence:

∎

The lemma shows that the point process where runs through the ordered points and , excluding , has second coordinates equal to

at points where the probability distribution, corresponding to

, has positive mass. A picture of this point process is given in Figure 3 for sample size .## 3 Consistency of the MLE

We have the following result.

###### Theorem 3.

Let have a strictly positive density on , for some . Furthermore, let be zero on an interval , where and have a strictly positive continuous density on the interval . Let be the MLE, where is the set of distribution function with mass at the set of points , where the run through the ordered set of points and , excluding the points . Then the MLE converges almost surely to on .

There are a lot of different ways to prove consistency, but we feel a preference for the elegant method in [14], which is used in the proof below.

###### Proof.

We start by observing that, by the fact that is the MLE, we must have:

On a set of probability one, the empirical probability measure converges weakly to the underlying measure on a set of elements which has probability one. Fixing and we get by the Helly compactness theorem a subsequence converging vaguely to a subdistribution function , for which we get the inequality:

(3.1) |

The minimum of

(3.2) |

over subdistribution functions is attained by a nondegenerate distribution function , since otherwise (3.2) could be made smaller by multiplying by a constant bigger than 1. This means that we may assume that the minimizer of (3.2) satisfies

(3.3) |

Minimizing (3.2) under the condition (3.3) is the same as minimizing

without this condition, using a Lagrange multiplier argument (with Lagrange multiplier ).

For we have for and the minimum of

is attained by taking . If , we find that

is minimized by taking , but since the minimizing values on the interval are equal , we must have and for the minimizing function .

## 4 Asymptotic distribution of the MLE

In this section we discuss the proof of the theorem below.

###### Theorem 4.

Let have a continuous density , staying away from zero on its support , , and let the exposure time have a continuous density on its support , for some , with a bounded derivative on the interval . Let be the MLE, where the set of distribution functions has the same meaning as in Theorem 3. Then we have at a point :

(4.1) |

where is two-sided Brownian motion on , originating from zero and where the constant is given by:

(4.2) |

The result shows that the limit distribution is given by Chernoff’s distribution. This distribution also occurs as limit distribution in the current status model and more generally in the interval censoring model in the so-called separated case, The jump in difficulty of the proof in going from the result for the current status model to the interval censoring, case 2, model is considerable. Similarly the present proof is not simple. The proofs of the lemma’s will be given in the Appendix.

We have the following lemma.

###### Lemma 2.

We also have the following property of the MLE.

###### Lemma 3.

Let the set be as in Lemma 1 and let . We define the weight process by:

(4.3) |

where, as usual, . If maximizes in (1.4) then is for all the left-continuous slope of the greatest convex minorant of the “self-induced” cusum diagram, consisting of the point and the points

(4.4) |

where is defined by (2.1).

As explained in [9], after a preliminary reduction, removing points where we know that the MLE can only be equal to or , one can compute the MLE by the iterative convex minorant algorithm, where one computes iteratively the greatest convex minorant of the cusum diagram with points and points

(4.5) |

where is defined as in (4.3), but with replaced by , where is the temporary estimate of the distribution function at an iteration, and where runs through the order statistics of the observations and (excluding zero). The MLE corresponds to a stationary point of this algorithm and is given by the left-continuous slope of the greatest convex minorant of the cusum diagram, see Figure 4. See [9] for further remarks on this algorithm.

A fundamental tool in our proof is the so-called “switch relation”, see, e.g., Section 3.8 in [10]. Let the process be defined by

(4.6) |

where is the function, defined by (4.3) at the points and extended to a right-continuous piecewise constant function elsewhere. We define, for

Then we have the switch relation:

see, e.g., (3.35) and Figure 3.7 in Section 3.8 of [10].

We have:

where . Using the property that the argmin function does not change if we add constants to the object function, we get:

where

We have: