Kernel Density Estimation with Linked Boundary Conditions

09/20/2018 ∙ by Matthew J. Colbrook, et al. ∙ University of Stuttgart University of Cambridge University of Technology Sydney UNSW 0

Kernel density estimation on a finite interval poses an outstanding challenge because of the well-recognized boundary bias issues at the end-points. Motivated by an application of density estimation in biology, we consider a new type of boundary constraint, in which the values of the estimator at the two boundary points are linked. We provide a kernel density estimator that successfully incorporates this linked boundary condition. This is studied via a nonsymmetric heat kernel which generates a series expansion in nonseparable generalised eigenfunctions of the spatial derivative operator. Despite the increased technical challenges, the model is proven to outperform the more familiar Gaussian kernel estimator, yet it inherits many desirable analytical properties of the latter KDE model. We apply this to our motivating example in biology, as well as providing numerical experiments with synthetic data. The method is fast and easy to use and we also compare against existing methods, which do not incorporate this constraint and are therefore inaccurate near the boundary.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Suppose we are given an independent and identically distributed statistical sample from some unknown continuous density .111The subscript in does not indicate a partial derivative; it merely indicates that that

is associated with the random variable

. Estimating the density is one of the most common methods for discovering patterns in statistical data [44, 5, 43].

When the support of is the whole real line, a simple and popular non-parametric method for estimating is the kernel density estimator

with a Gaussian kernel . Here is the so-called bandwidth parameter that controls the smoothness of the estimator (see, for example, [43, 44] and references therein).

It is well-known that is not an appropriate kernel estimator when has compact support [18], which (without loss of generality) we may assume to be the unit interval . The main reason for this is that exhibits large boundary bias at the end-points of the interval. For example, no matter how small the bandwidth parameter,

will have non-zero probability mass outside the interval

. Various solutions have been offered to cope with this boundary bias issue, which we may classify into three main types:

  1. Using special (non-Gaussian) kernels with support on or on , as in [4, 28, 39];

  2. Adding bias-correction terms to as in [8, 23];

  3. Employing domain transformations [18, 30], which work by mapping the data to , constructing a kernel density estimator on the whole real line, and finally mapping the estimate back to .

Sometimes, we not only know that has support on , but also we have extra information about the values of at the boundaries. One example of this situation is what we will refer to as a linked boundary condition, where we know a-priori that

for some known given parameter .

This situation arises in the field of Systems Biology [25], where inferences from many snapshots of cell data from unknown time points are made via pseudo-time algorithms and where data are known to be cyclical because of the cell cycle, for example. One then wishes to estimate the distribution of cells on the pseudo-time scale. The value at the left boundary of this distribution must as a consequence of cell division be double the value at the right boundary. In other words, we have linked boundary conditions with the constant , but otherwise we do not know the value of the density at the boundaries of the domain. See [25], for example, for further motivation and discussion.

Unfortunately, to the best of our knowledge, all of the currently existing kernel density estimation methods, bias-correcting or not, cannot satisfactorily handle the linked boundary condition.

Figure 4 in the numerical section 5.2 shows a typical example of what can go wrong when currently available density estimators are applied to real biological data. The result is a smooth density with two unacceptable features:

  • The domain is not respected, and instead the solution has positive density for negative values of , and also for , which are physically unreasonable.

  • The density does not respect the important biological constraint of the linked boundary condition (that the left value should be double the right, in this particular application), and instead the density decays to zero as becomes large.

The purpose of this paper is to describe a new kernel density estimator that can handle the more general problem of linked boundary conditions with an arbitrary value of ; the situation of interest in the biological application where is then solved as an important special case. Figure 5 (C) shows a successful application of our proposed method.

Our proposed estimator is of type (a), that is, we construct a special kernel with support on , and such that the linked boundary condition is incorporated into the resulting estimator. Our kernel is inspired by the solution of a diffusion-type PDE [1, 3, 31, 38, 50]. In particular, we modify the diffusion model in [3] so that it satisfies the linked boundary conditions.

We also consider the discrete counterpart of the continuous model for two reasons. First, it is a numerical approximation to the continuous model and a useful way to compute the solution of the PDE. Second, the discrete model is relevant when we deal with data which is already pre-binned in a large number of bins.

The rest of the article is organized as follows. In the next section we describe the continuous model for the application at hand. Our results provide the necessary assurances for this to be a useful model in statistics for the purposes of density estimation. Next, we study a discrete model of the same situation, and establish convergence to the continuous model. We then briefly discuss the issue of choosing a stopping time, including pointwise bias, asymptotic properties and the AMISE. These results are applied to an example with biological data in §5.2 and some simulated examples. We then finish the main part of the paper with a short conclusion. After that, we present a derivation of the solution in §7, and explanations of its properties in §8

. We formally solve the non self-adjoint initial boundary value problem as a contour integral using the unified transform, a novel transform for analysing boundary value problems for linear (and integrable nonlinear) partial differential equations. This formal solution is then rigorously verified and studied via a nonsymmetric heat kernel which generates a series expansion in nonseparable generalised eigenfunctions of the spatial derivative operator.

2 The Continuous Linked–Boundaries Model

First, recall the diffusion PDE approach to density estimation [3, 24], where a diffusion partial differential equation in one space dimension is employed for kernel density estimation purposes [43, 44]. We now present the continuous diffusion model that satisfies the linked boundary condition, and derive its analytical solution. Our proposed diffusion model for a linked-boundary KDE is the continuous solution of the PDE:

(1)

Here the initial condition (IC),

(2)

is simply the empirical measure of the given data . In other words, is a sum of Dirac delta distributions. Our results hold for the common cases of practical importance, such as for initial data of the form in (2), and also for functions that are probability densities (or even general distributions) which we denote by .

The first boundary condition (BC) involves a constant . Notice the ratio of the left boundary value to the right boundary value is

The constant ratio links the boundaries222Results that we present later, such as Theorem 1, can in fact be generalised to any complex (the PDE is degenerate irregular and the problem ill-posed when ), but due to the application we have in mind we stay with the case throughout this article. and is assumed to be non-negative: .

This PDE (1) is a natural candidate for density estimation with such a linked boundary condition. However, it is not immediately obvious what properties the PDE has. For example, one question is whether or not the solution is a probability density. More fundamental questions concern existence and uniqueness, which are well known in the special self-adjoint case of periodic boundary conditions when .

2.1 Well-posedness of the continuous model

We begin by establishing existence and uniqueness in Theorem 1. It turns out that results analogous to the well known case hold. First we require the following definitions.

Definition 1.

Denote the class of finite Borel measures on by and equip this space with the vague topology (i.e. weak topology). Then we let denote the space of all functions

such that is continuous in the vague topology.333This is the condition that the integral is continuous as a function of for any given function continuous on the interval .

Notably, this is weaker than , where has the usual Banach space structure with total variation norm. In this case the would need to be continuous in this norm. For instance, .

The usual story in something like the finite element method, for example, is to first multiply the original differential equation (1) by a smooth test function, then integrate, and then apply integration by parts. This procedure leads to the weak formulation of the problem, and similar steps apply here too as follows, where we use to denote our smooth test functions. In terms of notation, denotes the derivative with respect to , whereas does not denote a derivative, and instead the subscript denotes the dependence on time. The notation denotes integration of the smooth test function against the measure.

The following adjoint boundary conditions are exactly those that arise from integration by parts (adjoint with respect to the standard inner product). That is, defined below in (3) is a subspace of , where is defined in (7), and where is the adjoint operator, that is defined in (47).

Definition 2.

Let denote all satisfying the adjoint linked boundary conditions

(3)
Definition 3 (Weak Solution).

Let our initial condition . We say that is a weak solution to (1) for if and for all , is differentiable for with

(4)
Theorem 1 (Well Posedness).

Assume our initial condition lies in and satisfies the linked boundary conditions in the very weak sense that . Then there exists a unique weak solution to (1) for for any , which we denote by .

For this weak solution is smooth in and real analytic in . Furthermore, in the more general setting of not necessarily being of the form (2), the solution has the following properties which generalise the classical case of :

  1. If , then for any , converges to as . If then converges to as uniformly over the closed interval .

  2. If and , then is the unique weak solution in and converges to as in .

2.2 Solution of the continuous model

If we ignore the constant in the boundary conditions of (1) (and replace it by the special case ), then we would have the simple diffusion equation with periodic boundary conditions. One can successfully apply Fourier methods, separation-of-variables or Sturm–Liouville theory to solve the periodic version of this PDE [13, 19]. However, when , making the ansatz that a solution is of the ‘rank one’, separable form leads to no interesting solutions and separation of variables fails. It is possible to construct a somewhat non-obvious series solution. Surprisingly, this is not possible in general for well posed constant coefficient evolution equations on a finite interval [34]. The differential operator associated with the evolution equation in (1) is regular in the sense of Birkhoff [2] but not self-adjoint when due to the boundary conditions. It is possible to generalise the notion of eigenfunctions of the differential operator [6] and these generalised eigenfunctions form a complete system in [33, 26] (and in fact form a Riesz basis). However, the expansions become complicated due the fact that the characteristic values (which determine the spectrum) are not simple and the Green’s function does not have simple poles. In particular, this complicates the time dependence of the following series solution.

Theorem 2 (Representation).

The unique solution described in Theorem (1) has the following representations.

Fourier integral representation:

(5)

Here the contours are shown in Figure 6 and are deformations of the boundaries of . The determinant function is given by and

Series representation:

(6)

where and

In the more general case where , in addition to the usual separable solutions , the series expansion also includes the non-separable solutions . We can understand these as being generalised eigenfunctions in the following sense (see the early papers [27, 48]). Define the operator, where denotes a derivative,

(7)

Then it is easily checked444We use to denote the null space, which is sometimes often termed the kernel, of an operator, i.e.

is the space of all vectors

with . that . In particular, both and satisfy the required boundary conditions. Evaluating the series at gives a realisation of Theorems 3.1–3.5 of Chapter 5 of [26] which state that these generalised eigenfunctions form a complete system in . In particular, these functions block diagonalise

the operator in an analogous form to the Jordan normal form for finite matrices. If we consider any generalised eigenspace

corresponding to with and choose the basis , the operator acts on this subspace as the matrix

which cannot be diagonalised for .

For our context of kernel density estimation, we define an integral kernel so that we can write the solution as

Owing to (39) and the residue calculus, this is given by the somewhat complicated expression

(8)

which can be expressed in terms of the more common kernel and its derivative (see (44)). For the initial data (2) this gives the density estimate

Remark 1.

Theorem (1) also shows that the pointwise bias vanishes if is continuous with . Namely,

(9)

uniformly in . This is in contrast to the standard Gaussian kernel density estimator [49] or its periodic version (when ) [3], where this is not the case at the boundary. In fact, a more careful analysis of §8.1 shows that if and then our estimator satisfies

with independent of .

2.3 Properties of the continuous model

The Maximum Principle [13, 19] for parabolic PDEs states that the diffusion equation attains a maximum on the ‘boundary’ of the two dimensional region in space and time . If our initial distribution is given by a continuous function then the maximum principle gives the following.

Proposition 1 (Bounds on Solution).

Suppose that the initial distribution is continuous with and non-negative with for all . Then for any and we have

(10)

In particular, remains bounded away from if and .

Similar bounds can be found for negative and even negative or complex , but we exclude those cases in this article.

The second boundary condition in words, says ‘left slope equals right slope’. A consequence of this BC is formal conservation of mass:

If we start with an initial condition that integrates to one, then the solution for all later times also integrates to one. In the context of density estimation, this essential property corresponds to conservation of probability. Seeing both that the solution is non-negative (by the Maximum Principle above), and that the solution always integrates to one if the initial density integrates to one, confirms that the solution is indeed a probability density. This is made precise in the following Theorem (a more careful proof could come via integral kernels, but we omit that here) which does not require continuous initial data.

Theorem 3 (Conservation of Probability).

Suppose that the assumptions in Theorem (1) hold and denotes the unique weak solution (which is well behaved for by Theorem 2). Suppose also that is a probability measure. Then

  1. , for ,

  2. for and .

We also have the following, which immediately follows from (6), and describes the behaviour of the solution for large time.

Proposition 2 (Large Time Behaviour).

Suppose that the assumptions in Theorem (1) hold and denotes the unique weak solution, then as , converges uniformly on to the linear function

(11)

This linear function is the unique stationary distribution that obeys the boundary conditions and has the same integral over as .

Two simple examples

Choose so that the linked boundary condition becomes, in words, the ‘left boundary value is double the right boundary value.’ A simple and explicit solution, which is a probability density, is

(12)

The initial condition must be a probability density, and must be compatible with the above form in (12). In this example, the long time behaviour of (11) is . This solution in (12) is separable, and it has constant boundary values ( on the left, and on the right) that do not change in time. However, the solution in (12) is too simplistic. To be useful in applications we need solutions of (1), for any choice of , to have boundary values that automatically evolve in time, from the given initial data, and where we do not make assumptions about the form of that initial data. For instance, Figure 1 shows a second example, with , that, unlike (12), illustrates boundary values that evolve with time. This is one reason why we need results such as Theorem 2 which allows us to handle non-separable solutions.

Figure 1: An example of the solution of the continuous PDE (1) at three time points, with . The values at the boundaries change with time, but the ratio remains a constant . The initial condition does not satisfy the ‘equal slopes’ boundary condition. Nonetheless, our theory shows that for all boundary conditions are satisfied, and moreover the methods that we propose still converge to the correct density, albeit at a slower rate.

3 The Discrete Linked–Boundaries Model

The exact, continuous PDE (1) is replaced by a standard, finite-difference approximation that remains continuous in time but is discrete in space. This is sometimes known as the ‘method of lines’ [20], and it yields the first order ODE system:

(13)

The structure of the matrix is described below. The solution of (13) is a vector that approximates the exact solution . That is, . Here is the th grid point on the grid of equally spaced points in the domain , for . The spacing between two consecutive grid points is The two boundary conditions give two equations involving values at the two boundary nodes, i.e. at node and at node . That is,

(14)
(15)

This motivates us to make the following definitions for the boundary nodes:

(16)

(We have eliminated and from our equations in favour of and . We also always have that ) We are left with a set of equations involving unknown values , at the interior nodes (and only an corresponding matrix).

3.1 The Four Corners Matrix

All this leads to an matrix , with the following structure:

(17)

The first row and the last row come from the boundary conditions. All interior rows come from the standard second-order finite difference approximation of the (spatial) second derivative.

Various linked boundary conditions can be modelled by choosing different values for the four numbers and . We will only consider choices of and that maintain these key properties of the matrix (17), namely that it has zero column sum, and off-diagonals are negative and the main diagonal entries are positive. Then an attraction of this discrete model (13) is that it can be interpreted as a continuous-time Markov process. This ensures that the solution vector is a non-negative probability vector. For example, our linked boundary conditions in (1), with , corresponds to the choice

which does allow the interpretation as a Markov process.

The solution of the discrete model problem (13), with a constant coefficient matrix comes as a matrix exponential

One way to compute that exponential is via diagonalization

(18)

where is a diagonal matrix, with th diagonal entry given by the scalar exponential . The th column of the matrix

is the eigenvector corresponding to eigenvalue

.

The ‘Four Corners Matrix’ (17), is a non-symmetric example of a ‘tridiagonal Toeplitz matrix with four perturbed corners’ [51, 47]. Exact and explicit formulas for the eigenvalues and eigenvectors are available. Thus we have exact expressions for the solutions to this discrete model, via the above diagonalisation, if desired.

There is a unique zero eigenvalue, corresponding to the stationary density as . The stationary density is an affine function in the continuous PDE setting. In the discrete setting, with , the components of the stationary eigenvector are equally-spaced, in the sense that . Then it is easy to verify that the corresponding rank one matrix is Toeplitz-plus-Hankel, for example by quickly checking the tests described in [47]. The same matrix (17), in the special symmetric case that , appears in the Four Corners Theorem of [47]. Here we observe that the Four Corners Theorem of [47] can be generalized to the non-symmetric case. That is, the Four Corners Theorem still holds for the Four Corners Matrix (17) in our non-symmetric case when : all functions of our non-symmetric Four Corners Matrix (17), with in (19) below, are the sum of a Toeplitz matrix and a Hankel matrix. To see this is still true, we could check that the projections onto all eigenspaces are Toeplitz-plus-Hankel. To do that, we could use the explicit expressions for the eigenvectors ( for or for given below in (20)). Then the entry of the rank one matrix , or of , is a sum of terms of the form . The elementary identity now allows us to write this as the sum of a Toeplitz part (coming from the part that is only a function of and not of or separately, which in our case is ) and a Hankel part (coming from functions of , i.e. ). Since the projection onto all eigenspaces is Toeplitz-plus-Hankel, then the solution is likewise the sum of two parts: (i) a Toeplitz part, which can be thought of as the solution without boundary conditions; and (ii) a Hankel part, which is precisely the correction due to the boundary conditions.

All other eigenvalues of are negative. In the notation of [51] we have:

(19)

and

so we are in the setting of [51, Theorem 3.2 (i)]. In the case that , we can group the spectral data into two classes with eigenvalues:

where

The zero eigenvalue, when , has already been discussed. Other eigenvalues correspond to eigenvectors with components (listed via subscripts)

(20)

Some properties of the discrete model and its links to the continuous model are:

  • All eigenvalues of the Four Corners Matrix are purely real. Also, the eigenvalues of the operator in the continuous model are likewise purely real.555Recall that in this article we insist on the PDE in (1) where and real, and also where the equal slopes boundary condition applies. Generalisations to other values of and other boundary constraints are certainly possible, but we do not consider those generalisations, nor their spectra, in this article. This is perhaps surprising since the matrix is not symmetric, and the operator is not self-adjoint.

  • The Four Corners Matrix is diagonalizable, and in that sense of the right hand side of (18), the solutions to the discrete model are ‘separable.’ In contrast, the operator for the continuous PDE is not diagonalizable, and instead the analogy of the Jordan Normal Form from linear algebra applies to the operator. Despite this, it is straightforward to show that:

    1. The eigenvalues of the discrete model converge666This can be made precise in say the Attouch-Wets topology - the convergence is locally uniform. to that of the continuous model (including algebraic multiplicities).

    2. The eigenvectors converge to the generalised eigenfunctions of the continuous operator. Letting we have ()

  • Finally, the discrete model is stable in the maximum norm. This together with consistency and the fact that the continuous model is well posed implies that the discrete model converges to the solutions of the continuous model from the Lax Equivalence Theorem [36]. In particular, the bounds proven in Proposition 1 for the continuous model carry over to the discrete model with777We use a subscript to denote the matrix norm.

    (21)

    From the interpretation of the discrete model as a Markov process, it follows that

    and hence we also gain stability in any

    –norm via interpolation.

In summary, we offer two different practical computational methods to perform density estimation with linked boundary conditions: (i) the discrete model, via the Four Corners Matrix; and (ii) the continuous model, via the solutions that we find here as a series or as a contour integral.888We provide MATLAB codes for both methods at https://github.com/MColbrook/Kernel-Density-Estimation-with-Linked-Boundary-Conditions. The two methods have relative advantages and disadvantages. The Four Corners Matrix is a finite difference method of modest accuracy but it is simple and easy to use. This might be attractive, especially if the initial data is already discretely binned, as is often the case. It also essentially guarantees maintaining positivity because of the attractive interpretation as a Markov process. In contrast, the series solution we provide for the continuous model is typically highly accurate for .

4 Choosing the stopping time

An important issue in density estimation is how to choose the bandwidth parameter (for kernel density estimation) or equivalently, the final time at which we compute the solution of the PDE. This issue has already received extensive attention in the literature. We now give a brief summary of that issue, and we also make two suggestions for known methods already available. After that, we address the issue specifically in the context of our linked boundaries model.

At one extreme, if we choose then we recover the initial condition, which is precisely the raw data, with no smoothing. At the other extreme, if we let , then we obtain a stationary density that is a linear, straight line, which contains no information whatsoever about the raw data. In between, , we have some smoothing effect while also retaining some information from the original data, and often this is interpreted as an optimal balance between the data and the prior.

One would like such a choice of an intermediate value of

to satisfy certain statistical properties. For example, as more and more data are included, we would like to converge to the ‘true’ density. Various proposals and their properties are available. Perhaps the most common choice is ‘Silverman’s rule of thumb’ which works very well when the data is close to being normally distributed. We expect that this choice is fine for the simpler datasets and examples that we consider in this article. However, the methods that we present in the article are more general, and for other applications it might well be necessary to adopt more sophisticated approaches.

In that regard, another possible approach is to use the output from the freely available software of one of the authors.999https://au.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator In particular, the first output from that software is bandwidth=sqrt(t), when the software is called via

[bandwidth,density,xmesh,cdf]=kde(data,n,MIN,MAX).

This is expected to be a better choice than Silverman’s rule in situations where there are many widely separated peaks in the data.

4.1 Asymptotic properties

We now give a more precise treatment of the choice of stopping time for the linked boundaries model, as well as discussing the Mean Integrated Squared Error (MISE) defined by

(22)
(23)

Often one is interested in the asymptotic approximation to the MISE, denoted AMISE, under the requirements that and . The asymptotically optimal bandwidth is then the minimizer of the AMISE. For our continuous model of kernel density estimation we have the following result which gives the same rate of convergence as the Gaussian kernel density estimator on the whole real line.

Theorem 4.

Let be such that and and suppose that with . Then the following hold:

  1. The integrated variance has the asymptotic behaviour

    (24)
  2. If then

    (25)
  3. If then

    (26)
Corollary 1.

Combining the leading order bias and variance terms gives the asymptotic approximation to the MISE:

  1. If then

    (27)

    Hence, the square of the asymptotically optimal bandwidth is

    with the minimum value

  2. If then

    (28)

    Hence, the square of the asymptotically optimal bandwidth is

    with the minimum value

A few remarks are in order. First, it is interesting to note that in the case of , the optimum choice and the minimum AMISE do not depend on , and are the same as the more familiar ‘whole line’ situation — in other words, we can confidently use existing methods in the literature (such as recommended above) to choose a stopping time. Second, it seems plausible that we could estimate (or the value of ) adaptively and change the boundary conditions in the model accordingly. A full discussion of solving the heat equation with linked boundary conditions for the first spatial derivative is beyond the scope of this paper, but can be done using the same methods we present here. Future work will aim to incorporate an adaptive estimate of the true boundary conditions (both for the density function and its first derivative) and resulting adaptive boundary conditions. We mention a result in this direction which will appear when we compare our model to that of [3], whose code is based around the discrete cosine transform, the continuous version of which solves the heat equation subject to the boundary conditions

(29)

We have used the subscript to avoid confusion with our solution to (1). The analogous result to Theorem 4 is the following theorem which can be proven using the same techniques and hence we have omitted the proof. Similarly, one can then derive the optimum choice of and the minimum AMISE.

Theorem 5.

Let be such that and and suppose that . Then the following hold:

  1. The integrated variance has the asymptotic behaviour

    (30)
  2. If then

    (31)
  3. If or then

    (32)

5 Numerical Examples

5.1 Numerical Examples with Synthetic Data

Here we shall test the method for examples where the true density is known. We begin with the trimodal distribution shown in Figure 2. We have also shown a typical approximation of the distribution function using our proposed method and the density estimation proposed in [3] for a sample size of . The proposed method is more accurate near the boundaries of the domain (see zoomed in section of plot) and behaves similarly in the middle of the domain. For stopping times we have used the bandwidth bandwidth=sqrt(t) discussed in §4. We have also measured the error in two ways. We took the mean over simulations for each of both

and

where is the given estimator. Figure 2 shows these measures of error as we increase , as well as the minimum AMISE as a function of for each method. We can clearly see agreement with the analysis in §4.1. The difference between the two methods is more striking when measured using the uniform norm. We found this to be the case for a range of other tested distributions. Even in the case when and both methods obtain minimum AMISE scaling as , there was a significant difference in the uniform norm.

Figure 2: Left: Example of different methods for a sample size . The proposed diffusion model (“Linked”) is much more accurate near the boundaries than the cosine model (“Cosine”) as highlighted by the zoomed in sections. Right: Errors of both methods as we increase (averaged over 100 simulations). The errors agree well with the minimum AMISE from §4.1, whereas the increased accuracy gained near the boundary by using the linked boundary model is highlighted by the errors.

Next we consider the case when

is log-concave and not necessarily smooth. Denoting the PDF of the beta distribution with parameters

by , we let

The parameter controls the smoothness of near . We have compared our method to a method that computes log-concave maximum likelihood estimators [11, 10]. This seeks to compute the log-concave projection of the empirical distribution through an active set approach.101010Code for this can be found at https://CRAN.R-project.org/package=logcondens Details on this method and other similar methods can be found in [37], with a study of the more involved case of censored data in [12]. Tables 1 and 2 show the squared mean and errors respectively over simulations for , as we vary for the linked boundary diffusion estimator and the log-concave projection method (abbreviated to LC), as well as its smoothed version (LCS). In each case we have shaded the most accurate estimator. The linked boundary diffusion estimator performs much better when measured in the uniform norm but is slightly worse in the sense when the distribution function becomes less smooth. This is demonstrated in Figure 3 for a typical estimation using . The experiments in Tables 1 and 2 measured the errors over equally spaced points and were performed on a four year old laptop. The linked boundary diffusion estimator took about 0.1s on average per simulation (the cosine method based on the discrete cosine transform took about 7s 111111The reason the linked estimator is quicker than the discrete cosine method is due to the exponential decay of the series for - only a small number of terms are needed to get very accurate results.), the log-concave projection took about 5s but its smoothed version was much slower, taking about 73s.

Linked 2.98E-3 1.33E-3 6.82E-4 3.22E-4 2.38E-4
LC 1.05E-3 1.14E-3 1.26E-3 1.38E-3 1.52E-3
LCS 1.23E-3 1.03E-3 9.42E-4 1.04E-3 1.19E-3


Linked 1.58E-4 1.13E-4 8.01E-5 5.96E-5 5.05E-5
LC 1.65E-3 1.80E-3 1.94E-3 2.09E-3 2.27E-3
LCS 1.30 E-3 1.40 E-3 1.66 E-3 1.74E-3 2.16E-3
Table 1: Mean squared error over 10 simulations for different .
Linked 7.32E-2 4.19E-2 2.52E-2 1.31E-2 7.97E-3
LC 5.34E-1 6.40E-1 7.51E-1 8.71E-1 1.00E0
LCS 1.84E-1 1.26E-1 1.42E-1 1.72E-1 1.95E-1


Linked 4.42E-3 2.85E-3 1.18E-3 4.78E-4 2.39E-4
LC 1.14E0 1.28E0 1.44E0 1.60E0 1.78E0
LCS 2.27E-1 2.47E-1 2.82E-1 3.22E-1 3.70E-1
Table 2: Mean squared error over 10 simulations for different .
Figure 3: Typical estimates for and , . We used the R package logcondens for the log-concave projection method.

5.2 Numerical Example with Cell Data

This short section demonstrates the application of the methods that we propose to a problem in biology with real data, where the data are taken from [25]. As mentioned in the introduction, Figure 4 shows an example of what goes wrong when current methods are applied. Figure 5 C demonstrate our proposed method, which performs satisfactorily.

Figure 4: A typical example of output from a kernel density estimator (ksdensity from MATLAB) applied to our real biological data. This does not respect the domain, and it also does not respect the important linked boundary conditions. The novel methods that we propose in this article address those issues.

Our example originates from the study of biological processes, in particular the cell cycle, with single cell data. The cell cycle can be split in different stages (Fig. 5 A): It starts with a growth phase (G1) followed by the S-phase in which DNA is replicated. The subsequent shorter growth phase (G2) then leads into M-phase (mitosis). A cell then divides into two daughter cells at the end of M-phase. The example dataset is obtained from an engineered cell line which has a fluorescent reporter protein named “geminin” which starts to accumulate once a cell transitions from G1- to S-phase and which is degraded during cell division (Fig. 5 A). This dataset compromises levels of DNA and geminin in single cells measured by flow cytometry (Fig. 5 B). The red curve in Fig. 5 B indicates the path that the average cell takes when it goes through the cell cycle. Pseudotime algorithms perform a dimensionality reduction by assigning a pseudotime value to each cell which can be interpreted as its position on the average curve. A recently developed theory based on ergodic principles makes it possible to derive a transformation to a real-time scale from cell ordering in pseudotime reflecting cell cycle progression ([21, 25]). The method relies on the distribution of cells on the pseudotime scale. As mentioned in the introduction, the distribution at the beginning and the end of the cell cycle are linked due to cell division by

(33)

The example dataset (Figure 5) compromises levels of DNA and geminin in single cells in an exponential growing population of NCI-H460/geminin cells measured by flow cytometry ([25]). A pseudotime algorithm provides a dataset of pseudotime values which represents a quantitative value for the progression of every single cell through the cell cycle. The kernel density estimator with linked boundary condition () thus produces an distribution that satisfies the conditions (33) on the density due to cell division (Figure 5 C).

Figure 5: (A) Schematic cell cycle with geminin expression starting at the end of G1. (B) DNA and geminin signal from individual cells can be used to obtain a pseudo-temporal ordering of the population. An average cell follows the indicated path (red) through the dataset. (C) Pseudotime values (gray), binned data (blue) and kernel density estimate (red). The kernel density estimate was obtained by solving our continuous PDE (1) by our discrete numerical method in (13), with the ‘four corners matrix’ in (17), and this solution looks satisfying. The stopping time, , came from the kernel density estimator in [3].

6 Conclusion

Our study was motivated by a question from biology. This biological application required a method of density estimation that can handle the situation of linked boundaries. Boundary conditions are known to be a difficult problem in the context of kernel density estimation. To our knowledge, the linked boundary conditions that we handle here have not been previously addressed in the literature. We have provided two approaches: 1) a continuous PDE model with theoretical guarantees that its solution is well behaved and suitable for the proposed statistical applications; 2) a discrete model, which we prove converges to the continuous model. As a numerical example, we have applied the discrete model on data that arises in biology. The models and the analysis we provide naturally lead us to provide algorithms and software suitable for this application. We believe this can be useful to a wider community interested in this class of problems with linked boundary conditions. We found the method to compete well with other existing methods, both in terms of speed and accuracy. In particular, the method is more accurate close to the boundary.

There remain some open questions regarding the provided models. First, it seems likely that a good strategy may be to adaptively approximate suitable boundary conditions satisfied by the probability distribution (such as

or more generally linking the derivatives) and use this estimate to choose the appropriate model. Second, it is possible to adapt the methods in this paper to multivariate distributions. We leave both of these topics for a further study.

Arguments to explain the key results

7 Derivation of the Solution

We begin with a description of how to obtain the solution in Theorem 2. It is straightforward to verify the solution works but this does not indicate how it is constructed. We will solve the problem via a powerful transform technique called the unified transform [14, 15, 17, 16, 7]. The unified transform applied to linear initial-boundary value problems can be considered as constructing a transform tailor-made to the given PDE and boundary conditions. It generalises many of the classical methods and computes solutions even in some cases when classical methods apparently fail [46, 22]. A review of this method can be found in [9]. So far the only case of our problem considered in the literature on this method has been (trivially periodic). For the heat equation with oblique Robin boundary conditions/non-local boundary conditions we refer the reader to [29] and for interface problems we refer the reader to [41, 42, 40].

We then construct the series solution by deforming the contours in the integral representation and applying Cauchy’s residue theorem. We remark that the unified method is similar in spirit to the contour method of Birkhoff in that it makes considerable use of the tools of complex analysis (see [32] for more details and history on this method). Birkhoff’s method tackles the problem of generalised eigenfunction expansions of the spatial derivative operator by using asymptotic estimates of the Green’s function associated with the problem for large eigenvalue parameter. Further discussion of the link between the unified transform and the spectral theory of two-point linear differential operators is beyond the scope of this paper but we refer the reader to [35, 45]. We found that the unified transform considerably simplifies the proof of Theorem 1 as opposed to the use of series solutions. To the authors’ knowledge, our problem has not been considered (other than the obvious periodic case) in the literature concerning either classical methods or the unified transform.

7.1 Obtaining the solution

The first step is to write the PDE in divergence form:

We will employ Green’s theorem,

(34)

over the domain . Here one must assume a priori estimates on the smoothness of the solution which will be verified later using the candidate solution. Define the transforms:

where again we assume these are well defined. Green’s theorem and the boundary conditions imply (after some small amount of algebra) the so called ‘global relation’, coupling the the transforms of the solution and initial data: