Shape-Constrained Univariate Density Estimation

04/04/2018
by   Sutanoy Dasgupta, et al.
Florida State University
0

While the problem of estimating a probability density function (pdf) from its observations is classical, the estimation under additional shape constraints is both important and challenging. We introduce an efficient, geometric approach for estimating pdfs given the number of its modes. This approach explores the space of constrained pdf's using an action of the diffeomorphism group that preserves their shapes. It starts with an initial template, with the desired number of modes and arbitrarily chosen heights at the critical points, and transforms it via: (1) composition by diffeomorphisms and (2) normalization to obtain the final density estimate. The search for optimal diffeomorphism is performed under the maximum-likelihood criterion and is accomplished by mapping diffeomorphisms to the tangent space of a Hilbert sphere, a vector space whose elements can be expressed using an orthogonal basis. This framework is first applied to shape-constrained univariate, unconditional pdf estimation and then extended to conditional pdf estimation. We derive asymptotic convergence rates of the estimator and demonstrate this approach using a synthetic dataset involving speed distribution for different traffic flow on Californian driveways.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

05/08/2018

Fused Density Estimation: Theory and Methods

In this paper we introduce a method for nonparametric density estimation...
11/18/2019

A projection approach for multiple monotone regression

Shape-constrained inference has wide applicability in bioassay, medicine...
08/28/2018

Active set algorithms for estimating shape-constrained density ratios

We review and modify the active set algorithm by Duembgen et al. (2011) ...
05/21/2022

Robust density estimation with the 𝕃_1-loss. Applications to the estimation of a density on the line satisfying a shape constraint

We solve the problem of estimating the distribution of presumed i.i.d. o...
06/20/2012

Fast Nonparametric Conditional Density Estimation

Conditional density estimation generalizes regression by modeling a full...
09/18/2013

A constructive approach to triangular trigonometric patches

We construct a constrained trivariate extension of the univariate normal...
05/23/2017

Efficient and principled score estimation with Nyström kernel exponential families

We propose a fast method with statistical guarantees for learning an exp...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimation of a probability density function (pdf) from a number of samples is an important and well-studied problem in statistics. It is useful for any number of statistical analyses, for example, quantile regression, or for indicating data features like skewness or multimodality. The problem becomes more challenging, however, when additional constraints are imposed on the estimate, especially constraints on the shape of the densities allowed. The imposition of such constraints is motivated by the fact that if the true density is known to lie in a certain shape class, then one should be able to use that knowledge to improve estimation accuracy.

The most commonly studied shape constraints include log-concavity, monotonicity, and unimodality. The obvious extension to multimodality has been studied in the case of function estimation (see the very recent article by wheeler2017bayesian and references therein), but there is considerably less work on density estimation under this type of constraint. It is this type of constraint that we focus on in this paper.

The earliest estimate for a unimodal density was given by Grenander1956-nt, who showed that a particular, natural class of estimators for unimodal densities is not consistent, and presented a modification that is consistent. Over the last several decades, a large number of papers have been written analyzing the properties of the Grenander estimator, e.g. (Rao1969-fc; Izenman1991-nk) and its modifications (Birge1997-gh). An estimator using a maximum likelihood approach was developed by Wegman1970-ux.

The earlier papers assumed knowledge of the position and value of the mode, and applied monotonic estimators over subintervals on either side of it. Later papers, for example Meyer2001-zh; Bickel1996-oi, include an additional mode-estimation step. Other papers developed Bayesian methods, for example Brunner1989-rj. Hall2002-lw uses a tilting approach to transform the estimated pdf into the correct shape. Turnbull2014-ag, in addition to describing an estimator that uses Bernstein polynomials with the weights chosen to satisfy the unimodality constraint, also provide a useful summary of recent results on unimodal density estimation.

Closer to our work in this paper, Cheng1999-up use a template function to estimate unimodal densities. Given an unconstrained estimator, they start from a template unimodal density and provide a sequence of transformations that when applied to the template both keep the result unimodal, and “improve” the estimate in some sense. However, the method is ad hoc, and convergence, although seen empirically, is by no means guaranteed.

In contrast, we take a principled geometric approach to the problem. The advantages of our method are as follows. First, while estimation is still based on transformation of an initial template, we apply only a single transformation rather than a (possibly non-convergent) sequence. Coupled with a small number of other parameters, this transformation constitutes a parametrization of the whole of the shape class of interest; there are no hidden constraints. Second, we use a broader notion of shape than previous work: in its simplest form we constrain the pdf to possess a fixed, but arbitrary, number of modes; we consider more general cases in section 6. Third, we use maximum likelihood estimation, guaranteeing optimality in principle, and allowing the derivation of asymptotic rates of convergence to the true density.

1.1 Summary of method

Our problem can be stated as follows: given independent samples , from a pdf , with a known number of well-defined modes, estimate this density ensuring the presence of modes in the solution. In order to do this, we construct a parameterization of the set of densities with modes, , as follows. Let the critical points of a pdf with modes be , with and . We can define the height ratio vector of as the set of ratios of the height of the interior critical point to the height of the first (from the left) mode: , where . Let the subspace of with height ratio vector be denoted . We then parameterize an arbitrary member of by the following elements:

  • A height rato vector ;

  • a diffeomorphism , where is the group of diffeomorphisms of .

Together these generate the pdf , where is a a priori fixed template function in , and denotes a group action of on , with the crucial property that it preserves .

Using this parameterization, we can construct the log likelihood function

(1)

and we can use maximum likelihood to estimate and .

The optimization involved is made challenging by the fact that is an infinite-dimensional, nonlinear manifold. To address this issue, we define a bijective map from into a unit Hilbert sphere (set of square-integrable functions with unit norm) and then flatten this sphere around a pivot point to reach a proper Hilbert space. Using a truncated orthonormal basis, we can then represent elements of by a finite set of coefficients. The joint optimization over and can then be performed using a standard optimization package since these representations now lie in a finite-dimensional Euclidean space.

We can generalize this method to a larger set of shape classes by defining a shape as a sequence of monotonically increasing, monotonically decreasing, and flat intervals that together constitute the entire density function. For example, the shape of an “N-shaped” density function is given by the sequence (increasing, decreasing, increasing). For any such sequence, we can construct a template density in the appropriate shape class, and proceed with estimation as before.

1.2 Overview

The rest of the paper is organized as follows. Section 2 describes the parameterization of in detail. Section 3 describes the implementation of the maximum likelihood optimization procedure, and in particular, the parameterization of . Section 4 presents the asymptotic convergence rates associated with the proposed estimator, while Section 5 presents some experimental results on simulated datasets. Section 6 extends the framework to the more general shape classes just mentioned, while Section 7 extends the framework to shape constrained conditional density estimation. Section 8 presents a application case study. Section 9 summarizes the contributions of the paper and discusses some associated problems, limitations and further possible extensions. The Appendix contains the derivations of the asymptotic convergence rate presented in Section 4.

2 A Geometric Exploration of Densities

In this section, describe the parameterization we use for the set of densities with modes. We start by introducing some notation and some assumptions about the underlying space of densities .

In this framework we are primarily going to focus on pdf’s that satisfy the following conditions: It is strictly positive and continuous with an interval support and zero boundaries. (For simplicity of presentation, we will assume that the support is .) Furthermore, we assume that the pdf has well defined modes that lie in . Let be such a pdf and suppose that the critical points of are located at , for , with and . Define the height ratio vector of to be , where is the ratio of the height of the interior critical point to the height of the first (from the left) mode. Please look at the top left panel of Figure 2 for an illustration. We define to be the set of all continuous densities on with zero boundaries. Let be the subset with modes and let be a further subset of pdf’s with height ratio vector equal to . Define the set of all time warping functions to be . This set is a group with composition being the group operation. The identity element of is and for every there exists a such that .

Theorem 1.

The group acts on the set by the mapping , given by . Furthermore, this action is transitive. That is, for any , there exists a unique such that .

Proof.

The new function is called the time-warped density or just warped density. To prove this theorem, we first have to establish that the warped density is indeed in the set . Note that time warping by and the subsequent global scaling do not change the number of modes of since is strictly positive (by definition). The modes simply get moved to their new locations . Secondly, the height ratio vector of remains the same as that of . This is due to the fact that and . Next, we prove the compatibility property that for every and , we have . Since,

this property holds.

Finally, we prove the transitivity property: given , there exists a unique such that . Let be the height of the first mode of and let be the height of the first mode of . Then, define two nonnegative functions according to and . Note that the height of both their first modes is and the height vector for the interior critical points is . Also, let the critical points of and (and hence and , respectively) be located at and respectively, for . Since the modes are well defined, the function is piecewise strictly-monotonous and continuous in the intervals , for . Hence, within each interval there exists a continuous inverse of , termed . Then, set is such that and hence . Note that the is uniquely defined, continuous, increasing, but not differentiable at the finitely many critical points in general. Hence does not exist at those points. But can be replaced by a weak derivative of . Let be a weak derivative of that is equal to wherever exists, and otherwise. Define . Then and are equal and exists everywhere, and . ∎

Now note that . Thus for the estimation procedure entails estimating the (unique) height ratio vector such that constructing an element , and estimating the time warping function such that . Figure 1 illustrates the height preserving effect of the composition of warping functions before normalization, and the height ratio vector preserving effect of the group action.

Figure 1: The top left panel shows the initial density shape. The top right panel shows different warping functions considered for transforming the shape. The bottom left panel shows the resultant warped shapes which preserves the heights of the critical points. The bottom right panel shows the resultant warped densities after normalization which does not preserve the heights but preserves the height ratio vector.

Assume, for the moment, that

corresponding to is known. The estimation procedure is initialized with an arbitrary modal template function constructed as follows:

Set where is a very small positive number. Let the interval be divided into equal intervals corresponding to the modes and interior antimodes. Let the location of the th critical point be , with , and . Set the value of for the location of the left most mode to be . Let the heights for the other interior critical points be for which are the height ratio vector for the true density, assumed known for now. Represent this as . The values of

for the other points is obtained by linear interpolation. Then

. The final step of the procedure involves estimating the time warping function such that . The key feature of this step of the estimation procedure is the geometry of the set , which is crucial in developing a maximum likelihood approach for estimating the to be used to transform the original shape , since is a nonlinear space. Note that . Thus are elements of the Hilbert sphere, with a known simple geometry, and associated linear tangent spaces which facilitate truncated orthogonal expansion to represent the elements of . The entire procedure of estimating by exploiting the geometry of is explained in detail in Section 3. Also the height ratio vector , assumed known till now, can be estimated jointly with from the observations via maximum likelihood estimation, discussed in Section 3.2.

Algorithm provides the steps on how to construct the estimate of given in practice. Note that in practice one can start with a template function rather than a template density because it results in the same estimate of . Figure 2 shows a simple example to illustrate the estimation procedure. The top left panel is the true density with modes with critical points located at with height . The top right panel shows the initial template function with modes and critical points located at and heights . The bottom left panel shows the warping function constructed according to Algorithm and the bottom right shows that using the warping function, we get back the exact true density shape. Thus, given any initial template the procedure entails estimating the correct height ratios and the warping function.

When the bounds of a density function is not known, they are estimated from the data using the formula and where A and B are the lower and upper bounds respectively,

is the standard deviation of the observations and

is the number of observations, as used in Turnbull2014-ag. For a general and , the data are scaled to the unit interval according to for the estimation process. Note that theoretically the assumption can be relaxed by considering the height ratios of the two boundaries as two extra parameters and . This allows the proposed framework to encompass a much broader notion of shapes. Specifically, “shapes” can refer to an ordered sequence of monotonic pieces which when pieced together constitute the entire function. For example, a V shaped function can be written as a decreasing-increasing shape. Knowing this “shape” allows us too incorporate the same shape in the template function and hence obtain a maximum likelihood density estimate in that specified shape class. In fact, this notion even allows one to model know flat modal or antimodal regions in the true density. However, for experiments, estimating the boundary values to a satisfactory degree requires many observations (using the inbuilt optimization function fmincon). Hence in this paper we focus on developing the theory for densities which satisfy . The theory for densities without this assumption is almost identical and results in the same convergence rate, and is not presented. However, we have discussed the idea in more detail in Section and have also presented some simulated examples. For illustration we focus on densities that decay at the boundaries. Then we estimate the effective support from the data and set the estimate to be zero at the estiated boundaries of the support. In this regard, note that and can be any real number and hence the above methodology can be used to estimate densities with entire reals as support. Here and play the role of effective support on which the numerical estimation is performed.

i. Start with an M modal template function . Construct by setting . Divide the interval into intervals corresponding to the modes and antimodes. Let the location of the th critical point be , with , and . Set the value of for the first mode to be , that is, . Let the heights for the other critical points be the correct height ratios for for the true density . Represent this as . Obtain the values for the other points by interpolation.
ii. Let be the function . Then , which implies that . Then and for . Now,let

(2)

Then there exists a unique continuous function such that where with . The can be constructed as follows. Since is piecewise monotonous in the intervals , for , there exists an inverse in the interval . Then can be constructed piecewise by

(3)

iii. Then

Algorithm 1 Construction of the warping function given a true density and the correct critical point height ratios and the critical point locations
Figure 2: The true density(top left) and the constructed template(top right) is shown. The constructed is shown in bottom left panel and the (exact) reconstructed density using the is shown in the bottom right panel.

3 Estimation of the parameters

In practice, we have to estimate the critical point height ratios ’s and the warping function . We exploit the geometry of the set to estimate the desired element .

3.1 Finite-Dimensional Representation of Warping Functions

Solving an optimization problem, say maximum-likelihood estimation, over faces two main challenges. First, is a nonlinear manifold, and second, it is infinite-dimensional. We handle the nonlinearity by forming a map from to a tangent space of the unit Hilbert sphere (the tangent space is a vector space), and infinite dimensionality by selecting a finite-dimensional subspace of this tangent space. Together, these two steps are equivalent to finding a family of finite-dimensional submanifolds that can be flattened into vector spaces. This allows for a representation of using orthogonal basis. Once we have a finite-dimensional representation of , we can optimize over this representation of using the maximum-likelihood criterion.

Define a function , , as the square-root slope function (SRSF) of a . (For a discussion on SRSFs of general functions, please refer to Chapter 4 of srivastava2016functional). For any , its SRSF is an element of the nonnegative orthant of the unit Hilbert sphere, , denoted by . This is because . We have the nonnegative orthant because by definition, is a nonnegative function. The mapping between and is a bijection, with its inverse given by . The unit Hilbert sphere is a smooth manifold with known geometry under the Riemannian metric lang2012fundamentals. It is not a vector space but a manifold with a constant curvature, and can be easily flattened into a vector space locally. The chosen vector space is a tangent space of . A natural choice for reference, to select the tangent space, is the point , a constant function with value , which is the SRSF corresponding to The tangent space of at is an infinite-dimensional vector space given by: .

Next, we define a mapping that takes an arbitrary element of to this tangent space. For this retraction, we will use the inverse exponential map that takes to according to:

(4)

where is the arc-length from to .

We impose a natural Hilbert structure on using the standard inner product: . Further, we can select any orthogonal basis of the set to express its elements by their corresponding coefficients; that is, , where . The only restriction on the basis elements ’s is that they must be orthogonal to 1, that is, . In order to map points back from the tangent space to the Hilbert sphere, we use the exponential map, given by:

(5)

We define a composite map , as

(6)

Now, we define , as

(7)

This map allows us to express an element in terms of the coefficient vector . Note that is not exactly since the range of the exponential map is the entire Hilbert sphere, and not restricted to the nonnegative orthant. We can restrict the domain of to . Figure 3 illustrates the map pictorially.

For any , let denote the diffeomorphism . For any fixed , the set is a finite-dimensional submanifold of , on which we pose the estimation problem. As goes to infinity, converges to the set .

Figure 3: A graphic illustration of the finite dimensional representation of elements of in terms of the elements of the tangent space of the Hilbert sphere through the coefficients of the orthogonal basis representation of .

3.2 Estimation of the s and Implementation

We use a joint maximum likelihood method to estimate the height ratios s along with the optimal coefficients corresponding to the estimate of . Note that when , there is no parameter. For , there are

parameters. Among them, the odd indices

correspond to the antimodes, and the rest correspond to the modes. Let . In the setting described above, the maximum likelihood estimate of the underlying density, given the initial template function , is

, where and

(8)

4 Asymptotic Convergence Results

In this section, we derive the asymptotic convergence rate of the (maximum likelihood) density estimate described according to (8) in Section 3.2 to the true underlying density by using the theory of sieve maximum likelihood estimation as in wong1995probability. Let denote the set of -modal continuous densities on strictly positive in and zero at the boundaries.

  • Assumption 1: is continuous, strictly positive on , and .

  • Assumption 2: has modes which lie in .

  • Assumption 3: either belongs to Hölder or Sobolev space of order .

Let be the number of available observations. Let be a sequence of positive numbers converging to 0. Let be the of observed data points scaled to the unit interval. We call an estimator an sieve MLE if

In the proposed method, is defined such that is exactly . Therefore, is a sieve MLE with . Let denote norm between functions. The following theorem states the asymptotic convergence rate for the sieve MLE .

Theorem 2.

Let for some constant . If satisfies Assumptions 1, 2 and 3; and is the sieve MLE described according to (8) in Section LABEL:p0, then there exists constants and such that

(9)

We present the proof of Theorem 2 in Appendix A. The essential idea hinges on proving the equivalence of the density space obtained with the parameter space. That is, we show that if the estimated parameter is “close” to the true parameter corresponding to the true density in some sense, then the corresponding estimated density is also “close” to the true density. The statement is formally stated and proved in Lemma 1 in Appendix A. The general theory is then inspired by the convergence of sieve MLE estimators in wong1995probability.

5 Simulation study

For numerical implementation, we use Fourier basis for the tangent space representation and the MATLAB function fmincon for optimization. The objective function as described in (8) is not convex, and hence the inbuilt function fmincon is used. However fmincon can often get stuck in local suboptimal solutions and so we use the GlobalSearch toolbox along with fmincon to obtain better results. We start with just basis points for the tangent space representation and we gradually move towards more number of basis elements upto a predecided limit and choose the estimate based on the best AIC value. AIC was chosen as the penalty on the number of basis elements because experiments suggests that BIC overpenalizes the number of parameter terms which often caused the estimate to miss the sharper features of the true density.

For illustration, we consider sample sizes , and . To evaluate the average performances we generate samples (of sample size , and respectively) and evaluate the mean error and the standard deviation of the errors. For error function we have considered , and . As a first part of the experiment, we generate from three examples with the constraint that the number of modes is one. For comparison, we use the umd packge developed by Turnbull2014-ag. In Figure 4 we illustrate the best, median and worst performance out of the samples based on the loss function for sample size for the warped method(top row) and the umd package(bottom row). The examples are as below:

  1. - a symmetric unimodal example.

  2. - a skewed unimodal density with , assumed known as well.

  3. -An example of unimodal contaminated data.

For the symmetric unimodal example, the warped method captures the sharp peak better than the umd method. In the contaminated data example the umd solver gets stuck in a suboptimal solution in one isolated case. However,in the Beta density example the umd method performs better. The quantitative analysis is presented in Table 1. As a second part of the simulation study we provide two examples with number of modes constrained to be and respectively, -an asymmetric bimodal example, and -an asymmetric trimodal example with one mode well separated from the other two modes. In Figure 5 top row we illustrate the median, best and worst performance out of samples of size for the two examples. In Table 2 we present the quantitative performance analysis.

One important observation is that the proposed method has much higher computation cost compared to the competitors because of the GlobalSearch toolbox used. For the symmetric unimodal example, the numerical performance with or without using the GlobalSearch toolbox is very similar, and hence the performance is presented without using the GlobalSearch toolbox to illustrate the difference in computation cost. For all other examples, GlobalSearch toolbox is used.

Figure 4: The figure illustrates the true density(solid line) ;the estimated density with best performance(dashed line); the median performance(dotted line) and the worst performance(dashed-dotted line) according to norm. The panels correspond to the four simulated examples in order from top left to bottom right.
Example Method: Warped Estimate umd Estimate
Norm Mean std.dev Time Mean std.dev Time
Symmetric Unimodal 100 1.1933 0.3038 1.5753 0.2202
0.1898 0.0568 sec 0.2791 0.0138 sec
0.0755 0.0299 0.1243 0.0099
500 0.5746 0.1131 1.1948 0.1109
0.0953 0.0248 sec 0.2289 0.0050 sec
0.0409 0.0149 0.1109 0.0063
1000 0.4786 0.2905 1.1325 0.0629
0.0834 0.0642 sec 0.2238 0.0036 sec
0.0371 0.3376 0.1117 0.0052
Skewed Unimodal 100 19.4054 4.3991 14.0244 4.7563
2.9589 0.7715 sec 2.1081 0.7414 sec
0.8517 0.2914 0.5668 0.2074
500 12.9066 3.0470 7.6131 2.3679
1.9930 0.5267 sec 1.1735 0.3838 sec
0.5866 0.1832 0.3294 0.1141
1000 12.0474 2.5418 5.6584 1.4165
1.8592 0.4427 sec 0.8779 0.2370 sec
0.5485 0.1765 0.2485 0.0732
Contaminated Unimodal 100 3.0600 1.5574 6.6567 1.4372
0.4385 0.2258 sec 0.9532 0.2374 sec
0.1136 0.0628 0.1455 0.0538
500 1.2348 0.5206 3.4151 0.8655
0.1893 0.0879 sec 0.5106 0.1568 sec
0.0510 0.0268 0.1455 0.0538
1000 0.8319 0.3172 3.1453 0.8934
0.1247 0.0563 sec 0.4616 0.0879 sec
0.0363 0.1277 0.0502 0.0538
Table 1: A comparison of the performances of umd package and Warped estimate for simulated unimodal examples.
Example: Bimodal density Trimodal density
Norm Mean std.dev Time Mean std.dev Time
100 4.3429 1.2332 6.7299 1.6367
0.6575 0.2049 sec 0.9075 0.2344 sec
0.2089 0.0850 0.2419 0.0867
500 2.4727 0.5755 3.4841 1.2778
0.3839 0.1103 sec 0.4816 0.1737 sec
0.1337 0.0502 0.1351 0.0538
1000 1.9942 0.5042 3.0489 1.7033
0.3100 0.0999 sec 0.4330 0.2353 sec
0.1095 0.0444 0.1246 0.0648
Table 2: A quantitative analysis of the performance of Warped Estimate for simulated bimodal and trimodal dataset.
EpisplineDensity estimate Warped Estimate
Norm Mean std.dev Time Mean std.dev Time
500 8.7334 2.1415 5.9202 2.8516
1.3269 0.5033 sec 0.7079 0.3346 sec
0.6167 0.4461 0.1538 0.0758
Table 3: A quantitative analysis of the performance of Warped Estimate for simulated bimodal and trimodal dataset.
Figure 5: The figure illustrates the true density(solid line) ;the estimated density with best performance(dashed line); the median performance(dotted line) and the worst performance(dashed-dotted line) according to norm. The panels correspond to the four simulated examples in order from top left to bottom right.

6 Extension to more general shapes

Upto this point we have restricted ourselves to density estimates which are zero at the boundary even though the true density might not be exactly zero. Also the estimation has inherently assumed that the modes lie in the interior of the support and not on the boundary. As indicated in the simulation studies, the method has very good numerical performance for densities which decay at the boundaries. However, the proposed framework allows a easy extention to densities which may have (1) modes located at the boundaries, (2) compact support with significantly large value at the boundaries, by simply considering the height ratios at the boundaries as extra parameters. Essentially this extension requires knowledge of the exact sequence of the modes and antimodes in order to construct the correct function template and the correct constraints for the parameters . (Note that previously we had indexed the template by and we fixed the boundary values of to be ). For example, for an -shaped density, we need the knowledge that the function is initially increasing,then decreasing and finally increasing, and hence we can create an shaped template. Once the template is constructed the rest of the procedure remains the same. Another special example are monotone densities, where the mode is at one of the boundaries. In such a scenario, one can construct the template by setting the modal value of to be and estimate the other boundary value with appropriate constraint. The bottom row of Figure 5 considers an example of a monotonically decreasing density, a truncated to . As a comparison we have used the episplineDensity package and have considered samples of sample size . The bottom left panel of Figure 5 shows the best, median and worst performance out of the samples for the warped estimate. The right panel shows the same for the episplinedensity estimate. The performance of the warped estimate is better overall, and especially at the left boundary. Table 3 presents the quantitative comparison of the performances.

Finally suppose a density has a flat spot at a modal (or antimodal) location. This indicates that the modes are not well defined but is actually an interval. The framework theoretically accomodates such an information by simply adding a flat spot in the template function at the desired location. Thus, we can extend the idea of “shape” of a continuous density function to be identified as an ordered sequence of increasing, decreasing or flat pieces that form the entire density function. For example, a simple bimodal density function can be identified by the sequence increasing-decreasing-increasing-decreasing. A function with a unique modal interval can be described as increasing-flat-decreasing. If this sequence is known, then simply constructing a template with the same sequence allows us to provide a maximum likelihood density estimate within the class of densities satisfying that shape sequence.

Figure 6: The figure illustrates the true density(solid line) and the estimated density with median performance(dotted line); the best performance (dashed line) and the worst performance(dashed dotted line) for two examples with estimates unconstrained at the boundaries.

In practice, we have used MATLAB function fmincon for optimization purposes. However, estimating the correct height at the boundaries takes a large sample size using the fmincon implementation to achieve a satisfactory and stable performance. Figure 6 shows two examples,

  1. and zero otherwise - A density function with a flat modal region.

  2. - A bimodal density function truncated to .

The left panel of Figure 6 shows the best, median and worst performance out of samples of size from the density with flat spot. The right panel shows the same from sample size for the truncated bimodal density.

7 Extension to conditional density estimation

The proposed framework for modality constrained density estimation extends naturally to modality constrained conditional density estimation setups. Consider the following setup: Let

be a fixed one-dimensional random variable with a positive density on its support. Let

, where is the unknown conditional density that changes smoothly with ; is the unknown mean function, assumed to be smooth; and,

is the unknown variance, which may or may not depend on

. Conditioned on , is assumed to have a univariate, continuous distribution with support on interval , has a known modes in the interior of , and . We observe the pairs , and are interested in recovering the conditional density at a particular location of , henceforth referred to as . The estimation is again initialized with an modal template function . However, since varies smoothly with , we assign more importance to observations closer to the location than observations that are further away, and hence, we perform weighted maximum likelihood function to estimate the necessary parameters.

(10)

where is the localized weight associated with the th observation, calculated according to:

where is the standard normal pdf and is the parameter that controls the relative weights associated with the observations. However, weights defined in this way results in higher bias because information is being borrowed from all observations. As discussed in an example in bashtannyk2001bandwidth, we allow only a specified fraction of the observations to have a positive weight. However, using too small a fraction will result in unstable estimates and poor practical performance because the effective sample size will be too small. Hence we advocate using the nearest of the observations (nearest to the target location) for borrowing information and then calculating the weights for this smaller sample as defined before. The parameter is akin to the bandwidth parameter associated with traditional kernel methods for density estimation, for the predictors . A very large value of distributes approximately equal weight to all the observations, whereas a very small value considers only the observations in a neighborhood around . The parameter can be chosen via any standard cross validation based bandwidth selection method, for practical purposes. For our purposes we use an adaptive bandwidth selection method to save computation time, when the predictors are independent of each other:

The parameter is chosen according to the location using a two-step procedure:

  1. Compute a standard kernel density estimate

    of the predictor space using a fixed bandwidth chosen according to any standard criterion. For our purposes, we simply used the ksdensity estimate inbuilt in MATLAB which chooses the bandwidth optimal for normal densities. Let be the fixed bandwidth used.

  2. Then, set the bandwidth parameter at location to be .

The intuition is that controls the overall smoothing of the predictor space based on the sample points, and the stretches or shrinks the bandwidth at the particular location. At a sparse region, increased borrowing of information from the other data points is desirable in order to reduce the variance of the estimate, whereas in dense regions a reduced borrowing of information from far away points reduces the bias of the density estimates. A location from a sparse region is expected to have a low density estimate, and a location from a dense region is expected to have a high density estimate. Hence, varying the bandwidth parameter inversely with the density estimate helps adapt to the sparsity around the point of interest. The choice of the adaptive bandwidth parameter is motivated from the variable bandwidth kernel density estimators discussed in terrell1992variable, van2003adaptive and abramson1982bandwidth, among others.

As illustrative examples we consider two setups: (1), a unimodal conditional density and (2) , a bimodal conditional density. In both cases we study samples of size and and compute the conditional density at the th, th and th quantile of the predictor support. Figure 7 illustrates the best, worst and the median performance among the samples in each scenario. For sample size (first and third row), the performance is slightly unstable and the worst performances often has a bias and is wiggly in nature. Naturally for larger sample size (second and fourth row), the results are much more stable. Also noteworthy is the more pronounced bias for the conditional densities evaluated at the th and th quantiles because of borrowing of information via weighted likelihood estimation. However, the bias is almost absent for sample size . The quantitative performance based on average loss functions is presented in Table 4.

Figure 7: The figure illustrates the true density(solid line) and the estimated density with median performance(dotted line); the best performance (dashed line) and the worst performance(dashed dotted line) at three different locations in the support of the predictors.
Example Location: th quantile th quantile th quantile
Norm Mean std.dev Mean std.dev Mean std.dev
Unimodal cde 100 7.9623 2.0550 6.9829 1.9716 8.2243 2.4475
1.1658 0.2539 1.0132 0.2388 1.1884 0.2876
0.4056 0.0570 0.3586 0.0589 0.4094 0.0595
1000 5.1280 0.7392 4.1239 0.6308 5.2136 0.7194
0.9271 0.0929 0.7537 0.0812 0.9297 0.0846
0.3977 0.0275 0.3494 0.0256 0.3966 0.0239
Bimodal cde 100 8.3386 1.5436 7.0026 1.2024 7.8851 1.4847
0.9983 0.1802 0.8374 0.1307 0.9478 0.1695
0.2044 0.0384 0.1773 0.0349 0.2015 0.0430
1000 5.8890 0.6466 4.9654 0.9002 5.9918 0.6902
0.7201 0.0756 0.6111 0.1001 0.7285 0.0766
0.1574 0.0205 0.1406 0.0255 0.1561 0.0180
Table 4: A quantitative evaluation of the performance of Warped estimate for two simulated conditional density examples.

8 Application to speedflow data

As an application of modality constrained conditional density estimation, we use the speed flow data for Californian driveways from the package hdrcde in R. The scatterplot shown in Figure 8 shows the distinct bimodal nature of the speed distribution for traffic flow between and vehicles per lane per hour, corresponding to uncongested and congested traffic. This range of traffic flow where a bimodal nature is apparent is already studied in einbeck2006modelling. They study that beyond traffic flow of the regression curves corresponding to uncongested and congested traffic are no longer distinguishable. So, we consider the speed flow in that range ( observations) and compute the conditional density of the speed with bimodality constraint on the shape, given flow using our prescribed of the observations. The middle panel of Figure 8 (solid line) shows the conditional density estimate for flow using the proposed approach. The left mode is mph and the right mode is . einbeck2006modelling also obtains a very similar conditional density estimate. The left mode in their case is at mph and the right mode is at . On the other hand if we carry out a traditional conditional density estimation using NP package, we see several spurious bumps as shown in the middle panel of Figure 8 (dotted line) and zoomed in on the right panel. The bumpy nature is present in the NP estimate constructed using observations (not presented) as well as only using of the observations as in our approach. This results in over-interpreting the tail and consequently a lack of interpretability for the modes themselves. Thus constraining the number of modes clearly helps lending interpretability to the resultant density shape.

Figure 8: The figure illustrates the scatterplot(top left) and the estimated density of traffic speed at traffic flow (top right)for warped method (solid) and NP package(dotted).

9 Discussion

Density estimation and shape constrained density estimation are very rich topics of research in Statistics. The current paper focuses on introducing a novel framework using geometric tools which enables one to perform shape constrained density estimation with a broader notion of shapes than before. Specifically, exploiting the geometry of the group of diffeomorphisms, one can shift the problem of finding a density with the appropriate shape constraints to finding an appropriate diffeomorphism given an initial shape, based on available data. In recent years, most datasets on a variable of interest have associated covariates which make the problem of conditional density estimation very useful and practically relevant. An advantage of the proposed framework is the easy extendibility to the conditional density estimation problem via a weighted maximum likelihood objective function. Theoretically, the framework introduced is the first that can perform any modality constrained density estimation. However practically the performance suffers when the constrained shape is too complicated or if the number of modes is very high (greater than ) because the inbuilt solver fmincon gets stuck in local suboptimal solutions resulting in unsatisfactory density estimates.

Since the paper primarily focuses on introducing the framework and the group action that enables shape constrained estimation, it has only lightly touched upon or not explored many associated problems of density estimation. For example, the choice of the number of basis elements for tangent space representation, the choice of the basis set itself, or the choice of penalty for penalized estimation and boundary estimation are very rich and important problems themselves in their own right. This paper simply uses AIC as the penalty to select the number of basis elements because in comparison, BIC tends to choose insufficient number of parameters. Also, experiments using a Meyer basis set for the tangent space representation of the diffeomorphisms yielded similar results, though the Meyer wavelets seemed to require more observations than Fourier basis set to obtain satisfactory results. Keeping in mind that the basis set representation is for approximating the warping functions and not the density functions directly, one can choose different basis sets for a comparative study of performances. The paper follows Turnbull2014-ag for choosing the boundaries.

For conditional density estimation, the weights defined as gaussian kernel can also be defined using any other kernel. The choice of gaussian kernel (and the loss function) was as an illustration. A possible extension not explored in the paper is to develop the framework in situations where multiple or very high number of covariates are present. Currently the bandwidth parameter is chosen adaptively based on a kernel density estimate at the location of the (scalar) covariate. It can be directly extended to covariate scenario using a

variate kernel density estimate at the location of the predictors. However, such an estimate suffers from the curse of dimensionality. In applications where only a few of the covariates are relevant to the response variable,

wasserman2006rodeo developed a technique to identify the relevant variables and also obtain the corresponding bandwidth parameters. Using the obtained bandwidth parameters, one can redefine the weights and perform weighted likelihood maximization to produce a conditional density estimate.

Appendix A Proof of Theorem 2

First we set some notations and some preliminary definitions. is always used to represent the number of modes. Let denote the -modal template defined earlier as a function of . Here denotes the vector , corresponding to the height ratios of the last critical points with respect to the first critical point. Let be the number of basis elements used for approximating the warping function . Let be the corresponding coefficient vector. Now, define . In what follows, is used to represent the coefficient vectors. denotes the th basis element for the tangent space representation of warping functions. is used to represent the warping function corresponding to the coefficient vector . represent specific constants. represent generic constants that can change values from step to step but are otherwise independent of other terms.

Let be the height ratio vector for , as defined in Section 2. Then from Theorem 1 there exists an infinite dimensional such that can be represented as

Note that for each , . This corresponds to