Nonparametric modal regression

by   Yen-Chi Chen, et al.
Carnegie Mellon University

Modal regression estimates the local modes of the distribution of Y given X=x, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of Y and X. We derive asymptotic error bounds for this method, and propose techniques for constructing confidence sets and prediction sets. The latter is used to select the smoothing bandwidth of the underlying KDE. The idea behind modal regression is connected to many others, such as mixture regression and density ridge estimation, and we discuss these ties as well.


Modal clustering asymptotics with applications to bandwidth selection

Density-based clustering relies on the idea of linking groups to some sp...

Modal-set estimation with an application to clustering

We present a first procedure that can estimate -- with statistical consi...

An implicit function learning approach for parametric modal regression

For multi-valued functions—such as when the conditional distribution on ...

Modal clustering of matrix-variate data

The nonparametric formulation of density-based clustering, known as moda...

Asymptotic Confidence Regions for Density Ridges

We develop large sample theory including nonparametric confidence region...

Optimal Sampling Density for Nonparametric Regression

We propose a novel active learning strategy for regression, which is mod...

Kernel Selection for Modal Linear Regression: Optimal Kernel and IRLS Algorithm

Modal linear regression (MLR) is a method for obtaining a conditional mo...

Code Repositories

1 Introduction

Modal regression (Sager and Thisted, 1982; Lee, 1989; Yao et al., 2012; Yao and Li, 2014)

is an alternate approach to the usual regression methods for exploring the relationship between a response variable

and a predictor variable

. Unlike conventional regression, which is based on the conditional mean of given , modal regression estimates conditional modes of given .

Why would we ever use modal regression in favor a conventional regression method? The answer, at a high-level, is that conditional modes can reveal structure that is missed by the conditional mean. Figure 1 gives a definitive illustration of this point: we can see that, for the data examples in question, the conditional mean both fails to capture the major trends present in the response, and produces unecessarily wide prediction bands. Modal regression is an improvement in both of these regards (better trend estimation, and narrower prediction bands). In this paper, we rigorously study and develop its properties.

Figure 1:

Examples of modal regression versus a common nonparametric regression estimator, local linear regression. In the top row, we show local regression estimate and its associated 95% prediction bands alongside the modal regression and its 95% prediction bands. The bottom row does the same for a different data example. The local regression method fails to capture the structure, and produces prediction bands that are too wide.

Modal regression has been used in transportation (Einbeck and Tutz, 2006), astronomy (Rojas, 2005), meteorology (Hyndman et al., 1996) and economics (Huang and Yao, 2012; Huang et al., 2013). Formally, the conditional (or local) mode set at is defined as


where is the conditional density of given . As a simplification, the set can be expressed in terms of the joint density alone:


At each , the local mode set may consist of several points, and so is in general a multi-valued function. Under appropriate conditions, as we will show, these modes change smoothly as changes. Thus, local modes behave like a collection of surfaces which we call modal manifolds.

We focus on a nonparametric estimate of the conditional mode set, derived from a kernel density estimator (KDE):


where is the joint KDE of . Scott (1992) proposed this plug-in estimator for modal regression, and Einbeck and Tutz (2006) proposed a fast algorithm. We extend the work of these authors by giving a thorough treatment and analysis of nonparametric modal regression. In particular, our contributions are as follows.

  1. We study the geometric properties of modal regression.

  2. We prove consistency of the nonparametric modal regression estimator, and furthermore derive explicit convergence rates, with respect to various error metrics.

  3. We propose a method for constructing confidence sets, using the bootstrap, and prove that it has proper asymptotic coverage.

  4. We propose a method for constructing prediction sets, based on plug-in methods, and prove that the population prediction sets from this method can be smaller than those based on the regression function.

  5. We propose a rule for selecting the smoothing bandwidth of the KDE based on minimizing the size of prediction sets.

  6. We draw enlightening comparisons to mixture regression (which suggests a clustering method using modal regression) and to density ridge estimation.

We begin by reviewing basic properties of modal regression and recalling previous work, in Section 2. Sections 3 through 8 then follow roughly the topics described in items 1–6 above. In Section 9 we end with some discussion. Simple R code for the modal regression and some simulation data sets used in this paper can be found at

2 Review of modal regression

Consider a response variable and covariate or predictor variable , where is a compact set. A classic take on modal regression (Sager and Thisted, 1982; Lee, 1989; Yao and Li, 2014) is to assume a linear model

where , are unknown coefficients, and denotes the (global) mode of given . Nonparametric modal regression is more flexible, because it allows to be multi-valued, and also it models the components of as smooth functions of (not necessarily linear). As another nonlinear generalization of the above model, Yao et al. (2012) propose an interesting local polynomial smoothing method for mode estimation; however, they focus on the global mode rather than , the collection of all conditional modes.

The estimated local mode set in (3) from Scott (1992) relies on an estimated joint density function , most commonly computed using a KDE. Let be the observed data samples. Then the KDE of the joint density is


Here is a symmetric, smooth kernel function, such as the Gaussian kernel (i.e., ), and is the smoothing bandwidth. For simplicity, we have used the same kernel function and bandwidth for the covariates and the response, but this is not necessary. For brevity, we will write the estimated modal set as


where the subscript notation denotes partial derivatives, as in and .

In general, calculating can be challenging, but for special kernels, Einbeck and Tutz (2006) proposed a simple and efficient algorithm for computing local mode estimates, based on the mean-shift algorithm (Cheng, 1995; Comaniciu and Meer, 2002). A related approach can be found in Yao (2013), where the authors consider a mode hunting algorithm based on the EM algorithm. For a discussion of how the mean-shift and EM algorithms relate, see Carreira-Perpiñán (2007). In general, mean-shift algorithms can be derived for any KDEs with radially symmetric kernels (Comaniciu and Meer, 2002), but for simplicity we assume here that is Gaussian. The partial mean-shift algorithm of Einbeck and Tutz (2006), to estimate conditional modes, is described in Algorithm 1.

Input: Data samples , bandwidth . (The kernel is chosen to be Gaussian.)
  1. Initialize mesh points (a common choice is ).

  2. For each , fix , and update using the following iterations until convergence:

Output: The set , containing the points , where is a predictor value as fixed in , and is the corresponding limit of the mean-shift iterations (6).
Algorithm 1 Partial mean-shift

A straightforward calculation shows that the mean-shift update (6) is indeed a gradient ascent update on the function (for fixed ), with an implicit choice of step size. Because this function is generically nonconcave, we are not guaranteed that gradient ascent will actually attain a (global) maximum, but it will converge to critical points under small enough step sizes (Arias-Castro et al., 2013).

3 Geometric properties

In this section, we study the geometric properties of modal regression. Recall that is a set of points at each input . We define the modal manifold collection as the union of these sets over all inputs ,


By the implicit function theorem, the set has dimension ; see Figure 2 for an illustration with (univariate ).

Figure 2: Examples of modal manifolds.

We will assume that the modal manifold collection can be factorized as


where each is a connected manifold that admits a parametrization


for some function and open set . For instance, in Figure 2, each is a connected curve. Note that form an open cover for the support of . We call the th modal manifold, and the th modal function. By convention we let if , and therefore we may write


That is, at any , the values among that are nonempty give local modes.

Under weak assumptions, each is differentiable, and so is the modal set , in a sense. We discuss this next.

Lemma 1 (Derivative of modal functions).

Assume that is twice differentiable, and let be the modal manifold collection. Assume that factorizes according to (7), (8). Then, when ,


where is the gradient over of .

Proof. Since we assume that , we have by definition. Taking a gradient over yields

After rearrangement,

Lemma 1 links the geometry of the joint density function to the smoothness of the modal functions (and modal manifolds). The formula (11) is well-defined as long as is nonzero, which is guaranteed by the definition of local modes. Thus, when is smooth, each modal manifold is also smooth.

To characterize smoothness of itself, we require a notion for smoothness over sets. For this, we recall the Hausdorff distance between two sets , defined as

where with .

Theorem 2 (Smoothness of the modal manifold collection).

Assume the conditions of Lemma 1. Assume furthermore all partial derivatives of are bounded by , and there exists such that for all and . Then


The proof of this result follows directly from Lemma 1 and the definition of Hausdorff distance, so we omit it. Since is a multi-valued function, classic notions of smoothness cannot be applied, and Theorem 2 describes its smoothness in terms of Hausdorff distance. This distance can be thought of as a generalized distance for sets, and Theorem 2 can be interpreted as a statement about Lipschitz continuity with respect to Hausdorff distance.

Modal manifolds can merge or bifurcate as varies. Interestingly, though, the merges or bifurcations do not necessarily occur at points of contact between two modal manifolds. See Figure 3 for an example with . Shown is a modal curve (manifold with ), starting at and stopping at about , which leaves a gap between itself and the neighboring modal curve. We take a closer look at the joint density contours, in panel (c), and inspect the conditional density along four slices , in panel (d). We see that after , the conditional density becomes unimodal and the first (left) mode disappears, as it slides into a saddle point.

A remark about the uniqueness of the modal manifold collection in (8): this factorization is unique if the second derivative is uniformly bounded away from zero. This will later become one of our assumptions (assumption (A3)) in the theoretical analysis of Section 4. Note that in the left panel of Figure 2, the collection is uniquely defined, while this is not the case in the right panel (at the points of intersection between curves, the density has vanishing second derivatives with respect to ).

(a) Modal regression
(b) Joint density contour
(c) Zoomed-in density contour
(d) Conditional density given
Figure 3: A look at bifurcation. As moves to , we can see that a local mode disappears after .

Lastly, the population quantities defined above all have sample analogs. For the estimate , we define


where each is a connected manifold, and is the total number. We also define in a similar way for . Thus, we can write


In practice, determining the manifold memberships and the total number of manifolds is not trivial. In principle, the sample manifolds are well-defined in terms of the sample estimate ; but even with a perfectly convergent mean-shift algorithm, we would need to run mean-shift iterations at every input in the domain to determine these manifold components. Clearly this is not an implementable strategy. Thus from the output of the mean-shift algorithm over a finite mesh, we usually employ some type of simple post-processing technique to determine connectivity of the outputs and hence the sample manifolds. This is discussed further in Section 7.

4 Asymptotic error analysis

In this section we present asymptotic results about the convergence of the estimated modal regression set to the underlying modal set . Let denote the collection of times continuously differentiable functions with all partial derivatives bounded in absolute value by . (The domain of these functions should be clear from the context.) Given a kernel function , denote the collection of functions

where denotes the -th order derivative of .

Our assumptions are as follows.

  • The joint density for some .

  • The collection of modal manifolds can be factorized into where each is a connected curve that admits a parametrization for some , and form an open cover for the support of .

  • There exists such that for any with , .

  • The kernel function and satifies

    for .

  • The collection is a VC-type class, i.e. there exists such that for ,

    where is the -covering number for a semi-metric space and

    is any probability measure.

Assumption (A1) is an ordinary smoothness condition; we need fourth derivatives since we need to bound the bias of second derivatives. The assumption (A2) is to make sure the collection of all local modes can be represented as finite collection of manifolds. (A3) is a sharpness requirement for all critical points (local modes and minimums); and it excludes the case that the modal manifolds bifurcate or merge, i.e., it excludes cases such as the right panel of Figure 2. Similar conditions appear in Romano (1988); Chen et al. (2014b)

for estimating density modes. Assumption (K1) is assumed for the kernel density estimator to have the usual rates for its bias and variance. (K2) is for the uniform bounds on the kernel density estimator; this condition can be found in

Giné and Guillou (2002); Einmahl et al. (2005); Chen et al. (2014c). We study three types of error metrics for regression modes: pointwise, uniform, and mean integrated squared errors. We defer all proofs to LABEL:suppA.

First we study the pointwise case. Recall that is the KDE in (4) of the joint density based on samples, under the kernel , and is the estimated modal regression set in (5) at a point . Our pointwise analysis considers

the Hausdorff distance between and , at a point . For our first result, we define the quantities:

Theorem 3 (Pointwise error rate).

Assume (A1-3) and (K1-2). Define a stochastic process by

Then, when

is sufficiently small, we have

Moreover, at any fixed , when and ,

The proof is in the supplementary material (Chen et al., 2015). This shows that if the curvature of the joint density function along is bounded away from , then the error can be approximated by the error of after scaling. The rate of convergence follows from the fact that is converging to at the same rate. Note that as is a conditional mode, the partial derivative of the true density is . We defined as above since implies , so that the ratio would be ill-defined if . Also, the constraints on in the second assertion ( and ) are to ensure .

For our next result, we define the uniform error

This is an type error for estimating regression modes (and is also closely linked to confidence sets; see Section 5).

Theorem 4 (Uniform error rate).

Assume (A1-3) and (K1-2). Then as and ,

The proof is in the supplementary material (Chen et al., 2015). Compared to the pointwise error rate in Theorem 3, we have an additional factor in the second term. One can view this as the price we need to pay for getting an uniform bound over all points. See Giné and Guillou (2002); Einmahl et al. (2005) for similar findings in density estimation.

The last error metric we consider is the mean integrated squared error (MISE), defined as

Note that the MISE is a nonrandom quantity, unlike first two error metrics considered.

Theorem 5 (MISE rate).

Assume (A1-3) and (K1-2). Then as and ,

The proof is in the supplementary material (Chen et al., 2015). If we instead focus on estimating the regression modes of the smoothed joint density , then we obtain much faster convergence rates. Let be the smoothed regression modes at . Analogously define

Corollary 6 (Error rates for smoothed conditional modes).

Assume (A1-3) and (K1-2). Then as and ,

where .

5 Confidence sets

In an idealized setting, we could define a confidence set at by


By construction, we have . Of course, the distribution of is unknown, but we can use the bootstrap (Efron, 1979) to estimate .

Given the observed data samples , we denote a bootstrap sample as . Let be the estimated regression modes based on this bootstrap sample, and

We repeat the bootstrap sampling times to get . Define by

Our confidence set for is then given by

Note that this is a pointwise confidence set, at .

Alternatively, we can use to build a uniform confidence set. Define by

As above, we can use bootstrap sampling to form an estimate

, based on the quantiles of the bootstrapped uniform error metric

Our uniform confidence set is then

In practice, there are many possible flavors of the bootstrap that are applicable here. This includes the ordinary (nonparametric) bootstrap, the smoothed bootstrap and the residual bootstrap. See Figure 4 for an example with the ordinary bootstrap.

We focus on the asymptotic coverage of uniform confidence sets built with the ordinary bootstrap. We consider coverage of the smoothed regression mode set (to avoid issues of bias), and we employ tools developed in Chernozhukov et al. (2014a, b); Chen et al. (2014c).

Figure 4: An example with pointwise (left) and uniform (right) confidence sets. The significance level is 90%.

Consider a function space defined as


and let be a Gaussian process defined on such that


for all .

Theorem 7.

Assume (A1-3) and (K1-2). Define the random variable

. Then as , ,

The proof is in the supplementary material (Chen et al., 2015). This theorem shows that the smoothed uniform discrepancy is distributed asymptotically as the supremum of a Gaussian process. In fact, it can be shown that the two random variables and are coupled by

Now we turn to the limiting behavior for the bootstrap estimate. Let be the observed data and denote the bootstrap estimate by

where is the the bootstrap regression mode set at .

Theorem 8 (Bootstrap consistency).

Assume conditions (A1-3) and (K1-2). Also assume that . Define . There exists such that and, for all ,

The proof is in the supplementary material (Chen et al., 2015). Theorem 8 shows that the limiting distribution for the bootstrap estimate is the same as the limiting distribution of (recall Theorem 7) with high probability. Using Theorems 7 and 8, we conclude the following.

Corollary 9 (Uniform confidence sets).

Assume conditions (A1-3) and (K1-2). Then as and ,

6 Prediction sets

Modal regression can be also used to construct prediction sets. Define

Recall that for a point and a set . Then

are pointwise and uniform prediction sets, respectively, at the population level, because

At the sample level, we use a KDE of the conditional density , and estimate via

An estimated pointwise prediction set is then

This has the proper pointwise coverage with respect to samples drawn according to , so in an asymptotic regime in which , it will have the correct coverage with respect to the population distribution, as well.

Similarly, we can define


the quantile of , , and then the estimated uniform prediction set is


The estimated uniform prediction set has proper coverage with respect to the empirical distribution, and so certain conditions, it will have valid limiting population coverage.

6.1 Bandwidth selection

Prediction sets can be used to select the smoothing bandwidth of the underlying KDE, as we describe here. We focus on uniform prediction sets, and we will use a subscript throughout to denote the dependence on the smoothing bandwidth. From its definition in (18), we can see that the volume (Lebesgue measure) of the estimated uniform prediction set is

where is the number of estimated local modes at , and is as defined in (17). Roughly speaking, when is small, is also small, but the number of estimated manifolds is large; on the other hand, when is large, is large, but the number of estimated manifolds is small. This is like the bias-variance trade-off: small corresponds to less bias () but higher variance (number of estimated manifolds).

Our proposal is to select by

Figure 5 gives an example this rule when , i.e., when minimizing the size of the estimated 95% uniform prediction set. Here we actually use cross-validation to obtain the size of the prediction set; namely, we use the training set to estimate the modal manifolds and then use the validation set to estimate the width of prediction set. This helps us to avoid overfitting. As can be seen, there is a clear trade-off in the size of the prediction set versus in the left plot. The optimal value is marked by a vertical line, and the right plot displays the corresponding modal regression estimate and uniform prediction set on the data samples.

In the same plot, we also display a local regression estimate and its corresponding 95% uniform prediction set. We can see that the prediction set from the local regression method is much larger than that from modal regression. (To even the comparison, the bandwidth for the local linear smoother was also chosen to minimize the size of the prediction set.) This illustrates a major strength of the modal regression method: because it is not constrained to modeling conditional mean structure, it can produce smaller prediction sets than the usual regression methods when the conditional mean fails to capture the main structure in the data. We investigate this claim theoretically, next.

Figure 5: An example of bandwidth selection based on the size of the prediction sets.

6.2 Theory on the size of prediction sets

We will show that, at the population level, prediction sets from modal regression can be smaller than those based on the underlying regression function . Defining

pointwise and uniform prediction sets based on the regression function are


For a pointwise prediction set , we write for its Lebesgue measure on ; note that in the case of modal regression, this is somewhat of an abuse of notation because the Lebesgue measure of can be a sum of interval lengths. For a uniform prediction set , we write for its Lebesgue measure on .

We consider the following assumption.

  • The conditional density satisfies

    with by convention, and denoting the Gaussian density function with mean and variance .

The assumption that the conditional density can be written as a mixture of Gaussians is only used for the next result. It is important to note that this is an assumption made about the population density, and does not reflect modeling choices made in the sample. Indeed, recall, we are comparing prediction sets based on the modal set and the regression function , both of which use true population information.

Before stating the result, we must define several quantities. Define the minimal separation between mixture centers


Also define