Modal regression estimates the local modes of the distribution of Y given X=x, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of Y and X. We derive asymptotic error bounds for this method, and propose techniques for constructing confidence sets and prediction sets. The latter is used to select the smoothing bandwidth of the underlying KDE. The idea behind modal regression is connected to many others, such as mixture regression and density ridge estimation, and we discuss these ties as well.READ FULL TEXT VIEW PDF
is an alternate approach to the usual regression methods for exploring the relationship between a response variable
and a predictor variable. Unlike conventional regression, which is based on the conditional mean of given , modal regression estimates conditional modes of given .
Why would we ever use modal regression in favor a conventional regression method? The answer, at a high-level, is that conditional modes can reveal structure that is missed by the conditional mean. Figure 1 gives a definitive illustration of this point: we can see that, for the data examples in question, the conditional mean both fails to capture the major trends present in the response, and produces unecessarily wide prediction bands. Modal regression is an improvement in both of these regards (better trend estimation, and narrower prediction bands). In this paper, we rigorously study and develop its properties.
Examples of modal regression versus a common nonparametric regression estimator, local linear regression. In the top row, we show local regression estimate and its associated 95% prediction bands alongside the modal regression and its 95% prediction bands. The bottom row does the same for a different data example. The local regression method fails to capture the structure, and produces prediction bands that are too wide.
Modal regression has been used in transportation (Einbeck and Tutz, 2006), astronomy (Rojas, 2005), meteorology (Hyndman et al., 1996) and economics (Huang and Yao, 2012; Huang et al., 2013). Formally, the conditional (or local) mode set at is defined as
where is the conditional density of given . As a simplification, the set can be expressed in terms of the joint density alone:
At each , the local mode set may consist of several points, and so is in general a multi-valued function. Under appropriate conditions, as we will show, these modes change smoothly as changes. Thus, local modes behave like a collection of surfaces which we call modal manifolds.
We focus on a nonparametric estimate of the conditional mode set, derived from a kernel density estimator (KDE):
where is the joint KDE of . Scott (1992) proposed this plug-in estimator for modal regression, and Einbeck and Tutz (2006) proposed a fast algorithm. We extend the work of these authors by giving a thorough treatment and analysis of nonparametric modal regression. In particular, our contributions are as follows.
We study the geometric properties of modal regression.
We prove consistency of the nonparametric modal regression estimator, and furthermore derive explicit convergence rates, with respect to various error metrics.
We propose a method for constructing confidence sets, using the bootstrap, and prove that it has proper asymptotic coverage.
We propose a method for constructing prediction sets, based on plug-in methods, and prove that the population prediction sets from this method can be smaller than those based on the regression function.
We propose a rule for selecting the smoothing bandwidth of the KDE based on minimizing the size of prediction sets.
We draw enlightening comparisons to mixture regression (which suggests a clustering method using modal regression) and to density ridge estimation.
We begin by reviewing basic properties of modal regression and recalling previous work, in Section 2. Sections 3 through 8 then follow roughly the topics described in items 1–6 above. In Section 9 we end with some discussion. Simple R code for the modal regression and some simulation data sets used in this paper can be found at http://www.stat.cmu.edu/~yenchic/ModalRegression.zip.
Consider a response variable and covariate or predictor variable , where is a compact set. A classic take on modal regression (Sager and Thisted, 1982; Lee, 1989; Yao and Li, 2014) is to assume a linear model
where , are unknown coefficients, and denotes the (global) mode of given . Nonparametric modal regression is more flexible, because it allows to be multi-valued, and also it models the components of as smooth functions of (not necessarily linear). As another nonlinear generalization of the above model, Yao et al. (2012) propose an interesting local polynomial smoothing method for mode estimation; however, they focus on the global mode rather than , the collection of all conditional modes.
The estimated local mode set in (3) from Scott (1992) relies on an estimated joint density function , most commonly computed using a KDE. Let be the observed data samples. Then the KDE of the joint density is
Here is a symmetric, smooth kernel function, such as the Gaussian kernel (i.e., ), and is the smoothing bandwidth. For simplicity, we have used the same kernel function and bandwidth for the covariates and the response, but this is not necessary. For brevity, we will write the estimated modal set as
where the subscript notation denotes partial derivatives, as in and .
In general, calculating can be challenging, but for special kernels, Einbeck and Tutz (2006) proposed a simple and efficient algorithm for computing local mode estimates, based on the mean-shift algorithm (Cheng, 1995; Comaniciu and Meer, 2002). A related approach can be found in Yao (2013), where the authors consider a mode hunting algorithm based on the EM algorithm. For a discussion of how the mean-shift and EM algorithms relate, see Carreira-Perpiñán (2007). In general, mean-shift algorithms can be derived for any KDEs with radially symmetric kernels (Comaniciu and Meer, 2002), but for simplicity we assume here that is Gaussian. The partial mean-shift algorithm of Einbeck and Tutz (2006), to estimate conditional modes, is described in Algorithm 1.
Initialize mesh points (a common choice is ).
For each , fix , and update using the following iterations until convergence:
A straightforward calculation shows that the mean-shift update (6) is indeed a gradient ascent update on the function (for fixed ), with an implicit choice of step size. Because this function is generically nonconcave, we are not guaranteed that gradient ascent will actually attain a (global) maximum, but it will converge to critical points under small enough step sizes (Arias-Castro et al., 2013).
In this section, we study the geometric properties of modal regression. Recall that is a set of points at each input . We define the modal manifold collection as the union of these sets over all inputs ,
By the implicit function theorem, the set has dimension ; see Figure 2 for an illustration with (univariate ).
We will assume that the modal manifold collection can be factorized as
where each is a connected manifold that admits a parametrization
for some function and open set . For instance, in Figure 2, each is a connected curve. Note that form an open cover for the support of . We call the th modal manifold, and the th modal function. By convention we let if , and therefore we may write
That is, at any , the values among that are nonempty give local modes.
Under weak assumptions, each is differentiable, and so is the modal set , in a sense. We discuss this next.
Proof. Since we assume that , we have by definition. Taking a gradient over yields
Lemma 1 links the geometry of the joint density function to the smoothness of the modal functions (and modal manifolds). The formula (11) is well-defined as long as is nonzero, which is guaranteed by the definition of local modes. Thus, when is smooth, each modal manifold is also smooth.
To characterize smoothness of itself, we require a notion for smoothness over sets. For this, we recall the Hausdorff distance between two sets , defined as
where with .
Assume the conditions of Lemma 1. Assume furthermore all partial derivatives of are bounded by , and there exists such that for all and . Then
The proof of this result follows directly from Lemma 1 and the definition of Hausdorff distance, so we omit it. Since is a multi-valued function, classic notions of smoothness cannot be applied, and Theorem 2 describes its smoothness in terms of Hausdorff distance. This distance can be thought of as a generalized distance for sets, and Theorem 2 can be interpreted as a statement about Lipschitz continuity with respect to Hausdorff distance.
Modal manifolds can merge or bifurcate as varies. Interestingly, though, the merges or bifurcations do not necessarily occur at points of contact between two modal manifolds. See Figure 3 for an example with . Shown is a modal curve (manifold with ), starting at and stopping at about , which leaves a gap between itself and the neighboring modal curve. We take a closer look at the joint density contours, in panel (c), and inspect the conditional density along four slices , in panel (d). We see that after , the conditional density becomes unimodal and the first (left) mode disappears, as it slides into a saddle point.
A remark about the uniqueness of the modal manifold collection in (8): this factorization is unique if the second derivative is uniformly bounded away from zero. This will later become one of our assumptions (assumption (A3)) in the theoretical analysis of Section 4. Note that in the left panel of Figure 2, the collection is uniquely defined, while this is not the case in the right panel (at the points of intersection between curves, the density has vanishing second derivatives with respect to ).
Lastly, the population quantities defined above all have sample analogs. For the estimate , we define
where each is a connected manifold, and is the total number. We also define in a similar way for . Thus, we can write
In practice, determining the manifold memberships and the total number of manifolds is not trivial. In principle, the sample manifolds are well-defined in terms of the sample estimate ; but even with a perfectly convergent mean-shift algorithm, we would need to run mean-shift iterations at every input in the domain to determine these manifold components. Clearly this is not an implementable strategy. Thus from the output of the mean-shift algorithm over a finite mesh, we usually employ some type of simple post-processing technique to determine connectivity of the outputs and hence the sample manifolds. This is discussed further in Section 7.
In this section we present asymptotic results about the convergence of the estimated modal regression set to the underlying modal set . Let denote the collection of times continuously differentiable functions with all partial derivatives bounded in absolute value by . (The domain of these functions should be clear from the context.) Given a kernel function , denote the collection of functions
where denotes the -th order derivative of .
Our assumptions are as follows.
The joint density for some .
The collection of modal manifolds can be factorized into where each is a connected curve that admits a parametrization for some , and form an open cover for the support of .
There exists such that for any with , .
The kernel function and satifies
The collection is a VC-type class, i.e. there exists such that for ,
where is the -covering number for a semi-metric space and
is any probability measure.
Assumption (A1) is an ordinary smoothness condition; we need fourth derivatives since we need to bound the bias of second derivatives. The assumption (A2) is to make sure the collection of all local modes can be represented as finite collection of manifolds. (A3) is a sharpness requirement for all critical points (local modes and minimums); and it excludes the case that the modal manifolds bifurcate or merge, i.e., it excludes cases such as the right panel of Figure 2. Similar conditions appear in Romano (1988); Chen et al. (2014b)
for estimating density modes. Assumption (K1) is assumed for the kernel density estimator to have the usual rates for its bias and variance. (K2) is for the uniform bounds on the kernel density estimator; this condition can be found inGiné and Guillou (2002); Einmahl et al. (2005); Chen et al. (2014c). We study three types of error metrics for regression modes: pointwise, uniform, and mean integrated squared errors. We defer all proofs to LABEL:suppA.
First we study the pointwise case. Recall that is the KDE in (4) of the joint density based on samples, under the kernel , and is the estimated modal regression set in (5) at a point . Our pointwise analysis considers
the Hausdorff distance between and , at a point . For our first result, we define the quantities:
Assume (A1-3) and (K1-2). Define a stochastic process by
is sufficiently small, we have
Moreover, at any fixed , when and ,
The proof is in the supplementary material (Chen et al., 2015). This shows that if the curvature of the joint density function along is bounded away from , then the error can be approximated by the error of after scaling. The rate of convergence follows from the fact that is converging to at the same rate. Note that as is a conditional mode, the partial derivative of the true density is . We defined as above since implies , so that the ratio would be ill-defined if . Also, the constraints on in the second assertion ( and ) are to ensure .
For our next result, we define the uniform error
This is an type error for estimating regression modes (and is also closely linked to confidence sets; see Section 5).
Assume (A1-3) and (K1-2). Then as and ,
The proof is in the supplementary material (Chen et al., 2015). Compared to the pointwise error rate in Theorem 3, we have an additional factor in the second term. One can view this as the price we need to pay for getting an uniform bound over all points. See Giné and Guillou (2002); Einmahl et al. (2005) for similar findings in density estimation.
The last error metric we consider is the mean integrated squared error (MISE), defined as
Note that the MISE is a nonrandom quantity, unlike first two error metrics considered.
Assume (A1-3) and (K1-2). Then as and ,
The proof is in the supplementary material (Chen et al., 2015). If we instead focus on estimating the regression modes of the smoothed joint density , then we obtain much faster convergence rates. Let be the smoothed regression modes at . Analogously define
Assume (A1-3) and (K1-2). Then as and ,
In an idealized setting, we could define a confidence set at by
By construction, we have . Of course, the distribution of is unknown, but we can use the bootstrap (Efron, 1979) to estimate .
Given the observed data samples , we denote a bootstrap sample as . Let be the estimated regression modes based on this bootstrap sample, and
We repeat the bootstrap sampling times to get . Define by
Our confidence set for is then given by
Note that this is a pointwise confidence set, at .
Alternatively, we can use to build a uniform confidence set. Define by
As above, we can use bootstrap sampling to form an estimate
, based on the quantiles of the bootstrapped uniform error metric
Our uniform confidence set is then
In practice, there are many possible flavors of the bootstrap that are applicable here. This includes the ordinary (nonparametric) bootstrap, the smoothed bootstrap and the residual bootstrap. See Figure 4 for an example with the ordinary bootstrap.
We focus on the asymptotic coverage of uniform confidence sets built with the ordinary bootstrap. We consider coverage of the smoothed regression mode set (to avoid issues of bias), and we employ tools developed in Chernozhukov et al. (2014a, b); Chen et al. (2014c).
Consider a function space defined as
and let be a Gaussian process defined on such that
for all .
Assume (A1-3) and (K1-2).
Define the random variable
Assume (A1-3) and (K1-2). Define the random variable. Then as , ,
The proof is in the supplementary material (Chen et al., 2015). This theorem shows that the smoothed uniform discrepancy is distributed asymptotically as the supremum of a Gaussian process. In fact, it can be shown that the two random variables and are coupled by
Now we turn to the limiting behavior for the bootstrap estimate. Let be the observed data and denote the bootstrap estimate by
where is the the bootstrap regression mode set at .
Assume conditions (A1-3) and (K1-2). Also assume that . Define . There exists such that and, for all ,
The proof is in the supplementary material (Chen et al., 2015). Theorem 8 shows that the limiting distribution for the bootstrap estimate is the same as the limiting distribution of (recall Theorem 7) with high probability. Using Theorems 7 and 8, we conclude the following.
Assume conditions (A1-3) and (K1-2). Then as and ,
Modal regression can be also used to construct prediction sets. Define
Recall that for a point and a set . Then
are pointwise and uniform prediction sets, respectively, at the population level, because
At the sample level, we use a KDE of the conditional density , and estimate via
An estimated pointwise prediction set is then
This has the proper pointwise coverage with respect to samples drawn according to , so in an asymptotic regime in which , it will have the correct coverage with respect to the population distribution, as well.
Similarly, we can define
the quantile of , , and then the estimated uniform prediction set is
The estimated uniform prediction set has proper coverage with respect to the empirical distribution, and so certain conditions, it will have valid limiting population coverage.
Prediction sets can be used to select the smoothing bandwidth of the underlying KDE, as we describe here. We focus on uniform prediction sets, and we will use a subscript throughout to denote the dependence on the smoothing bandwidth. From its definition in (18), we can see that the volume (Lebesgue measure) of the estimated uniform prediction set is
where is the number of estimated local modes at , and is as defined in (17). Roughly speaking, when is small, is also small, but the number of estimated manifolds is large; on the other hand, when is large, is large, but the number of estimated manifolds is small. This is like the bias-variance trade-off: small corresponds to less bias () but higher variance (number of estimated manifolds).
Our proposal is to select by
Figure 5 gives an example this rule when , i.e., when minimizing the size of the estimated 95% uniform prediction set. Here we actually use cross-validation to obtain the size of the prediction set; namely, we use the training set to estimate the modal manifolds and then use the validation set to estimate the width of prediction set. This helps us to avoid overfitting. As can be seen, there is a clear trade-off in the size of the prediction set versus in the left plot. The optimal value is marked by a vertical line, and the right plot displays the corresponding modal regression estimate and uniform prediction set on the data samples.
In the same plot, we also display a local regression estimate and its corresponding 95% uniform prediction set. We can see that the prediction set from the local regression method is much larger than that from modal regression. (To even the comparison, the bandwidth for the local linear smoother was also chosen to minimize the size of the prediction set.) This illustrates a major strength of the modal regression method: because it is not constrained to modeling conditional mean structure, it can produce smaller prediction sets than the usual regression methods when the conditional mean fails to capture the main structure in the data. We investigate this claim theoretically, next.
We will show that, at the population level, prediction sets from modal regression can be smaller than those based on the underlying regression function . Defining
pointwise and uniform prediction sets based on the regression function are
For a pointwise prediction set , we write for its Lebesgue measure on ; note that in the case of modal regression, this is somewhat of an abuse of notation because the Lebesgue measure of can be a sum of interval lengths. For a uniform prediction set , we write for its Lebesgue measure on .
We consider the following assumption.
The conditional density satisfies
with by convention, and denoting the Gaussian density function with mean and variance .
The assumption that the conditional density can be written as a mixture of Gaussians is only used for the next result. It is important to note that this is an assumption made about the population density, and does not reflect modeling choices made in the sample. Indeed, recall, we are comparing prediction sets based on the modal set and the regression function , both of which use true population information.
Before stating the result, we must define several quantities. Define the minimal separation between mixture centers