aims to obtain a conditional mode predictor consisting of a global maximizer of a conditional probability density function (PDF) of dependent variables conditioned on independent variables by using a linear model. Besides a nice interpretability of a conditional mode, the MLR has an advantage that resulting parameter and curve estimators are consistent even for heteroscedastic or asymmetric conditional PDFs, as compared with the robust M-type estimators which are not consistent
. The consistency of the MLR even for data with skewed conditional distributions makes the MLR a promising approach to analyzing them; refer, for example, to an application to cognitive impairment prediction and analysis of economic data [5, 20]. Thus studies of the MLR and related areas are currently ongoing from various viewpoints [20, 19, 17].
The MLR is a kernel-based semiparametric regression method, and the user has a freedom in selecting a kernel used in the MLR. However, many researches [25, 23, 10, 19, 15] of the MLR use a Gaussian kernel, presumably because the modal EM algorithm (MEM) [14, 23], a standard parameter estimation method for the MLR, does not provide an explicit parameter update formula if a non-Gaussian kernel is used (see Section II). In this paper, we study the problem of kernel selection in the MLR from two perspectives: “which kernel achieves smaller error?” and “which kernel is computationally efficient?” in order to pave the way for use of a non-Gaussian kernel in the MLR.
As far as the authors’ knowledge, relationship between the kernel used in the MLR and the estimation error of the MLR has not been discussed at all. In Section III of this paper, we refine an existing theorem of asymptotic normality of the MLR parameter and derive the asymptotic mean squared error (AMSE) of it. Then, adopting the AMSE as a criterion of an estimation error, we investigate the question “which kernel achieves smaller error?”. As a consequence, we find that a Biweight kernel is optimal among non-negative kernels for the MLR in the sense of minimizing the AMSE.
An objective function appearing in the MLR may have multiple peaks and/or broad plateaux, which would make the optimization involved in the parameter estimation difficult. Accordingly, the parameter estimation in the MLR often needs a multi-start technique, i.e., repeating many times from different initial estimates. So, it is plausible to have an efficient definite algorithm with convergence guarantees. In order to use non-Gaussian kernels in the MLR, parameter estimation methods other than the MEM are preferable, because the non-Gaussian MEM needs to use an additional optimization algorithm in its inner loop and may be inefficient. In Section IV of this paper, we consider using the iteratively reweighted least-squares algorithm (IRLS) for the MLR and show that the IRLS has a convergence guarantee for a broad class of kernels. Also, as a partial answer to the question “which kernel is computationally efficient?”, we prove a theorem stating that the IRLS exactly terminates in a finite number of iterations if an Epanechnikov kernel is used.
In Section V, we show results of numerical experiments of the MLR with several kernels on simulation data. There, we confirmed that the mean squared error (MSE) when using a Biweight kernel became the smallest as the sample size went larger, and that the calculation efficiency of the IRLS with an Epanechnikov kernel was much higher than that using other kernels. Finally, conclusions are summarized in Section VI. The proofs of the theoretical results are in Appendices.
Ii Existing researches on MLR and MEM
Let be the input space and be the output space. Let
be a pair of random variables taking values in
following a certain joint distribution. We assume that the conditional distribution with the PDFof conditioned on is such that for a certain function the residual is distributed according to a distribution whose mode is at the origin. It then follows that holds for any . In the MLR, it is further assumed that is a linear function,
where is an underlying MLR parameter.
Suppose that a sample set consists of independent and identically-distributed (iid) samples from the joint distribution of . To estimate the underlying parameter in (1) on the basis of the sample set, the MLR maximizes the kernel-based objective function
where is the kernel function such that for some function , which is also called a kernel function, and where is a scale parameter (bandwidth) of . Also, the estimator of the MLR parameter, defined as the maximizer of (2), is denoted as .  and  have provided fundamental analysis of asymptotic statistical behaviors of the MLR, such as consistency and asymptotic normality of the estimator .
When is a Gaussian kernel, the M-step (4) admits an analytic expression,
where , , and is an diagonal matrix with diagonal elements . The MEM [14, 23], as shown above, is like the well-known EM algorithm. It has been shown that, irrespective of the kernel used, the objective function sequence is non-decreasing and converges, as long as the M-step (4) is executed exactly.
Also, several studies have combined the MLR with regularization, variable selection, and other techniques [24, 5, 25, 10, 15]. The MEM has been used in most of them, and Gaussian kernels have been used almost exclusively. For instance,  has studied relationships between the MLR using a Gaussian kernel and a correntropy-based regression [16, 6]. Also, when the bandwidth is quite large, it is known that a Gaussian kernel is approximated as . Thus, 
has discussed that the MLR with a Gaussian kernel approaches the ordinary least-squares (OLS) as.
Iii MSE-based Asymptotics and Optimal Kernel
Iii-a Asymptotic Normality of MLR Parameter
We discuss the optimal kernel minimizing an estimation error for the MLR in the next subsection, on the basis of the asymptotic statistical theory. For this purpose, we review in this section the analysis on asymptotic behaviors of the MLR. In the standard asymptotic analysis, in order to have consistency of the estimator, the bandwidth of the kernel should be made smaller as the sample size tends to be larger, but not too fast in order to prevent from becoming rough. A bandwidth sequence specifies how one sets the bandwidth according to the sample size . It, together with the kernel used, determines the asymptotic behaviors of the MLR.
Among the existing studies which discuss the asymptotics the most strict is that by . The following theorem is a refinement of their result.
Let be a bounded function with
the bounded second-order moment and
the first three bounded derivatives
be a bounded function with the bounded second-order moment and the first three bounded derivatives, , of which , are square-integrable, satisfying
and let be a sequence of positive constants satisfying
where , , and .
The outline of the proof is in Appendix A.
Iii-B Optimal Kernel for MLR
|Kernel||QM at ?|
From Theorem 1, keeping only the leading-order terms under
, the bias and the variance ofare given by and , respectively, where denotes the trace of a matrix . Accordingly, the AMSE of is given as a sum of the squared asymptotic bias and the asymptotic variance, as
The ratio of the squared-bias to the variance is . Which of the squared-bias and the variance is dominant in the AMSE depends on the rate of decay of the bandwidth sequence as : If decays slowly, then it will be dominated by the squared-bias, whereas if decays fast, then it will be dominated by the variance. The fastest decay of the AMSE is achieved at the boundary between the bias-dominant and variance-dominant regimes, i.e., when .
Moreover, on the basis of the stationary condition of the AMSE with respect to , the asymptotic optimal bandwidth minimizing the AMSE becomes
and under this optimal bandwidth the leading-order term of the AMSE becomes proportional to
This relationship implies that the optimal convergence rate of the AMSE is , but we would like to focus here on the dependence not on but rather on the kernel-dependent quantities and : indeed, quantities with tilde in (11) are determined from the underlying data distribution, whereas and are completely determined by the kernel . Thus, the optimal kernel should be obtained by minimizing (called the AMSE criterion in this paper). It is a variational problem, whose solution is given by:
A Biweight kernel minimizes the AMSE criterion among non-negative kernels.
In order to check the degree of goodness of a Biweight kernel in comparison with other kernels, we have calculated the criterion for various commonly-used kernel functions, ignoring their differentiability. The results are summarized in Table I. The AMSE is about 12 percent larger for a Gaussian kernel than for a Biweight kernel. An empirical observation is that truncated kernels such as a Triweight and an Epanechnikov in addition to a Biweight are better in terms of the AMSE criterion than a Gaussian kernel. Also, the AMSE criterion becomes larger for heavy-tailed kernels including a Laplace and a Sech, and we may conclude that heavy-tailed kernels are not good for use in the MLR.
Iv IRLS and its Convergence
Iv-a Construction of IRLS for MLR from MM Algorithm
As mentioned in Section II, a plausible property of the MEM is that the objective function sequence given by the MEM converges no matter what kernel is used. It would presumably be the main reason as to why the MEM has frequently been used. However, the maximization on the right-hand side of (4) can explicitly be solved only when a Gaussian kernel is used. If using a non-Gaussian kernel, the M-step requires use of an additional optimization as an inner loop, such as the conventional gradient method. This may reduce computational efficiency and even suffer from convergence problem of the inner loop. In contrast, we consider in this section an IRLS-based parameter estimation method, which provides an explicit parameter update formula even when a non-Gaussian kernel is used. Although several existing works [9, 5, 21] using the IRLS for the MLR have not discussed its convergence properties, we clarify in this paper its convergence properties on the basis of reformulation of the IRLS based on the minorize-maximize (MM) algorithm .
We first assume that the kernel is symmetric about the origin. Hence, it is represented as with a function , called a profile of , where the term ‘profile’ is defined as follows.
Definition 1 (profile).
A function defined on is called a profile (with coefficient ) of a function on , if can be represented as with a constant .
We further assume that the kernel function is quadratic minorizable as defined below. To that end, we first provide the definition of the minorizer.
Definition 2 (minorizer).
We say that a function is a minorizer of a function at , if it satisfies
Following this definition, notions concerning the quadratic minorizability are given as follows:
Definition 3 (quadratic minorizability).
A quadratic minorizer of at is defined as a minorizer of at that is quadratic or constant. We call quadratically minorizable (QM) at if it has a quadratic minorizer at . Furthermore, when is QM at , among quadratic minorizers of at , the one whose vertex has the largest value is called the best quadratic minorizer of at .
Then, one sufficient condition that a function has a quadratic minorizer is provided as follows (see  for the details).
If a function has a continuous, convex, and non-increasing profile with coefficient , is quadratically minorizable at any non-zero point. Also, the best quadratic minorizer of at is
where and where is the minimum value among the subderivatives111A subderivative of a convex function at a point is a real number such that , and thus it takes values in an interval between and . of at .
From the viewpoints mentioned thus far, we introduce the following class of kernels for the MLR and IRLS.
Definition 4 (QM kernel).
A kernel function is called a QM kernel if is non-negative, continuous, and normalized and its profile is convex and non-increasing.
Although the QM kernel is more restrictive than the ‘modal regression kernel’ defined in , it includes a wide variety of practically-used kernel functions (see Table I and Figure 1). However, the QM kernel does not generally satisfy the conditions of Theorem 1, mainly with regard to its differentiability.
For the QM kernel , its best quadratic minorizer is represented as
where when , where is the minimal value of the subderivative of the profile . However, for a kernel with a profile satisfying , is ill-defined at and one cannot construct a quadratic minorizer when is exactly equal to 0. For such a kernel, we assume that , , does not become exactly zero for all . We also note that is non-positive, since it is assumed that the profile of the QM kernel is non-increasing. Then, given , we can construct a quadratic minorizer of the objective function (2) as
Moreover, on the basis of the stationary condition of the quadratic minorizer,
where , we construct the IRLS-based parameter update formula
or equivalently , where is an diagonal matrix with diagonal elements . When a Gaussian kernel is used, the IRLS (17) gives the same update formula as the MEM because of the relationship .
Iv-B Convergence of IRLS for MLR
MSE and standard deviation of MSE (both are multiplied by)
Since the IRLS is based on the MM algorithm, its ascent property and convergence can be proved from the general theory of the MM algorithm. The following convergence result of the IRLS is a direct consequence of its construction.
Let be a QM kernel. Then, for the parameter estimate sequence obtained via the IRLS, the objective function sequence is non-decreasing and converges.
This theorem implies that the IRLS is superior to the MEM in that the former allows for a variety of kernels to be practically used in the MLR with convergence guarantee.
We furthermore provide a stronger result stating that the IRLS with an Epanechnikov kernel converges in a finite number of iterations. The proof, detailed in Appendix C, is on the basis of the fact that the subderivative takes only a finite number of values when an Epanechnikov kernel is used.
Consider the parameter estimate sequence obtained via the IRLS using an Epanechnikov kernel . Then, there exists a finite such that holds for all . Moreover, if the Hessian of at is negative definite, holds for all .
V Simulation Studies
The objective of the numerical experiments described in this section is neither to compare the MLR with other regression methods including OLS, least absolute deviation, Huber’s robust regression, and so on, as in [9, 23], nor to try applying the MLR to real-world data. Rather, the objective is to verify our theoretical results on effects of the kernel function on the performance of the MLR.
In reference to the experiment design in , we generated datasets consisting of iid samples from the heteroscedastic distribution with (let ) and being independent, where
is a uniform distribution with support. Thus, the underlying conditional mode becomes since , and hence the underlying parameter is given by .
We used the linear model , where , and the IRLS (17) as a parameter estimation method222We also experimented the MEM using a conventional gradient method in its inner loop (M-step), but it often stopped in a plateau when using non-Gaussian kernels.. We adopted the multi-start method with 10 points randomly drawn from being used for the initial parameter , in order to mitigate issues arising from the multimodality of an objective function. We judged convergence by . We compared four different kernels: Epanechnikov , Biweight , Gaussian , and Laplace . It should be noted that the Laplace kernel is not QM at , but in our experiments, we did not encounter a serious trouble caused by the update formula (17) becoming ill-defined as described in Section IV. For the bandwidth of the kernel function, we used the empirical counterpart of the optimal bandwidth in (10), on the basis of the datasets and the underlying data distribution. As an evaluation criterion of the MLR in our experiments, we adopted the MSE between and , which was calculated on the basis of 1000 trials. We report the MSE and the standard deviation of the MSE in Table II.
When the sample size is the smallest, , the best result was obtained when the Gaussian kernel was used. However, increasing the sample size resulted in the smallest MSE when the Biweight kernel was used. Also, the MSE with the Laplace kernel was the largest for every . These results are in good agreement with the results of the asymptotic analysis333Our analysis in Section III is based on the asymptotics concerning when the sample size is large enough. Readers may want to know the theory for the small sample case, but there is no valid result for kernel selection in such a case, as far as we are aware of. given in Theorem 2 and Table I.
The ranking of the calculation time of the IRLS for each kernel did not change even if the sample size was varied. Therefore, we report the calculation times with : the IRLS converged in , , , and seconds from an initial point on average for the Epanechnikov, Biweight, Gaussian, and Laplace kernels, respectively. Thus, the computational efficiency with the Epanechnikov kernel, as suggested by Theorem 4, was confirmed.
First, we have found that a Biweight kernel is optimal in the sense of minimizing the AMSE of the MLR parameter estimator (Theorem 2), via refining the theorem on the asymptotic normality of it (Theorem 1). Truncated kernels are generally more suitable for the MLR than heavy-tailed kernels (Table I).
Secondly, we have analyzed the IRLS, which is applicable to a wider kernel class opposed to the MEM which gives an analytic parameter update only for a Gaussian kernel. We have provided a sufficient condition under which the objective function sequence given via the IRLS converges (Theorem 3). Moreover, we have shown that both of the objective function and parameter estimate sequences obtained by the IRLS converge exactly in a finite number of iterations when an Epanechnikov kernel is used (Theorem 4).
In the simulation, among those kernels investigated, the MSE under the large sample size became the smallest for a Biweight kernel and the computational efficiency of the IRLS was best for an Epanechnikov kernel, as suggested in our theorems.
From the above results, one would be able to recommend a Biweight kernel or an Epanechnikov kernel for use in the MLR; practitioners can select them properly according to their needs, on the basis of the findings in this paper. Additionally in order to use them, it will be reasonable to proceed with discussion in the framework of the IRLS-based parameter estimation.
Vi-a Proof Outline of Theorem 1
The Taylor expansion of at around is
where , , and where an appropriately determined satisfying . The equation (18) implies if is regular.
 has proved that is a consistent estimator of . Although the asymptotic normality of has been also proved, its mean reduces to due to their strong assumption . Thus, it is enough to clarify the mean of and requirements for the proof, under the weaker assumptions. These can be proved on the basis of the Taylor expansion-based analysis framework. The regularity conditions of Theorem 1, which are same as ones given in , are
is an iid sequence.
for some .
is continuous in and for all and . In addition, there exists a set such that and for all and .
, are uniformly bounded.
is negative definite.
The parameter space is a compact space that includes in its interior.
Vi-B Proof Outline of Theorem 2
Vi-C Proof of Theorem 4
Let , and be the set of indices for which the argument of is in the non-flat region of the kernel :
Thus, the objective function (2) is written as
and the corresponding update of the IRLS (17) becomes
where and we let to denote . For any , is in , the power set of . We note that, given a sample set, depends on only through . Thus, at most different values appear in the parameter estimate sequence as well as the sequence . Convergence of is guaranteed by Theorem 3. Since a finite number of values appear in , there exists such that for all , is a constant. In the following, we prove that also converges at th iteration. Introduce the best quadratic minorizer of at as
The inequality (12) implies
and combining them leads
Since is constant for , should be equal to 0. Alternatively, from (21) one has
Since the left-hand side equals to 0, one has and consequently is constant for , under the additional condition that the Hessian of at is negative definite. This completes the proof.
The Hessian of is written as in the Appendix A. Under the asymptotic situation satisfying the conditions of Theorem 1, the Hessian gets negative definite, and hence the latter half of Theorem 4 holds. Also, this convergence result still holds even if using the update with
and an identity matrix. This form corresponds to a -regularized version of the MLR or a technique for stabilization of inverse matrix calculation. In this case, the additional condition is not needed.
Finally, we would like to note that this way to prove has been similarly conducted for the analysis [2, 8] of the mean shift algorithm (MS), which is used for mode estimation and so on. The MS also can be viewed as a MM-based optimization of a kernel-based objective function. Additionally, the relationship between the MS and the MEM for mode estimation is similar to that of the IRLS and the MEM for the MLR; the MS is more generic . The MEM transforms problems that are easier to directly optimize into harder problems.
-  (2012) On the use of robust regression in econometrics. Economics Letters 114 (1), pp. 124–127. Cited by: §I.
Mean shift analysis and applications.
IEEE International Conference on Computer Vision, Vol. 2, pp. 1197–1203. Cited by: §VI-C.
-  (2009) Sharp quadratic majorization in one dimension. Computational Statistics & Data Analysis 53 (7), pp. 2471–2484. Cited by: §IV-A.
-  (1980) Optimum kernel estimators of the mode. The Annals of Statistics 8 (4), pp. 870–882. Cited by: §VI-A.
-  (2017) A statistical learning approach to modal regression. arXiv preprint arXiv:1702.05960. Cited by: §I, §II, §IV-A, §IV-A.
-  (2015) Learning with the maximum correntropy criterion induced losses for regression.. Journal of Machine Learning Research 16, pp. 993–1034. Cited by: §II.
-  (1991) Optimizing kernel methods: a unifying variational principle. International Statistical Review/Revue Internationale de Statistique 59 (3), pp. 373–388. Cited by: §III-B, §VI-B.
On convergence of Epanechnikov mean shift.
AAAI Conference on Artificial Intelligence, Cited by: §VI-C.
-  (2012) Regression towards the mode. Journal of Econometrics 170 (1), pp. 92–101. Cited by: §I, §II, §II, §III-A, §III-A, §IV-A, §V, §VI-A.
-  (2017) Non linear parametric mode regression. Communications in Statistics — Theory and Methods 46 (6), pp. 3006–3024. Cited by: §I, §II.
-  (2016) MM optimization algorithms. SIAM. Cited by: §IV-A.
-  (1989) Mode regression. Journal of Econometrics 42 (3), pp. 337–349. Cited by: §I.
-  (1993) Quadratic mode regression. Journal of Econometrics 57 (1–3), pp. 1–19. Cited by: §I.
-  (2007) A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research 8, pp. 1687–1723. Cited by: §I, §II.
Robust feature selection via-norm in finite mixture of regression. Pattern Recognition Letters 108 (1), pp. 15–22. Cited by: §I, §II.
-  (2007) Correntropy: properties and applications in non-gaussian signal processing. IEEE Transactions on Signal Processing 55 (11), pp. 5286–5298. Cited by: §II.
-  (2018) Quantile regression approach to conditional mode estimation. arXiv preprint arXiv:1811.05379. Cited by: §I.
On weak convergence and optimality of kernel density estimates of the mode. The Annals of Statistics 16 (2), pp. 629–647. Cited by: §VI-A.
-  (2018) Information geometric perspective of modal linear regression. In International Conference on Neural Information Processing, pp. 535–545. Cited by: §I, §I.
-  (2017) Fitting truncated mode regression model by simulated annealing. In Computational Optimization in Engineering-Paradigms and Applications, Cited by: §I.
-  (2017) Regularized modal regression with applications in cognitive impairment prediction. In Advances in Neural Information Processing Systems, pp. 1448–1458. Cited by: §I, §IV-A.
-  (2019) Properties of mean shift. IEEE Transactions on Pattern Analysis and Machine Intelligence. Note: Early access Cited by: §VI-C.
-  (2014) A new regression model: modal linear regression. Scandinavian Journal of Statistics 41 (3), pp. 656–671. Cited by: §I, §I, §II, §II, §V, §V, §VI-A.
-  (2012) Bayesian mode regression. arXiv preprint arXiv:1208.0579. Cited by: §II.
-  (2014) Robust and efficient variable selection for semiparametric partially linear varying coefficient model based on modal regression. Annals of the Institute of Statistical Mathematics 66 (1), pp. 165–191. Cited by: §I, §II.