We consider nonparametric regression models with one-sided errors that take the general form
is the response variable,is the covariate, is the unknown regression function corresponding to the upper boundary curve and is a nonpositive random error term. The statistical issue of such a model lies hence on the frontier estimation, in other words, on the estimation of based on the observations , where is the sample size of the available data.
Models described in Equation (1.1) are also called boundary regression models (BRM) and have received an increasing attention in the last past years. BRM are closely related to production frontier models (PFM). Both share the objective of the estimation of the frontier - the boundary curve - and contribute to the same applications. PFM appeared first in the field of data envelopment analysis (DEA). Introduced in 1950’s with the seminal contribution of Farrell (1957), DEA answers the need of developing nonparametric methods to evaluate productivity and to assess efficiency of a system. A production unit is technically efficient if it produces the maximum output which is technically feasible for given inputs, or uses minimal inputs for the production of a given level of output. Relaxing the fundamental hypothesis of convex hull body of the data in DEA models, Deprins, Simar and Tulkens (1984) later introduced free disposal hull (FDH) models; see also Lovell et al. (1994) for further developments. DEA and its extensions are now recognized as powerful quantitative, analytical tools for measuring and evaluating the performance of a system and have countless applications, for instance in social sciences, health-care evaluation systems, banking sectors, supply chain management, energy system assessment or when analysing the performance of public services like hospitals and education. We refer to the books Cooper, Seiford and Zhu (2011) and Ramanathan (2003) for an comprehensive treatment of such methods and an exhaustive development of the applications. On the other hand, stochastic frontier analysis (SFA) originally formulated independently by Aigner, Lovell and Schmidt (1977) and Meeusen and van Den Broeck (1977) offers an interesting alternative with parametric estimations of the frontier; see also the books of Kumbhakar and Lovell (2003) and Cornwell and Schmidt (2008) for more recent references.
There is vast literature in the theory of production frontier models dealing with the question of estimating the boundary curve. Numerous parametric and non parametric techniques have been proposed, for instance using extreme-value based estimators (see for instance Girard and Jacob (2003), de Haan and Resnick (1994), Hall, Nussbaum and Stern (1997), Menneteau (2008), Girard and Jacob (2004), Gardes (2002) and Gijbels and Peng (2000)), projections techniques (Jacob and Suquet (1995)), kernel based estimators (Girard, Guillou and Stupfler (2013), Girard and Jacob (2008)), maximum likelihood based estimators (see Kumbhakar et al. (2007) and Simar and Zelenyuk (2011)). Some estimators need the boundary curve to be monotone (see e.g. Daouia and Simar (2005), Daouia, Noh and Park (2016) and Gijbels et al. (1999)).
In contrast to the aforementioned methods, we concentrate in this paper on an alternative approach that consists in approximating the regression function locally by a polynomial lying above the data points. Polynomial estimators in frontier estimation have been widely studied in the literature and are employed in several works, for instance in Hall, Park and Stern (1998), Hall and Park (2004), Girard, Iouditski and Nazin (2005), Knight (2001) and in Hall and Van Keilegom (2009) (local-linear estimator); see also the literature using piecewise polynomials (e.g. Korostelëv and Tsybakov (1993), Korostelëv, Silmar and Tsybakov (1995) and Härdle, Park and Tsybakov (1995)).
While consistency of estimators based on local constant approximations (local polynomial estimator of order ) of the boundary curve may be attained under mild assumptions of the boundary curve - such as continuity of in Neumeyer, Selk and Tillier (2019) - rates of convergence need stronger regularity assumption on . A classical framework is to assume that belongs to some Hölder class of order . This assumption essentially means that is -times differentiable and moreover a Lipschitz condition holds for the derivative. Further assumptions are also needed on the error distribution such as regular variation, meaning that the distribution of the errors has polynomial tails. These two assumptions are common in the context of boundary models and are met in several papers, see for instance Meister and Reiß (2013), Jirak, Meister and Reiß (2014), Müller and Wefelmeyer (2010), Drees, Neumeyer and Selk (2019), Girard et al. (2013), Härdle et al. (1995) and Hall and Van Dellegom (2009); see also the book of de Haan and Ferriera (2006) for the applications and the motivation of regularly varying errors.
In the context of nonparametric regression models with an Hölder boundary curve and regularly varying nonpositive errors, Jirak et al. (2014) suggested an adaptive estimator for the boundary curve using a local polynomial estimation based on local extreme value statistics. An adaptative procedure - a fully data-driven estimation procedure - is constructed by applying a nested version of Lepski method which shows no loss in the convergence rates with respect to the general
-risk. Drees et al. (2019) estimated the regression function similarly to Jirak et al. (2014) via minimization of the local integral of a polynomial approximation in the context of equidistant design points. By showing uniform rates of convergence for the regression estimator they proposed distribution-free tests of error distributions (goodness-of-fit) where the test statistics are based on empirical processes of residuals. They also discussed asymptotically distribution-free hypotheses tests for independence of the error distribution from the points of measurement and for monotonicity of the boundary function as well.
More often than not, fixed (deterministic) covariates and more specifically equidistant fixed design points are considered. In contrast in the paper at hand, we investigate the two cases of random and deterministic covariates which are both of particular interest. Deterministic covariates are often used in real-life applications when time is involved in the data set. For instance Jirak et al. (2014) studied the monthly sunspot observations and the annual best running times of 1500 meters; see also the plentiful applications in energy and environmental performance analysis provided in the review of Mardani et al. (2018). Besides, deterministic design is met accross a number of papers in regression models, see for instance Brown and Low (1996), Meister and Reiß (2013) and the references within. The case of random covariates is obviously the most relevant and appears in essence in many applications in boundary models, among other, in non-life insurance mathematics and financial risk modelling when analyzing optimality of portfolios and efficiency strategies; see also the extensive literature on modern portoflio theory (e.g. the recent books of Francis and Kim (2013) and Goetzmann et al. (2014)).
In opposition to kernel estimation methods, all the literature cited above concerning the asymptotic study of local polynomial approximations of the boundary curve deal with univariate covariates and as far as we are concerned the topic of providing rates of convergence of such estimators based on a multivariate sample is new and challenging. Beyond the theoretical interest this generalisation raises, we think that extending to the multivariate setup is also interesting from an application point of view. In light of this motivation, the main aim of this paper is to extend the results of Drees et al. (2019) to the multivariate case with arbitrary covariate. That is, under the main assumptions of regular variation of the nonpositive errors and -Hölder class of the boundary curve, we aim at showing uniform consistency and providing rates of convergence of an estimator based on the minimisation of the local integral of a polynomial lying above the data points for both multivariate random covariates and multivariate deterministic design points.
The remaining part of the manuscript is organized as follows. In section 2 the model is explained, while in section 3 the estimation procedure is described. In section 4 we show uniform consistency and provide rates of convergence of the estimator of the regression function for both random and deterministic multivariate covariates. Section 5 is dedicated to a simulation study to investigate the small sample behavior in dimension two and three. Proofs are postponed to the appendix.
Notation and shortcuts
Notation: stands for Borelian sets; and are the floor and ceiling functions respectively; means the largest natural number that is strictly smaller than ; denotes the survival function associated to a cdf ; denotes the supremum norm;
means that two random variablesshare the same distribution; holds if for two sequences and
of nonnegative numbers. Generally, vectors are highlighted in bold writing. For vectorslet denote the -th component of for . By we mean the maximum norm that is . For multivariate polynomials we use the multiindex notation where for a vector and a multiindex we define , and .
stands for cumulative distribution function andiid for independent and identically distributed.
2 The model
We focus on nonparametric boundary regression models with univariate observations and multivariate covariates of the form
where the errors are iid non-positive univariate random variables that are independent of the and stands for the sample size. The unknown regression function thus corresponds to the upper boundary curve.
Independence of errors and covariates is a typical assumption in regression models and is met among others in Müller and Wefelmeyer (2010), Meister and Reiß (2013), Reiß and Selk (2017) and Drees et al. (2019). Such assumption is crucial and often needed in the framework of frontier models when using statistical methods such as bootstrap procedures; see Wilson (2003) and Simar and Wilson (1998). In the case this hypothesis does not hold, one may consider parametric transformations e.g. exponential, Box-Cox (Box and Cox (1964)), or sinh-arcsinh transformations (Jones and Pewsey (2009)) of the response variable in order to retreive the framework of independence between errors and covariates; see Neumeyer et al. (2019) for the univariate case and also Linton, Sperlich and Van Keilegom (2008).
In the paper at hand, we investigate both cases of random and fixed design points. The former means that the covariates are -dimensional random variables while the latter assumes that are deterministically spread over . Without loss of generality, we work for ease of reading with design points lying on but results extend effortless to any cartesian product of one-dimensional closed intervals.
2.1 The random design case
In the random design case we consider the nonparametric boundary regression model with independent and identically distributed observations , defined by
corresponding to model (2.1), where the design points are multivariate random covariates distributed on that fulfill assumption (K4). The errors are assumed to be iid non-positive random variables that satisfy (K2). The precise statements of assumptions (K2) and (K4) are given in Section 3.1.
2.2 The fixed design case
In the fixed design case we consider a triangular array of independent observations and deterministic design points in for . Thus we conduct the nonparametric boundary regression model
corresponding to model (2.1) with errors that are univariate non-positive independent and identically distributed random variables that satisfy assumption (K2).
We allow for fixed equidistant as well as fixed nonequidistant design. In the first case we consider that form a grid
where we assume that is an integer. Note that when the univariate equidistant design simplifies to for . In the second case the points are not necessarily equidistant, but we assume that they are even enough distributed on , see Assumption (K4’) below.
3 Estimating the regression function
To estimate the boundary curve, we use an estimator that locally approximates the regression function by a polynomial lying above the data points; see Theorem 2.2 in Drees, Neumeyer and Selk (2019) for further details in the univariate case.
3.1 The random design case
For , we consider the regression function estimator defined as
where is a multivariate polynomial of total degree and minimizes the local integral
under the constraints for .
The polynomial is the solution to the linear optimization problem
where is represented by its vector of coefficients, is the vector representing the linear functional , the matrix is the multivariate Vandermonde matrix whose -th row has as its entries all the monomials of degree at most in the entries of , and is the vector with . For the estimator to be well-defined, it is necessary that this problem is bounded from below, that is, that the objective is bounded from below on the polytope defined by the constraints. This need not always be the case, as the example in Jirak et al. (2014) with and two support points demonstrates. However, the alternate optimization problem proposed in Jirak et al. (2014) has the same problem of unboundedness. When , Problem (3.3) is bounded whenever we have at least points. This follows from the fact that the univariate Vandermonde matrix is totally positive when the support points are positive. However, for higher dimensions it is not as simple. There, the Vandermonde matrix with
rows needs not be invertible. As of now, we believe that for the boundedness of (3.3) needs to be checked on a case-by-case basis using linear optimization algorithms. By duality theory we know that (3.3) is bounded if and only if there exists a vector with . This linear program can become very large as
. This linear program can become very large asand grow. For instance, if and has rows, then the interior point algorithm from Vaidya (1989) runs in time in the worst case, which in terms of grows faster than . Nevertheless, this is a theoretical worst case and the average case might be better, possibly using another algorithm. For implementations, the authors suggest using the Python module scipy.optimize.linprog, which by default uses an interior point algorithm based on the MOSEK interior point optimizer by Andersen and Andersen (2000).
To illustrate the estimation procedure we take a look at the simplest case of which results in a local constant approximation. The estimator defined in (3.1)-(3.2) then simplifies to . In Figure 1 an example for is shown. For each -value a constant function is fitted to the data in the neighborhood.
We work under the following four assumptions (K1)-(K4).
Regression function: belongs to some Hölder class of order that is is -times differentiable on and all partial derivatives of order satisfy
for some where
Errors distribution: The errors are independent and identically distributed on with common cdf that satisfies
with and when .
Bandwidths: is a sequence of positive bandwidths that satisfies and .
Design points: The covariates
are iid random variables defined on a probability spaceand valued on with cdf and density that is bounded and bounded away from zero. Besides, they are independent of the errors .
3.2 The fixed design case
Similarly to the random design case, for , we consider the regression function estimator defined as
where is a polynomial of total degree and minimizes the local integral
under the constraints for .
We work for the fixed design case under (K1)-(K2) and the following modified assumptions (K3’) and (K4’).
Bandwidths: Let be a sequence of positive bandwidths that satisfies and with from (K4’).
Design points: Let be a -dimensional interval which is the cartesian product of one-dimensional closed intervals of length with . We assume that at least and at most design points lie in all such .
Assumptions (K3) and (K3’) are common when analyzing the asymptotic behavior of such estimators of the form (3.1) or (3.5). In the equidistant fixed design framework we have , and (K3’) equals (K3). Then for the univariate case , we have and assumption (K3’) turns to be assumption (H1) in Drees et al. (2019); see also assumption (A4’) in Neumeyer et al. (2019).
It is clear for assumption (K2) that the errors do not depend on the covariates in both cases of random and fixed design setups. Still, it has to be noticed that when dealing with the triangular scheme defined in (2.3), the errors depend on too. This justifies the addition of the second index in . Indeed, the design point may vary with the sample size; see the construction of the fixed multivariate equidistant design in Section 2.2 to be convinced.
The size of in assumption (K2) is an important factor for the performance of the estimator defined in (3.1)-(3.2) and (3.5)-(3.6) respectively. Simply speaking the smaller , the better the estimator. For the error distribution is irregular and in this case the rate of convergence for the estimator is faster than the typical nonparametric rate, see the considerations below Theorem 4.1. In Figure 2 we show some examples for different error distributions to highlight the effect of the size of . To simplify the presentation we restrict the display to .
4 Main results
In this section we give the uniform consistency as well as the convergence rates of our estimator separated for the two considered cases.
4.1 The random design case
Note that the deterministic part stems from the approximation of the regression function by a polynomial whereas the random part results from the observational error. Balancing the two error rates by setting gives . For the case of an irregular error distribution, i. e. , this rate improves upon the typical optimal rate for the nonparametric estimation of mean regression functions in models with regular errors.
Theorem 4.1 extends the result of Drees et al. (2019), Theorem 2.2, to the multivariate random setting. Even in the univariate deterministic setting our result (Theorem 4.3 established below) is an extension of the aforementioned Theorem since the convergence rate holds on the whole unit interval whereas in Drees et al. (2019) the result is restricted to . The proof of the error rate that stems from the observational error follows along similar lines as the proof in Drees et al. (2019) but major adaptions are needed to deal with the multivariate and random case. The proof of the deterministic error rate that is due to approximating the boundary curve by a polynomial is based on the proof of Theorem 3.1 in Jirak et al. (2014). In the univariate equidistant fixed case that is treated in Drees et al. (2019) this Theorem can be applied directly whereas in the multivariate, possibly random case that is treated in the paper at hand the proof has to be intensely modified. Indeed, the original proof in Jirak et al. (2014) fully relies on the fundamental theorem of algebra, which states that every polynomial equation in one variable with complex coefficients has at least one complex solution. As far as we know there is no possible extension of such a result for higher dimension hence moving to multidimensional covariates requires completely different arguments. See the proof of Proposition A.1 and especially the proof of Lemma A.2 for this modification. This also gives an alternative proof of Theorem 3.1 in Jirak et al. (2014) and extends it to the multivariate and possibly random case.
4.2 The fixed design case
When the Hölder class defined in Assumption (K1) reduces to the so-called class of -Hölder uniformly continuous functions with . In this framework, the boundary curve may be estimated by a local constant approximation
This local constant approximation based estimator has been studied in Neumeyer et al. (2019) in the univariate setup for both random and fixed design points. Under the weaker assumption of continuity of the boundary curve , they showed the uniform consistency of the estimator defined in (4.1) on the whole unite interval . Strengthening with the -Hölder uniformly continuity assumption and iid regularly varying innovations, we obtain from Theorem 4.3 for the multivariate case, the uniform consistency as well as the rate of convergence for the deterministic design case on the whole unit interval . We sum up in the following corollary.
Assume (K2), (K3’) and (K4’) hold for model (2.3) with -Hölder uniformly continuous function . Then, the local constant approximation of the boundary curve defined in (4.1) is uniformly consistent on and we have
Corollary 4.4 extends Remark 2.5 in Drees et al. (2019) to the multivariate setup and for non necessarily equidistant design points.
To study the small sample behavior, we generate data according to the model
with iid Unif and iid Exp for and sample sizes . Thus in the simulated model the error distribution fulfills and we consider different values of which means that we investigate local constant, local linear, local quadratic and local cubic approximation.
Since our estimator is computed by minimizing an integral over a polynomial the estimation procedure consists of solving a linear programming problem (compare to Remark 3.1). The bandwidth is chosen as which corresponds (up to a log term) to the theoretically optimal bandwidth for that balances the two error rates, see the considerations below Theorem 4.1.
We run the same simulations with fixed equidistant design points as described in section 2.2. The results are very similar and thus we only present the results for the random design case.
In Table 1 the results for 1000 replications are shown where we display the estimated mean squared error of our estimator ( respectively). It can be seen that the estimator works reasonably well and the results improve as grows. The results improve as well as grows which are both expected effects that correspond to the theoretical result in Theorem 4.1.
In Figure 3 we show the true boundary curve and the estimated curve in comparison. Since in dimension one the presentation of two curves in one plot is clearer than in higher dimensions we display a cut through the two respectively three dimensional surface of the functions. To be precise we plot the functions and ( and respectively). It can be seen that the approximation gets better as grows both for and . For the approximation is very good, also in both considered cases of two and three dimensional covariates. Exemplarily we show the results for since the effect is very similar for the other cases.
To evaluate the performance of the estimator on the whole interval we display in Table 2 the arithmetic mean of the estimated mean squared error of where form a grid on with . It can be seen that the performance is surprisingly poor for and . In Figure 4 we show plots of the estimated mean squared error of on for these cases. From the picture it can be deduced that the problem lies on the boundaries. The reason could be the smaller number of observations in this area which are of higher significance for larger .
Appendix A Appendix: Proofs of theorems
a.1 Proof of Theorem 4.1
Proof: We first highlight that throughout the proof the design points are random elements. To make the reading easier, we omit the script but the proof has to be understood -wise that is for any realisation . Besides, unless it is specified otherwise, is arbitrary.
Let be fixed and for set . Note that assumption (K3) here is not required. Nevertheless, we assume that are positive bandwidths such that . The idea of proof is based on Theorem 3.1 in Jirak, Meister and Reiß (2014) but comprehensive adaptions are needed to deal with the multivariate case. We consider random design points satisfying assumption (K4) where the Riemann approximation in the aforementioned paper is replaced by the integral defined in (3.2).
This means that we consider the coefficients for all multiindices with which minimize the objective function
under the constraint for all with .
Now, a Taylor-Lagrange development up to the order of around yields
with . Then we have
where is the remainder term defined for by
By assumption (K1) we can make use of the Hölder property (3.4) and get
with some constants .
Consider now that are the Taylor coefficients such that
With this we define the following two quantities
Then, one may rewrite the data points in the model (2.2) as
and from what precedes, under the constraint , it follows that
Since the errors are non-positive, we have for any
Thus, since the coefficients minimize the local integral (A.1), we have
for all . Define now the polynomial
as the difference of the integrands of the last two quantities. From (A.3), it follows that
for any and for any boundary curve .
Define now the three sets , and as
First note that
For we have
Note that by definition