    # Robust estimators in a generalized partly linear regression model under monotony constraints

In this paper, we consider the situation in which the observations follow an isotonic generalized partly linear model. Under this model, the mean of the responses is modelled, through a link function, linearly on some covariates and nonparametrically on an univariate regressor in such a way that the nonparametric component is assumed to be a monotone function. A class of robust estimates for the monotone nonparametric component and for the regression parameter, related to the linear one, is defined. The robust estimators are based on a spline approach combined with a score function which bounds large values of the deviance. As an application, we consider the isotonic partly linear log--Gamma regression model. Through a Monte Carlo study, we investigate the performance of the proposed estimators under a partly linear log--Gamma regression model with increasing nonparametric component.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As is well known, semiparametric models may be introduced when the linear model is insufficient to explain the relationship between the response variable and its associated covariates. This approach has been used to extend generalized linear models to allow most predictors to be modelled linearly while one or a small number of them enter the model nonparametrically. In this paper, we deal with observations

satisfying a semiparametric generalized partially linear model, denoted gplm. To be more precise, we assume that where , with and known functions and is such that

 μ(x,t)=H(x\footnotesize\sc t\boldmathβ0+η0(t)), (1)

where is a known link function, is an unknown parameter and is an unknown continuous function with support on a compact interval , which we will assume equal to , without loss of generality. The parameter which is usually a nuisance parameter, generally lies on a subset of , for that reason we will assume that , where stands for an open set.

When , the generalized partially linear model is simply the well known partly linear regression model, that has been considerably studied, and, in this case, is the scale parameter. We refer for instance to Härdle et al. (2000). Robust estimators for gplm have been considered for instance by Boente et al. (2006) and by Boente and Rodríguez (2010). However, in this paper, we deal with the situation in which there are constraints on the nonparametric component . More precisely, we will assume that , in model (1), is monotone and for simplicity and without loss of generality non–decreasing. Most studies on generalized partly linear models assume that is an unspecified smooth function. However, in many applications, monotonicity is a property of the function to be fitted. Some examples when can be found for instance in Ramsay (1988) who studied the relation between the incidence of Down’s syndrome and the mother’s age; see also He and Shi (1998). In Section 6, we analyse a data set considered in Marazzi and Yohai (2004) which aims to study the relationship between the hospital cost of stay and several explanatory variables, including the length of stay in days which we model non–parametrically. The monotone assumption on is natural in this data set, since the hospital cost increases the longer the stay.

Most estimation developments under monotone constraints were given under a partly linear regression model and we can mention among others, Huang (2002), Sun et al. (2012) who considered estimation under constraints and also Lu (2010) who proposed a sieve maximum likelihood estimator based on

splines. Recently, Lu (2015) considered a spline approach to generalized monotone partial linear models. All these methods are sensitive to outliers and some developments were given under a regression model, that is, when

to provide robust estimators. For nonparametric isotonic regression models, He and Shi (1998) and Wang and Huang (2002) proposed a robust isotonic estimate procedure based on the median regression, while, to improve the efficiency, Álvarez and Yohai (2012) considered estimators for isotonic regression. On the other hand, under a partly linear regression model and following the approach given by Lu (2010), Du et al. (2013) consider estimators based on monotone splines when is assumed to be a monotone function, the scale parameter is known and the errors have a symmetric distribution. However, in the hospital data set to be considered in Section 6

, the errors follow an asymmetric log–Gamma distribution and the proposal considered in Du

et al. (2013) is not appropriate. Furthermore, the shape parameter is unknown and needs to be estimated in order to calibrate the robust estimators and to downweight large residuals.

In this paper, we provide a general setting to provide a family of estimators for the regression parameter and the monotone regression function under the gplm model (1) when the nuisance parameter is unknown. This model includes a partly linear isotonic regression model with unknown scale and a partly linear isotonic log–Gamma regression model with unknown shape parameter, as particular cases. In this sense, we generalize the proposal given in Du et al. (2013) by considering a preliminary scale estimator. The paper is organized as follows. Section 2 described the proposed robust estimators. In particular, since our approach is based on splines, a data–driven robust selection method for the knots is described. Consistency and rates of convergence for the proposed estimators are given in Section 3. The particular case of the log–Gamma model is considered in Section 4, while in Section 5, a numerical study is carried out to examine the small sample properties of the proposed procedures. An application to a real data set is provided in Section 6, while concluding remarks are given in Section 7. Some comments regarding the Fisher–consistency of the proposed estimators are given in Appendix A, while the proofs of the main results are relegated to Appendix B.

## 2 The robust estimators

Let be a weight function to control leverage points on the carriers and

a loss function. Define the functions

 Ln(\boldmathβ,g,a) = 1nn∑i=1ρ(yi,x% \footnotesize\sc ti\boldmathβ+g(ti),a)w(xi) (2) L(\boldmathβ,g,a) = Eρ(y1,x\footnotesize\sc t% 1\boldmathβ+g(t1),a)w(x1). (3)

As in Lu (2010, 2015) and Du et al. (2013), consider where is a sequence of knots that partition the closed interval into subintervals , for and .

Denote as the class of splines of order with knots . According to Corollary 4.10 of Schumaker (1981), for any , there exist a class of spline basis functions , with , such that . Furthermore, according to Theorem 5.9 of Schumaker (1981), the spline is monotonically nondecreasing on if nondecreasing constraints are imposed on the coefficients , i.e., when .

Therefore, we can define a collection of monotone non-decreasing splines on , , which is a subclass , through

 Mn(Tn,ℓ)={kn∑i=jλjBj:λ1≤⋯≤λkn},

where the non-decreasing constraints are imposed on the coefficients to guarantee monotonicity. Hence, the function can be approximated as with

the vector of

spline basis functions, the spline coefficient vector such that .

This suggests that estimators of may be obtained minimizing over and , where is a robust consistent estimator of , for instance, previously computed without the monotonicity constraint. More precisely, the estimators are defined through the values such that

 (ˆ\boldmathβ,ˆ\boldmathλ)=argmin\scriptsize\boldmathβ∈Rp,\scriptsize\boldmathλ∈LknLn(\boldmathβ,kn∑j=1λjBj,ˆκ), (4)

where . If we denote , we have that

 (ˆ\boldmathβ,ˆ\boldmathλ)=argmin\scriptsize\boldmathβ∈Rp,\scriptsize\boldmathλ∈Lkn1nn∑i=1ρ(yi,x\footnotesize% \sc ti\boldmathβ+B\footnotesize\sc ti%\boldmath$λ$,ˆκ)w(xi). (5)

Let . Throughout the paper, we will assume Fisher–consistency, i.e.,

 L(\boldmathβ0,η0,κ0)=min% \scriptsize\boldmathβ∈Rp,g∈GL(\boldmathβ,g,κ0), (6)

with being the unique minimum, that is, for any , . This is a usual condition in robustness and it states that our target are indeed the true parameters of the model. A similar condition for generalized linear models was required in Bianco et al. (2013a) and for generalized partial linear models in Boente et al. (2006) and Boente and Rodríguez (2010) who provide conditions ensuring that .

###### Remark 2.1.

As mentioned in Lu (2015), if , the function is non–decreasing, but the linear inequality constraint on the coefficients is not a necessary condition. However, for quadratic splines, the coefficients condition is sufficient and necessary for monotonicity.

### 2.1 The loss function

Under a fully parametric generalized linear model, the selected loss function aims to bound either large values of the deviance or of the Pearson residuals. We refer to Bianco and Yohai (1996), Croux and Haesbroeck (2003), Bianco et al. (2005) and Cantoni and Ronchetti (2001), where different choices for the loss function are given. On the other hand, optimally bounded score functions have been studied in Stefanski et al. (1986). We briefly remind the definition of the family which bounds the deviance which is the function used in our simulation study, for more details see, for instance, Boente et al. (2006) who considered this family of loss functions to estimate the parameters of a generalized partial linear model using a profile–kernel approach.

Let be a bounded non–decreasing function with continuous derivative , being the tuning constant. Typically, is a function performing like the identity function in a neighbourhood of 0 but bounding large values of the deviance. Denote as the density of the distribution function with . In this setting, the robust deviance–based estimator are related to the following choice for the function

 ρ(y,u,a)=φa[−logf(y,H(u))+logf(y,y)]+Ga(H(u)). (7)

The correction term is given by

 G′a(s)=Es(φ′a[−logf(y,s)+logf(y,y)]f′(y,s)f(y,s)),

where indicates expectation taken under and is a shorthand for . It is worth noticing that , and when considering the maximum likelihood estimator, under a generalized linear model. For a general function , the correction factor is included to guarantee Fisher–consistency under the true model, as for generalized linear models. If the correction factor is taken equal to , the results stated in Section 3 only ensure that the estimators will be consistent to the minimizer of , where is defined in (3). However, as discussed in Bianco et al. (2005), when considering a continuous family of distributions with strongly unimodal density function, the correction term can be avoided. In this case, may play the role of the tuning constant. For instance, for the Gamma distribution, the tuning constant depends on the shape parameter so, if the shape is unknown, initial estimators need to be considered. Further details are given in Section 4.

Note that for the Poisson and logistic regression models, we have

, so does not need to be estimated, hence . Furthermore, as noted by Croux and Haesbroeck (2003) for the logistic model, in order to guarantee existence of solution, beyond the overlapping condition required for the maximum likelihood estimator, the derivative of the function must satisfy additional constraints. More precisely, needs to be increasing on and decreasing on for some or increasing on and also to fulfil that for any . An example of function satisfying these conditions is also given therein.

On the other hand, when , the usual square loss function is replaced by a function after scaling the residuals to control the effect of large responses. More precisely, let stands for function as defined in Maronna et al. (2006), i.e., an even continuous, non-decreasing function with and such that when with . Then, when the link function equals to identity function and, as mentioned in the Introduction, plays the role of the scale parameter.

###### Remark 2.2.
1. As noted in Boente et al. (2006), under a logistic partially linear regression model, Fisher–consistency can easily be derived for the loss function given by (7), when satisfies the regularity conditions stated in Bianco and Yohai (1996), , for all , and

 P(x\footnotesize\sc t\boldmathβ=a0|t=t0)<1,∀(\boldmathβ,a0)≠0%andforalmostall$t0$. (8)

Moreover, taking conditional expectations with respect to , it is easy to verify that is the unique minimizer of in this case. Condition (8) does not allow to include an intercept, so that the model will be identifiable.

2. Under a generalized partially linear model with responses having a gamma distribution, Theorem 1 of Bianco et al. (2005) allows us to derive Fisher–consistency for the nonparametric and parametric components, if the score function is bounded and strictly increasing on the set where it is not constant and if (8) holds (see Section 4).

3. Finally, consider the partially linear model where are independent of , that is, the link function equals . In this case, Fisher–consistency holds if, for instance, the errors have a symmetric distribution with density strictly unimodal, the loss function equals with a function as defined in Maronna et al. (2006), i.e., an even continuous, non-decreasing function with and such that when with . Furthermore, we also have that , for any , see Appendix A for a proof.

### 2.2 Selection of kn

A remaining question is the choice of the number knots and their location for the space of splines. Knot selection is more important for the estimate of than for the estimate of . One approach is to use uniform knots which is the approach followed in our simulation study. Uniform knots are usually sufficient when the function

does not exhibit dramatic changes in its derivatives. On the other hand, non–uniform knots are desirable when the function has very different local behaviours in different regions. Another commonly used approach is to consider as knots quantiles of the observed

with uniform percentile ranks.

The number of knots or equivalently the number of elements of the basis (recall that ) may be determined by a model selection criterion. Suppose that is the estimator solution of (4) with a dimensional spline space. As in He and Shi (1996) and He et al. (2002), for each define a criterion analogous to Schwartz (1978) information criterion

 BIC(k)=1nn∑i=1ρ(yi,x% \footnotesize\sc tiˆ\boldmathβ(k)+k∑j=1ˆλ(k)jBj(ti),ˆκ)w(xi)+logn2n(k+p).

Large values of indicate poor fits. A robust version of the Akaike criterion considered in Lu (2015) can also be considered. As is usual in spline–based procedures the number of knots should increase slowly with the sample size to attain an optimal rate of convergence. When it is assumed that is twice continuously differentiable and cubic splines () are considered, as in our simulation study, according to the convergence rate derived in Theorem 3.2, a possible criterion is to search for the first (i.e. smallest ) local minimum of in the range of . Within this range, there is usually only one local minimum. The reason for being larger than is that for cubic splines the smallest possible choice is . Also note that the global minimum of actually occurs at a saturated model in which , so is a valid criterion only for a limited range of .

## 3 Consistency

In this section, we will derive, under some regularity conditions, consistency and rates of convergence for the estimators defined in the previous Section. We will begin by fixing some notation. Let the Euclidean norm of and . For any continuous function denote and . From now on, stands for a neighbourhood of with closure strictly included in and will denote the family of functions

 Fn={f(y,x,t)=ρ(y,x% \footnotesize\sc t\boldmathβ+\boldmathλ% \footnotesize\sc tB(t),a)w(x),\boldmathβ% ∈Rp,\boldmathλ∈Lkn,a∈V}.

Furthermore, for any measure , and stand for the covering and bracketing numbers of the class with respect to the distance in , as defined, for instance, in van der Vaart and Wellner (1996).

### 3.1 Consistency results

To derive the consistency of our proposal in the general framework we are considering, we will need the following set of assumptions whose validity is discussed in Remark 3.1.

• The estimators of are strongly consistent.

• and are non–negative and bounded functions and is a continuous function. Moreover, satisfies the following equicontinuity condition: for any there exists such that for any ,

 |a1−a2|<δ⇒sup\scriptsize\boldmathβ∈Rk,\scriptsize\boldmathλ∈Lkn|L⋆(\boldmathβ,\boldmathλ,a1)−L⋆(\boldmathβ,\boldmathλ,a2)|<ϵ.
• The true function is nondecreasing and its th derivative satisfies a Lipschitz condition on , with , that is,

 η0∈Hr={g∈Cr[0,1]:∥g(j)∥∞≤C1,0≤j≤r and |g(r)(z1)−g(r)(z2)|≤C2|z1−z2|}.
• The maximum spacing of the knots is assumed to be of order , . Moreover, the ratio of maximum and minimum spacings of knots is uniformly bounded.

• The class of functions is such that, for any , , for some constant independent of and .

For simplicity, denote as , where and the estimators defined through (4) with . To measure the closeness between the estimators and the parameters, consider the metric where stands for a norm in the space of functions , such as or . Let .

###### Theorem 3.1.

Let be i.i.d. observations satisfying (1). Assume that C0 to C4 hold and that for any , and that for . Then, we have that .

###### Remark 3.1.

As mentioned above, for the logistic and Poisson model, is known and does not need to be estimated, hence C0 may be omitted. On the other hand, when the scale parameter may be estimated using any robust scale estimator computed without using the monotone constraint. To be more precise, let be the robust estimators of defined in Bianco and Boente (2004) and define the residuals as . The scale estimator can be taken as . Another possibility is to consider a scale estimator based on a function as follows. As in Maronna et al. (2006), let be a function, that is, an even function, non–decreasing on , increasing for when and such that . The estimator of the scale is the solution

 1nn∑i=1χc(ris) =b, (9)

where , is a user–chosen tuning constant and is related to the breakdown point of the scale estimator. If is bounded, it is usually assumed that in which case . For instance, when is the Tukey’s biweight function, the choice and

leads to an scale estimator Fisher–consistent at the normal distribution with breakdown point

. On the other hand, the choice , and leads to . Similarly, when the responses have a Gamma distribution the parameter corresponds to the tuning constant and is related to the shape parameter. It can be estimated using a preliminary estimator computed without making use of the monotone restriction, as described in Section 4. Straightforward calculations allow to show that in both situations C0 holds.

Assumption C1 is a standard requirement since it states that the weight function controls large values of the covariates and that the score function bounds large residuals, respectively. Moreover, the equicontinuity requirement allows to deal with the nuisance parameter in a general setting and a similar condition appears in Bianco et al. (2013a). For the particular case of a partly linear regression model, i.e., when , is the scale parameter and the function is usually chosen as where the function is an even, bounded function, non–decreasing on . In this case, the equicontinuity condition is satisfied, for instance, if is continuously differentiable with first derivative such that is bounded.

C2 and C3 are conditions regarding the smoothness of the nonparametric component and the knots spacing. They are analogous to those considered, for instance, in Lu (2010, 2015). On the other hand, the requirement ensures that does not attain a minimum value at infinite. It was also a requirement in Boente et al. (2006) and Boente and Rodríguez (2010) to guarantee strong consistency. It can be replaced by the condition that lie ultimately in a compact set since is the unique minimizer of as stated in (6).

Assumption C4 is satisfied for most loss functions . Effectively, assume that is known and that the densities are such that the covering number of the class

grows at a polynomial rate, i.e., it is bounded by . Then, if the functions and are of bounded variation, we obtain the result using that . A similar bound can be obtained for the bracketing numbers. For the score functions usually considered in robustness, such as the Tukey’s biweight function or the score function introduced in Croux and Haesbroeck (2002) for the logistic model, and have bounded variation and the required condition is easily verified using the permanence properties of classes of functions since the class is a finite–dimensional class and so a class. Furthermore, if plays the role of the tuning constant or the scale parameter, as in the Gamma model or when and the errors have a symmetric distribution, the same conclusions hold.

### 3.2 Convergence rates

In order to derive rates of convergence for the estimators, we choose as norm in the space of functions , the norm, with , where . Hence, we include as possible norms or , in which case with or , respectively. Furthermore, in this setting we define the distance

 π2P(\boldmathθ1,\boldmathθ2)=E(w(x)[x\footnotesize\sc t(\boldmathβ1−\boldmathβ2)+η1(t)−η2(t)]2),

where for , .

We consider the following additional assumptions. Two possible conditions on the bracketing entropy are stated below and according to them weaker or stronger convergence rates are attained. Conditions under which they hold for some particular models are given in Remark 3.2.

To avoid requiring an order of consistency to the estimator of , from now on we will assume that for any and , such that . This condition clearly entails Fisher–consistency and holds, for instance, for the log–partly linear regression model and when if the errors have a symmetric distribution.

From now on, for , stands for the spline function .

• Let . For some constant independent of , and , we have that .

• For , the family of functions is such that for any , , for some constant independent of and .

• The function is twice continuously differentiable with respect to its second argument with derivatives and such that

• , almost surely, for any .

• .

• There exists and a positive constant , such that for any with and any , .

###### Theorem 3.2.

Let be i.i.d. observations satisfying (1) and for . Assume that C1 to C3 and C6 to C8 hold and that . Then, we have that

• if C5 holds, , where , so if , the estimators converge at the optimal rate .

• if C5 holds, , for any , such that and .

###### Remark 3.2.

Note that condition C6b) is analogous to the conditional Fisher–consistency stated in Kunsch et al. (1989), while condition C5 is analogous to assumption C3 in Shen and Wong (1994). Similar arguments to those considered in Shen and Wong (1994) when analysing the Case 3 in page 596, allow to show that C5 holds, for instance, when when is continuously differentiable with first derivative such that is bounded. It also holds for the logistic model and for the gamma model when is bounded using C6a).

## 4 The log–Gamma regression model

Among generalized linear models, the Gamma distribution with a log–link, usually denoted log–Gamma regression, plays an important role, see Chapter 8 of McCullagh and Nelder (1989). For any and , denote as the parametrization of the Gamma distribution given by the density

 f(y,α,μ)=ααyα−1exp(−(α/μ)y){μαΓ(α)}−1Iy≥0.

Under a log–Gamma model, , where with link function . As it is well known, in this case, the responses can be transformed so that they are modelled through a linear regression model with asymmetric errors (see for instance Cantoni and Ronchetti, 2006). Let be the transformed responses, then

 zi=x\footnotesize\sc ti\boldmathβ0+η0(ti)+ui, (10)

where and are independent. Moreover, with density

 g(u,α)=ααΓ(α)exp[α(u−exp(u))]. (11)

This density is asymmetric and unimodal with maximum at . For fully parametric linear models. i.e., when , a description on robust estimators based on deviances was given in Bianco et al. (2005), while Heritier et al. (2009) considered type estimators based on Pearson residuals. For the sake of completeness, we will describe how to adapt the estimators based on deviances to the present situation.

We will consider the transformed model (10) and denote by the deviance component of the -th observation, i.e.,

where .

In this setting, the classical estimators to be considered below are not based on the quasi–likelihood but on the deviance and they correspond to the choice in (7), since no tuning constant is needed. Thus, the loss function equals , while , . Hence, if , the classical estimators of without any restriction are obtained as where with

 (ˆ\boldmathβ,ˆ\boldmathλ)=argmin\scriptsize\boldmathβ,% \scriptsize\boldmathλn∑i=1d(zi−[x\footnotesize\sc ti\boldmathβ+\boldmathλ\footnotesize\sc tBi]),

where, for the sake of simplicity, we have denoted as , so