 # Nonlinear Simplex Regression Models

In this paper, we propose a simplex regression model in which both the mean and the dispersion parameters are related to covariates by nonlinear predictors. We provide closed-form expressions for the score function, for Fisher's information matrix and its inverse. Some diagnostic measures are introduced. We propose a residual, obtained using Fisher's scoring iterative scheme for the estimation of the parameters that index the regression nonlinear predictor to the mean response and numerically evaluate its behaviour. We also derive the appropriate matrices for assessing local influence on the parameter estimates under different perturbation schemes. We also proposed a scheme for the choice of starting values for the Fisher's iterative scheme for nonlinear simplex models. The diagnostic techniques were applied on actual data. The local influence analyses reveal that the simplex models can be a modeling alternative more robust to influential cases than the beta regression models, both to linear and nonlinear models.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In several practical situations, whether experimental or observational, there is interest in investigating how a set of variables relates to percentages or rates. Among the most suitable models for these data types are the simplex and beta regression models (Kieschnick and McCullough, 2003). The beta regression model was proposed by Ferrari and Cribari-Neto (2004) and extended to nonlinear models by Simas et al. (2010). Rocha and Simas (2011) also developed residual and local influence for the class of nonlinear beta regression models.

A competitive alternative to the beta regression model is the simplex regression model proposed by Barndorff-Nielson and Jørgensen (1991). The simplex distribution is part of the dispersion models defined by Jørgensen (1997) that extends the generalized linear models. The simplex distribution has been widely used to model data constrained to the interval . Song and Tan (2000), for example, proposed a simplex regression model with constant dispersion to model longitudinal continuous proportion data under the generalized estimation equation (EEG) approach. This approach was modified by Peter X.K. Song and Tan (2004), assuming that the dispersion parameter varying throughout the observations. After that, Qiu et al. (2008) introduced the simplex mixed-effects model and more recently Zhang et al. (2016) implemented the package simplexreg in the R system and available from the Comprehensive R Archive Network (CRAN) at https: //CRAN.R-project.org/package=simplexreg.

In this paper, as our most important contribuition we propose the general nonlinear simplex regression model, which considers nonlinear structures in the parameters, for both the mean and dispersion submodels. We present analytical expressions for the score vector, for the Fisher information matrix and its inverse, and the estimation of the model parameters is performed by maximum likelihood. To that end, arises the second important contribuition of this paper, the proposal of a scheme for choosing initial guesses for the maximum likelihood estimation of the parameters under nonlinearity for the simplex regression model. Our proposal is based on

Espinheira et al. (2017), which has been shown to be extremely relevant on ensuring the convergence of the log-likelihood maximization procedure based on BFGS method with analytical first derivatives for nonlinear beta regression models.

We also propose diagnostic tools for the nonlinear simplex regression model, being our third important contribution. We have developed a residual based on Fisher’s iterative scoring algorithm for estimating of the mean submodel coefficients. Sometimes, the distribution of this residual is not well approximated by standard normal distribuition. Thus, we decided to to adopt strategy of using new thresholds for residual plots based on the simulated envelope algorithm which are associated with normal probability plots proposed by

Espinheira et al. (2017). The authors show that by using these thresholds, one is more likely to identify atypical points than when using limits ( and

) based on the normal distribution assumption for the residuals.

Finally, we developed local influence (Cook 1986) based on two traditional pertubations schemes, known as: case weighting and covariate perturbation. Furthermore, we propose a new perturbation approach for the response variable, that considered the peculiar aspects of the simplex distribution. Those local influence schemes are appropriate both for linear and nonlinear simplex regression models.

We presented two applications to real data in order to evaluate the behaviour of simplex regression models relative to the beta regressions models. We show that both models are important and one can be an alternative to another depending on the features of the response variable. In particular, in the two examples presented, the modeling based on the simplex distribution presented a process of estimation by maximum likelihood more robust to influential points than the models based on the beta distribution.

## 2 Nonlinear beta regressions

If a random variable

follows the simplex distribution, denoted by with parameters and , the density is given by

 p(y;μ,σ2)=[2πσ2{y(1−y)}3]−1/2exp{−12σ2d(y;μ)},y∈(0,1), (1)

in which the deviation component is given by

 d(y;μ)=(y−μ)2y(1−y)μ2(1−μ)2. (2)

The variance function for the simplex distribution is given by

. By Jørgensen (1997), it follows that and

 Var(y)=μ(1−μ)−√12σ2exp{12σ2μ2(1−μ)2}Γ{12,12μ2σ2(1−μ)2}. (3)

Here, is the incomplete gamma function defined by .

The simplex distribution is very flexible to model data in the continuous interval , presenting different shapes according to the values of the parameters that index the distribution. In addition to the asymmetric left, asymmetric right, , and inverted forms, known from the beta distribution, the simplex model is very useful to accommodate data with bimodal distributions.

## 3 The class of nonlinear simplex regression model

Let be independent random variables such that each , , is simplex-distributed, i.e., each has density (1) where and . The nonlinear simplex regression models are defined by (1) and by the systematic components:

 g(μt)=f1(x⊤t;β)=ηtandh(σ2t)=f2(z⊤t;γ)=ζt, (4)

in which and are, respectively, a -vector and a -vector of unknown parameters and , , and are nonlinear predictors, and are linear or nonlinear functions, continuous and differentiable in the second argument, so that the matrices of derivatives and have ranks and , respectively. Here, for , and are, respectively, and observations of known covariates, which may coincide in whole or in part, such that and . We assume that the link functions and are strictly monotonous and twice differentiable. Different link functions can be chosen for being and . For example, for

we can use the logit specification:

; the probit function: , where denotes the standard normal distribution function; the complementary log-log function and the log-log function , among others. Since that , one can use or the square root function . Even the identity function can be used, since that one takes into account the positiveness of the estimatives.

Based on 1, it follows that the logarithm of the likelihood function has the form , in which

 ℓt(μt,σ2t)=−12log2π−12logσ2t−32log{yt(1−yt)}−12σ2td(yt;μt). (5)

Thus, follows that the score vector is defined by

 Uβ(β,γ)=~X⊤SUT(y−μ)andUγ(β,γ)=~Z⊤Ha. (6)

In (6), , , are vectors and , where

 at={d(yt;μt)2(σ2t)2−12σ2t}andut=1μt(1−μt){d(yt;μt)+1μ2t(1−μt)2}. (7)

with given by (2). Furthermore,

 S=diag{1σ21,…,1σ2n},T=diag{1g′(μ1),…,1g′(μn)}and (8)

Fisher’s information for and is a diagonal block matrix, in which and . Here, and , with

 wt=vtσ2tg′(μt)2,vt=σ2t{3σ2tμt(1−μt)+1μ3t(1−μt)3}anddt={12(σ2t)2h′(σ2t)2}. (9)

Since is a diagonal block matrix, the vectors and are globally orthogonal (Cox and Reid, 1987), so that their maximum likelihood estimators and , respectively, are asymptotically independent. For large samples and under regularity conditions, we have approximately that

 (ˆβˆγ)∼Nk+q((βγ),K−1),withK−1=K(β,γ)−1=(Kββ00Kγγ).

Here,

 Kββ=(~X⊤SW~X)−1andKγγ=(~Z⊤D~Z)−1. (10)

The maximum likelihood estimators of and are obtained as the solution of and . However, we emphasize that the maximum likelihood estimators in nonlinear models do not usually present closed-form analytical expressions, making it necessary to use iterative methods, such as quasi-Newton algorithms (eg BFGS); See for example Nocedal and Wright (1999). The optimization algorithms require the specification of a value in order to start the iterative process.

## 4 New starting values for log-likelihood maximization

The estimation of parameters in nonlinear models can be an arduous task when Fisher’s scoring method is used. However, a well-structured starting-values process, taking into account the features of the response variable distribution, can guarantee both the convergence of the process and the feasibility of the maximum likelihood estimates. Our proposal is based on Espinheira et al. (2017) that proposed a starting-values procedure for the nonlinear beta regression model. We shall consider that and . Thus, we take the first order Taylor expansion of at , given by

 f(xt,β)≈f(xt,β(0))+k∑t=1[∂f(xt,β)∂βt]β=β(0)(βt−β(0)t), (11)

where is an initial guess. Thus, . We consider , , thus . With the perspective of a linear model we have that the least squares estimator of is given by , where and . Hence, . Our proposal is to use is use the following nonlinear initial guess to : and . As can be noticed, the procedure we propose for the starting points involves two steps. The first step is obtain and then, we can compute the nonlinear initial guess . For the precision/dispersion submodel, we consider . Hence, and . Here, and with . Finally, with .

Typically, in one or both submodels there are more parameters than covariates in nonlinear predictors, that is, and/or . In those cases, it is necessary given numerical values for parameters to obtain a predictor formed by covariates that no longer involve unknown parameters. This step is very relevant, since that these initial numerical guess shall respect the features for both the covariates and its relation with the mathematical functions that describe the nonlinear predictors. After this step we shall construct a matrix based on a linear predictor and compute . Based on this two step procedure it is possible to obtain when numbers of parameters exceed the number of covariates.

## 5 Weighted residual

We can define residual as a measure that aims to identify discrepancies between the estimated model and the data. Thus, it is relevant to take in accout the probability distribution’s features of the response variable, as well as the features of the estimation process of the model. From this perspective,

Espinheira et al. (2008b) suggested to use standardized residuals obtained from the convergence of the iterative Fisher’s scoring process for estimation of the vector of the regression parameters.

Based on the nonlinear simplex regression model in (4), on the score function for in (6) and on the inverse of Fisher’s information for in (10), it follows that Fisher’s scoring iterative algorithm for estimating is given by

 β(m+1)=β(m)+(~X⊤(m)S(m)W(m)~X(m))−1~X⊤(m)S(m)U(m)T(m)(y−μ(m)), (12)

where indexes the iterations that take place until convergence, which occurs when the distance between and becomes smaller than a given small constant. It is possible to write the iterative scheme in (12) in terms of weighted least squares regressions, such that , where . Upon convergence,

 ˆβ=(ˆ~X⊤ˆSˆWˆ~X)−1ˆ~X⊤ˆSˆWz,withz=ˆ~Xˆβ+ˆW−1ˆUˆT(y−ˆμ). (13)

We have that in (13) can be accounted as the least squares estimate of obtained by regressing on with weighting matrix . In this sense, the ordinary residual is

 r∗ =(ˆSˆW)1/2(z−ˆ~Xˆβ)=[(ˆSˆW)1/2−(ˆSˆW)1/2ˆ~X(ˆ~X⊤ˆSˆWˆ~X)−1ˆ~X⊤ˆSˆW]z (14) =ˆS1/2ˆW−1/2ˆUˆT(y−ˆμ)

An alternative for standardize the ordinary residual is using an aproximation to the variance of . Let (13) as and consider that , and . Then, using the fact that , it follows that . Thus, based on (14), we have that

 [(SW)1/2−(SW)1/2~X(~X⊤SW~X)−1~X⊤SW]⊤ ≈(I−S1/2W1/2~X~X⊤SW~X)−1~X⊤W1/2S1/2)≈In−H∗,

with .

Then, we proposed the weighted residual for nonlinear simplex regression models, which includes the class of linear simplex model, and to be defined as Finally, considering the definitions of , , , given along (7) and (9), we have that

 rβt=ˆut(yt−ˆμt)√ˆvt(1−h∗tt), (15)

where denotes the th diagonal element of and is given in (9).

## 6 Local influence

Let is a vector of unknown parameters and denotes its log-likelihood function. Now, we introduce a perturbation in the assumed model through a vector , . Thus, denote the log-likelihood function of the perturbed model for a given . The maximum likelihood estimators of for the assumed and perturbed models are denoted by, respectively, and . The likelihood displacement can be used to attain the influence of the perturbation on the maximum likelihood estimate. If minor modifications at the postulated model lead to considerable changes in inferential results, then further investigations must be carried out about a better model to fit the data.

Cook (1986) proposed to analyze the local behaviour of around , which represents no perturbation, such that . The author suggested to evaluate the curvature of the plot of against , where , is a unit norm direction. The interest lies in to find the direction corresponding to the largest curvature . Observations that are jointly influential can be singled out by the index plot of . Cook (1986) showed that the normal curvature at direction is , where , and is an matrix given by , evaluated at and . Thus,

is the largest eigenvalue of

and

is the corresponding eigenvector.

In the other hand, Lesaffre and Verbeke (1998) proposed a measure for identifying individually influential observations, known as the total local influence of observation and defined as , where is the th column of . is the normal curvature in the direction of the vector whose th component equals one and all other elements are zero. Observations such as that can be taken to be individually influential. If we part the parameter vector as , we can access the local influence relative to , such that and , where

 ¨ℓθ2θ2=∂2ℓ(θ)∂θ2∂θ⊤2and¨ℓ22=(000¨ℓθ2θ2−1).

In this case, is the unit norm eigenvector corresponding to the largest eigenvalue of .

In what follows, we shall develop local influence measures for the class of nonlinear simplex regression model under three different perturbation schemes, namely: case weighting, response perturbation and covariate perturbation. To that end we need to obtain and . Based on (5), it follows that , , and . Here, ,

 qt={ut−(yt−μt)u′t+(yt−μt)utg′′(μt)g′(μt)}1{g′(μt)}2, (16)

with

 u′t = −{2(yt−μt)utμt(1−μt)+3(1−2μt)μ4t(1−μt)4+(1−2μt)d(yt;μt)μ2t(1−μt)2},

, com

 νt=dt+ath′′(σ2t){h′(σ2t)3andE=diag{(y1−μ1),…,(yn−μn)}. (17)

Besides that, , is an array of dimension , being a matrix with elements given by , and is an array of dimension , being a matrix with elements given by . Finally, represents the bracket product of a matrix by an array as defined by Wei (1998), p.188.

The structure of for each perturbation scheme is given below. For The cases weighting perturbation we have that . In this case, , , and thus

with being that the quantity and the components of the diagonal matrix were defined in (7). Additionaly, the diagonal matrices , , and were defined, respectively, in (8) and (17).

We now move to the response perturbation scheme, where , with , and as the variance function. This is a new proposal to the standardization of the scale factor

, given that, in general, we use the standard deviation of

. Its worth pointing out that the new proposal presents lesser computational cost because the variance of a random variable with simplex distribution involves computations related to the incomplete gamma function, see (3). In this scheme, and

where , and , with

 mt=1yt(1−yt){2yt(1−μt)3+(1−3μt)μ2t(1−μt)3−12∂d(yt;μt)∂μt}and
 bt=12σ2tyt(1−yt){d(yt;μt)+2(yt−μt)ytμt(1−μt)2}.

The third and final perturbation scheme involves the simultaneous perturbation of a continuous covariate, say , and . We replace by and by , where is a vector of small perturbations and is the standard deviation of and is the standard deviation of (Thomas and Cook, 1990). In this scheme, and

 Δ=⎛⎜⎝−ˆ~X⊤ˆS2ˆTˆHˆUˆEˆ~Zδ−ˆ~X⊤ˆSˆQˆ~Xδ+[ˆb⊤β][ˆ~Xβδ]−ˆ~Z⊤ˆS2ˆTˆHˆUˆEˆ~Xδ−ˆ~Z⊤ˆVˆ~Zδ+[ˆb⊤γ][ˆ~Zγδ]⎞⎟⎠, (18)

where , , is an array , being a matrix with elements , is an array , being a matrix with elements . In (18) the elements of the matrices and are given by (16) and (17), respectively. Other elements as in (18) are defined in the paragraph below expression (17).

## 7 Simulations

Our aim here is to investigate the distribuition of the weighted residual proposed in (15) using Monte Carlo experiments. The experiments were carried out using a nonlinear simplex regression model in which

 logμt1−μt=β2+xβ2t2+β3xt3+β4xt4;log(σ2t)=γ1+zγ2t2,t=1,…,n. (19)

The covariate values were obtained as random draws following: , , and . The covariate values remained constant throughout the simulations. We should consider scenarios where the response values are close to one, scattered on the standard unit interval and close to zero. In this sense, we consider , (), , () and , ().

We measure the intensity of nonconstant dispersion as and we consider results to , , , and , . The sample sizes are . We have to point out that only values for the covariates were generated, which were replicated two and three times, respectively, to obtain the other sample sizes. With this, we can guarantee that the intensity of the nonconstant dispersion remains the same for the different sample sizes. The Monte Carlo experiments were carried out based on 10,000 replications.

The Figures 1, 2 and 3 contain normal probability plots of the mean order statistics of the weighted residual considering the fitted model in (19) for the different scenarios for . These figures reveal that the weighted residual distribution is typically well approximated by the standard normal distribution when even in the case where , (Figure 1). However, in cases where and , the distribution of the weighted residual presents some asymmetry (Figures 2 and 3), respectively. Such asymmetry becomes more evident as the intensity of the nonconstant dispersion increases. In addition, for these ranges of , the approximation of the distribution of the weighted residual by the standard normal distribution does not seem to improve as the sample size increases.