Inference for the proportional odds cumulative logit model with monotonicity constraints for ordinal predictors and ordinal response

The proportional odds cumulative logit model (POCLM) is a standard regression model for an ordinal response. Ordinality of predictors can be incorporated by monotonicity constraints for the corresponding parameters. It is shown that estimators defined by optimization, such as maximum likelihood estimators, for an unconstrained model and for parameters in the interior set of the parameter space of a constrained model are asymptotically equivalent. This is used in order to derive asymptotic confidence regions and tests for the constrained model, involving simple modifications for finite samples. The finite sample coverage probability of the confidence regions is investigated by simulation. Tests concern the effect of individual variables, monotonicity, and a specified monotonicity direction. The methodology is applied on real data related to the assessment of school performance.

Authors

• 2 publications
• 16 publications
04/23/2018

A constrained regression model for an ordinal response with ordinal predictors

A regression model is proposed for the analysis of an ordinal response v...
05/25/2018

Penalized polytomous ordinal logistic regression using cumulative logits. Application to network inference of zero-inflated variables

We consider the problem of variable selection when the response is ordin...
01/30/2021

Bayesian Cumulative Probability Models for Continuous and Mixed Outcomes

Ordinal cumulative probability models (CPMs) – also known as cumulative ...
02/03/2021

Statistical Inference for Ordinal Predictors in Generalized Linear and Additive Models with Application to Bronchopulmonary Dysplasia

Discrete but ordered covariates are quite common in applied statistics, ...
07/03/2021

Proportional mean model for panel count data with multiple modes of recurrence

Panel count data is common when the study subjects are exposed to recurr...
08/13/2021

Non-parametric estimation of cumulative (residual) extropy

Extropy and its properties are explored to quantify the uncertainty. In ...
05/18/2020

Distributed lag models to identify the cumulative effects of training and recovery in athletes using multivariate ordinal wellness data

Subjective wellness data can provide important information on the well-b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Regression problems are treated in which the response variable is ordinal and one or more of the explanatory variables are ordinal as well. When considering regression relations between variables, monotonicity has a special status regarding ordinal variables. The ordinal scale type implies that the ordering of the possible values is meaningful, whereas there is no meaningful quantitative distance between them. Changing values of the explanatory variable in a meaningful manner means to increase or decrease them. Assuming a general monotonic relationship between explanatory variable and response means that such a meaningful change is connected to the response in the same way (meaning either decreasing or increasing) everywhere on the measurement scale of the explanatory variable. This gives the monotonicity assumption a special status between ordinal variables, although of course not all relationships between ordinal variables are monotonic. Note that monotonicity for ordinal variables can be seen as analogous to linearity for interval scaled measurements, which implies that the same quantitative difference between values of the explanatory variable (defined as meaningful at interval scale level) has the same quantitative effect on the response everywhere on the measurement scale.

In this sense the monotonic relationship between ordinal variables in a regression is a key reference point, as is the linearity assumption between interval scaled variables. A monotonicity assumption can be seen as default, unless contradicted by the data or specific background knowledge.

Here we derive asymptotic theory for a maximum likelihood estimator (MLE) in a proportional odds cumulative logit model (POCLM, McCullagh (1980)), in which relationships between ordinal variables are constrained to be monotonic. Such constraints have been proposed by Espinosa and Hennig (2019). We pay special attention to the relation between the constrained and the unconstrained MLE, and to confidence regions (CRs), which based on the unconstrained MLE may be used to reject the monotonicity assumption where contradicted by the data, and for monotonicity direction detection. Espinosa and Hennig (2019)

proposed to decide about the appropriateness of the monotonicity assumption and monotonicity directions based on one-dimensional confidence intervals (CIs) for parameters, which can lose a lot of power compared to using multivariate CRs taking into account dependence between parameters. We will also give conditions that make the CIs in

Espinosa and Hennig (2019) valid, which originally were based on statements made in Agresti (2010) without a proof.

An alternative approach for regression with ordinal predictors would be to represent them by latent variables (Moustaki (2000)) or optimal scaling (De Leeuw et al. (2009)). Work on isotonic regression (Stout (2015)) is also related to the present approach.

Section 2 introduces the unconstrained POCLM and reviews its inference and asymptotic theory. Not only are CRs for the unconstrained POCLM a basis for the CRs for the constrained POCLM, they are of interest in their own right when making statements about whether the data are compatible with none, one, or both possible monotonicity directions for the ordinal predictors. Section 3 introduces the monotonicity constraints for ordinal predictors. It is possible to leave the monotonicity direction unspecified, or to constrain the parameters belonging to a variable to be isotonic, or antitonic. In order to use the unconstrained asymptotic theory for the constrained model, a general theorem is shown that constrained estimators defined by optimization are asymptotically equivalent to their unconstrained counterparts on the interior of the constrained parameter space. This result is used in Section 4 in order to define confidence regions and tests for the constrained POCLM. Confidence regions based on unconstrained asymptotics may in finite samples contain parameters that do not fulfill monotonicity constraints, and adaptations for this case are proposed, and their coverage probabilities compared in a simulation study. The methodology is applied to a data set on school performance in Chile in Section 5, and Section 6 concludes the paper.

2 The unconstrained POCLM

Consider a regression with an ordinal response being the number of independent observations, with categories. There are ordinal predictors (OPs;

refers to a dummy variable for ordinal predictor

and category ) and non-ordinal quantitative predictors . According to the POCLM (McCullagh (1980)),

 logit[P(zi≤j|xi)]=αj+t∑s=1ps∑hs=2βs,hsxi,s,hs+v∑u=1βuxi,u,j=1,…,k−1, (1)

with

 −∞=α0<α1<⋯<αk−1<αk=∞, (2)

and where the first category for each predictor is the baseline category, i.e., .

The dimensionality of the parameter space is , defining the

-dimensional parameter vector as

 γ′ =(α′,β′) =(α′,β′(ord),β′(nonord)) =(α′,β′1,…,β′t,β1,…,βv),

where is the parameter vector associated with the ordinal predictors and is the one associated with the non-ordinal predictors.

With , the log-likelihood function is

 ℓ(γ)=n∑i=1k∑j=1yijlogπj(xi), (3)

where , and denotes the indicator function, and , i.e.,

 πj(xi)=eαj+∑ts=1∑pshs=2βs,hsxi,s,hs+∑vu=1βuxi,u1+eαj+∑ts=1∑pshs=2βs,hsxi,s,hs+∑vu=1βuxi,u −eαj−1+∑ts=1∑pshs=2βs,hsxi,s,hs+∑vu=1βuxi,u1+eαj−1+∑ts=1∑pshs=2βs,hsxi,s,hs+∑vu=1βuxi,u. (4)

The parameter space for model (1) is

 UUM={ γ′=(α′,β′1,…,β′t,β′(nonord))∈Rp:−∞<α1<⋯<αk−1<∞}, (5)

where . We refer to this as the “unconstrained model” (thus “”) despite the constraints on the -parameters, because in Section 3 constraints will be introduced for the -vectors of ordinal predictors.

The unconstrained maximum likelihood estimator (UMLE) is

 ^γ=argmaxγ∈UUMℓ(γ), (6)

then is the vector of UMLEs belonging to the parameter space .

The score function and the Fisher information matrix of the first observations are

 sn(γ) =∂logL(γ|zn,Xn)/∂γ, (7) Fn(γ) =covγsn(γ). (8)

2.1 Asymptotic results for the unconstrained POCLM

Asymptotic theory for the unconstrained POCLM is given in Fahrmeir and Kaufmann (1986), applying results for generalized linear models (GLMs) proved in Fahrmeir and Kaufmann (1985); Fahrmeir (1987). This theory is summarized here. The unconstrained POCLM is treated as a multivariate generalized linear model (GLM), for which the following structure is assumed:

1. [label=()]

2. The are

-dimensional independent random variables with densities

 f(yi|θi)=c(yi)exp(θ′iyi−b(θi)),i=1,2,…, (9)

with parameter vector , which is the interior of the convex set of all parameters for which (9) defines a valid density. For .

3. The matrix influences in form of a linear combination where is a -dimensional parameter.

4. The linear combination is related to the mean of by the injective link function , , where is the image of . These functions will be referred to as general link functions.

Model (1) can be connected to (9) by setting . (1) applies the logit (which is the natural link function for a multinomial model for ) to instead, and the link function also needs to involve , which makes it non-natural for the POCLM.

Making reference to the results for non-natural link functions in Fahrmeir and Kaufmann (1985), the following conditions, of which only one has to be fulfilled, for asymptotic results for the POCLM are given in Fahrmeir and Kaufmann (1986):

1. There is a such that for , and for

, the smallest eigenvalue of

goes to .

2. and for some for all .

Although somewhat restrictive, these conditions are conditions on the observed predictors only, not on unobservable model parameters. They are stronger than what is required for standard linear regression, see

Fahrmeir and Kaufmann (1986), but can be argued to be realistic in many cases.

” denotes convergence in distribution, is the

-variate Gaussian distribution with mean vector

and unit matrix as covariance matrix, and is the Cholesky square root of . The following results then hold:

Theorem 2.1.

Assuming (B) or (G), the probability that a unique MLE exists converges to one. Any sequence of MLEs is weakly consistent and asymptotically normal,

 (Fn(γ)−1/2)′(^γn−γ)L→Np(0,Ip). (10)

Fahrmeir and Kaufmann (1986) also state an optimality result for the MLE. See Fahrmeir and Kaufmann (1985) for an additional condition that allows for strong consistency.

Now consider a standard loglikelihood ratio test for the linear hypothesis

 H0: Cγ=ξ % against H1: Cγ≠ξ, (11)

where is an -matrix of full row rank , , and has a nonempty intersection with the parameter space. Let

 Rn=2[ℓ(^γ)−ℓ(~γ)] (12)

be the standard loglikelihood ratio statistic, where is the MLE within , fulfilling , and

 Wn=(C^γ−ξ)′[CF−1n(^γ)C′]−1(C^γ−ξ) (13)

be the Wald statistics.

Theorem 2.2.

Assuming (B) or (G), if in (11) is true, then and are asymptotically equivalent, and

 Rn, WnL→χ2r.

Under additional conditions, Fahrmeir (1987) derives an asymptotic distribution with noncentrality parameter for under certain sequences of parameters from .

We will mainly consider asymptotic CRs based on here. One-dimensional CIs for a parameter of level (as used in Espinosa and Hennig (2019)) are usually constructed as

 ^β±z1−ν/2(SE^β), (14)

where is the

-quantile of the Gaussian distribution, and the estimated standard error

can be obtained from the asymptotic covariance matrix. In Theorem 2.1, depends on the true parameter and replacing it by requires additional conditions, see Fahrmeir and Kaufmann (1985). However, the resulting CI is equivalent to the one that can be obtained from the Wald statistic with , and therefore it is valid according to Theorem 2.2. Otherwise, we normally prefer the loglikelihood ratio statistic for inference, see Harrell (2001) for a discussion.

3 Monotonicity constraints for the POCLM

3.1 The constrained parameter space

When ordinal predictors are treated as of nominal scale type, no monotonicity constraints are imposed on any of the vectors , , and the parameter space is .

In order to take their ordinality into account, Espinosa and Hennig (2019) proposed monotonic constraints on the parameters associated with the categories of an ordinal predictor . Monotonicity is allowed to be either isotonic or antitonic. The isotonic constraints are

 0≤βs,2≤⋯≤βs,ps,∀s∈I, (15)

where , with , and the antitonic constraints are

 0≥βs,2≥⋯≥βs,ps,∀s∈A, (16)

where .

The model under monotonicity constraints, i.e., requiring for all ordinal variables in (1) will be referred to as the constrained model. Even if a variable is in fact ordinal, a researcher may have reasons to not enforce constraints, in which case it can be treated as non-ordinal in the model.

is the parameter set that constrains all ordinal variables to fulfill monotonicity constraints but allows both monotonicity directions:

 ~UCM={γ′=(α′, β′1,…,β′t,β′(nonord))∈Rp:−∞<α1<⋯<αk−1<∞, [(βs,2≥0,βs,hs≥βs,hs−1) % or (βs,2≤0,βs,hs≤βs,hs−1)] (17) ∀(s,hs)∈S×{3,…,ps}}.

is the parameter set that fixes the monotonicity directions for all ordinal variables (normally assuming that these are correct):

 UCM={γ′=(α′, β′1,…,β′t,β′(nonord))∈Rp:−∞<α1<⋯<αk−1<∞, (βs,2≥0,βs,hs≥βs,hs−1)∀(s,hs)∈I×{3,…,ps}, (18) (βs,2≤0,βs,hs≤βs,hs−1)∀(s,hs)∈A×{3,…,ps}}.

Note that in practice equality for the -parameters should be allowed, because in case the unconstrained optimizer of the likelihood violates monotonicity constraints, the constrained optimizer will only exist if equality is allowed. This is different from equality for the -parameters, because equality of subsequent parameters would imply that there is a category with , in which case this category can be dropped from the model.

The constrained maximum likelihood estimator (CMLE) is

 ^γc=argmaxγ∈~UCMℓ(γ). (19)

One may also be interested in direction constrained MLEs (DMLEs) with fully prespecified monotonicity directions with and chosen suitably,

 ^γd=argmaxγ∈UCMℓ(γ), (20)

or even partially direction constrained MLEs (PMLEs) that optimize over a parameter space with some monotonicity directions fixed and others left open. These estimators can be computed using the R-function maxLik from the package of the same name (Henningsen and Toomet (2011)).

3.2 Asymptotics for constrained estimators

The asymptotic theory for the UMLE presented above makes sure that for large enough it is arbitrarily close to the true parameter. In case that monotonicity constraints indeed hold, the UMLE can therefore be expected to be in with arbitrarily large probability as long as the true parameter vector is in the set , where for a set , denotes the interior set, i.e.,

 So={x: ∃ϵ>0:∥x−y∥<ϵ⇒y∈S}.

is with all “” and “” replaced by “” and “”, respectively. In this case, the UMLE and the CMLE will be equal with probability approaching one, and asymptotically equivalent.

We will state this rather obvious fact as a general theorem for constrained estimators, which we have not been able to find in the literature (Liew (1976) showed a corresponding result for the least squares estimator; otherwise work on constrained estimators seems to focus on the general case in which constraints are not necessarily fulfilled, requiring more sophisticated theory, e.g., Dupacova and Wets (1988)).

For observations in some space , consider estimators of parameters from some parameter set that are defined by optimization of a function :

 Tn(x1,…,xn)=argmaxθ∈Θhn(x1,…,xn;θ). (21)

In case the argmax does not exist or is not unique, can be defined as taking any value. The issue of interest here is whether the following properties hold for a constrained estimator assuming that they hold for the unconstrained :

1. The probability that the argmax in the definition of exists and is unique converges to 1 for .

2. is weakly consistent for (the results below also hold if “weakly” is replaced by “strongly”).

3. For given functions and distribution , assuming as the true parameter,

 Gn(Tn;θ0)L→Q.

Constrained estimation is defined by assuming

 θ∈~Θ⊂Θ, ~Tn(x1,…,xn)=argmaxθ∈~Θhn(x1,…,xn;θ).

If the argmax in the definition of exists uniquely, and , then .

Theorem 3.1.

With the notation introduced above, assuming that the true ,

1. If fulfills (C1) and (C2), then fulfills (C1) and (C2), and

 limn→∞P{Tn=~Tn}=1.
2. If fulfills (C1), (C2), and (C3), then also fulfills (C3).

Proof.

(C2) for means that for all . Consider and so that

 Uϵ={θ∈Θ: ∥θ−θ0∥<ϵ}⊆~Θ.

If the argmax exists and is unique (which happens with because of (C1)), if , then , therefore

 P{Tn=~Tn}→1 and P{∥~Tn−θ0∥<ϵ}→1.

This holds for all , as for arbitrarily small , and if is so large that , then and

 P{∥~Tn−θ0∥<ϵ∗}≥P{∥~Tn−θ0∥<ϵ}→1.

Therefore (C2) holds for , as does (C1), because

 P({argmax is unique and exists }∩{Tn∈Uϵ⊆~Θ})→1.

This proves (a).

(C3) for means that for all continuity points of the cdf of ,

 P{Gn(Tn;θ0)≤x}→Q(−∞,x].

Consider again . Ignoring the possibility that the argmax might not exist or be unique (the probability of which vanishes due to (C1)), let

 U1n={Tn∈Uϵ, Gn(~Tn;θ0)≤x}={Tn∈Uϵ, Gn(Tn;θ0)≤x}, (22) U2n={Tn∈Ucϵ, Gn(~Tn;θ0)≤x}. (23)

Then . Because of (C2), , and therefore

 |P(U1n)−P{Gn(Tn;θ0)≤x}|→0, P(U2n)→0.

Therefore,

 limn→∞P{Gn(~Tn;θ0)≤x}=limn→∞P(U1n)=limn→∞P(Gn(Tn;θ0)≤x}=Q(−∞,x], (24)

proving (C3) for and therefore (b). ∎

Theorem 3.1 implies that Theorems 2.1 and 2.2 do not only hold for the UMLE but also for the CMLE, and in fact the DMLE and PMLEs, meaning that asymptotic inference for them is the same, unless is on the border of , in which case the probability for the CMLE to equal the UMLE may not go to zero. Weak consistency also implies that for and all monotonicity directions will be estimated correctly with probability converging to 1, i.e., UMLE and CMLE will be in .

4 Inference based on asymptotics

4.1 Confidence regions

Due to the asymptotic equivalence of UMLE and CMLE, inference about the parameters of the constrained POCLM can be based on the theory for the unconstrained POCLM, particularly (12) and Theorem 2.2. As opposed to Espinosa and Hennig (2019), the results here allow for inference regarding multiple parameters. Simultaneous inference may be of particular interest regarding all parameters belonging to a variable. Unconstrained inference can be relevant for testing monotonicity. If the monotonicity assumption for ordinal variables is indeed true, one should expect higher efficiency for inference that assumes the monotonicity constraints into account. Regarding constrained inference, CRs are probably most interesting; they may be interpreted as quantifying some kind of “effect size”.

A major issue regarding Theorem 3.1 is that its proof is based on the fact that asymptotically UMLE and CMLE are the same. CRs based on Theorem 2.2 will with probability converging to one be subsets of arbitrarily small neighborhoods of the MLE and therefore also of the true parameters, meaning that if the monotonicity constraints are indeed fulfilled with strict inequalities, the whole CR will only contain parameters indicating the correct monotonicity direction.

In practice, for finite , it is neither guaranteed that UMLE and CMLE are the same, nor that all parameters in such a CR indicate the same monotonicity direction, or at least fulfill any monotonicity constraint. The validity of asymptotic CRs may then be doubted.

For the sake of simplicity, in the following discussion, we consider CRs for a vector of parameters of interest, usually those belonging to a single ordinal variable, although ideas also apply to more general vectors fulfilling linear constraints as in (11). For a vector with parameters of interest, , the overall parameter vector is partitioned as , where is a vector with the remaining parameters. The unconstrained MLE is now denoted as accordingly, and the constrained MLE is . Unconstrained CRs have the form

 UCR={ β0r:2[ℓ(^βr,^ϕ)−ℓ(β0r,~ϕ)]≤χ2r;1−α,β0r∈UUM}, (25)

where is defined by

 ℓ(β0r,~ϕ)=maxϕ∈UUMℓ(β0r,ϕ). (26)

Using the UCR, considering a single ordinal variable, the following cases can occur:

1. UMLE and CMLE are the same, and all parameter vectors in the CR indicate the same monotonicity direction.

2. UMLE and CMLE are the same; the CR contains parameter vectors indicating the same monotonicity direction as the MLE, but also parameter vectors violating monotonicity.

3. UMLE and CMLE are the same; the CR contains parameter vectors indicating the same monotonicity direction as the MLE, but also parameter vectors indicating the opposite monotonicity direction, in which case there will normally also be parameters violating monotonicity in the CR.

4. UMLE and CMLE are not the same; the CR contains parameter vectors indicating the same monotonicity direction as the CMLE on top of parameters vectors that are non-monotonic, or that potentially also indicate another monotonicity direction.

5. UMLE and CMLE are not the same, and the CR does not contain any parameter vector fulfilling the monotonicity constraints.

6. In case that there are further variables in the model, there is a further possibility, namely that the UMLE fulfills monotonicity constraints, but due to enforcement of constraints for another variable that is potentially dependent on the variable in question, UMLE and CMLE differ on that variable as well. In that case it is possible that all parameter vectors in the CR indicate the same monotonicity direction, but the other cases listed above can in principle also occur.

Only the first case indicates a behavior in line with the asymptotic theory. In the other cases the validity of the asymptotic CR for the given finite sample can be doubted. Assuming that monotonicity in fact holds, the other cases indicate that the true parameter is so close to the boundary of the constrained parameter space that for the given sample size one can expect an asymmetric distribution of the CMLE, because even if UMLE and CMLE are actually the same, parameters that do not satisfy the constraints are also compatible with the data. In the third and potentially also the fourth case, the distribution of the CMLE may be bimodal, as the true parameter may be so close to the boundary between monotonicity directions that they may “switch”; for datasets generated by such parameters, the UMLE may fall within both monotonicity direction domains and also may violate monotonicity, in which case the CMLE may be in either possible domain.

One could use techniques such as bootstrap to find constrained finite sample CRs, but here we explore what can be done based on the asymptotic CRs. Section 4.3 will then explore these CRs empirically.

The easiest way to obtain a constrained CR from an unconstrained one is to just remove all parameter vectors that do not satisfy monotonicity constraints:

 UCCR={ β0r:2[ℓ(^βr,^ϕ)−ℓ(β0r,~ϕc)]≤χ2r;1−α,β0r∈~UCM}. (27)

is the vector of constrained maximum likelihood estimators as a function of the value of , i.e., defined according to (26), but requiring rather than .

This may be seen as legitimate, because if the confidence level is (approximately) valid in the unconstrained case, this should also hold for the constrained case, as the constrained parameter set is a subset of the unconstrained one. This may however result in very small CRs that suggest more precision than the given sample size can provide. In case five above, the CR will even be empty. This is not incompatible with the general theory of CRs, but can be seen as undesirable (other examples of this kind exist and are occasionally used by Bayesians to argue against frequentist inference).

An alternative is to center the CR at the CMLE rather than the UMLE (using the unconstrained theory otherwise), which guarantees it to be nonempty. Asymptotically UMLE and CMLE are equal, so the UCR, UCCR, and CCR are asymptotically equivalent.

 CCR={ β0r:2[ℓ(^βc,r,^ϕc)−ℓ(β0r,~ϕc)]≤χ2(r);1−α,β0r∈~UCM}. (28)

A third option, the most conservative one, would be to use the aggregated CR:

 ACR=UCCR∪CCR.

4.2 Tests

Three kinds of null hypotheses may be of special interest for an ordinal variable :

1. ; no impact of variable .

2. either or ; variable fulfills monotonicity constraints.

3. ; the impact of variable is isotonic (or, analogously, antitonic); a specific monotonicity direction obtains.

Only the first is a linear hypothesis of the type (11). This can be tested by a standard test using , but there is a subtlety. in the definition of may be taken as the UMLE, or the CMLE (or even DMLE or PMLE), respectively, in case these are different. Asymptotically this is equivalent, in practice it may make a difference. The choice between these two relies on how confident the user is to impose a monotonicity assumption, even in case that the UMLE indicates against it.

The second implies that monotonicity is not assumed for the alternative. An asymptotically valid test can be defined by rejecting in case that no parameter vector in the UCR fulfills monotonicity constraints. In fact, it only needs to be checked whether the CMLE is in the UCR (which is equivalent to the UCCR being empty), because the CMLE maximizes the likelihood among the parameter vectors that fulfill the constraints, so if the CMLE does not fulfill with , no element can. The correspondence between CRs and tests is used here to define these tests, but fulfilling

 2[ℓ(^βr,^ϕ)−ℓ(β0r,~ϕc)]=χ2r;1−p (29)

for does not define a valid p-value, because asymptotically, under , UMLE and CMLE are the same, so that a p-value defined in this way will converge to 1 under

, rather than to a uniform distribution. In order to define a valid p-value, it would be necessary to find a borderline monotonic parameter vector

in the UCR minimizing in (29), which is computationally hard. The opposite , namely whether variable does not fulfill monotonicity constraints, can be tested in an analogous way by checking whether non-monotonic parameters are in the UCR.

The third can be tested by asking whether the corresponding PMLE (or DMLE) constraining the monotonicity direction of interest accordingly is in a suitable CR (if not, the can be rejected, according to the logic explained for the second ). Once more, it depends on what assumptions the user is willing to make otherwise, whether the UCCR (or, here equivalently, UCR), CCR, or ACR should be used. Asymptotically these are all the same, as is the -distribution used to define the CR.

4.3 Finite sample coverage probabilities of confidence regions

Coverage probabilities (CPs) of the UCCR, CCR, and ACR, will be simulated and compared under different scenarios in order to see whether these asymptotically motivated CRs perform well on finite samples; see Morris et al. (2019) for some general considerations regarding such experiments.

Consider model (1) with four ordinal predictors with 3, 4, 5, and 6 ordered categories each, one categorical (non-ordinal) predictor with 5 categories and one interval-scaled predictor. For every th observation, each of the four ordinal predictors () is represented in the model by dummy variables denoted as , with and where , , , and ; the nominal predictor is denoted as with ; and the interval-scaled predictor as

. The first category of the categorical variables is considered as the baseline so they are omitted. Thus, the simulated model is

 logit[P(zi≤j|xi)]= αj+3∑h1=2β1,h1xi,1,h1+4∑h2=2β2,h2xi,2,h2+5∑h3=2β3,h3xi,3,h3 +6∑h4=2β4,h4xi,4,h4+5∑h5=2β5,h5xi,5,h5+β1xi,1, (30)

where the number of categories of the ordinal response is , i.e., . This model was fitted for 500 data sets that were simulated using the following true parameters: for the intercepts , , and ; for the non-ordinal categorical predictor ; and for the interval-scaled predictor . The values of the ordinal predictors were drawn from the population distributions illustrated in Figure 1.

The simulated values for the non-ordinal categorical predictor were drawn from the following population distribution: 0.2, 0.2, 0.3, 0.1, 0.2 for categories 1, 2, 3, 4, and 5, respectively. The interval-scaled covariate was randomly generated from .

The current simulation design offers 12 different scenarios defined by two factors: (i) distances between parameter values of adjacent ordinal categories and (ii) sample sizes. The true parameter vectors of the ordinal predictors where chosen to represent three different levels (large, medium, and small) of distances between the parameter values of their adjacent ordinal categories as shown in Figure 2. Four different sample sizes were considered: , 100, 500, and 1,000.

Table 1-3 show the results in terms of frequencies and coverage probabilities (CPs) of the three CRs defined in the previous section. Different scenarios were considered according to two factors: (i) the distance between parameter values of adjacent ordinal categories representing three different degrees of monotonicity and (ii) four sample sizes, resulting in 12 scenarios. Factor (i) separates the Tables in three sections, each one corresponding to a different level: “small”, “medium”, or “large”. For each one of the 12 scenarios, the three definitions of CRs explained in the previous section were considered: UCCR, CCR and their union (ACR).

In order to include the comparisons of cases where the unconstrained and constrained MLE are equal or not, two categories were defined: “Same MLE” and “Different MLE”. The former corresponds to the group of cases 1, 2 and 3 of those discussed in Section 4 whereas the latter is equivalent to Case 4.

Within each one of the 12 scenarios and for each one of the three definitions of CRs, the frequency of cases was recorded separately depending on whether the true parameter was in or out of each CR. The CRs were computed using a confidence level of , and all parameters were assessed simultaneously ().

The factor “the distance between parameter values of adjacent ordinal categories” with levels small, medium, and large will be referred to as “monotonicity degree”.

In order to assess whether the coverage probabilities that are smaller than the confidence level of 95% are significantly low or not, coverage frequencies were compared with the 0.95 quantile of the binomial distribution with

trials (simulation replicates), which is 467. This means that the coverage probabilities under 93.4% (467/500) are significantly smaller than the confidence level of 95%.

By construction, the CP resulting from the ACR is always higher than or equal to the one of the UCCR or CCR. Tables 1-3 show that for , , CPs are much higher for the ACR than for the UCCR or CCR; in fact, they are higher and often significantly higher than the confidence level of 0.95 regardless of the monotonicity degree. This indicates a certain waste of power, although consistently respecting the confidence level is certainly positive.

For the UCCR and small sample sizes, the CPs are not significantly smaller than 95% when the monotonicity degree is medium or large, and significantly smaller (89.4% only) when the monotonicity degree is small. For the CCR the CPs are all significantly smaller than 95% for =50, ranging from 84.0% to 89.6%. In particular, when the sample size is 50, the CPs of the CCR are worse than those of the UCCR. This is because for some data sets the CR is centered around parameter estimates that are in the wrong monotonicity direction, and therefore the latter is more likely to be out of the CCR than of the UCCR.

For the larger sample sizes, , 1000, the CPs for the totals range between 93.6% and 95.6% for the ACR; none of them is significantly smaller than the nominal confidence level. A CP higher than 95% occurs only twice. CPs for the ACR are smaller for larger because UMLECMLE happens more often, in which case the ACR is no longer bigger that the UCCR and the CCR, and therefore less conservative.

In general, the results of the simulations are consistent with the asymptotic theory insofar as for no CP is significantly too small or too large. Some caution is required with the UCCR and the CCR for , 100.

Figure 3 shows, for each monotonicity degree, the overall CPs of the three CRs, and also those for the case when the UMLE and CMLE are different. When comparing the overall CPs against those where UMLE and CMLE are different for =50, they all start almost at the same point; here the proportion of different MLEs for any monotonicity degree is 97.8% or larger. For larger sample sizes, the CPs for “UMLECMLE” are much lower that the overall ones. Furthermore, this effect increases as increases. As the behavior in larger sample sizes is closer to the asymptotics, “UMLECMLE” happens less often, but if it happens, it is an indication that the situation is somewhat ambiguous, or not very well compatible with the monotonicity constraints, and also the CRs are less reliable.

In terms of monotonicity degrees, a smaller distance between parameter values of adjacent categories of the ordinal predictors makes the estimation problem harder, and the UCCR and the CCR tend to produce too low CPs for smaller samples, whereas the CPs of the ACR are not that strongly affected.

In line with the results given here, it can generally be expected that as increases, the CPs of the UCCR and the CCR get closer, and consequently the CP of their union (ACR) will get closer too. This is because as increases, the UMLE and CMLE get closer, and ultimately the UMLE will be in the constrained space. Furthermore, the larger the monotonicity degree of the parameters, the faster the CPs get closer to each other as increases. This is because when the monotonicity degree is large enough, the UMLE requires a relatively small sample size in order to fall into the correct constrained space, and therefore the UCCR and CCR get closer.

5 Application to school performance data

We apply the introduced methodology to data regarding the performance assessment of educational institutions in Chile. The observations are schools that provide educational services at primary and/or secondary level. The database was built using freely accessible information, which is published by the Education Quality Agency, an autonomous public service that interacts with the presidency of Chile through its Ministry of Education.

In Chile, the educational system considers 12 years of schooling, eight at the primary level and four at the secondary level. The performance of each school () is recorded as an ordinal variable of four ordered categories: “Insufficient”, “Medium-Low”, “Medium”, and “High”. There are two extra categories representing lack of information (for new schools mainly) or low number of registered students, which together concentrate 28.3% of the schools in 2019, but these are excluded from the analysis, given that they are associated with schools that were not evaluated. The performance assessment results that are analysed correspond to the ones obtained from 5,333 schools in both 2019 and 2016, specifically from their fourth year students of primary level, with national coverage. The ordinal predictors are:

• perf2016: school performance results in 2016, measured in the same way as the one of 2019,

• funding: the way the students pay for their studies with three categories: “Public”, “Mixed” and “Private”. This is an ordinal predictor in the sense that “Private” is associated to schools where the students need to pay by themselves and the fees are higher than in schools belonging to the other classes. This also indicates the level of financial resources that each school is able to reach, being the class “Private” the highest and “Public” the lowest, and

• regisRat: the ratio of number of registered students in 2019 over the one of 2016.

The parameter estimators from the unconstrained POCLM are given in Table 4. The variable perf2016 is estimated as antitonically related to ; in fact, it has negative coefficients, meaning that a better performance in 2016 is related to a better performance in 2019. This is because (1) models probabilities to be smaller than certain values as dependent on . The coefficients associated with the categories of funding are not estimated to be monotonic, although monotonicity is only very slightly violated. Taking the meaning of the variables into account, it can be considered highly likely that a monotonic relationship holds; “Mixed” funding is properly between “Public” and “Private”, and reasons for potential non-monotonicity are hard to imagine. In order to improve interpretability, these parameter estimates were therefore constrained to be monotonic, and the CMLE was computed. The UMLE of funding is so close to being antitonic that not only is the CMLE for its parameters very close to the UMLE, also parameters of the other variables do not change by constraining funding to be monotonic, although theoretically this may happen.

Figure 4 visualizes the parameter estimators with 95% confidence intervals. The difference between the UMLE and CMLE is hardly visible. Figure 5 shows a two-dimensional 95% CR for the two free parameters of funding. Here the UMLE and the CMLE are visibly slightly different. CRs are visualized by checking whether two-dimensional projections of (for funding with this is just ) are in the CR, for a dense grid of values. The UCR comprises of monotonic and non-monotonic parameters. The latter (colored grey) belong to the UCR, but not to the UCCR and CCR. The difference between the UCCR and CCR is that the UCCR is centered at the UMLE (even though all the non-monotonic parameters are removed), whereas the CCR is centered at the CMLE. The difference is not large; on the grid of parameter values for which we checked whether they are in the CR, only two (red) points belong to the CCR but not the UCCR.

Regarding the tests in Section 4.2, the CR shows that the can be rejected, as this pair of parameter values is not in the CR; equivalently the -test also rejects the corresponding linear hypothesis with . There are monotonic parameters in the UCR, therefore the of monotonicity of funding cannot be rejected, neither can non-monotonicity be rejected. However, it can be rejected that funding is isotonic. From a subject matter perspective, antitoncity seems likely for funding, and it is also well compatible with the data, therefore it makes sense to base further inference on the CMLE.

For perf2016 there are three free parameters, so not everything can be shown simultaneously in two dimensions. Figure 6 shows two selected views. On the left side, joint CRs for and are shown. UMLE and CMLE here are the same, as are UCR, UCCR, and CCR. This plot is of interest because the difference between and is the largest possible for this variable. This difference would need to change sign in order to revert the monotonicity direction of perf2016. No such parameter pair is in the CR, which suggests that the of isotonicity can be rejected, which can be confirmed by checking that , the three-dimensional MLE assuming isotonicity, is not in the three-dimensional CR. An important consideration here is the number of degrees of freedom for the -distribution used in the computation of the CR. If the researcher were only interested in and , there should be two degrees of freedom. However, taking into account that this plot was picked because and are the most distant parameters for this variable, choosing the degrees of freedom as 3 (number of free parameters for perf2016) takes into account possible variation on all three parameters for the resulting parameter pairs of and . Figure 6

also shows the CR based on 9 degrees of freedom, which takes into account variation of all free parameters in the model. Note that CRs are not perfectly elliptical, as the loglikelihood ratio can have irregular variations in line with asymmetries in the data, as opposed to the Wald test statistic. This is actually an advantage of the loglikelihood ratio statistic, as it reflects the information in the data more precisely.

The CR on the right side of Figure 6 concerns the parameter differences vs. . These differences are smaller in absolute value than , and can therefore more easily indicate potential non-monotonicity. In fact, no non-monotonic pairs of differences (let alone isotonic ones) are in the CR, so that non-monotonicity can be rejected, as can be the that perf2016 does not have any impact (). On top of these tests researchers of course may be interested in “effect sizes”, i.e., how far the parameter values are away from non-monotonicity or isotonicity relative to the variation of the parameter estimators, which the CRs visualize as well.

In general, for variables with more than three categories, and consequently free parameters, the most interesting two-dimensional views may be the one that shows the most extreme parameter pair (left side of Figure 6), which helps to assess how far the data are from the inverse monotonicity direction, and one or more plots of pairs of differences that either in terms of the UMLE run counter to the dominating monotonicity direction, or are the closest to violating it (right side of Figure 6; even though for this variable all plots comfortably confirm antitonicity).

6 Conclusion

We believe that the monotonicity assumption plays a key role in regression problems with ordinal predictors as well as an ordinal response. Obviously, monotonicity cannot be taken for granted in every application, but as is the case for the linear assumption for interval scaled variables, it means that the impact of a predictor on the response works in the same way over the whole range of the scale.

Espinosa and Hennig (2019) already proposed an approach to decide about whether monotonicity is appropriate at all, and what monotonicity direction to choose for which variable. This was based on one-dimensional confidence intervals, not taking the potential dependence between parameters appropriately into account, resulting in weaker power than possible. Here we show that looking at higher dimensional CRs the asymptotic theory of the unconstrained POCLM applies to the model with monotonicity constraints as well, but that finite samples can highlight issues with the asymptotic approximation, namely where UMLE and CMLE are different, and where non-monotonic parameter vectors are in the UCR. For testing monotonicity, the unconstrained theory can be used, but assuming monotonicity, CRs need to be adapted. We propose some simple ways of doing that, and some tests of standard hypotheses of interest that are easy to evaluate. Finite sample theory is not easily available, therefore the CRs can only be validated empirically. Alternatives worthy to explore are bootstrap tests and confidence regions (Hall (1987); Olive (2018)), although multivariate bootstrap CRs are often computationally expensive and rarely applied in the literature.

References

• Agresti (2010) Agresti, A. (2010). Analysis of ordinal categorical data, volume 656. John Wiley & Sons.
• De Leeuw et al. (2009) De Leeuw, J., Mair, P., et al. (2009). Gifi methods for optimal scaling in r: The package homals. Journal of Statistical Software, 31(4):1–20.
• Dupacova and Wets (1988) Dupacova J. and Wets R. (1988). Asymptotic Behavior of Statistical Estimators and of Optimal Solutions of Stochastic Optimization Problems. The Annals of Statistics, 16(4):1517–1549.
• Espinosa and Hennig (2019) Espinosa J. and Hennig C. (2019). A constrained regression model for an ordinal response with ordinal predictors. Statistics and Computing, 29(5):869–890.
• Fahrmeir (1987) Fahrmeir, L. (1987). Asymptotic testing theory for generalized linear models. Statistics, 18(1):65–76.
• Fahrmeir and Kaufmann (1985) Fahrmeir, L. and H. Kaufmann (1985). Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. The Annals of Statistics, 342–368.
• Fahrmeir and Kaufmann (1986) Fahrmeir, L. and H. Kaufmann (1986). Asymptotic inference in discrete response models. Statistische Hefte, 27(1):179–205.
• Hall (1987) Hall, P. (1987). On the Bootstrap and Likelihood-Based Confidence Regions. Biometrika, 74:481–493.
• Harrell (2001) Harrell, F. E. J. (2001). Regression Modeling Strategies. Springer.
• Henningsen and Toomet (2011) Henningsen, A. and O. Toomet (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics, 26(3):443–458.
• Liew (1976) Liew, C. K. (1976). Inequality Constrained Least-Squares Estimation. Journal of the American Statistical Association, 71(355):746–751.
• McCullagh (1980) McCullagh, P. (1980). Regression models for ordinal data. Journal of the royal statistical society. Series B (Methodological), pages 109–142.
• Morris et al. (2019) Morris, T. P., I. R. White, and M. J. Crowther (2019). Using simulation studies to evaluate statistical methods. Statistics in medicine, 38(11):2074–2102.
• Moustaki (2000) Moustaki, I. (2000). A latent variable model for ordinal variables. Applied psychological measurement, 24(3):211–223.
• Olive (2018) Olive, D. J. (2018). Applications of hyperellipsoidal prediction regions. Statistical Papers, 59:913–931.
• Stout (2015) Stout, Q. F. (2015). Isotonic regression for multiple independent variables. Algorithmica, 71(2):450–470.