The literature on regression methods for discrete response variables has been traditionally concerned with parametric distributions. Generalized linear models (GLMs)(McCullagh and Nelder, 1989)
, which assume that the conditional distribution of the response belongs to an exponential family, have long dominated the scene in methodological and applied research. The reasons are manifold and include convenient interpretation of the regression coefficients, e.g., as (log) odds ratios in logistic regression or (log) rate ratios in Poisson regression, universal availability in statistical software, and the benefits of a well-developed, unifying maximum likelihood theory.
Research has been increasingly directed toward the development of nonparametric (distribution-free) methods to overcome situations in which traditional approaches are unsatisfactory or, more in general, when the goal of the inference transcends the conditional mean of the response. In its classical formulation (Koenker and Bassett, 1978), quantile regression (QR) provides a distribution-free approach to the modeling and estimation of the effects of covariates on different quantiles of the conditional distribution of a continuous response variable. Median regression is a special case of QR and represents a robust alternative to mean regression.
While most of the progress in QR methods has revolved around continuous responses, relatively less contributions have been made in the discrete case so far. Major hindrances include lack of a general theory for handling different types of discreteness, practical estimation challenges, and the troublesome asymptotic behavior of sample quantiles in the presence of ties. Thus, it is not suprising that existing approaches to discrete QR rely on some notion of continuity, either postulated or artificially induced. Early works in the former category date back to the 1950’s (Rosenblatt, 1958).
Prominent in the econometric literature, Maximum Score Estimation (MSE) deals with conditional median models of binary (Manski, 1975, 1985; Horowitz, 1992) and ordered discrete (Lee, 1992) response variables. More recently, Kordas (2006) extended Horowitz’s (1992) estimator for binary outcomes to quantiles other than the median. In the MSE approach, the key assumption is the existence of a continuous latent variable, say , which undergoes the working of a threshold mechanism resulting in the observable binary outcome
. The conditional quantiles of the observable outcome are obtained as transformed quantiles of the latent outcome. However, MSE is computationally expensive as it involves nonconvex loss functions. Jittering is another strategy used for quantile estimation with discrete responses. A continuous variable, say, is obtained by adding random noise, say , to the observable discrete outcome, i.e.
. Estimation then proceeds by applying standard algorithms for convex quantile loss functions (e.g., linear programming) and, successively, by averaging the noise out. This approach, which has been adopted for modeling count(Machado and Santos Silva, 2005) and ordinal (Hong and He, 2010) response variables, may lack generality as it requires that adjacent values in the support of are equally spaced.
In this paper, we build on mid-quantiles (Parzen, 1993) to introduce an alternative estimation approach for conditional quantiles of discrete responses. Sample mid-quantiles, which are based on essentially the same idea of the mid--value (Lancaster, 1961), offer a unifying theory for quantile estimation with continuous or discrete variables and are well-behaved asymptotically (Ma, Genton and Parzen, 2011). In our approach, we develop a two-step estimator that can be applied to a large variety of discrete responses, including binary, ordinal, and count variables, and is shown to have good theoretical properties. In a simulation study and in real data analyses, we gather empirical evidence that conditional mid-quantile estimation is more stable and efficient than jittering. However, this evidence is contextual and may not be generalizable.
The rest of the paper is organized as follows. In the next section we discuss modeling, estimation, and theoretical properties of conditional mid-quantile estimators. In Section 3
, we report the results of a simulation study to assess bias and efficiency of the proposed estimator, as well as confidence interval coverage. In Section4, we illustrate two real data applications, one on global terrorism and the other on prescription drugs in the United States. We conclude with final remarks in Section 5.
2.1 Marginal mid-quantiles
be a discrete random variable with probability mass function
and cumulative distribution function (CDF). The th quantile of , denoted by , is defined as for any . We may define the quantile function (QF) of , , as the generalized inverse of the CDF of Y, that is . In the discrete case, the CDF is not injective, thus a discrete QF is not the ordinary inverse of the CDF. Now let be an independent sample of size from the population . The sample CDF is defined as , , while the (classical) sample QF, defined as the inverse of the sample CDF, is such that , for . This function, too, is discrete.
In general, sample quantiles as defined above may not be consistent for the population quantiles when the underlying distribution is discrete (Jentsch and Leucht, 2016). For example, not only does the sample median lack asymptotic normality if is discrete, but its limiting distribution appears to be discrete in the presence of ties, even if is continuous (Genton, Ma and Parzen, 2006).
We now introduce the mid-cumulative distribution function (mid-CDF) (Parzen, 1993, 2004), a modification of the standard CDF that plays an important role in discrete modeling and in samples with ties. For a random variable with CDF , either continuous or discrete, the function
is called mid-distribution function (mid-CDF). If is continuous, then since .
Further, let be the set of distinct values that the discrete random variable can take on, with corresponding probabilities . We also define the mid-probabilities and , for . The following function
is called mid-quantile function (mid-QF) (Ma, Genton and Parzen, 2011). If , then the last category is suppressed. One can verify that . Examples of when is discrete uniform, count, or binary are given in Figure 1. The continuous version of the mid-QF is piecewise linear and it connects the points (dashed lines in Figure 1).
The sample mid-CDF corresponding to (1) is , where is the sample relative frequency of . To define the sample mid-quantiles, we need to introduce some more notation. Let , , be distinct values that occur in the sample and let , , be the corresponding relative frequencies. Then , where is such that , , , , and . In samples with ties, is the piecewise linear function connecting the values . Ma, Genton and Parzen (2011) showed that if the underlying distribution is absolutely continuous, then the sample mid-quantiles have the same asymptotic properties as the classical sample quantiles. More importantly, if is discrete, then the sample mid-quantiles are consistent estimators of the population mid-quantiles and their sampling distribution is normal (Ma, Genton and Parzen, 2011).
2.2 Conditional mid-quantiles
Analogously to (1), we define the conditional mid-CDF as
where is a
-dimensional vector of covariates (these may include a vector of ones). Theconditional mid-QF is given by .
We assume a model that is linear on the scale of , i.e.,
where is a known monotone and differentiable transformation function, and is a vector of unknown regression coefficients. In our approach,
may simply be the identity or a linear transformation, the logarithmic function—which is typically used in the modeling of counts(e.g., see Machado and Santos Silva, 2005), the logistic function, or belong to a family of flexible transformation models (Chamberlain, 1994; Mu and He, 2007; Yin, Zeng and Li, 2008; Geraci and Jones, 2015). These often involve the Box-Cox (Box and Cox, 1964) or Aranda-Ordaz (Aranda-Ordaz, 1981) families.
Our definition of conditional mid-quantiles is general as it applies to any type of discrete response variable that can be ordered. An interesting special case is when is binary. According to (3), the mid-CDF is then given by , where . This leads to the following conditional mid-quantile function
where and . Therefore the conditional mid-median is exactly equal to , while all the other conditional mid-quantiles are shifted by .
We define the sample equivalent of (3) as
where, again, , , is the th distinct observation of that occurs in the sample. Also, define as an interpolation of the sample estimates , for a given .
Inference for model (4) proceeds in two steps. First, we estimate the mid-CDF (this is discussed in more detail at the end of this section). Secondly, we estimate by solving the implicit equation , where . Our objective function and estimator are thus given by
We can make (7) explicit by using the following linear interpolating function
where and . The index identifies, for a given , the value , , such that . The indices and identify, respectively, the largest and smallest value such that and . If we restrict , where , then we find that our estimator, conditionally on , has the form
where is a matrix with th row and is a vector with th element . It is straightforward to verify that (8) is a minimizer by plugging it into (6). The closed-form of (8) is, clearly, computationally convenient.
We can derive the variance-covariance ofvia the total variance law as follows:
We estimate the first term in the right-hand side of (9) using a Huber-White variance-covariance estimator, which is given by , where and , . To obtain an estimate of the second term, we note that, by the delta method,
The expression for depends on the variance of the estimator , which we discuss further below. We omit the tedious algebra for the Jacobian , which can be easily obtained via numerical differentiation. Also, note that the latter is carried out efficiently since the Jacobian is sparse, with sparsity no less than . This follows from the fact that the th partial derivatives of with respect to elements of with indices other than and are null (hence, there are at most non-zero partial derivatives).
Of course, one can still apply numerical optimization to obtain (7), regardless of whether is within the interval . In this case, the variance of can be obtained using nonparametric bootstrap (Jentsch and Leucht, 2016), an estimator of the asymptotic variance as derived in Theorem 2 below (more details are given at the end of Appendix A.2), or a numerical Hessian approximation within the optimization procedure.
The estimation of plays a key role in our approach. In our formulation, we require an estimator that can be applied to discrete responses and that admits continuous and discrete covariates (or a mix thereof). In line with the nonparametric flavor of our modeling strategy, we considered the conditional CDF estimator proposed by Li and Racine (2008). This takes the form
where is the (product) kernel with bandwidth vector and is the kernel estimator of the marginal density of . We refer the reader to Li and Racine (2008), Li, Lin and Racine (2013), and Hayfield and Racine (2008) for technical and implementation details. By applying (11) to the sample observations, we obtain , , and , which we plug into (5). Here, we set , hence . It follows that the diagonal elements of the matrix are given by
, , and . In the expression above, we can neglect the covariance between and , , as this is asymptotically zero as shown in the proof of Theorem 2 (Section 2.4). This means that the off-diagonal elements of are also asymptotically negligible. Finally, the expression for the variance of is given in Li and Racine (2008).
Unfortunately, nonparametric estimation of entails a loss of performance when the dimension of is large, the design is sparse, or both. In these cases a semiparametric approach may be preferred. A natural choice is the binomial estimator , where is the predictor of a binomial model on the response , , with -dimensional parameter vector and link function . An estimator of this sort is discussed by Peracchi (2002).
2.4 Theoretical results
We first show consistency of under general assumptions. We then provide its asymptotic distribution. Here, we assume that the conditional CDF estimator of Li and Racine (2008) is used to obtain
. The identity matrix of orderwill be denoted by . The proofs of the theorems in this section are given in Appendix A.
Generate independent observations from a discrete distribution with parameters satisfying (4). Assume that the marginal density of is strictly positive and that . Assume also that the kernel in (11) is symmetric, bounded, and compactly supported, and that , while for all . The interpolation operator is assumed to be deterministic. Let denote the solution in (7) and let be its population counterpart.
Then, for all , . Additionally, and .
In addition to the assumptions in Theorem 1, assume that the interpolation operator is differentiable. Assume also that the design matrix is full rank, that exists and is a positive-definite matrix. Finally, assume that . Then,
where , , and .
3 Simulation study
Data were generated according to four models with homoscedastic discrete uniform, heteroscedastic discrete uniform, Poisson, and Bernoulli errors. Each model was considered with either one discrete covariate (scenario a), or with two continuous covariates (scenario b) as follows:
, where and ;
, where , , and ;
, where and ;
, where , , and ;
, where , , and ;
as in scenario (3a) with , , and ;
, where , , and ;
as in scenario (4a) with , , and ;
denote random variables with, respectively, discrete and continuous uniform distribution on. Samples of size were independently drawn from each model for replications. We then fitted the linear mid-quantile model with data generated under models 1 and 2; the log-linear mid-quantile model with data generated under model 3; and the logistic mid-quantile model
with data generated under model 4. All models were estimated for 7 deciles,, except the logistic model, which was estimated for the median only.
Let denote the true mid-quantile at level for a given under any of the data-generating models defined above and be the corresponding estimate for replication . We assessed the performance of the proposed methods in terms of average bias and root mean squared error (RMSE) of the mid-QF, i.e.
respectively. We also report the average true mid-quantiles at
as a term of comparison for assessing the relative magnitude of the bias. Finally, we calculated confidence intervals to assess coverage of the slope parameter in mid-quantile models for
when data were generated under scenarios 1a, 2a, and 3a. The corresponding standard errors were computed based on expression (9).
Estimated bias and RMSE of the proposed estimator are shown in Tables 1-3 for scenarios 1a, 2a, and 3a, and in Tables B1-B3 (Appendix B) for scenarios 1b, 2b, and 3b. The bias was, in general, small, never exceeding of the average mid-quantile for the homoscedastic discrete uniform model (1a and 1b), for the heteroscedastic discrete uniform model (2a and 2b), and for the Poisson model with a discrete covariate (3a). In contrast the bias was relatively higher (up to of the average mid-quantile) in the Poisson scenario with continuous covariates (3b), although this issue was limited to the tail quantiles at smaller sample sizes. Both bias and RMSE decreased with at approximately the expected rate for all three models. The estimated bias and RMSE of the proposed estimator for the Bernoulli model (4a and 4b) were extremely small at all sample sizes (results not shown). The estimated conditional mid-quantiles from all replications and the average estimated conditional mid-quantiles () are shown in Figure B1 (Appendix B) for scenarios 1b, 2b, and 3b, and in Figure B2 (Appendix B) for scenario 4b. All mid-quantiles are plotted as functions of , with set equal to the median of . The observed coverage at the nominal confidence level for the slope in selected scenarios is given in Table 4. The results are in general accurate, although frequencies are occasionally slightly away from the nominal level. This is not surprising since the sample estimator of (9) relies on the Huber-White estimator and on several approximations.
It would be remiss of us not to make a contrast between our proposed estimator and existing alternatives. The estimator developed by Machado and Santos Silva (2005)
(hereinafter referred to as MSS) is a natural candidate. However, comparison is inevitably restricted: first of all, the response must be a count or ordinal variable; more importantly, neither the ‘true’ coefficients nor the population quantiles underlying mid-quantile and jittering-based estimation are, in general, the same quantities. Indeed, the quantiles modeled by MSS are defined as, where and is uniformly distributed on . The jittered quantiles dominate the true quantiles since , uniformly over . In contrast, mid-quantiles interpolate the true quantiles. For these reasons, we consider only Poisson data as in scenario 3a. The ratio of the variance of the slope estimates for the jittering-based estimator relative to ours (VR) is given in Table 5. There is clear evidence that our estimator is generally more efficient in this particular instance, but clearly results are not generalizable.
Two real data applications are discussed in this section. The first application concerns deaths from terrorist attacks around the world, while the second deals with prescription drugs use in the United States (US).
4.1 Global terrorism
The Global Terrorism Database (GTD), accessible at https://www.start.umd.edu/gtd, is a comprehensive, open-source database that provides information on terrorist events around the world from 1970 through 2017 (National Consortium for the Study of Terrorism and Responses to Terrorism (START), 2018). For each GTD incident, several variables are available including date and location of the incident, method of attack, and the number of fatalities. Due to an issue of consistency with data collection up to 2007, we restricted our analysis to the period 2008-2017. We analyzed attacks in selected regions of the world (East Asia, North America, Western Europe). Moreover, we included observations of selected types of attacks (armed assaults, bombings, hijacking, facility/infrastructure attacks) and excluded assassinations, hostage takings, kidnappings, unarmed assaults and attacks of unknown type. Figure 2
shows the marginal distribution and mid-quantile function of fatalities estimated from a total of 2,491 incidents, by region. The maximum number of casualties in a single attack was 184 as a result of riots (armed assaults) in Urumqi, China. The extreme skewness of the distribution is apparent, as is the sparsity of the observations. The latter increases with the size of the counts, particularly in East Asia where the count distribution had the largest gap (going from 50 to 184 deaths, the two most extreme counts). This makes the application of the MSS estimator somewhat troublesome, as argued further below.
We fitted log-linear mid-quantile models, separately for each region, to estimate temporal trends in number of deaths. Here, the quantile level is an index of the lethality of the attacks in a given year. In Figure 3 we show the estimated conditional mid-quantiles by year, for , . During 2008-2017, the estimated number of fatalities in terrorist attacks increased in East Asia () and Western Europe (), but remained constant in North America (). In Western Europe, the yearly rates of increase differed by quantile level, with an estimated rate of at the median (90% confidence interval: to ), but at (90% confidence interval: to ). In other words, deadlier attacks in Europe are becoming more deadly over time. Of course, these results are to be interpreted with caution since the attacks included in this analysis originate, in general, from heterogeneous motives.
Predictions obtained with the MSS approach (based on jittered samples) showed some instability, resulting in quantile-crossing between the 7th and the 8th predicted deciles of casualties in East Asia. These results were insensitive to increasing the number of jittered samples (up to ). The purpose of this comparison is not to claim a superiority of our method in general but simply to remark that, in the presence of large gaps between responses, our interpolation based estimator might have an advantage over jittering.
4.2 Prescription drugs
In this section we illustrate another application of mid-quantile regression using data on prescription medications from the National Health and Nutrition Examination Survey (NHANES) (National Center for Health Statistics (NCHS), 2019). The US is the worldwide leader in per capita prescription drug spending (The Kaiser Family Foundation, 2015)
, and its pharmaceutical market represents a major economic sector worth hundreds of billions of dollars. In a recent quantile regression analysis of NHANES data,Hong et al. (2019) found a higher opioid use (morphine milligram equivalent) in adults with longstanding physical disability and those with inflammatory conditions as compared to individuals with other conditions. Differences were markedly larger at the 75th and 95th percentiles than those at lower percentiles. In the context of medications use, a higher percentile can be interpreted as an index of diminished health, lower quality of life, and higher financial burden. A quantile regression analysis of prescription medications use is therefore of both public health and health economics interest.
We abstracted data () on number of prescription medicines taken and the main reason for use from the 2015-2016 Dietary Supplement and Prescription Medication section of the Sample Person Questionnaire. We also obtained information on age (years), sex, and race. Before carrying out the analysis, we removed the effect of NHANES oversampling by first restricting the dataset to all observations for White persons (about of the overall sample), and subsequently adding observations for persons of other races that were subsampled with probabilities proportional to their NHANES weights. This resulted in a sample () composed of about of White persons and females.
For the purpose of this analysis, the main reason for medication use, originally converted by the NHANES to an ICD-10 (International Statistical Classification of Diseases, 10th revision) code, was grouped into: diseases of the circulatory system (“I codes”), metabolic diseases (“E codes”), mental disorders (“F codes”), diseases of the respiratory system (“J codes”), and all other codes (“other codes”). Race was categorized as White, Black, and other races. We restricted the dataset to adults aged 18-65 years. The final sample size for analysis was. Figure 4 shows the marginal distribution and mid-quantile function of number of prescription medicines.
Preliminarily, we investigated a log-linear model with age (centered at 40 years and scaled by 10), sex (baseline: female), and the interaction between age and sex. The results (not shown) indicated that prescription medications count increases with age (as expected) and that the rate of increase is quantile-dependent, but not sex-specific. We subsequently fitted a log-linear model with reason for medication use (baseline: other codes) and race (baseline: White), in addition to sex and age (but no interaction between age and sex). Due to the high proportion of zeros, the admissible range for the application of (8) was . In Figure 5, we report the estimated coefficients and their confidence intervals for . Males tend to use less medications as compared to women, and the inequality is more marked among high users. Age is confirmed to be an important predictor, with an estimated difference of about 4 prescriptions between a person aged 65 years and an 18 year old at . However, the difference climbs to 25 prescriptions at . The effect of race is not statistically significant for any of the quantiles. In contrast, a rather strong effect is due to reason for medication. As compared to persons in the group of ‘other’ ICD-10 codes, those assigned to the circulatory diseases group have the highest use of prescription drugs, followed by persons assigned to mental disorders, metabolic diseases, and respiratory diseases groups. Moreover, effects for all groups decrease with the quantile index . The decreasing magnitude of the estimated coefficients indicate that, with larger , predictions of one group relative to the reference group become proportionally smaller. Yet, the absolute differences between predictions increase with . For instance, as compared to persons in the reference group, those in the circulatory diseases group received 44 more prescription drugs at , but 98 at . By comparison, an analogous GLM for Poisson data predicts a difference of 34 and 37 prescriptions at and , respectively, thus grossly underestimating the effect of the reason for medication.
Finally, we compared our estimates to those obtained using the MSS estimator. The results, reported in Table 6, confirm the findings from the simulation study according to which our proposed estimator might have an advantage in terms of efficiency.
5 Concluding remarks
We developed an approach to conditional quantile estimation with discrete responses. We established the theoretical properties of our conditional mid-quantile estimator under general conditions and showed its good performance in a simulation study with data generated from different discrete response models. Our two-step estimator is easy to implement. When constraining the quantile index to a data-driven admissible range, the second-step estimating equation has a least-squares type, closed-form solution, which is computationally efficient. In real data analyses, conditional mid-quantiles provided results that reveal interesting aspects of important matters like terrorism and prescription drug use.
We believe that mid-quantile regression is amenable to several possible extensions, including estimation in the presence of censoring and modeling of clustered data.
Appendix A - Proofs of Theorems
a.1 Proof of Theorem 1
Under the conditions stated (Li and Racine, 2008) Since the interpolation operator is deterministic by assumption, then it follows that Consequently,
Consider now . It is straightforward to verify that
In fact, if for some value of and , then ; while all other values are obtained through interpolation. A consequence is that is, eventually, a solution of the minimization problem in (7). Additionally, there is only one such solution, since, by assumption, for all , and is monotonic for , where and are the mid-probabilities corresponding to, respectively, the smallest and largest discrete value (if , then ). This implies consistency of , the minimizer in (7). Consistency of the predicted mid-quantiles follows directly. ∎
a.2 Proof of Theorem 2
Since the differentiability of follows from the assumptions, we can apply a first-order Taylor expansion to obtain
where is a point in the interior of the hypercube delimited by and . Expressions for and are given in the next section. Note that since is the minimizer in (7). The assumption on the design matrix guarantees that the Hessian is positive definite. Hence, we can rewrite (A.1) as
To derive the asymptotic distribution of , it suffices to study the asymptotic distribution of the right-hand side of (A.2). First, let . By using the consistency results in Theorem 1 and the triangle inequality, it is immediate to show that weakly converges element-wise to . Using the results in the web supplement, we then can write
We need to demonstrate that the expression above converges in distribution, thus we expand the quantities on the right-hand side as follows:
where . First of all, as shown in Li and Racine (2008), converges in distribution to a Gaussian random variable for all . Additionally, the assumptions on the bandwidths guarantee asymptotic independence of and for and all . To see this, note that for all . According to the dominated convergence theorem, the asymptotic covariance of