A list of resources on how/why to do a PhD
Galarza, Lachos and Bandyopadhyay (2017) have recently proposed a method of estimating linear quantile mixed models (Geraci and Bottai, 2014) based on a Monte Carlo EM algorithm. They assert that their procedure represents an improvement over the numerical quadrature and non-smooth optimization approach implemented by Geraci (2014). The objective of this note is to demonstrate that this claim is incorrect. We also point out several inaccuracies and shortcomings in their paper which affect other results and conclusions that can be drawn.READ FULL TEXT VIEW PDF
A list of resources on how/why to do a PhD
Study Minimally Important Effect Size for CASI - Simulation methods and supplemental results
Linear quantile mixed models (LQMMs) were developed by Geraci and Bottai (2014) as an extension of the quantile regression model with random intercepts of Geraci and Bottai (2007). We consider data from two-level nested designs in the form , for and , , where is the th row of a known matrix , is the th row of a known matrix and is the
th observation of the response vectorfor the th cluster. The vector of responses is denoted by . This kind of data arise from longitudinal or panel studies and other cluster sampling designs.
The th LQMM is defined as
where is the given quantile level, is a vector of -specific coefficients that are common to all clusters, while the vector may vary with cluster. For estimation purposes only, Geraci and Bottai (2014) introduced the convenient assumption that the responses , , , conditionally on a vector of random effects , independently follow the asymmetric Laplace (AL) density
where is the ‘check’ function and denotes the indicator function, with location and scale parameters given by and , respectively, which we write as
. (The third parameter of the AL is the skew parameterwhich, in this model, is fixed and defines the quantile level of interest.) Also, they assumed that , for , is a random vector independent from the model’s error term with mean zero and
-specific variance-covariance matrixof dimensions . The latter is reparameterized in terms of an -dimensional vector, , of non-redundant parameters , i.e. .
The algorithm to estimate is described in detail by Geraci (2014); Geraci and Bottai (2014). First, the (quasi) log-likelihood is integrated numerically over the distribution of the random effects, i.e.
with , where and , , , denote the abscissas and the weights of the (one-dimensional) Gaussian quadrature. Second, the integrated log-likelihood (3
) is maximized via a non-smooth optimizer. In principle, one can consider different distributions for the random effects, which may be naturally linked to different quadrature rules (or penalties). For example, it is immediate to verify that the normal distribution is akin to a Gauss-Hermite quadrature, a special case of LQMM discussed by Geraci and Bottai.
In their paper, Galarza et al write [Geraci and Bottai (2014)] extended [Geraci and Bottai’s (2007)] setup to accommodate multiple random effects […]. Here, we consider a more general correlated random effects framework with general dispersion matrix . This statement is not justified since their model is precisely the LQMM defined in (1)-(2) with normal random effects and general variance-covariance matrix .
), Galarza et al proposed to use a stochastic approximation of the expectation-maximization (SAEM) algorithm. They compared their estimation approach, as implemented in theqrLMM package, to the quadrature-based algorithm implemented in the lqmm package (Geraci, 2014). The description of their simulation study is not clear as it lacks several details. First of all, it does not specify which versions of lqmm, qrLMM or R were used in their study. Most importantly, there is no indication about lqmm optimization settings and the syntax used for modeling. We speculate that the default options were used. We tried to replicate their design and we believe that the setting described below is quite similar to theirs.
We generated data according to the model
where , , , , , , and . Moreover, , where
The number of clusters varied (50, 100, 200, 300) while the size of the clusters was fixed to , , throughout the simulation. We considered five LQMMs (1) for . (Note that the error in (4) is sampled from an AL with skewness determined by the same that defines the quantile to be estimated, therefore for all .) Data were replicated 100 times for each combination of sample size and quantile level.
For this simulation, we used lqmm 1.5.3, which is, at the time of writing, the latest version available on the Comprehensive R Archive Network, and qrLMM 1.3 for R version 3.4.2 (R Core Team, 2017). By default, the function QRLMM starts the SAEM algorithm with estimates of and
obtained from linear programming (packagequantreg). In contrast, lqmm
starts by default from ordinary least squares estimates. Therefore, we changed the optionlqmmControl(startQR = TRUE) for the sake of comparability. Moreover, we used quadrature knots instead of the default to improve accuracy since (see Geraci and Bottai, 2014, for details). For the SAEM algorithm we used the same settings as in Galarza et al, namely 20 Monte Carlo simulations, 500 maximum iterations and for the cut-point that determines the proportion of initial iterations with no memory. The variance-covariance matrix was specified as a general positive-definite matrix in both estimation procedures. All the other estimation settings in lqmm and QRLMM were left unchanged to their default values. In a preliminary analysis, we assessed the computational time needed to run the full simulation and we estimated it would take approximately two months for QRLMM, but less than half an hour for lqmm. Given the excessive computational time needed for QRLMM, we ran the latter only for selected scenarios, namely and .
|Algorithm||Average bias||Average root mean||Total elapsed time||Average elapsed time||Percentage of convergence|
|lqmm||0.0019||0.0852||7.5 (min)||0.7 (s)|
|qrLMM||0.0017||0.0910||24066.0 (min)||2551.2 (s)|
Table 1 shows a summary of the actual performance of the two algorithms, while Figures 1 and 2 show, respectively, the absolute bias and root mean squared error (RMSE) of the two estimators. For comparison, Figure 2 also shows the RMSE values reported for SAEM by Galarza et al in Table 2 of their paper.
The average bias and RMSE calculated for and in selected scenarios were small for both estimators (Table 1), with average RMSE slightly lower for lqmm. Most notably, the time needed by qrLMM to run one model was on average 42.52 minutes (min 5.82, max 132.30 minutes) with a convergence failure rate. In contrast, lqmm took less than 1 second for one replication (min 0.09, max 9.27 seconds) with no convergence failures in any of the selected scenarios (and no convergence failures in any of the other scenarios either). The bias and RMSE for specific sample sizes and quantiles were similar in most cases except for the intercept, in which case lqmm seemed to have an advantage over qrLMM at more extreme quantiles (Figures 1 and 2). Note also that the RMSE results reported by Galarza et al in their paper are close to those we obtained in our simulation for selected scenarios (Figure 2), thus it is reasonable to conclude that our simulation design is a faithful reproduction of theirs.
Table 2 shows that the (scaled) average log-likelihood resulting from the two fitting algorithms was comparable in all selected scenarios.
As a side note, we found Galarza el al’s simulation setting rather unusual since the vector is typically a subset of
, for the random effects are constrained to have zero-mean (see also the discussion in the next section). Moreover, the AL distribution in LQMM provides only a quasi-likelihood for point estimation, i.e. it is not assumed to be the true distribution. Galarza el al’s simulation is very limited and is more of a sanity check than a validation study. It would be more realistic to simulate errors from a variety of distributions with different shapes, along with heteroscedastic variants of these models. An extensive simulation of this kind is provided byGeraci and Bottai (2014).
Galarza et al also provided a comparison between the two algorithms using a subset of the cholesterol data from the Framingham study (Zhang and Davidian, 2001). Once again, the description of their analysis is incomplete. First of all, the authors state Interestingly, for the extremes quantiles, some warnings messages on convergence were displayed while fitting Geraci’s method, even after increasing the number of iterations and reducing [sic] the tolerance, as suggested in the lqmm manual. The authors do not say which estimation settings were used initially and how these were changed afterwards. Most importantly, they do not say whether the warning messages were obtained during model fitting or bootstrapping, and how many warnings were produced. These warnings may be of little concern (see Geraci, 2014, for a discussion on this point) and, as shown further below, they can be addressed with an appropriate tweaking of optimization parameters (Geraci, 2014), along with a thoughtful examination of the data and model.
The authors are also silent on the model for . This is irrelevant for the qrLMM package since it provides only one model (i.e., the general positive-definite matrix). However, the lqmm package provides four different models, including the diagonal variance-covariance structure as the default.
We then decided to replicate the analysis of Galarza et al who considered the model
where is the th measurement of cholesterol (divided by 100) and , with denoting years since the beginning of the study.
There is clearly something awkward about the specification of model (5) since it does not include a fixed coefficient for . We can only surmise that Galarza et al misinterpreted equation (9) in Zhang and Davidian (2001), where the random slope is actually centered about the fixed slope (not about zero). Whatever the reason, suffice it to say that this oversight not only may introduce bias in the estimates, but it might also render estimation more difficult and prone to failure if the fixed slope is effectively different from zero. Therefore, we proceeded with the following corrected model
is a general positive-definite matrix. Model fitting and bootstrap standard error estimation were carried out for 19 vigintiles as detailed in Appendix A. The tolerance parameters set forlqmm estimation were less restrictive than the default values but still within reasonable bounds. We obtained only two warning messages of failed convergence during bootstrapping, but not during model estimation. Considering that models were fitted for the bootstrap, this issue is hardly worthy of note in this case.
Finally, Galarza et al’s statement We observe that our SAEM method leads to mostly smaller SEs and AIC compared to the Geraci method. […] Hence […] the substantial gain in the AIC criterion and the SEs establish that our SAEM approach provides a much better fit to the dataset is a hodgepodge of claims that are not substantiated anywhere in their paper. First, the authors found in their simulation study that the asymptotic approximations provide valid standard errors (Galarza, Lachos and Bandyopadhyay, 2017, Table 1). This should be expected since the data were generated under the AL distribution. Theoretical results actually show that outside the AL case it is not appropriate to quantify uncertainty using this distribution as it leads to underestimation of the true variability (Yang, Wang and He, 2016). This is why lqmm makes use of bootstrap. Moreover, the authors never provided a simulation study to compare SAEM’s standard errors with those obtained with lqmm; thus, it is not possible to understand the nature of the differences found for one particular dataset (even if we grant for the sake of argument that the lqmm’s estimation settings and modeling syntax were appropriately specified in the Framingham data analysis). Secondly, the comparison between average log-likelihoods in our simulation (Table 2) does not support the superiority of SAEM in terms of goodness of fit (GOF). In summary, the issue of standard error estimation remains to be investigated, while empirical evidence, although limited, contradicts Galarza et al’s claim that SAEM gives a better GOF performance.
Linear quantile mixed models (Geraci and Bottai, 2014) represent a valuable tool available to the scientific community. Computational issues are still an open problem and different approaches have been investigated by several researchers (see Marino and Farcomeni, 2015, for an overview). Galarza et al’s claim of SAEM’s superior performance fails to stand up to closer examination. Our simulation shows that while SAEM produces finite-sample bias and RMSE comparable to those obtained from the quadrature-based algorithm in lqmm, its sluggish convergence and high proportion of convergence failures put Galarza et al’s proposal at enormous disadvantage.
In this Appendix, we provide the R code for the analysis of the Framingham cholesterol data using the lqmm package.
library(lqmm) data(Cholesterol, package = "qrLMM") Cholesterol$year.c <- (Cholesterol$year - 5)/10 Cholesterol$sex <- as.factor(Cholesterol$sex) # Set optimization parameters ctrl <- lqmmControl(method = "df", LP_tol_ll = 1e-3, LP_max_iter = 2000, startQR = TRUE) # Fit model for tau = 0.05, 0.1, ..., 0.95 fit <- lqmm(I(cholst/100) ~ year.c + sex + age, random = ~ year.c, group = ID, data = Cholesterol, tau = 1:19/20, covariance = "pdSymm", control = ctrl) # Bootstrap (50 replicates) fit.s <- summary(fit, R = 50, seed = 178)
This research was partially supported by the National Institutes of Health – National Institute of Child Health and Human Development (Grant Number: 1R03HD084807-01A1).
[author] Galarza, C. E.C. E., Lachos, V. H.V. H. Bandyopadhyay, D.D. (2017). Quantile regression in linear mixed models: A stochastic approximation EM approach. Statistics and Its Interface 10 471-482.