The normal (or Gaussian) distribution historically has played a prominent role not only as limiting distribution of a number of sample statistics, but also for modelling data obtained in empirical studies. Its probability density is given by
. The Laplace (or double exponential) distribution, like the normal, has a long history in Statistics. However, despite being of potentially great value in applied research, it has never received the same attention. Its density is given by
Throughout this paper, these distributions will be denoted by and , respectively.
In (1) and (2), and , where and , represent a location and a scale parameters, respectively. These two densities are shown in the left-hand side plots of Figure 1. The normal and Laplace distributions are both symmetric about
and have variance equal to. As compared to the normal one, the Laplace density has a more pronounced peak (a characteristic technically defined leptokurtosis
) and fatter tails. Interestingly, the Laplace distribution can be represented as a scale mixture of normal distributions. Let, then (Kotz, Kozubowski and Podgórski, 2001)
where and are independent standard exponential and normal variables, respectively. That is, the Laplace distribution emerges from heterogeneous normal sub-populations.
). At Laplace’s time, the problem to be solved was that of estimatingaccording to the linear model
where denotes the error term. This problem was encountered, for example, in astronomy and geodesy, where represented the ‘true’ value of a physical quantity to be estimated from experimental observations. It is well known that, under the Gaussian error law (1), the maximum likelihood estimate of is the sample mean but, under the double exponential error law (2), it is the sample median. The former is the minimiser of the least squares (LS) estimator, while the latter is the minimiser of the least absolute deviations (LAD) estimator.
The robustness of the LAD estimator in presence of large errors was known to Laplace himself. However, given the superior analytical tractability of the LS estimator (and therefore of the normal distribution), the mean regression model
quickly became the ‘standard’ tool to study the association between the location parameter of
(the response variable) and other variables of interest,(the covariates).
In the past few years, theoretical developments related to least absolute error regression (Bassett and Koenker, 1978; Koenker and Bassett, 1978) have led to a renewed interest in the Laplace distribution and its asymmetric extension (Yu and Zhang, 2005) as pseudo-likelihood for quantile regression models of which median regression is a special case (see, among others, Yu and Moyeed, 2001; Yu, Lu and Stander, 2003; Geraci and Bottai, 2007). In parallel, computational advances based on interior point algorithms have made LAD estimation a serious competitor of LS methods (Portnoy and Koenker, 1997; Koenker and Ng, 2005). Another reason for the ‘comeback’ of the double exponential is related to its robustness properties which makes this distribution and distributions alike desirable in many applied research areas (Kozubowski and Nadarajah, 2010).
In statistical applications, the interest is often in processes where the source of randomness can be attributed to more than one ‘error’ (a hierarchy of errors is also established). For instance, this is the case of longitudinal studies where part of the variation is attributed to an individual source of heterogeneity (often called ‘random effect’), say, independently from the noise, , i.e.
where the distributions of and are often assumed to be symmetric about zero. It will be shown later that this model can be extended to include covariates associated with the parameter and the random effect . For now, it suffices to notice that the linear combination of random errors leads to the study of convolutions. So let us define a convolution (Mood, Graybill and Boes, 1974).
If and are two independent, absolutely continuous random variables with density functions
are two independent, absolutely continuous random variables with density functionsand , respectively, and , then
is the convolution of and .
In Section 2, I consider convolutions based on the normal and the Laplace distributions within a simple and practical scheme. In Section 3, I discuss inference when data are clustered, along with the implementation of estimation procedures using existing R (R Core Team, 2014) software (further technical details are provided in Appendix, along with a simulation study). In Section 4, I show some applications and, in Section 5, conclude with final remarks.
Let be a real-valued random variable with absolutely continuous distribution function and density . The variable is observable and represents the focus of the analysis in specific applications (e.g., as the response variable in regression models). I consider four cases in which results from one of the four convolutions reported in Table 1. The letters and are used to denote normal and Laplace variates with densities (1) and (2
), respectively. The subscripts 1 and 2 indicate, respectively, which of the two random variables plays the role of a random effect and which one is considered to be the noise. Here, the former may in general be associated with a vector of covariates and may represent an inferential quantity of interest; the latter is treated as a nuisance. Moreover, I assume independence between the components of the convolution throughout the paper.
A few remarks about notation are needed. The shorthand or , where is a vector, is used to denote the diagonal matrix whose diagonal elements are the corresponding elements of
. The standard normal density and cumulative distribution functions will be denoted byand , respectively.
2.1 Normal-normal (NN) convolution
The first convolution
where and , represents, in some respects, the simplest case among the four combinations defined in Table 1. Standard theory of normal distributions leads to
Model (3) can be generalised to the regression model
where and are, respectively, and vectors of covariates, and is a dimensional vector of regression coefficients. If , then I assume , that is, a multivariate normal distribution with variance-covariance matrix . It follows that
2.2 Normal-Laplace (NL) convolution
The second convolution consists of a normal and a Laplace components, that is
where and . The resulting density is given by (Reed, 2006)
where and is the Mills ratio
The above distribution arises from a Brownian motion whose starting value is normally distributed and whose stopping hazard rate is constant. An extension of (8
) to skewed forms can be obtained by lettingfollow an asymmetric Laplace distribution (Reed, 2006). Applications of the NL convolution can be found in finance (Reed, 2007; Meintanis and Tsionas, 2010). See also the double Pareto-lognormal distribution, associated with , which has applications in modeling size distributions (Reed and Jorgensen, 2004).
As in the previous section, I consider a generalisation of model (7) to the regression case
If , then I assume . It follows that is normal with mean zero and variance . This leads to
where and, as defined above, . It is easy to verify that .
Model (9) is a median regression model with normal random effects, a special case of the linear quantile mixed models (LQMMs) discussed by Geraci and Bottai (2007, 2014). LQMMs have been used in a wide range of research areas, including marine biology (Muir et al., 2015; Duffy et al., 2015; Barneche et al., 2016), environmental science (Fornaroli et al., 2015), cardiovascular disease (Degerud et al., 2014; Blankenberg et al., 2016), physical activity (Ng et al., 2014; Beets et al., 2016), and ophthalmology (Patel et al., 2015; Patel, Geraci and Cortina-Borja, 2016).
2.3 Laplace-normal (LN) convolution
The Laplace-normal convolution is given by
where and . The LN appears in robust meta-analysis (Demidenko, 2013, p.266).
The LN convolution in (11), clearly, is the same as the NL convolution in (7) (so I omit writing its density). However, note that now the Laplace component is associated with the random effect, not with the error term; therefore, the two scale parameters and will appear swapped. The distinction becomes clear when considering the regression model
By analogy with the NL convolution, I assume that, for , has a -dimensional multivariate Laplace distribution (Kotz, Kozubowski and Podgórski, 2001, p.235).
An -dimensional random variable is said to follow a zero-centred multivariate Laplace distribution with parameter , , if its density is given by
where is an nonnegative definite symmetric matrix, and is the modified Bessel function of the third kind.
If , then (Kotz, Kozubowski and
Podgórski, 2001, p.249). For a diagonal matrix , the coordinates of the multivariate Laplace are uncorrelated, but not independent. Therefore, the joint distribution of
, the coordinates of the multivariate Laplace are uncorrelated, but not independent. Therefore, the joint distribution ofindependent univariate Laplace variates does not have the properties of the multivariate Laplace with diagonal variance-covariance matrix.
For , the multivariate density defined above reduces to the univariate density (2) with . Moreover, a linear combination of the coordinates of the multivariate Laplace is still a Laplace (Kotz, Kozubowski and Podgórski, 2001, p.255). Indeed, if we assume , then , where . Thus, the density of in Equation (12) is given by
where . Again, it is easy to verify that .
2.4 Laplace-Laplace (LL) convolution
The fourth and last convolution consists of two Laplace variates, i.e.
where and . It can be shown (Kotz, Kozubowski and Podgórski, 2001, p.35) that the density of is
with and .
For the regression model
with , I obtain
with and . The variance is given by .
2.5 Some properties
All the convolutions are symmetric, unimodal, twice differentiable and have continuous first and second derivatives (the NN and NL are also smooth). Also, they are log-concave since both the normal (1) and Laplace (2) densities are log-concave (Prékopa, 1973). The right-hand side plots of Figure 1 shows that, as compared to the NN density, the NL (LN) and LL densities are leptokurtic and have more weight in the tails, with the NL density sitting between the NN and LL distributions. Thus, the presence of the Laplace term in the convolution confers different degrees of robustness to the model depending on whether one or both random terms are assumed to be Laplacian. Also, notice that the marginal regression models are location–scale-shift models, since both the location and the scale of are functions of the covariates.
In this section, I briefly discuss inferential issues, with detailed mathematical derivations provided in Appendix.
Let be a multivariate random response vector, and and , be vectors of covariates for the th observation, , in cluster , . Each component of can be modelled using any of the convolutions discussed in Section 2 by assuming
where the random effect and the error term are either Gaussian or Laplacian according to the scheme in Table 1. The marginal models implied by these four convolutions have been defined in expressions (6), (10), (13), and (17). At the cluster level, I use the notation
where and are, respectively, and design matrices. I assume that the vector of random effects has variance-covariance matrix for all and that the ’s are independent of one another. The structure of
is, for the moment, left unspecified. Also, I assume that
is a multiple of the identity matrix, although this assumption can be easily relaxed (see Section3.4).
There are several approaches to mixed effects model estimation (see, for example, Pinheiro and Bates, 2000; Demidenko, 2013), each approach having its own advantages and disadvantages. One approach is to work with the marginal likelihood of . Although independence between clusters can still be assumed, in general the ’s will be correlated within the same cluster. Therefore, parameter estimation based on the marginal likelihood requires knowing the joint distribution of . Under the NN convolution, is known to be multivariate normal. It is beyond the scope of this paper to derive the multivariate distribution of for the NL, LN and LL convolutions.
Likelihood-based estimation of location and scale parameters using the NN model has been largely studied. Therefore, I will focus on the NL, LN, and LL models. Since an important aspect in applied research is the availability of software to perform data analysis, here I consider two methods which can be applied using existing software. The first method is based on numerical integration and applies to specific NL and LL models, while the other method is based on a Monte Carlo Expectation-Maximisation (MCEM) algorithm and applies to NL, LN and LL models.
3.1 Numerical integration
Let the th contribution to the marginal log-likelihood be
where denotes the density of the error term conditional on the random effect and denotes the density of the random effect. One can work with the numerically integrated likelihood
with nodes and weights , , , as an approximation to the marginal log-likelihood.
The maximisation of the approximate log-likelihood (19) can be time-consuming depending on the dimension of the quadrature , the required accuracy of the approximation controlled by the number of nodes , and, of course, the distribution . If is a diagonal matrix, then . This greatly simplifies calculations since the -dimensional integral can be carried out with successive applications of one-dimensional quadrature rules. In the multivariate normal case, a non-diagonal covariance matrix can be rescaled to a diagonal one and the joint density factorises into normal variates. However, this is not the case for the multivariate Laplace, at least not for the one defined in Section 2.3. Geraci and Bottai (2014)
considered a steepest-descent approach combined with Gauss-Hermite and Gauss-Laguerre quadrature for, respectively, the NL and LL likelihoods. Standard errors were obtained by bootstrapping the clusters (block bootstrap).
Since Geraci’s (2014) algorithms, which are implemented in the R package lqmm, can be applied to selected models only (namely, NL models with correlated or uncorrelated random effects and LL models with uncorrelated random effects), in the next section I develop an alternative, more general approach based on the EM algorithm.
3.2 EM estimation
Rather than working with the Laplace distribution directly, I consider its representation as a scale mixture of normal distributions. As noted before, if , then , where and are, respectively, independent standard exponential and normal variates. This equivalence has been used in EM estimation of regression quantiles (see, for example, Lum and Gelfand, 2012; Tian, Tian and Zhu, 2013). Similarly, in the multivariate case, if , then , where is, again, standard exponential and . As shown in Appendix, the normal components in the scale mixture representation of the NL, LN, and LL models can be easily convolved (conditionally on ) and the resulting log-likelihood for the th cluster becomes
where is multivariate normal and is standard exponential.
The proposed EM algorithm starts from the likelihood of the complete data , where represents the unobservable data. In the E-step, the expected value of the complete log-likelihood is approximated using a Monte Carlo expectation. As shown in expression (25
) in Appendix, the M-step reduces to the maximum likelihood estimation of a linear mixed model with prior weights which can be carried out using fitting routines from existing software (e.g.,nlme or lme4 in R).
3.3 Modelling and estimation of
There are different possible structures for . The simplest is a multiple of the identity matrix, with constant diagonal elements and zero off-diagonal elements. Other structures include, for example, diagonal (variance components), compound symmetric (constant diagonal and constant off-diagonal elements), and the more general symmetric positive-definite matrix. These are all available in the nlme (Pinheiro et al., 2014), lme4 (Douglas et al., 2015) and lqmm (Geraci, 2014) packages, as well as in SAS procedures for mixed effects models.
The variance-covariance matrix of the random effects, whether normal or Laplace, must be nonnegative definite. However, it is possible that, during MLE, the estimate may be singular or veer off into the space of negative definite matrices. This problem does not occur in EM estimation if the starting matrix is nonnegative definite. However, the monotonicity property is lost when a Monte Carlo error is introduced at the E-step (McLachlan and Krishnan, 2008). There are at least three approaches one can consider (Demidenko, 2013, p.88): (i) allow to be negative definite during estimation and, if negative definite at convergence, replace it with a nonnegative definite matrix after the algorithm has converged; (ii) constrained optimisation; (iii) matrix reparameterisation (Pinheiro and Bates, 2000). As discussed in Appendix, I follow the latter approach.
3.4 Residual heteroscedasticity and correlation
The development of the EM algorithm discussed above is based on the assumption that the within-group errors are independent with common scale parameter . As briefly outlined in Section .6
in Appendix, it is easy to extend the NL, LN, and LL models to the case of heteroscedastic and correlated errors. Commonly available mixed effects software provide capabilities for estimating residual variance and correlation parameters. For the sake of simplicity, I do not consider this extension any further in this paper.
Here, I discuss an application in meta-analysis. The data consist of mean standard deviation scores of height at diagnosis in osteosarcoma patients which had been reported in five different studies (Figure2) and were successively meta-analysed by Arora et al. (2011). Let denote the study-specific effect. For these data, I considered the NN model (DerSimonian and Laird, 1986)
where and , and the LN model (Demidenko, 2013)
where and .
In meta-analysis, the goal is to estimate an ‘overall’ or ‘pooled’ effect () and the between-study variance or heterogeneity among study-specific effects (). The sampling variances are assumed to be known.
Estimation for the osteosarcoma data was carried out using R software developed for standard (Viechtbauer, 2010) and robust (Demidenko, 2013) meta-analysis. The estimates (standard errors) of and were, respectively, (0.087) and (0.027) for the NN model, and (0.073) and (0.033) for the LN model. The larger estimated overall effect and heterogeneity for the NN model are a consequence of the outlying effect size of study 5 (Figure 2) which skews the location
and inflates the scale of the normal distribution. In contrast, the Laplace distribution is more robust to outliers and heavy tails. Indeed, the estimate of
from the LN model was more precise as demonstrated by the lower standard error (as a consequence, the related test statistic has smaller-value). A similar example is described by Demidenko (2013).
4.2 Repeated measurements in clinical trials
Ten Crohn’s disease patients with endoscopic recurrence were followed over time (Sorrentino et al., 2010). Colonoscopy was performed and surrogate markers of disease activity were collected on four occasions. One of the goals of this trial was to assess the association between fecal calprotectin (FC – mg/kg) and endoscopic score (ES – Rutgeerts). The data were analysed using a log-linear median regression model under the assumption of independence between measurements (Sorrentino et al., 2010). Here, I take the within-patient correlation into account and analyse the data using three of the four regression models discussed in Section 2: the NN model
the NL model
and the LL model
where and denote, respectively, FC and ES measurements on patient at occasion , , , , and . Therefore, the variance of the random effects is , while the variance of the error term is .
In this case, the parameters of interest are the slope and the intra-class correlation , which measures how much of the total variance is due to between-individual variability. Estimation was carried out using the nlme (Pinheiro et al., 2014) and lqmm (Geraci, 2014) packages. The results are shown in Table 2. The estimates of the regression coefficients tallied across models. However, the estimates of and differed substantially, with values from the NN model much lower than those from the NL and LL models. First-level residuals (i.e., predictions of the random effects plus the error term) and second-level residuals (i.e., predictions of the error term only) from the NN model are shown in Figure 3. It is apparent that may be inflated by an unusual second-level residual, to the detriment of . As a consequence, the intra-class correlation appeared to be heavily underestimated by the NN model. The NL model improved upon the estimation of the scale parameters as it is more robust to outliers in the error term. However, the LL model gave the largest value of the log-likelihood, suggesting that the goodness of the fit is further improved by using a robust distribution for the random effects as well. Note also that the standard error of the slope was smallest for the LL model.
4.3 Growth curves
In a weight gain experiment, 30 rats were randomly assigned to three treatment groups: treatment 1, a control (no additive); treatments 2 and 3, which consisted of two different additives (thiouracil and thyroxin respectively) to the rats drinking water (Box, 1950). Weight (grams) of the rats was measured at baseline (week 0) and at weeks 1, 2, 3, and 4. Data on three of the 10 rats from the thyroxin group were subsequently removed due to an accident at the beginning of the study. Figure 4 shows estimated intercepts and slopes obtained from rat-specific LS regressions of the type
where the response is weight measurement taken on rat on occasion conditional on treatment group , and . (Note that and .) It is evident that the weight of rats treated with thiouracil grew slower than the controls’, though at baseline the former tended to be heavier than the latter. In contrast, rats in the control and thyroxin groups had, on average, similar intercepts and slopes.
The Pearson’s correlation coefficients of the estimated intercept-slope pairs gave (), (), and (), suggesting a negative association between baseline weight and growth rate in all treatment groups. However, the direction of the association in treatment group 3 is unclear. Interestingly, the Kendall rank correlation coefficient in the thyroxin group indicated a weak positive association (), while the Pearson’s coefficient became strongly positive () after removing the two pairs with the largest slopes. Moreover, the distributions of intercepts and slopes showed the presence of skewness and bimodality. Therefore, some degree of robustness against departures from normality might be needed.
To model the heterogeneity within each treatment group, subject-specific random intercepts and slopes were included in the following four models: the NN model
the NL model
the LN model
and the LL model
where I assumed , , , and , and the ’s, , are symmetric matrices,
Further, I assumed that the random effects are uncorrelated between treatment groups.
|Treatment 1||Treatment 2||Treatment 3|
|Treatment 1||Treatment 2||Treatment 3|
|Treatment 1||Treatment 2||Treatment 3|
|Treatment 1||Treatment 2||Treatment 3|
The NL, LN, and LL models were estimated using the EM algorithm as detailed in Appendix with a Monte Carlo size equal to , fixed at each EM iteration, and a convergence tolerance of . The four models gave similar estimates of the fixed effects (Table 3), although the trajectory in the thiouracil group resulting from the LN model tended to be less steep than the corresponding trajectory resulting from the other three models. However, this difference might be of little practical importance. In contrast, more substantial seemed to be the differences between the estimates of the correlation matrices , where , (Table 4). It is interesting to note that there is disagreement on the magnitude and even direction of some of the estimates. Notably, was smallest for the NN model but it was substantially larger for the NL and LL models. The best fit in terms of the log-likelihood was for the NN model, followed closely by the NL model. The LL model and, especially, the LN model gave smaller log-likelihoods.
5 Final remarks
In the words of Wilson (1923, p.842) “No phenomenon is better known perhaps, as a plain matter of fact, than that the frequencies which I actually meet in everyday work in economics, in biometrics, or in vital statistics, very frequently fail to conform at all closely to the so-called normal distribution”. Kotz and colleagues (2001) echo Wilson’s observations on the inadequacy of the normal distribution in many practical applications and give a systematic exposition of the Laplace distribution, an unjustifiably neglected error law which can be “a natural and sometimes superior alternative to the normal law” (Kotz, Kozubowski and Podgórski, 2001, p.13).
My proposed convolution scheme brings together the normal and Laplace distributions showing that these models represent a family of sensible alternatives as they introduce a varying degree of robustness in the modelling process. Estimation can be approached in different ways. The EM algorithm discussed in this paper takes advantage of the scale mixture representation of the Laplace distribution which provides the opportunity for computational simplification. In a simulation study with a moderate sample size (see Section .7 in Appendix), this algorithm provided satisfactory results in terms of mean squared error for the NL and LL models. The estimation of the LN model needed a relatively larger number of Monte Carlo samples to achieve reasonable bias, though the results were never fully satisfactory in terms of efficiency. Finally, model selection has been left out of consideration, but further research on this topic is needed, especially at smaller sample sizes. An interesting starting point is offered by Kundu (2005).
To reiterate the main point of this study, these convolutions have a large number of potential applications and, as demonstrated using several examples, may provide valuable insight into different aspects of the analysis.
This research has been supported by the National Institutes of Health – National Institute of Child Health and Human Development (Grant Number: 1R03HD084807-01A1).
.1 EM estimation
Here, I discuss maximum likelihood inference for , , and in normal-Laplace (NL), Laplace-normal (LN), and Laplace-Laplace (LL) models. In particular, I develop an estimation approach based on the scale mixture representation of the Laplace distribution. If then , where and are, respectively, independent standard exponential and normal variates. Similarly, if , then , where is, again, standard exponential and (Kotz, Kozubowski and Podgórski, 2001).
Let be a multivariate random vector, and and be, respectively, and vectors of covariates for the th observation, , in cluster , . Also, let and be, respectively, and design matrices for cluster . I assume that the random effects have variance-covariance matrix for all and that the ’s are independent of one another. The structure of is purposely left unspecified. The relative precision matrix is parameterised in terms of an unrestricted -dimensional vector, , of non-redundant parameters (Pinheiro and Bates, 2000). The parameter to be estimated is then of dimension . The identity matrix will be denoted by .
.2 Normal-Laplace convolution
Let be a vector of independent standard exponential variates and let . The NL model can be written as
where and . The model can be simplified by convolving and conditional on , i.e., by integrating out the random effects
where , , and .
.3 Laplace-Normal convolution
The LN model can be written as
where is a standard exponential variate, , and . The normal component of the random effects can be integrated out as follows
where , , and .
.4 Laplace-Laplace convolution
Let be a vector of independent standard exponential variates, where , and let . The LL model can be written as
where and . As before, the joint density can be simplified to
where , , and .
.5 The algorithm
The joint density could be further integrated to obtain . Except for the normal-normal (NN) model, the form of the marginal likelihood of does not seem to have an immediate known form. A numerical integration could have some appeal since this integral would reduce to a Gauss-Laguerre quadrature (h() is standard exponential). However, since quadrature methods are notoriously inefficient if the dimension of the integral is large, I consider an alternative approach based on Monte Carlo EM (MCEM) estimation. In this case, the unobservable variable is sampled from the conditional density . While the Monte Carlo sample size does not depend as much on dimensionality as quadrature methods do, convergence can be slower for MCEM than for quadrature-based methods (McLachlan and Krishnan, 2008).
Note that does not depend on . The EM approach alternates between an
expectation step (E-step) , ; and a
maximisation step (M-step) ,
where is the estimate of the parameter after cycles. The expectation in step (i) is taken with respect to , that is, the distribution of the unobservable data conditional on the observed data and the current estimate of . Given that the latter density does not have an immediate known form, I consider a Monte Carlo approach and use the following numerical approximation
where is a vector of appropriate dimensions sampled from at iteration . The number of samples can be fixed at the same value for all iterations or may vary with . The approximate complete data log-likelihood for all clusters (Q-function), averaged over , is given by
where , ,
and is the scaled variance-covariance matrix of the random effects. Note that all the information given by is contained in which depends on (the superscript has been dropped from , , and to ease notation). Furthermore, the parameter is defined to be the vector of non-zero elements of the upper triangle of the matrix logarithm of , where is the matrix obtained from the Cholesky decomposition