1 Introduction
Linear mixed models (LMM) are commonly used to model data that present a natural hierarchical structure because they flexibly model the withinsubject correlation (Pinheiro and Bates, 2000). This kind of structure appears when a variable of interest is repeatedly measured for several subjects (or cluster units, in general), which is frequently the case in clinical trials.
For mathematical convenience, it is usually assumed that both random effect and error follow Gaussian distributions. Nevertheless, these restrictive assumptions may result in a lack of robustness against departures from the normal distribution and invalid statistical inferences, especially when the data show heavy tails and skewness
(see, e.g., Drikvandi et al., 2017; Drikvandi, 2019). Several approaches have been considered in the literature to replace the normal assumptions of LMM with more flexible distributions. For example, Pinheiro et al. (2001) proposed a multivariate t linear mixed model (TLMM), and Rosa et al. (2003) considered the thicktailed class of normal/independent (NI) distributions from a Bayesian framework.Accounting for skewness, ArellanoValle et al. (2005) proposed a skewnormal linear mixed model (SNLMM) based on the classic skewnormal (SN) distribution introduced by Azzalini and Dalla Valle (1996), and Ho and Lin (2010) proposed a skewt linear mixed model (STLMM) based on the classic skewt (ST) distribution introduced by Azzalini and Capitanio (2003). More generally, Lachos et al. (2010a)
proposed robust parametric modeling of LMM based on skewnormal/independent (SNI) distributions, where random effects follow an SNI distribution and withinsubject errors follow a normal/independent (NI) distribution, and consequently observed responses follow an SNI distribution, and they define what they call the skewnormal/independent linear mixed model, and
Schumacher et al. (2021) extended the SNILMM by considering withinsubject serial dependence and developing additional tools for model evaluation.In general, SN and ST distributions extend the normal and Studentt distributions by introducing additional parameters regulating skewness. Some extensions and unifications of these distributions are carefully surveyed in works such as Azzalini (2005) and ArellanoValle and Azzalini (2006). For further information, we refer to the book edited by Genton (2004) and the more recent one by Azzalini and Capitanio (2014). The classic formulation of SN and ST distributions were used successfully in many other works, such as Pyne et al. (2009), Lachos et al. (2010a), Lachos et al. (2010b), Cabral et al. (2012a), Cabral et al. (2012b), and Cabral et al. (2014). Furthermore, another popular version of these two skewed distributions was defined by Sahu et al. (2003), which will be called the SDBSN and SDBST in this work. This proposal was applied in works such as Lin (2009), Lin (2010), and Jara et al. (2008).
Recently, Lee and McLachlan (2016) proposed finite mixtures of a generalization of the classic and SDBST distributions, called canonical fundamental skewt distributions (CFUST), which are special cases of the fundamental skew distributions defined by ArellanoValle and Genton (2005). Using simulation studies, Lee and McLachlan (2016) showed that the CFUST distribution outperforms the classic and SDBST distributions in mixture models. In this regard, this work aims to extend the classic STLMM by considering the CFUST distribution used in Lee and McLachlan (2016), by developing ML estimation based on an EMtype algorithm. This formulation, which will be hereafter simply called ST distribution, enables a more flexible skewness structure at the cost of a higher number of parameters to be estimated.
The rest of this manuscript is organized as follows. Section 2 introduces the formulation of the skewed distributions that are considered in this work. Section 3 defines the STLMM and discusses its ML estimation via an EMtype algorithm. Section 4 presents some simulation studies conducted to evaluate the empirical performance of the STLMM and the effect of initial values, as well as an illustrative study to exemplify the flexibility of the model. In Section 5, the methodology is applied to a schizophrenia data set. Finally, Section 6 discusses some final remarks.
2 Skewed distributions
Let denote a
dimensional random vector following a normal distribution with mean vector
and covariance matrix , and let andbe the respective density and cumulative distribution function. When
and (the null dimensional vector and the identity matrix, respectively), we simplify the notation to and , and when , we use the notation and .Suppose , then for a given Borel set , we say that the distribution of is a truncated normal distribution on , denoted by , whose density is given by
where is the indicator function of . As a particular case, consider and , then all elements of the vector
are independent random variables and
. Now, considering , we say that the distribution of is a dimensional halfnormal with scale matrix , where .First, we will define the version of the skewnormal distribution that will be used in this work. It is a special case of the fundamental skewnormal distribution defined by ArellanoValle and Genton (2005). The presentation below is based on this work, and all the proofs can be found there. Some of them are reproduced below, and some others are skipped.
Definition 1.
Let and be independent, where is positive definite, and let be a matrix. We say that the distribution of
is skewnormal with location vector , shape matrix , and scale matrix . We use the notation .
Observe that when , then . A trivial but relevant consequence of this definition is that affine transformations of SN distributions are still SN, as stated in Proposition 1.
Proposition 1.
Let be an matrix with rank , be a vector of length , and . Then, .
In particular, marginal distributions are also SN. Thus, if and considering the partition
(1) 
then, for , we have . Matrix induces similar partitions on , and , given by , , , , where , and . By Proposition 1, we have . An analogous result is true for .
Proposition 2.
Let . Then, the density of is given by
(2) 
where
(3) 
Proof.
Define the random vector
(4) 
where and are given in Definition 1. Then, has the same distribution of . Thus, it is enough to find the distribution of , which is . It can be shown that
(5) 
which implies . The result follows immediately. ∎
Remarks

Observe that the relation induces a 11 parameterization, that is, we could parameterize the distribution in terms of and use the notation with density given in (2). Besides this, if we define the parameterization , where is a square root of , the pdf of is
where . This parameterization is used in ArellanoValle and Genton (2005)
– see equation (2.11) and page 109 for the pdf and moment generating function. Unless stated explicitly, we use the parameterization
.
The mean vector and covariance matrix of a random vector with SN distribution are given in Proposition 3. The proof is a direct consequence of Definition 1 and the fact that if a random variable has a univariate halfnormal distribution with scale parameter 1, then its mean value is . We define the dimensional vector of ones by .
Proposition 3.
Let . Then,
Before defining the skewt distribution, we enunciate a result regarding marginal and conditional distributions of a random vector with Studentt distribution that will be helpful to obtain results for the skewt distribution similar to the ones presented to the SN distribution. The proof can be found in ArellanoValle and Bolfarine (1995) (see also Fang et al. (1990, Theorem 3.8)).
Lemma 1.
Let and consider the partition given in (1) and the induced partitions of and . It can be shown that:

,

,
where and , with and .
Definition 2.
Let and be independent, where
denotes a Gamma distribution with mean
and variance
, with . Let be a vector of constants of length . We say that the distribution of is skewt (ST) with location vector , scale matrix , shape matrix , and degrees of freedom. We use the notation .In what follows, and respectively denote the density and the cumulative distribution of the variate Studentt distribution with location vector , scale matrix and degrees of freedom.
Proposition 4.
Proof.
Similar to the SN case, by Definition 2 and Proposition 1, it can be shown that affine transformations of ST distributions are still ST distributed, as stated in the following proposition.
Proposition 5.
Let be a matrix, be a vector of length , and . Then, .
If , we have . An analogous result can be shown for .
Equation (6) also implies the following result:
Proposition 6.
Let and . Then, admits the following hierarchical representation
where denotes the dimensional halfnormal distribution with location parameter and scale matrix .
It is straightforward to show that the density of in Proposition 6 is
Proposition 7.
Let . Then,
(7) 
where , and is a matrix of ones.
3 The model
3.1 Definition
The standard linear mixed model introduced by Laird and Ware (1982) has been a widely used tool to model the correlation withinsubjects often present in longitudinal data. In this work, we present a flexible extension of this model.
Suppose that there are subjects, with the th subject having observations, then a linear mixed model can be written as
(8) 
We consider that
(9) 
where , and denotes independent random vectors. This setup implies
where denotes independent and identically distributed random vectors. Thus, we have that , according to (7). Moreover, from Proposition 6, we have the following stochastic representation of the complete data model
(10)  
(11)  
(12)  
(13) 
Observe that , with . Using Proposition 5, we have
(14) 
with Hence, the marginal pdf of is
(15)  
where , , , and . Therefore, assuming that , and depend on unknown and reduced parameter vectors , and , respectively, the loglikelihood function for based on the observed sample is given by
(16) 
where . Since the observed loglikelihood function involves complex expressions, it is very computationally expensive to work directly with to find the ML estimates of . Hence, in the following subsection, we discuss the development of an EMtype algorithm (Dempster et al., 1977) for ML estimation.
3.2 Maximum likelihood estimation
From the hierarchical representation given in (10)–(13) and treating , and as hypothetical missing data, we propose to use the ECME algorithm (Liu and Rubin, 1994) for parameter estimation. Let the augmented data set be , where , then the completedata loglikelihood function is given by
where is a function that depends on the parameter vector only through .
Given the current value , the Estep of an EMtype algorithm evaluates , where the expectation is taken with respect to the joint conditional distribution of , , and , given and . Therefore, we can write
where
and , with indicating the trace of matrix A, and , , , , , and , .
From the representation given in (10)–(13), using properties from conditional expectation and after some algebra, omitting the supraindex , the expressions above can be written as:
where , , and and are as given in (15). Moreover,
(17) 
with denoting a Studentt distribution truncated on
, and its moments can be computed using the
R package MomTrunc (Galarza et al., 2021).To maximize with respect to , the ECME algorithm performs a conditional maximization (CM) step that conditionally maximizes , obtaining a new estimate , as follows:
1. , , , and are updated using the following expressions: