Statistical inference dealing with continuous multivariate data is commonly focused on the multivariate normal (MN) distribution, with mean vector and covariance matrix
, due to its computational and theoretical convenience. However, for many real phenomena, the tails of this distribution are lighter than required. This is often due to the presence of mild outliers(Ritter, 2015, p. 4). Outliers are defined as “mild”, with respect to the MN distribution (which is the reference distribution), when they do not really deviate from the MN and are not strongly outlying, but rather they produce an overall distribution that is too heavy-tailed to be modeled by the MN (Mazza and Punzo, 2020, Morris et al., 2019, and Punzo and Tortora, 2019); for a discussion about the concept of reference (or target) distribution, see Davies and Gather (1993). Therefore, mild outliers can be modeled by means of a heavy-tailed elliptically symmetric (or elliptically contoured or, simply, elliptical) distribution embedding the MN distribution as a special case (Ritter, 2015, p. 79).
A classical way to define such a larger model is by means of the MN scale mixture (also known as MN variance mixture); it is a finite/continuous mixture of MN distributions on
obtained via a convenient discrete/continuous mixing distribution with positive support. The MN scale mixture simply alters the tail behavior of the MN distribution while leaving the resultant distribution elliptically contoured. This is made possible by the additional parameters of the mixing distribution which, in the scale mixture, govern the deviation from normality in terms of tail weight. Regardless of the mixing distribution considered, expressing a multivariate elliptical distribution as a MN scale mixture enables (among other): natural/efficient computation of the maximum likelihood (ML) estimates via expectation-maximization (EM) based algorithms (Lange and Sinsheimer, 1993 and Balakrishnan et al., 2009), robust ML estimation of the parameters and based on down-weighting of mild outliers (Peel and McLachlan, 2000, Punzo and McNicholas, 2016, and Farcomeni and Punzo, 2019), and efficient Monte Carlo calculations (Relles, 1970, Andrews and Mallows, 1974, Efron and Olshen, 1978 and Choy and Chan, 2008). Moreover, robustness studies have often used MN scale mixtures for simulation and in the analysis of outlier models; see West (1984, 1987) and the references therein.
The MN scale mixture family includes many classical distributions, such as the multivariate (M; Lange et al., 1989 and Kotz and Nadarajah, 2004), obtained by using a convenient gamma as mixing distribution, the multivariate contaminated normal (MCN) of Tukey (1960, see also , ), defined under a convenient Bernoulli mixing distribution (see, e.g., Yamaguchi, 2004 and Punzo et al., 2020), and the multivariate symmetric generalized hyperbolic (MSGH; McNeil et al., 2005) obtained if a generalized inverse Gaussian is considered as mixing distribution.
In the present paper, we add a new member to the MN scale mixture family: the multivariate tail-inflated normal (MTIN) distribution. It is obtained by using a convenient continuous uniform as mixing distribution. In Section 2.1 we show that the proposed distribution has a closed-form probability density function (pdf) which is characterized by a single additional parameter, with respect to and of the nested MN, governing the tail weight. We also make explicit the representations of the MTIN distribution as MN scale mixture (Section 2.2.1) and as elliptical distribution (Section 2.2.2). In Section 2.3 we provide the first four moments of the MTIN distribution, which are those of practical interest (see also Appendix B). We estimate the parameters by the method of moments (Appendix C) and maximum likelihood (ML; Section 3); for the latter, we illustrate two alternative algorithms/methods to obtain these estimates (a direct approach in Section 3.1 and an ECME algorithm in Section 3.2). Always for the ML method, in Section 4 we discuss the existence of the estimates. The ECME algorithm allows us to highlights how the ML estimates of and are robust in the sense that mild outliers are down-weighted in the computation of these parameters (see Section 5 for details). Two simulated data analyses are designed to compare the estimation methods discussed above in terms of parameter recovery and computational time required (Section 6.1), and to evaluate the appropriateness of AIC and BIC in selecting among a set of candidate elliptical models that can be considered as natural competitors of our MTIN distribution (Section 6.2). At last, we illustrate the proposed model by analyzing financial multivariate data related to four of the publicly owned companies considered by the Dow Jones index (Section 7); here, we also compare the performance of the MTIN distribution with other heavy-tailed elliptical distributions which are well-known in the financial literature. We conclude the paper with a brief discussion in Section 8.
2 Multivariate tail-inflated normal distribution
2.1 Probability density function
Definition 2.1 (Probability density function).
A -dimensional random vector is said to have a -variate tail-inflated normal distribution with mean vector , scale matrix , and inflation parameter , in symbols , if its pdf is given by
where is the upper incomplete gamma function, is the determinant, and denotes the squared Mahalanobis distance between and (with as the covariance matrix).
Theorem 2.1 (Limiting normal form).
The pdf of can be obtained as a special case of (1) when ; in formula
In this section we show that, if , then its distribution has a twofold representation as MN scale mixture (Section 2.2.1) and as multivariate elliptically symmetric distribution (Section 2.2.2). The properties of these families are so implicitly inherited by our model.
2.2.1 Multivariate normal scale mixture representation
For robustness sake, one of the most common ways to generalize the MN distribution is represented by the MN scale mixture (MNSM), also called MN variance mixture (MNVM), with pdf
where is the mixing probability density (or mass) function — with support — depending on the parameter(s) . The pdf in (5) is unimodal, elliptically symmetric, and guarantees tails heavier than those of the MN distribution (see, e.g., Barndorff-Nielsen et al., 1982, Fang et al., 2013, Section 2.6, Yamaguchi, 2004 and McLachlan and Peel, 2000, Section 7.4). The tail weight of is governed by .
Examples of distributions belonging to the MNSM family are: the multivariate contaminated normal (MCN), the multivariate (M), the multivariate symmetric generalized hyperbolic (MSGH), the multivariate symmetric hyperbolic (MSH), the multivariate symmetric variance-gamma (MSVG), and the multivariate symmetric normal inverse Gaussian (SNIG); for details about these special cases, as well as about the properties of the MNSM family, see McNeil et al. (2005, Section 3.2.1).
In Theorem 2.2 we show that can be represented as a MNSM by considering a convenient (continuous) uniform as mixing distribution.
Theorem 2.2 (Scale mixture representation).
The pdf in (6) can be written as
By noting that
the proof of the theorem is straightforward. ∎
A more direct interpretation of Theorem 2.2 can be given by the hierarchical representation of as
where denotes a uniform distribution on . This alternative way to see the MTIN distribution is useful for random generation and for the implementation of EM-based algorithms (as we will better appreciate in Section 3).
2.2.2 Elliptical representation
As well-documented in the literature, the class of elliptical distributions has several good properties (Cambanis et al., 1981, Fang et al., 2013, Chapter 2.5 and Rachev et al., 2010, p. 357) and the class of MNSMs is contained therein (Bingham et al., 2002, p. 247, McNeil et al., 2005, p. 61 and Fang et al., 2013, p. 48). Therefore, the MTIN distribution is also elliptical and Theorem 2.3 presents its elliptical representation.
Theorem 2.3 (Elliptical representation).
If , then is elliptically distributed with the following stochastic representation
where is a matrix such that , is a -variate random vector uniformly distributed on the unit hypersphere with topological dimensions ,
is a random variable having adistribution with degrees of freedom, and . Note that , , and are independent.
To find the distribution of as defined by (10) we need to derive the pdf of the random variable
and then use the relationship (see Bagnato et al., 2017)
The random variable has density
where is the indicator function on the set . By definition, the density of the product of the independent random variables and is given by
In this section we provide some of the moments of the MTIN distribution. In addition to the mean, we give the first two moments of higher even order, i.e. the covariance matrix and kurtosis, which are helpful in assessing the influence of mild outliers on the distribution (Rachev et al., 2010, p. 307); indeed, these are the moments of practical interest for people using multivariate heavy-tailed elliptical distributions. As measure of kurtosis, as usual, we consider
where is a random vector having mean and covariance matrix (Mardia, 1970).
Theorem 2.4 (MTIN: mean vector, covariance matrix and kurtosis).
If , then
See Appendix B. ∎
Based on (15), the covariance matrix of the MTIN distribution is proportional to the covariance matrix of the nested MN distribution, with proportionality factor depending on . In particular, the function , given in (17) and graphically represented via a solid line in Figure 1, is increasing in .
Because of the values can assume, is an inflated version of , without limits about the degree of inflation that tends to when . However, if , then ; so, under this case, tends to .
According to (18), is proportional to the kurtosis of the (nested) MN distribution, with proportionality factor depending on . In particular, the function , given in (18) and graphically represented by a dashed line in Figure 1, is increasing in . Therefore, is greater than meaning that the MTIN distribution allows for leptokurtosis. An upper bound for does not exist since when . However, if , then ; so, under this case, tends to . There is another aspect to be noted from Figure 1. While is a smoothly increasing function of , is approximately flat on the left, increasing suddenly only when is close to 1. From a practical point of view, this means that we need an high -value to have a relevant excess kurtosis.
3 Maximum likelihood estimation
Several estimators of the parameters of the MTIN distribution may be considered. Among them, the maximum likelihood (ML) estimator is the most attractive because of its asymptotic properties (see, e.g., Pfanzagl and Hamböker, 1994, p. 206). Naturally, other estimation methods can be used; Appendix C gives details about the method of moments (MM). Below we discuss two possible ways to obtain the ML estimates of ; for their comparison, see Section 6.1.
3.1 Direct approach
Given a random sample (observed data) from , the ML estimation method is based on the maximization of the (observed-data) log-likelihood function
which does not admit a closed form solution. Furthermore, maximizing (19) with respect to is a constrained problem due to and . To make the maximization problem unconstrained, we apply the following reparameterization for and . According to the Cholesky decomposition we write
where is an upper triangular matrix with real-valued entries. As concerns we write
where . According to the above parametrization, the log-likelihood function in (19) can be re-written as
where . Details about the first order partial derivatives of in (20), with respect to , and , are given in Appendix F. Operationally, we perform unconstrained maximization of with respect to via the general-purpose optimizer optim() for R (R Core Team, 2018), included in the stats package. Different algorithms, such as the Nelder-Mead or BFGS, can be used for maximization. They can be passed to optim() via the argument method. Once the maximum is determined, the estimates for and are simply obtained by back-transformations. For a comparison, in terms of parameter recovery and computational times, between Nelder-Mead and BFGS algorithms applied to the direct ML estimation of , see Section 6.1.
3.2 ECME algorithm
To circumvent the complexity of the direct ML approach discussed in Section 3.1, we could consider the application of the expectation-maximization (EM) algorithm (Dempster et al., 1977), which is the classical approach to find ML estimates for distributions belonging to the MNSM family. However, for the MTIN, the M-step is not well-defined (for the motivations we give in Appendix D), and the EM algorithm fails to converge. This is not a serious problem because ML estimates of can be still found, much more efficiently, by using the expectation-conditional maximization either (ECME) algorithm (Liu and Rubin, 1994). The higher efficiency of the ECME algorithm, with respect to the EM algorithm, has been also discussed with reference to another member of the MNSM family, the M distribution (see Liu and Rubin, 1994, 1995 and McLachlan and Krishnan, 2007, Section 5.8).
The ECME algorithm is an extension of the expectation-conditional maximum (ECM) algorithm (Meng and Rubin, 1993) which, in turn, is an extension of the EM algorithm (see McLachlan and Krishnan, 2007, Chapter 5, for details). The ECM algorithm replaces the M-step of the EM algorithm by a number of computationally simpler conditional maximization (CM) steps. The ECME algorithm generalizes the ECM algorithm by conditionally maximizing on some or all of the CM-steps the incomplete-data log-likelihood. As for the EM and ECM algorithms, the ECME algorithm monotonically increases the likelihood and reliably converges to a stationary point of the likelihood function (see Meng and van Dyk, 1997 and McLachlan and Krishnan, 2007, Chapter 5). Moreover, Liu and Rubin (1994) found the ECME algorithm to be nearly always faster than both the EM and ECM algorithms in terms of number of iterations, and that it can be faster in total computer time by orders of magnitude (see also McLachlan and Krishnan, 2007, Chapter 5).
For the application of any variant of the EM algorithm, in the light of the hierarchical representation of the MTIN distribution given in (9), it is convenient to view the observed data as incomplete. The complete data are , where the missing variables are defined so that
independently for , and
Because of this conditional structure, the complete-data likelihood function can be factored into the product of the conditional densities of given the and the marginal densities of , i.e.
Accordingly, the complete-data log-likelihood function can be written as
The ECME algorithm iterates between three steps, one E-step and two CM-steps, until convergence. The two CM-steps arise from the partition of as , where and . These steps, for the generic th iteration of the algorithm, , are detailed below.
The E-step requires the calculation of
the conditional expectation of given the observed data , using the current fit for . In (27) the two terms on the right-hand side are ordered as the two terms on the right-hand side of (24). As well-explained in McNeil et al. (2005, p. 82), in order to compute we need to replace any function of the latent mixing variables which arise in (25) and (26) by the quantities , where the expectation, as it can be noted by the subscript, is taken using the current fit for , . To calculate these expectations we can observe that the conditional pdf of satisfies , up to some constant of proportionality. In detail,
denotes the pdf of a gamma distribution with parametersand and
Now, the functions arising in (25) and (26) are , and . However, there is no need to compute the expectation of since the term is not related with the parameters; moreover, there is no need to compute the expectation of because we do not use to update . Thanks to (29) we have that
Then, by substituting with in the last term on the right-hand side of (25), we obtain
where the terms constant with respect to and are dropped.
3.2.2 CM-step 1
The first CM-step, at the same iteration, requires the calculation of as the value of that maximizes , with respect to and , with fixed at . Theis maximization is easily implemented if we note that is a weighted log-likelihood, with weights , of independent observations from , respectively. So, the updates for and are
3.2.3 CM-step 2
Thus, the second CM-step of the ECME algorithm chooses to maximize (34) with and . This implies that is a solution of the equation
As a closed form solution for the root of (35) is not analytically available, we use the uniroot() function in the stats package of R to perform the numerical one-dimensional search of .
4 Existence of the ML estimates
In this section we demonstrate the existence of the ML estimates for , , and . With this aim, we consider the solutions from the ECME algorithm of Section 3.2. In line with the CM-steps of the algorithm, we first demonstrate the existence of and , given (Theorem 4.1), and then the existence of given and (Theorem 4.2).
Given and a sample , if , then there exists , being the set of positive-definite symmetric matrices, such that
The pdf of can be written as
To prove the theorem we use Proposition 2.4 in Cuesta-Albertos et al. (2008). In particular, it is sufficient to verify that the following assumptions are valid:
there exists a strictly decreasing sequence which converges to zero, such that for every ;
if , then there exists such that ;
is continuous on .
It is straightforward to verify that for all ; so, a is satisfied for any strictly decreasing sequence which converges to zero.
Given , , and a sample , then there exists such that
According to (7), and given and , the log-likelihood function results
It is straightforward to realize that is continuous in . If the limits of , when tends to the boundaries and of the support, are both finite, then the log-likelihood function has a maximum and the theorem is proved. When we have
The limit in (40) is constant and corresponds to the log-likelihood function of a -variate normal distribution computed in and . ∎
5 Some notes on robustness
The MTIN model allows to obtain improved (in terms of robustness) ML estimates of and , with respect to those provided by the nested (reference) MN model, in the presence of mild outliers. In detail, the influence of the observations is reduced (down-weighted) as the squared Mahalanobis distance increases. This is in line with the -estimation (Maronna, 1976) which uses a decreasing weighting function to down-weight the observations with large values. To be more precise, according to (32) and (33), and can be respectively viewed, since is estimated from the data by ML, as an adaptively weighted sample mean and sample covariance matrix, in the sense used by Hogg (1974), with weights given in (31). This approach, in addition to be a type of -estimation, follows Box (1980) and Box and Tiao (2011) in embedding the reference MN model in a larger model with one or more parameters (here ) that afford protection against non-normality.
For each possible value of the dimension , consider the weights in (31) as a bivariate (weighting) function of the squared Mahalanobis distance and of the inflation parameter , i.e.
As concerns the limiting behavior of as , by using the fundamental theorem of calculus and l’Hspital’s rule we obtain
From (42) we note that, when observations are very far from the bulk of the data () their weight in the computation of and depends by . If is close to zero (i.e. if the MTIN approaches the MN distribution), then the weight is close to one; indeed, this is what we expect under the normal case. In the other cases, the weight linearly decreases as increases, i.e. as the MTIN departs from the MN distribution. As concerns the limiting behavior of as , by using the mathematics considered for the limit in (42), we have
6 Simulation studies
In this section, we investigate various aspects related to our model through simulation studies. While the simulation study of Section 6.1 is considered to evaluate parameter recovery of the proposed estimation procedures, in addition to the computational times required by these procedures, the simulation study of Section 6.2 aims to compare classical model selection criteria in selecting among a set of well-established elliptical distributions, with differing number of parameters, including the MTIN. The whole analysis is conducted in R and the code for density evaluation, random number generation, and fitting for the MTIN distribution is available, in the form of an R package named mtin, from http://docenti.unict.it/punzo/Rpackages.htm.
6.1 Parameter recovery and computational times
In this section we compare two methods to estimate the parameters of the MTIN distribution, namely the method of moments (MM), discussed in Appendix C, and ML, discussed in Section 3. These methods are compared with respect to parameter recovery and to the computational time required. We use three approaches to compute ML estimates: direct approach with the Nelder-Mead algorithm, direct approach with the BFGS algorithm, and ECME algorithm (cf. Section 3). All the algorithms used to obtain ML estimates are initialized by the solution provided by the method of moments.
In this study we consider three experimental factors: the dimension (), the sample size (), and the inflation parameter (). The values of are chosen unbalanced on the right simply to have scenarios with more excess kurtosis (refer to the considerations made at the end of Section 2.3).
For each combination of , and , we sample one hundred datasets from a MTIN distribution with a zero mean vector () and an identity scale matrix (), for a total of datasets. On each generated dataset, we fit the MTIN distribution with the four approaches cited above. Computation is performed on a Windows 10 PC, with Intel i7-8550U CPU, 16.0 GB RAM, using R 64-bit, and the elapsed time (in seconds) is computed via the system.time() function of the base package. Parallel computing, using 4 cores, is considered.
We start evaluating parameter recovery by focusing on ; we limit the investigation to this parameter because, in analogy with other normal scale mixtures, the parameter(s) governing the tail-weight is (are) the most difficult to be estimated. However, although not reported here, the ranking of the estimation methods we obtain with respect to are roughly preserved when we move to and .
In Figure 3 we report the box-plots of the differences , for bias evaluation, while in Figure 4 we report the box-plots of the squared differences , for mean square error (MSE) evaluation. Each of the nine plots in these figures refers to a particular pair and shows box-plots each summarizing (for every and used method) the behavior of the considered differences with respect to the available 100 replications. As expected, the differences under evaluation improve as increases. Interestingly enough, the differences improves when either or increase. Regardless of the scenario and difference considered, BFGS and ECME algorithms for ML estimation work comparably and represent the best approaches. The worst approach is MM. The very similar behavior between BFGS and ECME algorithms can be further corroborated by looking at Table 1, where log-likelihood values are averaged over the 100 replications of each triplet . Here we can note how these algorithms return exactly the same average results, which are always slightly better than those provided by the Nelder-Mead algorithm, and this is true regardless of the triplet considered. To be more precise, looking at the whole set of obtained results (not reported here for the sake of space), BFGS and ECME algorithms provide, on each fitted model, the same maximized log-likelihood value.
As concerns the computational times, for each combination of and the estimation method considered, Figure 5 reports the box-plots of the elapsed times in each of the replications obtained marginalizing with respect to and .
The MM provides the best (lowest) elapsed times, regardless of the true underlying . Among the algorithms considered to obtain ML estimates, the BFGS is the best one, followed by the Nelder-Mead and ECME algorithms. The elapsed times from the ECME algorithm seem to be a decreasing function of , tying those of the BFGS algorithm when . Another interesting result can be noted by looking at Figure 6 where we report, analogously to Figure 5, the box plots of the elapsed times only in the case , when varying the sample size.