## 0.1 Introduction

The normal distribution plays an important role in statistics as much collected data are, or are assumed to be, normally distributed. Whilst the central limit theorem tells us that for any random sample

of independent and identically distributed random variables with expected value

and variance

, converges in distribution to as (Patel1996). Moreover, the t distribution, and its use in cases of normally distributed data of small sample size and of unknown variance, is also of considerable importance (Ireland2010). Consequently, functions such as normal() and t() are available for working with these distributions in Stata.However, a considerable amount of statistical analysis performed is not univariate, but multivariate. As a result, the generalisation of the normal and t distributions to higher dimensions; the multivariate normal (MVN) and multivariate t distributions (MVT), are of increasing significance. Their broad use can be seen through introductions created for statisticians (Tong2012; Kotz2004a), natural scientists (Bailer1997; Blaesild2002; Mazza2014; Feigelson2012) and social scientists (Stevens2012; Howell2012).

The complexity of these distributions makes computational analysis almost certainly necessary, and thus much research into their numerical analysis has been conducted (Genz2009). Many programming languages now have efficient code available for working with densities, evaluating distribution functions, finding quantiles, and generating pseudo-random variables (see, for example, Genz2014). In Stata though, no code currently exists for working with the MVT distribution, and there are limitations to the support for the MVN distribution. Specifically, the drawnorm command allows users to generate pseudo-random samples from a MVN distribution, and lnmvnormalden() is available to evaluate its density. Cappallari2006 provided mvnp

for the evaluation of cumulative MVN probabilities of any dimension through use of the Geweke-Hajivassiliou-Keane simulator

(Geweke1989; Hajivassiliou1998; Keane1994). This simulator is also utilised in the Mata function ghk() (Gates2006). Whilst in Stata 15, mvnormal() and several similar mata commands were released to facilitate the computation of the MVN distribution function over any range of integration. However, to the best of our knowledge, no command is presently obtainable for easily determining equicoordinate quantiles of the MVN distribution.Furthermore, no commands are currently available to perform calculations involving the MVN or MVT distribution, or their univariate counterparts, in the presence of variable truncation. In certain scenarios, it is not uncommon for one or more variables in an MVN or MVT setting to be bound to some particular finite interval. For example, the truncated normal distribution is utilised in modelling censored data through the Tobit model (Tobin1958).

In this work, we present several Stata commands and Mata functions for performing key calculations for any non-degenerate case of the MVN or MVT distribution, and for a particular type of non-central multivariate t distribution (NCMVT), either in the presence or absence of variable truncation. We proceed by briefly summarising the algorithms underpinning these commands, before detailing their syntax, and finally demonstrating their use through several examples.

## 0.2 Methods

### 0.2.1 Multivariate normal distribution

We begin by considering a -dimensional random variable with a (non-degenerate) MVN distribution. We denote this by , where is a location parameter, and is a positive-definite covariance matrix. Many of the steps below revolve around the evaluation of , for and with for . Specifically

for , and where .

Several different methods are available for evaluating such integrals. For example, the mata function mvnormalcv() employs numerical quadrature for this task. In our code, we utilise a quasi-Monte Carlo integration algorithm over a randomised lattice after separation-of-variables has been performed (see Genz1992; Genz2002; Genz2009 for further details). Moreover, we employ variable re-ordering in order to improve efficiency as proposed by Gibson1994. This is the approach recommended for such calculations by Genz2009. The procedure is implemented in our Stata command pmvnormal. The algorithm employed requires specification of the number of shifts and the number of samples to use in the Monte Carlo integration. In addition, a value for the Monte Carlo confidence factor is necessitated. However, default values that have been shown to perform consistently well are provided, and thus little user alteration is required in practice. Increasing the number of shifts or samples theoretically brings about a reduction in the Monte Carlo error of the integration, but this comes at a cost to the run time of the commands.

In the above,

is the probability density function for

. We later present a Stata command mvnormalden to directly compute the above density for any user specified choices of , and .Note that those with access to Stata 15 should utilise mvnormalcv() for evaluating , as its execution time will be smaller. Additionally, as noted earlier, lnmvnormalden() could also be used for density calculations. The purpose of our commands is to provide a framework for these tasks with a consistent syntax to our commands performing previously unsupported derivations, and as a means of evaluating the MVN distribution function with earlier versions of Stata.

Also routinely of interest in multivariate analysis is the determination of equicoordinate quantiles of the MVN distribution. Determining the value,

, of the th equicoordinate quantile of , requires us to solve the following equationHere, we achieve this in the Stata command invmvnormal, allowing either mvnormalcv() or the algorithm implemented in pmvnormal to be used to evaluate the right hand side of the above equation. Modified interval bisection is then utilised to search for the correct value of .

Finally, in order to draw pseudo-random vectors with distribution , we utilise the fact that for with , has the desired distribution when with . The can be generated using rnormal(). Several methods for generating such an are available, as discussed for example by Johnson1987

. Here we allow for eigen, singular value, and cholesky decomposition of

. Explicitly, our Stata command for this pseudo-random generation is rmvnormal. As for the density and distribution function evaluations discussed above, drawnorm provides similar functionality. In this case however, our command provides additional options as well as a consistent syntax.### 0.2.2 Multivariate distribution

We now consider a -dimensional random variable with a (non-degenerate) NCMVT distribution. We denote this by , where is now a vector of non-centrality parameters, is a positive-definite scale matrix, and

is a degrees of freedom parameter. There are in-fact several ways by which one can define an associated NCMVT integral. Here we consider what has been referred to as the location shifted NCMVT

(Genz2009). See also Kotz2004bfor further details. This NCMVT distribution arises for example as the Bayesian posterior distribution for the regression coefficients in a linear regression

(Genz2014). Note that in the case when the central MVT distribution is obtained.Our integral of interest is as follows

and can be evaluated using mvt. In addition, mvtden is available to evaluate the density .

We would again like to determine equicoordinate quantiles, this time solving the following equation

Later, the stata command invmvt is described that performs this calculation. In both mvt and invmvt, the algorithm employed for numerical integration is a direct modification of that described above for the MVN distribution.

Finally, in order to draw pseudo-random vectors following a location shifted distribution, we utilise the fact that if and , then

has the desired distribution. We draw pseudo-random vectors for using the algorithm described earlier for the MVN distribution, with pseudo-random numbers acquired through rchi2(). The Stata command for this is rmvt.

### 0.2.3 Truncated multivariate normal and multivariate distributions

We additionally consider truncated versions of the MVN and NCMVT distributions discussed above. We denote these by and respectively. Here, and , with for , are the lower and upper truncation points for . That is, we restrict such that for . The distribution functions for these variables are

with and their respective densities. Our Stata commands to evaluate these quantities are tmvnormal, tmvt, tmvnormalden, and tmvtden.

We further provide commands invtmvnormal and invtmvt to determine the values of such that

or

Finally, we use rejection sampling in the commands rtmvnormal and rtmvt to generate random variables with a or distribution. For example, in rtmvnormal, a random draw from the is made. If for then this draw is retained, otherwise it is rejected and the process is repeated.

## 0.3 Syntax

Note that all of the Stata commands presented here are written as .ado files, utilising Mata for computational efficiency wherever possible. This allows all desired calculations to be performed easily from within Stata. Corresponding Mata functions for each of the Stata commands are provided for convenience given the matrix based nature of this work.

### 0.3.1 Stata commands

mvnormalden, x(numlist) mean(numlist) sigma(string) logdensity

pmvnormal, lower(numlist miss) upper(numlist miss) mean(numlist) sigma(string) shifts(integer 12) samples(integer 1000) alpha(real 3)

invmvnormal, p(real) mean(numlist) sigma(string) tail(string) itermax(integer 1000000) tolerance(real 0.000001) integrator(string) shifts(integer 12) samples(integer 1000)

rmvnormal, mean(numlist) sigma(string) n(integer 1) method(string)

mvtden, x(numlist) delta(numlist) sigma(string) df(real 1) logdensity

mvt, lower(numlist miss) upper(numlist miss) delta(numlist) sigma(string) df(real 1) shifts(integer 12) samples(integer 1000) alpha(real 3)

invmvt, p(real) delta(numlist) sigma(string) df(real 1) tail(string) itermax(integer 1000000) tolerance(real 0.000001) shifts(integer 12) samples(integer 1000)

rmvt, delta(numlist) sigma(string) df(real 1) n(integer 1) method(string)

tmvnormalden, x(numlist) mean(numlist) sigma(string) lowertruncation(numlist miss) uppertruncation(numlist miss) logdensity integrator(string) shifts(integer 12) samples(integer 1000)

tmvnormal, lower(numlist miss) upper(numlist miss) mean(numlist) sigma(string) lowertruncation(numlist miss) uppertruncation(numlist miss) integrator(string) shifts(integer 12) samples(integer 1000)

invtmvnormal, p(real) mean(numlist) sigma(string) lowertruncation(numlist miss) uppertruncation(numlist miss) tail(string) itermax(integer 1000000) tolerance(real 0.000001) integrator(string) shifts(integer 12) samples(integer 1000)

rtmvnormal, mean(numlist) sigma(string) lowertruncation(numlist miss) uppertruncation(numlist miss) n(integer 1) method(string)

tmvtden, x(numlist) delta(numlist) sigma(string) lowertruncation(numlist miss) uppertruncation(numlist miss) df(real 1) logdensity shifts(integer 12) samples(integer 1000)

tmvt, lower(numlist miss) upper(numlist miss) delta(numlist) sigma(string) lowertruncation(numlist miss) uppertruncation(numlist miss) df(real 1) shifts(integer 12) samples(integer 1000)

invtmvt, p(real) delta(numlist) sigma(string) lowertruncation(numlist miss) uppertruncation(numlist miss) df(real 1) tail(string) itermax(integer 1000000) tolerance(real 0.000001) shifts(integer 12) samples(integer 1000)

rtmvt, delta(numlist) sigma(string) lowertruncation(numlist miss) uppertruncation(numlist miss) df(real 1) n(integer 1) method(string)

Here, the above options are as follows

x is the vector of quantiles at which a density is sought (see above).

mean is the location parameter of a MVN or truncated MVN distribution. In the case of a MVN distribution, this is the expected value of the distribution.

sigma is the matrix described above. This must be symmetric positive-definite. In the case of a MVN distribution this is the variables covariance matrix. This is not the case for the NCMVT distribution, or for truncated distributions.

logdensity specifies that the log of the evaluated density should be returned.

lower is the vector of lower limits, denoted by above. Use . to indicate a value is .

upper is the vector of upper limits, denoted by above. Use . to indicate a value is .

shifts is the number of shifts of the Quasi-Monte Carlo integration algorithm to use. It must be a strictly positive integer. In invmvnormal and invtmvnormal this will only have an effect if integrator is set to pmvnormal.

samples is the number of samples in each shift of the Quasi-Monte Carlo integration algorithm to use. It must be a strictly positive integer. As above, in invmvnormal and invtmvnormal this will only have an effect if integrator is set to pmvnormal.

alpha

is the value of the Monte Carlo confidence factor to use in estimating the error in the returned value of the integral when using

pmvnormal or mvt. It must be strictly positive.p is a desired probability, denoted by above. It must be strictly between 0 and 1.

tail specifies which quantile should be computed. For invmvnormal "lower" gives the such that

"upper" such that

and "both" such that

Analogous statements hold for invmvt, invtmvnormal, and invtmvt.

itermax specifies the maximum allowed number of iterations in the implemented modified interval bisection algorithm.

tolerance specifies the desired tolerance for termination of the implemented modified interval bisection algorithm.

n is the number of random vectors to generate from a chosen distribution. It must be a strictly positive integer.

delta is the vector of non-centrality parameters of a NCMVT or truncated NCMVT distribution. Note that this may not be these distributions expected value.

lowertruncation is the vector of lower truncation limits, denoted by above. Use . to indicate a value is .

uppertruncation is the vector of upper truncation limits, denoted above. Use . to indicate a value is .

df is the degrees of freedom parameter, , of a NCMVT or truncated NCMVT distribution.

### 0.3.2 Mata functions

Mata functions are provided with a naming convention that adds the suffix ‘_mata’ on to the command names above. All inputs are listed in the same order as the Stata commands, and have the exact same name and interpretation. Input types are specified in the obvious manner. That is, delta and lower must be real vectors, df a real scalar, method a string, and similarly for other inputs. For this reason, we omit their specification here. However, as an example

mvtden_mata(real vector x, real vector delta, real matrix Sigma, real scalar df, string logdensity)

## 0.4 Examples

### 0.4.1 Density comparison

In order to demonstrate the use of mvnormalden, mvtden, tmvnormalden, and tmvtden, we plot and compare densities in the case with

We first loop over a grid of values in the region . At each point we compute the density of a MVN, MVT, truncated MVN, and truncated MVT distribution with the above parameters

. mat Sigma = J(2, 2, 0.5) + 0.5*I(2) . mat mvnormalden = J(101, 101, 0) . mat mvtden = J(101, 101, 0) . mat tmvnormalden = J(101, 101, 0) . mat tmvtden = J(101, 101, 0) . local i = 1 . foreach x1 of numlist -3(0.06)3 2. local j = 1 3. foreach x2 of numlist -3(0.06)3 4. qui mvnormalden, x(‘x1’, ‘x2’) mean(0, 0) sigma(Sigma) 5. mat mvnormalden[‘i’, ‘j’] = r(density) 6. qui mvtden, x(‘x1’, ‘x2’) delta(0, 0) sigma(Sigma) df(‘nu’) 7. mat mvtden[‘i’, ‘j’] = r(density) 8. qui tmvnormalden, x(‘x1’, ‘x2’) mean(0, 0) sigma(Sigma) lowert(-1.5, -1.5) ¿ uppert(1.5, 1.5) 9. mat tmvnormalden[‘i’, ‘j’] = r(density) 10. qui tmvtden, x(‘x1’, ‘x2’) delta(0, 0) sigma(Sigma) lowert(-1.5, -1.5) ¿ uppert(1.5, 1.5) df(1) 11. mat tmvtden[‘i’, ‘j’] = r(density) 12. local ‘j++’ 13. 14. local ‘i++’ 15.

We then pass our results to twoway contour to produce Figure 1

. qui svmat mvnormalden . qui svmat mvtden . qui svmat tmvnormalden . qui svmat tmvtden . gen row = _n . qui reshape long mvnormalden mvtden tmvnormalden tmvtden, i(row) j(col) . local opt ”aspect(2) ccuts(0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4) ¿ xtitle(”xsubscript:1”) ytitle(”xsubscript:2”)” . twoway contour mvnormalden row col, nodraw saving(g1, replace) ‘opt’ (file g1.gph saved) . twoway contour mvtden row col, nodraw saving(g2, replace) ‘opt’ (file g2.gph saved) . twoway contour tmvnormalden row col, nodraw saving(g3, replace) ‘opt’ (file g3.gph saved) . twoway contour tmvtden row col, nodraw saving(g4, replace) ‘opt’ (file g4.gph saved) . graph combine g1.gph g2.gph g3.gph g4.gph, common imargin(0 0 0 0) rows(2) ¿ cols(2) scheme(sj)

As would be expected, we observe that the density of the MVT distribution with degree of freedom is more homogeneous than that of the MVN distribution. This explains why tail probabilities are larger for MVT distributions with low degrees of freedom. Truncation increases the density at each point in the region compared to the corresponding non-truncated variable.

### 0.4.2 Goodness of fit tests for multivariate normality

Stata provides, through mvtest normality, a set of statistical tests of multivariate normality. Here, we explore the results of applying these tests to random variables drawn using rmvt and rtmvt. Specifically, we consider a bivariate MVT distribution with

We also examine results for a variable that is a mixture of two truncated MVT distributions, each with the above value for , but one with

and the other

We additionally consider several possible values for . We draw samples from the MVT distribution, for multiple values of , 1000 times in order to compute an average -value from the Doornik-Hansen omnibus test. This process is also performed for the truncated MVT distributions, except samples are drawn from each and then combined in to a single dataset, so that the overall sample size is retained at . We achieve this with the following code

. set seed 2
. mat Sigma = J(2, 2, 0.5) + 0.5*I(2)
. mat list Sigma
. mat gof_pvaluesum_mvn = J(4, 5, 0)
. mat gof_pvaluesum_mix = J(4, 5, 0)
. local i = 1
. foreach n of numlist 10 25 50 100
2. local j = 1
3. foreach df of numlist 2 5 10 50 100
4. foreach rep of numlist 1/1000
5. rmvt, delta(0, 0) sigma(Sigma) df(‘df’) n(‘n’) method(chol)
6. mat randsamp = r(rmvt)
7. qui svmat randsamp
8. qui mvtest normality randsamp1 randsamp2
9. mat gof_pvaluesum_mvn[‘i’, ‘j’] = gof_pvaluesum_mvn[‘i’, ‘j’] + r(p_dh)
10. drop randsamp1 randsamp2
11. rtmvt, delta(-2, -2) sigma(Sigma) df(‘df’) n(‘n’) method(chol)
¿ lowert(-2.5, -2.5) uppert(-1.5, -1.5)
12. mat randsamp = r(rtmvt)
13. rtmvt, delta(2, 2) sigma(Sigma) df(‘df’) n(‘n’) method(chol)
¿ lowert(1.5, 1.5) uppert(2.5, 2.5)
14. mat randsamp = (randsamp

r(rtmvt))
15. qui svmat randsamp
16. qui mvtest normality randsamp1 randsamp2
17. mat gof_pvaluesum_mix[‘i’, ‘j’] = gof_pvaluesum_mix[‘i’, ‘j’] + r(p_dh)
18. drop randsamp1 randsamp2
19.
20. local ‘j++’
21.
22. local ‘i++’
23.
. mat gof_summary_mvn = gof_pvaluesum_mvn/1000
. mat gof_summary_mix = gof_pvaluesum_mix/1000

Observe that we have utilised cholesky decomposition of in generating these random variables. For most distributions the choice of decomposition will have little effect on the speed and quality of the generated random variables. However, when the dimension is large, this choice should be made more carefully.

Now, the results of the Doornik-Hansen tests are summarised in Table 1. We can see that, as would be expected, increasing the value of for the MVT distribution typically increases the average -value for each value of , as the resultant distribution is closer to that of a corresponding MVN distribution. Whilst the pattern is slightly less clear as we increase , in general this reduces the average -value, as more data allows the test to more precisely categorise the distribution of the generated variables.

For the datasets which are a mixture of the truncated NCMVT variables, perhaps surprisingly the average

-values are often larger than the corresponding value for the MVT distribution. In no instance is the Doornik-Hansen omnibus test significant on average at the 5% level. That is, in no case does it on average correctly reject the null hypothesis of multivariate normality. This is likely a consequence of the chosen values for

, , and in the two truncated MVT distributions. When the two simulated datasets are combined, they likely resemble a MVN distribution with .Mixture of multivariate distributions | |||||
---|---|---|---|---|---|

0.2948 | 0.4069 | 0.4540 | 0.4851 | 0.4882 | |

0.1040 | 0.3200 | 0.4217 | 0.5066 | 0.5109 | |

0.0063 | 0.1502 | 0.3281 | 0.5065 | 0.5090 | |

0.0003 | 0.0574 | 0.2326 | 0.4656 | 0.5046 | |

Mixture of truncated non-central multivariate t distributions | |||||

0.6612 | 0.6502 | 0.6398 | 0.6402 | 0.6483 | |

0.6634 | 0.6466 | 0.6377 | 0.6138 | 0.6053 | |

0.4281 | 0.4581 | 0.4720 | 0.4712 | 0.4446 | |

0.1439 | 0.1942 | 0.2300 | 0.2443 | 0.2406 |

### 0.4.3 Familywise error-rate using Bonferroni and Dunnett corrections

As a brief example of the usage of pmvnormal and invmvnormal, consider a statistical test of the following hypotheses

Suppose our tests are based on test statistics

, , and it is known thatThe familywise error-rate, the probability of incorrectly rejecting one or more of the null hypotheses, can be controlled to a level using the Bonferroni correction by rejecting if . The Dunnett correction in comparison rejects if , where is the solution to

We can use invmvnormal to determine this value of , with the following code

. set seed 3 . mat Sigma = J(3, 3, 0.5) + 0.5*I(3) . invmvnormal, p(0.95) mean(0, 0, 0) sigma(Sigma) tail(”lower”) quantile = 2.0620839 error = 4.663e-15 flag = 0 fquantile = 0 iterations = 21 . local r = r(quantile)

Returned along with the quantile, which is determined to be approximately 2.06, is an estimate of the error in this value. Also provided is a flag variable, which takes the value 0 if the interval bisection algorithm converged with no issues. A value of the objective function (fquantile) we attempt to find the root of, and the number of iterations required for convergence, are additionally listed.

We can then verify that this value of will control the familywise error-rate to using pmvnormal as follows, additionally evaluating the true familywise error-rate when using the Bonferroni correction

. pmvnormal, lower(., ., .) upper(‘r’, ‘r’, ‘r’) mean(0, 0, 0) sigma(Sigma) integral = .94999707 error = .00005633 . pmvnormal, lower(., ., .) upper(‘= invnormal(1 - 0.05/3)’, ¿ ‘= invnormal(1 - 0.05/3)’, ‘= invnormal(1 - 0.05/3)’) mean(0, 0, 0) sigma(Sigma) integral = .95705901 error = .00003466

We can now see the conservatism of the Bonferroni correction: the true familywise error-rate is approximately 4.3%.

### 0.4.4 Orthont probabilities in the presence of truncation

The value of is often referred to as an orthont probability. Here, as a final example, we examine how changes in , in order to demonstrate the usage of tmvnormal.

We consider again the case with

Using pmvnormal, we can establish that

. set seed 4 . mat Sigma = J(3, 3, 0.5) + 0.5*I(3) . pmvnormal, lower(0, 0, 0) upper(., ., .) mean(0, 0, 0) sigma(Sigma) integral = .2500246 error = .00005905

The second value returned in the above is, as in Section 4.3, an estimate of the error in the integral. As we can see, pmvnormal is again easily able to control this error to a small level.

Then, as stated, we can evaluate how this probability changes in for a truncated MVN distribution as follows

. mat lowertrunc = J(110, 1, 0) . mat probabilities = J(110, 1, 0) . foreach i of numlist 1/110 2. mat lowertrunc[‘i’, 1] = -10 + 0.1*(‘i’ - 1) 3. qui tmvnormal, lower(0, 0, 0) upper(., ., .) mean(0, 0, 0) sigma(Sigma) ¿ lowert(‘=lowertrunc[‘i’, 1]’, ‘=lowertrunc[‘i’, 1]’, ‘=lowertrunc[‘i’, 1]’) ¿ uppert(., ., .) 4. mat probabilities[‘i’, 1] = r(integral) 5. . mat data = (lowertrunc, probabilities) . qui svmat data . twoway (line data2 data1, yaxis(1)), xtitle(t) ytitle(P(X ¿= (0,0,0)—(t,t,t))) ¿ scheme(sj)

Here, twoway line is used to produce Figure 2. As expected, when is highly negative, the value of is approximately equal to that of the corresponding non-truncated distribution. This value increases up to a value of one when , since at this point truncation implies we must have that for .

## 0.5 Conclusion

The MVN and MVT distributions are extremely important in a wide range of statistical problems faced by researchers in many fields. Here we have extended Stata users ability to work with these key distributions; presenting several commands to allow densities, pseudo-random draws, distribution functions, and equicoordinate quantiles to be computed quickly and efficiently, both with and without variable truncation. In particular, using algorithms developed by Genz2009, our programs require little input from the user to determine probabilities the distributions fall in any range of integration. Whilst this is possible from the MVN distribution using mvnormalcv(), our commands are the first for the NCMVT distribution, as well as for truncated MVN and NCMVT distributions.

However, whilst the algorithms utilised here for the computation of probabilities over arbitrary ranges of integration are efficient, there are many alternatives available. Moreover, we have considered only one possible definition of a NCMVT distribution. Another important definition, used in the computation of the power of multiple contrast tests under a normality assumption (Genz2014), was proposed by Kshirsagar Kotz2004b. Thus, further development of the functions presented here will seek to make additional algorithms available as an option in the distribution function commands, as well as to include an option to change between the shifted NCMVT considered here and that of Kshirsagar available in the NCMVT and truncated NCMVT commands.

## 0.6 Acknowledgements

Michael J. Grayling is supported by the Wellcome Trust (Grant Number 099770/Z/12/Z). Adrian P. Mander is supported by the Medical Research Council (Grant Number MC_UP_1302/2).

Comments

There are no comments yet.