A New Wald Test for Hypothesis Testing Based on MCMC outputs

01/03/2018 ∙ by Yong Li, et al. ∙ 0

In this paper, a new and convenient χ^2 wald test based on MCMC outputs is proposed for hypothesis testing. The new statistic can be explained as MCMC version of Wald test and has several important advantages that make it very convenient in practical applications. First, it is well-defined under improper prior distributions and avoids Jeffrey-Lindley's paradox. Second, it's asymptotic distribution can be proved to follow the χ^2 distribution so that the threshold values can be easily calibrated from this distribution. Third, it's statistical error can be derived using the Markov chain Monte Carlo (MCMC) approach. Fourth, most importantly, it is only based on the posterior MCMC random samples drawn from the posterior distribution. Hence, it is only the by-product of the posterior outputs and very easy to compute. In addition, when the prior information is available, the finite sample theory is derived for the proposed test statistic. At last, the usefulness of the test is illustrated with several applications to latent variable models widely used in economics and finance.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Latent variable models have been widely used in economics, finance, and many other disciplines. Two typical models are the dynamic stochastic general equilibrium models in macroeconomics and stochastic volatility models in finance. The latent variable models are generally indexed by the latent variable and the parameter. In many latent variable models, the latent variable is generally high-dimensional so that the observed likelihood function which is a marginal integral on the latent variable is often intractable and becomes difficult to evaluate accurately. Consequently, the statistical inference for latent variable models is nontrivial in practice. In the recent years, Bayesian MCMC methods have been applied in more and more applications in economics and finance due to that they make it possible to fit increasingly complex models, especially latent variable models, see Geweke, et al (2011) and reference therein.

In economic research, the point null hypothesis test is a fundamental topic in statistical inference. Under the Bayesian paradigm, the Bayes factors (BFs) are the corner-stone of Bayesian hypothesis testing (e.g. Jeffreys,1961; Kass and Raftery 1995; Geweke, 2007). Unfortunately, the BFs are not problem-free. First, the BFs are sensitive to the prior distribution and subjects to the notorious Jeffreys-Lindley’s paradox; see for example, Kass and Raftery (1995), Poirier (1995), Robert (1993, 2001). Second, the calculation of BFs generally involves the evaluation of marginal likelihood. In many cases, the evaluation of marginal likelihood is often difficult.

Not surprisingly, some alternative strategies have been proposed to test a point null hypothesis in the Bayesian literature. In recent years, on the basis of the statistical decision theory, several interesting Bayesian approaches to replace BFs have been developed for hypothesis testing. For example, Bernardo and Rueda (2002, BR hereafter) demonstrated that BFs for the Bayesian hypothesis testing can be regarded as a decision problem with a simple zero-one discrete loss function. However, the zero-one discrete function requires the use of non-regular (not absolutely continuous) prior and this is why BF leads to Jeffreys-Lindley’s paradox. BR further suggested using a continuous loss function, based on the well-known continuous Kullback-Leibler (KL) divergence function. As a result, it was shown in BR that their Bayesian test statistic does not depend on any arbitrary constant in the prior. However, BR’s approach has some disadvantages. First, the analytical expression of the KL loss function required by BR is not always available, especially for latent variable models. Second, the test statistic is not a pivotal quantity. Consequently, BR had to use subjective threshold values to test the hypothesis.

To deal with the computational problem in BR in latent variable models, Li and Yu (2012, LY hereafter) developed a new test statistic based on the

function in the Expectation-Maximization (EM) algorithm. LY showed that the new statistic is well-defined under improper priors and easy to compute for latent variable models. Following the idea of McCulloch (1989), LY proposed to choose the threshold values based on the Bernoulli distribution. However, like the test statistic proposed by BR, the test statistic proposed by LY is not pivotal. Moreover, it is not clear if the test statistic of LY can resolve Jeffreys-Lindley’s paradox.

Based on the difference between the deviances, Li, Zeng and Yu (2014, LZY hereafter) developed another Bayesian test statistic for hypothesis testing. This test statistic is well-defined under improper priors, free of Jeffreys-Lindley’s paradox, and not difficult to compute. Moreover, its asymptotic distribution can be derived and one may obtain the threshold values from the asymptotic distribution. Unfortunately, in general the asymptotic distribution depends on some unknown population parameters and hence the test is not pivotal. With sharing the nice properties with Li, Zeng and Yu (2014, LZY hereafter), Li, Liu and Yu (2015)(2015, LLY hereafter) further proposed a pivotal Bayesian test statistic, based on a quadratic loss function, to test a point null hypothesis within the decision-theoretic framework. However, LLY required to evaluate the first derivative of the observed log-likelihood. As to the latent variable models, because the observed log-likelihood is often intractable, this still posed some tedious computational efforts although there have been several interesting methods for evaluating the first derivative, such as EM algorithm, Kalman filter or Particle filter.

In the paper, we want to propose another novel, easy-to-implement Bayesian statistic for hypothesis testing in the framework of latent variable models. The new statistic can share the important advantages with LLY. First, it is well-defined under improper prior distributions and avoids Jeffrey-Lindley’s paradox. Second, under some mild regularity conditions, the statistic is asymptotically equivalent to the Wald test. Hence, from the large sample theory, it’s asymptotic distribution can be derived to follow the

distribution so that the threshold values can be easily calibrated from this distribution. Third, it’s statistical error can be derived using the Markov chain Monte Carlo (MCMC) approach. In addition, most importantly, compared with the previous test statistics, it is extremely convenient for the latent variable models. We don’t need to evaluate the first-order derivative of the observed log-likelihood function, which is time consuming and difficult for the latent variable models. We just need the MCMC output of posterior simulation. The only effort we should make is the inverse of the posterior variance matrix of the interest parameter in hypothesis testing. Fortunately, in most applications, the dimension of the interest parameter is often not so high that our method can be easily applied. In addition, when the prior information is available, we establish the finite sample theoretical properties for the proposed test statistic.

The paper is organized as follows. Section 2 presents the Bayesian analysis for latent variable models. Section 3 develops the new Bayesian test statistic from the decisional viewpoint and establishes its finite and large sample theoretical properties. Section 4 illustrates the new method by using three real examples in economics and finance. Section 5 concludes the paper. Appendix collects the proof of all the theoretical results.

2 Bayesian analysis of latent variable models

Without loss of generality, let denote observed variables and the latent variables. The latent variable model is indexed by the parameter, . Let be the likelihood function of the observed data, and the complete likelihood function. The relationship between these two likelihood functions is:

(1)

In many latent variable modes, especially dynamic latent variable models, the latent variable

is often dependent on the sample size. Hence, the integral is high-dimensional and often does not have an analytical expression so that it is generally very difficult to evaluate. Consequently, the statistical inferences, such as estimation and hypothesis testing, are difficult to implement if they are based on the popular maximum likelihood approach.

In recent years, it has been documented that the latent variables models can be simply and efficiently analyzed using MCMC techniques under the Bayesian framework. For details about Bayesian analysis of latent variable models via MCMC such as algorithms, examples and references, see Geweke, et al. (2011). Let be prior distribution of unknown parameter . Owing to the complexity induced by latent variables, the observed likelihood is often intractable, hence it is almost impossible to evaluate the expectation of the posterior density directly. To alleviate this difficulty, in the posterior analysis, the popular data-augmentation strategy(Tanner and Wong, 1987) is applied to augment the observed variable with the latent variable . Then, the well-known Gibbs sampler can be used to generate random samples from the joint posterior distribution . More concretely, we start with an initial value , and then simulates one by one; at the th iteration, with current values

(a) Generate from ;

(b) Generate from .

After the burning-in phase, that is, sufficiently many iterations of this iteration procedure, the simulated random samples can be regarded as efficient random observations from the joint posterior distribution .

The statistical inference can be established on the efficient random observations drawn from the posterior distribution. Bayesian estimates of and latent variables

as well as their standard errors can be easily obtained via the corresponding sampling mean and sample covariance matrix of the generated random observations. Specifically, let

be effective random observations generated form the joint posterior distribution . Then the joint Bayesian estimates of , as well as the estimates of their covariance matrix can be obtained as follows:

3 Bayesian Hypothesis Testing from the Decision Theory

3.1 Testing a point null hypothesis

It is assumed that a probability model

is used to fit the data. We are concerned with a point null hypothesis testing problem which may arise from the prediction of a particular theory. Let

denote a vector of

-dimensional parameters of interest and a vector of -dimensional nuisance parameters. The problem of testing a point null hypothesis is given by

(2)

The hypothesis testing may be formulated as a decision problem. It is obvious that the decision space has two statistical decisions, to accept (name it ) or to reject (name it ). Let be the loss function of statistical decision. Hence, a natural statistical decision to reject can be made when the expected posterior loss of accepting is sufficiently larger than the expected posterior loss of rejecting , i.e.,

where is a Bayesian test statistic; the posterior distribution with some given prior ; a threshold value. Let be the net loss difference function which can generally be used to measure the evidence against as a function of . Hence, the Bayesian test statistic can be rewritten as

Remark 3.1.

When the equal prior and the net loss function is taken as

following BR (2002) and Li and Yu (2012), the Bayesian test statistic can be given by

which is equivalent to the well known BFs (Kass and Raftery, 1995) as

when rejecting the null hypothesis. In practice, the BFs are often served as the gold statistics for hypothesis testing and the benchmark for the other test statistics. However, the BFs have some theoretical and computational difficulties. First, in the literature, it is well documented that it can not be well defined when using improper priors and suffers from the notorious Jeffreys-Lindley’s paradox, see Poirier (1995), Robert (2001), Li and Yu (2012), Li, Zeng and Yu (2014), etc. Second, the computation of requires to evaluate the marginal likelihood . Clearly, for latent variable models, this often involves a marginalization over the unknown latent variables and the parameter . Furthermore, it is often a high-dimensional integration and generally hard to do in practice although there have been several interesting methods proposed in the literature for computing BFs from the MCMC output; see, for example, Chib (1995), and Chib and Jeliazkov (2001).

Remark 3.2.

Under decision theory framework, several papers have explored some effective approaches to replace the BFs for point-null hypothesis testing. Poirier (1997) developed a loss function approach for hypothesis testing for models without latent variables. Bernardo and Rueda (2002) proposed an intrinsic statistic for Bayesian hypothesis test based on the Kullback-Leibler (KL) loss function. However, the analytical expression of the KL loss function required by BR is not always available, especially for latent variable models. Furthermore, the test statistic is not a pivotal quantity so that BR had to use subjective threshold values for hypothesis testing. To deal with latent variable models, Li and Yu (2012) proposed a Bayesian test statistic based on the Q-function loss function within EM algorithm. LY showed that the test statistic is well-defined under improper priors and easy to compute for latent variable models. However, like the test statistic proposed by BR, the test statistic proposed by LY is not pivotal. Moreover, it is not clear if the test statistic of LY can resolve Jeffreys-Lindley’s paradox. Li, Zeng and Yu (2014) proposed another test statistic, which is a Bayesian version of likelihood ratio test statistic. This test statistic is well-defined under improper priors, free of Jeffreys-Lindley’s paradox, and not difficult to compute. Moreover, its asymptotic distribution can be derived and one may obtain the threshold values from the asymptotic distribution. Unfortunately, in general the asymptotic distribution depends on some unknown population parameters and hence the test is not pivotal.

Remark 3.3.

In a recent paper, Li, Liu and Yu (2015) proposed a new Bayesian test statistic with the following quadratic loss function

where is the posterior mean under the null and is the submatrix of with respect to parameters . With this loss function, they showed that under some mild regularity conditions, the proposed Bayesian test statistics followed a pivotal asymptotically, hence, it is very easy to calibrate threshold values. Furthermore, this proposed test statistic shared some nice properties with Li and Yu (2012), Li,Zeng and Yu (2014), that is, this test statistic is well-defined under improper prior and immune to Jefferys-Lindley’s paradox. As to latent variable models, obviously, the test statistic by Li, Liu and Yu (2015) needs to evaluate the first-derivative of the observed likelihood function. As noted in section 2, the observed likelihood function often generally doesn’t have analytical form so that it is not easy to do. Li, Liu and Yu (2015) showed that some complex simulation algorithms such as EM algorithm, Kalman filter, Particle filter have to be applied for evaluating the first derivative. Further, the standard error of the new statistic will be smaller than the one in LLY.

3.2 A new Bayesian test from decision theory

In this subsection, as to latent variable models, based on the decision theory, we develop a new Bayesian test statistic for hypothesis testing. The new test statistic can share the nice advantages with Li, Liu and Yu (2015). For example, it can be well-defined under improper prior distributions and avoids Jeffrey-Lindley’s paradox. Furthermore, the threshold values can be easily calibrated from the pivotal asymptotic distribution and it’s statistical error can be derived using MCMC approach. Most importantly, the new test statistic can achieve other important advantages over the existing approaches, such as, Li, et al (2015). Our new contributions are twofold. As to latent variable models, it can be shown that the new test statistic is only the by-product of the posterior outputs, hence, very easy to compute. In addition, when the prior information is available, we establish the finite sample theory.

As to any in support space of , let

In this paper, under the statistical decision theory, we propose the following net loss function for hypothesis testing

where is the submatrix of corresponding to , is the inverse matrix of and is the posterior mean of under the alternative hypothesis . Then, we can define a Bayesian test statistic as follows:

(3)
Remark 3.4.

When informative priors are not available, an objective prior or default prior may be used. Often, is taken as uninformative priors, such as Jeffreys or the reference prior (Jeffreys, 1961; Berger and Bernardo, 1992). These priors are generally improper, and it follows that where is a nonintegrable function, and is an arbitrary positive constant. Since the posterior distribution is independent of an arbitrary constant in the prior distributions, and is the posterior covariance matrix of the interest parameter , hence, the statistic is independent of an arbitrary constant. Consequently, our proposed test statistic is independent on this arbitrary positive constant and can be well-defined under improper priors.

Remark 3.5.

To see how the new statistic can avoid Jeffreys-Lindley’s paradox, consider the example discussed in Li, et al (2015). Let with a known and we test the null hypothesis . Let the prior distribution of be . The prior distribution of can be set as with . Suppose . We want to test the simple point null hypothesis . The posterior distribution of is with

It can be shown that

Clearly, when the prior information is very uninformative, as , we can get that which means that the BFs always support the null hypothesis. This is well-known as Jeffreys-Lindley’s paradox in the Bayesian literature. However, we can find that as . Hence, is distributed as when is true. Consequently, our proposed test statistic is immune to Jeffreys-Lindley’s paradox.

Remark 3.6.

The implementation of the Bayesian test statistic by Li,et al (2015) requires the evaluation of the first derivative of the observed log-likelihood function. As described in section 2, for latent variable models, the observed likelihood function generally doesn’t have analytical form so that it is generally hard to get the fist derivative. Compared with Li,et al (2015), the main advantage of the proposed test statistic in this paper is that it is not highly computational intensive. From the equation (3), we can easily observe that it doesn’t require to evaluate the first derivatives. From the computational perspective, our test statistic is only involved of the posterior random samples and the inverse of the posterior covariance matrix. In practice, through the latent variable or parameter may be high-dimensional, in manly latent variable models, the interest parameter is often low-dimensional. Hence, the proposed Bayesian test statistic is only by-product of Bayesian posterior output, not requires additional computational efforts. This is especially advantageous for latent variable models.

3.3 Large sample theory for the Bayesian test statistic

In this subsection, we establish the Bayesian large sample theory for the proposed test statistic. Let be a sequence of random vectors defined on the probability space and be the collection of . Let denote an element of and write as , then we can write the conditional likelihood function for as , where include some elements of and . Define to be the conditional likelihood for observation and as the derivative of , we suppress the subscript when . The logarithm of posterior likelihood function is

Furthermore, let , and the negative Hessian matrix as

Let the prior density to be , and . In order to derive the asymptotic distribution of the proposed test statistic, following LZY (2014) and LLY(2015), a set of regularity conditions are imposed in the following.

Assumption 1.

There exists a finite sample size , so that, for , there is a local maximum at (i.e., posterior mode) such that and is negative definite.

Assumption 2.

The largest eigenvalue

of goes to zero in probability as .

Assumption 3.

For any , there exists a positive number , such that

(4)
Assumption 4.

For any ,

in probability as , where is the support space of .

Assumption 5.

For any ,

as , where is the support space of .

Assumption 6.

Let to be true value, where is a compact, separable metric space.

Assumption 7.

is an mixing sequence that satisfies, for and , the mixing coefficient for some and .

Assumption 8.

Let for , and , (i) and are measurable to and strictly stationary; (ii) and ; (iii) .

Assumption 9.

There exists a function such that for , all where is an open, convex set containing , exists, for some .

Assumption 10.

The prior density is continuous and for all .

Assumption 11.

For , .

Remark 3.7.

Regularity Assumptions 1-4 have been used to develop the Bayesian large sample theory. This theory is proved by chen(1985), which states that the posterior distribution is degenerate about the posterior mode and asymptotically normal after suitable scaling, that is,

More details, one can refer to Chen (1985). Bickel and Doksum (2006), and Le Cam and Yang (2000), Ghosh (2003) presented another version of this theorem on the basis of other similar regularity conditions. The main difference is that the value at which the asymptotic posterior variance matrix is evaluated. It is the posterior mode in Chen (1985), the true value in Bickel and Doksum (2006), and Le Cam and Yang (2000), Ghosh (2003) and the MLE estimator in Kim (1994) depending on different assumptions.

Remark 3.8.

Under Assumptions 1-4, conditional on the observed data , it can be shown that

where is the posterior mean. These conclusions have been given by Li,Zeng and Yu (2014).

Remark 3.9.

Following Rilstone, Srivatsava and Ullah(1996), Bester and Hansen (2006), the assumptions 5-10 are used to justify the validity of high order Laplace expansion. The assumption 5 is analogous to the analytical assumptions for Laplace’s method (kass et al., 1990), but we impose the higher order constraints other than , see also Miyata (2004, 2010). With these assumptions, we can get the standard form higher order Laplace expansion of the order in kass et al. (1990) to , similar to the fully exponential form in Miyata (2004, 2010).

Let to be the maximum likelihood estimator of and is the subvector of corresponding to , under Assumptions 5-8 with and , the Wald statistic be

where is the submatrix of corresponding to .

Theorem 3.1.

Under Assumptions 6-11 with , when the likelihood dominates the prior such as , under the null hypothesis, we can show that

(5)
Remark 3.10.

In Theorem 3.1, we can see that under the null hypothesis, the asymptotic distribution of always follows the distribution, hence, is pivotal. As to the proposed test statistic, we still need to specify some threshold values, for implementing the test, that is,

Hence, this asymptotic distribution can be utilized conveniently to calibrate threshold values.

Remark 3.11.

From this theorem, may be regarded as the Bayesian version of the statistic. However, the test statistic is a frequentist test which is based on the maximum likelihood estimation of the model in the alternative hypothesis whereas our test is a Bayesian test which is based on the posterior quantities of the models under the alternative hypothesis.

Remark 3.12.

The implementation of the test requires the ML estimation of the model and evaluation of the second derivative of the observed likelihood function under the alternative hypothesis. As described in section 2, for latent variable models, the observed likelihood function generally doesn’t have analytical form so that it is generally hard to get the maximum likelihood estimator and its corresponding second derivative. Hence, it is difficult to apply the test statistic for hypothesis testing. However, our proposed Bayesian test statistic is only by-product of posterior outputs. As long as the Bayesian MCMC methods are applicable, our test can be implemented for latent variable models. In addition, from equation (3), can incorporate the prior information through the posterior distribution directly, but can not incorporate the useful prior information.

Remark 3.13.

We use a simple example to illustrate the influence of the prior distributions. Let with a known variance . The true value of is set at . The prior distribution of is set as . The simple point null hypothesis is . It can be shown that

where . When , and the asymptotic distribution for both and is . Let us consider the case that corresponds to an informative prior and compare it to the case that corresponds to a non-informative prior . Table 1 reports , , and when under these two priors. It can be seen that both the BF and the new test depend on the prior (although the BFs tend to choose the wrong model under the vague prior even when the sample size is very large) while the test is independent of the prior. When , correctly rejects the null hypothesis when the prior is informative but fails to reject it when the prior is vague under the significant level 5%. In this case, the test fails to reject the null hypothesis under both priors.111To implement the test,we use the following Fisher’s scale. Let be the critical level and . If is between 95% and 97.5%, the evidence for the alternative is “moderate”; between 97.5% and 99%, “substantial”; between 99% and 99.5%, “strong”; between 99.5% and 99.9%, “very strong”; larger than 99.9%, “overwhelming”. To implement the BF we use Jeffreys’ scale instead. If is less than 0, there is “negative” evidence for the alternative; between 0 and 1, “not worth more than a bare mention”; between 1 and 3, “positive”; between 3 and 5, “strong”; larger than 5, “very strong”.

Prior
10 100 1000 10000 10 100 1000 10000
9.96 11.12 20.60 93.58 -117.42 -118.50 -110.72 -38.00
10.96 12.22 22.30 96.98 1.01 2.23 12.32 87.03
0.01 1.23 11.32 86.03 0.01 1.23 11.32 86.03
Table 1: Comparison of , , and
Remark 3.14.

Under the null hypothesis, our statistic can be written as

where is the order of , and has the order .

Since is calculated by using the MCMC output, it is important to assess the numerical standard error for measuring the magnitude of simulation error.

Corollary 3.2.

Given the posterior draws , the numerical standard error (NSE) of the statistic is,

where , , , , , is the NSE of .

The Corollary 3.2 shows us how to compute the numerical standard error of the proposed statistic. For the NSE of , , following Newey and West (1987), a consistent estimator can be given by

where

and the value of is always equal to 10.

4 The Extension of the Test

In this section, we extend the point-null hypothesis aforementioned into the following problem,

where is a matrix, . This hypothesis problem is much more general than the previous one. On the other hand, it can help us to study the relationship among parameters. Further, for such problems, it is hard to use the Bayes factor. Hence, the extension here is meaningful.

For such problem, the frequentist Wald statistic is

where is the MLE estimator of .

According to the decision theory, we define the net loss function for such problem as

where is the posterior mean of , . Then the statistic is defined as

Theorem 4.1.

Under Assumptions ~~~~, when the likelihood information dominates the prior information, under the null hypothesis,

Similarly, for the statistic, the numerical standard error can be computed in the following corollary.

Corollary 4.2.

Given the posterior draws , the numerical standard error (NSE) of the statistic is,

where , , , , , is the NSE of .

For the NSE of , , we can still follow the way proposed by Newey and West (1987) to evaluated.

5 Simulation Studies

In this section, we do two simulation studies to check the empirical size and power of the proposed test statistic. The first example is a simple simulation examination based on linear regression model where the our proposed test statistic has analytical expression. We compare the size and power of the new statistics with the Wald statistic. In the second example, we use the stochastic volatility model with leverage effect, where Wald statistic can not be used, to study the size and power of our statistic.

5.1 The empirical power and size of and Wald statistic for linear regression model

In this subsection, we use the simple linear regression model to examine the empirical power and size of the proposed test statistic. The model we use is

with . Let , then we can rewrite the model in matrix form,

where , .

We are interested in the subvector of , , then . Here we want to test against and against . Assume that the prior distribution for and are normal and inverse gamma, respectively,

where , and ,

are hyperparameters.

The proposed statistic for the first problem is