Efficient functional ANOVA through wavelet-domain Markov groves

02/12/2016 ∙ by Li Ma, et al. ∙ 0

We introduce a wavelet-domain functional analysis of variance (fANOVA) method based on a Bayesian hierarchical model. The factor effects are modeled through a spike-and-slab mixture at each location-scale combination along with a normal-inverse-Gamma (NIG) conjugate setup for the coefficients and errors. A graphical model called the Markov grove (MG) is designed to jointly model the spike-and-slab statuses at all location-scale combinations, which incorporates the clustering of each factor effect in the wavelet-domain thereby allowing borrowing of strength across location and scale. The posterior of this NIG-MG model is analytically available through a pyramid algorithm of the same computational complexity as Mallat's pyramid algorithm for discrete wavelet transform, i.e., linear in both the number of observations and the number of locations. Posterior probabilities of factor contributions can also be computed through pyramid recursion, and exact samples from the posterior can be drawn without MCMC. We investigate the performance of our method through extensive simulation and show that it outperforms existing wavelet-domain fANOVA methods in a variety of common settings. We apply the method to analyzing the orthosis data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work concerns a common inference task—identifying the contributions from various sources to the variation in functional data, or functional analysis of variance (fANOVA) [18, 20]. Suppose functional observations are measured at a number of given locations. (Throughout this work we use “locations” in a general sense to refer to the points in the index or coordinate space on which the function is observed. For example, in time-series applications, these “locations” are time points.) A simple approach to fANOVA is to carry out ANOVA—e.g., through an -test—at each location. Doing fANOVA in this location-by-location manner often results in poor performance due to the limited amount of data available at each location as well as the necessary multiple testing correction incurred.

An alternative, often more effective approach, is to first apply a basis transformation to the original observations and then carry out ANOVA under the new basis. Many common bases for functional data analysis can be adopted, including splines, polynomials, and Fourier basis, etc. [18]. With a properly chosen basis, this can allow more effective pooling of information across multiple locations, as well as more sparse representations of the underlying structures (here the functional variations), thereby enhancing the ability to identify such structures. The effectiveness of the different bases depends on the nature of the data; no basis is universally the best for all problems. Here our attention is focused on the wavelet basis transform, and in particular the discrete wavelet transform (DWT) [13].

The DWT is a basis transform for functional data observed on an equi-spaced grid of locations. It has been extensively applied in applications such as signal processing and time-series analysis [17]

. The transformed data, in the form of the so-called wavelet coefficients, characterize functional features of different scales (also referred to as frequencies) at different locations. Each wavelet coefficient is associated with one location-scale (also referred to as time-frequency) combination. This wavelet-domain representation of functional data enjoys several desirable properties. First, the wavelet transform has a “whitening effect” that reduces the correlation in the noise, making the common assumption of independent errors more reasonable than in the original space. A second benefit is that wavelet transforms concentrate “energy”—in an information theoretic sense as measured by entropy or Kullback-Leibler divergence—into a small number of location-scale combinations, and thus “sparsify” the underlying signal thereby making detecting such structures much easier. These properties have motivated the development of numerous wavelet-based regression methods in function estimation. Such methods are particularly effective in comparison to other functional methods when the underlying functions contain local structures 

[15].

More recently, several authors have proposed methods for fANOVA under the DWT [19, 2, 1, 4, 14]. In [19, 14], the authors treat each location-scale combination individually and carry out an ANOVA test for each, and then identify significant -values after correcting for multiple testing. In a similar vein but instead of taking the coefficient-by-coefficient testing approach, [1] and [2]

consider testing of the joint null hypothesis of nil factor effects over all location-scale combinations together, and constructed tests for this purpose that are minimax optimal. On the other hand, while not directly addressing fANOVA,

[15]

proposes a Bayesian functional mixed-effects model in the wavelet domain that can be applied to this problem. In particular, a regression model is adopted for the wavelet coefficient at each location-scale combination. ANOVA hypothesis testing can then be naturally handled as a model selection problem using the spike-and-slab prior on the inclusion of the fixed/random effects into the regression model. Inference under the model incurs a heavy computational cost requiring Markov Chain Monte Carlo (MCMC) on a large number of regression models, one for each location-scale combination. Instead,

[4] introduces a frequentist approach for the wavelet-domain mixed effects models addressing testing in both random and fixed effects.

A key motivation for the current work is an important common phenomenon in wavelet-domain analysis that has not been amply exploited in the existing wavelet-based fANOVA methods—the location-scale clustering of functional features in the wavelet coefficients. That is, interesting functional features tend to appear in clusters (like a string of grapes) in the wavelet domain. Such correlation structure has been noted as early as [11], and has been fruitfully exploited by [10] in function estimation. It is easy to imagine, and will be confirmed herein, that the location-scale dependency is prevalent in fANOVA problems as well: when a factor impacts the variance of a wavelet coefficient at one location-scale combination, it typically contributes to the variance at nearby locations and in adjacent scales as well. Thus inference should benefit, in fact substantially, from “borrowing strength” among nearby and/or nested location-scale combinations in identifying the variance components.

The Bayesian wavelet regression framework for fANOVA affords a natural, principled way to incorporating such dependency through designing model space priors—that is, priors on the spike-and-slab indicators that encode whether a factor contributes to the variance at different location-scale combinations. In particular, we present a model space prior in the form of a graphical model consisting of a collection of Markov trees [10]

, one for each factor, and hence called the Markov grove (MG). The MG prior is highly parsimonious—specified by a small number of hyperparameters, and yet flexible enough to characterize the key dependency pattern in factor effects across adjacent/nested location-scale combinations.

Our new Bayesian hierarchical fANOVA model enjoys several important properties. First, due to the tree structure of the MG prior, when coupled with a normal-inverse-Gamma (NIG) conjugate prior specification on the regression coefficients and error variance, exact Bayesian inference for fANOVA can be achieved efficiently. In particular, we show that the joint posterior of our model has a closed form representation computable using a pyramid algorithm that operationally imitates Mallat’s pyramid algorithm for the DWT

[13] and achieves the same computational complexity (or simplicity rather), being linear in both the number of functional observations and the number of locations. The closed form posterior allows direct sampling from the posterior using standard Monte Carlo as opposed to MCMC. Furthermore, when testing fANOVA hypotheses, the posterior marginal probability for the alternative hypotheses (i.e., the presence of factor effects) can also be computed analytically using pyramid recursion without Monte Carlo. This makes our model particularly favorable in large-scale problems such as genomics where fANOVA needs to be completed many times.

The rest of the paper is organized as follows. In 2 we present our methodology. First, we provide a brief background on Bayesian wavelet regression in Section 2.1. There we review the NIG conjugate prior and show how to use it in conjunction with the MT model to achieve adaptive shrinkage in the wavelet coefficients in a way that takes into account location-scale dependency. In Section 2.2 we introduce the MG model as a generalization to the MT model, and show how to use it with the NIG specification to form a hierarchical model for wavelet-based fANOVA. We construct a full inference framework for fANOVA under this model consisting of (i) a closed form of the joint posterior computable through a pyramid algorithm, (ii) a recipe for evaluating the posterior marginal alternative probability of each factor effect at each location-scale combination based on another pyramid algorithm, and (iii) a decision rule for calling significant factor effects that properly adjusts for multiple testing. In 3 we carry out simulations to evaluate the performance of our method and compare it to a number of wavelet-domain fANOVA methods. We also apply our method to the analysis of the orthosis data. We conclude in 4 with brief remarks.

2 Method

2.1 Wavelet regression with normal-inverse-Gamma Markov tree

We start from considering Bayesian modeling of a single functional observation in the wavelet domain. We shall use this simpler problem as a medium to introduce a number of building blocks of our more general wavelet-based fANOVA method—namely, (i) Bayesian adaptive wavelet shrinkage with the spike-and-slab prior, (ii) the normal-inverse-Gamma (NIG) conjugate specification, and (iii) the Markov tree (MT) model. We will show how one can use these three tools in conjunction to carry out adaptive shrinkage in the wavelet domain. Our approach arises from a recombination of the ideas from [7], [9], [10], [5], and [15].

Suppose we have a single functional observation whose values are attained at equidistant locations , and

(1)

where . In words, the errors are assumed to be independent across the locations but can be heterogeneous. We wish to recover the unknown function (or some features of it) from the noisy observation . For simplicity, we assume that for some integer . After applying the DWT to we obtain:

where , and with being the orthonormal matrix corresponding to the corresponding wavelet basis. Due to properties of multivariate Gaussians, is also a multivariate Gaussian with a diagonal covariance matrix.

The elements of are referred to as the (empirical) wavelet coefficients, and those in are the wavelet coefficients of its mean function . In particular, one of the elements in each of , , and is called the father (or scaling) coefficient, while the other are the (mother) coefficients. Without loss of generality, in this work we assume that the scaling coefficients are computed at the coarsest level. The elements of and

can be organized into a bifurcating tree structure, with each element in the corresponding vector associated to a node in the tree. We use the pair of indices

, where and , to represent the th node in the th level of this tree. The two children nodes of node are indexed by and . Correspondingly, for , the parent of is indexed by . From now on we shall use node and location-scale combination interchangeably, and use to denote the collection of indices corresponding to all nodes in the bifurcating tree. We shall use , , and to denote the corresponding mother wavelet coefficients.

The model can be written in a node-specific manner (for notational simplicity, we express the model in terms of the mother coefficients, but the same holds for the father coefficients):

Bayesian inference on this regression model proceeds by placing priors on as well as on the hyperparameter .

It is well-known that effective inference in the wavelet regression should exploit the underlying sparsity of the wavelet coefficients—that is, many of the coefficients are (or very close to) zero—due to energy concentration. Hence, data-adaptive shrinkage toward zero is critical for effectively inferring . A very popular Bayesian strategy to achieving this, which has been adopted by several authors, is to place a two-group, or so-called spike-and-slab, mixture prior on [7, 9, 8, 5, 15]:

In words, with prior probability

, the wavelet coefficient

is non-zero and its value is generated from a Gaussian distribution. (More generally, the spike does not have to be exactly at 0 but can be a Gaussian with a much smaller variance, to which our method will also apply.) The hyperparameter

is a level-specific dispersion parameter that characterizes the overall level of variability in the wavelet coefficients at level . Specifically, we consider the following parametric structure as proposed in [3]:

for some . This implies that the wavelet coefficients tend to be smaller as the level increases. The parameter controls the smoothness of the functional observations. Larger corresponds to smoother functions. In particular, [3] discusses guiding principles for selecting to generate functions of various regularities. A moderate choice that works well for a variety of common functions, as recommended in [3], is . Alternatively, and can both be chosen through an empirical Bayes approach by maximizing the marginal likelihood (see Section 2.3 for details), which as we will illustrate in the numerical examples can often lead to better performance through incorporating additional adaptivity.

The spike-and-slab prior can be written hierarchically with the introduction of a hidden state [8], with indicating that the empirical wavelet coefficient contains a signal , and when otherwise. Formally,

(2)

The error variance is typically unknown, and it can be inferred from the data. In many applications of wavelet regression, the error variance is assumed to be homogeneous, i.e., for all and . It has been noted that homogeneous error variance is often unrealistic [15]. Thus we allow

to be heterogeneous and adopt a hyperprior on them:

(3)

The inverse-Gamma prior maintains conjugacy to the Gaussian model, and consequently the marginal likelihood can be evaluated analytically. This hierarchical specification includes the homogeneous variance as a special case because as , .

Donoho and Johnstone [11] noted a prevalent phenomenon in many applications of inference in wavelet spaces: the “signals”—the wavelet coefficients that are large in magnitude—often show up in clusters in the location-scale tree. This phenomenon, for instance, can be clearly seen in the four test functions presented in [11] (see 1). In particular, when the coefficient deviates far away from zero, the coefficients of the two children in the bifurcating location-scale tree, namely and tend to be away from zero as well. Such location-scale clustering is particularly strong for functions with sharp boundaries and abrupt changes such as blocks, bumps, and doppler. Crouse et al [10] pointed out that the clustering pattern in the wavelet coefficients can be directly exploited to improve adaptive shrinkage, and proposed a graphical modeling strategy to induce such spatial-scale dependency by jointly modeling the latent states

using a Markov process, resulting in a hidden Markov model evolving on the location-scale tree, called the

Markov tree (MT).

Figure 1: The four test functions from [11] and the associated mother wavelet coefficients. The coefficients for the test functions blocks, bumps and doppler display strong patterns of location-scale clustering.

Under the MT for , the shrinkage state of node depends on that of its parent through a Markov transition:

(4)

where for are called the state transition probabilities, and they can be organized into a transition matrix, for each node .

A simple and flexible two-hyperparameter specification of these transition matrices is:

(5)

where and . The parameter induces the spatial-scale dependency of the wavelet signal. Larger values correspond to stronger correlation or clustering in the large wavelet coefficients. On the other hand, the parameter controls how likely it is to have a “signal”, i.e., non-zero wavelet coefficient in each level. The exponential decaying factor counters exactly the exponential increase in the expected number of wavelet coefficients in higher resolution, and keeps the prior expected number of de novo signals (in the sense that a node contains a signal but its parent does not) in each resolution fixed at .

There are two strategies to choosing the hyperparameters . One is to elicit them based on some criteria for multiplicity adjustment and the other is to choose them using an empirical Bayes approach by maximizing the marginal likelihood. We shall discuss both strategies in Section 2.3.

Because the root node does not have a parent, the initial state of the process, , is specified by a set of initial state probabilities such that:

(6)

Combining the MT model [10] on and the NIG hierarchical setup [9, 8], Eqs. (2), (3), (4), and (6) together give a new hierarchical model for the wavelet coefficients, which we shall refer to as the normal inverse-Gamma Markov tree, or NIG-MT.

We next show how to do inference under the NIG-MT model. In particular, we show that the tree nature of the MT combined with the normal inverse-Gamma setup results in full conjugacy of the NIG-MT: the joint posterior on is still an NIG-MT whose parameters can be computed analytically, and can be sampled from directly.

To this end, let us consider a more general case with i.i.d. functional observations from model (1). From now on, we shall use the superscript “” to indicate the terms corresponding to the th observation. The node-specific model after DWT becomes:

Our interest lies in finding the posterior distribution on given the observed data. Let be the marginal likelihood for the node-specific model on given that :

From the normal-inverse-Gamma conjugacy, the marginal likelihood is in closed form:

where .

The following theorem shows that the NIG-MT model is completely conjugate in the sense that the joint posterior is still an NIG-MT. Moreover, the posterior hyperparamters are available analytically through a recursive algorithm operationally similar to the pyramid algorithm for DWT [13]. From now on, we shall use to represent the totality of data.

Theorem 1.

The joint posterior on is still an NIG-MT as follows:

  • The posterior of the hidden states is still a MT:

    1. State transition probabilities:

      for and ;

    2. Initial state probabilities:

  • The posterior of the variances given is:

  • The posterior of given and is:

The mappings and are defined recursively in and can be computed through a bottom-up pyramid algorithm as follows:

Remark I: The recursive computation of the mappings and is operationally analogous to Mallat’s pyramid algorithm [13] for carrying out the DWT. In the order , it computes the mapping at a node based on the mapping values on its children in the next resolution. The algorithm achieves the same computational complexity for evaluating the posterior exactly as Mallat’s algorithm, that is, linear both in and in .

Remark II: The term is the overall marginal likelihood (integrating out all the latent variables) given the hyperparameters, which we can use to set the hyperparameters through a common empirical Bayes strategy—maximum marginal likelihood estimation (MMLE).

Given the analytical form of the joint posterior, the posterior mean of can also be computed exactly. To this end, we first use a top-down pyramid algorithm to compute the posterior marginal probability of the hidden states as follows. In the order , the posterior marginal probability of for each is available as

Then the posterior mean of is given by

(7)

which has an intuitive explanation in terms of shrinkage. The average of the observed wavelet coefficients is shrunk toward the prior mean 0 with the amount of shrinkage being averaged over the different shrinkage states. By applying an inverse DWT to we can get the posterior mean of , .

In addition to the posterior mean, one can construct credible intervals for

and for by sampling from the joint posterior of according to Theorem 1. Because the exact posterior is available, no MCMC is needed and the sampling is standard Monte Carlo. Given a posterior samples Bayesian inference can proceed as usual. For example, a credible band for is available from a posterior sample , attained through applying an inverse DWT to the posterior sample on . We illustrate the work of this model in function denoising through simulations in Section 3.1.

2.2 Wavelet fANOVA with normal-inverse-Gamma Markov grove

Next we present our main methodology for fANOVA. We first introduce our framework in one-way fANOVA as the notation is much simpler with all the essential components of the framework present, and then generalize the formulation to the general multi-way case.

One-way fANOVA. Suppose we have groups of independent functional observations whose values are attained at equidistant points . Suppose

(8)

where is the group index, is the index for the replicates in the th group, and . The (one-way) fANOVA problem concerns identifying the variation among , if any, from that in the noise.

Treating the first group as the baseline , and letting be the contrast of each group to the baseline, we can write

(9)

By design , and the ANOVA problem boils down to inference on the contrast functions for . After applying the DWT we obtain

where , , and for , and . We can again write the model in a node-specific manner:

As before, we introduce a latent indicator for each and adopt the same NIG-MT setup for as given in Eqs. (2), (3), (4), and (6). Under this formulation, ANOVA can be accomplished through inference on the contrast coefficients . To this end, we introduce another latent indicator such that

(10)

where similar to , is a scaling parameter that characterizes the size of the differences across the factor levels. Our motivation to defining a different scaling parameter than for the ’s is that the ’s characterize the difference across the factor levels while the that for the baseline function mean. It is often the case that the scale of the ’s are substantially different than that of the ’s.

Just as are modeled in a correlated manner to capture the spatial-scale dependency in , we do that for the as well. Intuitively, if a factor contributes to the variation at one location-scale combination, then it typically contributes to the variation at the children/neighbor nodes as well. Again, an MT is a convenient choice for jointly modeling the latent indicators . We let denote the corresponding state transition matrix (or the initial probability vector when ), which can be specified in the same way as given in (5) for . That is for ,

(11)

Now we arrive at a fully specified joint model on given by Eqs. (2), (3), (4), (6), (10), and (11). It is specified by the NIG conjugate priors on given the latent indicators, and two MTs on the latent indicators. For this reason, we shall refer to this model as a normal-inverse-Gamma Markov grove (NIG-MG).

Next we show how Bayesian inference can be carried out for the NIG-MG. It turns out that the joint posterior can again be computed analytically through a pyramid algorithm whose complexity is linear in both and . Accordingly, posterior marginal and joint null/alternative probabilities can also be evaluated exactly, and one can sample from the exact posterior using standard Monte Carlo.

We write the node-specific model in matrix notation:

where is the vector of the wavelet coefficients for all the observations at node , is the vector of the wavelet coefficients for the mean functions, is the vector of the residual errors, and is the design matrix. The design matrix can be written as

where , is a vector of ones, and is a binary vector where the th element is equal to one if the th observation belongs to group , and equal to zero otherwise. We also define the following matrices for :

and

The marginal likelihood for the node-specific model on given and is

(12)

where

Theorem 2.

The joint posterior on is as follows.

  • The marginal posterior of the hidden states is an MT defined on the product state-space with

    1. State transition probabilities:

      for .

    2. Initial state probabilities:

  • The conditional posterior of given and is:

  • The posterior of given , and is given as follows

where represents the Hadamard product, and for any matrix . The mappings and can be computed recursively through a pyramid algorithm as follows:

Once the joint posterior is computed following the theorem, Bayesian inference can proceed in the usual manner. In particular, for testing the presence of a variance contribution from the factor, is an indicator for whether the null hypothesis at location-scale :

is false. Thus represents the posterior probability for the factor to contribute to the variation at location-scale . For this reason, we shall refer to as the posterior marginal alternative probability (PMAP) for location-scale . Next we show how to compute PMAPs through a top-down pyramid algorithm.

Corollary 1.

For , the posterior marginal distribution of can be computed recursively as

using the initial and transition probabilities given in Theorem 2. Then the PMAPs are given by .

In the next corollary, we show how to compute the posterior probability for the presence of factor effects at any (i.e., at least one) location-scale combination. This probability, which we refer to as the posterior joint alternative probability (PJAP) can be used for testing the “global” null hypothesis that the factor does not contribute to the variation at all.

Corollary 2.

For all and , let

where denotes the subtree in with as the root, i.e., includes and all of its descendants in . Then we can compute by the following pyramid algorithm

The posterior joint null probability (PJNP) is given by

Accordingly, the posterior joint alternative probability (PJAP) is .

Remark: Corollary 1 and 2 can also be applied to the prior model to get the prior marginal alternative probabilities, which can be used to elicit the prior specification on the hyperparameters. See Section 2.3 for more details.

Theorem 2 allows us to draw posterior samples of using standard Monte Carlo (not MCMC). Based on this posterior sample, we can also complete other inference tasks such as computing the posterior mean of and constructing credible bands for which quantifies the posterior uncertainty of the factor contribution to the functional variation. This can be achieved by applying inverse DWT to the posterior draws of ’s. We will illustrate this in the numerical examples.

Multi-way fANOVA. The NIG-MG model for one-way fANOVA can be naturally extended to the case with multiple factors by specifying one MT for each factor to capture the location-scale clustering of each factor effect. The complication is mainly in the notation.

Suppose now we have factors, and the th factor has levels. Now suppose for each factor combination , we have independent functional observations for , and suppose each observation arise from the following model

(13)

where is the group index, is the index for the replicates in the th group, and .

Now let and for . Then

After applying the DWT we obtain: