Bayesian nonparametric regression using complex wavelets

03/07/2018 ∙ by Norbert Reményi, et al. ∙ Georgia Institute of Technology 0

In this paper we propose a new adaptive wavelet denoising methodology using complex wavelets. The method is based on a fully Bayesian hierarchical model in the complex wavelet domain that uses a bivariate mixture prior on the wavelet coefficients. The heart of the procedure is computational, where the posterior mean is computed through Markov chain Monte Carlo (MCMC) simulations. We show that the method has good performance, as demonstrated by simulations on the well-known test functions and by comparison to a well-established complex wavelet-based denoising procedure. An application to real-life data set is also considered.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the present paper we consider a novel Bayesian model as a solution to the classical nonparametric regression problem

(1)

where , , are equispaced sampling points, and the errors

are i.i.d. normal random variables, with zero mean and variance

. The interest is to estimate the function using the observations . After applying a linear and orthogonal wavelet transform, the equation in (1) becomes

(2)

where , and are the wavelet coefficients (at resolution and position ) corresponding to , and respectively. Note that and are equal in distribution due to orthogonality of wavelet transforms. Due to the the whitening property of the wavelet transforms (Flandrin, 1992) many existing methods assume independence of the coefficients, and omit the double indices to model a generic wavelet coefficient as

(3)

When indices are needed for the clarity of exposition they will be used.

To estimate in model (3) Bayesian shrinkage rules have been proposed in the literature by many authors. By a shrinkage rule the observed wavelet coefficients are replaced with their shrunk version . Then is estimated as the inverse wavelet transform of . Empirical distributions of detail wavelet coefficients for signals encountered in practical applications are (at each resolution level) centered around and peaked at zero (Mallat, 1989). A range of models, for which unconditional distribution of wavelet coefficients mimic these properties, have been considered in the literature. The traditional Bayesian models consider prior distribution on the wavelet coefficient as

(4)

where is a point mass at zero, is a symmetric and unimodal distribution, and is a fixed parameter in [0,1], usually level dependent, that controls the extent of shrinkage for values of close to 0. This type of model was considered by Abramovich et al. (1998), Vidakovic (1998), Vidakovic and Ruggeri (2001) and Johnstone and Silverman (2005), among others. A recent overview of Bayesian strategies in wavelet shrinkage can be found in Reményi and Vidakovic (2012).

Wavelet shrinkage methods using complex-valued wavelets provide additional insights to shrinkage process due to the information contained in the phase. After taking a complex wavelet transform of a real-valued signal, the model remains as in (2), however, the observed wavelet coefficients become complex numbers at resolution and location . Several papers considering Bayesian wavelet shrinkage with complex wavelets are available. Lina and Mayrand (1995) describes the complex-valued Daubechies’ wavelets in detail, Lina and Macgibbon (1997), Lina (1997), and Lina et al. (1999) focus on image denoising, in which the phase of the observed wavelet coefficients is preserved, but the modulus of the coefficients is shrunk by the Bayes rule. Barber and Nason (2004)

develops the complex empirical Bayes (CEB) procedure, which modifies both the phase and modulus of wavelet coefficients by a bivariate shrinkage rule. The wavelet coefficients are represented as bivariate real-valued random variables and the authors formulate a bivariate model in the complex wavelet domain. The resulting estimators are in closed-form and the hyperparameters of the model are estimated by the empirical Bayes procedure.

We build on the results above, and formulate a fully Bayesian hierarchical model which accounts for the uncertainty of the prior parameters by placing hyperpriors on them. Since a closed-form solution for the Bayes estimator does not exist, the Markov Chain Monte Carlo (MCMC) methodology is applied and an approximate estimator (posterior mean) is computed from the output of simulational runs. Although the simplicity of a closed-form solution is lost, the procedure is fully Bayesian, adaptive to the underlying signal, and the estimation of the hyperparameters is automatic via the MCMC sampling algorithm. The estimation is governed by the data and hyperprior distributions on the parameters, and this adaptivity ensures good denoising performance.

The paper is organized as follows. Section 2 formalizes the model and presents some results related to it. Section 3 details the MCMC sampling scheme. Section 4 discusses the selection of hyperparameters and presents simulation results and comparisons to existing methods. In Section 5 we apply the proposed shrinkage to a real data set, and in Section 6 conclusions are provided.

2 Fully Bayesian Model

In this section we describe a novel, fully Bayesian hierarchical model in the complex wavelet domain. After applying the complex wavelet transform to a real-valued signal, the observed wavelet coefficients at resolution and location become complex numbers. Building on the approach taken by Barber and Nason (2004), we represent the complex-valued wavelet coefficients as bivariate real-valued random variables.

We consider the following Bayesian model

(5)

where

stands for the bivariate exponential power distribution. The multivariate exponential power distribution is an extension of the class of normal distributions in which the heaviness of tails can be controlled. Its definition and properties can be found in

Gomez et al. (1998). The prior on the location is a bivariate extension of the standard mixture prior in the Bayesian wavelet shrinkage literature, consisting of a point mass at zero and a heavy-tailed distribution. As a prior, Barber and Nason (2004) considered a mixture of point mass and bivariate normal distribution. A heavy-tailed mixture prior in our proposal better captures the sparsity of wavelet coefficients common in most applications; however, a closed-form solution is lost, and we need to rely on MCMC simulations to compute estimators.

To calibrate the exponential power prior in (5) for wavelet shrinkage, we use , because the detail wavelet coefficients are centered around zero by their definition. We also fix , which gives the prior on the following form:

(6)

The prior specified above is equivalent to the bivariate double exponential distribution. The univariate double exponential prior was found to have desirable properties in the real-valued wavelet denoising context

(Vidakovic and Ruggeri, 2001; Johnstone and Silverman, 2005), hence it is natural to extend it to the bivariate case.

From model (5) it is apparent that the mixture prior on is set depending on the dyadic level , which ensures scale-adaptivity of the method. Quantity represents the scaled covariance matrix of the noise at each decomposition level, and represents the levelwise scale matrix in the exponential power prior. Explicit expression for the covariance (

) induced by white noise in complex wavelet shrinkage was derived in

Barber and Nason (2004). We adopt the approach described in their paper to model the covariance structure of the noise and compute for each dyadic level from the expressions

(7)

We assume a common i.i.d normal noise model , as in (1). After taking complex wavelet transform, the real and imaginary parts of the transformed noise become correlated, which is expressed by (2) above. Note that

is a complex-valued unitary matrix representing the wavelet transform, therefore

, where denotes the complex conjugate of .

Instead of estimating hyperparameters , , and in (5) in empirical Bayes fashion, we elicit hyperprior distributions on them in a fully Bayesian manner. We specify a conjugate inverse gamma prior on the noise variance , and an inverse Wishart prior on the matrix describing the covariance structure of the spread prior of . Mixing weight regulates the strength of shrinkage of a wavelet coefficient to zero. We specify a “noninformative” uniform prior on this parameter, allowing the estimation to be governed mostly by the data.

For computational purposes, we represent the exponential power prior in (6) as a scale mixture of multivariate normal distributions, which is an essential step for efficient Monte Carlo simulation. From Gomez et al. (2008), the bivariate exponential power distribution with and can be represented as

which is a scale mixture of bivariate normal distributions with mixing distribution gamma. Using the specified hyperpriors and the mixture representation, the basic model in (5) extends to

(8)

For computational purposes, we introduce a latent variable . Variable is a Bernoulli variable indicating whether parameter comes from a point mass at zero () or from a bivariate normal distribution (

), with prior probability of

or , respectively. The uniform prior on is equivalent to beta

distribution, which is a conjugate prior for the Bernoulli distribution. Integrating out

from model (2) gives back the original mixture expression in model (5).

By representing the exponential power prior as a scale mixture of normals, the hierarchical model in (2) becomes tractable, because the full conditional distributions of all the parameters become explicit. Therefore, we can develop a fast Gibbs sampling algorithm to update all the necessary parameters , , , , and . Note that in order to run the Gibbs sampling algorithm we only have to specify hyperparameters , , and The details of Gibbs sampling scheme will be discussed in the next section.

3 Gibbs sampling scheme

To conduct posterior inference on the wavelet coefficients , a standard Gibbs sampling procedure is adopted. In this section we provide details of how to develop a Gibbs sampler for the model in (2). Gibbs sampling is an iterative algorithm that simulates from a joint posterior distribution through iterative simulation over the list of full conditional distributions. For more details on Gibbs sampling, see Casella and George (1992), or Robert and Casella (1999). For the model in (2), the full conditionals for parameters , , , , and can be specified exactly. Specification of hyperparameters , , and will be discussed in Section 4. Technical details of this section are deferred to Appendix.

3.1 Updating

Using a conjugate prior on results in a full conditional which is inverse gamma. Therefore, update as

(9)

where denotes the sample size, and denotes the simulation run.

3.2 Updating and

In model (2) the latent variable has Bernoulli prior with parameter . Its full conditional remains Bernoulli and is updated as follows:

where

Parameter is given a conjugate prior. This results in a full conditional distributed as beta. Therefore we update as

(11)

Note that other choices from the family are possible as the prior for . However, we used the noninformative choice of and to facilitate data-driven estimation of .

3.3 Updating

From the conjugate setup of model (2) and using the latent variable , it follows that the full conditional distribution of is either a point mass at zero (), or a bivariate normal distribution (). Therefore we update as follows:

(12)

where

3.4 Updating

In model (2) for the scale mixture of normals representation we placed a gamma prior on The full conditional distribution of depends on the value of , and the updating scheme is:

(13)

Here

denotes the generalized inverse Gaussian distribution

(Johnson et al., 1994, p.284)

with probability density function

where denotes the modified Bessel function of the third kind. Simulation of random variates is available through a MATLAB implementation based on Dagpunar (1989).

3.5 Updating

Placing a conjugate inverse Wishart prior on covariance matrix results in a full conditional distribution which remains inverse Wishart. Therefore, is updated as:

(14)

The implementation of the described Gibbs sampler requires simulation routines for standard distributions such as inverse gamma, Bernoulli, beta, normal, and also a specialized routine to simulate from the generalized inverse Gaussian. The procedure was implemented in MATLAB and available from the authors.

The Gibbs sampling procedure can be summarized as

  1. Choose initial values for parameters

  2. Repeat steps (iii) - (viii) for

  3. Update

  4. Update for

  5. Update for

  6. Update for

  7. Update for

  8. Update for .

The denoising method based on the Gibbs sampling algorithm above will be called Complex Gibbs Sampling Wavelet Smoother (CGSWS). In the following section we explain the specification of hyperparameters , , and , and apply the CGSWS algorithm to denoise simulated test functions.

4 Simulations and Comparisons

In this section we discuss the performance of the proposed CGSWS estimator and compare it to an established method from the literature considering complex Bayesian wavelet denoising. Within each replication of the simulations we performed 10,000 Gibbs sampling iterations, of which the first 5,000 was burn-in. We used the sample average as the usual estimator for the posterior mean. In our set-up . First we discuss the selection of hyperparameters, then explain the simulation setup and results.

4.1 Selection of Hyperparameters

In any Bayesian modeling task the selection of hyperparameters is critical for good performance of the model. It is also desirable to have a default way of selecting the hyperparameters which makes the shrinkage procedure automatic.

In order to apply the CGSWS method we need to specify hyperparameters , , and in the hyperprior distributions. This selection is governed by the data and hyperprior distributions. The advantage of the fully Bayesian approach is that once the hyperpriors are set, the estimation of parameters , , , and is automatic via the Gibbs sampling scheme. Another advantage is that the method is relatively robust to the choice of hyperparameters since they enter the model at higher level of hierarchy.

Parameters and . For simplicity we set and , where

This ensures that the mean of the inverse gamma prior on is the standard robust estimator of the noise variation (Donoho and Johnstone, 1994). Here MAD stands for the median absolute deviation of the wavelet coefficients, which we calculate at the finest level of detail from both the real and imaginary parts of wavelet coefficients (Barber and Nason, 2004).

Parameters and . Hyperparameters and play an important role in the prior on the covariance matrix . Since in the Gibbs sampler updates and therefore can possibly be zero, a noninformative Jeffreys prior on is not computationally feasible. Also note that the mean of the inverse Wishart prior is , where is the dimension of , which is equal to 2 in our case. Therefore we set

which forces the mean of the prior to be a pre-specified estimate of . In the case of the mixture bivariate double exponential prior, the covariance of the signal part is , where is the covariance of a bivariate double exponential random variable (Gomez et al., 1998). Since the model assumes independence of signal and error parts, we have that , where is the covariance of the observations at dyadic level. We choose as a reasonable estimate, which additionally simplifies the equation in hand. Therefore a reasonable estimator for is

(15)

where is the sample covariance estimator using observations at dyadic level. Note that is known, and is the usual robust estimator of the variance of wavelet coefficients introduced before. Also note that when

is not positive definite, we regularize it by adding a multiple of the identity matrix.

Finally, we set . Note, that is the least informative choice in our case, however, we found that a slightly higher values for worked better in practice.

4.2 Simulation Results

In the following section we present results of a simulation study in which we compare the denoising performance of the CGSWS method to two complex wavelet-based denoising methods introduced by Barber and Nason (2004). The first one (CMWS-Hard) is a phase preserving estimator based on hard thresholding of a “thresholding statistic” . The second one (CEB-Posterior mean) is a bivariate posterior mean estimator based on an empirical Bayes procedure.

Four standard test functions (Blocks, Bumps, Doppler, Heavisine) were considered (Donoho and Johnstone, 1994) in the simulations. The functions were rescaled so that the added noise produced preassigned signal-to-noise ratio (SNR), as standardly done. The test functions were simulated at , , and equally spaced points in the interval

. Four commonly considered SNR’s were selected, SNR=3, SNR=5, SNR=7 and SNR=10. We used the symmetric complex-valued Daubechies wavelet base with 3 vanishing moments for all the test functions. The coarsest decomposition level was

which matches suggested by Antoniadis et al. (2001).

Reconstruction of the theoretical signal was measured by the average mean squared error (AMSE), calculated as

where is the number of simulation runs and are known values of the test functions considered. We denote by the estimator from the -th simulation run. Note, that in each of these simulation runs we perform 10,000 Gibbs iterations to get the estimators . We set .

The results are summarized in Table 1, where boldface numbers indicate the smallest AMSE result for each test scenario. The results convey that the proposed CGSWS method outperforms both estimators for majority of the test scenarios, and in most other cases it is very close in performance to the superior method. The improvement is most pronounced at small sample sizes () and for the test functions Bumps and Heavisine. This result confirms the adaptiveness of the method and the advantage of using a heavy-tailed prior as prior distribution for the location of wavelet coefficients. Note, however, that the computational cost of the algorithm is higher than for the competitors. The CEB-Posterior mean method can be a good compromise in terms of performance and computational efficiency.

Signal N Method SNR=3 SNR=5 SNR=7 SNR=10 Signal N Method SNR=3 SNR=5 SNR=7 SNR=10
Blocks 256 CGSWS 0.4293 0.4533 0.4610 0.4499 Doppler 256 CGSWS 0.3093 0.3119 0.3251 0.3619
CMWS-H 0.4929 0.5476 0.5490 0.5021 CMWS-H 0.3332 0.3351 0.3644 0.4000
CEB-PM 0.4343 0.4675 0.4715 0.4547 CEB-PM 0.3137 0.3158 0.3351 0.3723
512 CGSWS 0.2954 0.3180 0.3138 0.3051 512 CGSWS 0.1854 0.2073 0.2052 0.2095
CMWS-H 0.3481 0.3627 0.3457 0.3166 CMWS-H 0.2048 0.2217 0.2192 0.2289
CEB-PM 0.3028 0.3202 0.3126 0.2995 CEB-PM 0.1845 0.2007 0.2035 0.2132
1024 CGSWS 0.1991 0.2013 0.1991 0.1924 1024 CGSWS 0.1034 0.1209 0.1310 0.1467
CMWS-H 0.2372 0.2230 0.2098 0.1944 CMWS-H 0.1160 0.1329 0.1432 0.1601
CEB-PM 0.1980 0.1988 0.1947 0.1879 CEB-PM 0.1087 0.1225 0.1302 0.1419
Bumps 256 CGSWS 0.4631 0.4825 0.4946 0.5181 Heavisine 256 CGSWS 0.1198 0.1640 0.1900 0.2030
CMWS-H 0.5972 0.5946 0.5853 0.5809 CMWS-H 0.1547 0.2075 0.2144 0.2198
CEB-PM 0.4855 0.4996 0.5120 0.5390 CEB-PM 0.1338 0.1838 0.2098 0.2188
512 CGSWS 0.3273 0.3274 0.3235 0.3203 512 CGSWS 0.0799 0.1050 0.1258 0.1429
CMWS-H 0.3983 0.3760 0.3538 0.3317 CMWS-H 0.0959 0.1202 0.1357 0.1371
CEB-PM 0.3295 0.3315 0.3287 0.3228 CEB-PM 0.0881 0.1167 0.1340 0.1427
1024 CGSWS 0.1965 0.1970 0.2009 0.2090 1024 CGSWS 0.0487 0.0650 0.0747 0.0843
CMWS-H 0.2137 0.2151 0.2134 0.2223 CMWS-H 0.0557 0.0746 0.0793 0.0791
CEB-PM 0.1919 0.1986 0.2034 0.2122 CEB-PM 0.0564 0.0730 0.0794 0.0835
Table 1: AMSE of CGSWS method compared to estimators CMWS-Hard and CEB-Posterior mean.

5 Application to Inductance Plethysmography Data

For illustration we apply the described CGSWS method to a real-life data set from anesthesiology collected by inductance plethysmography. The recordings were made by the Department of Anaesthesia at the Bristol Royal Infirmary and represent measure of flow of air during breathing. The data set was analyzed by several authors, for example Nason (1996) and Abramovich et al. (1998, 2002) where more information about the data can be found.

Figure 1 shows a section of plethysmograph recording lasting approximately 80 s ( observations), while Figure 2 shows the reconstruction of the signal with the CGSWS method. In the reconstruction process we applied iterations of the Gibbs sampler of which the first 5,000 was burn-in. The aim of smoothing was to preserve features such as peak heights while eliminating spurious high-frequency variation. The result provided by the proposed method satisfies these requirements providing a very smooth result. Abramovich et al. (2002) report the heights of the first peak while analyzing this data set. In our case the height is 0.8342, which is quite close to the result 0.8433, obtained by Abramovich et al. (2002), and better compared to the results obtained by other established methods analyzed in their paper.

Figure 1: A section of inductance plethysmography data with .
Figure 2: Reconstruction of the inductance plethysmography data by CGSWS.

6 Conclusions

In this paper we proposed the Complex Gibbs Sampling Wavelet Smoother (CGSWS), a complex wavelet-based method for nonparametric regression. A fully Bayesian approach was taken, in which a hierarchical model was formulated that accounts for the uncertainty of the prior parameters by placing hyperpriors on them. A mixture prior was specified on the complex wavelet coefficients with a bivariate double exponential spread distribution to account for the large wavelet coefficients. Since all the full conditional distributions were available in an explicit distributional form, an efficient Gibbs sampling estimation procedure was proposed. The CGSWS method provided excellent denoising performance, which was demonstrated by simulations on well-known test functions and by comparison to a well-established wavelet denoising method that uses complex wavelets. The methodology was also illustrated on a real-life data set from inductance plethysmography. There the proposed method performed well in both smoothing and preserving the important features of the phenomenon.

7 Appendix

In this Appendix we provide some results used for setting the Gibbs sampling algorithm in (2

). To derive the full conditional distribution for a parameter of interest we start with the joint distribution of all the parameters and collect the terms which contain the desired parameter. Denote

where and are bivariate components. Similarly, denote

and

the vector containing matrices

for resolution levels . The joint distribution of the data and parameters is

From the joint distribution, the full conditional distribution of is

The conditional distribution of remains Bernoulli with posterior success probability

where

The marginal distribution is a bivariate normal distribution with zero mean and covariance matrix . This follows form conjugate multivariate normal-normal structure (Lindley and Smith, 1972).

The full conditional distribution of is

The full conditional distribution of is

where

Derivation of is also a standard result contained for example in Lindley and Smith (1972) and was used in the wavelet shrinkage context by Barber and Nason (2004).

The full conditional distribution of is proportional to

For , this becomes

and when , it becomes

Here denotes the generalized inverse Gaussian distribution (Johnson et al., 1994, p.284) with probability density function

where denotes the modified Bessel function of the third kind.

Finally, the full conditional distribution of is given as