# Bayesian Wavelet Shrinkage with Beta Priors

We present a Bayesian approach for wavelet shrinkage in the context of non-parametric curve estimation with the use of the beta distribution with symmetric support around zero as the prior distribution for the location parameter in the wavelet domain in models with additive Gaussian errors. Explicit formulas of shrinkage rules for particular cases are obtained, statistical properties such as bias, classical and Bayesian risk of the rules are analyzed and performance of the proposed rules is assessed in simulations studies involving standard test functions. Application to Spike Sorting real data set is provided.

## Authors

• 4 publications
• 5 publications
• 1 publication
• ### Asymmetric prior in wavelet shrinkage

In bayesian wavelet shrinkage, the already proposed priors to wavelet co...
10/09/2020 ∙ by Alex Rodrigo dos Santos Sousa, et al. ∙ 0

• ### Wavelet Shrinkage in Nonparametric Regression Models with Positive Noise

Wavelet shrinkage estimators are widely applied in several fields of sci...
09/13/2021 ∙ by Alex Rodrigo dos Santos Sousa, et al. ∙ 0

• ### Subspace Shrinkage in Conjugate Bayesian Vector Autoregressions

Macroeconomists using large datasets often face the choice of working wi...
07/16/2021 ∙ by Florian Huber, et al. ∙ 0

• ### Kurtosis control in wavelet shrinkage with generalized secant hyperbolic prior

The present paper proposes a bayesian approach for wavelet shrinkage wit...
08/11/2021 ∙ by Alex Rodrigo dos Santos Sousa, et al. ∙ 0

• ### Multiscale Analysis of Bayesian CART

This paper affords new insights about Bayesian CART in the context of st...
10/16/2019 ∙ by Ismaël Castillo, et al. ∙ 0

• ### Bayesian cumulative shrinkage for infinite factorizations

There are a variety of Bayesian models relying on representations in whi...
02/12/2019 ∙ by Sirio Legramanti, et al. ∙ 0

• ### Large-Scale Shrinkage Estimation under Markovian Dependence

We consider the problem of simultaneous estimation of a sequence of depe...
03/04/2020 ∙ by Bowen Gang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Wavelet-based methods are widely applied in a range of fields, such as mathematics, signal and image processing, geophysics, and many others. In statistics, applications of wavelets arise mainly in the areas of non-parametric regression, density estimation, functional data analysis and stochastic processes. These methods basically utilize the possibility of representing functions that belong to certain functional spaces as expansions in wavelet bases, similar to others expansions such as splines or Fourier (among others) for example. However, wavelet expansions have characteristics that make them quite useful from the point of view of function representation: they are localized in both time and scale in an adaptive way. Their coefficients are typically sparse, they can be obtained by fast computational algorithms, and the magnitudes of coefficients are linked with the smoothness properties of the functions they represent. These properties enable time/frequency data analysis, bring computational advantages, and allow for statistical data modeling at different resolution scales.

Grossmann and Morlet (1984), Daubechies (1988, 1992) and Mallat (1998) are standard early references about wavelets in general and Ogden (1997), Vidakovic (1999), Percival and Walden (2000), Nason (2008), Antoniadis (1997), Morettin (1999) and Morettin et al (2017) are well-cited references about wavelet techniques in statistics.

Wavelet shrinkage methods are used to estimate the coefficients associated with the representation of the function in the wavelet domain by reducing the magnitude of the observed (empirical) coefficients that are obtained by the wavelet transform of the the original data. There are in fact several shrinkage techniques available in the literature. The main works in this area are of Donoho and Johnstone (1994, 1995, 1998), but also Donoho et al. (1995, 1996a, 1996b), Johnstone e Silverman (1997), Vidakovic (1998, 1999) and Antoniadis et al. (2002) can be cited. For more details of shrinkage methods, see Vidakovic (1999) and Jansen (2012).

It is well known that optimal rules in Bayesian models are shrinkage rules, except for pathological cases. Bayesian approach offers theoretical paradigm for statistical formalization of wavelet shrinkage. In addition to their controllable shrinkage properties, Bayesian methods have also been extensively studied for the possibility of incorporating, by means of a prior probabilistic distributions, prior information about the regression smoothnes, neighboring coefficients and other parameters to be estimated. Bayesian models in the wavelet domain have showed to be capable of incorporating prior information about the unknown regression function such as smoothness, periodicity, self-similarity, degree of “noisiness” (SNR), and, for some particular bases (e.g., Haar), monotonicity.

In this sense, the choice of the prior distribution in the statistical model describing wavelet coefficients is extremely important to achieve meaningful results.

Several priors models in the wavelet domain were proposed since 1990s, see Chipman et all (1997), Abramovich, Sapatinas and Silverman (1998), Abramovich and Sapatinas (1999), Vidakovic (1998), Vidakovic and Ruggeri (2001), Angelini and Vidakovic (2004), Johnstone and Silverman (2004, 2005), among others.

The standard statistical models in which shrinkage techniques are applied assume Gaussian additive errors. These models are important not only from their applicability to a range of different problems, but also from the mathematical point of view since the Gaussian additive errors remain both Gaussian and additive after the wavelet transformation.

In this paper we propose and explore a beta distribution symmetric around zero as a prior distribution for the location parameter in a Gaussian model on wavelet coefficients. As traditionally done in this kind of analysis the prior is in fact a distribution contaminated at 0. This added point mass at zero to the spread part of the prior facilitates thresholding. The flexibility of the beta distribution, as a spread part of the prior, is readily controlled by convenient choice of its parameters. Moreover, we show that there is an interesting relationship between the prior (hyper) parameters and the degree of wavelet shrinkage, which is useful in practical applications.

In this paper we aim to incorporate prior belief on the boundedness of the energy of the signal (the

-norm of the regression function). The prior information on the energy bound often exists in real life problems and it can be modeled by the assumption that the location parameter space is bounded, which in our context relates to estimating a bounded normal mean. Estimation of a bounded normal mean has been considered in Miyasawa (1953), Bickel (1981), Casella and Strawderman (1981), and Vidakovic and DasGupta (1996). In the context of Bayesian modeling, if the structure of the prior can be supported by the analysis of the empirical distribution of the wavelet coefficients, the precise elicitation of the prior hyperparameters cannot be done without some kind of guidance, especially in the Empirical Bayes spirit. Of course, when prior knowledge on the signal-to-noise ratio

is available, then any 0-contaminated symmetric and unimodal distribution supported on the bounded set, say

, could be a possible candidate for the prior. If the problem is rescaled so that the size of the noise (its variance) is 1, then

can be taken as

This paper is organized as follows: Section 2 defines the considered model in time and wavelet domain and the proposed prior distribution, Section 3 presents the shrinkage rule, shows statistical properties such as variance, bias and risks and explicit formulas of two particular cases of the shrinkage rule under beta prior and of an extension the shrinkage rule under triangular prior. Section 4 is dedicated to prior elicitation. To verify the strength of the proposed approach simulation studies are performed in Section 5, and the shrinkage rule is applied in a Spike Sorting real data set in Section 6. Section 7 provides conclusions.

## 2 The Model

### 2.1 The Symmetric Around Zero Beta Distribution

In Statistics, the beta distribution is extensively used to model variables in the domain. This distribution is extremely flexible in shape controlled by a convenient choices of its parameters. In our framework, it is convenient to use its version shifted and rescaled to the interval as well as choosing its parameters to keep it symmetric about 0. Therefore, we propose the use of beta distribution with symmetric support around zero as a prior distribution for the wavelet coefficients. Its density function is

 g(x;a,m)=(m2−x2)(a−1)(2m)(2a−1)B(a,a)I[−m,m](x), (2.1)

where is the standard beta function, and are the parameters of the distribution, and is an indicator function equal to 1 for its argument in the interval and 0 else.

For , the density function (2.1) is unimodal around zero and as increases, the density becomes more concentrated around zero. This is an important feature for wavelet shrinkage methods, since high values of imply higher levels of shrinkage of the empirical coefficients, which results in sparse estimated coefficients. Figure 2.1 shows the beta density function (2.1) for some selected values of and .

### 2.2 Beta Distribution as Prior for the Wavelet Coefficients

 yi=f(xi)+ei,i=1,...,n=2J,J∈N, (2.2)

where and ,

, are zero mean independent normal random variables with unknown variance

. In vector notation, we have

 y=f+e, (2.3)

where , and . The goal is to estimate the unknown function

. After applying a discrete wavelet transform (DWT) on (2.3), given by the orthogonal matrix

, we obtain the following model, in the wavelet domain,

 d=θ+ϵ, (2.4)

where , and .

Due to the independence of the random errors and the orthogonality of the transform, the model in the wavelet domain remains additive and the errors are i.i.d. normal.

Because the strong decorrelating property of wavelets we can study one coefficient at a time. Of course, wavelet coefficients are “almost decorrelated,” but never fully decorrelated, except for special cases, and strategies for involvement of neighboring coefficients in the shrinkage policy are common in the literature. However, these strategies increase the complexity of the shrinkage for very modest improvements, and still favored approach, especially among the practitioners, is to consider shrinkage coefficient-by-coefficient.

For the th component of the vector , we have a simple model

 di=θi+ϵi, (2.5)

where is the empirical wavelet coefficient, is the coefficient to be estimated and is the normal random error with unknown variance . For simplicity of notation, we suppress the subindices of , and . Note that, according to the model (2.5), and then, the problem of estimating a function becomes a normal mean estimation problem in the wavelet domain for each coefficient.

To complete the Bayesian model, we propose the following prior distribution for ,

 π(θ;α,a,m)=αδ0(θ)+(1−α)g(θ;a,m), (2.6)

where , is the point mass function at zero and is the beta distribution (2.1) in . The proposed prior distribution has , and as hyperparameters and their choices are directly related to the degree of shrinkage of the empirical coefficients. It will be shown that as or (or both of them) increase, the degree of shinkage increases as well.

## 3 The Shrinkage Rule

The shrinkage rule for Bayesian estimation of the wavelet coefficient of model (2.5

) depends on the choice of location of the posterior (mean, mode, or median) and the loss function. Under square error loss function

, it is well known that the Bayes rule is the posterior expected value of , i.e, minimized Bayes risk. The Proposition 3.1 gives an expression of the shrinkage rule under a mixture prior consisting of a point mass at zero and a density function with support in .

###### Proposition 3.1.

If the prior distribution of is of the form , where is a density function with support in , then the shrinkage rule under the quadratic loss function is given by

 δ(d)=(1−α)∫m−dσ−m−dσ(σu+d)g(σu+d)ϕ(u)duα1σϕ(dσ)+(1−α)∫m−dσ−m−dσg(σu+d)ϕ(u)du (3.1)

where is the standard normal density function.

###### Proof.

If is the likelihood function, we have that

 δ(d) =Eπ(θ∣d) =∫Θθ[αδ0(θ)+(1−α)g(θ)]L(d∣θ)dθ∫Θ[αδ0(θ)+(1−α)g(θ)]L(d∣θ)dθ =(1−α)∫m−mθg(θ)1√2πexp{−12(d−θσ)2}dθσα1σ√2πexp{−12(dσ)2}+(1−α)∫m−mg(θ)1√2πexp{−12(d−θσ)2}dθσ =(1−α)∫m−dσ−m−dσ(σu+d)g(σu+d)ϕ(u)duα1σϕ(dσ)+(1−α)∫m−dσ−m−dσg(σu+d)ϕ(u)du.

Figure 3.1 presents some shrinkage rules for as beta distribution (2.1) with hyperparameters , and as well as their variances. It can be seen that the length of the interval in which the rule shrinks to zero increases as the hyperparameter increases, since high values of result in more concentrated beta distributions around zero. A typical feature of these rules is that as increases, gets closer to and as decreases, gets closer to , i.e, and . These asymptotic characteristics are reasonable since there is the assumption that the coefficients to be estimated belong to the range , so that empirical coefficients outside this range occur due to the presence of noise.

Figure 3.2 shows the bias and classical risks respectively for the shrinkage rules under beta prior distribution with hyperparameters and showed in Figure 2.1. Observe that, as expected, the rules have smaller variances and biases for values of near zero, reaching minimum values in both graphs when the wavelet coefficient is zero. It is also noted that as hyperparameter increases, the bias of the estimator increases and the variance decreases. The classical risk decreases as tends to zero and that for high values of , the risk is larger for rules with large values of . These features are justified by the fact that the degree of shrinkage increases as the hyperparameter increases, so if the value of the wavelet coefficient is far from zero, such rules with larger values of tend to underestimate than rules with small values of .

Table 3.1 shows Bayes risks in terms of the hyperparameter of the shrinkage rules considered. As expected, Bayes risk decreases as increases.

### 3.1 Particular Cases

In this section we show two particular cases of the shrinkage rule under beta prior and an extension of the beta prior, the triangular prior. Angelini and Vidakovic (2004) studied in details the shrinkage rule under uniform prior, an interesting particular case of beta distribution with the hyperparameter . The proofs of Propositions 3.2, 3.3 and 3.4 use the Lemma 3.1, whose proofs are in the Appendix.

###### Lemma 3.1.

Let and

be the standard normal density and cumulative distribution functions respectively and

. Then,

#### 3.1.1 Shrinkage Rule for a=2

###### Proposition 3.2.

The shrinkage rule under prior distribution of the form , where is the beta distribution in with the hyperparameter is

 δ(d)=(1−α){N1(d)[Φ(q2)−Φ(q1)]+N2(d)ϕ(q1)+N3(d)ϕ(q2)}4m3α3σϕ(dσ)+(1−α){D1(d)[Φ(q2)−Φ(q1)]+D2(d)ϕ(q1)+D3(d)ϕ(q2)}, (3.2)

where

, , , , , , and .

###### Proof.

In the Proposition 3.1, we do as the beta distribution (2.1) with . Then,

 δ(d) =(1−α)∫m−dσ−m−dσ(σu+d)[m2−(σu+d)2]2−1(2m)3B(2,2)ϕ(u)duα1σϕ(dσ)+(1−α)∫m−dσ−m−dσ[m2−(σu+d)2]2−1(2m)3B(2,2)ϕ(u)du =(1−α)I1Cϕ(dσ)+(1−α)I2.

where , , , and are the integrals of the numerator and denominator, respectively.

 I1 =∫q2q1(σu+d)[m2−(σu+d)2]ϕ(u)du =N1(d)[Φ(q2)−Φ(q1)]+N2(d)ϕ(q1)+N3(d)ϕ(q2),

where , , . Similarly,

 I2 =∫q2q1[m2−(σu+d)2]ϕ(u)du =D1(d)[Φ(q2)−Φ(q1)]+D2(d)ϕ(q1)+D3(d)ϕ(q2),

where , e .

Figure 3.3 shows the shrinkage rules (3.2), bias, variance and classical risk for the hyperparameters and for .

As expected, the rule presents a higher degree of shrinkage as the hyperparameter increases. Similarly to the previous section results, the bias increases and the classical risk increases and the variance decreases as increases. Observe that both bias, variance and classical risk increase as moves away from zero and approaches .

Table 3.2 presents the Bayes risks of the rules (3.2) for the hyperparameters and for .

The Bayes risk decreases as increases, since increasing implies a greater degree of shrinkage of empirical coefficients.

#### 3.1.2 Shrinkage Rule for a=12

###### Proposition 3.3.

The shrinkage rule under prior distribution of the form , where is the beta distribution on with the hyperparameter can be approximated by

 δ(d)≈(1−α)[P1(d)ϕ(m−1−dσ)−P2(d)ϕ(1−m−dσ)]Cϕ(dσ)+(1−α)[P3(d)ϕ(m−1−dσ)−P4(d)ϕ(1−m−dσ)], (3.3)

where

,

,

,

###### Proof.

In Proposition 3.1, we take as the beta distribution (2.1) with . Then,

 δ(d) =(1−α)∫m−mθ(m2−θ2)−12B(12,12)1σ√2πexp{−(θ−d)22σ2}dθα1σϕ(dσ)+(1−α)∫m−mθ(m2−θ2)−12B(12,12)1σ√2πexp{−(θ−d)22σ2}dθ =(1−α)I1Cϕ(dσ)+(1−α)I2,

where , e are the integrals of the numerator and denominator respectively. In fact, we approximate these integrals using second order Taylor series.

The following approximation of the integral in the numerator can be obtained around a value :

 T1(c,θ)=exp{−(c−d)22σ2}2(m2−c2)−12σ2{2cσ2(θ−c)+[cd+σ2−c2+c2σ2(m2−c2)−1](θ−c)2}.

Therefore,

 I1≈T1(m−1,m)−T1(−m+1,−m)=exp{−(1−m−d)22σ2}P1(d)−exp{−(1−m−d)22σ2}P2(d)2σ2(2m−1)−12.

In the same way, the approximation of the integral in the denominator around is

 T2(c,θ)=exp{−(c−d)22σ2}2(m2−c2)−12σ2{2σ2(θ−c)+[−c+d+cσ2(m2−c2)−1](θ−c)2}.

Then,

 I2≈T2(m−1,m)−T2(−m+1,−m)=exp{−(1−m−d)22σ2}P3(d)−exp{−(1−m−d)22σ2}P4(d)2σ2(2m−1)−12.

Therefore,

 δ(d) ≈(1−α)1√2πexp{−(1−m−d)22σ2}P1(d)−1√2πexp{−(1−m−d)22σ2}P2(d)2απσ2(2m−1)−12ϕ(dσ)+(1−α)1√2πexp{−(1−m−d)22σ2}P3(d)−1√2πexp{−(1−m−d)22σ2}P4(d).

Figure 3.4 shows the Bayes rule and the approximation given in Proposition 3.3 for the case and .

#### 3.1.3 Shrinkage Rule for Triangular Prior

We present the triangular prior distribution for the wavelet coefficients as an extension of the beta distribution. In fact, the triangular distribution in

is the convolution of two uniform distributions in

.

The triangular distribution in the interval symmetric around zero has density function given by

 gT(x;m)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩x+mm2,se−m

The following proposition provides an explicit formula for the shrinkage rule under triangular prior.

###### Proposition 3.4.

The shrinkage rule under prior distribution of the form , where is the triangular distribution over , is

 δT(d)=(1−α)S1(d)αm2σϕ(dσ)+(1−α)S2(d), (3.5)

where

and

###### Proof.

Applying Proposition 3.1 we get

 δT(d) =(1−α)∫m−dσ−m−dσ(σu+d)gT(σu+d)ϕ(u)duα1σϕ(dσ)+(1−α)∫m−dσ−m−dσgT(σu+d)ϕ(u)du =(1−α)I2α1σϕ(dσ)+(1−α)I1.

For , the integral in the denominator, we have that

 I1 =∫m−dσ−m−dσgT(σu+d)ϕ(u)du =∫−dσ−m−dσ[(σu+d)+m]m2ϕ(u)du+∫m−dσ−dσ[m−(σu+d)]m2ϕ(u)du =I∗1+I∗∗1.

We calculate e separately, i.e,

 I∗1 =∫−dσ−m−dσ[(σu+d)+m]m2ϕ(u)du =1m2{σ[ϕ(m+dσ)−ϕ(dσ)]+(d+m)[Φ(−dσ)−Φ(−m−dσ)]}.
 I∗∗1 =∫m−dσ−dσ[m−(σu+d)]m2ϕ(u)du =1m2{(m−d)[Φ(m−dσ)−Φ(−dσ)]−σ[ϕ(dσ)−ϕ(m−dσ)]}.

For , the integral in the numerator,

 I2=σ∫m−dσ−m−dσugT(σu+d)ϕ(u)du+d∫m−dσ−m−dσgT(σu+d)ϕ(u)du=σI∗2+dI1.
 I∗2 =∫−dσ−m−dσu[(σu+d)+m]m2ϕ(u)du+∫m−dσ−dσu[m−(σu+d)]m2ϕ(u)du =σm2[2Φ(−dσ)−Φ(−m−dσ)−Φ(m−dσ)].

In this way, we obtain as

 I2 =1m2{dσ[ϕ(m+dσ)+ϕ(mdσ)−2ϕ(dσ)]+(σ2+dm+d2)Φ(m+dσ)+ +(σ2−dm+d2)Φ(d−mσ)−2(σ2+d2)Φ(dσ)}.

Finally, substituting the integrals and into (???), we have

 δT(d)=(1−α)S1(d)αm2σϕ(dσ)+(1−α)S2(d).

## 4 Default Prior Hyperparameters

Methods and criteria for determination of the involved parameters and hyperparameters to estimate the coefficients are important in Bayesian procedures. In the framework of Bayesian shrinkage with beta prior, the choices of the parameter of the random error distribution and the hyperparameters , and of the beta prior distribution of the wavelet coefficient are required. We present the methods and criteria already available in the literature for such choices and used in simulation and application studies.

Based on the fact that much of the noise information present in the data can be obtained on the finest resolution scale, for the robust estimation, Donoho and Johnstone (1994) suggested robust estimator

 ^σ=median{|dJ−1,k|:k=0,...,2J−1}0.6745. (4.1)

Angelini et al. (2004) suggested the hyperparameters and be dependent on the level of resolution according to the expressions

 α=α(j)=1−1(j−J0+1)γ (4.2)

and

 m=m(j)=maxk{|djk|}, (4.3)

where , is the primary resolution level and . They also suggested that in the absence of additional information, can be adopted.

Chaloner et al. (1983) proposed a method for choosing the hyperparameters of the beta distribution as a priori distribution for the probability of success in each Bernoulli assay of the binomial distribution. Duran and Booker (1988) propose the percentile method. For

and fixed, is chosen so that

 P(θ≤k)=p (4.4)

i.e,

 ∫k−m(m2−θ2)(a−1)(2m)(2a−1)B(a,a)dθ=p. (4.5)

Thus, the choice of is made by determining the probability of occurrence of a particular event . This procedure is interesting because of the greater facility of subjective determination of a

probability, that is, it is simpler to cognitively assign a probability to a certain event than to directly assign a value to the parameter of a probability distribution.

## 5 Simulation Studies

Simulation studies were done to evaluate the performance of the shrinkage rules under beta prior distribution for the particular cases in which the hyperparameter assumes the fixed values and triangular distribution and to compare them with the performances of some available shrinkage methods in the literature, namely, Universal Soft Thresholding (UNIV), False Discovery Rate (FDR), Cross Validation (CV), and Bayesian shrinkage rule under normal prior (BNormal). We also considered the shrinkage rule under Bickel prior, seggested by Angelini and Vidakovic (2004). They proved that the shrinkage rule under this prior is approximately -minimax for the class of all symmetric unimodal priors bounded on , . When increases, the weak limit of the least Bickel (1981) proved that the least favorable prior in is approximately (in sense of weak distributional limits) . Applying this result in our context, we have that the Bickel shrinkage is induced by prior

 π(θ)=αδ0+(1−α)1mcos2(πθ2m)I[−m,m](θ).

The corresponding Bayes rule does not have a simple analytical form and needs to be numerically computed. The hyperparameters and were selected according to the proposals described in Section 4.

To perform the simulation, the rules were applied in the Donoho-Johnstone (DJ) test functions. These functions, shown in Figure 5.1, are widely used in the literature for comparison of wavelet-based methods. These are four functions, called Bumps, Blocks, Doppler and Heavisine, which represent some characteristics of curves in real problems.

For each test function, three sample sizes were selected, and , and for each point, a normal error with zero mean and variance was added, where was selected according to three signal to noise ratio (SNR), 3, 5 and 7.

We used the mean squared error (MSE), as performance measure of the shrinkage rules. For each function, the process was repeated times and a comparison measure, the average of the obtained MSEs, , was calculated for each rule as shown in Tables 5.1 and 5.2. Boxplots of the MSEs of the shrinkage rules for and SNR=5 ares shown in Figure 5.2.