I Introduction
In signal processing, machine learning and statistics, a key aspect of modeling concerns the compact representation of data. Consider an
dimensional signal , where and ; the columns of define a basis for representation of the signal, and are the basis coefficients. Two important means of constituting the columns of are via orthornormal wavelets or the block discrete cosine transform (DCT) [1]. The former is used in the JPEG2000 compression standard, and the latter in the JPEG standard; compression is manifested because many of the components of may be discarded without significant impact on the accuracy of the reconstructed signal [2, 3]. Wavelet and blockDCTbased compression are applicable to a wide class of natural data. For such data, the original typically has no values that are exactly zero, but many that are relatively small, and hence in this case is termed “compressible.”In compression, one typically acquires the uncompressed data, sends it through filters defined by , and then performs compression on the resulting . In inverse problems, one is not given , and the goal is to recover it from given data. For example, in compressive sensing (CS) [4, 5, 6] one is typically given linear projections of the data, , where and is measurement noise (ideally ). The goal is to recover from . In denoising applications [7] may be the identity matrix, and the goal is to recover in the presence of noise , where the noise may be nonGaussian. If one is performing recovery of missing pixels [8] (the “inpainting” problem), then is defined by rows of the identity matrix. Other related problems include deblurring, where may be a blur kernel [9].
To solve these types of problems, it is important to impart prior knowledge about , where compressibility is widely employed. As a surrogate for compressibility, there is much work on assuming that is exactly sparse [4, 5, 6]; the residuals associated with ignoring the small components of are absorbed in . This problem has been considered from a large range of perspectives, using methods from optimization [10, 11, 12] as well as Bayesian approaches [13, 14, 15]. In this paper we take a Bayesian perspective. However, it is demonstrated that many of the shrinkage priors we develop from that perspective have a corresponding regularizer from an optimization standpoint, particularly when considering a maximum a posterior (MAP) solution.
In Bayesian analysis researchers have developed methods like the relevance vector machine (RVM) [16], which has been applied successfully in CS [13]. Spikeslab priors have also been considered, in which exact sparsity is imposed [14]. Methods have also been developed to leverage covariates (, the spatial locations of pixels in an image [8]
), using methods based on kernel extensions. Inference has been performed efficiently using techniques like Markov chain Monte Carlo (MCMC)
[8], variational Bayes [17], belief propagation [15], and expectation propagation [18]. Expectationmaximization (EM) has been considered [19] from the MAP perspective.Recently there has been significant interest on placing “globallocal shrinkage” priors [20, 21] on the components of , the th of which is denoted . Each of these priors imposes the general construction , with a prior on highly concentrated at the origin, with heavy tails. The set of “local” parameters , via the heavy tailed prior, dictate which components of have significant amplitude. Parameter is a “global” parameter, that scales such that they are of the appropriate amplitude to represent the observed data. Recent examples of such priors include the horseshoe [22] and threeparameterbeta priors [23].
In this line of work the key to imposition of compressibility is the heavytailed prior on the . An interesting connection has been made recently between such a prior and increments of general Lévy processes [21]. It has been demonstrated that nearly all of the above shrinkage priors may be viewed in terms of increments of a Lévy process, with appropriate Lévy measure. Further, this same Lévy perspective may be used to manifest the widely employed spikeslab prior. Hence, the Lévy perspective is a valuable framework from which to view almost all priors placed on . Moreover, from the perspective of optimization, many regularizers on have close linkages to the Lévy process [24].
All of the shrinkage priors discussed above assume that the elements of are exchangeable; , the elements of are drawn i.i.d. from a heavytailed distribution. However, the components of are often characterized by a tree structure, particularly in the context of a wavelet [2, 14] or blockDCT [3, 17] representation of . It has been recognized that when solving an inverse problem for , the leveraging of this known tree structure may often manifest significantly improved results [25].
The principal contribution of this paper concerns the extension of the aforementioned shrinkage priors into a new framework, that exploits the known tree structure. We primarily focus on one class of such shrinkage priors, which has close connections from an optimization perspective to adaptive Lasso [26]. However, we also demonstrate that this specific framework is readily generalized to all of the shrinkage priors that may be represented from the Lévy perspective. This paper therefore extends the very broad class of Lévybased shrinkage priors, leveraging the known tree structure associated with .
We principally take a Bayesian perspective, and perform computations using MCMC and variational Bayesian analysis. However, we also perform a MAPbased point estimation, which demonstrates that this general treebased structure may also be extended to an optimization perspective. Specific applications considered are estimation of
in the context of CS and in denoising. For the latter we consider nonGaussian noise, specifically noise characterized by the sum of Gaussian and spiky components. This is related to the recent interest in robust principal component analysis
[27]. We demonstrate that this noise may also be viewed from the perspective of a Lévy jump process. Stateoftheart results are realized for these problems.The remainder of the paper is organized as follows. The basic setup of the inverse problems considered is detailed in Section II. In that section we also introduce the specific class of globallocal priors for which example results are computed. In Section III we discuss the method by which we extend such a shrinkage prior to respect the tree structure of , when associated with expansions tied to wavelet, blockDCT and other related bases. It is demonstrated in Section IV how the above model is a special case of a general class of treebased models, based on the Lévy process. As indicated above, this is an active area of research, and in Section V we provide a detailed discussion of how the current work is related to the literature. In Section VI we discuss how spiky noise is modeled, and its connection to the Lévy process. Three different means of performing inference are discussed in Section VII, with further details in the Appendix. An extensive set of numerical results and comparisons to other methods are presented in Section VIII, with conclusions discussed in Section IX.
Ii TreeBased Representation and Coefficient Model
Iia Measurement model
Consider the wavelet transform of a twodimensional (2D) image ; the image pixels are vectorized to , where . Under a wavelet basis , can be expressed as , where the columns of represent the orthonormal wavelet basis vectors, and represents the wavelet coefficients of . We here focus the discussion on a wavelet transform, but the same type of treebased model may be applied to a block discrete cosine transform (DCT) representation [3, 17]. In Section VIII experimental results are shown based on both a wavelet and blockDCT treebased decomposition. However, for clarity of exposition, when presenting the model, we provide details in the context of the wavelet transform.
The measurement model for is assumed manifested in terms of linear projections, defined by the rows of . Assuming measurement noise , we have:
(1) 
where . In compressive sensing (CS) [4] , and it is desirable that the rows of are incoherent with the columns of , such that is a dense matrix [4, 5, 28]. If one directly observes the image pixels, which we will consider in the context of imagedenoising applications, , with symbolizing the identity matrix.
Assuming i.i.d. Gaussian noise with precision , we have
(2) 
The assumption of i.i.d. Gaussian noise may be inappropriate in some settings, and in Section VI we revisit the noise model.
IiB Modeling the wavelet scaling coefficients
In a wavelet decomposition of a 2D image, there are three sets of trees: one associated with highhigh (HH) wavelet coefficients, one for highlow (HL) wavelet coefficients, and one for lowhigh (LH) coefficients [1]. In addition to the HH, HL and LH wavelet coefficients, there is a lowlow (LL) scaling coefficient subimage, which does not have an associated tree; the LL scaling coefficients constitute a lowfrequency representation of the original image.
The scaling coefficients are in general not sparse (since the original image is typically not sparse in the native pixel basis), and hence the scaling coefficients are modeled as
(3) 
where controls the “global” scale of the scaling coefficients, and is the noise precision as above ( represents the relative precision of the scaling coefficients, with respect to the noise precision ). Noninformative gamma priors are placed on and , , .
IiC Compressiblility of wavelet coefficients
Let represent coefficient at level in the wavelet tree (for either a HH, HL, or LH tree). To make connections to existing shrinkage priors and regularizers (e.g., adaptive Lasso [26]), we consider a generalization of the doubleexponential (Laplace) prior [29, 30]. In Section IV we discuss that this model corresponds to one example of a broad class of shrinkage priors based on Lévy processes, to which the model may be generalized.
Coefficient is assumed drawn from the distribution
(4)  
Parameter is a “global” scaling for all wavelet coefficients at layer , and is a “local” weight for wavelet coefficient at that level. We place a gamma prior on the
, and as discussed below one may set the hyperparameters
to impose that most are small, which encourages that most are small. Inference of a point estimate for the model parameters, via seeking to maximize the log posterior, one observes that the log of the prior in (4) corresponds to adaptive Lasso regularization [26].The model (4) for wavelet coefficient may be represented in the hierarchical form
(5)  
introducing latent variables , rather than marginalizing them out as in (4); a vague/diffuse gamma prior is again placed on the scaling parameters . While we have introduced many new latent variables , the form in (5) is convenient for computation, and with appropriate choice of most are encouraged to be large. The large correspond to small , and therefore this model imposes that most are small, , it imposes compressibility. Below we concentrate on the model for .
IiD Shrinkage priors versus spikeslab priors
There are models of coefficients like that impose exact sparsity [31, 32, 14], while the model in (5) imposes compressibility (many coefficients that are small, but not exactly zero). Note that if a component of is small, but is set exactly to zero via a sparsity prior/regularizer, then the residual will be attributed to the additive noise . This implies that the sparsitypromoting priors will tend to overshrink, underestimating components of
and overestimating the noise variance. This partially explains why we have consistently found that compressibilitypromoting priors like (
5) often are more effective in practice on real (natural) data than exactsparsitypromoting models, like the spikeslab setup in [14]. These comparisons are performed in detail in Section VIII. In Section VA we make further connections shrinkage priors developed in the literature.Iii Hierarchical TreeBased Shrinkage
Iiia Hierarchical gamma shrinkage
If we simply impose the prior on , with set to encourage compressibility on , we do not exploit the known tree structure of wavelets and what that tree imposes on the characteristics of the multilayer coefficients [1, 33]. A wavelet decomposition of a 2D image yields a quadtree hierarchy ( children coefficients under each parent coefficient), as shown in Figure 1. A key contribution of this paper is development of a general shrinkagebased decomposition that respects the tree structure.
Let represent all the coefficients in a wavelet decomposition of a given image, where represents all coefficients across all trees at level in the wavelet hierarchy. We here consider all the wavelet trees for one particular class of wavelet coefficients (, HH, HL or LH), and the model below is applied independently as a prior for each. Level is associated with the scaling coefficients, and correspond to levels of wavelet coefficients, with the rootnode level. We develop a prior that imposes shrinkage within each layer , while also imposing structure between consecutive layers. Specifically, we impose the persistence property [33] of a wavelet decomposition of natural imagery: if wavelet coefficient at level , denoted , is large/small, then its children coefficients are also likely to be large/small.
A parameter is associated with each node at each level in the tree (see Figure 1). Let represent the set of parameters for all coefficients at layer . The parameters represent the children parameters under parent parameter .
If there are wavelet trees associated with a given 2D image, there are root nodes at the top layer . For each of the coefficients associated with the root nodes, corresponding shrinkage parameters are drawn
(6) 
which for large encourages that most
will be small, with a few outliers; this is a “heavytailed” distribution for appropriate selection of
, where here we set (the promotion of compressibility is made more explicit in Section IV).Assuming nodes at layer , we define an “increment length” for node at level :
(7) 
To constitute each child parameter , for , we independently draw
(8) 
Because of the properties of the gamma distribution, if
is relatively large (, is relatively large), the children coefficients are also encouraged to be relatively large, since the integration window is relatively large; the converse is true if is small.The use of the term “increment” is tied to the generalization of this model within the context of Lévy processes, discussed in Section IV, where further justification is provided for associating increment lengths with shrinkage paramerers .
IiiB Dirichlet simplification
In (5), the parameter provides a scaling of all precisions at layer of the wavelet tree. Of most interest to defining which wavelet coefficients have large values are the relative weights of , with the relatively small elements corresponding to wavelet coefficients with large values. Therefore, for computational convenience, we consider scaled versions of , with the relatively large associated with the relatively smallvalued . Specifically, consider the normalized vector
(9) 
which corresponds to a draw from a Dirichlet distribution:
(10) 
where , and is the parent of the th coefficient at layer . The last equation in (5) is replaced by (10), and all other aspects of the model remain unchanged; the are normalized parameters, with the scaling of the parameters handled by at layer . One advantage of this normalization computationally is that all elements of are updated as a block, rather than one at a time, as implied by the last equation in (5).
The model for parameters at each level of the tree may be summarized as follows. At the root layer
(11) 
If are parameters at level , then the child parameters follow (10), with .
Iv Multiscale Lévy Process Perspective
Iva Lévy process and infinite divisibility
A Lévy process [34] is a continuoustime stochastic process with , and with increments that are independent and stationary. Specifically, for any , an increment of width is defined by , and the distribution of depends only on
. Further, the random variables associated with nonoverlapping increments are independent. While this stochastic process can take values in a multidimensional space, we consider
on the real line. The characteristic equation of the random variable is(12) 
where is a Lévy measure satisfying . The terms and correspond to, respectively, the drift and variance per unit time of Brownian motion over time , and at time , the Brownian contribution to is drawn from a normal with mean , and variance : . The term associated with Lévy measure corresponds to a Poisson process (PP) [35] over time period , with PP rate ; the PP contribution to , denoted , is the integral of the PP draw over . The latter corresponds to a purejump processes, and . Importantly for our purposes, the parameter controls the mean and variance on , with these terms increasing with .
The infinitely divisible nature of a Lévy process [34] allows us to express for any , where each is an independent draw from the same Lévy process as above, but now with time increment . Each of the is the sum of a Brownian motion term, with drift and variance , and a jump term with rate . As first discussed in [20] by choosing particular settings of , and , the vector may be used to impose sparsity or compressibility.
Specifically, assume that we wish to impose that the components of are compressible. Then the th component may be assumed drawn , where is a “global” scaling, independent of . Since only a small subset of the are relatively large for an appropriate Lévy process (with , and a Lévy measure characterized by a small number of jumps [20]), most components are encouraged to be small. This global scaling and local looks very much like the class of models in (5), which we now make more precise.
IvB Treebased Lévy process
We develop a hierarchical Lévy process, linked to the properties of treebased wavelet coefficients; from that, we will subsequently make a linkage to the hierarchical prior developed in Section III. At each layer in the tree, we impose a total time window . An underlying continuoustime Lévy process is assumed for each layer , for . Assuming wavelet coefficients at layer , we consider positive increment lengths , with . The th increment of the Lévy process at layer , , is defined as .
The Lévyprocessbased tree structure, analogous to that considered in Section III, is defined as follows. At the root layer , with wavelet coefficients, we define for all . From Lévy process we define increments , .
Assume increments , at layer . The lengths of the increments at layer are defined analogous to (7), specifically
(13) 
The children of coefficient at layer have Lévy increment widths . Therefore, if an increment is relatively large/small compared to the other coefficients at layer , then the window lengths of its children coefficients, , are also relatively large/small. Since the expected relative strength of a Lévy increment is tied to the width of the associated increment support, this construction has the property of imposing a persistence in the strength of the wavelet coefficients across levels (relatively large/small parent coefficients enourage relative large/small children coefficients).
IvC Shrinkage and sparsity imposed by different Lévy families
Each choice of Lévy parameters yields a particular hierarchy of increments . In the context of imposing compressibility or sparsity, we typically are interested in , and different choices of yield particular forms of sparsity or shrinkage. For example, consider the gamma process, with
(14) 
defined for . Using notation from above, for a Lévy process drawn from a gamma process, the increment is a random variable . Therefore, the model considered in Section III is a special case of the hierarchical treebased Lévy process, with a gamma process Lévy measure. The increments correspond to the in (5).
Note the singularity in (14) at , and also the heavy tails for large . Hence, the gamma process imposes that most will be very small, since the gammaprocess is concentrated around , but the heavy tails will yield some large ; this is why the increments of appropriate Lévy processes have close connections to compressibilityinducing priors [21].
There are many choices of Lévy measures one may consider; the reader is referred to [21, 20, 36] for a discussion of these different Lévy measures and how they impose different forms of sparsity/compressibility. We have found that the gammaprocess construction for , which corresponds to Section III, works well in practice for a wide range of problems, and therefore it is the principal focus of this paper.
We briefly note another class of Lévy measures that is simple, and has close connections to related models, discussed in Section V. Specifically, consider , where is a positive constant and
is a probability density function,
, . This corresponds to a compound Poisson process, for which atoms are situated uniformly at random within the window , and the amplitude of each atom is drawn i.i.d. from . In this case is a jump process that can be positive and negative, if admits random variables on the real line (, ifcorresponds to a Gaussian distribution). Rather than using this Lévy process to model
, which must be positive, the increments of are now used to directly model the wavelet coefficients . We demonstrate in Section VB that this class of hierarchical Lévy constructions has close connections to existing models in the literature for modeling wavelet coefficients.V Connections to Previous Work
Va Other forms of shrinkage
Continuous shrinkage priors, like (5) and its hierarchical extensions developed here, enjoy advantages over conventional discrete mixture [37, 38] (, spikeslab) priors. First, shrinkage may be more appropriate than selection in many scenarios (selection is manifested when exact sparsity is imposed). For example, wavelet coefficients of natural images are in general compressible [39], with many small but nonzero coefficients. Second, continuous shrinkage avoids several computational issues that may arise in the discrete mixture priors (which impose explicit sparsity) [40], such as a combinatorial search over a large model space, and highdimensional marginal likelihood calculation. Shrinkage priors often can be characterized as Gaussian scale mixtures (GSM), which naturally lead to simple blockwise Gibbs sampling updates with better mixing and convergence rates in MCMC [20, 23]. Third, continuous shrinkage reveals close connections between Bayesian methodologies and frequentist regularization procedures. For example, in Section IIC we noted the connection of the proposed continuousshrinkage model to adaptive Lasso. Additionally, the iteratively reweighted [41] and [42] minimization schemes could also be derived from the EM updating rules in GSM and Laplace scale mixture (LSM) models [43, 19], respectively.
Recently, aiming to better mimic the marginal behavior of discrete mixture priors, [20, 21] present a new globallocal (GL) family of GSM priors,
(15) 
where and are the prior distributions of and ; as noted above, such priors have been an inspiration for our model. The local shrinkage parameter and global shrinkage parameter are able to offer sufficient flexibility in highdimensional settings. There exist many options for the priors on the Gaussian scales , for example in the three parameter beta normal case [23],
(16) 
The horseshoe prior [22] is a special case of with , , .
GSM further extends to the LSM [43] by adding one hierarchy in (15): , , which captures higher order dependences. As a special case of LSM, the normal exponential gamma (NEG) prior [44] provides a Bayesian analog of adaptive lasso [26]. The Dirichlet Laplace (DL) prior employed here can be interpreted as a nonfactorial LSM prior which resembles point mass mixture prior in a joint sense, with frequentist optimality properties well studied in [45].
None of the above shrinkage priors take into account the multiscale nature of many compressible coefficients, like those associated with wavelets and the block DCT. This is a key contribution of this paper.
VB Other forms of imposing tree structure
There has been recent work exploiting the wavelet tree structure within the context of compressive sensing (CS) [14, 17, 46]. In these models a twostate Markov tree is constituted, with one state corresponding to largeamplitude wavelet coefficients and the other to small coefficients. If a parent node is in a largeamplitude state, the Markov process is designed to encourage the children coefficients to also be in this state; a similar statetransition property is imposed for the smallamplitude state.
The treebased model in [14] is connected to the construction in Section IV, with a compound Poisson Lévy measure. Specifically, consider the case for which , and is a compound Poisson process with , where a positive real number and a precision that depends on the wavelet layer . Priors may be placed on and . At layer , “jumps” are drawn, and for each a random variable is drawn from ; the jumps are placed uniformly at random between [0,1], defining . Along the lines discussed in Section IV, at the root layer of the Lévy tree, each of the wavelet coefficients is assigned a distinct increment of length . As discussed at the end of Section IVC, here the increments are used to model the wavelet coefficients directly. If one or more of the aforementioned jumps falls within the window of coefficient at layer , then the coefficient is nonzero, and is a Gaussian distributed; if no jump falls within width , then the associated coefficient is exactly zero. If a coefficient is zero, then all of its decendent wavelet coefficients are zero, using the construction in Section IV. This may be viewed as a twostate model for the wavelet coefficients: each coefficient is either exactly zero, or has a value drawn from a Gaussian distribution. This is precisely the type of twostate model developed in [14], but viewed here from the perspective of a treebased Lévy process, and particularly a compound Poisson process. In [14] a compound Poisson model was not used, and rather a Markov setup was employed, but the basic concept is the same as above.
In [46] the authors developed a method similar to [14], but in [46] the negligible wavelet coeffecients were not shrunk exactly to zero. Such a model can also be viewed from the perspective of the Lévy tree, again corresponding to a compound Poisson process. However, now and . For an increment for which there are no jumps, the wavelet coefficient is only characterized by the Brownian motion term, and is associated with small but nonzero wavelet coefficients.
This underscores that the proposed treebased shrinkage prior, which employs the gamma process, may be placed within the context of the Lévy framework of Section IV, as can [14, 46], which use statebased models. The placement of all of these models in the Lévy framework provides a unification, and also suggests other types of priors that respect the treebased structure, via different choices of . This provides a significant generalization of [20, 21], which first demonstrated that many shrinkage and sparsity priors may be placed within the context of the increments of a Lévy process, but did not account for the treebased structure of many representations, such as wavelets and the block DCT.
VC Connection to dictionary learning
In Section VIII we present experimental results on denoising images, and we make comparisons to an advanced dictionarylearning approach [8]. It is of interest to note the close connection of the proposed wavelet representation and a dictionarylearning approach. Within the wavelet decomposition, with root nodes, we have trees, and therefore the overall construction may be viewed as a “forest.” Each tree has layers, for an layer wavelet decomposition, and each tree characterizes a subregion of the image/data. Therefore, our proposed model can be viewed as dictionary learning, in which the dictionary building blocks are wavelets, and the treebased shrinkage imposes form on the weights of the dictionary elements within one tree (corresponding to an image patch or subregion). Within the proposed gamma process construction, the “local” account for statistical relationships between coefficients at layer , the parentchild encoding accounts for relationships between consecutive layers, and the “global” also accounts for crosstree statistical relationships.
Vi Generalized Noise Model
In (2) the additive noise was assumed to be i.i.d. zeromean Gaussian with precision . In many realistic scenarios, the noise may be represented more accurately as a sum of a Gaussian and spiky term, and in this case we modify the model as
(17) 
where the spiky noise is also modeled by a shrinkage prior:
(18) 
The Gaussian term is still modeled as .
In this case the noise is a Lévy process, with increments defined by the size of the pixels: the term corresponds to the Brownian term, with manifested by the jump term (with Lévy measure ). Interestingly, the Lévy process is now playing two roles: () to model in a novel hierarchical manner the tree structure in the coefficients , and () to separately model the Gaussian plus spiky noise ; the particular Lévy triplet need not be the same for these two components of the model.
Vii Inference
We developed an MCMC algorithm for posterior inference, as well as a deterministic variational algorithm to approximate the posterior. The latter is easily adapted to develop an ExpectationMaximization (EM) algorithm to obtain MAP point estimates of the wavelet coefficients, and recalls frequentist algorithms like adaptive Lasso [26] and reweighted [42]. Details on the inference update equations are provided in the Appendix, and below we summarize unique aspects of the inference associated with the particular model considered.
Viia MCMC inference
The MCMC algorithm involves a sequence of Gibbs updates, where each latent variable is resampled conditioned on the rest. Most conditional updates are conjugate, the important exception being the coefficients at each level; these have conditionals
(19)  
(20)  
Recall that is just normalized. We performed this update using a Metropolis step [47], proposing from the generalizedinverseGaussian (GIG) distribution specified by the first term, with the last two terms determining the acceptance probability. In our experiments, we observed acceptance rates of about .
ViiB Heuristic variational posterior approximation
Here we deterministically approximate the posterior distribution by a parametric distribution , which we then optimize to match the true posterior distribution. We use a meanfield factorized approximation, assuming a complete factorization across the latent variables, . We approximate the distribution of each wavelet coefficient as a Gaussian distribution with mean and variance . The marginal distributions over and
were set to exponential distributions. The
vectors were set as Dirichlet distributions with parameter vector . To optimize these parameters, we used a heuristic very closely related to expectation propagation [48]. For any variable (say ), the corresponding Gibbs update rule from the previous section gives its distribution conditioned on its Markov blanket. We look at this conditional distribution, plugging in the the average configuration of the Markov blanket (as specified by the current settings of the posterior approximations). We then calculate the relevant moments of resulting conditional distribution over
, and update the variational distribution to match these moments. We repeat this procedure, cycling over all variables, and seeking a compatible configuration of all parameters (corresponding to a fixed point of our algorithm). As mentioned earlier, the resulting update equations recall various frequentist algorithms, and we list them in the Appendix. Two important ones are:(21)  
(22) 
where is the index of corresponding to , is the th column of , and denotes the expectation value of the entry in . It is worth noting that updating the ’s requires calculating the mean of the distribution specified in equation (20) for the current averages of the posterior approximation. Unfortunately, this has no closed form. One approach is to calculate an importance samplingbased Monte Carlo estimate of this quantity, using the proposal distribution specified in the previous section. If samples were used to produce this estimate, we call the resulting algorithm VB(). The high acceptance rate of samples from the GIG suggests that this is a reasonably accurate approximation to the intractable distribution. This suggests a simpler approach where all terms are ignored except for the GIG, whose mean is used to update the ’s. In this case, the update rules become:
where is the modified Bessel function of the second kind, and . We call this approximation aVB. This approximation can also be interpreted from the perspective of a wavelet hidden Markov (HM) tree [14], in which the dependence is always from the parentnode, rather than the childnodes. Thus, the underlying dependence inside the HM tree is single direction, from parent to children.
ViiC MAP estimates via ExpectationMaximization
Closely related to the previous approach is an algorithm that returns MAP estimates of the wavelet coefficients, while (approximately) averaging out all other variables. This maximization over must also be done approximately, but is a straightforward modification of the variational update of . From the properties of the GIG, the Mstep becomes
(23) 
where . The approximate EStep remains unaffected.
Viii Numerical Results
Our code is implemented in nonoptimized MATLAB, with all results generated on a laptop with a 2.7 GHz CPU and 8 GB RAM. Parameters , and are all drawn form a broad gamma prior, . When performing inference, all parameters are initialized at random. No parameter tuning has been performed. We run VB and EM for iterations, while our MCMC inferences were based on runs of samples with a discarded burnin of samples. These number of iterations were not optimized, and typically excellent results are obtained with far fewer.
Viiia Details on CS implementation and comparisons
We apply the proposed approach, which we refer to as a shrinkage hierarchical model (denoted “sHM” in the following figures) to compressive sensing (CS). In these experiments the elements of the projection matrix are drawn i.i.d. from , and the “CS ratio,” denoted CSR, is .
In addition to considering sHM, we consider a “flat” version of the shrinkage prior, ignoring the tree structure (analogous to models in [21]); for this case the prior on the wavelet coefficients is as in (5). Additionally, we make comparisons to the following algorithms (in all cases using publicly available code, with default settings):

TSWCS [14, 17], using code from http://people.ee.duke.edu/~lcarin/BCS.html

OMP [11], using code from http://www.cs.technion.ac.il/~elad/software/

magic with TV norm (TV) [4], using code from http://users.ece.gatech.edu/~justin/l1magic/

Bayesian compressive sensing (BCS) [13], using code from http://people.ee.duke.edu/~lcarin/BCS.html

linearized Bregman [49], using code from http://www.caam.rice.edu/~optimization/linearized_bregman/
ViiiB CS and waveletbased representation
We consider images of size , and hence . We use the “Daubechies4” wavelets [1], with scaling coefficients (LL channel) of size . In all Bayesian models, the posterior mean results are used for the reconstructed image.
Figure 2 plots the reconstruction peak signaltonoiseratio (PSNR) versus CS ratio for “Barbara”. We see that a) the proposed sHM tree model provides best results, and b) models (sHM and TSWCS) with tree structure are better than those without tree structure, consistent with the theory in [25].
Similar observations hold for other widely considered images, viz. “Lena” and “Cameraman” (omitted for brevity). The PSNR of the reconstructed images by the proposed sHM tree model is about 1dB higher than TSWCS. It is worth noting that even without the tree structure, the proposed localglobal shrinkage prior performs well. The other treebased model, TSWCS, has a prior that, via a spikeslab construction, encourages explicit sparsity; this seems to undermine results relative to the proposed shrinkagebased tree model, which we examine next.
ViiiB1 Estimated Wavelet Coefficients
We compare the posterior distribution (estimated by MCMC) of the estimated wavelet coefficients from the proposed model versus TSWCS [14], in which the treebased Markovian spikeslab model is used. We consider “Barbara” with a CSR=. An example of the posterior distribution for a typical wavelet coefficient is shown in Figure 3, comparing the proposed sHM with TSWCS. Note that sHM puts much more posterior mass around the true answer than TSWCS.
Because of the spikeslab construction in TSWCS, each draw from the prior has explicit sparsity, while each draw from the prior for the shrinkagebased sHM always has all coefficients being nonzero. While TSWCS and sHM both impose and exploit the wavelet tree structure, our experiments demonstrate the advantage of sHM because it imposes compressibility rather than explicit sparsity. Specifically, note from Figure 3 that the coefficient inferred via TSWCS is shrunk toward zero more than the sHM, and TSWCS has little posterior mass around the true coefficient value. This phenomenon has been observed in numerous comparisons. The overshrinkage of TSWCS implies an increased estimate of the noise level for that algorithm (the small wavelet coefficients are shrunk to exactly zero, and the manifested residual is incorrectly attributed to ). This is attributed as the key reason the sHM algorithm consistently performs better than TSWCS for CS inversion.
ViiiB2 Robustness to Noise
Based upon the above discussions, it is expected that sHM will more accurately estimate the variance of additive noise
, with TSWCS overestimating the noise variance. We study the robustness of CS reconstruction under different levels of measurement noise, by adding zeromean measurement Gaussian noise when considering “Barbara”; the standard deviation here takes values in [0, 0.31] (the pixel values of the original image were normalized to lie in [0,1]). Figure
4 (right) plots the PSNR of the reconstruction results of the three best algorithms from Figure 2 (our tree and flat shrinkage models, and TSWCS). We see that the proposed sHM tree model is most robust to noise, with its difference from TSWCS growing with noise level.Figure 4 (left) plots the noise standard derivation inferred by the proposed models and TSWCS (mean results shown). We see the tree structured sHM model provides the best estimates of the noise variance, while the TSWCS overestimates it. The noisevariance estimated by the flat shrinkage model is underestimated (apparently the absence of the wavelet tree structure in the flat model causes the model to attribute some of the noise incorrectly to wavelet coefficients).
CS ratio  MCMC  VB(500)  VB(100)  VB(50)  aVB  EM  VB(TSWCS) 

0.4  26.83  26.32  26.26  26.07  26.50  26.08  25.26 
0.5  28.51  28.08  27.98  27.96  28.50  27.72  27.25 
ViiiB3 VB and EM Inference
The above results were based on an MCMC implementation. We next evaluate the accuracy and computation time of the different approximate algorithms. The computation time of our approximate VB and EM (one iteration 2 sec at CSR) is very similar to TSWCS (one iteration 1.62 sec). The bottleneck in our methods is the evaluation of the Bessel function needed for inferences. This can be improved (, by precomputing a table of Bessel function evaluations). Note, however, that our methods provide better results (Figure 2 and Table I) than TSWCS.
Table I also compares the different variational algorithms. Recall that these involved solving an integral via Monte Carlo (MC), and we indicate the number of samples used by ‘num’, giving “VB(num)”. As the number of samples increases, the PSNR increases, but at the cost of computation time (the algorithm with MC samples takes 9.34 sec). It is interesting to note that using the approximate VB and approximate EM, the PSNR of the reconstructed image is higher than TSWCS (VB) and even VB(500). Since the VB and EM provide similar results (VB is slightly better) and with almost the same computational time, we only show the results of aVB.
ViiiC CS and blockDCT representation
The principal focus of this paper has been on wavelet representations of images. However, we briefly demonstrate that the same model may be applied to other classes of tree structure. In particular, we consider an blockDCT image representation, which is consistent with the JPEG imagecompression standard. Details on how this tree structure is constituted are discussed in [3], and implementation details for CS are provided in [17]). Figure 5 shows CS results for “Barbara” using each of the methods discussed above, but now using a blockDCT representation. We again note that the proposed hierarchical shrinkage method performs best in this example, particularly for a small number of measurements (, CSR of 0.1).
ViiiD Denoising with spikyplusGaussian noise
We now apply the proposed model to denoising, with measurement noise that is a superposition of a Gaussian component and a spiky component. As in section VI, we view this as a Lévy process, with the Brownian motion contributing the Gaussian part, and the Poisson jumps the spikes. Inference is similar to Section VII. Figure 6 compares the performance of our model with the stateoftheart dHBP (dependent Hierarchical Beta Process) model in [8], considering the same type of Gaussian+spiky noise considered there. Specifically, of the pixels are corrupted by spiky noise with amplitudes distributed uniformly at random over [0,1] (scaled images as before), and the zeromean Gaussian noise is added with standard deviation 0.02. While our results are similar to dHBP from the (imperfect) measure of PSNR, visually, the dHBP returns an image that appears to be more blurred than ours. For example, consider the detail in the eyes and hair in sHM versus dHBP, Figures 6(c) and 6(e), respectively. Such detailed difference in local regions of the image are lost in the global PSNR measure. Other commonly used images, including “Barbara”, “Cameraman”, “Pepper” and “House” were also tested, with similar observations.
In the dictionarylearning method of dHBP in [8], neighboring patches are encouraged to share dictionary elements. This has the advantage of removing spiky noise, but it tends to remove highfrequency details (which often are not well shared between neighboring patches). In our work the spiky noise is removed because it is generally inconsistent with the wavelettree model developed here, that is characteristic of the wavelet coefficients of natural images but not of noise (Gaussian or spiky). Hence, our wavelettree model is encouraged to model the underling image, and the noise is attributed to the noise terms and in our model.
Ix Conclusions
We have developed a multiscale Bayesian shrinkage framework, accounting for the tree structure underlying a wavelet or blockDCT representation of a signal. The model that has been our focus is based on a gamma process hierarchical construction, with close connections to adaptive Lasso [26]. Specially, we have proposed a treestructured adaptive lasso. This is but one class of Lévy process, and in [21] it was demonstrated that many shrinkage priors may be placed within the context of Lévy processes. We have articulated how any of these shrinkage priors may be extended in a hierarchical construction to account for the treebased representation of many bases. In that context, we have discussed how the Markovbased spikeslab hierarchical priors developed in [14] may be viewed from the perspective of a Lévy process, with compoundPoisson Lévy measure. The proposed approach yields stateoftheart results, using both a wavelet or blockDCT based image representation, for the problems of CS image recovery and denoising with Gaussian plus spiky noise. For the latter, we have made connections to dictionary learning, which is widely employed for such applications.
The proposed model accounts for the fact that wavelet and blockDCT coefficients are compressible, but not exactly sparse. By contrast, [14] imposes exact sparsity, which tends to yield results that overestimate the noise variance. This characteristic has manifested results from the proposed hierarchical shrinkage prior that are consistently better than that in [14], for CS signal recovery.
Concerning future research, in this paper we have considered only one class of Lévy process constructions, although the treebased structure may be applied to any. It is of interest to examine other classes of Lévy processes, and applications for which they are particularly relevant. For example, in applications for which there are only a very few large coefficients, the symmetric alphastable Lévy process [36] may be most appropriate.
In this paper it has been assumed that the treebased basis is known. In some applications one is interested in learning a treebased dictionary matched to the data [50]. It is of interest to examine the utility of the proposed approach to this application, particularly because of its noted connection to dictionary learning. The proposed shrinkage priors have the advantage of, when viewed from the standpoint of the log posterior, being closely connected to many of the methods used in the optimization literature [50].
References
 [1] S. Mallat, A wavelet tour of signal processing: The sparse way. Academic Press, 2008.
 [2] M. Shapiro, “Embedded image coding using zerotrees of wavelet coefficients,” IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3445–3462, Decemeber 1993.
 [3] Z. Xiong, O. G. Gulerguz, and M. T. Orchard, “A DCTbased embedded image coder,” IEEE Signal Processing Letters, vol. 3, no. 11, pp. 289–290, November 1996.

[4]
E. J. Candès and T. Tao, “Decoding by linear programming,”
IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, December 2005.  [5] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, April 2006.
 [6] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, February 2006.
 [7] J. Portilla, V. Strela, M. J. Wainwright, and E. P. Simoncelli, “Image denoising using scale mixtures of gaussians in the wavelet domain,” IEEE Transactions on Image Processing, vol. 12, no. 11, pp. 1338–1351, November 2003.

[8]
M. Zhou, H. Yang, G. Guillermo, D. Dunson, and L. Carin, “Dependent hierarchical beta process for image interpolation and denoising,” in
International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2011. 
[9]
F. CouzinieDevy, J. Sun, K. Alahari, and J. Ponce, “Learning to estimate and
remove nonuniform image blur,”
IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)
, 2013.  [10] J. A. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Transactions on Information Theory, vol. 50, no. 10, pp. 2231–2242, October 2004.
 [11] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Transactions on Information Theory, vol. 53, no. 12, pp. 4655–4666, December 2007.
 [12] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde, “Modelbased compressive sensing,” IEEE Transactions on Information Theory, vol. 56, no. 4, pp. 1982–2001, April 2010.
 [13] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEE Transactions on Signal Processing, vol. 56, no. 6, pp. 2346–2356, June 2008.
 [14] L. He and L. Carin, “Exploiting structure in waveletbased bayesian compressive sensing,” IEEE Transactions on Signal Processing, vol. 57, no. 9, pp. 3488–3497, September 2009.
 [15] D. Baron, S. Sarvotham, and R. G. Baraniuk, “Bayesian compressive sensing via belief propagation,” IEEE Transactions on Signal Processing, vol. 58, no. 1, pp. 269–280, January 2010.
 [16] M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, vol. 1, pp. 211–244, September 2001.
 [17] L. He, H. Chen, and L. Carin, “Treestructured compressive sensing with variational bayesian analysis,” IEEE Signal Processing Letters, vol. 17, no. 3, pp. 233–236, 2010.
 [18] M. Seeger, “Bayesian inference and optimal design for the sparse linear model,” J. Machine Learning Res., vol. 9, June 2008.
 [19] Z. Zhang, S. Wang, D. Liu, and M. I. Jordan, “EPGIG priors and applications in bayesian sparse learning,” The Journal of Machine Learning Research, vol. 13, pp. 2031–2061, 2012.
 [20] N. G. Polson and J. G. Scott, “Shrink globally, act locally: Sparse bayesian regularization and prediction,” Bayesian Statistics, 2010.
 [21] ——, “Local shrinkage rules, lévy processes and regularized regression,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 74, no. 2, pp. 287–311, 2012.
 [22] C. M. Carvalho, N. G. Polson, and J. G. Scott, “The horseshoe estimator for sparse signals,” Biometrika, vol. 97, no. 2, pp. 465–480, 2010.
 [23] A. Armagan, M. Clyde, and D. B. Dunson, “Generalized beta mixtures of gaussians,” Advances in Neural Information Processing Systems (NIPS), pp. 523–531, 2011.
 [24] Z. Zhang and B. Tu, “Nonconvex penalization, Lévy processes and concave conjugates,” Advances in Neural Information Processing Systems (NIPS), 2012.
 [25] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde, “Modelbased compressive sensing,” IEEE Transactions on Information Theory, vol. 56, no. 4, pp. 1982–2001, April 2010.
 [26] H. Zou, “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, vol. 101, no. 476, pp. 1418–1429, December 2006.
 [27] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM, vol. 58, no. 3, pp. 1–37, May 2011.
 [28] R. Baraniuk, “Compressive sensing,” IEEE Signal Processing Magazine, vol. 24, no. 4, pp. 118–121, July 2007.
 [29] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, Series B, vol. 58, pp. 267–288, 1994.
 [30] T. Park and G. Gasella, “The bayesian lasso,” Journal of the American Statistical Association, vol. 103, pp. 681–686, 2008.
 [31] P. Bühlmann and T. Hothorn, “Spike and slab variable selection: Frequentist and bayesian strategies,” Statistical Science, pp. 730–773, 2005.
 [32] P. Schniter, L. Potter, and J. Ziniel, “Fast bayesian matching pursuit: Model uncertainty and parameter estimation for sparse linear models,” Proc. Information Theory and Applications Workshop, 2008, 326333, 2008.

[33]
M. Crouse, R. Nowak, and R. Baraniuk, “Waveletbased statistical signal processing using hidden markov models,”
IEEE Transactions on Signal Processing, vol. 46, no. 4, pp. 886–902, April 1998.  [34] D. Applebaum, Lévy Processes and Stochastic Calculus. Cambridge University Press, 2004.
 [35] J. Kingman, Poisson Processes. Oxford Press, 2002.
 [36] R. L. Wolpert, M. A. Clyde, and C. Tu, “Stochastic expansions using continuous dictionaries: Lévy Adaptive Regression Kernels,” Annals of Statistics, 2011.

[37]
T. J. Mitchell and J. J. Beauchamp, “Bayesian variable selection in linear regression,”
Journal of the American Statistical Association, vol. 83, no. 404, pp. 1023–1032, 1988.  [38] E. I. George and R. E. McCulloch, “Variable selection via gibbs sampling,” Journal of the American Statistical Association, vol. 88, no. 423, pp. 881–889, 1993.
 [39] V. Cevher, “Learning with compressible priors,” Advances in Neural Information Processing Systems (NIPS), pp. 261–269, 2009.
 [40] C. M. Carvalho, N. G. Polson, and J. G. Scott, “Handling sparsity via the horseshoe,” in International Conference on Artificial Intelligence and Statistics (AISTATS), 2009, pp. 73–80.
 [41] I. Daubechies, R. DeVore, M. Fornasier, and C. S. Güntürk, “Iteratively reweighted least squares minimization for sparse recovery,” Communications on Pure and Applied Mathematics, vol. 63, no. 1, pp. 1–38, 2010.
 [42] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted minimization,” Journal of Fourier Analysis and Applications, vol. 14, no. 56, pp. 877–905, 2008.
 [43] P. Garrigues and B. A. Olshausen, “Group sparse coding with a laplacian scale mixture prior,” Advances in Neural Information Processing Systems (NIPS), pp. 676–684, 2010.
 [44] J. E. Griffin and P. J. Brown, “Bayesian hyperlassos with nonconvex penalization,” Australian New Zealand Journal of Statistics, vol. 53, no. 4, pp. 423–442, 2011.
 [45] A. Bhattacharya, D. Pati, N. S. Pillai, and D. B. Dunson, “Bayesian shrinkage,” arXiv:1212.6088, 2012.
 [46] S. Som and P. Schniter, “Compressive imaging using approximate message passing and a markovtree prior,” IEEE Transactions on Signal Processing, pp. 3439–3448, 2012.
 [47] W. K. Hastings, “Monte carlo sampling methods using markov chains and their applications,” Biometrika, pp. 97–109, 1970.
 [48] T. P. Minka, “A family of algorithms for approximate Bayesian inference,” Ph.D. dissertation, MIT, 2001.
 [49] W. Yin, S. Osher, D. Goldfarb, and J. Darbon, “Bregman iterative algorithms for minimization with applications to compressed sensing,” SIAM Journal on Imaging Sciences, vol. 1, no. 1, pp. 143–168, 2008.
 [50] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach, “A multiscale framework for compressive sensing of proximal methods for sparse hierarchical dictionary learningvideo,” International Conference on Machine Learning (ICML), 2010.
Comments
There are no comments yet.