DeepAI

# Horseshoe priors for edge-preserving linear Bayesian inversion

In many large-scale inverse problems, such as computed tomography and image deblurring, characterization of sharp edges in the solution is desired. Within the Bayesian approach to inverse problems, edge-preservation is often achieved using Markov random field priors based on heavy-tailed distributions. Another strategy, popular in statistics, is the application of hierarchical shrinkage priors. An advantage of this formulation lies in expressing the prior as a conditionally Gaussian distribution depending of global and local hyperparameters which are endowed with heavy-tailed hyperpriors. In this work, we revisit the shrinkage horseshoe prior and introduce its formulation for edge-preserving settings. We discuss a sampling framework based on the Gibbs sampler to solve the resulting hierarchical formulation of the Bayesian inverse problem. In particular, one of the conditional distributions is high-dimensional Gaussian, and the rest are derived in closed form by using a scale mixture representation of the heavy-tailed hyperpriors. Applications from imaging science show that our computational procedure is able to compute sharp edge-preserving posterior point estimates with reduced uncertainty.

• 4 publications
• 6 publications
• 8 publications
12/20/2021

### Bayesian neural network priors for edge-preserving inversion

We consider Bayesian inverse problems wherein the unknown state is assum...
11/06/2019

### Regularization of Bayesian shrinkage priors and inference via geometrically / uniformly ergodic Gibbs sampler

Use of continuous shrinkage priors — with a "spike" near zero and heavy-...
04/14/2021

### A hybrid Gibbs sampler for edge-preserving tomographic reconstruction with uncertain view angles

In computed tomography, data consist of measurements of the attenuation ...
03/03/2019

### Heavy Tailed Horseshoe Priors

Locally adaptive shrinkage in the Bayesian framework is achieved through...
05/26/2021

### Cauchy Markov Random Field Priors for Bayesian Inversion

The use of Cauchy Markov random field priors in statistical inverse prob...
06/28/2020

10/29/2018

### Prior-preconditioned conjugate gradient method for accelerated Gibbs sampling in "large n & large p" sparse Bayesian regression

In a modern observational study based on healthcare databases, the numbe...

## 1 Introduction

The integration of data into computational models is fundamental when one is interested in recovering unknown parameters defining the model. This specifies an inverse problem where the goal is to discover the parameters that make the model closely match the observed data. Inverse problems are typically ill-posed and the stability mainly depends on the structure of the mathematical model, the dimension of the parameter space, and the scarcity and noisiness of the data. Regularization methods are commonly used to find the solution of inverse problems (see, e.g., hansen_2010

). These methods are deterministic and incorporate penalty functions as stabilization procedure. Another approach relies on Bayesian statistics (see, e.g.,

kaipio_and_somersalo_2005

). These techniques incorporate a probabilistic description of the model parameters that combines prior information with a likelihood function that accounts for the model and data. In this approach, the objective is to estimate a so-called posterior distribution of the model parameters. Closed-form expressions of the posterior can only be derived in some particular cases, and in general the posterior has to be estimated using sampling-based methods such as Markov chain Monte Carlo (MCMC)

gamerman_and_lopes_2006; owen_2019, or approximation methods such as variational inference ranganath_et_al_2014, transport maps elmoselhy_and_marzouk_2012 and Laplace approximation schillings_et_al_2020; uribe_et_al_2021.

The main advantage of the Bayesian formulation lies in the possibility of quantifying parameter uncertainty due to noise and model errors. However, the approach is generally limited to problems where the dimension of the parameter space is not prohibitively large. In large-scale scenarios, the unknown parameters are modeled as random fields discretized pointwise on a fine grid. This complicates the application of sampling methods for Bayesian inference since one requires the exploration of a high-dimensional parameter space — even in the case of the linear problems considered here. Another level of complexity arises when the solution of the inverse problem is non-smooth and the preservation of the edges or sharp features in the solution is necessary

bruckstein_et_al_2009. This especially occurs in imaging science such as X-ray computed tomography, image deblurring, segmentation and denoising

. In these cases edge-preservation is generally imposed via the prior probability distribution. These prior models can be grouped into four categories:

• Heavy-tailed Markov random fields

defined on pairwise parameter increments or differences. The idea is to increase the probability of large jump events by imposing heavy-tailed distributions on the increments. Some examples include total variation prior

lassas_and_siltanen_2004, Laplace Markov random fields bardsley_2012; uribe_et_al_2021 and Cauchy Markov random fields suuronen_et_al_2022.

• Random fields with jump discontinuities

, including Besov space priors based on Haar wavelet expansions dashti_et_al_2012; lassas_et_al_2009, Gaussian and compound Poisson process priors hosseini_2017, and level-set priors that employ Gaussian random fields with different thresholds via predefined level sets dunlop_et_al_2017.

• Machine learning-based models, including implicit models such as plug-and-play priors kamilov_et_al_2017

and Bayesian neural networks with Cauchy weights that promote edge-preservation

li_et_al_2022.

• Shrinkage priors, also known as global-local or component-wise priors, that aim at shrinking small values towards zero while leaving the larger ones unaffected. These prior models are hierarchical by nature and include the methods proposed in calvetti_et_al_2019; calvetti_et_al_2020b, as well as multiple models used in the statistics community such as elastic net, spike-slab, horseshoe, discrete Gaussian mixtures, and others (see, e.g., polson_and_sokolov_2019; vanerp_et_al_2019 for reviews).

One of the advantages of shrinkage priors is that they are essentially defined by conditionally Gaussian distributions. Given one or more hyperparameters such as the variance, a target uncertain parameter endowed with a Gaussian distribution can have many nice properties vono_et_al_2022

. Different shrinkage priors differ in the definition of the hierarchical structure and probabilistic models used to represent the hyperparameters. For example, the Lasso prior is based on a scale mixture representation of the Laplace distribution, and thus it defines an exponential distribution on the variance with a half-Cauchy distributed rate parameter

park_and_casella_2008

. The elastic net prior uses a truncated-gamma distribution on the variance with scale defined by two additional parameters that are half-Cauchy distributed

li_et_al_2010. Another shrinkage model is the horseshoe prior, in which the variance is defined by two hyperparameters that are half-Cauchy distributed carvalho_et_al_2010; one hyperparameter is a scalar that controls the global variability, while the other is spatially dependent and captures local variations.

Despite being common in statistics, shrinkage priors are not yet fully applied in large-scale linear inverse problems that require edge-preserving solutions. In this paper, we formulate a Bayesian inference approach that targets this requirement. We focus on the horseshoe prior, where the target parameter is Gaussian-distributed conditioned on half-Cauchy-distributed global and local hyperparameters. Our idea is that when this type of prior is formulated for the parameter increments or differences, the local hyperparameter assists in the identification of large discontinuities. The Bayesian inverse problem becomes hierarchical, since hyperparameters associated with the prior and likelihood are also part of the inference process.

The main difficulty in this formulation is that the resulting hierarchical posterior is challenging to handle computationally due to the large-scale nature of the target parameter and the heavy-tailed distributions imposed on the hyperparameters. To alleviate the complexity of the sampling process, we use the scale mixture representation of the half-Student-t distribution to express the half-Cauchy distributions of the hyperparameters in terms of inverse gamma distributions. Despite this requires the addition of extra hyperparameters, the hierarchical posterior is given essentially by products of Gaussian and inverse gamma densities. Therefore, we can exploit the conjugacy in the hierarchies to define a Gibbs sampler for which the associated full conditional densities are derived in closed form, and hence they can be sampled directly without an MCMC algorithm. However, within the Gibbs sampler, the conditional distribution of the target parameter defines a linear Gaussian Bayesian inverse problem. Although this full conditional is Gaussian and has an analytical expression, it is prohibitively slow to simulate in large-scale applications. Hence, the task of sampling from this distribution is formulated as a least-squares problem that is solved efficiently via standard iterative methods such as CGLS bjorck_1996 with a preconditioner based on the prior precision matrix. The computational framework is tested on a one-dimensional deconvolution problem, as well as, two-dimensional applications of image deblurring and computed tomography (CT).

##### Related work

Existing applications of shrinkage priors in the context of inverse problems include the following. Bayesian inverse problems in sparse signal and image processing are surveyed in mohammaddjafari_2012 where a variety of shrinkage priors are discussed and applied to multiple problems in imaging science. MCMC and approximation methods are also suggested in mohammaddjafari_2012 to solve the Bayesian inverse problem depending on the choice of the prior. Spike-and-slab priors are used in riis_andersen_et_al_2017 for linear Bayesian inverse problems subject to sparsity constraints. The posterior is simulated with an expectation propagation algorithm using different approximations of the precision matrix of the underlying Gaussian parameter to improve the efficiency in high-dimensional applications. So-called Gaussian hypermodels are proposed in calvetti_et_al_2019, which are defined as conditionally Gaussian priors with a single local variance hyperparameter endowed with a gamma distribution. An iterative alternating sequential algorithm is proposed that converges to the maximum a posteriori probability estimator of the parameters. Following this idea, in calvetti_et_al_2020b a generalized gamma distribution is imposed on the local variances and propose two modifications to their iterative algorithm to exploit the global convexity ensured by gamma hyperpriors and the stronger sparsity promotion of the generalized gamma hyperpriors.

##### Our contributions

The main contributions of the paper are highlighted as follows:

• We review the horseshoe prior for the representation of parameters that are sparse.

• We introduce the horseshoe prior for the solution of linear Bayesian inverse problems that require edge-preservation and perform uncertainty quantification.

• We propose a Gibbs sampler to solve the resulting hierarchical formulation of the Bayesian inverse problem; the full conditional distributions are derived in closed form by exploiting conjugacy.

• We demonstrate the performance of our method through numerical experiments on two inverse problems.

##### Structure of the paper

This paper is organized as follows. In section 2, we briefly introduce the mathematical framework for linear inverse problems and the Bayesian approach, and we present the standard horseshoe prior. We then derive its generalization for edge-preserving linear Bayesian inversion in section 3. We show that a scale mixture representation can be applied to express a heavy-tailed half-Cauchy hyperprior in terms of two inverse gamma distributions. The computational framework is defined in section 4, where we develop a Gibbs sampling approach to solve the resulting hierarchical Bayesian inverse problem. Here, we use CGLS to sample the full Gaussian conditional and we derive the remaining hyperparameter conditionals in closed form. In section 5 we illustrate our computational approach with imaging science applications, before concluding in section 6.

## 2 Preliminaries

We commence with the mathematical framework and introduce linear inverse problems and the Bayesian approach to find solutions to those, as well as the standard horseshoe prior.

### 2.1 Linear inverse problems

We consider the discrete inverse problem of estimating an unknown model parameter using noisy observed data , given a linear forward operator acting as a link between the parameters and data. Here, is the number of parameters and denotes the number of data points. The parameter is spatially varying and exists in connection with a physical domain , e.g., one-dimensional signals, two-dimensional images and three-dimensional objects. Using a suitable finite-dimensional representation, the dimension will depend on a discretization size imposed on the physical domain. For instance, in one- and two-dimensional settings and (assuming equal discretization in both directions), respectively.

In particular, a linear inverse problem estimates an that approximates the ground truth in

 y=Axtrue+ewithe∼N(0,σ2obsIm), (1)

where the additive measurement noise is modeled as a realization of an independent Gaussian random vector with variance

,

is an identity matrix of size

, and denotes the multivariate Gaussian distribution depending on mean and covariance parameters. The linear inverse problem associated to (1) is typically ill-posed and we employ the Bayesian statistical framework to study the influence of noise in the observed data.

### 2.2 Bayesian inverse problems

In the Bayesian approach to inverse problems kaipio_and_somersalo_2005, we treat and

as random variables and the solution is expressed as the probability distribution of

given an instance of the observed data . This allows both modeling the noise via its statistical properties and specifying prior information on the parameter, i.e., the form of solutions that are believed more likely.

The unknown parameter is modeled as a discretized random field represented as a random vector taking values . We assume the distribution of has a so-called prior probability density . The likelihood function is defined from a density with fixed argument equal to the observed data ; then the likelihood becomes a function of only. The Gaussian statistical model for the measurement noise assumed in (1) gives

 πdata(y|x)=1(2π)\nicefracm2σmobsexp(−12σ2obs∥y−Ax∥22). (2)

Assuming that the measurement noise and model parameters are independent, the solution of the Bayesian inverse problem combines the prior and likelihood probability models into a so-called posterior

density, given by Bayes’ Theorem as

 πpos(x|y)=1Zπdata(y|x)πpr(x), (3)

where is the normalizing constant of the posterior.

In practice, point estimates of the posterior (3) are used to represent the solution of the inverse problem. The two classical choices are the maximum a posteriori (MAP) and the posterior mean (PM) estimators:

 xMAP=argmaxx∈Rdπpos(x|y) and xPM=∫Rdxπpos(x|y)dx. (4)

The MAP estimator is computed via optimization techniques which is often an advantage, compared to the calculation of the PM estimator that requires a more involved high-dimensional integration. The estimation of the posterior mean and related summary statistics is often performed via Monte Carlo methods (see, e.g., gamerman_and_lopes_2006; owen_2019

). We point out that when the first two statistical moments of a random vector do not exist (as it is common for some heavy-tailed distribution models), location and scale characteristics of the distribution can still be summarized using for instance the

posterior median and median absolute deviation, respectively tenorio_2017

. These are also a sensible option to describe random variables whose distribution is heavily skewed.

To find the posterior (3

), we will focus on the definition of a prior probability model that allows both, the preservation of potential sharp features in the unknown parameter and a tractable computation of the high-dimensional posterior distribution. Our model is based on the horseshoe prior

carvalho_et_al_2009, which we revisit next.

### 2.3 Horseshoe prior

When our a priori knowledge is that is sparse, it may be convenient to apply a sparsity-inducing shrinkage prior (see, e.g., vanerp_et_al_2019). A well-known continuous shrinkage model is the standard horseshoe carvalho_et_al_2009, which is defined as the hierarchical prior:

 πpr(x,τ,σ)=πpr(x|τ,σ)πhpr(τ)πhpr(σ), (5)

where a zero-mean conditionally Gaussian prior is imposed on :

 (6)

Here, the prior covariance matrix depends on the hyperparameters controlling global shrinkage and defining local shrinkage. The global-local scheme is defined with heavy-tailed distributions. In particular, the horseshoe model uses a half-Cauchy distribution for and an independent standard half-Cauchy distribution for :

 πhpr(τ)∝2τ0(1+τ2τ20)andπhpr(σ)∝d∏i=121+σ2iwith τ,σi>0, (7)

where is a scale parameter. According to existing work, we can set (i) carvalho_et_al_2009; (ii) , where is the number of nonzero elements piironen_and_vehtari_2017; (iii) as a part of the inference, e.g., using a Jeffreys’ hyperprior carvalho_et_al_2010.

Another expression for the horseshoe prior is given by re-writing the density of as for , where is called a shrinkage parameter. The relation between and is bijective when , then its inverse is continuously differentiable and non-zero. Therefore, the hyperprior of each parameter is found from the half-Cauchy hyperprior of using a standard probabilistic transformation and we obtain

 πhpr(κi)∝1√κi√1−κi, (8)

which is proportional to a horseshoe-shaped beta probability density with shape parameters equal to , hence the name of the prior. Small values of correspond to and produce almost no shrinkage, while values corresponding to provide essentially full shrinkage, and hence can be used to describe how many active or inactive variables are present in the model carvalho_et_al_2010.

The horseshoe prior is hierarchical by definition and lacks of a proper analytical expression after marginalizing out the hyperparameters. However, lower and upper bounds for the one-dimensional horseshoe probability density are available carvalho_et_al_2010:

 12√2π3log(1+4x2)≤πpr(x)≤1√2π3log(1+2x2)for\leavevmode\nobreak\ x≠0. (9)

These bounds are shown in Figure 1 together with other common probability distributions (standardized). Note that the horseshoe prior has heavy tails that are comparable to the Cauchy distribution, and it has a pole at . These characteristics typically enable the prior to perform well when handling sparsity. We exploit this behavior to develop a horseshoe-based model that can be applied to a class of linear Bayesian inverse problems where edge-preservation is fundamental.

## 3 Horseshoe prior for edge-preservation

In some applications the solution of inverse problems is sparse; in other applications sharp edges are present in the solution meaning that pairwise differences or increments of the solution elements exhibit sparsity. We exploit this fact to define a horseshoe prior on the differences that promote solutions preserving their sharp features. The resulting prior can be interpreted as an anisotropic conditionally Gaussian Markov random field prior.

### 3.1 Horseshoe prior and conditionally Gaussian Markov random fields

Gaussian Markov random fields (GMRF) are typically specified by their conditional independence structure. Therefore, a natural way to describe a GMRF is by its precision matrix. This is because the sparsity structure of this matrix determines the so-called neighborhood system explaining conditional relations in rue_and_held_2005. Since we are interested in preserving the edges, we define the precision matrix of the GMRF using increments between elements of the parameter vector. In one-dimensional cases, the increments are computed by application of a finite difference matrix , while in two-dimensional cases we consider the horizontal and vertical first-order finite difference matrices . These are given by

 D=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1−11−11⋱⋱−11⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦N×NandD(1)=IN⊗D,D(2)=D⊗IN, (10)

where is domain discretization size, is the identity matrix of size , and denotes the Kronecker product. Note that we use, without loss of generality, zero boundary conditions on at the left part of the domain.

Due to the structure defined in eq. 10 the increments of (i.e., the vectors , and ) are independent and we impose a horseshoe prior on them. This allows us to rewrite the prior in eq. 6 as a conditionally GMRF that incorporates information about the increments. The associated probability density is

 πpr(x|τ,w)=(detΛ(τ,w))\nicefrac12(2π)\nicefracd2exp(−12xTΛ(τ,w)x), (11)

where

is a prior precision matrix (see below) that depends on the global standard deviation

and local weights . These hyperparameters are defined as in the standard horseshoe prior and they are half-Cauchy-distributed following eq. 7. In this case, the weights in are analogous to local standard deviations in , but they will correspond to the increments and not to the elements of the parameter vector, hence the different notation.

For the conditionally Gaussian prior in eq. 11, the precision matrix is defined such that it takes into account the increments via eq. 10 (see, e.g., (bardsley_2019, p.68) for the details). Therefore, we can generally write the precision matrix as

 Λ(τ,w)=LTW(τ,w)L, (12)

and for one-dimensional and two-dimensional domains we have

 L=D,\definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0W(τ,w)=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1τ2w21⋱1τ2w2d⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (13)

and

 L=[D(1)D(2)],W(τ,w)=[W(1)0d0dW(2)],\definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0W(i)=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1τ2(w(i)1)2⋱1τ2(w(i)d)2⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,i=1,2, (14)

respectively. Here, is called the structure matrix, and is a matrix containing zeros. Notice that in the two-dimensional case we have to consider local weights

 w=[w(1)w(2)]∈R2d>0

for each coordinate direction. These are assumed independent and identically distributed according to the half-Cauchy prior exactly as in the one-dimensional case.

The conditionally Gaussian prior eq. 11 (based on precision matrices of the form eq. 12 with eq. 13eq. 14) and the hyperpriors eq. 7 define a horseshoe prior promoting edge-preservation. Again, this is because we are now regularizing increments of the unknown parameter using a heavy-tailed probabilistic model. A main motivation to utilize the horseshoe prior is that the unknown parameter is conditionally Gaussian, which facilitates the analytical and numerical treatment of the resulting posterior distribution. However, the main difficulties in applying this prior are: (i) the dimension of the parameter space is increased since the local weights have the same or even larger dimension than the model parameter, and (ii) the hyperparameters are endowed with heavy-tailed distributions that cause challenges when computing the posterior via sampling methods. For the latter point, a regularized horseshoe prior is proposed in piironen_and_vehtari_2017

; this model employs half-Student’s t-distributions with larger degrees of freedom on the local parameters to overcome the sampling issues. However, increasing the degrees of freedom makes the tail of the distribution less heavy, which is not desirable when looking for methods that promote sharp features.

### 3.2 Extended horseshoe prior

Our objective is to present an equivalent model that extends the hierarchical structure of the horseshoe prior by adding auxiliary parameters. This allows us to write the resulting posterior such that it can be sampled in a tractable manner without loosing the connection to the original half-Cauchy hyperpriors in the standard horseshoe. We point out that this idea is also used in makalic_and_schmidt_2016

for the standard horseshoe prior applied to logistic regression problems.

We consider a scale mixture decomposition of a Student’s t-distributed random variable (see, e.g., wand_et_al_2011). If and are random variables such that

 (A2|B)∼IG(ν2,νB)% andB∼IG(12,1c2), (15)

where denotes the inverse gamma distribution depending on shape and scale parameters, then

 A∼t+(ν,0,c), (16)

where is the half-Student’s t-distribution depending on degrees of freedom, location and scale parameters. These distributions have densities defined as equationparentequation

 IG(x;α,β) =βαΓ(α)x−α−1exp(−βx), (17a) t+(x;ν,0,c) =⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩2Γ(ν+12)Γ(ν2)√πνc(1+1νx2c2)−ν+12,if x≥00,otherwise. (17b)

The decomposition eq. 15 can be derived by marginalization of the associated hierarchical prior: equationparentequation

 π(a2) =∫∞0π(a2|b)π(b)db∝(a2)−ν2−1∫∞0b−ν+12−1exp(−1b(νa2+1c2))db (18a) ∝(a2)−ν2−1(νa2+1c2)−ν+12∝(a2)−12(1+1νa2c2)−ν+12, (18b)

where we use the fact that the integrand in eq. 18a is an unnormalized inverse gamma density with shape and scale parameters and , respectively. This allows analytical computation of the marginalization procedure. Finally, to obtain the density for , we can use a probabilistic transformation (as performed for eq. 8). This results in the half-Student’s t-distribution defined in eq. 15 with degrees of freedom and scale parameter . We consider the standard horseshoe prior and therefore we use degrees of freedom such that half-Student’s t-distribution reduces to a half-Cauchy distribution. In addition, we note that the regularized horseshoe prior piironen_and_vehtari_2017, which typically uses , can also be included in the formulation.

The scale mixture representation eq. 15 is utilized to write the standard horseshoe prior as an extended hierarchical prior. The motivation behind this choice is that the additional auxiliary variables allow the derivation of full conditional distributions in closed form by exploiting conjugacy. Thereafter, these can be sampled within a Gibbs sampler scheme (robert_and_casella_2004, Ch.10). The resulting extended horseshoe prior is defined as:

 πpr(x,τ2,w2,γ,ξ)=πpr(x|τ2,w2)πhpr(τ2|γ)πhpr(γ)πhpr(w2|ξ)πhpr(ξ), (19)

where equationparentequation

 πpr(x|τ2,w2) =N(0,Λ−1(τ,w)), (20a) πhpr(τ2|γ) =IG(ν2,νγ), πhpr(γ) =IG(12,1τ20), (20b) πhpr(w2i|ξi) =IG(ν2,νξi), πhpr(ξi) =IG(12,1). (20c)

## 4 Proposed approach

The extended hierarchical horseshoe prior eq. 19 is used to define the posterior in eq. 3, thereby defining a hierarchical formulation of the Bayesian inverse problem. In addition to the hyperparameters arising from the horseshoe prior, we also model the noise variance in the likelihood function as a random variable. Following the hyperprior distributions imposed on the prior parameters, we define an inverse gamma hyperprior for with shape and scale parameters and , respectively. This choice makes the hyperprior relatively uninformative higdon_2007.

As a result, the Bayesian inverse problem of estimating the posterior eq. 3 is written as the hierarchical Bayesian inverse problem of determining the new posterior density:

 πpos(x,σ2obs,τ2,w2,γ,ξ)∝πdata(y|x,σ2obs)πpr(x|τ2,w2)πhpr(τ2|γ)πhpr(w2|ξ)πhpr(σ2obs)πhpr(γ)πhpr(ξ); (21)

whose dependencies are shown in Figure 2.

Performing statistical inference with the posterior eq. 21 generally requires application of sampling methods based on Markov chain Monte Carlo (MCMC) methods. The goal in MCMC is to compute samples or realizations of a Markov chain that is stationary with respect to the posterior distribution owen_2019. A common difficulty when solving hierarchical Bayesian inverse problems is that MCMC algorithms have to handle drastic changes in the shape of the posterior. This is because a small modification at the bottom of the hierarchy induces large changes in the joint posterior (see, e.g., betancourt_and_girolami_2015 for a detailed discussion). To alleviate this problem one typically relies on (i) re-parametrization of the hierarchical structure in order to break potential correlations between parameters (cf. Figure 2), or (ii) application of specialized MCMC algorithms that capture the geometry of the joint posterior. For instance, a Riemannian Hamiltonian Monte Carlo method that uses local curvature information is advocated in betancourt_and_girolami_2015; and for problems involving the horseshoe prior, a Gibbs sampler is proposed in makalic_and_schmidt_2016 and two further MCMC methods are developed in johndrow_et_al_2020. Recently, another Gibbs sampler is discussed in nishimura_and_suchard_2022 for Bayesian inverse problems under the regularized horseshoe prior.

Our idea is to exploit the structure presented in section 3.2 to derive full conditional distributions for each uncertain parameter. The advantage of this approach is that it allows a direct application of the Gibbs sampler geman_and_geman_1984, since the conditional densities for each parameter can be derived in closed form. Therefore, we avoid sampling in the high-dimensional joint space of the posterior. In particular, the full conditional densities associated with eq. 21 are found to be: equationparentequation

 π1(x|σ2obs,τ2,w2) ∝πdata(y|x,σ2obs)πpr(x|τ2,w2), (22a) π2(σ2obs|x) ∝πdata(y|x,σ2obs)πhpr(σ2obs), (22b) π3(τ2|x,w2,γ) ∝πpr(x|τ2,w2)πhpr(τ2|γ), (22c) π4(w2|x,τ2,ξ) ∝πpr(x|τ2,w2)πhpr(w2|ξ), (22d) π5(γ|τ2) ∝πhpr(τ2|γ)πhpr(γ), (22e) π6(ξ|w2) ∝πhpr(w2|ξ)πhpr(ξ). (22f)

In the remainder of this section, we determine the full conditional densities for each uncertain parameter and discuss sampling techniques to obtain draws from them. Due to the extended horseshoe model, we anticipate that most of the densities in eq. 22 can be sampled directly and a classic Gibbs sampler algorithm can be used to characterize the posterior.

### 4.1 Sampling of π1

The conditional for the uncertain parameter in eq. 22a defines a Bayesian inverse problem with Gaussian likelihood and prior. After inserting the corresponding probabilistic models we obtain:

 π1(x|σ2obs,τ2,w2)∝exp(−12(1σ2obs∥y−Ax∥22+∥∥Λ\nicefrac12(τ,w)x∥∥22)). (23)

This conditional density is also Gaussian with precision matrix and mean vector given by (see, e.g., (kaipio_and_somersalo_2005, p.78))

 ˜Λ(τ,w)=1σ2obsATA+Λ(τ,w),˜μ(τ,w)=˜Λ−1(τ,w)(1σ2obsATy). (24)

where is given in (12).

Given the hyperparameters , and , the most common sampling algorithm for a Gaussian distribution is based on the Cholesky factorization. In this case, a sample from is obtained as , where is a standard Gaussian random vector, and is a lower triangular matrix with real and positive diagonal entries (Cholesky factor). Note that we drop the dependence on , and for the sake of notation. This simple strategy is computationally prohibitive due to the need for computing a factorization of the matrix . Assuming that matrix-vector multiplications with can be done efficiently, we rely on Krylov subspace methods to sample from the Gaussian conditional in eq. 23. We refer to vono_et_al_2022 for a review of methods for sampling high-dimensional Gaussian distributions. In particular, the task of sampling a Gaussian random vector can be written as a least squares problem, and thus, we draw a sample from by solving:

 x⋆=argminx∈Rd∥Mx−z∥22withM=[(\nicefrac1σobs)AΛ\nicefrac12],z=[(\nicefrac1σobs)y0d]+˜u, (25)

where . Recall that the solution of eq. 25 is required at every Gibbs iteration given new values of the hyperparameters. We use the CGLS method, which requires one pair of forward and backward model computations (i.e., multiplications with and ) per iteration. A tolerance on the relative residual norm of the normal equations or a maximum number of iterations can be used to control the quality of the computed samples.

One way to accelerate the performance of the CGLS method is to apply a standard-form transformation to (25) (hansen_2010, Sec. 8.4); this is also referred to as priorconditioning calvetti_et_al_2018. The idea is to compute the Cholesky factorization of the prior precision matrix and introduce the change of variables , such that we can transform eq. 25 into

 ~x⋆=argmin~x∈Rd∥M˜x−z∥22withM=[(\nicefrac1σobs)AC−1Id], (26)

where is defined as in eq. 25. Note that the change of variables corresponds to “whitening” the prior, that is, . In the standard-form least squares problem (26), we are now working with the Krylov subspace , which is in most cases a better subspace than of the general-form least-squares problem eq. 25 (see, e.g., (hansen_2010, Sec. 8.4) for the details). Hence, the application of the standard-form transformation tends to reduce the number of iterations required for convergence and it can be incorporated in a preconditioned version of the CGLS algorithm ((bjorck_1996, Sec. 7.4)). We denote this variant as pCGLS.

The Cholesky factor of the prior precision matrix, which is instrumental for the standard-form transformation, can be obtained as follows. In the one-dimensional case, from the definition of the prior precision in (13) we have that

 Λ=LTWL=CTCwithL=D, (27)

where is diagonal and is either upper or lower triangular. Hence, by simple observation, we can write , which is indeed triangular and constitutes an inexpensive representation of the Cholesky factor.

In the two-dimensional case, the Cholesky factor is no longer immediately available but still can be computed in an economical way. From eq. 14 we have

 Λ=LTWLwithW=[W(1)0d0dW(2)]andL=[D(1)D(2)], (28)

where the difference matrices and are sparse and so is . Now, we can compute the thin QR factorization of ,

 W\nicefrac12L=[(W(1))\nicefrac12D(1)(W(2))\nicefrac12D(2)]=QR; (29)

notice that is not required and . Therefore, is the desired Cholesky factor.

### 4.2 Sampling of π2 to π6

The conditional densities for each hyperparameter are all obtained in closed form. This is due to conjugate relations that arise from the extended horseshoe prior formulation.

The conditional density for the noise variance in eq. 22b is equationparentequation

 π2(σ2obs|x) ∝πdata(y|x,σ2obs)πhpr(σ2obs) (30a) ∝(σ2obs)−\nicefracm2exp(−12σ2obs∥y−Ax∥22)[(σ2obs)−αobs−1exp(−1βobsσ2obs)] (30b) ∝(σ2obs)−\nicefracm2−αobs−1exp(−1σ2obs[12∥y−Ax∥22+1βobs]) (30c) ∝IG(m2+αobs,12∥y−Ax∥22+1βobs). (30d)

The conditional density for the squared global shrinkage parameter in eq. 22c is equationparentequation

 π3(τ2|x,w2,γ) ∝πpr(x|τ2,w2)πhpr(τ2|γ) (31a) ∝(τ2w2)−\nicefrack2exp(−12k∑i=1[Lx]2iτ2w2i)[(τ2)−\nicefracν2−1exp(−νγτ2)] (31b) ∝(τ2)−\nicefrack2−\nicefracν2−1exp(−1τ2[12k∑i=1[Lx]2iw2i+νγ]) (31c) ∝IG(k+ν2,12k∑i=1[Lx]2iw2i+νγ), (31d)

where in one- and two-dimensional problems, respectively.

Moreover, since the local shrinkage parameters are assumed independent, one can derive the conditional density for their squared version and at each component eq. 22d, as follows equationparentequation

 π4(w2i|x,τ2,ξi) ∝πpr(x|τ2,w2i)πhpr(w2i|ξi) (32a) ∝(τ2w2i)−\nicefrac12exp(−12[Lx]2iτ2w2i)[(w2i)−\nicefracν2−1exp(−νξiw2i)] (32b) ∝(w2i)−\nicefrac12−\nicefracν2−1exp(−1w2i[[Lx]2i2τ2+νξi]) (32c) ∝IG(ν+12,[Lx]2i2τ2+νξi). (32d)

The conditionals for the auxiliary parameters and in eq. 22e and eq. 22f respectively, are derived in a similar manner. These are inverse gamma distributions defined as: equationparentequation

 π5(γ|τ2) ∝IG(ν+12,1τ20+ντ2), (33a) π6(ξi|w2i) ∝IG(ν+12,1+νw2i). (33b)

### 4.3 The computational procedure

Based on the sampling approaches for the full conditional densities discussed above, we define a Gibbs sampler generating states of a Markov chain with stationary distribution eq. 21, .

The Gibbs sampler draws a single sample from each conditional density at each iteration. This uses the fact that, under mild conditions, the set of full conditional distributions determine the joint distribution

besag_1974. The Markov chain approaches its equilibrium condition as the number of iterations increases. This means that after convergence all samples from the chain will be distributed according to the target posterior distribution. Convergence conditions for the Gibbs sampler are defined in roberts_and_smith_1994; tierney_1994. From a practical viewpoint, convergence is guaranteed approximately and explored from a statistical perspective by analyzing the observed output of the chain and exploring ergodic results. For this, we require application of burn-in and lagging steps: (i) the successive values of the chain are only selected after discarding samples during the warm up phase of the algorithm (burn-in period), and (ii) the sample chain values are only stored every iterations since for large enough the samples are virtually independent (lagging steps). As a result, to obtain quasi-independent samples from the posterior, we require Markov chain steps.

Different scanning strategies exist for the Gibbs sampler (see, e.g., gamerman_and_lopes_2006; owen_2019). We follow a deterministic or systematic scan in which all iterations consist of sampling the conditional densities of each component in the same order. This version of the Gibbs sampler applied to the posterior eq. 21 is summarized in Algorithm 1.

The efficiency of Algorithm 1 can be measured in terms of the autocorrelation of the posterior samples. The chain autocorrelation allows the definition of the effective sample size owen_2019

 neff:=ns1+2∑nsj=1ρ(j)ρ(0)≈⌈nsτint⌉, (34)

where denotes the autocorrelation at the th lag and is the integrated autocorrelation time (IACT). Essentially, the effective sample size is used to compare the variance estimated via correlated MCMC samples and the ideal case of a variance computed from independent draws. Thus, the aim is to obtain a value of as close as possible to .

The computational cost of the Gibbs sampler at each iteration can be given in terms of the number of model calls (operations with and ). Hence, the cost corresponding to the simulation of each conditional density are: maximum calls are needed for and evaluation is required to sample the density . Hence, the total cost of a single iteration of the Gibbs sampler is model calls maximum.

## 5 Numerical experiments

In the following, we illustrate the use of the horseshoe prior for edge-preserving inversion and the Gibbs sampler in Algorithm 1 designed for the computations. We consider linear inverse problems arising in imaging science. The first example consists of a one-dimensional deconvolution problem that allows us to test multiple parameter settings in our computational framework. The second example is a two-dimensional computed tomography problem that highlights the potential of the horseshoe prior in more realistic applications.

In our test problems, the point estimates (posterior mean or median) of the target parameter are evaluated using the relative reconstruction error defined as

 relerr:=∥∥¯x−xtrue∥∥2/∥∥xtrue∥∥2, (35)

where denotes the underlying true solution. To further assess the effectiveness of our approach, we compare our solutions with those from the method proposed in uribe_et_al_2021 which uses a Laplace Markov random field prior to achieve edge-preservation. This method, however, relies on a Laplace approximation of the posterior, while Algorithm 1 does not introduce any approximation.

### 5.1 One-dimensional deconvolution

We consider the inverse problem of identifying an unknown piecewise constant signal from noisy convolved data. The mathematical model for convolution can be written as a Fredholm integral equation of the first kind:

 y(t)=∫10A(t−u)x(u)duwithA(t)=exp(−12s2(t−u)2),0≤t≤1, (36)

where denotes the convolved signal and we employ a Gaussian convolution kernel with fixed parameter .

In practice, a finite-dimensional representation of eq. 36 is employed. After discretizing the signal domain into intervals, the convolution model can be expressed as a system of linear algebraic equations . We consider two sets of synthetic observed data with equally-spaced elements. The first set is generated with noise standard deviation and the second one with (corresponding to and errors). Figure 3 shows the true signal together with the two data sets.

The inputs to the Gibbs sampler are studied/selected as follows: (i) The number of samples is , in addition to a burn-in period of . (ii) The scale parameter in the half-Cauchy hyperprior for is studied. (iii) The lag or thinning number is also analyzed. (iv) The influence of the number of standard and preconditioned CGLS iterations is studied as well.

We remark that for this example we can afford the factorization of the precision matrix in eq. 24. Hence, we can directly sample the Gaussian conditional in eq. 23 without using CGLS. This allows us to obtain an exact sample from at each Gibbs iteration. This method is referred to as “direct” in the following studies and will be used to compare to the solution obtained by sampling with standard and preconditioned CGLS.

##### Influence of the scale parameter τ0

In this case, we use the low-noise data set, fix the lagging steps to , and we directly sample the conditional . Moreover, we fix the noise standard deviation to its true value. We perform a parameter study on using the values