1 Introduction
We are interested in the typical normal mean estimation problem under and an assumption of sparsity of signal. Suppose that the data
is a vector, where each data pointis normally distributed with mean
and variance
. We propose the following model:(1) 
where is the locally individuallevel adaptive parameter and is the global adaptive parameter. The assumption of sparsity is that most of the are exactly .
The prior distribution of the global parameters (, , and ) is not our primary interest in this paper. For and , a typical NormalInverseGamma family with weak prior information has been proved to be flexible and is the default prior for these parameters. As for , its prior should provide a good multiplicity control and work as an indication of the overall sparsity. The choice of the prior distribution of
is based mostly on the data set at hand and the corresponding sampling scheme. Common default choices are a gamma distribution or a beta distribution of the second kind. Our main focus in this paper is the prior distribution
on the and its effects on sparsity.1.1 The Horseshoe and Horeseshoe+ Priors
At face value, the ability to locally adapt comes directly from the individuallevel random effect parameter . Ideally, one would expect that is estimated close to zero if the data point is close to the global mean or provide a good estimate of the distance between and if the data point is not close to . Define the local shrinkage profile parameter and assume , and , then the posterior mean of can be represented by the following equation
(2) 
This is showing that the amount of posterior shrinkage is controlled by the , where meaning total shrinkage and meaning no shrinkage. Hence, the focus is on the or for controlling the for this type of model.
In [Carvalho et al., 2010], the authors give a thorough review of the previous priors of and propose the famous Horseshoe shaped Beta for . Immediately, the Horseshoe prior drew a significant amount of attention due to its bounded influence on the estimation of and its noise control property. A further reason for the popularity that the Horseshoe prior receives is the relatively easy sampling scheme through the following hierarchy:
(3) 
It is easy to observe that this hierarchy results in a complete conjugate sampling procedure for the most of the parameters of the Horseshoe model, hence implementing the Horseshoe prior is very straight forward through Gibbs sampling.
The Horseshoe prior is more of a good inspiration for the globallocal adaptive mechanism than it is a practical estimation and prediction tool. Oftentimes, it fails to provide enough shrinkage to perform well in the ultrasparse situation. In [Bhadra et al., 2017], the authors propose adding two more layers of the Gamma distribution into the hierarchy in (1.2) in order to attain a better result than the original Horseshoe prior. They refer this prior as the Horseshoe plus (HS+). The marginal densities of the HS(+) are
(4) 
Notice that at the tail area both marginals are dominated by term and that near the origin both marginals are dominated by term. The marginal prior from the HS+ is only different from the marginal prior from the HS up to a slowly varying function of . This suggests a minimal improvement for the HS+ over the HS. There is clearly room for improvement in creating priors for the that are dramatically different than those for the HS.
1.2 The Heavy Tail Horseshoe Prior
We propose to include another individuallevel parameter , which could be referred as the local decision parameter. This parameter serves as the shape parameter in the Gamma hierarchy in , which is modified to
(5) 
A first, and simplest, model assumes . This provides a closed form of the marginal density of , which will be discussed in this section. We further investigate the other possible choices of the priors of later in the paper. The relationship between , , and can be revealed from (5). Note that causes to be smaller and to be larger. At same time, the smaller reinforces a larger . Similar logic holds for . This decisionreinforcing mechanism characterizes the advantage of our model compared to the original HS prior of the HS+ prior. The effects of this decision reinforcement will be shown through the comparison of the prior marginal densities of and as well as posterior effects.
In the hierarchy (5), it is possible to mathematically integrate out and , providing a closed form for the marginal prior of and . The marginal density of is
(6) 
It turns out that the marginal density of is a logCauchy density function, and the part barely makes the density function integratabtle on the positive real line. Hence, it is an extreme heavy tailed prior density function. We call it the heavy tailed horseshoe prior, refer as ‘HTHS’ in this article. Similar to the HS+ hierarchical structure, an even heavier tail prior density can be achieved by introducing another two layers of Gamma distribution upon the HTHS prior hierarchy, then integrating the hierarchy out to acquire the marginal. Strikingly, we are surprised to observe that the resulting density form is
(7) 
where it is also a logCauchy distribution with only a different scaling for
. It will be referred as ‘HTHS+’ in this article. Comparing to the HS(+) prior densities, the marginal densities of the HTHS(+) are both proportion to at the origin and at the tail up to a slowly varying function. It indicates that the HTHS pushes more probability mass toward the origin and the tail areas to have a more decisive judgement on the nature of
, coinciding the initial idea of adding the local decision parameter .We could also observe the improvement from the marginals of the shrinkage profile parameter .
(8) 
Similar to the arguments made for , the ratios between the prior densities of the HS(+) and the HTHS(+) are at the origin and at the unit (up to a slowly varying function), pushing more probability mass away from the center area and avoiding ambiguous decisions. Figure 2 and Figure 2 shows a graphical comparison for these priors.
2 Theoretical Properties
This section provides statements of the theoretical advantage of adopting the HTHS over the HS prior, especially under the sparse condition.
2.1 Marginal Distribution for
The following theorem characterizes the tail behavior of the marginal likelihood of the HTHS(+).
Theorem 1.
Suppose and that with . Let denote the marginal density for . Then as , the marginal density satisfies
(9) 
where is a slowly varying function (for every , as ). As a consequence, we have
(10) 
up to the (more quickly vanishing) score of the slowly varying function .
It is striking to observe that both likelihood marginals of the HS(+) are proportion to up to slowly varying functions. Comparing to the HS(+), the marginal distribution of the HTHS(+) deploys more probability mass on large values as the marginal distribution is asymptotically . This theorem completely captures the advantage of the HTHS(+) for preserving large signals. In [Carvalho et al., 2010], the authors point out that this kind of robustness for large features comes from the heavy tail of the prior density of . Hence, it is not surprising to see that the tails of are heavier than the tails of , which will be proved in the next subsection.
2.2 Marginal of
The impact of on the marginal density of the individual random effect parameter is one of the key points of all the locally adaptive models. However, neither of these prior densities of have an analytically closed forms due to the complexities of the marginals of . It is true that all four prior densities share basic features, such as they all have an asymptote at the origin and are in the domain of polynomial tails. Assume , , and , the following theorem gives the asymptotic upper and lower bounds of the marginal density of from the HTHS(+),
Theorem 2.
Let . The univariate marginal densities of for from the HS+ and HTHS+ satisfy the inequalities
(11) 
where and are the lower and upper incomplete gamma functions, respectively, and and are positive numbers.
By the nature of the incomplete gamma functions, Theorem 2 indeed proves the features of spikiness and slabbiness of the HTHS(+).
The lack of the tractable analytical forms of the HS(+) still prevents us from direct insights on the comparison of the tail behaviors of between the HS(+) and HTHS(+), which is crucial for the estimation robustness. It turns out that a simple integration trick could shed the light on this matter if we keep the intermediate hierarchical parameter from the whole integration procedure. We have
(12) 
Equation (12) directly presents the fact that the marginal of from the HTHS has a heavier tail than the HS(+), further concluding the result of the marginal likelihood robustness in section 2.1.
Figure 3 shows the prior marginal densities of from the HS(+) and the HTHS(+) based on numerical approximation. Notice that the HTHS(+) is more spikiness at the origin than the HS(+), and this spikiness is critical to pursue a better risk bound when the true signal is indeed sparse by providing more shrinkage on the noise. The theorem in the next section gives a consideration on the risk bound.
2.3 KL risk bounds
When the true value of each is zero, all of the locallyadaptive shrinkage models are superefficient (convergence to the true mean at a rate faster than that of the MLE). This is due to the prior asymptote at the origin. One way to measure their relative efficiencies is through the KullbackLeibler risk bound, which characterizes the divergence between the true sampling density and the Bayesian predictive density. Let denote the KL divergence of from , the Cesàro average Bayes predictive risk upper bound for true value is
(13) 
where is the prior for and . The following theorem gives the KullbackLeibler risk bounds of the HTHS(+) prior at the origin.
Theorem 3.
Let and assume that
and that the posterior predictive comes from the HTHS(+) model. Then, the optimal convergence rate
is bounded above by(14) 
for any where is a constant.
In [Bhadra et al., 2017], the authors give the optimal convergence rates of the HS(+) at the true as
(15) 
The second term on the right hand side of (15) shows where the improvement is made for HS+ over HS. Because there is no change to the leading term, one could argue that there is not a significant change in risk made by the HS+ prior. However, this upper bound is somewhat crude and we show the direct computation of the integral in Figure 4. The HTHS(+) provides dramatic improvement in the leading term. The shrinking of the first term on the right hand side of (14), for any , shows that the HTHS(+) attain a better KL risk bound and more efficient on suppressing the noise. In Figure 4, we also show the relative risks for nonzero . The differences in efficiency for large signals is not really seen until the signal size is quite large. This shows that the gains in efficiency are really due to better modeling of the true zero signals, especially in ultrasparse settings. This efficiency feature will be supported by the simulation study in the next section, especially under the ultrasparse condition.
3 Extension
Originally, the local decision parameter is assigned with a Beta
prior distribution, indicating no particular favors to signals or outliers. However, if the prior knowledge implies a possible ultrasparse situation, a variant of the HTHS(+) with even stronger suppression on the noises is desired. This motivation is reflected on an asymmetrical
, further resulting an asymmetrical prior marginal density of the shrinkage profile parameter . We propose the following prior hierarchy on ,(16) 
and denote this variant of the HTHS as .
Depending on , the prior density of can either have an asymptote at the origin or stack more probability mass near the unit. Unlike associated with the HTHS, which is a straight line on the unit interval, the asymmetry of associated with further reinforces the decision on treating the data points as noises or signals, especially offering extra shrinkages. A keen reader could notice the similarity between (16) and the HS or even HTHS hierarchical structure, which is also showing in Fig. 6. Indeed, can be easily turned into the HS or the HTHS prior, depending on the situations, but we afraid that including more parameters is more than the model needing now. Notice that is a constant equal to in the HS prior. All the marginal densities of are showed in Figure 6.
Usually, when the signals are spares, the estimation risk comes from two sources: one from undershrinking noise, and the other one from overshrinking the true signals. If the situation is ultrasparse, this asymmetry structure should provide extra shrinkage, reducing the overall estimation risk. This is exactly showing in Figure 6, which presents the logpredictive functions of the HS, the HTHS, and the . First, notice that the logpredictive density of the has almost identical tails with the HTHS, outperforming the HS on the signal robustness. Further more, introducing makes the having a smaller risk than the HTHS near the origin, meaning more suitable for coping with the ultrasparse situation.
4 Simulation
In this section, we show a simple simulation result from the posterior estimation of the HTHS(+), comparing to the result from the HS(+) to demonstrate our theoretical claim in section LABEL:sec::TP. While exhaustive simulations can be done, this simulation is indicative of results that are obtained for any data analysis using these four models. The data are simulated from the following true model:
where is an overall sparsity parameter and is a point mass at zero. All five Bayesian locally adaptive models, the HS(+), the HTHS(+), and the are included in this simulation to have a straight comparison. They are assigned with same prior distributions for the global parameters to control the discrepancy from any unwanted source. A typical Gibbs with MH steps is implemented for the posterior distribution samplings with burning and thinning procedure. We consider two measurements to represent the performance of each model, the mean absolute error and the distance to the oracle M.L.E. estimator. The oracle M.L.E. of is defined as for those that are nonzero and if . The marginal posterior median is adopted for a point estimator under each model due to the multimodality of each marginal posterior.
0.2  0.05  0.01  
M.A.E.  Ora.  M.A.E.  Ora.  M.A.E.  Ora.  
M.L.E.  321  262  323  307  318  315 
HS  198  146  79  67  53  51 
HS+  135  85  65  53  51  49 
HTHS  134  84  54  41  40  38 
HTHS+  95  46  47  34  36  34 
104  58  33  22  13  11 

, the degree of sparsity; M.A.E., mean absolute error; Ora., distance to the oracle M.L.E. estimator; M.L.E., the maximum likelihood estimator. . The numbers are the averages over 20 replicates.
Table 1 offers a strong evidence for the superiority of adopting the HTHS(+) over the HS(+) when dealing with the sparse situation. Notice that the HS+ indeed has made an improvement over the original HS when the sparsity is moderate, but the two models are really quite close to each other when the situation is ultrasparsity. In other words, there is no clear advantage for adopting the HS+ over the HS under severe sparsity conditions. As the severity of the sparsity increases, the HTHS(+) makes significant gains over the HS+ in learning noises. It is very interesting to observe that the has a decisive advantage under ultrasparse situation comparing to every other models, even though doing worse than the HTHS+ under moderate sparsity. This also shows that the strong asymptote of in (16) at the origin helps the large signals to escape the extra shrinkage introducing by the asymmetry, preserving the robustness.
5 Conclusion
In this article, we propose an innovative, new Bayesian locally adaptive model that extends the hierarchical structures of the HS and the HS+ priors by adding a local decision parameter. Integrating out the hierarchical parameters yields a logCauchy prior distribution for the local adaptive parameter under the HTHS(+) model. We prove several theorems to theoretical justify the better shrinkage properties of the HTHS(+). These theoretical properties are exhibited in a simulation study. The next step in this research is to introduce asymmetry into the prior for the local shrinkage profile, which can provide even more dramatic risk gains for ultrasparse signals.
References
 [Bhadra et al., 2017] Bhadra, A., Datta, J., Polson, N. G., Willard, B., et al. (2017). The horseshoe+ estimator of ultrasparse signals. Bayesian Analysis, 12(4):1105–1131.
 [Carvalho et al., 2010] Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480.
Comments
There are no comments yet.