# On Global-local Shrinkage Priors for Count Data

Global-local shrinkage prior has been recognized as useful class of priors which can strongly shrink small signals towards prior means while keeping large signals unshrunk. Although such priors have been extensively discussed under Gaussian responses, we intensively encounter count responses in practice in which the previous knowledge of global-local shrinkage priors cannot be directly imported. In this paper, we discuss global-local shrinkage priors for analyzing sequence of counts. We provide sufficient conditions under which the posterior mean keeps the observation as it is for very large signals, known as tail robustness property. Then, we propose tractable priors to meet the derived conditions approximately or exactly and develop an efficient posterior computation algorithm for Bayesian inference. The proposed methods are free from tuning parameters, that is, all the hyperparameters are automatically estimated based on the data. We demonstrate the proposed methods through simulation and an application to a real dataset.

## Authors

• 8 publications
• 6 publications
• 22 publications
• ### Continuous shrinkage prior revisited: a collapsing behavior and remedy

Modern genomic studies are increasingly focused on identifying more and ...
07/04/2020 ∙ by Se Yoon Lee, et al. ∙ 0

• ### Shrinkage with Robustness: Log-Adjusted Priors for Sparse Signals

We introduce a new class of distributions named log-adjusted shrinkage p...
01/23/2020 ∙ by Yasuyuki Hamura, et al. ∙ 0

• ### Bayesian sparse convex clustering via global-local shrinkage priors

Sparse convex clustering is to cluster observations and conduct variable...
11/20/2019 ∙ by Kaito Shimamura, et al. ∙ 0

• ### Adaptive Bayesian Shrinkage Estimation Using Log-Scale Shrinkage Priors

Global-local shrinkage hierarchies are an important, recent innovation i...
01/08/2018 ∙ by Daniel F. Schmidt, et al. ∙ 0

• ### Large Multi-scale Spatial Kriging Using Tree Shrinkage Priors

We develop a multiscale spatial kernel convolution technique with higher...
03/30/2018 ∙ by Rajarshi Guhaniyogi, et al. ∙ 0

• ### Shrinkage-based random local clocks with scalable inference

Local clock models propose that the rate of molecular evolution is const...
05/15/2021 ∙ by Alexander A. Fisher, et al. ∙ 0

• ### Post-Inference Prior Swapping

While Bayesian methods are praised for their ability to incorporate usef...
06/02/2016 ∙ by Willie Neiswanger, et al. ∙ 0

## Code Repositories

### GLSP-count

Global-local shrinkage prior for count data https://arxiv.org/abs/1907.01333

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

High-dimensional count data appears in a variety of scientific fields including genetics, epidemiology and social science. It is frequently observed in such data that many of those counts are very small and nearly zero except for some outliers. For example, in crime statistics where we divide the area of interest into small sub-regions, the number of occurrences of specific crime is likely to be small or zero in many sub-regions, while it is still important to detect “hotspots”, i.e., the regions of the unexplained high crime rate. In this context, the Poisson-gamma model is obviously inappropriate, for the gamma prior shrinks all the observations uniformly, including the large signals that should be kept unshrunk. The desirable prior should account for both small and large signals and realize the flexible shrinkage effects on Poisson rates.

The prior of this type has been studied as global-local shrinkage prior for the Gaussian observations. The sparse signals of high-dimensional continuous observations are detected by the horseshoe prior, which exhibits the aforementioned property of shrinkage, being comparable to the variable selection (Carvalho et al., 2010)

. It is extended to the three-parameter beta distribution for more flexible modeling of sparseness

(Armagan et al., 2011, 2013). In hierarchical models, such priors have been adopted for random effect distributions in small area estimation (Tang et al., 2018) or default Bayesian analysis (Bhadra et al., 2016). For recent developments, see, for example, Bhadra et al. (2019) and the references therein.

While extensively studied for Gaussian data, the global-local shrinkage priors have not been fully developed for count data, although Poisson likelihood models with hierarchical structure are widely used in applications such as disease mapping (see, for example, Wakefield 2006 and Lawson 2013). The theory related to the Poisson likelihoods has been well developed, but not necessarily from the viewpoint of global-local shrinkage (e.g., Brown et al. 2013 and Yano et al. 2019). The standard Bayesian models for count data is of Poisson-gamma type; the gamma prior for the Poisson rate shows the similarity to the global-local shrinkage prior if one assumes further hierarchical prior on the gamma scale parameters. In this context, the use of heavy-tailed hierarchical priors has already been practiced (e.g., Zhu et al. 2019), but the research on the general, statistical property of such priors has been limited. The theoretical properties of the Bayes estimators of those models have been investigated partially by Datta and Dunson (2016) with the focus on (a generalized version of) the three-parameter beta prior for the analysis of zero-inflated count data. Our research further develops the global-local shrinkage from the viewpoint of heavy-tail property, which ensures the large signals are less or not at all affected by the shrinkage effect.

Consider the sequence of counts, , that are modeled by and with shape and global precision , where represents a random effect and is a structured part characterizing dependence on covariates or additional hierarchical structures. The count-specific scale parameters, , follow the common prior denoted by and are expected to induce the local shrinkage. Our objective is to consider the effect of the choice of prior on Bayes estimators of Poisson rate, . We start by revisiting the definition of tail-robustness for the Bayes estimators, stating as , which is the stronger requirement than that of Datta and Dunson (2016) as discussed in Proposition 1. Requiring the tail-robustness for the Bayes estimators helps us restrict the class of priors with this desirable property. Theorem 1 and Corollary 1 give the sufficient conditions for the tail-robustness and lead to the two classes of prior distributions for that we propose: the inverse-gamma prior and the newly-introduced extremely heavily-tailed prior. The latter satisfies the condition for the tail-robustness exactly and is tail-robust, while the former has negligible asymptotic bias for the large signals hence is approximately tail-robust.

Both priors are conditionally conjugate for most of parameters in the model, which allows the fast and efficient posterior analysis by Gibbs sampler. In the numerical study, we observe the properties of tail-robustness theoretically guaranteed for those priors, while the Poisson-gamma prior suffers from the overly-shrunk Bayes estimators for outliers. The difference of two proposed priors are empirically confirmed in this numerical study; the inverse-gamma prior is better in the point estimations for small signals, having more shrinkage effect toward prior mean, while the extremely heavily-tailed prior is successful in quantifying the uncertainty for large counts, as shown in the coverage rates of posterior credible intervals. Despite this difference, both priors perform almost equally in the analysis of the actual crime data in Japan by detecting the hotspots of crimes that are overlooked in the Poisson-gamma models.

The rest of the paper is organized as follows. In Section 2, we provide conditions for tail robustness and propose the two classes of local priors that meet the conditions. The implied marginal priors of Poisson rate are given in Section 3, in addition that the specific procedure of Gibbs sampler for the posterior computation with the proposed priors is derived. Section 4 is devoted to the numerical experiments for the extensive comparison of the proposed priors and other commonly-used priors/estimators under the various settings. The application to the real data of crimes in Tokyo metropolitan area, Japan, is discussed in Section 5.

## 2 Global-local shrinkage prior for count data

### 2.1 Settings and definitions

Our model has the following hierarchical representation; the observations are conditionally independent and follow

 yi|λi∼Po(ηiλi),   λi|ui∼Ga(α,β/ui),   ui∼π(ui), (1)

for , where is offset and known, and are the hyperparameters. The offset term, , can be any known constant in general; in practice, it is flexibly modeled by regression with the log link function, as we examine in Section 5. In what follows, we assume for simplicity. The two scale parameters of the gamma prior, and , control the global and local shrinkage effects, respectively. Under this model, the Bayes estimator of Poisson rate is

where the expectation is taken with respect to the marginal posterior of , so that the conditional posterior mean of shrinks toward the prior mean .

We consider the appropriate choice of prior in terms of the shrinkage effect realized in the Bayes estimator . As discussed in the introduction, the estimator should not be shrunk toward prior mean when the large signal is observed. This property is named as the tail-robustness (e.g. Carvalho et al., 2010) and mathematically defined as

 limyi→∞|˜λi−yi|=0 (2)

This is the strong requirement for the estimators and, in comparison to the other definitions of tail-robustness, we phrase this property as strong tail-robustness whenever necessary.

The tail-robustness can be defined in several ways; one variant is based on the ratio of the estimator and observation and given by , which we name weak tail-robustness

. It is obvious that the strong tail-robustness implies the weak one. The weak tail-robustness is related to the mean absolute percentage error (MAPE) loss function

, which is frequently used in practice to evaluate the inferential/predictive performance of the models for count data. In this sense, the weakly tail-robust estimator is asymptotically -median estimator and optimal in MAPE (Berry et al., 2019, Section 3.3.2).

For fixed , the Bayes estimator is and clearly loses the tail-robustness of our definition. In other literature, the concept of tail-robustness is defined as the deviation of the estimator not from the observation , but from the conditional Bayes estimator with fixed , and is expressed by

 limyi→∞˜λiα+yi=1

as used in Datta and Dunson (2016). This property is implied by the weak (hence strong) tail-robustness.

We allow the class of prior densities to include improper priors. For each set of hyperparameters , the class of priors that provide proper posterior distribution of for any observation is expressed by

 Πα={ π(⋅):non\mathchar45negative ∣∣ ∫∞0π(u)(1+u)αdu<∞ }. (3)

This expression implies that the posterior propriety is independent of the value of . Furthermore, we define The class of priors that have posterior propriety for any choice of hyperparameter by Any proper prior has posterior propriety hence belongs to . The sequence of classes of priors is decreasing as ; for , we have . In Section 2.3, we give a specific improper prior that belongs to ; see equation (7).

### 2.2 General conditions for tail-robustness

In this subsection, we provide conditions for tail-robustness. First, we prove the weak tail-robustness for the large class of priors.

###### Proposition 1.

Suppose that is strictly positive and that for some . Then, under the model (1) with the , we have

 limyi→∞|˜λi−yi|yi=0. (4)

i.e., the Bayes estimator is weakly tail-robust.

The detailed proofs of the proposition above and all the following statements are given in the supplemental material. The key implication of this proposition is that, for fixed hyperparameters, the weak tail-robustness– and its variant in (2.1)– is valid for almost all priors. For this reason, the weak tail-robustness is not useful in characterizing the desirable property that priors should have in the context of tail-robustness. This observation motivates us to pursue the stronger requirement for the limited class of priors, which is the tail-robustness of our definition.

To consider the strong tail-robustness, the next theorem is useful in evaluating the asymptotic bias for a variety of priors. In what follows, we assume that is strictly positive and continuously differentiable.

###### Theorem 1.

Suppose that satisfies the following two conditions:

 ∫10π(u)du+∫10|uπ′(u)|du<∞, (A1) limu→∞uπ′(u)π(u)=ξ∈[−∞,∞]. (A2)

(i) If , then ; if , then .

(ii) If , then

 limyi→∞(˜λi−yi)=1+ξ%.

The interplay of shape parameter , posterior propriety and asymptotic bias is clear in this theorem. The asymptotic bias of under can be characterized by the tail behaviour of the mixing distribution . This condition is similar to but significantly different from that of Gaussian response (e.g. Tang et al., 2018). It is immediate from Theorem 1 that is the sufficient condition for the estimator to be strongly tail-robust, which is summarized in the following corollary.

###### Corollary 1.

Under the conditions (A1) and

 limu→∞uπ′(u)π(u)=−1, (A3)

it holds that and the Bayes estimator is strongly tail-robust, that is, as .

The crucial assumption in the above corollary is (A3), which describes the desirable tail behavior of the marginal prior distribution of . In fact, (A3), and more generally (A2), are sufficient conditions for to be slowly varying as , i.e., for all (e.g., see Seneta 1976, equation (1.11)). It further implies that, for the marginal prior , we have as under the regularity condition that justifies the interchange of the limit and integral. In other words, under this assumption, the marginal densities of and are asymptotically equivalent in the tail as density functions.

An example of priors that satisfies assumption (A3) is . In many cases, (A3) requires priors to be of this from; see Section 2.3. This prior, however, does not satisfy the other assumption (A1), being not integrable around zero. This problem can be circumvented by slight modification of the prior; set with for and for large so that the prior satisfies both assumptions. From this viewpoint, (A1) is merely a technical requirement for the proof.

One notable feature of Corollary 1 is that the sufficient conditions for the tail-robustness, (A1) and (A3), are independent of the values of hyperparmeters and . For this feature, the corollary can guarantee the tail-robustness even when we estimate those hyperparameters or consider their priors. This setting about hyperparameters is a great contrary to other approaches, e.g., Proposition 1 of Datta and Dunson (2016) where the tail-robustness is discussed for the limiting values of hyperparameters, i.e., or .

### 2.3 Proper priors for ui

In this subsection, we introduce two classes of proper priors for that approximately or exactly satisfy the conditions for tail robustness given in Corollary 1. We also discuss the three-parameter beta prior from the viewpoint of tail-robustness. The prior of can be dependent on another hyperparameter(s) ; we explicitly denote it by .

The inverse gamma (IG) prior is given by

 π(ui;γ)=πIG(ui;γ)=γγΓ(γ)1ui1+γe−γ/ui (5)

i.e., the density of for . It is clearly proper and conditionally conjugate, which simplifies the posterior computation in practice. The conditions in Corollary 1 are not satisfied for this prior. Instead, by applying Theorem 1, we evaluate the asymptotic bias as , which is decreasing to zero as . For the bias is negligible for small

, this result shows the “approximate” tail-robustness for this prior. The choice of inverse-gamma distribution and the common shape and rate parameters

is crucial for the approximate tail-robustness. For example, for the exponential prior , , the asymptotic bias can never be ignored because by Theorem 1.

The other class of priors we propose originates from the three-parameter beta (TPB) distributions (Armagan et al., 2011), whose density is given by

 π(ui)∝uia0−1(1+ϕ0ui/β)γ0(1+ui/β)a0+b0−γ0, (6)

where , , , and are all positive constants. For the zero-inflated count data and Poisson likelihood, Datta and Dunson (2016) considered this prior with and . Although the TPB prior is well-known and flexible, its density in (6) does not satisfy assumption (A3) in Corollary 1 and is not tail-robust for any choice of hyperparameters. Under the prior in (6), by Theorem 1, the asymptotic bias is , negatively biased and dependent on its hyperparameter . Similar to the inverse-gamma prior, the approximate tail-robustness for the TPB prior is justified by the limiting case of ;

 π(ui)∝uγi(1+ui)1+γ,   γ>γ––≥−1, (7)

which is shown to be in by Corollary 1. In return for the tail-robustness in the limit, however, it is inevitable for the prior in (7) to be improper. Under the use of improper prior for , the resulting marginal prior for is also improper and too non-informative, for which the posterior distribution of would not successfully reflect the prior information to be shrunk toward prior mean.

Motivated by the improper prior in (7), we newly introduce the modified version of (7) named extremely heavy-tailed (EH) prior, defined by the density

 π(ui;γ)=πEH(ui;γ)=γ11+ui1{1+log(1+ui)}1+γ (8)

for . The additional logarithm function in the denominator contributes to the integrability of the density function and is often seen in the literature of decision-theoretic statistical theory (for example, see Maruyama and Strawderman 2019, Remark 4.1). This prior is proper because

 ∫∞0πEH(u;γ)du =[−{1+log(1+u)}−γ]∞0=1.

It satisfies the conditions of Corollary 1;

as , hence the EH prior is strongly tail-robust. Thus, the prior remains to be proper while maintaining the tail property in the original improper prior in (7). In addition, the EH prior can be augmented by latent variables and enables the simple and efficient posterior computation by Gibbs sampler; details are given in Section 3.3.

In theory, the class of priors of this type can be extended to

 π(ui)∝uiγ1−1(1+γ2ui)γ11{1+γ3log(1+γ4+ui)}γ5,

which is also proper and tail-robust. However, such general prior would be over-parametrized and the posterior analysis is not feasible in practice for the unknown normalizing constant that involves the hyperparameters to be estimated. We do not consider this extension in this research but focus on the extensive study of its special case, the EH prior, given in (8).

## 3 Properties of the proposed priors

### 3.1 Marginal prior distributions for λi

In this section, the marginal density of is computed to consider its behavior in the limit of and . The tail property of the marginal density shows the same approximate/exact tail robustness we confirmed in the previous section. The behavior of the density around zero is an important information to understand the amount of shrinkage effect toward zero, which has not been discussed in this paper. In general, the marginal prior distribution for is given by

 p(λi;α,β,γ) =∫∞0βα/uiαΓ(α)λiα−1e−(β/ui)λiπ(ui;γ)dui =βαΓ(α)∫∞01xαe−β/xπ(λix;γ)dx.

We continue the computation of this density for two classes of priors: and .

For the IG prior , we have

 p(λi;α,β,γ)=(β/γ)αB(α,γ)λiα−1{1+(β/γ)λi}α+γ, (9)

which implies the beta distribution, i.e.,

 (β/γ)λi1+(β/γ)λi∼Beta(α,γ).

From (9), we have as . For sufficiently small , the marginal prior of can be heavily-tailed, being almost equivalent to in the tail. This observation is coherent with the -dependent asymptotic bias of the Bayes estimator, . It should also be noted here that, due to the heavy tail of this density, the prior mean does not exist if . This can be confirmed by the direct computation or the fact that the prior mean of is not finite under this condition. In this situation, it is difficult to interpret the prior from the viewpoint that the estimator is shrunk toward the prior mean. For those who prefer the prior with finite mean, we recommend the modification of the IG prior to , , which instead increases the asymptotic bias slightly to .

In contrast, the density at the origin depends on the value of . In particular, for , while the limit becomes unity for and zero for . This fact gives a clue to the interpretation of the choice of, or the posterior inference for, hyperparameter .

For the EH prior, the marginal density is evaluated around zero as follows; for ,

 p(λi;α,β,γ) =βαγΓ(α)∫∞0e−β/xxα11+λix1{1+log(1+λix)}1+γdx →βαγΓ(α)∫∞0e−β/xxαdx ={(α−1)−1βif α>1∞if α≤1

as by the monotone convergence theorem. Thus, and increasing as , implying the shrinkage of small signals toward the global prior mean. For the tail property, we have

 limλi→∞p(λi;α,β,γ)πEH(λi;γ) =βαΓ(α)∫∞0e−β/xxα[limλi→∞1+λi1+λix{1+log(1+λi)1+log(1+λix)}1+γ]dx =βαΓ(α)∫∞0e−β/xxα+1dx=1.

Therefore, as , which means that the marginal prior is proper but has a sufficiently heavy tail so that the model can accommodate large signals. For the computation to verify the result above, see the supplemental materials.

### 3.2 Marginal posterior distributions for λi

We briefly describe the flexibility of the proposed prior distributions compared with the common gamma prior for . Since the conditional posterior distribution of given is under the model (1), the marginal posterior distribution of is obtained as the mixture of the gamma distribution with respect to the marginal posterior distribution of . Note that the use of the gamma prior distribution for leads to the posterior distribution . We set and show the marginal posterior density of with several values of in Figure 1. It is observed that under the moderate signal such as , the posterior distributions of are almost the same among the conventional gamma prior and the proposed global-local shrinkage priors. On the other hand, under large values of , the posterior densities of the proposed methods are significantly different from one based the gamma prior, which shows the flexibility of the proposed priors against large signals and is consistent with (strong) tail-robustness property given in Theorem 1. However, the posterior density with the conventional gamma prior is not sensitive to large signals, which leads to over-shrinkage estimation. As noted in the previous section, the hyperparameter in the inverse gamma (IG) distribution is directly related to the asymptotic bias, and Figure 1 shows that the IG prior with the smaller produces heavier-tailed posterior density functions than that with the larger .

### 3.3 Posterior computation

The computation of the Bayes estimator is based on the Markov chain Monte Carlo method. For the proposed priors are mostly conditionally conjugate, the sampling from most of the conditional posterior distributions is straightforward. We summarize the conditional posteriors and sampling methods in this subsection.

We first discuss the parameters that are common and always included in the model regardless the choice of prior for . Note that we assign prior distributions for and in practice.

In this research, we consider the gamma priors; and .

Sampling of the common parameters .

• The full conditional of is and are mutually independent.

• The full conditional of is .

• The sampling of dispersion parameter can be done in multiple ways. We take the strategy of Zhou and Carin (2013) by working on the conditional, negative-Binomial likelihood of by marginalizing out. The conditional posterior density of is proportional to

 ψα(α)m∏i=1Γ(yi+α)Γ(α)(ββ+ui)α=ψα(α)m∏i=1yi∑νi=1|s(yi,νi)|ανi(ββ+ui)α,

where is the prior density of and is the Stirling’s number of the second kind, and the summation collapses to one if . The integer-valued variable is considered a latent parameter that augments the model and allows Gibbs sampler. Thus, we need to sample from the full conditionals of and .

• In the augmented model, the conditional of is
.

• If , then

with probability one. Otherwise, the conditional posterior probability function of

is proportional to , from which we can sample based on the distributional equation,

 νi=yi∑j=1dj,      where    dj:ind∼Ber(αj−1+α).

When the model is of Poisson-gamma type and the local parameters are fixed, the posterior analysis can be done by following the above procedure. The Poisson-gamma model is examined in the next section for the comparison with the proposed priors.

For the model with the IG prior, the scale parameter has the known conditional posterior, while the conditional of is difficult to directly sample from. Although several computationally-sophisticated options are available for the sampling of hyperparameter , we simply use the random-walk Metropolis-Hastings method.

Sampling of parameters related to IG prior.

• The full conditional of is and are mutually independent.

• The full conditional of is proportional to

 fγ(γ)=ψγ(γ)γmγ{Γ(γ)}m(m∏i=11ui)γe−γ∑mi=11/ui,

where is a prior distribution for . We sample the candidate of from a proposal distribution with two tuning parameters and accept it with an acceptance probability based on the target distribution . Details are in the supplementary material.

The EH prior is augmented by latent variables as

 πEH(ui;γ)=∭(0,∞)3πEH(ui,vi,wi,ti;γ)dvidwidti,

where

 πEH(ui,vi,wi,ti;γ)=γe−(1+ui)viwiγe−wiΓ(1+γ)tiwi−1Γ(wi)e−(1+ui)ti.

The parameter expansion enables the Gibbs sampler for all the parameters in the EH prior. These distributions include the generalized inverse Gaussian (GIG) distribution; we use for , where . The density of this distribution is proportional to .

Sampling parameters related to the EH prior

• The full conditional of is .

• The full conditional of is .

• The full conditional of is .

• The sampling of is simplified by marginalizing out from the model to compute its full conditional. The conditional posterior of with marginalized out is .

• The sampling of is also simplified by the marginalization of and . We sample from .

## 4 Simulation study

We here investigate the finite sample performance of the proposed method together with some existing methods. We generated the independent sequence of counts from for with . For the generating process for , we first considered the mixture: , where and denote distributions of moderate and large signals, respectively. We set , so that around of ’s are large signals (outliers). We also set and , which is denoted by scenario 1. Moreover, we considered and with and , denoted by scenarios 2 and 3, respectively, where is the distribution degenerated at . In scenarios 2 and 3, the moderate signals are more concentrated around 1 and have less variation, in comparison to the continuous prior in scenario 1. Finally, we employed the setting as scenario 4, where is the folded

-distribution with three degrees of freedom. Even in scenario 4, large signals are expected due to the heavy tail properties of

-distribution.

To obtain the point estimator of , we considered the following five priors/methods:

• IG: The proposed method with inverse gamma prior.

• EH: The proposed method with extremely heavy tailed prior.

• PG: Using gamma distribution for , known as Poisson-gamma model.

• KW: Nonparametric empirical Bayes method

(Kiefer and Wolfowitz, 1956; Koenker and Mizera, 2014).

• ML: Maximum likelihood (non-shrinkage) estimator, i.e., .

For each dataset, in the first three methods, we generated 3,000 posterior samples after discarding 500 samples as a burn-in period. In applying the PG method, we assign gamma priors for hyperparameters to estimate them. Based on 1,000 replications, we computed mean squared errors (MSE) for all the models and empirical coverage probabilities (CP) of credible intervals of the first three Bayesian methods. We defined outliers for generated from in scenarios 1, 2 and 3, and in scenario 4, and non-outliers for the other cases. The values of MSE and CP were computed separately in outlying and non-outlying values, respectively. To compare the resulting MSE values, we calculated MSE ratios of the four methods (IG, EH, PG and KW) to the ML method. The MSE ratio which is smaller than 1 means that the corresponding method provides more accurate estimates than ML.

In Figure 2, we presented a figure showing the MSE ratios for non-outlying and outlying values. In non-outlying values, it is confirmed that the performance of all the four methods are comparable and always better than ML. On the other hand, for outlying values, the point estimates of both PG and KW methods tend to be worse than ML; theoretically, the PG and KW methods are not tail-robust in general and are expected to produce over-shrunk estimates. In contrast, the proposed IG and EH methods provides reasonable point estimates and, in particular, the EH method is even competitive with the ML method. The result that the EH method slightly outperform the IG method for outliers is consistent with the theory; the EH method holds tail-robustness whereas the IG method only approximately holds the property.

In Figure 3, we reported CPs of credible intervals based on the three methods. This figure shows that the CPs for non-outlying values are comparable among the methods. However, the CPs of PG for outlying values are seriously smaller than the nominal level, which would be caused by the over-shrinkage problem. The best prior in the CPs for outliers is the proposed EH method, the CPs of which are around , owing to tail-robustness. The IG method provides reasonable CPs compared with the PG method, although they are smaller than those of EH partly because the tail-robustness holds not exactly but approximately under the IG prior.

## 5 Data analysis

Here we apply the proposed methods to a dataset of number of police-recorded crime in Tokyo metropolitan area, provided by University of Tsukuba and publicly available online (“GIS database of number of police-recorded crime at O-aza, chome in Tokyo, 2009-2017”, available at https://commons.sk.tsukuba.ac.jp/data_en). In this study, we focus on the number of violent crimes in local towns in Tokyo metropolitan area in 2015. For auxiliary information about each town, we adopted area (km), population densities in noon and night, density of foreign people, percentage of single-person household and average duration of residence, which all help adjustment of the crime risk. Let be the observed count of violent crimes, be area and

be the vector of standardized auxiliary information in the

-th local town. We are interested in the crime risk adjusted by the auxiliary information and, to this end, we employ the following Poisson regression model:

 yi|λi∼Po(λiηi),    ηi=exp(logai+xtiδ), (10)

independently for , where is a vector of unknown regression coefficients. Under the model (10), can be interpreted as an adjustment risk factor which cannot be explained by the auxiliary information. In most areas, the offset term explains the variations of crime rates, hence the adjustment risk factor is expected to be small. Yet, the adjustment risk might be extremely high in some local towns, and we want to detect such areas. This is precisely where the global-local shrinkage priors fit, and we employed the proposed the IG and EH priors for . We set a diffuse prior on , and generated posterior samples of as well as ’s. Conditional on , the posterior computation algorithm for provided in Section 3.3 is still valid, so that the posterior computation for generating both ’s and can be readily carried out via Gibbs sampling. We adopted independent MH algorithm for generating a random sample of from the conditional posterior distribution given ’s. For comparison, we also applied the common gamma distribution for as considered in Section 4, which is again denoted by PG in what follows. In each Gibbs sampler, we generated 20,000 posterior samples after discarding 3,000 posterior samples as burn-in.

We first computed posterior means based on the three methods, which are shown in Figure 4. It is observed that the proposed two methods, IG and EH, provided similar estimates of and successfully detected several local towns whose risk factors are extremely high. However, such extreme towns are less emphasized, or not detected at all, by the PG method for the over-shrinkage problem. We conclude that the PG method seriously underestimates the true risk factors.

We then detected 10 local towns with the largest 10 posterior means of . In these towns, we computed credible intervals of based on the three methods, which are presented in the left panel of Figure 5. This panel clearly shows the over-shrinkage problem in the PG method in terms of both point estimation (posterior means) and interval estimation (posterior credible intervals). We also randomly selected 10 local town with moderate estimates of and gave credible intervals in the right panel of the same figure. It is confirmed that the proposed methods perform almost the same as the PG method in these towns. These results exemplify that the proposed methods can prevent from the over-shrinkage problem for large signals while their performance in the other towns are almost the same as the standard PG method.

## 6 Discussion

It should be emphasized again that the global-local shrinkage priors for sequence of counts developed in this article are based on the new concept of tail-robustness, that is clearly different from other definitions and non-trivial for many prior densities. We provided sufficient conditions for this desirable tail-robustness and, specifically, proposed two tractable global-local shrinkage priors. As illustrated by the simulated and real data examples, the models with these priors could actually show the tail-robustness as predicted by theory, and are expected to be applied in the future studies of high-dimensional counts.

The settings of our study are critically dependent on the Poisson likelihoods, whose mean and variance are equal. Conditionally, both prior mean and variance of

are controlled by the common local parameter of prior , affecting both the baseline of shrinkage and the amount of shrinkage. This property is not seen in the Gaussian case, where the local parameter appears in the prior variance only and controls the amount of local shrinkage, which makes the role of local parameters clear and interpretable. In this sense, the local parameter in our framework (1) might be less interpretable, although the setting (1) enables us to carry out posterior computation easily and is widely adopted (e.g. Datta and Dunson, 2016). It would be an interesting future research to pursue an alternative setting for hierarchical modeling of sequence of counts under which the role of the local parameters is properly restricted and interpretable.

From the viewpoint of methodological research, this paper is primarily focused on the point and interval estimation of the Poisson rate. The high-dimensional counts can be cast as other statistical problems such as multiple testing. However, the detailed investigation for such directions would extend the scope of this paper, we left it to a valuable future study.

## References

• Armagan et al. (2011) Armagan, A., Clyde, M., and Dunson, D. B. (2011). “Generalized beta mixtures of Gaussians.” In Advances in neural information processing systems, 523–531.
• Armagan et al. (2013) Armagan, A., Dunson, D. B., and Lee, J. (2013). “Generalized double Pareto shrinkage.” Statistica Sinica, 23(1): 119.
• Berry et al. (2019) Berry, L. R., Helman, P., and West, M. (2019). “Probabilistic forecasting of heterogeneous consumer transaction-sales time series.” ArXiv:1808.04698.
• Bhadra et al. (2016) Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2016). “Default Bayesian analysis with global-local shrinkage priors.” Biometrika, 103(4): 955–969.
• Bhadra et al. (2019) Bhadra, A., Datta, J., Polson, N. G., and Willard, B. T. (2019). “Lasso Meets Horseshoe: A Survey.” Statistical Science, to appear.
• Brown et al. (2013) Brown, L. D., Greenshtein, E., and Ritov, Y. (2013). “The Poisson compound decision problem revisited.” Journal of the American Statistical Association, 108(502): 741–749.
• Carvalho et al. (2010) Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). “The horseshoe estimator for sparse signals.” Biometrika, 97(2): 465–480.
• Datta and Dunson (2016) Datta, J. and Dunson, D. B. (2016). “Bayesian inference on quasi-sparse count data.” Biometrika, 103(4): 971–983.
• Kiefer and Wolfowitz (1956) Kiefer, J. and Wolfowitz, J. (1956). “Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters.” Annals of Mathematical Statistics, 27: 887–906.
• Koenker and Mizera (2014) Koenker, R. and Mizera, I. (2014). “Convex optimization, shape constraints, compound decisions, and empirical Bayes rules.” Journal of the American Statistical Association, 109(506): 674–685.
• Lawson (2013) Lawson, A. B. (2013). Bayesian disease mapping: hierarchical modeling in spatial epidemiology. Chapman and Hall/CRC.
• Maruyama and Strawderman (2019) Maruyama, Y. and Strawderman, W. E. (2019). “Admissible Bayes equivariant estimation of location vectors for spherically symmetric distributions with unknown scale.” Annals of Statistics, to appear..
• Seneta (1976) Seneta, E. (1976). Regularly varying functions, volume 508. Springer-Verlag Berlin Heidelberg.
• Tang et al. (2018) Tang, X., Ghosh, M., Ha, N. S., and Sedransk, J. (2018). “Modeling Random Effects Using Global–Local Shrinkage Priors in Small Area Estimation.” Journal of the American Statistical Association, 113(524): 1476–1489.
• Wakefield (2006) Wakefield, J. (2006). “Disease mapping and spatial regression with count data.” Biostatistics, 8(2): 158–183.
• Yano et al. (2019) Yano, K., Kaneko, R., and Komaki, F. (2019). “Exact Minimax Predictive Density for Sparse Count Data.” ArXiv:1812.06037.
• Zhou and Carin (2013) Zhou, M. and Carin, L. (2013). “Negative binomial process count and mixture modeling.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2): 307–320.
• Zhu et al. (2019) Zhu, A., Ibrahim, J. G., and Love, M. I. (2019). “Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences.” Bioinformatics, 35(12): 2084–2092.

## S1 Proof of Theorem 1

We first provide two useful lemmas.

###### Lemma S1.

Let and and let be a strictly positive and continuously differentiable function defined on . Suppose that

 limu→∞uh′(u)h(u)=ζ∈[−∞,∞].

Then if , there exists such that for all ,

 h(u)h(N)≥(uN)η.

Similarly, if , there exists such that for all ,

 h(u)h(N)≤(uN)η.

Proof.   Suppose first that . Select large enough so that for all we have . Then for any we have that

 ∫uNh′(z)h(z)dz≥η∫uN1zdz

and hence that . Next, suppose that . Let for . Then, since , there exists such that for all . Thus, .

###### Lemma S2.

Let . Let and be nonnegative integrable functions defined on and , respectively, and let be a strictly increasing function defined on . Suppose that . Then

 limy→∞∫M10{φ(u)}yh1(u)du/∫∞M0{φ(u)}yh0(u)du=0.

Proof.   We have

 limsupy→∞∫M10{φ(u)}yh1(u)du/∫∞M0{φ(u)}yh0(u)du ≤limsupy→∞{φ(M1)φ(M0)}y∫M10h1(u)du/∫∞M0h0(u)du =0

by assumption.

We now begin the proof of Theorem 1. For part (i), suppose first that . Then since otherwise

 ∞>∫∞Nπ(u)/π(N)(1+u)αdu≥∫∞N(u/N)α(1+u)αdu=∞

for some by Lemma S1. If , then clearly . On the other hand, if , then for any , there exists such that

 ∫∞Nπ(u)/π(N)(1+u)αdu≥∫∞N(u/N)ξ−ε(1+u)αdu

by Lemma S1. Thus, in order for the posterior to be proper, the right-hand side of the above inequality must be finite, which implies that . Since is arbitrary, it follows that .

Next, suppose that . Let . Then by Lemma S1 there exists such that for all we have . Therefore,

 limu→∞uπ(u)(1+u)α=0, (S1) ∫∞1π(u)(1+u)αdu<∞,

and, hence, . This completes the proof of part (i).

For part (ii), suppose that . We first show that the assumptions imply the following:

 ∫∞0π(u)(β+u)αdu+∫∞0|uπ′(u)|(β+u)αdu<∞, (S2) limu→∞uπ(u)(β+u)α=limu→0uπ(u)(β+u)α=0. (S3)

By integration by parts, it follows from (A1) that

 R ∋limδ→0∫1δuπ′(u)du=limδ→0{π(1)−δπ(δ)−∫1δπ(u)du}.

Therefore, for some . Furthermore, we have since otherwise

 ∫10π(u)du≥∫ε0uπ(u)udu≥c02∫ε0duu=∞

for some in contradiction to (A1). Thus, . By part (i), we have . If , then by (S1). If , then as . Also, by integration by parts,

 ∞ >∫∞1π(u)(β+u)αdu=limM→∞∫M1π(u)(β+u)αdu =limM→∞{Mπ(M)(β+M)α−π(1)(β+1)α−∫M1uπ′(u)π(u)π(u)(β+u)αdu+α∫M1uβ+uπ(u)(β+u)αdu}.

Therefore,