# Objective Bayesian Analysis for the Differential Entropy of the Gamma Distribution

The use of entropy related concepts goes from physics, such as in statistical mechanics, to evolutionary biology. The Shannon entropy is a measure used to quantify the amount of information in a system, and its estimation is usually made under the frequentist approach. In the present paper, we introduce an fully objective Bayesian analysis to obtain this measure's posterior distribution. Notably, we consider the Gamma distribution, which describes many natural phenomena in physics, engineering, and biology. We reparametrize the model in terms of entropy, and different objective priors are derived, such as Jeffreys prior, reference prior, and matching priors. Since the obtained priors are improper, we prove that the obtained posterior distributions are proper and their respective posterior means are finite. An intensive simulation study is conducted to select the prior that returns better results in terms of bias, mean square error, and coverage probabilities. The proposed approach is illustrated in two datasets, where the first one is related to the Achaemenid dynasty reign period, and the second data describes the time to failure of an electronic component in the sugarcane harvest machine.

Comments

There are no comments yet.

## Authors

• 5 publications
• 1 publication
• 10 publications
• 14 publications
• 17 publications
04/09/2020

### Objective Bayesian analysis for spatial Student-t regression models

The choice of the prior distribution is a key aspect of Bayesian analysi...
05/15/2020

### Power laws distributions in objective priors

The use of objective prior in Bayesian applications has become a common ...
05/17/2020

### Posterior properties of the Weibull distribution for censored data

The Weibull distribution is one of the most used tools in reliability an...
11/09/2019

### Estimation of entropy measures for categorical variables with spatial correlation

Entropy is a measure of heterogeneity widely used in applied sciences, o...
01/03/2021

### Meta-Learning Conjugate Priors for Few-Shot Bayesian Optimization

Bayesian Optimization is methodology used in statistical modelling that ...
07/09/2021

### Entropy, Information, and the Updating of Probabilities

This paper is a review of a particular approach to the method of maximum...
07/30/2018

### A Proof of Entropy Minimization for Outputs in Deletion Channels via Hidden Word Statistics

From the output produced by a memoryless deletion channel from a uniform...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In recent years, there has been a growing interest in estimating different metrics of information theory related to parametric distributions. The Shannon entropy, also known as differential entropy, introduced by Claude Shannon [30], is an essential quantity that measures the amount of available information or uncertainty outcome of a random process. Given a density function , the differential entropy is given by

 H(α,β)=E(−logf(x|α,β)). (2)

The differential entropy depends on the distribution parameters, and, given a sample, it is necessary to be estimated. The commonly used method to estimate the parameters is the maximum likelihood approach due to its one-to-one invariance property. Hence, we need only to estimate the parameters of the original model and plug-in the entropy function. Under this approach many authors have derived the estimators for different distributions such as, Weibull [9], Inverse Weibull [33], Log-logistic [13]

and for the exponential distribution with different shift origin

[20], to list a few.

A major drawback of the maximum likelihood inference is that the obtained estimates are usually biased for small samples [11]

. Another concern happens under small samples when constructing the confidence intervals for the parameters since such intervals are not precise and may not return good coverage probabilities. In this case, the maximum likelihood estimation (MLE) skewness study is essential to assess the quality of the interval

[12]. To overcome these limitations, we can use objective Bayesian methods. In this context, the inference for the parameters of the gamma distribution have been discussed earlier under this approach by Miller [23], Sun and Ye [31], Berger et al. [3], and Louzada and Ramos [21]. Moreover, Ramos et al. [27]

revised the most common objective priors and provided sufficient and necessary conditions for the obtained posteriors and their higher moments to be proper.

Although the authors have obtained different joint posterior distributions for the parameters of interest, the obtained posterior means can not be directly plunged in the Shannon entropy. Under the Bayesian approach, it is necessary to obtain the posterior distribution of the entropy measure. In this context, Shakhatreh [29] recently derived different posterior distributions using objective priors for the entropy assuming a Weibull distribution. On the other hand, the cited distribution’s entropy expression is not as complicated as the gamma distribution’s entropy expression. With this in mind, in this paper, focusing on the gamma distribution, we derive the posterior distributions using objective priors, such as Jeffreys prior [18], reference priors [7, 2, 3], and matching priors [32]

, and prove that the obtained posteriors are proper and can be used to construct the posterior distributions of the Shannon entropy. Moreover, even if the posterior distribution is proper, the posterior mean can be infinite, which is undesirable, and thus we shall also prove that the obtained posterior means for the entropy measure are finite. Finally, the credibility intervals are obtained to construct accurate interval estimates.

The gamma distribution considered here is a two-parameter family of distribution among the most well-known distribution used to model different stochastic processes and to make statistical inferences, and has received attention from different fields. It surfaces in many areas of applications, including financial analysis [10], climate analysis [17], reliability analysis [16]

[19], and physics [14]

. Particularly, the gamma distribution includes the exponential distribution, Erlang distribution, and chi-square distribution as special cases.

follows a gamma distribution, if its probability density function, parametrized by a shape parameter

and scale parameter , is given by,

 f(x|α,β)=βαΓ(α)xα−1e−βx,  x>0, (3)

where is the gamma function.

The paper is organized as follows. Section 2 presents the maximum likelihood estimators for the gamma distribution parameters and the Shannon Entropy computation. Section 3 presents the objective Bayesian analysis using objective priors for the Shannon entropy parameter’s reparametrized posterior distribution. Section 4 provides a simulation study to select the best objective prior. In Section 5, the methodology is illustrated on a real dataset. Some final comments are given in Section 6.

## 2 Frequentist approach

The classical inference (frequentist) is a commonly used approach to conduct parameter estimation of a particular distribution. In this case, the parameter is treated as fixed, and the MLE is commonly used to obtain the estimates. The MLE has good asymptotic properties, such as invariance, consistency, and efficiency. This procedure search the parameter space of where the maximum likelihood is obtained. Here our main aim is to obtain the estimate of a function of the parameters. Hence, firstly we need to obtain the entropy measure, mathematically defined as , which quantifies the amount of uncertainty in the data . Besides, it should be noted that a higher realization of indicates more uncertainty.

The entropy of the gamma density is given by

 H(α,β) =−∫∞0log(βαΓ(α)xα−1exp{−βx})f(x|α,β)dx (4) =α−log(β)+log(Γ(α))+(1−α)ψ(α),

where is the digamma function.

Now, consider a change of variable by setting , which implies . The aim of the transformation is to obtain a likelihood of and instead of and . Therefore, if , , are a complete sample from (3) then the likelihood function of and is given as

 L(W,H|x)=δ(W,H)nWΓ(W)n{n∏i=1xWi}exp{−δ(W,H)n∑i=1xi}, (5)

where .

The log-likelihood function is given by

 l(W,H|x)=Wlog(δ(W,H))−logΓ(W)+Wn∑i=1log(xi)−δ(W,H)n∑i=1xi. (6)

The MLEs for the parameters are obtained by directly maximizing the log-likelihood function . Hence, after some algebraic manipulations the MLEs and are obtained from the solution of

 ∂l(W,H|x)∂W=log(δ(W,H))−ψ(W)−n∑i=1log(xi)+σ(W−δ(W,H)n∑i=1xi)
 ∂l(W,H|x)∂H=−W+δ(W,H)n∑i=1xi

where . The solutions for these equations provide the maximum likelihood estimators for the entropy of the gamma distributions, and . Since equation (3) cannot be solved easily using a closed-form solution, numerical techniques must estimate the true parameters.

Following [22]

, the MLEs are asymptotically normally distributed with a joint bivariate normal distribution given by

 (^WMLE,^HMLE)∼N2[(W,H),I−1(W,H)] as n→∞,

where is the Fisher information matrix for the reparametrized model given by

 (7)

and () is the derivative of (), called the trigamma function.

In the present paper, we are only interested in , and thus, given and using the element , we can conclude that the confidence interval for the estimate of the entropy measure with a confidence level of for is given by

 ^H−Za2√(1−W)2ψ′(W)+2−W

where is the significance level and is the -th percentile of the standard normal distribution.

## 3 Bayesian Inference

Here, the parameter is considered as a random variable and the distribution that represents knowledge about is refereed as a prior distribution and defined by . The distribution provides the knowledge or uncertainty about before obtaining the sample data . After the data

is observed, a natural way of combining the resulting information from the a priori the distribution and the likelihood function is done by the Bayes’ theorem, resulting in the posterior distribution of

given . In a Bayesian framework, Ramos [27] analyzed the properties of the posterior distribution of the gamma distribution parameters and stated the conditions for this distribution to have proper posterior and finite moments.

To obtain the posterior distributions for the parameter, we can consider the one-to-one invariance property of the Jeffreys prior, reference prior, and matching prior, and thus we only need to obtain the Jacobian matrix related to the reparametrization from and to and . After some algebraic manipulations, we can conclude that the parameters and can be written as

 β=exp(W+log(Γ(W))+(1−W)ψ(W)−H) and α=W,

and thus, from the relations

 ∂α∂H=0,∂α∂W=1,∂β∂H=−β   and   ∂β∂W=(1+(1−W)ψ(1)(W))β,

it follows that the Jacobian matriz (J) relative to the change of variable will be given by

 J=⎡⎢⎣∂α∂H∂α∂W∂β∂H∂β∂W⎤⎥⎦=[01−βσβ], (9)

where .

The use of objective priors plays an essential role in Bayesian analysis where the data provide the dominant information, and the posterior distribution is not overshadowed by prior information. Such priors allow us to conduct objective Bayesian inference. On the other hand, in most situations, they are not proper prior distributions and may lead to improper posterior, invalidating the analysis since we cannot compute the normalizing constant. Therefore, we need to check if the obtained posterior (and posterior mean) is proper (or finite). The priors for the entropy and its related posterior distributions will be discussed in the next subsections.

Before we derive the priors and posterior distributions, hereafter, we shall always assume that there are at least two distinct data , that is, there exists such that . Additionally, before we proceed, we present below a definition and proposition that will be used to prove that the obtained posteriors are proper. In the following let denote the extended real number line and let denote the strictly positive real numbers. The following definition is a special case from the one presented in [25] and will play an important role in proving that the analyzed posterior distributions and posterior means are proper.

###### Definition 3.1.

Let , and , where and suppose that . Then, if , we say that .

Regarding the above definition, we have the following proposition from [25].

###### Proposition 3.2.

Let and be continuous functions in , where and , and let . Then implies in and implies in .

### 3.1 Jeffreys prior

Jeffreys [18] described a procedure to achieve an objective prior, which is invariant under one-to-one monotone transformations. The invariant property of the Jeffreys prior has been widely exploited to make statistical inferences from its posterior distribution numerical analysis. The prior construction is based on the square root of the determinant of the Fisher information matrix . Thus, the Jeffreys prior to the gamma distribution is given by

 π1(α,β)∝√αψ′(α)−1β. (10)

Additionally, from the determinant of the Fisher information, or using the change of variables over the Jeffreys prior we have

 π1(H,W)∝√Wψ′(W)−1. (11)

Finally, the joint posterior distribution for and produced by the Jeffreys prior is

 π1(H,W|x)∝δ(W,H)nW√Wψ′(W)−1Γ(W)n{n∏i=1xWi}exp{−δ(W,H)n∑i=1xi}. (12)
###### Theorem 3.3.

The posterior density (12) is proper for all .

###### Proof.

Using the change of variables and denoting it follows that

 d1(x) ∝∫∞0∫∞−∞π1(H,W|x)dHdW ∝∫∞0∫∞0δ1(W)nWunW−1√Wψ′(W)−1Γ(W)n{n∏i=1xWi}exp{−δ1(W)un∑i=1xi}dudW =∫∞0δ1(W)nW√Wψ′(W)−1Γ(W)n{n∏i=1xWi}∫∞0unW−1exp{−δ1(W)(n∑i=1xi)u}dudW =∫∞0√Wψ′(W)−1{∏ni=1xWi}(∑ni=1xi)nWΓ(nW)Γ(W)ndW=∫10g1(W)dw+∫∞1g1(W)dw,

where for all . Now, according to [25, 28], we have and and since

 limW→0+{∏ni=1xWi}(∑ni=1xi)nW=1⇒{∏ni=1xWi}(∑ni=1xi)nW∝W→0+1,

it follows by Proposition 3.2 that

 ∫10g1(W)dW∝∫10W−1/2×1×Wn−1dW<∞.

Moreover, due to [25, 28] we have and , and since

are not all equal, due to the inequality of the arithmetic and geometric means we have

and thus it follows that

 {∏ni=1xWi}(∑ni=1xi)nW=⎛⎜ ⎜⎝1n∑ni=1xin√∏ni=1xi⎞⎟ ⎟⎠−nWn−nW=exp(−nqW)n−nW.

Therefore, from Proposition 3.2 it follows that

 ∫∞1g1(W)dW ∝∫∞1W−1/2×exp(−nqW)n−nW×nnWW(n−1)/2dW =∫∞1Wn/2−1exp(−nqW)dW=Γ(n/2)(nq)n/2<∞,

which concludes the proof. ∎

###### Theorem 3.4.

The posterior mean of relative to (12) is finite for any .

###### Proof.

Doing the change of variables and denoting , it follows that

 E1[H|x] ∝∫∞0∫∞−∞Hπ1(α,β|x)dHdW =∫∞0∫∞0−log(u)δ1(W)nWunW−1√Wψ′(W)−1Γ(W)n{n∏i=1xWi}exp{−δ1(W)un∑i=1xi}dudW =∫∞0δ1(W)nW√Wψ′(W)−1Γ(W)n{n∏i=1xWi}∫∞0(−log(u))unW−1exp{−δ1(W)(n∑i=1xi)u}dudW.

Moreover, from the identity one obtains that

 ∫∞0log(s)sz−1e−asds=1/az∫∞0log(t/a)tz−1e−tdt=1/az(ψ(z)Γ(z)−log(a)Γ(z))

and thus, letting denote the absolute value operator and letting for all , and using the triangle inequality we have

 |E1[H|x]| ∝∣∣ ∣∣∫∞0(ψ(nW)−log(δ1(W)n∑i=1xi))√Wψ′(W)−1{∏ni=1xWi}(∑ni=1xi)nWΓ(nW)Γ(W)ndW∣∣ ∣∣ ≤∫∞0∣∣ ∣∣ψ(nW)−log(δ1(W)n∑i=1xi)∣∣ ∣∣√Wψ′(W)−1{∏ni=1xWi}(∑ni=1xi)nWΓ(nW)Γ(W)ndW ≤∫∞0δ2(W)√Wψ′(W)−1{∏ni=1xWi}(n∑i=1xi)nWΓ(nW)Γ(W)ndW=∫10h1(W)dW+∫∞1h1(W)dW,

where for all .

We shall now prove that and . Indeed, notice that for . Moreover, since due to Abramowitz [1] we have and it follows that

 limW→0+|ψ(nW)|W−1=limW→0+1n|(nW)ψ(nW)|=1n limW→0+|log(Γ(W))|W−1=limW→0+|Wlog(Γ(W))−Wlog(W)|=|0⋅log(1)−0|=0 limW→0+(1+W)|ψ(W)|W−1=limW→0+(1+W)|Wψ(W)|=1 and limW→0+W+∣∣log(∑ni=1xi)∣∣W−1=limW→0+(W2+W∣∣ ∣∣log(n∑i=1xi)∣∣ ∣∣)=0

and thus

 limW→0+δ2(W)W−1=1n+1⇒δ2(W)∝W→0+1W.

On the other hand, since due to Abramowitz [1] we have , it follows from the L’hopital rule that

 limW→∞log(Γ(W))W(log(W)+1)=limW→∞(log(Γ(W))′(W(log(W)+1))′=limW→∞ψ(W)log(W)=1,

and therefore, considering we have

 limW→∞|ψ(W)|W(log(W)+1)=limW→∞1W1(1+log(W)−1)∣∣∣ψ(W)log(W)∣∣∣=0, limW→∞|log(Γ(W))|W(log(W)+1)=limW→∞∣∣∣log(Γ(W))W(log(W)+1)∣∣∣=1, limW→∞(1+W)|ψ(W)|W(log(W)+1)=limW→∞(1+W−1)1(1+log(W)−1)∣∣∣ψ(W)log(W)∣∣∣=1, and limW→∞W+∣∣log(∑ni=1xi)∣∣W(log(W)+1)=limW→∞(1log(W)+1+∣∣log(∑ni=1xi)∣∣W(log(W)+1))=0,

and thus

 limW→∞δ2(W)W(log(W)+1)=2⇒δ2(W)∝W→∞Wlog(W).

Therefore, combining the obtained proportionality with the proportionalities proved in Theorem 3.3 and using Proposition 3.2 we have

 ∫10h1(W)dW∝∫10W−1×W−1/2×1×Wn−1dW<∞.

Finally, using the proportionality , letting be as in the proof of Theorem 3.3 and using that for , it follows from the proportionalities proved during Theorem 3.3 and from Proposition 3.2 that

 ∫∞1h1(W)dW ∝∫∞1W(log(W)+1)×W−1/2×exp(−nqW)n−nW×nnWW(n−1)/2dW ≤∫∞1W(n/2+2)−1exp(−nqW)dW=Γ(n/2+2)(nq)n/2+2<∞,

which concludes the proof. ∎

In order to sample for the posterior distribution we obtain that the marginal posterior distributions of is given by

 π1(W|x)∝√Wψ′(W)−1Γ(nW)Γ(W)n⎛⎜ ⎜⎝n√∏ni=1xi∑ni=1xi⎞⎟ ⎟⎠nW,

and the conditional posterior distribution of is given by

 π1(H|W,x)∝exp{−nWH−δ(W,H)n∑i=1xi}.

### 3.2 Reference prior

Bernardo [7] discussed a different approach to obtain a new class of objective priors, named as reference priors. Further, many studies were presented to develop formal and rigorous definitions to derive such class of prior distributions under different contexts [4, 5, 6, 2, 3]. The reference prior is obtained by maximizing the Kullback-Leibler (KL) divergence assuming some regularity conditions. The idea of the expected posterior information to the prior allows the data to have the maximum influence on the posterior distributions. The reference priors have essential properties such as consistent sampling, consistent marginalization, and one-to-one transformation invariance [8]. The reference priors may depend on the order of the parameters of interest. Hence, for the gamma distribution, we have two distinct priors that are presented below.

#### 3.2.1 Reference prior when β is the parameter of interest

The reference prior when is the parameter of interest and is the nuisance parameter is given by

 π2(α,β)∝√ψ′(α)β. (13)

Thus, using the Jacobian transformation it follows that the related reference prior is given by

 π2(W,H)∝√ψ′(W). (14)

Finally, the joint posterior distribution for and , produced by the reference prior (21), is given by

 π2(W,H|x)∝δ(W,H)nH√ψ′(W)Γ(W)n{n∏i=1xHi}exp{−δ(W,H)n∑i=1xi}. (15)
###### Theorem 3.5.

The posterior density (15) is proper for all .

###### Proof.

Doing the change of variables , denoting and proceeding analogously as in the proof of Theorem 3.3 we have

 d2(w)∝ ∫∞0∫∞−∞π2(W,H|x)dHdW∝∫10g2(W)dw+∫∞1g2(W)dw,

where for all . Now, according to [25, 28], we have and , and since we proved in Theorem 3.3 that , it follows from Proposition 3.2 that

 ∫10g2(W)dW∝∫10W−1×1×Wn−1dW<∞.

Moreover, from Abramowitz [1] we have , which combined with implies in . Therefore it follows that . and by Proposition 3.2 it follows that

 ∫∞1g2(W)dW∝∫∞1g1(W)dW<∞,

which concludes the proof. ∎

###### Theorem 3.6.

The posterior mean of relative to (15) is finite for all .

###### Proof.

Proceeding analogously as in the proof of Theorem 3.4 it follows that

 |E2[H|x]| ∝∫∞0∣∣∣∫∞−∞Hπ2(H,W|x)dHdW∣∣∣ ≤∫∞0δ2(W)√ψ′(W){∏ni=1xWi}(∑ni=1xi)nWΓ(nW)Γ(W)ndW=∫10h2(W)dW+∫∞1h2(W)dW,

where is the same as defined in the proof of Theorem 3.4 and

 h2(W)=δ2(W)√ψ′(W){∏ni=1xWi}(∑ni=1xi)nWΓ(nW)Γ(W)n⋅

Since in the proof of Theorem 3.4 we showed that , together with the proportionalities proved in Theorem 3.3 and Proposition 3.2 we have

 ∫10h2(W)dW∝∫10W−1×W−1×1×Wn−1dW<∞.

Finally, from the proof of Theorem 3.5 we know that , which implies directly that , and thus from Proposition 3.2 it follows that

 ∫∞1h2(W)dW∝∫∞1h1(W)dW<∞,

which concludes the proof. ∎

The marginal posterior distributions of is given by

 π2(W|x)∝√ψ′(W)Γ(nW)Γ(W)n⎛⎜ ⎜⎝n√∏ni=1xi∑ni=1xi⎞⎟ ⎟⎠nW.

Moreover, the conditional posterior distribution of is given by

 π2(H|W,x)∝exp{−nWH−δ(W,H)n∑i=1xi}.

#### 3.2.2 Reference prior when α is the parameter of interest

The reference prior when is the parameter of interest and is the nuisance parameter is given by

 π3(α,β)∝1β√αψ′(α)−1α. (16)

Therefore, in terms of the reparametrized model, the reference prior when is the parameter of interest and is the nuisance parameter is given by

 π3(W,H)∝√Wψ′(W)−1W. (17)

Finally, the joint posterior distribution for and , produced by the reference prior (17) is given by

 π3(W,H|x)∝√Wψ′(W)−1Wδ(W,H)nWΓ(W)n{n∏i=1xWi}exp{−δ(W,H)n∑i=1xi}. (18)
###### Theorem 3.7.

The posterior density (18) is proper for all .

###### Proof.

Since it follows that for all and