Log In Sign Up

A Closed-Form Approximation to the Conjugate Prior of the Dirichlet and Beta Distributions

by   Kaspar Thommen, et al.

We derive the conjugate prior of the Dirichlet and beta distributions and explore it with numerical examples to gain an intuitive understanding of the distribution itself, its hyperparameters, and conditions concerning its convergence. Due to the prior's intractability, we proceed to define and analyze a closed-form approximation. Finally, we provide an algorithm implementing this approximation that enables fully tractable Bayesian conjugate treatment of Dirichlet and beta likelihoods without the need for Monte Carlo simulations.


page 1

page 2

page 3

page 4


The extended power distribution: A new distribution on (0, 1)

We propose a two-parameter bounded probability distribution called the e...

On a prior based on the Wasserstein information matrix

We introduce a prior for the parameters of univariate continuous distrib...

A Closed-Form EVSI Expression for a Multinomial Data-Generating Process

This paper derives analytic expressions for the expected value of sample...

A New Distribution on the Simplex with Auto-Encoding Applications

We construct a new distribution for the simplex using the Kumaraswamy di...

The R2D2 Prior for Generalized Linear Mixed Models

In Bayesian analysis, the selection of a prior distribution is typically...

Kullback-Leibler Divergence for Bayesian Nonparametric Model Checking

Bayesian nonparametric statistics is an area of considerable research in...

Optimal Stopping of a Brownian Bridge with an Uncertain Pinning Time

We consider the problem of optimally stopping a Brownian bridge with an ...

1 Introduction and prior work

The probabilistic modeling of proportional or compositional data (i.e., vectors containing fractions that sum up to 1) arises in many fields: in finance, we might be interested in modeling the composition of client portfolios with respect to assets classes contained therein, and in political sciences or A/B testing, the probabilistic analysis of voting fractions may be of interest.

Fractional observations can often be modeled using Dirichlet (or beta) likelihoods. To enable computationally efficient Bayesian inference (e.g. for changeppoint detection, see

(Adams und MacKay, 2007)) a conjugate prior is desired. Recently, (Andreoli, 2018) has derived the prior of the Dirichlet distribution, which he has named Boojum distribution. However, due to its intractability, the Boojum distribution prevents itself from many practical applications due to the necessary resource-intensive Monte Carlo simulations. We overcome this problem by providing a closed-form approximation that renders the Boojum-Dirichlet conjugate pair tractable.

2 The conjugate prior of the Dirichlet distribution

First, we recall the systematic construction procedure of conjugate priors for exponential family distributions. Then, we apply this general solution to the Dirichlet likelihood in order to find its conjugate prior, the Boojum distribution. Note that this derivation also holds for beta likelihoods by setting the dimensionality of the problem to .

2.1 Conjugate priors of exponential family distributions

Given a likelihood from the exponential family defined by


for the vector random variable

, with parameter vector and known functions , and , there is a conjugate prior of the form


with and representing prior hyperparameters (Keener, 2010). The normalizing constant is given by


The posterior after having collected observations is, thanks to conjugacy,


with updated hyperparameters


The prior predictive distribution (or the posterior predictive distribution if appropriately updated hyperparameters

and are employed instead) is defined as follows:


2.2 Application to the Dirichlet distribution

The likelihood function of the Dirichlet distribution parameterized with for a random variable belonging to the simplex (i.e., with representing the -th vector entry) is given by


with representing the multivariate beta function and where we have introduced the notation


Matching terms of (2.10) with (2.1) yields the following identities:


This allows us to rewrite the generic prior (2.2) for the Dirichlet distribution, which yields the definition of the Boojum distribution:




The posterior has the same functional form as (2.17) due to conjugacy, but operates on updated hyperparameters:


Finally, the predictive distribution (2.7) translates to


2.3 Hyperparameter interpretation

We can interpret the prior hyperparameters and as summary statistics of a set of prior pseudo-observations computed using the posterior update equations (2.19a) and (2.19b), see Table 1.

Hyperparameter Interpretation / summary statistic Value
Number of prior pseudo-observations
Sum of vector-logs (see (2.11)) of prior pseudo-observations
Table 1: Boojum hyperparameters

Generally, given an arbitrary set of observations


and after dropping the “prior” qualifier and the subscript, we can compute both and according to Table 1. In other words, an observation set fully defines a Boojum distribution. In case of a prior, is the set of prior pseudo-observations, and for Boojum posteriors the set comprises both prior pseudo-observations and actual observations.

2.4 Convergence analysis

The integral defining the Boojum distribution’s normalizing constant (2.18) does not converge for all hyperparameter values and as demonstrated by (Andreoli, 2018). He finds that all of the following conditions must hold:

  • (a)

  • (b)

  • (c) or

If we define the Boojum hyperparameters by means of prior pseudo-observations as proposed in Section 2.3, we find that:

  • (a) is satisfied:

  • (b) is satisfied:

Evaluating condition (c) is more involved. The left-hand side sub-condition, , is clearly violated given that , so we must evaluate the right-hand side:


Note that the summands are the geometric means of the -th entries across the observation vectors used to define the Boojum distribution. In other words, the right-hand side of (2.25

) equals the sum over the components of the “geometric mean vector” of the prior pseudo-observations.

For any set of positive numbers, the geometric mean is less than or equal to the arithmetic mean, hence we can write (2.25) as


Because the geometric mean of a set of positive numbers only equals the arithmetic mean if the set is composed of identical numbers, we can conclude the following:

  • If or if and all are identical, the geometric and arithmetic means are equal, thus rendering (2.27) an equality. Hence, in this case, condition (c) is not satisfied.

  • Otherwise, i.e., if and if not all observations are identical (i.e., when for any and any ), then the arithmetic mean dominates its geometric counterpart. Hence, in this case, the left-hand side of (2.27) in is strictly less than the right-hand side, and it follows that condition (c) is satisfied.

To summarize, the Boojum distribution only converges if we define it using two or more prior pseudo-observations that are not all identical. Section 3 will present visualizations of the Boojum distribution that demonstrate this finding graphically.

2.5 MAP approximation

The Boojum’s normalizing constant (2.18) cannot be computed analytically. This leads to the intractability of the distribution itself and derived distributions such as the predictive distribution (2.20). In order to avoid inefficient Monte Carlo simulations, we seek to find a closed-form approximation to the Boojum distribution. We propose the maximum a posteriori (MAP) method (Murphy, 2012)

which approximates the posterior probability distribution function (PDF) by a Dirac delta function located at its mode. After dropping the normalizing constant in (

2.17), we can write the Boojum’s mode as


which shows that is effectively a function of a single (vector) variable only, namely . Setting the derivative with respect to of the argument in (2.34) to zero yields


where is the digamma function111 where is the gamma function.. This set of dependent equations lacks an analytic solution for , so we must resort to numerical methods. To this end, either gradient ascent methods operating on the derivative (2.38) (e.g. the Adam optimizer (Kingma und Ba, 2017)

which is popular for its good performance in the neural network realm) or direct optimization methods (e.g. the

Nelder-Mead method (Nelder und Mead, 1965)) can be employed.

Once is determined, the MAP approximation of the Boojum posterior becomes, by definition,


where is the Dirac delta function located at its argument. Consequently, the posterior predictive distribution (2.20) simplifies to the likelihood evaluated at the mode , thus rendering it tractable (a welcome side-effect of the MAP approximation):


This approximation will allow us to perform Bayesian inference of a Dirichlet or beta likelihood in closed-form (except for the optimization step in (2.34) that has to be carried out numerically). The next section will analyze the accuracy of the proposed MAP approximation.

3 Analysis of the Boojum and the MAP approximation

3.1 Example scenarios

We visualize the Boojum distribution as well the corresponding predictive distribution with numerical examples. This will help gaining an intuitive understanding of the Boojum distribution and the conditions affecting its convergence properties (see Section 2.4). In order to be able to display the Boojum and all derived distributions graphically, we choose dimensions for all examples, which simplifies the Dirichlet to a beta distribution.

Due to conjugacy, the Boojum PDF can be interpreted either as a prior distribution (with the observations set representing prior pseudo-observations) or as a posterior distribution (where is a combination of both prior pseudo-observations and actual observations). However, the MAP approximation to both the Boojum and and the corresponding predictive distribution necessitates the posterior interpretation by definition.

Figure 1: Boojum scenarios configured with different observation sets (see Section 2.3) and the resulting Boojum PDFs and posterior predictive PDFs respectively. Notes: Superscripts on the vectors in the definitions of represent the multiplicity of the respective vector in the set. All scenarios have the same axes scaling in order to facilitate comparisons.

The Boojum PDF has been normalized numerically. Darker areas indicate higher probability density. The maximum (mode) is calibrated to be black for

, and , which implies that the gray scales differ across these scenarios and thus cannot be directly compared. The gray scales for the improper scenarios and are arbitrary because the PDFs diverge.
The white “” marks the mode of the Boojum PDF for the proper distributions , and . The posterior predictive PDFs marked “Exact” have been obtained by numerical marginalization according to (2.20).

We express the mismatch between the exact posterior predictive and the MAP using the Kullback-Leibler divergence,

, as indicated in the plots. Note that the Kullback-Leibler divergence is not computable for the diverging scenarios and .

Figure 1 shows five Boojum distributions through configured with different sets of observations that define the hyperparameters and as per (2.21). We can make the following observations:

  • : The first scenario shows a Boojum parameterization with two distinct observations. The resulting Boojum PDF has the bulk of its probability density located approximately in the average direction of the two observation vectors, . This somewhat surprising observation can be explained as follows: the parameterization of a Boojum prior (or any prior, for that matter) with a set of prior pseudo-observations naturally assigns these observations relatively high probabilities, simply because it has been defined by them. For , Dirichlet distributions have their probability density concentrated in the vicinity of the distribution’s mean, , hence must point in similar directions as the prior pseudo-observations.

    Given that the probability density is quite spread out in the plane, the MAP approximation to the exact posterior predictive is rather coarse. This is clearly visible in the plot and can be quantified using the the Kullback-Leibler divergence (Kullback und Leibler, 1951) which is also shown in the chart. The wide PDF implies thick tails in the exactly marginalized posterior predictive distribution, a feature that the MAP approximation doesn’t exhibit.222Unfortunately, this fact is not very well visible in the plots due to their limited size.

  • : This scenario is based on the same observations that define , but contains each of them five times respectively. As before, the resulting Boojum probability density is concentrated in the same average direction as the observations it is based on, but the higher number of observations leads to a more concentrated distribution. Note that the mode of is identical to the mode of because a simple increase in multiplicity does not affect the ratio as per Table 1, which is effectively the dependent variable in the definition of (2.34).

    The higher probability density concentration in compared to naturally puts the MAP approximation closer to the true posterior predictive as shown in the chart and the smaller Kullback-Leibler divergence.

  • : Here we return to employing only two observations to define the Boojum distribution, but, unlike scenario , the observations are spread further apart. Unsurprisingly, this wider variation leads to the probability density concentrated around shorter vectors that imply less peaked Dirichlet PDFs and that therefore assign more posterior predictive probability to a wider range of vectors, thus replicating the high variability of the observations used for configuring the Boojum in the first place.

    Similar to , we only see a moderately accurate fit between the exact posterior predictive distribution and the MAP approximation, again a consequence of the small number of observations that we have used to define the Boojum that has led to low concentration of probability density.

  • : This case demonstrates a violation of convergence condition (c) in Section 2.4 by defining a Boojum distribution using only a single observations. Intuitively, this setup fails to “teach” the Boojum an appropriate measure of variability (or rather, we have indicated a desire for zero variability around the supplied observation). Indeed, the resulting Boojum PDF is improper: the probability density’s peak diverges towards infinity (in the direction of the supplied observation vector), implying a preference for infinite .

    Consequently, both the posterior predictive and its MAP approximation converge to a Dirac delta function located at the observation vector used in the definition of the Boojum distribution, . This reflects our failure to configure the Boojum with a non-zero expected variability of observations.

  • : The last scenario defines the Boojum using the same observation as but repeats it ten times. As before, convergence condition (c) is violated, leading to similar conclusions as for , but with faster divergence towards infinity caused by the greater number of observations.

3.2 Summary of findings

The numerical case studies analyzed in the previous section lead to the following conclusions for the construction of Boojum distributions using pseudo-observations:

  • In order to satisfy all convergence criteria of Section 2.4, Boojum priors must be constructed with at least two distinct prior pseudo-observations.

  • Encoding prior information about observation variability can be done by choosing prototypical prior pseudo-observation vectors with the desired variability. Note that this conclusion is common to all conjugate priors irrespective of the likelihood, but recalling it helps to build an intuition around the rather novel Boojum distribution.

  • The proposed MAP approximation (2.39) becomes more accurate the larger the number of observations encoded in the distribution (either through prior pseudo-observations or through actual observations).

4 Algorithm

We present an algorithm that performs closed-form333With the exception of the numerical computation of on line 7., approximate Bayesian inference of Dirichlet and beta likelihoods using the MAP approximation of the Boojum prior.

1:A set of prior pseudo-observations
3: See (2.11)
4:while new observations arrive do
5:      See (2.19a) for
6:      See (2.19b) for and (2.11)
7:      See Section 2.5
8:     emit

Emit posterior MAP estimate if required

9:     emit Emit posterior predictive if required, see (2.40)
10:end while
Algorithm 1 Approximate Bayesian conjugate inference for Dirichlet and beta likelihoods

We supply a reference implementation of the above algorithm in the GitHub repository (Thommen, 2021).

5 Conclusion and outlook

We have derived a closed-form approximation to the Boojum distribution (i.e, to the the conjugate prior of the Dirichlet and beta likelihoods), including an exploratory analysis of the distribution and an algorithm to implement the procedure.

Further research should be directed at improving the MAP approximation, e.g. through variational inference, in an attempt to find more accurate posterior approximations that render it (and, ideally, the posterior predictive distribution) tractable.

6 Data availability statement


I want to thank my line manager Giuseppe Nuti for having given me the opportunity to work on this problem and for feedback on drafts of this paper. I also want to thank my colleagues Peter Larkin, Lluís Jiménez-Rugama and Mathias Brucherseifer for valuable feedback.