1 Introduction and prior work
The probabilistic modeling of proportional or compositional data (i.e., vectors containing fractions that sum up to 1) arises in many fields: in finance, we might be interested in modeling the composition of client portfolios with respect to assets classes contained therein, and in political sciences or A/B testing, the probabilistic analysis of voting fractions may be of interest.
Fractional observations can often be modeled using Dirichlet (or beta) likelihoods. To enable computationally efficient Bayesian inference (e.g. for changeppoint detection, see(Adams und MacKay, 2007)) a conjugate prior is desired. Recently, (Andreoli, 2018) has derived the prior of the Dirichlet distribution, which he has named Boojum distribution. However, due to its intractability, the Boojum distribution prevents itself from many practical applications due to the necessary resource-intensive Monte Carlo simulations. We overcome this problem by providing a closed-form approximation that renders the Boojum-Dirichlet conjugate pair tractable.
2 The conjugate prior of the Dirichlet distribution
First, we recall the systematic construction procedure of conjugate priors for exponential family distributions. Then, we apply this general solution to the Dirichlet likelihood in order to find its conjugate prior, the Boojum distribution. Note that this derivation also holds for beta likelihoods by setting the dimensionality of the problem to .
2.1 Conjugate priors of exponential family distributions
Given a likelihood from the exponential family defined by
for the vector random variable, with parameter vector and known functions , and , there is a conjugate prior of the form
with and representing prior hyperparameters (Keener, 2010). The normalizing constant is given by
The posterior after having collected observations is, thanks to conjugacy,
with updated hyperparameters
The prior predictive distribution (or the posterior predictive distribution if appropriately updated hyperparametersand are employed instead) is defined as follows:
2.2 Application to the Dirichlet distribution
The likelihood function of the Dirichlet distribution parameterized with for a random variable belonging to the simplex (i.e., with representing the -th vector entry) is given by
with representing the multivariate beta function and where we have introduced the notation
This allows us to rewrite the generic prior (2.2) for the Dirichlet distribution, which yields the definition of the Boojum distribution:
The posterior has the same functional form as (2.17) due to conjugacy, but operates on updated hyperparameters:
Finally, the predictive distribution (2.7) translates to
2.3 Hyperparameter interpretation
|Hyperparameter||Interpretation / summary statistic||Value|
|Number of prior pseudo-observations|
|Sum of vector-logs (see (2.11)) of prior pseudo-observations|
Generally, given an arbitrary set of observations
and after dropping the “prior” qualifier and the subscript, we can compute both and according to Table 1. In other words, an observation set fully defines a Boojum distribution. In case of a prior, is the set of prior pseudo-observations, and for Boojum posteriors the set comprises both prior pseudo-observations and actual observations.
2.4 Convergence analysis
The integral defining the Boojum distribution’s normalizing constant (2.18) does not converge for all hyperparameter values and as demonstrated by (Andreoli, 2018). He finds that all of the following conditions must hold:
If we define the Boojum hyperparameters by means of prior pseudo-observations as proposed in Section 2.3, we find that:
(a) is satisfied:
(b) is satisfied:
Evaluating condition (c) is more involved. The left-hand side sub-condition, , is clearly violated given that , so we must evaluate the right-hand side:
Note that the summands are the geometric means of the -th entries across the observation vectors used to define the Boojum distribution. In other words, the right-hand side of (2.25
) equals the sum over the components of the “geometric mean vector” of the prior pseudo-observations.
For any set of positive numbers, the geometric mean is less than or equal to the arithmetic mean, hence we can write (2.25) as
Because the geometric mean of a set of positive numbers only equals the arithmetic mean if the set is composed of identical numbers, we can conclude the following:
If or if and all are identical, the geometric and arithmetic means are equal, thus rendering (2.27) an equality. Hence, in this case, condition (c) is not satisfied.
Otherwise, i.e., if and if not all observations are identical (i.e., when for any and any ), then the arithmetic mean dominates its geometric counterpart. Hence, in this case, the left-hand side of (2.27) in is strictly less than the right-hand side, and it follows that condition (c) is satisfied.
To summarize, the Boojum distribution only converges if we define it using two or more prior pseudo-observations that are not all identical. Section 3 will present visualizations of the Boojum distribution that demonstrate this finding graphically.
2.5 MAP approximation
The Boojum’s normalizing constant (2.18) cannot be computed analytically. This leads to the intractability of the distribution itself and derived distributions such as the predictive distribution (2.20). In order to avoid inefficient Monte Carlo simulations, we seek to find a closed-form approximation to the Boojum distribution. We propose the maximum a posteriori (MAP) method (Murphy, 2012)
which approximates the posterior probability distribution function (PDF) by a Dirac delta function located at its mode. After dropping the normalizing constant in (2.17), we can write the Boojum’s mode as
which shows that is effectively a function of a single (vector) variable only, namely . Setting the derivative with respect to of the argument in (2.34) to zero yields
where is the digamma function111 where is the gamma function.. This set of dependent equations lacks an analytic solution for , so we must resort to numerical methods. To this end, either gradient ascent methods operating on the derivative (2.38) (e.g. the Adam optimizer (Kingma und Ba, 2017)
which is popular for its good performance in the neural network realm) or direct optimization methods (e.g. theNelder-Mead method (Nelder und Mead, 1965)) can be employed.
Once is determined, the MAP approximation of the Boojum posterior becomes, by definition,
where is the Dirac delta function located at its argument. Consequently, the posterior predictive distribution (2.20) simplifies to the likelihood evaluated at the mode , thus rendering it tractable (a welcome side-effect of the MAP approximation):
This approximation will allow us to perform Bayesian inference of a Dirichlet or beta likelihood in closed-form (except for the optimization step in (2.34) that has to be carried out numerically). The next section will analyze the accuracy of the proposed MAP approximation.
3 Analysis of the Boojum and the MAP approximation
3.1 Example scenarios
We visualize the Boojum distribution as well the corresponding predictive distribution with numerical examples. This will help gaining an intuitive understanding of the Boojum distribution and the conditions affecting its convergence properties (see Section 2.4). In order to be able to display the Boojum and all derived distributions graphically, we choose dimensions for all examples, which simplifies the Dirichlet to a beta distribution.
Due to conjugacy, the Boojum PDF can be interpreted either as a prior distribution (with the observations set representing prior pseudo-observations) or as a posterior distribution (where is a combination of both prior pseudo-observations and actual observations). However, the MAP approximation to both the Boojum and and the corresponding predictive distribution necessitates the posterior interpretation by definition.
: The first scenario shows a Boojum parameterization with two distinct observations. The resulting Boojum PDF has the bulk of its probability density located approximately in the average direction of the two observation vectors, . This somewhat surprising observation can be explained as follows: the parameterization of a Boojum prior (or any prior, for that matter) with a set of prior pseudo-observations naturally assigns these observations relatively high probabilities, simply because it has been defined by them. For , Dirichlet distributions have their probability density concentrated in the vicinity of the distribution’s mean, , hence must point in similar directions as the prior pseudo-observations.
Given that the probability density is quite spread out in the plane, the MAP approximation to the exact posterior predictive is rather coarse. This is clearly visible in the plot and can be quantified using the the Kullback-Leibler divergence (Kullback und Leibler, 1951) which is also shown in the chart. The wide PDF implies thick tails in the exactly marginalized posterior predictive distribution, a feature that the MAP approximation doesn’t exhibit.222Unfortunately, this fact is not very well visible in the plots due to their limited size.
: This scenario is based on the same observations that define , but contains each of them five times respectively. As before, the resulting Boojum probability density is concentrated in the same average direction as the observations it is based on, but the higher number of observations leads to a more concentrated distribution. Note that the mode of is identical to the mode of because a simple increase in multiplicity does not affect the ratio as per Table 1, which is effectively the dependent variable in the definition of (2.34).
The higher probability density concentration in compared to naturally puts the MAP approximation closer to the true posterior predictive as shown in the chart and the smaller Kullback-Leibler divergence.
: Here we return to employing only two observations to define the Boojum distribution, but, unlike scenario , the observations are spread further apart. Unsurprisingly, this wider variation leads to the probability density concentrated around shorter vectors that imply less peaked Dirichlet PDFs and that therefore assign more posterior predictive probability to a wider range of vectors, thus replicating the high variability of the observations used for configuring the Boojum in the first place.
Similar to , we only see a moderately accurate fit between the exact posterior predictive distribution and the MAP approximation, again a consequence of the small number of observations that we have used to define the Boojum that has led to low concentration of probability density.
: This case demonstrates a violation of convergence condition (c) in Section 2.4 by defining a Boojum distribution using only a single observations. Intuitively, this setup fails to “teach” the Boojum an appropriate measure of variability (or rather, we have indicated a desire for zero variability around the supplied observation). Indeed, the resulting Boojum PDF is improper: the probability density’s peak diverges towards infinity (in the direction of the supplied observation vector), implying a preference for infinite .
Consequently, both the posterior predictive and its MAP approximation converge to a Dirac delta function located at the observation vector used in the definition of the Boojum distribution, . This reflects our failure to configure the Boojum with a non-zero expected variability of observations.
: The last scenario defines the Boojum using the same observation as but repeats it ten times. As before, convergence condition (c) is violated, leading to similar conclusions as for , but with faster divergence towards infinity caused by the greater number of observations.
3.2 Summary of findings
The numerical case studies analyzed in the previous section lead to the following conclusions for the construction of Boojum distributions using pseudo-observations:
In order to satisfy all convergence criteria of Section 2.4, Boojum priors must be constructed with at least two distinct prior pseudo-observations.
Encoding prior information about observation variability can be done by choosing prototypical prior pseudo-observation vectors with the desired variability. Note that this conclusion is common to all conjugate priors irrespective of the likelihood, but recalling it helps to build an intuition around the rather novel Boojum distribution.
The proposed MAP approximation (2.39) becomes more accurate the larger the number of observations encoded in the distribution (either through prior pseudo-observations or through actual observations).
We present an algorithm that performs closed-form333With the exception of the numerical computation of on line 7., approximate Bayesian inference of Dirichlet and beta likelihoods using the MAP approximation of the Boojum prior.
We supply a reference implementation of the above algorithm in the GitHub repository (Thommen, 2021).
5 Conclusion and outlook
We have derived a closed-form approximation to the Boojum distribution (i.e, to the the conjugate prior of the Dirichlet and beta likelihoods), including an exploratory analysis of the distribution and an algorithm to implement the procedure.
Further research should be directed at improving the MAP approximation, e.g. through variational inference, in an attempt to find more accurate posterior approximations that render it (and, ideally, the posterior predictive distribution) tractable.
6 Data availability statement
Acknowledgements.I want to thank my line manager Giuseppe Nuti for having given me the opportunity to work on this problem and for feedback on drafts of this paper. I also want to thank my colleagues Peter Larkin, Lluís Jiménez-Rugama and Mathias Brucherseifer for valuable feedback.
- Adams und MacKay (2007) Adams und MacKay 2007 Adams, Ryan P. ; MacKay, David J. C.: Bayesian Online Changepoint Detection. 2007
- Andreoli (2018) Andreoli 2018 Andreoli, Jean-Marc: A conjugate prior for the Dirichlet distribution. In: CoRR abs/1811.05266 (2018). – URL http://arxiv.org/abs/1811.05266
- Keener (2010) Keener 2010 Keener, R.W.: Theoretical Statistics: Topics for a Core Course. Springer New York, 2010 (Springer Texts in Statistics). – URL https://books.google.co.in/books?id=aVJmcega44cC. – ISBN 9780387938394
- Kingma und Ba (2017) Kingma und Ba 2017 Kingma, Diederik P. ; Ba, Jimmy: Adam: A Method for Stochastic Optimization. 2017
- Kullback und Leibler (1951) Kullback und Leibler 1951 Kullback, S. ; Leibler, R. A.: On Information and Sufficiency. In: Ann. Math. Statist. 22 (1951), Nr. 1, S. 79–86
- Murphy (2012) Murphy 2012 Murphy, Kevin P.: Machine Learning: A Probabilistic Perspective. MIT Press, 2012
- Nelder und Mead (1965) Nelder und Mead 1965 Nelder, J. A. ; Mead, R.: A Simplex Method for Function Minimization. In: Computer Journal 7 (1965), S. 308–313
- Thommen (2021) Thommen 2021 Thommen, Kaspar: Conjugate prior of Dirichlet and beta. https://github.com/UBS-IB/conjugate-prior-of-dirichlet-and-beta. 2021