This work is motivated by the goal to conduct Bayesian inference for multiple proportions , , which are possibly correlated. In particular, the goal is to derive a multidimensional credible region for taking into account the dependency structure. The author’s main application of interest are model evaluation and comparison studies where the accuracy of
binary classifiers is assessed on the same dataset. Previous work in the machine learning context has shown that evaluating multiple promising models on the final test data (instead of a single prespecified model) can improve the final model performance and statistical power in the evaluation study. Hereby, it is beneficial to take into account model similarity when adjusting for multiplicity as the according adjustment needs to be less strict when different models give similar predictions(EOMPM1; IMS).
In these previous works, the focus was on frequentist methods, in particular a multivariate normal approximation for the vector of test statistics in conjunction with the so called maxT-approach (projection method)(hothorn2008). From a Bayesian viewpoint, this can (at least numerically) be seen as a multivariate normal-normal model under the assumption of a flat (improper) prior distribution. This approach showed good performance in terms of family-wise error rate control in extensive simulation studies. It has however three drawbacks. Firstly, the resulting confidence region is not guaranteed to lie entirely in the parameter space
. Secondly, the needed estimate of the covariance matrix may be singular, which is e.g. the case if any observed proportionis zero or one. This scenario becomes increasingly likely when . Finally, for (close to) least favourable parameter configurations, this approach leads to an increased type 1 error rate in the frequentist sense. While many ad hoc remedies exist for these problems (e.g. parameter transformation, shrinkage estimation), the main goal of this work to derive a more self-contained model. Moreover, it may be desirable to include prior knowledge to the inference task which calls for a Bayesian treatment of the problem.
For a single proportion , a common Bayesian approach is the Beta-binomial model where each observation is a Bernoulli variable such that
Assuming a Beta prior distribution with shape parameters for , i.e.
leads to the posterior distribution
given successes have been observed (ASI, pp. 172-173).
The goal of this work is twofold. Firstly, the Beta distribution shall be generalized to higher dimensions allowing general correlation structures. A multivariate generalization of the Beta distribution has been studied in several works (jones2002; olkin2003; nadarajah2005; arnold2011), mostly however limited to the bivariate case. olkin2015 discuss shortcomings of some of these approaches such as a restricted range of possible correlations or a complicated extension to higher dimensions. Another general access to this problem the separation of marginal distributions and dependency structure via copula models (CMD; CBD; nyaga2017). The second aim is to then derive a multivariate Beta-binomial model which allows to conduct Bayesian inference regarding or transformations thereof.
To this end, the bivariate Beta distribution by olkin2015 based on an underlying Dirichlet distribution is extended to higher dimensions. This was already adumbrated in section 3 of the original article. While this construction works in theory for any dimension , it suffers from the fact that it depends on parameters. This may serve problems, in particular when posterior samples consisting of a large number of observations of (initially) variables need to be drawn. In practice, it is thus only feasible for dimensions not much larger than (depending on computational resources). However, it will be shown that a reduced parametrisation with parameters allows to handle much higher dimensions with reasonable computational effort by employing a copula model.
The construction shown in this article has methodological similarities to existing work in the context of multivariate Bernoulli or Binomial distributions with general correlation structures(madsen1993; kadane2016; fontana2018). In particular, the question which correlation structures are admissible for an -dimensional Beta distribution can directly be transferred to the same question regarding an
-dimensional Bernoulli distribution(hailperin1965; chaganty2006). Similar considerations have been made regarding the question how to generate correlated binary data (leisch1998; xue2010; preisser2014; shults2017). The distinctive feature of this work is thus the different (Bayesian) setting and the focus on statistical inference. Moreover, to the best of the author’s knowledge, the necessary and sufficient moment conditions that are provided in section 2.1
have not been mentioned in this form in the literature despite the fact that many authors have recognized the connection to linear programming(madsen1993; fontana2018; shults2017).
The remainder of this work is structured as follows: In the section 2, the construction and several properties of the multivariate Beta distribution as well as restrictions on the admissible correlation structures are described. Moreover, a multivariate analogue to the common Beta-binomial model is introduced. In section 3, different methods for the derivation of multivariate credible regions are described. Details regarding the numerical implementation are given in section 4. Section 5 covers numerical simulations in the context of model evaluation and comparison studies to access the properties of different credible regions. Finally, section 6 contains a summary of the present work and a discussion on the connection between multivariate credible regions and Bayesian hypothesis testing.
2 Statistical Model and Theoretical Results
2.1 Multivariate Beta Distribution
Our goal is a joint model for the success probabilities of Binomial variables , , with arbitrary correlation structure between the variables . Conditional on the observed data , we assume that marginally
as introduced in equation (3). The variables are the sum of Bernoulli variables , , which are also observed.
The subsequent construction of an -dimensional Beta distribution is based on a -dimensional Dirichlet distribution such as proposed by olkin2015
for the bivariate case. The Dirichlet distribution is frequently employed in the so called Dirichlet-multinomial model for the success probabilities of multinomial data. A multinomial random variable is the generalization of a Binomial random variable, i.e. each observation is one ofdistinct events where each event has a probability of to occur, such that . A Dirichlet random variable has support and is fully characterized by the concentration parameter (vector) . A comprehensive overview of the Dirichlet distribution is given by DIR. In the following, an -dimensional random variable with multivariate Beta distribution will be constructed from a
-dimensional Dirichlet random variable. We will see that this can be achieved by a convenient parametrisation and a simple linear transformation. Although the casecan easily be recovered, is assumed in the following to avoid laborious case distinctions.
A single binary observation is assumed to be a realization of an -dimensional random variable . The complete experimental data, i.i.d. observations of , is collected in the rows of the binary matrix . We define a categorical random variable , , which is linked to via the mapping which is defined in the following.
Definition 1 (Transformation matrix).
Define the linear mapping , , whereby the -th column of the transformation matrix corresponds to the binary representation (of length ) of the integer , .
This definition uniquely defines for any dimension . For instance, for ,
It is easy to see that and that . In effect, defines a bijection between and which is illustrated below for :
Hence, the function defines a one-to-one correspondence between observing (a) correlated Bernoulli variables
and (b) a single categorical variablewith possible events. The same link may be used to relate (a) correlated Binomial variables and (b) a single multinomial variable . A realization of is referred to as the cell count version of the experimental data. It can easily be computed as the sum of all rows of the matrix . Clearly, the probabilities for the distinct events , , can be modelled via the Dirichlet distribution as . This allows us to define the random variable and investigate it’s properties.
Definition 2 (Multivariate Beta (mBeta) distribution).
Let , and . Let follow the Dirichlet distribution with concentration parameter . Define the linear transform of whereby the transformation matrix is defined in definition 1. In this case is said to follow a multivariate Beta distribution with concentration parameter or for short.
Proposition 1 (Properties of the mBeta distribution).
Let as defined in definition 2. Then the following assertions hold:
is a -dimensional random variable with support .
For any , marginally has a distribution with parameters and whereby .
The mean vector of is given by .
Define and . Then and the covariance of is given by
Probabilities of products with have a distribution with
Hereby, is the Hadamard product of associated rows of .
Most of the above claims follow immediately from the definition of . Further details are provided in appendix A. The symmetric matrix contains the (scaled) first-order moments and mixed second-order moments as off-diagonal elements and will prove to be useful later. olkin2015 state that the density function of does not have a closed form expression and show several representations for different subregions of the unit square in the bivariate setting. The next definition will allow a characterization of the correlation structures that are admissible for a multivariate Beta distribution.
Definition 3 (Moment conditions).
Let and be a symmetric matrix. Define ,
with . Hereby is the -th row of and the Hadamard (entrywise) product of vectors. In addition, let whereby contains the upper off-diagonal elements of . Then the pair is said to satisfy the moment conditions if
Note that the binary matrix only depends on the dimension and not on
. By imputing suitable vectorsin (MC) it is easy to see that the moment bounds
whereby both inequalities and min and max operations are meant component-wise. Hereby is the matrix with all rows identical and equal to and . The derived conditions (MB) and (FB) or variations thereof have appeared several times in the relevant literature (leisch1998; shults2017).
Proposition 2 (mBeta parametrisation).
(Existence) Let , and a valid (symmetric, positive-definite) correlation matrix with . Then, there exists a vector , , with and a random variable such that
if and only if and the derived moment matrix
satisfy (MC). Hereby, .
(Uniqueness) Given parameters and fulfilling as in (1), the parameter can be uniquely determined if and only if . For , uniqueness can be achieved by imposing additional constraints, e.g. by minimization of .
The first result can be proven by applying Farkas’ Lemma, a standard result from linear programming, to the linear program
which needs to be solved for . Details are given in appendix A.
The moment conditions give some intuition on the admissible correlation structures, in particular by means of the weaker but more interpretable necessary conditions (MB) and (FB). However, a direct verification of (MC) is usually not feasible in practice, at least not more efficiently than attempting to solve (LP). A more practical approach to translate a correlation into a moment description is described in section 4.
The bounds on the derived moment matrix induced by (MC) imply bounds on the correlation matrix because the elements of are monotone in the according elements of . The construction and the according conditions translate to a multivariate Bernoulli distribution with minor modifications. In this context, several works have illustrated the bounds on the correlation coefficients for low dimensions (prentice1988; preisser2014; shults2017). For instance, for , (FB) implies
with , . These necessary correlation bounds also apply to the case for all but are only sufficient for being admissible for . It should be noted, that the overall concentration parameter
only drives the variances of the. That is to say, two mBeta distributions induced by the parameters and with have the same correlation structure. Below, several simple results concerning (MC) are described, some of which will be utilized in the next section.
2.2 Multivariate Beta-Binomial Model
The next result formalizes that the update rule from the well-studied Dirichlet-multinomial model can be adopted for the multivariate generalization of the Beta-binomial model. Hereby, observed cell counts are added to corresponding prior parameters .
Proposition 4 (Multivariate Beta-binomial model).
Let be the prior distribution for and let be the observed data matrix with cell count representation . Then:
The posterior distribution of is given by
Let and be the parameter of prior and posterior distribution of , respectively. Let and . Then
whereby the update matrix depends on the data due to .
The second result is useful as it allows to work with a reduced parametrisation from which the mean vector and covariance matrix of the distribution can still be derived, see proposition 1. It depends on and the symmetric matrix and thus requires parameters. Hereby, neither the prior parameter nor the posterior parameter are needed to derive the posterior matrix from the prior matrix and the data . The term ’reduced parametrisation’ will be used when and are known but is unknown.
When the reduced parametrisation is employed, as a consequence of proposition 3, the only worry is to correctly specify a prior parameter , either directly or implicitly (via ). As previously stated, checking (MC) for the prior distribution may not always be feasible, especially in high dimensions. For certain priors with simple structure, this is however easily possible. In particular, mBeta distributions with , and , i.e. independent distributions are always admissible. One can simply check that one possible parameter vector to obtain these properties is . The case
corresponds to independent uniform distributions overwhich will be employed as a vague prior in the simulation study in section 5. Another simple and practically relevant way to ensure the validity of is to construct it based on previous experimental data via , see proposition 3. Figure 1 illustrates the update rule by visualizing a prior and posterior distribution in the three dimensional case. The underlying numerical example is provided in appendix B.
3 Bayesian Inference
3.1 Construction of Credible Regions
Assume that the prior distribution has been updated to the posterior distribution with (proposition 4). Deriving a simultaneous credible region for all proportions of interest is a typical data analysis goal. A % credible region is a set with the property (SDT). For simplicity, only equi-tailed, two-sided credible regions are considered in this work. In the normal posterior case, this corresponds to highest density regions (HDR) while for the mBeta distribution this is in general not true.
There are several ways to construct credible regions for . One possibility is to approximate the posterior distribution by a multivariate normal by matching the first two moments and use established methods for this case. The normal approximation of this posterior distribution is then given by
whereby and are derived according to proposition 1 from . From this, a simultaneous credible region
can be derived. Hereby, the posterior variance vector contains the diagonal elements of . The ’critical value’
can be computed numerically as a suitable equi-tailed quantile of the standard multivariate normal with correlation matrix(hothorn2008). Note that such approximate credible regions are not guaranteed to lie in . Moreover, this approach does not benefit from the fact that the type of the marginal posterior distributions is known and non-Gaussian.
This can be alleviated by employing a copula approach (nadarajah2018). A copula model allows to disentangle marginal distributions and dependency structure of a multivariate random variable . More specifically, Sklar’s theorem states that for every random vector
with joint cumulative distribution function (CDF)there exists a copula function such that
Furthermore, the copula is unique if all marginal distributions are continuous (sklar1959; nadarajah2018). For instance, a Gaussian copula may be utilized which is parametrized via a correlation matrix and given by
Hereby is the univariate standard normal CDF and is the -dimensional normal CDF with mean and covariance matrix . When modelling the posterior distribution , an obvious choice for is , the posterior correlation matrix which can be obtained by standardizing the posterior covariance (see proposition 1). Following similar arguments as given by dickhaus2012, this can be used to construct a simultaneous credible region. For this, the same critical value as for the normal approximation is used and translated to adjusted local tail probabilities . The credible region is then based on the and quantiles of the marginal Beta distributions of the the joint mBeta posterior.
Lastly, we may base our inference regarding on , the underlying Dirichlet-multinomial model (section 2). To pursue this route, a posterior sample of size can be drawn from the underlying Distribution which is then transformed to a posterior sample for , compare section 2.1. Denote by
the credible region such that is satisfied for all margins , meaning that , are suitable marginal Beta quantiles. Subsequently, can be tuned such that data points of are contained in . This can be achieved by a simple numerical root finding. While normal approximation and copula model only require knowledge of the reduced representation , this extensive posterior sampling approach is only feasible when the complete parameter vector is known. It is thus the only of the three approaches which employs all available information - if it is indeed available. In high dimensions it is however computationally expensive or even infeasible as the original Dirichlet sample is of size .
Note that all three approaches to construct credible regions are Bayes actions in the sense that they are based on (different approximations of) the posterior expected coverage probability. In section 5, the influence of these approximations on the Bayes coverage probability, i.e. the expected coverage under different generative prior distributions, will be assessed in a simulation study.
3.2 Inference for Transformed Parameters
Besides inference for the proportions themselves, transformations of them might also be of interest. The three approaches described in the last section (normal approximation, copula model, extensive sampling approach) can be modified for this purpose. A commonly investigated case are linear contrasts defined by a contrast matrix where is the dimension of the target space. A popular example are all-vs-one comparisons (w.l.o.g.) defined by . Hereby is the
-dimensional identity matrix and. In the model evaluation context, this would relate to comparing the accuracy of all models against , the accuracy of the -th model.
For the normal approximation and the copula method, the fact that implies
can be utilized. For the copula approach, it is important to note, that the difference of two Beta random variables no longer follows a Beta distribution (gupta2004). One solution to obtain correct marginal quantiles is to rely on posterior sampling for this case as well. Non-linear transformations will not be investigated in this work, could however be tackled by employing the multivariate delta method. For the extensive sampling approach, any transformation can be applied to the posterior sample. The transformed sample can then be processed by the same means as before.
4 Numerical Implementation
Functions for prior definition, update rules and calculation of credible regions have been implemented in the new R package SIMPle111A development version of the SIMPle package is available at https://github.com/maxwestphal/SIMPle (accessed November 6, 2019).. It’s main goal is to conduct simultaneous inference for multiple proportions by means of the proposed multivariate Beta-binomial model. It allows the definition of a prior distribution based on the mean vector and correlation matrix. Instead of solving the obvious linear system (LP), the least squares problem with equality and inequality constraints
is solved for a given input with help of the lsei package (SLSP; lsei). That is to say, we require the first-order moments to be matched exactly and the mixed second-order moments should be fitted as closely as possible. The moment matrix can be specified explicitly or implicitly via mean vector and correlation matrix , compare proposition 2. If a solution of (LS) is found, it defines a valid mBeta distribution. If the solution is exact, i.e. , it defines an mBeta distribution with exactly the targeted mean and correlation structure. If , the solution defines a valid mBeta distribution with targeted mean but only approximated correlation structure.
For dimensions the reduced parametrisation in conjunction with the copula approach described in section 3 is employed by default because solving (LS) or (LP) becomes numerically expensive. In this case only the necessary bounds (FB) are checked. As a result, the prior matrix is in general not guaranteed to satisfy (MC), unless simplifying structural assumptions are made.
5 Simulation Study: Comparison of Multiple Classifiers
5.1 Method Comparison
This section covers the results of a simulation study in the context of classifier evaluation. In machine learning, prediction models should be trained and evaluated on independent data sets to avoid an overoptimistic performance assessment (ESL; ELA; APM). Earlier work has shown that a simultaneous evaluation of multiple promising classifiers is beneficial as the test data can then be employed for the final model selection. (EOMPM1; IMS). Hereby, an adequate adjustment for the introduced selection-induced bias needs to be employed.
The goal of this simulation study is to compare the properties of different credible regions which have been outlined in section 3:
approximate: normal approximation of posterior distribution
copula: exact posterior marginals, copula model for dependency structure
extensive: based on a posterior sample of size , drawn from the underlying Dirichlet distribution
Our primary interest is to assess the Bayes coverage probability of these credible regions, i.e. the expected coverage probability
when parameters arise from different generative prior distributions . The BCP definition is inspired by the standard Bayes risk definition (SDT, p. 11). Each credible region depends on the employed approach and on the analysis prior . We investigate two cases here: (a) , i.e. the true generative prior is known, and (b) a vague prior is used. The latter case will implemented as independent uniform variables, corresponding to the parameter which was discussed at the end of section 2.2. Because all approaches are constructed as (approximate) Bayes actions, we expect a close to when (a) or (b) the sample size is large.
The employed generative prior distributions are characterized by dimension , concentration parameter , mean vector and correlation matrix . For a given scenario the according parameter vector is obtained by solving (LS). The following cases are investigated: for , we consider means of for all models with a concentration parameter of or and a equicorrelation of or . For , we define two blocks of five models as above. The prior mean is given as
. This is supposed to mimic the case that two learning algorithms with different hyperparameters are investigated whereby one of them (averaged over the hyperparameters) yields classifiers with higher accuracy. The correlation between models of different algorithms is defined to besuch that the overall correlation matrix is a block matrix consisting of blocks.
For each simulation run, the underlying parameter vector is drawn from a distribution. The experimental data is then drawn from a dimensional multinomial distribution with parameters and then transformed to a multivariate Binomial distribution with the same parameters via , compare section 2.1. We have investigated the sample sizes .
As the BCP is a proportion, the standard error of its simulated estimate is bounded from above bywhich is approximately 0.001 in the overall analysis (figures 2 and 3; ) and 0.002 in the stratified analysis (appendix C; ). The three investigated methods are applied to the same simulated datasets. The target coverage probability is set to for all simulations. The numerical experiments were conducted in R with help of the batchtools package (batchtools). Software and custom functions that were used to conduct the simulation study are partially publicly available.222See https://github.com/maxwestphal/SEPM.PUB (accessed July 07, 2019).
Figure 2 shows the Bayes coverage probability of the different credible regions for the raw proportions . The results are only stratified for the number of proportions and whether the correct or a vague prior is used for the analysis. In effect, each simulated BCP in figure 2 is the average over all four scenarios (, ) and is thus comprised of simulation runs.
If the analysis prior corresponds to the true generative prior distribution, all methods have close to target coverage level for . As the dimension increases to , the BCP deviates more from the target level . If the vague analysis prior is employed, the normal approximation clearly performs worse compared to the copula and the extensive approach, in particular for low sample sizes. More detailed results in appendix C suggest that the normal approximation becomes worse not only as increases but also as the concentration or the correlation decrease. This is plausible as in both cases parameter values near the boundaries of the unit interval become more likely which negatively inflects the quality of the normal approximation.
Figure 3 shows similar investigations for credible regions for differences of parameters , . Overall, the picture is similar to the previous analysis. The deviations from the target coverage probability are larger for small (compared to large ) and when the normal approximation is used (compared to the other two approaches). The main difference is that the actual BCP is larger than the target for small when the vague prior is employed. We attribute this observation to the fact that the (induced) prior for the difference is no longer a uniform but rather a triangular distribution.
Besides the coverage probabilities we also investigated the frequency to obtain a credible region not entirely in the support of the distribution. While for the copula and extensive approach this probability is zero for all sample sizes by construction, for the approximate method it is nonzero. For the raw proportion analysis (figure 2) this probability is as low as (vague prior) or (correct prior) for and and does stabilize to at least for all scenarios where .
In this work, a simple construction of a bivariate Beta distribution from a four-dimensional Dirichlet distribution due to olkin2015 was generalized to higher dimensions. As parameters are needed to describe the -dimensional Beta distribution, it is of limited to no use in high dimensions. To counter this problem, a reduced parametrisation only requiring parameters was proposed which can be derived from an overall concentration parameter a mean vector and a correlation matrix . Necessary and sufficient conditions have been provided that need to be satisfied such that an mBeta distribution for a given triple can exist.
These moment conditions (MC) provide some intuition on which correlation structures are admissible. However, they are also of limited use in practice as checking the conditions for general and is usually not feasible, at least not in a numerically efficient manner. A more concrete descriptions of these conditions may be obtained by calculating the extreme rays of the polyhedral cone . That is, one would need to compute the so-called V-representation which implies a finite but potentially large number of conditions , imposed by the columns of the matrix . In R, algorithms for that matter are for instance implemented in the package rcdd (rcdd). As the number of generating rays is rapidly growing in the dimension , this method is again only helpful for small dimensions. A similar approach was recently pursued by fontana2018 in a related context. Altogether, it appears that the verification of the validity of a multivariate Beta distribution in terms of its mean and correlation structure is only feasible in higher dimensions when making simplifying structural assumptions, see section 2.2. This is only a concern for the prior distribution, as we are guaranteed to end up with a valid posterior when we start with a valid prior (proposition 3).
An example of a valid prior was the vague prior employed in the simulation study which corresponds to independent uniform prior distributions. This case can be connected to the so-called Bayes prior which is frequently employed for the Bayesian analysis of a single proportion (ASI, p. 173). Marginally the two approaches do the same, namely adding two pseudo-observations (one success, one failure) to the dataset leading to shift of posterior mass towards . The proposed mBeta model additionally includes a prior on and update of the mixed second-order moments. From a frequentist viewpoint, this approach can thus be used for a joint shrinkage estimation of sample mean and covariance matrix as they both depend only on the posterior parameters (proposition 1).
The idea to connect marginal Beta-binomial models via a copula is not new. In previous work following this direction, marginal distribution and copula parameters were usually jointly estimated (nyaga2017; yamaguchi2019). In contrast, in this work the copula model is only fitted to the ’correct’ joint posterior distribution to construct simultaneous credible regions.
The simulation study indicates that the copula and extensive sampling approach result in credible regions with close to the desired Bayes coverage probability. The loss in accuracy when employing the copula approach seems to be negligible in the situations that were assessed. In contrast, the normal approximation only produces satisfactory results when the sample size is sufficiently high, as expected. As it provides no benefits compared to the copula approach besides simplicity, the latter seems a reasonable default choice for the considered problem.
A methodological limitation of this work is that the adequateness of the reduced relative to the full parametrisation cannot be assessed via numerical simulation in higher dimensions. This is due to the fact that data generation from the full underlying Dirichlet distribution becomes also numerically infeasible for much larger than 10.
The present work focused on the construction of multivariate credible regions. Decisions regarding prespecified hypotheses based the posterior distribution of the parameters may of course also be of interest in practice. madruga2001 and thulin2014
connect credible regions to hypothesis testing in a decision theoretic framework. They show that for specific loss functions, the standard Bayes test, i.e. deciding for the hypothesis with lowest posterior expected loss, corresponds to comparison of parameter values with credible bounds which may be denoted as.
An advantage of the approach of thulin2014 is that the employed loss function does not depend on the observed data. This was a non-standard feature of the proposal by madruga2001. It appears that the approach of thulin2014 can however not easily be transferred to the multivariate setting which would require to specify a loss function such that
Hereby, the action space consists of all possible test decisions, e.g. for the one-sided hypothesis system
A possible generalization of the loss function from thulin2014 is
whereby indicates for all whether is contained in the alternative . thulin2014 showed that is a Bayes test under the loss function (30) in the case . In the multivariate setting (), this no longer holds true which can be confirmed via numerical examples. However, can be seen as a constrained Bayes test. That is to say, from all tests with posterior false positive probability bounded by , it minimizes the posterior false negative probability .
These considerations are somewhat opposing the usual (empirical-) Bayes approach to multiplicity adjustment which is usually based on modifying the prior distribution rather than the loss function (scott2009; guo2010; scott2010). This established strategy could also be employed for the multinomial Beta-binomial model considered in this work. For instance the prior distribution could be modified such that the tail probability is controlled, e.g. by increasing the concentration parameter (assuming ). Under the vague (independent uniform) prior employed in chapter 5, such a control is not given as for . The above sketched approaches (adaptation of the loss function, e.g. (30)) to multiple hypothesis testing in the Bayesian framework should be contrasted thoroughly with the established methods (adaptation of the prior, hierarchical models) in the future.
Appendix A Technical Details
At first, some established results concerning the Dirichlet distribution are stated. Let with support whereby and . An essential property of the Dirichlet distribution is the so-called aggregation property (DIR, Theorem 2.5 (i)). It concerns the vector where two components and from the original are replaced by their sum which has the following distribution:
Repeated application of this result allows to aggregation of arbitrary subvectors of . In particular, the marginal distribution of the component turns out to be
Additionally, we will use the fact that and for (DIR, p. 39). When setting , this amounts to
Proof of proposition 1.
Follows from the definition .
We have and for all . Thus, .
Is a consequence of the aggregation property (A.1).
Follows from and (3).
Follows from (A.2) and the the fact is a linear transformation of .
A generalization of (3) and again a consequence of (A.1). The only thing left to check is that the correct parameters are added up.
For the proof of proposition 2, Farkas’ lemma will be employed which is stated below (CO, p. 263). As usual, inequalities between vectors should be interpreted component-wise.
Lemma 1 (Farkas’ lemma).
For any matrix and vector , the following two statements are equivalent:
The linear system of equations is feasible, i.e. has a solution , with .
For all , implies .
Proof of proposition 2.
The linear system (LP) consists of equations and unknowns. Thus, it has a unique solution only for as then . On the other hand implies and thus (LP) is underdetermined and has no unique solution in this case. To enforce uniqueness, the minimization of (e.g.) under the side condition (LP) can be reformulated as a convex linearly constrained quadratic program and thus has has a unique solution.
Proof of proposition 3.
Follows immediately from (1), as .
Let with be given. Let and be the derived moment vectors from and , respectively, similar as in definition 3. Then
For , the parameters and together with the vector provide a counterexample as but