A Bayesian network is a probabilistic model constituted by a directed acyclic graph (DAG) and a set of conditional probability tables (CPTs), one for each node. The CPT of node
contains the conditional probability distributions of
given each possible configuration of its parents. Usually all variables are discrete and the conditional distributions are estimated adopting a Multinomial-Dirichlet model, where the Dirichlet prior is characterised by the vector of hyper-parameters. Yet, Bayesian estimation of multinomials is sensitive to the choice of and inappropriate values cause the estimator to perform poorly [hausser2009entropy]. Mixtures of Dirichlet distributions have been recommended both in statistics [casella2009assessing, good1987robustness]
and in machine learning[nemenman2002entropy] in order to obtain more robust estimates. Yet, mixtures of Dirichlet distributions are computationally expensive; this prevents them from being widely adopted. Another difficulty encountered in CPT estimation is the presence of rare events. Assuming that all variables have cardinality and that the number of parents is , we need to estimate conditional distributions, one for each joint configuration of the parent variables. Frequently one or more of such configurations are rarely observed in the data, making their estimation challenging.
We propose to estimate the conditional distributions by adopting a novel approach, based on a hierarchical Multinomial-Dirichlet model. This model has two main characteristics. First, the prior of each conditional distribution is constituted by a mixture of Dirichlet distributions with parameter ; the mixing is attained by treating
as a random variable with its own prior and posterior distribution. By estimating from data the posterior distribution of, we need not to fix its value a priori. Instead we give more weight to the values of that are more likely given the data. Secondly, the model is hierarchical since it assumes that the conditional distributions within the same CPT (but referring to different joint configurations of the parents) are drawn from the same mixture. The hierarchical model jointly estimates all the conditional distributions of the same CPT, called columns of the CPT. The joint estimates generate information flow between different columns of the CPT; thus the hierarchical model exploits the parameters learned for data-rich columns to improve the estimates of the parameters of data-poor columns. This is called borrowing statistical strength [murphy2012machine, Sec 18.104.22.168] and it is well-known within the literature of hierarchical models [gelman2013bayesian]. Also the literature of Bayesian networks acknowledges [koller2009probabilistic, Sec.17.5.4] that introducing dependencies between columns of the same CPT could improve the estimates, especially when dealing with sparse data. However, as far as we known, this work is the first practical application of joint estimation of the columns of the same CPT.
To tackle the problem of computational complexity we adopt a variational inference approach. Namely, we compute a factorised approximation of the posterior distribution that is highly efficient. Variational inference appears particularly well suited for hierarchical models; for instance the inference of Latent Dirichlet Allocation [LDA] is based on variational Bayes. By extensive experiments, we show that our novel approach considerably improves parameter estimation compared to the traditional approaches based on Multinomial-Dirichlet model. The experiments show large gains especially when dealing with small samples, while with large samples the effect of the prior vanishes as expected.
The paper is organised as follows. Section II introduces the novel hierarchical model. Section III provides an analytical study of the resulting estimator, proving that the novel hierarchical approach provides lower estimation error than the traditional approaches, under some mild assumptions on the generative model. Section IV presents some simulation studies showing that, given the same network structure, hierarchical estimation yields both a better fit of the joint distribution and a consistent improvement in classification performance, with respect to the traditional estimation under parameter independence. Section V reports some concluding remarks.
Ii Estimation Under Multinomial-Dirichlet Model
We want to induce a Bayesian network over the set of random variables .
We assume that each variable is discrete and has possible values in the set
. The parents of are denoted by and they have possible joint states collected in the set
We denote by the probability of being in state when its parent set is in state , i.e., . We denote by the parameters of the conditional distribution of given . A common assumption [koller2009probabilistic, Sec.17] is that is generated from a Dirichlet distribution with known parameters. The collection of the conditional probability distributions constitutes the conditional probability table (CPT) of . Each vector of type , with , is a column of the CPT.
The assumption of local parameter independence [koller2009probabilistic, Sec. 17] allows to estimate each parameter vector independently of the other parameter vectors. The assumed generative model, , is:
where denotes the prior strength, also called equivalent sample size, and is a parameter vector such that . The most common choice is to set and , which is called BDeu prior [koller2009probabilistic, Sec.17].
If there are no missing values in the data set , the posterior expected value of is [gelman2013bayesian]:
where is the number of observations in characterised by and , while .
Iii Hierarchical Model
The proposed hierarchical model estimates the conditional probability tables by removing the local independence assumption. In order to simplify the notation we present the model on a node with a single parent . has states in the set , while has states in the set . Lastly, we denote by the number of observations with and and by the number of observations with , where and .
As described in Section II, , for and , are usually assumed to be independently drawn from a Dirichlet distribution, with known parameter . On the contrary, the hierarchical model treats as a hidden random vector, thus making different columns of the CPT dependent. Specifically, we assume to be drawn from a higher-level Dirichlet distribution with hyper-parameter .
We assume that for are observations from the hierarchical Multinomial-Dirichlet model:
where is the equivalent sample size, and is a vector of hyper-parameters.
Iii-a Posterior moments for
We now study the hierarchical model, deriving an analytical expression for the posterior average of , which is the element of vector , and for the posterior covariance between and . To keep notation simple, in the following we will not write explicitly the conditioning with respect to the fixed parameters and . We introduce the notation to represent the posterior average and to represent the posterior covariance.
We define the pointwise estimator for the parameter as its posterior average, i.e., , and the pointwise estimator for the element of the parameter vector as its posterior average, i.e.,
Under model (1), the posterior average of is
while the posterior covariance between and is
where is defined as
The posterior average and posterior covariance of cannot be computed analytically. Some results concerning their expression and numerical computation, together with the complete proof of Theorem 1, are detailed in Appendix.
Notice that the pointwise estimator is a mixture of traditional Bayesian estimators obtained under (non-hierarchical) Multinomial-Dirichlet models with fixed, i.e, . Indeed, thanks to the linearity in , we obtain
This mixture gives more weight to the values of that are more likely given the observations.
Iii-B Properties of the estimator
We study now the mean-squared error (MSE) of and we compare it to the MSE of other traditional estimators.
In order to study the MSE of we need to assume the generative model
where and are the true underlying parameters. Moreover, since is a random vector, we define the MSE for an estimator of the single component as
where and represent respectively the expected value with respect to and , and the MSE for the estimator of the vector as
Notice that the generative model (3) is the traditional, thus non-hierarchical, Multinomial-Dirichlet model, which implies parameter independence. Hence, the traditional Bayesian estimator satisfies exactly the assumptions of this model. The Bayesian estimator is usually adopted by assuming
to be fixed to the values of a uniform distribution on, i.e., , see e.g. [gelman2013bayesian]. However, since in general , the traditional Bayesian approach generates biased estimates in small samples. On the contrary, the novel hierarchical approach estimates the unknown parameter vector basing on its posterior distribution. For this reason the proposed approach can provide estimates that are closer to the true underlying parameters, with a particular advantage in small samples, with respect to other traditional approaches.
In order to study the MSE of different estimators, we first consider an ideal shrinkage estimator
where and is the maximum-likelihood (ML) estimator, obtained estimating from the observations each vector independently of other vectors. This convex combination shrinks the ML estimator towards the true underlying parameter . Setting , the ideal estimator corresponds to a Bayesian estimator with known parameter , i.e., . However, since represents the true underlying parameter that is usually unknown, the ideal estimator (5) is unfeasible. Yet it is useful as it allows to study the MSE.
The main result concerns the comparison, in terms of MSE, of the ideal estimator with respect to the traditional (non hierarchical) Bayesian estimator, which estimates by means of the uniform distribution , i.e., , see e.g. [gelman2013bayesian]. The traditional Bayesian estimator can be written as
Under the assumption that the true generative model is (3), the MSE for the ideal estimator is
while the MSE for the traditional Bayesian estimator is
If , .
The proof is reported in Appendix.
Since in general , the second term in (2) is positive and the ideal estimator achieves smaller MSE with respect to traditional Bayesian estimator. To improve the estimates of the traditional Bayesian model in terms of MSE, we propose to act exactly on the second term of (2). Specifically, we can achieve this purpose by estimating the parameter vector from data, instead of considering it fixed.
where . This convex combination of and shrinks the ML estimator towards the posterior average of , with a strength that is inversely proportional to .
Contrary to the traditional Bayesian estimator (6), the hierarchical estimator provides an estimate of that converges to the true underlying parameter as increases. Indeed, it is well known that the posterior average converges to the true underlying parameter as goes to infinity, i.e., as , [gelman2013bayesian]. As a consequence, converges to and the MSE of converges to the MSE of , i.e., as .
In the finite sample assumption differs from , since includes an estimator of . Since in the hierarchical model we cannot compute this quantity analytically, we verify by simulation that the hierarchical estimator provides good performances in terms of MSE with respect to the traditional Bayesian estimators.
In conclusion, as we will show in the numerical experiments, the hierarchical model can achieve a smaller MSE than the traditional Bayesian estimators, in spite of the unfavourable conditions of the generative model (3). This gain is obtained thanks to the estimation of the parameter , rather than considering it fixed as in the traditional approaches. In more general conditions with respect to (3), the true generating model could not satisfy parameter independence and the MSE gain of the hierarchical approach would further increase.
In the experiments we compute the proposed hierarchical estimator using variational Bayes inference in R by means of the rstan package [stan2]
. The variational Bayes estimates are practically equivalent to those yielded by Markov Chain Monte Carlo (MCMC), though being the variational inference much more efficient than MCMC (less than a second for estimating a CPT compared to a couple of minutes for the MCMC method). In the following we report the results obtained via variational inference. The code is available athttp://ipg.idsia.ch/software.php?id=139.
Iv-a MSE analysis
In the first study we assess the performances of the hierarchical estimator in terms of MSE. We consider two different settings, in which we generate observations from model (3), where is fixed. In the first setting (test 1) we sample from a Dirichlet distribution with parameter , while in the second setting (test 2) we sample it from a Dirichlet distribution with parameter . Under test 2 the parameters of the sampling distribution for are very large and equal to each other, implying , . For this reason, test 2 is the ideal setting for the traditional Bayesian estimator, while test 1 is the ideal setting for the hierarchical estimator.
In both test 1 and test 2 we consider all the possible combinations of (the number of states of ) and (the number of conditioning states), with and . For each combination of and , for both test 1 and test 2, we generate data sets with size . We repeat the data sampling and the estimation procedure 10 times for each combination of , and . Then we compare the estimates yielded by the hierarchical estimator, with and , and by the traditional Bayesian estimator assuming parameter independence, with . We compare the performance of different estimators by computing the difference in terms of average MSE, defined as for an estimator . In every repetition of the experiment , we then compute and we represent it in Figure 1.
The results show that the hierarchical estimator mostly provides better or equivalent results in comparison to , especially for small and/or large . In a Bayesian network it is usual to have large values of , since represents the cardinality of the parents’ joint states set. In test 1 (light blue boxplots) the advantage of the hierarchical estimator over the Bayesian one is generally large, as expected. The advantage of the hierarchical model steadily increases as increases, becoming relevant for or . For large the gap between the two estimators vanishes, although it is more persistent when dealing with large . Interestingly, in test 2 (green boxplots), the traditional Bayesian estimator is just slightly better than the hierarchical one, even though the former is derived exactly from the true generative model. The traditional estimator has a small advantage only for and small values of , and this advantage quickly decreases if either or increase.
Iv-B Joint distribution fitting
In the second study we assess the performance of the hierarchical estimator in the recovery of the joint distribution of a given Bayesian network.
We consider 5 data sets from UCI Machine Learning Repository: Adult, Letter Recognition, Nursery, Pen-Based Recognition of Handwritten Digits and Spambase. We discretise all numerical variables into five equal-frequency bins and we consider only instances without missing values. For each dataset we first learn, from all the available data, the associated directed acyclic graph (DAG) by means of a hill-climbing greedy search, as implemented in the bnlearn package [scutari2010learning]. We then keep such structure as fixed for all the experiments referring to the same data set, since our focus is not structural learning. Then, for each data set and for each we repeat 10 times the procedure of 1) sampling observations from the data set and 2) estimating the CPTs. We perform estimation using the proposed hierarchical approach, with and , and the traditional BDeu prior (Bayesian estimation under parameter independence) with and . The choice of is the most commonly adopted in practice, while is the default value proposed by the bnlearn package. Conversely, we did not offer any choice to the hierarchical model. Indeed, we set the smoothing factor in the proposed model to the number of states of the child variable, which has the same order of magnitude of the smoothing factors used in the traditional Bayesian approach. In spite of the more limited choice for the parameter , the hierarchical estimator consistently outperforms the traditional Bayesian estimator, regardless whether the latter adopts a smoothing factor 1 or 10.
We then measure the log-likelihood of all the instances included in the test set, where the test contains all the instances not present in training set. We report in the top panels of Figure 2 the difference between the log-likelihood of the hierarchical approach and the log-likelihood of Bayesian estimation under local parameter independence, i.e., the log-likelihood ratio. The log-likelihood ratio approximates the ratio between the Kullback-Leibler (KL) divergences of the two models. The KL of a given model measures the distance between the estimated and the true underlying joint distribution.
The log-likelihood ratios obtained in the experiments are extremely large on small sample sizes, being larger than one thousand on all data sets (Figure 2, top panels). This shows the huge gain delivered by the hierarchical approach when dealing with small data sets. This happens regardless of the equivalent sample size used to perform Bayesian estimation under parameter independence: we note however that in general yields better results than . The lowest gains are obtained on the data set Nursery; the reason is that the DAG of Nursery has the lowest number of parents per node (0.9 on average, compared to about twice as much for the other DAGs). Thus, this data set is the less challenging from the parameter estimation viewpoint, with respect to the others. We point out that significant likelihood ratios are obtained in general even for large samples, even though they are not apparent from the figure due to the scale. For instance, for the log-likelihood ratios range from 50 (Nursery) to 85000 (Letter).
In the third study we assess the performance of the hierarchical estimator in terms of classification. We consider the same datasets of the previous experiment, discretised in the same way.
For each dataset we first learn the Tree-Augmented Naive Bayes (TAN) structure by means of thebnlearn package. The networks are estimated on the basis of all the available samples and are kept fixed for all the experiments referring to the same data set, since our focus is not structural learning. Then, for each dataset and for each , we sample observations. We then estimate the CPTs of the Bayesian network from the sampled data by means of the hierarchical estimator ( and ) and the traditional Bayesian estimators obtained under a BDeu prior ( and ). We repeat the sampling and the estimating steps 10 times for each value of
and each data set. We then classify each instance of the test set, which contains 1000 instances sampled uniformly from all the instances not included in the training set. We assess the classification performance by measuring accuracy and area under the ROC (ROC AUC) of the classifiers. In the central panels of Figure2 we report the difference in accuracy between the hierarchical estimator and the traditional Bayesian ones, while in the bottom panels of the same figure we report the difference in ROC AUC between the same classifiers.
The area under the ROC is a more sensitive indicator for the correctness of the estimated posterior probabilities with respect to accuracy. According to Figure2 (bottom panels), the hierarchical approach yields consistently higher ROC AUC than both the BDeu classifiers. The increase of ROC AUC in small samples (, ) ranges between 2 and 20 points compared to both the BDeu priors. As increases this gain in ROC AUC tends to vanish. However, for the gain in ROC AUC for the datasets Adult and Letter ranges between 1 and 5 points.
Figure 2 (central panels) shows also improvements in accuracy, even if this indicator is less sensitive to the estimated posterior probability than the area under ROC. Indeed, in computing the accuracy, the most probable class is compared to the actual class, without paying further attention to its posterior probability. In small samples (, ) there is an average increase of accuracy of about 5 points compared to the BDeu prior with and of about 10 points compared to the BDeu prior with . The accuracy improvements tends to decrease as increases; yet on both Adult and Letter data sets an accuracy improvement of about 1-2 points is shown also for .
We have presented a novel approach for estimating the conditional probability tables by relaxing the local independence assumption. Given the same network structure, the novel approach yields a consistently better fit to the joint distribution than the traditional Bayesian estimation under parameter independence; it also improves classification performance. Moreover, the introduction of variational inference makes the proposed method competitive in terms of computational cost with respect to the traditional Bayesian estimation.
In order to prove Theorem 1
, we first need to derive some results concerning the posterior moments for the vector, whose general element is associated to state for the variable . Given a dataset , the -th posterior moment for the element is
The following proposition states a general result for computing any posterior moment of , whose general expression is , where represents the power of element .
Under the assumptions of model (2), the posterior average of the quantity , with , is
where and is a proportionality constant such that
The element of the posterior average vector is
where is a Kronecker delta.
The element of the posterior covariance matrix is
Proof of Lemma 1.
Under the assumptions of model (2), the joint posterior density of is
Marginalising with respect to , we obtain the marginal posterior density for , i.e.,
Thanks to the well-known property of the Gamma function
we can write the posterior marginal density (11) as
The proportionality constant of the posterior marginal density is obtained by integrating the right term in (12) with respect to the elements of , such that . The resulting proportionality constant is thus
The posterior average for the quantity can be derived directly from the posterior marginal density of as
where . In the special case of , , we have .
Proof of Theorem 1.
Given , the marginal posterior density for is a Dirichlet distribution with parameters , where . It is thus easy to compute
where , , and .
The posterior average of can thus be computed by means of the law of total expectation as
The posterior average of is obtained directly from Lemma 1.
In order to compute the posterior covariance between and we can use the law of total covariance, i.e.,
The first quantity is:
If , the second quantity is:
Otherwise, if , , since given .
Exploiting the law of total covariance, we obtain
The posterior covariance between and is obtained directly from Lemma 1. ∎
Proof of Theorem 2.
Exploiting the linearity of the ideal estimator we obtain that
The MSE of is thus
This quantity corresponds to the first term of . The second term is obtained as
since is independent of and .
The last term , since is independent of and .
The MSE for the ideal estimator is thus
If and , ,