A Bayesian network is a graphical model for probabilistic relationships among a set of variables. Over the last decade, the Bayesian network has become a popular representation for encoding uncertain expert knowledge in expert systems (Heckerman et al., 1995a). More recently, researchers have developed methods for learning Bayesian networks from data. The techniques that have been developed are new and still evolving, but they have been shown to be remarkably effective for some data-analysis problems.
In this paper, we provide a tutorial on Bayesian networks and associated Bayesian techniques for extracting and encoding knowledge from data. There are numerous representations available for data analysis, including rule bases, decision trees, and artificial neural networks; and there are many techniques for data analysis such as density estimation, classification, regression, and clustering. So what do Bayesian networks and Bayesian methods have to offer? There are at least four answers.
One, Bayesian networks can readily handle incomplete data sets. For example, consider a classification or regression problem where two of the explanatory or input variables are strongly anti-correlated. This correlation is not a problem for standard supervised learning techniques, provided all inputs are measured in every case. When one of the inputs is not observed, however, most models will produce an inaccurate prediction, because they do not encode the correlation between the input variables. Bayesian networks offer a natural way to encode such dependencies.
Two, Bayesian networks allow one to learn about causal relationships. Learning about causal relationships are important for at least two reasons. The process is useful when we are trying to gain understanding about a problem domain, for example, during exploratory data analysis. In addition, knowledge of causal relationships allows us to make predictions in the presence of interventions. For example, a marketing analyst may want to know whether or not it is worthwhile to increase exposure of a particular advertisement in order to increase the sales of a product. To answer this question, the analyst can determine whether or not the advertisement is a cause for increased sales, and to what degree. The use of Bayesian networks helps to answer such questions even when no experiment about the effects of increased exposure is available.
Three, Bayesian networks in conjunction with Bayesian statistical techniques facilitate the combination of domain knowledge and data. Anyone who has performed a real-world analysis knows the importance of prior or domain knowledge, especially when data is scarce or expensive. The fact that some commercial systems (i.e., expert systems) can be built from prior knowledge alone is a testament to the power of prior knowledge. Bayesian networks have a causal semantics that makes the encoding of causal prior knowledge particularly straightforward. In addition, Bayesian networks encode the strength of causal relationships with probabilities. Consequently, prior knowledge and data can be combined with well-studied techniques from Bayesian statistics.
Four, Bayesian methods in conjunction with Bayesian networks and other types of models offers an efficient and principled approach for avoiding the over fitting of data. As we shall see, there is no need to hold out some of the available data for testing. Using the Bayesian approach, models can be “smoothed” in such a way that all available data can be used for training.
This tutorial is organized as follows. In Section 2, we discuss the Bayesian interpretation of probability and review methods from Bayesian statistics for combining prior knowledge with data. In Section 3, we describe Bayesian networks and discuss how they can be constructed from prior knowledge alone. In Section 4, we discuss algorithms for probabilistic inference in a Bayesian network. In Sections 5 and 6, we show how to learn the probabilities in a fixed Bayesian-network structure, and describe techniques for handling incomplete data including Monte-Carlo methods and the Gaussian approximation. In Sections 7 through 12, we show how to learn both the probabilities and structure of a Bayesian network. Topics discussed include methods for assessing priors for Bayesian-network structure and parameters, and methods for avoiding the overfitting of data including Monte-Carlo, Laplace, BIC, and MDL approximations. In Sections 13 and 14, we describe the relationships between Bayesian-network techniques and methods for supervised and unsupervised learning. In Section 15, we show how Bayesian networks facilitate the learning of causal relationships. In Section 16, we illustrate techniques discussed in the tutorial using a real-world case study. In Section 17, we give pointers to software and additional literature.
2 The Bayesian Approach to Probability and Statistics
To understand Bayesian networks and associated learning techniques, it is important to understand the Bayesian approach to probability and statistics. In this section, we provide an introduction to the Bayesian approach for those readers familiar only with the classical view.
In a nutshell, the Bayesian probability of an eventis a person’s degree of belief in that event. Whereas a classical probability is a physical property of the world (e.g., the probability that a coin will land heads), a Bayesian probability is a property of the person who assigns the probability (e.g., your degree of belief that the coin will land heads). To keep these two concepts of probability distinct, we refer to the classical probability of an event as the true or physical probability of that event, and refer to a degree of belief in an event as a Bayesian or personal probability. Alternatively, when the meaning is clear, we refer to a Bayesian probability simply as a probability.
One important difference between physical probability and personal probability is that, to measure the latter, we do not need repeated trials. For example, imagine the repeated tosses of a sugar cube onto a wet surface. Every time the cube is tossed, its dimensions will change slightly. Thus, although the classical statistician has a hard time measuring the probability that the cube will land with a particular face up, the Bayesian simply restricts his or her attention to the next toss, and assigns a probability. As another example, consider the question: What is the probability that the Chicago Bulls will win the championship in 2001? Here, the classical statistician must remain silent, whereas the Bayesian can assign a probability (and perhaps make a bit of money in the process).
One common criticism of the Bayesian definition of probability is that probabilities seem arbitrary. Why should degrees of belief satisfy the rules of probability? On what scale should probabilities be measured? In particular, it makes sense to assign a probability of one (zero) to an event that will (not) occur, but what probabilities do we assign to beliefs that are not at the extremes? Not surprisingly, these questions have been studied intensely.
With regards to the first question, many researchers have suggested different sets of properties that should be satisfied by degrees of belief (e.g., Ramsey 1931, Cox 1946, Good 1950, Savage 1954, DeFinetti 1970). It turns out that each set of properties leads to the same rules: the rules of probability. Although each set of properties is in itself compelling, the fact that different sets all lead to the rules of probability provides a particularly strong argument for using probability to measure beliefs.
The answer to the question of scale follows from a simple observation: people find it fairly easy to say that two events are equally likely. For example, imagine a simplified wheel of fortune having only two regions (shaded and not shaded), such as the one illustrated in Figure 1. Assuming everything about the wheel as symmetric (except for shading), you should conclude that it is equally likely for the wheel to stop in any one position. From this judgment and the sum rule of probability (probabilities of mutually exclusive and collectively exhaustive sum to one), it follows that your probability that the wheel will stop in the shaded region is the percent area of the wheel that is shaded (in this case, 0.3).
This probability wheel now provides a reference for measuring your probabilities of other events. For example, what is your probability that Al Gore will run on the Democratic ticket in 2000? First, ask yourself the question: Is it more likely that Gore will run or that the wheel when spun will stop in the shaded region? If you think that it is more likely that Gore will run, then imagine another wheel where the shaded region is larger. If you think that it is more likely that the wheel will stop in the shaded region, then imagine another wheel where the shaded region is smaller. Now, repeat this process until you think that Gore running and the wheel stopping in the shaded region are equally likely. At this point, your probability that Gore will run is just the percent surface area of the shaded area on the wheel.
In general, the process of measuring a degree of belief is commonly referred to as a probability assessment. The technique for assessment that we have just described is one of many available techniques discussed in the Management Science, Operations Research, and Psychology literature. One problem with probability assessment that is addressed in this literature is that of precision. Can one really say that his or her probability for event is and not ? In most cases, no. Nonetheless, in most cases, probabilities are used to make decisions, and these decisions are not sensitive to small variations in probabilities. Well-established practices of sensitivity analysis help one to know when additional precision is unnecessary (e.g., Howard and Matheson, 1983). Another problem with probability assessment is that of accuracy. For example, recent experiences or the way a question is phrased can lead to assessments that do not reflect a person’s true beliefs (Tversky and Kahneman, 1974). Methods for improving accuracy can be found in the decision-analysis literature (e.g, Spetzler et al. (1975)).
Now let us turn to the issue of learning with data. To illustrate the Bayesian approach, consider a common thumbtack—one with a round, flat head that can be found in most supermarkets. If we throw the thumbtack up in the air, it will come to rest either on its point (heads) or on its head (tails).111This example is taken from Howard (1970). Suppose we flip the thumbtack times, making sure that the physical properties of the thumbtack and the conditions under which it is flipped remain stable over time. From the first observations, we want to determine the probability of heads on the th toss.
In the classical analysis of this problem, we assert that there is some physical probability of heads, which is unknown. We estimate this physical probability from the
observations using criteria such as low bias and low variance. We then use this estimate as our probability for heads on theth toss. In the Bayesian approach, we also assert that there is some physical probability of heads, but we encode our uncertainty about this physical probability using (Bayesian) probabilities, and use the rules of probability to compute our probability of heads on the th toss.222Strictly speaking, a probability belongs to a single person, not a collection of people. Nonetheless, in parts of this discussion, we refer to “our” probability to avoid awkward English.
To examine the Bayesian analysis of this problem, we need some notation. We denote a variable by an upper-case letter (e.g., ), and the state or value of a corresponding variable by that same letter in lower case (e.g., ). We denote a set of variables by a bold-face upper-case letter (e.g., ). We use a corresponding bold-face lower-case letter (e.g., ) to denote an assignment of state or value to each variable in a given set. We say that variable set is in configuration . We use (or as a shorthand) to denote the probability that of a person with state of information . We also use
to denote the probability distribution for(both mass functions and density functions). Whether refers to a probability, a probability density, or a probability distribution will be clear from context. We use this notation for probability throughout the paper. A summary of all notation is given at the end of the chapter.
Returning to the thumbtack problem, we define to be a
variable333Bayesians typically refer to as an uncertain variable, because the value of is uncertain. In
contrast, classical statisticians often refer to as a random variable. In this text, we refer to and all
uncertain/random variables simply as variables.
and all uncertain/random variables simply as variables.whose values correspond to the possible true values of the physical probability. We sometimes refer to as a parameter. We express the uncertainty about
using the probability density function. In addition, we use to denote the variable representing the outcome of the th flip, , and to denote the set of our observations. Thus, in Bayesian terms, the thumbtack problem reduces to computing from .
To do so, we first use Bayes’ rule to obtain the probability distribution for given and background knowledge :
Next, we expand the term . Both Bayesians and classical statisticians agree on this term: it is the likelihood function for binomial sampling. In particular, given the value of , the observations in are mutually independent, and the probability of heads (tails) on any one observation is (). Consequently, Equation 1 becomes
where and are the number of heads and tails observed in , respectively. The probability distributions and are commonly referred to as the prior and posterior for , respectively. The quantities and are said to be sufficient statistics for binomial sampling, because they provide a summarization of the data that is sufficient to compute the posterior from the prior. Finally, we average over the possible values of (using the expansion rule of probability) to determine the probability that the th toss of the thumbtack will come up heads:
where denotes the expectation of with respect to the distribution .
To complete the Bayesian story for this example, we need a method to assess the prior distribution for . A common approach, usually adopted for convenience, is to assume that this distribution is a beta distribution:
are the parameters of the beta distribution,, and is the Gamma function which satisfies and . The quantities and are often referred to as hyperparameters to distinguish them from the parameter
. The hyperparametersand must be greater than zero so that the distribution can be normalized. Examples of beta distributions are shown in Figure 2.
The beta prior is convenient for several reasons. By Equation 3, the posterior distribution will also be a beta distribution:
We say that the set of beta distributions is a conjugate family of distributions for binomial sampling. Also, the expectation of with respect to this distribution has a simple form:
Hence, given a beta prior, we have a simple expression for the probability of heads in the th toss:
Assuming is a beta distribution, it can be assessed in a number of ways. For example, we can assess our probability for heads in the first toss of the thumbtack (e.g., using a probability wheel). Next, we can imagine having seen the outcomes of flips, and reassess our probability for heads in the next toss. From Equation 8, we have (for )
Given these probabilities, we can solve for and . This assessment technique is known as the method of imagined future data.
Another assessment method is based on Equation 6. This equation says that, if we start with a Beta prior444Technically, the hyperparameters of this prior should be small positive numbers so that can be normalized. and observe heads and tails, then our posterior (i.e., new prior) will be a Beta distribution. Recognizing that a Beta prior encodes a state of minimum information, we can assess and by determining the (possibly fractional) number of observations of heads and tails that is equivalent to our actual knowledge about flipping thumbtacks. Alternatively, we can assess and , which can be regarded as an equivalent sample size for our current knowledge. This technique is known as the method of equivalent samples. Other techniques for assessing beta distributions are discussed by Winkler (1967) and Chaloner and Duncan (1983).
Although the beta prior is convenient, it is not accurate for some problems. For example, suppose we think that the thumbtack may have been purchased at a magic shop. In this case, a more appropriate prior may be a mixture of beta distributions—for example,
where 0.4 is our probability that the thumbtack is heavily weighted toward heads (tails). In effect, we have introduced an additional hidden or unobserved variable , whose states correspond to the three possibilities: (1) thumbtack is biased toward heads, (2) thumbtack is biased toward tails, and (3) thumbtack is normal; and we have asserted that conditioned on each state of is a beta distribution. In general, there are simple methods (e.g., the method of imagined future data) for determining whether or not a beta prior is an accurate reflection of one’s beliefs. In those cases where the beta prior is inaccurate, an accurate prior can often be assessed by introducing additional hidden variables, as in this example.
So far, we have only considered observations drawn from a binomial distribution. In general, observations may be drawn from any physical probability distribution:
where is the likelihood function with parameters . For purposes of this discussion, we assume that the number of parameters is finite. As an example, may be a continuous variable and have a Gaussian physical probability distribution with mean and variance :
Regardless of the functional form, we can learn about the parameters given data using the Bayesian approach. As we have done in the binomial case, we define variables corresponding to the unknown parameters, assign priors to these variables, and use Bayes’ rule to update our beliefs about these parameters given data:
We then average over the possible values of to make predictions. For example,
For a class of distributions known as the exponential family, these computations can be done efficiently and in closed form.555Recent advances in Monte-Carlo methods have made it possible to work efficiently with many distributions outside the exponential family. See, for example, Gilks et al. (1996).
Members of this class include the binomial, multinomial, normal, Gamma, Poisson, and multivariate-normal distributions. Each member of this family has sufficient statistics that are of fixed dimension for any random sample, and a simple conjugate prior.666In fact, except for a few, well-characterized exceptions, the exponential family is the only class of distributions that have sufficient statistics of fixed dimension (Koopman, 1936; Pitman, 1936). Bernardo and Smith (pp. 436–442, 1994) have compiled the important quantities and Bayesian computations for commonly used members of the exponential family. Here, we summarize these items for multinomial sampling, which we use to illustrate many of the ideas in this paper.
In multinomial sampling, the observed variable is discrete, having possible states . The likelihood function is given by
where are the parameters. (The parameter is given by .) In this case, as in the case of binomial sampling, the parameters correspond to physical probabilities. The sufficient statistics for data set are , where is the number of times in . The simple conjugate prior used with multinomial sampling is the Dirichlet distribution:
where , and . The posterior distribution . Techniques for assessing the beta distribution, including the methods of imagined future data and equivalent samples, can also be used to assess Dirichlet distributions. Given this conjugate prior and data set , the probability distribution for the next observation is given by
As we shall see, another important quantity in Bayesian analysis is the marginal likelihood or evidence . In this case, we have
We note that the explicit mention of the state of knowledge is useful, because it reinforces the notion that probabilities are subjective. Nonetheless, once this concept is firmly in place, the notation simply adds clutter. In the remainder of this tutorial, we shall not mention explicitly.
In closing this section, we emphasize that, although the Bayesian and classical approaches may sometimes yield the same prediction, they are fundamentally different methods for learning from data. As an illustration, let us revisit the thumbtack problem. Here, the Bayesian “estimate” for the physical probability of heads is obtained in a manner that is essentially the opposite of the classical approach.
Namely, in the classical approach, is fixed (albeit unknown), and we imagine all data sets of size that may be generated by sampling from the binomial distribution determined by . Each data set will occur with some probability and will produce an estimate . To evaluate an estimator, we compute the expectation and variance of the estimate with respect to all such data sets:
We then choose an estimator that somehow balances the bias () and variance of these estimates over the possible values for .777Low bias and variance are not the only desirable properties of an estimator. Other desirable properties include consistency and robustness. Finally, we apply this estimator to the data set that we actually observe. A commonly-used estimator is the maximum-likelihood (ML) estimator, which selects the value of that maximizes the likelihood . For binomial sampling, we have
For this (and other types) of sampling, the ML estimator is unbiased. That is, for all values of , the ML estimator has zero bias. In addition, for all values of
, the variance of the ML estimator is no greater than that of any other unbiased estimator (see, e.g., Schervish, 1995).
In contrast, in the Bayesian approach, is fixed, and we imagine all possible values of from which this data set could have been generated. Given , the “estimate” of the physical probability of heads is just itself. Nonetheless, we are uncertain about , and so our final estimate is the expectation of with respect to our posterior beliefs about its value:
The expectations in Equations 2 and 15 are different and, in many cases, lead to different “estimates”. One way to frame this difference is to say that the classical and Bayesian approaches have different definitions for what it means to be a good estimator. Both solutions are “correct” in that they are self consistent. Unfortunately, both methods have their drawbacks, which has lead to endless debates about the merit of each approach. For example, Bayesians argue that it does not make sense to consider the expectations in Equation 2, because we only see a single data set. If we saw more than one data set, we should combine them into one larger data set. In contrast, classical statisticians argue that sufficiently accurate priors can not be assessed in many situations. The common view that seems to be emerging is that one should use whatever method that is most sensible for the task at hand. We share this view, although we also believe that the Bayesian approach has been under used, especially in light of its advantages mentioned in the introduction (points three and four). Consequently, in this paper, we concentrate on the Bayesian approach.
3 Bayesian Networks
So far, we have considered only simple problems with one or a few variables. In real learning problems, however, we are typically interested in looking for relationships among a large number of variables. The Bayesian network is a representation suited to this task. It is a graphical model that efficiently encodes the joint probability distribution (physical or Bayesian) for a large set of variables. In this section, we define a Bayesian network and show how one can be constructed from prior knowledge.
A Bayesian network for a set of variables consists of (1) a network structure that encodes a set of conditional independence assertions about variables in , and (2) a set of local probability distributions associated with each variable. Together, these components define the joint probability distribution for . The network structure is a directed acyclic graph. The nodes in are in one-to-one correspondence with the variables . We use to denote both the variable and its corresponding node, and to denote the parents of node in as well as the variables corresponding to those parents. The lack of possible arcs in encode conditional independencies. In particular, given structure , the joint probability distribution for is given by
The local probability distributions are the distributions corresponding to the terms in the product of Equation 16. Consequently, the pair
encodes the joint distribution.
The probabilities encoded by a Bayesian network may be Bayesian or physical. When building Bayesian networks from prior knowledge alone, the probabilities will be Bayesian. When learning these networks from data, the probabilities will be physical (and their values may be uncertain). In subsequent sections, we describe how we can learn the structure and probabilities of a Bayesian network from data. In the remainder of this section, we explore the construction of Bayesian networks from prior knowledge. As we shall see in Section 10, this procedure can be useful in learning Bayesian networks as well.
To illustrate the process of building a Bayesian network, consider the problem of detecting credit-card fraud. We begin by determining the variables to model. One possible choice of variables for our problem is Fraud (), Gas (), Jewelry (), Age (), and Sex (), representing whether or not the current purchase is fraudulent, whether or not there was a gas purchase in the last 24 hours, whether or not there was a jewelry purchase in the last 24 hours, and the age and sex of the card holder, respectively. The states of these variables are shown in Figure 3. Of course, in a realistic problem, we would include many more variables. Also, we could model the states of one or more of these variables at a finer level of detail. For example, we could let Age be a continuous variable.
This initial task is not always straightforward. As part of this task we must (1) correctly identify the goals of modeling (e.g., prediction versus explanation versus exploration), (2) identify many possible observations that may be relevant to the problem, (3) determine what subset of those observations is worthwhile to model, and (4) organize the observations into variables having mutually exclusive and collectively exhaustive states. Difficulties here are not unique to modeling with Bayesian networks, but rather are common to most approaches. Although there are no clean solutions, some guidance is offered by decision analysts (e.g., Howard and Matheson, 1983) and (when data are available) statisticians (e.g., Tukey, 1977).
In the next phase of Bayesian-network construction, we build a directed acyclic graph that encodes assertions of conditional independence. One approach for doing so is based on the following observations. From the chain rule of probability, we have
Now, for every , there will be some subset such that and are conditionally independent given . That is, for any ,
Consequently, to determine the structure of a Bayesian network we (1) order the variables somehow, and (2) determine the variables sets that satisfy Equation 18 for . In our example, using the ordering , we have the conditional independencies
Thus, we obtain the structure shown in Figure 3.
This approach has a serious drawback. If we choose the variable order carelessly, the resulting network structure may fail to reveal many conditional independencies among the variables. For example, if we construct a Bayesian network for the fraud problem using the ordering , we obtain a fully connected network structure. Thus, in the worst case, we have to explore variable orderings to find the best one. Fortunately, there is another technique for constructing Bayesian networks that does not require an ordering. The approach is based on two observations: (1) people can often readily assert causal relationships among variables, and (2) causal relationships typically correspond to assertions of conditional dependence. In particular, to construct a Bayesian network for a given set of variables, we simply draw arcs from cause variables to their immediate effects. In almost all cases, doing so results in a network structure that satisfies the definition . For example, given the assertions that Fraud is a direct cause of Gas, and Fraud, Age, and Sex are direct causes of Jewelry, we obtain the network structure in Figure 3. The causal semantics of Bayesian networks are in large part responsible for the success of Bayesian networks as a representation for expert systems (Heckerman et al., 1995a). In Section 15, we will see how to learn causal relationships from data using these causal semantics.
In the final step of constructing a Bayesian network, we assess the local probability distribution(s) . In our fraud example, where all variables are discrete, we assess one distribution for for every configuration of . Example distributions are shown in Figure 3.
Note that, although we have described these construction steps as a simple sequence, they are often intermingled in practice. For example, judgments of conditional independence and/or cause and effect can influence problem formulation. Also, assessments of probability can lead to changes in the network structure. Exercises that help one gain familiarity with the practice of building Bayesian networks can be found in Jensen (1996).
4 Inference in a Bayesian Network
Once we have constructed a Bayesian network (from prior knowledge, data, or a combination), we usually need to determine various probabilities of interest from the model. For example, in our problem concerning fraud detection, we want to know the probability of fraud given observations of the other variables. This probability is not stored directly in the model, and hence needs to be computed. In general, the computation of a probability of interest given a model is known as probabilistic inference. In this section we describe probabilistic inference in Bayesian networks.
Because a Bayesian network for determines a joint probability distribution for , we can—in principle—use the Bayesian network to compute any probability of interest. For example, from the Bayesian network in Figure 3, the probability of fraud given observations of the other variables can be computed as follows:
For problems with many variables, however, this direct approach is not practical. Fortunately, at least when all variables are discrete, we can exploit the conditional independencies encoded in a Bayesian network to make this computation more efficient. In our example, given the conditional independencies in Equation 3, Equation 21 becomes
Several researchers have developed probabilistic inference algorithms for Bayesian networks with discrete variables that exploit conditional independence roughly as we have described, although with different twists. For example, Howard and Matheson (1981), Olmsted (1983), and Shachter (1988) developed an algorithm that reverses arcs in the network structure until the answer to the given probabilistic query can be read directly from the graph. In this algorithm, each arc reversal corresponds to an application of Bayes’ theorem. Pearl (1986) developed a message-passing scheme that updates the probability distributions for each node in a Bayesian network in response to observations of one or more variables. Lauritzen and Spiegelhalter (1988), Jensen et al. (1990), and Dawid (1992) created an algorithm that first transforms the Bayesian network into a tree where each node in the tree corresponds to a subset of variables in. The algorithm then exploits several mathematical properties of this tree to perform probabilistic inference. Most recently, D’Ambrosio (1991) developed an inference algorithm that simplifies sums and products symbolically, as in the transformation from Equation 21 to 4. The most commonly used algorithm for discrete variables is that of Lauritzen and Spiegelhalter (1988), Jensen et al (1990), and Dawid (1992).
Methods for exact inference in Bayesian networks that encode multivariate-Gaussian or Gaussian-mixture distributions have been developed by Shachter and Kenley (1989) and Lauritzen (1992), respectively. These methods also use assertions of conditional independence to simplify inference. Approximate methods for inference in Bayesian networks with other distributions, such as the generalized linear-regression model, have also been developed (Saulet al., 1996; Jaakkola and Jordan, 1996).
Although we use conditional independence to simplify probabilistic inference, exact inference in an arbitrary Bayesian network for discrete variables is NP-hard (Cooper, 1990). Even approximate inference (for example, Monte-Carlo methods) is NP-hard (Dagum and Luby, 1993). The source of the difficulty lies in undirected cycles in the Bayesian-network structure—cycles in the structure where we ignore the directionality of the arcs. (If we add an arc from Age to Gas in the network structure of Figure 3, then we obtain a structure with one undirected cycle: .) When a Bayesian-network structure contains many undirected cycles, inference is intractable. For many applications, however, structures are simple enough (or can be simplified sufficiently without sacrificing much accuracy) so that inference is efficient. For those applications where generic inference methods are impractical, researchers are developing techniques that are custom tailored to particular network topologies (Heckerman 1989; Suermondt and Cooper, 1991; Saul et al., 1996; Jaakkola and Jordan, 1996) or to particular inference queries (Ramamurthi and Agogino, 1988; Shachter et al., 1990; Jensen and Andersen, 1990; Darwiche and Provan, 1996).
5 Learning Probabilities in a Bayesian Network
In the next several sections, we show how to refine the structure and local probability distributions of a Bayesian network given data. The result is set of techniques for data analysis that combines prior knowledge with data to produce improved knowledge. In this section, we consider the simplest version of this problem: using data to update the probabilities of a given Bayesian network structure.
Recall that, in the thumbtack problem, we do not learn the probability of heads. Instead, we update our posterior distribution for the variable that represents the physical probability of heads. We follow the same approach for probabilities in a Bayesian network. In particular, we assume—perhaps from causal knowledge about the problem—that the physical joint probability distribution for can be encoded in some network structure . We write
is the vector of parameters for the distribution, is the vector of parameters , and denotes the event (or “hypothesis” in statistics nomenclature) that the physical joint probability distribution can be factored according to .888As defined here, network-structure hypotheses overlap. For example, given , any joint distribution for that can be factored according the network structure containing no arc, can also be factored according to the network structure . Such overlap presents problems for model averaging, described in Section 7. Therefore, we should add conditions to the definition to insure no overlap. Heckerman and Geiger (1996) describe one such set of conditions. In addition, we assume that we have a random sample from the physical joint probability distribution of . We refer to an element of as a case. As in Section 2, we encode our uncertainty about the parameters by defining a (vector-valued) variable
, and assessing a prior probability density function. The problem of learning probabilities in a Bayesian network can now be stated simply: Given a random sample , compute the posterior distribution .
We refer to the distribution , viewed as a function of , as a local distribution function
. Readers familiar with methods for supervised learning will recognize that a local distribution function is nothing more than a probabilistic classification or regression function. Thus, a Bayesian network can be viewed as a collection of probabilistic classification/regression models, organized by conditional-independence relationships. Examples of classification/regression models that produce probabilistic outputs include linear regression, generalized linear regression, probabilistic neural networks (e.g., MacKay, 1992a, 1992b), probabilistic decision trees (e.g., Buntine, 1993; Friedman and Goldszmidt, 1996), kernel density estimation methods (Book, 1994), and dictionary methods (Friedman, 1995). In principle, any of these forms can be used to learn probabilities in a Bayesian network; and, in most cases, Bayesian techniques for learning are available. Nonetheless, the most studied models include the unrestricted multinomial distribution (e.g., Cooper and Herskovits, 1992), linear regression with Gaussian noise (e.g., Buntine, 1994; Heckerman and Geiger, 1996), and generalized linear regression (e.g., MacKay, 1992a and 1992b; Neal, 1993; and Saulet al., 1996).
In this tutorial, we illustrate the basic ideas for learning probabilities (and structure) using the unrestricted multinomial distribution. In this case, each variable is discrete, having possible values , and each local distribution function is collection of multinomial distributions, one distribution for each configuration of . Namely, we assume
where () denote the configurations of , and are the parameters. (The parameter is given by .) For convenience, we define the vector of parameters
for all and . We use the term “unrestricted” to contrast this distribution with multinomial distributions that are low-dimensional functions of —for example, the generalized linear-regression model.
Given this class of local distribution functions, we can compute the posterior distribution efficiently and in closed form under two assumptions. The first assumption is that there are no missing data in the random sample . We say that the random sample is complete. The second assumption is that the parameter vectors are mutually independent.999The computation is also straightforward if two or more parameters are equal. For details, see Thiesson (1995). That is,
We refer to this assumption, which was introduced by Spiegelhalter and Lauritzen (1990), as parameter independence.
Given that the joint physical probability distribution factors according to some network structure , the assumption of parameter independence can itself be represented by a larger Bayesian-network structure. For example, the network structure in Figure 4 represents the assumption of parameter independence for (, binary) and the hypothesis that the network structure encodes the physical joint probability distribution for .
Under the assumptions of complete data and parameter independence, the parameters remain independent given a random sample:
Thus, we can update each vector of parameters independently, just as in the one-variable case. Assuming each vector has the prior distribution Dir, we obtain the posterior distribution
where is the number of cases in in which and .
As in the thumbtack example, we can average over the possible configurations of to obtain predictions of interest. For example, let us compute , where is the next case to be seen after . Suppose that, in case , and , where and depend on . Thus,
To compute this expectation, we first use the fact that the parameters remain independent given :
Then, we use Equation 12 to obtain
where and .
These computations are simple because the unrestricted multinomial distributions are in the exponential family. Computations for linear regression with Gaussian noise are equally straightforward (Buntine, 1994; Heckerman and Geiger, 1996).
6 Methods for Incomplete Data
Let us now discuss methods for learning about parameters when the random sample is incomplete (i.e., some variables in some cases are not observed). An important distinction concerning missing data is whether or not the absence of an observation is dependent on the actual states of the variables. For example, a missing datum in a drug study may indicate that a patient became too sick—perhaps due to the side effects of the drug—to continue in the study. In contrast, if a variable is hidden (i.e., never observed in any case), then the absence of this data is independent of state. Although Bayesian methods and graphical models are suited to the analysis of both situations, methods for handling missing data where absence is independent of state are simpler than those where absence and state are dependent. In this tutorial, we concentrate on the simpler situation only. Readers interested in the more complicated case should see Rubin (1978), Robins (1986), and Pearl (1995).
Continuing with our example using unrestricted multinomial distributions, suppose we observe a single incomplete case. Let and denote the observed and unobserved variables in the case, respectively. Under the assumption of parameter independence, we can compute the posterior distribution of for network structure as follows:
(See Spiegelhalter and Lauritzen (1990) for a derivation.) Each term in curly brackets in Equation 6 is a Dirichlet distribution. Thus, unless both and all the variables in are observed in case , the posterior distribution of will be a linear combination of Dirichlet distributions—that is, a Dirichlet mixture with mixing coefficients and .
When we observe a second incomplete case, some or all of the Dirichlet components in Equation 6 will again split into Dirichlet mixtures. That is, the posterior distribution for we become a mixture of Dirichlet mixtures. As we continue to observe incomplete cases, each missing values for , the posterior distribution for will contain a number of components that is exponential in the number of cases. In general, for any interesting set of local likelihoods and priors, the exact computation of the posterior distribution for will be intractable. Thus, we require an approximation for incomplete data.
6.1 Monte-Carlo Methods
One class of approximations is based on Monte-Carlo or sampling methods. These approximations can be extremely accurate, provided one is willing to wait long enough for the computations to converge.
In this section, we discuss one of many Monte-Carlo methods known as Gibbs sampling, introduced by Geman and Geman (1984). Given variables with some joint distribution , we can use a Gibbs sampler to approximate the expectation of a function with respect to as follows. First, we choose an initial state for each of the variables in somehow (e.g., at random). Next, we pick some variable , unassign its current state, and compute its probability distribution given the states of the other variables. Then, we sample a state for based on this probability distribution, and compute . Finally, we iterate the previous two steps, keeping track of the average value of . In the limit, as the number of cases approach infinity, this average is equal to provided two conditions are met. First, the Gibbs sampler must be irreducible: The probability distribution must be such that we can eventually sample any possible configuration of given any possible initial configuration of . For example, if contains no zero probabilities, then the Gibbs sampler will be irreducible. Second, each must be chosen infinitely often. In practice, an algorithm for deterministically rotating through the variables is typically used. Introductions to Gibbs sampling and other Monte-Carlo methods—including methods for initialization and a discussion of convergence—are given by Neal (1993) and Madigan and York (1995).
To illustrate Gibbs sampling, let us approximate the probability density for some particular configuration of , given an incomplete data set and a Bayesian network for discrete variables with independent Dirichlet priors. To approximate , we first initialize the states of the unobserved variables in each case somehow. As a result, we have a complete random sample . Second, we choose some variable (variable in case ) that is not observed in the original random sample , and reassign its state according to the probability distribution
where denotes the data set with observation removed, and the sum in the denominator runs over all states of variable . As we shall see in Section 7, the terms in the numerator and denominator can be computed efficiently (see Equation 35). Third, we repeat this reassignment for all unobserved variables in , producing a new complete random sample . Fourth, we compute the posterior density as described in Equations 25 and 26. Finally, we iterate the previous three steps, and use the average of as our approximation.
6.2 The Gaussian Approximation
Monte-Carlo methods yield accurate results, but they are often intractable—for example, when the sample size is large. Another approximation that is more efficient than Monte-Carlo methods and often accurate for relatively large samples is the Gaussian approximation (e.g., Kass et al., 1988; Kass and Raftery, 1995).
The idea behind this approximation is that, for large amounts of data,
can often be approximated as a multivariate-Gaussian distribution. In particular, let
Also, define to be the configuration of that maximizes . This configuration also maximizes , and is known as the maximum a posteriori (MAP) configuration of . Using a second degree Taylor polynomial of about the to approximate , we obtain
where is the transpose of row vector , and is the negative Hessian of evaluated at . Raising to the power of and using Equation 29, we obtain
Hence, is approximately Gaussian.
To compute the Gaussian approximation, we must compute as well as the negative Hessian of evaluated at . In the following section, we discuss methods for finding . Meng and Rubin (1991) describe a numerical technique for computing the second derivatives. Raftery (1995) shows how to approximate the Hessian using likelihood-ratio tests that are available in many statistical packages. Thiesson (1995) demonstrates that, for unrestricted multinomial distributions, the second derivatives can be computed using Bayesian-network inference.
6.3 The MAP and ML Approximations and the EM Algorithm
As the sample size of the data increases, the Gaussian peak will become sharper, tending to a delta function at the MAP configuration . In this limit, we do not need to compute averages or expectations. Instead, we simply make predictions based on the MAP configuration.
A further approximation is based on the observation that, as the sample size increases, the effect of the prior diminishes. Thus, we can approximate by the maximum maximum likelihood (ML) configuration of :
One class of techniques for finding a ML or MAP is gradient-based optimization. For example, we can use gradient ascent, where we follow the derivatives of or the likelihood to a local maximum. Russell et al. (1995) and Thiesson (1995) show how to compute the derivatives of the likelihood for a Bayesian network with unrestricted multinomial distributions. Buntine (1994) discusses the more general case where the likelihood function comes from the exponential family. Of course, these gradient-based methods find only local maxima.
Another technique for finding a local ML or MAP is the expectation–maximization (EM) algorithm (Dempsteret al., 1977). To find a local MAP or ML, we begin by assigning a configuration to somehow (e.g., at random). Next, we compute the expected sufficient statistics for a complete data set, where expectation is taken with respect to the joint distribution for conditioned on the assigned configuration of and the known data . In our discrete example, we compute
where is the possibly incomplete th case in . When and all the variables in are observed in case , the term for this case requires a trivial computation: it is either zero or one. Otherwise, we can use any Bayesian network inference algorithm to evaluate the term. This computation is called the expectation step of the EM algorithm.
Next, we use the expected sufficient statistics as if they were actual sufficient statistics from a complete random sample . If we are doing an ML calculation, then we determine the configuration of that maximize . In our discrete example, we have
If we are doing a MAP calculation, then we determine the configuration of that maximizes . In our discrete example, we have101010The MAP configuration depends on the coordinate system in which the parameter variables are expressed. The expression for the MAP configuration given here is obtained by the following procedure. First, we transform each variable set to the new coordinate system , where . This coordinate system, which we denote by , is sometimes referred to as the canonical coordinate system for the multinomial distribution (see, e.g., Bernardo and Smith, 1994, pp. 199–202). Next, we determine the configuration of that maximizes . Finally, we transform this MAP configuration to the original coordinate system. Using the MAP configuration corresponding to the coordinate system has several advantages, which are discussed in Thiesson (1995b) and MacKay (1996).
This assignment is called the maximization step of the EM algorithm. Dempster et al. (1977) showed that, under certain regularity conditions, iteration of the expectation and maximization steps will converge to a local maximum. The EM algorithm is typically applied when sufficient statistics exist (i.e., when local distribution functions are in the exponential family), although generalizations of the EM algroithm have been used for more complicated local distributions (see, e.g., Saul et al. 1996).
7 Learning Parameters and Structure
Now we consider the problem of learning about both the structure and probabilities of a Bayesian network given data.
Assuming we think structure can be improved, we must be uncertain about the network structure that encodes the physical joint probability distribution for . Following the Bayesian approach, we encode this uncertainty by defining a (discrete) variable whose states correspond to the possible network-structure hypotheses , and assessing the probabilities . Then, given a random sample from the physical probability distribution for , we compute the posterior distribution and the posterior distributions , and use these distributions in turn to compute expectations of interest. For example, to predict the next case after seeing , we compute
In performing the sum, we assume that the network-structure hypotheses are mutually exclusive. We return to this point in Section 9.
The computation of is as we have described in the previous two sections. The computation of is also straightforward, at least in principle. From Bayes’ theorem, we have
where is a normalization constant that does not depend upon structure. Thus, to determine the posterior distribution for network structures, we need to compute the marginal likelihood of the data () for each possible structure.
We discuss the computation of marginal likelihoods in detail in Section 9. As an introduction, consider our example with unrestricted multinomial distributions, parameter independence, Dirichlet priors, and complete data. As we have discussed, when there are no missing data, each parameter vector is updated independently. In effect, we have a separate multi-sided thumbtack problem for every and . Consequently, the marginal likelihood of the data is the just the product of the marginal likelihoods for each – pair (given by Equation 13):
This formula was first derived by Cooper and Herskovits (1992).
Unfortunately, the full Bayesian approach that we have described is often impractical. One important computation bottleneck is produced by the average over models in Equation 33. If we consider Bayesian-network models with variables, the number of possible structure hypotheses is more than exponential in . Consequently, in situations where the user can not exclude almost all of these hypotheses, the approach is intractable.
Statisticians, who have been confronted by this problem for decades in the context of other types of models, use two approaches to address this problem: model selection and selective model averaging. The former approach is to select a “good” model (i.e., structure hypothesis) from among all possible models, and use it as if it were the correct model. The latter approach is to select a manageable number of good models from among all possible models and pretend that these models are exhaustive. These related approaches raise several important questions. In particular, do these approaches yield accurate results when applied to Bayesian-network structures? If so, how do we search for good models? And how do we decide whether or not a model is “good”?
The question of accuracy is difficult to answer in theory. Nonetheless, several researchers have shown experimentally that the selection of a single good hypothesis often yields accurate predictions (Cooper and Herskovits 1992; Aliferis and Cooper 1994; Heckerman et al., 1995b) and that model averaging using Monte-Carlo methods can sometimes be efficient and yield even better predictions (Madigan et al., 1996). These results are somewhat surprising, and are largely responsible for the great deal of recent interest in learning with Bayesian networks. In Sections 8 through 10, we consider different definitions of what is means for a model to be “good”, and discuss the computations entailed by some of these definitions. In Section 11, we discuss model search.
We note that model averaging and model selection lead to models that generalize well to new data. That is, these techniques help us to avoid the overfitting of data. As is suggested by Equation 33, Bayesian methods for model averaging and model selection are efficient in the sense that all cases in can be used to both smooth and train the model. As we shall see in the following two sections, this advantage holds true for the Bayesian approach in general.
8 Criteria for Model Selection
Most of the literature on learning with Bayesian networks is concerned with model selection. In these approaches, some criterion is used to measure the degree to which a network structure (equivalence class) fits the prior knowledge and data. A search algorithm is then used to find an equivalence class that receives a high score by this criterion. Selective model averaging is more complex, because it is often advantageous to identify network structures that are significantly different. In many cases, a single criterion is unlikely to identify such complementary network structures. In this section, we discuss criteria for the simpler problem of model selection. For a discussion of selective model averaging, see Madigan and Raftery (1994).
8.1 Relative Posterior Probability
A criterion that is often used for model selection is the log of the relative posterior probability.111111An equivalent criterion that is often used is