The Computational Power of Dynamic Bayesian Networks

03/19/2016 ∙ by Joshua Brulé, et al. ∙ University of Maryland 0

This paper considers the computational power of constant size, dynamic Bayesian networks. Although discrete dynamic Bayesian networks are no more powerful than hidden Markov models, dynamic Bayesian networks with continuous random variables and discrete children of continuous parents are capable of performing Turing-complete computation. With modified versions of existing algorithms for belief propagation, such a simulation can be carried out in real time. This result suggests that dynamic Bayesian networks may be more powerful than previously considered. Relationships to causal models and recurrent neural networks are also discussed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian networks are probabilistic graphical models that represent a set of random variables and their conditional dependencies via a directed acyclic graph. Explicitly modeling the conditional dependencies between random variables permit efficient algorithms to perform inference and learning in the network. Causal Bayesian networks have the additional requirement that all edges in the network model a causal relationship.

Dynamic Bayesian networks are the time-generalization of Bayesian networks and relate variables to each other over adjacent time steps. Dynamic Bayesian networks unify and extend a number of state-space models including hidden Markov models, hierarchical hidden Markov models and Kalman filters. Dynamic Bayesian networks can also be seen as the natural extension of acyclic causal models to models that permit cyclic causal relationships, while avoiding problems with causal models that try to model temporal relationships with an atemporal description

[1].

A natural question is what is the expressive power of such networks. The result in this paper shows that although discrete dynamic Bayesian networks are sub-Turing in computational power, introducing continuous random variables with discrete children is sufficient to model Turing-complete computation. In addition, the distributions used in the construction are such that the marginal posterior probabilities of random variables in the network can be effectively computed with modified versions of existing algorithms. Ignoring the overhead from arbitrary precision arithmetic, the simulation can be conducted with only a constant time penalty.

2 The Model and Main Results

A Bayesian network [2] consists of a directed-acyclic graph, over a set

of vertices and a probability distribution

over the set of variables that correspond to the vertices in . A Bayesian network “factorizes” the probability distribution over its variables, by requiring that each variable, , is conditionally independent of its non-descendants, given its parents (denoted ). This is the Markov condition [3]:

Dynamic Bayesian networks (DBN) extend Bayesian networks to model a probability distribution over a semi-infinite collection of random variables, with each collection of random variables modeling the system at a point in time [4]. Following the conventions in [5], the collections are denoted and variables are partitioned to represent input, hidden and output variables of a state space model. Such a network is “dynamic” in the sense that it can model a dynamic system, not that the network topology changes over time.

A DBN is defined as a pair , where is a Bayesian network that defines the prior and is a two-slice temporal Bayes net (2TBN) that defines via a directed acyclic graph:

where is the node at time , and are the parents of in the graph. The parents of a node can either be in the same time slice or in the previous time slice (i.e. the model is first-order Markov).

The semantics of a DBN can be defined by “unrolling” the 2TBN until there are

time-slices; the joint distribution is then given by:

Analyzing the computational power of a DBN requires defining what it means for a DBN to accept (and halt) or reject an input. Define an input sequence, of Bernoulli random variables to model the (binary) input. Similarly, define an output sequence () to represent whether the machine has halted and the answer that it gives. Given an input, , to a decision problem, the machine modeled by the DBN to has halted and accepted at time , if and only if and halted and rejected if and only if .

2.1 Discrete dynamic Bayesian networks are not Turing-complete

“Discrete” Bayesian networks are Bayesian networks where all random variables have some finite number of outcomes, i.e. Bernoulli or categorical random variables. If dynamic Bayesian networks are permitted to increase the number of random variables over time, then simulating a Turing-machine becomes trivial: simply add a new variable each time step to model a newly reachable cell on the Turing machine’s tape. However, this requires some ‘first-order’ features in the language used to specify the network and the computational effort required at

each step of the simulation will grow without bound.

With a fixed number of random variables at each time step and the property that DBNs are first-order Markov, the computational effort per step remains constant. However, discrete DBNs have sub-Turing computational power. Intuitively, a discrete DBN cannot possibly simulate a Turing machine since there is no way to store the contents of the machine’s tape.

More formally, any discrete Bayesian network can be converted into a hidden Markov model [5]. This is done by ‘collapsing’ the hidden variables () of the DBN into a single random variable by taking the Cartesian product of their sample space. The ‘collapsed’ DBN models a probability distribution over a exponentially larger, but still finite sample space. Hidden Markov models are equivalent to probabilistic finite automata [6] which recognize the stochastic languages. Stochastic languages are in the RP-complexity class and thus discrete DBNs are not Turing complete.

2.2 A dynamic Bayesian network with continuous and discrete variables

A 2TBN can be constructed to simulate the transitions of a two stack push-down automaton (PDA), which is equivalent to the standard one tape Turing machine. A two stack PDA consists of a finite control, two unbounded binary stacks and an input tape. At each step of computation, the machine reads and advances the input tape, reads the top element of each stack and can either push a new element, pop the top element or leave each stack unchanged. The state of the control can change as function of previous state and the read symbols. When the control reaches one of two possible halt states (), the machine stops and its output to the decision problem it was computing is defined which of the halt states it stops on.

A key part of the construction is using a Dirac distribution to simulate a stack. A Dirac distribution centered at

can be defined as the limit of normal distributions:

A single Dirac distributed random variable is sufficient to simulate a stack. The stack construction adapted from [7] encodes a binary string into the number:

Note that if the string begins with the value 1, then has a value of at least and if the string begins with , then is less than - there is never a need to distinguish among two very close numbers to read the most significant digit. In addition, the empty string is encoded as , but any non-empty string has value at least .

All random variables, except for the stack random variables, are categorically distributed - thus, the conditional probabilities densities between them can be represented using standard conditional probability tables.

Extracting the top value from a stack requires a conditional probability distribution for a Bernoulli random variable (

), given a Dirac () distributed parent. The Heavyside step function meets this requirement and is defined as the limit of logistic functions (or, more generally, softmax functions), centered at :

The linear operation transfers the range of to at least when the top element of the stack is and no more than

when the top element of the stack is 0. Then, the conditional probability density function:

yields whenever the top element of the stack is and whenever the top element of the stack is .

Similarly, a conditional probability distribution can be defined for Bernoulli random variable , as:

to check if a stack is empty.

Finally, the linear operations and push and pop , respectively, from a stack. The conditional probability density for a stack at time , given a stack at time , the top of the stack at time , and action to be performed on the stack () is fully described as follows:

Since there are two stacks in the full construction, they are labeled, at time , as and . The rest of the construction is straightforward. , and are functions of and

. Since all of these are discrete random variables, the conditional probability densities is simply the transition function of the PDA, written as a (0, 1) stochastic matrix. As expected

if is that halt state, and otherwise.

Finally, the priors for the dynamic Bayesian network are simply , , where is the initial state.

As described, this construction is somewhat of an abuse of the term ‘probabilistic graphical model’ - all probability mass is concentrated into a single event for every random variable in the system, for every time step. However, it is easy to see this construction faithfully simulates a two stack machine, as each random variable in the construction corresponds exactly to a component of the simulated automaton.

2.3 Exact inference in continuous-discrete Bayesian networks

This construction requires continuous random variables, which raise concerns as to whether the marginal posterior probabilities can be effectively computed. The original junction tree algorithm [8] and cut-set conditioning [9] approaches to belief propagation compute exact marginals for arbitrary DAGs, but require discrete random variables. Lauritzen’s algorithm [10] conducts inference in mixed graphical models, but is limited to conditional linear Gaussian (CLG) continuous random variables. In a CLG model, let be a continuous node, be its discrete parents, and be continuous parents. Then

Lauritzen’s algorithm can only conduct approximate inference, since the true posterior marginals may be some multimodal mix of Gaussians, while the algorithm itself only supports CLG random variables. However, the algorithm is exact in the sense that it computes exact first and second moments for the posterior marginals which is sufficient for the Turing machine simulation.

Laurientz’s algorithm does not permit discrete random variables to be children of continuous random variables. Lerner’s algorithm [11] extends Lauritzen’s algorithm to support softmax conditional probability densities for discrete children of continuous parents. Let A be a discrete node with the possible values and let be its parents. Then:

Like Lauritzen’s algorithm, Lerner’s algorithm computes approximate posterior marginals - relying on the observation that the product of a softmax and a Gaussian is approximately Gaussian - but exact first and second moments, up to errors in the numerical integration used to compute the best Gaussian approximation of the product of a Gaussian and a softmax. This calculation is actually simpler in the case where the softmax is replaced with a Heavyside and the Lerner algorithm can run essentially unmodified with a mixture of Heavyside and softmax conditional probability densities. In the case of Dirac-distributed parents, with Heavyside conditional probability densities, numeric integration is unnecessary and no errors are introduced in computing the first and second moments of the posterior distribution.

Any non-zero variance for the continuous variables will ‘leak’ probability to other values for the ‘stack’ random variables in the Turing machine simulation, eventually leading to errors. Lauritzen’s original algorithm assumes positive-definite covariance matrices for the continuous random variables, but can be extend to handle degenerate Gaussians

[12]. In summary: posterior marginals for the Turing machine simulation can be computed exactly, using a modified version of the Lerner algorithm when restricted to Dirac distributed continuous random variables with Heavside conditional probability densities. If Gaussian random variables and softmax conditional probability densities are also introduced, then the first and second moments of the posterior marginals can be computed ‘exactly’, up to errors in numerical integration, although this will slowly degrade the quality of the Turing machine simulation in later time steps.

Inference in Bayesian networks is NP-hard [13]. However, assuming that arithmetic operations can be computed in unit time over arbitrary-precision numbers (e.g. the real RAM model), the work necessary at each time step is constant. Thus, dynamic Bayesian networks can simulate Turing-machines with only a constant time overhead in the real RAM model, and slowdown proportional to the time complexity of arbitrary precision arithmetic otherwise.

3 Discussion

This result suggests that causal Bayesian networks may be a richer language for modeling causality than currently appreciated. Halpern [14] suggests that for general causal reasoning, a richer language, including some-first order features may be needed. First-order features will likely be very useful for causal modeling in practice, but the Turing-complete power of dynamic Bayesian networks suggests that first-order features may be unnecessary.

This result for dynamic Bayesian networks is analogous to Siegelmann and Sontag’s proof that a recurrent neural network can simulate a Turing machine in real time [7]. In fact, neural networks and Bayesian networks turn out to have very similar expressive power:

  1. Single perceptron

    Gaussian naive Bayes (Logistic regression)

    [15]

  2. Multilayer perceptron Full Bayesian network (Universal function approximation) [16] [17]

  3. Recurrent neural network Dynamic Bayesian network (Turing complete)

There is an interesting gap in decidability - it takes very little to turn a sub-Turing framework for modeling into a Turing-complete one. In the case of neural networks, a single recurrent layer, with arbitrary-precision rational weights and a saturating linear transfer function is sufficient. With dynamic Bayesian networks, two time-slices, continuous-valued random variables with a combination of linear and step function conditional probability densities is sufficient.

Although such a simple recurrent neural network is theoretically capable of performing arbitrary computations, practical extensions include higher-order connections, [18]

, ‘gates’ in long short-term memory

[19], and even connections to an ‘external’ Turing machine [20]. These additions enrich the capabilities of standard neural networks and make it easier to train them for complex algorithmic tasks.

An interesting question is to what degree dynamic Bayesian networks can be similarly extended and how the ‘core’ dynamic Bayesian network being capable of Turing-complete computation affects the overall performance of such networks.

Acknowledgements

I would like to thank James Reggia, William Gasarch and Brendan Good for their discussions and helpful comments on early drafts of this paper.

References

  • [1]

    D. Poole and M. Crowley, “Cyclic causal models with discrete variables: Markov chain equilibrium semantics and sample ordering,” in

    Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

    , pp. 1060–1068, AAAI Press, 2013.
  • [2] J. Pearl, “Bayesian networks: A model of self-activated memory for evidential reasoning,” in Proceedings of the 7th Conference of the Cognitive Science Society, University of California, Irvine, pp. 329–334, Aug. 1985.
  • [3] E. Bareinboim, C. Brito, and J. Pearl, Graph Structures for Knowledge Representation and Reasoning: Second International Workshop, GKR 2011, Barcelona, Spain, July 16, 2011. Revised Selected Papers, ch. Local Characterizations of Causal Bayesian Networks, pp. 1–17. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012.
  • [4] T. Dean and K. Kanazawa, “A model for reasoning about persistence and causation,” Comput. Intell., vol. 5, pp. 142–150, Dec. 1989.
  • [5] K. Murphy, Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, University of California, Berkeley, 2002.
  • [6] P. Dupont, F. Denis, and Y. Esposito, “Links between probabilistic automata and hidden markov models: probability distributions, learning models and induction algorithms,” Pattern Recognition, vol. 38, no. 9, pp. 1349 – 1371, 2005. Grammatical Inference.
  • [7] H. T. Siegelmann and E. D. Sontag, “On the computational power of neural nets,” Journal of computer and system sciences, vol. 50, no. 1, pp. 132–150, 1995.
  • [8] S. L. Lauritzen and D. J. Spiegelhalter, “Local computations with probabilities on graphical structures and their application to expert systems,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 157–224, 1988.
  • [9] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., 1988.
  • [10] S. L. Lauritzen, “Propagation of probabilities, means, and variances in mixed graphical association models,” Journal of the American Statistical Association, vol. 87, no. 420, pp. 1098–1108, 1992.
  • [11] U. Lerner, E. Segal, and D. Koller, “Exact inference in networks with discrete children of continuous parents,” in Proceedings of the seventeenth conference on uncertainty in artificial intelligence, pp. 319–328, Morgan Kaufmann Publishers Inc., 2001.
  • [12]

    C. Raphael, “Bayesian networks with degenerate gaussian distributions,”

    Methodology and Computing in Applied Probability, vol. 5, no. 2, pp. 235–263, 2003.
  • [13] G. F. Cooper, “The computational complexity of probabilistic inference using bayesian belief networks,” Artificial intelligence, vol. 42, no. 2, pp. 393–405, 1990.
  • [14] J. Y. Halpern, “Axiomatizing causal reasoning,” Journal of Artificial Intelligence Research, pp. 317–337, 2000.
  • [15]

    A. Ng and M. Jordan, “On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes,”

    Advances in neural information processing systems, vol. 14, p. 841, 2002.
  • [16]

    G. Cybenko, “Approximation by superpositions of a sigmoidal function,”

    Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314, 1989.
  • [17] G. Varando, C. Bielza, and P. Larrañaga, “Expressive power of binary relevance and chain classifiers based on bayesian networks for multi-label classification,” in Probabilistic Graphical Models, pp. 519–534, Springer, 2014.
  • [18] F. J. Pineda, “Generalization of back propagation to recurrent and higher order neural networks,” in Neural information processing systems, pp. 602–611, 1988.
  • [19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [20] A. Graves, G. Wayne, and I. Danihelka, “Neural turing machines,” arXiv preprint arXiv:1410.5401, 2014.