How To Train Your Program

by   David Tolpin, et al.

We present a Bayesian approach to machine learning with probabilistic programs. In our approach, training on available data is implemented as inference on a hierarchical model. The posterior distribution of model parameters is then used to stochastically condition a complementary model, such that inference on new data yields the same posterior distribution of latent parameters corresponding to the new data as inference on a hierachical model on the combination of both previously available and new data, at a lower computation cost. We frame the approach as a design pattern of probabilistic programming referred to herein as `stump and fungus', and illustrate realization of the pattern on a didactic case study.



There are no comments yet.


page 1

page 2

page 3

page 4


Attention for Inference Compilation

We present a new approach to automatic amortized inference in universal ...

Online Bayesian phylodynamic inference in BEAST with application to epidemic reconstruction

Reconstructing pathogen dynamics from genetic data as they become availa...

Detecting Parameter Symmetries in Probabilistic Models

Probabilistic models often have parameters that can be translated, scale...

The Hierarchical Adaptive Forgetting Variational Filter

A common problem in Machine Learning and statistics consists in detectin...

Making Recursive Bayesian Inference Accessible

Bayesian models are naturally equipped to provide recursive inference be...

Foundation Posteriors for Approximate Probabilistic Inference

Probabilistic programs provide an expressive representation language for...

Bayesian Incremental Learning for Deep Neural Networks

In industrial machine learning pipelines, data often arrive in parts. Pa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The ultimate Bayesian approach to learning from data is embodied by hierarchical models (Gelman et al., 2013; Goodman et al., 2016; McElreath, 2020). In a hierarchical model, each observation or a group of observations corresponding to a single item in the data set is conditioned on a parameter

, and all parameters are conditioned on a hyperparameter



A hierarchical model can be thought of as a way of inferring, or ‘learning’, the prior of each from all observations in the data set. Consider the following example problem: multiple boxes are randomly filled by marbles from a bag containing a mixture of blue and white marbles. We are presented by a few draws with replacement from each of the boxes, being the th draw from the th box; our goal is to infer the number of blue marbles in each box. Intuitively, since the boxes are filled from the same bag, the posterior distribution of should account both for draws from the th box and, indirectly, for draws from all other boxes. This is formalized by the following hierarchical model:


Model (2) learns from the data in the sense that inference for each box is influenced by draws from all boxes. However, learning from training data to improve inference on future data with a hierarchical model is computationally inefficient — if a new box is presented, one has to add observations of the new box to the previously available data and re-run inference on the extended data set. Inference performance can be improved by employing data subsampling (Korattikara et al., 2014; Bardenet et al., 2014, 2017; Maclaurin and Adams, 2014; Quiroz et al., 2018), but the whole training data set still needs to be kept and made accessible to the inference algorithm. A hierarchical model cannot ‘compress’, or summarize, training data for efficient inference on future observations.

An alternative approach, known as empirical Bayes (Robbins, 1951; Casella, 1985; Robbins, 1992)

, consists in adjusting the hyperprior based on the training data; e.g. by fixing

at a likely value, or by replacing in (1) with a suitable approximation of the posterior distribution of . However, empirical Bayes is not Bayesian. While practical efficiency of empirical Bayes was demonstrated in a number of settings (Robbins, 1992), in other settings empirical Bayes may result in a critically misspecified model and overconfident or biased inference outcomes.

In this work, we propose an approach to learning from data in probabilistic programs which is both Bayesian in nature and computationally efficient. First, we state the problem of learning from data in the context of Bayesian generative models (Section 2). Then, we introduce the approach and discuss its implementation (Section 3). As an illustration, we apply the approach to a didactic example based on a case study of tumor incidence in rats (Section 4). Finally, we conclude with a discussion of related work and further research (Section 5).

2. Problem: Learning from Data

The challenge we tackle here is re-using inference outcomes on the training data set for inference on new data. Formally, population is a set of sets of observations . Members of each are assumed to be drawn from a known distribution with unobserved parameter , . are assumed to be drawn from a common distribution . Our goal is to devise a scheme that, given a subset , the training set, infers the posterior distribution of for any in a shorter amortized time than running inference on a hierarchical model . By amortized time we mean here average time per as .

In other words, we look for a scheme that works in two stages. At the first stage, inference is performed on the training set only. At the second stage, the inference outcome of the first stage is used, together with , to infer . We anticipate a scheme that ‘compresses’ the training set at the first stage, resulting in a shorter running time of the second stage. Such scheme bears similarity to the conventional machine learning paradigm: an expensive computation on the training data results in shorter running times on new data.

3. Main Idea: Stump and Fungus

In quest of devising such a scheme, we make two observations which eventually help us arrive at a satisfactory solution:

  1. In Bayesian modelling, information about data is usually conveyed through conditioning of the model on various aspects of the data.

  2. In a hierarchical model, influence of the th group of observations on the hyperparameters and, consequently, on other groups, passes exclusively through the group parameters .

If, instead of conditioning on training data , we could condition on parameters corresponding to the training data, then we could perform inference on new data item at a lower time and space complexity. Continuing the well known analogy between a hierarchical model and a tree, with the hyperparameter at the root and observations in the leaves, we can liken a model which receives all of the training data and new data item as a stump (the hierarchical model with the trunk cut off just after the hyperparameters) and a fungus growing on the stump — the new data item. The problem is, of course, that we infer distributions, rather than fixed values, of , and the model must be, somewhat unconventionally, conditioned on the distributions of .

However, a recently introduced notion of stochastic conditioning (Tolpin et al., 2021) makes conditioning on distributions of possible, both theoretically and in the practical case when the posteriors of are approximated using Monte Carlo samples. Moreover, conditioning the model both stochastically on the posterior distributions of on training data and deterministically on new data yields the same posterior distribution of as inference on the full hierarchical model. Based on this, we propose the ‘stump-and-fungus’ pattern for learning from data in probabilistic programs:

  • Training is accomplished through inference on a hierarchical model, in the usual way.

  • Training outcomes are summarized as a collection of samples , representing the mixture distribution of of all groups.

  • For inference on new data item , a stump-and-fungus model is employed:


Although two models — hierarchical and stump-and-fungus — are involved in the pattern, the models are in fact two roles fulfilled by the same generative model, combining stochastic conditioning on training data and deterministic conditioning on new data (consisting potentially of multiple data items). This preserves a common practice in machine learning in which the same model is used for both training and inference.

4. Case Study: Tumor Incidence in Rats

In this case study, based on (Tarone, 1982) and discussed in (Gelman et al., 2013, Chapter 5), data on tumor incidence in rats in laboratory experiments is used to infer tumor incidence based on outcomes of yet another experiment. A different number of rats was involved in each experiment, and the number of tumor cases was reported. The posteriors of independent models for all experiments are shown in Figure 1.

Figure 1. Separate model

A schoolbook solution for the problem is to perform inference on a hierarchical model:


Inference on Model (4) can be performed efficiently thanks to summarization of observations from as a single observation from . In general however, the use of a hierarchical model would require carrying all observations of all previous experiments for learned inference on findings of a new experiment.

The stump-and-fungus pattern is straightforwardly applicable to the problem:


In (5), are stochastically observable from the posterior of hierarchical model (4) conditioned on all but the th experiment. Figure 2 shows the posterior distributions for inferred on the hierarchical model (Figure 1(a)) and through applications of stump-and-fungus (similar to leave-one-out cross-validation). The Infergo (Tolpin, 2019) source code of the model is provided in the appendix. The inference was performed with HMC (Neal, 2011) on the hierarchical model (4) and stochastic gradient HMC (Chen et al., 2014) on the stump-and-fungus model (5). 1000 samples were used to visualize the posterior. One can see that the posteriors obtained via either method appear to be the same, except for small discrepancies apparently caused by finite sample size approximation. The data and code for the case study are available at

(a) Hierarchical model
(b) Stump-and-fungus models
Figure 2. Hierarchical vs. stump-and-fungus model. The posteriors obtained via either method are the same, except for small discrepancies apparently caused by finite sample size approximation. Colorful lines are posterior distributions of for each of 71 experiments. Black lines are posterior distributions of for a future experiment.

5. Discussion

We presented a probabilistic programming pattern for Bayesian learning from data. The importance of learning from data is well appreciated in probabilistic programming. Along with empirical Bayes, applicable to probabilistic programming as well as to Bayesian generative models in general, probabilistic-programming specific approaches were proposed. One possibility is to induce a probabilistic program suited for a particular data set (Liang et al., 2010; Perov and Wood, 2014; Perov, 2018; Hwang et al., 2011). A related but different research direction is inference compilation (Le et al., 2017; Baydin et al., 2019), where the cost of inference is amortized through learning proposal distributions from data. Another line of research is concerned by speeding up inference algorithms by tuning them based on training data (Eslami et al., 2014; Mansinghka et al., 2018). Our approach to learning from data in probabilistic programs is different in that it does not require any particular implementation of probabilistic programming to be used, nor introspection into the structure of probabilistic programs or inference algorithms. Instead, the approach uses inference in ubiquitously adopted hierarchical models for training, and conditioning on observations for incorporation of training outcomes in inference.

We thank PUB+ for supporting development of Infergo.


  • (1)
  • Bardenet et al. (2014) Rémi Bardenet, Arnaud Doucet, and Chris Holmes. 2014.

    Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In

    Proceedings of the 31st International Conference on Machine Learning. 405–413.
  • Bardenet et al. (2017) Rémi Bardenet, Arnaud Doucet, and Chris Holmes. 2017. On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research 18, 47 (2017), 1–43.
  • Baydin et al. (2019) Atilim Gunes Baydin, Lei Shao, W. Bhimji, L. Heinrich, Lawrence Meadows, Jialin Liu, Andreas Munk, Saeid Naderiparizi, Bradley Gram-Hansen, Gilles Louppe, Mingfei Ma, X. Zhao, P. Torr, V. Lee, K. Cranmer, Prabhat, and Frank D. Wood. 2019. Etalumis: bringing probabilistic programming to scientific simulators at scale. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2019).
  • Casella (1985) George Casella. 1985. An Introduction to Empirical Bayes Data Analysis. The American Statistician 39, 2 (1985), 83–87.
  • Chen et al. (2014) Tianqi Chen, Emily B. Fox, and Carlos Guestrin. 2014. Stochastic Gradient Hamiltonian Monte Carlo. In Proceedings of the 31st International Conference on on Machine Learning. 1683–1691.
  • Eslami et al. (2014) Ali Eslami, Daniel Tarlow, Pushmeet Kohli, and John Winn. 2014. Just-In-Time Learning for Fast and Flexible Inference. In NIPS’14 Proceedings of the 27th International Conference on Neural Information Processing Systems (nips’14 proceedings of the 27th international conference on neural information processing systems ed.). MIT Press Cambridge, 154–162.
  • Gelman et al. (2013) A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B. Rubin. 2013. Bayesian Data Analysis. CRC Press.
  • Goodman et al. (2016) Noah D Goodman, Joshua B. Tenenbaum, and the ProbMods contributors. 2016. Probabilistic Models of Cognition (second ed.). electronic; retrieved: 2019-4-29.
  • Hwang et al. (2011) I. Hwang, Andreas Stuhlmüller, and Noah D. Goodman. 2011. Inducing Probabilistic Programs by Bayesian Program Merging. ArXiv abs/1110.5667 (2011).
  • Korattikara et al. (2014) Anoop Korattikara, Yutian Chen, and Max Welling. 2014. Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget. In Proceedings of the 31st International Conference on Machine Learning. 181–189.
  • Le et al. (2017) Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. 2017. Inference Compilation and Universal Probabilistic Programming. In Proceedings of the 34th International Conference on Machine Learning. 1338–1348.
  • Liang et al. (2010) Percy Liang, Michael I. Jordan, and Dan Klein. 2010. Learning Programs: A Hierarchical Bayesian Approach. In Proceedings of the 27th International Conference on Machine Learning. Omnipress, 639–646.
  • Maclaurin and Adams (2014) Dougal Maclaurin and Ryan P. Adams. 2014. Firefly Monte Carlo: Exact MCMC with subsets of data. In

    Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence

    . 543–552.
  • Mansinghka et al. (2018) Vikash K. Mansinghka, Ulrich Schaechtle, Shivam Handa, Alexey Radul, Yutian Chen, and Martin Rinard. 2018. Probabilistic Programming with Programmable Inference. SIGPLAN Not. 53, 4 (June 2018), 603–616.
  • McElreath (2020) Richard McElreath. 2020. Statistical Rethinking (2nd ed.). CRC Press.
  • Neal (2011) Radford M. Neal. 2011. MCMC using Hamiltonian dynamics. Chapter 5 of the Handbook of Markov Chain Monte Carlo.
  • Perov (2018) Yura Perov. 2018. Inference Over Programs That Make Predictions. arXiv:arXiv:1810.01190
  • Perov and Wood (2014) Yura N. Perov and Frank D. Wood. 2014. Learning Probabilistic Programs. arXiv:arXiv:1407.2646
  • Quiroz et al. (2018) Matias Quiroz, Mattias Villani, Robert Kohn, Minh-Ngoc Tran, and Khue-Dung Dang. 2018. Subsampling MCMC — an Introduction for the Survey Statistician. Sankhya A 80, 1 (01 Dec 2018), 33–69.
  • Robbins (1951) Herbert Robbins. 1951. Asymptotically Subminimax Solutions of Compound Statistical Decision Problems. In

    Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability

    . University of California Press, Berkeley, Calif., 131–149.
  • Robbins (1992) Herbert Robbins. 1992. An Empirical Bayes Approach to Statistics. Springer New York, New York, NY, 388–394.
  • Tarone (1982) Robert Tarone. 1982. The Use of Historical Control Information in Testing for a Trend in Proportions. Biometrics 38, 1 (1982), 215–220.
  • Tolpin (2019) David Tolpin. 2019. Deployable Probabilistic Programming. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 1–16.
  • Tolpin et al. (2021) David Tolpin, Yuan Zhou, Tom Rainforth, and Hongseok Yang. 2021. Probabilistic Programs with Stochastic Conditioning. arXiv:arXiv:2010.00282

Appendix A Infergo Stump-and-Fungus Model

1package model
3import (
4    . ""
5    ""
6    "math"
9// Model is shared by both the tree and the fungus. When K is 0,
10// the model reduces to a conventional hierarchical model.
11// Otherwise, K is amount of evidence flowing from the tree
12// to the fungus, with stochastic conditioning fed through P.
13type Model struct {
14    P    <-chan float64
15    K    int
16    Y, N []float64
19func (m *Model) Observe(x []float64) (lp float64) {
20    // Hyperparameters
21    alpha := math.Exp(x[0])
22    beta := math.Exp(x[1])
23    lp += x[0] + x[1]
25    // Parameters
26    x = x[2:]
27    // x is a vector of odds, and is regularized through
28    // imposing the Cauchy prior.
29    lp += Cauchy.Logps(0, 10, x...)
30    p := make([]float64, len(x))
31    for i := range p {
32        p[i] = mathx.Sigm(x[i])
33        lp += mathx.LogDSigm(x[i])
34    }
36    // Parameters given hyperparameters
37    lp += Beta.Logps(alpha, beta, p...)
38    // Stochastic conditioning given hyperparameters
39    for i := 0; i != m.K; i++ {
40        lp += Beta.Logp(alpha, beta, <-m.P)
41    }
43    // Observations given parameters
44    for i := range m.Y {
45        lp += Binomial.Logp(m.N[i], p[i], m.Y[i])
46    }
48    return lp