Encapsulating models and approximate inference programs in probabilistic modules

12/14/2016 ∙ by Marco F. Cusumano-Towner, et al. ∙ 0

This paper introduces the probabilistic module interface, which allows encapsulation of complex probabilistic models with latent variables alongside custom stochastic approximate inference machinery, and provides a platform-agnostic abstraction barrier separating the model internals from the host probabilistic inference system. The interface can be seen as a stochastic generalization of a standard simulation and density interface for probabilistic primitives. We show that sound approximate inference algorithms can be constructed for networks of probabilistic modules, and we demonstrate that the interface can be implemented using learned stochastic inference networks and MCMC and SMC approximate inference programs.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We present the probabilistic module interface, which allows encapsulation of complex latent variable models with custom stochastic approximate inference machinery. The modules interface can be seen as a generalization of previously proposed interfaces for “elementary” random procedures in probabilistic programming languages: it does not require the module author to specify a marginal input-output density. Instead, module authors are only obligated to (i) provide a way to stochastically “regenerate” traces of the internal latent variables, subject to constraints on the module’s output, and (ii) provide a way to calculate a weight for this regeneration. We show this is sufficient for constructing sound approximate inference algorithms over networks of modules, including a Metropolis-Hastings procedure that can be seen as the module-level analogue of the single-site Metropolis-Hastings procedures that are commonly used with “lightweight” implementations of probabilistic programming languages goodman2012church , wingate2011lightweight .

This paper illustrates module networks by defining the mathematical interface and providing an example application to linear regression with outliers. This application contains two modules: (i) a complex prior over a binary “model selection” variable determining the prior prevalence of outliers, using a learned bottom-up network for regeneration, and (ii) a linear regression model with binary outlier indicators, using sequential Monte Carlo for regeneration of the outlier indicators (thereby avoiding an exponential sum over all possible indicator settings).

(a)  
(b)  
(c)  
Figure 1: Encapsulating latent variables in a probabilistic model into a probabilistic module: (a) shows the original probabilistic model, with the latent variables that are to be abstracted away in a dashed box. (b) shows the model in which the latent variables have been made into the internal auxiliary variables of a probabilistic module with output and input . (c) shows the declarative semantics of the resulting module network.

2 The probabilistic module interface

In several existing probabilistic programming systems111Univariate versions of this interface are used in wood2014new and dippl goodman2012church , mansinghka2014venture , mansinghka2015bayesdb , saad2016probabilistic , probabilistic modeling primitives implement a simulator procedure () which samples outputs given inputs from a distribution and log-density evaluation procedure (), which evaluates . Together, these two procedures enable inference programs to run valid approximate inference algorithms such as MCMC and SMC over the composite probabilistic model. This interface is summarized in Figure 2a.

We propose a stochastic generalization of this interface, called the probabilistic modules interface, that replaces with a stochastic generalization called . Unlike the and interface, the probabilistic modules interface, summarized in Figure 2b, is able to represent probabilistic computations

that involve internal ‘auxiliary’ random variables

that cannot be exactly marginalized out to compute the log-density . This is made possible by implementing a sampler that samples values for the auxiliary variables given inputs and outputs from a regeneration distribution, which is denoted .

(a) (b)
Figure 2: (a) shows the standard and interface for elementary random procedures, with inputs and outputs . (b) shows its stochastic generalization, the probabilistic modules interface, with module inputs , module auxiliary variables and module outputs .

If there are no auxiliary variables , then the procedure reduces to the deterministic procedure. In the presence of auxiliary variables ,

may be understood as using an unbiased single-sample importance sampling estimate of the output probability, where

is the importance distribution: . Indeed, in the extreme setting in which the regeneration distribution is identical to the conditional distribution on auxiliary variables given inputs and outputs (), this estimate is deterministic and exact, and is again identical to . Finally, note that the probabilistic modules interface does not require the auxiliary variables to be stored in memory all at once. This is useful when the log-weight can be incrementally computed during sampling of from and . Such cases are discussed in Section 4.

3 Implementing MCMC over probabilistic module networks

When we compose probabilistic modules in a directed acyclic graph, the resulting probabilistic module network

has the same declarative semantics as a Bayesian network with nodes for module outputs

. Both the module network and the Bayesian network represent the joint distribution on module outputs, with any module auxiliary variables

marginalized out. The existence of auxiliary variables in the modules only changes how approximate inference is performed.

Valid MCMC algorithms can easily be constructed over probabilistic module networks. In fact existing standard Metropolis-Hastings (MH) algorithms for inference in Bayesian networks need only a slight modification for use with modules (see Algorithm 1). The only change required is the storage of the current log-weight for each probabilistic module. The current log-weight for module is accessed with lookup-log-weight and updated with update-log-weight during MCMC inference. These values are initialized by running for each module whose output is not observed and for each module whose output is oberved, following a topological ordering of nodes in the network. Note that for single-site MH in a Bayesian network, the lookup-log-weight call and the call of Algorithm 1 are replaced with

. A Markov chain constructed from mixtures and cycles of the update of Algorithm 

1 admits the posterior as a marginal of its stationary distribution, which is defined on the space of all unobserved module outputs , and all module auxiliary variables .

1:Module whose output to update, proposal distribution , previous value .
2: Sample proposed value for module output
3:for  do
4:      Look up previous log-weight for module
5:      Estimate using value
6:end for
7:
8:if  then
9:      Accept the proposal
10:     for  do
11:         
12:     end for
13:end if
Algorithm 1 Single-site Metropolis-Hastings (MH) update in probabilistic module network

4 Encapsulating models and inference programs in probabilistic modules

We now show how to encapsulate a probabilistic model with internal latents and outputs as a probabilistic module with the declarative semantics of the marginal distribution on outputs , as shown in Figure 1. This is useful if is high dimensional or analytically intractable, and we are unable to implement by marginalizing out exactly.

We begin by defining the module auxiliary variables as the model’s internal latents (). Then, the probabilistic module interface requires us to construct a sampler for the regeneration distribution where is an approximation to such that we can efficiently compute the log-weight . It is sometimes possible to learn a sampler (see morris2001recognition for a pioneering example of this approach, and the ‘stochastic inverses’ of stuhlmuller2013learning ) or to learn the model and the regeneration sampler at the same time (e.g. the generative models and recognition networks of kingma2013auto ) such that the log-weight is tractable. We illustrate this approach for Module of Figure 2a, which uses a learned stochastic inverse network trained using samples from the prior of the model as described in stuhlmuller2013learning . The log-weight is tractable because the learned contains no additional random variables beyond those in the model itself ().

However, if we wish to use generally applicable stochastic inference programs implementing MCMC andrieu2003introduction and sequential Monte Carlo (SMC) del2006sequential for the regeneration distribution , it is not possible to compute because the marginal output density of the stochastic inference program is intractable. To handle these cases, we augment the auxiliary variables of the module to include the execution history of the stochastic inference program (). We define the distribution sampled by as the joint distribution of the stochastic inference program over its execution history and output , denoted . We then extend the distribution sampled by to also sample an execution history alonside the model latents , using a ‘meta-inference’ program cusumano2016quantifying that samples inference execution history given inference output from a distribution that approximates the conditional distribution on inference execution histories , so that .

As shown in cusumano2016quantifying and cusumanotowner2016smc , it is possible to construct meta-inference programs for sequential variants of MCMC using detailed balance transition kernels and for multiple-particle SMC with optional detailed balance transition kernels such that the log-weight can be efficiently computed on the fly when sampling from and . As the accuracy of improves (as happens when the number of particles in SMC increases) the log-weights sampled from the probabilistic module converge to the log density . Module of Figure 3 uses SMC for .

a Data flow of forward module network simulation. c   d   b Data flows of regeneration for the two modules.
e Internal probabilistic model of module
Figure 3: Illustration of module networks on an example application to linear regression with outliers. (a) shows two latent variable models encapsulated in probabilistic modules and composed in a probabilistic module network. Module encodes a prior distribution on , which determines the prior prevalence of outliers. Module encodes the linear regression outlier model. To treat modules like a single node in a Bayesian network when we lack an their marginal output density, we perform stochastic inversion (or “regeneration”) of the module. (b) shows the data flow of regeneration for the two modules. Module uses a learned stochastic inverse network and module uses sequential Monte Carlo (SMC) for regeneration. Despite being approximate, these permit valid Metropolis-Hastings over the exposed latent variable , as if we had the marginal output densities of the two modules. (c) shows traces of and the total log-weight for an MCMC run using Algorithm 1 in this model, with observed values for . The total log-weight varies stochastically even when is static. (d) shows a rendering of the latent variables of the model encapsulated in module (the line and outlier statuses) and the dataset at a point in the chain (e) shows the internal probabilistic model of module .

Acknowledgements

This research was supported by DARPA (PPAML program, contract number FA8750-14-2-0004), IARPA (under research contract 2015-15061000003), the Office of Naval Research (under research contract N000141310333), the Army Research Office (under agreement number W911NF-13-1-0212), and gifts from Analog Devices and Google. This research was conducted with Government support under and awarded by DoD, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a.

References

Appendix A Appendix A: Deriving single-site Metropolis Hastings in module network

Consider a network of probabilistic modules indexed by . Let denote the internal auxiliary variables or module , let denote the output of module , and let denote the inputs to module . Each module input is a tuple of module outpus for , where is the sequence of parent module indices of module . Let denote the set of children of module : . Let denote the simulation distribution for module and let denote the regeneration distribution. Define the collection of all module auxiliary variables by and the collection of all module outputs by . Define the joint simulation distribution over the network as . The declarative semantics of the module network are derived from the marginal distribution over outputs . Suppose a subset of the modules are observed, meaning their output is constrained to a value . The target distribution of the network is then defined by . We will derive a Metropolis-Hastings algorithm for which the stationary distribution is the distribution . If the algorithm converges to this joint distribution, then the marginal over only converges to the network’s target distribution.

Consider a standard single-site Metropolis-Hastings update targeting the distribution where we propose a new value for the output of some module . Let and denote the state prior to the update. We propose a new value , and then propose . We also propose for all modules , where includes the updated value . We perform an MH accept/reject step using this proposal, and with the ‘local’ posterior as the target distribution. The acceptance ratio is:

(1)

The log-acceptance ratio is:

(2)
(3)
(4)

This is the acceptance ratio used in Algorithm 1. Therefore Algorithm 1 corresponds to a valid MH update that can be used to compose valid MH algorithms that converge to the posterior , such as a single-site random-scan mixture over Algorithm 1 applications for all unobserved modules .