Stochastically Differentiable Probabilistic Programs

03/02/2020 ∙ by David Tolpin, et al. ∙ 0

Probabilistic programs with mixed support (both continuous and discrete latent random variables) commonly appear in many probabilistic programming systems (PPSs). However, the existence of the discrete random variables prohibits many basic gradient-based inference engines, which makes the inference procedure on such models particularly challenging. Existing PPSs either require the user to manually marginalize out the discrete variables or to perform a composing inference by running inference separately on discrete and continuous variables. The former is infeasible in most cases whereas the latter has some fundamental shortcomings. We present a novel approach to run inference efficiently and robustly in such programs using stochastic gradient Markov Chain Monte Carlo family of algorithms. We compare our stochastic gradient-based inference algorithm against conventional baselines in several important cases of probabilistic programs with mixed support, and demonstrate that it outperforms existing composing inference baselines and works almost as well as inference in marginalized versions of the programs, but with less programming effort and at a lower computation cost.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Probabilistic programming [GMR+08, MSP14, WVM14, GS15]

represents statistical models as programs written in an otherwise general programming language that provides syntax for the definition and conditioning of random variables. Inference can be performed on probabilistic programs to obtain the posterior distribution or point estimates of the variables. Inference algorithms are provided by the probabilistic programming framework, and each algorithm is usually applicable to a wide class of probabilistic programs in a black-box automated manner. The algorithms include Markov Chain Monte Carlo (MCMC) variants — Metropolis-Hastings 

[WSG11, MSP14, YHG14], Hamiltonian Monte Carlo (HMC) [N12, Stan17], as well as expectation propagation [MWG+10], extensions of Sequential Monte Carlo [WVM14, MYM+15, PWD+14, RNL+2016, MS18], variational inference [WW13, KTR+17], and gradient-based optimization [Stan17, BCJ+19].

A probabilistic program computes the (unnormalized) probability of its execution 

[WVM14]. An execution is summarized as an instantiation of the program trace, and the probability is a function of the trace. Some probabilistic programming frameworks require that the trace shape be specified upfront, i.e. static [Stan17, T19], while others allow introduction of trace components dynamically [GMR+08, P09, GS15, GXG18, TMY+16], in the course of a program execution.

Efficient inference methods for probabilistic programs, especially in the high-dimensional scenario, often involve computation of the gradient with respect to the latent variables [S16, Stan17, GXG18, T19, BCJ+19]. This restricts gradient-based inference methods to differentiable programs with continuous variables only. Probabilistic programs with a mixture of continuous and discrete variables, such as mixture models, state space models, and factor models, must resort to alternatives which do not rely solely on differentiability, at the cost of lower performance and poorer scalability.

Take a standard Gaussian mixture model (GMM) as an example. In GMM, the continuous latent variables are usually the mean and variance of the Gaussian distribution of each mixture component, and the discrete variables are the assignments of each data point to a mixture component. Inference in GMM can be performed by either manually marginalizing out the discrete latent variables, or by combining conditional gradient-based inference on continuous variables, and conditional gradient-free (such as Gibbs sampling) inference on discrete variables. Manual marginalization is straightforward in the case of GMM, as suggested by Stan 

[S16], but can be more complex in most other cases, and even intractable. Conditional inference on each group of variables is prone to slow mixing and poor robustness in the face of multimodality. A generic robust inference method which both exploits differentiability and is applicable to probabilistic programs with mixed support is highly desirable.

In this work, we focus on the probabilistic programs with a mixture of continuous and discrete variables, differentiable with respect to the continuous variables, and propose to treat them as stochastically differentiable. We derive an unbiased stochastic estimate of the gradient of the marginal log likelihood that can be efficiently computed. We then demonstrate how one can apply stochastic gradient-based inference methods [MCF15], in particular stochastic gradient Hamiltonian Monte Carlo (sgHMC) [CFG14], on stochastically differentiable models with this estimate. To show the potential usage, a reference implementation of the probabilistic programming facility that supports stochastically differentiable probabilistic programs is presented in Section LABEL:sec:studies. We compare our proposed adaptation of sgHMC on three stochastically differentiable models against manually marginalized inference and the state-of-the-art automated method in PPSs, composing inference, and empirically confirm the substantial improvements of our approach.

Contributions

This work brings the following contributions:

  • the notion of a stochastically differentiable probabilistic program;

  • an unbiased stochastic estimator for the gradient of the marginal log likelihood of such a program;

  • an adaptation of sgHMC with this estimator as the automated inference engine;

  • a reference implementation of probabilistic programming with support for stochastically differentiable probabilistic programs in Go programming language [Golang].

Notation

We denote by the probability or probability density of random variable , and by the conditional probability or probability density of given , depending on the domain of . We write when is a random variable with probability . We denote by a probability known up to a normalization constant, i.e. the unnormalized probability.

2 Stochastically Differentiable Probabilistic Program

To define a stochastically differentiable probabilistic program, we begin with definitions of a deterministic and a stochastic probabilistic program. Different definitions of (deterministic) probabilistic programs are given in the literature and reflect different views and accents. For the purpose of this work, let us give the following broad definition:

Definition 1.

A deterministic probabilistic program that defines a distribution over traces is a computer program that accepts a trace assignment as one of its arguments, and returns the unnormalized probability of conditioned on other arguments :

(1)

The trace assignment

may have the form of a vector, or of a list of address-value pairs, or any other form suitable for a particular implementation. Accordingly, let us define a stochastic probabilistic program:

Definition 2.

A stochastic probabilistic program that defines a distribution over traces is a computer program that accepts a trace assignment as one of its arguments, and returns the unnormalized probability of conditioned on other arguments and random variable conditioned on :

(2)

A rationale for this definition is that corresponds to the nuisance parameters or nondeterministic choices inside the program. Finally, let us define a stochastically differentiable probabilistic program:

Definition 3.

A stochastically differentiable probabilistic program is a stochastic probabilistic program (Definition 2) with trace such that is differentiable w.r.t. for any and .

1 func Survey(theta float, y []bool) float {
2     prob := Beta(1, 1).pdf(theta)
3     for i := 0; i != len(y); i++ {
4         coin := Bernoulli(0.5).sample()
5         if coin {
6             prob *= Bernoulli(theta).cdf(y[i])
7         } else {
8             prob *= Bernoulli(0.5).cdf(y[i])
9         }
10      }
11      return prob
12  }\end{lstlisting}
13  \subcaption{A probabilistic program for the compensation
14    survey. Random variables \lstinline{coin} at line $4$ are the nuisance variables
15     which make the program stochastic.}
16  \label{fig:survey-stochastic}
17        \end{subfigure}%
18%   \end{minipage}
19    ~
20    \hspace{15pt}
21%   \begin{minipage}{0.45\textwidth}
22    \begin{subfigure}{0.45\textwidth}
23        \centering
24    \vspace{-10pt}
25    \begin{lstlisting}[basicstyle=\small]
26 func Survey(theta float, y []bool) float {
27     prob := Beta(1, 1).pdf(theta)
28     for i := 0; i != len(y); i++ {
29         prob *= 0.5*Bernoulli(theta).cdf(y[i])
30                +0.5*Bernoulli(0.5).cdf(y[i])
31     }
32     return prob
33 }\end{lstlisting}
34 \subcaption{An equivalent deterministic version of the compensation
35    survey model in Figure~\ref{fig:survey-stochastic}.
36    The probability is marginalized over the coin flip
37    in the code of the program.}
38 \label{fig:survey-deterministic}
39    \end{subfigure}%
40%\end{minipage}
41\caption{Compensation survey model}
42\end{figure*}
43
44Let us illustrate a stochastic (and stochastically
45differentiable) probabilistic program on an example:
46\begin{example}
47    A survey is conducted among a company’s employees. The survey
48    contains a single question: “Are you satisfied with your
49    compensation?’ To preserve employees’ privacy, the employee
50    flips a coin before answering the question. If the coin
51    shows head, the employee answers honestly; otherwise, the
52    employee flips a coin again, and answers ‘yes’ on head, no
53    on tail. Based on survey outcomes, we want to know how many
54    of the employees are satisfied with their compensations.
55    \label{ex:survey}
56\end{example}
57
58\paragraph{Stochastic probabilistic program} The stochastic
59probabilistic program modelling the survey
60(Figure~\ref{fig:survey-stochastic}) receives two parameters:
61probability \lstinline{theta} that a randomly chosen employee is
62satisfied with the compensation and vector \lstinline{y} of
63boolean survey outcomes. The program trace consists of a single
64variable \lstinline{theta}.  There are nuisance random variables
65\lstinline{coin} representing coin flips which are sampled
66inside the program from their prior distribution, but are not
67included in the trace  this makes the probabilistic program
68stochastic. The unnormalized probability of the trace is
69accumulated in variable \lstinline{prob}.  First, a Beta prior
70is imposed on \lstinline{theta} (line 2).  Then,
71\lstinline{prob} is multiplied by the probability of each answer
72given \lstinline{theta} and \lstinline{coin} (lines 5–9).
73
74\paragraph{Manual marginalization} Though not generally the
75case, the program in
76Figure~\ref{fig:survey-stochastic} can be rewritten as a
77deterministic probabilistic program
78(Figure~\ref{fig:survey-deterministic}). Instead of flipping a
79coin as in line~4 of the stochastic program in
80Figure~\ref{fig:survey-stochastic}, the probability of
81observation \lstinline{y[i]} is computed in lines~5–6 as the
82sum of probabilities given either head or tail, weighted by the
83probabilities of head and tail (both are 0.5).
84%\begin{figure*}
85%    \begin{lstlisting}
86%  func Survey(theta float, y []bool) float {
87%      prob := Beta(1, 1).pdf(theta)
88%      for i := 0; i != len(y); i++ {
89%          prob *= 0.5*Bernoulli(theta).cdf(y[i])+0.5*Bernoulli(0.5).cdf(y[i])
90%      }
91%      return prob
92%  }\end{lstlisting}
93%    \caption{A deterministic probabilistic program for the compensation
94%    survey. The probability is marginalized over the coin flip
95%    in the code of the program.}
96%    \label{fig:survey-deterministic}
97%\end{figure*}
98For case studies (Section~\ref{sec:studies}), we chose
99probabilistic programs for which manual marginalization was
100possible, and compared inference in stochastic and
101marginalized versions.
102
103\yz{probably add a section on background: HMC; MH+HMC; sgHMC}
104
105\section{Inference}
106\label{sec:infer}
107
108Efficient posterior inference in probabilistic models for which
109a stochastic gradient estimate of the logarithm of the
110unnormalized posterior density is available can be performed
111using a stochastic gradient Markov Chain Monte
112Carlo method~\cite{MCF15}. Stochastic gradient posterior inference was
113originally motivated by the handling of large data sets, where the
114gradient was estimated by subsampling  computing the
115gradient on a small part of the dataset~\cite{CFG14}, but
116stochastic gradient MCMC methods are agnostic to the way the
117gradient is estimated, given an unbiased estimate is available,
118and can as well be applied to stochastically differentiable
119probabilistic programs. However, a naive estimate of the
120gradient as $\nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y}, \pmb{z})$,
121where $\pmb{z}$ is drawn from $p(\pmb{z}|\pmb{y})$,
122is a \textbf{biased} %\textbf{not} an unbiased
123estimate of the gradient of $\nabla_{\pmb{x}} \log
124\tilde p(\pmb{x}|\pmb{y})$.
125\yz{not exactly sure why this is biased and why the following is not}
126\dt{because the above is a stochastic gradient estimate of the
127product integral rather than of the regular (sum) integral; it
128corresponds to the case when we express non-determinism/know the
129posterior distribution, rather than marginalize on the prior
130distribution; the original paper on stochastic probabilistic
131programs, on arxiv, discusses both cases and tries to outline
132the difference between them. Just technically, you get an
133unbiased estimate if there is a sum over a set of samples, and
134you compute (and scale) the gradient of a sub-sum. For
135marginalization, there is $\log$ sum of samples, so you cannot
136just take the gradient of one term (or a few  terms) under the
137logarithm, and use it as an unbiased estimate.}
138\yz{Thanks, David. I think i probably did not phrase the question well.
139What I was confused was actually what is biased.
140To be very explicit,
141the following estimate of the gradient is biased, ie
142$$
143\nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y})  \approx \frac 1 N \sum_{i=1}^N \nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y}, \pmb{z}_i) \\
144$$
145where $\nonumber \pmb{z}_i \sim p(\pmb{z}|\pmb{y})$ which is the prior of $z$.
146The point is we cannot take the same calculation as in $p(x, y)$ by marginalizing $z$ out from the prior.
147btw, why does the prior of $z$, which is $p(\pmb{z}|\pmb{y})$, depend on $y$?
148
149In fact, I found this terminology a bit confusing.
150I would like to suggest to use $p(\pmb x, \pmb y)$ instead of $\tilde p(\pmb{x}|\pmb{y})$ for the unnormalized density.
151Especially when we have all x, y and z. Eg $\tilde p(\pmb{x}|\pmb{y}, \pmb{z})$ actually means $p(\pmb{x}, \pmb{y}| \pmb{z})$.
152}
153\hy{The bias comes from the use of logarithm. The RHS of your approximate
154equation but without the gradient is a biased estimator of the LHS without
155the gradient.}
156
157In what follows, we derive an
158\textbf{unbiased} estimate of the stochastic gradient, and elaborate on
159the use of the estimate within the framework of stochastic
160gradient Hamiltonian Monte Carlo~(sgHMC).
161
162\subsection{Unbiased Stochastic Gradient Estimate}
163
164The unnormalized probability density $\tilde p(\pmb{x}|\pmb{y})$
165is a marginalization of $\tilde p(\pmb{x}|\pmb{y},\pmb{z})$ over
166$\pmb{z}$:
167\begin{equation}
168    \tilde p(\pmb{x}|\pmb{y}) = \int_z p(\pmb{z}|\pmb{y}) \tilde p(\pmb{x}|\pmb{y},\pmb{z})d\pmb{z}
169\end{equation}
170Posterior inference and maximum \textit{a posteriori} estimation
171require a stochastic estimate of $\nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y})$:
172\begin{align}
173    \nonumber \nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y}) &= \frac {\int_z p(\pmb{z}|\pmb{y}) \tilde p(\pmb{x}|\pmb{y},\pmb{z})\nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y},\pmb{z})d\pmb{z}} {\tilde p(\pmb{x}|\pmb{y})}  \\
174    \nonumber &= \int_z p(\pmb{z}|\pmb{y}) \frac {\tilde p(\pmb{x}|\pmb{y},\pmb{z})} {\tilde p(\pmb{x}|\pmb{y})} \nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y},\pmb{z})d\pmb{z} \\ \nonumber
175    &= \int_z  \frac {p(\pmb{x}|\pmb{y},\pmb{z})p(\pmb{z}|\pmb{y})} {p(\pmb{x}|\pmb{y})} \nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y},\pmb{z})d\pmb{z} \\
176    &= \int_z p(\pmb{z}|\pmb{x},\pmb{y}) \nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y},\pmb{z})d\pmb{z}
177\end{align}
178By Monte Carlo approximation,
179\begin{align}
180    \label{eqn:grad-mc-marg}
181    \nonumber \pmb{z}_i & \sim p(\pmb{z}|\pmb{x}, \pmb{y}) \\
182    \nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y}) & \approx \frac 1 N \sum_{i=1}^N \nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y}, \pmb{z}_i)
183\end{align}
184Draws of $\pmb{z}_i$ are conditioned on $\pmb{x}$ and can be
185approximated by a Monte Carlo method, such as Markov chain Monte
186Carlo;
187$p(\pmb{z}|\pmb{y})$ need not be known for the
188approximation.
189Note that there are two Monte Carlo
190approximations involved:
191\begin{enumerate}
192    \item gradient $\nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y})$
193        is approximated by a finite sum of gradients
194        $\nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y}, \pmb{z}_i)$;
195    \item draws of $\pmb{z}_i$ from $p(\pmb{z}|\pmb{x}, \pmb{y})$ are
196        approximated.
197\end{enumerate}
198
199An intuition behind estimate (\ref{eqn:grad-mc-marg}) is that
200for marginalization, assignments to nuisance parameters which
201make the trace assignment more likely contribute more to the
202gradient estimate.
203
204\subsection{Inference with Stochastic Gradient HMC}
205
206Stochastic gradient Hamiltonian Monte Carlo~\cite{CFG14} does
207not involve Metropolis-Hastings correction and requires only
208that a routine that computes a stochastic estimate of the
209gradient of the unnormalized posterior probability density should be
210provided. We propose here to use a single-sample gradient
211estimate, where $\pmb{z}$ is drawn using an MCMC method, such as
212a Gibbs sampler or a variant of Metropolis-Hastings:
213\begin{algorithm}[h]
214    \caption{Single-sample gradient estimate}
215    \label{alg:sge}
216    \begin{enumerate}
217        \item Draw $\pmb{z}$ from $p(\pmb{z}|\pmb{x}, \pmb{y})$ (MCMC).
218        \item Return $\nabla_{\pmb{x}} \log \tilde p(\pmb{x}|\pmb{y}, \pmb{z})$.
219    \end{enumerate}
220\end{algorithm}
221
222Note that $\pmb{z}$ is freshly drawn for each estimation of the gradient.
223Hence, every step through $\pmb{x}$ is performed using a gradient estimate
224based on \textit{a new sample} of $\pmb{z}$.
225\yz{I am a little bit confused what our algorithm is doing exactly.
226    If I understand sgHMC correctly, it uses a mini-batch stochastic gradient estimator
227    in the original paper right.
228    Are we doing exactly the sgHMC?
229It feels like we are doing the following: for each $1$ to $L$ Leapfrog step in the sgHMC, we do Alg. 1.
230But we are using Eq.~\ref{eqn:grad-mc-marg} as the gradient estimator on \emph{full} data, i.e. all $y$s, rather than a mini-batch.}
231\dt{sgHMC uses an unbiased stochastic gradient estimate to
232sample from the posterior but does not restrict how this
233estimate is obtained. In the case of deep learning’/ large data
234sets, the log density is with respect to \textbf{all} samples in
235the datasets. So, to compute the log density over all samples
236you take the sum of log densities of each sample. The gradient
237of log density is the sum of gradients of log densities of each
238sample. You estimate the gradient stochastically by selecting a sub
239sum uniformly and scaling up.
240
241In our case, we have the same sgHMC algorithm which uses an
242unbiased stochastic gradient estimate, but the stochasticity
243comes from a different source. We use all of the data
244but choose certain assignments to discrete variables for each
245gradient computation. In a sense, our data which we subsample is
246the set of all samples from the posterior distribution of the
247nuisance variables. }
248\yz{I see. We should probably point this out as it is not super straightforward if people don’t know sgHMC well.
249In fact, I will point this out when I write the background section.}
250
251
252
253Variance of this stochastic gradient estimate can be reduced by
254computing and averaging multiple estimates, each for a new draw
255of $\pmb{z}$. This is similar to the well-known
256mini-batch stochastic gradient descent in stochastic optimization. In the case
257studies (Section~\ref{sec:studies}), we compare statistical
258efficiency and computation time of inference with the
259single-sample and multi-sample stochastic gradient estimates.
260
261Compare inference with sgHMC, using Algorithm~\ref{alg:sge} as the
262gradient estimate, to alternating inference, with (non-stochastic)
263HMC on $\pmb{x}$ and an MCMC variant on $\pmb{z}$:
264\begin{algorithm}[h]
265    \caption{Baseline: alternating HMC on $\pmb{x}$ and MCMC on $\pmb{z}$}
266    \label{alg:hmc}
267    \begin{enumerate}
268        \item Draw $\pmb{z}$ from $p(\pmb{z}|\pmb{x}, \pmb{y})$ (MCMC).
269        \item Draw $\pmb{x}$ from $p(\pmb{x}|\pmb{y}, \pmb{z})$ (HMC).
270    \end{enumerate}
271\end{algorithm}
272
273Despite apparent similarity between Algorithms~\ref{alg:sge}
274and~\ref{alg:hmc}, one HMC iteration of Algorithm~\ref{alg:hmc}
275involves multiple computations of the gradient for each draw of
276$\pmb{x}$, all for \textit{the same sample} of $\pmb{z}$,
277as in the Leapfrog steps.
278A new sample of
279$\pmb{z}$ is only drawn between iterations of HMC. This
280results in poorer mixing and may lead the sampler to get stuck
281in one of the modes of a multimodal posterior distribution.
282
283
284As an illustration, consider the probabilistic program in
285Figure~\ref{fig:sge-vs-hmc-pp}.
286\begin{figure*}[!t]
287    \centering
288    \hspace{5pt}
289    %    \begin{minipage}{0.45\textwidth}
290    \begin{subfigure}{0.45\textwidth}
291\begin{lstlisting}
292func TwoNormals(x float) float {
293    z := Bernoulli(0.5).sample()
294    if z {
295        return Normal(1, 0.5).pdf(x)
296    } else {
297        return Normal(-1, 0.5).pdf(x)
298    }
299}\end{lstlisting}
300\subcaption{Mixture of two Gaussians: nuisance variable $z$
301    selects either $\mathcal{N}(-1, 0.5)$ or 5$\mathcal{N}(1,
302    0.5)$.}
303\label{fig:sge-vs-hmc-pp}
304    \end{subfigure}%
305    ~
306    \hspace{5pt}
307    \begin{subfigure}{0.45\textwidth}
308 \includegraphics[width=\linewidth]{sge-vs-hmc}
309                \subcaption{Mixture of two Gaussians: HMC conditioned on either $\pmb{z}=\mathit{false}$
310                or $\pmb{z}=\mathit{true}$ will alternate between exploring each of the
311    components, while sgHMC will directly draw samples from the posterior.}
312\label{fig:sge-vs-hmc}
313    \end{subfigure}%
314    \caption{An illustration example: sgHMC vs. MCMC + HMC}
315\end{figure*}
316%\begin{figure*}
317%\begin{lstlisting}
318%func TwoNormals(x float) float {
319%   z := Bernoulli(0.5).sample()
320%   if z {
321%       return Normal(1, 0.5).pdf(x)
322%   } else {
323%       return Normal(-1, 0.5).pdf(x)
324%   }
325%}\end{lstlisting}
326%    \caption{Mixture of two Gaussians: nuisance variable $z$
327%    selects either $\mathcal{N}(-1, 0.5)$ or 5$\mathcal{N}(1,
328%    0.5)$.}
329%    \label{fig:sge-vs-hmc-pp}
330%\end{figure*}
331This program specifies a Gaussian mixture model, $p(x) = 0.5 p(x|\mu=-1, \sigma=0.5) + 0.5
332p(x|\mu=1, \sigma=0.5)$. Nuisance variable $z$ selects either
333component with equal probability.  Figure~\ref{fig:sge-vs-hmc}
334shows densities of each of the components and of the mixture.
335%\begin{figure}
336%    \centering
337%    \includegraphics[width=0.9\linewidth]{sge-vs-hmc}
338%    \caption{Mixture of two Gaussians: HMC conditioned on either $\pmb{z}=false$
339%    or $\pmb{z}=true$ will alternate between exploring each of the
340%    components, sgHMC will directly draw samples from the posterior.}
341%    \label{fig:sge-vs-hmc}
342%\end{figure}
343For Algorithm~\ref{alg:sge}, \yz{Re-phrase  or make a alg. block: Alg.1 is not a full algorithm}
344it directly samples from the entire mixture density, i.e. it re-draws $\pmb z$ before each gradient step of $\pmb x$, which enables the posterior samples of $\pmb x$ to cover both modes.
345%However, the HMC step of Algorithm~\ref{alg:hmc} will
346%alternate between sampling from densities of each of the
347%components, with higher correlation between samples and lower
348%statistical efficiency.
349However, the HMC step of Algorithm~\ref{alg:hmc} will conditioned on a fixed $\pmb z$,
350i.e. sample $\pmb x$ for $L$ leapfrog steps conditioned on $\pmb z = 1$ or $2$,
351which makes the samples mainly concentrated around one mode.
352The estimates of $\pmb x$ will only evolve to the other mode once $\pmb z$ changes
353based on samples from the current mode, with potentially low mixing rate.
354
355More evidence for poorer mixing and lower statistical efficiency
356is provided in the case studies. Compared to that,
357Algorithm~\ref{alg:sge} estimates the gradient for a new draw of
358$\pmb{z}$ on each invocation.
359
360\section{Case Studies}
361\label{sec:studies}
362
363We implemented inference in stochastically differentiable
364probabilistic programs using Infergo~\cite{T19}, a probabilistic
365programming facility for the Go programming language. Due to
366reliance of Infergo on pure Go code for probabilistic programs,
367no changes were necessary to the way probabilistic programs are
368represented. Implementation of sgHMC for the case studies in
369this paper will be made publicly available upon publication
370and proposed for inclusion into Infergo.
371
372We evaluated inference in stochastically differentiable
373probabilistic programs on several models. In each evaluation, we
374created two probabilistic programs:
375%a mixed support probabilistic program and a corresponding marginalized program.
376the standard probabilistic program with a mixed support
377as users may write (akin to Figure~\ref{fig:survey-stochastic})
378and a corresponding marginalized program (akin to Figure~\ref{fig:survey-deterministic}).
379Additionally, for comparison,
380we implemented a marginalized program for each
381model in Stan~\cite{Stan17};
382we verified that the posterior
383distributions inferred by each of inference schemes on both
384mixed support and marginalized programs are consistent with the
385Stan version. We provide results for three models in this study:
386compensation survey (Section~\ref{sec:survey}),  Gaussian
387mixture model, and hidden Markov model. Details of each model
388are provided later in this section.  The full source code, data,
389and evaluation scripts for the case studies will be made
390publicly available upon publication.
391
392The purpose of these case studies is to compare statistical
393efficiency and computation time of sgHMC, combination of MCMC
394and HMC on mixed-support models, and HMC on marginalized models.
395For each model, we measured, and averaged over 10 independent
396runs, the effective sample size and the computation time. Four
397inference schemes were applied:
398\begin{itemize}
399    \item (ours) sgHMC with a single-sample gradient estimate;
400    \item (ours) sgHMC with 10-sample gradient estimate;
401    \item a combination of MCMC (sitewise Metropolis-Hastings)
402        on discrete variables and HMC on continuous variables;
403    \item HMC on the marginalized model.
404\end{itemize}
405All inference schemes were run to produce $10\,000$ samples
406(sufficient for convergence on all models in the studies),
407with $10$ steps (gradient estimates) between samples.
408The step resulting in the largest effective sample size for HMC on the
409marginalized model was chosen and fixed for all over schemes
410on the same model. This choice of the step size favors HMC in
411comparison, meaning that the actual comparative performance of
412sgHMC on stochastically differentiable probabilistic programs is
413at least as good, or better, as what follows from the evaluation.
414
415\begin{table*}[!t]
416    \centering
417    \caption{Effective sample size}
418    \label{tab:ess}
419
420    \begin{tabular}{r | c | c | c | c }
421        {\it model} & {\it sgHMC, 1 sample} & {\it sgHMC, 10 samples} & {\it MH+HMC} & {\it HMC, marginalized}  \\ \hline
422        Survey & $4600\pm 180$ & $5000\pm 160$ & $2800\pm 300$ & $5200\pm 200$  \\
423        GMM & $5900\pm240$ & $6800\pm200$ & $1900\pm200$ & $7200\pm150$  \\
424        HMM & $6200\pm280$ & $6300\pm220$ & $4800\pm190$ & $6700\pm180$
425    \end{tabular}
426\end{table*}
427
428\begin{table*}[!t]
429    \centering
430    \caption{Computation time, seconds}
431    \label{tab:time}
432
433    \begin{tabular}{r | c | c | c | c }
434        {\it model} & {\it sgHMC, 1 sample} & {\it sgHMC, 10 samples} & {\it MH+HMC} & {\it HMC, marginalized}  \\ \hline
435        Survey & $6.5 \pm 0.1$ & $40 \pm 1$ & $7.4\pm0.1$ & $21\pm0.5$ \\
436        GMM & $35\pm 0.7$ & $340\pm 5$ & $38\pm 0.7$ &  $95\pm2$ \\
437        HMM & $10\pm 0.3$ & $98 \pm 3$ & $10 \pm 0.3$ & $46 \pm 0.8$
438    \end{tabular}
439\end{table*}
440
441
442\begin{table*}[!t]
443    \centering
444    \caption{Effective sample size per second}
445    \label{tab:persec}
446
447    \begin{tabular}{r | c | c | c}
448        {\it model} & {\it sgHMC} & {\it MH+HMC} & {\it HMC}  \\ \hline
449        Survey & {\bf 650} & 380  & 240 \\
450        GMM & {\bf 170} & 50  & 75 \\
451        HMM & {\bf 620} & 480 & 145 \\
452    \end{tabular}
453\end{table*}
454
455
456The measurements are summarized in Tables~\ref{tab:ess}
457and~\ref{tab:time}, including empirical standard deviations over
458multiple runs. Table~\ref{tab:ess} shows effective sample sizes.
459Larger numbers are better.  Effective sample size of HMC on
460marginalized program is 15\%–20\% larger than sgHMC with a
461single-sample gradient estimate and 5–7\% larger than sgHMC
462with 10-sample estimate. The MH+HMC combination consistently
463produces sample sizes which are lower by 30–60\%.
464Table~\ref{tab:time} shows computation times. While 10-sample
465gradient estimate is computationally expensive and results in
466almost linear increase in the total computation time,
467cancelling the improvement due to a less noisy gradient
468estimate, sgHMC exhibits the shortest computation time for all
469models, 2.5–4.5 times shorter than HMC on the marginalized
470program. This is because the marginalized version has both a
471higher asymptotic complexity (more details are in descriptions
472of each of the case studies) and involves computationally
473expensive exponentiation (calls to \lstinline{LogSumExp} in
474Figures~\ref{fig:survey-marg}, \ref{fig:gmm-marg}, and~\ref{fig:hmm-marg}).
475
476
477Table~\ref{tab:persec} combines Tables~\ref{tab:ess}
478and~\ref{tab:time} to show approximate effective sample size in
479seconds. Despite a slightly lower total effective sample size of
480sgHMC, the effective sample size of sgHMC with a single-sample
481gradient estimate is 2–4 times larger than that of HMC on the
482marginalized model. To summarize, sgHMC with a single-sample
483gradient estimate has the best performance on the mixed support
484probabilistic programs used in the case studies, outperforming
485HMC on marginalized models by a number of times even for
486parameters favoring HMC in the comparison.
487
488\subsection{Compensation Survey}
489\label{sec:survey}
490
491\input{code_examples/survey.tex}
492
493The compensation survey follows the description in
494Section~\ref{sec:survey}. The mixed support program is shown in
495Figure~\ref{fig:survey} and the marginalized version in
496Figure~\ref{fig:survey-marg}. For evaluation, a data set of 60
497observations, corresponding to 67\% satisfaction, was used.  The
498marginalized version has the same asymptotic complexity  as the
499mixed support version, but should take at least twice as long to
500run due to computation of the log probability for both coin flip
501outcomes (lines~5–7).  This is consistent with the empirical
502measurements, 21 vs 6.5 seconds, (Table~\ref{tab:time}), with
503extra time spent in \lstinline{LogSumExp}.
504
505\subsection{Gaussian Mixture Model}
506
507\input{code_examples/gmm.tex}
508
509The Gaussian mixture model infers means and standard deviation
510of the components, given a set of samples from the mixture and
511the number of components. The mixed support program is shown in
512Figure~\ref{fig:gmm} and the marginalized version in
513Figure~\ref{fig:gmm-marg}. For evaluation, a data set of
514100 observations, corresponding to two components with equal
515probability of either component, was used. The computation time
516of the marginalized version is at least the computation time of
517the mixed support program times the number of components (at least
518twice longer for two components) due to the inner loop in
519lines~8–15. This is consistent with the empirical measurements,
52095 vs. 35 seconds (Table~\ref{tab:time}), with extra time spent
521in \lstinline{LogSumExp}.
522
523\subsection{Hidden Markov Model}
524
525\input{code_examples/hmm.tex}
526
527The Hidden Markov model infers the transition matrix given the
528emission matrix and the observations.  The mixed support program
529is shown in Figure~\ref{fig:hmm} and the marginalized version in
530Figure~\ref{fig:hmm-marg}. For evaluation, a data set of 16
531observations, corresponding to three distinct hidden states, was
532used. Following~\cite{SDT18}, the marginalized version
533implements the costly forward pass of the forward-backward
534algorithm. The computation time of the marginalized version, in
535a na\"ive implementation, is
536at least the computation time of the mixed support program times
537the squared number of hidden states (at least nine times longer
538for three hidden states). This is due to the doubly-nested loop
539in lines~11–20. The forward pass also involves two distinct
540calls to \lstinline{LogSumExp}, one in line~18, on each
541iteration of the inner loop, and the other in line~22. The
542marginalized version in Figure~\ref{fig:hmm-marg} optimizes the
543inner loop by reusing computations, but the computation time
544of the marginalized version is still five times longer than that of
545the mixed support version, 46 vs. 10 seconds (Table~\ref{tab:time}).
546
547\section{Related Work}
548\label{sec:related}
549
550Stochastically differentiable probabilistic programs as introduced in Section~\ref{sec:spp},
551though may be named differently,
552commonly appeared in many different probabilistic programming systems~(PPSs).
553However, the mixture of continuous and discrete random variables substantially complicates the inference procedures in PPS as automated engines.
554Especially, the inference procedure can become extremely challenging
555if the program trace is not static, i.e. the support of the program is dynamic.
556
557To perform inference in such models in PPSs such as %BUGS~\cite{BUGS96} or
558Stan~\cite{Stan17},
559which are usually designed around one or two efficient gradient-based inference methods such as HMC,
560the user would need to manually marginalize out the discrete nuisance variables since they violate the prerequisite of the inference engines.
561Unfortunately, this is not feasible in most cases and also adds unnecessary burden for the user, which violates the principle of automation in PPS at the first place.
562
563Alternative choices may be PPSs such as PyMC3~\cite{S16}, Turing.jl~\cite{GXG18}, and Gen~\cite{Gen19}
564where one can customize different kernels for different variables as composing inference.
565For example, one can use the Metropolis-within-Gibbs sampler for the nuisance variables and HMC for the remaining continuous ones.
566It is probably the state-of-the-art method for this type of models, especially in the PPSs oriented for the gradient-based inference engines.
567However, as we have discussed in Section~\ref{sec:infer} and empirically confirmed in Section~\ref{sec:studies},
568this method has some fundamental shortcomings and our approach provides substantial practical improvements.
569
570There are also some other approaches for improving inference performance for this type of models.
571For example, LF-PPL~\cite{zhou2019lf} proposed a novel low-level language for PPS to incorporate more sophisticated inference techniques for piecewise differentiable models;
572Stochaskell~\cite{roberts2019reversible} provided a way to perform reversible jump MCMC in PPS;
573\cite{ZYT+19} introduced a new sampling scheme, called Divide, Conquer and Combine, to perform inference in the general purpose PPS by dividing the program into sub-programs and conquering individual one before combing into an overall estimate.
574However, LF-PPL imposes some restrictions to the language of the PPS that are not required in our setup,
575whereas the latter two do not exploit gradient information at all.
576Therefore, they are less relevant to our approach and we leave the possible connection for future work.
577
578%Last but not the least, we want to emphasize that stochastic gradient
579%\input{code_examples/hmm.tex}
580\section{Discussion}
581\label{sec:disc}
582
583In this paper, we introduced the notion of a stochastically
584differentiable probabilistic program as well as an inference
585scheme for such programs.
586We provided a reference implementation
587of probabilistic programming with stochastically differentiable
588probabilistic programs.
589Stochastically differentiable probabilistic programs facilitate natural specification of
590probabilistic models and support efficient inference in the
591models, providing a viable alternative to explicit model
592marginalization, whether manual or algorithmic.
593
594Our understanding of stochastically differentiable probabilistic
595programs, classes of models for which they are best suited, and
596inference schemes is still at an early stage and evolving.
597In particular, we are working, towards future publications, on
598maximum \textit{a posteriori} estimation in stochastically
599differentiable probabilistic programs.  Another research
600direction would be selection and adaption of stochastic
601gradient-based inference algorithms to probabilistic programming
602inference.  Last but not least, we are looking into introducing
603support for stochastically differentiable probabilistic programs
604into existing probabilistic programming frameworks.
605
606%\section{Acknowledgements}
607%Development of Infergo is supported by PUB+, \url{http://pubplus.com/}.
608
609\bibliography{refs}
610\bibliographystyle{plain}
611
612\clearpage
613\onecolumn
614\appendix
615%\appendix{Supplementary material}
616
617%\section{Code examples}
618%\input{code_examples/survey.tex}
619%\input{code_examples/gmm.tex}
620%\input{code_examples/hmm.tex}
621
622\clearpage
623\section{Stan versions of the programs in the case studies}
624
625\subsection{Compensation Survey}
626
627\begin{figure*}[!h]
628\begin{lstlisting}[language=Stan]
629data {
630    int<lower=0> N;
631    int<lower=0,upper=1> y[N];
632}
633parameters {
634    real<lower=0,upper=1> theta;
635}
636model {
637    for (n in 1:N) {
638        target += log_mix(0.5,
639            bernoulli_lpmf(y[n]|theta),
640            bernoulli_lpmf(y[n]|0.5));
641    }
642}"

2.1 Gaussian Mixture Model

1data {
2    int<lower = 0> N;
3    vector[N] y;
4}
5parameters {
6    vector[2] mu;
7    real<lower=0> sigma[2];
8}
9model {
10    mu ~ normal(0, 10);
11    sigma ~ lognormal(0, 10);
12    for (n in 1:N)
13        target += log_mix(0.5,
14            normal_lpdf(y[n] | mu[1], sigma[1]),
15            normal_lpdf(y[n] | mu[2], sigma[2]));
16    }
17}

2.2 Hidden Markov model

1data {
2    real<lower=0> noise;
3    int<lower=1> K; // number of states
4    int<lower=1> N;
5    real y[N];
6}
7parameters {
8    simplex[K] theta[K];
9}
10model {
11real acc[K];
12real gamma[N,K];
13for (k in 1:K)
14    gamma[1,k] = normal_lpdf(y[0]|k-1, noise);
15    for (t in 2:N) {
16        for (k in 1:K) {
17            for (j in 1:K)
18                acc[j] = gamma[t-1,j] + log(theta[j,k])
19                    + normal_lpdf(y[t]|k-1, noise);
20            gamma[t,k] = log_sum_exp(acc);
21        }
22    }
23    target += log_sum_exp(gamma[N]);
24}