Meta Particle Flow for Sequential Bayesian Inference

02/02/2019 ∙ by Xinshi Chen, et al. ∙ Georgia Institute of Technology 0

We present a particle flow realization of Bayes' rule, where an ODE-based neural operator is used to transport particles from a prior to its posterior after a new observation. We prove that such an ODE operator exists and its neural parameterization can be trained in a meta-learning framework, allowing this operator to reason about the effect of an individual observation on the posterior, and thus generalize across different priors, observations and to online Bayesian inference. We demonstrated the generalization ability of our particle flow Bayes operator in several canonical and high dimensional examples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many data analysis tasks, it is important to estimate unknown quantities

from observations . Given prior knowledge and likelihood functions , the essence of Bayesian inference is to compute the posterior by Bayes’ rule. For many nontrivial models, the prior might not be conjugate to the likelihood, making the posterior not in a closed form. Therefore, computing the posterior often results in intractable integration and poses significant challenges. Typically, one resorts to approximate inference methods such as sampling (e.g., MCMC) (Andrieu et al., 2003) or variational inference (Wainwright & Jordan, 2003).

In many real problems, observations arrive sequentially online, and Bayesian inference needs be performed recursively,

(1)

That is the estimation of should be computed based on the estimation of obtained from the last iteration and the presence of the new observation . It therefore requires algorithms which allow for efficient online inference. In this case, both standard MCMC and variational inference become inefficient, since the former requires a complete scan of the dataset in each iteration, and the latter requires solving an optimization for every new observation. Thus, sequential Monte Carlo (SMC) (Doucet et al., 2001; Balakrishnan & Madigan, 2006) or stochastic approximations, such as stochastic gradient Langevin dynamics (Welling & Teh, 2011) and stochastic variational inference (Hoffman et al., 2012), are developed to improve the efficiency. However, SMC suffers from the path degeneracy problems in high dimensions (Daum & Huang, 2003; Snyder et al., 2008), and rejuvenation steps are designed but may violate the online sequential update requirement (Canini et al., 2009; Chopin et al., 2013). Stochastic approximation methods are prescribed algorithms that cannot exploit the structure of the problem for further improvement.

To address the challenges above, the seminal work of Kernel Bayes Rule (KBR) views the Bayes update as an operator in reproducing kernel Hilbert spaces(RKHS) which can be learned and directly produce the posterior from prior after each observation (Fukumizu et al., 2012). In the KBR framework, the posterior is represented as an embedding using a feature map associated with a kernel function; then the kernel Bayes operator will take this embedding as input and produce the embedding of the updated posterior,

(2)

Another novel aspect of KBR method is that it contains a training phase and a testing phase, where the structure of the problem at hand (e.g., the likelihood) is taken into account in the training phase, and in the testing phase, the learned operator will directly operate on the current posterior to produce the output. However, despite the nice concepts of KBR operator, it only works well for a limited range of problems due to its strong theoretical assumptions.

In this work, we aim to lift the limitation of KBR operator, and will design a novel continuous particle flow operator to realize the Bayes update, for which we call it meta particle flow (MPF). In our MPF framework (Fig. 1), a prior distribution , or, the current posterior will be approximated by a set of equally weighted particles ; and then, given an observation , the flow operator will transport each particle to a new particle corresponding to the updated posterior . That is to say, ,

(3)

where will be used to represent the updated posterior. Furthermore, this particle flow operator can be applied recursively to and a new observation to produce , and so on.

In a high-level, we model the MPF operator

as a continuous deterministic flow, which propagates the locations of particles and the values of their probability density simultaneously through a dynamical system described by ordinary differential equations (ODEs). A natural question is whether a fixed ODE-based Bayes operator applicable to different prior distributions and likelihood functions exists. In our paper, we resolve this important theoretical question by making a novel connection between the MFP operator and the Fokker-Planck equation of Langevin dynamics. The proof of existence also provides a basis for our parameterization of the MPF operator using DeepSet 

(Zaheer et al., 2017).

Similar to KBR, the MPF method will also have a training phase and a testing phase. However, our training procedure is very different as it adopts a meta-learning framework (Andrychowicz et al., 2016), where multiple related Bayesian inference tasks with different priors and observations are created. The MPF operator will learn from these tasks how to update the posteriors with observations. During test phase, the learned MPF will be applied directly to new observations without either re-optimization or storing previous observations. We conduct various experiments to show that the learned MPF operator can generalize to new Bayesian inference tasks.

2 Bayesian Inference as Particle Flow

We present details in this section from the four aspects: (1) How to map sequential Bayes inference to particle flow? (2) What is the property of such particle flow? (3) Does a shared flow-based Bayesian operator exist? (4) How to parameterize such an operator?

2.1 Particle Flow Bayesian Operator

The problem mapping from sequential Bayesian inference to particle flow goes as follows. Initially, we start with particles sampled i.i.d. from a prior distribution . Given an observation , the operator will transport the particles to to estimate the posterior . We define this transformation as the solution of an initial value problem of an ordinary differential equation (ODE). That is, ,

where the flow velocity takes the observation as the input, and determines both the direction and the speed of the change of . In the above ODE model, each particle sampled from the prior gives an initial value , and then the flow velocity will evolve the particle continuously and deterministically. At the terminal time , we will take the result of the evolution as the transformed particle for estimating the posterior.

Applying this ODE-based transformation sequentially as new observations arrive, we can define a recursive particle flow Bayes operator as

(4)

The set of obtained particles can be used to perform Bayesian inference such as estimating the mean and quantifying the uncertainty of any test function by averaging over these particles.

At this moment, we assume

has a form of , and will be shared across different sequential stages. In section 2.3, a rigorous discussion on the existence of a shared flow velocity of this form will be made. Next, we will discuss further properties of this continuous particle flow which will help us study the existence of such operator for Bayesian inference, and design the parameterization and the learning for the flow velocity.

Figure 1: Sequential Bayesian inference as a deterministic flow of particles.

2.2 Property of Continuous Deterministic Flow

The continuous transformation of described by the differential equation defines a deterministic flow for each particle. Let

be the probability density of the continuous random variable

. The change of this density is also determined by . More specifically, the evolution of the density follows the widely known continuity equation (Batchelor, 2000):

(5)

where is the divergence operator. Continuity equation is the mathematical expression for the law of local conservation of mass - mass can neither be created nor destroyed, nor can it ”teleport” from one place to another.

Given continuity equation, one can describe the change of log-density by another differential equation, for which we state it as Theorem 2.1.

Theorem 2.1.

If , then the change in log-density follows the differential equation

(6)

Since for any physical quantity , the distinguish between material derivative and partial derivative is important, we clarify the definition before the proof of this theorem.

Definition 2.1.

Material derivative of is defined as

(7)

Note that defines the rate of change of in a given particle as it moves along its trajectory in the flow, while means the rate of change of at a particular point that is fixed in the space.

Proof of Theorem 2.1.

By continuity equation,

. By chain rule, we have

. ∎

Theorem 2.1 gives the same result as the Instantaneous Change of Variables Theorem stated by Chen et al. (2018). However, our statement is more accurate using the notation of material and partial derivatives. Our proof is simpler and intuitively clearer using continuity equation. This also helps us to see the connection to other physics problems such as fluid dynamics and electromagnetism.

With theorem 2.1, we can compute the log-density of the particles associated with the Bayes operator by integrating across for each :

(8)

2.3 Existence of Shared Flow Velocity

Does a shared flow velocity exist for different Bayesian inference tasks involving different priors and observations? If it does, what is the form of this function? Is such a flow time-invariant or time-varying? These questions are non-trivial even for the simple Gaussian case.

For instance, let the prior and the likelihood

both be one dimensional Gaussian distributions. Given an observation

, the posterior distribution of is also a Gaussian distributed as . This means that the dynamical system needs to push a zero mean Gaussian distribution with covariance to another zero mean Gaussian distribution with covariance for any . It is not clear whether such a shared flow velocity function exist and what is the form for it.

To resolve the existence issue, we will first establish a connection between the deterministic flow in Section 2.2 and the stochastic flow of Langevin dynamics. Then we will leverage the connection between closed-loop control and open-loop control to show the existence of a shared .

2.3.1 Connection to Stochastic Flow

Langevin dynamics is a stochastic process

(9)

where is a standard Brownian motion. Given a fixed initial location , multiple runs of the Langevin dynamics to time will result in multiple random locations of due to the randomness of . This stochastic flow is very different in nature comparing to the deterministic flow in Section 2.2, where a fixed location will always end up as the same location . Nonetheless, we will establish their connection using the so-called Fokker-Plank equation.

While Langevin dynamics is a stochastic flow of a continuous random variable , the probability density of follows a deterministic evolution according to the associated Fokker-Planck equation (Jordan et al., 1998)

(10)

where is the Laplace operator.

Furthermore, if the so-called potential function is smooth and , the Fokker-Planck equation has a unique stationary solution in the form of a Gibbs distribution,

(11)

Clearly, this stationary solution is the posterior distribution in Bayesian inference.

Now we will rewrite the Fokker-Planck equation in Eq. (2.3.1) into the form of the deterministic flow in Eq. (5) from Section 2.2, and hence identify the corresponding flow velocity.

Theorem 2.2.

Assume the deterministic transformation of a continuous random variable follows , where

(12)

and is the probability density of . If the potential function is smooth and , then converges to as .

Proof.

By continuity equation in Eq. (5), the probability density of satisfies . It is easy to see this equation is the same as the Fokker-Planck equation in Eq. (2.3.1), by decomposing the Laplace as . Under the conditions for the potential function , since Fokker-Planck equation has a unique stationary distribution equal to the posterior distribution , the deterministic flow in Eq. (12) will also converge to . ∎

The implication of Theorem 2.2 is that we can construct a deterministic flow of particles to obtain the posterior and hence establish the existence of such a flow. However, the flow velocity in Eq. (12) depends on the intermediate density which changes continuously over time. This seemingly suggests that can not be expressed as a fixed function of and . In the next section, we show that this dependence on can be removed using theory of optimal control for deterministic systems.

2.3.2 Closed-loop to Open-loop Conversion

Now the question is whether the term in Eq. (12) can be made independent of , or whether there is a equivalent form which can achieve the same flow. To investigate this question, we consider the the following deterministic optimal control problem

(13)
s.t. (14)

for any metric defined on the set of densities over . In Eq. (13), we are optimizing over , which is usually called the control. By Theorem 2.2, is apparently an optimal solution. Furthermore, the corresponding flow velocity derived by Fokker-Planck equation can be regarded as the continuous steepest descent of KL under Wasserstein distance (Jordan et al., 1998). We are seeking an alternative expression to the above optimal solution which only depends on and . First, we introduce the terminology below from optimal control literature.

Definition 2.2.

In optimal control literature, a control in a feed-back form is called closed-loop. In contrast, another type of control is called open-loop. An open-loop control is determined when the initial state is observed, whereas, a closed-loop control can adapt to the encountered states .

Theorem 2.3.

For the optimal control problem in Eq. (13) and Eq. (14), there exists an optimal open-loop control such that the induced optimal state satisfies . Moreover, has a fixed expression with respect to and across different .

Proof.

By Theorem 2.2, can induce the optimal state and achieve a zero loss, . Hence, is an optimal closed-loop control for this problem.

Although in general closed-loop control has a stronger characterization to the solution, in a deterministic system like Eq. (14

), the optimal closed-loop control and the optimal open-loop control will give the same control law and thus the same optimality to the loss function 

(Dreyfus, 1964). Hence, there exists an optimal open-loop control such that the induced optimal state also gives a zero loss and thus . More details are provided in Appendix A to explicitly express as a fixed function of , , and . ∎

Conclusion of a shared . In sequential Bayesian inference, we will set as . Therefore, Theorem 2.3 shows that there exists a fixed and deterministic flow velocity of the form

(15)

which can transform to and in turns define a shared particle flow Bayes operator . Furthermore, this flow velocity needs to be time-varying. That is is also a function of time.

In practice, since we represent prior and posterior densities as particles, we do not have access to the entire density functions. Thus, in the next section, we will discuss a practical parametrization of using particles from . Furthermore, the deterministic flow requires the particles evolve to infinite time in order to attain the target posterior distribution. In practice, we will use a finite time for the particle evolution, and optimize the parameters in to compensate for this discrepancy due to the use of finite time evolution.

2.4 Parametrization

In Section 2.3, we introduce a shared flow velocity, which is in form of as indicated by Eq. (15). We design the parameterization of based on this expression.

(i) : In our particle flow framework, since we do not have full access to the density but have samples from it, we can use these samples as surrogates for . A related example is feature space embedding of distributions (Smola et al., 2007), . Ideally, if is an injective mapping from the space of probability measures over to the feature space, the resulting embedding can be treated as a sufficient statistic of the density and any information we need from can be preserved. Hence, we represent by , where is a nonlinear feature mapping to be learned. Since we use a neural version of , this representation can also be regarded as a DeepSet (Zaheer et al., 2017).

(ii) : In both Langevin dynamics and Eq. (15), the only term containing the likelihood is . Consequently, we can use this term as an input to . In the case when the likelihood function is fixed, we can also simply use the observation , which results in similar performance in our experiments.

Overall we will parameterize the flow velocity as

(16)

where both and

are neural networks. For instance, let

be the context of this conditional flow, where

is a dense feed-forward neural network, a specific neural architecture we use in later experiment is

(17)
(18)

where is element-wise multiplication. The number of layers can be tuned, but in general is a shallow network.

Let be the collection of parameters in and . From now on, we will write , where is independent of . In the next section, we will propose a meta learning framework for learning these parameters.

3 Meta Learning

Since we want to learn a generalizable Bayesian inference operator, we need to create multiple inference tasks and design the corresponding training and testing algorithm. We will discuss these details below.

Multi-task Framework. We are interested in a training set containing multiple inference tasks. Each task is a tuple

which consists of a prior distribution, a likelihood function and a sequence of observations. As we explained in previous sections, we want to learn a Bayesian operator that can be applied recursively to hit the targets sequentially. Therefore, each task can also be interpreted as a sequence of sub-tasks:

Therefore, each task is a sequential Bayesian inference and each sub-task corresponds to one step Bayesian update.

Cumulative Loss Function. For each sub-task we define a loss , where is the distribution transported by at -th stage and is the target posterior (see Fig. 1 for illustration). Meanwhile, the loss for the corresponding sequential task will be , which sums up the sub-loss of all intermediate stages. Since its optimality is independent of normalizing constants of , it is equivalent to minimize the negative evidence lower bound (ELBO)

(19)

The above expression is an empirical estimation using particles . In each iteration during training phase, we will randomly sample a task from and compute the gradient of the above cumulative loss function to update the parameters of our MPF operator .

3.1 Training Tasks Creation

Similar to other meta learning problems, the distribution of tasks that we use during training phase will essentially affect learner’s performance on testing tasks (Dai et al., 2017). Depending on the nature of the Bayesian problem, we propose two approaches to construct the multitask training set: one is data driven, and the other is based on generative models. The general principle is that the collection of training priors have to be diverse enough or representative of those may be seen in testing time; the observations in the training tasks should follow a distribution similar to the testing distribution.

Data-Driven Approach. We can use the posterior distribution obtained from previous Bayesian inference step as the new prior distribution. If the posterior distribution has an analytical (but unnormalized) expression, we will directly use this expression.

Since each Bayesian inference step will generate a set of particles, , corresponding to the posterior distribution

, we can also apply a kernel density estimator on top of this set of samples, and obtain an empirical density function

(20)

where is the kernel bandwidth. Then this density function with a set of samples from it will be used as the new prior for the next training task with a different set of observations. This approach has two attractive properties. First, it does not require any prior knowledge of the model and thus is generally applicable to most problems. Second, it provides us a way of breaking a long sequence with large into multiple tasks with shorter sequences

(21)

This will help make the training procedure more efficient. This approach will be particularly suited to sequential Bayesian inference and is used in later experiments.

Generative Model Approach.

Another possibility is to sample priors from a flexible generative model, such as a Dirichlet process Gaussian mixture model 

(Antoniak, 1974).

In this case, one can pick a base measure and a concentration parameter , and the prior can be drawn according to the following generative process for ,

Then a distribution with a set of particles are obtained.

We will leave the experimental evaluation of this approach for future investigation.

3.2 Overall Training Algorithm

Learning the MPF operator is learning the parameters of to minimize the following loss for multiple tasks

where is the loss for a single task defined in Eq. (19). Both the particle location and its density value in are results of forward propagation determined by according to ODEs in Eq. (4) and Eq. (8

). The training procedure is similar to other deep learning optimizations, except that an ODE technique called adjoint method 

(Chen et al., 2018) is used to compute the gradient with very low memory cost. The overall algorithm under the meta-learning framework is summarized in Algorithm 1.

random initialization For  to #iterations do
        sampled from    initial particles
       Grad() if mod then
             Perform a validation step on    validation
            
      
return best in validation steps
Remark.

See Appendix B for both derivation and algorithm steps of Grad().

Algorithm 1 Overall Meta-Learning Algorithm

3.3 Efficient Training

In this section, we discuss two techniques to improve training efficiency when solving large scale problems. Its application to an experiment which contains millions of data points has been demonstrated in Section 5. These two techniques can be summarized as mini-batch embedding and sequence-segmentation.

The loss function in Eq. (19) contains a summation from to and also a hidden inner summation

(22)

Thus the evaluation cost of is quadratic with respect to the length of the observation sequences. Therefore, we need to reduce the length for large scale problems.

Mini-batch embedding. Previously we only regard an observation as one single data point. In fact we can also regard this observation as a mini-batch of data points, i.e., . Each Bayesian update corresponding to this mini-batch will become . If we rewrite , this is essentially the same as our previous expression. Therefore, we can replace by and input these samples simultaneously as a set to the flow . To reduce the input dimension, we resort to a set-embedding , where is a neuralized nonlinear function to be learned. Depending on the structure of the model, one define the embedding as a Deepset (Zaheer et al., 2017)

, or, if the posterior is not invariant with respect to the order of the observations (e.g. hidden Markov models), one need to use a set-embedding that is not order invariant. To conclude, for a mini-batch Bayesian update, the parameterization of flow velocity can be modified as

Sequence-segmentation. We will use the approach in Eq. (21) to break a long sequence into short ones. More precisely, suppose we have particles at -th stage. We can cut the original sequence at position and generate a new task using the second half. The prior for the new sequence will be an empirical density estimation as defined in Eq. (20). Then, for all stages , the terms in Eq. (22) becomes

We can apply this technique for multiple times to split a long sequence into multiple segments.

4 Related Works

SMC. Since MCMC and variational inferences methods are either memory-inefficient or slow, sequential Monte Carlo methods are asymptotically the state of art for performing sequential Bayesian inference and we choose one-pass SMC (an improved version of SMC) (Balakrishnan & Madigan, 2006) to compare with in this paper. However, SMC typically reweights a set of particles recursively using the likelihood function. In high dimensional case, SMC suffers from the degeneracy problem (Daum & Huang, 2003; Snyder et al., 2008)

. Resampling based on Gaussian distribution or kernel density estimator are introduced to address this problem heuristically, which may breaks down the convergence guarantee of SMC. Some resort to include occasional rejuvenation steps 

(Canini et al., 2009; Chopin et al., 2013), which, however, relax the online constraint. Sequential Bayesian inference remains to be a challenging problem.

KBR.

Kernel Bayes’ rule proposed a kernel method for realizing Bayes’ rule. It uses samples from the joint distribution

to learn a Bayes operator that can be sequentially applied for testing. However, in most real-world problems, we do not have samples of the latent variable . Although it has strong theoretical guarantee, those results are restricted in RKHS and its application is still limited.

ODE. There is a recent surge of interests in ODE-based Bayesian inference, including works such as neural ODE (Chen et al., 2018) and Monge Ampere flow (Zhang et al., 2018), and also in continuous flow-based generative models (Grathwohl et al., 2018; Lei et al., 2017). These works focus on fitting one target distribution and do not take the structure of Bayesian inference problems into account. Consequently, they can only deal with a single dataset, and the learned continuous flow can not be applied directly to a new dataset, a new prior or to sequential setting without re-optimization.

Learning to learn. Our approach is closely related to the recent line of work on learning optimization algorithms. Specifically, based on existing algorithms, Andrychowicz et al. (2016)

used a recurrent neural network to perform optimization steps, which is in turns a learn-able optimization algorithm. Similarly, we also construct the flow operator based on the expression of a shared flow

, and we aim to achieve better performance than that for the case when testing tasks are similar to training tasks.

5 Experiments

We conduct experiments on multivariate Gaussian model, hidden Markov model and Bayesian logistic regression to demonstrate the generalization ability of MPF as a Bayesian operator and also the posterior estimation performance of MPF as an inference method.

111Python code will be released upon publication.

Evaluation metric. For multivariate normal model and Gaussian linear dynamical system, we could calculate the true posterior. Therefore, we evaluate the performance of different methods through

  • [nosep, nolistsep, wide]

  • Cross-entropy ;

  • Kernel maximum mean discrepancy (Gretton et al., 2012)

  • integral evaluation discrepancy ;

where we use Monte Carlo method to compute for the first two metrics. For the integral evaluation, we choose some test functions where the exact value of has a closed-form expression. For the experiments on real-world dataset, we estimate the commonly used prediction accuracy due to the intractability of the posterior.

Multivariate Gaussian Model. We consider a model where the prior , the observation conditioned on prior and the posterior follow Gaussian distributions. In our experiment, we use and . We test the learned operator on different sequences of 100 observations , while the training set only contains sequences of 10 observations, which are ten times shorter than sequences in the testing set. However, since we construct a set of different prior distributions to train the operator, the diversity of the prior distributions allows the learned to generalize to compute posterior distributions in a longer sequence.

The performances of MPF are compared with KBR (Fukumizu et al., 2012) and one-pass SMC (Balakrishnan & Madigan, 2006). We learn both MPF operator and KBR operator from the training set and then use them as algorithms on the testing set, which consists of 25 sequences of 100 observations conditioned on 25 different sampled from . We compare estimations of across stages from to 100 and the results are in Fig. 2. Since KBR’s performance reveals to be less comparable in this case, we leave its results to Appendix D. We can see from Fig. 2 that as the dimension of the model increases, our MPF operator has more advantages over one-pass SMC.

(a) Cross entropy for dimension 3, 5 and 8 (b) Squared MMD with RBF kernel for dimension 3, 5 and 8
Figure 2: Experimental results for Gaussian model. We use 256 obtained particles as samples from

and compare it with true posteriors. Each evaluation result is computed over 25 tasks and the shaded area shows the standard error.

Gaussian Mixture Model. Following the same setting as Welling & Teh (2011), we conduct an experiment on an interesting mixture model, where the observations and the prior . The same as Dai et al. (2016), we generate observations from this model with so that the resulting posterior will have two modes: and . Within our meta-learning framework, multiple sequences are generated, each of which consists of 30 observations. We learn the MPF operator from training sequences and then apply it to testing sequences.

Compared to Welling & Teh (2011) and Dai et al. (2016), our experiment is more challenging. First, they are fitting only one posterior distribution given one sequence of observations by optimization, while our learned MPF will operate on sequences that are not observed during training without re-optimization. Second, their target is to estimate the final posterior given the whole sequence of data points, while we aim at fitting every intermediate posteriors. Even for one fixed sequence, it is not easy to fit the posterior and capture both two modes, which can be seem from the results reported by Dai et al. (2016). However, our learned operator can operate on testing sequences and the resulting posteriors look closed enough to true posterior (See Fig. 3).

True

MPF

SMC

Figure 3: Visualization of the evolution of posterior density from left to right. In the end, MPF density is closer to the true density than SMC density.

Hidden Markov Model — LDS. Our MPF method can be easily adapted to hidden Markov models, where we will estimate the marginal posterior distribution . For instance, consider a Gaussian linear dynamical system(LDS):

where and . The particles representing marginal posterior distributions can be obtained by recursively applying two steps:

where . The second step is a Bayesian update from to given the likelihood . The only tricky part is we do not have a tractable loss function in this case because of the integration . Hence, at each stage , we use the particles to construct an empirical density estimation as defined in Eq. (20) and then define the loss at each stage as

where we replace the intractable density by . With this loss, the operator will learn how to update particles from to the next posterior given a new likelihood term .

In the experiment, we sample a pair of 2 dimensional random matrices and for the LDS model. We learn both MPF and KBR from a training set containing multiple different sequences of observations. For KBR, we use an adapted version called Kernel Monte Carlo Filter (Kanagawa et al., 2013)

, which is designed for state-space models. We use Kalman filter 

(Welch & Bishop, 2006) to compute the true marginal posterior for evaluation purpose.

(a) Comparison of cross entropy (b) Squared MMD with RBK kernel
(c) Integral estimation on test function
Figure 4: Experimental results for LDS model. Only results evaluated on the testing set are shown. The left column compares performances on every stage , whereas the right column compares the estimation errors with respect to different particle sizes. Results in the right column are firstly averaged over 25 stages for each task, and then averaged over 25 tasks. The error bar shows the standard error over tasks. We use the same MPF operator trained with 1024 particles, even though during testing phase, we apply it on particles of difference sizes. See Appendix D for more results.

Fig. 4 compares our method with BKR and SMC (importance sampling with resampling) on a testing set containing 25 new sequences of ordered observations. We see that our learned MPF can generalize to test sequences and achieve better and stabler performances.

Bayesian Logistic Regression (BLR). We consider logistic regression for digits classification on the MNIST8M 8 vs. 6 dataset which contains about 1.6M training samples and 1932 testing samples. We follow the same setting as Dai et al. (2016) to reduce the dimension of the images to 50 by PCA. Two experiments are conducted on this dataset. For both experiments, we compare our method with SMC, SVI (stochastic variational inference (Hoffman et al., 2012)), PMD (particle mirror descent (Dai et al., 2016)), SGD Langevin (stochastic gradient Langevin dynamics (Welling & Teh, 2011)) and SGD NPV (stochastic version of nonparametric variational inference (Gershman et al., 2012)). This is a large dataset, so we use the techniques discussed in Section 3.3 to facilitate training efficiency.

Figure 5: Bayesian logistic regression on MNIST. Left: The average online prediction accuracy is evaluated, where is the accuracy for the -th batch of images. The shaded area presents standard deviation of results over 10 testing tasks with different rotation angles . Right: We collect some examples when the random initialization of particles are farther from the posterior and gives worse initial prediction accuracy. The MPF operator updates those particles to gradually achieve a higher accuracy.

BLR-Meta Learning. In the first experiment, we use a meta-learning setting similar to that in synthetic experiments. A different sequence of image samples will result in different posterior distributions and thus corresponds to a different inference task. Furthermore, we modify the dataset to make the learning task harder. More precisely, we create a multi-task environment by rotating the first and second components of the features reduced by PCA through an angle uniformly sampled from . Note that the first two components account for more variability in the data. With a different rotation angle , the classification boundary will change and thus a different classification task will be created.

We learn the operator on a set of training tasks, where each task corresponds to a different rotation angle . After that, we will use it as an online inference algorithm on the testing tasks and compare its performances with other stochastic methods or sampling methods. Test is done in an online fashion: all algorithms start with a set of particles sampled from the prior (hence the prediction accuracy at 0-th step is around 0.5). Each algorithm will make a prediction to the encountered batch of 32 images, and then observe their true labels. After that, each algorithm will update their particles and then make a prediction to the next batch of images. Ideally we should compare the estimation of posteriors. However, since it is intractable, we evaluate the average accuracy at each stage of prediction made using the particles. Results are shown in Fig. 5.

Note that we have conducted a sanity check to confirm the learned operator does not ignore the first 2 rotated dimensions and use the rest 48 components to make predictions. More precisely, if we zero out the first two components of the data and learn on them. The accuracy of the particles dropps to around 65%. This further verifies that the learned MPF indeed can generalize across different tasks.

BLR-Variational Inference. For the second experiment on MNIST, we use MPF as a variational inference method. That is to say, instead of accurately learning a generalizable Bayesian operator, the estimation of the posterior is of more importance. Thus here we do not use the loss function in Eq. (19) summing over all intermediate states, but emphasize more on the final error . After the training is finished, we only use the transported particles to perform classification on the test set but do not further use the operator . The result is shown in Fig. 6, where -axis shows number of visited samples during training. Since we use a batch size of 128 and consider 10 stages, the first gradient step of our method starts after around samples are visited.

Figure 6: MPF as a variational inference method. The prediction accuracy of MPF is comparable with state-of-art variational inference and sampling methods.

6 Discussion on Potential Architecture

In this section, we discuss two specific parametrizations of the flow velocity to increase the efficiency and stability of the particle flow.

6.1 Efficient Architecture

Consider a special case when the flow is in the form of a normalizing flow (Rezende & Mohamed, 2015)

(23)

Its approximation using forward Euler discretization scheme with step size gives the Planar normalizing flow:

(24)

The change of variable formulation for this flow is particularly easy to compute due to its special architecture

(25)

To extend our meta-learning approach to this discrete normalizing flow, one can make the weights adaptive to the new observations and the location of the particles for the current posterior. For instance,

This is similar to a HyperNetwork (Ha et al., 2017), where the filters/weights are not fixed.

6.2 Stable Architecture

An ODE solution is called stable if the long-term behavior of the system does not depend significantly on the initial conditions. This coincides with sequential Bayesian inference. With a longer sequence of observations, the initial prior distribution has less effect.

Mathematically, the solution of is stable if for all , where

are the eigenvalues of the Jacobian and

denotes the real part. Meanwhile, if is too small, it will lead to catastrophic forgetting of past observations. Recently, Haber & Ruthotto (2018) designed an architecture where . More precisely, the weight matrix is restricted to be antisymmetric:

(26)

Then the flow is dependent on the matrix

and the vector

. Similarly, one can make and to be dependent on and .

7 Conclusion and Future Work

In this paper, we have explored the possibility of learning an ODE-based Bayesian operator that can perform online Bayesian inference in testing phase and verified its generalization ability through both synthetic and real-world experiments. Further investigation on the parameterization of flow velocity (e.g. use a stable neural architecture (Haber & Ruthotto, 2018) with HyperNetwork (Ha et al., 2017)) and generating diverse prior distributions through a Dirichlet process can be made to explore a potentially better solution to this challenging problem.

References

  • Andrieu et al. (2003) Andrieu, C., de Freitas, N., Doucet, A., and Jordan, M. I.

    An introduction to mcmc for machine learning.

    Machine Learning, 50:5–43, 2003.
  • Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., and de Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989, 2016.
  • Antoniak (1974) Antoniak, C. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2:1152–1174, 1974.
  • Balakrishnan & Madigan (2006) Balakrishnan, S. and Madigan, D. A one-pass sequential monte carlo method for bayesian analysis of massive datasets. Bayesian Analysis, 1(2):345–361, 06 2006.
  • Batchelor (2000) Batchelor, G. Kinematics of the Flow Field, pp. 71–130. Cambridge Mathematical Library. Cambridge University Press, 2000.
  • Canini et al. (2009) Canini, K. R., Shi, L., and Griff iths, T. L. Online inference of topics with latent dirichlet allocation. In

    Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS)

    , 2009.
  • Chen et al. (2018) Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 6572–6583. Curran Associates, Inc., 2018.
  • Chopin et al. (2013) Chopin, N., Jacob, P. E., and Papaspiliopoulos, O. Smc2: an efficient algorithm for sequential analysis of state space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):397–426, 2013.
  • Dai et al. (2016) Dai, B., He, N., Dai, H., and Song, L. Provable bayesian inference via particle mirror descent. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 985–994, 2016.
  • Dai et al. (2017) Dai, H., Khalil, E. B., Zhang, Y., Dilkina, B., and Song, L. Learning combinatorial optimization algorithms over graphs. arXiv preprint arXiv:1704.01665, 2017.
  • Daum & Huang (2003) Daum, F. and Huang, J. Curse of dimensionality and particle filters. In 2003 IEEE Aerospace Conference Proceedings, volume 4, 2003.
  • Doucet et al. (2001) Doucet, A., de Freitas, N., and Gordon, N. Sequential Monte Carlo Methods in Practice. Springer-Verlag, 2001.
  • Dreyfus (1964) Dreyfus, S. Some types of optimal control of stochastic systems. Journal of the Society for Industrial and Applied Mathematics Series A Control, 2(1):120–134, 1964. doi: 10.1137/0302010.
  • Fukumizu et al. (2012) Fukumizu, K., Song, L., and Gretton, A. Kernel Bayes’ rule: Bayesian inference with positive definite kernels. In accepted to Journal of Machine Learning Research (JMLR), 2012.
  • Gershman et al. (2012) Gershman, S., Hoffman, M., and Blei, D. M. Nonparametric variational inference. In Langford, J. and Pineau, J. (eds.), Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 663–670, New York, NY, USA, 2012. ACM.
  • Grathwohl et al. (2018) Grathwohl, W., Chen, R. T., Betterncourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
  • Gretton et al. (2012) Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., and Smola, A. A kernel two-sample test. JMLR, 13:723–773, 2012.
  • Ha et al. (2017) Ha, D., Dai, A., and Le, Q. V. HyperNetworks. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • Haber & Ruthotto (2018) Haber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2018.
  • Hoffman et al. (2012) Hoffman, M., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. In International Conference on Machine Learning, 2012.
  • Jordan et al. (1998) Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the fokker–planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998.
  • Kanagawa et al. (2013) Kanagawa, M., Nishiyama, Y., Gretton, A., and Fukumizu, K. Filtering with state-observation examples via kernel monte carlo filter. arXiv e-prints, December 2013.
  • Lei et al. (2017) Lei, N., Su, K., Cui, L., Yau, S.-T., and Xianfeng Gu, D. A Geometric View of Optimal Transportation and Generative Model. arXiv e-prints, art. arXiv:1710.05488, October 2017.
  • Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
  • Smola et al. (2007) Smola, A., Gretton, A., Song, L., and Schölkopf, B. A hilbert space embedding for distributions. In Algorithmic learning theory, pp. 13–31. Springer, 2007.
  • Snyder et al. (2008) Snyder, C., Bengtsson, T., Bickel, P., and Anderson, J. Obstacles to high-dimensional particle filtering. Monthly Weather Review, 136(12):4629–4640, 2008.
  • Wainwright & Jordan (2003) Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, Department of Statistics, September 2003.
  • Welch & Bishop (2006) Welch, G. and Bishop, G. An introduction to the kalman filter. Technical Report TR-95-041, Department of Computer Science, University of North Carolina at Chapel Hill, 2006.
  • Welling & Teh (2011) Welling, M. and Teh, Y.-W. Bayesian learning via stochastic gradient langevin dynamics. In International Conference on Machine Learning (ICML), pp. 681–688, 2011.
  • Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, pp. 3391–3401, 2017.
  • Zhang et al. (2018) Zhang, L., E, W., and Wang, L. Monge-ampere flow for generative modeling. arXiv preprint arXiv:1809.10188, 2018.

Appendix A Existence of Shared Flow Operator

See 2.3

Proof.

By Theorem 2.2, can induce the optimal state and achieve a zero loss, . Hence, is an optimal closed-loop control for this problem.

Although in general closed-loop control has a stronger characterization to the solution, in a deterministic system like Eq. (14), the optimal closed-loop control and the optimal open-loop control will give the same control law and thus the same optimality to the loss function(Dreyfus, 1964). Hence, there exists an optimal open-loop control such that the induced optimal state also gives a zero loss and thus .

More specifically, when the system is deterministic, a state is just a determinis