# The Theory and Algorithm of Ergodic Inference

Approximate inference algorithm is one of the fundamental research fields in machine learning. The two dominant theoretical inference frameworks in machine learning are variational inference (VI) and Markov chain Monte Carlo (MCMC). However, because of the fundamental limitation in the theory, it is very challenging to improve existing VI and MCMC methods on both the computational scalability and statistical efficiency. To overcome this obstacle, we propose a new theoretical inference framework called ergodic Inference based on the fundamental property of ergodic transformations. The key contribution of this work is to establish the theoretical foundation of ergodic inference for the development of practical algorithms in future work.

## Authors

• 4 publications
02/27/2020

### MetFlow: A New Efficient Method for Bridging the Gap between Markov Chain Monte Carlo and Variational Inference

In this contribution, we propose a new computationally efficient method ...
08/04/2017

### Learning Model Reparametrizations: Implicit Variational Inference by Fitting MCMC distributions

We introduce a new algorithm for approximate inference that combines rep...
10/15/2020

### Orbital MCMC

Markov Chain Monte Carlo (MCMC) is a computational approach to fundament...
02/27/2011

### Instant Replay: Investigating statistical Analysis in Sports

Technology has had an unquestionable impact on the way people watch spor...
07/08/2020

### Deep Fiducial Inference

Since the mid-2000s, there has been a resurrection of interest in modern...
08/07/2020

### MCMC Algorithms for Posteriors on Matrix Spaces

We study Markov chain Monte Carlo (MCMC) algorithms for target distribut...
02/24/2018

### Automatic adaptation of MCMC algorithms

Markov chain Monte Carlo (MCMC) methods are ubiquitous tools for simulat...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Statistical inference is the cornerstone of probabilistic modelling in machine learning. The research on inference algorithms always attracts a great attention in the research community, because it is the fundamentally important in the computation of Bayesian inference, deep generative models. The majority of research is focused on algorithmic development in two theoretical frameworks: variational inference (VI) and Markov chain Monte Carlo (MCMC). These two methods are significantly different. VI is an optimisation-based approach, in particular, which fits a simple distribution to a given target. In contrast, MCMC is a simulation-based approach, which sequentially generates asymptotically unbiased samples of arbitrary target.

Unfortunately, both VI and MCMC suffer from fundamental limitations. VI methods are in general biased because the density function of approximate distribution must be in closed-form. MCMC methods are also biased in practice because the Markov property limits the sample simulation in a local sample space close to previous samples. However, VI is in general more scalable in computation. Optimising variational distribution and simulating samples in VI are computationally efficient and can be accelerated by parallelization on GPU. In contrast, simulating Markov chains is computationally inefficient and, more importantly, asynchronized parallel simulation of multiple Markov chains has no effect on reducing sample correlations but multiplies the computation.

Ergodic Measure preserving flow (EMPF), introduced by (Zhang et al., 2018), is a recent novel optimisation-based inference method that overcomes the limitations of both MCMC and VI. However, there is no theoretical proof of the validity of EMPF. In this work, we will generalize EMPF to a novel inference framework called ergodic inference. In particular, the purpose of this work is to establish the theoretical foundation of ergodic inference. We list the key contribution of this work as following

• The mathematical foundation of ergodic inference. (Section 3 and 4)

• A tractable loss of ergodic inference and the proof of the validity of the loss. (Section 5)

• An ergodic inference model: deep ergodic inference networks (Section 6)

• Clarification of differences between ergodic inference, MCMC and VI (Section 6)

## 2 The background

Convergence of probability measures is the foundation of statistical inference. Distance metric between probability measures is critical in the study of convergence. We will review the basics of distance metrics between probability measures and connect these metrics to theoretical foundation of inference methods.

### 2.1 Distance Metric of Probability Measures

Total variation distance is fundamentally important in probability theory, because it defines the strongest convergence of probability measure. Let

be a measure space, where denotes the sample space and denotes the collection of measurable subsets of . Given two probability measure and defined on , the TV distance between and is defined as

 DTV(Q,P)=supA∈F|Q(A)−P(A)|. (1)

Convergence in TV, that is , means and cannot be distinguished on any measurable set.

The Kullback-Leibler (KL) divergence is an important measure of difference between probability measures in statistical methods. For a continuous sample space , the KL divergence is defined as

 DKL(Q||P)=∫ΩdQlogdQdP, (2)

where denote the density of probability measure.

### 2.2 Approximate Monte Carlo Inference

Monte Carlo method is the most popular simulation based inference technique in probabilistic modelling. For example, to fit a probabilistic model

by maximum likelihood estimation, it is essential to compute the gradient of the partition function

. Given the unnormalised density function , computing the gradient becomes a problem of expectation estimation

 ∂θZ(θ)=Eπ(z)[∂θlogπ∗(z)].

Monte Carlo methods allow us to construct unbiased estimator of expectation as

 Eπ(z)[f(z)]=limN→∞1NN∑i=1f(zi),

where denotes samples from . Unfortunately, it is intractable to generate samples from complex distributions, like the posterior distributions in model parameters or latent variables. Because of this challenge, approximate Monte Carlo Inference is fundamentally important. We will review the theoretical foundation of two important inference methods: variational Inference (VI) and Markov chain Monte Carlo (MCMC) in the next two sections.

### 2.3 Variational Inference

The theoretic foundation of VI is Pinsker’s inequality. Pinsker’s inequality states that the KL divergence is a upper bound of TV distance

 DTV(Q,P)≤DKL(Q||P). (3)

Given a parametric distribution and the target distribution , minimising the KL divergence implies the less TV distance . The key challenge of VI is how to construct the parametric family so that the estimation of the KL divergence is tractable and family is expressive to approximate complex target. This forces most VI methods to choose with closed-form density function. Otherwise, the estimation of entropy term becomes challenging. In practice, the approximation family

in most VI methods are rather simple, like Gaussian distribution, so the approximation bias due to oversimplified

is the key issue of VI.

However, simple approximate family gives VI methods great computational advantage in practice. First, the main loss function in VI is known as the evidence lower bound (ELBO)

 LELBO=∫ΩdQlogdπ∗dQ≤log∫dπ∗. (4)

With analytic form of the entropy of , ELBO can be efficiently computed and optimized using standard gradient descent algorithm. Second, simulating i.i.d. samples from a simple variational family is straightforward and very efficient.

### 2.4 Markov Chain Monte Carlo

The theoretical foundation of Markov chain Monte Carlo (MCMC) is ergodic theorem. Ergodic theorem states that, given an ergodic Markov chain with a stationary distribution , the average cross states of chain is equivalent to the average in state space of the chain, that is

 Eπ[f]=limm→∞1Mf(Zm∞)=limn→∞1Nf(Zn),

where denotes the sample of a well-mixed Markov chains after infinitely long transitions. Ergodic theorem implies that we can generate unbiased samples from every Markov transition without waiting forever for the chains to reach stationary state. Therefore, we can trade computational efficiency with a bias that may decrease in a long time. The key challenge of MCMC methods is to define ergodic Markov chains with any given stationary distribution . This challenge was solved first by Metropolis-Hastings algorithm. We will discuss in detail in Section 4.2.

Ergodic Markov chains enjoy strong stability. Irrespective of the distribution of initial state and the parameter of Markov kernel , the distribution of the state of the chain is guaranteed to converge to the stationary distribution in total variation after every transition. Formally, that means the reduce of TV distance to stationary for all

 DTV(QL+1,π)

where denotes the marginal distribution of the -th state and

 qL(dz′)=∫K(z,dz′)qL(dz).

As increases, the distribution converges to a unique stationary distribution

 liml→∞DTV(Ql,π)=0.

In spite of the theoretical convergence property, the convergence of MCMC chains is not guaranteed in practice. Because the burn-in stage cannot be infinite long, the samples from MCMC methods are often biased. The problem is that there is no reliable measurement of such a sampling bias related to TV distance or KL divergence. The iterative simulation of Markov chain is another limitation in computational efficiency. Each sample from MCMC methods requires one simulation of Markov transition and this can only be executed in a sequential manner due to the nature of Markov chain. Therefore, the sampling time of MCMC grows linearly with the number of samples.

## 3 Ergodic Inference Principle

In this section, we present the mathematical foundation of ergodic inference principle.

### 3.1 Motivation

First, we would like to propose the the following properties of ideal inference method:

• Parallelizable: the simulation of each sample is computationally independent;

• Statistically efficient: there is zero correlation between samples;

• Asymptotic unbiased: more computational power guarantees diminishing of simulation bias. The bias can be eliminated in theory with sufficient computation.

Both MCMC and VI fail to have all the properties above. For this reason, there are existing works on a hybrid methods that combine MCMC and VI, for example, accelerate the burn-in of MCMC using variational approximation in (Hoffman, 2017) or optimise ELBO based on tractable density function of MCMC kernel in (Salimans et al., 2015). To some extend, such algorithmic hybrid approach can be useful in practice. However, the limitation in theoretical foundation of MCMC and VI cannot be eliminated by algorithmic modification. To achieve an ideal inference method, it is necessary to have a new mathematical theoretical foundation.

### 3.2 The Theoretical Foundation

Different from Pinsker’s inequality and ergodic theorem, the theoretical motivation of the proposed inference is the characteristic property of ergodic Markov transition: there is a unique invariant distribution for every ergodic Markov Kernel. Formally, let be an ergodic Markov transition kernel with an invariant distribution . By construction of , is guaranteed to be the only distribution satisfies the condition .

Based on the property of ergodic Markov kernel, we construct the following criteria to verify if a distribution is equivalent to the stationary distribution of the kernel. Given a distribution , the distribution of after one Markov transition by is given by

 q1(z′)=∫Kπ(z,z′)q(dz). (5)

We say the distribution is preserved by if

 DTV(q1,q)=0. (6)

By the uniqueness of the invariant distribution of ergodic kernel , the preservation of by as (6) implies . This motivates the following loss function.

###### Definition 3.1.

Given a Markov kernel that is ergodic w.r.t. a distribution , the ergodic loss of a distribution is defined as

 L∗(q,Kπ)=DTV(∫Kπ(z,⋅)q(dz),q(⋅)).

As mentioned earlier, the loss is equal to 0 if and only if is equal to 0.

Let be the target distribution and be the approximate distribution in a parametric family . Given an ergodic Markov kernel , the closest to the target can be identified by the parameter optimising the ergodic loss

 ϕ∗=argminϕL∗(qϕ,Kπ).

If the target distribution is in , then the optimal parameter should have the loss

 L∗(qϕ∗,Kπ)=0,

otherwise the norm of the gradient of the loss should vanish

 ||∂ϕ∗L∗(qϕ∗,Kπ)||22=0.

### 3.3 Technical Challenges

There are two technical challenges of ergodic inference methods in practice. First, we need a tractable estimation of a loss function equivalent to . The estimation of the gradient of the loss should also be tractable for the optimisation of the parameter . Second, we need a general parametric family that can approximate any target distribution up to a certain amount of error. More specific, the error can be controlled and even eliminated by increase the complexity of approximation family of , i.e. the number of parameters of is unlimited. The computational cost of optimisation is associated with the complexity of .

We will present the solution to the first challenge in Section 5 and the solution to the second challenge in Section 6.

## 4 Ergodic Transformations

The key of solving the technical challenges in ergodic inference is the reparameterization of the ergodic Markov kernel. This is important in both algorithmic development and theoretical analysis.

### 4.1 Ergodic Transformations and Markov Kernels

Ergodic Markov kernels are essentially conditional distributions, which can be reparameterized by deterministic transformations known as measure preserving transformations (MPTs). Given a probability measure , a deterministic transformation preserves if for any measurable subset of sample space , . The shear transformation , which preserves Lebesgue measure, is a classic example of MPT (Billingsley, 1986). The following conditions are often used in the literature MCMC theory for verification of ergodic property:

1. Irreducibility: except and .

2. Density preservation: .

3. Lebesgue preservation: the determinant of the Jacobian of is equal to 1.

Formally, we define the reparameterisation of Ergodic Markov chains as following.

###### Definition 4.1.

(Ergodic Reparameterisation of MCMC) Given a target distribution , a MCMC kernel with invariant can be reformed as two steps:

1. Simulate an auxiliary variable with distribution

2. Deterministic transformation ,

where is an ergodic transformation that preserves the probability measure .

###### Remark.

The transformation in ergodic reparameterisation is fundamentally different from volume preserving transformation in the sample space of for two reasons.

• does not preserve the volume/entropy in the sample space of , but must preserves the volume/entropy in the space of .

• preserves the probability measure , but does not preserve in general.

Ergodic transformations also allow us to form the expectation under Markov transition as composition of functions, that is not used in classic MCMC literature. Formally, this is given by the following proposition.

###### Proposition 1.

Given an ergodic transformation w.r.t. , the expectation is preserved by the transformation, which means, for any function

where is the image of under and denotes the pushforward probability measure of under . Because preserves , and .

In the next two sections, we will demonstrate the ergodic reparameterization with two well-known MCMC kernels.

### 4.2 Metropolis-Hastings Transformations

Metropolis-Hastings (MH) algorithm is the first and most well-known MCMC methods. We will show that it is straightforward to form the MH transition kernel as an ergodic transformation. Given a target distribution and a transition proposal distribution , MH kernel in most text books is described as following two steps:

1. Sample from .

2. Return the new state of the chain as with probability

 pMH=min{1,π(r)q(z|r)π(z)q(r|z)}, (7)

otherwise the state remains as .

It is straightforward to verify that MH transition kernel preserves the density function as

 π(z)[q(r|z)min{1,π(r)q(z|r)π(z)q(r|z)}] = min{π(z)q(r|z),π(r)q(z|r)} = π(r)[q(z|r)min{1,π(z)q(r|z)π(r)q(z|r)}],

where the MH transition kernel is in squared rackets. This verification of stationary distribution is known as detailed balance. It is important because it proves the existence of stationary distribution.

Now we consider an alternative representation of MH kernel. In particular, we define a stationary distribution as the joint distribution of all random variables involved in the target

and MH kernel , that is , where

denotes uniform distribution between

. Following the ergodic reparameterization (Definition 4.1), we can rewrite the MH algorithm as

1. Resample from and from .

2. Return the next state defined as

 TMH(z,r,u) =(z,r,u)δ(u>pMH) +(r,z,u)δ(u

where denotes indicator function.

Notice that the transformation above is a deterministic function. It is obvious that resampling and from their conditional distribution leaves invariant. Then, it is straightforward to show the preservation of density function

 π(s)δ(s′=TMH(s))=π(s′)δ(s=TMH(s′)),

where denote the triple . It is also easy to verify that the determinate of Jacobian of is always equal to 1.

### 4.3 Hamiltonian Measure Preserving Transformations

Hamiltonian Monte Carlo (HMC), originally known as Hybrid Monte Carlo, is an important MCMC method. Originally, HMC is considered as a hybrid method, because its combines both deterministic and stochastic simulation. The deterministic simulation in HMC essentially refers to any dynamics that generalize the classic Hamiltonian dynamics in physics.

Hamiltonian system in physics is a system of moving particles in an energy field and the energy of the system is constant over time. Given particles, the state of Hamiltonian system is defined by the position and the momenta . The position is associated with potential energy and the momentum is associated with kinetic energy . The state evolves over time , according to Hamilton’s equations:

 ˙z(t)=∂rK(r);˙r(t)=−∂rU(z), (9)

where denotes the derivative of w.r.t. time . It is straightforward to verify that the total energy does not change over time

 ˙H(z,r)=(∂rU(z))T∂rK(r)−(∂rU(z))T∂rK(r)=0.

Given an initial condition , the solution of Hamiltonian dynamics is a function of time

 (z(t),r(t))=TH(t,z,r).

Given a fixed time , the solution becomes a map between two states and with the same total energy . Intuitively, forms a trajectory of particle traversing in a -dimensional space and the velocity of the particle is given by .

It is well-known in MCMC literature that is essentially a family of measure preserving transformations with any parameter . It is clear that is irreducible if and density preserving w.r.t. . The volume preservation property of Hamiltonian dynamics in the state space is a well-known result of Liouville’s Theorem. Therefore, we know that with any is an ergodic transformation w.r.t. the distribution . This implies also preserves by the definition of marginal distribution.

In practice, Hamiltonian dynamics do not have closed-form solutions. Fortunately, there is a rich literature on the numeric simulation of Hamiltonian dynamics. The most known approximate approach in HMC is Leapfrog algorithm, which is constructed as a sequential of shear transformations. Leapfrog algorithm enjoys strong stability and good approximation error is around squared discretized step size. See more detailed analysis in (Neal, 2010; Leimkuhler & Reich, 2004).

## 5 Ergodic Loss

### 5.1 π-Ergodic Loss Function

By the definition of TV distance, we know that is the stationary distribution of if and only if for all function with ,

 Eq1[f(z)]=Eq[f(z)]. (10)

However, it is impossible to compare the expectation of all possible function , but given specific function it is possible to estimate

 LK,f(ϕ)=|Eq1[f(z)]−Eq[f(z)]|. (11)

With the optimal choice of function and certain condition, we can claim that implies . The log density function is an intuitive choice, because we can identify a distribution by its density function. Therefore, we define the following -ergodic loss.

###### Definition 5.1.

(Ergodic Loss Function)

 (12)
###### Theorem 1.

(Ergodic Loss Convergence Theorem) Given the ergodic Markov kernel with invariant distribution , the loss if and only if .

###### Proof.

The convergence of loss implies

 Eq1(z)μ(r)[logπ(z)]=Eq(z)μ(r)[logπ(z)], (13)

where is given by (5). Notice that is essentially the marginal of the pushforward of under the measure preserving transformation . By Proposition 1, the expectations in (13) can be written as following

 Eq1(z)[logπ(z)]Δ=∫Ωlogπ∘Tπμd(qμ)=∫Ωlogπd(qμ), (14)

where is the shorthand notations for . Replacing on both sides in (14) with any distribution, the equality still holds. If we replace in (14) with with the pushforward probability measure of under , denoted by , we have

 ∫Ωlogπ∘Tπμ∘d(Tπμ∗(qμ))=∫Ωlogπ∘d(Tπμ∗(qμ)),

which can be rewritten as

 ∫Ωlogπ∘T1πμ∘Tπμd(qμ1)=∫Ωlogπ∘Tπμd(qμ), (15)

where denotes and denotes . Notice that the LHS of (15) is an expectation under the distribution of after two ergodic Markov transitions from , that is . Therefore, by (14) and (15), we have

 Eq2(z)[logπ(z)] Δ=∫Ωlogπ∘T1πμ∘Tπμd(qμ1) =∫Ωlogπ∘Tπμd(qμ) =Eq(z)[logπ(z)]. (16)

By induction, we know the expectation of does not change after any number of measure preserving transformation , that gives

 Eq∞(z)[logπ(z)]=Eq(z)[logπ(z)]. (17)

By (17), we know if we simulate infinitely long ergodic Markov chain by kernel , then the expectation is the same as the initial expectation .

Because an ergodic Markov chain has unique invariant distribution, (17) implies

 Eπ(z)[logπ(z)]=Eq(z)[logπ(z)]. (18)

Recall that the convergence of loss cannot be sufficient for the convergence of the TV distance . Fortunately, under some reasonable condition, the loss implies the convergence in TV distance. Formally, this is given by the following theorem.

###### Theorem 2.

(Ergodic Measure Convergence Theorem) Let be an ergodic Markov kernel with invariant distribution . Assume that the entropy of is not less than the entropy of , that is , the loss if and only if .

###### Proof.

By the definition of the KL divergence, we have

 DKL(q||π)=Eq[logq]−Eq[logπ]. (19)

By Theorem 1, we have

 DKL(q||π)=Eq[logq]−Eπ[logπ], (20)

which is equivalent to

 DKL(q(z)||π)=H(π)−H(Q).

Because the KL divergence is never less than 0, we have

 H(π)≥H(Q).

Finally, by the assumption , we know , so we know which implies . ∎

By the monotonic convergence in TV distance of ergodic Markov chain, it is straightforward to show that

###### Proposition 2.

Given a smooth ergodic transformations w.r.t. the probability measure , if , the loss

 Eq[logπ∗(z)]−Eq1[logπ∗(z)]>0. (21)

Assume that , we have

 L∗K,π∗(ϕ)=Eq[|logπ∗(z)|]−Eq1[|logπ∗(z)|], (22)

### 5.2 Optimising π∗-Ergodic Loss

Let be the joint distribution . Then, we can rewrite (22) as

 L∗K,π∗(ϕ)=Eq01[logπ∗(z)−logπ∗(z1)], (23)

which can be estimated by samples of . To optimise the loss (25), we need to compute the gradient . Notice that the and are coupled by the kernel and the density function of most MCMC kernels, which makes the computation of the gradient unstable. To avoid this, we reparameterize both and the ergodic Markov kernel by a transformation and a measure preserving transformation respectively. This allows us to transform some simple random variable and , that is independent of , into as

 z=Tϕ(r),z1=Tπ(z,r1). (24)

Therefore, we can compute the loss with following reformulation

 L∗K,π∗(ϕ)=Eμ(r)μ1(r1)[Lπ∗,Tϕ,Tπ(r,r1)], (25)

where and as (24).

As discussed above, the only requirement of approximate family in ergodic inference is the transformation is known and it is a measurable function. It is an important advantage over VI, where the density function of must be in closed form.

## 6 Deep Ergodic Inference Model

Ergodic transformations are not only fundamentally important in the ergodic loss, they are also powerful tools for constructing flexible approximation family . In this section, we will present how to construct and optimise the approximation family by stacking multiple layers of ergodic transformations.

### 6.1 Definition

Let be ergodic transition kernel with independent parameters . Let be the distribution of initial state also has parameter . By ergodic reparameterization, we reform each ergodic Markov kernel as a transformation , where is a deterministic function depends on the kernel parameter and is sampled from a standard distribution . We also reparameterize the initial distribution from a simple distribution by a transformation . Then, we can generate samples of by transforming samples of from as

 zn=TrN−1∘⋯∘Tr1∘T0(r0), (26)

where denotes . We call this multiple layer ergodic transformation deep ergodic inference network (DEIN). The expectation of can be reformed as

 EqN[f(zN)]=Eμ[f∘TrN−1∘⋯∘Tr1∘T0(r0)],

which allows us to estimate the gradient of any function by Monte Carlo method

 ∂ϕEqN[f(zN)]≈1MM∑i=1∂ϕf∘TriN−1∘⋯∘Tri1∘T0(ri0).

### 6.2 Optimisation and Convergence of DEINs

This is a non-parametric model because the number of parameters of this model grows with the number of transformations. Different from deep neural networks, DEIN has strong stability by the natural of ergodicity. In particular, DEINs can be arbitrarily deep and the stability and simulation quality is guaranteed to improve with the depth.

First, we define a loss (12) for each transition as

 Ln(ϕn)=Eqn[logπ∗(z)]−Eqn−1[logπ∗(z)],

where denotes the marginal of the last state

 qn(z;ϕ0:n)=∫K(zn−1,zn)qn−1(zn−1;ϕ0:n−1). (27)
###### Proposition 3.

Assume that , minimizing the ergodic loss in (22) with of deep ergodic Inference network is equivalent to maximizing the total ergodic loss

 LN(ϕ)=EqN[logπ∗(z)]−Eq0[logπ∗(z)]. (28)

which is equivalent to

 LN(ϕ;ϕ0)=EqN[logπ∗(z)]. (29)

when the parameter of is fixed.

The total loss (29) is consistent with the loss proposed by (Zhang et al., 2018) in ergodic measure preserving flows.

By Proposition 2, it is straightforward to show that DEINs enjoy incremental improvement as the depth grows.

###### Theorem 3.

(Incremental Convergence of DEIN) Given a -layer DEIN defined as (26), the optimal total ergodic loss increases monotonically as increases.

Similar to the convergence of ergodic Markov chains, we have the asymptotic unbiased convergence of DEINs as following.

###### Theorem 4.

(Asymptotic Unbiased Convergence of DEINs) For arbitrarily small , there always exists a DEIN with finite number of layer , so that with the optimal distribution has the ergodic loss .

### 6.3 Comparison with Auto-Tuning MCMC

From an algorithmic perspective, auto-tuning MCMC (AMCMC) and DEIN are very similar, because both methods simulate ergodic Markov chains and optimise the parameters of the kernel w.r.t. a loss. This may give a false impression of that AMCMC and DEIN share the same theoretical foundation.

To clear this impression, we will discuss the fundamental difference between DEINs and AMCMC. First of all, AMCMC is essentially a class of MCMC methods with auto-tuning strategy of kernel parameters. In particular, the purpose of auto-tuning is to boost the statistical power of samples from MCMC by encouraging distant jump between states in Euclidean space, which is inspired by the work of (Pasarica & Gelman, 2010) on reducing sample correlation of MCMC. In contrast, as a parametric family in ergodic inference methods. The parameters in DEINs is optimised w.r.t. the ergodic loss, which is based on the ergodic inference principle in Section 3.2.

The fundamental difference have two important effects in practice. The first effect is on the sample correlation. By the nature of Markov property, optimising the auto-tuning loss can never eliminate the correlation of samples from MCMC. In contrast, the samples from DEINs are generated by deterministic transformation of i.i.d. samples from initial distribution, which is still i.i.d. samples. The second consequence is on the MH-correction. In particular, MH correction is optional for DEINs for three reasons. First, DEIN is a parametric approximate family rather than unbiased simulation procedure. Second, by optimising the ergodic loss, DEINs guarantee the convergence towards the target in TV distance. Finally, even with approximate ergodic transformations, the existence of a stationary distribution (not necessarily the target) is guaranteed by measure preserving property, in particularly with the depth of DEIN is always finite. In contrast, the convergence of AMCMC chains is only guaranteed with MH correction. In particular, without MH correction, the existence of a stationary distribution of MCMC chains becomes questionable. With unlimited number of recurrent Markov transitions, Markov chains are not guaranteed to converge to any distribution. The existence of stationary distribution is the necessary condition of ergodic theorem (Robert & Casella, 2005). Therefore, without MH-correction (implicitly proved by detailed balance condition), the bias of samples from MCMC may not be bounded. This is particularly true when the Markov kernel parameter is tuned to maximize the jumping distance between states.

### 6.4 Comparison with Normalising Flows

Normalizing Flow (NF), introduced by (Rezende & Mohamed, 2015), is a recent variational inference framework, where the variational parametric distribution is defined in an iterative procedure. The fundamental idea of NF is to define an expressive parametric family by a sequence of deterministic transformations with closed-form Jacobian. Let be a random variable from a simple distribution , like Gaussian, and be deterministic functions from to . We define a sequence of random variable as

 zM=fM∘⋯∘f1(z0).

By the rule of changing variables, the density function of is given by

 logp(dzM)=logq(dz0)−∑i=1log∣∣det∂zifi(zi)∣∣.

There are three important difference between DEINs and NFs. First, without manually engineering ergodic transformations, DEINs have theoretical guarantee of better performance with more transformations (Theorem 4). In contrast, the transformations

in NFs is predefined based on heuristics and experimental evidence. Second, ergodic transformations

has no closed form solutions, but the transformations in NFs is limited to simple functions with tractable Jacobian. Finally, the distribution of DEINs is very expressive, which may not even have a closed form as (27). More importantly, there is no need to compute the density for optimising the parameters. It is the opposite for NFs. In particular, the transformations in NFs are often restricted to simple functions to have closed-form Jacobian. The computation of the Jacobian is also one of computational bottlenecks in optimisation.

### 6.5 Comparison Overview

The key difference between ergodic inference, AMCMC and VI is highlighted in the following table.

Method VI AMCMC TV-Loss Implicit Simulation Density Independent samples Yes No Yes No Yes No Yes Yes Yes
• TV-Loss: Optimising the loss function leads to the convergence in TV distance.

• Independent samples: computationally and statistically independent sample simulation.

• Implicit Simulation Density: no closed-form density function of simulation distribution is required in training.

## 7 Related Works

Hamiltonian variational inference (HVI), introduced by (Salimans et al., 2015), is an interesting variational framework using MCMC kernel as variational parametric distribution. The motivation of HVI is that the joint density function of all the states of HMC chains is tractable to compute. Unfortunately, the variational lower bound is still intractable to compute, because the reverse probability of HMC chain given the last state is intractable. To overcome this problem, they propose to approximate the reverse density function using neural network. Although HVI shows improvement in performance over VAEs, the additional approximation limits the potential of this method. However, optimising the HMC kernel parameters w.r.t. ELBO is still an attractive feature of HVI.

Hoffman (Hoffman, 2017) proposed another hybrid method based on VI and HMC without auxiliary approximation. The idea is to use a Monte Carlo estimation of the marginal likelihood by averaging over samples from HMC chains, that are initialized by variational distribution. In (Han et al., 2017)

a very similar framework is proposed using Metropolis-adjusted Langevin dynamics. This idea is very similar to contrastive divergence in

(Hinton, 2002). The main disadvantage of this methods is that the HMC parameters are manually pretuned. Especially, As mentioned by (Hoffman, 2017), No-U-turn Sampler (NUTS), an adaptive HMC, is not appliable due to engineering difficulties. (Neal, 2010) pointed out that HMC is very sensitive to the choice of Leapfrog step size and number of leaps.

Stein Variational Gradient Descent (SVGD) is a recent particle based dynamical inference method proposed by (Liu, 2017). In SVGD, the approximation distribution is a set point mass generated by transforming a set of points sampled from a distribution using a perturbation function , where is in a function space with boundary norm. With this setup, the optimisation of w.r.t. the KL divergence between and the target is transformed into a stochastic optimisation in the kernel space of . The theoretical foundation of convergence of SVDG is sound and appealing. However, this method faces two practical challenges. First, the optimisation complexity grows quadratically with the number of particles. Second, it is very difficult to approximate high dimensional distribution well with a limited number of point mass approximation.

## 8 Summary

I proposed a new generic inference method based on optimization and ergodic deterministic transformations. This work provides us the very foundation of ergodic inference including: the fundamental ergodic inference principle; tractable estimation of ergodic loss and the its gradient; a generic construction of approximation family.

## References

• Billingsley (1986) Billingsley, P. Probability and Measure. John Wiley and Sons, third edition, 1986.
• Han et al. (2017) Han, T., Lu, Y., Zhu, S.-C., and Wu, Y. N. Alternating Back-Propagation for Generator Network. In AAAI, volume 3, pp.  13, 2017.
• Hinton (2002) Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2014/09/08 2002.
• Hoffman (2017) Hoffman, M. D. Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1510–1519. PMLR, 2017.
• Leimkuhler & Reich (2004) Leimkuhler, B. and Reich, S. Simulating Hamiltonian Dynamics, volume 14. Cambridge university press, 2004.
• Liu (2017) Liu, Q. Stein variational gradient descent as gradient flow. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 3115–3123. Curran Associates, Inc., 2017.
• Neal (2010) Neal, R. M. MCMC using Hamiltonian Dynamics. 2010.
• Pasarica & Gelman (2010) Pasarica, C. and Gelman, A. Adaptively scaling the metropolis algorithm using expected squared jumped distance. Statistica Sinica, pp. 343–364, 2010.
• Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational Inference with Normalizing Flows. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 1530–1538, 2015.
• Robert & Casella (2005) Robert, C. P. and Casella, G. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005. ISBN 0387212396.
• Salimans et al. (2015) Salimans, T., Kingma, D., and Welling, M. Markov Chain Monte Carlo and Variational Inference: Bridging the Gap. In International Conference on Machine Learning, pp. 1218–1226, 2015.
• Zhang et al. (2018) Zhang, Y., Hernández-Lobato, J. M., and Ghahramani, Z. Ergodic measure preserving flows. CoRR, abs/1805.10377, 2018. URL http://arxiv.org/abs/1805.10377.