    # An Introduction to Markov Chain Monte Carlo on Finite State Spaces

We elaborate the idea behind Markov chain Monte Carlo (MCMC) methods in a mathematically comprehensive way. Our focus is on simplicity. We give an elementary proof for the Perron-Frobenius theorem and a convergence theorem for Markov chains. Subsequently we briefly discuss the well-known Gibbs sampler and the Metropolis- Hastings algorithm. Only basic knowledge about matrix multiplication, convergence of real sequences and stochastic is required.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

With sampling we refer to the task of generating random numbers according to a given distribution. The distributions arising in practice are often complicated or not available in closed form. Therefore, direct and exact sampling approaches may not be available. In this case, we can rely on approximate sampling methods like Markov Chain Monte Carlo (MCMC).

Let be a distribution over a finite state space , i.e. for all and . Independent samples from (so-called i.i.d. samples) can be used universally to approximate expectations w.r.t. . Let be a set of such samples and then

 1mm∑i=1\mathanttf(zi)≈∑s∈Sπs\mathanttf(s) (1)

The strong law of large numbers even justifies that with increasing

the l.h.s. of Equation (1) converges to the r.h.s. .

A simple example for is the indicator function over events

. It is one if the condition in the brackets is true and zero otherwise. Its expectation yields the probability of

 ∑s∈Sπs\raisebox−0.853583pt\scalebox1.2$1${s∈A}=∑s∈Aπs

An approach to gain i.i.d. samples from is the inversion method. Therefore, we apply an arbitrary ordering to the state space such that . Thereafter we generate a uniform random number over and obtain the sample with . Even though, this is guarantied to work for any choice of , its computation is often not feasible and this is where approximate schemes like MCMC come into play.

A Markov chain on a finite state space is defined through an initial state and a transition kernel . is a positive function over such that for all . Thus, can be understood as a distribution w.r.t. , but conditioned on .

The Markov chain starts in state and then evolves according to in an iterative fashion. The distribution of the first link in the chain is w.r.t. and given the first link the distribution of the second link is and so forth. This can potentially be continued infinitely long. If we consider as a quadratic matrix (a stochastic matrix), the distribution of the -th link is nothing more than the n-fold matrix product of , i.e. .

We also refer to Markov chains by a sequence of random variables

, whereby the distribution of given is determined by . We can set , however, may also follow a non deterministic distribution. Probabilities w.r.t. random variables are expressed through the distribution , e.g.

 P(Xi∈A∣Xi−1=z)=∑s∈APzs

The foundation of MCMC sampling is that under some circumstances Markov chains converge towards their unique invariant distribution regardless of their initial state. Thus, by simulating such a chain for a longer while we obtain an approximate sample of . MCMC methods provide schemes to build Markov chains with a predefined unique and invariant distribution .

There is a tremendous number of scientific articles and books about MCMC available. I recommend Bishop and Mitchell (2014) for a vivid introduction without mathematical proofs.

The Gibbs sampler is a primal MCMC method. It builds a Markov chain by decomposing

into simpler conditional versions. This facilitates sampling of complex joint distributions, but is somewhat restricted in its ability to explore

. However, this strategy is employed intensively in more sophisticated MCMC algorithms as well.

The well-known Metropolis-Hastings algorithm is capable of incorporating user defined proposal distributions. They enable the exploration of the state space in any desired fashion. That way, the Metropolis-Hastings algorithm even allows us to explore only parts of the state space accurately w.r.t. , which enables sampling of conditional versions of .

### 1.1 Convergence of Markov chains over finite state spaces

Here we the consider the line-by-line convergence of certain finite Markov chains and try to convey a correct mathematical foundation for MCMC sampling. In order to understand this section, only very basic knowledge is required.

Let be a finite state space and be the transition kernel of a Markov chain that starts with an arbitrary but fixed , i.e. . For convenience, but without loss of generality in this section we set .

In the following we consider as a matrix over and any distribution over

as a row vector in

. and therewith also are called irreducible if for every there exists an such that . This means that, regardless of the state we are just in, every state can possibly be reached sometime. We say that a distribution over is an invariant distribution of if or equivalently . Thus, invariance states that transitioning according to doesn’t affect the law of .

The following theorem (Frobenius et al., 1912) has been known for over a hundred years now. It is usually stated in a more general context and its proof appears to be fairly complicated. Here we give a simplified edition of this theorem and provide an easy proof.

###### Lemma 1 (Perron-Frobenius Theorem).

An irreducible stochastic matrix has a unique invariant distribution .

###### Proof.

Since any stochastic matrix has a right eigenvector with corresponding eigenvalue 1, it also has such a left eigenvector. If this eigenvector has only non-negative or non-positive entries, through normalizing we can immediately derive

from it.

Let us assume that we have an eigenvector with negative entries for and non-negative entries for . The following applies

 ∑i∈NxiPij+∑i∈¯NxiPij=xj⇒∑i∈Nxi∑j∈¯NPij+∑i∈¯Nxi∑j∈¯NPij=∑i∈¯Nxi ⇔∑i∈Nxi∑j∈¯NPij−∑i∈¯Nxi(1−∑j∈¯NPij)=∑i∈Nxi∑j∈¯NPij≥0−∑i∈¯Nxi∑j∈NPij≤0=0

This shows that all and are zero for and . Thus, the state space is divided into two classes with no possible transitions in between, which means that the Markov chain is not irreducible. Since negative and positive entries imply reducibility, we conclude that irreducibility implies that any 1-eigenvector has either non-positive or non-negative entries.

Let us assume that there is another invariant distribution . In order to be a stochastic vector not all components of can be either larger or smaller than the components of . Thus, must have positive and negative entries. However, is a left eigenvector (with eigenvalue 1) of and thus, can’t be irreducible, which contradicts the existence of . ∎

An irreducible stochastic matrix is called aperiodic if there exists an such that has solely positive entries for all . This means that, regardless of the state we start in, in the long run it is always possible to reach any other state immediately.

The following convergence theorem forms the foundation for all MCMC sampling methods on finite spaces. The generalization for non-finite spaces closely resembles this approach, but is mathematically far more complex. Its proof is a simplification of a proof that can be found in Koenig (2005).

###### Lemma 2.

If is irreducible and aperiodic with invariant distribution , then for all .

###### Proof.

We consider the Markov chain with transition kernel and initial distribution , i.e. for all . Let be the first were and . Consider an arbitrary path and let , we define and observe that

 P(Y0:n=i0:n)=P(Y0:n=i0:n,T>n)+n∑k=0P(Y0:n=i0:n,T=k) =\mathanttpi0:nP(Z0:n≠i0:n)+n∑k=0\mathanttpi1:kP(Zk=ik,Z0:k−1≠i0:k−1)\mathanttpik+1:n=\mathanttpi0:n

This shows that is a Markov chain with transition kernel .

Choose in such a way that has solely positive entries. Let . Since we conclude

 P(T>ℓN)≤P(XkN≠ZkN for all 0

This shows that both chains will meet at the same state at some point with probability one.

Since and we get for all . ∎

Given a distribution , MCMC methods usually seek an irreducible and aperiodic transition kernel with invariant distribution . Thus, it is possible to sample approximately from by simulating the corresponding Markov chain for a longer while. The last sample within this chain is then considered as a single approximate sample from . In particular, this procedure is considered to be independent of the state it is started in.

In this section, we have developed a minimal set of lemmas and proofs deployed on a minimal environment (finite state spaces) in order to demonstrate the basic idea behind MCMC. We have proven that the repeated multiplication of an irreducible and aperiodic transition Matrix to itself converges line by line to its unique invariant distribution.

### 1.2 The Gibbs Sampler

The Gibbs sampler (Geman and Geman, 1984) is a primal MCMC sampling algorithm which is based on a decomposition of the objective distribution into conditional versions. It is mainly used to sample from the joined distribution of a set of random variables. Thereby each step involves sampling from a single random variable given the remaining random variables conditioned on the last sample.

Now we assume that , with a finite space . Our goal is to sample from a distribution over . We draw independent samples from in a step wise manner. In step , given the sample from the last step , we choose and sample from the transition kernel which is defined through

 (~πzj)s=\raisebox−0.853583pt\scalebox1.2$1${zℓ=sℓ for ℓ≠j}πs∑s′∈S with s′ℓ=zℓ for ℓ≠jπs′

for all . is the conditional version of conditioned on .

This describes a finite Markov chain that, at each step, only manipulates one single component of the previous state. is an invariant distribution of since for

 ∑z∈Sπz(~πzj)s=∑z∈Sπs(~πsj)z=πs∑z∈S(~πsj)z=1=πs

A famous and quite old application of the Gibbs sampler is the Ising model (Ising, 1925). There, consists of the positive or negative values of the grid points of a finite grid, whereby independence is induced by spatial separation. This yields very simple sampling steps, each conducted on a single grid point given all the other, but essentially only its neighboring grid points. Higdon (1998) provides very vivid and more sophisticated treatments of the Ising model.

However, the irreducibility and aperiodicity of the corresponding Markov chain relies on . For example, assume that and . In this case we can never get from to and thus, the chain is not irreducible. A case where irreducibility and aperiodicity are guarantied to be met is when is positive. This example reveals the disadvantage of the Gibbs sampler: We are not free in traversing the state space, which might impair or even hinder convergence.

### 1.3 The detailed balance condition

Now we introduce a sufficient condition for a Markov chain to have a given distribution as an invariant distribution. It is called the detailed balance condition and greatly facilitates the invariance proofs for MCMC algorithms.

###### Definition 1.

The transition kernel preserves the detailed balance condition w.r.t. if

 πsPsz=πzPzs

for all .

If preserves the detailed balance condition w.r.t. , is an invariant distribution of since . The opposite implication does not generally hold.

### 1.4 The Metropolis-Hastings algorithm

In the following we want to elaborate the well-known Metropolis-Hastings algorithm. It is an MCMC sampler that traverses through the state space by means of a user defined proposal. Characteristic for this sampler is that each proposed value undergoes an accept-reject step which decides whether the proposed value or the previous sample is chosen to be the next sample. This acceptance step alone secures the detailed balance of this Markov Chain and thus, gives the user great freedom in designing proposals. A primal version was first published in Metropolis et al. (1953) and then extended in Hastings (1970).

The Metropolis-Hastings algorithm requires the user to provide a transition kernel which is referred to as the proposal. In step , given a sample , we propose a new sample according to and accept it with probability

 \mathanttasz=min{1,πzPzsπsPs,z} (2)

whereby we agree that dividing by zero yields . The new sample is than either if we have accepted or if not.

This describes a Markov chain with invariant distribution , which can be shown by checking the detailed balance condition for with

 πs\mathanttaszPsz=min{πsPsz,πzPzs}=πz\mathanttazsPzs

There is nothing to show if .

The irreducibility and aperiodicity of this Markov chain has to be met by the acceptance probability together with the proposal. Equation (2) shows, that we can only perform a transition from to if the corresponding backward transition is also feasible, more precisely, if is positive. In the most extreme case this could mean that we apply an irreducible and aperiodic proposal, but the resulting Metropolis-Hastings kernel will never move away from the initial state.

## References

• Bishop and Mitchell (2014) Bishop, C. M. and T. M. Mitchell (2014).
• Frobenius et al. (1912) Frobenius, G., F. G. Frobenius, F. G. Frobenius, F. G. Frobenius, and G. Mathematician (1912). Über matrizen aus nicht negativen elementen.
• Geman and Geman (1984) Geman, S. and D. Geman (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence.
• Hastings (1970) Hastings, W. K. (1970). Monte carlo sampling methods using markov chains and their applications. Biometrika.
• Higdon (1998) Higdon, D. M. (1998). Auxiliary variable methods for markov chain monte carlo with applications. Journal of the American Statistical Association.
• Ising (1925) Ising, E. (1925). Contribution to the Theory of Ferromagnetism. Z. Phys..
• Koenig (2005) Koenig, W. (2005). Stochastische prozesse i: Markovketten in diskreter und stetiger zeit.
• Metropolis et al. (1953) Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics.