 # Data Reduction in Markov model using EM algorithm

This paper describes a data reduction technique in case of a markov chain of specified order. Instead of observing all the transitions in a markov chain we record only a few of them and treat the remaining part as missing. The decision about which transitions to be filtered is taken before the observation process starts. Based on the filtered chain we try to estimate the parameters of the markov model using EM algorithm. In the first half of the paper we characterize a class of filtering mechanism for which all the parameters remain identifiable. In the later half we explain methods of estimation and testing about the transition probabilities of the markov chain based on the filtered data. The methods are first developed assuming a simple markov model with each probability of transition positive, but then generalized for models with structural zeroes in the transition probability matrix. Further extension is also done for multiple markov chains. The performance of the developed method of estimation is studied using simulated data along with a real life data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Applications of statistical models often encounter with datasets which are too large to store or to gain meaningful interpretation. This motivates the need for a suitable data filtering mechanism which stores only a subset of the available information such that based on the observed data reasonable inferences can be drawn about the parameter. This paper describes such a filtering mechanism in case of discrete markov models. Discrete markov chains are the simplest dependent structure that one can think of and are very useful for modeling a wide range of scientific problems in nature. Some important applications include modeling of dry and wet spells (P. J. Avery and D. A. Henderson (1999)), deoxyribonucleic acid (DNA) sequences (P. J. Avery and D. A. Henderson (1999) ), study of chronic diseases ( B. A. Craig and A. A. Sendi (2002) ).

Any stochastic process having a finite set as its state space, is said to be a markov process of order if

 P(Xn=an|Xn−1=an−1,Xn−2=an−2,⋯,X1=a1)
 =P(Xn=an|Xn−1=an−1,Xn−2=an−2,⋯,Xn−s=an−s)

For notational convenience let us denote the state space as Further we assume that the markov process has stationary transition probabilities, which means

 P(Xn=an|Xn−1=an−1,Xn−2=an−2,⋯,Xn−s=an−s)=pan−s,⋯,an−1:an

does not depend on For we have a simple markov chain with finite state space. Any markov chain of order can be treated as a simple markov chain with suitable parameters. So in this paper we will develop the methods assuming a simple markov chain which will be equally applicable for any markov process of higher order. A markov chain can be completely described by the initial state and the set of transition probabilities. Here we shall consider the initial state of a markov chain to be known and try to make inferences about the transition probabilities based on the observed data. More specifically inferences regarding the transition probability matrix can help us to answer many specific questions regarding the markov process which we usually encounter.

There is an extensive literature available on the statistical inferences of finite markov chains based on complete data. Billingsley(4) gives a good account of the mathematical aspects of different techniques regarding inferences about the transition probabilities which includes Whittle’s formula, maximum likelihood and chi-square methods. Estimation of transition probabilities and testing goodness of fit from a single realisation of a markov chain has been studied by Bartlett(3). Goodman and Anderson(1) derived the estimates of the transition probabilities and their hypothesis when there are more than one realisation of a single markov chain. Their paper also described the asymptotic properties of the methods when the number of realisations increase. All these works assume the observed data to be one or more long, unbroken observations of the chain. In this paper we assume that there is a single realisation of the markov chain which is not completely observed. The observed broken chain which results from the filtering mechanism is therefore not markov.

Based on the filtering mechanism we will observe only certain transitions of a markov chain and treat the remaining part of the chain as missing. From the observed data we will estimate the transition probabilities using EM algorithm. EM algorithm is a standard tool for maximum likelihood estimation in incomplete data problems. Since the missingness in the data occurs due to the filtering process, the data are not missing at random (NMAR). Here the missing mechanism is nonignorable but known. Unlike the conventional uses of EM algorithm where missingness occurs naturally, here we introduce missingness deliberately to reduce the size of the observed data. The E step of the EM algorithm requires to find the conditional expectation of the missing data given the observed data. This is achieved by defining the all possible missing paths for a transition of any order and finding the probability of the same. The standard error of the EM estimate is obtained by the supplemented EM algorithm (SEM) (Meng and Rubin). Usually the standard error of the EM estimate is obtained by inverting the observed information matrix. In our case the observed likelihood cannot be obtained explicitly and hence we avoid the calculation of the observed information matrix. SEM is a technique to calculate asymptotic dispersion matrix of the EM estimate without inverting the observed information matrix.

Section 2 describes the setup of the problem. Section 3 deals with the identifiability issues of the parameter that arise due to filtering of data. In section 2 we assume that the transition probability matrix consists of all positive elements. This assumption is relaxed in section 3 where we allow some structural zeroes in the transition probability matrix. We describe the additional modification we need in the filtering mechanism due to such relaxation. Section 5 describes the methods of estimation and testing the transition probabilities. In this section we also describe the estimation of standard errors of the estimates by the SEM algorithm. Section 6 describes the generalization of the above methods in case of multiple markov chains. In section 7 we demonstrate the methods developed using simulated data. A real life data analysis is demonstrated in section 8. Section 9 is the appendix which has the proofs of a major theorem of this paper.

## 2. Setup

Let be a simple markov chain with finite state space and transition probability matrix Let us first assume that . We shall relax this assumption later and consider the case where we allow some ’s to be zero. The transition probability matrix satisfy the standard condition

 P1=1i.e.∑jpij=1∀i.

Hence we have

independent parameters. We define the vector of the parameters

 θ=(p11,p12,⋯,p1(k−1),p21,p22,⋯,p2(k−1),⋯,pk1,pk2,⋯,pk(k−1))′
 =(θ1,θ2,⋯,θd)′

where and the parameter space is

 Θ={θ:k−1∑j=1pij<1,for i=1,⋯,k}
 ={θ:k−1∑j=1θj<1,k−1∑j=0θ(i−1)k+j<1,for i=2,⋯,k}.

Now suppose we consider a single realization of the chain and the number of transitions from state to state in this realization is . We assume that the markov process is continued sufficiently long enough so that the realization contains each transition at least once, that is, for all and The matrix of transition count is

 N=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣n11n12⋯⋯n1kn21n22⋯⋯n2k⋮n¯¯¯¯¯¯¯¯k−11n¯¯¯¯¯¯¯¯k−12⋯⋯n¯¯¯¯¯¯¯¯k−1knk1nk2⋯⋯nkk⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

In this paper we propose a data acquisition protocol which suggests that instead of observing the entire realization we record only some of the transitions and treat the remaining part of the chain as missing. The decision about which transitions we record is described in the form of a filter matrix which contains 0 and 1 as elements. In particular we record the transition from state to state . If is the complete chain then let denote the chain filtered using

###### Example 1.

Consider a three state markov chain as

 112312232123331121331

Suppose we are given a filter matrix

 F=⎡⎢⎣100010110⎤⎥⎦.

Then the transitions we record in are

 (2.1) 1→12→23→13→2

Then the filtered chain is

 11\kern 0.6pt\vrule height 1.29pt width 1px\vbox\hrulewidth15.0ptheight1px\vrule height 1.29pt width 1px312232\kern 0.6pt\vrule he% ight 1.29pt width 1px\vbox\hrulewidth15.0ptheight1px\vrule height% 1.29pt width 1px\kern 0.6pt\vrule height 1.29pt width 1px\vbox\hrulewidth15.0ptheight1px\vrule height 1.29pt width 1px\kern 0.6pt% \vrule height 1.29pt width 1px\vbox\hrulewidth15.0ptheight1px% \vrule height 1.29pt width 1px\kern 0.6pt\vrule height 1.29pt width 1px% \vbox\hrulewidth15.0ptheight1px\vrule height 1.29pt width 1px311\kern 0.6pt\vrule height 1.29pt width 1px\vbox\hrulewidth15.0ptheight1px\vrule height 1.29pt width 1px\kern 0.6pt\vrule height 1.2% 9pt width 1px\vbox\hrulewidth15.0ptheight1px\vrule height 1.29pt % width 1px\kern 0.6pt\vrule height 1.29pt width 1px\vbox\hrulewidth15.0ptheight1px\vrule height 1.29pt width 1px31

In the filtered chain the missing states are indicated by the symbol which we call “blank”. The example shows that besides the transitions (2.1) there may be some transitions which are indirectly recorded in the filtered chain (such as is recorded even if Any transition may be recorded indirectly in the filtered chain if there exist some states and such that and

Thus all the transitions in the filtered chain may be classified into one of the three categories:

• directly recorded ()

• indirectly recorded ( but the transition occurs in the filtered chain, e.g. in Example 1)

• unobserved ( and the transition does not appear in the filtered chain e.g. in Example 1)

## 3. Identifiability

In this section we shall discuss about the identifiability of the parameters based on the filtered chain. We note that our filtered chain no longer possesses the markov property. While applying the filtering mechanism, if we record only a very few transitions then all the parameters of the markov chain may not be identifiable. For example in a markov chain with state space if we record only the transition then some parameters , say , are not identifiable. We need to study how much data we can throw away, so that the problem still remains identifiable. Thus our main aim, in this section, will be to identify a class of filter matrices so that data generated by any filter matrix of that class will retain the identifiability of the parameters. But first we define what is meant by identifiability of parameter on the basis of a random sample.

###### Definition 2.

Let be a random sample from a distribution characterized by the parameter and be the likelihood. Then the parameter is said to be identifiable on the basis of if for any two distinct values and in the parameter space

 L(θ1,x)≠L(θ2,x)

Suppose is a random sample drawn from a population characterized by the parameter . Let be function of . Given we can always construct through . So if is identifiable on the basis of , we can identify also from . On the contrary, if is unidentifiable on the basis of , then it is also unidentifiable on the basis of This is because, if we assume to be identifiable on the basis of then given , we can construct through and then can be identified from , which is a contradiction. Thus in general we have the following two results:

###### Claim 3.

a) If is identifiable on the basis of , then is also identifiable on the basis of .

b) If is unidentifiable on the basis of , then is also unidentifiable on the basis of .

In the present situation to prove that the parameters are identifiable it is enough to consider a observed sample such that such that and prove that any two different values of the parameter will yield different values of the observed likelihood .

Let be the class of all filter matrices. We call a filter matrix identifiable if is identifiable with respect to . Let be the set of all filter matrices for which the parameter is identifiable. Then is the set of identifiable filter matrices.

With this notation, the general fact stated in claim 3 is also applicable for the data generated by the filter matrices.

###### Lemma 4.

For , let for some function . Then implies and implies

###### Example 5.

Let

 H=⎡⎢⎣100010110⎤⎥⎦andM=⎡⎢⎣110010110⎤⎥⎦

is same as except that for we directly observe one more transition than . Then and hence if then .

In general there are possible filter matrices in Instead of searching over all possible filter matrices we shall start with some definite structures of filter matrices which are identifiable. The above discussion motivates us to extend the identifiability over a larger class of matrices. This requires some ordering of the filter matrices in terms of the data we store.

###### Definition 6.

For filter matrix and we say if and if

a) If and then

b) If and then

###### Proof.

a) implies for some Using Lemma 4 we get implies

b) implies for some Using Lemma 4 we get implies . ∎

Thus if any filter matrix is identifiable, then all filter matrices which stores more data than are also identifiable. This fact is also true for any subclass of filter matrices.

###### Definition 8.

If , then the closure of is defined as

 ¯D={F∈F:F⪰Dfor some D∈D}.

If then

###### Proof.

Let . Then .

Since , we get . Then Lemma 7 implies .

This implies

Thus given any class of identifiable filter matrices we can always extend it to a larger subclass of identifiable filter matrices.

Our observed chain is a sequence of states and blanks . Given any observed chain we want to find the condition under which the conditional probability of a given segment of the observed chain given the initial state in the segment is identifiable.

###### Definition 10.

For any finite sequence of states or blanks we define

 Sπ=set of all filtered segments where π occurs in % consecutive positions.

We note that if then .

###### Lemma 11.

For any filter matrix , if then is identifiable where is a sequence of states or blanks which starts and ends with states and is the conditional probability of the sequence given the initial state in .

###### Proof.

Let

 A=subchains that ends with α.
 B=subchains that ends with the sequence π.

Then Also implies which implies .

Also from Markov property we get that Thus if changes

changes. Since the conditional probability of a class of subchains changes, the joint distribution of the entire filtered chain must also change. Hence two distinct values of

will give two distinct values of the observed likelihood.Thus is identifiable. ∎

###### Corollary 12.

For any filter matrix the parameter is identifiable if

As mentioned before we want to start with filter matrices of definite structures which are identifiable and extend them to relatively larger classes. With this view in mind, we define three classes of filter matrices each of which will be sufficient for a filter matrix to be identifiable.

Class1:

We define which consists of all filter matrix such that

1. such that i.e. the row of is zero.

2. such that i.e. the column of is zero.

3. ,i.e. except row every other row must have exactly one element .

4. ,i.e. except column every other column must have exactly one element .

Class2:

We define which consists of all filter matrix such that

1. and such that and i.e. the and column of is zero.

2. ,i.e. except and column every other column have exactly one element .

3. , i.e. except and column every other element of and row is .

4. ,i.e. except and row every other row have exactly one element .

Class3:

We define which consists of all filter matrix such that

1. and such that and i.e. the and the row of is zero.

2. ,i.e. except and row every other row have exactly one element .

3. , i.e. except and row every other element of and column is .

4. ,i.e. except and column every other column have exactly one element .

The following theorem and its corollary provide sufficient conditions for filter matrices to be identifiable. Any filter matrix which belong to at least one of the three classes is identifiable. The proof of the theorem is given in the appendix.

###### Theorem 13.

Consider a simple markov chain on finite state space and transition probabilities where Suppose be any filter matrix belonging to the class . Then must also belong to the class

The following corollary to the above theorem is an immediate application of Lemma 9.

###### Corollary 14.

.

Thus if we start with a definite structure of matrices in or or we get a relatively larger class of identifiable filter matrices. For the rest of the paper we shall be working with filter matrices within this class. We shall find that any filter matrix in this class will provide considerable reduction in data.

## 4. Structural zeroes in Transition probability matrix

In the previous section while obtaining the sufficient conditions for identifiability we assumed This was a crucial assumption in developing the theory for the sufficient conditions. However in many practical applications this assumption stands out to be too restrictive. For example, while modeling a disease status the probability of an individual entering from one state to another may be zero (in case of chronic illness, the condition of an individual usually deteriorates). Also the case of structural zeroes in the transition probability matrix will occur later in this paper while dealing with multiple markov chains. In this section we generalize the sufficient conditions for a filter matrix to be identifiable even when some ’s are zero.

We note that all zeroes (if any) in the transition probability model are structural zeroes, that is, we know the position of the zeroes even before the collection of the data. Also for any must be positive for at least one since all the row sums of the transition probability matrix is We further assume

(A1) for any must be positive for at least one

This is a reasonable assumption to make because if such a state exists we shall ignore that state from our analysis.

As before we have the classes of filter matrix , and Further let us define an additional class of filter matrix as

 R={F∈F: For any i∈{1,2,...,n},fij=1 for at least one j∈Z}

where This restriction means for every row of a filter matrix, we should observe at least one probable transition. The restriction on the filter matrices is quite justified and does not in any way reduces the applicability of filtering mechanism. The following theorem is a generalization of Theorem 13 in the case where we allow some to be zero.

###### Theorem 15.

Consider a simple markov chain on finite state space and transition probabilities where Let be any filter matrix belonging to the class where Then under the assumption must also belong to the class

The proof of the above theorem is similar to the proof of Theorem 13 because under the assumption A1, and for filter matrices within the class , we have for all choices of sequences that we require in Theorem 13. Finally application of lemma 9, gives the required result.

## 5. Estimation and testing

As mentioned earlier, a markov process can be completely characterized by specifying the transition probability matrix. This section deals with drawing inferences regarding the parameters. Instead of recording the entire markov chain , we apply a given filter matrix to record is fixed and does not in any way depend on the data The choice of may depend on the availability of the samples, storage facilities or past experience subject to the constraint of identifiability. Based on we shall find estimates of the transition probabilities and compute the standard error of the estimates. Our main tool for estimation will be EM algorithm. For the computation of the standard error we shall use Supplemented EM algorithm(SEM). The latter half of the section deals with testing of hypothesis regarding the parameters.

### 5.1. Estimation of parameters:

In the present situation the complete data is which is unobserved and the observed data is As a natural tool of missing data analysis we will apply EM algorithm for the estimation of the parameter Each iteration of EM algorithm consists of a E step (expectation step) and an M step (maximization step). In the E step of the algorithm we need to find the conditional expectation of the complete data log-likelihood given the observed data and the current iterated value of the parameters. In our case this requires to find the conditional expectation with respect to the conditional distribution of given and the current iterated value of the parameter. The complete data log likelihood is

 ℓcom(θ)∝k∑α,β=1nαβlogpαβ.

Since is linear in , we need to compute

 E(nαβ|ϕF(x),θ(t)).

This conditional distribution cannot be computed directly as the conditional distribution of given cannot be found out explicitly. We shall express as a sum of certain indicator variables to evaluate this conditional expectation, the computation of which will be shown in subsection 5.1.2. We will show that this require us to find the following conditional probability:

 P(the chain moves from state α to state β|ϕF(x),θ(t)).

Since the observed chain has runs of missing states, the calculation of the above probability will require us to find the probability of a transition from one state to another in any number of steps such that all the intermediate steps are missing. If the complete chain is available, then the probability of a transition from to in steps is the element of However we need to find the probability of such transition through some specific ways.

#### 5.1.1. Defining possible missing paths for a transition:

Consider two states and . Suppose we are interested in transition from to in steps. Each possible way of transition from to in steps is called a path of order We call a path of order as edge.Thus a any given path consists of one or more edges. Clearly the transition from to in steps can occur through one or more paths. We classify these paths in two categories based on the given filter matrix:

• observed path(): whose all edges are observed.

• unobserved path(): whose all edges are unobserved.

Clearly the two sets and are not mutually exhaustive, that is, we cannot classify all paths into any one of these categories.

###### Example 16.

Consider a two state markov chain and two filter matrices and such that

 F1=F2=

Suppose we consider the transition from state to state in two steps. The possible paths are:

 w1:1⟶1⟶1w2:1⟶2⟶1

For filter matrix , path , i.e. path is observed whereas is empty, i.e. no paths are unobserved. For filter matrix , path and . If we consider the transition from state to state in two steps, the possible paths are:

 w1:2⟶2⟶2w2:2⟶1⟶2

For filter matrix , path , and is empty whereas for filter matrix , path and .

Now consider the transition probability matrix of the markov chain. We construct two matrices and as

 pij={0iffij=1pijiffij=0

and

 pij={0iffij=0pijiffij=1

Then the element of gives the probability of going from state to state in steps through unobserved path(s). Also the element of gives the probability of going from state to state in steps through observed path(s).

###### Example 17.

Returning to the previous example we see that for the filter matrix ,

 P=[p11p12p21p22]P=[0p1200]P=[p110p21p22]

Then

 (P)2=(P)2=[p2110p21p11+p22p21p222]

Thus for filter matrix , probability of going from any state to any state through the unobserved paths in steps is zero. Also

 (P)ν=for any ν

which means that the probability of going from any state to any state through the unobserved paths in any steps is zero.

Similarly for filter matrix ,

 P=[0p12p210](P)2=[p12p2100p21p12]

Thus for , the probability of going from state to state in steps through the unobserved paths is and the probability of going from state to state in steps through the unobserved paths is .

Thus given a filter matrix, the probability of going from a state to a state in steps through the unobserved paths is the element of which is .

#### 5.1.2. Estimation by EM Algorithm:

For the transition, let,

 Y1i=state from where the transition occurs
 Y2i=state to where the transition occurs

Thus and

are two discrete random variables taking values in the state space

for all . Let us express the total number of transitions from the state to the state as where

 I(Y1i=α,Y2i=β)={1ifY1i=α,Y2i=β0otherwise.

The complete data likelihood then can be written as

 Lcom(p)∝n∏i=1f(y1i,y2i|p)=constant×k∏α,β=1pnαβαβ

where

 pαk=1−k−1∑j=1pαj∀α=1(1)n.

After iterations in the EM algorithm we write the E-step and the M-step as:

E-step:

Let be the value of the transition probability matrix after iterations. The corresponding value of the parameter is The other matrices we construct take the values and Then we compute the expected complete data log-likelihood with respect to the conditional distribution of The complete data log-likelihood is given by

 ℓcom(θ)=constant+k∑α,β=1{(logpαβ)×nαβ}.

We then compute

 Q(θ)=E(ℓcom(θ)|ϕF(x),θ(t)).

Since is linear in , we need to compute

 E(nαβ|ϕF(x),θ(t))
 =n∑i=1P(Y1i=α,Y2i=β|ϕF(x),θ(t)).

Let us denote . Then for each , takes one of the three forms , or as follows:

• Case I ( observed): Suppose we have a missing chain of length with the next observed state Then

 Piαβ=⎧⎪⎨⎪⎩pαβ×p(ν−1)0βbp(ν)0αbifY1i=α0ifY1i≠α=:Pi(1)αβ, say.
• Case II ( observed): Suppose we have a missing chain of length with the previous observed state Then

 Piαβ=⎧⎪⎨⎪⎩p(ν−1)0aα×pαβp(ν)0aβifY2i=β0ifY2i≠β=:Pi(2)αβ, say.
• Case III (Both are not observed): Suppose we have a missing chain of length with the previous observed state and the next observed state . Then

 Piαβ=p(m)0aαpαβp(n)0βbp(ν)0ab=:Pi(3)αβ, say.

where and and . If there is no such next observed state (that is, the observed chain ends) then

 Piαβ=p(m)0aαpαβ(∑bp(n)0βb)∑bp(ν)0ab=:Pi(3)αβ, say.

M-step:

We try to maximize with respect to Setting for each we get

 θ(t+1)=(θ(t+1)1,θ(t+1)2,⋯,θ(t+1)d)

where

 θ(t+1)j=n∑l=1Pl1jk∑β=1n∑l=1Pl1βfor any j=1,2,⋯,(k−1)

and

 θ(t+1)(i−1)k+j=n∑l=1Pli(j+1)k∑β=1n∑l=1Pliβfor any j=0,1,⋯,(k−1) and i=2,3,⋯,k.

### 5.2. Estimation of Standard Errors:

Since EM estimate of the parameters are the maximum likelihood estimate of the observed likelihood, the large sample covariance matrix can be obtained by inverting the observed information matrix. But in our problem the observed likelihood is not known explicitly. An alternative way is using Supplemented EM Algorithm (SEM) which allows us to find the large sample covariance matrix without inverting the estimate of the observed information matrix. SEM algorithm is a procedure of obtaining a numerically stable estimate of the covariance matrix of the estimated parameters using only the code for the steps in EM algorithm, code for computing the large sample complete data covariance matrix and standard matrix operations.

#### 5.2.1. Supplemented EM Algorithm:

Since each step of the EM algorithm produces a fresh estimate of the parameter from the previous estimates, EM algorithm can be considered as a mapping on the parameter space. The derivative of the EM mapping, which we call , can be expressed in the form

 M(1)=imisi−1com=I−iobsi−1com.

The above equation implies

 i−1obs=i−1com(I−M(1))−1

which in turn implies

 Vobs=Vcom(I−M(1))−1.

Now we note that

 Vobs=Vcom(I+M(1)−M(1))(I−M(1))−1=Vcom+△V

where

is the increment in variance due to missingness.

#### 5.2.2. Calculation of Vcom:

The complete data log-likelihood is

 ℓcom(θ)=constant+∑i=1nijlogpijwhere pik=1−k−1∑j=1pij∀i.
 ∴∂∂pijℓcom=nijpij−nikpik

 S=⎡⎢ ⎢ ⎢ ⎢⎣n11p11−n1kp1k⋮nk(k−1)pk(k−1)−nkkpkk⎤⎥ ⎥ ⎥ ⎥⎦

Now for . Also we have

Let be the matrix of the negatives of the second order derivatives. Then is a matrix of order such that where where

Then the fisher information matrix of the complete data is

 icom=E(B|θ,data)=blockdiagonal(E(B1),E(B2),⋯,E(Bk))

where . Thus the variance-covariance matrix of the complete data is

#### 5.2.3. Computing M(1) by numerical differentiation:

For our problem the mapping is not known explicitly. The derivative of at is calculated numerically from the output of the forced EM steps. is the matrix with the element as change in the component of due to the change in the element of . For this we start with the EM estimate and change its element by . We call this resultant estimate by and run one EM iteration on it to get . Then and and so we compute the ratio Thus we run a sequence of SEM iterations, where the iteration is defined as follows: