1 Introduction
Finite Markov chain models and their probabilistic characteristics are widely used to explain the behavior of several physical systems or phenomenons; such understanding of physical mechanisms are further applied to answer important research questions in psychology, genetics, epidemiology and also several types of social studies (Iosifescu, 2007). For such applications, it is important to estimate the underlying probabilistic structure for the assumed Markov chain model based on the data observed from associated physical process(es).
Consider one long, unbroken sequence of random observations from a stationary Markov chain with finite statespace and transition probability matrix . Note that, for each
, the vectors
is a probability over ; let us denote all such probability vectors over by . By stationarity, the initial probability is independent of for each . We assume that the Markov chain is ergodic (irreducible and aperiodic) and consider the problem of making inference about these unknown probabilities s and s based on the observed sequence . Assuming no further structure, their nonparametric (maximum likelihood) estimates are, respectively, given by(1) 
where denotes the indicator function of the event and
(2) 
The estimated transition probability matrix is then given by . More details about these estimates and their asymptotic properties can be found in, e.g., Jones (2004), Rajarshi (2014) and the references therein. However, in several applications in epidemiology, biology, Genomics, reliability studies, etc., we often model the transition probability matrix
by a parametric model family of
transition matrices , where are known functions depending on some unknown dimensional parameter vector , the parameter space, and for every and for each . We need to assume, throughout this paper, that this model family is identifiable in the sense for any two parameter values must imply . Then, any inference has to be performed based upon a consistent and asymptotically normal estimate of . The maximum likelihood estimator (MLE) is an immediate (optimum) candidate for this purpose, which were studied by Billingsley (1961) and is still the mostly used method of inference in a finite Markov chain. Some modified likelihood based approach (e.g., PL, QL) are also developed for computation feasibility; see Hjort and Varin (2008) and the references therein.Although asymptotically optimum, an wellknown drawback of all these likelihood based inference is their nonrobustness against outliers or data contamination leading to erroneous insights. Since outliers are not infrequent in several real life applications, a robust statistical procedure automatically taking care of the outliers is of great value to produce robust estimators and subsequent stable inference in such cases. However, up to the knowledge of the author, there is no literature on robust inference methods for finite Markov chains. An alternative to the MLE based on the minimum distance approach was discussed by
Menéndez et al. (1999) using disparity measures, but they also did not discussed the issue of robustness. Here, we will fill this gap in the literature by developing a robust methodology for parameter estimation and associated inference for the finite Markov chain models.As a way to solve the robustness issue, here, we consider the popular minimum distance approach based the density power divergence (DPD) measure that was originally introduced by Basu et al. (1998)
for IID data. The DPD measure is a oneparameter generalization of the KullbackLeibler divergence (KLD); for any two densities
and , with respect to some common dominating measure , the DPD measure is defined in terms of a tuning parameter as(3)  
(4) 
Note that, is nothing but the KLD measure, and is the squared distance. Since the MLE is a minimizer of the KLD measure between the data and the model, a generalized estimator can be obtained by minimizing the corresponding DPD measure for any given . The resulting minimum DPD estimator (MDPDE) has recently become popular due to its simplicity in construction and computation along with its extremely high robustness properties; they are also highly efficient although the tuning parameter controls the tradeoffs between efficiency and robustness of the MDPDE and associated inferences (see, e.g., Basu et al., 2011). This approach based on MDPDE has recently been applied successfully to different models and data analysis problems to produce robust insights against possible data contamination; see, e.g., Basu et al. (2006, 2018); Ghosh and Basu (2013, 2018); Ghosh et al. (2016, 2018), among many more.
In this paper, we develop the MDPDE for the finite Markov chain models as a robust generalization of the MLE and use it for further robust inference. We first define the MDPDE as a minimizer of an appropriate (generalized) total discrepancy measure in terms of the density power divergence between rows of the empirical estimate and the model transition matrix and then derive its asymptotic and robustness properties. In particular, we have proved the consistency and asymptotic normality of the MDPDE as and its robustness is studied via classical influence function analysis. The proposed estimator (MDPDE) and its performances are illustrated through four common examples of finite Markov chain model including simple random walk, binomial extensions of random walk and an important epidemic model. The asymptotic relative efficiency of the MDPDEs (compared to the MLE) are used to study the effect of tuning parameter and finite sample simulation studies are performed to justify the robustness benefits of the MDPDE; these illustrations clearly indicate the usefulness of our proposed MDPDE for robust estimation under finite Markov chain models.
Further, we describe the application of the proposed MDPDE in performing statistical testing of general composite hypotheses. The asymptotic distribution of the corresponding MDPDE based Waldtype test statistic is derived under the null distribution and under a contiguous sequence of alternatives. The influence function of these test statistics are also derived. An example of testing for the BernoulliLaplace diffusion model against a suitable parametric family of alternatives is discussed. The MDPDE based testing procedure is also developed for comparing the parametric transition matrices of two observed Markov chain sequences.
Finally, we discuss important extensions of the concept of MDPDE for a few complex finite Markov chain model setups. These include the case of multiple sequence of observations obtained from the same finite Markov chain model, where the asymptotics of the MDPDE are discussed for both the cases of diverging sequence length (with finite number of sequences) and diverging number of observed sequences (of finite length each). The MDPDE is also defined for parameter estimation in higherorder Markov chains. Brief discussions are also provided for the MDPDE of the parametric Markov chain models with timedependent (nonstationary) transition probabilities.
2 Robust Estimation for A Finite Markov Chain
2.1 The Minimum Density Power Divergence Estimator
Let us consider the setup and notations of Section 1. The widely popular MLE of is defined as the maximizer of the likelihood function which is proportional to
Some algebra leads to the form of the corresponding loglikelihood function as given by
(5) 
and hence the MLE can be equivalently obtained by minimizing a generalized KLD measure, a weighted average of KLD measure between the estimated probability vector and the model probability vector over different . Since DPD measure is a generalization of the KLD measure at , in view of (5), we can define the MDPDE at any as the minimizer of the generalized DPD measure given by
with respect to . Since the last term within the bracket in the above equation does not depend on , the MDPDE can indeed be obtained by minimizing, in , the simpler objective function
(6) 
Under the assumption of differentiability of in , we can obtain the estimating equations of the MDPDE at any as given by
(7) 
where and denotes a vector having all entries zero. Note that, at , the MDPDE estimating equation in (7) coincides with the score equation corresponding to the MLE, as expected from the relations between DPD and KLD measures. Therefore, the estimating equation (7) is valid for the MDPDEs with any ; the MDPDE coincides with the MLE at and provides its robust generalization at . It is easy to verify that the MDPDE estimating equations are unbiased at the model and the estimator itself is Fisher consistent for all .
In this regard, we define the statistical functional, say , corresponding to the MDPDE with tuning parameter at any general (true) transition matrix as the minimizer of with respect to , where denote the th row of and s are the true initial probabilities depending on . In consistence with the MDPDE objective function in (6), the MDPDE functional can be obtained from a simpler objective function given by
(8) 
The corresponding estimating equation for the MDPDE functional has the form
(9) 
Note that, and and which implies is indeed the proposed MDPDE. Further, if the model is correctly specified with the true transition matrix being for some , then the estimating equation has a solution at . Under the assumption of identifiability of our model family , it further implies and hence , i.e., the MDPDE functional is Fisher consistent at the model family . When the true transition matrix does not belong to the model family , we will denote the corresponding MDPDE functional as the ‘best fitting parameter” value (in the DPD sense) and we will show below that the corresponding MDPDE is also asymptotically consistent for this .
2.2 Asymptotic Properties of the MDPDE
In order to derive the asymptotic properties of the proposed MDPDE under the finite Markov chain models, we first consider the following regularity conditions on the model transition probabilities.

For each , the model transition probability matrix has the same sets of zero elements, i.e., the set is independent of . Put .
Additionally, is regular in the sense that any Markov chain with transition probabilities satisfying “ if and only if ” is irreducible. 
For all , the function are twice continuously differentiable for all .

The matrix has rank for any , where
Based on (A1), for any transition matrix , we define the vector having elements only for (the elements are stacked rowwise in our convention) and denote the set of all such vectors as . Then, in view of Theorem 3.1 of Billingsley (1961), for a stationary and ergodic finite Markov chain having true transition matrix , we have the asymptotic result:
(10) 
where from (1) and is a matrix having entries
The rate of convergence in (10) is uniform in a neighborhood of and also almost surely (a.s.) as (Lifshits, 1979; Sirazhdinov and Formanov, 1984).
We also define a few matrices as follows which are required for our asymptotic derivations. For any satisfying (A1) and any , we define the matrix
Also define the following matrices which are nonsingular by Assumption (A3).
(11)  
(12) 
Now, let us first restrict ourselves to the cases where the assumed parametric model family is correctly specified and hence the true transition probability matrix belongs to the model family, i.e., for some . For simplicity, we put . Note that, in such cases we have, for any (including ),
Further, using the result (10) with and extending the arguments from Menéndez et al. (1999), we now prove the asymptotic consistency of the MDPDEs at the model which is presented in the following theorem. From now on, we will use the notation , the initial probabilities, when and assume that for all .
Theorem 2.1
Consider a finite Markov chain that is stationary and ergodic having true transition matrix for some and fix an . Then, under Assumptions (A1)–(A3), we have the following results.

There exists a solution (MDPDE) to the estimating equation (7) which is unique a.s. in a neighborhood of and satisfies the relation
(13) 
The MDPDE is consistent for and also asymptotically normal with
(14) where .
Proof: Note that , the interior of the dimensional unit cube. Consider a neighborhood of such that has continuous partial derivatives for all ; this is possible in view of Assumption (A2). Then, with slight abuse of notation, we consider the function
where each coordinate function is continuous in . By definition, for , we have , i.e., the function has a zero at .
Next, through standard differentiation, we get
(15) 
Since is nonsingular by Assumption (A3), we can now apply implicit function theorem on the function at the point to get a neighborhood of in and a unique continuously differentiable function such that and
Differentiating this last equation with respect to
, via Chain rule, we get
Evaluating it at and simplifying using (15), we get
But, a Taylor series expansion of around yields
Therefor, upon simplification, for any , we get
(16) 
Finally, in view of (10), we have almost surely and as . Thus almost surely for sufficiently large and hence is the unique solution of the equations
which is the MDPDE estimating equation in (7). Therefore, is indeed our target MDPDE and also almost surely unique. We can verify that it satisfies the required relation in (13) by substituting in Equation (16) completing the proof of the Part (i) of the theorem.
Remark 2.1 (The special case )
We have already argued that the MDPDE is a generalization of the classical MLE and, in fact, coincides with the MLE at . In this special case, the estimating equation (7) is given by
(17) 
which is the usual score equation of the MLE. We can also find out the asymptotic distribution of the MLE as a special case of Theorem 2.1 at . Note that and some algebra lead us to
. Therefore the asymptotic variance of (
times) MLE turns out to be . This coincides with the usual maximum likelihood theory, since is indeed the Fisher information matrix of our model. Further, it is important to note that the minimum disparity estimators, discussed in Menéndez et al. (1999), also have the same asymptotic distribution as that of the MDPDE at .The asymptotic variance formula in Theorem 2.1 can be used to study the asymptotic relative efficiency (ARE) of the MDPDEs at different
. Further, it also helps us to compute an estimate of the standard errors of the proposed MDPDEs through a consistent estimate
of . In fact, as we increase , the asymptotic variance of the MDPDE increases slightly (ARE decreases) with a significant gain in robustness. This fact is not easy to verify directly from the general variance formula; we will illustrate them through several examples in the next section.Note that the above asymptotic properties of the MDPDE in Theorem 2.1 are obtained under the assumption of perfectly specified models. However, they can easily be extended for model misspecification cases where the true transition probability matrix, say , does not belong to the assumed model family. In this case, we can talk about the consistency only at the “best fitting parameter value” defined at the end of Section 2.1. Then, the conclusions of Theorem 2.1 still hold with slight modifications as given in the following theorem. Its proof can be done using the arguments similar to those used in the proof of Theorem 2.1 by replacing and , respectively, by and ; the details are hence omitted.
Theorem 2.2
Consider a finite Markov chain that is stationary and ergodic having true transition matrix , which does not necessarily belongs to the model family , and fix an . Let denote the “best fitting parameter value” in the DPD sense. Then, under Assumptions (A1)–(A3), we have the following results.

There exists a solution (MDPDE) to the estimating equation (7) which is unique a.s. in a neighborhood of and satisfies the relation
(18) 
The MDPDE is consistent for and also asymptotically normal with
(19)
Based on Theorem 2.2, a modelrobust estimator of the standard error of the MDPDE can be obtained from the modelrobust estimator of the asymptotic variance matrix given by . This can be shown to be a consistent variance estimator under standard regularity conditions. It also works better compared to the model specific variance estimator under model misspecification, but the second one works better against outliers with respect to a fixed model.
2.3 Influence Function of the MDPDE
The influence function (IF) is a classical measure of local robustness of any statistical functional; it measures the amount of (asymptotic) bias of the functional against infinitesimal contamination at a distant outlying point (Hampel et al., 1986). Let us now study the IF of the proposed MDPDE functional
under the finite Markov model setup.
Suppose that the data are observed from a stationary and ergodic finite Markov chain having true transition matrix , which does not necessarily belong to the model family . Consider a contaminated transition matrix where denote the contamination proportion, is the contamination point and the contamination matrix has entry one at th position for all and zero in all other positions. These leads to contaminated probability vector for each row of the transition matrix. The associated IF of the MDPDE functional at a fixed is then defined as
In order to derive this IF, we note that satisfies the estimating equation (9) with replaced by , i.e., we have
Differentiation above with respect to and evaluating at , we can get the IF of the MDPDE functional. The straightforward derivation steps are omitted for brevity and the final results are presented in the following theorem.
Theorem 2.3
Consider a finite Markov chain that is stationary and ergodic having true transition matrix and fix an . Let denote the “best fitting parameter value” in the DPD sense. Then, the influence function of the MDPDE functional is given by
The above formula can be further simplified at the model where for some .
The only term of the IF that depend on the contamination point is ; the more bounded it is, the more robust the estimator is. We can quantify the extent of robustness through this IF in terms of the sensitivity measure defined as For most common examples, this sensitivity indeed decreases with increasing indicating the gain in robustness by our MDPDE for larger .
3 Examples and Illustrations
3.1 Example 1: Simple Random Walk with Reflecting Barriers
Let us first consider a simple finite Markov chain, namely the random walk with reflecting barriers, having statespace and parametric transition matrix
(21) 
Here our target parameter is a scalar and the associated parameter space is . It is easy to verify that this Markov chain is stationary and ergodic with initial (stationary) probabilities being
where is defined from the relation . Further, Assumption (A1)–(A3) hold for in (21) with and hence and .
Let us now consider the problem of estimating from an observed sequence . The MLE of is given by
Now, to find the MDPDE of with tuning parameter , we simplify the estimating equation (7) which leads to
(22) 
Although the above estimating equation (22) is not directly solvable analytically, one can easily verify that the MLE is indeed a solution of (22) for any . Therefore, the MDPDEs for all are the same, given by , for this example. Additionally, since (A1)–(A3) hold, one can obtain its asymptotic properties at the model from Theorem 2.1. In particular, with some algebra, we have
Thus, although these two quantities depend on , the asymptotic variance of the MDPDE becomes independent of and is given by . This is consistent with the fact that the MDPDEs themselves do not depend on . Further, the above asymptotic variance formula is exactly the same as derived in Hjort and Varin (2008) for the MLE and thus our Theorem 2.1 generalizes their results for the larger class of MDPDEs.
We conjecture that the MDPDEs will be independent of and hence coincide with the MLE having no robustness benefit, as in the present example, whenever the transition matrix has elements as a linear function of parameters only. This is certainly an interesting phenomenon which was never observed so prominently in the literature of DPD and its wide range of applications.
3.2 Example 2: A Random Walk Type Model with Binomial Probabilities
We now consider another more interesting example of finite Markov chain over the statespace with reflecting barriers and Bin(2, ) distribution for moving from each internal position to its nearest (both sided) three positions. The corresponding transition matrix is then given by
(23) 
Such a model often arise in many reallife applications, e.g., in genetics, with different values of . Once again the target parameter is scalar and the Markov chain in stationary and ergodic with initial (stationary) probabilities , where
and 
Further, Assumption (A1)–(A3) also hold for in (23) with
so that and
Now consider one long sequence observed from this given Markov chain based on which we wish to infer about the target parameter . One can easily verify that the MLE of is given by