Log In Sign Up

A Bayesian Approach to Multi-State Hidden Markov Models: Application to Dementia Progression

People are living longer than ever before, and with this arise new complications and challenges for humanity. Among the most pressing of these challenges is to understand the role of aging in the development of dementia. This paper is motivated by the Mayo Clinic Study of Aging data for 4742 subjects since 2004, and how it can be used to draw inference on the role of aging in the development of dementia. We construct a hidden Markov model (HMM) to represent progression of dementia from states associated with the buildup of amyloid plaque in the brain, and the loss of cortical thickness. A hierarchical Bayesian approach is taken to estimate the parameters of the HMM with a truly time-inhomogeneous infinitesimal generator matrix, and response functions of the continuous-valued biomarker measurements are cutoff point agnostic. A Bayesian approach with these features could be useful in many disease progression models. Additionally, an approach is illustrated for correcting a common bias in delayed enrollment studies, in which some or all subjects are not observed at baseline. Standard software is incapable of accounting for this critical feature, so code to perform the estimation of the model described below is made available online.


page 1

page 2

page 3

page 4


Mixture of Input-Output Hidden Markov Models for Heterogeneous Disease Progression Modeling

A particular challenge for disease progression modeling is the heterogen...

Modeling disease progression in longitudinal EHR data using continuous-time hidden Markov models

Modeling disease progression in healthcare administrative databases is c...

Efficient Learning and Decoding of the Continuous-Time Hidden Markov Model for Disease Progression Modeling

The Continuous-Time Hidden Markov Model (CT-HMM) is an attractive approa...

User-driven Analysis of Longitudinal Health Data with Hidden Markov Models for Clinical Insights

A goal of clinical researchers is to understand the progression of a dis...

Learning transition times in event sequences: the Event-Based Hidden Markov Model of disease progression

Progressive diseases worsen over time and are characterised by monotonic...

Bayesian Nonparametric Models for Synchronous Brain-Computer Interfaces

A brain-computer interface (BCI) is a system that aims for establishing ...

1 Introduction

People are living longer and with this arises new complications and challenges for humanity. Among the most pressing of these challenges is of understanding the role of aging in the development of dementia. Such is the initiative of the Mayo Clinic Study of Aging (MCSA), a large prospective study with the goal of understanding the natural history of dementia and particularly Alzheimer’s Disease.

This paper is motivated entirely by the MCSA and how the resulting data can be used to draw inference on the role of aging in the development of dementia. The goal is to create a model of progression to dementia which can accommodate: (1) A wide variation in age (the dominant variable under consideration), (2) Significant fluctuation in the time between subject visits, (3) Different amount of information available for each subject (e.g., missing visits and/or clinical data), and (4) Subject specific covariates.

The main contribution of this work is to provide an innovative statistical analysis of this important and unique data set via a continuous-time, discrete-state hidden Markov model (HMM) estimated within the Bayesian paradigm. Additionally, we demonstrate the existence of and provide solutions for various methodological gaps in the analysis of disease progression for studies like the MCSA. First, we provide an approach for correcting a common bias in delayed enrollment studies which has been overlooked in the literature. Second, we introduce a methodological framework for estimating the strength and persistence of a separate death rate bias specific to death rates, which could be present in any study relying on enrollment of subjects. Our final methodological innovation is a proposed Bayesian approach to estimating the biomarker regions most associated with high/low burden states in a manner that does not require the specification of cut-points.

The term delayed enrollment, here, is used to describe a study with a given baseline (age 50 in the case of the MCSA) such that some or all subjects are observed at baseline. We demonstrate empirically that the effects of this bias cannot be ignored, and existing software is not equipped to handle this feature.

We formalize the discrete-state space exhibited in Figure 1 in which many of the states are defined by continuous biomarkers. The previous work of Jack et al. (2016) defined a state space similar to Figure 1, but in which the high/low burden biomarker states were defined by practitioner chosen, hard biomarker cut-points. Hard cut-points for discretizing continuous measurements of biological processes are practically and philosophically problematic, and have to be chosen more or less arbitrarily.

Moreover, we illustrate a general and effective framework for fitting a continuous-time, discrete-state HMM within the Bayesian paradigm, and the infinitesimal generator matrix of the underlying Markov process is allowed to be truly time-inhomogeneous (as a function of an individual’s age). Time must be treated as continuous because, as in much of medical research, subjects are often observed irregularly in time.

Our final contribution is that in addition to the effect of age, the effects of the covariates gender, number of years of education, and presence of an APOE-4 allele on the infinitesimal transition rates are also estimated. The importance of these variables has been well documented in the medical literature but their effect on aging has not been studied in this context (i.e., how they affect the transition rates between states in Figure 1). In addition to the new insights these features bring to the medical community, flexible software to fit the models described below is provided at

Our analysis builds on the work of Jack et al. (2016) with more sophisticated modeling which allows for deeper insights. They found that a Markov model of disease progression for dementia is indeed a natural approach, that almost all rates are log-linear, and at age 50 nearly everyone is in state AN (i.e., low Amyloid burden and low cortical thickness loss burden which is state 1 in Figure 1) but that soon begins to change.

Most implementations of a continuous-time, discrete-state HMM, including Jack et al. (2016) estimate parameters in a maximum likelihood fashion. However, as mentioned in Jack et al. (2016)

optimization becomes exceedingly difficult as more parameters are introduced in the model. Convergence time for standard methods may become impractical, and analytical gradient formulas for use in more efficient optimization procedures can become intractable. Additionally, it is often difficult/awkward to fit prior information into an optimization-based frequentist approach (e.g., via constrained optimization or penalty functions which require tuning parameters), and deriving confidence intervals becomes a challenge. Further, as the model becomes more complex to better capture reality in our application, prior information becomes necessary for practical identifiability of HMM parameters which makes Bayesian methodology a natural approach. For these reasons, we propose a hierarchical Bayesian framework with model estimation via Markov chain Monte Carlo (MCMC). Using MCMC for the estimation of complicated models requires creative proposal strategies, but is extremely flexible for a variety of model specifications. Moreover, credible regions become a convenient way to represent uncertainty.

Accounts of continuous-time, discrete-state HMM are given by Lange & Minin (2013); Bureau et al. (2003); Jackson et al. (2003); Titman & Sharples (2008); Jack et al. (2016), and within a Bayesian framework, Zhao et al. (2016). Further, Satten & Longini (1996) is a very complete account of how to implement such a model, and is recommended reading for anyone not familiar with the methodology.

More relevant to the present application, the work of Jackson et al. (2003) uses an HMM to model the state misclassification error of a disease, and includes age as a covariate on the transition rates. However, they make a very restrictive assumption that the transition rates are constant between subject observation times. The work of Jack et al. (2016) is seemingly the first in the literature to estimate the transition rates as a function of age, in a truly continuous-time fashion for a multi-state model of dementia, however, Yu et al. (2010)

also treated transition probabilities as a function of age, in a discrete-time fashion.

The organization of the rest of the paper is as follows. Section 2 describes the MCSA data set and the HMM methodology. An illustration of the death rate bias and the delayed enrollment bias is then given in Section 3, and a simulation study is provided in Section 4 which implements our Bayesian estimation procedure on synthetic data generated to resemble the MCSA data set. Finally, a detailed analysis of the MCSA data is presented in Section 5, accompanied by a discussion and interpretation of the biological findings. This paper also has accompanying supplementary material containing details of the computations and further MCSA analysis results.

2 Methodology

2.1 Description of the Mayo Clinic Study of Aging data

The MCSA has enrolled a large age/sex stratified random sample from Olmsted County, Minnesota. Subjects are followed forward approximately every 15 months, and clinical visits collect information on major aspects of the disease. The MCSA study began in 2004, currently has 4742 subjects, and is still ongoing (at the time of this writing) and enrolling new subjects. Estimates of quantities such as expected time in a given state, probability of ever entering a given state, or the fraction of the population that will pass through a high amyloid burden state on the way to dementia, are all of clinical interest.

It is known in the medical community that amyloid protein buildup in the brain, and significant neuro-degeneration are strongly associated with dementia. Accordingly, amyloid buildup, as measured by Pittsburgh Compound B (PIB) from a Positron Emission Tomography (PET) scan, and neuro-degeneration, as measured by cortical thickness (Thickness) from Magnetic Resonance Imaging (MRI), are continuous outcomes measured during the regular clinical visits for approximately 50 percent of the subjects.

Figure 1:

State space. Emitted response variables are displayed in brackets above the respective hidden state. A+ corresponds to high amyloid burden, and N+ corresponds to high neuro-degenerative burden. States 1-4 are all non-demented.

With regard to dementia, high amyloid burden is a notion which refers to a build-up of amyloid plaques in the brain significant enough to effect pathways and lead to neuro-degeneration, but precise measurements of the extent of amyloid protein would require autopsy (PIB measurements serve as a proxy for measuring this extent). Likewise, high neuro-degeneration burden refers to a state of loss of neurons and synapses denoted by atrophy of the cerebral cortex in Alzheimer’s-sensitive areas.

For the (approximately 50 percent of) subjects who were not chosen to undergo regular brain scans less clinical information is available. However, the Mini Mental State Exam (MMSE) is almost always observed. The MMSE is a questionnaire-based test administered by a medical professional to assess cognitive impairment on an integer scale out of 30 points (Xu et al., 2015). Furthermore, baseline data such as age, sex, clinical and genetic markers is always recorded in the data.

Finally, at the time of observation all subjects are determined to be either cognitively unimpaired, or to be demented. This represents a substantial amount of information for making inference on the underlying cognitive state of a subject. However, diagnosing dementia is not an exact science, and so the observed label is not without error.

2.2 The HMM state space and emitted response variables

A simplistic formalization of the biology is to theorize a seven-state model to describe cognitive health in relation to dementia. Figure 1 illustrates such states, and depicts the allowed transitions with directional arrows. A notable feature of the state space is that an individual must be in a high neuro-degeneration burden state (i.e., N+) to develop dementia, but not necessarily in a high amyloid burden state (i.e., A+). In fact, the transition from A+N+ (state 4) to A+Dem (state 6) is identified as Alzheimer’s Disease, a particular type of dementia. Isolating this Alzheimer’s Disease transition is not possible using the previous state space of Jack et al. (2016).

Time must be treated as continuous because patient visit times are irregular, and the underlying sequence of states visited for an individual is hidden by uncertainty (even when PIB/Thickness are available, they are only proxies for the true level of amyloid/neuro-degeneration burden). Moreover, the states of high amyloid burden and neuro-degenerative burden are not precisely defined and are best treated as hidden. It is worth remarking that amyloid build-up and neuro-degeneration each develop on a continuum, but the time spent in any intermediate states, not explicitly represented by the state space in Figure 1, is believed to be relatively short and thus ignorable in these data.

The PIB and Thickness values associated with amyloid buildup and neuro-degeneration, respectively, are used as emitted response variables (in a traditional hidden Markov model sense) to make inference on an underlying sequence of states visited. States 2, 4, and 6 which correspond to increased amyloid burden will emit PIB values from a distribution corresponding to A+, while states 1, 3, and 5 will emit PIB from the distribution corresponding to A, and similarly for neuro-degenerative burden (N+ or N).

Figure 2: Observed response data for PIB which is a measure of amyloid buildup from a PET scan, and (cortical) Thickness which is associated with neuro-degeneration. Note that the response densities for PIB correspond to the data transformation, . The component density estimates correspond to the posterior mean estimates from Section 5. The blue dashed lines represent the normal mixture density estimates.
Figure 3: Observed response data for MMSE test scores associated with the six non-death states in the state space. The component density estimates here correspond to the posterior mean estimates from Section 5. The blue dashed line represents the normal mixture density estimate.

The prior distributions for these response distribution parameters was chosen to correspond to biomarker values which are consistent with the medical community’s most up-to-date understanding of the biology. Using a Gaussian distribution for Thickness and for

appears to be quite reasonable, as there is evidence that the error from the PIB measurements follows more closely to a constant coefficient of variation than to a constant variance. Figure 

2 displays histograms of the observed response data along with the respective normal mixture densities resulting from the posterior mean estimates from the full HMM described in the coming sections.

The MMSE score serves as an additional emitted response, and a separate Gaussian emission distribution for each of the first six states is assumed (deceased subjects do not emit cognitive test scores). Figure 3 overlays the estimated six-component normal mixture density function on top of a histogram of the observed MMSE scores from the study subjects’ visits.

Lastly, a simple misclassification response model is used to allow for a probability of a dementia diagnosis given the underlying state is a dementia state (i.e., states 5 or 6), and given it is not a dementia state. Death is the only state in the state space which is known without error, and the exact time of death is known, as well.

2.3 Continuous-time transition probabilities

This section serves to specify the hidden Markov model in the context of the state space illustrated in Figure 1. For and , the probability of transitioning from state at time to state at time is denoted by Assuming these probabilities are differentiable functions in and that the Markov process is time-homogeneous, it can be shown that they satisfy the Kolmogorov forward equations (Karlin & Taylor (1981)),


where is called the transition rate matrix, and is the matrix with components


Note that can be taken to be in (2) because the probabilities are assumed for now to be time-homogeneous. The off-diagonal components of are interpreted as the change in transition probabilities for an infinitesimal amount of time into the future, i.e.,


with diagonal elements .

The forward equations in (1) have the matrix exponential solution, . However, as discussed in Section 1, the transition rates will be expressed as a function of a subject’s age at the time of transition. That is, which violates the time-homogeneity of the Markov process. A simple work around is to discretize the effect of age and assume that the transition rates only change when a subject’s integer age changes. Doing so implies that subjects’ transition rates, , remain constant between birthdays and yields


for , where represents the subject’s current age, is the time (in years) into the future, and is the floor function.

Observe that with expression (4) all transition probabilities can be computed, as long as the components of are specified. As mentioned above, the transition rates will be modeled as a function of age, gender, presence of an APOE-4 allele, and number of years of education. Specifically, denoting each of the 13 nonzero transition rates illustrated in Figure 1 by for ,



Figure 4: Annualized natural logarithm of Minnesota overall population death rates. Solid line corresponds to female, and dashed line corresponds to male.

Overall death rates over age 50 are log-linear (see Figure 4) which makes the log-linear function of age a natural starting place. This functional form was also argued in Jack et al. (2016) to be reasonable for all of the rates except the rate from AN (state 1) to A+N (state 2). They compared log-linear rate to that obtained from log-cubic splines. We came to the same conclusion and accordingly, a cubic spline is used for estimating only the rate of transition from state , with knots at ages 55, 65, 75, 90, and boundary knots at 50 and 120.

2.4 Likelihood function

If the states were known for each subject at each observation, then the contribution to the likelihood function from each subject consists of a product of matrix exponentials, i.e., apply (4) to get the probability of being in a given state for each observation time. However, the underlying state sequences are not observed. Within an HMM, rather, responses emitted from the underlying process (conditional on the true state of the process at a given point in time) are used to inform of the underlying state. In this application, there are four emitted responses (i) , (ii) Thickness, (iii) MMSE (see Figures 2 and 3), and (iv) dementia diagnosis (binary). Denote the observations for each of these four responses by , respectively, for the subject’s clinical visit (observation). Further, the subject has clinical visits, and has for each visit an (unknown) state .

Thus, the likelihood contribution from the subject at the visit can be expressed as

where the sum is taken over all possible states (since the true state sequence is unknown). If a given response is missing (e.g., missing PIB scan), then the missing value is integrated out of the likelihood (i.e., the response density of the missing value contributes a 1 to the likelihood function).

Making the standard assumption that the responses are conditionally independent given an underlying state sequence , and applying the Markov property for the state sequence gives,


is the initial state probability vector,

is a diagonal matrix with diagonal components for each in the state space, is the transition probability matrix given in (4) with denoting the (continuous) age of subject at visit , and is a column vector of ones. There are three subtle, but important features of the MCSA data that need to be addressed which result in slight modifications of this likelihood function.

Finally, in order to complete the specification of the likelihood, the form of the response density and the baseline state probability vector must be specified. It is assumed that the four responses, , , , and , are conditionally independent given state and subject specific covariates. As illustrated in Figure 2

, the transformed PIB measurements are assumed to be generated according to two normal random variables with different means. One of the means, say

, corresponds to the distribution of transformed PIB measurements for an individual in a low amyloid burden state, and the other, say , corresponds to a high amyloid burden state. Specifically,


where is the indicator function equal to 1 if and 0 otherwise. The variance of both Gaussians are assumed to be equal to aid in identifiability of the two groups. The density function for Thickness and MMSE are defined analogously.

For the Thickness response variable,


and for the MMSE response variable,



The first four covariates are the same as those in Section 2.3, and ‘ntests’ is the number of times a subject has taken the MMSE by a given clinical visit. It is observed in the medical practice that scores on the MMSE may improve as an individual becomes familiar with the exam, and so the ‘ntests’ covariate is included to control for this effect.

The probability mass function for misdiagnosis of dementia is


where if the subject was diagnosed with dementia, and if not. Accordingly, and are misclassification probabilities, with the probability of an incorrect diagnosis of dementia, and the probability of an incorrect non-diagnosis of dementia.

Lastly, the baseline state probability vector is where for , for all subjects, . As the MCSA did not enroll demented or deceased individuals, subjects would necessarily have been in states 1-4 at baseline. Accordingly, , and the last three components of must be zero.

2.4.1 Treating time of death as known without error

One feature common to many population-based studies is that death is observed without error and time of death is known exactly, so the likelihood must be modified to account for this more precise information (Satten & Longini, 1996). Suppose that subject transitions to death at time . The final term in the subject’s contribution to the likelihood can be re-expressed as follows. For a state at time , and , let,

Then is the event that the subject dies precisely at time . Further,

Thus, dividing both sides by and taking the limit as gives a likelihood function value of


evaluated at the event . The quantity in (11) is interpreted as the average probability of being in each of the first six states the instant prior to death, each weighted by the probability of transitioning to death at the next instant (given by the instantaneous transition rates ). Note that the response functions are not needed/defined when because death is assumed observed without error.

3 Population-based study challenges for an HMM

The purpose of this section is to demonstrate the existence of and provide solutions for two critical methodological gaps in the analysis of disease progression for studies like the MCSA. After describing these two overlooked sources of bias in the literature, Section 3.3 serves to demonstrate empirically on a simple synthetic data set the impact that ensues when the delayed enrollment bias is not properly addressed. We argue that the effects of this bias cannot be ignored, and existing software is not equipped to handle this feature.

3.1 The death rate bias

A common feature of many human population based studies (delayed enrollment or not), is the following death rate bias. In addition to not representing those members of the population who would have enrolled if they were not already dead (which is tied into the delayed enrollment bias), it often the case that people are much less likely to enroll in a study if they are very sick and/or dying. As a result, the death rate for the sub-population of individuals who are most likely to enroll is probably smaller than the overall population death rate, for at least the first few years after enrollment into the study. Consistent with the difference between this healthier sub-population and the overall population, this phenomenon will lead to a reduced estimate of the death rate and higher likelihoods of other paths in the state space. This represents a bias with respect to the true parameter values of the overall population.

Our proposed approach to correct for the death rate bias is to explicitly estimate the bias on the non-dementia to death rate. This can be done in a linear fashion by including two additional parameters in the log-linear rate equation (5). The first, say , will be for estimating the baseline effect of the death rate bias, and the second, say , will be for estimating the linear slope at which the death rate bias vanishes for every integer year in the study (since the time effect is discretized annually). That is, equation (5) for only the non-dementia to death rate () becomes,


where ‘iyears’ is integer years enrolled, and . The death rate bias term is only allowed to decrease the non-dementia to death rate. Further, the smallest root of is the duration for which the death rate bias persisted in the study.

The coefficients in equation (12) become identifiable due to strong prior information available for the overall population death rate. We provide evidence that this bias exists in the MCSA data by appealing to the fact that and are both estimated to be nonzero (see Figure 9), and by demonstrating in Section 4 that all parameters in equation (12) can be accurately estimated in a similar, truth known synthetic data setting.

3.2 The delayed enrollment bias

The delayed enrollment bias occurs in situations in which the first observation time for the subject, , is not necessarily equal to the baseline age of 50 years old (in the case of the MCSA). In this case, the probability of transitioning from the (unknown) underlying state at baseline to the (unknown) underlying state at the first observation must be accounted for in the likelihood function. That is, the initial state probability vector, in (2.4), is replaced with,


where , and is the initial state probability vector for a subject at baseline. The last three components are set equal to zero due to the fact that demented and dead subjects are not enrolled into the study.

If the initial probabilities from baseline to enrollment age were not conditioned on the underlying states being non-demented and non-dead, then the transition rates (especially those to dementia and death) will exhibit strong downward bias (with respect to the true population parameter values) due to the fact that neither demented nor dead subjects are enrolled. We refer to this bias as the delayed enrollment bias, and it will be illustrated empirically in Section 3.3. Relevant standard software such as the msm package in (Jackson, 2011) does allow for specification of a common baseline in a delayed enrollment study (after manually adding censored observations in the data set at baseline), but unfortunately does not offer the ability to perform conditioning (rescaling) on the initial state probability vector, as in (13).

3.3 Demonstrating the effect of a delayed enrollment bias

The Cardiac Allograft Vasculopathy (CAV) data set from the msm package was collected from a study of the progression of CAV, which is a common cause of death after heart transplant (Sharples et al., 2003). The state space as described in Jackson (2011) includes four states labeled ‘no CAV’ (state 1), ‘mild/moderate CAV’ (state 2), ‘severe CAV’ (state 3), and ‘death’ (state 4), and forward-only transitions are assumed (patients can only get worse). Observed remissions in the state of a patient is considered a result of misclassification error. In the CAV data, baseline is defined as time of heart transplant. All patients are observed at baseline, and in fact at baseline all patients are in state 1 because CAV does not develop immediately.

In this simple data set, the only response variable is the observed state at each visit which is assumed to follow a categorical distribution. In particular, given a patient is in state 1, 2, or 3 there is a nonzero probability of observing an adjacent state. Additionally, as is the case for the MCSA, death and time of death are known without error. Formally, given a true state (row) the response probability mass function takes the form given in Table 1, where the probabilities , , , and are each interpreted as the probability a of particular misclassification. Note that the rows must sum to one.

Observed state
no CAV mild severe death
True state no CAV 0 0
mild 0
severe 0 0
death 0 0 0 1
Table 1: Misclassification response function.

Time is treated as continuous and discretized annually. Integer years since heart transplant, and gender are included as covariates on the transitions rates for the CAV data,



To replicate the CAV data in a ‘known truth’ simulation, first the HMM parameter estimates were obtained using the msm package on the original CAV data. As suggested in Jackson (2011), the ‘BFGS’ quasi-Newton optimization algorithm was specified to estimate the HMM parameters with the ‘msm’ function. The parameter estimates from the CAV data set are then used as the ‘true’ values in order to generate new data sets.

Important features of the MCSA as distinct from the CAV data are that not all subjects are in state 1 at baseline, and virtually none of the subjects are observed at baseline. Thus, to more closely resemble the MCSA data, each patient in the simulated CAV data is generated with a baseline state according to the following distribution,

no CAV mild severe death
0.95 0.04 0.01 0

See the Supplementary Material for complete detail on how synthetic data was generated.

Figure 5:

Intercept coefficient estimates for a traditional study in which all subjects are enrolled at a common baseline time. ‘Bayes’ corresponds to the Bayesian posterior mean estimates, and ‘MLE (msm)’ corresponds to maximum likelihood estimates computed from the ‘msm’ function. Green dashed lines represent the true values. Coverage is the proportion of .95 probability credible intervals (confidence intervals for the MLE) which contain the true parameter value.

A sample size of 2000 was generated for 100 simulated data sets. Figure 5 shows estimates of the log-rate intercept coefficients from equation (14) using the posterior mean from the proposed Bayesian approach, and the MLE obtained via msm. The box plots are over the 100 estimates of each parameter and demonstrate that in a simple idealized setting the Bayesian and MLE estimates are very similar.

In the synthetic data example above, all patients are observed at baseline; this is the type of study for which the msm

package was designed. To illustrate the bias that ensues when subjects are not observed at baseline, the 100 data sets are generated once more from the same random generator seeds. However, instead of beginning with the initial observations at baseline (zero years after heart transplant), the time of initial observation is generated, with probability 0.75, from a Gaussian distribution with a mean of 5 integer years and a standard deviation of 1, i.e., the initial observations remain at baseline with probability 0.25. Additionally, if a patient transitions to death prior to the generated initial observation time, then the patient is not included in this

delayed enrollment study. This is a critical point because it is the cause of the bias: the delayed enrollment study is less likely to include patients with immediate adverse reactions to heart transplant. If the sample were truly representative, then all 2000 patients would be represented. However, studies typically only sample from the living. The average sample size of the 100 synthetic data sets for the delayed enrollment study is 1686.

Figure 6: Intercept coefficient estimates for a delayed enrollment study. ‘Bayes’ corresponds to the Bayesian posterior mean estimates, ‘MLE’ corresponds to the MLE obtained via optimizing the likelihood function from Section 2.4 using the ‘optim’ function in , and ‘MLE (msm)’ corresponds to maximum likelihood estimates computed from the ‘msm’ function. Green dashed lines represent the true values. Coverage is the proportion of .95 probability credible intervals (confidence intervals for the MLE) which contain the true parameter value.

In Section 3.2 the procedure for accounting for the delayed enrollment bias in the likelihood function was described. It amounts to evaluating the transition rate matrix, , (here, annually) from baseline to initial observation, and then computing the conditional probabilities for the initial states of enrollment. This conditioning feature is not available in the msm package which was not designed for a delayed enrollment study. Figure 6 illustrates the effect of ignoring this feature and estimating as though enrollment is conditional on being alive, i.e., MLE (msm). The estimates from Bayes and MLE are analogous to those in Figure 5, however, they do explicitly account for the initial probability according to (13). Without accounting for the delayed enrollment effect in the likelihood function certain estimates are significantly biased downward, suggesting slower rates of transition. This is because, of the 2000 patients, those which happened to transition quickly through the state space are no longer observed in the data set. The biases become more extreme as fewer patients are observed at baseline; recall that about 25% of the patients are still observed at baseline in this example.

The bias also filters into other HMM parameter estimates; see the Supplementary Material for the full results of these two simulation setups. The objective of this synthetic delayed enrollment study example is to demonstrate one of the crucial reasons why the analysis of the MCSA data requires methodological developments which are not readily available in standard software. The msm package, as well as other similar software, are not flawed, rather they were simply not designed for this type of application.

4 Synthetic Mayo Clinic Study of Aging data

Section 3.3 presented the results of the proposed estimation procedure in a simple idealized simulation example. In this section a more realistic simulation is presented which is intended to closely replicate the MCSA data generating process with respect to sample size, the frequency of clinical visits, and the proportion of biomarker measurements available. The three objectives are to (i) provide evidence that the synthetic data reasonably resembles the real data in an effort to verify that the data generating mechanisms of the real data are sufficiently understood, (ii) validate the estimation procedure by demonstrating that credible regions concentrate around the true parameter values, and (iii) demonstrate the reliability of the estimates with respect to the true parameter values.

The standard package for estimating an HMM with arbitrary continuous-time observations is msm (Jackson (2011)). While msm is a well written and powerful package, it does not offer Bayesian estimation options, and as discussed previously it cannot correct for the biases on the transition rates that arise from a delayed enrollment sample.

The same techniques used to generate data resembling the CAV data set are applied here, just with more features to simulate such as the additional response functions and a death rate bias. See the Supplementary Material for the particulars of how the synthetic data was generated. Before diving right into the estimation results, a few estimation details are discussed, mainly relating to prior information.

In the state space of Figure 1, there are a number of assumptions which can reasonably be made about the biology of this process. For instance, the rate of transition to the N+ state with high amyloid burden should be at least as fast as with low amyloid burden. That is, the rate parameter from for transitioning from A+N to A+N+ should be at least as big as the rate parameter for transitioning from AN to AN+. Similar constraints should be true for transitioning from low to high amyloid burden with respect to high/low neuro-degeneration burden (i.e., having N+ cannot lower the rate of transitioning to A+), and those with dementia die at a rate no less than those without dementia. These constraints are a mathematical formulation of the assumption that having a larger burden may not escalate disease progression, but it certainly cannot help. A table with the full list of rate constraints is given in the Supplementary Material.

In addition to these constraints, it is reasonable to assume that all of the age coefficients on the rate parameters (see equation (5)) are nonnegative. That is, becoming older will not slow down an individual’s rates of progression. Next, constraints are placed on the response variable parameters to help with identifiability. The mean measurement associated with the low amyloid burden state should not exceed that of the high amyloid burden state. Analogously, the mean Thickness measurement associated with high neuro-degeneration burden should be less than or equal to that for low neuro-degeneration burden (low Thickness is associated with high burden). Lastly, the mean MMSE scores should be monotone non-increasing in the state ordering , the mean MMSE score for state 1 should be no smaller than that for states 2 and 3, and the mean MMSE score for states 2 and 3 should be no smaller than that for state 4. No constraint is placed between the mean scores for states 2 and 3.

One of the advantages of the Bayesian approach of estimation is that imposing this long list of important constraints in the model can be easily handled through specification of the priors, and the priors have a natural ability to accommodate other known information about the model parameters. In particular, the MCSA sampled subjects are from the greater Rochester, MN area, and Minnesota death rates for women and men of all ages (from 1970 to 2004) are made available in the US Decennial Census, which are captured in the “” data set in the survival package in (Therneau, 2015). These gender-specific rates are used to set the prior means placed on the baseline and gender coefficients for the log death transition rate (equation (12)), see Table 2. It is assumed that all non-dementia to death rates are the same (and hence share the same coefficients in equation (12)). The fact that the A and N states do not have obvious external manifestations makes this assumption reasonable.

A multivariate normal prior is placed on the vector of HMM parameters, and multivariate normal proposals are used. The full table of prior distribution specifications for each of the 81 parameters in the model is provided in Table 2

. For parameters which are non-negative, such as the variance parameters for the response functions, the Gaussian priors are placed (and Gaussian proposals are made) on the natural logarithm of the parameter. For parameters which are constrained to be between 0 and 1, such as the dementia misclassification parameters and the initial state probability parameters, the logit transformation of the parameters is used. For example, the initial state probability vector is re-expressed as


where , , and are assigned Gaussian priors and proposals.

Although commonly done in the literature (mostly due to conjugacy), we hesitate to use inverse-gamma priors on variances for reasons discussed in Gelman et al. (2006); namely, the inverse-gamma(,) family of priors is very sensitive to the choice of , which does not naturally lend itself as weakly-informative nor uninformative. Moreover, we view our priors in general as weakly-informative in the sense of Gelman et al. (2006). That is, they are set intentionally weaker than what we believe the expert domain knowledge warrants, with the exception of the priors for parameters associated with the non-dementia to death rate and the death rate bias. Further, without having more knowledge about the shape of the prior distributions, the symmetry, exponential tails, and mathematical simplicity of Gaussian priors make them a natural choice. They allow us to be as diffuse as we believe appropriate without affording a questionably large amount of mass at the extreme values of the distributions. We investigate/discuss sensitivities of our posterior estimates to specifications of our priors in Section 5.1.

Prior means and standard deviations
State transition parameters (see (5) and (12))
non-dem7 -4.41 (.1) .094 (.01) .47 (.05) 0 (.1) 0 (1) -.75 (.375) -.60 (.3)
57 and 67 -4 (1) .1 (.05) 0 (1) 0 (.1) 0 (1)
all others -3 (1) .1 (.05) 0 (1) 0 (.1) 0 (1)
Cubic spline parameters for state 1 to state 2 as a function of age
-5 (1) -4 (2) -3 (2) -2 (2) -1 (2) 0 (3) 1 (3) 2 (3)
response (see (7)) Thickness response (see (8))
-1.3 (.2) -.5 (.2) (2) 3.14 (.2) 2.34 (.2) (2)
MMSE response (see (9))
- -
-.28 (.75) -7.3 (3) 0 (1) 0 (1) 0 (1) 0 (1) 0 (1) -.7 (2)
Dem misclass (see (10)) Initial probabilities (see (15))
-3 (1) -3 (1) -3.5 (.25) -6 (1) -6 (1)
Table 2: Displayed are the means and standard deviations of the (independent) normal priors placed on the 81 model parameters. Standard deviations are in parentheses, and these priors assume that the data have been centered. The parameters are the control points used to estimate the cubic spline for the state 1 to state 2 transition (for baseline and age), as discussed at the end of Section 2.3.

The simulation study consisted of simulating 50 synthetic data sets resembling the MCSA data. The synthetic data sets were simulated to contain similar amounts of information to the real data set. That is, 4742 subjects were simulated starting from random ages and assigned other covariates randomly from the empirical distribution of the MCSA data. The simulated subjects are “observed” at times which are determined by sampling from the actual inter-observations times in the MCSA data.

The maximum length of time in the study for subjects in the MCSA is not much longer than 12 years. Thus, for reasonable comparison, subjects in the synthetic data sets are observed for 12 years or until time of death, whichever comes first. Exactly 2718 (approximately 57%) of subjects in the MCSA data set have at least one observed biomarker measurement, and both biomarkers were observed for 1740 subjects. Of the 2718 subjects, the proportions of observed biomarkers over all visits is presented in the following table,

measured not measured
Thickness measured 0.227 0.231
not measured 0.002 0.540


To keep consistent with this feature, 43% of the synthetically generated subjects were given no biomarker data for any of their visits, while the remaining 57% were randomly given observed PIB or Thickness measurements, at each clinical visit, according to the above distribution.

Death was observed for just over 28% of the actual MCSA study subjects, and the number of clinical visits for study subjects varied between 1 and 10 visits, with a median of 4. The synthetic data sets observe death for 31% of subjects on average, and the number of clinical visits for synthetic study subjects varies between 1 and 11 visits, with a median of around 6.

Figure 7: Intercept and death rate bias coefficient estimates for the synthetic MCSA data, see (5) and (12). Note that the covariates in the data are centered. Presented are box plots of posterior means of the labeled parameters, from 50 synthetic MCSA data sets. Green dashed lines represent the true values. Coverage is the proportion of .95 probability credible intervals which contain the true parameter value.

Figure 7 provides a summary of the results in the form of box plots of the posterior mean estimates, and coverage for 95 percent credible intervals, for 11 of the more interesting model parameters. The Supplementary Material contains a more detailed summary of the results, including box plots, coverages, histograms, and MCMC trace plots for all 81 model parameters. Overall, this simulation exercise lends some confidence to the results of the actual MCSA analysis presented next.

5 Analysis of the Mayo Clinic Study of Aging Data

This section presents analysis and interpretation of the HMM parameter estimates of the actual MCSA data. Since there are 81 HMM parameters, some playing very different roles, presenting the output in a concise manner is challenging and depends on the question being asked. We present a targeted summary of a few important questions here. See the Supplementary Material for estimates of all 81 HMM parameters, including the MCMC trace plots, histograms, and 95 percent credible intervals. We parallelize our likelihood computation (within each MCMC step) over 30 threads on a computing cluster, and for 10 unique random number generator seeds we run the MCMC algorithm (i.e., using a total of 300 threads) for about 3 days on the real MCSA data set.

Figure 8: Evolution of transition probabilities. The curves represent the probability of transitioning to the respective state for each given age, computed using the posterior mean estimates of the HMM parameters. The probabilities are conditional on not transitioning to state 7 (Dead), and correspond to an individual in state 1 (AN) at the baseline age of 50. The label ‘apoe4 negative’ corresponds to an individual with no APOE-4 alleles, and ‘apoe4 positive’ corresponds to an individual with at least one APOE-4 allele. The curve labeled ‘Alzheimer’s’ depicts the probability of making the transition from state 4 (A+N+) to state 6 (A+Dem), given not dead.

The state space as it is defined in Figure 1 allows for the computation of transition probabilities broken down for particular types of dementia. Most notably, the development of Alzheimer’s Disease as defined here corresponds to a transition from state 4 (A+N+) to state 6 (A+Dem), i.e., there was Amyloid build up, prior to the neuro-degeneration leading to Dementia. See Figure 8 for the estimates of how these transition probabilities evolve over time. While Alzheimer’s Disease is slightly more prominent among females of a given age versus males of the same age, it is interesting that the likelihood of dementia (of any kind) is nearly the same, i.e., males are more likely to develop non-Alzheimer’s related dementia.

Figure 9: Posterior mean estimate of the death rate bias (12). Values are interpreted as the proportion of the population death rate that is experienced by subjects in the MCSA, for each integer year a subject is enrolled in the study. For example, subjects just enrolled in the study experience a death rate which is 31 percent of the population death rate. The dotted lines are 95% credible bands.

The estimated death rate bias for this study is also of interest, particularly due to the novel approach taken to account for it. It is observed that individuals just enrolled in the study experience a death rate which is 31 percent (posterior mean) of the population death rate and it remains lower than the rest of the population for several years after enrollment; See Figure 9. This suggests that the death rate bias cannot be ignored.

Another feature of the analysis is that it is cut-point agnostic, by design. Instead of hardwiring cut-points, the suggested cut-points in the medical literature have been used as prior information for the response distributions from high/low amyloid burden and high/low cortical thickness loss burden in the and Thickness measurements, respectively. Recall, Figure 2 displayed the estimated distribution of these response biomarkers for high/low burden states.

Observed status
Diagnosed not demented Diagnosed demented
True status Not demented 0.992 0.008 [0.007, 0.009]
Demented 0.107 [0.072, 0.152] 0.893
Table 3: Posterior mean estimates of the dementia diagnosis response parameters. The components are probabilities. Note that each row corresponds to only one parameter, but both columns have been filled in for ease of interpretability (rows must sum to one). The brackets represent 95 percent credible intervals (for the components directly corresponding to the parameters which were estimated).

Table 3 shows the estimated dementia misclassification probabilities. These estimates indicate that physicians tend to be conservative in diagnosing dementia. That is, they very seldom diagnose an individual without dementia as demented, but about 1 out of 10 individuals that truly have dementia is not diagnosed as such.


Figure 10: Posterior mean estimates of the transition rate components of at age 60. The numerical values can be interpreted as the estimated mean times (in years) to transition, conditional on age. These plots correspond to an individual with a college degree. Recall that transitions to dead from states 1-4 are constrained to be equal, and so for ease of presentation only one transition arrow is shown.


Figure 11: Posterior mean estimates of the transition rate components of at age 80.

Finally, Figures 10 and 11 provide heat maps of the state space corresponding to individual specific estimated transition intensities (posterior mean estimates). These depictions of the state space in relation to the estimated parameters captures a holistic picture of the model as a physical system. With these plots, it is easy to identify which transition rates are most intense at different ages, and to get a general sense of when rates begin to ‘heat up’. In fact, these plots can be created for every integer age greater than or equal to 50, and they are presented as a movie in the Supplementary Material.

There are many variables whose effect is known medically that could be used to validate some of the results. For example, it is observed that the rates are relatively dormant for the first two decades starting from 50 years old. After that, transition rates among the first four states seem to increase slower than the transition rates to more advanced states. Transitions particularly to A+Dem from A+N+ or from ADem are the most intense over almost all ages. APOE-4 is known to increase risk of A+ by a factor of 2-3. Based on this, the model should yield a higher rate estimate of AN (state 1) to A+N (state 2) and AN+ (state 3) to A+N+ (state 4) among APOE-4 carriers than non-carriers. Examination of Figures 10 and 11 illustrates that at both ages and for both men and women, the relationship between both of these transition rates and APOE-4 in the model output is exactly as would be expected. Additionally, the rate of A+N+ to A+Dem should be greater in APOE-4 carriers than non-carriers. This is the case in the model for both ages shown and for both men and women as would be predicted from the known biology.

5.1 Sensitivity to prior densities

As with any Bayesian analysis one must consider the sensitivity of the resulting posterior estimates with respect to the prior specifications. Since the non-dementia to death rate is known to not deviate far from the overall Minnesota death rate, tight priors serve to untangle ambiguity between the death rate and other parameters in a broad sense, or “on average”. It should be noted that the non-demented to death rate and death rate bias parameters are quite sensitive to the variances placed on their respective priors. We remark once more that the methodology we propose in Section 3.1 for estimating/correcting the death rate bias relies on strong prior information available for the overall population death rate. In investigating the sensitivity of the HMM parameter estimates to the priors for the non-dementia to death rate parameters we find that the only unstable parameter estimates are those associated with the rate itself and the death rate bias, due to a lack of identifiability.

To study sensitivities to the specification of all priors other than those related to the non-dementia to death rate, we increased all other prior standard deviations by a factor of 10 and re-computed the HMM parameter estimates. See the Supplementary Material to compare the posterior estimates between the more and less diffuse prior specifications. It was observed that the estimated posterior distributions of each parameter were largely unchanged, with the exception of the log-cubic spline parameters for the transition from state 1 to state 2, and the initial state probabilities. This can most likely be attributed to the large amount of flexibility afforded by a cubic spline, and a degree of un-identifiability between the cubic spline and initial state parameters because all other parameter estimates are unaffected. Moreover, the original tighter priors set in Table 2 for the initial state probabilities are well-informed and reasonable, according to population data.

6 Conclusions & Future work

A continuous-time HMM was developed for the analysis of the MCSA data. Much care was taken to make this model as realistic of an approximation to the actual data generating process as possible, including the treatment of important features such as delayed enrollment and death rate bias. A Bayesian computational framework was developed in order to facilitate computation and quantification of uncertainty, as well as allow for essential prior information on many of the parameters. The model and its estimation performance was validated via several simulation studies, prior to presenting the results of its application to the MCSA data. Several important findings were that (i) the delayed enrollment and death rate bias play a significant role in this study, (ii) females of a certain age are more prone to Alzheimer’s related dementia than male counterparts, but they are less prone to dementia, in general, and (iii) individuals with at least one APOE-4 allele are more than twice as likely to develop Alzheimer’s than those with no alleles.

Our work builds on the simpler Markov model of the MCSA from Jack et al. (2016), and could be viewed as a competing model. We include the additional covariates for gender, education, and presence of an APOE-4 allele on the state transition rates, and introduce the emitted response variables associated with amyloid and cortical thickness to allow for agnostic biomarker cut-points. Additionally, we consider the emitted response variables for MMSE scores and physician diagnosis misclassification of dementia, as well as provide methodology for estimating/correcting the death rate bias. The model from Jack et al. (2016) used fixed death rates from Minnesota population data. These added features of our model allow for greater flexibility in the theorized data generating model, and deeper understanding of the biology. Admittedly, though, if the proposed model became much more complex with additional covariates (or other features) it would require some decisions about which covariates to include in which equations. Stochastic search variable selection is a viable option as it has been used successfully on event rate models (Storlie et al., 2013; George & McCulloch, 1993). Other options include spike and slab priors (Ishwaran et al., 2005), or computing competing models and comparing them with some criterion such as the Widely Applicable Information Criterion (WAIC, see Watanabe (2010)).

One limitation of the proposed approach is that it does not allow for rates to vary depending on the extent of Amyloid (or cortical thickness) burden. It is known that once there is sufficient Amyloid build up, then the rate of neuro-degeneration is elevated, however, this effect may not be constant across all levels of Amyloid build up. This could be tested by allowing for multiple discrete Amyloid states (e.g., low, medium, high). However, this feature would be best addressed with a continuous-state space for both Amyloid and cortical thickness burden. This is a subject of future work.

An additional remark is that the various constraints placed within our model such as on the transition rates and on the response function parameters are rather stringent, but they reflect expert domain knowledge. Moreover, constraints such as those on the PIB and Thickness mean parameters simply make the parameters identifiable. Nonetheless, imposition of such constraints should not be decided ad hoc. And when concerns arise as to the reasonability of various constraints, sensitivity tests should be performed to more fully understand the implications. A similar remark applies to prior specifications, and as we noted in Section 5.1 we find that our estimated model, particularly for parameters associated with the death rate and death rate bias, are heavily reliant on the availability of strong prior information for population death rates.

Lastly, it is likely the case that our computations could be made more efficient by implementing Hamiltonian Monte Carlo methods. This paper lends credibility to and provides verification for our theorized model and features of the MCSA data. Accordingly, a natural next step is to sharpen our computational strategies which would facilitate deeper explorations of the features of the data.


  • (1)
  • Bureau et al. (2003) Bureau, A., Shiboski, S. & Hughes, J. P. (2003), ‘Applications of continuous time hidden markov models to the study of misclassified disease outcomes’, Statistics in Medicine 22(3), 441–462.
  • Gelman et al. (2006) Gelman, A. et al. (2006), ‘Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper)’, Bayesian analysis 1(3), 515–534.
  • George & McCulloch (1993) George, E. I. & McCulloch, R. E. (1993), ‘Variable selection via gibbs sampling’, Journal of the American Statistical Association 88(423), 881–889.
  • Ishwaran et al. (2005) Ishwaran, H., Rao, J. S. et al. (2005), ‘Spike and slab variable selection: frequentist and bayesian strategies’, The Annals of Statistics 33(2), 730–773.
  • Jack et al. (2016) Jack, C. R., Therneau, T. M., Wiste, H. J., Weigand, S. D., Knopman, D. S., Lowe, V. J., Mielke, M. M., Vemuri, P., Roberts, R. O., Machulda, M. M., Senjem, M. L., Gunter, J. L., Rocca, W. A. & Petersen, R. C. (2016), ‘Transition rates between amyloid and neurodegeneration biomarker states and to dementia: a population-based, longitudinal cohort study’, Lancet Neurol 15, 56–64.
  • Jackson (2011) Jackson, C. H. (2011), ‘Multi-state models for panel data: The msm package for r’, Journal of Statistical Software 38(8).
  • Jackson et al. (2003) Jackson, C. H., Sharples, L. D., Thompson, S. G., Duffy, S. W. & Couto, E. (2003), ‘Multistate markov models for disease progression with classification error’, The Statistician 52(2), 193–209.
  • Karlin & Taylor (1981) Karlin, S. & Taylor, H. M. (1981), A second course in stochastic processes, Academic Press, New York, NY.
  • Lange & Minin (2013) Lange, J. M. & Minin, V. N. (2013), ‘Fitting and interpreting continuous-time latent markov models for panel data’, Statistics in medicine 32(26), 4581–4595.
  • Satten & Longini (1996) Satten, G. A. & Longini, I. M. (1996), ‘Markov chains with measurement error: Estimating the ‘true’ course of a marker of the progression of human immunodeficiency virus disease’, Journal of the Royal Statistical Society. Series C (Applied Statistics) 45(3), 275–309.
  • Sharples et al. (2003) Sharples, L. D., Jackson, C. H., Parameshwar, J., Wallwork, J. & Large, S. R. (2003), ‘Diagnostic accuracy of coronary angiography and risk factors for post–heart-transplant cardiac allograft vasculopathy’, Transplantation 76(4), 679–682.
  • Storlie et al. (2013) Storlie, C. B., Michalak, S. E., Quinn, H. M., Dubois, A. J., Wender, S. A. & Dubois, D. H. (2013), ‘A Bayesian reliability analysis of neutron-induced errors in high performance computing hardware’, Journal of the American Statistical Association 108(502), 429–440.
  • Therneau (2015) Therneau, T. M. (2015), A Package for Survival Analysis in S. version 2.38.
  • Titman & Sharples (2008) Titman, A. C. & Sharples, L. D. (2008), ‘A general goodness-of-fit test for markov and hidden markov models’, Statistics in medicine 27(12), 2177–2195.
  • Watanabe (2010) Watanabe, S. (2010), ‘Asymptotic equivalence of bayes cross validation and widely applicable information criterion in singular learning theory’,

    Journal of Machine Learning Research

    11(Dec), 3571–3594.
  • Xu et al. (2015) Xu, X., Chong, E., Hilal, S., Ikram, M. K., Venketasubramanian, N. & Chen, C. (2015), ‘Beyond screening: Can the mini-mental state examination be used as an exclusion tool in a memory clinic?’, Diagnostics 5(4), 475–486.
  • Yu et al. (2010) Yu, L., Griffith, W. S., Tyas, S. L., Snowdon, D. A. & Kryscio, R. J. (2010), ‘A nonstationary markov transition model for computing the relative risk of dementia before death’, Statistics in Medicine 29, 639–648.
  • Zhao et al. (2016) Zhao, T., Wang, Z., Cumberworth, A., Gsponer, J., de Freitas, N. & Bouchard-Côté, A. (2016), ‘Bayesian analysis of continuous time markov chains with application to phylogenetic modeling’, Bayesian Analysis 11(4), 1203–1237.