# JUMP-Means: Small-Variance Asymptotics for Markov Jump Processes

Markov jump processes (MJPs) are used to model a wide range of phenomena from disease progression to RNA path folding. However, maximum likelihood estimation of parametric models leads to degenerate trajectories and inferential performance is poor in nonparametric models. We take a small-variance asymptotics (SVA) approach to overcome these limitations. We derive the small-variance asymptotics for parametric and nonparametric MJPs for both directly observed and hidden state models. In the parametric case we obtain a novel objective function which leads to non-degenerate trajectories. To derive the nonparametric version we introduce the gamma-gamma process, a novel extension to the gamma-exponential process. We propose algorithms for each of these formulations, which we call JUMP-means. Our experiments demonstrate that JUMP-means is competitive with or outperforms widely used MJP inference approaches in terms of both speed and reconstruction accuracy.

• 21 publications
• 40 publications
• 10 publications
• 29 publications
01/26/2018

### Parameter Estimation for Weak Variance-Alpha-Gamma Processes

The weak variance-alpha-gamma process is a multivariate Lévy process con...
01/25/2018

### Nonparametric Hawkes Processes: Online Estimation and Generalization Bounds

In this paper, we design a nonparametric online algorithm for estimating...
08/28/2021

### Nonparametric estimation of the incubation time distribution

We discuss nonparametric estimators of the distribution of the incubatio...
12/10/2012

### MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The classical mixture of Gaussians model is related to K-means via small...
07/26/2017

### Dynamic Clustering Algorithms via Small-Variance Analysis of Markov Chain Mixture Models

Bayesian nonparametrics are a class of probabilistic models in which the...
03/30/2021

### Quantifying effect of geological factor on distribution of earthquake occurrences by inhomogeneous Cox processes

Research on the earthquake occurrences using a statistical methodology b...
11/30/2020

### Calibration for Multivariate Lévy-Driven Ornstein-Uhlenbeck Processes with Applications to Weak Subordination

Consider a multivariate Lévy-driven Ornstein-Uhlenbeck process where the...

## 1 Introduction

Markov jump processes (MJPs) are continuous-time, discrete-state Markov processes in which state durations are exponentially distributed according to state-specific rate parameters. A stochastic matrix controls the probability of transitioning between pairs of states. MJPs have been used to construct probabilistic models either when the state of a system is observed directly, such as with disease progression

(Mandel, 2010) and RNA path folding (Hajiaghayi et al., 2014), or when the state is only observed indirectly, as in corporate bond rating (Bladt & Sørensen, 2009). For example, consider the important clinical task of analyzing physiological signals of a patient in order to detect abnormalities. Such signals include heart rate, blood pressure, respiration, and blood oxygen level. For an ICU patient, an abnormal state might be the precursor to a cardiac arrest event while for an epileptic, the state might presage a seizure (Goldberger et al., 2000). How can the latent state of the patient be inferred by a Bayesian modeler, so that, for example, an attending nurse can be notified when a patient enters an abnormal state? MJPs offer one attractive approach to analyzing such physiological signals.

Applying an MJP model to physiological signals presents a challenge: the number of states is unknown and must be inferred using, for example, Bayesian nonparametric methods. However, efficient inference in nonparametric MJP models is a challenging problem, where existing methods based on particle MCMC scale poorly and mix slowly (Saeedi & Bouchard-Côté, 2011)

. Current optimization-based methods such as expectation maximization (EM) are inapplicable if the state size is countably infinite; hence, they cannot be applied to Bayesian nonparametric MJP models, as we would like to do for physiological signals.

Furthermore, although MJPs are viewed as more realistic than their discrete-time counterparts in many fields (Rao & Teh, 2013), degenerate solutions for the maximum likelihood (ML) trajectories for both directly and indirectly observed cases (Perkins, 2009), and non-existence of the ML transition matrix (obtained from EM) for some indirectly observed cases (Bladt & Sørensen, 2009) present inferential challenges. Degenerate ML trajectories occur when some of the jump times are infinitesimal, which severely undermines the practicality of such approaches. For instance, a trajectory which predicts a patient’s seizure for an infinitesimal amount of time is of limited use to the medical staff. Fig. 3 shows an example of the degeneracy problem.

In this paper, we take a small-variance asymptotics (SVA) approach to develop an optimization-based framework for efficiently estimating the most probable trajectories (states) for both parametric and nonparametric MJP-based models. Small-variance asymptotics has recently proven to be useful in estimating the parameters and inferring the latent states in rich probabilistic models. SVA extends the well-known connection between mixtures of Gaussians and -means: as the variances of the Gaussians approach zero, the maximum a posteriori solution to the mixture of Gaussians model degenerates to -means solution (Kulis & Jordan, 2012). The same idea can be applied to obtain well-motivated objective functions that correspond to a latent variable model for which scalable inference via standard methods like MCMC is challenging. SVA has been applied to (hierarchical) Dirichlet process mixture models (Kulis & Jordan, 2012; Jiang et al., 2012), Bayesian nonparametric latent feature models (Broderick et al., 2013)

, hidden Markov models (HMMs), and infinite-state HMMs

(Roychowdhury et al., 2013).

We apply the SVA approach to both parametric and Bayesian nonparametric MJP models to obtain what we call the JUMP-means objective functions. In the parametric case, we derive a novel objective function which does not suffer from maximum likelihood’s solution degeneracy, leading to more stable and robust inference procedures in both the directly observed and hidden state cases. Infinite-state MJPs (iMJPs) are constructed from the hierarchical gamma-exponential process (HEP) (Saeedi & Bouchard-Côté, 2011). In order to apply SVA to iMJPs, we generalize the HEP to obtain the first deterministic procedure (we know of) for inference in Bayesian nonparametric MJPs.

We evaluate JUMP-means on several synthetic and real-world datasets in both the parametric and Bayesian non-parametric cases. JUMP-means performs on par with or better than existing methods, offering an attractive speed-accuracy tradeoff. We obtain significant improvements in the non-parametric case, gaining up to a 20% reduction in mean error on the task of observation reconstruction. In summary, the JUMP-means approach leads to algorithms that 1) are applicable to MJPs with Bayesian nonparametric priors; 2) provide non-degenerate solutions for the most probable trajectories; and 3) are comparable to or outperform other standard methods of inference both in terms of speed and reconstruction accuracy.

## 2 Background

### 2.1 Markov Jump Processes

A Markov jump process (MJP) is defined by (a) a finite (or countable) state space, which we identify with the integers

; (b) an initial state probability distribution

; (c) a (stochastic) state transition matrix with for all

; and (d) a state dwell-time rate vector

. The process begins in a state . When the process enters a state , it remains there for a dwell time that is exponentially distributed with parameter . When the system leaves state , it transitions to state with probability .

A trajectory of the MJP is a sequence of states and a dwell time for each state, except for the final state: . Implicitly, (and thus

) is a random variable such that

and the system is in state at time . Let and be the sequences of states and times corresponding to . The probability of a trajectory is given by

where and is the indicator function. In many cases when the states are directly observed, the initial state and the final state are observed, in which case it is straightforward to obtain a likelihood from (2.1).

A hidden state MJP (HMJP) is an MJP in which the states are observed indirectly according to a likelihood model , , , where is some observation space. The times of the observations are chosen independent of , so the probability of the observations is given by where, with an abuse of notation, we write for the state of the MJP at time .

### 2.2 Previous Approaches to MJP Inference

There are a number of existing approaches to inference and learning in MJPs. An expectation-maximization (EM) algorithm can be derived, but it cannot be applied to models with countably infinite states, so it is not suitable for iMJPs (Lange, 2014) (iMJPs are detailed in Section 4). Moreover, with discretely observed data, the maximum-likelihood estimate with finite entries for the transition matrix obtained from EM may not exist (Bladt & Sørensen, 2005).

Maximum likelihood inference amounts to finding , which can be carried out efficiently using dynamic programming (Perkins, 2009). However, maximum likelihood solutions for the trajectory are degenerate: only an infinitesimal amount of time is spent in each state, except for the state visited with the smallest rate parameter (i.e., longest expected dwell time). Such a solution is unsatisfying and unintuitive because the dwell times are far from their expected values. Thus, maximum likelihood inference produces results that are unrepresentative of the model behavior.

Markov chain Monte Carlo methods have also been developed, but these can be slow and their convergence is often difficult to diagnose (Rao & Teh, 2013). Recently, a more efficient Monte Carlo method was proposed in Hajiaghayi et al. (2014) which is based on particle MCMC (PMCMC) methods (Andrieu et al., 2010). This approach addresses the issue of efficiency, but since it marginalizes over the jump points, it cannot provide probable trajectories.

### 2.3 Small-variance Asymptotics

Consider a Bayesian model in which the likelihood terms contain a variance parameter . Given some data , a point estimate for the parameters and latent variables of the model can obtained by maximizing the posterior , resulting in a maximum a posteriori (MAP) estimate. In the SVA approach (Broderick et al., 2013), the MAP optimization is considered in the limit as the likelihood variance parameter is taken to zero:

. Typically, the small-variance limit leads to a much simpler optimization than the MAP optimization with non-zero variance. For example, the MAP objective for a Gaussian mixture model simplifies to the

-means objective.

## 3 Parametric MJPs

### 3.1 Directly Observed MJP

Consider the task of inferring likely state/dwell-time sequences given , the times at which the system was directly observed and the states of the system at those times. For simplicity we assume that and that all times are in the interval . Let be the state of the system following trajectory at time . The likelihood of a sequence is

 \omit\span\omit\span\omit\@ADDCLASSltxeqnlefteqn$ℓ(\mcU\given\mcO,P,λ)=\ind[t⋅≤T]I∏i=1\ind[s(\mcU,~ti)=~si]$×(K∏k=1λsk−1e−λsk−1tk−1psk−1sk)e−λsK(T−t⋅)

We also place a gamma prior on the rate parameters

(detailed below). Instead of relying on MAP estimation, we apply a small variance asymptotics analysis to obtain a more stable objective function. Following

(Jiang et al., 2012), we scale the distributions by an inverse variance parameter and then maximize the scaled likelihood and prior in the limit (i.e., as the variance goes to zero).

Scaling the exponential distribution produces the two-parameter family with

which is the density of a gamma distribution with shape parameter

and rate parameter . Hence, the mean of the scaled distribution is and its variance is . Letting denote the CDF corresponding to , we have , where is the upper incomplete gamma function.

The multinomial distribution is scaled by the parameter . Writing the likelihood with the scaled exponential families (and dropping indicator variables) yields:

The modified likelihood is for a jump process which is no longer Markov when . We also place a prior on each and set . It can be shown (see the Supplementary Material for details) that, when , the MAP estimation problem becomes

The optimization problem (A) is very natural and offers far greater stability than maximum likelihood optimization. As with maximum likelihood, the terms penalize transitions with small probability. The term is convex and minimized when , the expected value of the dwell time for state . As , approaches , while for , grows approximately linearly. Thus, times very close to zero are heavily penalized while times close to the expected dwell time are penalized very little. The term penalizes the time spent in state so far in the same manner as a regular dwell time when is greater than the expected value of the dwell-time. However, when is less than the expected value there is no cost, which is quite natural since the system may remain in state for longer than — i.e., there should not be a large penalty for being less than its expected value. Finally, parameters and have a very natural interpretation (cf. (5) below): they correspond to a priori having dwell times of length for each state.

Comparison to Maximum Likelihood MJP trajectories estimated using maximum likelihood (MLE) are usually trivial, with the system spending almost all its time in a single state (with the smallest ), with infinitesimal dwell times for the other states. This poor behavior of MLE is due to the fact that the mode of , which is favored by the MLE, is 0, even though the mean is .111Note that placing priors on the rate parameters, as we do, does not affect the degeneracy of the ML trajectory. The SVA optimization, on the other hand, does give trajectories that are representative of the true behavior because the SVA terms of the form are optimized at (i.e., at the mean of ). We demonstrate the superior behavior of the SVA in the concrete example of estimating disease progression in patients in Section 5.

### 3.2 Hidden State MJP

For an HMJP, the likelihood of a valid trajectory is

 p(\mcU\given\mcX,τ,P,λ)=(L∏ℓ=1p(xℓ\givensτℓ))×(K∏k=1λsk−1e−λsk−1tk−1psk−1sk)e−λsK(T−t⋅).

Hence, the only difference between the directly observed case and the HMJP is the addition of the observation likelihood terms. Because multinomial observations are commonly used in MJP applications, that is the case we consider here. Let denote the number of possible observations and be the probability of observing when . The observation likelihoods are scaled in the same manner as the transition probabilities, but with . Thus, for the HMJP, we obtain:

### 3.3 Algorithm

Optimizing the JUMP-means objectives in (A) and (3.2) is non-trivial due to the fact that we do not know the number of jumps in the MJP, and the combinatorial explosion in the sequences with the number of jump points. The terms involving the continuous variables (dwell times) present an additional complexity.

We therefore resort to an alternating minimization procedure to optimize the JUMP-means objective function, similar in spirit to the one used in Roychowdhury et al. (2013). In each iteration of the optimization process, we first use a modified Viterbi algorithm to obtain the most likely state sequence. Then, we use convex optimization to distribute the jump points optimally with respect to the values from for the current state sequence.

Directly Observed MJP When optimizing (A), there may be many sequences (’s) available, representing distinct realizations of the process. We use the following algorithm to optimize (A):

1. [label=0., leftmargin=*]

2. Initialize the state transition matrix and rate vector with uniform values.

3. For every observation sequence , instantiate the jump points by adding one jump point between every pair of observations, in addition to the start and end points.

4. For each , use a modified Viterbi algorithm. to find the best state sequence to optimize (A), while keeping the jump points fixed. The modified algorithm includes the dwell time penalty terms, which are dependent upon the assignment of states to the time points.

5. Optimize the dwell times with the state sequences of the trajectories fixed.

6. Optimize and with the other variables fixed. The optimal values can be obtained in closed form. For example, if there is only a single observation sequence with corresponding inferred trajectory , then

where denotes to the number of transitions from state to state in .

7. Repeat steps 3-5 until convergence.

Beam Search Variant We note that the optimization procedure just described is restrictive since the number of jump points is fixed and the jump points are constrained by the observation boundaries. To eliminate this, we also tested a beam search variant of the algorithm to allow for the creation and removal of jump points, but found it did not have much impact in our experiments.

Hidden State MJP The algorithm to optimize the hidden state MJP JUMP-means objective (3.2) is similar to that for optimizing (A), but with three modifications. First, in place of , we have the indirect observations of the states . Second, observation likelihood terms containing are included in the objective minimized by the Viterbi optimization (step 3). Finally, an additional update is performed in step 5 for each of the observation distributions :

for and . If each is initialized to be uniform, then the algorithm converges to a poor local minimum, so we add a small amount of random noise to each uniform .

## 4 Bayesian Nonparametric MJPs

We now consider the Bayesian nonparametric MJP (iMJP) model. The iMJP is based on the hierarchical gamma-exponential process (HEP), which is constructed from the gamma-exponential process (EP). We denote the Moran gamma process with base measure and rate parameter by  (Kingman, 1993). The HEP generates a state/dwell-time sequence (with assumed known) according to (Saeedi & Bouchard-Côté, 2011):

where is the base probability measure, is a concentration parameter, , , and denotes the total mass of the measure . As in the parametric case, we must replace the exponential distribution in (4

) with the scaled exponential distribution. After an appropriate scaling of the rest of the hyperparameters, we obtain the hierarchical gamma-gamma process (H

). The definition and properties of the H are given in the Supplementary Material.

Let denote the number of used states, the number of transitions out of state , and the mass on the -th component of the measure . For , let and for , let . Let be the waiting times following state and define . In order to retain the effects of the hyperparameters in the asymptotics, set and . It can then be shown that (see the Supplementary Material for details), when , the iMJP MAP estimation problem becomes

Like its parametric counterpart, the Bayesian nonparametric cost function penalizes dwell durations very close to zero via the terms. In addition, there are penalties for the number of states and the state transitions. The observation likelihood term in (4) favors the creation of new states to minimize the JUMP-means objective, while the state penalty and the non-linear penalty term counteracts the formation of a long tail of states with very few data points. The hyperparameter introduces an additional, nonlinear cost for each additional state — if a state is occupied for time, then the term for that state does not have much effect on the cost. The KL divergence terms between and arise from the hierarchical structure of the prior, biasing the transition probabilities to be similar to the prior .

### 4.1 Algorithm

For the iMJP case, we have the extra variables and to optimize. In addition, the number of variables to optimize depends on the number of states in our model. The major change in the algorithm from the parametric case is that we must propose and then accept or reject the addition of new states. We propose the following algorithm for optimizing the iMJP:

1. [label=(0), leftmargin=*]

2. Initialize , and with uniform values and set the number of states .

3. For each observation sequence, apply the Viterbi algorithm and update the times using the new objective function in (4), analogously to steps (3) and (4) in the parametric algorithm.

4. Perform MAP updates for (as in (3.3)) and :

5. For every state pair , form a new state by considering all transitions from to and reassigning all observations that were assigned to to the new state. Update and to estimate the overall objective function for every new set of states formed in this way and accept the state set that minimizes the objective. If no such set exists, do not create a new state and revert back to the old and .

6. Repeat steps 2-4 until convergence.

If instead of multinomial observations we have Gaussian observations, the parameter is replaced with the mean parameter . In this case, we update the mean for each state using the data points assigned to the state, similar to the procedure for -means clustering (see, e.g., Jiang et al. (2012); Roychowdhury et al. (2013)).

## 5 Experiments

In this section we provide a quantitative analysis of the JUMP-means algorithm and compare its performance on synthetic and real datasets with standard inference methods for MJPs. For evaluation, we consider multiple sequences of discretely observed data and randomly hold out a subset of the data. We report reconstruction error for performance comparison.

### 5.1 Parametric Models

For the parametric models, we compare JUMP-means to maximum likelihood estimation of the MJP parameters learned by EM (Asger & Ledet, 2005), the MCMC method proposed in Rao & Teh (2013) and a simple baseline where we ignore the sequential structure of the data. We run three sets of experiments (2 synthetic, 1 real) for our evaluation.

Synthetic 1: Directly Observed States For evaluating the model on a directly observed process, we generate 100 different datasets randomly from various MJPs with 10 states. To generate each dataset, we first generate the rows of the transition probability matrix and transition rates independently from and , respectively. Next, given the rates and transition probabilities for each dataset, we sample 500 sequences of length 20. We hold out 30% of the observations at random for testing reconstruction error.

We run JUMP-means by initializing the algorithm with a uniform transition matrix and set the rate vector to be all ones. We run 300 iterations of the algorithm described in Section 3.3; each iteration is one scan through all the sequences. We set the hyperparameters , and equal to 1, 1, and .5, respectively. For MCMC, we initialize the jump points using the time points of the observations. We place independent priors on and independent priors on . We initialize EM with a uniform and an all-ones . We run both MCMC and EM for 300 iterations, then reconstruct observations using the Bayes estimator approximated from the 300 posterior samples. For our baseline we use the most common observation in the dataset as an estimate of the missing observations.

Table 1 gives the mean reconstruction error across sequences for the various methods. Note that JUMP-means performs better than MCMC, and is almost on par with EM. Fig. 2(a) shows the average error across all the datasets for each method versus number of iterations. In terms of CPU time, each iteration of JUMP-means (Java), EM (Java), and MCMC (Python) takes 0.3, 1.61 and 42 seconds, respectively. We also ran experiments with the beam search variant described in Section 3.3; however, we did not obtain any significant improvement in results.

Synthetic 2: Hidden States For the hidden state case, we generate 100 different datasets for MJPs with 5 hidden and 5 observed states, with varying parameters as above. In each dataset there are 500 sequences of length 20. In addition to parameters in the directly observed case, we generate observation likelihood terms for each state from .

We initialize the transition probabilities and the rate vectors for JUMP-means, MCMC and EM in a fashion similar to the directly observed case. For the observation likelihood , we use

as a prior for MCMC, uniform distributions for EM initialization and a uniform probability matrix with a small amount of random noise for JUMP-means initialization. We set

as before and to 1.

We run each algorithm for 300 iterations. For JUMP-means, we use the hidden state MJP algorithm described in Section 3.3. Table 1 and Fig. 2(b) again demonstrate that JUMP-means outperforms MCMC by a large margin and performs comparably to EM. The poor performance of MCMC is due to slow mixing over the parameters and state trajectories. The slow mixing is a result of the coupling between the latent states and the observations, which is induced by the observation likelihood.

Disease Progression in Multiple Sclerosis (MS) Estimating disease progression and change points in patients with Multiple Sclerosis (MS) is an active research area (see, e.g., Mandel (2010)). We can cast the progression of the disease in a single patient as an MJP, with different states representing the various stages of the disease. Obtaining the most-likely trajectory for this MJP can aid in understanding the disease progression and enable better care.

For our experiments, we use a real-world dataset collected from a phase III clinical trial of a drug for MS. This dataset tracks the progression of the disease for 72 different patients over three years. We randomly hold out 50 of the observations and evaluate on the observation reconstruction task. The observations are values of a disability measure known as EDSS, recorded at different time points. Initialization and hyperparameters are the same as Synthetic 1.

Table 1 shows that JUMP-means significantly outperforms MCMC, achieving almost a 50% relative reduction in reconstruction error. JUMP-means again achieves comparable results with EM. Fig. 3 (top panel) provides an example of the latent trajectories from JUMP-means and maximum likelihood estimate for a single patient. The MLE trajectory includes two infinitesimal dwell times, which do not reflect realistic behavior of the system (since we do not expect a patient to be in a disease state for an infinitesimal amount of time). On the other hand, the trajectory produced by JUMP-means takes into account the dwell times of the various stages of the disease and provides a more reasonable picture of its progression.

### 5.2 Nonparametric Model

Vital Signs Monitoring (MIMIC) We now consider a version of the problem of understanding physiological signals discussed in the introduction. We use data from the MIMIC database (Goldberger et al., 2000; Moody & Mark, 1996), which contains recordings of several vital parameters of ICU patients. Specifically, we consider blood pressure readings of 69 ICU patients collected over a 24-hour period and sub-sample observation sequences of length 32 for each patient, keeping the start and end times fixed.222We use a small dataset for testing since PMCMC cannot easily scale to larger datasets. For testing, we randomly hold out 25% of the observations.

To initialize JUMP-means, we choose uniform matrices for and and set = 1. The hyperparameters and are set to 5, while , and are set to 0.005. Using a Gaussian likelihood model for the observations, we run our model for 50 iterations. We compare with particle MCMC (PMCMC) (Andrieu et al., 2010) and EM. PMCMC is a state-of-the-art inference method for iMJPs (Saeedi & Bouchard-Côté, 2011), which we run for 300 iterations with 100 particles. For PMCMC, we first categorize the readings into the standard four categories for blood pressure provided by NIH. We run EM with a number of hidden states from 1 to 12 and report the best performance among all the results. For initializing the EM, we use the same setting as the Synthetic 2 case.

For evaluation, we consider the time point of a test observation and categorize the mean of the latent state at this time point (using the same categories obtained above) to compare against the actual category. Table 1 shows that JUMP-means significantly outperforms PMCMC and obtains a 21% relative reduction in average error rate. Fig. 2(c) plots the error against iterations of both algorithms. In terms of CPU time, each iteration of JUMP-means (Java) and PMCMC (Java) takes 0.17 and 1.95 seconds, respectively. Compared to EM’s error rate of 25.7%, JUMP-means reaches a rate of 24.3% without the need to separately train for different number of states. The second-best result for the EM had an error of 45%, which shows the importance of model selection when using EM.

Fig. 3 (bottom) provides an example of the latent trajectory inferred by JUMP-means. The observations are uniquely colored by the latent state they are assigned. We note that the model captures different levels of blood pressure readings and provides a non-degenerate latent trajectory.

Hyperparameters A well-known problem when applying SVA methods is that there are a number of hyperparameters to tune. In our objective functions, some of these hyperparameters (, , and ) have natural interpretations so prior knowledge and common sense can be used to set them, but others do not. Fig. 4 shows histograms over the errors we obtain for runs of JUMP-means on the MS and MIMIC datasets with different settings. We can see that a significant fraction of the runs converge to the minimum error, while some settings — in particular when the hyperparameters were of different orders of magnitude — led to larger errors. Hence, the sensitivity study indicates the robustness of JUMP-means to the choice of hyperparameters.

Scaling Fig. 5 shows the total runtime and reconstruction error of the non-parametric JUMP-means algorithm on increasingly large amounts of synthetic data. The algorithm is able to handle up to a million data points with the runtime scaling linearly with data size. Furthermore, the error rate decreases significantly as the amount of data increases. See the Supplementary Material for further details.

## 6 Conclusion

We have presented JUMP-means, a new approach to inference in MJPs using small-variance asymptotics. We derived novel objective functions for parametric and Bayesian nonparametric models and proposed efficient algorithms to optimize them. Our experiments demonstrate that JUMP-means can be used to obtain high-quality non-degenerate estimates of the latent trajectories. JUMP-means offers attractive speed-accuracy tradeoffs for both parametric and nonparametric problems, and achieved state-of-the-art reconstruction accuracy on nonparametric problems.

#### Acknowledgments

Thanks to Monir Hajiaghayi, Matthew Johnson, and Tejas Kulkarni for helpful comments and discussions. JHH was supported by the U.S. Government under FA9550-11-C-0028 and awarded by the DoD, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a.

## References

• Andrieu et al. (2010) Andrieu, C., Doucet, A., and Holenstein, R. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
• Asger & Ledet (2005) Asger, H. and Ledet, J. J. Statistical inference in evolutionary models of dna sequences via the em algorithm. Statistical Applications in Genetics and Molecular Biology, 4(1):1–22, 2005.
• Banerjee et al. (2005) Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clustering with bregman divergences.

The Journal of Machine Learning Research

, 6:1705–1749, 2005.
• Bladt & Sørensen (2005) Bladt, M. and Sørensen, M. Statistical inference for discretely observed markov jump processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):395–410, 2005.
• Bladt & Sørensen (2009) Bladt, M. and Sørensen, M. Efficient estimation of transition rates between credit ratings from observations at discrete time points. Quantitative Finance, 9(2):147–160, 2009.
• Broderick et al. (2013) Broderick, T., Kulis, B., and Jordan, M. I. MAD-Bayes: MAP-based Asymptotic Derivations from Bayes. In International Conference on Machine Learning, 2013.
• (7) DLMF. NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov/, Release 1.0.8 of 2014-04-25. Online companion to (Olver et al., 2010).
• Goldberger et al. (2000) Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C.-K., and Stanley, H. E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation, 101(23):e215–e220, 2000.
• Hajiaghayi et al. (2014) Hajiaghayi, M., Kirkpatrick, B., Wang, L., and Bouchard-Côté, A. Efficient Continuous-Time Markov Chain Estimation. In International Conference on Machine Learning, 2014.
• Jiang et al. (2012) Jiang, K., Kulis, B., and Jordan, M. I. Small-variance asymptotics for exponential family dirichlet process mixture models. In NIPS, pp. 3167–3175, 2012.
• Kingman (1993) Kingman, J. F. C. Poisson Processes. Oxford Studies in Probability. Oxford University Press, 1993.
• Kulis & Jordan (2012) Kulis, B. and Jordan, M. I.

Revisiting k-means: New Algorithms via Bayesian Nonparametrics.

In International Conference on Machine Learning, 2012.
• Lange (2014) Lange, J. Latent Continuous Time Markov Chains for Partially-Observed Multistate Disease Processes. PhD thesis, 2014.
• Mandel (2010) Mandel, M. Estimating disease progression using panel data. Biostatistics, 11(2):304–316, 2010.
• Moody & Mark (1996) Moody, G. and Mark, R. A database to support development and evaluation of intelligent intensive care monitoring. In Computers in Cardiology, 1996, pp. 657–660, Sept 1996.
• Olver et al. (2010) Olver, F. W. J., Lozier, D. W., Boisvert, R. F., and Clark, C. W. (eds.). NIST Handbook of Mathematical Functions. Cambridge University Press, New York, NY, 2010. Print companion to (DLMF, ).
• Perkins (2009) Perkins, T. J. Maximum likelihood trajectories for continuous-time markov chains. In NIPS, pp. 1437–1445, 2009.
• Rao & Teh (2013) Rao, V. and Teh, Y. W. Fast MCMC sampling for Markov jump processes and extensions. The Journal of Machine Learning Research, 14(1):3295–3320, 2013.
• Roychowdhury et al. (2013) Roychowdhury, A., Jiang, K., and Kulis, B. Small-Variance Asymptotics for Hidden Markov Models. In Advances in Neural Information Processing Systems, pp. 2103–2111, 2013.
• Saeedi & Bouchard-Côté (2011) Saeedi, A. and Bouchard-Côté, A. Priors over recurrent continuous time processes. In NIPS, volume 24, pp. 2052–2060, 2011.
• Teh et al. (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476):1566–1581, December 2006.

## Appendix A Parametric MJPs for SVA

To obtain the SVA objective from the parametric MJP model, we begin by scaling the exponential distribution , which is an exponential family distribution with natural parameter , log-partition function , and base measure  (Banerjee et al., 2005). To scale the distribution, introduce the new natural parameter and log-partition function . The new base measure is uniquely defined by the integral equation (see Banerjee et al., 2005, Theorem 5) Choosing satisfies the condition, so we have It can now be seen that is the density of a gamma distribution with shape parameter and rate parameter . Hence, the mean of the scaled distribution is and its variance is . Letting denote the CDF corresponding to , we have , where is the upper incomplete gamma function.

For the state at the -th jump we use a 1-of- representation; that is, is an -dimensional binary random variable which satisfies and . Hence, we have:

 p(sk|sk−1,j=1)=M∏m=1pskmjm. (A.1)

Given the Bregman divergence for a multinomial distribution, where , this can be written in terms of exponential family notation in the following form (Banerjee et al., 2005):

 p(sk|sk−1,j=1)=bϕ(sk)exp(−dϕ(sk,\bpj)) (A.2)

where . For a scaled multinomial distribution we have , where is the scaling parameter for the multinomial distribution. Writing the trajectory probility with the scaled exponential families yields:

 p(\mcU|s0,sK,P,λ)∝exp{−β(lnΓ(β)−lnΓ(β,βλsKt⋅)β+ξK−1∑k=0\klsk+1\bpsk∝exp{+K−1∑k=0(λsktk−lnλsktk−βlnβ−lnΓ(β)β+lntkβ))},

Since , we can apply the asymptotic expansions for and . In particular, applying Stirling’s formula and the facts in (DLMF, ) we have:

We also place a prior on each . With , we obtain Hence, when , obtain

 min\mcU,λ,P{ξK−1∑k=0\klsk+1\bpsk+K−1∑k=0(λsktk−lnλsktk−1)+\ind[λsKt⋅≥1](λsKt⋅−lnλsKt⋅−1)+ξλM∑s=1(μλλs−lnλs−1)}

## Appendix B Bayesian Nonparametric MJPs for SVA

First we recall that the Moran gamma process is a distribution over measures. If is a random measure distributed according to a Moran gamma process with base measure on the probability space and rate parameter , then for all measurable partitions of , , satisfies

The hierarchical gamma-gamma process (H) is defined to be:

Consider the gamma-gamma process (), defined by (B)-(B) (with treated as an arbitrary fixed measure). We now show that the retains the key properties of the EP: conjugacy and exchangeability. Let and be the sufficient statistics of the observations. The is a conjugate family: , where and . [Proof sketch] The proof is analogous to that for Proposition 2 in (Saeedi & Bouchard-Côté, 2011). The key additional insight is that and are conjugate:

. In order to give the joint distribution of the times

, we first derive the predictive distribution for the , . We make use of the following family of densities. [Shaped Translated Pareto] Let . A random variable is shaped translated Pareto, denoted , if it has density where is the beta function. The predictive distribution of the is

 (sk+1,tk+1)\given\mcUk\dist¯μ′sk×\distNamedSTP(β,∥¯μ′sk∥,γ′sk).

By Proposition B, it suffices to show that if , , and , then , where . Letting , the distribution of is We can now show that the process is exchangeable by exhibiting the joint distribution of waiting times: Let be the waiting times following state . Then is an exchangeable sequence with joint distribution

 p(\bt∗m)=Γ(β(κ0+Km))Γ(β)Km(∏Kmj=1t∗mj)β−1(γ+∑Kmj=1τmj)β(κ0+Km)

[Proof sketch] Take the product of the predictive distributions of .

The measures and can be integrated out of the H generative model in a manner analogous to the way in the the Chinese restaurant franchise in obtained from the hierarchical Dirichlet process (Teh et al., 2006). However the mass of the measure cannot be integrated out. We omit details as they are essentially identical to those in case of the HEP (Saeedi & Bouchard-Côté, 2011).

First, we consider the case of integrating out . Let denote the number of used states, the number of transitions out of state , and the number of states that can be reached from state in one step. The contribution to the likelihood from the H prior is where . Taking the logarithm, using asymptotic expansions for the Gamma terms, and ignoring terms yields where . In order to retain the effects of the hyperparameters in the asymptotics, set and . Thus, as . We require that , so without loss of generality we can choose to obtain Thus, the objective function to minimize is

Alternatively, the small variance asymptotics can be derived in the case where is not integrated out. To do so, we first rewrite the H generative model in an equivalent form, with integrated out:

For , let and for , let . Integrating out , the contribution to the likelihood from the H prior is now

We use a slightly different limiting process, with , a positive constant, and scale the multinomial distributions (B) by . Taking the logarithm and and ignoring terms as before yields Thus, the objective function to minimize is

 + ζL∑ℓ=1lnρsτℓxℓ+ξK∑k=1ln¯πsk−1,sk+ξ1M+M∑m=1{ξ2\kl¯π0¯πm−∑Kmj=1lnt∗mj−Kmln([γ+t∗m⋅]/Km)}.

## Appendix C Time-accuracy plots for the experiments

In the main paper we include the error versus iteration as it is more objective than time-accuracy results. In Fig. C.6, we compare the time-accuracy across different methods for different datasets. EM, PMCMC, and JUMP-means are implemented in Java and MCMC is implemented in Python. To plot the MCMC results, we give a speed boost of 100x in the results to compensate for Python’s slow interpreter. From our experience with scientific computing applications, we believe this is a generous adjustment. Also we note that the EM implementation used in our experiments is not the most optimized in terms of time per iteration. However, our goal is to show that JUMP-means can achieve comparable performance with a reasonable implementation of MCMC and EM.

## Appendix D Scaling experiments

For the scaling experiments we generated 4 datasets consisting of to sequences. All datasets are sampled from a single hidden state MJP with 5 hidden states and 5 possible observations. For the 20 observations in each sequence a Gaussian likelihood is used. Finally, for the held out results, we categorized the observations in 5 bins, removed of the data points and predicted their category.