1 Introduction
Markov jump processes (MJPs) are continuoustime, discretestate Markov processes in which state durations are exponentially distributed according to statespecific rate parameters. A stochastic matrix controls the probability of transitioning between pairs of states. MJPs have been used to construct probabilistic models either when the state of a system is observed directly, such as with disease progression
(Mandel, 2010) and RNA path folding (Hajiaghayi et al., 2014), or when the state is only observed indirectly, as in corporate bond rating (Bladt & Sørensen, 2009). For example, consider the important clinical task of analyzing physiological signals of a patient in order to detect abnormalities. Such signals include heart rate, blood pressure, respiration, and blood oxygen level. For an ICU patient, an abnormal state might be the precursor to a cardiac arrest event while for an epileptic, the state might presage a seizure (Goldberger et al., 2000). How can the latent state of the patient be inferred by a Bayesian modeler, so that, for example, an attending nurse can be notified when a patient enters an abnormal state? MJPs offer one attractive approach to analyzing such physiological signals.Applying an MJP model to physiological signals presents a challenge: the number of states is unknown and must be inferred using, for example, Bayesian nonparametric methods. However, efficient inference in nonparametric MJP models is a challenging problem, where existing methods based on particle MCMC scale poorly and mix slowly (Saeedi & BouchardCôté, 2011)
. Current optimizationbased methods such as expectation maximization (EM) are inapplicable if the state size is countably infinite; hence, they cannot be applied to Bayesian nonparametric MJP models, as we would like to do for physiological signals.
Furthermore, although MJPs are viewed as more realistic than their discretetime counterparts in many fields (Rao & Teh, 2013), degenerate solutions for the maximum likelihood (ML) trajectories for both directly and indirectly observed cases (Perkins, 2009), and nonexistence of the ML transition matrix (obtained from EM) for some indirectly observed cases (Bladt & Sørensen, 2009) present inferential challenges. Degenerate ML trajectories occur when some of the jump times are infinitesimal, which severely undermines the practicality of such approaches. For instance, a trajectory which predicts a patient’s seizure for an infinitesimal amount of time is of limited use to the medical staff. Fig. 3 shows an example of the degeneracy problem.
In this paper, we take a smallvariance asymptotics (SVA) approach to develop an optimizationbased framework for efficiently estimating the most probable trajectories (states) for both parametric and nonparametric MJPbased models. Smallvariance asymptotics has recently proven to be useful in estimating the parameters and inferring the latent states in rich probabilistic models. SVA extends the wellknown connection between mixtures of Gaussians and means: as the variances of the Gaussians approach zero, the maximum a posteriori solution to the mixture of Gaussians model degenerates to means solution (Kulis & Jordan, 2012). The same idea can be applied to obtain wellmotivated objective functions that correspond to a latent variable model for which scalable inference via standard methods like MCMC is challenging. SVA has been applied to (hierarchical) Dirichlet process mixture models (Kulis & Jordan, 2012; Jiang et al., 2012), Bayesian nonparametric latent feature models (Broderick et al., 2013)
, hidden Markov models (HMMs), and infinitestate HMMs
(Roychowdhury et al., 2013).We apply the SVA approach to both parametric and Bayesian nonparametric MJP models to obtain what we call the JUMPmeans objective functions. In the parametric case, we derive a novel objective function which does not suffer from maximum likelihood’s solution degeneracy, leading to more stable and robust inference procedures in both the directly observed and hidden state cases. Infinitestate MJPs (iMJPs) are constructed from the hierarchical gammaexponential process (HEP) (Saeedi & BouchardCôté, 2011). In order to apply SVA to iMJPs, we generalize the HEP to obtain the first deterministic procedure (we know of) for inference in Bayesian nonparametric MJPs.
We evaluate JUMPmeans on several synthetic and realworld datasets in both the parametric and Bayesian nonparametric cases. JUMPmeans performs on par with or better than existing methods, offering an attractive speedaccuracy tradeoff. We obtain significant improvements in the nonparametric case, gaining up to a 20% reduction in mean error on the task of observation reconstruction. In summary, the JUMPmeans approach leads to algorithms that 1) are applicable to MJPs with Bayesian nonparametric priors; 2) provide nondegenerate solutions for the most probable trajectories; and 3) are comparable to or outperform other standard methods of inference both in terms of speed and reconstruction accuracy.
2 Background
2.1 Markov Jump Processes
A Markov jump process (MJP) is defined by (a) a finite (or countable) state space, which we identify with the integers
; (b) an initial state probability distribution
; (c) a (stochastic) state transition matrix with for all; and (d) a state dwelltime rate vector
. The process begins in a state . When the process enters a state , it remains there for a dwell time that is exponentially distributed with parameter . When the system leaves state , it transitions to state with probability .A trajectory of the MJP is a sequence of states and a dwell time for each state, except for the final state: . Implicitly, (and thus
) is a random variable such that
and the system is in state at time . Let and be the sequences of states and times corresponding to . The probability of a trajectory is given bywhere and is the indicator function. In many cases when the states are directly observed, the initial state and the final state are observed, in which case it is straightforward to obtain a likelihood from (2.1).
A hidden state MJP (HMJP) is an MJP in which the states are observed indirectly according to a likelihood model , , , where is some observation space. The times of the observations are chosen independent of , so the probability of the observations is given by where, with an abuse of notation, we write for the state of the MJP at time .
2.2 Previous Approaches to MJP Inference
There are a number of existing approaches to inference and learning in MJPs. An expectationmaximization (EM) algorithm can be derived, but it cannot be applied to models with countably infinite states, so it is not suitable for iMJPs (Lange, 2014) (iMJPs are detailed in Section 4). Moreover, with discretely observed data, the maximumlikelihood estimate with finite entries for the transition matrix obtained from EM may not exist (Bladt & Sørensen, 2005).
Maximum likelihood inference amounts to finding , which can be carried out efficiently using dynamic programming (Perkins, 2009). However, maximum likelihood solutions for the trajectory are degenerate: only an infinitesimal amount of time is spent in each state, except for the state visited with the smallest rate parameter (i.e., longest expected dwell time). Such a solution is unsatisfying and unintuitive because the dwell times are far from their expected values. Thus, maximum likelihood inference produces results that are unrepresentative of the model behavior.
Markov chain Monte Carlo methods have also been developed, but these can be slow and their convergence is often difficult to diagnose (Rao & Teh, 2013). Recently, a more efficient Monte Carlo method was proposed in Hajiaghayi et al. (2014) which is based on particle MCMC (PMCMC) methods (Andrieu et al., 2010). This approach addresses the issue of efficiency, but since it marginalizes over the jump points, it cannot provide probable trajectories.
2.3 Smallvariance Asymptotics
Consider a Bayesian model in which the likelihood terms contain a variance parameter . Given some data , a point estimate for the parameters and latent variables of the model can obtained by maximizing the posterior , resulting in a maximum a posteriori (MAP) estimate. In the SVA approach (Broderick et al., 2013), the MAP optimization is considered in the limit as the likelihood variance parameter is taken to zero:
. Typically, the smallvariance limit leads to a much simpler optimization than the MAP optimization with nonzero variance. For example, the MAP objective for a Gaussian mixture model simplifies to the
means objective.3 Parametric MJPs
3.1 Directly Observed MJP
Consider the task of inferring likely state/dwelltime sequences given , the times at which the system was directly observed and the states of the system at those times. For simplicity we assume that and that all times are in the interval . Let be the state of the system following trajectory at time . The likelihood of a sequence is
We also place a gamma prior on the rate parameters
(detailed below). Instead of relying on MAP estimation, we apply a small variance asymptotics analysis to obtain a more stable objective function. Following
(Jiang et al., 2012), we scale the distributions by an inverse variance parameter and then maximize the scaled likelihood and prior in the limit (i.e., as the variance goes to zero).Scaling the exponential distribution produces the twoparameter family with
which is the density of a gamma distribution with shape parameter
and rate parameter . Hence, the mean of the scaled distribution is and its variance is . Letting denote the CDF corresponding to , we have , where is the upper incomplete gamma function.The multinomial distribution is scaled by the parameter . Writing the likelihood with the scaled exponential families (and dropping indicator variables) yields:
The modified likelihood is for a jump process which is no longer Markov when . We also place a prior on each and set . It can be shown (see the Supplementary Material for details) that, when , the MAP estimation problem becomes
The optimization problem (A) is very natural and offers far greater stability than maximum likelihood optimization. As with maximum likelihood, the terms penalize transitions with small probability. The term is convex and minimized when , the expected value of the dwell time for state . As , approaches , while for , grows approximately linearly. Thus, times very close to zero are heavily penalized while times close to the expected dwell time are penalized very little. The term penalizes the time spent in state so far in the same manner as a regular dwell time when is greater than the expected value of the dwelltime. However, when is less than the expected value there is no cost, which is quite natural since the system may remain in state for longer than — i.e., there should not be a large penalty for being less than its expected value. Finally, parameters and have a very natural interpretation (cf. (5) below): they correspond to a priori having dwell times of length for each state.
Comparison to Maximum Likelihood MJP trajectories estimated using maximum likelihood (MLE) are usually trivial, with the system spending almost all its time in a single state (with the smallest ), with infinitesimal dwell times for the other states. This poor behavior of MLE is due to the fact that the mode of , which is favored by the MLE, is 0, even though the mean is .^{1}^{1}1Note that placing priors on the rate parameters, as we do, does not affect the degeneracy of the ML trajectory. The SVA optimization, on the other hand, does give trajectories that are representative of the true behavior because the SVA terms of the form are optimized at (i.e., at the mean of ). We demonstrate the superior behavior of the SVA in the concrete example of estimating disease progression in patients in Section 5.
3.2 Hidden State MJP
For an HMJP, the likelihood of a valid trajectory is
Hence, the only difference between the directly observed case and the HMJP is the addition of the observation likelihood terms. Because multinomial observations are commonly used in MJP applications, that is the case we consider here. Let denote the number of possible observations and be the probability of observing when . The observation likelihoods are scaled in the same manner as the transition probabilities, but with . Thus, for the HMJP, we obtain:
3.3 Algorithm
Optimizing the JUMPmeans objectives in (A) and (3.2) is nontrivial due to the fact that we do not know the number of jumps in the MJP, and the combinatorial explosion in the sequences with the number of jump points. The terms involving the continuous variables (dwell times) present an additional complexity.
We therefore resort to an alternating minimization procedure to optimize the JUMPmeans objective function, similar in spirit to the one used in Roychowdhury et al. (2013). In each iteration of the optimization process, we first use a modified Viterbi algorithm to obtain the most likely state sequence. Then, we use convex optimization to distribute the jump points optimally with respect to the values from for the current state sequence.
Directly Observed MJP When optimizing (A), there may be many sequences (’s) available, representing distinct realizations of the process. We use the following algorithm to optimize (A):

[label=0., leftmargin=*]

Initialize the state transition matrix and rate vector with uniform values.

For every observation sequence , instantiate the jump points by adding one jump point between every pair of observations, in addition to the start and end points.

For each , use a modified Viterbi algorithm. to find the best state sequence to optimize (A), while keeping the jump points fixed. The modified algorithm includes the dwell time penalty terms, which are dependent upon the assignment of states to the time points.

Optimize the dwell times with the state sequences of the trajectories fixed.

Optimize and with the other variables fixed. The optimal values can be obtained in closed form. For example, if there is only a single observation sequence with corresponding inferred trajectory , then
where denotes to the number of transitions from state to state in .

Repeat steps 35 until convergence.
Beam Search Variant We note that the optimization procedure just described is restrictive since the number of jump points is fixed and the jump points are constrained by the observation boundaries. To eliminate this, we also tested a beam search variant of the algorithm to allow for the creation and removal of jump points, but found it did not have much impact in our experiments.
Hidden State MJP The algorithm to optimize the hidden state MJP JUMPmeans objective (3.2) is similar to that for optimizing (A), but with three modifications. First, in place of , we have the indirect observations of the states . Second, observation likelihood terms containing are included in the objective minimized by the Viterbi optimization (step 3). Finally, an additional update is performed in step 5 for each of the observation distributions :
for and . If each is initialized to be uniform, then the algorithm converges to a poor local minimum, so we add a small amount of random noise to each uniform .
Data Set  Mean Error (%)  

# Points  Held Out  # States  BL  EM  MCMC  SVA  PMCMC  
Synthetic 1 (PDO)  10,000  30 %  10  69.7  40.2  41.9  41.2   
Synthetic 2 (PH)  10,000  30 %  5  51.8  42.9  74.6  46.5   
MS (PDO)  390  50 %  3  51.2  26.2  48.1  25.4   
MIMIC (NPH)  2,208  25 %    42.3  25.7*    24.3  30.9 
4 Bayesian Nonparametric MJPs
We now consider the Bayesian nonparametric MJP (iMJP) model. The iMJP is based on the hierarchical gammaexponential process (HEP), which is constructed from the gammaexponential process (EP). We denote the Moran gamma process with base measure and rate parameter by (Kingman, 1993). The HEP generates a state/dwelltime sequence (with assumed known) according to (Saeedi & BouchardCôté, 2011):
where is the base probability measure, is a concentration parameter, , , and denotes the total mass of the measure . As in the parametric case, we must replace the exponential distribution in (4
) with the scaled exponential distribution. After an appropriate scaling of the rest of the hyperparameters, we obtain the hierarchical gammagamma process (H
). The definition and properties of the H are given in the Supplementary Material.Let denote the number of used states, the number of transitions out of state , and the mass on the th component of the measure . For , let and for , let . Let be the waiting times following state and define . In order to retain the effects of the hyperparameters in the asymptotics, set and . It can then be shown that (see the Supplementary Material for details), when , the iMJP MAP estimation problem becomes
Like its parametric counterpart, the Bayesian nonparametric cost function penalizes dwell durations very close to zero via the terms. In addition, there are penalties for the number of states and the state transitions. The observation likelihood term in (4) favors the creation of new states to minimize the JUMPmeans objective, while the state penalty and the nonlinear penalty term counteracts the formation of a long tail of states with very few data points. The hyperparameter introduces an additional, nonlinear cost for each additional state — if a state is occupied for time, then the term for that state does not have much effect on the cost. The KL divergence terms between and arise from the hierarchical structure of the prior, biasing the transition probabilities to be similar to the prior .
4.1 Algorithm
For the iMJP case, we have the extra variables and to optimize. In addition, the number of variables to optimize depends on the number of states in our model. The major change in the algorithm from the parametric case is that we must propose and then accept or reject the addition of new states. We propose the following algorithm for optimizing the iMJP:

[label=(0), leftmargin=*]

Initialize , and with uniform values and set the number of states .

For each observation sequence, apply the Viterbi algorithm and update the times using the new objective function in (4), analogously to steps (3) and (4) in the parametric algorithm.

Perform MAP updates for (as in (3.3)) and :

For every state pair , form a new state by considering all transitions from to and reassigning all observations that were assigned to to the new state. Update and to estimate the overall objective function for every new set of states formed in this way and accept the state set that minimizes the objective. If no such set exists, do not create a new state and revert back to the old and .

Repeat steps 24 until convergence.
If instead of multinomial observations we have Gaussian observations, the parameter is replaced with the mean parameter . In this case, we update the mean for each state using the data points assigned to the state, similar to the procedure for means clustering (see, e.g., Jiang et al. (2012); Roychowdhury et al. (2013)).
5 Experiments
In this section we provide a quantitative analysis of the JUMPmeans algorithm and compare its performance on synthetic and real datasets with standard inference methods for MJPs. For evaluation, we consider multiple sequences of discretely observed data and randomly hold out a subset of the data. We report reconstruction error for performance comparison.
5.1 Parametric Models
For the parametric models, we compare JUMPmeans to maximum likelihood estimation of the MJP parameters learned by EM (Asger & Ledet, 2005), the MCMC method proposed in Rao & Teh (2013) and a simple baseline where we ignore the sequential structure of the data. We run three sets of experiments (2 synthetic, 1 real) for our evaluation.
Synthetic 1: Directly Observed States For evaluating the model on a directly observed process, we generate 100 different datasets randomly from various MJPs with 10 states. To generate each dataset, we first generate the rows of the transition probability matrix and transition rates independently from and , respectively. Next, given the rates and transition probabilities for each dataset, we sample 500 sequences of length 20. We hold out 30% of the observations at random for testing reconstruction error.
We run JUMPmeans by initializing the algorithm with a uniform transition matrix and set the rate vector to be all ones. We run 300 iterations of the algorithm described in Section 3.3; each iteration is one scan through all the sequences. We set the hyperparameters , and equal to 1, 1, and .5, respectively. For MCMC, we initialize the jump points using the time points of the observations. We place independent priors on and independent priors on . We initialize EM with a uniform and an allones . We run both MCMC and EM for 300 iterations, then reconstruct observations using the Bayes estimator approximated from the 300 posterior samples. For our baseline we use the most common observation in the dataset as an estimate of the missing observations.
Table 1 gives the mean reconstruction error across sequences for the various methods. Note that JUMPmeans performs better than MCMC, and is almost on par with EM. Fig. 2(a) shows the average error across all the datasets for each method versus number of iterations. In terms of CPU time, each iteration of JUMPmeans (Java), EM (Java), and MCMC (Python) takes 0.3, 1.61 and 42 seconds, respectively. We also ran experiments with the beam search variant described in Section 3.3; however, we did not obtain any significant improvement in results.
Synthetic 2: Hidden States For the hidden state case, we generate 100 different datasets for MJPs with 5 hidden and 5 observed states, with varying parameters as above. In each dataset there are 500 sequences of length 20. In addition to parameters in the directly observed case, we generate observation likelihood terms for each state from .
We initialize the transition probabilities and the rate vectors for JUMPmeans, MCMC and EM in a fashion similar to the directly observed case. For the observation likelihood , we use
as a prior for MCMC, uniform distributions for EM initialization and a uniform probability matrix with a small amount of random noise for JUMPmeans initialization. We set
as before and to 1.We run each algorithm for 300 iterations. For JUMPmeans, we use the hidden state MJP algorithm described in Section 3.3. Table 1 and Fig. 2(b) again demonstrate that JUMPmeans outperforms MCMC by a large margin and performs comparably to EM. The poor performance of MCMC is due to slow mixing over the parameters and state trajectories. The slow mixing is a result of the coupling between the latent states and the observations, which is induced by the observation likelihood.
Disease Progression in Multiple Sclerosis (MS) Estimating disease progression and change points in patients with Multiple Sclerosis (MS) is an active research area (see, e.g., Mandel (2010)). We can cast the progression of the disease in a single patient as an MJP, with different states representing the various stages of the disease. Obtaining the mostlikely trajectory for this MJP can aid in understanding the disease progression and enable better care.
For our experiments, we use a realworld dataset collected from a phase III clinical trial of a drug for MS. This dataset tracks the progression of the disease for 72 different patients over three years. We randomly hold out 50 of the observations and evaluate on the observation reconstruction task. The observations are values of a disability measure known as EDSS, recorded at different time points. Initialization and hyperparameters are the same as Synthetic 1.
Table 1 shows that JUMPmeans significantly outperforms MCMC, achieving almost a 50% relative reduction in reconstruction error. JUMPmeans again achieves comparable results with EM. Fig. 3 (top panel) provides an example of the latent trajectories from JUMPmeans and maximum likelihood estimate for a single patient. The MLE trajectory includes two infinitesimal dwell times, which do not reflect realistic behavior of the system (since we do not expect a patient to be in a disease state for an infinitesimal amount of time). On the other hand, the trajectory produced by JUMPmeans takes into account the dwell times of the various stages of the disease and provides a more reasonable picture of its progression.
5.2 Nonparametric Model
Vital Signs Monitoring (MIMIC) We now consider a version of the problem of understanding physiological signals discussed in the introduction. We use data from the MIMIC database (Goldberger et al., 2000; Moody & Mark, 1996), which contains recordings of several vital parameters of ICU patients. Specifically, we consider blood pressure readings of 69 ICU patients collected over a 24hour period and subsample observation sequences of length 32 for each patient, keeping the start and end times fixed.^{2}^{2}2We use a small dataset for testing since PMCMC cannot easily scale to larger datasets. For testing, we randomly hold out 25% of the observations.
To initialize JUMPmeans, we choose uniform matrices for and and set = 1. The hyperparameters and are set to 5, while , and are set to 0.005. Using a Gaussian likelihood model for the observations, we run our model for 50 iterations. We compare with particle MCMC (PMCMC) (Andrieu et al., 2010) and EM. PMCMC is a stateoftheart inference method for iMJPs (Saeedi & BouchardCôté, 2011), which we run for 300 iterations with 100 particles. For PMCMC, we first categorize the readings into the standard four categories for blood pressure provided by NIH^{3}^{3}3http://www.nhlbi.nih.gov/health/healthtopics/topics/hbp. We run EM with a number of hidden states from 1 to 12 and report the best performance among all the results. For initializing the EM, we use the same setting as the Synthetic 2 case.
For evaluation, we consider the time point of a test observation and categorize the mean of the latent state at this time point (using the same categories obtained above) to compare against the actual category. Table 1 shows that JUMPmeans significantly outperforms PMCMC and obtains a 21% relative reduction in average error rate. Fig. 2(c) plots the error against iterations of both algorithms. In terms of CPU time, each iteration of JUMPmeans (Java) and PMCMC (Java) takes 0.17 and 1.95 seconds, respectively. Compared to EM’s error rate of 25.7%, JUMPmeans reaches a rate of 24.3% without the need to separately train for different number of states. The secondbest result for the EM had an error of 45%, which shows the importance of model selection when using EM.
Fig. 3 (bottom) provides an example of the latent trajectory inferred by JUMPmeans. The observations are uniquely colored by the latent state they are assigned. We note that the model captures different levels of blood pressure readings and provides a nondegenerate latent trajectory.
Hyperparameters A wellknown problem when applying SVA methods is that there are a number of hyperparameters to tune. In our objective functions, some of these hyperparameters (, , and ) have natural interpretations so prior knowledge and common sense can be used to set them, but others do not. Fig. 4 shows histograms over the errors we obtain for runs of JUMPmeans on the MS and MIMIC datasets with different settings. We can see that a significant fraction of the runs converge to the minimum error, while some settings — in particular when the hyperparameters were of different orders of magnitude — led to larger errors. Hence, the sensitivity study indicates the robustness of JUMPmeans to the choice of hyperparameters.
Scaling Fig. 5 shows the total runtime and reconstruction error of the nonparametric JUMPmeans algorithm on increasingly large amounts of synthetic data. The algorithm is able to handle up to a million data points with the runtime scaling linearly with data size. Furthermore, the error rate decreases significantly as the amount of data increases. See the Supplementary Material for further details.
6 Conclusion
We have presented JUMPmeans, a new approach to inference in MJPs using smallvariance asymptotics. We derived novel objective functions for parametric and Bayesian nonparametric models and proposed efficient algorithms to optimize them. Our experiments demonstrate that JUMPmeans can be used to obtain highquality nondegenerate estimates of the latent trajectories. JUMPmeans offers attractive speedaccuracy tradeoffs for both parametric and nonparametric problems, and achieved stateoftheart reconstruction accuracy on nonparametric problems.
Acknowledgments
Thanks to Monir Hajiaghayi, Matthew Johnson, and Tejas Kulkarni for helpful comments and discussions. JHH was supported by the U.S. Government under FA955011C0028 and awarded by the DoD, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a.
References
 Andrieu et al. (2010) Andrieu, C., Doucet, A., and Holenstein, R. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
 Asger & Ledet (2005) Asger, H. and Ledet, J. J. Statistical inference in evolutionary models of dna sequences via the em algorithm. Statistical Applications in Genetics and Molecular Biology, 4(1):1–22, 2005.

Banerjee et al. (2005)
Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J.
Clustering with bregman divergences.
The Journal of Machine Learning Research
, 6:1705–1749, 2005.  Bladt & Sørensen (2005) Bladt, M. and Sørensen, M. Statistical inference for discretely observed markov jump processes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):395–410, 2005.
 Bladt & Sørensen (2009) Bladt, M. and Sørensen, M. Efficient estimation of transition rates between credit ratings from observations at discrete time points. Quantitative Finance, 9(2):147–160, 2009.
 Broderick et al. (2013) Broderick, T., Kulis, B., and Jordan, M. I. MADBayes: MAPbased Asymptotic Derivations from Bayes. In International Conference on Machine Learning, 2013.
 (7) DLMF. NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov/, Release 1.0.8 of 20140425. URL http://dlmf.nist.gov/. Online companion to (Olver et al., 2010).
 Goldberger et al. (2000) Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C.K., and Stanley, H. E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation, 101(23):e215–e220, 2000.
 Hajiaghayi et al. (2014) Hajiaghayi, M., Kirkpatrick, B., Wang, L., and BouchardCôté, A. Efficient ContinuousTime Markov Chain Estimation. In International Conference on Machine Learning, 2014.
 Jiang et al. (2012) Jiang, K., Kulis, B., and Jordan, M. I. Smallvariance asymptotics for exponential family dirichlet process mixture models. In NIPS, pp. 3167–3175, 2012.
 Kingman (1993) Kingman, J. F. C. Poisson Processes. Oxford Studies in Probability. Oxford University Press, 1993.

Kulis & Jordan (2012)
Kulis, B. and Jordan, M. I.
Revisiting kmeans: New Algorithms via Bayesian Nonparametrics.
In International Conference on Machine Learning, 2012.  Lange (2014) Lange, J. Latent Continuous Time Markov Chains for PartiallyObserved Multistate Disease Processes. PhD thesis, 2014.
 Mandel (2010) Mandel, M. Estimating disease progression using panel data. Biostatistics, 11(2):304–316, 2010.
 Moody & Mark (1996) Moody, G. and Mark, R. A database to support development and evaluation of intelligent intensive care monitoring. In Computers in Cardiology, 1996, pp. 657–660, Sept 1996.
 Olver et al. (2010) Olver, F. W. J., Lozier, D. W., Boisvert, R. F., and Clark, C. W. (eds.). NIST Handbook of Mathematical Functions. Cambridge University Press, New York, NY, 2010. Print companion to (DLMF, ).
 Perkins (2009) Perkins, T. J. Maximum likelihood trajectories for continuoustime markov chains. In NIPS, pp. 1437–1445, 2009.
 Rao & Teh (2013) Rao, V. and Teh, Y. W. Fast MCMC sampling for Markov jump processes and extensions. The Journal of Machine Learning Research, 14(1):3295–3320, 2013.
 Roychowdhury et al. (2013) Roychowdhury, A., Jiang, K., and Kulis, B. SmallVariance Asymptotics for Hidden Markov Models. In Advances in Neural Information Processing Systems, pp. 2103–2111, 2013.
 Saeedi & BouchardCôté (2011) Saeedi, A. and BouchardCôté, A. Priors over recurrent continuous time processes. In NIPS, volume 24, pp. 2052–2060, 2011.
 Teh et al. (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476):1566–1581, December 2006.
Appendix A Parametric MJPs for SVA
To obtain the SVA objective from the parametric MJP model, we begin by scaling the exponential distribution , which is an exponential family distribution with natural parameter , logpartition function , and base measure (Banerjee et al., 2005). To scale the distribution, introduce the new natural parameter and logpartition function . The new base measure is uniquely defined by the integral equation (see Banerjee et al., 2005, Theorem 5) Choosing satisfies the condition, so we have It can now be seen that is the density of a gamma distribution with shape parameter and rate parameter . Hence, the mean of the scaled distribution is and its variance is . Letting denote the CDF corresponding to , we have , where is the upper incomplete gamma function.
For the state at the th jump we use a 1of representation; that is, is an dimensional binary random variable which satisfies and . Hence, we have:
(A.1) 
Given the Bregman divergence for a multinomial distribution, where , this can be written in terms of exponential family notation in the following form (Banerjee et al., 2005):
(A.2) 
where . For a scaled multinomial distribution we have , where is the scaling parameter for the multinomial distribution. Writing the trajectory probility with the scaled exponential families yields:
Since , we can apply the asymptotic expansions for and . In particular, applying Stirling’s formula and the facts in (DLMF, ) we have:
We also place a prior on each . With , we obtain Hence, when , obtain
Appendix B Bayesian Nonparametric MJPs for SVA
First we recall that the Moran gamma process is a distribution over measures. If is a random measure distributed according to a Moran gamma process with base measure on the probability space and rate parameter , then for all measurable partitions of , , satisfies
The hierarchical gammagamma process (H) is defined to be:
Consider the gammagamma process (), defined by (B)(B) (with treated as an arbitrary fixed measure). We now show that the retains the key properties of the EP: conjugacy and exchangeability. Let and be the sufficient statistics of the observations. The is a conjugate family: , where and . [Proof sketch] The proof is analogous to that for Proposition 2 in (Saeedi & BouchardCôté, 2011). The key additional insight is that and are conjugate:
. In order to give the joint distribution of the times
, we first derive the predictive distribution for the , . We make use of the following family of densities. [Shaped Translated Pareto] Let . A random variable is shaped translated Pareto, denoted , if it has density where is the beta function. The predictive distribution of the isBy Proposition B, it suffices to show that if , , and , then , where . Letting , the distribution of is We can now show that the process is exchangeable by exhibiting the joint distribution of waiting times: Let be the waiting times following state . Then is an exchangeable sequence with joint distribution
[Proof sketch] Take the product of the predictive distributions of .
The measures and can be integrated out of the H generative model in a manner analogous to the way in the the Chinese restaurant franchise in obtained from the hierarchical Dirichlet process (Teh et al., 2006). However the mass of the measure cannot be integrated out. We omit details as they are essentially identical to those in case of the HEP (Saeedi & BouchardCôté, 2011).
First, we consider the case of integrating out . Let denote the number of used states, the number of transitions out of state , and the number of states that can be reached from state in one step. The contribution to the likelihood from the H prior is where . Taking the logarithm, using asymptotic expansions for the Gamma terms, and ignoring terms yields where . In order to retain the effects of the hyperparameters in the asymptotics, set and . Thus, as . We require that , so without loss of generality we can choose to obtain Thus, the objective function to minimize is
Alternatively, the small variance asymptotics can be derived in the case where is not integrated out. To do so, we first rewrite the H generative model in an equivalent form, with integrated out:
For , let and for , let . Integrating out , the contribution to the likelihood from the H prior is now
We use a slightly different limiting process, with , a positive constant, and scale the multinomial distributions (B) by . Taking the logarithm and and ignoring terms as before yields Thus, the objective function to minimize is
Appendix C Timeaccuracy plots for the experiments
In the main paper we include the error versus iteration as it is more objective than timeaccuracy results. In Fig. C.6, we compare the timeaccuracy across different methods for different datasets. EM, PMCMC, and JUMPmeans are implemented in Java and MCMC is implemented in Python. To plot the MCMC results, we give a speed boost of 100x in the results to compensate for Python’s slow interpreter. From our experience with scientific computing applications, we believe this is a generous adjustment. Also we note that the EM implementation used in our experiments is not the most optimized in terms of time per iteration. However, our goal is to show that JUMPmeans can achieve comparable performance with a reasonable implementation of MCMC and EM.
Appendix D Scaling experiments
For the scaling experiments we generated 4 datasets consisting of to sequences. All datasets are sampled from a single hidden state MJP with 5 hidden states and 5 possible observations. For the 20 observations in each sequence a Gaussian likelihood is used. Finally, for the held out results, we categorized the observations in 5 bins, removed of the data points and predicted their category.