Inference of a sequence of estimated classes from a sequence of noisy observations is fundamental in many applications. The hidden Markov model (HMM) and its variants are the usual methods employed to do this, and have been used with conspicuous success in such domains as speech recognition, bioinformatics and natural language processing. As well as being computationally efficient, they are a popular choice due to their intuitive probabilistic interpretation. However, they have drawbacks in terms of classification accuracy, being primarily generative rather than discriminative models.
. This improves classification performance, though the requirement in general to limit such models to parametric forms means they still do not have the discriminative power of kernel-based and max-margin methods. Another idea therefore is to create models combining the structure of the HMM and the classification approach of Support Vector Machines[1, 19, 20]. This considerably improves classification performance, though at the expense of losing the intuitive probabilistic interpretation of the HMM.
In this paper, we propose a different idea which combines the advantages of both the above approaches, using concepts from the field of direct density ratio estimation [16, 7, 2]. Our key observation is that rather than trying to quantify a set of likelihoods (i.e. how likely any possible observation is given some class, relative to the likelihood of making other observations given that same class), it is in fact only necessary to know about likelihood ratios (i.e. how likely any possible observation is given some class, relative to the likelihood of making the same observation given a different class). Because the forward-backward inference algorithm computes such ratios anyway, we therefore follow the principle that we should avoid solving a more general learning problem than is strictly necessary .
We therefore reformulate the forward-backward algorithm for HMM inference in terms of density ratios so that the intermediate step of likelihood function estimation can be dispensed with. We demonstrate how efficient and highly discriminative nonparametric inference can be carried out in this framework using a kernel-based density ratio estimation procedure 
. Because density ratios are a natural parameterization of the forward-backward algorithm, the resulting inference procedure is also more numerically stable than the conventional version. We demonstrate these ideas using synthetic data and on a physiological monitoring problem. In the case of physiological monitoring, we also demonstrate an application to sequential anomaly detection.
recasts this procedure in terms of ratios of probability densities, and Section4 describes how nonparametric estimates of those density ratios can be calculated directly from training data in both supervised and unsupervised settings, and how estimates of several pairwise likelihood ratio functions can be obtained from a concise set of parameters. Experimental results in Section 5 show the performance of the density ratio HMM methodology on synthetic and real-world physiological monitoring data, showing striking improvements compared to conventional parametric and nonparametric sequential inference approaches.
A Matlab implementation and demo is available at http://cit.mak.ac.ug/staff/jquinn/software/densityratioHMM.html.
2 Forward-backward inference
Consider the estimation of a latent sequence from an observation sequence . The variable is assumed to have first order Markovian dynamics, such that , the values it can take on are a discrete set of classes or ‘states’ , and each observation is independently drawn from a fixed emission distribution . Given a sequence and knowledge of both and and the initial state distribution , the forward-backward algorithm can be used to estimate the probability of each state at each time frame, . We use the following shorthand in describing the algorithm: denotes a matrix of state transition probabilities, such that and is a vector of initial state probabilities such that .
The first stage of the algorithm involves recursively calculating forward messages for each of the states :
After each time step, it is conventional (but not necessary) to normalize the forward messages,
Without this normalization the algorithm is numerically unstable, as the values of otherwise become very small over repeated iterations. The step is optional however because normalization occurs anyway later in the procedure (in Eq. (7
)). The forward messages can be interpreted as the posterior probability of each state given observations up to that time frame, the process of calculating this being known asfiltering. Note that the normalization means that the absolute values of are not directly significant for inference; we are ultimately interested only in the relative magnitudes.
To carry out smoothing (calculation of ) the backwards messages must first be similarly calculated:
also with an optional normalization step carried out after each iteration,
The two types of messages are then combined to give the final result
The forward-backward algorithm therefore requires an explicit likelihood model for every dynamical regime, is numerically unstable (requiring message scaling or transforms in and out of log space to prevent underflow), and loses information during normalization steps. We next describe how a density ratio formulation overcomes these problems, by expressing inference in terms of parameters which are more natural to the problem.
3 Inference with density ratios
In this section we rearrange the above inference equations in terms of pairwise probability density ratios. Ratios of probabilities make it particularly convenient to express Bayesian updates, as the ratio of two posteriors is equal to simply the ratio of the priors multiplied by the ratio of the likelihoods. In order to express the forward-backward equations in this way, corresponding to the three types of values we define three types of ratios as
We also treat likelihoods in terms of density ratios, and define
Although these expressions are written in terms of all pairwise ratios, implying that terms have to be calculated at each time frame, in fact these ratios can be estimated with a concise set of parameters as we demonstrate in Section 4.
The messages are finally combined simply with
The filtering and smoothing probabilities, if needed, can be simply calculated from the ratios with and . This completes the density ratio forward-backward algorithm.
To provide some intuition about the ratio-based procedure we look in some more detail at a two-state HMM, . The forward equation in this simplified case reduces to:
and the backward equation to
The first term of (12) specifies the evolution from to in terms of the transition probabilities in . Hence this is simply the ratio form of marginalizing out the transition probabilities. Figure 1 shows this evolution of the state probability ratio given transition matrices of the form for different values of . If then any past information is lost; is always equal to 1, hence both states are equally likely a priori at time . If , there can be no state transitions. Inference is equivalent to accumulating evidence in favor of either state 1 or 2 globally, and Eq. (12) reduces to . The ratio form of the Bayesian update given the observation is simply multiplication by .
The ratio formulation of the forward-backward algorithm is numerically stable, and Figure 1 provides an intuition of this; extreme ratio values tend to be mapped back to within a few orders of magnitude of unity after considering transition probabilities. Scaling of values to prevent underflow is therefore not necessary with this method as it is in the conventional forward-backward algorithm.
Hence, given a way to estimate from training data, it is possible to carry out inference in the HMM without ever needing to calculate the individual likelihoods. These steps are mathematically equivalent to the standard forward-backward algorithms, but do not require any normalization step. If the observation distribution is known exactly then the ratio formulation is equivalent and has no advantage. However, when the observation distribution needs to be approximated in some way – which is almost always the case with complex real-world data – the parameterization in terms of is more natural to the sequential inference problem than attempting to approximate each of the distributions directly. We discuss ways in which the estimation of these ratios and other parameters can be carried out in the next section.
4 Parameter estimation
The parameters to be learned in the density ratio HMM model are the likelihood ratio functions , the transition matrix and initial class probabilities . In this section we first suggest effective methods for estimating which bypass the difficult step of estimating the densities and individually. In Section 4.1 we introduce an efficient method for carrying this out when there are two classes to be modeled, an give a related method for several classes in Section 4.2 which does not need every pairwise ratio to be estimated individually. The estimation of and is then discussed in Section 4.3.
4.1 Two-class density ratio estimation
We begin by discussing how to estimate a likelihood ratio from data in the case that there are only two states and in the model, for example as would be needed to calculate (12) and (13). There is just one ratio to learn, since . Several techniques have been developed for direct density ratio estimation [10, 2, 3, 7, 18, 17]. We use a least squares approach here  which yields a consistent estimator with very good computational efficiency. The estimator is of the following form:
for some number of parameters , and
is a vector of kernel basis functions. We can set to have a kernel basis function at every training point, or for use some random subset of the training points. In this work we use the squared exponential kernel .
This model can be fitted using a squared loss objective function,
Expanding the squared term we obtain
where is a constant term that does not depend on any of the values. Empirically, we can approximate the expectations by sample averages. Ignoring the constant , factor and including an -regularizer, we have the following training criterion:
where , is a column vector indicating membership of class such that the th element is one if and zero otherwise, and is a square matrix with along the diagonal and other entries set to zero. is minimized by
We select and with cross validation. Because of the nature of the estimator, it is sometimes possible to obtain negative values for . We simply round up negative estimates to zero in such cases, which does not affect the consistency of the estimator . This least-squares approach is very fast to compute in practice, finding a global optimum in a single step with no iterative parameter search required.
4.2 Extension to multiple classes
In problems where , it would be possible to simply learn several pairwise likelihood ratio functions. Because and , this would entail learning ratio functions. Fewer still need to be estimated if relationships of the form are used, though this ‘ratio of ratios’ may become unstable when the denominator is close to zero.
For multiple-class ratio estimation another approach is to directly estimate conditional probabilities of class given i.i.d. samples . Using the fact that , likelihood ratios in our problem can be estimated as
This is similar in principle to the scheme of using classifiers to calculate likelihoods in standard HMMs discussed inpunyakanok01classifiers .
A standard approach to calculating the
terms is kernel logistic regression. This can be computationally demanding for large datasets though, so we use a least squares procedure similar in form to the two-class estimator we introduced in the previous section, and which is known to give comparable accuracy to kernel logistic regression but requiring far less training time.
We construct functions to estimate , defined as
and is the same as in Eq. (14). The squared loss term in this case is
Expanding and using we obtain
which can be empirically approximated to give the following training criterion:
is minimized by
which is essentially kernel ridge regression. As in the two-class case, we selectand with cross validation. Also as before we round up negative estimates to zero, , though we note that that negative estimates are unlikely when the number of training samples is large enough .
4.3 Setting other parameters
We now discuss setting parameters in the density ratio HMM given training data, first when states
are available and second as an unsupervised learning problem when we only have observation sequences.
If is observed in training data, and can be estimated directly by frequency. The process of setting the remaining parameters in the case that the labels are present in the training data is therefore identical to the standard HMM; we refer the reader to the details in rabiner1989tutorial .
For the unsupervised case in which only the observation sequence is available for training, learning can be carried out by iterating between (1) a likelihood-maximization step to update given estimates of and (2) an expectation step, the inference procedure given in §3. To do this some initial estimate is required as a starting point, which could be obtained by running a standard non-dynamic clustering procedure such as -means on the observations. The iterations are continued until convergence or some limit on the number of cycles is reached. For unsupervised learning we require one alteration to the parameter estimation procedure for in order to accommodate soft estimates of state probabilities rather than hard labels. The weighted version of Eq. (17) is
where is a diagonal matrix such that . The maximization step updates for given are again identical to the standard HMM case .
We now give experimental results using the above methods when applied to both synthetic and real-world datasets.
5.1 Sequential classification on toy data
To give an illustration of HMM inference using density ratio estimation, we generated data from a simple switching model with three dynamical regimes. The transition probabilities in the model were set to with initial state probabilities , and these parameters were then used to generate sequences . Scalar sequences were then generated conditioned on as follows:
with . A sample sequence is shown in Figure 2 (top left). To construct the vector sequence used for testing inference, we took subsequences of using a sliding window of length (using for this example) such that , for .
Figure 2 (bottom left) shows the results of using density ratios at each time step to find the most likely class independently at each time step, by finding . This is equivalent to filtering inference using a uniform transition matrix , and gives an idea of how informative the subsequences are about at individual time frames when no dynamical information is incorporated. In this plot the blue line shows the true values of in the sample sequence, and the red crosses show the MAP estimates. The top right panel shows MAP estimates with filtering inference using the forward equations only. The bottom right panel shows MAP estimates with smoothing inference using both forward and backward steps.
Inference results were compared to those from two alternative models. The first was an alternative nonparametric hidden Markov model, using kernel density estimation (using a squared exponential kernel, with parameters chosen by cross-validation) to modelfor each , as proposed in piccardi07kdeHMM 
. The second was the conventional Gaussian mixture model approach to modeling the observation density of each regime, with the number of components of the mixture selected in each case being that which minimized the Bayesian Information Criterion (BIC) on training data. Training sequences were randomly generated from the above switching noisy sine wave example, of lengths between 50 and 500 subsequences for each of the three dynamic regimes. Test sequences of length 1000 were also generated as above. The density ratio HMM, KDE-HMM and GMM-HMM were trained and applied to the test sequence. The accuracy, calculated as the proportion of time frames for which the MAP estimate was equal to the true simulated state , was calculated for both methods. This was repeated 500 times for each training sample size. Figure 3 shows the mean and standard deviation of accuracy for filtering and smoothing with all three methods. The density ratio HMM has consistently higher average accuracy than the other methods, particularly for small training set sizes.
5.2 Physiological monitoring
We next evaluated the density ratio HMM using time series of physiological measurements from premature infants receiving intensive care. The dataset we used consists of 24 hour sequences of monitoring data from 15 babies, a total of 360 hours of data with measurements taken at one second intervals. The measurements are of vital signs (heart rate, blood pressures, temperatures, blood gas concentrations) and environmental measurements (incubator temperature and humidity). The data is annotated with the occurrences of four common phenomena: bradycardia (temporary slowing or stopping of the heart), opening of the incubator, the taking of a blood sample, and disconnection of the core temperature probe. Any period in the data during which there was some clinically significant change not covered by one of the four conditions above was annotated as a fifth class. Finally, for each baby a period of around 10 minutes was annotated as ‘normal’, i.e. representative of that baby’s baseline physiology. The data is publicly available and described in quinn09physiological .
We first trained a set of two-class density ratio HMMs with this data, treating each state in the annotation as a separate inference problem. To train the bradycardia model for a particular baby, for example, we would take the period for that baby annotated as normal as training data for the first state, and periods annotated as bradycardia from other babies as the training data for the second state.
We compared the output of the density ratio HMM to that of a factorial hidden Markov model (FHMM) and factorial switching linear dynamical system (SLDS), recreating the evaluation of quinn09physiological . As the problem associated with this dataset is real-time patient monitoring, we applied filtering inference only. The latter methods were developed using extensive domain knowledge as to the physical processes underlying the observations. The evaluation was done using 3-fold cross validation on the set of 15 sequences. Area under ROC curve (AUC) and equal error rate (EER) for each of the classes in the annotations were calculated for the three methods, shown in Table 1. The density ratio HMM gives either equivalent or superior results in all cases, for example achieving a 17% increase in AUC and 13% decrease in EER for detection of temperature probe disconnection compared to the next best method.
We also constructed an anomaly detection model for each test sequence,
to assess the ability of this framework to identify any clinically
significant deviation from known types of physiological variation.
Using the multiple-class density ratio estimation procedure in §4.2,
this can be modeled quite easily111 An outlier detection method which could be used with the two-class
ratio estimation method in §
An outlier detection method which could be used with the two-class ratio estimation method in §4.1 is described in JMLR:Kanamori+etal:2009 . .
Assuming that observations from outlier classes might be present in test data, we use , to denote any such class at time . The method we propose for discriminating outliers from inliers is similar in essence to the one-class support vector machine  and the kernel Fisher discriminant method for outlier detection . These methods are based on the assumption that outliers occupy low-density regions of the data space and that a kernel model can be used to characterize the high-density regions given training data. Any given significance threshold can then be used to separate the inlier and outlier level sets.
We estimate the conditional probability of an outlier with
The problem of identifying outliers can then be equated with learning such that Eq. (18) is close to zero when
is within a region in which training data has high density, and is close to one anywhere else. To achieve this we minimize the following loss function:
The solution to this is given by
The parameters learned to model the inlier classes can therefore also be used for outlier detection.
To apply this to the problem of anomaly detection in the physiological monitoring dataset, we trained inlier class parameters using the reference period of the test sequence annotated as normal as well as any subsequence of training data annotated as bradycardia, incubator opening, blood sampling or temperature probe disconnection. Inference of in test data therefore gave an estimate of whether any significant physiological changes not consistent with any of the known dynamical regimes were occurring.
We compared this to the anomaly detection method based on the SLDS in quinn09physiological . Again using 3-fold cross-validation on the 360 hours of annotated data, the performance of our method was found to be superior to that of the SLDS, with 6% increase in AUC and 5% decrease in EER.
In this paper we have demonstrated that direct density ratio estimation methods can be applied to sequential inference problems, improving the discriminative capability of HMMs without losing the probabilistic interpretation of such models. We believe this method is particularly effective when no parametric class-conditional distribution of observations is known a priori, which is usually the case in real-world problems. As density ratio estimation is a growing field, having already been successfully applied to various problems in statistical inference such as covariate shift and outlier detection, the ideas in this paper make further advances in the field directly applicable to sequence modeling. It would be possible to apply this same principle to several other related sequential probabilistic models, another significant direction for future work.
Y. Altun, I. Tsochantaridis, and T. Hofmann.
Hidden Markov support vector machines.
Proceedings of the 20th International Conference on Machine Learning, 2003.
-  S. Bickel, M. Brüeckner, and T. Scheffer. Discriminative learning for differing training and test distributions. In Proceedings of the 24th Annual International Conference on Machine Learning (ICML2007), pages 81–88, 2007.
-  A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf. Covariate shift by kernel mean matching. In J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. Lawrence, editors, Dataset Shift in Machine Learning, pages 131–160, Cambridge, MA, USA, 2009. MIT Press.
-  T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391–1445, Jul. 2009.
-  T Kanamori, T Suzuki, and M Sugiyama. Statistical Analysis of Least-Squares Density Ratio Estimation. Machine Learning, 86(3):335–367, 2012.
-  A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proceedings of the Seventeenth International Conference on Machine Learning, volume 951, pages 591–598, 2000.
-  X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
-  M. Piccardi and O. Perez. Hidden Markov models with kernel density estimation of emission probabilities and their use in activity recognition. In , 2007.
-  V Punyakanok and D Roth. The Use of Classifiers in Sequential Inference. In Advances in Neural Information Processing Systems, volume 14, 2001.
-  J. Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619–630, 1998.
-  J.A. Quinn, C.K.I. Williams, and N. McIntosh. Factorial switching linear dynamical systems applied to physiological condition monitoring. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31:1537–1551, 2009.
-  L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
-  V. Roth. Kernel fisher discriminants for outlier detection. Neural computation, 18(4):942–960, 2006.
-  B. Schölkopf, R. Williamson, A. Smola, and J. Shawe-Taylor. SV estimation of a distribution’s support. In Advances in Neural Information Processing systems, volume 12, 1999.
-  M. Sugiyama, H. Hachiya, M. Yamada, J. Simm, and H. Nam. Least-squares probabilistic classifier: A computationally efficient alternative to kernel logistic regression. In Proc. International Workshop on Statistical Machine Learning for Speech Processing (IWSML2012), pages 1–10, 2012.
-  M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, Cambridge, UK, 2012.
-  M. Sugiyama, T. Suzuki, and T. Kanamori. Density ratio matching under the Bregman divergence: A unified framework of density ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
-  M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008.
-  B Taskar, C Guestrin, and D Koller. Max-margin Markov networks. In Advances in Neural Information Processing Systems, volume 16, 2004.
-  I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6(2):1453, 2006.
-  VN Vapnik. Statistical Learning Theory. Wiley, 1998.
-  P.C. Woodland and D. Povey. Large scale discriminative training of hidden Markov models for speech recognition. Computer Speech & Language, 16(1):25–47, 2002.