# Kernel Density Estimation-Based Markov Models with Hidden State

We consider Markov models of stochastic processes where the next-step conditional distribution is defined by a kernel density estimator (KDE), similar to Markov forecast densities and certain time-series bootstrap schemes. The KDE Markov models (KDE-MMs) we discuss are nonlinear, nonparametric, fully probabilistic representations of stationary processes, based on techniques with strong asymptotic consistency properties. The models generate new data by concatenating points from the training data sequences in a context-sensitive manner, together with some additive driving noise. We present novel EM-type maximum-likelihood algorithms for data-driven bandwidth selection in KDE-MMs. Additionally, we augment the KDE-MMs with a hidden state, yielding a new model class, KDE-HMMs. The added state variable captures non-Markovian long memory and signal structure (e.g., slow oscillations), complementing the short-range dependences described by the Markov process. The resulting joint Markov and hidden-Markov structure is appealing for modelling complex real-world processes such as speech signals. We present guaranteed-ascent EM-update equations for model parameters in the case of Gaussian kernels, as well as relaxed update formulas that greatly accelerate training in practice. Experiments demonstrate increased held-out set probability for KDE-HMMs on several challenging natural and synthetic data series, compared to traditional techniques such as autoregressive models, HMMs, and their combinations.

## Authors

• 24 publications
• 5 publications
• 17 publications
• ### Consistency of the maximum likelihood estimator in seasonal hidden Markov models

In this paper, we introduce a variant of hidden Markov models in which t...
02/20/2018 ∙ by Augustin Touron, et al. ∙ 0

• ### Partially Hidden Markov Chain Linear Autoregressive model: inference and forecasting

Time series subject to change in regime have attracted much interest in ...
02/24/2021 ∙ by Fatoumata Dama, et al. ∙ 0

• ### Learning Temporal Dependence from Time-Series Data with Latent Variables

We consider the setting where a collection of time series, modeled as ra...
08/27/2016 ∙ by Hossein Hosseini, et al. ∙ 0

• ### Autoregressive Asymmetric Linear Gaussian Hidden Markov Models

In a real life process evolving over time, the relationship between its ...
10/27/2020 ∙ by Carlos Puerto-Santana, et al. ∙ 0

• ### A Generalization of the Ornstein-Uhlenbeck Process: Theoretical Results, Simulations and Parameter Estimation

In this work, we study the class of stochastic process that generalizes ...
08/13/2021 ∙ by J. Stein, et al. ∙ 0

• ### Learning Hidden Markov Models with Geometrical Constraints

Hidden Markov models (HMMs) and partially observable Markov decision pro...
01/23/2013 ∙ by Hagit Shatkay, et al. ∙ 0

• ### Exact inference for a class of non-linear hidden Markov models

Exact inference for hidden Markov models requires the evaluation of all ...
06/05/2020 ∙ by Guillaume Kon Kam King, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Time series and other sequence data are ubiquitous in nature. To recognize patterns or make decisions based on observations in the face of uncertainty and natural variation, or to generate new data, e.g., for synthesizing speech, we need capable stochastic models.

Exactly what model to use depends on the situation. The standard approach is to propose a mathematical framework, and then let data fill in the unknowns by estimating parameters [1, 2]. If the data-generating process is well understood, it may be possible to write down an appropriate model form directly. If not, we must fall back on the extensive library of general-purpose statistical models available. With little data, simple, tried-and-true approaches are typically used. These tend to involve many assumptions on the nature of the data-generating process. Model accuracy may suffer if these assumptions are incorrect. Once more data becomes available, such as in today’s “big data” paradigm, it is often possible to get a better description by fitting more advanced models with additional parameters. Using a complex model is certainly no guarantee for better results, however, and finding out which particular description to use tends to be a laborious trial-and-error process.

In this article, we consider a class of nonparametric models for discrete-time, continuous-valued stationary data series, where conditional next-step distributions are defined by kernel density estimators. Unlike standard parametric techniques, these models can converge on a significantly broader class of ergodic finite-order Markov processes as the training data material grows large. This makes the models widely applicable, and is especially compelling for data where the generating process is complex and nonlinear, or otherwise poorly understood.

Our main contributions are 1) extending the nonparametric Markov models with a discrete-valued hidden state, and 2) presenting several guaranteed-ascent iterative update formulas for maximum-likelihood parameter estimation,111Nonparametric models

can be defined as sets of probability distributions indexed by a parameter

, where the dimensionality of the space grows with the amount of data; hence a nonparametric model still has parameters that may need to be estimated. applicable both with and without hidden state. The added state variable allows models to capture long-range dependences, patterns with variable duration, and similar structure, on top of the short-range correlations described by the Markov model. This is shown to improve finite-sample performance on challenging real-world data. The state variable can also be used as an input signal to control process output, and is attractive for recognition and synthesis applications, particularly with speech signals, where the hidden states can be identified with language units such as phones.

The remainder of the paper is organized as follows: Section 2 discusses established techniques for modelling Markovian processes with and without hidden state, along with sample applications. Section 3 then introduces kernel density-based time series models and their properties. Parameter estimation is subsequently treated in Section 4. Sections 5 and 6 present experimental results, while Section 7 concludes.

## 2 Background

In this section we discuss Markov models and hidden-state models for strictly stationary and ergodic sequence data, and also address how the two approaches can be combined to describe more complex natural processes.

### 2.1 Markov Models

Let the underline notation represent a sequence of variables from a stochastic process. A process satisfying

 fXt|X––t−1−∞(xt|x––t−1−∞) ≡fXt|X––t−1t−p(xt|x––t−1t−p); (1)

is said to be a Markov process of order .222

We will use capital letters to denote random variables (RVs), letting lower-case letters signify observations, i.e., specific, nonrandom outcomes of RVs.

The relation implies that future evolution is conditionally independent of the past, given the latest observations—the context or state . In other words, the process has a short, finite memory, and knowing the most recent samples suffices for optimal prediction.

means that variables are independent. The Bayesian network in Fig.

(a)a illustrates the between-variable dependences when .

Continuous-valued Markovian data can be modelled using linear as well as nonlinear models. Standard linear autoregressive (AR) and autoregressive moving-average (ARMA) models are perhaps the most well-known. Among nonlinear models one finds piecewise-linear, regime-switching approaches such as self-exciting threshold autoregressive (SETAR) models [3]

, non-recurrent time-delay neural networks (TDNN)

[4], or kernel-based AR models [5, 6]

. Similar to regular AR models, the latter can be made probabilistic by exciting them with a random process, typically white noise.

All the above models are parametric, and can therefore only converge on processes in their finite-dimensional parametric family. This may limit the achievable accuracy. As it turns out, convergent, general-purpose models are possible using nonparametric techniques. In Section 3 we highlight and extend a simple but flexible approach based on kernel density estimation, drawing on [7] and [8]. The technique is asymptotically consistent for stationary and ergodic Markov processes333Not all nonparametric Markov models have this general convergence property, the residual bootstrap of [9] being one counterexample. [10] despite having only a single free parameter, and outperforms traditional methods on multiple datasets.

### 2.2 Hidden-State Models

To capture long-range dependences, a Markov model may require a high order, many parameters, and a large dataset to provide a reasonable description. A more efficient approach is to use models with a hidden, unobservable state variable that governs time-series evolution. The state is assumed to follow a Markov process, while the observed process values, in turn, are stochastic functions of the current state alone. This dependence structure is illustrated in Fig. (b)b. In this framework, discrete state spaces lead to hidden Markov models (HMMs) [11]

and variants thereof, while continuous state spaces yield Kalman filters

[12] and nonlinear extensions such as those presented in [13] and [14].

Even though the hidden state satisfies the Markov property, the distribution of the observed data typically cannot be represented by any finite-order Markov process. This is clear, e.g., from the forward algorithm

 (2)

for state inference in HMMs [11], which recursively depends on all previous samples. Hidden-state models can thus integrate predictive information over arbitrary lengths of time, providing efficient representations of long-range signal structure such as, for instance, the order of sounds in speech signals or different chart patterns in financial technical analysis. At the same time, efficient inference is possible for HMMs using the forward-backward algorithm [11].

Like the components in a mixture model, the discrete state of hidden-Markov models essentially partitions the data into different subsets, each explained by a different distribution. This can capture more detail in the data distribution than can a single component. HMMs furthermore model how components correlate across time, making them capable of representing series data.

### 2.3 Combined Models

Real-world data series often exhibit both short-range correlations and long-range structure. In theory, this can be represented by a discrete-state HMM given a sufficient number of states, but in practice that number may be prohibitively large. We see that HMMs trade the limited memory range of Markov models for a limited memory resolution (a finite state space).

To get the best of both worlds, Markovian observation dependences and hidden-state dynamics can be combined within a single model. This results in a model structure as illustrated in Fig. (c)c, where between-sample dependences follow a Markov process with properties determined by the hidden state. Such a structure is seen in, e.g., GARCH models [15]

from econometrics, or in SETAR models driven by an unobservable Markov chain (so-called Markov switching models

[16, 3]). The addition of dynamic features (velocity and acceleration) to ordinary HMMs—a common practice in acoustic modelling of speech [17]—can be viewed as a method for adding implicit between-frame correlations. Autoregressive HMMs (AR-HMMs) [18] and trajectory HMMs [19] are similar models that make these dependences explicit. We will use AR-HMMs as a baseline since they are based on standard AR models and have a directed structure that allows efficient sampling and parameter estimation.

The Markovian part in a combined model typically captures simple and predictable local aspects of the data, such as continuity, allowing the hidden state to concentrate on more complex, long-range interactions. In acoustic modelling of speech, e.g., [19, 20], dynamic features capture correlations between analysis frames due to physical constraints on the motion of speech articulators, while the hidden states are based on a language model (e.g., -grams) and account for grammar and other large-scale language structures.

As a major contribution of this article, we extend kernel density-based Markov models as used in [21, 8] with a hidden state. This is found to improve the distribution-prediction performance for held-out data over comparable parametric approaches such as AR-HMMs.

## 3 KDE Models for Time Series

In this section, we first review kernel density estimation of probability distributions, and how it can be adapted to describe and generate Markovian processes. Novel models with hidden state are introduced in Section 3.5.

From now on, we will use the letters and to denote training data and test data, respectively. represents a set of training data, e.g., a training data sequence

, while boldface signifies vector-valued variables.

### 3.1 Kernel Density Estimation

Kernel density estimation (KDE) or Parzen estimation, introduced in [22, 23] and discussed in more detail in [24], is a nonparametric method for estimating a -dimensional probability density from a finite sample , , by convolving the empirical density function with a kernel function . The resulting estimated pdf can be written as

 ˆfX(x|D;h) =1NN∑n=11hDk(1h(x−yn)), (3)

where the bandwidth is a scale parameter adjusting the width of the kernel.

We see that the KDE in (3) always defines a proper density if we require that and that

integrates to one. For technical reasons we also require the first moment

to be zero and that the second moment is bounded. A wide variety of kernels exists, with the squared-exponential or Gaussian kernel

 k(r) =1√2πexp(−12r⊺r) (4)

being a common example. For simplicity, we generally assume that the kernel factors across dimensions as

 k(r) =D∏d=1kd(rd); (5)

this is known as a product kernel. Furthermore, we often take all component functions to be identical.

The bandwidth controls the degree of smoothing the KDE applies to the empirical distribution function. Bandwidth selection, the task of choosing this , is key to KDE performance. With an appropriate adaptive bandwidth selection scheme, such that while as

, KDE can be shown to converge asymptotically on the true probability density function in a mean-squared error sense, regardless of

. (Kernel shape is less important, with many standard choices yielding near-optimal performance [25].) Essentially, given an ever-growing collection of progressively more localized basis functions with centres drawn from , we can eventually represent arbitrarily small details in the pdf. Moreover, KDE attains the best possible convergence rate for nonparametric estimators [26], assuming optimal bandwidth selection.

In practice, KDE finite-sample performance depends heavily on the data and the particular bandwidth chosen. Smooth distributions are typically easy to learn and can use large , whereas complicated distributions require more data and narrow bandwidths to bring out the details. Consequently, there are many techniques for choosing the bandwidth in a data-driven manner: see, e.g., [27] or [28] for reviews.

### 3.2 Kernel Conditional Density Estimation

We now consider how to approximate Markov processes. Using the Markov property (1), the density function for a univariate sequence from a general stationary nonlinear Markov process of order can be written

 fX––T1(x––T1)=fX––p1(x––p1)T∏t=p+1fXp+1|X––p1(xt|x––t−1t−p). (6)

It is sufficient to specify the stationary conditional next-step distribution to uniquely determine the -process and its associated stationary distribution . A key idea in this paper is to use KDEs, specifically kernel conditional density estimation (KCDE), to estimate this conditional distribution from data.

Given two variables and and a kernel density estimate

of their joint distribution

from training data pairs , the estimate also induces a conditional distribution

 ˆfX|X′(x|x′;h,h′)=ˆfX,X′(x,x′;h,h′)∫ˆfX,X′(ξ,x′;h,h′)dξ (7) =N∑n=1kX′(x′−y′nh′)∑Nn′=1kX′(x′−y′n′h′)1hDkX(x−ynh) ; (8)

this is the KCDE for [21, 10, 29]. (We have assumed that the kernel factors between and .)

The estimator in (8) is (MSE) consistent if bandwidths satisfy and when . As always, this flexibility and consistency comes at a price of increased computational complexity and memory demands over parametric approaches, since all data points typically must be stored and used in calculations, making, e.g., bandwidth selection scale as . Fortunately, approximate -nearest neighbour techniques can be used to accelerate KDE computations. Reference [30] describes a method based on dual trees yielding speedup factors up to six orders of magnitude for KDE likelihood computation on 10,000 points from a nine-dimensional census dataset.

### 3.3 KDE Markov Models

Let be a sampled data sequence (time series) from a stationary, ergodic Markov process of interest. Since can be seen as a set of samples from the stationary distribution of , we can apply KCDE to estimate the conditional next-step distribution in (6). Restricting ourselves to product kernels where all are identical and use the same bandwidth (this is appealing since is stationary), this KCDE next-step distribution takes the form

 ˆfXp+1|X––p1(xt|x––t−1t−p;h) =1h∑Nn=p+1∏pl=0k(xt−l−yn−lh)∑Nn=p+1∏pl=1k(xt−l−yn−lh). (9)

(Sequences of vectors are a straightforward extension.)

Eq. (9) defines a Markov model which approximates the data-generating process. We will call this construction a KDE Markov model (KDE-MM), and will write to distinguish variables generated from the KDE-MM next-step distribution in (9) from , data distributed according to the reference process defined by in (6).444Although by definition , the stationary distribution of induced by (9) need not necessarily match the original KDE pdf . This is because typically cannot be a stationary distribution, as its marginal distributions for are based on different sets of kernel centres (datapoints) and thus are unlikely to be identical. If we wish to ensure that , it is sufficient to perform periodic extension of the data series for non-positive indices, so that (reminiscent of periodic extension in Fourier analysis), and change the summations over in (9) to start at rather than at . This makes all marginal distributions identical, though it may introduce out-of-character behaviour into in case the beginning and end of the training sequence do not match up well. Stationarity is easily verified by computing . We have implemented the periodic extension for all KDE-MMs (but not KDE-HMMs) in our experiments. However, (9) will always define a proper stationary stochastic process even if this is not done.

The use of conditional distributions based on KDE to describe first-order () Markovian data dates back to [8], predating the KCDE paper [10]. At , our kernel density-based Markov model coincides with the next-step distribution in [8], assuming the latter uses a product kernel. KDE-MMs can also be seen as a slight restriction of the Markov forecast density (MFD), introduced for economics by [31], when applied to one-step prediction (the methods differ in their predictions for multiple time-steps). The maximum-likelihood-type parameter estimation algorithms we later present in Section 4 are new, however.

Using KCDE to describe the next-step conditional distribution has several advantages over parametric approaches. To begin with, we inherit the asymptotic consistency properties of KDEs, and the probabilities of finite substrings in converge on the true probabilities as (under certain regularity conditions and with appropriately chosen bandwidths). Moreover, the approach only has a single free parameter, .

As with other nonparametric models, a potential downside of KDE-MMs is that per-point computational costs scale linearly with database size . At the same time,

may need to be large for KDE-based methods to overtake fast-to-converge but asymptotically biased parametric models.

### 3.4 KDE-MM Data Generation

Eq. (9) can be rewritten as

 ˆfXp+1|X––p1(xt|x––t−1t−p;h)=N∑n=p+1κn(x––t−1t−p;h)1hk(xt−ynh) (10)
 κn(x––t−1t−p;h)=∏pl=1k(xt−l−yn−lh)∑Nn′=p+1∏pl=1k(xt−l−yn′−lh), (11)

which can be interpreted as a mixture model with context-dependent weights. To generate a new datapoint given the context , one selects a mixture component according to the weights . For KDE-MMs this component corresponds to an index into the database , identifying an exemplar upon which we base the sample at . is then generated from with -shaped noise added:

 P(Zt=n|X––t−1t−p=x––t−1t−p;h) =κn(x––t−1t−p;h) (12) ˆfXp+1|Zp+1,X––p1(xt|n,x––t−1t−p;h) = (13) ˆfXp+1|Zp+1(xt|n;h) =1hk(xt−ynh). (14)

This staggered structure is illustrated in Fig. 2 for a second-order model (). The procedure is also simple to implement in software. It is easy to see that

 ˆfXp+1|X––p1(xp+1|x––p1) ≤maxyn∈D1hk(xp+1−ynh), (15)

regardless of context . This bounds the tails of , so the -process has as many finite moments as the kernel function does, ensuring stability.

The bandwidth affects the character of the generated data in two ways. With a sufficiently narrow bandwidth, and assuming has exponentially decreasing tails, the weight distribution for will strongly favour the one that minimizes ; call this component . Since the added -shaped noise in (14

) has small variance, we have

in the typical case. This makes sampled sequences closely follow long segments of consecutive values from the training data series. If is increased, the weight distribution becomes more uniform, and sampled trajectories will increasingly often switch between different training-data segments, but the amount of added noise also grows. In Markov forecast densities and in KDE-HMMs as presented in the next section, the two roles of context sensitivity and additive noise level are assigned to separate parameters.

The idea of creating new samples by concatenating old data is reminiscent of the common block bootstrap for time series [32], albeit more refined, since the KDE-MM preferentially selects pieces that fit well together; this has advantages for the asymptotic convergence rate of bootstrap statistics [33]. Indeed, both [8] and [31] apply a bootstrap perspective when introducing KDE-based methods. The KDE-MM approach is also similar to waveform-similarity-based overlap-add techniques (WSOLA) [34] from signal processing, although these do not model probability distributions.

### 3.5 KDE Hidden Markov Models

Being Markov models, KDE-MMs capture short-range correlations in data series, but are not well-equipped to describe long-range memory or structure such as the sequential order of sounds in a speech utterance, as discussed in Section 2. To remedy this, we introduce a hidden (unobservable), discrete state variable to the model in (9). is governed by a first-order Markov chain. This results in hidden Markov models where the state-conditional output distributions are given by KDEs, specifically KDE-MMs. In other words, the current state determines which of a set of KDE-MMs is used to generate the next sample value of the process. This construction is completely analogous to the setup of AR-HMMs [18], but with KDE-MMs instead of linear AR models. We call the new models kernel density hidden Markov models, or KDE-HMMs.

In this paper, we will work with models that, given the current state and context , have the form

 ˆfXt|Qt,X––t−1t−p(xt|q,x––t−1t−p;K)=N∑n=p+1κqn(x––t−1t−p;K)1hq0kq0(xt−ynhq0) (16)
 κqn(x––t−1t−p;K) =wqn∏pl=1kql(xt−l−yn−lhql)∑Nn′=p+1wqn′∏pl=1kql(xt−l−yn′−lhql). (17)

here denotes the set of KDE parameters, for , , and . The function of the weights is to allow the training data points to be assigned to different states. This assignment can be either hard (binary) or soft, but we require and . We also allow kernel functions and bandwidths to depend on the state and lag considered. In addition to , we also have standard HMM parameters describing the hidden-state evolution, specifically a matrix of state transition probabilities defined by

 aqq′ =P(Qt+1=q′|Qt=q). (18)

(Since the model is stationary and ergodic, the leading left eigenvector

of satisfying defines the stationary state distribution of the Markov chain. This also gives the initial state probabilities of the model.) The full set of KDE-HMM parameters is thus . In Section 4 we derive update formulas to estimate these parameters from data.

To the best of our knowledge, models of the above form have not been considered previously. KDE-HMMs generalize [8] and [31] by introducing hidden states (when ) and different kernels and bandwidths for every lag . Our proposal is also more general than HMMs with KDE outputs, dubbed KDE/HMM, previously investigated in [35]. We recover KDE/HMMs by setting , making samples conditionally independent given the state sequence, similar to Fig. (b)b.

Like with KDE-MMs, the next-step distribution of KDE-HMMs can be broken into two parts: one, , choosing an exemplar from for the next step, given the context and the current state , and the other adding kernel-shaped noise around the selected exemplar value . The dependences among variables in a KDE-HMM with are illustrated in Fig. 3. Despite the increased complexity, sampling is straightforward, and efficient inference remains possible for all using standard forward-backward recursions. Specifically, past and future values of and are conditionally independent given the sequence and the state at time , as can be verified, e.g., using the algorithm in [36].

Adding a hidden state has several advantages. Aside from the benefits discussed in Sections 2.2 and 2.3, the state essentially partitions the space of contexts into regions associated with different models, thus allowing different bandwidths for different situations. This can be a significant advantage in many cases—even when the data-generating process is Markovian—since it often is desirable to apply different degrees of smoothing in the modes versus the tails of distributions: see, for instance, [24] or [37], where the latter’s filtering functions closely resemble our weights . Letting the bandwidth depend on the lag furthermore allows better representation of phenomena such as variable (typically decreasing) predictive value of older samples, and separates context sensitivity ( for ) from the added noise ( for ) during sampling. This distinguishes KDE-HMMs from KDE-MMs even if the number of states, , is one. As before, the advantages come at a price of greater demands for data and computation, compared to parametric approaches.

## 4 Parameter Estimation

To use models in practice, we must identify parameters that fit well for a given process of interest. In this section, we consider bandwidth selection and parameter estimation for KDE-MMs and KDE-HMMs. In particular, we derive guaranteed-ascent maximum likelihood update equations based on the EM-algorithm for the case of Gaussian kernels, along with relaxed update formulas which converge faster in practice.

### 4.1 Parameter Estimation Objective Function

Numerous principles have been proposed for choosing bandwidths in a data-driven manner (see [27] and [28]

for reviews). In practice, the preferred scheme depends on the data and on the intended application, including the performance measure (loss function) of interest. In this paper we concentrate on the classic Kullback-Leibler (or KL) divergence from information theory,

 DKL(fX∣∣∣∣ˆfX) =∫fX(x)lnfX(x)ˆfX(x)dx, (19)

previously used with KDE in signal processing and machine learning by

[35] and [30]. Minimizing between empirical and estimated distributions leads to traditional maximum-likelihood (ML) parameter estimation, cf. [38]. While likelihood maximization is not considered appropriate for heavy-tailed reference densities [39, 27], likelihoods are asymptotically a factor faster to evaluate compared to other bandwidth-selection criteria such as integrated squared error [30].

The likelihood function is given by Eqs. (9) and (16) applied to the training data sequence , but it is not appropriate to optimize this function directly. In the limit where bandwidth goes to zero, the KDE shrinks to a set of point masses (spikes) placed at the points in

and the likelihood diverges to positive infinity. A similar issue exists in common models such as Gaussian mixture models (GMMs) and Gaussian-output HMMs: by dedicating one component or state to a narrow spike centred on a single datapoint, arbitrarily high likelihood values can be achieved

[1, pp. 433–434].

With KDE, a standard workaround is to use cross-validation, for every omitting the component centred at when computing the likelihood term at . This prevents points in the training data from “explaining themselves” and removes the degenerate optimum at zero. The resulting objective function is known as the pseudo-likelihood. It can be shown that, under certain conditions, maximizing the pseudo-likelihood asymptotically minimizes the KL-divergence of KDE [40].

For simplicity, we exclude the first points of all sequences, where the context is incomplete, from the KDE-HMM likelihood evaluations in this paper. (KDE-MMs use the periodic extension from footnote 4.) Using (17) and (16) then leads to the pseudo-likelihood

 ˜fX––(y–N1;θ)=∑q–Np+1∑z–Np+1˜fX––,Q––,Z––(y–N1,q–Np+1,z–Np+1;θ) (20)
 ˜fX––,Q––,Z––(y–N1,q–Np+1,z–Np+1;θ)=P(Q––Np+1=q–Np+1;A)⋅N∏t=p+1(1−δztt)wqtzt∏pl=0kqtl(yt−l−yzt−lhqtl)hqt0∑Nn=p+1,n≠twqtn∏pl=1kqtl(yt−l−yn−lhqtl). (21)

The cross-validation terms are introduced to set the probabilities of sequences having to zero.

To reduce clutter, we will from now on take the indices in expressions and summations to range as , , and , unless otherwise specified. Primed indices follow the same limits as unprimed ones.

### 4.2 Expectation Maximization for KDE-HMMs

As with other latent-variable models such as GMMs and HMMs, direct analytic optimization of the (log) pseudo-likelihood is infeasible due to the sums over latent variables in (20

). Instead, we seek an iterative optimization procedure based on the expectation maximization (EM) algorithm

[41], by maximizing the auxiliary function

 Q(θ′;ˆθ)=E(ln˜fX––,Q––,Z––(y–,Q––,Z––;θ′)|X––=y–;ˆθ). (22)

Using Jensen’s inequality, one can prove that any revised parameter estimate satisfying

 Q(ˆθ(new);ˆθ) ≥Q(ˆθ;ˆθ) (23)

is guaranteed not to decrease the actual likelihood of the data; this is used to establish convergence. In particular, convergence does not require maximizing (22) at every step, and procedures that increase without necessarily maximizing it are termed generalized EM (GEM). Because the pseudo-likelihood is equivalent to the regular likelihood together with a specific cross-validation prior on , convergence is assured also in our case.

At each EM iteration the shape of the auxiliary function is determined by the conditional posterior distribution of the latent variables and . For KDE-HMMs, the relevant forward-backward recursions to determine the conditional hidden-state distributions, also known as state occupancies,

 γqt =P(Qt=q|X––N1=y–N1;ˆθ) (24)

are identical to those of other HMMs with Markovian output distributions, such as [18], apart from enforcing due to cross-validation. The occupancies are used to update the state transition probabilities of the KDE-HMM following standard HMM formulas available in [11].

Determining conditional posterior distributions, or responsibilities, of the latent KDE component variables is similarly straightforward, and one obtains

 ϱnumqnt =P(Zt=n|Qt=q,X––tt−p=y–tt−p;ˆK) (25) =(1−δnt)κqn(y–t−1t−p;ˆK)kq0(yt−ynˆhq0)∑n′≠tκqn′(y–t−1t−p;ˆK)kq0(yt−yn′ˆhq0). (26)

The auxiliary function for weights and bandwidth parameters then takes the form

 QK(K′;ˆK)=∑q,t,n≠tγqtϱnumqnt(lnw′qn−12lnh′q0)+∑q,t,n≠tγqtϱnumqntp∑l=0lnkql(yt−l−yn−lh′ql)−∑q,tγqtln⎛⎝∑n≠tw′qn∏lkql(yt−l−yn−lh′ql)⎞⎠. (27)

Optimal values for the next-step bandwidths are readily identified [35] and are presented in (28) below. The weights and context bandwidths ( and for

) present an obstacle, however, as these appear in a negative term with a sum inside a logarithm and cannot be optimized analytically. Such terms are common in models that feature conditional probabilities with renormalization, e.g., standard maximum mutual information (MMI) discriminative classifiers

[42].

To identify GEM updates of with Gaussian kernels we construct a global lower bound of in (27) which is tight at . Any updated parameters that does not decrease is then guaranteed not to decrease the true auxiliary function, and the key convergence relation (23) remains satisfied. This approach is known as minorize-maximization and can be seen as a generalization of EM [43].

While many bounds on log-sum-exp-type expressions exist [44], we chose to base our bound on so-called reverse-Jensen bounds [45, 46]. These apply to Gaussian and other exponential-family kernels that satisfy the conditions of Lemma 2, page 139 in [46]. Importantly, the reverse-Jensen bounds have the same parametric dependence on as the other terms in Eq. (27), enabling us to to derive analytic expressions for the parameters that maximize the lower bound .

A mathematically similar application of reverse-Jensen bounds to MMI training of HMM classifiers can be found in [47]. The resulting updates resemble those produced by a common MMI training technique called extended Baum-Welch (EBW) [48, 49]

. However, while standard EBW equations include a heuristically set

[50] tuning parameter that limits the update magnitude, the reverse-Jensen procedure selects the tuning parameter automatically, requiring no user intervention.

### 4.3 Generalized EM Update Formulas for KDE-HMM

For Gaussian kernels as in (4), maximizing the reverse-Jensen lower bound yields the GEM update equations

 ˆh2(new)q0 =∑t,n≠tγqtϱnumqnt(yt−yn)2∑t,n≠tγqtϱnumqnt (28) ˆh2(new)ql =Wqˆh2ql+∑t,n≠tγdiffqnt(yt−l−yn−l)2Wq+∑t,n≠tγdiffqnt (29) ˆw(new)qn =Wqˆwqn+∑tγdiffqntWq+∑t,n′≠tγdiffqn′t (30) γdiffqnt =γqt(ϱnumqnt−ϱdenqnt) (31) Wq =∑t,n≠tγqt(ϱdenqnt+ωhqnt+ωwqnt+ω′qnt); (32)

the state occupancies in these formulas are defined by (24), while the responsibilities and the terms from the reverse-Jensen bound follow (26) and

 ϱdenqnt =P(Zt=n|Qt=q,X––t−1t−p=y–t−1t−p;ˆK) (33) =(1−δnt)κqn(y–t−1t−p;ˆK)∑n′≠tκqn′(y–t−1t−p;ˆK) (34) ωhqnt =2G(12ϱdenqnt)p∑l=1(ξhql)2 (35) ωwqnt =4G(12ϱdenqnt)ξwqn (36) ω′qnt =ϱdenqntmax(maxl(ξhql),ξwqn) (37) ξhql =(yt−l−yn−l)2ˆh−2ql−1 (38) ξwqn =ˆw−1qn−1. (39)

The function

 G(γ) =\setstretch0.75⎧⎪ ⎪⎨⎪ ⎪⎩(γ−1lnγ)2−14lnγif γ<\nicefrac16(\nicefrac16−1ln\nicefrac16)2−14ln\nicefrac16+γ−16otherwise (40)

follows Eq. (5.10) in [46, p. 99].

The update formulas in (28) through (30) should increase the pseudo-likelihood except at local extreme points. Since

 Wq+∑t,n≠tγdiffqnt >0, (41)

the updated parameters are always well defined.

###### Prop. 1.

Let for be a sequence of parameter estimates, computed iteratively from using the update procedure above, and let be the corresponding pseudo-likelihoods in Eq. (20). Define the minimum separation through

 dmin =mint,n≠t|yt−yn|. (42)

Assume is strictly positive and . Then exists.

The proof proceeds by establishing that the pseudo-likelihood has a finite upper bound , and then showing that , which ensures convergence.

Studying the update equations, we see that the quantity limits the update step length: if is large compared to , we have , and similarly for . Unfortunately, often substantially exceeds on large datasets, and the time required until eventual convergence is then infeasibly long.

To reduce and obtain formulas that produce larger updates, one may let the weights be fixed (as their updates drive up the -factors the most) and only consider updating the bandwidths. We can then set in all formulas. Moreover, we apply the approximation , which is the first of several steps involved in connecting reverse-Jensen update formulas to standard EBW heuristics [47]. This yields the approximate weights

 ˜ωhqnt =ϱdenqnt∑l(ξhql)2, (43)

and the associated relaxed -factors

 ˜Wq =∑t,n≠tγqt(ϱdenqnt+˜ωhqnt+ω′qnt). (44)

(It is important to remember that here should use .) The resulting KDE-HMM training algorithm is summarized in Table I.

Since in (44) tends to be of roughly the same order of magnitude as , EM-training with instead of converges quite quickly. On the other hand, remains sufficiently conservative to virtually always increase the likelihood at every step in our experiments.

### 4.5 KDE-MM Bandwidth Selection

We now turn to consider the special case of KDE-MMs. While bandwidth selection formulas for KDE-MM-like models do exist in the literature, e.g., [51], these focus on mean squared error rather than pseudo-likelihood. Since KDE-MMs essentially are single-state KDE-HMMs with fixed, uniform weights and all bandwidths tied to be equal, the approach in this paper can also be used to derive pseudo-likelihood maximization formulas for the KDE-MM bandwidth . Assuming a Gaussian kernel this yields the iterative updates