For many machine learning tasks we are faced with the scenario of learning a model from non-stationary sequential data that results from studying several views of the same underlying phenomenon. Consider the following examples: replicated scientific experiments often vary in timing across different trials, different subjects might require different amounts of time to perform the same task, the peaks of the neuronal action potential waveforms do not match each other perfectly, telecommunications suffer from temporal jitter, climate patterns are often cyclic though particular events take place at slightly different times in the year. However, most sample statistics, such as mean and variance, are designed to capture variations in amplitude rather than phase or timing; this results in increased sample variance, blurred fundamental data structures and an exaggerated number of principal components necessary to describe the data. Therefore, the alignment of data is often performed as a non-trivial pre-processing step.
Traditionally, the notion of sequence correspondence is defined using a measure of pairwise similarity integrated across the sequences Berndt:1994 ; Keogh:2001 ; Drydmard:2016 ; Hsu:2005 ; Zhou:2009 ; Zhou:2012 . We build on Kazlauskaite:2018
where the models of the individual sequences and the alignment across sequences are cast within a single framework. Differently from the commonly used approach of interpolating between observations, the underlying functions and the warps are modelled using GPs which jointly regularise the solution space of the ill-posed alignment problem. The alignment is performed using dimensionality reduction which preserves similarities in the observation space while imposing the preference for dissimilar data points to be placed far apart in the latent space. In particular, the authors propose to use a GP-LVM that places independent GPs over the data features and optimises the corresponding latent variables.
Similarly to Kazlauskaite:2018 , we use GPs to model the data and the warps, which allows us to reject the observation noise in a principled manner, and imposes a smoothness constraint on the warping functions. In contrast, however, we consider a Dirichlet process mixture model Ferguson:1973 as a model for alignment. The DPMM performs clustering explicitly and allows us to automatically find the optimal number of underlying functions explaining the observed sequences.
Let us assume that we are given noisy observed sequences where each sequence comprises of time samples . We consider each sequence to be generated by sampling a latent function at points , i.e. , where are known evenly-spaced inputs (same for all sequences), are the warping functions (different for different sequences) and is i.i.d. Gaussian noise.
We are interested in the case where the number of distinct latent functions is smaller than , i.e. some of the observed sequences are generated from the same underlying functions. In this setting, the observed sequences were misaligned by applying different time warpings to the input locations. We treat the aligned sequences as if they were observed, and hence we refer to as pseudo-observations. The rest of this section is structured as follows: first, we briefly describe the model over individual sequences , which is the same as in Kazlauskaite:2018 , and we refer the reader to this work for more details; next, we introduce an aligning objective based on the DPMM.
Model over time
We model latent functions as stationary GPs, with a squared exponential kernel. Denoting the finite-dimensional evaluation of the warping function as , the joint likelihood of observed and aligned sequences is
We encode our preference for smooth warping functions by also modelling as a GP with a smooth kernel function. The are constrained to be monotonic in the range using a cumulative sum of the set of auxiliary variables with a softmax renormalisation. The distribution over is
So far we have modelled each sequence in isolation, and there is nothing in the model that encourages it to use the smallest possible number of distinct latent functions to explain the data. One way to introduce such a constraint is to add a regularisation term that encourages clustering of aligned sequences into a small number of clusters. Indeed, the subset of the aligned sequences that belong to the same cluster would be similar to each other, meaning that the corresponding latent functions are also similar when evaluated at , while the GP prior on enforces that the functions are smooth and nearby values are correlated.
We propose clustering using the DPMM. As a non-parametric mixture model, DPMM allows us to automatically infer the number of clusters (i.e. distinct aligned sequences) from the data. The locations of the mixture components and the mixing weights are distributed according to a sample from a Dirichlet process, and, depending on the parameters of the process, such a prior seeks to explain the data using only a few mixture components, corresponding to aligning the underlying latent functions. Formally, the DPMM can be defined as follows:
where , a sample from the Dirichlet process, can be thought of as an infinite weighted sum of delta functions at locations sampled from the base distribution (e.g. a Gaussian). The scaling parameter controls the entropy of those weights; for small values of only a few weights are significantly above 0, meaning that the data are sampled from one of a few mixture distributions with parameters . An explicit construction of the DPMM is based on a stick breaking representation:
We use DPMM to regularise the GPs that are fitted to the data, and hence we optimise the data likelihood of the DPMM jointly with the GP models. Since the data likelihood is not available in closed form, we approximate it with a variational lower bound Blei:2005 . Specifically, we approximate posterior distributions over , and with factorised Beta, Gaussian and Multinomial distributions respectively:
where is the maximal number of clusters in the mixture (the infinite mixture model is truncated for the variational approximation). This approximation allows us to obtain a lower bound on the data likelihood:
where both terms are analytically tractable, and we refer the reader to Blei:2005 for their exact expressions.
We want the observed data and aligned sequences to be modelled by the GPs (by and by respectively), and the aligned sequences to be clustered into groups, therefore, we simultaneously maximise the GP data likelihood (1) and the lower bound on the DPMM likelihood (4). The GP likelihood (1) includes the evalution of the warping GP, , which we cannot integrate out analytically, therefore, we use a point estimate by including the likelihood of in the objective, and directly optimising , which parametrise . Overall, the optimisation objective is
which we maximise w.r.t. the pseudo-observations of the aligned sequences , the DPMM variational parameters , the auxilary variables of the warping functions , and the hyper-parameters of the GPs and the DPMM. Among the hyper-parameters of the DPMM are the scaling parameter , variance of the base distribution (we assume it is a zero-mean Gaussian), and the variance of mixture components (we assume diagonal-covariance Gaussians with the same variance for all components). The scaling parameter and the variance of are directly optimised (we include a Gamma prior on them), while the variance of mixture components is set to , i.e. the estimated noise in the GP fits, which we assume to be the same for all sequences.
We consider a set of 10 sequences generated using and , where
is a linearly spaced vector of values in
, and warped using randomly generated monotonically increasing warping functions. We define (1) the mean (median) alignment error as the sum of means (medians) of pairwise distances between observations within each group of sequences in the N-dimensional space, (2) the data fit as the standard deviation of the estimated observational noise (), and (3) the warping complexity as the sum of the absolute values of differences between components of , which corresponds to the total variation of variables that define the warps. We provide a quantitative comparison between our method and the method that uses GP-LVM Kazlauskaite:2018 in terms of the three criteria in Fig. A2 (see Appendix) as a function of a warping parameter where 0 corresponds to no warping and the warps get progressively larger as increases. Our model achieves lower alignment error in situations where is small or intermediate while the GP-LVM achieves smaller mean alignment errors for very large warps. However, the median errors of our model stay low even for high values of . An example of this behaviour is provided in Fig. A2
(see Appendix). If one of the sequences is an outlier due to a large warp, the DPMM tends to create a new cluster for it, while the GP-LVM favours the solution that recovers the two groups of sequences but is unable to align the sequences within the two groups accurately.
We consider a data set of heart beat sounds, which have a clear "lub dub, lub dub” pattern that varies temporally depending on the age, health, and state of the subject Bentley:2011 . Similarly to Kazlauskaite:2018 , our model is able to automatically align and cluster the heart sounds into two distinct patterns. In both approaches a Matérn kernel is used, that takes into account the rapid variations in the recordings while also limiting the effect of the uninformative high frequency noise. Fig. 1 gives a comparison of the alignment and the clustering of a set of heartbeats. Our model achieves a smaller alignment error and estimates simpler warps (closer to an identity function).
We have presented a probabilistic model that is able to implicitly align inputs that contain temporal variations. Our approach builds on Kazlauskaite:2018 , where we replace the latent variable model with a DPMM. Our model performs explicit clustering in the data space (and automatically estimates the number of clusters), while the GP-LVM performs dimensionality reduction and looks for a set of sequences that exhibit a simple structure in the latent low dimensional space (and allows to sample new sequences). While the experimental results suggest that both models perform well on real and synthetic data sets, they display different behaviour. The GP-LVM aligns the sequences constrained to them staying on a low-dimensional structure; in comparison, the DPMM does not have this global constraint and aligns sequences in each estimated cluster independently from the other clusters. In future work we propose developing a model which makes use of the global low-dimensional structure and the unconstrained alignment within clusters. The main limitation of our method is related to the sensitivity to the choice of hyper-parameters in the variational approximation of the lower bound on the DPMM likelihood, and one possible direction to overcome this limitation could be to perform variational inference over the hyper-parameters instead of optimising them directly. A successful application of our method on new data sets relies on the choice of priors for alignment, and we propose to explore the effect of using alternative prior processes, such as the Pitman-Yor process.
This work has been supported by EPSRC CDE (EP/L016540/1) and CAMERA (EP/M023281/1) grants.
P. Bentley, G. Nordehn, M. Coimbra, and S. Mannor.
The PASCAL Classifying Heart Sounds Challenge 2011 (CHSC2011) Results.http://www.peterjbentley.com/heartchallenge.
- (2) D. J. Berndt and J. Clifford. Using Dynamic Time Warping to Find Patterns in Time Series. In International Conference on Knowledge Discovery and Data Mining (KDD), 1994.
- (3) David M. Blei and Michael I. Jordan. Variational inference for dirichlet process mixtures. Bayesian Analysis, 1:121–144, 2005.
- (4) I. L. Dryden and K. V. Mardia. Statistical Shape Analysis, with Applications in R. Second Edition. 2016.
- (5) Thomas S. Ferguson. A bayesian analysis of some nonparametric problems. Ann. Statist., 1(2):209–230, 03 1973.
- (6) Eugene Hsu, Kari Pulli, and Jovan Popović. Style translation for human motion. ACM Trans. Graph., 24(3):1082–1089, 2005.
- (7) I. Kazlauskaite, C. H. Ek, and N. D. F. Campbell. Gaussian Process Latent Variable Alignment Learning. ArXiv e-prints, March 2018.
- (8) E. J. Keogh and M. J. Pazzani. Derivative Dynamic Time Warping. In SIAM International Conference on Data Mining, 2001.
N. D. Lawrence.
Probabilistic Non-Linear Principal Component Analysis with Gaussian Process Latent Variable Models.Journal of Machine Learning Research (JMLR), 6:1783–1816, 2005.
- (10) C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). 2005.
- (11) F. Zhou. Generalized Time Warping for Multi-modal Alignment of Human Motion. In , 2012.
- (12) F. Zhou and F. de la Torre. Canonical Time Warping for Alignment of Human Behavior. In Advances in Neural Information Processing Systems (NIPS), 2009.