Manifold learning techniques have become of great interest when studying high dimensional data. The underlying idea behind these methods, is that high dimensional data often encapsulates redundant information. In these cases, the data have an extrinsic dimensionality that is artificially high, while its intrinsic structure is well-modeled as a low-dimensional manifold plus noise. Following the same line of reasoning, dynamical systems and time series can be regarded as processes governed by few underlying parameters, confined in a low-dimensional manifold[1, 2].
In particular, electroencephalographic (EEG) measures can be contemplated in this analytical framework. These measures are taken from different parts of the brain, resulting in a multivariate time series in a high dimensional space. It is known that these time series are highly correlated with each other. Therefore, it can be assumed that there is a low-dimensional representation of the intrinsic dynamics of the brain that can explain a broad spectrum of physical and psychological phenomena such as sleep stages. Additionally, it can be very useful for researchers to achieve meaningful visual representations of this phenomena in two or three dimensions. Such visualizations can be used to better understand the overall shape and finer patterns within the data.
In this paper we present DIG (dynamical information geometry), a dimensionality reduction tool that is designed for visualizing the inherent low-dimensional structure present in high-dimensional dynamical processes. DIG is built upon a diffusion framework adapted to dynamical processes, followed by an embedding of a novel group of information distances applied to the diffusion operator. The resulting embedding is noise resilient and presents a faithful visualization of the true structure at both local and global scales with respect to time and the overall structure of the data. We demonstrate our DIG on high-dimensional EEG data.
This paper is organized as follows. In Section 2 we present a brief background of diffusion methods. In Section 3, we present the steps of DIG, with a focus on the extension of the diffusion process to dynamical systems and the embedding of information distances. Then, in Section 4, we show and discuss some practical results for EEG data. And finally in Section 5, we conclude our work and propose some extensions.
2.1 Related Work
. Principal components analysis (PCA) and t-distributed stochastic neighborhood embedding (t-SNE)  are two of the most commonly used methods for visualization. However, these and other methods are inadequate in many applications. First, these methods tend to favor one aspect of the data at the expense of the other. For example, when used for visualization, PCA typically shows the large scale global structure of the data while neglecting the finer, local structure. In contrast, t-SNE is explicitly designed to focus on the local structure and often distorts the global structure, potentially leading to misinterpretations . Second, PCA and t-SNE fail to explicitly denoise the data for visualization. Thus in noisy settings, the true structure of the data can be obscured. In addition, none of these methods are designed to exploit the structure present in dynamical systems.
Diffusion maps (DM) is a popular nonlinear dimensionality reduction technique that effectively denoises the data while capturing both local and global structure . However, DM typically encodes the information in higher dimensions and is thus not optimized for visualization. A more recent visualization method called PHATE was introduced in 
to exploit the power of DM in denoising and capturing the structure of data while presenting the learned structure in low-dimensions by preserving an information distance between the diffusion probabilities.
DM has been extended to dynamical systems previously [13, 14, 15, 16]. In particular, Talmon and Coifman [14, 15] introduced an approach called empirical intrinsic geometry (EIG) that builds a diffusion geometry using a noise resilient distance. The resulting embedding learned from the geometry is thus noise-free and captures the true structure of the underlying process. However, EIG and other extensions of DM to dynamical systems are still not optimized for visualization as the learned structure of the data is encoded in higher dimensions. In this work, we introduce a new visualization method DIG that is well-suited for visualizing high-dimensional dynamical processes by preserving an information distance between the diffusion probabilities constructed from a noise resilient distance. This results in a visualization method that represents the true structure of the underlying dynamical process.
. In the latter DM is implemented by building the affinity matrix using both the cross-spectrum distance and the covariance matrix distance as similarity measures between multivariate time series. EIG has also been applied to data including both respiratory and EEG signals.
Here we provide background on DM  and PHATE . DM learns the geometry of the data by first constructing a graph based on local similarities. The graph is typically constructed by applying a Gaussian kernel to Euclidean distances:
where are the data and is a fixed kernel scale or bandwidth that controls the locality scale of the graph.
The resulting kernel or affinity matrix encodes the local structure of the data. The process of diffusion is then used to learn the global structure and denoise the data. The first step is to transform the affinity matrix into a probability transition matrix (also known as the diffusion operator), by dividing each row by the sum of its entries. The th entry of indicates the probability of transitioning from to in a single-step random walk, where the probabilities are calculated based on the relative affinity between points. Hence, if the affinity between two points is high, then the transition probability is high. The global structure of the dynamical process is then learned by performing a -step random walk for integers . The transition probabilities of these random walks are given by the -th power of the diffusion operator, (i.e., ), and rows of this matrix serve as -step diffusion representations of data points.
In DM, the information encoded in the diffused operator is typically embedded in lower dimensions via eigendecomposition. The Euclidean distances between the embedded coordinates are equivalent to the following scaled distance between entries in the diffused operator:
DM, as described above, has several weaknesses. First, in many applications the data are not sampled uniformly. In these cases, a fixed bandwidth for all points with the Gausssian kernel in (1) may not accurately capture the local data geometry in all settings. For example, a bandwidth tuned for a densely sampled region will be inappropriate for sparsely sampled regions and vice versa. PHATE counters this by replacing the Gaussian kernel with fixed bandwidth with the -decay kernel with adaptive bandwidth , eq. (3). The adaptive bandwidth enables the affinities to scale with the local density while the -decay kernel corrects inaccuracies that may be introduced by the adaptive bandwidth in sparse sampled regions:
The value of is the distance from to its -nearest neighbor. Since this value changes for different observations, it may result in a non-symmetric kernel. Taking the average, as shown in eq. (3), mitigates this issue.
Second, choosing an appropriate time scale is a difficult problem, as this parameter controls the resolution of the captured diffusion representation. In PHATE, the von Neumann entropy (VNE) is used to automatically tune this scale. The VNE (also known as spectral entropy) for each
is the entropy of eigenvalues of the diffused operator
(normalized to form a probability distribution), and provides a soft proxy for the number of significant eigenvalues of. As , the VNE converges to zero, since the diffusion process is constructed to have a unique stationary distribution. The rate of decay of the VNE as increases is thus used to determine the appropriate value of at the transition between rapid decay, which is interpreted in  as corresponding to the elimination of noise, and slow decay (interpreted there as losing meaningful information).
Third, DM is not well-suited for visualization as the eigendecomposition does not explicitly force as much information as possible into low dimensions. In fact, when the data have a branching structure, DM tends to place different branches in different dimensions . Additionally, attempts to directly embed the diffusion distances into low dimensions for visualization using multidimensional scaling (MDS)  can lead to unstable and inaccurate embeddings . PHATE counters this by constructing a potential distance (i.e., comparing energy potentials) from diffused probabilities and then directly embedding potential distances in two or three dimensions using MDS.
3 The DIG Algorithm
In this section, we extend principles of DM and PHATE to dynamical systems to derive DIG. In this context, we present a family of information distances and derive some of their properties.
3.1 Diffusion with Dynamical Systems
In the context of dynamical systems it is needed to learn the local structure by constructing a matrix that encodes the local distances between data points. These local distances can be taken as an input for (3) to build a diffusion operator, from which information is extracted for visualization. To do this, we build upon the EIG framework [15, 14] which uses a state-space formalism (4)-(5):
The multivariate time series represents the observed time series data while represents the hidden (unobserved) states that drive the process. can be viewed as a corrupted version of a clean process that is driven by the hidden states, where the corruption is a stationary process independent of . In general, we can view as being drawn from a conditional pdf . In the stochastic process (5), the unknown drift functions are independent from , . Therefore, we assume local independence between and , . The variables are Brownian motions.
It can be shown that the pdf
is a linear transformation of[15, 14]
. Since the pdfs are unknown, we use histograms as their estimators. Each histogramhas bins, and is built with the observations within a time window of length , centered at . The expected value of the histograms, e.g. , is a linear transformation of . Since the Mahalanobis distance is invariant under linear transformations, it can be deduced that the distance (6) is noise resilient :
where and are the covariance matrices in the histograms space, in a time window of length , centered at and , respectively. Also, it can be proved under certain assumptions that is a good approximation of the distance between the underlying state variables :
To learn the global relationships from the local information encoded by the distances in eq. (6), we use the diffusion process. We apply the -decay kernel in eq. (3) to the Mahalanobis distances in eq. (6), where the -nearest neighbor distances are also calculated with this distance. We use the -decay kernel to account for potentially different regions of density, as described in Section 2.2. The diffusion operator is then constructed by row-normalizing the resulting kernel matrix as before.
For comparison purposes, we also make use of an alternative distance to (6). Assuming that the data within time windows of length centered at
follows a multivariate Gaussian distribution, we can compute the geodesic distance between different time windows of data using the Fisher information as the Riemannian metric  as follows:
where are the roots of, . The diffusion operator can then be obtained using this distance as input to the -decay kernel.
3.2 Embedding Information Distances
Typically, information is extracted from the diffusion operator by either eigendecomposition or by embedding the diffusion distances. However, as mentioned previously, the former typically fails to provide a low-dimensional representation that is sufficient for visualization while the latter can result in unstable embeddings in some cases . To overcome this, DIG extracts the information from the diffusion operator by embedding an information distance instead. We focus on a broad family of information distances that are parametrized by :
The parameter controls the level of influence of the lower differences among probabilities in the overall distance. For example, the standard diffusion distances () are highly influenced by the highest absolute differences among probabilities. In contrast, the potential distances (), which were used in PHATE, account for the relative differences between them. Thus the standard diffusion distances and the potential distances can be viewed as two extremes of a general class of distances over the diffusion geometry.
It can be shown that for , the distance forms an M-divergence [22, 23]. Furthermore, when , becomes proportional to the Hellinger distance, which is an -divergence [24, 25]. Further, -divergences are directly related to the Fisher information and thus are well-suited for building an information geometry . Therefore, -divergences may be desirable for embedding the diffused probabilities.
We also consider another information distance based on -divergences that has not been applied to diffusion operators as far as we know. Since the rows of the diffusion matrix can be interpreted as multinomial distributions, we can compute the geodesic distance between them using the Fisher information as the Reimmanian metric . This can be seen as an extension of the KL divergence or the Hellinger distance, both -divergences, for distributions far apart from each other. Such distances are as follows [28, 21]:
After the information distances have been obtained, DIG applies metric MDS to the information distances to obtain a low-dimensional representation. Given an information distance , metric MDS minimizes the following stress function:
where the are the -dimensional embedded coordinates. For visualization, is chosen to be 2 or 3. See Algorithm 1 for pseudocode summarizing the described steps of DIG.
4 Experimental results
We now present a real-world data application using EEG data provided by [29, 30]. The original data is sampled at 512Hz and labeled for every 30 second interval, within six sleep categories according to R&K rules (REM, Awake, S-1, S-2, S-3, S-4). Due to the lack of observations in some stages, we group S-1 with S-2, and S-3 with S-4. We band-filtered the data between 8-40 Hz, and down-sampled it to 128Hz.
4.2 Experimental Setup
The tuning and influence of the parameters , and for nondynamical data have been covered in . For the EEG data, we found that the visualizations are highly robust for a wide range of configurations. Preliminary experiments showed that , , and give meaningful results. Therefore, we used these parameters for all experiments shown here unless otherwise stated. To compute the distances (6) and (8), we need to choose , the window size. Its selection is driven by the way the data is presented. In our case, we simply took , the number of observations in the 30s span. We also selected , and the number of bins for the histograms . We do not focus on these parameters since their impact have been already studied in the cited literature.
Conversely, as far as our knowledge goes, there has not been any previous study addressing the properties of the information distances mentioned in the previous section. Thus, we focus on how these distances may affect the learning process.
For this purpose, we wish to measure the relative distortion of the embeddings both locally and globally, when setting different values of , as well as using the geodesic distance in eq. (10). One commonly used approach to determine the local distortion, is the Trustworthiness proposed in . This measure provides a penalty when one of the -nearest neighbors of an observation in the low-dimensional embedding is not one of the -nearest neighbors in the original data space. For our particular case, we compare the low-dimensional embeddings with each other. Trustworthiness gives an index that goes from zero to one. The lower the distortion is, the closer Trustworthiness gets to one. Notice the trustworthiness does not need to be symmetric.
To measure global differences between embeddings, we employed the Mantel test , which gives a level of similarity between the distance matrices in the embedded dimensions. This gives an overall measure of the similarity between embeddings. The motivation for this test is the fact that distances are not independent from each other. For example, changing the position of a single observation will result in the distortion of distances. This implies that the direct calculation of the correlation between distances may not be enough to accurately assess the similarity between distance matrices. The Mantel test takes such dependencies between distances into account. The test outputs a correlation coefficient between 0 and 1, that can be interpreted similarly as the usual Pearson correlation.
Figure 1AB shows how using distance (6), the embedding becomes robust with respect to as increases. From a local perspective, the Trustworthiness measure among embeddings with different values of gets closer to one using than for . This shows a high local similarity independent of the value of gamma. From a global perspective, the Mantel test also shows a high dissimilarity for , while for the embeddings become more similar. The previous assessments can be visually corroborated by looking at the embeddings themselves in Figure 1C. Finally, in Figure 1C we can observe a good representation of the sleep dynamics as there is a visually clear discrimination among sleep stages, especially for higher values of or when using the diffusion geodesic distances. For these higher values of , the central structure of the embedding is more clearly defined than when using DM or lower values of . But largely, the embedding is robust to the choice of , suggesting that the specific choice of information distance is not too important when using the Mahalanobis distance in eq. (6) from the EIG framework. Furthermore, the results suggest that the use of the selected -divergences does not provide any apparent advantage in this setting over the other information distances.
For the previous case, different values of produced largely the same results. This is no longer the case when we use the Gaussian-based geodesic distance in eq. 8 as shown in Figure 2. First, Figure 2A and Figure 2B show quantitatively a greater difference between the embeddings than in Figure 1. This can be visually confirmed by looking at the embeddings in Figure 2C. In this case, the traditional DM tends to condense the structure together, and the use of the alternative values may reveal more details of the structure of the data. The most left embedding is a clear representation of such a situation, where DM does not show a suitable discrimination of the sleep stages. But when increasing the value of gamma, a more suitable representation is achieved.
There are clear visual differences between the embeddings implementing distances (6) and (8). In Figure 3, we show a comparison of two embeddings colored by time steps. The left embedding is built using distance (6), achieving a good, clean visualization of the process across time. In fact, the different branches in the central region of the embedding are created from different periods of time, suggesting that transitions from different sleep stages may differ slightly depending on the total sleep time. In contrast, the Gaussian-based geodesic distance (8), displayed in the right embedding, not only inherits more of the original noise but also obscures the path of the process across time. Thus, using the Mahalanobis distance appears to better denoise the data and preserve the overall structure and time progression of this data.
In this work, we introduced a manifold learning tool for visualizing dynamical processes based on a diffusion framework. We addressed some of the shortcomings of the traditional diffusion maps approach for visualization, and used elements from PHATE and EIG to overcome them. We showed that when using the EIG-based distance, the visualization is robust to the choice of information distance.
We presented experimental results where we were able to discover sleep dynamics using solely EEG recordings. The 2-dimensional visualizations showed a clear distinction among sleep stages, as well as the time-varying progress of the processes.
Future work includes extending the analysis and comparison between different distance measures for time series data, further analyzing the impact of different information distances, and applying DIG to financial and biological data.
R.-S. Lin, C.-B. Liu, M.-H. Yang, N. Ahuja, and S. Levinson,
“Learning nonlinear manifolds from time series,”
European Conference on Computer Vision. Springer, 2006, pp. 245–256.
-  R. Talmon, S. Mallat, H. Zaveri, and R.R. Coifman, “Manifold learning for latent variable inference in dynamical systems,” IEEE Transactions on Signal Processing, vol. 63, no. 15, pp. 3843–3856, 2015.
L. van der Maaten and G. Hinton,
“Visualizing data using t-SNE,”
Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.
-  J.B. Tenenbaum, V. De Silva, and J.C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
-  E. Becht, L. McInnes, J. Healy, C.-A. Dutertre, I.W.H. Kwok, L.G. Ng, F. Ginhoux, and E.W. Newell, “Dimensionality reduction for visualizing single-cell data using umap,” Nature Biotechnology, vol. 37, no. 1, pp. 38, 2019.
-  S.T. Roweis and L.K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
-  T.F. Cox and M.A.A. Cox, Multidimensional Scaling, Chapman & Hall/CRC, 2 edition, 2001.
-  T.K. Moon and W.C. Stirling, Mathematical methods and algorithms for signal processing, Prentice Hall, 2000.
-  L. van Der Maaten, E. Postma, and J. Van den Herik, “Dimensionality reduction: a comparative,” Journal of Machine Learning Research, vol. 10, no. 66–71, pp. 13, 2009.
-  M. Wattenberg, F. Viégas, and I. Johnson, “How to use t-SNE effectively,” Distill, 2016.
-  R.R. Coifman and S. Lafon, “Diffusion maps,” Applied and computational harmonic analysis, vol. 21, no. 1, pp. 5–30, 2006.
-  K.R. Moon, D. van Dijk, Z. Wang, D. Burkhardt, W. Chen, A. van den Elzen, M.J. Hirn, R.R. Coifman, N.B. Ivanova, G. Wolf, and S. Krishnaswamy, “Visualizing transitions and structure for biological data exploration,” bioRxiv, p. 120378, 2019.
-  W. Lian, R. Talmon, H. Zaveri, L. Carin, and R.R. Coifman, “Multivariate time-series analysis and diffusion maps,” Signal Processing, vol. 116, pp. 13–28, 2015.
-  R. Talmon and R.R. Coifman, “Intrinsic modeling of stochastic dynamical systems using empirical geometry,” Applied and Computational Harmonic Analysis, vol. 39, no. 1, pp. 138–160, 2015.
-  R. Talmon and R.R. Coifman, “Empirical intrinsic geometry for nonlinear modeling and time series filtering,” Proceedings of the National Academy of Sciences, vol. 110, no. 31, pp. 12535–12540, 2013.
-  P.L.C. Rodrigues, M. Congedo, and C. Jutten, “Multivariate time-series analysis via manifold learning,” in 2018 IEEE Statistical Signal Processing Workshop (SSP). IEEE, 2018, pp. 573–577.
-  X.-W. Wang, D. Nie, and Bao-L. Lu, “Emotional state classification from eeg data using machine learning approach,” Neurocomputing, vol. 129, pp. 94–106, 2014.
-  P. Ataee, A. Yazdani, S.K. Setarehdan, and H.A. Noubari, “Manifold learning applied on eeg signal of the epileptic patients for detection of normal and pre-seizure states,” in 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 2007, pp. 5489–5492.
-  H.-t. Wu, R. Talmon, and Y.-L. Lo, “Assess sleep stage by modern signal processing techniques,” IEEE Transactions on Biomedical Engineering, vol. 62, no. 4, pp. 1159–1168, 2014.
-  L. Haghverdi, M. Buettner, F.A. Wolf, F. Buettner, and F.J. Theis, “Diffusion pseudotime robustly reconstructs lineage branching,” Nature Methods, vol. 13, no. 10, pp. 845–848, 2016.
-  C. Atkinson and A.F.S. Mitchell, “Rao’s distance measure,” Sankhyā: The Indian Journal of Statistics, Series A, pp. 345–365, 1981.
-  M. Salicrú and A.A. Pons, “Sobre ciertas propiedades de la m-divergencia en análisis de datos,” Qüestiió: quaderns d’estadística i investigació operativa, vol. 9, no. 4, pp. 251–256, 1985.
-  M. Salicrú, A. Sanchez, J. Conde, and P.S. Sanchez, “Entropy measures associated with K and M divergences,” Soochow Journal of Mathematics, vol. 21, no. 3, pp. 291–298, 1995.
-  I. Csiszár, “Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten,” Magyer Tud. Akad. Mat. Kutato Int. Koezl., vol. 8, pp. 85–108, 1964.
-  S.M. Ali and S.D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 131–142, 1966.
-  V. Berisha and A.O. Hero, “Empirical non-parametric estimation of the fisher information,” IEEE Signal Processing Letters, vol. 22, no. 7, pp. 988–992, 2014.
-  S. Amari, Information geometry and its applications, vol. 194, Springer, 2016.
-  J. Lafferty and G. Lebanon, “Diffusion kernels on statistical manifolds,” Journal of Machine Learning Research, vol. 6, no. Jan, pp. 129–163, 2005.
-  M.G. Terzano, L. Parrino, A. Smerieri, R. Chervin, S. Chokroverty, C. Guilleminault, M. Hirshkowitz, M. Mahowald, H. Moldofsky, A. Rosa, R. Thomas, and A. Walters, “Atlas, rules, and recording techniques for the scoring of cyclic alternating pattern (CAP) in human sleep,” Sleep medicine, vol. 3, no. 2, pp. 187–199, 2002.
-  A.L. Goldberger, L.A.N. Amaral, L. Glass, J.M. Hausdorff, P.C. Ivanov, R.G. Mark, J.E. Mietus, G.B. Moody, C.-K. Peng, and H.E. Stanley, “Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals,” Circulation, vol. 101, no. 23, pp. e215–e220, 2000.
-  S. Kaski, J. Nikkilä, M. Oja, J. Venna, P. Törönen, and E. Castrén, “Trustworthiness and metrics in visualizing similarity of gene expression,” BMC bioinformatics, vol. 4, no. 1, pp. 48, 2003.
-  N. Mantel, “The detection of disease clustering and a generalized regression approach,” Cancer research, vol. 27, no. 2 Part 1, pp. 209–220, 1967.