1 Introduction
Manifold learning techniques have become of great interest when studying high dimensional data. The underlying idea behind these methods, is that high dimensional data often encapsulates redundant information. In these cases, the data have an extrinsic dimensionality that is artificially high, while its intrinsic structure is wellmodeled as a lowdimensional manifold plus noise. Following the same line of reasoning, dynamical systems and time series can be regarded as processes governed by few underlying parameters, confined in a lowdimensional manifold
[1, 2].In particular, electroencephalographic (EEG) measures can be contemplated in this analytical framework. These measures are taken from different parts of the brain, resulting in a multivariate time series in a high dimensional space. It is known that these time series are highly correlated with each other. Therefore, it can be assumed that there is a lowdimensional representation of the intrinsic dynamics of the brain that can explain a broad spectrum of physical and psychological phenomena such as sleep stages. Additionally, it can be very useful for researchers to achieve meaningful visual representations of this phenomena in two or three dimensions. Such visualizations can be used to better understand the overall shape and finer patterns within the data.
In this paper we present DIG (dynamical information geometry), a dimensionality reduction tool that is designed for visualizing the inherent lowdimensional structure present in highdimensional dynamical processes. DIG is built upon a diffusion framework adapted to dynamical processes, followed by an embedding of a novel group of information distances applied to the diffusion operator. The resulting embedding is noise resilient and presents a faithful visualization of the true structure at both local and global scales with respect to time and the overall structure of the data. We demonstrate our DIG on highdimensional EEG data.
This paper is organized as follows. In Section 2 we present a brief background of diffusion methods. In Section 3, we present the steps of DIG, with a focus on the extension of the diffusion process to dynamical systems and the embedding of information distances. Then, in Section 4, we show and discuss some practical results for EEG data. And finally in Section 5, we conclude our work and propose some extensions.
2 Background
2.1 Related Work
Many dimensionality reduction methods exist, some of which have been used for visualization [3, 4, 5, 6, 7, 8, 9]
. Principal components analysis (PCA)
[8] and tdistributed stochastic neighborhood embedding (tSNE) [3] are two of the most commonly used methods for visualization. However, these and other methods are inadequate in many applications. First, these methods tend to favor one aspect of the data at the expense of the other. For example, when used for visualization, PCA typically shows the large scale global structure of the data while neglecting the finer, local structure. In contrast, tSNE is explicitly designed to focus on the local structure and often distorts the global structure, potentially leading to misinterpretations [10]. Second, PCA and tSNE fail to explicitly denoise the data for visualization. Thus in noisy settings, the true structure of the data can be obscured. In addition, none of these methods are designed to exploit the structure present in dynamical systems.Diffusion maps (DM) is a popular nonlinear dimensionality reduction technique that effectively denoises the data while capturing both local and global structure [11]. However, DM typically encodes the information in higher dimensions and is thus not optimized for visualization. A more recent visualization method called PHATE was introduced in [12]
to exploit the power of DM in denoising and capturing the structure of data while presenting the learned structure in lowdimensions by preserving an information distance between the diffusion probabilities.
DM has been extended to dynamical systems previously [13, 14, 15, 16]. In particular, Talmon and Coifman [14, 15] introduced an approach called empirical intrinsic geometry (EIG) that builds a diffusion geometry using a noise resilient distance. The resulting embedding learned from the geometry is thus noisefree and captures the true structure of the underlying process. However, EIG and other extensions of DM to dynamical systems are still not optimized for visualization as the learned structure of the data is encoded in higher dimensions. In this work, we introduce a new visualization method DIG that is wellsuited for visualizing highdimensional dynamical processes by preserving an information distance between the diffusion probabilities constructed from a noise resilient distance. This results in a visualization method that represents the true structure of the underlying dynamical process.
EEG signals have been embedded in low dimensional representations for detecting emotional states [17], preseizure states [2, 18] and sleep dynamics [16]
. In the latter DM is implemented by building the affinity matrix using both the crossspectrum distance and the covariance matrix distance as similarity measures between multivariate time series. EIG has also been applied to data including both respiratory and EEG signals
[19].2.2 Preliminaries
Here we provide background on DM [11] and PHATE [12]. DM learns the geometry of the data by first constructing a graph based on local similarities. The graph is typically constructed by applying a Gaussian kernel to Euclidean distances:
(1) 
where are the data and is a fixed kernel scale or bandwidth that controls the locality scale of the graph.
The resulting kernel or affinity matrix encodes the local structure of the data. The process of diffusion is then used to learn the global structure and denoise the data. The first step is to transform the affinity matrix into a probability transition matrix (also known as the diffusion operator), by dividing each row by the sum of its entries. The th entry of indicates the probability of transitioning from to in a singlestep random walk, where the probabilities are calculated based on the relative affinity between points. Hence, if the affinity between two points is high, then the transition probability is high. The global structure of the dynamical process is then learned by performing a step random walk for integers . The transition probabilities of these random walks are given by the th power of the diffusion operator, (i.e., ), and rows of this matrix serve as step diffusion representations of data points.
In DM, the information encoded in the diffused operator is typically embedded in lower dimensions via eigendecomposition. The Euclidean distances between the embedded coordinates are equivalent to the following scaled distance between entries in the diffused operator:
(2) 
where
is the stationary distribution of the corresponding Markov chain, or equivalently the first left eigenvector of
.DM, as described above, has several weaknesses. First, in many applications the data are not sampled uniformly. In these cases, a fixed bandwidth for all points with the Gausssian kernel in (1) may not accurately capture the local data geometry in all settings. For example, a bandwidth tuned for a densely sampled region will be inappropriate for sparsely sampled regions and vice versa. PHATE counters this by replacing the Gaussian kernel with fixed bandwidth with the decay kernel with adaptive bandwidth [12], eq. (3). The adaptive bandwidth enables the affinities to scale with the local density while the decay kernel corrects inaccuracies that may be introduced by the adaptive bandwidth in sparse sampled regions:
(3) 
The value of is the distance from to its nearest neighbor. Since this value changes for different observations, it may result in a nonsymmetric kernel. Taking the average, as shown in eq. (3), mitigates this issue.
Second, choosing an appropriate time scale is a difficult problem, as this parameter controls the resolution of the captured diffusion representation. In PHATE, the von Neumann entropy (VNE) is used to automatically tune this scale. The VNE (also known as spectral entropy) for each
is the entropy of eigenvalues of the diffused operator
(normalized to form a probability distribution), and provides a soft proxy for the number of significant eigenvalues of
. As , the VNE converges to zero, since the diffusion process is constructed to have a unique stationary distribution. The rate of decay of the VNE as increases is thus used to determine the appropriate value of at the transition between rapid decay, which is interpreted in [12] as corresponding to the elimination of noise, and slow decay (interpreted there as losing meaningful information).Third, DM is not wellsuited for visualization as the eigendecomposition does not explicitly force as much information as possible into low dimensions. In fact, when the data have a branching structure, DM tends to place different branches in different dimensions [20]. Additionally, attempts to directly embed the diffusion distances into low dimensions for visualization using multidimensional scaling (MDS) [7] can lead to unstable and inaccurate embeddings [12]. PHATE counters this by constructing a potential distance (i.e., comparing energy potentials) from diffused probabilities and then directly embedding potential distances in two or three dimensions using MDS.
3 The DIG Algorithm
In this section, we extend principles of DM and PHATE to dynamical systems to derive DIG. In this context, we present a family of information distances and derive some of their properties.
3.1 Diffusion with Dynamical Systems
In the context of dynamical systems it is needed to learn the local structure by constructing a matrix that encodes the local distances between data points. These local distances can be taken as an input for (3) to build a diffusion operator, from which information is extracted for visualization. To do this, we build upon the EIG framework [15, 14] which uses a statespace formalism (4)(5):
(4)  
(5) 
The multivariate time series represents the observed time series data while represents the hidden (unobserved) states that drive the process. can be viewed as a corrupted version of a clean process that is driven by the hidden states, where the corruption is a stationary process independent of . In general, we can view as being drawn from a conditional pdf . In the stochastic process (5), the unknown drift functions are independent from , . Therefore, we assume local independence between and , . The variables are Brownian motions.
It can be shown that the pdf
is a linear transformation of
[15, 14]. Since the pdfs are unknown, we use histograms as their estimators. Each histogram
has bins, and is built with the observations within a time window of length , centered at . The expected value of the histograms, e.g. , is a linear transformation of . Since the Mahalanobis distance is invariant under linear transformations, it can be deduced that the distance (6) is noise resilient [14]:(6) 
where and are the covariance matrices in the histograms space, in a time window of length , centered at and , respectively. Also, it can be proved under certain assumptions that is a good approximation of the distance between the underlying state variables [14]:
(7) 
To learn the global relationships from the local information encoded by the distances in eq. (6), we use the diffusion process. We apply the decay kernel in eq. (3) to the Mahalanobis distances in eq. (6), where the nearest neighbor distances are also calculated with this distance. We use the decay kernel to account for potentially different regions of density, as described in Section 2.2. The diffusion operator is then constructed by rownormalizing the resulting kernel matrix as before.
For comparison purposes, we also make use of an alternative distance to (6). Assuming that the data within time windows of length centered at
follows a multivariate Gaussian distribution
, we can compute the geodesic distance between different time windows of data using the Fisher information as the Riemannian metric [21] as follows:(8) 
where are the roots of, . The diffusion operator can then be obtained using this distance as input to the decay kernel.
3.2 Embedding Information Distances
Typically, information is extracted from the diffusion operator by either eigendecomposition or by embedding the diffusion distances. However, as mentioned previously, the former typically fails to provide a lowdimensional representation that is sufficient for visualization while the latter can result in unstable embeddings in some cases [12]. To overcome this, DIG extracts the information from the diffusion operator by embedding an information distance instead. We focus on a broad family of information distances that are parametrized by :
(9) 
The parameter controls the level of influence of the lower differences among probabilities in the overall distance. For example, the standard diffusion distances () are highly influenced by the highest absolute differences among probabilities. In contrast, the potential distances (), which were used in PHATE, account for the relative differences between them. Thus the standard diffusion distances and the potential distances can be viewed as two extremes of a general class of distances over the diffusion geometry.
It can be shown that for , the distance forms an Mdivergence [22, 23]. Furthermore, when , becomes proportional to the Hellinger distance, which is an divergence [24, 25]. Further, divergences are directly related to the Fisher information and thus are wellsuited for building an information geometry [26]. Therefore, divergences may be desirable for embedding the diffused probabilities.
We also consider another information distance based on divergences that has not been applied to diffusion operators as far as we know. Since the rows of the diffusion matrix can be interpreted as multinomial distributions, we can compute the geodesic distance between them using the Fisher information as the Reimmanian metric [27]. This can be seen as an extension of the KL divergence or the Hellinger distance, both divergences, for distributions far apart from each other. Such distances are as follows [28, 21]:
(10) 
After the information distances have been obtained, DIG applies metric MDS to the information distances to obtain a lowdimensional representation. Given an information distance , metric MDS minimizes the following stress function:
(11) 
where the are the dimensional embedded coordinates. For visualization, is chosen to be 2 or 3. See Algorithm 1 for pseudocode summarizing the described steps of DIG.
4 Experimental results
4.1 Data
We now present a realworld data application using EEG data provided by [29, 30]. The original data is sampled at 512Hz and labeled for every 30 second interval, within six sleep categories according to R&K rules (REM, Awake, S1, S2, S3, S4). Due to the lack of observations in some stages, we group S1 with S2, and S3 with S4. We bandfiltered the data between 840 Hz, and downsampled it to 128Hz.
4.2 Experimental Setup
The tuning and influence of the parameters , and for nondynamical data have been covered in [12]. For the EEG data, we found that the visualizations are highly robust for a wide range of configurations. Preliminary experiments showed that , , and give meaningful results. Therefore, we used these parameters for all experiments shown here unless otherwise stated. To compute the distances (6) and (8), we need to choose , the window size. Its selection is driven by the way the data is presented. In our case, we simply took , the number of observations in the 30s span. We also selected , and the number of bins for the histograms . We do not focus on these parameters since their impact have been already studied in the cited literature.
Conversely, as far as our knowledge goes, there has not been any previous study addressing the properties of the information distances mentioned in the previous section. Thus, we focus on how these distances may affect the learning process.
For this purpose, we wish to measure the relative distortion of the embeddings both locally and globally, when setting different values of , as well as using the geodesic distance in eq. (10). One commonly used approach to determine the local distortion, is the Trustworthiness proposed in [31]. This measure provides a penalty when one of the nearest neighbors of an observation in the lowdimensional embedding is not one of the nearest neighbors in the original data space. For our particular case, we compare the lowdimensional embeddings with each other. Trustworthiness gives an index that goes from zero to one. The lower the distortion is, the closer Trustworthiness gets to one. Notice the trustworthiness does not need to be symmetric.
To measure global differences between embeddings, we employed the Mantel test [32], which gives a level of similarity between the distance matrices in the embedded dimensions. This gives an overall measure of the similarity between embeddings. The motivation for this test is the fact that distances are not independent from each other. For example, changing the position of a single observation will result in the distortion of distances. This implies that the direct calculation of the correlation between distances may not be enough to accurately assess the similarity between distance matrices. The Mantel test takes such dependencies between distances into account. The test outputs a correlation coefficient between 0 and 1, that can be interpreted similarly as the usual Pearson correlation.
4.3 Discussion
Figure 1AB shows how using distance (6), the embedding becomes robust with respect to as increases. From a local perspective, the Trustworthiness measure among embeddings with different values of gets closer to one using than for . This shows a high local similarity independent of the value of gamma. From a global perspective, the Mantel test also shows a high dissimilarity for , while for the embeddings become more similar. The previous assessments can be visually corroborated by looking at the embeddings themselves in Figure 1C. Finally, in Figure 1C we can observe a good representation of the sleep dynamics as there is a visually clear discrimination among sleep stages, especially for higher values of or when using the diffusion geodesic distances. For these higher values of , the central structure of the embedding is more clearly defined than when using DM or lower values of . But largely, the embedding is robust to the choice of , suggesting that the specific choice of information distance is not too important when using the Mahalanobis distance in eq. (6) from the EIG framework. Furthermore, the results suggest that the use of the selected divergences does not provide any apparent advantage in this setting over the other information distances.
For the previous case, different values of produced largely the same results. This is no longer the case when we use the Gaussianbased geodesic distance in eq. 8 as shown in Figure 2. First, Figure 2A and Figure 2B show quantitatively a greater difference between the embeddings than in Figure 1. This can be visually confirmed by looking at the embeddings in Figure 2C. In this case, the traditional DM tends to condense the structure together, and the use of the alternative values may reveal more details of the structure of the data. The most left embedding is a clear representation of such a situation, where DM does not show a suitable discrimination of the sleep stages. But when increasing the value of gamma, a more suitable representation is achieved.
There are clear visual differences between the embeddings implementing distances (6) and (8). In Figure 3, we show a comparison of two embeddings colored by time steps. The left embedding is built using distance (6), achieving a good, clean visualization of the process across time. In fact, the different branches in the central region of the embedding are created from different periods of time, suggesting that transitions from different sleep stages may differ slightly depending on the total sleep time. In contrast, the Gaussianbased geodesic distance (8), displayed in the right embedding, not only inherits more of the original noise but also obscures the path of the process across time. Thus, using the Mahalanobis distance appears to better denoise the data and preserve the overall structure and time progression of this data.
5 Conclusion
In this work, we introduced a manifold learning tool for visualizing dynamical processes based on a diffusion framework. We addressed some of the shortcomings of the traditional diffusion maps approach for visualization, and used elements from PHATE and EIG to overcome them. We showed that when using the EIGbased distance, the visualization is robust to the choice of information distance.
We presented experimental results where we were able to discover sleep dynamics using solely EEG recordings. The 2dimensional visualizations showed a clear distinction among sleep stages, as well as the timevarying progress of the processes.
Future work includes extending the analysis and comparison between different distance measures for time series data, further analyzing the impact of different information distances, and applying DIG to financial and biological data.
References

[1]
R.S. Lin, C.B. Liu, M.H. Yang, N. Ahuja, and S. Levinson,
“Learning nonlinear manifolds from time series,”
in
European Conference on Computer Vision
. Springer, 2006, pp. 245–256.  [2] R. Talmon, S. Mallat, H. Zaveri, and R.R. Coifman, “Manifold learning for latent variable inference in dynamical systems,” IEEE Transactions on Signal Processing, vol. 63, no. 15, pp. 3843–3856, 2015.

[3]
L. van der Maaten and G. Hinton,
“Visualizing data using tSNE,”
Journal of Machine Learning Research
, vol. 9, no. Nov, pp. 2579–2605, 2008.  [4] J.B. Tenenbaum, V. De Silva, and J.C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
 [5] E. Becht, L. McInnes, J. Healy, C.A. Dutertre, I.W.H. Kwok, L.G. Ng, F. Ginhoux, and E.W. Newell, “Dimensionality reduction for visualizing singlecell data using umap,” Nature Biotechnology, vol. 37, no. 1, pp. 38, 2019.
 [6] S.T. Roweis and L.K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
 [7] T.F. Cox and M.A.A. Cox, Multidimensional Scaling, Chapman & Hall/CRC, 2 edition, 2001.
 [8] T.K. Moon and W.C. Stirling, Mathematical methods and algorithms for signal processing, Prentice Hall, 2000.
 [9] L. van Der Maaten, E. Postma, and J. Van den Herik, “Dimensionality reduction: a comparative,” Journal of Machine Learning Research, vol. 10, no. 66–71, pp. 13, 2009.
 [10] M. Wattenberg, F. Viégas, and I. Johnson, “How to use tSNE effectively,” Distill, 2016.
 [11] R.R. Coifman and S. Lafon, “Diffusion maps,” Applied and computational harmonic analysis, vol. 21, no. 1, pp. 5–30, 2006.
 [12] K.R. Moon, D. van Dijk, Z. Wang, D. Burkhardt, W. Chen, A. van den Elzen, M.J. Hirn, R.R. Coifman, N.B. Ivanova, G. Wolf, and S. Krishnaswamy, “Visualizing transitions and structure for biological data exploration,” bioRxiv, p. 120378, 2019.
 [13] W. Lian, R. Talmon, H. Zaveri, L. Carin, and R.R. Coifman, “Multivariate timeseries analysis and diffusion maps,” Signal Processing, vol. 116, pp. 13–28, 2015.
 [14] R. Talmon and R.R. Coifman, “Intrinsic modeling of stochastic dynamical systems using empirical geometry,” Applied and Computational Harmonic Analysis, vol. 39, no. 1, pp. 138–160, 2015.
 [15] R. Talmon and R.R. Coifman, “Empirical intrinsic geometry for nonlinear modeling and time series filtering,” Proceedings of the National Academy of Sciences, vol. 110, no. 31, pp. 12535–12540, 2013.
 [16] P.L.C. Rodrigues, M. Congedo, and C. Jutten, “Multivariate timeseries analysis via manifold learning,” in 2018 IEEE Statistical Signal Processing Workshop (SSP). IEEE, 2018, pp. 573–577.
 [17] X.W. Wang, D. Nie, and BaoL. Lu, “Emotional state classification from eeg data using machine learning approach,” Neurocomputing, vol. 129, pp. 94–106, 2014.
 [18] P. Ataee, A. Yazdani, S.K. Setarehdan, and H.A. Noubari, “Manifold learning applied on eeg signal of the epileptic patients for detection of normal and preseizure states,” in 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 2007, pp. 5489–5492.
 [19] H.t. Wu, R. Talmon, and Y.L. Lo, “Assess sleep stage by modern signal processing techniques,” IEEE Transactions on Biomedical Engineering, vol. 62, no. 4, pp. 1159–1168, 2014.
 [20] L. Haghverdi, M. Buettner, F.A. Wolf, F. Buettner, and F.J. Theis, “Diffusion pseudotime robustly reconstructs lineage branching,” Nature Methods, vol. 13, no. 10, pp. 845–848, 2016.
 [21] C. Atkinson and A.F.S. Mitchell, “Rao’s distance measure,” Sankhyā: The Indian Journal of Statistics, Series A, pp. 345–365, 1981.
 [22] M. Salicrú and A.A. Pons, “Sobre ciertas propiedades de la mdivergencia en análisis de datos,” Qüestiió: quaderns d’estadística i investigació operativa, vol. 9, no. 4, pp. 251–256, 1985.
 [23] M. Salicrú, A. Sanchez, J. Conde, and P.S. Sanchez, “Entropy measures associated with K and M divergences,” Soochow Journal of Mathematics, vol. 21, no. 3, pp. 291–298, 1995.
 [24] I. Csiszár, “Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten,” Magyer Tud. Akad. Mat. Kutato Int. Koezl., vol. 8, pp. 85–108, 1964.
 [25] S.M. Ali and S.D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 131–142, 1966.
 [26] V. Berisha and A.O. Hero, “Empirical nonparametric estimation of the fisher information,” IEEE Signal Processing Letters, vol. 22, no. 7, pp. 988–992, 2014.
 [27] S. Amari, Information geometry and its applications, vol. 194, Springer, 2016.
 [28] J. Lafferty and G. Lebanon, “Diffusion kernels on statistical manifolds,” Journal of Machine Learning Research, vol. 6, no. Jan, pp. 129–163, 2005.
 [29] M.G. Terzano, L. Parrino, A. Smerieri, R. Chervin, S. Chokroverty, C. Guilleminault, M. Hirshkowitz, M. Mahowald, H. Moldofsky, A. Rosa, R. Thomas, and A. Walters, “Atlas, rules, and recording techniques for the scoring of cyclic alternating pattern (CAP) in human sleep,” Sleep medicine, vol. 3, no. 2, pp. 187–199, 2002.
 [30] A.L. Goldberger, L.A.N. Amaral, L. Glass, J.M. Hausdorff, P.C. Ivanov, R.G. Mark, J.E. Mietus, G.B. Moody, C.K. Peng, and H.E. Stanley, “Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals,” Circulation, vol. 101, no. 23, pp. e215–e220, 2000.
 [31] S. Kaski, J. Nikkilä, M. Oja, J. Venna, P. Törönen, and E. Castrén, “Trustworthiness and metrics in visualizing similarity of gene expression,” BMC bioinformatics, vol. 4, no. 1, pp. 48, 2003.
 [32] N. Mantel, “The detection of disease clustering and a generalized regression approach,” Cancer research, vol. 27, no. 2 Part 1, pp. 209–220, 1967.
Comments
There are no comments yet.