Recently there has been a profusion of multimodal data measured in parallel on the same system. Some examples include multiple modalities of data collected on biological specimens, such a single cell RNA-sequencing or single cell ATAC-sequencing, or multiple measurements collected on hospitalized patients, such as lab tests and continuous monitoring systems. There is a dire need for integration of this data in order to perform a wide variety of downstream tasks such as clustering, differential or comparative analysis, denoising and cross-modality correlations between features. We believe that the key to integrating data is to discover which entities are similar to each other across modalities by creating a data affinity graph on the basis of information from all modalities available. However, it is not immediately clear how to determine distances or similarities between entities on the basis of multiple modalities of data, which could be measured on entirely differently scales and suffer from different amounts of noise and sparsity. This is particularly problematic in the biomedical domain, where issues of ‘drop out’ or undersampling make correlation analysis in single cell technologies extremely difficult. In order to address this, we turn to the framework of data diffusion that was developed by Coifman and Lafon (2006).
According to the data diffusion framework, we can implicitly learn the intrinsic space of the data by powering a Markov transition matrix to a power , which implicitly calculates a2006)
, the powered diffusion operator is eigendecomposed to uncover intrinsic data dimensions. Since that seminal work, the diffusion operator, a Markov matrix describing affinities between datapoints as transition probabilities on a data graph, has been shown to be useful in a myriad of data processing tasksMoon et al. (2018), including clustering Burkhardt et al. (2020), denoising Van Dijk et al. (2018) and dimensionality reduction Moon et al. (2019).
Here, we define an integrated diffusion operator
for multiple data modalities. We show that naive fusions of modalities via affinity addition, distance addition or feature concatenation yield poor results. We instead define an integrated diffusion operator that accounts for local noise and intrinsic dimensionality of each modality. Conceptually diffusion probabilities in our integrated operator are computed by taking several steps in the data graph from one modality, and several steps on the data graph defined by the other modality. The number of steps is carefully chosen based on eigenvalue entropy orspectral entropy of each operator. Furthermore, we emphasize dominant directions in the diffusion operator by locally denoising using PCA-based low-rank approximations.
Empirically we show that our method yields more accurate visualizations, more faithful denoising and more accurate clustering on both datasets where ground truth is known and exploratory biomedical datasets, as compared to a variety of alternative methods for combining multimodal data. These comparisons include diffusion-based methods, such as combining features, diffusion operators, distance matrices and affinity graphs, as well as non diffusion-based methods for multimodal learning, such as cycleGAN Zhu et al. (2020)
, autoencoders, feature-concatenated PCA, and canonical corelation analysis (CCA)Hao et al. (2020a).
2 Related Work
Recently, there has been greater focus on integrating different modalities of data collected on the same or similar systems. When analyzing and integrating information from different measurement modalities, two primary lines of work have been established: domain transfer learning and data fusion. Domain transfer learning helps predict or impute a modality of data that is entirely missing either based on supervised training examples or based on unsupervised data alignment techniques. These include neural network-based techniques such as cycleGAN, MAGANAmodio and Krishnaswamy (2018), and harmonic alignment Stanley et al. (2020). Instead we are focusing on the problem of accurately analyzing datasets when both modalities are generally available, yet suffer from real world problems with data collection. We believe our problem has not been brought to attention because it is often assumed that naive modality concatenations can offer a viable solution.
The second line of research, data fusion, seeks to learn a common latent space from both modalities of data. This field includes techniques such as CCA, autoencoders, and alternating diffusion Katz et al. (2019); Yang et al. (2021)
. Here, we improve upon this line of research by accurately representing each modality in a way such that modality specific local and global noise is corrected. Furthermore, the particular latent space that we choose, the data diffusion operator, is a widely usable latent space which can be used for many supervised and unsupervised learning applications.
High dimensional data can often be modeled as a sampling of a dimensional manifold that is mapped to dimensional observations via a nonlinear function . Intuitively, although measurement strategies, modeled here via
, create high dimensional observations, the intrinsic dimensionality, or degrees of freedom with in the data, is relatively low. This manifold assumption is at the core of the vast field of manifold learning(e.g., Moon et al., 2018; Coifman and Lafon, 2006; Van Der Maaten et al., 2009; Izenman, 2012, and references therein), which leverages the intrinsic geometry of data, as modeled by a manifold, for exploring and understanding patterns, trends, and structure that displays significant non-linearity.
In Coifman and Lafon (2006), diffusion maps were proposed as a robust way to capture intrinsic manifold geometry in data by eigendecomposing a powered diffusion operator. Using -step random walks that aggregate local affinity, Coifman and Lafon (2006) were able to reveal nonlinear relations in data and allow their embedding in low dimensional coordinates. These local affinities are commonly constructed using a Gaussian kernel:
where is an Gram matrix whose entry is denoted by to emphasize the dependency on the data based on bandwidth parameter
, which controls local neighborhood sizes. A diffusion operator is defined as the row-stochastic matrixwhere is a diagonal matrix with . The matrix , or diffusion operator, defines single-step transition probabilities for a time-homogeneous diffusion process, or a Markovian random walk, over the data. Furthermore, as shown in Coifman and Lafon (2006), powers of this matrix , for , can be used to simulate multi-step random walks over the data, helping understand multiscale organization of , which can be interpreted geometrically when the manifold assumption is satisfied.
Prior work involving data diffusion for multimodal data has centered around the notion of alternating diffusion Katz et al. (2019). Intuitively, this generalizes the random walk to “hop” between different metric spaces; mathematically, this is expressed by taking a matrix product of the markov transition matrices:
Finally, is powered to stimulate “hopping” across modalities. As explained by Katz et al. (2019), the diffusion distances in this joint space constitute the joint diffusion map embedding, which captures information shared between modalities but removes modality-specific information.
4.1 Problem Formulation
Let and be two sets of data, perhaps with different dimensionalities, capturing two modalities gathered (e.g., via different measurement techniques) from the same underlying system. We consider a setting where the underlying system of the data can be modeled via a d-dimensional manifold (with ) that is embedded in a high dimensional ambient space given by both modalities, but is only partially captured by each individual dataset. Here, we describe an unsupervised approach to integrate information from such multimodal settings based on the principles of data diffusion in order to recover the underlying joint manifold. By utilizing methods that capture both local and global manifold geometric information, our method is robust to vastly differing quantities of noise. Our method allows for amenable visualization, data denoising, and clustering of this jointly recovered manifold.
Neighborhood low rank approximation for local noise correction
We begin by estimating a measure of local signal in various neighborhoods in the dataset. To do this, we first run spectral clustering on each modality to obtain partitions , written as submatrices of (and accordingly for ). Next, we compute SVD on the centered points in the partition where is a matrix with all rows containing the partition center,
consist of left singular vectors,right singular vectors and contains singular values. In order to denoise a neighborhood, we estimate the intrinsic dimensionality using an eigengap heuristic, counting the first k+1 singular values. Finally, we obtain a low rank approximation of the data in each local partition by using a truncated SVD, i.e., where only takes the first (most significant) singular values, and vectors , consist of the first columns of , (correspondingly). It is important that this method be highly local so as not to destroy the manifold structure via elimination of linear dimensions in the data.
Modality specific diffusion time scale calculation via spectral entropy
In addition to correcting for varying local noise within a single modality, it is crucial to estimate the intrinsic dimensionality of each modality to understand how much information it contains. Previous implementations data diffusion methods, such as alternating diffusion and diffusion maps, provide no means of calculating correct timescale. Here, we apply spectral entropy, computed on the diffusion operator, to estimate ideal number of -steps to take in each modality. This refers to the theory of graph signal processing Shuman et al. (2013)
where the eigenvectors of the diffusion operator (equivalent to the eigenvectors of the graph Laplacian in reverse order) form frequency harmonics on a data graph. The spectral entropy of the operator is then the amount of variability explained by each frequency in the graph spectrum, i.e., the diffusion dimension.
To quantify the significance of each diffusion dimension in describing the data geometry, we can observe the corresponding eigenvalues
. Quantitatively, this is given by the spectral entropy defined on probability distribution of eigenvalues normalized by their sum,. This is parameterized by the diffusion timescale as this spectrum changes with the powering of the diffusion operator .
When the diffusion operator is powered to a value , there is an application of a low-pass filter to the eigenspectrum of the operator such that the eigenvalues corresponding to higher frequencies are diminished (see Supplement). Thus the spectral entropy decreases with subsequent powering of the operator —but not steadily. For low values of the spectral entropy rapidly decreases and then stabilizes to create an elbow. We believe this elbow refers to the elimination of noise, with further powering removing signal. We find the elbow of this operator for the modality-specific operators. In this manner, the higher frequency components of the data graph, corresponding to noise dimensions will be eliminated in a frequency-specific manner globally on the graph, as opposed to locally in a vertex-specific manner using local PCA. We note that a similar heuristic is used in Moon et al. (2019) where any value beyond an elbow is chosen for visualization using PHATE. In cases where the intrinsic dimensionality is known, it may be possible to set to a value that makes the spectral entropy equivalent to intrinsic dimensionality.
Fusion of operators
We compute using the spectral entropy heuristic for each modality taken independently, giving us an estimate for the relative quantities of information present between modalities. While the absolute degree of information within each view is informative, a ratio of information is perhaps more meaningful. We raise each modalities diffusion operator to the lowest possible multiple of the ideal view specific computed via spectral entropy. For example, if we obtain time values of 2 and 8 for two individual modes, then we will assume a ratio of 1:4 of information. Intuitively, this ratio indicates that for every diffusion “step” taken in modality 1, four diffusion steps will be taken in modality 2. More generally, we can write our joint diffusion operator, , to reflect the differing levels of global information between views as follows:
where and are integer values obtained from the reduced ratio as described above, and and are modality specific diffusion operators. We use the reduced ratio instead of directly applying the values of t obtained from the spectral entropy heuristic, as this joint operator is then powered once more to correct for spurious noise generated when integrating the datasets (i.e., noise present from one measurement modality affecting signal present from another measurement modality). Powering directly by and would lead to an oversmoothing effect in the final computed manifold which would collapse independent clusters together. We determine the adequate timescale for powering this joint diffusion via the same spectral entropy approach and calculate an embedding using the method of Coifman and Lafon (2006) or Moon et al. (2019), whose utility we explore on a variety of classification and visualization tasks.
5 Experimental Results
In the following experiments we evaluate integrated diffusion’s ability to visualize, denoise and cluster high dimensional multimodal data. To create multimodal synthetic data for our tasks, we generated two versions of the MNIST handwritten digits by adding different amounts of Gaussian noise to the MNIST pixels. We also created two versions of synthetically generated tree datasets by adding differing amounts of Gaussian noise to specific branches to simulate local noise. These datasets mimic the structure of the information contained in biological measurements due to their high dimensionality, similar global structures, and vastly different amounts of noise. In the Supplement we show an example of integrating three different modalities to show the generality of the method.
First, we generate multiple modalities of MNIST handwritten digits by adding Gaussian noise to the images, where each pixel value , where changes based on the level of noise. To showcase the ability of our method to handle modalities with significant differences in global noise, we add a fixed amount of Gaussian noise to simulate one data modality and increasing amount of Gaussian noise to simulate the second data modality (Figure 2A). Next, we generate multiple modalities of high dimensional artificial trees with varying amounts of global noise and local noise specific to branches. Similar to our MNIST multimodal datasets, we add a small amount of fixed noise to each tree before adding increasing amounts of noise to differing branches (Figure 2B). Finally, we apply our integrated diffusion approach to real world single cell biological data measuring RNA-sequencing, or gene expression, and ATAC-sequencing, or chromatin accessibility. With these datasets, we compare integrated diffusion to diffusion-based and non-diffusion based multimodal learning approaches on visualization, denoising and clustering tasks.
To quantify the differences in visualizations produced from differing multimodal integration strategies, we performed two separate comparisons. We compared diffusion operator constructions based off of multimodal feature concatenation, distance addition, affinity addition, affinity multiplication and alternating diffusion. We then performed an ablation study, comparing these techniques to various diffusion operators: alternating diffusion with local low rank approximation, alternating diffusion with modality specific powering of diffusion operators via spectral entropy ratio, and finally our integrated diffusion approach.
For our MNIST comparisons, we integrated both modalities of data and calculated the first 20 diffusion map components by eigendecomposition of the powered diffusion operator. From this embedding we train a kNN-classifier to predict MNIST digit of origin from the visualization. As we are distorting the global and, more importantly, local geometries for our tree comparisons and trying to reconstruct the original noiseless tree, we try to determine how successful our reconstruction is using DeMAP (Denoised Manifold-Affinity Preservation) proposed inMoon et al. (2019). DeMAP takes in a noiseless dataset, in this case the noiseless ground truth tree, as well as an embedding, in our case diffusion map of either of our comparison methods or the embedding layer of a neural network. DeMAP then correlates geodesic distances on noiseless dataset with euclidean distance in diffusion map space, trying to determine if the embedding accurately maintains known point to point distances in compressed space.
All strategies performed comparably when both modalities had a similar degree of local and global noise. As the difference in global noise increased in our MNIST embedding classification task, however, strategies that accounted for global information with modality specific diffusion operator powering outperformed strategies that did not (Figure 2C). When embedding trees with varying degrees of local branch specific noise, methods that perform local correction significantly outperformed methods that did not (Figure 2D). Finally, we compared out integrated diffusion approach to other non-diffusion based multimodal integration strategies, including cycle GANS, domain transfer autoencoders (DT), CCA, autoencoders trained on concatenated features, and PCA on concatenated features. Across both local noise comparisons in Table 2 and global noise comparisons in Table 1, integrated diffusion significantly outperformed other multimodal integration strategies.
Previous work in diffusion filters has shown that low pass filtering can correct many types of noise present in real world biological datasets, allowing for downstream analysis Van Dijk et al. (2018). Here, we compare several methods of diffusion operator construction with our integrated diffusion approach. As done previously, we created multimodal MNIST data by adding differing amounts of global noise. After computing the joint diffusion operator with each of these comparison methods, we filter the noisier MNIST modality as done previously Van Dijk et al. (2018) and as can be seen in Figure 3. To get quantitative results, we train a kNN-classifier on the denoised pixel values to determine how well each operator is able to recover ground truth (Figure 3). As shown in Figure 4, across all denoising comparisons, classification accuracy on increasingly noisy MNIST digits were best recovered by integrated diffusion followed by powered alternating diffusion, both methods account for global information within each noisy modality.
Other multimodal integration strategies can allow for denoising of noisy modalities. We similarly trained a knn-classifier to predict the original digit from the denoised pixel values determined from each technique (Table 3). Across all comparisons, integrated diffusion’s denoising significantly improved prediction of denoised handwritten digits when compared to CCA, feature concatenated PCA, domain transfer and normal autoencoder as well as a cycle GAN.
In this next experiment, we showcase the utility of our integrated diffusion operator in partitioning high dimensional data by incorporating information from multiple modalities. Previously, a variation on spectral clustering has been performed with the diffusion operators as done in Moon et al. (2019). Here, we computed spectral clusters on each of our multimodal datasets by performing kmeans clustering on each diffusion operator as well as on the embedding space of non diffusion-based multimodal learning approaches. Across comparisons on toy data we see that increasing amounts of global noise disrupts integrated diffusion based spectral clustering less than other methods of joint clustering multimodal data (Figure 5).
Next, we tried to compare clustering accuracy across different clustering methods by running kmeans on both the joint diffusion operators and the compressed feature spaces of other non-diffusion based methods. We compared these computed cluster labels against ground truth MNIST labels using adjusted rand index (ARI). Using the noisy multimodal MNIST data modalities previously generated we computed 10 clusters in each method and identified clustering accuracy. While clustering accuracy remained poor throughout all comparisons, they remained highest when clustering the powered alternating diffusion and integrated diffusion operators, both of which account for differences in global levels of information in each modality. Furthermore, as noise increased, integrated diffusion which also corrects local noise, performed superior to powered alternating diffusion (Table 4).
6 Biological Applications
New methods allow for the measurement tens to hundreds of thousands of features in single cells, allowing for unprecedented insight into biological and cell type specific processes. Until recently, only a single modality could be measured in each cell, be it expression of genes through RNA sequencing or the accessibility of chromatin regions through ATAC sequencing. Now novel techniques allow for the measurement of different modalities at single-cell resolution. Increasingly commonly, individual cells are measured with a combination of chromatin accessibility, RNA expression, protein expression and spatial location Ma et al. (2020); Cao et al. (2018); Liu et al. (2020). This new type of data is powerful, as it not only allows for the study of each modality independently, but also allows for the discovery of regulatory mechanisms between modalities. Currently, no computational techniques are capable of modelling and predicting these dynamics as there are no strategies that integrate different modalities of data to jointly visualize, cluster and denoise multimodal single-cell data.
We apply integrated diffusion to multimodal single cell data of 11,909 blood cells, visualizing the joint manifold, identifying known cell types and uncovering key cross modalities interactions. Visualizing each modality, gene expression and chromatin accessibility, independently reveals similar overall structure however different resolutions. Chromatin accessibility data, when compared to gene expression data, is incredibly sparse and generally considered to be far less informative. When computing the spectral entropy of each modality, we can clearly see that the chromatin accessibility diffusion operator has a far fewer informative dimensions than the gene expression operator. The alternating diffusion approach, which does not take into account the information present within each modality, creates an embedding that blends the distinct structure of gene expression data with the less informative structure of chromatin accessibility data. Integrated diffusion, however, appears to better resolve differences in information across dataset, producing a visualization that contains sharper borders between populations and displays clear structure when visualized with PHATE (Figure 6A).
These more clearly resolved populations also correspond with more biologically relevant clusters. Using cellular annotations of this dataset which predict celltype based on the expression of known marker genes and accessibility regions Hao et al. (2020b), we computed clusters from the diffusion operator of each modality as well as alternating and integrated operators. Clusters from the integrated operator best overlapped with annotated cell types (Figure 6C).
A major issue in single cell data is sparsity due to under sampling which makes it very difficult to measure and model cross modality interactions. Theoretically, if a gene is expressed, then the chromatin encoding that gene must be accessible. With this understanding of the data, we try to recover these known associations between gene expression and chromatin accessibility (Figure 6D). Due to sparsity, there is no association as computed by mutual information between these variables without denoising. There are several strategies to recovering these cross modality interactions: denoising with modality specific diffusion operators, denoising with a single alternating diffusion operator or denoising with a single integrated diffusion operator. Using the integrated diffusion operator appears to best recover known gene expression and chromatin accessibility associations as shown in genes CD19, CD14 and CD4 (Figure 6D). We then computed these associations across all genes with each of our denoising strategies. Across 18,659 genes, integrated diffusion recovered significantly more information between a gene’s accessibility and its expression than alternating diffusion and modality-specific diffusion (Figure 6E).
We introduce the integrated diffusion operator, a method for learning the joint data geometry as described by multiple data measurement modalities applied to a single system. We show its improvement over more naive diffusion based methods (e.g., feature concatenation, alternating diffusion, affinity addition) and several non-diffusion based methods such as PCA, CCA autoencoders and cycleGANs on synthetic and biological datasets. We apply our method in the biomedical setting to a multi-omics dataset, where we generated rich joint manifolds, compute cell populations with increased accuracy and recover cross modality gene-chromatin associations. Our flexible framework is extendable to multiple modalities and will allow for the successful integration and analysis of massive multi-omic datasets from a wide variety of fields. Future work will involve multiscale diffusion operators designed to integrate data at many levels of granularity.
- MAGAN: aligning biological manifolds. External Links: Cited by: §2.
- Quantifying the effect of experimental perturbations in single-cell rna-sequencing data using graph signal processing. bioRxiv. External Links: Cited by: §B.1, §1.
- Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361 (6409), pp. 1380–1385. External Links: Cited by: §6.
- Diffusion maps. Applied and Computational Harmonic Analysis 21 (1), pp. 5–30. External Links: Cited by: §1, §1, §3, §3, §3, §4.2.
- Integrated analysis of multimodal single-cell data. bioRxiv. External Links: Cited by: §1.
- Integrated analysis of multimodal single-cell data. bioRxiv. External Links: Cited by: §6.
- Introduction to manifold learning. Wiley Interdisciplinary Reviews: Computational Statistics 4 (5), pp. 439–446. Cited by: §3.
- Alternating diffusion maps for multimodal data fusion. Information Fusion 45, pp. 346–360. External Links: Cited by: §2, §3, §3.
- High-spatial-resolution multi-omics sequencing via deterministic barcoding in tissue. Cell 183 (6), pp. 1665–1681.e18. External Links: Cited by: §6.
- Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183 (4), pp. 1103–1116.e20. External Links: Cited by: §6.
- Manifold learning-based methods for analyzing single-cell rna-sequencing data. Current Opinion in Systems Biology 7, pp. 36–46. Cited by: §1, §3.
- Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology 37 (12), pp. 1482–1492. Cited by: §1, Figure 2, §4.2, §4.2, §5.1, §5.3.
- The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine 30 (3), pp. 83–98. Cited by: §B.1, §4.2.
- Harmonic alignment. External Links: Cited by: §B.1, §2.
- Dimensionality reduction: a comparative. J Mach Learn Res 10, pp. 66–71. Cited by: §3.
- Recovering gene interactions from single-cell data using data diffusion. Cell 174 (3), pp. 716–729. Cited by: §A.2, §1, §5.2.
- Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nature Communications 12 (1). External Links: Cited by: §2.
Unpaired image-to-image translation using cycle-consistent adversarial networks. External Links: Cited by: §1.
Appendix A Trimodal Data Integration
The integrated diffusion framework is generalizable to many modalities. Thus far, we have only applied our approach to toy and biological data generated via two measurement modalities, but theoretically we can integrate data generated from more modalities. In this section, we display results from each of our comparison strategies on multimodal MNIST and high dimensional artificial tree data.
To quantify the differences in visualizations produced from differing multimodal integration strategies, we leveraged the same experimental set up as done previously on tri-modal MNIST and artificial tree data. In these experiments, instead of generating two data modalities by adding varying amounts of local and global noise, we generated three modalities. To simulate real data, we added increasing amounts of noise to two of the modalities, while keeping the third fixed, thus establishing a ratio between noise levels.
We compared diffusion operator constructions based off of multimodal feature concatenation, distance addition, affinity addition, affinity multiplication and alternating diffusion. We also compared to non-diffusion based approaches: feature concatenated PCA, CCA and feature concatenated autoencoder as done previously. We also performed ablation studies, comparing these techniques to various diffusion operators: alternating diffusion with local low rank approximation, alternating diffusion with modality specific powering of diffusion operators via spectral entropy ratio, and finally our integrated diffusion approach.
As done previously, for our tri-modal MNIST comparisons, we integrated all modalities of data and calculated the first 20 diffusion map components by eigendecomposition of the powered diffusion operator for our diffusion based comparisons. From this embedding we train a kNN-classifier to predict MNIST digit of origin from the embedding. For our non-diffusion based comparisons, an embedding is created and used for classification. For our tree comparisons, we are trying to reconstruct the original noiseless tree from tri-modal locally noisy tree using our integrated approaches, evaluating performance using DeMAP.
All trimodal integration strategies performed comparably when both modalities had a similar degree of local and global noise. As the difference in global noise increased in our MNIST embedding classification task, however, strategies that accounted for global information with modality specific diffusion operator powering outperformed strategies that did not and non-diffusion based strategies (Table 5). When embedding trees with varying degrees of local branch specific noise, methods that perform local correction significantly outperformed methods that did not and non-diffusion based strategies (Table 6). These findings are in line our previous results on integrated bimodal data visualization.
Here, we compare our integrated diffusion approach to other approaches for data denoising. As done previously, we created trimodal MNIST data by adding differing amounts of global noise. After computing the joint diffusion operator with each of these comparison methods, we filter the noisier MNIST modality as done previously Van Dijk et al. (2018) and as can be seen in Figure 3. To get quantitative results, we train a kNN-classifier on the denoised pixel values to determine how well each operator is able to recover ground truth. As shown in Table 7, across denoising comparisons, classification accuracy on increasingly noisy MNIST digits were best recovered by integrated diffusion followed by powered alternating diffusion, both methods account for global information within each noisy modality.
Finally, we tried to compare spectral clustering accuracy across different clustering methods by running kmeans on both the joint diffusion operators and the compressed feature spaces of other non-diffusion based methods. As done previously, we compared these computed cluster labels against ground truth MNIST labels using adjusted rand index (ARI). Using the noisy trimodal MNIST data modalities previously generated we computed 10 clusters in each method and identified clustering accuracy. While clustering accuracy remained poor throughout all comparisons, they remained highest when clustering the powered alternating diffusion and integrated diffusion operators, both of which account for differences in global levels of information in each modality. Furthermore, as noise increased, integrated diffusion which also corrects local noise, performed superior to powered alternating diffusion (Table 8).
Appendix B Methods Details
b.1 Fast Noise Decay, Slow Signal Decay
As discussed in Burkhardt et al. (2020); Stanley et al. (2020); Shuman et al. (2013) among other sources, the eigenvectors of the diffusion operator (closely related to the Graph Laplacian) form frequency harmonics of the associated graph. In the case of the diffusion operator, the high-eigenvalued eigenvectors are low frequency harmonics and the low-eigenvalued eigenvectors represent high-frequency components which are often noise in data (causing nearby points to seem as if they have significantly different features). Therefore, the diffusion operator is often powered to reduce the noise.
To see this, we can first inspect the spectral properties of the diffusion operator defined on our data manifold as it is increasingly powered. By observing the distribution of eigenvalues as the diffusion timescale changes, we can see that smaller eigenvalues are gradually ”shaved” off of the spectrum (Figure 7). Thus, powering the diffusion operator can be viewed as a low-pass filter which removes high-frequency noise present in the data.
For numeric stability, we choose a symmetric operator that is conjugate to the asymetric operator for powering the diffusion operator.
is the diagonal matrix containing the node degrees (i.e., row sums of the affinity matrix). Note that the matrixis also positive semidefinite and has the same set of non-negative eigenvalues as . We can then normalize these eigenvalues to obtain a probability distribution, , such that .
For a particular dataset, we determine the powering necessary to denoise the data and retain signal by quantifying the rate of the information decay. We hypothesize that the noise dimensions, which have low eigenvalues (and do not explain much of the data) decay quickly while signal values are harder to ”shave off.” Because of this difference in the rates of eigenvalue ”decay”, we can estimate the optimal timescale at which to power our diffusion operator such that maximal signal is preserved while the largest sources of global noise are smoothed out.
In order to compute the amount of information contained in the operator after any power we compute its associated spectral entropy. When we plotted the spectral entropy, our hypothesis of fast noise-decay and slow-signal decay was confirmed by the negative exponential shape of the spectral entropy curve (Figure 6B). We generally choose the ”elbow-point” of the spectral entropy versus timescale curve.
b.2 Spectral entropy as a measure of operator information
To measure the information contained in an operator, we compute the eigenvalue entropy. This measures how information is spread through the eigenspace of the diffusion operator.
Note that this can also be thought of as the classic Shannon entropy calculated on the normalized spectrum of the diffusion operator. Due to the submodularity of the logarithmic function, the spectral entropy is largely determined by the significant eigenvalues corresponding to the ”signal-rich” diffusion dimensions.
b.3 Neural Network Methods
We compared our diffusion map-based approaches to several neural network models: an autoencoder trained on concatenated features from both views (”joint” model), an autoencoder adopted for domain transfer, and a cycle GAN.
For our joint autoencoders, our architecture for the classification task (handwritten digits) was as follows: an input layer of 128 (2 x 64) nodes, followed by layers of 55, 32, 32, and 20 nodes (at the bottleneck). On the tree dataset, our architecture was: an input layer of 200 (2 x 100) nodes, followed by layers of 128, 64, 32, and 20 nodes at the bottleneck. We chose a 20-dimensional representational space to stay consistent with our choice to draw the first 20 diffusion components when assessing diffusion based methods.
Our domain transfer-adapted autoencoder followed largely the same architecture as the joint model; however, instead of learning to reconstruct a joint input, the domain transfer model aimed to reconstruct the same points in the other view. For example, corrupted digits would serve as input to the model, which would aim to learn a representation from which the other view of the digits could be reproduced.
The final neural model tested was a cycle GAN, adopted from the field of domain transfer. Briefly, this model aims to learn two mappings from a ”source” to a ”target” distribution and vice versa. Because there is no shared latent space between the views in this architecture, we assessed classification accuracy by classifying ”uncorrupted” mappings from the ”noisy” view (we took this approach when comparing against Canonical Correlation Analysis as well).
Note that due to the complexity of making numerous pairwise latent spaces for the domain transfer autoencoders and cycle GANs, we opted to omit these methods from comparisons on the trimodal data.