Log In Sign Up

Geometric Multimodal Deep Learning with Multi-Scaled Graph Wavelet Convolutional Network

Multimodal data provide complementary information of a natural phenomenon by integrating data from various domains with very different statistical properties. Capturing the intra-modality and cross-modality information of multimodal data is the essential capability of multimodal learning methods. The geometry-aware data analysis approaches provide these capabilities by implicitly representing data in various modalities based on their geometric underlying structures. Also, in many applications, data are explicitly defined on an intrinsic geometric structure. Generalizing deep learning methods to the non-Euclidean domains is an emerging research field, which has recently been investigated in many studies. Most of those popular methods are developed for unimodal data. In this paper, a multimodal multi-scaled graph wavelet convolutional network (M-GWCN) is proposed as an end-to-end network. M-GWCN simultaneously finds intra-modality representation by applying the multiscale graph wavelet transform to provide helpful localization properties in the graph domain of each modality, and cross-modality representation by learning permutations that encode correlations among various modalities. M-GWCN is not limited to either the homogeneous modalities with the same number of data, or any prior knowledge indicating correspondences between modalities. Several semi-supervised node classification experiments have been conducted on three popular unimodal explicit graph-based datasets and five multimodal implicit ones. The experimental results indicate the superiority and effectiveness of the proposed methods compared with both spectral graph domain convolutional neural networks and state-of-the-art multimodal methods.


I Introduction

Multimodal data are generally composed of several heterogeneous data gathered from multiple real-world phenomena. Different modalities provide parts of the description of a phenomenon from various source domains usually with very different statistical properties. Unlike the unimodal data that may indicate only partial information of an entity, multimodal data provide complementary information by leveraging and fusing modality-specific knowledge [3]. Although each modality has its distinct statistical properties, the modalities are usually semantically correlated. Multimodal learning tries to improve the performance in applications with multimodal data by discovering the hidden intra-modality and cross-modality correlations.

To apply the geometry-aware data analysis for multimodal problems, data in various modalities can be implicitly represented based on their geometric structures such as graphs, manifolds, and meshed surfaces. Also, there are an increasing number of applications, in which data are generated in non-Euclidean geometries and inherently defined for example as a graph. These applications represent complex relationships and interdependencies among objects [35]

, including social networks, citation networks, networks of the spread of epidemic diseases, e-commerce networks, brain’s neuronal networks, biological regulatory networks, and so on.

Due to the increasing growth of non-Euclidean data with explicit or implicit geometric structures, and success of deep learning models in capturing hidden patterns in various domains, many deep learning methods have recently been revolutionized to be applied on geometrically structured domains.

Graph neural networks (GNNs) are the most common networks developing deep learning methods for graphs as the geometric structure of data, which perform filtering operations directly on the graph via the graph weights [7]. Graph convolutional network (GCN) can learn the local meaningful stationary properties of the input signals through especially designed convolution operator on graphs [30]. GCN represents node features (also called node embeddings) within these local neighborhoods on the graph to learn graph embedding for node classification, graph signal classification, graph regression, and so on [30]. Complex geometric structures in graphs can be encoded with the strong mathematical tools in many spatial or spectral graph-based methods [10]. In spatial graph-based methods, the convolution on each vertex is directly defined by aggregating feature information from all its neighbors [26]

. Spectral graph-based methods define convolution by leveraging graph Fourier transform to convert signals defined in vertex domain into the spectral domain using convolution theorem


Most of the recent multimodal data analysis methods are limited because are based on two simplifying assumptions: 1) there is the same number of data samples in each modality (the homogeneity assumption), and 2) at least partial correspondences and/or non-correspondences between modalities are given as prior knowledge. Obviously, these assumptions are unrealistic in practical machine learning scenarios


The success of GCNs for unimodal graph-based data and the limitations of current multimodal methods in practical problems motivates the approach proposed in this paper, that is extending spectral GCN models to multimodal graph-based data. Although various spectral and spatial methods are developed to extend convolutional neural networks on graph-based data, there are not numerous methods for extending GCNs for multimodal graph-based domains.

The main purpose of this work is to introduce a novel spectral GCN on graph-based multimodal data in a more practical scenario, in which it is only known that the data points of various modalities are sampled from the same phenomena.

In this paper, we proposed a multimodal graph wavelet convolutional network (M-GWCN), with an end-to-end learning. Unlike existing spectral graph-based methods that apply graph Fourier transforms for providing localization properties of the spectral domain, M-GWCN uses graph wavelet transforms as a multiresolution analysis of graph signals. Besides the capability of M-GWCN in highly localizing in the vertex domain, it can simultaneously localize signal content in both spatial and spectral domains (Graph wavelets from the spectral graph theory point of view are studied in [16]).

To prevent over-smoothing, M-GWCN is equipped with an initial residual connection mechanism. It makes sure that the representation at each layer contains a portion of the initial features. To avoid the expensive spectral decomposition for diagonalizing the graph Laplacian, we apply the Chebyshev polynomial method for approximating the graph wavelet bases. This approximation enables the M-GWCN to be applicable for large graphs.

The proposed M-GWCN consists of two main parts. In the first part, a graph convolution is conducted by applying multi-scaled graph wavelet transforms in each modality. Graph wavelet bases obtained with different scales provide valuable localization properties in graph domain of each modality. These properties are applied for intra-modality representation of feature vectors of each graph. In the second part, cross-modality representations are computed in one layer, that is responsible for data fusion. In this layer, permutations for encoding the cross-modality correlations are learned by applying new loss function and regularization terms.

The main contributions of this paper are briefly summarized as follows:

  1. We generalize the spectral GCN model to multimodal graph-based data domains as a problem that has rarely been addressed.

  2. We consider a general scenario for multimodal problems with unpaired data in the heterogeneous modalities, which needs no prior knowledge.

  3. We introduce a new spectral GCN model that in parallel applies graph wavelet transforms with different scaling parameters in each modality. This method prevents over-smoothing by introducing initial residual connections.

  4. We develop a new efficient network with end-to-end learning that simultaneously applies feature mapping on each modality and explores permutations for representing the cross-modality correlations among various modalities.

  5. We experimentally demonstrate the superiority and effectiveness of the proposed M-GWCN model on three unimodal explicit graph-based datasets and five multimodal datasets, which are not explicitly defined as graphs, but the graphs are constructed based on them implicitly.

The remainder of the paper is organized as follows. Related works are reviewed in section II. In section III, we overview the background and the basic notations. Our proposed methods are introduced in section IV. Experimental results are presented and discussed in section V. Finally, section VI concludes the paper.

Ii Related works

Ii-a Multimodal learning

Many multimodal learning methods are limited to the problems with homogeneous data, which have the same number of data samples in all modalities. For example, MVML-LA is a multi-view method, which learns a common discriminative low-dimensional latent space by preserving the geometric structure of the original input [39]. MVDL-CV is a new multi-view method, which obtains the sparse representation of the sample by learning a particular dictionary for each view and determines the similarity of samples using a regularization term between two dictionaries [23]. GPLVM represents multiple modalities in a common subspace using the Gaussian process latent variable model [22]. Another work provides a unifying framework for multiclass classification by encompassing vector-valued manifold regularization and co-regularized multi-view learning [27].

Some multimodal learning methods are geometry-aware and try to extend different diffusion and spectral methods to the multimodal setting. A nonlinear multimodal data fusion method captures the intrinsic structure of data and relies on minimal prior model knowledge [19]. A multimodal image registration method uses the graph Laplacian to capture the intrinsic structure of data in each modality by introducing a new structure preservation criterion based on Laplacian commutativity [40].

These two methods assume that the data samples in various modalities are completely paired, and all modalities have the same number of data samples. Another approach presented in [12], assumed partial correspondence information is predetermined using an expert manipulation process.

Since the correspondence knowledge between various modalities, which is essential for the above learning multimodal methods, may not be available in many practical scenarios, working independent of this prior knowledge is of key importance. Thus, it has been focused in continue on several methods, which try to reduce dependency on the expert manipulation process.

The method presented in [28] first extends the given correspondence information between modalities using functional mapping idea on the data manifolds of the respected modalities and then uses all correspondence information to simultaneously learn an underlying low-dimensional common manifold by aligning the manifolds of different modalities. In [5], another multimodal manifold learning approach, called local signal expansion for joint diagonalization (LSEJD) was proposed, which uses the intrinsic local tangent spaces to expand the initial correspondences knowledge.

Although these two later methods greatly expand correspondence information, they still depend on the little basic prior knowledge of correspondences. In recent work [4], we proposed a multimodal learning method, which is independent from any prior expertise between modalities. This method first uses spectral graph wavelet transform for representing local descriptors of each modality, and then applies these descriptors to find point-wise correspondences between modalities using the functional map approach.

Ii-B Graph convolutional neural networks

Inspired by the success of CNNs in the Euclidean domain, a large number of methods are proposed to generalize CNNs to non-Euclidean and especially graph domains. These methods are classified into two categories, spatial methods and spectral ones.

Spatial methods directly perform convolution on a graph based on its topological structure. Different spatial methods provide various weighted average functions for characterizing the node influences in the neighborhoods. NN4G [26] performs graph convolutions by summing up the nodes neighborhood information directly. The message-passing neural network (MPNN) [14] treats graph convolution as a message-passing process in which information can be passed from one node to another along edges directly. To amend the MPNN-based methods in distinguishing different graph structures based on the graph embedding, the graph isomorphism network (GIN) [38] is proposed. GIN adjusts the weight of the central node in a neighborhood by a learnable parameter. GraphSage [15] performs graph convolution by adopting a sampling strategy to obtain a fixed number of neighbors for each node. Graph attention network (GAT) [32] applies graph convolution by adopting attention mechanisms to learn the relative weights between two connected nodes.

Spectral methods developed the graph convolution operation based on the spectral graph theory, which is generally based on Fourier analysis in the graph domain. Spectral CNN (SCNN) [8] is the first spectral method, which its graph convolution layer projects the input graph signal to a new space using graph Fourier transform. To amend the limitations of the basic SCNN, Chebyshev spectral CNN (ChebyNet) [11] is introduced, which applies a fast localized filter approximation to find desired approximate filter response through the Chebyshev expansion. CayleyNet is another spectral method that uses Cayley polynomials filters. Unlike ChebyNet, CayleyNet can detect narrow frequency bands with a small number of filter parameters [21]

. To address the limitations of spectral graph CNN methods in providing beneficial localization properties, GWNN is presented that applies graph wavelets as a set of bases instead of eigenvectors of graph Laplacian

[37]. A new spectral method is proposed in [6] that offers a more flexible frequency response. It captures a better global graph structure by applying the Auto-Regressive Moving Average (ARMA) filter.

Ii-C Multimodal graph neural networks

Generalizing GNN to graph-based multimodal data is an important problem that has rarely been addressed. A multimodal GNN method for visual question answering tasks is proposed [13]. This method represents the image as three graph-based modalities and refines the features of nodes by passing a message from one graph to another. Inspired by the message-passing idea of graph neural networks, a multimodal GCN (MMGCN) framework is proposed in [34]. MMGCN captures user preferences in a recommender system by enriching the representation of each node by leveraging information interchange in various modalities. Motivated by graph-based structure in addressing the long unaligned sequences, a multimodal GCN-based method is proposed in [24] to investigate the effectiveness of GNN in modeling the multimodal sequential data. The Edge Adaptable GCN (EA-GCN) method for disease prediction is presented in [17]. EA-GCN represents various modalities as a population graph using an edge adapter and applies GCN for semi-supervised node classification.

Iii Background

Iii-a Multimodal problem formulation

We assume that multimodal data are acquired from different modalities or spaces with different dimensions . Data in each modality is represented by an undirected weighted graph , where is a set of vertices, is the set of edges, and is a weighted adjacency matrix representing connection weights between vertices.

Let graph signal is a collection of all feature vectors associated with labeled and unlabeled vertices, where .

Consider as the label matrix of data sample in modality for a -class classification problem. If the vertex belongs to the -th class, then contains in the -th location and in all others. For unlabeled data sample , the vector has in all locations.

In this paper, it is assumed that the sample correspondences information across different modalities is unknown. Also, when there is no emphasis on a specific modality, symbols are considered in the generic notation, e.g., is used to show the data graph instead of , that indicates a specific modality .

Iii-B Graph spectral geometry

The symmetric normalized Laplacian matrix of graph is defined as , by discretizing the Laplace-Beltrami (LB) operator [29], where is the diagonal matrix of nodes degrees. For all Laplacian matrices

, there are unitary eigenspaces

’s, such that , where matrix and diagonal matrix

contains the orthogonal eigenvectors and their corresponding eigenvalues (spectrum), respectively.

The eigenvectors play the role of Fourier basis in classical harmonic analysis, and the eigenvalues can be interpreted as frequencies. For a given graph signal on the vertices of , performs the graph Fourier transform, and is its inverse.

Iii-C Spectral graph convolutional network

According to convolution theorem, the spectral graph convolution of signal with the filter on the graph can be defined as an element-wise product of their Fourier transform as . By denoting , the spectral graph convolution can be written in the form of matrix multiplication as [11].

The spectral convolution layer is defined by extending CNN to graph as follows [8]:


where is an matrix of first eigenvectors of Laplacian matrix (corresponding to its least eigenvalues), is an diagonal matrix of spectral multipliers representing the learnable parameters in layer , is the input signal including features (channels), is the output signal including features, and

is a nonlinear activation function (e.g., ReLU). In this equation, parameter

keeps the locality of filter in the spectral domain using the lowest frequency harmonies.

The main drawback of spectral CNN is its computational complexity because the eigen-decomposition problem of Laplacian matrix is too expensive, and also yields dense eigenvectors that disables taking advantage of sparse multiplications.

To overcome this limitation, ChebyNet [11] model is presented that approximates the desired filter response using Chebyshev expansion as follows:


where is the rescaled spectrum in , is the -dimensional vector of the polynomial coefficients parametrizing the filter and is optimized during the training, and is the Chebyshev polynomial of order that defines recursively with and .

The convolution of graph signal with this defined filter is obtained as follows:


where . The resulting convolution layer is now define as:


where indicates -th trainable weight matrix in layer .

A specific polynomial order in equation (4) covers the -hope neighborhood and ignores the impact of the farther neighbors. To cover the larger structures in graph, it is essential to apply a high-order polynomial, but this polynomial leads to overfitting to the known graph. Furthermore, this high-order polynomial is more computationally expensive.

Graph Convolutional Network (GCN) [20] presents a simplified version of the Chebyshev filter. It reduces complexity and overfitting by setting , , , and substituting by in equation (4), where and . Thus, the convolution layer of GCN is obtained as:


This model is able to cover the large structure of graph with high-order neighborhoods by applying multiple GCN layers.

The other limitations, such as representing narrow-band filter with ChebyNet and smoothing node features after few convolutions with GCN, have been addressed using Cayley [21] and ARMA filters [6]. The rational forms of these filters can offer a large variety of shapes for them.

Iii-D Spectral graph wavelet transform

Wavelet transform is a powerful multiresolution analysis tool that expresses a signal as a combination of several localized, shifted, and scaled bases (wavelet bases) [25]. Spectral graph wavelet transform refers to projecting graph signals from vertex domain into the spectral domain using a proper set of bases provided by wavelet transform. This transformation provides valuable localization property by applying a series of appropriate scaling operations of graph signals [16].

As initially shown in [16], the spectral graph wavelet localized at vertex with scale parameter is shown by whose -th element is given by:


where is the number of vertices, is the -th eigenvalue of the normalized graph Laplacian matrix, is the Laplacian’s associated eigenvector that its -th element

is the value of the Laplace-Beltrami operator eigenfunction at vertex

, the symbol denotes complex conjugate operator, and is the spectral graph wavelet generating kernel.

Wavelet bases are defined by , where each wavelet basis corresponds to a signal on vertex and scale . According to wavelet bases, equation (6) can be written in the form of matrix multiplication:


where is the scaling matrix and . Graph wavelet transform for a given graph signal is define by , and its inverse is .

According to [1], can be obtained by replacing in with corresponding to a heat kernel.

Computing the wavelet bases is still dependent on eigen-decomposition, which is inefficient for large graphs. As mentioned in [16], the graph wavelet bases can be approximated using Chebyshev polynomials as follows:


where is the Chebyshev polynomial of order for approximate with , , is the number of Chebyshev polynomials, and is the Bessel function of the first kind [2].

Iv Proposed method

As mentioned in section III, computational complexity of graph Fourier transform for obtaining Fourier bases is the main drawback of the most spectral methods. In addition, graph Fourier transform, as a global transformation, does not provide helpful localization properties in the vertex domain.

Inspired by the superiority of spectral graph wavelet transforms in approximating with highly sparse wavelet bases which have more helpful localization properties, we developed a novel multimodal graph wavelet convolution network (M-GWCN) for analyzing multimodal data.

Due to the sparsity of wavelet bases, the computations in the M-GWCN model that are based on wavelet bases are much more efficient than models, which are based on graph Fourier bases. Furthermore, since each wavelet basis is related to a signal on the graph that diffused away from a central node, the M-GWCN model with small scale values is localized in the vertex domain of each modality. Thus, different scales of wavelet bases enable this model to represent feature vectors of each modality based on the different levels of localities in an efficient way.

The M-GWCN model also takes advantage of the complementary information provided by various modalities by learning cross-modal representations. Cross-modality representation is the process of representing feature vectors of each modality based on wavelet bases of the other modalities. Finding the correspondences among different vertices in various graphs is essential for cross-modality representation. The M-GWCN model explores these correspondences by learning permutation matrices that encode them among various modalities.

The proposed M-GWCN method has three advantages distinguished it from previous networks:

  1. M-GWCN introduces a stacked architecture that utilizes graph wavelets convolution with multiple scaling parameters in parallel and provides more useful intra-modality localization properties by aggregating embedded features at different scales.

  2. M-GWCN adopts residual connections that not only are helpful to prevent the over-smoothing of each stack, but also encourage them to provide various filtering responses based on each scale.

  3. M-GWCN generalizes the benefit of graph wavelet transform for cross-modality representation by finding permutations encoded cross-modality correlations among various modalities.

Iv-a Multiscale adaptive graph wavelet

We define an Adaptive Graph Wavelet (AGW) as a building block of the proposed network. In each layer, AGW consists of a graph wavelet transformation with a desired scale and a residual connection as an adaptive component. AGWs with different scales are concatenated in parallel to form a multiscale adaptive graph wavelet (MAGW) or a stack of AGWs. Multiple scales of wavelets provide more helpful localization properties by decomposing a graph signal on components at different scales, or frequency ranges.

The designed graph filter with MAGW approximates graph frequency response with different scales, without knowing the underlying graph structure. The residual connection compensates over-smoothing of MAGW in deeper layers and provides different filtering responses for each AGW. Fig. 1 depicts a scheme of MAGW.

Fig. 1: Scheme of a stack of AGWs in layer with scales. Each AGW consists of graph wavelet bases with a desired scale for feature mapping and a residual connection, as an adaptive component, for preventing the over-smoothing of each stack. AGWs with different scales are concatenated to form a multiscale AGW (MAGW) or a stack of AGWs.

According to the superiority of rational filters in approximating various shapes of filters, compared with polynomial filters, we apply a rational filter, as a more versatile graph filter, for each AGW. Inspired by frequency response of rational filters mentioned in [18], the response of filtering signal in scale s can be implemented using the following first-order recursion:


where and are the filter coefficients in scale , and is any practical graph representing matrix used to capture comprehensive information of the graph.

Inspired by this filter response in graph signal processing, we design a machine learning approach for learning the parameters and in each scale using a new graph convolutional network. In the designed AGW, graph wavelet transform is used for localizing graph convolution by projecting signals in the vertex domain into the spectral domain.

The obtained AGW with scale is defined as follows:


where is the wavelet bases at scale and is its inverse, and are learning parameters, is the input signal in scale including features, is the output signal including features, is the initial node features, and is the nonlinear activation function.

The output of each AGW stack is computed with the average of the outputs of all AGWs as .

Iv-B Multimodal graph wavelet convolution layers

The proposed multimodal graph wavelet convolutional network consists of three phases, intra-modality localization, cross-modality correlation, and node classification. Fig. 2 schematically shows the diagram of the proposed M-GWCN. The details are given below.

Fig. 2: Diagram of the proposed M-GWCN model. This model consists of three phases. The first phase consists of the first layers of the M-GWCN model. Each layer applies multi-scaled graph wavelet convolution for intra-modality localization. In the second phase, a new convolutional layer is defined to explore the cross-modality correlations among various modalities by embedding feature vectors of each modality based on the graph wavelet of the other modalities. Finally, node classification is conducted in the third phase.

1. Intra-modality localization

In this phase, feature mapping processes for node feature vectors of all modalities are conducted separately. For each modality , node features are represented through layers of network while each layer contains an parallel units of AGWs, where is the set of scales.

The output of the proposed network for modality in scale of -th layer is defined as:


where and are trainable parameter matrices for feature mapping in scale , is a diagonal matrix for graph convolution kernel, is the embedded feature vectors in layer and scale , is the initial node feature in modality , is the number of features in layer , and is the non-linear activation function.

Applying stochastic dropout to the initial node feature in equation (11) encourages each AGW to provide a response different from the others.

The final embedded feature vectors of layer k are defined by averaging the outputs of all units of AGW in -th AGW as:


2. Cross-modality correlations

Tuning the cross-modality correlations or finding the point-wise correspondences among various modalities is the most critical challenge in the multimodal problems.

In this phase, the cross-modality correlations on various modalities are explored and a new convolutional layer is defined to represent embedded features of each modality based on the graph wavelet of the other modalities.

To prevent the increase of learnable parameters, in this layer, we leverage wavelet with one scale. Our graph wavelet convolutional network learned in phase 1 is permutation invariant because the embedded feature vectors are insensitive to re-ordering the node index. We utilize this property to take advantage of applying correlated representational information of the other modalities discovered by cross-modality correlations.

To have better representation of embedded feature vectors in modality obtained after layers, , based on the wavelet bases of modality , we define the following cross-modality feature mapping between two modalities and :


where is a permutation matrix that encoded cross-modality correspondence between modalities and , and is a matrix of embedded feature vectors in modality obtained by representing based on the wavelet bases of modality , while their correlations are encoded in .

A permutation matrix is defined as a matrix including exactly one single unit value in each row and column, and zeros elsewhere. This matrix is used to represent the permutations of elements in an ordered sequence.

For example, for the following square matrix with rows, a permutation in order of its rows as is represented by the shown permutation matrix , which yields by a simple matrix-vector multiplication, as:

The permutation of symmetric matrix , in order of both rows and columns, is obtained by matrix-vector multiplication .

Based on the cross-modality feature mapping, for each modality , we conduct a cross-modality convolutional layer as follows:


where is the output embedded feature matrix, is kernel matrix, , is the total number of layers for intra-modality representation, and function is the concatenation operator that incorporates the extracted feature information in the cross-modality convolution layer.

3. Node classification

The embedded feature vectors of each modality are obtained through layers of feature mapping (intra-modality representation) and one layer of feature mapping based on the other modalities (cross-modality representation).

The obtained embedded feature vectors are fed into the last layer to conduct node classification as follows:


where , , and is number of classes.

4. Problem formulation and optimization

Considering the architecture of the proposed network, we present a unified regularization-based optimization problem that aims to simultaneously learn network parameters and permutations through network training.

Since the permutations are highly discrete and too costly to enumerate, the stochastic gradient descent (SGD) method is not capable to optimize the proposed networks because it applies for optimizing networks with continuous parameters.

We extend our formulation to the nearest convex surrogate by approximating all permutation matrices in equation (13) with doubly stochastic matrices

. A doubly stochastic matrix

is an matrix of non-negative real numbers, each of whose rows and columns sums to , i.e.:

where is an -dimensional column vector of ones.

According to this relaxation, we propose a new network loss function as:


where is the set of network weights, is the set of doubly stochastic matrices used in extension of equation (13), is the categorical cross-entropy loss function, that will be computed by equation (18), is the loss function for doubly stochastic matrix, which will be given in equation (19), and is a trade-off parameter between them.

measures the empirical loss on the training data by summing up the discrepancy between the outputs of the network and the ground-truth for all modalities :


where is the output of the network, obtained in equation (15). This output is a function of the set of network weights in equations (11), (14), and (15) and the set of doubly stochastic matrices in the extension of equation (13).

The loss function for learning the doubly stochastic matrix between two modalities and is as follows:


Thus, the loss function considering all doubly stochastic matrices is defined as follow:


The final optimization problem is also included the following additional regularization terms:


where , , and are regularization parameters. The first term is the -norm regularization penalty on the network weights preventing overfitting by constraining the complexity of the learned kernels of all layers. The second term is the between-modality regularization term that restricts the output labels of all modality pairs m and e based on cross-modality relations encoded in permutation matrix as follows:


The third term of equation (20) is the intra-modality regularization term, which is defined based on preserving manifold constraints of multimodal data based on the following equation:


where is the weight of edge between -th and -th vertices.

According to loss functions and regularization terms defined in equations (16) and (20), the final optimization problem on the objective function is obtained as follows:


To optimize this problem under its constraints, an iterative optimization algorithm can be adopted to alternatively minimize loss function with respect to and .

In the first pass, , that is the gradient of with respect to is computed while all other parameters are considered fixed. According to the stochastic gradient descent optimization method, all network kernel matrices can be update using the following iterative equation until convergence:


where is the learning rate of the SGD method and is iteration number.

At the second pass, is computed as the gradient of with respect to while all other parameters are fixed. Permutation matrices can be updated using SGD as:


while the non-negative constraint is maintained by thresholding .

Practically, for optimizing the above-mentioned problems, we take advantage of Keras

111 library by adding a new loss function and defining regularization terms on each layer, which these penalties are summed into the loss function during optimization.

The proposed M-GWCN is summarized in Algorithm 1.

  • : number of modalities

  • : Undirected weighted graph of modality

  • : Label vector of data samples in modality

  • : Set of scales

  • : Number of layers

  • : Number of features in layer

  • : Dropout rate for initial node feature

  • : Nonlinearity activation function

  • : Number of Chebyshev polynomials

  • , , , : Regularization parameters

  • : Maximum iteration number

  1. Approximate for each modality in scale using equation (8).

  2. Initialize and randomly for all modalities and all layers .

  3. For ( to ) do:

    1. Compute and update kernel matrix using equation (24).

    2. Compute and update doubly stochastic matrices using equation (25).

    3. Maintain non-negative constraint by thresholding .

    End for

  • Network parameters .

  • Doubly stochastic matrices .

Algorithm 1 Multimodal Graph Wavelet Convolutional Network (M-GWCN).

5. Computational complexity analysis

The computations of the M-GWCN consists of two parts, computing graph wavelet bases and computing the network outputs.

Since M-GWCN employs the Chebyshev polynomials in approximating graph wavelet bases, it takes its linearity advantage with computational complexity , where is the number of edges of and is the order of Chebyshev polynomials.

Under the assumption of sparse wavelet bases, each scale s in modality can be implemented as a matrix multiplication between sparse square matrix and embedded feature matrix , which has incurred linear complexity for each layer in feature mapping phase.

The computation complexity in layer of each modality depends on computing cross-modality correlation. According to the sparsity nature of both permutation matrices and wavelet bases, computing the cross-modality correlation using matrix multiplication also has linear complexity.

Finally, according to the linear time of matrix multiplication between embedded feature matrix and kernel weight matrix in each modality

, and also the linearity of computing sigmoid function, the label of each layer in the final layer (classification) can be estimated in linear time.

Iv-C Other versions of M-GWCN

1. Graph wavelet convolutional Network (GWCN)

Since most of the explicit graph-based data are unimodal, we simplify the proposed M-GWCN for unimodal tasks. GWCN, as a unimodal version of M-GWCN, is designed using the first layers integrated with the last classification layer (without cross-modality correlations).

2. Multi-view graph wavelet convolutional Network (MV-GWCN)

Many multimodal problems get the benefits from the prior knowledge of fully or partially correspondence information among modalities. Since these problems, sometimes called multi-view, are special cases of multimodal problems, we redesigned our proposed M-GWCN to cope with multi-view data, called MV-GWCN. In MV-GWCN, permutation matrices encode correspondence information among various modalities, such that if sample in modality , , corresponds with sample in modality , , then contains in -th entry and otherwise.

MV-GWCN is trained similar to M-GWCN, while permutation matrices are considered as non-learning parameters.

V Experiments

We investigate the effectiveness of the proposed network with two types of experiments on unimodal explicit graph-based data and multimodal implicit graph-based ones.

The first experiment examines the efficiency of the proposed network on inherently graph-based data, including citation datasets, considering semi-supervised node classification tasks. The purpose of second experiment is to evaluate the effectiveness of the proposed method on multimodal data, implicitly considered as a graph, compared with state-of-the-art semi-supervised multimodal problems. Public implementations with the open-source GNN libraries Spektral


(TensorFlow/Keras) are available in

V-a Evaluation on unimodal explicit graph-based data

To evaluate the performance of M-GWCN on explicit graph-based data, we focus on semi-supervised node classification of popular citation datasets. Since these datasets are unimodal, we apply the unimodal version of M-GWCN (GWCN) for semi-supervised node classification.

Specifications of three benchmark datasets used in the experiments, including the number of nodes, edges, node features, and classes are reported in Table I.

Dataset Nodes Edges Features Classes
Cora 2708 5429 1433 7
Citeseer 3327 9228 3703 6
Pubmed 19717 88648 500 3
TABLE I: The properties of citation datasets

We compare the efficiency of GWCN with most popular state-of-the-art graph convolutional networks. These baselines include ChebyNet [11], GCN [20], CayleyNets [21], GNN-ARMA [6], and GWNN [37] as spectral methods, in addition to GAT [32], GraphSAGE [15], and GIN [38] as spatial methods.

In the semi-supervised problem, the features of all vertices are known, but only labels per class are given for training. Also, and labeled nodes are considered for validation and test, respectively.

The task is learning a network that takes the feature vectors as inputs and assigns a label to each vertex as outputs. The evaluation results of the learned network on testing nodes are reported as classification accuracies.

Table II reports experimental results of GWCN on citation data sets described in Table I, and compares them with state-of-the-art methods. The accuracies of CayleyNets, GraphSAGE, and GIN have been reported from the respective paper while others implemented by authors according to mentioned settings. In this table, GWCN-1 indicates GWCN model with only one and the first scale, mentioned in Table III, e.g. for Cora, and GWCN-2 considers both scales.

The maximum number of training epochs is set to

. Training of network will be terminated if the validation loss does not decrease for consecutive epochs. Learning rate is for all methods.

For maintaining the sparsity structure of graph wavelet bases, a threshold is define to refine the values of and such that the value of entries that are smaller than threshold is set to . This parameter in all experiments is set to .

Method Cora Citeseer Pubmed
GIN 75.1±1.7 63.1±0.2 77.1±0.7
GraphSAGE 73.7±1.8 65.9±0.9 78.5±0.6
GAT 83.1±0.4 70.7±0.1 78.4±0.3
ChebyNet 78.2±0.6 68.6±0.2 76.3±0.8
GCN 82.3±0.4 71.4±0.3 79.3±0.2
CayleyNets 81.2±1.2 67.1±2.4 75.6±3.6
GNN-ARMA 81.4±0.4 68.9±0.6 76.3±0.3
GWNN 82.6±0.3 70.4±0.9 79.4±0.7
GWCN-1 83.2±0.1 71.6±0.6 79.5±0.3
GWCN-2 83.8±0.5 72.7±0.3 80.3±0.6
TABLE II: Accuracies of different methods on citation datasets (Mean Standard deviation)

As proved in [16], using small values for scale parameters, graph wavelet bases induce locality properties in such a way that each basis represents the neighborhood structure of a specific vertex. To avoid model complexity in this experiment, we use two scales for graph wavelets, . The values of these scales are chosen among small values between and via grid search to ensure the locality of convolution in the vertex domain. The values of all parameters of GWCN used in the above experiments are summarized in the Table III. Parameters of other methods are chosen according to their respected papers.

Parameter Cora Citeseer Pubmed
2 2 1
40 40 30
TABLE III: Parameters of the GWCN model

According to classification accuracies reported in Table II, GWCN provides the best accuracies among all baselines on all datasets. The second-best accuracies for Cora and Pubmed are provided by GWNN and for Citeseer are reported with GCN. These results demonstrate the capability of graph wavelet bases in achieving a better representation of vertex domain compared to other spectral methods.

Furthermore, GWCN performs better than all spatial methods reflecting the promising ability of spectral methods to achieve good performance. Since GAT assigns a self-attention weight to each edge, it captures more effectively the local similarity among neighborhoods and provides better accuracy compared to other spatial methods. However, computing the attention weights is inefficient for the large number of edges, and unlike GWCN, GAT cannot take advantage of the global structure of graphs.

Among spectral methods, GCN not only consistently outperforms others but also achieves a significant accuracy compared with GWCN.

The key to the good performance of GCN is that, unlike CayleyNets that applies graph Fourier bases to express the spectral graph convolution, GCN leverages the Laplacian matrix as a weighted matrix in its formulation (equation (5)), which in terms of sparsity, Laplacian matrix is sparser than Fourier bases and is similar to wavelet bases.

GWNN presents closer results to GWCN because it employs wavelet bases to localized graph convolution. Nevertheless, GWCN has two main advantages that enable it to consistently outperforms GWNN: 1) Aggregating features with different scaling parameter values makes GWCN flexible in locally exploring each sub-graph according to an appropriate scale, instead of using only one scale parameter for the whole graph. 2) Due to the sparsity of wavelet bases, after a few convolutions, the node features in GWNN become too smooth. Since GWCN formulation is adopted with a residual connection, it can be amplified with the initial node features avoiding over-smoothing.

Table IV reports the training time and parameter complexity of GNN methods.

Method Cora Citeseer
Sec/epoch Parameters Sec/epoch Parameters
ChebyNet 0.28492 46080 0.69860 118688
GAT 0.26980 92373 0.55968 237586
GCN 0.23140 23040 0.56030 59344
GNN-ARMA 0.23029 46103 0.37047 118710
GWNN 0.34075 23063 0.46709 59366
GWCN-1 0.37979 46103 0.54539 178054
GWCN-2 0.47666 91975 0.61975 355814
TABLE IV: Training time and parameter complexity of GNN methods

Since in citation network, Laplacian is sparser than graph wavelets [37], two graph wavelet-based methods, GWNN and GWCN, have a little more time complexity compared with Laplacian-based methods, ChebyNet, GCN, and GNN-ARMA.

According to the definition of GNN models, most of them have moderate number of parameters because increasing their parameters by adding multiple layers will raise the over-smoothing problem. But, due to two main reasons, increasing the parameters of GWCN model does not lead to over-smoothing: 1) these additional parameters are respected to multiple stacks in each layer with different scaling parameter values, and not respected to more layers, 2) residual connection adopted for GWCN prevents the over-smoothing of each stack. Also, due to effective computations provided by sparse wavelet bases, GWCN achieves better performance in reasonable time complexity. Therefore, unlike most GNN models, the performance of GWCN is improved using additional parameters.

V-B Evaluation of multimodal implicit graph-based data

The primary purpose of the proposed method is generalizing graph convolutional neural networks for multimodal problems. This experiment evaluates the effectiveness of the M-GWCN method for multimodal data.

To evaluate the performance of the proposed method, we conduct M-GWCN for the classification of multimodal implicit graph-based data on two categories of benchmark problems, including multimodal and multi-view datasets.

These categories consist of two multimodal datasets, Caltech and NUS, and three multi-view datasets, Caltech101-7, Caltech101-20, and MNIST, as introduced in

[5]. In these datasets, each modality is implicitly defined as a graph.

In the first experiment, we evaluate M-GWCN on multimodal datasets in a more practical scenario, without any predefined knowledge among modalities. To ensure this property, we randomly shuffle the order of data samples in each modality, which makes the point-wise correspondences between the original modalities unknown.

We compare the efficiency of M-GWCN with most state-of-the-art multimodal data modeling methods including CD (pos) [12], CD (pos+neg) [12], SCSMM [28], m-LSJD [5], m-LSJD [5], and MCPC-u [4].

The first experiment conducts M-GWCN on multimodal datasets with the parameters in Table V. In this experiment, data in each modality is portioned to for training, for validation, and for testing. The maximum number of training epochs is , and the training phase will be terminated if the validation loss does not decrease for consecutive epochs. Experimental results over randomly data splits are repeated in terms of mean accuracy and standard deviations. Parameters of other methods are chosen according to their respective paper.

Parameter Caltech NUS
2 2
2000 1000
100 100
TABLE V: Parameters of M-GWCN model

Table VI reports the performance of M-GWCN in terms of mean and standard deviation of accuracies, and compares it with baselines. Among these compared methods, MCPC-u has the most similar experimental settings with M-GWCN because it lakes prior correspondence knowledge. Although other methods have been developed for multimodal datasets, they still are more or less dependent on prior knowledge about correspondences among modalities. In the respective papers of methods CD (pos), CD (pos+neg), m-LSJD, and m-LSJD, the rate of dependency is defined based on the ratio of the number of given corresponding samples to the total number of samples (in percent). To get closer to a fair comparison, we consider the correspondence ratio as minimum as possible () in all mentioned methods.

Method Caltech NUS
CD (pos) 76.9±1.1 81.8±0.8
CD (pos+neg) 73.4±0.8 80.3±0.4
SCSMM - 83.9±2.42
m-LSJD 84.1±1.4 83.2±1.3
m-LSJD 88.5±1.6 87.2±1.1
MCPC-u 84.8±0.7 86.4±0.3
M-GWCN 90.6±0.4 89.2±0.8
TABLE VI: Classification accuracies on multimodal datasets (Mean Standard Deviation)

According to classification accuracies reported in Table VI, M-GWCN achieves a significant improvement and consistently outperforms other methods. These results indicate the superiority of the proposed graph convolutional network in effectively finding the cross-modal correlations in absence of given coupling/decoupling information between various modalities.

The M-GWCN is able to represent feature vectors of each modality based on the correlated wavelet bases on other modalities. This capability enables it to be not only successfully applicable to the multimodal graph-based data, but also usable efficiently independent of initial correspondence information.

Since in many multimodal methods, the correspondences among modalities are predetermined, for more experiments, we evaluate the performance of our proposed network on the multi-view datasets. Therefore, in the second experiment, we use MV-GWCN as a particular version of M-GWCN and conduct it on multi-view graph-based datasets, according to parameters listed in Table VII. The other settings are similar to the first experiment.

Parameter Caltech101-7 Caltech101-20 MNIST
2 2 2
40 30 35
TABLE VII: Parameters of MV-GWCN model

To evaluate the performance of the proposed method on multi-view datasets, we compare the efficiency of MV-GWCN with some existing methods evaluated on mentioned multi-view datasets. The state-of-the-art methods used for comparison include MLDA [9], MLDA-m [9], MULDA [9], MULDA-m [9], MvMDA [31], OGMA [36], OMLDA [36], OMvMDA [36], and MCPC-p [4].

Table VIII reports the classification accuracy obtained by the M-GWCN compared with different multimodal problems on three multi-view datasets. In this table, MV-GWCN-1 indicates the proposed method with only one and the first scale mentioned in Table VII. Similarly, MV-GWCN-2 is based on both scales.

As can be seen from Table VIII, MV-GWCN consistently outperforms other methods. According to these results, although our proposed method achieves the best accuracy among all methods on all multi-view datasets, the differences between their accuracies are not as significant as multimodal datasets. The main reason is that on multi-view datasets, the knowledge of correspondence is predetermined and corresponded samples can be fused into a joint latent space for boosting the classification, which improvs accuracy without needing to explore the cross-modality correlations. When exploring the cross-modality correlations for discovering the correspondences among various modalities is a necessity, because of the lack of this type of prior knowledge, the superiority of our proposed method is proved to be more significant.

Method Caltech101-7 Caltech101-20 MNIST
MLDA 92.298e-3 76.5912e-3 92.845e-3
MLDA-m 89.7810e-3 73.77114e-3 93.098e-3
MULDA 92.658e-3 82.2011e-3 95.235e-3
MULDA-m 92.5910e-3 82.176e-3 95.124e-3
MvMDA 92.658e-3 80.5013e-3 93.789e-3
OGMA 95.015e-3 86.0010e-3 96.096e-3
OMLDA 94.985e-3 86.8510e-3 95.716e-3
OMvMDA 94.717e-3 82.2810e-3 95.996e-3
MCPC-p 94.831.1 86.441.1 -
MV-GWCN-1 95.251.3 87.830.8 96.451.4
MV-GWCN-2 96.230.7 88.461.1 97.210.7
TABLE VIII: Classification accuracies on multi-view datasets (Mean Standard Deviation)

In the last experiment, we develop new network architectures based on several conventional graph convolutional networks for applying to multi-view datasets. The architecture of these developed networks, that are named with a prefix ’M’ in Table IX to show this extension, is simple. Similar to MV-GWCN, each modality is represented through K layers using a specific graph convolutional network, and then embedded features in each modality are fused into a joint layer to have better representation based on all modalities.

In this way, we evaluate the abilities of various graph convolutional networks in both feature mapping in each modality as well as cross-modal feature fusion, and then compare these networks with MV-GWCN, which takes advantage of graph wavelet bases abilities.

The classification results of various graph convolutional networks on multi-view datasets are shown in Table IX. The results reported in this table demonstrate the superiority of applying wavelet bases on GNNs with similar modality fusion idea. These results confirm the effectiveness of applying wavelet bases in simultaneously representing each modality and utilizing correlation among all modalities efficiently.

Method Caltech101-7 Caltech101-20 MNIST
M-ChebyNet 91.490.3 85.530.4 95.500.8
M-GAT 93.590.4 82.590.5 96.000.4
M-GIN 92.510.7 84.270.3 95.250.6
M-GCN 91.150.3 82.800.6 94.240.3
M- GNN-ARMA 94.891.1 86.370.5 95.750.3
MV-GWCN-1 95.251.3 87.830.8 96.451.4
TABLE IX: Classification accuracies on multi-view datasets (Mean Standard Deviation)

Vi Conclusions

Extending the convolutional neural networks to multimodal geometrically structured and/or graph-based data is an important problem that has been rarely addressed to the best of authors knowledge. This paper introduced a novel graph convolutional neural network based on wavelet bases to learn the representation of multimodal graph-based data in the spectral domain. Compared with Fourier bases used in spectral GNNs, wavelet bases more effectively represent the feature vectors in each modality utilizing scaling parameter. Besides, due to the ability of the Chebyshev polynomials used in approximating wavelet bases without requiring the costly eigen-decomposition and sparsity structure of obtained wavelet bases, computations are performed efficiently.

Proposed M-GWCN simultaneously provides intra-modal localization by applying multi-scaled graph wavelet convolution, and as an alignment stage estimates the cross-modal correlations between various modalities. We also introduced two additional particular versions of the proposed network for conducting unimodal and multi-view tasks.

The proposed network evaluated on both unimodal explicit graph-based data sets as well as multimodal implicit graph-based data. Extensive experiments demonstrated that M-GWCN outperforms state-of-the-art GNNs, including unimodal and multimodal cases, without any prior knowledge.

To the best of our knowledge, our proposed M-GWCN model is the first work in developing graph convolutional networks for multimodal graph-based data in the practical scenarios.

According to the efficiency, generality, and flexibility of the spatial methods, analyzing the multimodal graph-based data using a spatial network can be a valuable work for future.

Since multimodal data have a wide range of applications, developing multimodal graph-based networks for other tasks, including cross-modal retrieval, multimodal clustering, domain adaptation, etc., can also be our other future works.


  • [1] Cited by: §III-D.
  • [2] G. B. Arfken, H. J. Weber, and F. E. Harris (2013) Mathematical Methods for Physicists (Seventh Edition) -Chapter 14 - Bessel Functions. Academic Press, Boston. Cited by: §III-D.
  • [3] T. Baltrusaitis, C. Ahuja, and L. P. Morency (2019) Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2). External Links: Document, ISSN 19393539 Cited by: §I.
  • [4] M. Behmanesh, P. Adibi, J. Chanussot, and S. M. S. Ehsani (2021) Cross-modal and multimodal data analysis based on functional mapping of spectral descriptors and manifold regularization. ArXiv abs/2105.05631. External Links: Link Cited by: §I, §II-A, §V-B, §V-B.
  • [5] M. Behmanesh, P. Adibi, J. Chanussot, C. Jutten, and S. M. S. Ehsani (2021) Geometric multimodal learning based on local signal expansion for joint diagonalization. IEEE Transactions on Signal Processing 69 (), pp. 1271–1286. External Links: Document Cited by: §II-A, §V-B, §V-B.
  • [6] F. M. Bianchi, D. Grattarola, L. Livi, and C. Alippi (2021) Graph Neural Networks with Convolutional ARMA Filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 8828 (c), pp. 1–12. External Links: Document, 1901.01343, ISSN 19393539 Cited by: §II-B, §III-C, §V-A.
  • [7] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. External Links: Document Cited by: §I.
  • [8] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun (2014) Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, (English (US)). Cited by: §I, §II-B, §III-C.
  • [9] G. Cao, A. Iosifidis, K. Chen, and M. Gabbouj (2018) Generalized Multi-View Embedding for Visual Recognition and Cross-Modal Retrieval. IEEE Transactions on Cybernetics 48 (9). External Links: Document, ISSN 21682267 Cited by: §V-B.
  • [10] F. R. K. Chung (2001) Lectures on Spectral Graph Theory. American Mathematical Society; UK ed. edition. Cited by: §I.
  • [11] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Advances in Neural Information Processing Systems. External Links: ISSN 10495258 Cited by: §II-B, §III-C, §III-C, §V-A.
  • [12] D. Eynard, A. Kovnatsky, M. M. Bronstein, K. Glashoff, and A. M. Bronstein (2015) Multimodal Manifold Analysis by Simultaneous Diagonalization of Laplacians. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (12). External Links: Document, ISSN 01628828 Cited by: §II-A, §V-B.
  • [13] D. Gao, K. Li, R. Wang, S. Shan, and X. Chen (2020) Multi-modal graph neural network for joint reasoning on vision and scene text.

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    External Links: Document, ISSN 10636919 Cited by: §II-C.
  • [14] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In 34th International Conference on Machine Learning, ICML 2017, Vol. 3. Cited by: §II-B.
  • [15] W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, Vol. 2017-Decem. External Links: ISSN 10495258 Cited by: §II-B, §V-A.
  • [16] D. K. Hammond, P. Vandergheynst, and R. Gribonval (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2). External Links: Document, ISSN 10635203 Cited by: §I, §III-D, §III-D, §III-D, §V-A.
  • [17] Y. Huang and A. C.S. Chung (2020) Semi-Supervised Multimodality Learning with Graph Convolutional Neural Networks for Disease Diagnosis. In Proceedings - International Conference on Image Processing, ICIP, Vol. 2020-Octob. External Links: Document, ISSN 15224880 Cited by: §II-C.
  • [18] E. Isufi, A. Loukas, A. Simonetto, and G. Leus (2017) Autoregressive Moving Average Graph Filtering. IEEE Transactions on Signal Processing 65 (2). External Links: Document, ISSN 1053587X Cited by: §IV-A.
  • [19] O. Katz, R. Talmon, Y. L. Lo, and H. T. Wu (2019) Alternating diffusion maps for multimodal data fusion. Information Fusion 45. External Links: Document, ISSN 15662535 Cited by: §II-A.
  • [20] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings. Cited by: §III-C, §V-A.
  • [21] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein (2019) CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters. IEEE Transactions on Signal Processing 67 (1). External Links: Document, ISSN 1053587X Cited by: §II-B, §III-C, §V-A.
  • [22] J. Li, Z. Li, G. Lu, Y. Xu, B. Zhang, and D. Zhang (2021) Asymmetric gaussian process multi-view learning for visual classification. Information Fusion 65, pp. 108–118. External Links: ISSN 1566-2535, Document, Link Cited by: §II-A.
  • [23] B. Liu, X. Chen, Y. Xiao, W. Li, L. Liu, and C. Liu (2021) An efficient dictionary-based multi-view learning method. Information Sciences 576, pp. 157–172. External Links: ISSN 0020-0255, Document Cited by: §II-A.
  • [24] S. Mai, S. Xing, J. He, Y. Zeng, and H. Hu (2020) Analyzing unaligned multimodal sequence via graph convolution and graph pooling fusion. arXiv, pp. 1–14. External Links: 2011.13572, ISSN 23318422 Cited by: §II-C.
  • [25] S. Mallat (2009) A Wavelet Tour of Signal Processing. Academic Press, Boston. External Links: Document Cited by: §III-D.
  • [26] A. Micheli (2009) Neural network for graphs: A contextual constructive approach. IEEE Transactions on Neural Networks 20 (3). External Links: Document, ISSN 10459227 Cited by: §I, §II-B.
  • [27] H. Q. Minh, L. Bazzani, and V. Murino (2016) A unifying framework in vector-valued reproducing kernel Hilbert spaces for manifold regularization and co-regularized multi-view learning. Journal of Machine Learning Research 17. External Links: ISSN 15337928 Cited by: §II-A.
  • [28] A. Pournemat, P. Adibi, and J. Chanussot (2021) Semisupervised charting for spectral multimodal manifold learning and alignment. Pattern Recognition 111. External Links: Document, ISSN 00313203 Cited by: §II-A, §V-B.
  • [29] S. Rosenberg (1997) The Laplacian on a Riemannian Manifold: An Introduction to Analysis on Manifolds.. London Mathematical Society Student Texts, Cambridge University Press. External Links: Document Cited by: §III-B.
  • [30] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1). External Links: Document, ISSN 10459227 Cited by: §I.
  • [31] S. Sun, X. Xie, and M. Yang (2016) Multiview Uncorrelated Discriminant Analysis. IEEE Transactions on Cybernetics 46 (12). External Links: Document, ISSN 21682267 Cited by: §V-B.
  • [32] P. Veličković, A. Casanova, P. Liò, G. Cucurull, A. Romero, and Y. Bengio (2018) Graph attention networks. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings. Cited by: §II-B, §V-A.
  • [33] L. Wang, L. Zhang, C. Shen, and R. Li (2020) Orthogonal Multi-view Analysis by Successive Approximations via Eigenvectors. arXiv. External Links: 2010.01632, Link Cited by: §V.
  • [34] Y. Wei, X. He, X. Wang, R. Hong, L. Nie, and T. S. Chua (2019) MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia. External Links: Document Cited by: §II-C.
  • [35] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2021) A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems 32 (1). External Links: Document, ISSN 21622388 Cited by: §I.
  • [36] L. xilinx Wang, L. Zhang, C. Shen, and R. Li (2020) Orthogonal multi-view analysis by successive approximations via eigenvectors. ArXiv abs/2010.01632. Cited by: §V-B.
  • [37] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng (2019) Graph wavelet neural network. 7th International Conference on Learning Representations, ICLR 2019. Cited by: §II-B, §V-A, §V-A.
  • [38] K. Xu, S. Jegelka, W. Hu, and J. Leskovec (2019) How powerful are graph neural networks?. 7th International Conference on Learning Representations, ICLR 2019. Cited by: §II-B, §V-A.
  • [39] Y. Zhao, X. You, S. Yu, C. Xu, W. Yuan, X. Y. Jing, T. Zhang, and D. Tao (2018) Multi-view manifold learning with locality alignment. Pattern Recognition 78. External Links: Document, ISSN 00313203 Cited by: §II-A.
  • [40] V. A. Zimmer, M. Á. González Ballester, and G. Piella (2019) Multimodal image registration using Laplacian commutators. Information Fusion 49. External Links: Document, ISSN 15662535 Cited by: §II-A.