I Introduction
Multimodal data are generally composed of several heterogeneous data gathered from multiple realworld phenomena. Different modalities provide parts of the description of a phenomenon from various source domains usually with very different statistical properties. Unlike the unimodal data that may indicate only partial information of an entity, multimodal data provide complementary information by leveraging and fusing modalityspecific knowledge [3]. Although each modality has its distinct statistical properties, the modalities are usually semantically correlated. Multimodal learning tries to improve the performance in applications with multimodal data by discovering the hidden intramodality and crossmodality correlations.
To apply the geometryaware data analysis for multimodal problems, data in various modalities can be implicitly represented based on their geometric structures such as graphs, manifolds, and meshed surfaces. Also, there are an increasing number of applications, in which data are generated in nonEuclidean geometries and inherently defined for example as a graph. These applications represent complex relationships and interdependencies among objects [35]
, including social networks, citation networks, networks of the spread of epidemic diseases, ecommerce networks, brain’s neuronal networks, biological regulatory networks, and so on.
Due to the increasing growth of nonEuclidean data with explicit or implicit geometric structures, and success of deep learning models in capturing hidden patterns in various domains, many deep learning methods have recently been revolutionized to be applied on geometrically structured domains.
Graph neural networks (GNNs) are the most common networks developing deep learning methods for graphs as the geometric structure of data, which perform filtering operations directly on the graph via the graph weights [7]. Graph convolutional network (GCN) can learn the local meaningful stationary properties of the input signals through especially designed convolution operator on graphs [30]. GCN represents node features (also called node embeddings) within these local neighborhoods on the graph to learn graph embedding for node classification, graph signal classification, graph regression, and so on [30]. Complex geometric structures in graphs can be encoded with the strong mathematical tools in many spatial or spectral graphbased methods [10]. In spatial graphbased methods, the convolution on each vertex is directly defined by aggregating feature information from all its neighbors [26]
. Spectral graphbased methods define convolution by leveraging graph Fourier transform to convert signals defined in vertex domain into the spectral domain using convolution theorem
[8].Most of the recent multimodal data analysis methods are limited because are based on two simplifying assumptions: 1) there is the same number of data samples in each modality (the homogeneity assumption), and 2) at least partial correspondences and/or noncorrespondences between modalities are given as prior knowledge. Obviously, these assumptions are unrealistic in practical machine learning scenarios
[4].The success of GCNs for unimodal graphbased data and the limitations of current multimodal methods in practical problems motivates the approach proposed in this paper, that is extending spectral GCN models to multimodal graphbased data. Although various spectral and spatial methods are developed to extend convolutional neural networks on graphbased data, there are not numerous methods for extending GCNs for multimodal graphbased domains.
The main purpose of this work is to introduce a novel spectral GCN on graphbased multimodal data in a more practical scenario, in which it is only known that the data points of various modalities are sampled from the same phenomena.
In this paper, we proposed a multimodal graph wavelet convolutional network (MGWCN), with an endtoend learning. Unlike existing spectral graphbased methods that apply graph Fourier transforms for providing localization properties of the spectral domain, MGWCN uses graph wavelet transforms as a multiresolution analysis of graph signals. Besides the capability of MGWCN in highly localizing in the vertex domain, it can simultaneously localize signal content in both spatial and spectral domains (Graph wavelets from the spectral graph theory point of view are studied in [16]).
To prevent oversmoothing, MGWCN is equipped with an initial residual connection mechanism. It makes sure that the representation at each layer contains a portion of the initial features. To avoid the expensive spectral decomposition for diagonalizing the graph Laplacian, we apply the Chebyshev polynomial method for approximating the graph wavelet bases. This approximation enables the MGWCN to be applicable for large graphs.
The proposed MGWCN consists of two main parts. In the first part, a graph convolution is conducted by applying multiscaled graph wavelet transforms in each modality. Graph wavelet bases obtained with different scales provide valuable localization properties in graph domain of each modality. These properties are applied for intramodality representation of feature vectors of each graph. In the second part, crossmodality representations are computed in one layer, that is responsible for data fusion. In this layer, permutations for encoding the crossmodality correlations are learned by applying new loss function and regularization terms.
The main contributions of this paper are briefly summarized as follows:

We generalize the spectral GCN model to multimodal graphbased data domains as a problem that has rarely been addressed.

We consider a general scenario for multimodal problems with unpaired data in the heterogeneous modalities, which needs no prior knowledge.

We introduce a new spectral GCN model that in parallel applies graph wavelet transforms with different scaling parameters in each modality. This method prevents oversmoothing by introducing initial residual connections.

We develop a new efficient network with endtoend learning that simultaneously applies feature mapping on each modality and explores permutations for representing the crossmodality correlations among various modalities.

We experimentally demonstrate the superiority and effectiveness of the proposed MGWCN model on three unimodal explicit graphbased datasets and five multimodal datasets, which are not explicitly defined as graphs, but the graphs are constructed based on them implicitly.
The remainder of the paper is organized as follows. Related works are reviewed in section II. In section III, we overview the background and the basic notations. Our proposed methods are introduced in section IV. Experimental results are presented and discussed in section V. Finally, section VI concludes the paper.
Ii Related works
Iia Multimodal learning
Many multimodal learning methods are limited to the problems with homogeneous data, which have the same number of data samples in all modalities. For example, MVMLLA is a multiview method, which learns a common discriminative lowdimensional latent space by preserving the geometric structure of the original input [39]. MVDLCV is a new multiview method, which obtains the sparse representation of the sample by learning a particular dictionary for each view and determines the similarity of samples using a regularization term between two dictionaries [23]. GPLVM represents multiple modalities in a common subspace using the Gaussian process latent variable model [22]. Another work provides a unifying framework for multiclass classification by encompassing vectorvalued manifold regularization and coregularized multiview learning [27].
Some multimodal learning methods are geometryaware and try to extend different diffusion and spectral methods to the multimodal setting. A nonlinear multimodal data fusion method captures the intrinsic structure of data and relies on minimal prior model knowledge [19]. A multimodal image registration method uses the graph Laplacian to capture the intrinsic structure of data in each modality by introducing a new structure preservation criterion based on Laplacian commutativity [40].
These two methods assume that the data samples in various modalities are completely paired, and all modalities have the same number of data samples. Another approach presented in [12], assumed partial correspondence information is predetermined using an expert manipulation process.
Since the correspondence knowledge between various modalities, which is essential for the above learning multimodal methods, may not be available in many practical scenarios, working independent of this prior knowledge is of key importance. Thus, it has been focused in continue on several methods, which try to reduce dependency on the expert manipulation process.
The method presented in [28] first extends the given correspondence information between modalities using functional mapping idea on the data manifolds of the respected modalities and then uses all correspondence information to simultaneously learn an underlying lowdimensional common manifold by aligning the manifolds of different modalities. In [5], another multimodal manifold learning approach, called local signal expansion for joint diagonalization (LSEJD) was proposed, which uses the intrinsic local tangent spaces to expand the initial correspondences knowledge.
Although these two later methods greatly expand correspondence information, they still depend on the little basic prior knowledge of correspondences. In recent work [4], we proposed a multimodal learning method, which is independent from any prior expertise between modalities. This method first uses spectral graph wavelet transform for representing local descriptors of each modality, and then applies these descriptors to find pointwise correspondences between modalities using the functional map approach.
IiB Graph convolutional neural networks
Inspired by the success of CNNs in the Euclidean domain, a large number of methods are proposed to generalize CNNs to nonEuclidean and especially graph domains. These methods are classified into two categories, spatial methods and spectral ones.
Spatial methods directly perform convolution on a graph based on its topological structure. Different spatial methods provide various weighted average functions for characterizing the node influences in the neighborhoods. NN4G [26] performs graph convolutions by summing up the nodes neighborhood information directly. The messagepassing neural network (MPNN) [14] treats graph convolution as a messagepassing process in which information can be passed from one node to another along edges directly. To amend the MPNNbased methods in distinguishing different graph structures based on the graph embedding, the graph isomorphism network (GIN) [38] is proposed. GIN adjusts the weight of the central node in a neighborhood by a learnable parameter. GraphSage [15] performs graph convolution by adopting a sampling strategy to obtain a fixed number of neighbors for each node. Graph attention network (GAT) [32] applies graph convolution by adopting attention mechanisms to learn the relative weights between two connected nodes.
Spectral methods developed the graph convolution operation based on the spectral graph theory, which is generally based on Fourier analysis in the graph domain. Spectral CNN (SCNN) [8] is the first spectral method, which its graph convolution layer projects the input graph signal to a new space using graph Fourier transform. To amend the limitations of the basic SCNN, Chebyshev spectral CNN (ChebyNet) [11] is introduced, which applies a fast localized filter approximation to find desired approximate filter response through the Chebyshev expansion. CayleyNet is another spectral method that uses Cayley polynomials filters. Unlike ChebyNet, CayleyNet can detect narrow frequency bands with a small number of filter parameters [21]
. To address the limitations of spectral graph CNN methods in providing beneficial localization properties, GWNN is presented that applies graph wavelets as a set of bases instead of eigenvectors of graph Laplacian
[37]. A new spectral method is proposed in [6] that offers a more flexible frequency response. It captures a better global graph structure by applying the AutoRegressive Moving Average (ARMA) filter.IiC Multimodal graph neural networks
Generalizing GNN to graphbased multimodal data is an important problem that has rarely been addressed. A multimodal GNN method for visual question answering tasks is proposed [13]. This method represents the image as three graphbased modalities and refines the features of nodes by passing a message from one graph to another. Inspired by the messagepassing idea of graph neural networks, a multimodal GCN (MMGCN) framework is proposed in [34]. MMGCN captures user preferences in a recommender system by enriching the representation of each node by leveraging information interchange in various modalities. Motivated by graphbased structure in addressing the long unaligned sequences, a multimodal GCNbased method is proposed in [24] to investigate the effectiveness of GNN in modeling the multimodal sequential data. The Edge Adaptable GCN (EAGCN) method for disease prediction is presented in [17]. EAGCN represents various modalities as a population graph using an edge adapter and applies GCN for semisupervised node classification.
Iii Background
Iiia Multimodal problem formulation
We assume that multimodal data are acquired from different modalities or spaces with different dimensions . Data in each modality is represented by an undirected weighted graph , where is a set of vertices, is the set of edges, and is a weighted adjacency matrix representing connection weights between vertices.
Let graph signal is a collection of all feature vectors associated with labeled and unlabeled vertices, where .
Consider as the label matrix of data sample in modality for a class classification problem. If the vertex belongs to the th class, then contains in the th location and in all others. For unlabeled data sample , the vector has in all locations.
In this paper, it is assumed that the sample correspondences information across different modalities is unknown. Also, when there is no emphasis on a specific modality, symbols are considered in the generic notation, e.g., is used to show the data graph instead of , that indicates a specific modality .
IiiB Graph spectral geometry
The symmetric normalized Laplacian matrix of graph is defined as , by discretizing the LaplaceBeltrami (LB) operator [29], where is the diagonal matrix of nodes degrees. For all Laplacian matrices
, there are unitary eigenspaces
’s, such that , where matrix and diagonal matrixcontains the orthogonal eigenvectors and their corresponding eigenvalues (spectrum), respectively.
The eigenvectors play the role of Fourier basis in classical harmonic analysis, and the eigenvalues can be interpreted as frequencies. For a given graph signal on the vertices of , performs the graph Fourier transform, and is its inverse.
IiiC Spectral graph convolutional network
According to convolution theorem, the spectral graph convolution of signal with the filter on the graph can be defined as an elementwise product of their Fourier transform as . By denoting , the spectral graph convolution can be written in the form of matrix multiplication as [11].
The spectral convolution layer is defined by extending CNN to graph as follows [8]:
(1)  
where is an matrix of first eigenvectors of Laplacian matrix (corresponding to its least eigenvalues), is an diagonal matrix of spectral multipliers representing the learnable parameters in layer , is the input signal including features (channels), is the output signal including features, and
is a nonlinear activation function (e.g., ReLU). In this equation, parameter
keeps the locality of filter in the spectral domain using the lowest frequency harmonies.The main drawback of spectral CNN is its computational complexity because the eigendecomposition problem of Laplacian matrix is too expensive, and also yields dense eigenvectors that disables taking advantage of sparse multiplications.
To overcome this limitation, ChebyNet [11] model is presented that approximates the desired filter response using Chebyshev expansion as follows:
(2) 
where is the rescaled spectrum in , is the dimensional vector of the polynomial coefficients parametrizing the filter and is optimized during the training, and is the Chebyshev polynomial of order that defines recursively with and .
The convolution of graph signal with this defined filter is obtained as follows:
(3) 
where . The resulting convolution layer is now define as:
(4) 
where indicates th trainable weight matrix in layer .
A specific polynomial order in equation (4) covers the hope neighborhood and ignores the impact of the farther neighbors. To cover the larger structures in graph, it is essential to apply a highorder polynomial, but this polynomial leads to overfitting to the known graph. Furthermore, this highorder polynomial is more computationally expensive.
Graph Convolutional Network (GCN) [20] presents a simplified version of the Chebyshev filter. It reduces complexity and overfitting by setting , , , and substituting by in equation (4), where and . Thus, the convolution layer of GCN is obtained as:
(5) 
This model is able to cover the large structure of graph with highorder neighborhoods by applying multiple GCN layers.
IiiD Spectral graph wavelet transform
Wavelet transform is a powerful multiresolution analysis tool that expresses a signal as a combination of several localized, shifted, and scaled bases (wavelet bases) [25]. Spectral graph wavelet transform refers to projecting graph signals from vertex domain into the spectral domain using a proper set of bases provided by wavelet transform. This transformation provides valuable localization property by applying a series of appropriate scaling operations of graph signals [16].
As initially shown in [16], the spectral graph wavelet localized at vertex with scale parameter is shown by whose th element is given by:
(6) 
where is the number of vertices, is the th eigenvalue of the normalized graph Laplacian matrix, is the Laplacian’s associated eigenvector that its th element
is the value of the LaplaceBeltrami operator eigenfunction at vertex
, the symbol denotes complex conjugate operator, and is the spectral graph wavelet generating kernel.Wavelet bases are defined by , where each wavelet basis corresponds to a signal on vertex and scale . According to wavelet bases, equation (6) can be written in the form of matrix multiplication:
(7) 
where is the scaling matrix and . Graph wavelet transform for a given graph signal is define by , and its inverse is .
According to [1], can be obtained by replacing in with corresponding to a heat kernel.
Computing the wavelet bases is still dependent on eigendecomposition, which is inefficient for large graphs. As mentioned in [16], the graph wavelet bases can be approximated using Chebyshev polynomials as follows:
(8)  
where is the Chebyshev polynomial of order for approximate with , , is the number of Chebyshev polynomials, and is the Bessel function of the first kind [2].
Iv Proposed method
As mentioned in section III, computational complexity of graph Fourier transform for obtaining Fourier bases is the main drawback of the most spectral methods. In addition, graph Fourier transform, as a global transformation, does not provide helpful localization properties in the vertex domain.
Inspired by the superiority of spectral graph wavelet transforms in approximating with highly sparse wavelet bases which have more helpful localization properties, we developed a novel multimodal graph wavelet convolution network (MGWCN) for analyzing multimodal data.
Due to the sparsity of wavelet bases, the computations in the MGWCN model that are based on wavelet bases are much more efficient than models, which are based on graph Fourier bases. Furthermore, since each wavelet basis is related to a signal on the graph that diffused away from a central node, the MGWCN model with small scale values is localized in the vertex domain of each modality. Thus, different scales of wavelet bases enable this model to represent feature vectors of each modality based on the different levels of localities in an efficient way.
The MGWCN model also takes advantage of the complementary information provided by various modalities by learning crossmodal representations. Crossmodality representation is the process of representing feature vectors of each modality based on wavelet bases of the other modalities. Finding the correspondences among different vertices in various graphs is essential for crossmodality representation. The MGWCN model explores these correspondences by learning permutation matrices that encode them among various modalities.
The proposed MGWCN method has three advantages distinguished it from previous networks:

MGWCN introduces a stacked architecture that utilizes graph wavelets convolution with multiple scaling parameters in parallel and provides more useful intramodality localization properties by aggregating embedded features at different scales.

MGWCN adopts residual connections that not only are helpful to prevent the oversmoothing of each stack, but also encourage them to provide various filtering responses based on each scale.

MGWCN generalizes the benefit of graph wavelet transform for crossmodality representation by finding permutations encoded crossmodality correlations among various modalities.
Iva Multiscale adaptive graph wavelet
We define an Adaptive Graph Wavelet (AGW) as a building block of the proposed network. In each layer, AGW consists of a graph wavelet transformation with a desired scale and a residual connection as an adaptive component. AGWs with different scales are concatenated in parallel to form a multiscale adaptive graph wavelet (MAGW) or a stack of AGWs. Multiple scales of wavelets provide more helpful localization properties by decomposing a graph signal on components at different scales, or frequency ranges.
The designed graph filter with MAGW approximates graph frequency response with different scales, without knowing the underlying graph structure. The residual connection compensates oversmoothing of MAGW in deeper layers and provides different filtering responses for each AGW. Fig. 1 depicts a scheme of MAGW.
According to the superiority of rational filters in approximating various shapes of filters, compared with polynomial filters, we apply a rational filter, as a more versatile graph filter, for each AGW. Inspired by frequency response of rational filters mentioned in [18], the response of filtering signal in scale s can be implemented using the following firstorder recursion:
(9) 
where and are the filter coefficients in scale , and is any practical graph representing matrix used to capture comprehensive information of the graph.
Inspired by this filter response in graph signal processing, we design a machine learning approach for learning the parameters and in each scale using a new graph convolutional network. In the designed AGW, graph wavelet transform is used for localizing graph convolution by projecting signals in the vertex domain into the spectral domain.
The obtained AGW with scale is defined as follows:
(10)  
where is the wavelet bases at scale and is its inverse, and are learning parameters, is the input signal in scale including features, is the output signal including features, is the initial node features, and is the nonlinear activation function.
The output of each AGW stack is computed with the average of the outputs of all AGWs as .
IvB Multimodal graph wavelet convolution layers
The proposed multimodal graph wavelet convolutional network consists of three phases, intramodality localization, crossmodality correlation, and node classification. Fig. 2 schematically shows the diagram of the proposed MGWCN. The details are given below.
1. Intramodality localization
In this phase, feature mapping processes for node feature vectors of all modalities are conducted separately. For each modality , node features are represented through layers of network while each layer contains an parallel units of AGWs, where is the set of scales.
The output of the proposed network for modality in scale of th layer is defined as:
(11) 
where and are trainable parameter matrices for feature mapping in scale , is a diagonal matrix for graph convolution kernel, is the embedded feature vectors in layer and scale , is the initial node feature in modality , is the number of features in layer , and is the nonlinear activation function.
Applying stochastic dropout to the initial node feature in equation (11) encourages each AGW to provide a response different from the others.
The final embedded feature vectors of layer k are defined by averaging the outputs of all units of AGW in th AGW as:
(12) 
2. Crossmodality correlations
Tuning the crossmodality correlations or finding the pointwise correspondences among various modalities is the most critical challenge in the multimodal problems.
In this phase, the crossmodality correlations on various modalities are explored and a new convolutional layer is defined to represent embedded features of each modality based on the graph wavelet of the other modalities.
To prevent the increase of learnable parameters, in this layer, we leverage wavelet with one scale. Our graph wavelet convolutional network learned in phase 1 is permutation invariant because the embedded feature vectors are insensitive to reordering the node index. We utilize this property to take advantage of applying correlated representational information of the other modalities discovered by crossmodality correlations.
To have better representation of embedded feature vectors in modality obtained after layers, , based on the wavelet bases of modality , we define the following crossmodality feature mapping between two modalities and :
(13) 
where is a permutation matrix that encoded crossmodality correspondence between modalities and , and is a matrix of embedded feature vectors in modality obtained by representing based on the wavelet bases of modality , while their correlations are encoded in .
A permutation matrix is defined as a matrix including exactly one single unit value in each row and column, and zeros elsewhere. This matrix is used to represent the permutations of elements in an ordered sequence.
For example, for the following square matrix with rows, a permutation in order of its rows as is represented by the shown permutation matrix , which yields by a simple matrixvector multiplication, as:
The permutation of symmetric matrix , in order of both rows and columns, is obtained by matrixvector multiplication .
Based on the crossmodality feature mapping, for each modality , we conduct a crossmodality convolutional layer as follows:
(14) 
where is the output embedded feature matrix, is kernel matrix, , is the total number of layers for intramodality representation, and function is the concatenation operator that incorporates the extracted feature information in the crossmodality convolution layer.
3. Node classification
The embedded feature vectors of each modality are obtained through layers of feature mapping (intramodality representation) and one layer of feature mapping based on the other modalities (crossmodality representation).
The obtained embedded feature vectors are fed into the last layer to conduct node classification as follows:
(15) 
where , , and is number of classes.
4. Problem formulation and optimization
Considering the architecture of the proposed network, we present a unified regularizationbased optimization problem that aims to simultaneously learn network parameters and permutations through network training.
Since the permutations are highly discrete and too costly to enumerate, the stochastic gradient descent (SGD) method is not capable to optimize the proposed networks because it applies for optimizing networks with continuous parameters.
We extend our formulation to the nearest convex surrogate by approximating all permutation matrices in equation (13) with doubly stochastic matrices
. A doubly stochastic matrix
is an matrix of nonnegative real numbers, each of whose rows and columns sums to , i.e.:where is an dimensional column vector of ones.
According to this relaxation, we propose a new network loss function as:
(16) 
where is the set of network weights, is the set of doubly stochastic matrices used in extension of equation (13), is the categorical crossentropy loss function, that will be computed by equation (18), is the loss function for doubly stochastic matrix, which will be given in equation (19), and is a tradeoff parameter between them.
measures the empirical loss on the training data by summing up the discrepancy between the outputs of the network and the groundtruth for all modalities :
(17) 
where is the output of the network, obtained in equation (15). This output is a function of the set of network weights in equations (11), (14), and (15) and the set of doubly stochastic matrices in the extension of equation (13).
The loss function for learning the doubly stochastic matrix between two modalities and is as follows:
(18)  
Thus, the loss function considering all doubly stochastic matrices is defined as follow:
(19) 
The final optimization problem is also included the following additional regularization terms:
(20) 
where , , and are regularization parameters. The first term is the norm regularization penalty on the network weights preventing overfitting by constraining the complexity of the learned kernels of all layers. The second term is the betweenmodality regularization term that restricts the output labels of all modality pairs m and e based on crossmodality relations encoded in permutation matrix as follows:
(21)  
The third term of equation (20) is the intramodality regularization term, which is defined based on preserving manifold constraints of multimodal data based on the following equation:
(22) 
where is the weight of edge between th and th vertices.
According to loss functions and regularization terms defined in equations (16) and (20), the final optimization problem on the objective function is obtained as follows:
(23)  
To optimize this problem under its constraints, an iterative optimization algorithm can be adopted to alternatively minimize loss function with respect to and .
In the first pass, , that is the gradient of with respect to is computed while all other parameters are considered fixed. According to the stochastic gradient descent optimization method, all network kernel matrices can be update using the following iterative equation until convergence:
(24) 
where is the learning rate of the SGD method and is iteration number.
At the second pass, is computed as the gradient of with respect to while all other parameters are fixed. Permutation matrices can be updated using SGD as:
(25) 
while the nonnegative constraint is maintained by thresholding .
Practically, for optimizing the abovementioned problems, we take advantage of Keras
^{1}^{1}1https://keras.io/api/losses/ library by adding a new loss function and defining regularization terms on each layer, which these penalties are summed into the loss function during optimization.The proposed MGWCN is summarized in Algorithm 1.
5. Computational complexity analysis
The computations of the MGWCN consists of two parts, computing graph wavelet bases and computing the network outputs.
Since MGWCN employs the Chebyshev polynomials in approximating graph wavelet bases, it takes its linearity advantage with computational complexity , where is the number of edges of and is the order of Chebyshev polynomials.
Under the assumption of sparse wavelet bases, each scale s in modality can be implemented as a matrix multiplication between sparse square matrix and embedded feature matrix , which has incurred linear complexity for each layer in feature mapping phase.
The computation complexity in layer of each modality depends on computing crossmodality correlation. According to the sparsity nature of both permutation matrices and wavelet bases, computing the crossmodality correlation using matrix multiplication also has linear complexity.
Finally, according to the linear time of matrix multiplication between embedded feature matrix and kernel weight matrix in each modality
, and also the linearity of computing sigmoid function, the label of each layer in the final layer (classification) can be estimated in linear time.
IvC Other versions of MGWCN
1. Graph wavelet convolutional Network (GWCN)
Since most of the explicit graphbased data are unimodal, we simplify the proposed MGWCN for unimodal tasks. GWCN, as a unimodal version of MGWCN, is designed using the first layers integrated with the last classification layer (without crossmodality correlations).
2. Multiview graph wavelet convolutional Network (MVGWCN)
Many multimodal problems get the benefits from the prior knowledge of fully or partially correspondence information among modalities. Since these problems, sometimes called multiview, are special cases of multimodal problems, we redesigned our proposed MGWCN to cope with multiview data, called MVGWCN. In MVGWCN, permutation matrices encode correspondence information among various modalities, such that if sample in modality , , corresponds with sample in modality , , then contains in th entry and otherwise.
MVGWCN is trained similar to MGWCN, while permutation matrices are considered as nonlearning parameters.
V Experiments
We investigate the effectiveness of the proposed network with two types of experiments on unimodal explicit graphbased data and multimodal implicit graphbased ones.
The first experiment examines the efficiency of the proposed network on inherently graphbased data, including citation datasets, considering semisupervised node classification tasks. The purpose of second experiment is to evaluate the effectiveness of the proposed method on multimodal data, implicitly considered as a graph, compared with stateoftheart semisupervised multimodal problems. Public implementations with the opensource GNN libraries Spektral
[33](TensorFlow/Keras) are available in
https://github.com/maysambehmanesh/MGWCN.Va Evaluation on unimodal explicit graphbased data
To evaluate the performance of MGWCN on explicit graphbased data, we focus on semisupervised node classification of popular citation datasets. Since these datasets are unimodal, we apply the unimodal version of MGWCN (GWCN) for semisupervised node classification.
Specifications of three benchmark datasets used in the experiments, including the number of nodes, edges, node features, and classes are reported in Table I.
Dataset  Nodes  Edges  Features  Classes 

Cora  2708  5429  1433  7 
Citeseer  3327  9228  3703  6 
Pubmed  19717  88648  500  3 
We compare the efficiency of GWCN with most popular stateoftheart graph convolutional networks. These baselines include ChebyNet [11], GCN [20], CayleyNets [21], GNNARMA [6], and GWNN [37] as spectral methods, in addition to GAT [32], GraphSAGE [15], and GIN [38] as spatial methods.
In the semisupervised problem, the features of all vertices are known, but only labels per class are given for training. Also, and labeled nodes are considered for validation and test, respectively.
The task is learning a network that takes the feature vectors as inputs and assigns a label to each vertex as outputs. The evaluation results of the learned network on testing nodes are reported as classification accuracies.
Table II reports experimental results of GWCN on citation data sets described in Table I, and compares them with stateoftheart methods. The accuracies of CayleyNets, GraphSAGE, and GIN have been reported from the respective paper while others implemented by authors according to mentioned settings. In this table, GWCN1 indicates GWCN model with only one and the first scale, mentioned in Table III, e.g. for Cora, and GWCN2 considers both scales.
The maximum number of training epochs is set to
. Training of network will be terminated if the validation loss does not decrease for consecutive epochs. Learning rate is for all methods.For maintaining the sparsity structure of graph wavelet bases, a threshold is define to refine the values of and such that the value of entries that are smaller than threshold is set to . This parameter in all experiments is set to .
Method  Cora  Citeseer  Pubmed 

GIN  75.1±1.7  63.1±0.2  77.1±0.7 
GraphSAGE  73.7±1.8  65.9±0.9  78.5±0.6 
GAT  83.1±0.4  70.7±0.1  78.4±0.3 
ChebyNet  78.2±0.6  68.6±0.2  76.3±0.8 
GCN  82.3±0.4  71.4±0.3  79.3±0.2 
CayleyNets  81.2±1.2  67.1±2.4  75.6±3.6 
GNNARMA  81.4±0.4  68.9±0.6  76.3±0.3 
GWNN  82.6±0.3  70.4±0.9  79.4±0.7 
GWCN1  83.2±0.1  71.6±0.6  79.5±0.3 
GWCN2  83.8±0.5  72.7±0.3  80.3±0.6 
As proved in [16], using small values for scale parameters, graph wavelet bases induce locality properties in such a way that each basis represents the neighborhood structure of a specific vertex. To avoid model complexity in this experiment, we use two scales for graph wavelets, . The values of these scales are chosen among small values between and via grid search to ensure the locality of convolution in the vertex domain. The values of all parameters of GWCN used in the above experiments are summarized in the Table III. Parameters of other methods are chosen according to their respected papers.
Parameter  Cora  Citeseer  Pubmed 

2  2  1  
40  40  30  
According to classification accuracies reported in Table II, GWCN provides the best accuracies among all baselines on all datasets. The secondbest accuracies for Cora and Pubmed are provided by GWNN and for Citeseer are reported with GCN. These results demonstrate the capability of graph wavelet bases in achieving a better representation of vertex domain compared to other spectral methods.
Furthermore, GWCN performs better than all spatial methods reflecting the promising ability of spectral methods to achieve good performance. Since GAT assigns a selfattention weight to each edge, it captures more effectively the local similarity among neighborhoods and provides better accuracy compared to other spatial methods. However, computing the attention weights is inefficient for the large number of edges, and unlike GWCN, GAT cannot take advantage of the global structure of graphs.
Among spectral methods, GCN not only consistently outperforms others but also achieves a significant accuracy compared with GWCN.
The key to the good performance of GCN is that, unlike CayleyNets that applies graph Fourier bases to express the spectral graph convolution, GCN leverages the Laplacian matrix as a weighted matrix in its formulation (equation (5)), which in terms of sparsity, Laplacian matrix is sparser than Fourier bases and is similar to wavelet bases.
GWNN presents closer results to GWCN because it employs wavelet bases to localized graph convolution. Nevertheless, GWCN has two main advantages that enable it to consistently outperforms GWNN: 1) Aggregating features with different scaling parameter values makes GWCN flexible in locally exploring each subgraph according to an appropriate scale, instead of using only one scale parameter for the whole graph. 2) Due to the sparsity of wavelet bases, after a few convolutions, the node features in GWNN become too smooth. Since GWCN formulation is adopted with a residual connection, it can be amplified with the initial node features avoiding oversmoothing.
Table IV reports the training time and parameter complexity of GNN methods.
Method  Cora  Citeseer  

Sec/epoch  Parameters  Sec/epoch  Parameters  
ChebyNet  0.28492  46080  0.69860  118688 
GAT  0.26980  92373  0.55968  237586 
GCN  0.23140  23040  0.56030  59344 
GNNARMA  0.23029  46103  0.37047  118710 
GWNN  0.34075  23063  0.46709  59366 
GWCN1  0.37979  46103  0.54539  178054 
GWCN2  0.47666  91975  0.61975  355814 
Since in citation network, Laplacian is sparser than graph wavelets [37], two graph waveletbased methods, GWNN and GWCN, have a little more time complexity compared with Laplacianbased methods, ChebyNet, GCN, and GNNARMA.
According to the definition of GNN models, most of them have moderate number of parameters because increasing their parameters by adding multiple layers will raise the oversmoothing problem. But, due to two main reasons, increasing the parameters of GWCN model does not lead to oversmoothing: 1) these additional parameters are respected to multiple stacks in each layer with different scaling parameter values, and not respected to more layers, 2) residual connection adopted for GWCN prevents the oversmoothing of each stack. Also, due to effective computations provided by sparse wavelet bases, GWCN achieves better performance in reasonable time complexity. Therefore, unlike most GNN models, the performance of GWCN is improved using additional parameters.
VB Evaluation of multimodal implicit graphbased data
The primary purpose of the proposed method is generalizing graph convolutional neural networks for multimodal problems. This experiment evaluates the effectiveness of the MGWCN method for multimodal data.
To evaluate the performance of the proposed method, we conduct MGWCN for the classification of multimodal implicit graphbased data on two categories of benchmark problems, including multimodal and multiview datasets.
These categories consist of two multimodal datasets, Caltech and NUS, and three multiview datasets, Caltech1017, Caltech10120, and MNIST, as introduced in
[5]. In these datasets, each modality is implicitly defined as a graph.In the first experiment, we evaluate MGWCN on multimodal datasets in a more practical scenario, without any predefined knowledge among modalities. To ensure this property, we randomly shuffle the order of data samples in each modality, which makes the pointwise correspondences between the original modalities unknown.
We compare the efficiency of MGWCN with most stateoftheart multimodal data modeling methods including CD (pos) [12], CD (pos+neg) [12], SCSMM [28], mLSJD [5], mLSJD [5], and MCPCu [4].
The first experiment conducts MGWCN on multimodal datasets with the parameters in Table V. In this experiment, data in each modality is portioned to for training, for validation, and for testing. The maximum number of training epochs is , and the training phase will be terminated if the validation loss does not decrease for consecutive epochs. Experimental results over randomly data splits are repeated in terms of mean accuracy and standard deviations. Parameters of other methods are chosen according to their respective paper.
Parameter  Caltech  NUS 

2  2  
2000  1000  
100  100 
Table VI reports the performance of MGWCN in terms of mean and standard deviation of accuracies, and compares it with baselines. Among these compared methods, MCPCu has the most similar experimental settings with MGWCN because it lakes prior correspondence knowledge. Although other methods have been developed for multimodal datasets, they still are more or less dependent on prior knowledge about correspondences among modalities. In the respective papers of methods CD (pos), CD (pos+neg), mLSJD, and mLSJD, the rate of dependency is defined based on the ratio of the number of given corresponding samples to the total number of samples (in percent). To get closer to a fair comparison, we consider the correspondence ratio as minimum as possible () in all mentioned methods.
Method  Caltech  NUS 

CD (pos)  76.9±1.1  81.8±0.8 
CD (pos+neg)  73.4±0.8  80.3±0.4 
SCSMM    83.9±2.42 
mLSJD  84.1±1.4  83.2±1.3 
mLSJD  88.5±1.6  87.2±1.1 
MCPCu  84.8±0.7  86.4±0.3 
MGWCN  90.6±0.4  89.2±0.8 
According to classification accuracies reported in Table VI, MGWCN achieves a significant improvement and consistently outperforms other methods. These results indicate the superiority of the proposed graph convolutional network in effectively finding the crossmodal correlations in absence of given coupling/decoupling information between various modalities.
The MGWCN is able to represent feature vectors of each modality based on the correlated wavelet bases on other modalities. This capability enables it to be not only successfully applicable to the multimodal graphbased data, but also usable efficiently independent of initial correspondence information.
Since in many multimodal methods, the correspondences among modalities are predetermined, for more experiments, we evaluate the performance of our proposed network on the multiview datasets. Therefore, in the second experiment, we use MVGWCN as a particular version of MGWCN and conduct it on multiview graphbased datasets, according to parameters listed in Table VII. The other settings are similar to the first experiment.
Parameter  Caltech1017  Caltech10120  MNIST 

2  2  2  
40  30  35  
To evaluate the performance of the proposed method on multiview datasets, we compare the efficiency of MVGWCN with some existing methods evaluated on mentioned multiview datasets. The stateoftheart methods used for comparison include MLDA [9], MLDAm [9], MULDA [9], MULDAm [9], MvMDA [31], OGMA [36], OMLDA [36], OMvMDA [36], and MCPCp [4].
Table VIII reports the classification accuracy obtained by the MGWCN compared with different multimodal problems on three multiview datasets. In this table, MVGWCN1 indicates the proposed method with only one and the first scale mentioned in Table VII. Similarly, MVGWCN2 is based on both scales.
As can be seen from Table VIII, MVGWCN consistently outperforms other methods. According to these results, although our proposed method achieves the best accuracy among all methods on all multiview datasets, the differences between their accuracies are not as significant as multimodal datasets. The main reason is that on multiview datasets, the knowledge of correspondence is predetermined and corresponded samples can be fused into a joint latent space for boosting the classification, which improvs accuracy without needing to explore the crossmodality correlations. When exploring the crossmodality correlations for discovering the correspondences among various modalities is a necessity, because of the lack of this type of prior knowledge, the superiority of our proposed method is proved to be more significant.
Method  Caltech1017  Caltech10120  MNIST 

MLDA  92.298e3  76.5912e3  92.845e3 
MLDAm  89.7810e3  73.77114e3  93.098e3 
MULDA  92.658e3  82.2011e3  95.235e3 
MULDAm  92.5910e3  82.176e3  95.124e3 
MvMDA  92.658e3  80.5013e3  93.789e3 
OGMA  95.015e3  86.0010e3  96.096e3 
OMLDA  94.985e3  86.8510e3  95.716e3 
OMvMDA  94.717e3  82.2810e3  95.996e3 
MCPCp  94.831.1  86.441.1   
MVGWCN1  95.251.3  87.830.8  96.451.4 
MVGWCN2  96.230.7  88.461.1  97.210.7 
In the last experiment, we develop new network architectures based on several conventional graph convolutional networks for applying to multiview datasets. The architecture of these developed networks, that are named with a prefix ’M’ in Table IX to show this extension, is simple. Similar to MVGWCN, each modality is represented through K layers using a specific graph convolutional network, and then embedded features in each modality are fused into a joint layer to have better representation based on all modalities.
In this way, we evaluate the abilities of various graph convolutional networks in both feature mapping in each modality as well as crossmodal feature fusion, and then compare these networks with MVGWCN, which takes advantage of graph wavelet bases abilities.
The classification results of various graph convolutional networks on multiview datasets are shown in Table IX. The results reported in this table demonstrate the superiority of applying wavelet bases on GNNs with similar modality fusion idea. These results confirm the effectiveness of applying wavelet bases in simultaneously representing each modality and utilizing correlation among all modalities efficiently.
Method  Caltech1017  Caltech10120  MNIST 

MChebyNet  91.490.3  85.530.4  95.500.8 
MGAT  93.590.4  82.590.5  96.000.4 
MGIN  92.510.7  84.270.3  95.250.6 
MGCN  91.150.3  82.800.6  94.240.3 
M GNNARMA  94.891.1  86.370.5  95.750.3 
MVGWCN1  95.251.3  87.830.8  96.451.4 
Vi Conclusions
Extending the convolutional neural networks to multimodal geometrically structured and/or graphbased data is an important problem that has been rarely addressed to the best of authors knowledge. This paper introduced a novel graph convolutional neural network based on wavelet bases to learn the representation of multimodal graphbased data in the spectral domain. Compared with Fourier bases used in spectral GNNs, wavelet bases more effectively represent the feature vectors in each modality utilizing scaling parameter. Besides, due to the ability of the Chebyshev polynomials used in approximating wavelet bases without requiring the costly eigendecomposition and sparsity structure of obtained wavelet bases, computations are performed efficiently.
Proposed MGWCN simultaneously provides intramodal localization by applying multiscaled graph wavelet convolution, and as an alignment stage estimates the crossmodal correlations between various modalities. We also introduced two additional particular versions of the proposed network for conducting unimodal and multiview tasks.
The proposed network evaluated on both unimodal explicit graphbased data sets as well as multimodal implicit graphbased data. Extensive experiments demonstrated that MGWCN outperforms stateoftheart GNNs, including unimodal and multimodal cases, without any prior knowledge.
To the best of our knowledge, our proposed MGWCN model is the first work in developing graph convolutional networks for multimodal graphbased data in the practical scenarios.
According to the efficiency, generality, and flexibility of the spatial methods, analyzing the multimodal graphbased data using a spatial network can be a valuable work for future.
Since multimodal data have a wide range of applications, developing multimodal graphbased networks for other tasks, including crossmodal retrieval, multimodal clustering, domain adaptation, etc., can also be our other future works.
References
 [1] Cited by: §IIID.
 [2] (2013) Mathematical Methods for Physicists (Seventh Edition) Chapter 14  Bessel Functions. Academic Press, Boston. Cited by: §IIID.
 [3] (2019) Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2). External Links: Document, ISSN 19393539 Cited by: §I.
 [4] (2021) Crossmodal and multimodal data analysis based on functional mapping of spectral descriptors and manifold regularization. ArXiv abs/2105.05631. External Links: Link Cited by: §I, §IIA, §VB, §VB.
 [5] (2021) Geometric multimodal learning based on local signal expansion for joint diagonalization. IEEE Transactions on Signal Processing 69 (), pp. 1271–1286. External Links: Document Cited by: §IIA, §VB, §VB.
 [6] (2021) Graph Neural Networks with Convolutional ARMA Filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 8828 (c), pp. 1–12. External Links: Document, 1901.01343, ISSN 19393539 Cited by: §IIB, §IIIC, §VA.
 [7] (2017) Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. External Links: Document Cited by: §I.
 [8] (2014) Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, (English (US)). Cited by: §I, §IIB, §IIIC.
 [9] (2018) Generalized MultiView Embedding for Visual Recognition and CrossModal Retrieval. IEEE Transactions on Cybernetics 48 (9). External Links: Document, ISSN 21682267 Cited by: §VB.
 [10] (2001) Lectures on Spectral Graph Theory. American Mathematical Society; UK ed. edition. Cited by: §I.
 [11] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Advances in Neural Information Processing Systems. External Links: ISSN 10495258 Cited by: §IIB, §IIIC, §IIIC, §VA.
 [12] (2015) Multimodal Manifold Analysis by Simultaneous Diagonalization of Laplacians. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (12). External Links: Document, ISSN 01628828 Cited by: §IIA, §VB.

[13]
(2020)
Multimodal graph neural network for joint reasoning on vision and scene text.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
. External Links: Document, ISSN 10636919 Cited by: §IIC.  [14] (2017) Neural message passing for quantum chemistry. In 34th International Conference on Machine Learning, ICML 2017, Vol. 3. Cited by: §IIB.
 [15] (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, Vol. 2017Decem. External Links: ISSN 10495258 Cited by: §IIB, §VA.
 [16] (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2). External Links: Document, ISSN 10635203 Cited by: §I, §IIID, §IIID, §IIID, §VA.
 [17] (2020) SemiSupervised Multimodality Learning with Graph Convolutional Neural Networks for Disease Diagnosis. In Proceedings  International Conference on Image Processing, ICIP, Vol. 2020Octob. External Links: Document, ISSN 15224880 Cited by: §IIC.
 [18] (2017) Autoregressive Moving Average Graph Filtering. IEEE Transactions on Signal Processing 65 (2). External Links: Document, ISSN 1053587X Cited by: §IVA.
 [19] (2019) Alternating diffusion maps for multimodal data fusion. Information Fusion 45. External Links: Document, ISSN 15662535 Cited by: §IIA.
 [20] (2017) Semisupervised classification with graph convolutional networks. 5th International Conference on Learning Representations, ICLR 2017  Conference Track Proceedings. Cited by: §IIIC, §VA.
 [21] (2019) CayleyNets: Graph Convolutional Neural Networks with Complex Rational Spectral Filters. IEEE Transactions on Signal Processing 67 (1). External Links: Document, ISSN 1053587X Cited by: §IIB, §IIIC, §VA.
 [22] (2021) Asymmetric gaussian process multiview learning for visual classification. Information Fusion 65, pp. 108–118. External Links: ISSN 15662535, Document, Link Cited by: §IIA.
 [23] (2021) An efficient dictionarybased multiview learning method. Information Sciences 576, pp. 157–172. External Links: ISSN 00200255, Document Cited by: §IIA.
 [24] (2020) Analyzing unaligned multimodal sequence via graph convolution and graph pooling fusion. arXiv, pp. 1–14. External Links: 2011.13572, ISSN 23318422 Cited by: §IIC.
 [25] (2009) A Wavelet Tour of Signal Processing. Academic Press, Boston. External Links: Document Cited by: §IIID.
 [26] (2009) Neural network for graphs: A contextual constructive approach. IEEE Transactions on Neural Networks 20 (3). External Links: Document, ISSN 10459227 Cited by: §I, §IIB.
 [27] (2016) A unifying framework in vectorvalued reproducing kernel Hilbert spaces for manifold regularization and coregularized multiview learning. Journal of Machine Learning Research 17. External Links: ISSN 15337928 Cited by: §IIA.
 [28] (2021) Semisupervised charting for spectral multimodal manifold learning and alignment. Pattern Recognition 111. External Links: Document, ISSN 00313203 Cited by: §IIA, §VB.
 [29] (1997) The Laplacian on a Riemannian Manifold: An Introduction to Analysis on Manifolds.. London Mathematical Society Student Texts, Cambridge University Press. External Links: Document Cited by: §IIIB.
 [30] (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1). External Links: Document, ISSN 10459227 Cited by: §I.
 [31] (2016) Multiview Uncorrelated Discriminant Analysis. IEEE Transactions on Cybernetics 46 (12). External Links: Document, ISSN 21682267 Cited by: §VB.
 [32] (2018) Graph attention networks. 6th International Conference on Learning Representations, ICLR 2018  Conference Track Proceedings. Cited by: §IIB, §VA.
 [33] (2020) Orthogonal Multiview Analysis by Successive Approximations via Eigenvectors. arXiv. External Links: 2010.01632, Link Cited by: §V.
 [34] (2019) MMGCN: Multimodal graph convolution network for personalized recommendation of microvideo. MM 2019  Proceedings of the 27th ACM International Conference on Multimedia. External Links: Document Cited by: §IIC.
 [35] (2021) A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems 32 (1). External Links: Document, ISSN 21622388 Cited by: §I.
 [36] (2020) Orthogonal multiview analysis by successive approximations via eigenvectors. ArXiv abs/2010.01632. Cited by: §VB.
 [37] (2019) Graph wavelet neural network. 7th International Conference on Learning Representations, ICLR 2019. Cited by: §IIB, §VA, §VA.
 [38] (2019) How powerful are graph neural networks?. 7th International Conference on Learning Representations, ICLR 2019. Cited by: §IIB, §VA.
 [39] (2018) Multiview manifold learning with locality alignment. Pattern Recognition 78. External Links: Document, ISSN 00313203 Cited by: §IIA.
 [40] (2019) Multimodal image registration using Laplacian commutators. Information Fusion 49. External Links: Document, ISSN 15662535 Cited by: §IIA.