I Introduction
Network mining is the basis for many network analysis tasks, such as classification and link prediction. The dimensionality of traditional node representations is proportional to the network scale, which requires large amount of storage and computation resources for network analysis tasks. Thus, it is necessary to learn lowdimensional representations of nodes to capture and preserve the network features. Network embedding, also known as network representation learning, is a way of learning lowdimensional representations and preserving useful features to commonly support subsequent network analysis tasks.
Existing network embedding methods [1, 2, 3] that emphasize preserving network structural features have achieved promising performance in several network analysis tasks. However, nodes in realworld networks have rich attribute information beyond the structural details, such as text information in citation networks and user profiles in social networks. Attribute features are essential to network analysis applications [4, 5]and it is insufficient to learn network representations only based on preserving structural features. Node attributes carry semantic information that largely alleviates the link sparsity problem and supplement the incompleteness structure information. The strong correlations between structures and attributes enable them to be integrated to learn network representations according to the principles of homophily [6] and social influence theory [7]. Therefore, we integrate the topological structures and node attributes to perform network embedding to preserve both structural and attribute features of the network. Differing from some taskoriented network embedding methods that learn network representations for the specific task, we aim to learn the network representations generally applying to various advanced network analysis tasks. We face three challenges: (1) Both the underlying network structures [8] and the complex interactions between attributes and structures [9] are highly nonlinear. Thus, designing a model to capture these nonlinear relationships is difficult. (2) The structures and attributes are information from different sources, which make it difficult to find direct correlations between originally observed information due to sparsity and noise. Modeling the correlations between network structures and attribute information is a tough problem. (3) The nodes with coherent links and similar attributes in the original network have strong proximity. They are supposed to be close to each other in the embedding space as well. Thus, mapping the proximity of nodes from both structure and attribute perspectives to the embedding space is critically important.
To address the above challenges, a Multimodal Deep Network Embedding method named MDNE is proposed in this paper. Most of existing shallow models have limited ability to represent complex nonlinear relationships [10]. A deep model comprising of multiple layers of nonlinear functions, using each layer to capture the nonlinear relationships of units in the lower layer, is able to extract the nonlinear relationships of data progressively during training [11]. Moreover, deep learning has been demonstrated to have powerful nonlinear representation and generalization ability [10]. In order to capture the highly nonlinear network structures and the complex interactions between structures and attributes, a deep model comprising multiple layers of nonlinear functions is proposed to learn compact representations of nodes.The original structure and attribute information, which are represented by an adjacency matrix and attribute matrix, respectively, are usually sparse and noisy, making it difficult for the deep model to extract the correlations between them directly. In this paper, a multimodal learning method [12] is adopted to preprocess the structure and attribute information to obtain their highorder features. Highorder features are condensed and less noisy, so concatenating the two highorder features facilitates the deep model to extract the highorder correlations between the network structures and node attributes. To ensure the obtained representations preserve both structural and attribute features of the original network, we use the structural proximity and attribute proximity to define the loss function for the new model. We preserve the structural features by taking the advantage of the firstorder proximity and secondorder proximity, which capture the local and global network structures [13]. The attribute proximity, which indicates the similarity of node attributes, is also utilized in the learning process to preserve the attribute features of the network. Thus, the learned representations preserve both the structural and attribute features of nodes in the embedding space.
To evaluate the effectiveness and generality of the proposed method in a variety of scenes, we conduct experiments to analyze network representations obtained by different network embedding methods from four realworld network datasets in three analysis tasks including link prediction, attribute prediction, and classification. The results show that the network representations obtained by MDNE offer better performance on different tasks compared to other methods. This demonstrates that the proposed method effectively preserves the topological structure and attribute features of nodes in the embedding space, which improves the performance on diverse network analysis tasks.
The rest of the paper is organized as follows. Section 2 discusses the related works. The proposed method MDNE is described in details in Section 3. Experimental results of different network analysis tasks on various realworld datasets are presented in Section 4. Finally, Section 5 concludes the paper.
Ii Related Works
The early works of network embedding are related to graph embedding [14, 15]
, which aims to embed an affinity graph into a lowdimensional vector space. The affinity graph is obtained by calculating the proximity between feature vectors of nodes. Recent network embedding aims to embed naturally formed networks into a lowdimensional space, such as social networks, citation networks, etc. Most of the existing works
[16, 17, 18] focused on reducing the dimensions of structure information while preserving the structural features of nodes. GraRep [3] defined different loss functions of models to preserve highorder proximity among nodes and optimized each model by matrix factorization techniques. The final representations of nodes combined representations learned from different models. MNMF [19] proposed a novel modularized nonnegative matrix factorization model to incorporate the community structure into network embedding. The above shallow models have applied in various network analysis tasks, but have limited ability to represent the highly nonlinear structure of networks. Thus, techniques with deep models were introduced to deal with the problem. LINE [20] designed the objective function based on the firstorder proximity and secondorder proximity and adopted negative sampling approach to minimize the objective function to get lowdimensional representations which preserve the local and global structure of the network. DeepWalk [1] utilized random walks in the network to sample the neighbors of nodes. By regarding the path generated as sentences, it adopted SkipGram, a general word representation learning model, to learn the node representations. Node2vec [2] modified the way of generating node sequences and proposed a flexible notion of node’s neighborhood. A biased random walk procedure was designed, which explored diverse neighborhood. SDNE [13]designed a clear objective function to preserve the firstorder proximity and secondorder proximity of nodes and mapped the network into a highly nonlinear latent space through an autoencoderbased model.
Besides structure information, most of the recently obtained network datasets often carry a large amount of attribute information. However, it is difficult for pure structurebased methods to compress attribute information and obtain the representations combining the structure and attribute information. Therefore, efforts have been done to jointly exploit structure and attribute information in network embedding, and the representations integrating structure and attribute information have been demonstrated to improve the performance in network analysis tasks [4, 5, 21]. TADW [22] proved DeepWalk to be equivalent to matrix factorization and incorporated text features into network representation learning under the framework of matrix factorization. It can only handle the text attributes. AANE [23]
modeled and incorporated node attribute proximity into network embedding in a distributed way. The above matrix factorization methods did not preserve the attribute features directly, but performed the learning based on the attribute affinity matrix calculated by a specific affinity metric, which limited the attribute feature preservation ability of the obtained representations. UPPSNE
[24] learned joint embedding representations by performing a nonlinear mapping on user profiles guided by network structure. It mainly dealt with user profile information. TriDNR [25]separately learned embedding from a coupled neural network architecture and linearly combined them in an iterative way. It lacked sufficient knowledge interactions between the two separate models. ASNE
[26]proposed a multilayer perceptron framework to integrate the structural and attribute features of nodes. It preserved the structural proximity and attribute proximity by maximizing the likelihood function defined based on random walks. Its model lacked a nonlinear preprocessing of the structure and attribute information, which could facilitate to extract the highorder correlations between attribute and structural features in the later learning. In this paper, a multimodal learning method is adopted to preprocess the original data.
Multimodal learning methods which have aroused considerable research interests aim to project data from multiple modalities into a latent space. The classical methods CCA, PLS, BLM and their variants [27, 28, 29] were widely applied in previous time. Recent decades have seen great power of deep learning method to generate integrated representations for multimodal data. [12] proposed an autoencoderbased method to learn features over multiple modalities (video and audio) and achieved in speech recognition. [30]
proposed a Deep Belief Network (DBN) architecture for learning a joint representation of multimodal data, which made it possible to create representations when some data modalities are missing. The multimodal Deep Boltzmann Machine (DBM) model proposed in
[31] fused modalities (image and tag) together and extracted unified representations which were useful for classification and information retrieval tasks. [32] learned consistent representations for two modalities and facilitated the crossmatching problem. [33] proposed a crossmodal hashing method to learn unified binary representations for multimodal data. Following these successful works, we introduced multimodal learning method into network embedding. The structure and attribute information of the network are regarded as different modalities. An autoencoderbased multimodal model [12] is adopted to preprocess the bimodal data and forms highorder features, which facilitate the fused representations to be learnt.There are also methods learning network representations for specific applications. PinSage [34] combined the recent Graph Convolutional Network (GCN) algorithm [35, 36] with efficient random walks to generate representations applying in webscale recommender systems. However, our MDNE learns integrated representations generally applying for various network analysis tasks.
Iii Multimodal Deep Network Embedding Method
Iiia Problem Definition
An attributed network is defined as , where represents a set of nodes, represents a set of edges, and represents the attribute matrix. Edge information is represented by the adjacency matrix .
The adjacency vector and attribute vector of node represent the structure and attribute information, respectively. Thus, the goal of network embedding is to compress the two vectors into a lowdimensional vector, and preserving the structure and attribute features in the lowdimensional space (embedding space).
The firstorder proximity and secondorder proximity capture the local and global network structural features, respectively [13].
Definition 1 (FirstOrder Structural Proximity).
The firstorder proximity describes the local pairwise proximity between two nodes. For each pair of nodes, the edge weight, indicates the firstorder proximity between and .
Definition 2 (SecondOrder Structural Proximity).
The secondorder proximity between a pair of nodes in a network describes the similarity between their neighborhood structures which are represented by the adjacency vectors.
The firstorder proximity and secondorder proximity jointly compose the structural proximity between nodes. The attribute proximity captures the attribute feature of nodes.
Definition 3 (Attribute Proximity).
The attribute proximity between a pair of nodes describes the proximity of their attributes information. It is determined by the similarity between their attribute vectors, i.e., and .
The attribute proximity and structural proximity between nodes are the basis of many network analysis tasks. For example, community detection on social networks clusters nodes based on the structural proximity and attribute proximity [37]. In recommendation on citation networks, papers having strong structural and attribute proximity are most likely to be reference papers of the given manuscript [38]. In user alignment across social networks, users are aligned based on their structure and attribute proximity on each network [39]. These applications benefit from utilizing both structural proximity and attribute proximity, which lead us to vestigate the problem of learning the lowdimensional representations of the network in the condition of preserving the two proximities. The problem is defined as follows.
Definition 4 (Attributed Network Embedding).
Given an attributed network denoted as with nodes and attributes, attributed network embedding aims to learn a mapping function , where . The objective of the function is to make the similarity between and explicitly preserve the attribute proximity and structural proximity of and .
IiiB Framework
In order to address the attributed network embedding problem, a Multimodal Deep Network Embedding (MDNE) method is proposed. Figure 1 shows the MDNE framework. The parameters marked with are parameters of the reconstruction component. Table 1 lists the terms and notations. Note that the attributes preprocessing layer and structures preprocessing layer have different weight matrices and , respectively. For simplicity, we denote and as .
The strong interactions and complex dependencies between nodes in realworld networks result in the high nonlinearity of the network structures. The interactions between structure and attribute features are nonlinear as well. Deep neural networks have demonstrably strong representation and generalization abilities for such nonlinear relationships [40]
. Therefore, the proposed model is established based on a deep autoencoder, one of the most common deep neural network architectures. Autoencoder is an unsupervised learning model that performs well in data dimensionality reduction and feature extraction
[11]. An autoencoder consists of two parts, the encoder and decoder. The encoder consists of one or multiple layers of nonlinear functions that map the input data into the representation space and obtain its feature vector; the decoder reconstructs the data in the representation space to obtain its original input form by an inverse process. A shallow autoencoder has three layers (input, encoding and output), where the encoder has only one layer of nonlinear functions. The deep autoencoder of our implementation has more hidden layer and is able to learn higherorder features of data. Given the input data vector , the output feature vectors for each layer are, where
denotes the nonlinear activation function for bringing the nonlinearity into the models. The activation functions must be chosen according to the loss function
[41], the requirements of the applied representations, and the datasets. In practice, we can choose them based on their test performance. In this work, the sigmoid function
is adopted as it provided the best performance in the experiments^{1}^{1}1Regarding the choice of activation function, we have tried sigmoid, Rectified Linear Unit (ReLU), Scaled Exponential Linear Unit (SELU), and hyperbolic tangent function (tanh). Empirically, the sigmoid function leads to the best performance in general.
. After obtaining the midlayer representation, i.e., the encoding result , we can obtain the decoding result through an inverse calculation process. The autoencoder optimizes the parameters by minimizing the reconstruction error between the input data and the reconstructed data. A typical loss function is the mean squared error (MSE). To alleviate the noise and redundant information in the input feature vectors, an undercomplete autoencoder is adopted to learn compact lowdimensional representations. The undercomplete autoencoder has a tower structure, with each upper layer having a smaller number of neurons than the layer below it. A smaller number of neurons restricts the dimensionality of the learned features, so that the autoencoder is forced to learn more abstract features of data during training
[41]. A layerbylayer pretraining algorithm, such as Restricted Boltzmann Machine (RBM) enables each upper layer of the encoder to capture the highorder correlations between the feature units in the lower layer, which is an efficient way to extract nonlinear structures progressively
[11]. Thus, the tower structure with stacked multiple layers of nonlinear functions is able to map the data into a compressive latent space, and capture the highly nonlinear structures of the network, as along with the complex interactions between the structures and attributes during training. The basic undercomplete autoencoder is chosen in our framework because of its generality and simplicity. Variants of the autoencoder can replace the basic autoencoder with slight modifications to accommodate specific scenarios, such as denoising autoencoder, contractive autoencoder, etc.
[41].Symbol  Definition 

Number of layers of the encoder/decoder  
Weight matrix of the layer  
Biases of the layer  
Representations of the layer 
An intuitive way to integrate both structure and attribute information in the representations is to concatenate the two feature vectors separately learned from both modalities. The way of learning individual modalities separately is limited in its ability to extract the correlations between structures and attributes. Alternatively, two kinds of information can be concatenated first at the input and the integrated representations are learned by a unified model. The inputs of the unified model are the adjacency vectors describing network structure and the attribute vectors describing node attributes. Since the adjacency vectors and attribute vectors of nodes are sparse and noisy, inputting the concatenated adjacency vector and attribute vector to the deep autoencoder directly, as shown in Figure 2(a), increases the difficulty in training the model to capture the correlations between structure and attribute information. We have also found that, in practice, learning in this way results in hidden units have strong connections of either structure or attribute variables, but few units connect across the two modalities [12].
To enable the deep model to better capture the correlations between structure and attribute information, multimodal learning method is introduced into the proposed model. The autoencoderbased multimodal learning model [12] is adopted to preprocess the original structure and attribute data. The preprocessing reduces the dimensionality of data from different modalities, specifically removing noise and redundant information to obtain compact highorder features. The correlations across modalities are strengthened between their highorder features. As shown in Figure 2(b), the structure information (adjacency vector) and attribute information (attribute vector) are input separately to a onelayer neural network serving as a preprocessing layer. The use of a pretraining algorithm such as a singlelayer RBM enables the preprocessing layer to extract highorder features of each modality. Then, the structure and attribute feature vectors are concatenated and input to the deep autoencoder for further learning. The highorder correlations between structure and attribute will be more facilely learned by deep autoencoder using highorder features obtained by the preprocessing layer. With the subsequent finetuning algorithm, the deep autoencoder provides a unified framework to integrate structure and attribute information.
The training goal of the model is preserving the structural and attribute features in the embedding space. The structural features and attribute features are captured by the structural and attribute proximities, respectively. Thus, the model loss function is defined based on the two proximities, as detailed in the next subsection. By finetuning the model based on the optimization of the loss function, the obtained representations preserve both the structure and attribute features of the original network.
In comparison with SDNE [13], which adopts a basic autoencoder to directly reconstruct the input structure information data, the proposed MDNE preprocesses the original adjacency matrix and attribute matrix using a multimodal learning method respectively, and concatenates the resulting highorder structure and attribute features for input to the deep model. The loss function is defined based on the structural and attribute proximities to preserve the structural and attribute features of nodes in the embedding space.
IiiC Loss Functions
The structural proximity includes the firstorder proximity describing the local network structure and the secondorder proximity describing the global network structure [13]. They are preserved in the loss function to preserve the local and global structural features in lowdimensional embedding space. With the firstorder proximity indicating the proximity between directly connected nodes, a corresponding loss function is defined to guarantee that connected nodes with larger weight have a shorter distance in the embedding space, i.e.,
(1) 
. Minimizing forces the model to preserve the firstorder proximity in the embedding space. The secondorder proximity represents the similarity of the neighborhood structure between nodes. The neighborhood structure of each node can be described by its adjacency vector. Thus the secondorder proximity between two nodes is determined by the similarity of their adjacency vectors, and the goal of the corresponding loss function is to guarantee that nodes with similar adjacency vectors have a short distance in the embedding space. Minimizing the reconstruction error of the input data amounts to maximizing the mutual information between input data and learnt representations [42]. Intuitively, if the representation allows a good reconstruction of the input data, it means that it has retained much of the information that was present in the input. That is, the MSEbased loss function prompts the basic autoencoder to latently preserve the similarity between input vectors in the embedding space during training. Since the adjacency vector describes the neighborhood structure of each node, minimizing of the reconstruction error of the adjacency vectors preserves the similarity of neighborhood structure (i.e., the secondorder proximity) between nodes in the embedding space. Thus, the loss function based on the secondorder proximity is as follows:
(2) 
, where means the Hadamard product, and are the penalty parameters for nonzero adjacency elements. If , , else . Increasing the penalty for the reconstruction error of nonzero elements avoids the reconstruction process’s tendency to reconstruct zero elements, making the model robust to sparse networks. Minimizing and imposes a restriction to force the model to preserve the firstorder and secondorder proximities between nodes.
The attribute proximity of nodes is determined by the similarity of their attribute vectors. The similarity metric of attribute vectors depends on whether the attributes are symmetric or asymmetric. In realworld networks, most of the attributes are highly asymmetric, such as wordcounts on citation networks. Moreover, symmetric attributes can also be transformed into asymmetric ones by regarding each in node ’s attribute vector as an asymmetric attribute indicating whether node has attribute value . Therefore, the attribute vectors are treated as highly asymmetric to match realworld circumstances. The asymmetry of both attribute vectors and adjacency vectors results in the same similarity metric of the two data forms. Training the autoencoder to minimize reconstruction error enables the model to preserve the similarity between input vectors in the embedding space [42]. Meanwhile, experiments in [43] shows that minimizing the reconstruction error of the wordcount vectors, a kind of highly asymmetric attribute vectors, with a deep autoencoder makes the similar input wordcount vectors close to each other in the embedding space. Thus, to preserve the attribute proximity between nodes in the embedding space, the autoencoder is trained to minimize the reconstruction error of the attribute vectors. The corresponding loss function is
(3) 
, where are the penalty parameters for nonzero attribute elements. If , , and otherwise. The penalty for the reconstruction error of nonzero attribute values reflects that the reconstruction of nonzero elements is more meaningful than the reconstruction of zero ones. This is because there are significantly fewer nonzero elements than zero ones in highly asymmetrical attribute vectors, with nonzero elements much more important in determining the similarity.
The final loss function combines the above structural and attribute proximity loss functions and preserves the structural and attribute proximities between nodes in the embedding space:
(4) 
, where is an norm regularization term to prevent overfitting, and , , and are the weight of the attribute proximity loss, secondorder proximity loss and regularization term in the loss function. is defined as:
, where are the weight matrices of the layer of the encoder and decoder, respectively.
IiiD Optimization
As presented so far, we seek to minimize the loss function to preserve the structural proximity and attribute proximity in the embedding space. Stochastic gradient descent is a general way to optimize the deep model. However, it is difficult to obtain the optimal result of the model when using stochastic gradient descent directly over randomized weights due to the existence of many local optima
[11]. Otherwise, the gradient descent works well when the initial weights are close to a good solution. Therefore, Deep Belief Network [44] is adopted to pretrain the model and obtain the initial weights, which have been proved to be close to the optimal weights [45]. Then, the model is optimized using stochastic gradient descent and the initial weights.By iterating and updating the parameters until model converges, we obtain the optimal model. Experimental results show that the model optimization converges quickly after the first 10 iterations, and slowly approaches the optimum in the later iterations. Approximately 400 iterations produce the satisfactory results. After proper optimization, informative representations are learned based on the trained model. Algorithm 1 presents the pseudocode of the proposed method. All the parameters are signed as .
IiiE Analysis and Discussions
In this section, we discuss and analyze the proposed model of MDNE.
Time Complexity: The time complexity of MDNE is , where is the number of edges, is the total number of the attributes carried by all the nodes, is the maximum number of dimensions of the hidden layer, and is the number of iterations. Since and are independent of the other parameters, the overall training complexity of the model is linear to the sum of the number of edges and attributes carried by all the nodes.
New nodes: A practical issue for network embedding is how to capture evolving networks. Many researches [23, 26] have shown interest in dealing with dynamic topological structures and node attributes. Since newly arriving nodes are an important factor for evolving networks, the proposed method provides a possible way to represent them. If new nodes have observable links connecting to existing nodes and bringing attribute information as well, their representations can be obtained by feeding their adjacency vectors and attribute vectors into the finely trained model. If the new nodes lack structure or attribute information, most existing methods cannot handle them [13]. However, MDNE can learn the representations of the new nodes lacking one modality of information by replacing the missing vectors with zero vectors and inputting the existing vectors together with the zero vectors to the trained model.
Iv Experimental Results
In this section, we empirically evaluate the effectiveness and generality of the proposed algorithm. First, the experimental setup is introduced, including datasets, baseline methods and parameter settings. We also investigate the convergence of MDNE, and verify the ability of all methods to reconstruct the network structure. Then, the comparisons of the proposed method and baselines are conducted on three realworld network analysis tasks, i.e., link prediction, attribute prediction and classification, to verify the ability of the obtained representations. Finally, the parameter sensitivity and the impact of preprocessing are discussed. Experiments run on a Dell Precision Tower 5810 with an Intel Xeon CPU E51620 v3 at 3.50 GHz and 16 GB of RAM.
Iva Experiment Setup
IvA1 Datasets
Four realworld network datasets are used in this work, including two citation networks and two social networks. Considering the characteristics of these datasets, one or more datasets are chosen to evaluate the performances on each network analysis task. Four datasets are described as follows.
cora: cora^{2}^{2}2http://linqs.cs.umd.edu/projects//projects/lbc/index.html
is a citation network which contains 2,708 nodes and 5,278 edges. Each node indicates a machine learning paper, and the edge indicates the citation relation between papers. After stemming and removing stopwords, a vocabulary of 1433 unique words is regarded as the attribute information of papers. Each attribute indicates the absence/presence of the corresponding word in papers. These papers are classified into one of the following seven classes: Case Based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, and Theory.
citesee: citeseer is a citation network which consists of 3,312 nodes and 4,551 edges. Similarly, nodes and edges represent scientific publications and their citations, respectively. The vocabulary of size 3,703 words is extracted and set as the attributes. These papers are classified into one of the following six classes: Agents, AI, DB, IR, ML, HCI.
UNC, Oklahoma: They are two Facebook subnetworks, which respectively contains 18,163 students from the University of North Carolina and 17,425 students from University of Oklahoma, and also with their seven anonymized attributes: status, gender, major, second major, dorm/house, high school, class year. Note that not all of the students have the seven attributes available.
Dataset  # nodes  # edges  # attributes 

UNC  18163  766800  2788 
Oklahoma  17425  892528  2305 
citeseer  3312  4551  3703 
cora  2708  5278  1433 
The statistics of the four datasets are summarized in Table 2. Experiments are conducted on both weighted and unweighted, small and large networks. Diverse datasets allow us to evaluate whether the proposed network embedding method has a better performance on networks with different characteristics.
IvA2 Baseline Methods
Five typical methods are chosen to be baselines.
LE [15]: It provides Laplacian Eigenmaps and spectral techniques to embed the data into a latent lowdimensional space. The solution reflects the features of the network structure.
node2vec [2]:
It samples the network structure by the biased random walk. By regarding the paths as sentences, it adopts the natural language processing model to generate network embedding. The hyperparameters
and introduce breadthfirst sampling and depthfirst sampling in the random walk. It can recover DeepWalk when and are set to 1.SDNE [13]: It exploits the firstorder proximity and secondorder proximity to preserve the local and global network structure. A deep model is adopted to address the highly nonlinear structure and sparsity problem of networks.
AANE [23]: It proposes a scalable and efficient framework which incorporates node attribute proximity into network embedding. It processes each node efficiently by decomposing the complex modeling and optimization into many subproblems.
ASNE [26]: It adopts a multilayer neural network to capture the complex interactions between features which denote the ID and attributes of nodes, and the proposed framework performs network embedding by preserving the structural proximity and attribute proximity of nodes in the paths generated by the random walk.
The first three methods are pure structurebased methods, and the others integrate attribute and structure information into network embedding.
IvA3 Parameter Settings
The depth of neural networks and the number of neurons are essential factors in learning effect. Recent evidences [13, 46, 47] reveal that the number of stacked layers (depth) and neurons should be neither too large nor too small. Large numbers of layers and neurons increase the difficulty of training the model, and bring overfitting problem. However, too few layers and neurons fail to extract effective lowdimensional representations [48], especially for largescale datasets. Therefore, we vary MDNE’s neural network structure according to different datasets, as shown in Table 3. Two numbers in the first layer and second layer indicate the dimensions of the vectors related to the structure and attribute data, respectively.
We implemented of MDNE using TensorFlow
^{3}^{3}3https://www.tensorflow.org/. We finetuned the loss function hyperparameters using grid search based on the performance of the network reconstruction [13], which is introduced as a basic quality criterion of the proposed method in Section 4.3. We first perform a parameter sweep setting on each dataset. They are tuned one by one iteratively until all of them are converged. Then every hyperparameter is further finetuned by grid search on a smaller space around optimal value got in previous search for each dataset.Dataset  # nodes in each layer 

cora  (2708,1433)(300,200)128 
citeseer  (3312,3703)(250,250)128 
UNC  (18163,2788)(3000,500)128 
Oklahoma  (17425,2305)(3600,650)128 
The parameters of the baseline methods are adjusted to the optimal values as given in their researches. For the sake of fairness, we set the embedding dimensions of all the methods on different tasks.
IvB Convergence
Experiments are conducted to investigate the convergence property of MDNE. We vary the number of iterations from 0 to 800 and plot the corresponding value of loss function on a citation network cora and a social network UNC. The learning curves are shown in Figure 3. The result indicates that MDNE convergences at about 400 iterations on different datasets. Although the performance may be better with more iterations, 400 iterations have achieved the best result among baselines. To balance the effectiveness and efficiency of MDNE, the model is trained about 400 iterations in experiments.
IvC Network Reconstruction
Network reconstruction verifies the ability of the method to reconstruct the network structure, which is also a basic requirement for network embedding methods. Given the learned network representations, all links in the original network need to be predicted. The way to predict the links is ranking all node pairs based on their similarity and predicting that a certain number of top pairs are linked by edges. The cosine distance of learned vectors measures the similarities between nodes. The higherranking node pairs are more likely to have links in the original network. The evaluation indicator is [13], referring to the ratio of the top node pairs to be connected in the original network. A larger indicates the better performance of the reconstruction.
Algorithm  @1000  @3000  @5000  @7000  @9000  @10000  

LE  0.661  0.481  0.408  0.353  0.316  0.300  
node2vec  1.000  0.903  0.542  0.388  0.302  0.272  
cora  SDNE  0.924  0.703  0.543  0.432  0.353  0.323 
AANE  0.792  0.465  0.318  0.239  0.194  0.179  
ASNE  0.954  0.796  0.514  0.383  0.307  0.281  
MDNE  0.996  0.871  0.701  0.581  0.491  0.455  
LE  0.480  0.376  0.334  0.307  0.280  0.269  
node2vec  1.000  1.000  0.654  0.467  0.364  0.327  
citeseer  SDNE  0.869  0.787  0.658  0.530  0.430  0.390 
AANE  0.774  0.586  0.424  0.323  0.262  0.239  
ASNE  0.962  0.908  0.713  0.543  0.438  0.400  
MDNE  0.994  0.951  0.798  0.637  0.530  0.488 
Algorithm  @5000  @10000  @15000  @20000  @25000  

LE  0.942  0.915  0.901  0.894  0.885  
UNC  SDNE  0.997  0.988  0.968  0.943  0.915 
AANE  0.012  0.010  0.008  0.008  0.008  
ASNE  0.999  0.989  0.922  0.765  0.635  
MDNE  0.998  0.998  0.997  0.982  0.963  
LE  0.952  0.938  0.925  0.916  0.907  
Oklahoma  SDNE  0.998  0.986  0.981  0.978  0.976 
AANE  0.022  0.018  0.015  0.014  0.013  
ASNE  0.995  0.974  0.943  0.914  0.888  
MDNE  0.999  0.996  0.993  0.983  0.969 
Network reconstruction has been performed on the four datasets, the results of which are shown in Figure 4. Also, Table 4 and Table 5 provide the numeric results helping to compare the close curves. Numbers in bold represent the best result in each column. Compared to UNC and Oklahoma, the performance of all the methods visibly decrease on cora and citeseer. This is because cora and citeseer have sparsity problem, as their average degree is much smaller than that of UNC and Oklahoma.
LE, which is a shallow modelbased method, has poor performance. It indicates that going deep enhance the model’s generalization ability, and helps to capture the high nonlinearity of network structures. SDNE adopts deep autoencoder model but only uses structure information. Its inferior performance demonstrates the usefulness of attribute information in learning better node representations. Node2vec is slightly better than MDNE on cora and citeseer network when to . The reason might be that node2vec can capture the higherorder proximity between nodes by random walks in the network.
AANE has relatively poor performance, especially on UNC and Oklahoma. This is because AANE only considers the firstorder proximity, and its performance largely depends on the computation of attribute similarity under full attribute space, while attribute similarity of nodes computed under highdimensional attribute space explicitly on certain networks has little discriminability. ASNE has slightly inferior performance because it is hard to capture the nonlinear correlations of structure and attribute information, as it preprocesses structure and attribute data linearly before concatenating them. MDNE has the best performance on four datasets in most cases. The good performance of MDNE is because it adopts a deep model to learn nonlinear features, and uses multimodal learning method to better capture the correlations of attribute and structure, and preserves the attribute proximity by minimizing the reconstruction error instead of computing the attribute similarity explicitly.
IvD Link Prediction and Attribute Prediction
In this section, we evaluate the ability of the learned representations to predict missing links and attributes in the network, which is a practical task in realworld applications.
IvD1 Link Prediction
Link prediction is the prediction of missing links based on the existing information. After hiding
of links randomly, the left network is utilized as a subdataset to perform network embedding. The test set consisted of positive instances and negative instances. The hidden links are taken as positive instances and the same ratio of unconnected node pairs in the original network are randomly selected to be negative instances. Similarities between the learned representations of nodes in the test set are calculated and are sorted in descending order. A higher ranking of a node pair corresponds to a greater possibility for them to be connected. Area Under the ROC Curve (AUC) is adopted as the evaluation metric as it is commonly used to measure the quality of classification based on ranking. A large AUC indicates good performance. If an algorithm ranks all positive instances higher than all negative instances, the AUC is 1. The above steps are repeated 10 times and the average AUC is taken as the final result. All methods had extremely poor performance on the cora and citeseer networks, as the low average degrees of the two networks make link prediction very hard. Thus we only show the results on the UNC and Oklahoma networks in Figure 5 and Table 6. Numbers in bold represent the highest performance in each column.
Test Ratio  0.05  0.15  0.25  0.35  0.45  

LE  0.670  0.668  0.644  0.616  0.588  
UNC  SDNE  0.915  0.902  0.896  0.880  0.873 
AANE  0.501  0.500  0.500  0.500  0.499  
ASNE  0.711  0.683  0.653  0.629  0.605  
MDNE  0.912  0.907  0.901  0.900  0.906  
LE  0.682  0.685  0.685  0.670  0.637  
Oklahoma  SDNE  0.891  0.885  0.882  0.873  0.852 
AANE  0.498  0.499  0.500  0.501  0.500  
ASNE  0.899  0.889  0.879  0.868  0.855  
MDNE  0.937  0.935  0.933  0.931  0.924 
Baseline  UNC  Oklahoma 

LE  1.125  1.125 
SDNE  0.0084  1.125 
AANE  1.125  1.125 
ASNE  1.125  1.125 
Compared with the shallow modelbased methods LE and AANE, the deep modelbased methods MDNE, SDNE and ASNE perform significantly better. This is because the deep model can better capture highly nonlinear network structures. The reason for the extremely poor performance of AANE is similar to that on network reconstruction task. SDNE and MDNE have good performance since preserve both the firstorder and secondorder proximities between nodes in the embedding space. MDNE is slightly better than SDNE which does not preserve attribute features in the learned representations. This result justifies the usefulness of attribute information in link prediction.
Friedman test is conducted to better endorse the superiority of MDNE with respect to other methods. The values are computed based on the ranking for the AUC value of MDNE with each method on different subdatasets with different test ratios. In Table 7, all the values are less than 0.05. The results show that the performance of MDNE is significantly different from the compared methods on link prediction task. The value of MDNE with SDNE on UNC is slightly higher than others because SDNE is slightly better than MDNE when the ratio of links for test is 5%.
The network becomes sparse with the ratio of links for test increasing, and the AUC of MDNE is stable while that of other methods dropped. It indicates that the penalty for nonzero elements in the loss function improves MDNE’s performance in dealing with sparse networks. Such an advantage is pivotal for downstream applications since links are often sparse, especially in largescale realworld networks. Despite the link prediction task’s favoring pure structurebased methods, our MDNE outperforms the others. This demonstrates the effectiveness of the learned representations in predicting missing links.
IvD2 Attribute Prediction
Attribute prediction refers to predicting unknown attribute values of nodes based on the obtained information. It has enjoyed increasing interest in network analysis tasks. For example, in social network recommendation, predicting attribute features is essential to help users to locate their interested information [38].
In attribute prediction experiments, of the attribute values (including value 1 and value 0) in the original network are hidden randomly, i.e., of in the attribute matrix are hidden. They are set as the test set. The left attribute information and structure information is trained to learn the representations of nodes. The obtained representations are used to predict the attributes in the test set. Assuming that the attribute of node is hidden, here is the way to predict. Similarities between node with all the other nodes in the embedding space are calculated, denoting as . We denote the top 10 nodes with the highest similarity to as set . , is the set of nodes with the attribute value in the above set. Similarly, , is the set of the rest of nodes with the attribute value in . is calculated, which indicates the possibility of that the , i.e., attribute of node , is 1. All in the test set are sorted in descending order of . AUC is the metric to evaluate the ranking list. A high AUC value indicates the high accuracy of the prediction. The result shows as in Figure 6.
Baseline  cora  citeseer  UNC  Oklahoma 
LE  1.125  1.125  1.125  1.125 
SDNE  1.125  1.125  1.125  1.125 
AANE  1.125  1.125  1.125  1.125 
ASNE  1.125  1.125  /  / 
The performances of LE, node2vec and SDNE are worse than that of MDNE. The reason is their lacking of consideration of preserving the attribute features in the embedding space, which is important for predicting missing attributes of nodes. AANE still has poor performance. It is because the attribute affinity matrix adopted by AANE is calculated based on the full attribute space, which decreases the discriminability of the representations. Compared with ASNE, the superior performance of our MDNE credits the preprocessing of the original attribute and structure information based on multimodal learning method. The highorder features of the attribute vectors and adjacency vectors obtained by the preprocessing layer help the successive layers better extract the highorder correlations between the structure and attribute feature of nodes.
Also, the Friedman test is conducted on MDNE with others. The values are listed in Table 8, all of which are less than 0.05. The results show that the performance of MDNE is significantly different from baselines on attribute prediction task.
The attribute sparsity of different datasets is quite different, as the average number of attributes of each node is 34.3, 31.7, 5.4 and 5.3 on cora, citeseer, UNC and Oklahoma respectively. Moreover, the attributes of each network become sparse with the ratio of the test set increasing. The proposed method has good performance in all cases. This demonstrates that MDNE is effective in attribute prediction tasks and is robust to networks with different extent of attribute sparseness.
IvE Classification
Classification is one of the important tasks in network analysis. It classifies nodes based on their features. The representations generated are used as features. The widely used classifier LIBLINEAR [49] is adopted. A portion of node representations and their labels are taken as the training set, and the rest to be the test set. For a fair comparison, the test ratio varies from by an increment of
. Fmeasure is a commonly adopted metric for binary classification. MicroF1 & MacroF1 are employed to judge the classification quality. Macroaverage is defined as an arithmetic average of Fmeasure of all the label categories, and Microaverage is the harmonic mean of average precision and average recall. For both metrics, the higher values indicate better performance. For each training ratio, we randomly split the training set and the test set for 10 times and report the average result as Figure 7. The experiment is conducted on citeseer and cora, since they are the only datasets containing class labels for nodes.
MDNE always has the best performance in all cases. Although node2vec has satisfactory performance on network reconstruction task, it returns the disappointing result on classification task. This shows the representations learned by node2vec have task preference. AANE still has poor performance. We replay the classification experiments with the nonlinear kernel SVM classifier and 5fold cross validation. The performance of AANE is improved. This is because AANE is hard to capture the nonlinear correlations between structure and attribute features, and the representations learned by AANE are nonlinear. The SVM with the nonlinear kernel has the classification ability with nonlinear representations, which is difficult for the linear classifier LIBLINEAR to deal with. MDNE has good performance on both LIBLINEAR and SVM. Considering LIBLINEAR has advantages in time complexity, it is beneficial that the learned representations are suitable for linear classifiers. The poor performance of ASNE is due to its lacking of nonlinear preprocessing of the original structure and attribute information. The nonlinear preprocessing of the adjacency vector and attribute vector can help the model to capture the highorder correlations between the two information in the subsequent learning. SDNE and LE are worse than MDNE, as they do not consider attribute information when embedding networks. The significant improvement of MDNE over baselines proves that adopting multimodal deep model and optimizing the loss function defined based on the structural proximity and attribute proximity are able to learn effective representations for classification tasks.
IvF Parameters Sensitivity and the Impact of PreProcessing
In this section, we investigate how different choices of and embedding dimensions, along with the consideration of preprocessing affect the performance of MDNE on the cora dataset. The results of classification tasks with different test ratios are reported. The results from other tasks on other datasets are omitted as they are similar.
IvF1 The weight of the attribute proximity loss
The hyperparameter adjusts the importance of attribute proximity loss in the loss function. The weight of structural proximity loss is fixed as . Then the will determine the relative importance between the attribute proximity loss and the structural proximity loss.
Figure 8 shows the impact of on the range of [0, 0.04] at an interval of 0.005. The slightly improving performance on shows that attribute proximity loss plays an important role in learning network representations. The performance relatively stabilized on indicates that the performance of MDNE is not sensitive to values on this range, which means the value of is suitable for the proposed model in a wide range in realworld applications. The great difference between and the weight of structural proximity loss is due to the inherent characteristics of dataset cora. The total number of edges is 5278, and the total number of attribute values is 49216. That is, the attribute proximity loss is much larger than the structural proximity loss. To balance the different effect from them, the smaller weight of is necessary. Besides, compared with Figure 7, when , which means the attribute proximity loss is ignored in loss function, MDNE still outperforms baselines. The observation indicates that the structure of the proposed multimodal deep autoencoder with the pretraining algorithm is able to capture the highly nonlinear relationship between structure and attribute features even without attribute proximity loss in the loss function.
IvF2 Embedding dimensions
The effect of the embedding dimensions on classification performance is shown in Figure 9. The performance gets better as the number of dimensions increasing initially. When the number of dimensions is larger than a threshold, the performance becomes stable. The reason is twofold. When the number of dimensions is small, more useful information is incorporated into representations with the number of dimensions increasing and the performance also increase. However, the too large number of dimensions also bring noise and redundant information which weaken the classification ability of the representations. Thus, it is important to select a reasonable embedding dimension. It is observed from Figure 9 that the proposed method is not very sensitive to embedding dimensions when the number of dimensions is larger than 60. Taking into account the accuracy and complexity of nodes, the embedding dimensions of MDNE is set as 128 in our experiments.
IvF3 Preprocessing
Figure 10 shows the results of MDNE with and without the preprocessing procedure. The model with the preprocessing procedure, which corresponds to Figure 2(b), has the structure of {(2708,1433)(300,200)128}. Except for the preprocessing layer, the subsequent deep model has an input layer with the concatenated highorder features and an output layer. The model without the preprocessing procedure, which corresponds to Figure 2(a), has the structure of {(2708,1433)500128}. The corresponding deep model has an input layer with concatenated original vectors, a hidden layer, and an output layer. The total number of weight parameters in the model without the preprocessing procedure is larger than that in the model with the preprocessing procedure. As Figure 10 shows, although the preprocessing model has smaller computation complexity, its result is slightly better. Moreover, compared with Figure 7, the proposed method without the preprocessing procedure is still better than baselines. It demonstrates that besides the preprocessing procedure, both the deep model and loss function of the proposed method contribute to the good performance of MDNE.
V Conclusion
In this paper, a Multimodal Deep Network Embedding method is proposed for learning informative network representations by integrating the structure and attribute information of nodes. Specifically, the deep model comprising of multiple layers of nonlinear functions is adopted to capture the nonlinear network structure and the complex interactions with node attributes. In order to better extract the highorder correlations between the topological structures and attributes of nodes, the multimodal learning method is adopted to preprocess the original structure and attribute data. The structural proximity and attribute proximity are utilized to describe the structure and attribute features of the network, respectively. The model loss function is defined based on the two proximities. Minimizing the loss function preserves both proximities in the embedding space. Experiments are conducted on four realworld networks to evaluate the performance of the representations obtained. Compared with baselines, the result demonstrates that MDNE offers superior performance on various realworld applications. In the future, we will consider improving the efficiency of MDNE through a parallel processing framework and expanding the model to learn taskoriented representations combined with the requirements from specific applications.
Acknowledgments
The authors would like to thank the anonymous referees for their critical appraisals and useful suggestions.
References
 [1] B. Perozzi, R. AlRfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 701–710.
 [2] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016, pp. 855–864.
 [3] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph representations with global structural information,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015, pp. 891–900.

[4]
X. Hu, L. Tang, J. Tang, and H. Liu, “Exploiting social relations for sentiment analysis in microblogging,” in
Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 2013, pp. 537–546.  [5] J. Tang, H. Gao, X. Hu, and H. Liu, “Exploiting homophily effect for trust prediction,” in Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 2013, pp. 53–62.
 [6] M. McPherson, L. SmithLovin, and J. M. Cook, “Birds of a feather: Homophily in social networks,” Annual review of sociology, vol. 27, no. 1, pp. 415–444, 2001.
 [7] P. V. Marsden, “Homogeneity in confiding relations,” Social networks, vol. 10, no. 1, pp. 57–76, 1988.
 [8] D. Luo, F. Nie, H. Huang, and C. H. Ding, “Cauchy graph embedding,” in Proceedings of the 28th International Conference on Machine Learning (ICML11), 2011, pp. 553–560.
 [9] P. Cui, X. Wang, J. Pei, and W. Zhu, “A survey on network embedding,” arXiv preprint arXiv:1711.08752, 2017.
 [10] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
 [11] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
 [12] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learning (ICML11), 2011, pp. 689–696.
 [13] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016, pp. 1225–1234.
 [14] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
 [15] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in neural information processing systems, 2002, pp. 585–591.
 [16] Y. Jacob, L. Denoyer, and P. Gallinari, “Learning latent representations of nodes for classifying in heterogeneous social networks,” in Proceedings of the 7th ACM international conference on Web search and data mining. ACM, 2014, pp. 373–382.
 [17] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine learning, vol. 42, no. 12, pp. 177–196, 2001.
 [18] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
 [19] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang, “Community preserving network embedding.” 2017.
 [20] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Largescale information network embedding,” in Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015, pp. 1067–1077.

[21]
J. Li, X. Hu, L. Wu, and H. Liu, “Robust unsupervised feature selection on networked data,” in
Proceedings of the 2016 SIAM International Conference on Data Mining. SIAM, 2016, pp. 387–395.  [22] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang, “Network representation learning with rich text information.” in IJCAI, 2015, pp. 2111–2117.
 [23] X. Huang, J. Li, and X. Hu, “Accelerated attributed network embedding,” in Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 2017, pp. 633–641.

[24]
D. Zhang, J. Yin, X. Zhu, and C. Zhang, “User profile preserving social
network embedding,” in
Proceedings of the 26th International Joint Conference on Artificial Intelligence
. AAAI Press, 2017, pp. 3378–3384.  [25] S. Pan, J. Wu, X. Zhu, C. Zhang, and Y. Wang, “Triparty deep network representation,” Network, vol. 11, no. 9, p. 12, 2016.
 [26] L. Liao, X. He, H. Zhang, and T. S. Chua, “Attributed social network embedding,” IEEE Transactions on Knowledge & Data Engineering, vol. PP, no. 99, pp. 1–1, 2017.
 [27] D. R. Hardoon, S. Szedmak, and J. ShaweTaylor, “Canonical correlation analysis: An overview with application to learning methods,” Neural computation, vol. 16, no. 12, pp. 2639–2664, 2004.
 [28] R. Rosipal and N. Krämer, “Overview and recent advances in partial least squares,” in International Statistical and Optimization Perspectives Workshop” Subspace, Latent Structure and Feature Selection”. Springer, 2005, pp. 34–51.
 [29] J. B. Tenenbaum and W. T. Freeman, “Separating style and content with bilinear models,” Neural computation, vol. 12, no. 6, pp. 1247–1283, 2000.
 [30] N. Srivastava and R. Salakhutdinov, “Learning representations for multimodal data with deep belief nets,” in International conference on machine learning workshop, vol. 79, 2012.
 [31] ——, “Multimodal learning with deep boltzmann machines,” Journal of Machine Learning Research, vol. 15, pp. 2949–2980, 2014.
 [32] C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan, “Learning consistent feature representation for crossmodal multimedia retrieval,” IEEE Transactions on Multimedia, vol. 17, no. 3, pp. 370–381, 2015.
 [33] X. Xu, F. Shen, Y. Yang, H. T. Shen, and X. Li, “Learning discriminative binary codes for largescale crossmodal retrieval,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2494–2507, 2017.
 [34] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec, “Graph convolutional neural networks for webscale recommender systems,” arXiv preprint arXiv:1806.01973, 2018.

[35]
M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in
Advances in Neural Information Processing Systems, 2016, pp. 3844–3852.  [36] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 [37] P. Wu and L. Pan, “Mining applicationaware community organization with expanded feature subspaces from concerned attributes in social networks,” KnowledgeBased Systems, vol. 139, pp. 1–12, 2018.
 [38] X. Cai, J. Han, W. Li, R. Zhang, S. Pan, and L. Yang, “A threelayered mutually reinforced model for personalized citation recommendation,” IEEE Transactions on Neural Networks and Learning Systems, 2018.
 [39] W. Zhao, S. Tan, Z. Guan, B. Zhang, M. Gong, Z. Cao, and Q. Wang, “Learning to map social network users by unified manifold alignment on hypergraph,” IEEE Transactions on Neural Networks and Learning Systems, 2018.

[40]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2016, pp. 770–778.  [41] D. Charte, F. Charte, S. García, M. J. del Jesus, and F. Herrera, “A practical tutorial on autoencoders for nonlinear feature fusion: Taxonomy, models, software and guidelines,” Information Fusion, vol. 44, pp. 78–96, 2018.
 [42] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of machine learning research, vol. 11, no. Dec, pp. 3371–3408, 2010.
 [43] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003.
 [44] G. E. Hinton, S. Osindero, and Y.W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
 [45] D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pretraining help deep learning?” Journal of Machine Learning Research, vol. 11, no. Feb, pp. 625–660, 2010.
 [46] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [47] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions.” Cvpr, 2015.
 [48] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833.
 [49] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin, “Liblinear: A library for large linear classification,” Journal of machine learning research, vol. 9, no. Aug, pp. 1871–1874, 2008.
Appendices
In order to better compare the close curves in Figure 67, numeric results are provided in Table 911. Numbers in bold represent the best result in each column.
Test Ratio  0.05  0.15  0.25  0.35  0.45  
LE  0.713  0.700  0.691  0.681  0.667  
node2vec  0.626  0.620  0.613  0.606  0.597  
cora  SDNE  0.701  0.686  0.677  0.665  0.652 
AANE  0.630  0.620  0.613  0.605  0.597  
ASNE  0.710  0.688  0.676  0.660  0.643  
MDNE  0.744  0.726  0.714  0.698  0.680  
LE  0.700  0.691  0.681  0.671  0.657  
node2vec  0.629  0.623  0.615  0.606  0.595  
citeseer  SDNE  0.715  0.704  0.693  0.680  0.664 
AANE  0.634  0.626  0.618  0.608  0.597  
ASNE  0.729  0.710  0.691  0.671  0.649  
MDNE  0.776  0.759  0.741  0.721  0.697  
LE  0.803  0.796  0.793  0.784  0.778  
UNC  SDNE  0.865  0.858  0.852  0.844  0.837 
AANE  0.780  0.771  0.768  0.754  0.747  
ASNE  0.780  0.773  0.770  0.760  0.751  
MDNE  0.881  0.874  0.868  0.860  0.853  
LE  0.800  0.795  0.790  0.786  0.780  
Oklahoma  SDNE  0.826  0.821  0.816  0.810  0.806 
AANE  0.759  0.755  0.751  0.745  0.741  
ASNE  0.870  0.863  0.857  0.851  0.844  
MDNE  0.872  0.863  0.857  0.851  0.845 
Test Ratio  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  

LE  0.716  0.700  0.701  0.715  0.688  0.681  0.655  0.621  0.535  
node2vec  0.139  0.139  0.133  0.133  0.137  0.139  0.137  0.140  0.142  
cora  SDNE  0.525  0.504  0.505  0.503  0.474  0.456  0.417  0.371  0.341 
AANE  0.127  0.136  0.133  0.134  0.133  0.140  0.140  0.146  0.149  
ASNE  0.444  0.429  0.406  0.421  0.437  0.410  0.419  0.350  0.343  
MDNE  0.790  0.802  0.779  0.748  0.738  0.726  0.719  0.688  0.660  
LE  0.519  0.497  0.504  0.496  0.495  0.481  0.467  0.452  0.408  
node2vec  0.166  0.176  0.168  0.172  0.170  0.171  0.173  0.176  0.171  
citeseer  SDNE  0.377  0.375  0.362  0.369  0.366  0.353  0.345  0.324  0.273 
AANE  0.161  0.173  0.177  0.172  0.176  0.176  0.173  0.176  0.169  
ASNE  0.328  0.325  0.323  0.311  0.324  0.306  0.317  0.315  0.286  
MDNE  0.634  0.648  0.629  0.624  0.623  0.609  0.583  0.509  0.473 
Test Ratio  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  

LE  0.725  0.710  0.708  0.721  0.696  0.691  0.666  0.636  0.555  
node2vec  0.163  0.163  0.157  0.158  0.161  0.163  0.161  0.163  0.165  
cora  SDNE  0.541  0.526  0.528  0.524  0.497  0.485  0.446  0.395  0.376 
AANE  0.255  0.257  0.251  0.241  0.233  0.226  0.222  0.216  0.208  
ASNE  0.488  0.479  0.447  0.467  0.472  0.458  0.459  0.378  0.389  
MDNE  0.807  0.815  0.798  0.774  0.761  0.748  0.743  0.711  0.687  
LE  0.534  0.499  0.517  0.515  0.501  0.485  0.471  0.468  0.416  
node2vec  0.179  0.184  0.182  0.184  0.181  0.181  0.183  0.185  0.183  
citeseer  SDNE  0.439  0.427  0.412  0.419  0.409  0.397  0.382  0.353  0.295 
AANE  0.196  0.210  0.214  0.206  0.204  0.203  0.198  0.203  0.192  
ASNE  0.355  0.360  0.359  0.341  0.355  0.335  0.347  0.341  0.308  
MDNE  0.691  0.700  0.691  0.678  0.673  0.656  0.623  0.545  0.502 
Comments
There are no comments yet.