1 Introduction
Analyzing and mining useful knowledge in graphs have been an actively researched topic for decades both in academia and industry. Among various graph mining techniques, network embedding, which learns lowdimensional vector representations for nodes in a graph, is shown to be especially effective for various networkbased tasks
[31, 36, 22].However, most existing network embedding methods assume that only a single type of relation exists between nodes [32, 33, 13], whereas in reality networks are multiplex [4] in nature, i.e., with multiple types of relations. Taking the publication network as an example, two papers can be connected due to various reasons, such as authors (two papers are authored by a common author), citation (one paper cites the other), or keywords (two papers share common keywords). As another example, in a movie database network, two movies can be connected via a common director, or a common actor.
Although different types of relations can independently form different graphs, these graphs are related, and thus can mutually help each other for various downstream tasks. As a concrete example of the publication network, although it is hard to infer the topic of a paper only from its citations (citations can be diverse), also knowing other papers written by the same authors will help predict its topic, because authors usually work on a specific research topic. Furthermore, nodes in graphs may contain attribute information, which plays important roles in many applications [42]. For example, if we are additionally given the abstract of the papers in the publication network, it will be much easier to infer their topics. As such, the main challenge is to learn a consensus representation of a node that not only considers its multiplexity, but also its attributes.
Several recent studies have been conducted for multiplex network embedding, however, some issues remain that need further consideration. First, previous methods [25, 41, 29, 19] focus on the integration of multiple graphs, but overlook node attributes. Second, even for those that consider node attributes [28, 37], they require node labels for training. However, as node labeling is often expensive and timeconsuming, it would be the best if a method can show competitive performance even without any label. Third, most of these methods fail to model the global properties of a graph, because they are based on random walkbased skipgram model or graph convolutional network (GCN) [13], both of which are known to be effective for capturing the local graph structure [39]. More precisely, nodes that are “close” (i.e., within the same context window or neighborhoods) in the graph are trained to have similar representations, whereas nodes that are far apart do not have similar representations, even though they are structurally similar [27].
Keeping these limitations in mind, we propose a simple yet effective unsupervised method for embedding attributed multiplex networks. The core building block of our proposed method is Deep Graph Infomax (DGI) [33] that aims to learn a node encoder that maximizes the mutual information between local patches of a graph, and the global representation of the entire graph. DGI is the workhorse method for our task, because it 1) naturally integrates the node attributes by using a GCN, 2) is trained in a fully unsupervised manner, and 3) captures the global properties of the entire graph. However, it is challenging to apply DGI, which is designed for embedding a single network, to a multiplex network in which the interactions among multiple relation types, and the importance of each relation type should be considered.
In this paper, we present a systematic way to jointly integrate the embeddings from multiple types of relations between nodes, so as to facilitate them to mutually help each other learn highquality embeddings useful for various downstream tasks. More precisely, we introduce the consensus regularization framework that minimizes the disagreements among the relationtype specific node embeddings, and the universal discriminator that discriminates true samples, i.e., ground truth “(graphlevel summary, local patch)” pairs, regardless of the relation types. Moreover, we demonstrate that through the attention mechanism, we can infer the importance of each relation type in generating the consensus node embeddings, which can be used for filtering unnecessary relation types as a preprocessing step. Our extensive experiments demonstrate that our proposed method, Deep Multplex Graph Infomax (DMGI), outperforms the stateoftheart attributed multiplex network embedding methods in terms of node clustering, similarity search, and especially, node classification even though DMGI is fully unsupervised.
2 Problem Statement
Definition 1.
(Attributed Multiplex Network) An attributed multiplex network is a network , where is a graph of the relation type , is the set of nodes, is the set of all edges with relation type , and is a matrix that encodes node attributes information for nodes. Note that for multiplex networks, and for a single network. Given the network , is a set of adjacency matrices, where is an adjacency matrix of the network .
Task: Unsupervised Attributed Multiplex Network Embedding. Given an attributed multiplex network , and the set of adjacency matrices , the task of unsupervised attributed multiplex network embedding is to learn a dimensional vector representation for each node without using any labels.
3 Unsupervised Attributed Multiplex Network Embedding
We begin by introducing Deep Graph Informax (DGI) [33], then we discuss about its limitations, and present our proposed method.
Deep Graph Infomax (DGI). velivckovic2018deep velivckovic2018deep proposed an unsupervised method for learning node representations, called DGI, that relies on the infomax principle [18]. More precisely, DGI aims to learn a lowdimensional vector representation for each node , i.e., , such that the average mutual information (MI) between the graphlevel (global) summary representation , and the representations of the local patches is maximized. To this end, DGI introduces a discriminator that discriminates the true samples, i.e., , from its negative counterparts, i.e., :
(1) 
where , is the set of neighboring nodes of including itself, , and is a normalizing constant for edge , , and
is the sigmoid nonlinearity. Negative patch representation
is obtained by rowwise shuffling the original attribute matrix X. velivckovic2018deep velivckovic2018deep theoretically proved that the binary cross entropy loss shown in Eqn. 1 amounts to maximizing the mutual information (MI) between and s, based on the JensenShannon divergence [33]. Refer to Section 3.3 of [33] for the detailed proof. As the local patch representations are learned to preserve the MI with the graphlevel representation s, each is expected to capture the global properties of the entire graph.Limitation. Despite its effectiveness, DGI is designed for a single attributed network, and thus it is not straightforward to apply it to a multiplex network. As a naive extension of DGI to a multiplex attributed network, we can independently apply DGI to each graph formed by each relation type, and then compute the average of the embeddings obtained from each graph to get the final node representations. However, we argue that this fails to model the multiplexity of the network, because the interactions among the node embeddings from different relation types is not captured. Thus, we need a more systematic way to integrate multiple independent models to obtain the final consensus embedding that every model can agree on.
3.1 Deep Multiplex Graph Infomax: Dmgi
We present our unsupervised method for embedding an attributed multiplex network. We first describe how to independently model each graph pertaining to each relation type, then explain how to jointly integrate them to finally obtain the consensus node embedding matrix.
Relationtype specific Node Embedding. For each relation type , we introduce a relationtype specific node encoder to generate the relationtype specific node embedding matrix of nodes in . The encoder is a single–layered GCN:
(2) 
where , , is a trainable weight matrix of the relationtype specific decoder , and
is the ReLU nonlinearity. Unlike conventional GCNs
[13], we control the weight of the selfconnections by introducing a weight . Larger indicates that the node itself plays a more important role in generating its embedding, which in turn diminishes the importance of its neighboring nodes. Then, we compute the graphlevel summary representation that summarizes the global content of the graph . We employ a readout function :(3) 
where is the logistic sigmoid nonlinearity, and denotes the th row vector of the matrix . We also note that various pooling methods such as maxpool, and SAGPool [15] can be used as .
Next, given the relationtype specific node embedding matrix , and its graphlevel summary representation , we compute the relationtype specific cross entropy:
(4) 
where is a discriminator that scores patchsummary representation pairs, i.e., . In this paper, we apply a simple bilinear scoring function as it empirically performs the best in our experiments:
(5) 
where is the logistic sigmoid nonlinearity, and is a trainable scoring matrix. To generate the negative node embedding , we corrupt the original attribute matrix by shuffling it in the rowwise manner [33], i.e., , and reuse the encoder in Eqn. 2. i.e. .
Joint Modeling and Consensus Regularization.
Heretofore, by independently maximizing the average MI between the local patches and the graphlevel summary pertaining to each graph , we obtained relationtype specific node embedding matrix that captures the global information in . However, as each is trained independently for each , these embedding matrices only contain relevant information regarding each relation type, and therefore fail to take advantage of the multiplexity of the network. This motivates us to develop a systematic way to jointly integrate the embeddings from different relation types, so as to facilitate them to mutually help each other learn highquality embeddings.
To this end, we introduce the consensus embedding matrix on which every relationtype specific node embedding matrix can agree. More precisely, we introduce the consensus regularization framework that consists of 1) a regularizer minimizing the disagreements between the set of original node embeddings, i.e. and the consensus embedding Z, and 2) another regularizer maximizing the disagreement between the corrupted node embeddings, i.e., , and the consensus embedding Z, which are formulated as follows:
(6) 
where is an aggregation function that combines a set of node embedding matrices from multiple relation types into a single embedding matrix. i.e., . can be any pooling method that can handle permutation invariant input, such as set2set [34] or Set Transformer [14]. However, considering the efficiency of the method, we simply employ average pooling, i.e., computing the average of the set of embedding matrices:
(7) 
It is important to note that the scoring matrix in Eqn. 5 is shared among all the relations . i.e., . The intuition is to learn the universal discriminator that is capable of scoring the true pairs higher than the negative pairs regardless the relation types. We argue that the universal discriminator facilitates the joint modeling of different relation types together with the consensus regularization.
Finally, we jointly optimize the sum of all the relationtype specific loss in Eqn. 4, and the consensus regularization in Eqn. 6 to obtain the final objective as follows:
(8) 
where controls the importance of the consensus regularization, is a coefficient for l2 regularization on , which is a set of trainable parameters. i.e., , and is optimized by Adam optimizer. Figure 1 illustrates the overview of DMGI.

Num. A  Num. B  Num. AB  Relation type 






ACM  PaperAuthor  3,025  5,835  9,744  PAP  29,281  1,830 (Paper abstract)  600  3  
PaperSubject  3,025  56  3,025  PSP  2,210,761  
IMDB  MovieActor  3,550  4,441  10,650  MAM  66,428  1,007 (Movie plot)  300  3  
MovieDirector  3,550  1,726  3,550  MDM  13,788  
DBLP  PaperAuthor  7,907  1,960  14,238  PAP  144,783  2,000 (Paper abstract)  80  4  
PaperPaper  7,907  7,907  10,522  PPP  90,145  
AuthorTerm  1,960  1,975  57,269  PATAP  57,137,515  
Amazon  ItemItem  7,621  7,621  38,514  Alsoview  266,237  2,000 (Item description)  80  4  
45,446  Alsobought  1,104,257  
9,783  Boughttogether  16,305 
Discussion. Despite its efficiency, the above average pooling scheme in Eqn. 7 treats all the relations equally, whereas, as will be shown in the experiments, some relation type is more beneficial for a certain downstream task than others. For example, the coauthorship information between two papers plays a more significant role in predicting the topic of a paper compared with their citation information; eventually, these two information mutually help each other to more accurately predict the topic of a paper. Therefore, we can adopt the attention mechanism [1] to distinguish between different relation types as follows:
(9) 
where denotes the importance of relation in generating the final embedding of node defined as:
(10) 
where is the feature vector of relation .
Extension to SemiSupervised Learning.
It is important to note that DMGI is trained in a fully unsupervised manner. However, in reality, nodes are sometimes associated with label information, which can guide the training of node embeddings even with a small amount [13, 25]. To this end, we introduce a semisupervised module into our framework that predicts the labels of labeled nodes from the consensus embedding Z. More precisely, we minimize the crossentropy error over the labeled nodes:
(11) 
where is the set of node indices with labels, is the ground truth label,
is the output of a softmax layer, and
is a classifier that predicts the label of a node from its embedding, which is a single fully connected layer in this work. The final objective function with the semisupervised module is:
(12) 
where the coefficient of the semisupervised module.
4 Experiments
Dataset. To make fair comparisons with HAN [37], which is the most relevant baseline method, we evaluate our proposed method on the datasets used in their original paper [37], i.e., ACM, DBLP, and IMDB. We used publicly available ACM dataset [37], and preprocessed DBLP and IMDB datasets. For ACM and DBLP datasets, the task is to classify the papers into three classes (Database, Wireless Communication, Data Mining), and four classes (DM, AI, CV, NLP)^{1}^{1}1DM: KDD,WSDM,ICDM, AI: ICML,AAAI,IJCAI, CV: CVPR, NLP: ACL,NAACL,EMNLP, respectively, according to the research topic. For IMDB dataset, the task is to classify the movies into three classes (Action, Comedy, Drama). We note that the above datasets used by previous work are not truly multiplex in nature because the multiplexity between nodes is inferred via intermediate nodes (e.g., ACM: PaperPaper relationships are inferred via Authors and Subjects that connect two Papers. i.e., “PAP” and “PSP”). Thus, to make our evaluation more practical, we used Amazon dataset [10] that genuinely contains a multiplex network of items, i.e., alsoviewed, alsobought, and boughttogether relations between items. We used datasets from four categories^{2}^{2}2We chose these categories because the three types of itemitem relations from these categories are similar in number , i.e., Beauty, Automotive, Patio Lawn and Garden, and Baby, and the task is to classify items into the four classes. For ACM and IMDB datasets, we used the same number of labeled data as in [37] for fair comparisons, and for the remaining datasets, we used 20 labeled data for each class. Table 1 summarizes the data statistics.
Mult.  Attr.  Unsup.  Glo.  
Dw/n2v  ✗  ✗  ✓  ✗ 
GCN/GAT  ✗  ✓  ✗  ✗ 
DGI  ✗  ✓  ✓  ✗ 
ANRL  ✗  ✓  ✓  ✓ 
CAN  ✗  ✓  ✓  ✗ 
DGCN  ✗  ✓  ✗  ✓ 
CMNA  ✓  ✗  ✓  ✓ 
MNE  ✓  ✗  ✓  ✗ 
mGCN  ✓  ✗  ✓  ✗ 
HAN  ✓  ✓  ✗  ✗ 
DMGI  ✓  ✓  ✓  ✓ 
Method  ACM  IMDB  DBLP  Amazon  

NMI  Sim@5  NMI  Sim@5  NMI  Sim@5  NMI  Sim@5  
Deepwalk  0.310  0.710  0.117  0.490  0.348  0.629  0.083  0.726 
node2vec  0.309  0.710  0.123  0.487  0.382  0.629  0.074  0.738 
GCN/GAT  0.671  0.867  0.176  0.565  0.465  0.724  0.287  0.624 
DGI  0.640  0.889  0.182  0.578  0.551  0.786  0.007  0.558 
ANRL  0.515  0.814  0.163  0.527  0.332  0.720  0.166  0.763 
CAN  0.504  0.836  0.074  0.544  0.323  0.792  0.001  0.537 
DGCN  0.691  0.690  0.143  0.179  0.462  0.491  0.143  0.194 
CMNA  0.498  0.363  0.152  0.069  0.420  0.511  0.070  0.435 
MNE  0.545  0.791  0.013  0.482  0.136  0.711  0.001  0.395 
mGCN  0.668  0.873  0.183  0.550  0.468  0.726  0.301  0.630 
HAN  0.658  0.872  0.164  0.561  0.472  0.779  0.029  0.495 
DMGI  0.687  0.898  0.196  0.605  0.409  0.766  0.425  0.816 
DMGI  0.702  0.901  0.185  0.586  0.554  0.798  0.412  0.825 
Methods Compared.

[leftmargin=.1in]

Embedding methods for a single network

[leftmargin=.00001in]

Attributed network embedding: GCN [13], GAT [32]: They learn node embeddings based on local neighborhood structures. As they perform similarly, we report the best performing method among them; DGI [33]: It maximizes the MI between the graphlevel summary representation and the local patches; ANRL [42]
: It uses neighbor enhancement autoencoder to model the node attribute information, and skipgram model to capture the network structure;
CAN [22]: It learns embeddings of both attributes and nodes in the same semantic space; DGCN [44]: It models the local and global properties of a graph by employing dual GCNs.


Multiplex embedding methods

[leftmargin=.01in]

No attributes: CMNA [3]: It leverages the crossnetwork information to refine intervector for network alignment and intravector for other downstream tasks. We use the intravector for our evaluations; MNE [41]: It jointly models multiple networks by introducing a common embedding, and a additional embedding for each relation type.

Attributed multiplex network embedding: mGCN [21], HAN [37]: They apply GCNs, and GATs on multiplex network considering the inter, and intranetwork interactions. For fair comparisons, we initialized the initial node embeddings of mGCN by using the node attribute matrix, although the node attributes information is ignored in the original mGCN; DMGI: DMGI with the attention mechanism (Eqn. 9).

For the sake of fair comparisons with DMGI, which considers the node attributes, we concatenated the raw attribute matrix X to the learned node embeddings Z of the methods that ignore the node attributes. i.e., Deepwalk, node2vec, CMNA, and MNE. i.e., . Moreover, regarding the embedding methods for a single network, i.e., the methods that belong to the first category in the above list, we obtain the final node embedding matrix Z by computing the average of the node embeddings obtained from each single graph. i.e., . We provide a summary of the properties of the compared methods in Table 3.
Evaluation Metrics. Recall that DMGI is an unsupervised method that does not require any labeled data for training. Therefore, we evaluate the performance of DMGI in terms of node clustering and similarity search, both of which are classical performance measures for unsupervised methods. For node clustering, we use the most commonly used metric [37]
, i.e., Normalized Mutual Information (NMI). For similarity search, we compute the cosine similarity scores of the node embeddings between all pairs of nodes, and for each node, we rank the nodes according to the similarity score. Then, we calculate the ratio of the nodes that belong to the same class within top5 ranked nodes (Sim@5). Moreover, we also evaluate
DMGI on the performance in terms of node classification. More precisely, after learning the node embeddings, we train a logistic regression classifier on the learned embeddings in the training set, and then evaluate on the nodes in the test set. We use MacroF1 (MaF1) and MicroF1 (MiF1)
[37].Experimental Settings. We randomly split our dataset into train/validation/test, and we have the equal number of labeled data for training and validation datasets. We report the test performance when the performance on validation data gives the best result. For DMGI, we set the node embedding dimension , selfconnection weight , tune . We implement DMGI
in PyTorch
^{3}^{3}3https://github.com/pcy1302/DMGI, and for all other methods, we used the source codes published by the authors, and tried to tune them to their best performance. More precisely, apart from the guidelines provided by the original papers, we tuned learning rate, and the coefficients for regularization from {0.0001,0.0005,0.001,0.005} on the validation dataset. After learning the node embeddings, for fair comparisons, we conducted the evaluations within the same platform.ACM  IMDB  DBLP  Amazon  
MaF1  MiF1  MaF1  MiF1  MaF1  MiF1  MaF1  MiF1  
Deepwalk  0.739  0.748  0.532  0.550  0.533  0.537  0.663  0.671 
node2vec  0.741  0.749  0.533  0.550  0.543  0.547  0.662  0.669 
GCN/GAT  0.869  0.870  0.603  0.611  0.734  0.717  0.646  0.649 
DGI  0.881  0.881  0.598  0.606  0.723  0.720  0.403  0.418 
ANRL  0.819  0.820  0.573  0.576  0.770  0.699  0.692  0.690 
CAN  0.590  0.636  0.577  0.588  0.702  0.694  0.498  0.499 
DGCN  0.888  0.888  0.582  0.592  0.707  0.698  0.478  0.509 
CMNA  0.782  0.788  0.549  0.566  0.566  0.561  0.657  0.665 
MNE  0.792  0.797  0.552  0.574  0.566  0.562  0.556  0.567 
mGCN  0.858  0.860  0.623  0.630  0.725  0.713  0.660  0.661 
HAN  0.878  0.879  0.599  0.607  0.716  0.708  0.501  0.509 
DMGI  0.898  0.898  0.648  0.648  0.771  0.766  0.746  0.748 
DMGI  0.887  0.887  0.602  0.606  0.778  0.770  0.758  0.758 
4.1 Performance Analysis
Overall evaluation. Table 3 and Table 4 show the evaluation results on unsupervised and supervised task, respectively. We have the following observations: 1) Our proposed DMGI and DMGI outperform all the stateoftheart baselines not only on the unsupervised tasks, but also the supervised task, although the improvement is more significant in the unsupervised task as expected. This verifies the benefit of our framework that models the multiplexity and the global property of a network together with the node attributes within a single framework. 2) Although DGI shows relatively good performance, the performance is unstable (poor performance on Amazon dataset), indicating that multiple relation types should be jointly modeled. 3) Attributeaware multiplex network embedding methods, such as mGCN and HAN, generally perform better than those that neglect the node attributes. i.e., CMNA and MNE, even though we concatenated node attributes to the node embeddings. This verifies not only the benefit of modeling the node attributes, but also that the attributes should be systematically incorporated into the model. 4) Multiplex network embedding methods generally outperform single network embedding methods, although the gap is not significant. This verifies that the multiplexity of a network should be carefully modeled, otherwise a simple aggregation of multiple relationtype specific embeddings learned from independent single network embedding methods may perform better.
ACM 

DGI  ANRL  DMGI  DMGI  
Rel. Type  PAP  0.822  0.875  0.795  
PSP  0.721  0.675  0.694  
Merged  0.867  0.889  0.814  0.898  0.901  
IMDB 

DGI  ANRL  DMGI  DMGI  
Rel. Type  MAM  0.485  0.484  0.495  
MDM  0.548  0.562  0.520  
Merged  0.566  0.578  0.527  0.605  0.586  
DBLP 

DGI  ANRL  DMGI  DMGI  
Rel. Type  PAP  0.730  0.779  0.692  
PPP  0.456  0.477  0.680  
PATAP  0.431  0.409  OOM  
Merged  0.724  0.786  0.720  0.766  0.799  
Amazon 

DGI  ANRL  DMGI  DMGI  
Rel. Type  AlsoV  0.355  0.367  0.563  
AlsoB  0.357  0.381  0.516  
Bou.T  0.662  0.639  0.770  
Merged  0.624  0.558  0.764  0.816  0.825 
Effect of the attention mechanism. In Table 5, we show the performance of DMGI and DMGI, together with the performance of single network embedding methods (GCN/GAT, DGI, and ANRL). We observe that DMGI outperforms DMGI
in most of the datasets but IMDB dataset. To analyze the reason for this, we first plot the distribution of the attention weights on DBLP dataset over the training epochs in Figure
4. The above graph in Figure 4 demonstrates that the attention weights eventually end up in both extremes. i.e., close to 0 or close to 1, and the below graphs show that most of the attention weight is dedicated to a single relation type, i.e., “PAP”, which actually turns out to be the most important relation among the three (See Table 5); This phenomenon is common in every dataset. Next, we look at the performance of the single network embedding methods, especially DGI, on each relation type in Table 5. We observe that the performance differences among relation types in ACM, DBLP, and Amazon datasets are more biased to a single relation type, whereas in IMDB dataset, “MAM” and “MDM” relations relatively show similar performance. To summarize our findings, since the attention mechanism tends to favor the single most important relation type (“PAP” in ACM, “MDM” in IMDB, “PAP” in DBLP, and “Boughttogether” in Amazon), DMGI outperforms DMGI on datasets where one relation type significantly outperforms the other, i.e., ACM, DBLP, and Amazon, by removing the noise from other relations. On the other hand, for datasets where all the relations show relatively even performance, i.e., IMDB, extremely favoring a single well performing relation type (“MDM”) is rather detrimental to the overall performance because the relation “MAM” should also be considered to some extent.We also note that since the attention mechanism of DMGI can infer the importance of each relation type, we can filter out unnecessary relation types as a preprocessing step. To verify this, we evaluated on all possible combinations of relation types in DBLP dataset (Table 6). We observe that by removing the relation “PATAP”, which turned out to be the most useless relation type in Table 5, DMGI obtains even better results than using all the relation types, whereas for GCN and DGI, still considering all the relation types shows the best performance. This indicates that the attention mechanism can be useful to filter out unnecessary relation types, which will especially come in handy when the number of relation types is large.
DBLP dataset  GCN/GAT  DGI  DMGI  

NMI  PAP+PPP  0.464  0.543  0.565 
PAP+PATAP  0.458  0.535  0.017  
PPP+PATAP  0.332  0.237  0.201  
All  0.465  0.551  0.554 
Ablation study. To measure the impact of each component of DMGI, we conduct ablation studies on the largest dataset, i.e., DBLP, in Table 7. We have the following observations: 1) As expected, the semisupervised module specifically helps improve the node classification performance, which is a supervised task, whereas the performance on the unsupervised task remains on par. 2) Various readout functions including ones that contain trainable weights (Linear projection and SAGPool [15]) do not have much impact on the performance, which promotes our use of average pooling. 3) The second term in Eqn. 6 indeed plays a significant role in the consensus regularization framework. 4) The sharing of the scoring matrix M facilitates DMGI to model the interaction among multiple relation types. 5) Node attributes are crucial for representation learning of nodes. 6) Shuffling adjacency matrix instead of attribute matrix deteriorates the model performance.
DBLP dataset  MaF1  NMI  Sim@5  
DMGI  0.778  0.554  0.798  
1) DMGI+ Semi supervised  0.791  0.555  0.798  
2) Readout (Eqn. 3)  Random sample  0.774  0.555  0.797 
Maxpool  0.778  0.552  0.802  
Linear projection  0.783  0.565  0.803  
SAGPool  0.797  0.563  0.797  
3) Without 2nd term of Eqn. 6  0.749  0.448  0.787  
4) .  0.645  0.076  0.677  
5) No attributes (Adj. as attribute)  0.377  0.053  0.763  
6) Neg sample: Shuffle adj.  0.364  0.156  0.504 
5 Related Work
Network embedding. Network embedding methods aim at learning lowdimensional vector representation for nodes in a graph while preserving the network structure [24, 8, 31], and various other properties such as node attributes [42, 22], structural role [27], and node label information [12].
Multiplex Network embedding. A multiplex network, which is also known as a multiview network [31, 29] or a multidimensional network [20, 21] in the literature, consists of multiple relation types among a set of singletyped nodes. It can be thought of as a special type of heterogeneous network [5, 6] with a single type of node and multiple types of edges. Therefore, a multiplex network calls for a special attention because there is no need to consider the semantics between different types of nodes, which is often addressed by the concept of metapath [30]. Distinguished from heterogeneous network, a key challenge in the multiplex network embedding is to learn a consensus embedding for each node by taking into account the interrelationship among the multiple graphs. In this regard, existing methods mainly focused on how to integrate the information from multiple graphs. HAN [37] employed graph attention network [32] on each graph, and then applied the attention mechanism to merge the node representations learned from each graph by considering the importance of each graph. However, the existing methods either require labels for training [37, 25, 28], or overlook the node attributes [19, 38, 16, 29, 41, 23, 3]. Most recently, ma2019multi ma2019multi proposed a graph convolutional network (GCN) based method called mGCN, which is not only unsupervised, but also naturally incorporates the node attributes by using GCNs. However, since it is based on GCNs that capture the local graph structure [39], it fails to fully model the global properties of a graph [44, 35, 33].
Attributed Network Embedding. Nodes in a network are often affiliated with various contents, such as abstract text in the publication network, user profiles in social networks, and item description text in movie database or item networks. Such networks are called attributed networks, and have been extensively studied [17, 9, 40, 42, 7, 43, 33, 22]. Their goal is to preserve not only the network structure, but also the node attribute proximity in learning representations. Recently, GCNs [13, 32, 33] have been widely praised for its seamless integration of the network structure, and node attributes into a single framework.
Mutual Information.
it has been recently made possible to compute the MI between high dimensional input/output pairs of deep neural networks
[2]. Several recent work adopted the infomax principle [18] to learn the unsupervised representations in different domains, such as images [11], speech [26] and graphs [33]. More precisely, velivckovic2018deep velivckovic2018deep proposed Deep Graph Infomax (DGI) for learning representations of graph structured inputs by maximizing the MI between a highlevel global representation, and the local patches of a graph.6 Conclusion
We presented a simple yet effective unsupervised method for embedding attributed multiplex network. DMGI can jointly integrate the embeddings from multiple types of relations between nodes through the consensus regularization framework, and the universal discriminator. Moreover, the attention mechanism of DMGI can infer the importance of each relation type, which facilitates the preprocessing of the multiplex network. Experimental results on not only unsupervised tasks, but also a supervised task verify the superiority of our proposed framework.
References
 [1] (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.1.

[2]
(2018)
Mine: mutual information neural estimation
. ICML. Cited by: §5.  [3] (2019) Crossnetwork embedding for multinetwork alignment. In WWW, Cited by: 1st item, §5.
 [4] (2013) Mathematical formulation of multilayer networks. Physical Review X. Cited by: §1.
 [5] (2017) Metapath2vec: scalable representation learning for heterogeneous networks. In KDD, Cited by: §5.
 [6] (2017) Hin2vec: explore metapaths in heterogeneous information networks for representation learning. In CIKM, Cited by: §5.
 [7] (2018) Deep attributed network embedding.. In IJCAI, Cited by: §5.
 [8] (2016) Node2vec: scalable feature learning for networks. In KDD, Cited by: 1st item, §5.
 [9] (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §5.
 [10] (2016) Ups and downs: modeling the visual evolution of fashion trends with oneclass collaborative filtering. In WWW, Cited by: §4.
 [11] (2019) Learning deep representations by mutual information estimation and maximization. ICLR. Cited by: §5.
 [12] (2017) Label informed attributed network embedding. In WSDM, Cited by: §5.
 [13] (2016) Semisupervised classification with graph convolutional networks. ICLR. Cited by: §1, §1, §3.1, §3.1, 2nd item, §5.
 [14] (2019) Set transformer. ICML. Cited by: §3.1.
 [15] (2019) Selfattention graph pooling. ICML. Cited by: §3.1, §4.1.
 [16] (2018) Multilayered network embedding. In SDM, Cited by: §5.
 [17] (2017) Attributed network embedding for learning in a dynamic environment. In CIKM, Cited by: §5.
 [18] (1988) Selforganization in a perceptual network. Computer. Cited by: §3, §5.
 [19] (2017) Principled multilayer network embedding. In ICDMW, Cited by: §1, §5.
 [20] (2018) Multidimensional network embedding with hierarchical structure. In WSDM, Cited by: §5.
 [21] (2019) Multidimensional graph convolutional networks. In SDM, Cited by: 2nd item, §5.
 [22] (2019) Coembedding attributed networks. In WSDM, Cited by: §1, 2nd item, §5, §5.
 [23] (2018) Coregularized deep multinetwork embedding. In WWW, Cited by: §5.
 [24] (2014) Deepwalk: online learning of social representations. In KDD, Cited by: 1st item, §5.
 [25] (2017) An attentionbased collaboration framework for multiview network representation learning. In CIKM, Cited by: §1, §3.1, §5.
 [26] (2018) Learning speaker representations with mutual information. arXiv preprint arXiv:1812.00271. Cited by: §5.
 [27] (2017) Struc2vec: learning node representations from structural identity. In KDD, Cited by: §1, §5.
 [28] (2018) Modeling relational data with graph convolutional networks. In ESWC, Cited by: §1, §5.
 [29] (2018) Mvn2vec: preservation and collaboration in multiview network embedding. arXiv preprint arXiv:1801.06597. Cited by: §1, §5.
 [30] (2011) Pathsim: meta pathbased topk similarity search in heterogeneous information networks. VLDB. Cited by: §5.
 [31] (2015) Line: largescale information network embedding. In WWW, Cited by: §1, §5, §5.
 [32] (2017) Graph attention networks. ICLR. Cited by: §1, 2nd item, §5, §5.
 [33] (2019) Deep graph infomax. ICLR. Cited by: §1, §1, §3.1, §3, §3, 2nd item, §5, §5, §5.
 [34] (2015) Order matters: sequence to sequence for sets. NIPS. Cited by: §3.1.
 [35] (2016) Structural deep network embedding. In KDD, Cited by: §5.
 [36] (2017) Community preserving network embedding. In AAAI, Cited by: §1.
 [37] (2019) Heterogeneous graph attention network. In WWW, Cited by: §1, 2nd item, §4, §4, §5.
 [38] (2017) Multitask network embedding. In DSAA, Cited by: §5.
 [39] (2019) Lovasz convolutional networks. In AISTATS, Cited by: §1, §5.
 [40] (2015) Network representation learning with rich text information. In IJCAI, Cited by: §5.
 [41] (2018) Scalable multiplex network embedding. In AAAI, Cited by: §1, 1st item, §5.
 [42] (2018) ANRL: attributed network representation learning via deep neural networks.. In IJCAI, Cited by: §1, 2nd item, §5, §5.
 [43] (2018) Prre: personalized relation ranking embedding for attributed networks. In CIKM, Cited by: §5.
 [44] (2018) Dual graph convolutional networks for graphbased semisupervised classification. In WWW, Cited by: 2nd item, §5.