Unsupervised Attributed Multiplex Network Embedding

11/15/2019 ∙ by Chanyoung Park, et al. ∙ POSTECH University of Illinois at Urbana-Champaign Verizon Media 0

Nodes in a multiplex network are connected by multiple types of relations. However, most existing network embedding methods assume that only a single type of relation exists between nodes. Even for those that consider the multiplexity of a network, they overlook node attributes, resort to node labels for training, and fail to model the global properties of a graph. We present a simple yet effective unsupervised network embedding method for attributed multiplex network called DMGI, inspired by Deep Graph Infomax (DGI) that maximizes the mutual information between local patches of a graph, and the global representation of the entire graph. We devise a systematic way to jointly integrate the node embeddings from multiple graphs by introducing 1) the consensus regularization framework that minimizes the disagreements among the relation-type specific node embeddings, and 2) the universal discriminator that discriminates true samples regardless of the relation types. We also show that the attention mechanism infers the importance of each relation type, and thus can be useful for filtering unnecessary relation types as a preprocessing step. Extensive experiments on various downstream tasks demonstrate that DMGI outperforms the state-of-the-art methods, even though DMGI is fully unsupervised.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Analyzing and mining useful knowledge in graphs have been an actively researched topic for decades both in academia and industry. Among various graph mining techniques, network embedding, which learns low-dimensional vector representations for nodes in a graph, is shown to be especially effective for various network-based tasks 

[31, 36, 22].

However, most existing network embedding methods assume that only a single type of relation exists between nodes [32, 33, 13], whereas in reality networks are multiplex [4] in nature, i.e., with multiple types of relations. Taking the publication network as an example, two papers can be connected due to various reasons, such as authors (two papers are authored by a common author), citation (one paper cites the other), or keywords (two papers share common keywords). As another example, in a movie database network, two movies can be connected via a common director, or a common actor.

Although different types of relations can independently form different graphs, these graphs are related, and thus can mutually help each other for various downstream tasks. As a concrete example of the publication network, although it is hard to infer the topic of a paper only from its citations (citations can be diverse), also knowing other papers written by the same authors will help predict its topic, because authors usually work on a specific research topic. Furthermore, nodes in graphs may contain attribute information, which plays important roles in many applications [42]. For example, if we are additionally given the abstract of the papers in the publication network, it will be much easier to infer their topics. As such, the main challenge is to learn a consensus representation of a node that not only considers its multiplexity, but also its attributes.

Several recent studies have been conducted for multiplex network embedding, however, some issues remain that need further consideration. First, previous methods [25, 41, 29, 19] focus on the integration of multiple graphs, but overlook node attributes. Second, even for those that consider node attributes [28, 37], they require node labels for training. However, as node labeling is often expensive and time-consuming, it would be the best if a method can show competitive performance even without any label. Third, most of these methods fail to model the global properties of a graph, because they are based on random walk-based skip-gram model or graph convolutional network (GCN) [13], both of which are known to be effective for capturing the local graph structure [39]. More precisely, nodes that are “close” (i.e., within the same context window or neighborhoods) in the graph are trained to have similar representations, whereas nodes that are far apart do not have similar representations, even though they are structurally similar [27].

Keeping these limitations in mind, we propose a simple yet effective unsupervised method for embedding attributed multiplex networks. The core building block of our proposed method is Deep Graph Infomax (DGI) [33] that aims to learn a node encoder that maximizes the mutual information between local patches of a graph, and the global representation of the entire graph. DGI is the workhorse method for our task, because it 1) naturally integrates the node attributes by using a GCN, 2) is trained in a fully unsupervised manner, and 3) captures the global properties of the entire graph. However, it is challenging to apply DGI, which is designed for embedding a single network, to a multiplex network in which the interactions among multiple relation types, and the importance of each relation type should be considered.

In this paper, we present a systematic way to jointly integrate the embeddings from multiple types of relations between nodes, so as to facilitate them to mutually help each other learn high-quality embeddings useful for various downstream tasks. More precisely, we introduce the consensus regularization framework that minimizes the disagreements among the relation-type specific node embeddings, and the universal discriminator that discriminates true samples, i.e., ground truth “(graph-level summary, local patch)” pairs, regardless of the relation types. Moreover, we demonstrate that through the attention mechanism, we can infer the importance of each relation type in generating the consensus node embeddings, which can be used for filtering unnecessary relation types as a preprocessing step. Our extensive experiments demonstrate that our proposed method, Deep Multplex Graph Infomax (DMGI), outperforms the state-of-the-art attributed multiplex network embedding methods in terms of node clustering, similarity search, and especially, node classification even though DMGI is fully unsupervised.

2 Problem Statement

Definition 1.

(Attributed Multiplex Network) An attributed multiplex network is a network , where is a graph of the relation type , is the set of nodes, is the set of all edges with relation type , and is a matrix that encodes node attributes information for nodes. Note that for multiplex networks, and for a single network. Given the network , is a set of adjacency matrices, where is an adjacency matrix of the network .

Task: Unsupervised Attributed Multiplex Network Embedding. Given an attributed multiplex network , and the set of adjacency matrices , the task of unsupervised attributed multiplex network embedding is to learn a -dimensional vector representation for each node without using any labels.

3 Unsupervised Attributed Multiplex Network Embedding

We begin by introducing Deep Graph Informax (DGI) [33], then we discuss about its limitations, and present our proposed method.

Deep Graph Infomax (DGI). velivckovic2018deep velivckovic2018deep proposed an unsupervised method for learning node representations, called DGI, that relies on the infomax principle [18]. More precisely, DGI aims to learn a low-dimensional vector representation for each node , i.e., , such that the average mutual information (MI) between the graph-level (global) summary representation , and the representations of the local patches is maximized. To this end, DGI introduces a discriminator that discriminates the true samples, i.e., , from its negative counterparts, i.e., :


where , is the set of neighboring nodes of including itself, , and is a normalizing constant for edge , , and

is the sigmoid nonlinearity. Negative patch representation

is obtained by row-wise shuffling the original attribute matrix X. velivckovic2018deep velivckovic2018deep theoretically proved that the binary cross entropy loss shown in Eqn. 1 amounts to maximizing the mutual information (MI) between and s, based on the Jensen-Shannon divergence [33]. Refer to Section 3.3 of [33] for the detailed proof. As the local patch representations are learned to preserve the MI with the graph-level representation s, each is expected to capture the global properties of the entire graph.

Limitation. Despite its effectiveness, DGI is designed for a single attributed network, and thus it is not straightforward to apply it to a multiplex network. As a naive extension of DGI to a multiplex attributed network, we can independently apply DGI to each graph formed by each relation type, and then compute the average of the embeddings obtained from each graph to get the final node representations. However, we argue that this fails to model the multiplexity of the network, because the interactions among the node embeddings from different relation types is not captured. Thus, we need a more systematic way to integrate multiple independent models to obtain the final consensus embedding that every model can agree on.

3.1 Deep Multiplex Graph Infomax: Dmgi

We present our unsupervised method for embedding an attributed multiplex network. We first describe how to independently model each graph pertaining to each relation type, then explain how to jointly integrate them to finally obtain the consensus node embedding matrix.

Relation-type specific Node Embedding. For each relation type , we introduce a relation-type specific node encoder to generate the relation-type specific node embedding matrix of nodes in . The encoder is a single–layered GCN:


where , , is a trainable weight matrix of the relation-type specific decoder , and

is the ReLU nonlinearity. Unlike conventional GCNs 

[13], we control the weight of the self-connections by introducing a weight . Larger indicates that the node itself plays a more important role in generating its embedding, which in turn diminishes the importance of its neighboring nodes. Then, we compute the graph-level summary representation that summarizes the global content of the graph . We employ a readout function :


where is the logistic sigmoid nonlinearity, and denotes the -th row vector of the matrix . We also note that various pooling methods such as maxpool, and SAGPool [15] can be used as .

Next, given the relation-type specific node embedding matrix , and its graph-level summary representation , we compute the relation-type specific cross entropy:


where is a discriminator that scores patch-summary representation pairs, i.e., . In this paper, we apply a simple bilinear scoring function as it empirically performs the best in our experiments:


where is the logistic sigmoid nonlinearity, and is a trainable scoring matrix. To generate the negative node embedding , we corrupt the original attribute matrix by shuffling it in the row-wise manner [33], i.e., , and reuse the encoder in Eqn. 2. i.e. .

Joint Modeling and Consensus Regularization.

Heretofore, by independently maximizing the average MI between the local patches and the graph-level summary pertaining to each graph , we obtained relation-type specific node embedding matrix that captures the global information in . However, as each is trained independently for each , these embedding matrices only contain relevant information regarding each relation type, and therefore fail to take advantage of the multiplexity of the network. This motivates us to develop a systematic way to jointly integrate the embeddings from different relation types, so as to facilitate them to mutually help each other learn high-quality embeddings.

To this end, we introduce the consensus embedding matrix on which every relation-type specific node embedding matrix can agree. More precisely, we introduce the consensus regularization framework that consists of 1) a regularizer minimizing the disagreements between the set of original node embeddings, i.e. and the consensus embedding Z, and 2) another regularizer maximizing the disagreement between the corrupted node embeddings, i.e., , and the consensus embedding Z, which are formulated as follows:


where is an aggregation function that combines a set of node embedding matrices from multiple relation types into a single embedding matrix. i.e., . can be any pooling method that can handle permutation invariant input, such as set2set [34] or Set Transformer [14]. However, considering the efficiency of the method, we simply employ average pooling, i.e., computing the average of the set of embedding matrices:


It is important to note that the scoring matrix in Eqn. 5 is shared among all the relations . i.e., . The intuition is to learn the universal discriminator that is capable of scoring the true pairs higher than the negative pairs regardless the relation types. We argue that the universal discriminator facilitates the joint modeling of different relation types together with the consensus regularization.

Finally, we jointly optimize the sum of all the relation-type specific loss in Eqn. 4, and the consensus regularization in Eqn. 6 to obtain the final objective as follows:


where controls the importance of the consensus regularization, is a coefficient for l2 regularization on , which is a set of trainable parameters. i.e., , and is optimized by Adam optimizer. Figure 1 illustrates the overview of DMGI.

Num. A Num. B Num. A-B Relation type
node attributes
labeled data
ACM Paper-Author 3,025 5,835 9,744 P-A-P 29,281 1,830 (Paper abstract) 600 3
Paper-Subject 3,025 56 3,025 P-S-P 2,210,761
IMDB Movie-Actor 3,550 4,441 10,650 M-A-M 66,428 1,007 (Movie plot) 300 3
Movie-Director 3,550 1,726 3,550 M-D-M 13,788
DBLP Paper-Author 7,907 1,960 14,238 P-A-P 144,783 2,000 (Paper abstract) 80 4
Paper-Paper 7,907 7,907 10,522 P-P-P 90,145
Author-Term 1,960 1,975 57,269 P-A-T-A-P 57,137,515
Amazon Item-Item 7,621 7,621 38,514 Also-view 266,237 2,000 (Item description) 80 4
45,446 Also-bought 1,104,257
9,783 Bought-together 16,305
Table 1: Statistics of the datasets. The node attributes are bag-of-words of text associated with each node.
Figure 1: Overview of DMGI (Best viewed in color).

Discussion. Despite its efficiency, the above average pooling scheme in Eqn. 7 treats all the relations equally, whereas, as will be shown in the experiments, some relation type is more beneficial for a certain downstream task than others. For example, the co-authorship information between two papers plays a more significant role in predicting the topic of a paper compared with their citation information; eventually, these two information mutually help each other to more accurately predict the topic of a paper. Therefore, we can adopt the attention mechanism [1] to distinguish between different relation types as follows:


where denotes the importance of relation in generating the final embedding of node defined as:


where is the feature vector of relation .

Extension to Semi-Supervised Learning.

It is important to note that DMGI is trained in a fully unsupervised manner. However, in reality, nodes are sometimes associated with label information, which can guide the training of node embeddings even with a small amount [13, 25]. To this end, we introduce a semi-supervised module into our framework that predicts the labels of labeled nodes from the consensus embedding Z. More precisely, we minimize the cross-entropy error over the labeled nodes:


where is the set of node indices with labels, is the ground truth label,

is the output of a softmax layer, and

is a classifier that predicts the label of a node from its embedding, which is a single fully connected layer in this work. The final objective function with the semi-supervised module is:


where the coefficient of the semi-supervised module.

4 Experiments

Dataset. To make fair comparisons with HAN  [37], which is the most relevant baseline method, we evaluate our proposed method on the datasets used in their original paper [37], i.e., ACM, DBLP, and IMDB. We used publicly available ACM dataset [37], and preprocessed DBLP and IMDB datasets. For ACM and DBLP datasets, the task is to classify the papers into three classes (Database, Wireless Communication, Data Mining), and four classes (DM, AI, CV, NLP)111DM: KDD,WSDM,ICDM, AI: ICML,AAAI,IJCAI, CV: CVPR, NLP: ACL,NAACL,EMNLP, respectively, according to the research topic. For IMDB dataset, the task is to classify the movies into three classes (Action, Comedy, Drama). We note that the above datasets used by previous work are not truly multiplex in nature because the multiplexity between nodes is inferred via intermediate nodes (e.g., ACM: Paper-Paper relationships are inferred via Authors and Subjects that connect two Papers. i.e., “PAP” and “PSP”). Thus, to make our evaluation more practical, we used Amazon dataset [10] that genuinely contains a multiplex network of items, i.e., also-viewed, also-bought, and bought-together relations between items. We used datasets from four categories222We chose these categories because the three types of item-item relations from these categories are similar in number , i.e., Beauty, Automotive, Patio Lawn and Garden, and Baby, and the task is to classify items into the four classes. For ACM and IMDB datasets, we used the same number of labeled data as in [37] for fair comparisons, and for the remaining datasets, we used 20 labeled data for each class. Table 1 summarizes the data statistics.

Mult. Attr. Unsup. Glo.
Table 3: Performance for node clustering and similarity search on test data.
Method ACM IMDB DBLP Amazon
NMI Sim@5 NMI Sim@5 NMI Sim@5 NMI Sim@5
Deepwalk 0.310 0.710 0.117 0.490 0.348 0.629 0.083 0.726
node2vec 0.309 0.710 0.123 0.487 0.382 0.629 0.074 0.738
GCN/GAT 0.671 0.867 0.176 0.565 0.465 0.724 0.287 0.624
DGI 0.640 0.889 0.182 0.578 0.551 0.786 0.007 0.558
ANRL 0.515 0.814 0.163 0.527 0.332 0.720 0.166 0.763
CAN 0.504 0.836 0.074 0.544 0.323 0.792 0.001 0.537
DGCN 0.691 0.690 0.143 0.179 0.462 0.491 0.143 0.194
CMNA 0.498 0.363 0.152 0.069 0.420 0.511 0.070 0.435
MNE 0.545 0.791 0.013 0.482 0.136 0.711 0.001 0.395
mGCN 0.668 0.873 0.183 0.550 0.468 0.726 0.301 0.630
HAN 0.658 0.872 0.164 0.561 0.472 0.779 0.029 0.495
DMGI 0.687 0.898 0.196 0.605 0.409 0.766 0.425 0.816
DMGI 0.702 0.901 0.185 0.586 0.554 0.798 0.412 0.825
Table 2: Properties of the compared methods (Mult.: Mutliplexity, Attr: Attribute, Unsup: Unsupervised, Glo: Global).

Methods Compared.

  1. [leftmargin=.1in]

  2. Embedding methods for a single network

    • [leftmargin=.00001in]

    • No attributes: Deepwalk [24], node2vec [8]: They learn node embeddings by random walks and skip-gram.

    • Attributed network embedding: GCN [13], GAT [32]: They learn node embeddings based on local neighborhood structures. As they perform similarly, we report the best performing method among them; DGI [33]: It maximizes the MI between the graph-level summary representation and the local patches; ANRL [42]

      : It uses neighbor enhancement autoencoder to model the node attribute information, and skip-gram model to capture the network structure;

      CAN [22]: It learns embeddings of both attributes and nodes in the same semantic space; DGCN [44]: It models the local and global properties of a graph by employing dual GCNs.

  3. Multiplex embedding methods

    • [leftmargin=.01in]

    • No attributes: CMNA [3]: It leverages the cross-network information to refine inter-vector for network alignment and intra-vector for other downstream tasks. We use the intra-vector for our evaluations; MNE [41]: It jointly models multiple networks by introducing a common embedding, and a additional embedding for each relation type.

    • Attributed multiplex network embedding: mGCN [21], HAN [37]: They apply GCNs, and GATs on multiplex network considering the inter-, and intra-network interactions. For fair comparisons, we initialized the initial node embeddings of mGCN by using the node attribute matrix, although the node attributes information is ignored in the original mGCN; DMGI: DMGI with the attention mechanism (Eqn. 9).

For the sake of fair comparisons with DMGI, which considers the node attributes, we concatenated the raw attribute matrix X to the learned node embeddings Z of the methods that ignore the node attributes. i.e., Deepwalk, node2vec, CMNA, and MNE. i.e., . Moreover, regarding the embedding methods for a single network, i.e., the methods that belong to the first category in the above list, we obtain the final node embedding matrix Z by computing the average of the node embeddings obtained from each single graph. i.e., . We provide a summary of the properties of the compared methods in Table 3.

Evaluation Metrics. Recall that DMGI is an unsupervised method that does not require any labeled data for training. Therefore, we evaluate the performance of DMGI in terms of node clustering and similarity search, both of which are classical performance measures for unsupervised methods. For node clustering, we use the most commonly used metric [37]

, i.e., Normalized Mutual Information (NMI). For similarity search, we compute the cosine similarity scores of the node embeddings between all pairs of nodes, and for each node, we rank the nodes according to the similarity score. Then, we calculate the ratio of the nodes that belong to the same class within top-5 ranked nodes (Sim@5). Moreover, we also evaluate 

DMGI on the performance in terms of node classification

. More precisely, after learning the node embeddings, we train a logistic regression classifier on the learned embeddings in the training set, and then evaluate on the nodes in the test set. We use Macro-F1 (MaF1) and Micro-F1 (MiF1) 


Experimental Settings. We randomly split our dataset into train/validation/test, and we have the equal number of labeled data for training and validation datasets. We report the test performance when the performance on validation data gives the best result. For DMGI, we set the node embedding dimension , self-connection weight , tune . We implement DMGI

 in PyTorch

333https://github.com/pcy1302/DMGI, and for all other methods, we used the source codes published by the authors, and tried to tune them to their best performance. More precisely, apart from the guidelines provided by the original papers, we tuned learning rate, and the coefficients for regularization from {0.0001,0.0005,0.001,0.005} on the validation dataset. After learning the node embeddings, for fair comparisons, we conducted the evaluations within the same platform.

MaF1 MiF1 MaF1 MiF1 MaF1 MiF1 MaF1 MiF1
Deepwalk 0.739 0.748 0.532 0.550 0.533 0.537 0.663 0.671
node2vec 0.741 0.749 0.533 0.550 0.543 0.547 0.662 0.669
GCN/GAT 0.869 0.870 0.603 0.611 0.734 0.717 0.646 0.649
DGI 0.881 0.881 0.598 0.606 0.723 0.720 0.403 0.418
ANRL 0.819 0.820 0.573 0.576 0.770 0.699 0.692 0.690
CAN 0.590 0.636 0.577 0.588 0.702 0.694 0.498 0.499
DGCN 0.888 0.888 0.582 0.592 0.707 0.698 0.478 0.509
CMNA 0.782 0.788 0.549 0.566 0.566 0.561 0.657 0.665
MNE 0.792 0.797 0.552 0.574 0.566 0.562 0.556 0.567
mGCN 0.858 0.860 0.623 0.630 0.725 0.713 0.660 0.661
HAN 0.878 0.879 0.599 0.607 0.716 0.708 0.501 0.509
DMGI 0.898 0.898 0.648 0.648 0.771 0.766 0.746 0.748
DMGI 0.887 0.887 0.602 0.606 0.778 0.770 0.758 0.758
figureVisualization of the attention weights on DBLP dataset.
Table 4: Node classification performance on test data.

4.1 Performance Analysis

Overall evaluation. Table 3 and Table 4 show the evaluation results on unsupervised and supervised task, respectively. We have the following observations: 1) Our proposed DMGI and DMGI outperform all the state-of-the-art baselines not only on the unsupervised tasks, but also the supervised task, although the improvement is more significant in the unsupervised task as expected. This verifies the benefit of our framework that models the multiplexity and the global property of a network together with the node attributes within a single framework. 2) Although DGI shows relatively good performance, the performance is unstable (poor performance on Amazon dataset), indicating that multiple relation types should be jointly modeled. 3) Attribute-aware multiplex network embedding methods, such as mGCN and HAN, generally perform better than those that neglect the node attributes. i.e., CMNA and MNE, even though we concatenated node attributes to the node embeddings. This verifies not only the benefit of modeling the node attributes, but also that the attributes should be systematically incorporated into the model. 4) Multiplex network embedding methods generally outperform single network embedding methods, although the gap is not significant. This verifies that the multiplexity of a network should be carefully modeled, otherwise a simple aggregation of multiple relation-type specific embeddings learned from independent single network embedding methods may perform better.

Rel. Type PAP 0.822 0.875 0.795
PSP 0.721 0.675 0.694
Merged 0.867 0.889 0.814 0.898 0.901
Rel. Type MAM 0.485 0.484 0.495
MDM 0.548 0.562 0.520
Merged 0.566 0.578 0.527 0.605 0.586
Rel. Type PAP 0.730 0.779 0.692
PPP 0.456 0.477 0.680
PATAP 0.431 0.409 OOM
Merged 0.724 0.786 0.720 0.766 0.799
Rel. Type Also-V 0.355 0.367 0.563
Also-B 0.357 0.381 0.516
Bou.-T 0.662 0.639 0.770
Merged 0.624 0.558 0.764 0.816 0.825
Table 5: Performance of similarity search (Sim@5) of embedding methods for a single network. (Merged denotes the average of all the relation-type specific embeddings.)

Effect of the attention mechanism. In Table 5, we show the performance of DMGI and DMGI, together with the performance of single network embedding methods (GCN/GAT, DGI, and ANRL). We observe that DMGI outperforms DMGI

 in most of the datasets but IMDB dataset. To analyze the reason for this, we first plot the distribution of the attention weights on DBLP dataset over the training epochs in Figure 

4. The above graph in Figure 4 demonstrates that the attention weights eventually end up in both extremes. i.e., close to 0 or close to 1, and the below graphs show that most of the attention weight is dedicated to a single relation type, i.e., “PAP”, which actually turns out to be the most important relation among the three (See Table 5); This phenomenon is common in every dataset. Next, we look at the performance of the single network embedding methods, especially DGI, on each relation type in Table 5. We observe that the performance differences among relation types in ACM, DBLP, and Amazon datasets are more biased to a single relation type, whereas in IMDB dataset, “MAM” and “MDM” relations relatively show similar performance. To summarize our findings, since the attention mechanism tends to favor the single most important relation type (“PAP” in ACM, “MDM” in IMDB, “PAP” in DBLP, and “Bought-together” in Amazon), DMGI outperforms DMGI on datasets where one relation type significantly outperforms the other, i.e., ACM, DBLP, and Amazon, by removing the noise from other relations. On the other hand, for datasets where all the relations show relatively even performance, i.e., IMDB, extremely favoring a single well performing relation type (“MDM”) is rather detrimental to the overall performance because the relation “MAM” should also be considered to some extent.

We also note that since the attention mechanism of DMGI can infer the importance of each relation type, we can filter out unnecessary relation types as a preprocessing step. To verify this, we evaluated on all possible combinations of relation types in DBLP dataset (Table 6). We observe that by removing the relation “PATAP”, which turned out to be the most useless relation type in Table 5DMGI obtains even better results than using all the relation types, whereas for GCN and DGI, still considering all the relation types shows the best performance. This indicates that the attention mechanism can be useful to filter out unnecessary relation types, which will especially come in handy when the number of relation types is large.

NMI PAP+PPP 0.464 0.543 0.565
PAP+PATAP 0.458 0.535 0.017
PPP+PATAP 0.332 0.237 0.201
All 0.465 0.551 0.554
Table 6: NMI on various combinations of relation types.

Ablation study. To measure the impact of each component of DMGI, we conduct ablation studies on the largest dataset, i.e., DBLP, in Table 7. We have the following observations: 1) As expected, the semi-supervised module specifically helps improve the node classification performance, which is a supervised task, whereas the performance on the unsupervised task remains on par. 2) Various readout functions including ones that contain trainable weights (Linear projection and SAGPool [15]) do not have much impact on the performance, which promotes our use of average pooling. 3) The second term in Eqn. 6 indeed plays a significant role in the consensus regularization framework. 4) The sharing of the scoring matrix M facilitates DMGI to model the interaction among multiple relation types. 5) Node attributes are crucial for representation learning of nodes. 6) Shuffling adjacency matrix instead of attribute matrix deteriorates the model performance.

DBLP dataset MaF1 NMI Sim@5
DMGI 0.778 0.554 0.798
1) DMGI+ Semi supervised 0.791 0.555 0.798
2) Readout (Eqn. 3) Random sample 0.774 0.555 0.797
Maxpool 0.778 0.552 0.802
Linear projection 0.783 0.565 0.803
SAGPool 0.797 0.563 0.797
3) Without 2nd term of Eqn. 6 0.749 0.448 0.787
4) . 0.645 0.076 0.677
5) No attributes (Adj. as attribute) 0.377 0.053 0.763
6) Neg sample: Shuffle adj. 0.364 0.156 0.504
Table 7: Result for ablation studies of DMGI.

5 Related Work

Network embedding. Network embedding methods aim at learning low-dimensional vector representation for nodes in a graph while preserving the network structure [24, 8, 31], and various other properties such as node attributes [42, 22], structural role [27], and node label information [12].

Multiplex Network embedding. A multiplex network, which is also known as a multi-view network [31, 29] or a multi-dimensional network [20, 21] in the literature, consists of multiple relation types among a set of single-typed nodes. It can be thought of as a special type of heterogeneous network [5, 6] with a single type of node and multiple types of edges. Therefore, a multiplex network calls for a special attention because there is no need to consider the semantics between different types of nodes, which is often addressed by the concept of meta-path [30]. Distinguished from heterogeneous network, a key challenge in the multiplex network embedding is to learn a consensus embedding for each node by taking into account the interrelationship among the multiple graphs. In this regard, existing methods mainly focused on how to integrate the information from multiple graphs. HAN [37] employed graph attention network [32] on each graph, and then applied the attention mechanism to merge the node representations learned from each graph by considering the importance of each graph. However, the existing methods either require labels for training [37, 25, 28], or overlook the node attributes [19, 38, 16, 29, 41, 23, 3]. Most recently, ma2019multi ma2019multi proposed a graph convolutional network (GCN) based method called mGCN, which is not only unsupervised, but also naturally incorporates the node attributes by using GCNs. However, since it is based on GCNs that capture the local graph structure [39], it fails to fully model the global properties of a graph [44, 35, 33].

Attributed Network Embedding. Nodes in a network are often affiliated with various contents, such as abstract text in the publication network, user profiles in social networks, and item description text in movie database or item networks. Such networks are called attributed networks, and have been extensively studied [17, 9, 40, 42, 7, 43, 33, 22]. Their goal is to preserve not only the network structure, but also the node attribute proximity in learning representations. Recently, GCNs [13, 32, 33] have been widely praised for its seamless integration of the network structure, and node attributes into a single framework.

Mutual Information.

it has been recently made possible to compute the MI between high dimensional input/output pairs of deep neural networks 

[2]. Several recent work adopted the infomax principle [18] to learn the unsupervised representations in different domains, such as images [11], speech [26] and graphs [33]. More precisely,  velivckovic2018deep velivckovic2018deep proposed Deep Graph Infomax (DGI) for learning representations of graph structured inputs by maximizing the MI between a high-level global representation, and the local patches of a graph.

6 Conclusion

We presented a simple yet effective unsupervised method for embedding attributed multiplex network. DMGI can jointly integrate the embeddings from multiple types of relations between nodes through the consensus regularization framework, and the universal discriminator. Moreover, the attention mechanism of DMGI can infer the importance of each relation type, which facilitates the preprocessing of the multiplex network. Experimental results on not only unsupervised tasks, but also a supervised task verify the superiority of our proposed framework.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.1.
  • [2] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm (2018)

    Mine: mutual information neural estimation

    ICML. Cited by: §5.
  • [3] X. Chu, X. Fan, D. Yao, Z. Zhu, J. Huang, and J. Bi (2019) Cross-network embedding for multi-network alignment. In WWW, Cited by: 1st item, §5.
  • [4] M. De Domenico, A. Solé-Ribalta, E. Cozzo, M. Kivelä, Y. Moreno, M. A. Porter, S. Gómez, and A. Arenas (2013) Mathematical formulation of multilayer networks. Physical Review X. Cited by: §1.
  • [5] Y. Dong, N. V. Chawla, and A. Swami (2017) Metapath2vec: scalable representation learning for heterogeneous networks. In KDD, Cited by: §5.
  • [6] T. Fu, W. Lee, and Z. Lei (2017) Hin2vec: explore meta-paths in heterogeneous information networks for representation learning. In CIKM, Cited by: §5.
  • [7] H. Gao and H. Huang (2018) Deep attributed network embedding.. In IJCAI, Cited by: §5.
  • [8] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In KDD, Cited by: 1st item, §5.
  • [9] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §5.
  • [10] R. He and J. McAuley (2016) Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In WWW, Cited by: §4.
  • [11] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. ICLR. Cited by: §5.
  • [12] X. Huang, J. Li, and X. Hu (2017) Label informed attributed network embedding. In WSDM, Cited by: §5.
  • [13] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. ICLR. Cited by: §1, §1, §3.1, §3.1, 2nd item, §5.
  • [14] J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teh (2019) Set transformer. ICML. Cited by: §3.1.
  • [15] J. Lee, I. Lee, and J. Kang (2019) Self-attention graph pooling. ICML. Cited by: §3.1, §4.1.
  • [16] J. Li, C. Chen, H. Tong, and H. Liu (2018) Multi-layered network embedding. In SDM, Cited by: §5.
  • [17] J. Li, H. Dani, X. Hu, J. Tang, Y. Chang, and H. Liu (2017) Attributed network embedding for learning in a dynamic environment. In CIKM, Cited by: §5.
  • [18] R. Linsker (1988) Self-organization in a perceptual network. Computer. Cited by: §3, §5.
  • [19] W. Liu, P. Chen, S. Yeung, T. Suzumura, and L. Chen (2017) Principled multilayer network embedding. In ICDMW, Cited by: §1, §5.
  • [20] Y. Ma, Z. Ren, Z. Jiang, J. Tang, and D. Yin (2018) Multi-dimensional network embedding with hierarchical structure. In WSDM, Cited by: §5.
  • [21] Y. Ma, S. Wang, C. C. Aggarwal, D. Yin, and J. Tang (2019) Multi-dimensional graph convolutional networks. In SDM, Cited by: 2nd item, §5.
  • [22] Z. Meng, S. Liang, H. Bao, and X. Zhang (2019) Co-embedding attributed networks. In WSDM, Cited by: §1, 2nd item, §5, §5.
  • [23] J. Ni, S. Chang, X. Liu, W. Cheng, H. Chen, D. Xu, and X. Zhang (2018) Co-regularized deep multi-network embedding. In WWW, Cited by: §5.
  • [24] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In KDD, Cited by: 1st item, §5.
  • [25] M. Qu, J. Tang, J. Shang, X. Ren, M. Zhang, and J. Han (2017) An attention-based collaboration framework for multi-view network representation learning. In CIKM, Cited by: §1, §3.1, §5.
  • [26] M. Ravanelli and Y. Bengio (2018) Learning speaker representations with mutual information. arXiv preprint arXiv:1812.00271. Cited by: §5.
  • [27] L. F. Ribeiro, P. H. Saverese, and D. R. Figueiredo (2017) Struc2vec: learning node representations from structural identity. In KDD, Cited by: §1, §5.
  • [28] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In ESWC, Cited by: §1, §5.
  • [29] Y. Shi, F. Han, X. He, X. He, C. Yang, J. Luo, and J. Han (2018) Mvn2vec: preservation and collaboration in multi-view network embedding. arXiv preprint arXiv:1801.06597. Cited by: §1, §5.
  • [30] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu (2011) Pathsim: meta path-based top-k similarity search in heterogeneous information networks. VLDB. Cited by: §5.
  • [31] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) Line: large-scale information network embedding. In WWW, Cited by: §1, §5, §5.
  • [32] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. ICLR. Cited by: §1, 2nd item, §5, §5.
  • [33] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm (2019) Deep graph infomax. ICLR. Cited by: §1, §1, §3.1, §3, §3, 2nd item, §5, §5, §5.
  • [34] O. Vinyals, S. Bengio, and M. Kudlur (2015) Order matters: sequence to sequence for sets. NIPS. Cited by: §3.1.
  • [35] D. Wang, P. Cui, and W. Zhu (2016) Structural deep network embedding. In KDD, Cited by: §5.
  • [36] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang (2017) Community preserving network embedding. In AAAI, Cited by: §1.
  • [37] X. Wang, H. Ji, C. Shi, B. Wang, Y. Ye, P. Cui, and P. S. Yu (2019) Heterogeneous graph attention network. In WWW, Cited by: §1, 2nd item, §4, §4, §5.
  • [38] L. Xu, X. Wei, J. Cao, and S. Y. Philip (2017) Multi-task network embedding. In DSAA, Cited by: §5.
  • [39] P. Yadav, M. Nimishakavi, N. Yadati, S. Vashishth, A. Rajkumar, and P. Talukdar (2019) Lovasz convolutional networks. In AISTATS, Cited by: §1, §5.
  • [40] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Chang (2015) Network representation learning with rich text information. In IJCAI, Cited by: §5.
  • [41] H. Zhang, L. Qiu, L. Yi, and Y. Song (2018) Scalable multiplex network embedding. In AAAI, Cited by: §1, 1st item, §5.
  • [42] Z. Zhang, H. Yang, J. Bu, S. Zhou, P. Yu, J. Zhang, M. Ester, and C. Wang (2018) ANRL: attributed network representation learning via deep neural networks.. In IJCAI, Cited by: §1, 2nd item, §5, §5.
  • [43] S. Zhou, H. Yang, X. Wang, J. Bu, M. Ester, P. Yu, J. Zhang, and C. Wang (2018) Prre: personalized relation ranking embedding for attributed networks. In CIKM, Cited by: §5.
  • [44] C. Zhuang and Q. Ma (2018) Dual graph convolutional networks for graph-based semi-supervised classification. In WWW, Cited by: 2nd item, §5.