Semi-Supervised Graph-to-Graph Translation

by   Tianxiang Zhao, et al.
Penn State University

Graph translation is very promising research direction and has a wide range of potential real-world applications. Graph is a natural structure for representing relationship and interactions, and its translation can encode the intrinsic semantic changes of relationships in different scenarios. However, despite its seemingly wide possibilities, usage of graph translation so far is still quite limited. One important reason is the lack of high-quality paired dataset. For example, we can easily build graphs representing peoples' shared music tastes and those representing co-purchase behavior, but a well paired dataset is much more expensive to obtain. Therefore, in this work, we seek to provide a graph translation model in the semi-supervised scenario. This task is non-trivial, because graph translation involves changing the semantics in the form of link topology and node attributes, which is difficult to capture due to the combinatory nature and inter-dependencies. Furthermore, due to the high order of freedom in graph's composition, it is difficult to assure the generalization ability of trained models. These difficulties impose a tighter requirement for the exploitation of unpaired samples. Addressing them, we propose to construct a dual representation space, where transformation is performed explicitly to model the semantic transitions. Special encoder/decoder structures are designed, and auxiliary mutual information loss is also adopted to enforce the alignment of unpaired/paired examples. We evaluate the proposed method in three different datasets.



There are no comments yet.


page 1

page 2

page 3

page 4


InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization

This paper studies learning the representations of whole graphs in both ...

Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation

Recently, end-to-end multi-speaker text-to-speech (TTS) systems gain suc...

Semi-Supervised Graph Attention Networks for Event Representation Learning

Event analysis from news and social networks is very useful for a wide r...

LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual Information Estimation with Optimal Transport

Estimating mutual information is an important machine learning and stati...

Semi-Supervised Deep Learning for Multiplex Networks

Multiplex networks are complex graph structures in which a set of entiti...

Semi-supervised Learning Approach to Generate Neuroimaging Modalities with Adversarial Training

Magnetic Resonance Imaging (MRI) of the brain can come in the form of di...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Graph-to-Graph translation aims at transforming a graph in the source domain to a new graph in the target domain, where different domains correspond to different states. Figure 1

gives an illustration of the graph-to-graph translation process. The graph in the source domain depicts shared music tastes of users, with each attributed node representing a user and the user’s portraits. We want to translate the attributed graph of shared music tastes to co-watch graph in the target domain, which represents similar reading preference. The translation function is expected to generate the target graph based on the source graph. It is in effect learning to capture the intrinsic attributes, like aesthetic preferences and personal characters in this case, and discovering the patterns behind that domain transition. It can facilitate many real-world applications. For instance, given the brain network of a healthy subject, we may want to predict the brain network when the subject has certain disease. Given the traffic flow of a city, we want to estimate the traffic flow when events such as concert is held in the city.

Figure 1. An example of graph-to-graph translation

The popularity of graph translation is attracting increasing attention and several efforts have been made. Most of them focus on modeling the transition of graph topology in the form of edge existence and node attributes across the source and the target domains. For example,  (Guo et al., 2018) formally defined this task and proposed a basic framework based on edge-node interactions.  (Guo et al., 2019) extended previous work by building two parallel paths for updating nodes and edges respectively, and introduced a graph frequency regularization term to maintain the inherent consistency of predicted graphs. There are also some other works focusing on special needs of different applications. For example,  (Do et al., 2018) introduced domain knowledge into the translator design for reaction prediction, and  (Shi et al., 2020) adopted multi-head attention to capture the patterns in skeleton mapping.

Despite their dedications, all aforementioned graph-to-graph translation algorithms investigate fully-supervised setting, i.e., large-scale paired graphs in source and target domain are available to provide supervision for learning translation patterns. However, for many domains, obtaining large-scale paired graphs is a non-trivial task, sometimes even impossible. The source and target graphs are expected to share the exactly same node sets, which is expensive to guarantee and collect in large scale. For example, in Figure 1, the source-domain graphs can be obtained from music streaming platforms like Spotify, and the target domain from e-book platforms like Amazon Kindle. It is difficult to build the corresponding pairs, as users could use different IDs across those two platforms. Another example is the brain networks, where two domains are brain activity patterns of healthy people and Dyslexia patients respectively. In this case, constructing pairs would need the same person in both healthy and diseased situation, which is impossible.

Although large-scale paired graph dataset is difficult to collect, dataset with limited number of paired graphs and large number of unpaired graphs is much easier to build, which enables us to explore semi-supervision techniques addressing this problem. Graph translation follows the classical auto-encoder framework, with an encoder in source domain and a decoder in target domain. Through introducing auxiliary reconstruction tasks and building a source-domain decoder and a target-domain encoder, those mono-domain graphs can be utilized to boost the learning process. Recovering the original source graph imposes a higher requirement on the source-domain encoder to extract more complete and representative features, and reconstructing the graph based on the its embedding can also guide the target-domain decoder to model the real graph distributions in the that domain. However, directly extending previous works in this manner has its pitfalls. The source and target domains are assumed to share one same embedding space, and the extracted embedding of source graphs are used for both reconstruction task and translation task, which makes it difficult to model the semantic drifts across two domains. Take 1 for example, embedding of node 2 would be close to node 3 as a result of message-passing mechanism in the source domain, but they are supposed to be distant in order to recover the non-connection information in the target domain. This trade-off could impair model’s expressive ability and lead it to get sub-optimal performance.

To cope with this semantic gap, we design a new framework, which we call as SegTran, as shown in Figure 2. Concretely, we design a specific translation module to model the semantic transition explicitly and enforce the alignment across two domains. This translation module is trained on the limited number of paired graphs, while other components can benefit from both paired cross-domain graphs and unpaired mono-domain graphs during training. Furthermore, to assure that patterns captured by the translation module is general, we also explored providing auxiliary training signals for it by maximizing the mutual information between the source graph and its translation results. The structures of the encoder and decoder are also specially designed, as graph translation facing some intrinsic difficulties, due to diverse topology and the inter-dependency between nodes and edges (Guo et al., 2018). The main contributions of the paper are:

  • We propose a new framework for graph translation which is better at coping with the semi-supervised scenario. It can promote future research of graph translation algorithms as the lack of large scale paired dataset is one key obstacle its exploration and applications.

  • We design novel encoder/decoder architectures by using position embedding, multi-head attention, along with an explicit translation module, to achieve higher expressive ability. The design of each component is well-justified.

  • Experiments are performed on three large datasets, and our model achieved the state of art result on all of them. Extensive analysis of our model’s behavior is also presented.

The rest of the paper are organized as follows. In Sec. 2, we review related work. In Sec. 3, we formally define the problem. In Sec. 4, we give the details of SegTran. In Sec. 5, we conduct experiments to evaluate the effectiveness of SegTran. In Sec. 6, we conclude with future work.

2. Related Work

In this section, we review related work, which includes graph translation and semi-supervised translation.

2.1. Graph Translation

Translation of complex structures has long been a hot research topic. Sequence-to-sequence learning is a standard methodology for many applications in natural language processing domain 

(Sutskever et al., 2014; Bahdanau et al., 2015). Tree-structured translation was found to be beneficial for molecular generation tasks in the chemistry field (Jin et al., 2018, 2019). With the development of novel architectures, especially transformer (Vaswani et al., 2017) and graph convolution network (Kipf and Welling, 2016a), the power of deep network in modeling inter-dependency and interactions have been greatly improved, which make it possible to translate a more general and flexible structure, the graph.

In graph, information are contained in both nodes and edges, and its translation has some intrinsic difficulties. First is the diverse topology structures. The semantics of graphs are not only encoded in each node’s feature vector, but also in each edge’s (non-)existence, which is combinatory in nature. Therefore, encoding a whole graph as a vector is often found to be brittle 

(Guo et al., 2018). Second, nodes and edges are inter-dependent, and they interact in an iterative way (You et al., 2018). This ’recurrent’ relationship is difficult to model, because it could easily go ’explode’ or ’shrink’, and make the learning process unstable. Third, the relationship between nodes and their neighbors are non-local. Say, distant nodes could still attribute to the existence of the target edge. These difficulties make it more difficult to capture the patterns and model the distributions of graphs.

Earlier works mainly focus on obtaining the intermediate representation during the translation process for downstream tasks, like  (Simonovsky and Komodakis, 2018; Kipf and Welling, 2016b), and payed little attention to the design of special models.  (Li et al., 2018) translates across different graph views by modeling the correlations of different node types, and  (Liang et al., 2018; Do et al., 2018) use a bunch of predefined rules to guide the target graph generation processes. All these works require domain knowledge, and their models are domain-specific.  (Sun and Li, 2019) introduces some techniques from the natural language processing domain, but their work is refrained to the topology, and is not suitable for our setting. Works more related to ours are  (Guo et al., 2018, 2019), which both focus on building a general graph translation framework.  (Guo et al., 2018) updates node attributes and edge existence iteratively, and adopt a GAN (Goodfellow et al., 2014) to improve the translation quality.  (Guo et al., 2019) constructs a node-edge co-evolution block, and designs a spectral-related regularization to maintain the translation consistency.

2.2. Semi-supervised Translation

Semi-supervised problem, where only a small fraction of training data has supervision, is also an important research problem. Previous works have dedicated to exploit unlabeled samples in many approaches: exploiting relationship among training examples (Kingma et al., 2014), enhancing feature extractors with auxiliary unsupervised task (Grandvalet and Bengio, 2005), reducing model bias with ensemble-based methods (Laine and Aila, 2016)

, etc. However, most of these methods are proposed for discriminative models, assuming availability of a shared intermediate embedding space with limited dimensions, or expecting the output to be a vector of probability. Our task is in essence a generative task, with both the input and output lying in the same space with a very large dimension, making those approaches infeasible.

The most related tasks to our problem setting is the utilization of monolingual dataset in neural machine translation, where unpaired sentences in the target language are used to boost model’s performance. The most popular method there is based on back translation 

(Artetxe et al., 2018; Sennrich et al., 2016; Prabhumoye et al., 2018)

, where a pre-trained target-to-source translator is applied to those unpaired sentences, hence pseudo source sentences can be obtained. They train the translator by enforcing it to generate real target from those fake pseudo sentences in the source language. In this way, a cycle-consistency can be built, and it has shown to get state-of-the-art performance. However, in those tasks, the semantic across domains are the same, which enables them to use one shared embedding space for both the source and the target sentences. Say, the meaning of the sentence remains the same, only the grammar and dictionary changes. And this difference in syntax and dictionaries can be encoded in the parameters of decoder model. Similar is the case of unsupervised style transfer in computer vision field 

(Zhu et al., 2017), where only low-level features like texture and color can change across domains.

3. Problem Setting

Many real-world problems can be viewed as the graph-to-graph translation problem, i.e., given the graph of an object in source status, predict the graph of the same object in target status. For example, predict the shared reading preference of a group of people given their similarity in music taste, or generate the brain activity network of a subject after certain diseases based on it in the healthy state.

Throughout the paper, we use to denote an attributed graph of the -th object in source status, where is a set of nodes, is the adjacency matrix of , and denotes the node attribute matrix, where is the node attributes of node j and is the dimension of the node attributes. Similarly, we define the corresponding target graph of as , where is the adjacency matrix of , and with being dimensionality of node attribute in target status. We set to be the same as in this work due to the dataset, but our model can also be used to the scenarios when they are different. Note that we assume and share the same set of nodes while their graph structures are different. This is a reasonable assumption for many applications. Still take Figure 1 for an example, the nodes, referring to users, have to remain in the same set to build the correspondence across two domains. But the edges, representing uses’ relationship with each other, along with node attributes can be different. For two graphs of different objects in the same status, say and , we consider because is a special case, which can also be handled by our model.

To learn a model that can predict graph-to-graph translation, we need a set of paired graphs to provide supervision. We use to denote paired graphs. However, obtaining large-scale paired graphs is not easy. In many cases, it is difficult or even impossible to acquire the representation of one graph in two domains, like the reading/music preference example and the brain network example. Thus, is usually small, which cannot well train a model for graph translation. Fortunately, it is relatively easy to obtain a set of unpaired graphs, i.e., graph in one status with its corresponding graph in another status missing. For simplicity, we use to denote a set of graphs in the status, and to represent a set of graphs in the status. is the overlapping part of and with and . With the above notations, the problem of semi-supervised graph translation is formally defined as

Given , and , we aim to learn a function that can transform a source-domain graph to the target domain, i.e.,


Note that though we only consider two statuses, it is straightforward to extend our setting to multi-status graph translation.

Figure 2. Overview of the framework

4. Methodology

In this section, we give the details of the proposed framework SegTran for semi-supervised graph-to-graph translation. It is challenging to directly translate a graph from one domain to another domain because (i) the graph structures are complex and discrete; and (ii) we have limited paired graphs to provide supervision. Thus, our basic idea is to learn good representations that preserves the topological and nodal features, translate the graph in latent domain and project the translated representation back to graph in target domain.

An illustration of the proposed framework is shown in Figure 2. It is composed of three components: (i) two encoders which aim to learn node and graph representations for graphs of each domain; (ii) two decoders which aim to reconstruct the attributed graph to guide the encoder to learn good representations during training phase and to predict the graph in target domain during test phase; and (iii) a translator which is designed to translate the graph in the latent domain from both node and graph level. In addition, the translator leverages both the paired and unpaired graph for learning better transition ability. This design has several advantages: (i) It allows the unpaired graphs to provide auxiliary training signals for learning better encoder; (ii) Through sharing the same decoder for both translation and reconstruction tasks, its ability in modeling the graph distribution of the target domain can also be enhanced; and (iii) Due to the semantic gap issue, instead of constructing one shared embedding space and expect it to be domain-agnostic, we explicitly modeling the semantic transition process through a translator module, and form the so-called “dual embedding” between the source and target domain. Next, we give the details of each component.

Given an input attributed graph, we first introduce an encoder to learn node embedding that capture the network topological and node attributes so that we can conduct translation in the latent space. Graph neural networks (GNNs) have shown promising results for learning node representations

(Kipf and Welling, 2016a; Hamilton et al., 2017; Jin et al., 2020b; Tang et al., 2020; Jin et al., 2020a). Thus, we choose GNN as the basic encoder. The basic idea of GNN is to aggregate a node’s neighborhood information in each layer to update the representation of the node, which can be mathematically written in a message-passing framework (Gilmer et al., 2017):


is the embedding of node at layer , belongs to ’s neighbor groups , and

refers to the activation function. We use a linear layer as the updating function and mean as message function respectively. However, previous works have found two main issues of classical GNNs. The first is that their expressiveness is significantly constrained by the depth. For example, a k-layer graph convolution network(GCN) can only capture k-order graph moments without the bias term 

(Dehmamy et al., 2019), and each layer is effectively a low-pass filter and resulting in too smooth graph features as it goes deep (Wu et al., 2019). This issue would weaken GNN’s representation ability, making it harder to encode topology information in the processed node embedding. The second issue is that most of them are position-agnostic. Classical GNNs fail to distinguish two nodes with similar neighborhood but at different positions in the graph (You et al., 2019; Zhao et al., 2019). This problem becomes severer when we are using GNN to extract node embeddings to predict the existence of edges. For example, if both nodes and their neighborhoods have similar attributes, the encoder would tend to produce similar embedding for them. This makes it hard for the decoder to learn that is not connected to but is.

To address these issues, we extend classical message-passing based GNNs by adding skip connections and using position embedding, as shown in Figure 3. With skip connections, higher GNN layers can still access lower-level features, and better representations can be learned. Concretely, we concatenate the initial node attributes to the output of each encoding layer, as in Figure 3. As to the position embedding, inspired by the work of  (You et al., 2019), we represent each node using the lengths of their shortest paths to a couple of selected anchors. Concretely, each time a graph is to be inputted, say , we randomly selected nodes as anchors, and calculate position embedding of each node basing on them. We use to denote the initial position embedding of it. As proven by their work, bound of position distortions can be calculated for an anchor set of the size , where is the number of nodes in that graph. Considering the size of graphs in our dataset, we preset it to eight for simplicity. Although the absolute values of obtained position embedding would be different when the anchor set changes, their relative relationship remains the same, and can be exploited.

4.1. Encoder

Figure 3. Illustration of the encoder

With these preparations, now we can present the formulations of our extended GNN structure. In -th block, the message passing and fusing process can be written as:


represents node embedding at layer , is the -th column in adjacency matrix, and is the position embedding at that layer. and are the initial node attributes and position embedding respectively. and are the weight parameters, and

refers to the activation function such as ReLU.

4.2. Decoder

With the representation learned by the encoder, we introduce a decoder to reconstruct the graph. Note that the decoder is not only used to perform the reconstruction task during training phase, but will also be used to complete the graph translation in test phase. During testing, suppose we are given , to predict , it need to first go through source-domain encoder, then translator, followed by the target-domain decoder.

After the encoding part, for each node , now we have two representation vectors, and . is the embedding of the node attributes, and is the processed embedding for its relative position in the graph. To predict the existence of edges between two nodes, clearly both these two features are helpful. Considering that these two features carry different-level semantic meanings and are in different embedding spaces, directly concatenating them might be improper. In order to learn to fuse the position and attribute information, we construct the decoder with a number of same-structured blocks. Inside the block, we apply a multi-head attention layer (Vaswani et al., 2017) with “Query”/“Key” being and “Value” being . Its formulation can be written as:


and they are the query embedding matrix, key embedding matrix, value embedding matrix respectively. is the output matrix which fuse the result from different heads. / is the dimensionality of /, which is the obtained position/node embedding from the encoder. is the embedding dimension of Query/Key, and is the embedding dimension of Value. In this way, model can learn to utilize the position information in aggregating/updating attribute embedding in a data-driven approach.

Then, after concatenating position embedding and processed attribute embedding, we use a weighted inner production to perform link prediction:


Here, is the parameter matrix capturing the interaction between nodes. For node attribute prediction, we append a two-layer MLP after the multi-head attention module as:

Figure 4. Illustration of the decoder

Since each graph are usually sparse, i.e., the majority elements in the adjacency matrix

are zero, simply treating each element in the adjacency matrix equally for the graph reconstruction loss will make the loss function dominated by missing links, causing trivial results. To alleviate this problem, following existing work 

(Pan et al., 2008), we assign higher weight to non-zero elements in as follows:


where is between 0 and 1 to control the weight of the missing links in a graph. With , the graph reconstruction loss can be written as:


where denotes element-wise multiplication.

4.3. Translator Module

With the encoders and decoders in place, now we can introduce how we learn the translation patterns across two domains. In this framework, we model the translation process in the intermediate level, through transitioning a source graph embedding to its target graph embedding. As shown in Figure 2, Translator module is a key component in our framework, which is required to build the mapping from source domain to target domain. Concretely, we adopt a MLP structure to implement it. The translator operates in a node-wise fashion, using both global feature along with node-level feature as the input:


refers to the translated result, and represent the translated intermediate-level node attribute embedding of node . Same is the case for . is the global pooling function, which is used to fuse the representation of all nodes in a graph. In this way, the graph-level embedding is appended to the extracted node-level embedding, so that translation patterns can be learned with both local and global features.

During training, this correspondence is easy to be established for paired graphs. We can perform a regression task and minimize the prediction loss in the intermediate embedding level as:


In this equation, refers to the embedding obtained from the target graph, and means the predicted result based on the source graph by the translator module.

However, for the large amount of unpaired graphs, this kind of training signals can not be obtained. Inspired by  (Belghazi et al., 2018; Hjelm et al., 2018)

, we propose to use the mutual information(MI) score as an auxiliary supervision to better align these two spaces, as it can quantify the dependence of two random variables. If two graphs are paired, then they should have a high MI score with each other, and if they are not paired, then their MI score would be low. Therefore, by optimizing the MI score between the translated result and the source graph, the translator module would be encouraged to produce more credible results. However, as the dimension of embedding space is too high(

), directly working on it could suffer from curse of dimensionality and could result in brittle results. Addressing this, we apply MI score to global-level embedding through the

function, which can change the dimension to . For simplicity of notation, we use to denote the global feature of graph after the


and use to denote the output of this translator module, the global feature of translated result


The mutual information score is calculated on top of that, following the same procedures as  (Hjelm et al., 2018), which involves a specific estimator :


4.4. Objective Function of SegTran

Putting previous parts together, we can get our full model architecture. The optimization goal during training can be written as:


Besides the paired translation loss defined in Equation 11, we also add the reconstruction loss defined in Equation 9 and MI loss defined in Equation 14. is applied to graphs from both source and target domains, and its weight is controlled by the hyper-parameter . is applied to only unpaired source graphs, by computing the mutual information between and translation result , and its weight is controlled by .

4.5. Training Algorithm

Besides, as our model is composed of multiple different components, we follow a pretrain-finetune pipeline to make the training process more stable. The full training algorithm can be found in Algorithm 1. We first pre-train the encoder and decoder using the reconstruction loss with both paired and unpaired graphs, so that one meaningful intermediate embedding space for each domain can be learned. Then, we fix them, only train the translator module, to learn the mapping between two embedding spaces. After that, we fix the whole model, and prepare the mutual information estimator. When all these preparations are done, we start the fine-tune steps, by alternatively updating on paired and unpaired graphs, and train the whole model in an end-to-end manner.

1:  Initialize the encoder, decoder in both source and target domain, by pretraining on loss and ;
2:  Fix other parts, only train the translator module, based on loss ;
3:  Fix the whole model, pretrain the mutual information estimator, following  (Hjelm et al., 2018);
4:  while Not Converged do
5:     Receive a batch of paired graphs ;
6:     Update the model using ;
7:     Update the mutual information estimator;
8:     Receive a batch of unpaired graphs ;
9:     Update the model using ;
10:  end while
11:  return  Trained encoder, decoder, and translator module.
Algorithm 1 Full Training Algorithm

5. Experiments

In this section, we conduct experiments to evaluate the effectiveness of SegTran and the factors that could affect its performance. In particular, we aim to answer the following questions.

  • How effective is SegTran in graph translation by leveraging paired and unpaired graphs?

  • How different ratios of unpaired graphs could affect the translation performance of SegTran?

  • What are the contributions of each components of SegTran?

We begin by introducing the experimental settings, datasets and baselines. We then conduct experiments to answer these questions. Finally, we analyze parameter sensitivity of SegTran.

5.1. Experimental Settings

5.1.1. Datasets

We conduct experiments on one synthetic dataset, BA (Guo et al., 2018), and two widely used real-world datasets, DBLP (Tang et al., 2008) and Traffic 111 The details of the datasets are given as follows:

  • [leftmargin=*]

  • BA: In BA dataset, source domain is constructed by the Barabasi-Albert model (McInnes et al., 1999). The graph is built by adding nodes to it sequentially with preferential attachment mechanism, until it reaches nodes. Each newly-added node is connected to only one existing node randomly, with probability . Here, means the current degree of node . This method can generate graphs that follow the scale-free degree distributions. The target graph is constructed as -hop reach-ability graph, i.e., if node is -hop reachable from node in the source graph, then they are connected in the target graph. As the generated graphs are unattributed, we initialize the node attribute matrix to be the same as adjacency matrix . We include this synthetic dataset to understand if SegTran can really capture the translation patterns.

  • DBLP: DBLP is a multi-view citation network, with nodes representing researchers. It provides edges of three types, representing co-authorship, citation, and research overlapping respectively. Each node has a -dimension vector available, encoding the research interests. In our experiment, we use the citation network as the source, and research overlapping network as the target domain. This dataset is given in the transductive learning setting, and all the nodes appeared in one single large graph. Addressing this, we manually split it by first selecting a center node and then sampling nodes within its 2-hop neighborhood to get a smaller graph. The sampled graphs are not required to have exactly same number of nodes, and we control the graph size by setting the upper-bound of node degree as .

  • Traffic: For traffic dataset, we use the publicly available New York Taxi Trip dataset in year . Each trip record contains the take-on and take-off places, as well as the trip-start time. We follow prior studies (Yao et al., 2018, 2019b, 2019a; Wang et al., 2019) and split the city into regions, and build the edges based on the taxi flow amount between each region pairs within one hour. This results in graphs in total. On this dataset, we perform a traffic flow prediction task, and set the target domain as the graph state one hour in the future. Node attributes are initialized using the mean of historic taxi flows in the past hours.

BA DBLP Traffic
Dataset Size 5000 22559 8760
Average Graph Size 40 125 100
Paired Training 500 2255 876
Unpaired Training 2000 9020 3504
Paired Testing 1000 4510 1752
Table 1. Statistics of Three Datasets.

The statistics of the three datasets are summarized in Table 1, where “Dataset Size” is the total number of graphs and “Average Graph Size” is the average number of nodes in each graph. Note that in BA and Traffic datasets, all graphs have the same size. As we adopt the semi-supervised problem setting, we only use a small subset of graphs as paired, and treat a larger subset as unpaired. The size of each subset is also listed in the “Paired Training” and “Unpaired Training”, respectively. The number of paired graphs for testing is listed in the “Paired Testing” row.

5.1.2. Baselines

We compare SegTran with representative and state-of-the-art supervised and semi-supervised graph-to-graph translation approaches, which includes:

  • [leftmargin=*]

  • DGT (Guo et al., 2018): This method belongs to the encoder-decoder framework. The encoder is composed of edge-to-edge blocks to update the edge representation and edge-to-node blocks to obtain the node embedding. The decoder performs an inverse process as to the encoder, and maps extracted node representation to the target graph. It also utilizes a discriminator and follows the adversarial training approach to generate more “real” graphs. For implementation, we used the the code provided by the author 222

  • NEC-DGT (Guo et al., 2019): This is the state-of-the-art approach in graph translation, which also follows the encoder-decoder framework. To model the iterative and interactive translation process of nodes and edges, it split each of its block into two branches, one for updating the node attributes and one for updating the edge representations. Special architecture is designed for each branch. Besides, it also designed a spectral-based regularization term to learn and maintain the consistency of predicted nodes and edges. We use the implementation provided by the author 333

  • NEC-DGT-enhanced: As both DGT adn NEC-DGT can only work on paired graphs, they are unable to learn from those unpaired training samples, which would make the comparison unfair. Addressing this problem, we design an extension of NEC-DGT by adding a source-domain decoder, and call it NEC-DGT-enhanced. With this auxiliary decoder, NEC-DGT-enhanced can utilize unpaired graphs by performing reconstruction tasks, which is supposed to improve the performance of the encoder. Note that this can be treated a a variant of SegTran without dual embedding and translator module.

  • SegTran_p: To further validate the performance of our model design and verify the improvement can be gained from utilizing unpaired graphs, we also designed a new baseline, SegTran_p. It is of the same architecture as SegTran, but is trained only on the paired graphs.

5.1.3. Configurations

All experiments are conducted on a -bit machine with Nvidia GPU (Tesla V100, 1246MHz , 16 GB memory), and ADAM optimization algorithm is used to train all the models.

For DGT, NEC-DGT, and NEC-DGT-Enhanced, the learning rate is initialized to , following the settings in their code. For SegTran and SegTran_p, the learning rate is initialized to . In SegTran, the value of hyper-parameter and are both set as . A more detailed analysis over the performance sensitivity to them can be found in Section 5.5

. Besides, all models are trained until converging, with the maximum training epoch being


5.1.4. Evaluation Metrics

For the evaluation metrics, we adopt mean squared error(MSE) and mean average percentage error(MAPE). As MSE is more sensitive when the ground-truth value is large and MAPE is more sensitive when the ground-truth value is small, we believe they can provide a more comprehensive comparison combined. Besides, class-imbalance exists in this problem, as edges are usually sparse, and directly training on them would result in trivial results. Addressing it, we re-weight the importance of edges and non-edges during the loss calculation, in both training and testing.

BA DBLP Traffic
Table 2. Comparison of different approaches in the semi-supervised scenario. Among these approaches, DGT, NEC-DGT and SegTran_p use only paired graphs, while NEC-DGT-Enhanced and SegTran use both paired and unpaired graphs during training.

5.2. Graph-to-Graph Translation Performance

To answer the first question, we compare the graph tranlation performance of SegTran with the baselines under the semi-supervised scenario on the three datasets. We train the model on the paired and unpaired training graphs, and conduct graph translation on the test graph. Each experiment is conducted

times to alleviate the randomness. The average performance with standard deviation in terms of MSE and MAPE are reported in Table 

2. From the table, we can make following observations:

  • [leftmargin=*]

  • SegTran outperforms all other approaches by a reasonable margin. Compared with NEC-DGT-Enhanced, It shows an improvement of on BA dataset, on DBLP, and on Traffic dataset measured by MSE loss. In the term MAPE loss, the improvements are , , and respectively. This result validates the effectiveness of SegTran.

  • SegTran performs much more stable than NEC-DGT-Enhanced, although they both utilize unpaired graphs. Looking at the standard deviations, it can be observed that NEC-DGT-Enhanced has higher performance variations. For example, its standard deviation on BA dataset is about two times that of SegTran. This phenomena could result from NEC-DGT-enhanced’s difficulty in trading-off between the semantics from source domain and the semantics from target domain during learning the intermediate embedding space.

  • The performance differences on DBLP dataset is smaller than those on the other two. This could result from the dataset size. Refering to Table 1, DBLP dataset is significantly larger than both other two datasets. Therefore, the bonus from adding more unpaired graphs would be smaller, which results in this result.

  • Generally, NEC-DGT-Enhanced performs better than NEC-DGT, and SegTran performs better than SegTran_p. This observation can test the significance of learning from unpaired graphs. NEC-DGT-Enhanced follows the exactly same structure as NEC-DGT other than an auxiliary decoder to perform reconstruction on the source-domain graphs, and it achieves an improvement of , and respectively measured by MSE losses. Similar observations can be made in the case of comparing SegTran and SegTran_p. This result shows that in the case when paired graphs are limited, it is important to use unpaired graphs to boost model’s performance.

To summarize, these results prove the importance of introducing unpaired graphs to graph translation tasks with only limited number of paired training samples. Besides, they also validate the intuition that explicitly modeling the difference between two domains in the intermediate space can make it easier for the model to utilize unpaired graphs.

(a) BA
(b) Traffic
Figure 5. Affects of Ratios of Unpaired Graphs.

5.3. Ratio of Unpaired Graphs

In this subsection, we analyze the sensitivity of our model and NEC-DGT-Enhanced towards the amount of unpaired graphs, which answers the second question. We fix the number of paired training graphs, and change the settings by taking different percentage of samples in the dataset as unpaired graphs for semi-supervised training. Concretely, we vary unpaired graph to paired graph ratio as , and fix other hyper-parameters unchanged. We only report the performance on BA and Traffic as we have similar observation on DBLP. The results are presented in Figure 5 From the figure, we make the following observations:

  • [leftmargin=*]

  • In general, as the increase of ratio of unpaired graphs, the performance of both NEC-DGT-Enhanced and SegTran increases, which implies that unpaired graphs can help to learn better encoder and decoder for graph-to-graph translation.

  • As the increase of the ratio of unpaired graphs, the performance of SegTran is consistently better than NEC-DGT-Enhanced, which validate the effectiveness of SegTran in learning from unpaired graphs.

5.4. Ablation Study

To answer the third question, in this subsection, we conduct ablation study to understand the importance of each component of the proposed framework SegTran. All experiments are performed in semi-supervised setting, with configurations exactly same as SegTran, if not stated otherwise.

Methods BA DBLP
Semi sup Semi sup
Shared Embedding 0.196 0.245
No position 0.142 0.204
No MI 0.141 0.197
No multi-head attention 0.135 0.190
SegTran 0.132 0.192
Table 3. Evaluation of the significance of different designs in our approach on BA and DBLP datasets. The scores are computed using MSE loss.

Gain from dual embedding In our model design, “dual embedding” is adopted to distinguish between the source and the target domain, and help the model to learn more from unpaired graph. To test its affect, we perform an ablation study by removing the translation module, requiring two domains to share the same embedding space. Other parts are not influenced, except the mutual information loss, which is no longer needed in this case. The performance of “Shared Embedding” in terms of MSE is shown in Table 3. From the table, we can see that on both BA and DBLP dataset, compared with SegTran, removing dual embedding and changing to shared embedding would result in a significant performance drop, which is about and points decrease, respectively. This is because “dual embedding” can ease the trade-off between reconstruction and translation tasks. Without it, the same features would be used for them, which could result in sub-optimal performance. This result shows the effectiveness of this design in leveraging the semantic change from source domain to target domain.

Importance of position embedding In this part, we test the effects of the position embedding, and evaluate its contributions in guiding the encoding process. In removing the position embedding, we leave the model architecture untouched, and use the first eight dimensions of node attributes to replace the calculated position embedding. Through the result in Table 3, it can be seen that it will result in a performance drop of on BA, and on DBLP. This result shows that position embedding is important for extending representation ability of graph encoder module.

Importance of MI loss We also test the benefits from auxiliary mutual information loss when aligning the transformed embedding of unpaired graphs. For this experiment, we simply remove this loss during the fine-tune steps, and observe the change of performance. We find a drop of points on BA, and on DBLP. The drop is smaller on DBLP datast, and it may still be a result of dataset size. As DBLP is much larger, the number of paired graphs is more sufficient for the training, which makes this auxiliary loss not that important. But overall, we can observe that this supervision can train the translator better.

Influence of multi-head attention In this part, we test the influence of the multi-head attention module in the decoding process. This module is designed to fuse the processed embedding of nodes based on their relative distances. The relative distances is measured using the processed position embedding along with the metric space constructed by each attention head. Through multi-head attention, higher-order interaction between processed node embedding and position embedding is supported. For comparison, we directly remove this layer from the decoder. This manipulation will not influence other parts of the translation network. From Table 3, we can see that on BA dataset, this module can bring a moderate improvement of around . On DBLP dataset, the performance is similar whether this module is removed or not. Therefore, the importance of this layer is rather dependent on the complexity of dataset, and sometimes it is not necessary.

(a) MSE
(b) MAPE
Figure 6. Parameter Sensitivity on BA

5.5. Parameter Sensitivity Analysis

In this subsection, we analyze the sensitivity of SegTran’s performance towards the hyper-parameters and , the weight of reconstruction loss and mutual information loss. We vary both and as . These experiments all use the same pre-trained model, as these two hyper-parameters only influence fine-tune process. Other settings are the same as SegTran. This experiment is performed on BA dataset, and the result is shown in Figure 6. The axis is the translation error measured using MSE loss, axis refers to the value of , and axis represents .

From the figure, we can observe that generally, reconstruction loss and mutual information loss are both important for achieving a better performance. When is small, it is difficult for this model to achieve a high performance. This observation makes sense because reconstruction loss guides the encoder to extract more complete features from input graphs, and consequently has a direct influence on the quality of intermediate embedding space.

5.6. Case study

In Figure 7, we provide an example of BA dataset to show and compare the captured translation patterns of SegTran and NEC-DGT-Enhanced. Figure 7(a) is the source graph, and Figure 7(b) is the target graph. Figure 7(c) is the result translated by SegTran, and Figure 7(d) is translated by NEC-DGT-Enhanced. For edges in Figure 7(c) and Figure 7(d), to show the prediction results clearly, we split them into three groups. When the predicted existence probability is above , we draw it in black if it is true otherwise in red. When the probability is between and , we draw it in grey. Edges with smaller probability are filtered out.

It can be seen that SegTran has a high translation quality, and made no erroneous predictions. In the target graph, node is the most popular node. The same pattern can be found in the translated result by SegTran, where node has a high probability of linking to most other nodes. As to the distant node pairs, like node and , or node and , which have no links in the target graph, SegTran assigns a relatively low probability to their existence, below .

The performance of NEC-DGT-Enhanced, on the other hand, is not that satisfactory. It mistakenly split the nodes into two groups, and others, and assigns large weight to edges inside them but little weight to edges between them. Node is connected only to node in the source graph, therefore NEC-DGT-Enhanced tries to push the embedding of it as well as its neighbors distant from other nodes, which could be the reason resulting in this phenomena. This example shows that SegTran is better in capturing the graph distributions in the target domain and learning the translation patterns.

(a) Source
(b) Target
(c) Translated by SegTran
(d) Translated by ENC-DGT-Enhanced
Figure 7. Obtained Case study examples.

6. Conclusion and Future Work

Graph-to-graph translation has many applications. However, for many domains, obtaining large-scale data is expensive. Thus, in this paper, we investigate a new problem of semi-supervised graph-to-graph translation. We propose a new framework SegTran, which is composed of a encoder to learn graph representation, a decoder to reconstruct and a translator module to translate graph in the latent space. SegTran adopts dual embedding to bridge the semantic gap between the source and target domain. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of the proposed framework for semi-supervised graph translation. Further experiments are conducted to understand and its hyper-parameter sensitivity.

There are several interesting directions need further investigation. First in this paper, we mainly focus on translating one graph in the source domain to another graph in the target domain. In real-world, there are many situations which require the translation from one domain to many other domains. For example, in molecular translation task (Jin et al., 2018, 2019), we could make different requirements on the properties of translated compound, and each requirement would form a domain. Therefore, we plan to extend SegTran to one-to-many graph translation. Second, the graph translation has many real-world applications. In this paper, we conduct experiments on citation network and traffic network. We would like to extend our framework for more application domains such as the graph translation in brain network and medical domains (Tewarie et al., 2015; Bassett and Sporns, 2017).

7. Acknowledgement

This project was partially supported by NSF projects IIS-1707548, IIS-1909702, IIS-1955851, CBET-1638320 and Global Research Outreach program of Samsung Advanced Institute of Technology #225003.


  • M. Artetxe, G. Labaka, and E. Agirre (2018) Unsupervised statistical machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3632–3642. Cited by: §2.2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §2.1.
  • D. S. Bassett and O. Sporns (2017) Network neuroscience. Nature neuroscience 20 (3), pp. 353. Cited by: §6.
  • M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm (2018) Mutual information neural estimation. In

    International Conference on Machine Learning

    pp. 531–540. Cited by: §4.3.
  • N. Dehmamy, A. Barabási, and R. Yu (2019) Understanding the representation power of graph neural networks in learning graph topology. In Advances in Neural Information Processing Systems, pp. 15387–15397. Cited by: §4.
  • K. Do, T. Tran, and S. Venkatesh (2018) Graph transformation policy network for chemical reaction prediction. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery Data Mining. Cited by: §1, §2.1.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1263–1272. Cited by: §4.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §2.1.
  • Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §2.2.
  • X. Guo, L. Wu, and L. Zhao (2018) Deep graph translation. ArXiv abs/1805.09980. Cited by: §1, §1, §2.1, §2.1, 1st item, §5.1.1.
  • X. Guo, L. Zhao, C. Nowzari, S. Rafatirad, H. Homayoun, and S. M. P. Dinakarrao (2019) Deep multi-attributed graph translation with node-edge co-evolution. ICDM. Cited by: §1, §2.1, 2nd item.
  • W. L. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NIPS, Cited by: §4.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §4.3, 3.
  • W. Jin, T. Derr, H. Liu, Y. Wang, S. Wang, Z. Liu, and J. Tang (2020a)

    Self-supervised learning on graphs: deep insights and new direction

    arXiv preprint arXiv:2006.10141. Cited by: §4.
  • W. Jin, Y. Ma, X. Liu, X. Tang, S. Wang, and J. Tang (2020b) Graph structure learning for robust graph neural networks. arXiv preprint arXiv:2005.10203. Cited by: §4.
  • W. Jin, R. Barzilay, and T. S. Jaakkola (2018)

    Junction tree variational autoencoder for molecular graph generation

    In ICML, Cited by: §2.1, §6.
  • W. Jin, K. Yang, R. Barzilay, and T. S. Jaakkola (2019) Learning multimodal graph-to-graph translation for molecular optimization. ArXiv abs/1812.01070. Cited by: §2.1, §6.
  • D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589. Cited by: §2.2.
  • T. N. Kipf and M. Welling (2016a) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.1, §4.
  • T. Kipf and M. Welling (2016b) Variational graph auto-encoders. ArXiv abs/1611.07308. Cited by: §2.1.
  • S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §2.2.
  • J. Li, C. Chen, H. Tong, and H. Liu (2018) Multi-layered network embedding. In Proceedings of the 2018 SIAM International Conference on Data Mining, pp. 684–692. Cited by: §2.1.
  • X. Liang, Z. Hu, H. Zhang, L. Lin, and E. P. Xing (2018) Symbolic graph reasoning meets convolutions. In NeurIPS, Cited by: §2.1.
  • B. I. A. McInnes, J. S. McBride, N. J. Evans, D. D. Lambert, and A. S. Andrew (1999) Emergence of scaling in random networks. Cited by: 1st item.
  • R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz, and Q. Yang (2008) One-class collaborative filtering. In 2008 Eighth IEEE International Conference on Data Mining, pp. 502–511. Cited by: §4.2.
  • S. Prabhumoye, Y. Tsvetkov, R. Salakhutdinov, and A. W. Black (2018) Style transfer through back-translation. In ACL, Cited by: §2.2.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Edinburgh neural machine translation systems for wmt 16. In WMT, Cited by: §2.2.
  • C. Shi, M. Xu, H. Guo, M. Zhang, and J. Tang (2020) A graph to graphs framework for retrosynthesis prediction. ArXiv abs/2003.12725. Cited by: §1.
  • M. Simonovsky and N. Komodakis (2018) GraphVAE: towards generation of small graphs using variational autoencoders. In ICANN, Cited by: §2.1.
  • M. Sun and P. Li (2019) Graph to graph: a topology aware approach for graph structures learning and generation. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    pp. 2946–2955. Cited by: §2.1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NIPS, Cited by: §2.1.
  • J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su (2008) ArnetMiner: extraction and mining of academic social networks. In KDD, Cited by: §5.1.1.
  • X. Tang, H. Yao, Y. Sun, Y. Wang, J. Tang, C. Aggarwal, P. Mitra, and S. Wang (2020) Graph convolutional networks against degree-related biases. arXiv preprint arXiv:2006.15643. Cited by: §4.
  • P. Tewarie, E. van Dellen, A. Hillebrand, and C. J. Stam (2015) The minimum spanning tree: an unbiased method for brain network analysis. Neuroimage 104, pp. 177–188. Cited by: §6.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.1, §4.2.
  • H. Wang, X. Tang, Y. Kuo, D. Kifer, and Z. Li (2019) A simple baseline for travel time estimation using large-scale trip data. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 1–22. Cited by: 3rd item.
  • F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger (2019) Simplifying graph convolutional networks. In International Conference on Machine Learning, pp. 6861–6871. Cited by: §4.
  • H. Yao, Y. Liu, Y. Wei, X. Tang, and Z. Li (2019a) Learning from multiple cities: a meta-learning approach for spatial-temporal prediction. In The World Wide Web Conference, pp. 2181–2191. Cited by: 3rd item.
  • H. Yao, X. Tang, H. Wei, G. Zheng, and Z. Li (2019b)

    Revisiting spatial-temporal similarity: a deep learning framework for traffic prediction

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5668–5675. Cited by: 3rd item.
  • H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, and Z. Li (2018) Deep multi-view spatial-temporal network for taxi demand prediction. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: 3rd item.
  • J. You, R. Ying, and J. Leskovec (2019) Position-aware graph neural networks. In International Conference on Machine Learning, pp. 7134–7143. Cited by: §4, §4.
  • J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec (2018) GraphRNN: generating realistic graphs with deep auto-regressive models. In ICML, Cited by: §2.1.
  • W. Zhao, Z. Cui, C. Xu, C. Li, T. Zhang, and J. Yang (2019) Hashing graph convolution for node classification. In CIKM, Cited by: §4.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251. Cited by: §2.2.