1. Introduction
GraphtoGraph translation aims at transforming a graph in the source domain to a new graph in the target domain, where different domains correspond to different states. Figure 1
gives an illustration of the graphtograph translation process. The graph in the source domain depicts shared music tastes of users, with each attributed node representing a user and the user’s portraits. We want to translate the attributed graph of shared music tastes to cowatch graph in the target domain, which represents similar reading preference. The translation function is expected to generate the target graph based on the source graph. It is in effect learning to capture the intrinsic attributes, like aesthetic preferences and personal characters in this case, and discovering the patterns behind that domain transition. It can facilitate many realworld applications. For instance, given the brain network of a healthy subject, we may want to predict the brain network when the subject has certain disease. Given the traffic flow of a city, we want to estimate the traffic flow when events such as concert is held in the city.
The popularity of graph translation is attracting increasing attention and several efforts have been made. Most of them focus on modeling the transition of graph topology in the form of edge existence and node attributes across the source and the target domains. For example, (Guo et al., 2018) formally defined this task and proposed a basic framework based on edgenode interactions. (Guo et al., 2019) extended previous work by building two parallel paths for updating nodes and edges respectively, and introduced a graph frequency regularization term to maintain the inherent consistency of predicted graphs. There are also some other works focusing on special needs of different applications. For example, (Do et al., 2018) introduced domain knowledge into the translator design for reaction prediction, and (Shi et al., 2020) adopted multihead attention to capture the patterns in skeleton mapping.
Despite their dedications, all aforementioned graphtograph translation algorithms investigate fullysupervised setting, i.e., largescale paired graphs in source and target domain are available to provide supervision for learning translation patterns. However, for many domains, obtaining largescale paired graphs is a nontrivial task, sometimes even impossible. The source and target graphs are expected to share the exactly same node sets, which is expensive to guarantee and collect in large scale. For example, in Figure 1, the sourcedomain graphs can be obtained from music streaming platforms like Spotify, and the target domain from ebook platforms like Amazon Kindle. It is difficult to build the corresponding pairs, as users could use different IDs across those two platforms. Another example is the brain networks, where two domains are brain activity patterns of healthy people and Dyslexia patients respectively. In this case, constructing pairs would need the same person in both healthy and diseased situation, which is impossible.
Although largescale paired graph dataset is difficult to collect, dataset with limited number of paired graphs and large number of unpaired graphs is much easier to build, which enables us to explore semisupervision techniques addressing this problem. Graph translation follows the classical autoencoder framework, with an encoder in source domain and a decoder in target domain. Through introducing auxiliary reconstruction tasks and building a sourcedomain decoder and a targetdomain encoder, those monodomain graphs can be utilized to boost the learning process. Recovering the original source graph imposes a higher requirement on the sourcedomain encoder to extract more complete and representative features, and reconstructing the graph based on the its embedding can also guide the targetdomain decoder to model the real graph distributions in the that domain. However, directly extending previous works in this manner has its pitfalls. The source and target domains are assumed to share one same embedding space, and the extracted embedding of source graphs are used for both reconstruction task and translation task, which makes it difficult to model the semantic drifts across two domains. Take 1 for example, embedding of node 2 would be close to node 3 as a result of messagepassing mechanism in the source domain, but they are supposed to be distant in order to recover the nonconnection information in the target domain. This tradeoff could impair model’s expressive ability and lead it to get suboptimal performance.
To cope with this semantic gap, we design a new framework, which we call as SegTran, as shown in Figure 2. Concretely, we design a specific translation module to model the semantic transition explicitly and enforce the alignment across two domains. This translation module is trained on the limited number of paired graphs, while other components can benefit from both paired crossdomain graphs and unpaired monodomain graphs during training. Furthermore, to assure that patterns captured by the translation module is general, we also explored providing auxiliary training signals for it by maximizing the mutual information between the source graph and its translation results. The structures of the encoder and decoder are also specially designed, as graph translation facing some intrinsic difficulties, due to diverse topology and the interdependency between nodes and edges (Guo et al., 2018). The main contributions of the paper are:

We propose a new framework for graph translation which is better at coping with the semisupervised scenario. It can promote future research of graph translation algorithms as the lack of large scale paired dataset is one key obstacle its exploration and applications.

We design novel encoder/decoder architectures by using position embedding, multihead attention, along with an explicit translation module, to achieve higher expressive ability. The design of each component is welljustified.

Experiments are performed on three large datasets, and our model achieved the state of art result on all of them. Extensive analysis of our model’s behavior is also presented.
2. Related Work
In this section, we review related work, which includes graph translation and semisupervised translation.
2.1. Graph Translation
Translation of complex structures has long been a hot research topic. Sequencetosequence learning is a standard methodology for many applications in natural language processing domain
(Sutskever et al., 2014; Bahdanau et al., 2015). Treestructured translation was found to be beneficial for molecular generation tasks in the chemistry field (Jin et al., 2018, 2019). With the development of novel architectures, especially transformer (Vaswani et al., 2017) and graph convolution network (Kipf and Welling, 2016a), the power of deep network in modeling interdependency and interactions have been greatly improved, which make it possible to translate a more general and flexible structure, the graph.In graph, information are contained in both nodes and edges, and its translation has some intrinsic difficulties. First is the diverse topology structures. The semantics of graphs are not only encoded in each node’s feature vector, but also in each edge’s (non)existence, which is combinatory in nature. Therefore, encoding a whole graph as a vector is often found to be brittle
(Guo et al., 2018). Second, nodes and edges are interdependent, and they interact in an iterative way (You et al., 2018). This ’recurrent’ relationship is difficult to model, because it could easily go ’explode’ or ’shrink’, and make the learning process unstable. Third, the relationship between nodes and their neighbors are nonlocal. Say, distant nodes could still attribute to the existence of the target edge. These difficulties make it more difficult to capture the patterns and model the distributions of graphs.Earlier works mainly focus on obtaining the intermediate representation during the translation process for downstream tasks, like (Simonovsky and Komodakis, 2018; Kipf and Welling, 2016b), and payed little attention to the design of special models. (Li et al., 2018) translates across different graph views by modeling the correlations of different node types, and (Liang et al., 2018; Do et al., 2018) use a bunch of predefined rules to guide the target graph generation processes. All these works require domain knowledge, and their models are domainspecific. (Sun and Li, 2019) introduces some techniques from the natural language processing domain, but their work is refrained to the topology, and is not suitable for our setting. Works more related to ours are (Guo et al., 2018, 2019), which both focus on building a general graph translation framework. (Guo et al., 2018) updates node attributes and edge existence iteratively, and adopt a GAN (Goodfellow et al., 2014) to improve the translation quality. (Guo et al., 2019) constructs a nodeedge coevolution block, and designs a spectralrelated regularization to maintain the translation consistency.
2.2. Semisupervised Translation
Semisupervised problem, where only a small fraction of training data has supervision, is also an important research problem. Previous works have dedicated to exploit unlabeled samples in many approaches: exploiting relationship among training examples (Kingma et al., 2014), enhancing feature extractors with auxiliary unsupervised task (Grandvalet and Bengio, 2005), reducing model bias with ensemblebased methods (Laine and Aila, 2016)
, etc. However, most of these methods are proposed for discriminative models, assuming availability of a shared intermediate embedding space with limited dimensions, or expecting the output to be a vector of probability. Our task is in essence a generative task, with both the input and output lying in the same space with a very large dimension, making those approaches infeasible.
The most related tasks to our problem setting is the utilization of monolingual dataset in neural machine translation, where unpaired sentences in the target language are used to boost model’s performance. The most popular method there is based on back translation
(Artetxe et al., 2018; Sennrich et al., 2016; Prabhumoye et al., 2018), where a pretrained targettosource translator is applied to those unpaired sentences, hence pseudo source sentences can be obtained. They train the translator by enforcing it to generate real target from those fake pseudo sentences in the source language. In this way, a cycleconsistency can be built, and it has shown to get stateoftheart performance. However, in those tasks, the semantic across domains are the same, which enables them to use one shared embedding space for both the source and the target sentences. Say, the meaning of the sentence remains the same, only the grammar and dictionary changes. And this difference in syntax and dictionaries can be encoded in the parameters of decoder model. Similar is the case of unsupervised style transfer in computer vision field
(Zhu et al., 2017), where only lowlevel features like texture and color can change across domains.3. Problem Setting
Many realworld problems can be viewed as the graphtograph translation problem, i.e., given the graph of an object in source status, predict the graph of the same object in target status. For example, predict the shared reading preference of a group of people given their similarity in music taste, or generate the brain activity network of a subject after certain diseases based on it in the healthy state.
Throughout the paper, we use to denote an attributed graph of the th object in source status, where is a set of nodes, is the adjacency matrix of , and denotes the node attribute matrix, where is the node attributes of node j and is the dimension of the node attributes. Similarly, we define the corresponding target graph of as , where is the adjacency matrix of , and with being dimensionality of node attribute in target status. We set to be the same as in this work due to the dataset, but our model can also be used to the scenarios when they are different. Note that we assume and share the same set of nodes while their graph structures are different. This is a reasonable assumption for many applications. Still take Figure 1 for an example, the nodes, referring to users, have to remain in the same set to build the correspondence across two domains. But the edges, representing uses’ relationship with each other, along with node attributes can be different. For two graphs of different objects in the same status, say and , we consider because is a special case, which can also be handled by our model.
To learn a model that can predict graphtograph translation, we need a set of paired graphs to provide supervision. We use to denote paired graphs. However, obtaining largescale paired graphs is not easy. In many cases, it is difficult or even impossible to acquire the representation of one graph in two domains, like the reading/music preference example and the brain network example. Thus, is usually small, which cannot well train a model for graph translation. Fortunately, it is relatively easy to obtain a set of unpaired graphs, i.e., graph in one status with its corresponding graph in another status missing. For simplicity, we use to denote a set of graphs in the status, and to represent a set of graphs in the status. is the overlapping part of and with and . With the above notations, the problem of semisupervised graph translation is formally defined as
Given , and , we aim to learn a function that can transform a sourcedomain graph to the target domain, i.e.,
(1) 
Note that though we only consider two statuses, it is straightforward to extend our setting to multistatus graph translation.
4. Methodology
In this section, we give the details of the proposed framework SegTran for semisupervised graphtograph translation. It is challenging to directly translate a graph from one domain to another domain because (i) the graph structures are complex and discrete; and (ii) we have limited paired graphs to provide supervision. Thus, our basic idea is to learn good representations that preserves the topological and nodal features, translate the graph in latent domain and project the translated representation back to graph in target domain.
An illustration of the proposed framework is shown in Figure 2. It is composed of three components: (i) two encoders which aim to learn node and graph representations for graphs of each domain; (ii) two decoders which aim to reconstruct the attributed graph to guide the encoder to learn good representations during training phase and to predict the graph in target domain during test phase; and (iii) a translator which is designed to translate the graph in the latent domain from both node and graph level. In addition, the translator leverages both the paired and unpaired graph for learning better transition ability. This design has several advantages: (i) It allows the unpaired graphs to provide auxiliary training signals for learning better encoder; (ii) Through sharing the same decoder for both translation and reconstruction tasks, its ability in modeling the graph distribution of the target domain can also be enhanced; and (iii) Due to the semantic gap issue, instead of constructing one shared embedding space and expect it to be domainagnostic, we explicitly modeling the semantic transition process through a translator module, and form the socalled “dual embedding” between the source and target domain. Next, we give the details of each component.
Given an input attributed graph, we first introduce an encoder to learn node embedding that capture the network topological and node attributes so that we can conduct translation in the latent space. Graph neural networks (GNNs) have shown promising results for learning node representations
(Kipf and Welling, 2016a; Hamilton et al., 2017; Jin et al., 2020b; Tang et al., 2020; Jin et al., 2020a). Thus, we choose GNN as the basic encoder. The basic idea of GNN is to aggregate a node’s neighborhood information in each layer to update the representation of the node, which can be mathematically written in a messagepassing framework (Gilmer et al., 2017):(2) 
is the embedding of node at layer , belongs to ’s neighbor groups , and
refers to the activation function. We use a linear layer as the updating function and mean as message function respectively. However, previous works have found two main issues of classical GNNs. The first is that their expressiveness is significantly constrained by the depth. For example, a klayer graph convolution network(GCN) can only capture korder graph moments without the bias term
(Dehmamy et al., 2019), and each layer is effectively a lowpass filter and resulting in too smooth graph features as it goes deep (Wu et al., 2019). This issue would weaken GNN’s representation ability, making it harder to encode topology information in the processed node embedding. The second issue is that most of them are positionagnostic. Classical GNNs fail to distinguish two nodes with similar neighborhood but at different positions in the graph (You et al., 2019; Zhao et al., 2019). This problem becomes severer when we are using GNN to extract node embeddings to predict the existence of edges. For example, if both nodes and their neighborhoods have similar attributes, the encoder would tend to produce similar embedding for them. This makes it hard for the decoder to learn that is not connected to but is.To address these issues, we extend classical messagepassing based GNNs by adding skip connections and using position embedding, as shown in Figure 3. With skip connections, higher GNN layers can still access lowerlevel features, and better representations can be learned. Concretely, we concatenate the initial node attributes to the output of each encoding layer, as in Figure 3. As to the position embedding, inspired by the work of (You et al., 2019), we represent each node using the lengths of their shortest paths to a couple of selected anchors. Concretely, each time a graph is to be inputted, say , we randomly selected nodes as anchors, and calculate position embedding of each node basing on them. We use to denote the initial position embedding of it. As proven by their work, bound of position distortions can be calculated for an anchor set of the size , where is the number of nodes in that graph. Considering the size of graphs in our dataset, we preset it to eight for simplicity. Although the absolute values of obtained position embedding would be different when the anchor set changes, their relative relationship remains the same, and can be exploited.
4.1. Encoder
With these preparations, now we can present the formulations of our extended GNN structure. In th block, the message passing and fusing process can be written as:
(3) 
(4) 
represents node embedding at layer , is the th column in adjacency matrix, and is the position embedding at that layer. and are the initial node attributes and position embedding respectively. and are the weight parameters, and
refers to the activation function such as ReLU.
4.2. Decoder
With the representation learned by the encoder, we introduce a decoder to reconstruct the graph. Note that the decoder is not only used to perform the reconstruction task during training phase, but will also be used to complete the graph translation in test phase. During testing, suppose we are given , to predict , it need to first go through sourcedomain encoder, then translator, followed by the targetdomain decoder.
After the encoding part, for each node , now we have two representation vectors, and . is the embedding of the node attributes, and is the processed embedding for its relative position in the graph. To predict the existence of edges between two nodes, clearly both these two features are helpful. Considering that these two features carry differentlevel semantic meanings and are in different embedding spaces, directly concatenating them might be improper. In order to learn to fuse the position and attribute information, we construct the decoder with a number of samestructured blocks. Inside the block, we apply a multihead attention layer (Vaswani et al., 2017) with “Query”/“Key” being and “Value” being . Its formulation can be written as:
(5)  
and they are the query embedding matrix, key embedding matrix, value embedding matrix respectively. is the output matrix which fuse the result from different heads. / is the dimensionality of /, which is the obtained position/node embedding from the encoder. is the embedding dimension of Query/Key, and is the embedding dimension of Value. In this way, model can learn to utilize the position information in aggregating/updating attribute embedding in a datadriven approach.
Then, after concatenating position embedding and processed attribute embedding, we use a weighted inner production to perform link prediction:
(6)  
Here, is the parameter matrix capturing the interaction between nodes. For node attribute prediction, we append a twolayer MLP after the multihead attention module as:
(7) 
Since each graph are usually sparse, i.e., the majority elements in the adjacency matrix
are zero, simply treating each element in the adjacency matrix equally for the graph reconstruction loss will make the loss function dominated by missing links, causing trivial results. To alleviate this problem, following existing work
(Pan et al., 2008), we assign higher weight to nonzero elements in as follows:(8) 
where is between 0 and 1 to control the weight of the missing links in a graph. With , the graph reconstruction loss can be written as:
(9) 
where denotes elementwise multiplication.
4.3. Translator Module
With the encoders and decoders in place, now we can introduce how we learn the translation patterns across two domains. In this framework, we model the translation process in the intermediate level, through transitioning a source graph embedding to its target graph embedding. As shown in Figure 2, Translator module is a key component in our framework, which is required to build the mapping from source domain to target domain. Concretely, we adopt a MLP structure to implement it. The translator operates in a nodewise fashion, using both global feature along with nodelevel feature as the input:
(10)  
refers to the translated result, and represent the translated intermediatelevel node attribute embedding of node . Same is the case for . is the global pooling function, which is used to fuse the representation of all nodes in a graph. In this way, the graphlevel embedding is appended to the extracted nodelevel embedding, so that translation patterns can be learned with both local and global features.
During training, this correspondence is easy to be established for paired graphs. We can perform a regression task and minimize the prediction loss in the intermediate embedding level as:
(11) 
In this equation, refers to the embedding obtained from the target graph, and means the predicted result based on the source graph by the translator module.
However, for the large amount of unpaired graphs, this kind of training signals can not be obtained. Inspired by (Belghazi et al., 2018; Hjelm et al., 2018)
, we propose to use the mutual information(MI) score as an auxiliary supervision to better align these two spaces, as it can quantify the dependence of two random variables. If two graphs are paired, then they should have a high MI score with each other, and if they are not paired, then their MI score would be low. Therefore, by optimizing the MI score between the translated result and the source graph, the translator module would be encouraged to produce more credible results. However, as the dimension of embedding space is too high(
), directly working on it could suffer from curse of dimensionality and could result in brittle results. Addressing this, we apply MI score to globallevel embedding through the
function, which can change the dimension to . For simplicity of notation, we use to denote the global feature of graph after the(12) 
and use to denote the output of this translator module, the global feature of translated result
(13) 
The mutual information score is calculated on top of that, following the same procedures as (Hjelm et al., 2018), which involves a specific estimator :
(14) 
4.4. Objective Function of SegTran
Putting previous parts together, we can get our full model architecture. The optimization goal during training can be written as:
(15)  
Besides the paired translation loss defined in Equation 11, we also add the reconstruction loss defined in Equation 9 and MI loss defined in Equation 14. is applied to graphs from both source and target domains, and its weight is controlled by the hyperparameter . is applied to only unpaired source graphs, by computing the mutual information between and translation result , and its weight is controlled by .
4.5. Training Algorithm
Besides, as our model is composed of multiple different components, we follow a pretrainfinetune pipeline to make the training process more stable. The full training algorithm can be found in Algorithm 1. We first pretrain the encoder and decoder using the reconstruction loss with both paired and unpaired graphs, so that one meaningful intermediate embedding space for each domain can be learned. Then, we fix them, only train the translator module, to learn the mapping between two embedding spaces. After that, we fix the whole model, and prepare the mutual information estimator. When all these preparations are done, we start the finetune steps, by alternatively updating on paired and unpaired graphs, and train the whole model in an endtoend manner.
5. Experiments
In this section, we conduct experiments to evaluate the effectiveness of SegTran and the factors that could affect its performance. In particular, we aim to answer the following questions.

How effective is SegTran in graph translation by leveraging paired and unpaired graphs?

How different ratios of unpaired graphs could affect the translation performance of SegTran?

What are the contributions of each components of SegTran?
We begin by introducing the experimental settings, datasets and baselines. We then conduct experiments to answer these questions. Finally, we analyze parameter sensitivity of SegTran.
5.1. Experimental Settings
5.1.1. Datasets
We conduct experiments on one synthetic dataset, BA (Guo et al., 2018), and two widely used realworld datasets, DBLP (Tang et al., 2008) and Traffic ^{1}^{1}1https://data.cityofnewyork.us/Transportation/2018YellowTaxiTripData/t29mgskq. The details of the datasets are given as follows:

[leftmargin=*]

BA: In BA dataset, source domain is constructed by the BarabasiAlbert model (McInnes et al., 1999). The graph is built by adding nodes to it sequentially with preferential attachment mechanism, until it reaches nodes. Each newlyadded node is connected to only one existing node randomly, with probability . Here, means the current degree of node . This method can generate graphs that follow the scalefree degree distributions. The target graph is constructed as hop reachability graph, i.e., if node is hop reachable from node in the source graph, then they are connected in the target graph. As the generated graphs are unattributed, we initialize the node attribute matrix to be the same as adjacency matrix . We include this synthetic dataset to understand if SegTran can really capture the translation patterns.

DBLP: DBLP is a multiview citation network, with nodes representing researchers. It provides edges of three types, representing coauthorship, citation, and research overlapping respectively. Each node has a dimension vector available, encoding the research interests. In our experiment, we use the citation network as the source, and research overlapping network as the target domain. This dataset is given in the transductive learning setting, and all the nodes appeared in one single large graph. Addressing this, we manually split it by first selecting a center node and then sampling nodes within its 2hop neighborhood to get a smaller graph. The sampled graphs are not required to have exactly same number of nodes, and we control the graph size by setting the upperbound of node degree as .

Traffic: For traffic dataset, we use the publicly available New York Taxi Trip dataset in year . Each trip record contains the takeon and takeoff places, as well as the tripstart time. We follow prior studies (Yao et al., 2018, 2019b, 2019a; Wang et al., 2019) and split the city into regions, and build the edges based on the taxi flow amount between each region pairs within one hour. This results in graphs in total. On this dataset, we perform a traffic flow prediction task, and set the target domain as the graph state one hour in the future. Node attributes are initialized using the mean of historic taxi flows in the past hours.
BA  DBLP  Traffic  

Dataset Size  5000  22559  8760 
Average Graph Size  40  125  100 
Paired Training  500  2255  876 
Unpaired Training  2000  9020  3504 
Paired Testing  1000  4510  1752 
The statistics of the three datasets are summarized in Table 1, where “Dataset Size” is the total number of graphs and “Average Graph Size” is the average number of nodes in each graph. Note that in BA and Traffic datasets, all graphs have the same size. As we adopt the semisupervised problem setting, we only use a small subset of graphs as paired, and treat a larger subset as unpaired. The size of each subset is also listed in the “Paired Training” and “Unpaired Training”, respectively. The number of paired graphs for testing is listed in the “Paired Testing” row.
5.1.2. Baselines
We compare SegTran with representative and stateoftheart supervised and semisupervised graphtograph translation approaches, which includes:

[leftmargin=*]

DGT (Guo et al., 2018): This method belongs to the encoderdecoder framework. The encoder is composed of edgetoedge blocks to update the edge representation and edgetonode blocks to obtain the node embedding. The decoder performs an inverse process as to the encoder, and maps extracted node representation to the target graph. It also utilizes a discriminator and follows the adversarial training approach to generate more “real” graphs. For implementation, we used the the code provided by the author ^{2}^{2}2https://github.com/anonymous1025/DeepGraphTranslation.

NECDGT (Guo et al., 2019): This is the stateoftheart approach in graph translation, which also follows the encoderdecoder framework. To model the iterative and interactive translation process of nodes and edges, it split each of its block into two branches, one for updating the node attributes and one for updating the edge representations. Special architecture is designed for each branch. Besides, it also designed a spectralbased regularization term to learn and maintain the consistency of predicted nodes and edges. We use the implementation provided by the author ^{3}^{3}3https://github.com/xguo7/NECDGT.

NECDGTenhanced: As both DGT adn NECDGT can only work on paired graphs, they are unable to learn from those unpaired training samples, which would make the comparison unfair. Addressing this problem, we design an extension of NECDGT by adding a sourcedomain decoder, and call it NECDGTenhanced. With this auxiliary decoder, NECDGTenhanced can utilize unpaired graphs by performing reconstruction tasks, which is supposed to improve the performance of the encoder. Note that this can be treated a a variant of SegTran without dual embedding and translator module.

SegTran_p: To further validate the performance of our model design and verify the improvement can be gained from utilizing unpaired graphs, we also designed a new baseline, SegTran_p. It is of the same architecture as SegTran, but is trained only on the paired graphs.
5.1.3. Configurations
All experiments are conducted on a bit machine with Nvidia GPU (Tesla V100, 1246MHz , 16 GB memory), and ADAM optimization algorithm is used to train all the models.
For DGT, NECDGT, and NECDGTEnhanced, the learning rate is initialized to , following the settings in their code. For SegTran and SegTran_p, the learning rate is initialized to . In SegTran, the value of hyperparameter and are both set as . A more detailed analysis over the performance sensitivity to them can be found in Section 5.5
. Besides, all models are trained until converging, with the maximum training epoch being
.5.1.4. Evaluation Metrics
For the evaluation metrics, we adopt mean squared error(MSE) and mean average percentage error(MAPE). As MSE is more sensitive when the groundtruth value is large and MAPE is more sensitive when the groundtruth value is small, we believe they can provide a more comprehensive comparison combined. Besides, classimbalance exists in this problem, as edges are usually sparse, and directly training on them would result in trivial results. Addressing it, we reweight the importance of edges and nonedges during the loss calculation, in both training and testing.
BA  DBLP  Traffic  
Methods  MSE  MAPE  MSE  MAPE  MSE  MAPE 
DGT  
NECDGT  
NECDGTenhanced  
SegTran_p  
SegTran 
5.2. GraphtoGraph Translation Performance
To answer the first question, we compare the graph tranlation performance of SegTran with the baselines under the semisupervised scenario on the three datasets. We train the model on the paired and unpaired training graphs, and conduct graph translation on the test graph. Each experiment is conducted
times to alleviate the randomness. The average performance with standard deviation in terms of MSE and MAPE are reported in Table
2. From the table, we can make following observations:
[leftmargin=*]

SegTran outperforms all other approaches by a reasonable margin. Compared with NECDGTEnhanced, It shows an improvement of on BA dataset, on DBLP, and on Traffic dataset measured by MSE loss. In the term MAPE loss, the improvements are , , and respectively. This result validates the effectiveness of SegTran.

SegTran performs much more stable than NECDGTEnhanced, although they both utilize unpaired graphs. Looking at the standard deviations, it can be observed that NECDGTEnhanced has higher performance variations. For example, its standard deviation on BA dataset is about two times that of SegTran. This phenomena could result from NECDGTenhanced’s difficulty in tradingoff between the semantics from source domain and the semantics from target domain during learning the intermediate embedding space.

The performance differences on DBLP dataset is smaller than those on the other two. This could result from the dataset size. Refering to Table 1, DBLP dataset is significantly larger than both other two datasets. Therefore, the bonus from adding more unpaired graphs would be smaller, which results in this result.

Generally, NECDGTEnhanced performs better than NECDGT, and SegTran performs better than SegTran_p. This observation can test the significance of learning from unpaired graphs. NECDGTEnhanced follows the exactly same structure as NECDGT other than an auxiliary decoder to perform reconstruction on the sourcedomain graphs, and it achieves an improvement of , and respectively measured by MSE losses. Similar observations can be made in the case of comparing SegTran and SegTran_p. This result shows that in the case when paired graphs are limited, it is important to use unpaired graphs to boost model’s performance.
To summarize, these results prove the importance of introducing unpaired graphs to graph translation tasks with only limited number of paired training samples. Besides, they also validate the intuition that explicitly modeling the difference between two domains in the intermediate space can make it easier for the model to utilize unpaired graphs.
5.3. Ratio of Unpaired Graphs
In this subsection, we analyze the sensitivity of our model and NECDGTEnhanced towards the amount of unpaired graphs, which answers the second question. We fix the number of paired training graphs, and change the settings by taking different percentage of samples in the dataset as unpaired graphs for semisupervised training. Concretely, we vary unpaired graph to paired graph ratio as , and fix other hyperparameters unchanged. We only report the performance on BA and Traffic as we have similar observation on DBLP. The results are presented in Figure 5 From the figure, we make the following observations:

[leftmargin=*]

In general, as the increase of ratio of unpaired graphs, the performance of both NECDGTEnhanced and SegTran increases, which implies that unpaired graphs can help to learn better encoder and decoder for graphtograph translation.

As the increase of the ratio of unpaired graphs, the performance of SegTran is consistently better than NECDGTEnhanced, which validate the effectiveness of SegTran in learning from unpaired graphs.
5.4. Ablation Study
To answer the third question, in this subsection, we conduct ablation study to understand the importance of each component of the proposed framework SegTran. All experiments are performed in semisupervised setting, with configurations exactly same as SegTran, if not stated otherwise.
Methods  BA  DBLP 

Semi sup  Semi sup  
Shared Embedding  0.196  0.245 
No position  0.142  0.204 
No MI  0.141  0.197 
No multihead attention  0.135  0.190 
SegTran  0.132  0.192 
Gain from dual embedding In our model design, “dual embedding” is adopted to distinguish between the source and the target domain, and help the model to learn more from unpaired graph. To test its affect, we perform an ablation study by removing the translation module, requiring two domains to share the same embedding space. Other parts are not influenced, except the mutual information loss, which is no longer needed in this case. The performance of “Shared Embedding” in terms of MSE is shown in Table 3. From the table, we can see that on both BA and DBLP dataset, compared with SegTran, removing dual embedding and changing to shared embedding would result in a significant performance drop, which is about and points decrease, respectively. This is because “dual embedding” can ease the tradeoff between reconstruction and translation tasks. Without it, the same features would be used for them, which could result in suboptimal performance. This result shows the effectiveness of this design in leveraging the semantic change from source domain to target domain.
Importance of position embedding In this part, we test the effects of the position embedding, and evaluate its contributions in guiding the encoding process. In removing the position embedding, we leave the model architecture untouched, and use the first eight dimensions of node attributes to replace the calculated position embedding. Through the result in Table 3, it can be seen that it will result in a performance drop of on BA, and on DBLP. This result shows that position embedding is important for extending representation ability of graph encoder module.
Importance of MI loss We also test the benefits from auxiliary mutual information loss when aligning the transformed embedding of unpaired graphs. For this experiment, we simply remove this loss during the finetune steps, and observe the change of performance. We find a drop of points on BA, and on DBLP. The drop is smaller on DBLP datast, and it may still be a result of dataset size. As DBLP is much larger, the number of paired graphs is more sufficient for the training, which makes this auxiliary loss not that important. But overall, we can observe that this supervision can train the translator better.
Influence of multihead attention In this part, we test the influence of the multihead attention module in the decoding process. This module is designed to fuse the processed embedding of nodes based on their relative distances. The relative distances is measured using the processed position embedding along with the metric space constructed by each attention head. Through multihead attention, higherorder interaction between processed node embedding and position embedding is supported. For comparison, we directly remove this layer from the decoder. This manipulation will not influence other parts of the translation network. From Table 3, we can see that on BA dataset, this module can bring a moderate improvement of around . On DBLP dataset, the performance is similar whether this module is removed or not. Therefore, the importance of this layer is rather dependent on the complexity of dataset, and sometimes it is not necessary.
5.5. Parameter Sensitivity Analysis
In this subsection, we analyze the sensitivity of SegTran’s performance towards the hyperparameters and , the weight of reconstruction loss and mutual information loss. We vary both and as . These experiments all use the same pretrained model, as these two hyperparameters only influence finetune process. Other settings are the same as SegTran. This experiment is performed on BA dataset, and the result is shown in Figure 6. The axis is the translation error measured using MSE loss, axis refers to the value of , and axis represents .
From the figure, we can observe that generally, reconstruction loss and mutual information loss are both important for achieving a better performance. When is small, it is difficult for this model to achieve a high performance. This observation makes sense because reconstruction loss guides the encoder to extract more complete features from input graphs, and consequently has a direct influence on the quality of intermediate embedding space.
5.6. Case study
In Figure 7, we provide an example of BA dataset to show and compare the captured translation patterns of SegTran and NECDGTEnhanced. Figure 7(a) is the source graph, and Figure 7(b) is the target graph. Figure 7(c) is the result translated by SegTran, and Figure 7(d) is translated by NECDGTEnhanced. For edges in Figure 7(c) and Figure 7(d), to show the prediction results clearly, we split them into three groups. When the predicted existence probability is above , we draw it in black if it is true otherwise in red. When the probability is between and , we draw it in grey. Edges with smaller probability are filtered out.
It can be seen that SegTran has a high translation quality, and made no erroneous predictions. In the target graph, node is the most popular node. The same pattern can be found in the translated result by SegTran, where node has a high probability of linking to most other nodes. As to the distant node pairs, like node and , or node and , which have no links in the target graph, SegTran assigns a relatively low probability to their existence, below .
The performance of NECDGTEnhanced, on the other hand, is not that satisfactory. It mistakenly split the nodes into two groups, and others, and assigns large weight to edges inside them but little weight to edges between them. Node is connected only to node in the source graph, therefore NECDGTEnhanced tries to push the embedding of it as well as its neighbors distant from other nodes, which could be the reason resulting in this phenomena. This example shows that SegTran is better in capturing the graph distributions in the target domain and learning the translation patterns.
6. Conclusion and Future Work
Graphtograph translation has many applications. However, for many domains, obtaining largescale data is expensive. Thus, in this paper, we investigate a new problem of semisupervised graphtograph translation. We propose a new framework SegTran, which is composed of a encoder to learn graph representation, a decoder to reconstruct and a translator module to translate graph in the latent space. SegTran adopts dual embedding to bridge the semantic gap between the source and target domain. Experimental results on synthetic and realworld datasets demonstrate the effectiveness of the proposed framework for semisupervised graph translation. Further experiments are conducted to understand and its hyperparameter sensitivity.
There are several interesting directions need further investigation. First in this paper, we mainly focus on translating one graph in the source domain to another graph in the target domain. In realworld, there are many situations which require the translation from one domain to many other domains. For example, in molecular translation task (Jin et al., 2018, 2019), we could make different requirements on the properties of translated compound, and each requirement would form a domain. Therefore, we plan to extend SegTran to onetomany graph translation. Second, the graph translation has many realworld applications. In this paper, we conduct experiments on citation network and traffic network. We would like to extend our framework for more application domains such as the graph translation in brain network and medical domains (Tewarie et al., 2015; Bassett and Sporns, 2017).
7. Acknowledgement
This project was partially supported by NSF projects IIS1707548, IIS1909702, IIS1955851, CBET1638320 and Global Research Outreach program of Samsung Advanced Institute of Technology #225003.
References
 Unsupervised statistical machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3632–3642. Cited by: §2.2.
 Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §2.1.
 Network neuroscience. Nature neuroscience 20 (3), pp. 353. Cited by: §6.

Mutual information neural estimation.
In
International Conference on Machine Learning
, pp. 531–540. Cited by: §4.3.  Understanding the representation power of graph neural networks in learning graph topology. In Advances in Neural Information Processing Systems, pp. 15387–15397. Cited by: §4.
 Graph transformation policy network for chemical reaction prediction. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery Data Mining. Cited by: §1, §2.1.
 Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: §4.
 Generative adversarial nets. In NIPS, Cited by: §2.1.
 Semisupervised learning by entropy minimization. In Advances in neural information processing systems, pp. 529–536. Cited by: §2.2.
 Deep graph translation. ArXiv abs/1805.09980. Cited by: §1, §1, §2.1, §2.1, 1st item, §5.1.1.
 Deep multiattributed graph translation with nodeedge coevolution. ICDM. Cited by: §1, §2.1, 2nd item.
 Inductive representation learning on large graphs. In NIPS, Cited by: §4.
 Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §4.3, 3.

Selfsupervised learning on graphs: deep insights and new direction
. arXiv preprint arXiv:2006.10141. Cited by: §4.  Graph structure learning for robust graph neural networks. arXiv preprint arXiv:2005.10203. Cited by: §4.

Junction tree variational autoencoder for molecular graph generation
. In ICML, Cited by: §2.1, §6.  Learning multimodal graphtograph translation for molecular optimization. ArXiv abs/1812.01070. Cited by: §2.1, §6.
 Semisupervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589. Cited by: §2.2.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.1, §4.
 Variational graph autoencoders. ArXiv abs/1611.07308. Cited by: §2.1.
 Temporal ensembling for semisupervised learning. arXiv preprint arXiv:1610.02242. Cited by: §2.2.
 Multilayered network embedding. In Proceedings of the 2018 SIAM International Conference on Data Mining, pp. 684–692. Cited by: §2.1.
 Symbolic graph reasoning meets convolutions. In NeurIPS, Cited by: §2.1.
 Emergence of scaling in random networks. Cited by: 1st item.
 Oneclass collaborative filtering. In 2008 Eighth IEEE International Conference on Data Mining, pp. 502–511. Cited by: §4.2.
 Style transfer through backtranslation. In ACL, Cited by: §2.2.
 Edinburgh neural machine translation systems for wmt 16. In WMT, Cited by: §2.2.
 A graph to graphs framework for retrosynthesis prediction. ArXiv abs/2003.12725. Cited by: §1.
 GraphVAE: towards generation of small graphs using variational autoencoders. In ICANN, Cited by: §2.1.

Graph to graph: a topology aware approach for graph structures learning and generation.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pp. 2946–2955. Cited by: §2.1.  Sequence to sequence learning with neural networks. In NIPS, Cited by: §2.1.
 ArnetMiner: extraction and mining of academic social networks. In KDD, Cited by: §5.1.1.
 Graph convolutional networks against degreerelated biases. arXiv preprint arXiv:2006.15643. Cited by: §4.
 The minimum spanning tree: an unbiased method for brain network analysis. Neuroimage 104, pp. 177–188. Cited by: §6.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.1, §4.2.
 A simple baseline for travel time estimation using largescale trip data. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 1–22. Cited by: 3rd item.
 Simplifying graph convolutional networks. In International Conference on Machine Learning, pp. 6861–6871. Cited by: §4.
 Learning from multiple cities: a metalearning approach for spatialtemporal prediction. In The World Wide Web Conference, pp. 2181–2191. Cited by: 3rd item.

Revisiting spatialtemporal similarity: a deep learning framework for traffic prediction
. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5668–5675. Cited by: 3rd item.  Deep multiview spatialtemporal network for taxi demand prediction. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: 3rd item.
 Positionaware graph neural networks. In International Conference on Machine Learning, pp. 7134–7143. Cited by: §4, §4.
 GraphRNN: generating realistic graphs with deep autoregressive models. In ICML, Cited by: §2.1.
 Hashing graph convolution for node classification. In CIKM, Cited by: §4.

Unpaired imagetoimage translation using cycleconsistent adversarial networks
. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251. Cited by: §2.2.