1. Introduction
Recently, many online personalized services have utilized users’ historical behavior data to characterize user preferences, such as: online video sites (Covington et al., 2016), App stores (Cheng et al., 2016), online advertisements (Guo et al., 2017) and Ecommmerce sites (Zhao et al., 2018; Wang et al., 2018). Learning the representation from useritem interactions is an essential issue in most personalized services. Usually, lowdimensional embeddings can effectively represent attributes of items and preferences of users in a uniform latent semantic space, which are helpful to provide personalized services and improve user experience. Moreover, the representation of users and items has been widely applied to many research topics related to above realworld scenarios, including: largescale recommendation (Wang et al., 2018; Ying et al., 2018), search ranking (Grbovic and Cheng, 2018; Chu et al., 2018), coldstart problem (Zhao et al., 2018).
In largescale personalized services, there are usually a relative small portion of active users, and a majority of nonactive users often interact with only a small number of items, users’ behavior data is thus lacking or insufficient in an individual domain, which makes it difficult to learn effective embeddings (Wang et al., 2019). On the other hand, though data from a single domain is sparse, users’ behaviors from correlated domains regarding the same items are usually complementary (Zhuang et al., 2017). Take the App store as an example, there are two ways users interact (e.g., download) with items (i.e., Apps). One is downloading Apps recommended on the homepage or category pages of App store (i.e., recommendation domain), the other is by searching (i.e., search domain). User behaviors in the search domain reflect user’s current needs or intention, while that in the recommendation domain represent user’s relative longterm interests. Leveraging the interaction data from the search domain can improve the performance of recommendation. On the other hand, interaction data from the recommendation domain can also help to explore user’s personalized interests and therefore optimize the ranking list in search domain. Therefore, we are motivated to leverage the complementary information from correlated domains to alleviate the sparsity problem.
Generally, users’ behaviors are sequential (Hidasi et al., 2016) (take the App store as example, as shown in Figure 1 (a)), and graph can be used to model users’ sequential behaviors intuitively (Wang et al., 2018). Specifically, in each domain (as shown in Figure 1 (b)), we can construct an item graph by modeling the items as nodes, the item cooccurrences as edges, and the number of cooccurrences in all users’ behavior sequences as the weights of edges. Through applying graph embedding methods such as DeepWalk (Perozzi et al., 2014; Wang et al., 2018), it can generate abundant item sequences by running random walk on the item graph, and then use the SkipGram algorithm (Mikolov et al., 2013a; Mikolov et al., 2013b) to learn item embedding. Compared with the random walk based graph embedding methods (Perozzi et al., 2014; Grover and Leskovec, 2016), graph neural networks (GNNs) have shown the great power for representation learning on graphs recently (Xu et al., 2019). As a stateoftheart GNN, graph convolutional network (GCN) (Kipf and Welling, 2017)
is proposed based on convolutional neural networks, and generates node embedding by operating convolution on the graph. The graph convolution operation in GCN is to encode node attributes and graph structure using neural networks, thus GCN performs well in graph embedding, and can be used for item embedding. However, these methods are developed for learning single graph embedding, i.e., single domain embedding. Users’ behaviors in crossdomain are more complex, and it is more reasonable to model user’s behaviors as multigraph (as shown in Figure
1 (c)), which consists a set of nodes and multiple types of edges (i.e., solid and dashed lines represent two types of edges). Concretely, nodes represent the same items across domains and each type of edge denotes the cooccurrences of item pairs in each domain. In multigraph, there may exist multiple types of edges between pairs of nodes, each type of edge forms a certain subgraph (i.e., a domain), and these subgraphs are related to each other, as all of them share the same nodes. Thus, it is likely that each node (i.e. item) in different subgraph (i.e. domain) has a different representation, and all these representations of a node are relevant to each other.However, the existing single graph embedding methods fail to fuse the complex relations in multigraph, and generate effective node embeddings. The crossdomain scenario poses challenges to transfer the information across domains and learn crossdomain representation. On the other hand, though GCN is effective, stacking many convolutional layers makes GCN difficult to train, as the iterative graph convolution operation is prone to overfit, as stated in (Li et al., 2018). It brings additional complexity and challenges to apply GCN to learn crossdomain (or multigraph) representation. Thus, to better utilize the power of GCN, dedicated efforts are desired to design a novel neural network architecture based on GCN for crossdomain representation learning, and optimize the neural network efficiently to overcome the disadvantages of GCN.
To address these challenges, in this paper, we propose a novel embedding model, named Deep MultiGraph Embedding (DMGE). We first construct the item graph as a multigraph based on users’ sequential behaviors from different domains. Specifically, the nodes in multigraph represent items, and two nodes are connected by an edge if they consecutively occur in one user’s sequence. Thus, learning the item embedding is converted to learn node embedding in the multigraph. To utilize the power of GCN on graph embedding, we propose a graph neural network inspired by multitask learning regime, which extends GCN to learn crossdomain representation. Specifically, each domain is viewed as a task in the model, and we design the domainspecific layers to generate domainspecific representation for each domain, all domains are correlated by the domainshared layers, which generate domainshared representation. The model is then trained in an unsupervised manner by learning the graph structure. Besides, to overcome the disadvantages of GCN, we introduce a multiplegradient descent optimizer to train the proposed model, which can adaptively adjust the weight of each domain. Particularly, It updates the parameters of the domainshared layers by using the weighted summation of the gradients of all domains, and parameters of the domainspecific layers by using the gradients of the specific domain. The main contributions of our work are summarized as follows:

We focus on learning crossdomain representation. Innovatively, we model users’ behaviors in crossdomain as multigraph, and propose a graph neural network to learn domainshared and domainspecific representation, simultaneously.

We propose a novel embedding model named Deep MultiGraph Embedding (DMGE), which is a graph neural network based on multitask learning. Particularly, we present a multiplegradient descent optimizer to efficiently train the model in an unsupervised manner.

We evaluate DMGE on various largescale realworld datasets, and the experimental results show that DMGE outperforms other stateoftheart embedding methods in various tasks.
2. Deep MultiGraph Embedding
In this section, we elaborate the Deep MultiGraph Embedding (DMGE) model for crossdomain item embedding. We first present the problem definition. Then, we propose a multigraph neural network to learn node embedding in the multigraph. Finally, we present an multiplegradient descent optimizer to efficiently train the model in an unsupervised manner.
2.1. Problem Definition
Suppose there are domains. For each domain , we first construct the item graph as an undirected weighted graph . As these domains are correlated and share the same set of items, we then construct the crossdomain item graph as an undirected weighted multigraph , which contains the node set with nodes and the edge set with types of edges, i.e., .
Our problem can be formally stated as follows, with an undirected weighted multigraph , and the node feature matrix , representing input for each node as an
dimensional feature vector, our goal is to learn a set of embedding for all nodes in each subgraph
, i.e., is the node embedding in subgraph , with each node has an dimensional embedding by solving the following optimization problem:(1) 
where
is the probability that there exists an edge between node
and node in the subgraph , and is the set of neighborhood of node in the subgraph .2.2. MultiGraph Neural Network
As is discussed, graph neural networks (GNNs) (Zhou et al., 2018; Xu et al., 2019) have emerged as a powerful approach for representation learning on graphs recently, such as GCN (Kipf and Welling, 2017). Thus we emphasize on applying GCN for multigraph embedding. In a multigraph, the same set of nodes are shared in all subgraphs. For each node, it has different neighbors in different subgraph, thus it is likely that it has different representation in different subgraph. Moreover, all these representations belong to the same node, thus they are inherently related to each other. We present two types of representation of nodes in the multigraph, for each node, it has a shared representation, which denotes the shared information in the multigraph. Besides, it also has a specific representation in each subgraph, which encodes the specific information in the subgraph.
To learn multiple types of node representations, the architecture of DMGE is presented in Figure 2, which follows the multitask learning regime (Caruana, 1997; Ruder, 2017). Specifically, the domainshared layers are graph convolutional layers on multigraph, which is used to learn shared representation across domains. The domainspecific layers are also graph convolutional layers, which is used to learn specific representation on each subgraph for each domain. The outputs of graph convolutional layers are node embeddings, which model the probability that an link existing between these nodes.
The graph convolutional layers can efficiently learn node embedding based on the neighborhood aggregation scheme. In DMGE, the shared graph convolutional layers are used to generate shared node embedding by encoding node attributes and multigraph structure. Based on the shared embedding, the specific graph convolutional layers are used to generate specific node embedding on each subgraph by the same rule.
To learn shared embedding, the shared graph convolutional layers are defined as follows:
(2) 
where is the shared embedding of multigraph in th layer, and is the matrix of node feature. is the adjacency matrix of graph with added selfconnections, is the adjacency matrix, and if there are any links between node and , and
is the identity matrix.
is a diagonal matrix, and . is the shared weight matrix of th layer. The output of the th shared graph convolutional layer is the shared node embedding .Based on the shared embedding, the specific graph convolutional layers are defined as follows:
(3) 
where is the specific embedding of subgraph in th layer, and is the shared embedding. is the adjacency matrix of subgraph with added selfconnections, is the adjacency matrix, and is the weight of edge . is a diagonal matrix, and . is the specific weight matrix of th layer. The output of the specific graph convolutional layers is the set of node embedding .
To learn node embeddings, we train the neural network in an unsupervised manner by modeling the graph structure. We use the embeddings to generate the linkage between two nodes, i.e. the probability that there exists an edge between these nodes. Therefore, we formulate the embedding learning as a binary classification problem by using the embeddings of two nodes.
The probability that there exists an edge between node and node in subgraph is defined in Eq. (4), and the probability that there exists no edge between node and node in subgraph is defined in Eq. (5):
(4) 
(5)  
where is the th row of , which is the embedding vector of node in subgraph .
is the sigmoid function.
Therefore, the objective is to generate the embedding by maximizing the loglikelihood function as follows:
(6)  
where is the set of positive samples in subgraph , which contains the tuples with an edge between node and node in subgraph . is the set of negative samples in subgraph . The negative samples are sampled from node set by using negative sampling (Mikolov et al., 2013a; Mikolov et al., 2013b), which contains the tuples with no edge between node and node in subgraph .
Therefore, the objective function is defined as Eq. (7), in which
is the loss function of subgraph
.(7) 
2.3. Optimization
In our model, the parameter set of shared graph convolutional layers is shared across domains, while parameter sets of specific graph convolutional layers are domainspecific. To train DMGE and benefit all domains, we need to optimize all the objectives . In multitask learning (Caruana, 1997; Ruder, 2017), a commonly used method to optimize the objective function Eq. (7) is to solve the weighted summation of all . However, stacking multiple layers brings additional difficulties to train the model (Li et al., 2018), and it is timeconsuming to tune the weight to obtain the optimal solution. Therefore, we formulate the problem as multiobjective optimization, and the optimization objective is defined as follows:
(8) 
The goal of Eq. (8) is to find the solution which is optimal for each objective (i.e., each domain). To solve the multiobjective optimization, we introduce a multiplegradient descent optimizer. Firstly, we state the KarushKuhnTucker (KKT) conditions (Kuhn and Tucker, 1951) for the multiobjective optimization in Eq. (8), which is a necessary condition for the optimal solution of multiobjective optimization:
(9) 
where is the weight of objective .
As proved in (Désidéri, 2012), either the solution to Eq. (10) is 0 and the result satisfies the KKT conditions Eq. (9), or the solution gives a descent direction that improves all objectives in Eq. (8). Thus, solving the KKT conditions Eq. (9) is equivalent to optimizing Eq. (10).
(10)  
To clearly illustrate how the optimizer works, we consider the case of two domains. The optimization objective Eq. (10) can be simplified as:
(11)  
where is the weight of .
(12) 
where , .
With the weight , we update the parameters of the model as follows, as the parameter sets and are domainspecific, we update and by using Eq. (13) for each domain, respectively. The parameter set is shared across domains, thus we apply the weighted gradient as a gradient update to the shared parameters as defined in Eq. (14). Notice that is trained by the optimizer according to the gradient of each domain.
(13) 
(14) 
Finally, we summarize the learning procedure of DMGE in Algorithm 1. In DMGE, the input includes the multigraph and the node feature matrix . In line 1, we first initialize the parameter sets , and the weight of each domain. Then, we operate convolution on the multigraph in line 2. For each subgraph , we sample a set of negative samples in line 5. We use the link information to train DMGE, compute the gradients and update the parameters of specific graph convolutional layers in line 6, and we compute the gradients of shared graph convolutional layers in line 7. Based on the gradients, we compute in line 8, and update the parameters of shared graph convolutional layers in line 9. Finally, we return a set of node embeddings in line 12.
3. Experiments
In this section, we first present the research questions about DMGE. Then, we introduce the datasets and experimental settings. Finally, we present the experimental results to demonstrate the effectiveness of DMGE.
We first present the following three research questions:

RQ1: How does DMGE perform in the recommendation task compared with other stateoftheart embedding methods for recommendation?

RQ2: How does the parameter sensitivity affect the performance of DMGE for recommendation?

RQ3: How does DMGE perform in the classic task on graph (e.g., link prediction) compared with other stateoftheart graph embedding methods?
3.1. Datasets
We evaluate our model on two realworld datasets, the details of datasets are as follows and the statistics of datasets are presented in Table 1.
Dataset  Domain/Relation  Nodes  Edges 

Tencent App Store  Homepage  18,229  548,930 
Search  18,229  936,065  
Youtube  Friendship  15,088  76,765 
Cofriends  15,088  1,940,806  

Tencent App Store: It is the App download records from a company App store, which contains recommendation domain and search domain. The time span of the dataset is 31 days, the number of Apps is 18,229, and the number of user is 1,011,567. Based on users’ download records, we construct the item graph for each domain, and the statistics of graph is presented in Table 1. We use this dataset for App recommendation task.

Youtube^{1}^{1}1http://socialcomputing.asu.edu/datasets/YouTube: YouTube dataset (Zafarani and Liu, 2009) consists of two types of relation among users, i.e. friendship and cofriends. Specifically, the friendship relation means two users are friends, and the cofriends means two users have shared friends. We use this dataset for link prediction task.
3.2. Experimental Settings
3.2.1. Baseline Methods
For both tasks, we choose the following stateoftheart graph embedding methods as baselines:

DeepWalk (Perozzi et al., 2014): It applies random walk on graph to generate node sequences, and uses SkipGram algorithm to learn embedding. We apply DeepWalk to each subgraph separately.

LINE (Tang et al., 2015): It learns node embedding through preserving both local and global graph structures. We apply LINE to each subgraph separately.

node2vec (Grover and Leskovec, 2016): It designs a biased random walk procedure, and can explore diverse neighborhoods. We apply node2vec to each subgraph separately.

GCN (Kipf and Welling, 2017): It operates convolution on graph, and can generate node embedding based on neighborhoods. We apply GCN to each subgraph separately.

mGCN (Ma et al., 2019): It applies graph convolutional networks for multigraph embedding. It can generate both general embeddings to capture the information for nodes over the entire graph and dimensionspecific embeddings to capture the information for nodes in each subgraph.

DMGE (): It is a variant of DMGE. It defines the objective function as the weighted summation of in Eq. (8) for multigraph embedding, in which is the weight of the first domain.
For the App recommendation task, besides the above baselines, we also compare with the matrix factorization (MF) (Koren et al., 2009), which factorizes useritem matrix into user embedding and item embedding, respectively. We apply MF to each domain separately.
3.2.2. Evaluation Metrics
To evaluate the performance of recommendation, we compare the recommended top list with the corresponding ground truth list for each user , and use the following metrics to evaluate the top recommended results:

Recall@: It calculates the fraction of the ground truth (i.e., the user downloaded Apps) that are recommended by different algorithms in Eq. (15), where is the user set, denotes the number of downloaded Apps hits in the candidate top App list for user , and denotes the number of downloaded App list of user . A larger value of recall@ means better performance.
(15) 
MRR@: Mean Reciprocal Rank (MRR) uses the multiplicative inverse of the rank of the first hit item among top item list to evaluate the performance of rank in Eq. (16), where is the rank of the first hit item. A larger value of MRR@ means better performance.
(16)
To evaluate the performance of link prediction, we use the metrics of binary classification: AUC and F1.
3.2.3. Model Parameters
The parameters of DMGE are set as follows:

Network architecture. The number of shared and specific graph convolutional layers are both 1, shared hidden size is 64, and specific hidden size is 16.

Initialization. The node feature matrix can be initialized randomly, or by other embedding methods, we initialize it as the identity matrix.

Gradient normalization. We normalize the gradient of shared parameter of each domain, and then use the normalized gradient to calculating in Eq. (12). The normalized gradient of domain is , where is the unnormalized gradient.

Other hyperparameters
. The number of negative samples is 2; the embedding dimension is 16; the dropout of shared graph convolutional layers is 0.3 and that is 0.1 of specific graph convolutional layers; the batch size is 256 and we train the model for a maximum of 10 epochs using Adam.
In all methods, the dimension of embedding is set to 16. The parameters of baselines are finetuning, and set as follows:

MF. It is implemented using LibMF^{2}^{2}2https://www.csie.ntu.edu.tw/ cjlin/libmf/.

DeepWalk. The length of context window is 5; the length of random walk is 20; the number of walks per node is 50.

LINE. The number of negative samples is 2.

node2vec. The length of context window is 5; the length of random walk is 20; the number of walks per node is 50; the number of negative samples is 2; is 1 and is 0.25.

GCN. The number of graph convolutional layers is 1.

mGCN. The initial general representation size is 64, other parameter settings are the same as (Ma et al., 2019), and we train the model for a maximum of 20 epochs using Adam.

DMGE (). Considering that both domains are important, we set the weight to 0.5; the other parameter settings are the same as DMGE.
3.3. Embedding for Recommendation
To demonstrate the performance of DMGE in recommendation task (RQ1), we compare DMGE with other stateoftheart embedding methods. The intuition is that learning better item embeddings will achieve better performance of recommendation.
Generally, users’ preferences can be characterized by the items they have interacted with, thus we represent users by aggregating embeddings of their interacted items. There are several ways to aggregate item embeddings, such as: average (Zhao et al., 2018), RNN (Okura et al., 2017). We apply average here, and represent users by using the average item embeddings of their interacted items:
(17) 
where is the embedding of user in domain , is the number of items user has interacted with, and is the embedding of item in domain .
For each domain, we measure useritem similarity by computing the cosine distance between user embedding and item embedding. Based on the useritem similarity, we then generate candidate top items for each user. We use consecutive 26 days data to train item embedding, and measure the performance of recommendation in the next 5 days by using the metric Recall@ and MRR@. The performance of different methods for recommendation domain and search domain is presented in Table 2, Table 3, Table 4 and Table 5. (Note that the best results are indicated by the bold font.)
Based on the results, we have the following observations:

We first compare the performance of singledomain methods, including: MF, DeepWalk, LINE, node2vec and GCN. We can observe the graph embedding methods outperforms MF, as MF only takes into account the explicit useritem interactions, while ignoring item cooccurrences in users’ behaviors, which can be captured by graph embedding methods.

The overall performance of crossdomain methods (i.e., mGCN, DMGE (), DMGE) is better than the single domain methods, which demonstrates that fusing information from correlated domains is helpful to learn better crossdomain representation, and can improve the performance of recommendation in both domains. When is less than 40 in recommendation domain and is less than 30 in search domain, the Recall of mGCN is worse than the single domain methods, the possible reason is that weight between withindomain and acrossdomain in mGCN is a hyperparameter to be tuned, and can not be adaptively learned by the importance of each domain. Both DMGE and DMGE () are consistently outperforms the single domain methods.

Compared the crossdomain embedding methods, both DMGE () and DMGE outperform mGCN, which indicates that our proposed graph neural network is effective to learn better representations.

DMGE outperforms DMGE () (except Recall@1000 in recommendation domain). The average of in DMGE is 0.4409, thus when , DMGE () can also achieve good performance. However, in DMGE (), it is timeconsuming and computationally expensive to tune the hyperparameter to obtain the optimal result. While in DMGE, is a trainable parameter. Thus, we recommend to use the multiplegradient descent optimizer to train the model.
Overall, the proposed DMGE outperforms the stateoftheart embedding methods, and improves the performance of recommendation in both domains.
Domain  Recall@  10  20  30  40  50  60  70  80  90  100  1000 

Single  MF  0.0301  0.0453  0.0565  0.0658  0.0739  0.0812  0.0880  0.0942  0.1002  0.1065  0.2932 
DeepWalk  0.0730  0.1104  0.1363  0.1558  0.1720  0.1853  0.1975  0.2082  0.2186  0.2273  0.4744  
LINE  0.0471  0.0728  0.0933  0.1106  0.1258  0.1395  0.1525  0.1642  0.1754  0.1861  0.4977  
node2vec  0.0345  0.0579  0.0773  0.0936  0.1080  0.1207  0.1324  0.1436  0.1534  0.1630  0.4574  
GCN  0.0743  0.1078  0.1317  0.1516  0.1688  0.1848  0.1977  0.2098  0.2217  0.2321  0.5624  
Cross  mGCN  0.0431  0.0835  0.1273  0.1677  0.2002  0.2142  0.2261  0.2383  0.2505  0.2627  0.6323 
DMGE ()  0.1019  0.1607  0.2069  0.2436  0.2762  0.3035  0.3260  0.3471  0.3660  0.3826  0.7016  
DMGE  0.1024  0.1661  0.2109  0.2455  0.2767  0.3042  0.3277  0.3484  0.3669  0.3831  0.6929  
Domain  Recall@  10  20  30  40  50  60  70  80  90  100  1000 

Single  MF  0.0150  0.0251  0.0335  0.0408  0.0474  0.0533  0.0589  0.0642  0.0691  0.0739  0.2679 
DeepWalk  0.0638  0.1043  0.1338  0.1571  0.1761  0.1924  0.2064  0.2185  0.2291  0.2387  0.4676  
LINE  0.0392  0.0603  0.0769  0.0909  0.1030  0.1138  0.1238  0.1331  0.1414  0.1495  0.4293  
node2vec  0.0289  0.0471  0.0622  0.0753  0.0870  0.0982  0.1082  0.1174  0.1260  0.1346  0.4176  
GCN  0.0499  0.0764  0.0956  0.1111  0.1242  0.1363  0.1472  0.1575  0.1666  0.1754  0.4905  
Cross  mGCN  0.0478  0.0939  0.1454  0.1938  0.2218  0.2328  0.2399  0.2480  0.2565  0.2653  0.5920 
DMGE ()  0.0823  0.1363  0.1784  0.2134  0.2415  0.2652  0.2857  0.3037  0.3206  0.3360  0.6254  
DMGE  0.0885  0.1467  0.1900  0.2238  0.2517  0.2759  0.2971  0.3162  0.3328  0.3473  0.6263  
Domain  Recall@  10  20  30  40  50  60  70  80  90  100  1000 

Single  MF  0.0149  0.0170  0.0180  0.0185  0.0188  0.0191  0.0193  0.0194  0.0196  0.0197  0.0208 
DeepWalk  0.0510  0.0549  0.0563  0.0571  0.0575  0.0578  0.0581  0.0582  0.0584  0.0585  0.0594  
LINE  0.0371  0.0398  0.0410  0.0417  0.0421  0.0424  0.0427  0.0429  0.0431  0.0432  0.0444  
node2vec  0.0265  0.0290  0.0302  0.0308  0.0312  0.0315  0.0317  0.0319  0.0321  0.0322  0.0334  
GCN  0.0558  0.0592  0.0606  0.0613  0.0619  0.0622  0.0625  0.0627  0.0628  0.0630  0.0642  
Cross  mGCN  0.0264  0.0311  0.0338  0.0354  0.0364  0.0367  0.0370  0.0372  0.0373  0.0375  0.0389 
DMGE ()  0.0697  0.0756  0.0780  0.0793  0.0801  0.0807  0.0811  0.0814  0.0816  0.0817  0.0829  
DMGE  0.0699  0.0761  0.0785  0.0797  0.0805  0.0810  0.0814  0.0817  0.0819  0.0821  0.0832  
Domain  Recall@  10  20  30  40  50  60  70  80  90  100  1000 

Single  MF  0.0120  0.0134  0.0141  0.0145  0.0148  0.0150  0.0151  0.0152  0.0153  0.0154  0.0164 
DeepWalk  0.0468  0.0515  0.0534  0.0543  0.0549  0.0553  0.0556  0.0558  0.0560  0.0561  0.0571  
LINE  0.0345  0.0372  0.0384  0.0390  0.0395  0.0398  0.0400  0.0402  0.0403  0.0405  0.0417  
node2vec  0.0237  0.0260  0.0271  0.0278  0.0283  0.0286  0.0289  0.0290  0.0292  0.0293  0.0307  
GCN  0.0405  0.0437  0.0450  0.0457  0.0462  0.0465  0.0468  0.0470  0.0471  0.0473  0.0486  
Cross  mGCN  0.0324  0.0382  0.0415  0.0435  0.0443  0.0446  0.0448  0.0449  0.0450  0.0452  0.0465 
DMGE ()  0.0592  0.0653  0.0678  0.0691  0.0700  0.0705  0.0709  0.0712  0.0714  0.0716  0.0728  
DMGE  0.0629  0.0693  0.0718  0.0731  0.0739  0.0744  0.0748  0.0751  0.0753  0.0755  0.0766  
3.4. Parameter Sensitivity
The key parameter that affects the performance of embedding is the dimension size (RQ2), we analyze how does the dimension size of learned embedding in DMGE affect the performance of recommendation. In particular, we test the dimension size . Figure 3
show the results of different embedding dimension in recommendation and search domain, and the evaluation metric is Recall@
.As shown in Figure 3, in both domains, when the dimension of embedding is 16, our model performs best regarding the metric Recall@100. Therefore, we set the dimension of embedding as 16.
3.5. Link Prediction
To demonstrate the performance of DMGE in the link prediction task (RQ3), we compare DMGE with other stateoftheart graph embedding methods. The intuition is that learning better node embeddings will achieve better performance of link prediction.
In the multigraph, we perform link prediction in different subgraph separately. In each subgraph, we randomly remove 30% of edges, and we aim to predict whether these removed edges exist. We formulate the link prediction task as a binary classification problem by using the embeddings of two nodes, and there are two types of combination: elementwise addition, elementwise multiplication.
In training set, we use the remaining node pairs as positive samples, and randomly sample an equal number of not connected node pairs as negative samples. In testing set, we use the removed node pairs as positive samples, and randomly sample an equal number of not connected node pairs as negative samples. We train a binary classifier using logistic regression on the training set, and evaluate the performance of link prediction on the testing set. For each method, we select the optimal combination of embeddings and present the best results. The results of different methods are presented in Figure
4, including the results of each relation, and the average performance over all dimensions.Based on the results, we have the following observations:

The multigraph embedding methods (i.e., mGCN, DMGE (), DMGE) outperforms the single graph embedding methods (i.e., DeepWalk, LINE, node2vec, GCN), which indicates that using multiple relations in the multigraph is helpful to learn better representation.

DMGE () and DMGE outperform mGCN, which indicates that our proposed graph neural network is effective to learn better representations.

The average performance of DMGE is better than DMGE (), which indicates the effectiveness of training the model using multiobjective optimization.
3.6. Discussion
3.6.1. The Usage of Embedding
The embeddings of DMGE can be used for candidate items generation in the recall stage. Through calculating the pairwise similarities between the embeddings of users and items, we can generate a candidate set of items which users may like, and the candidate set can be further used in the ranking stage to generate the final recommendation set of items (Covington et al., 2016)
. Besides, the embeddings can also be used for transfer learning
(Ni et al., 2018), and alleviating the sparsity and cold start problem (Zhao et al., 2018).3.6.2. CrossDomain Representation Leaning
3.6.3. Scalability
The graph convolutional layers in DMGE adopt the graph convolution operator defined in GCN (Kipf and Welling, 2017). However, GCN requires the full graph Laplacian, thus it is computationally expensive to apply GCN for largescale graph embedding.
To apply DMGE for largescale multigraph embedding, we have the following strategies: 1) we can adopt GraphSAGE (Hamilton et al., 2017a) as the graph convolutional layers in DMGE, as GraphSAGE generates embeddings by sampling and aggregating features from a node’s local neighborhood, and only requires local graph structures; 2) we can replace the graph convolutional layers in DMGE with the graph attentional layers, which are presented in GAT (Veličković et al., 2018), as GAT is computationally efficient and parallelizable across all nodes in the graph, and doesn’t require the entire graph structure upfront.
4. Related Work
4.1. Embedding Methods
Representation learning (Bengio et al., 2013)
is one of the most fundamental problems in deep learning. As a practical application, effective embedding has been proved to be useful and achieve significant improvement in recommender systems (RSs) including: Ecommerce
(Zhao et al., 2018; Wang et al., 2018), search ranking (Chu et al., 2018; Grbovic and Cheng, 2018) and social media (Ying et al., 2018).The embedding methods in RSs can be divided into two categories: word embedding based methods and graph embedding based methods. The word embedding based methods (Zhao et al., 2018; Grbovic and Cheng, 2018) learn embedding by modeling the item cooccurrence in users’ behavior sequences. Specifically, they model the items as words and user’s behavior sequences as sentences, and apply word embedding methods (Mikolov et al., 2013a; Mikolov et al., 2013b) to represent items in a lowdimensional space. The graph embedding based methods (Ying et al., 2018; Wang et al., 2018) construct item graph based on users’ behaviors, they model the items as nodes and item cooccurrences as edges, and apply the graph embedding methods (Hamilton et al., 2017b; Cui et al., 2018; Perozzi et al., 2014; Hamilton et al., 2017a) to learn embedding. However, these embedding methods are developed to learn embedding in a single domain, which fail to learn effective crossdomain embedding. Although there are several crossdomain recommendation methods (Man et al., 2017), they aim to improve the recommendation in target domain by transferring the information from source domain. In our work, we adopt graph neural network to learn more effective crossdomain embeddings to benefit all domains.
4.2. Graph Neural Networks
Graph neural networks (GNNs) (Zhou et al., 2018; Xu et al., 2019) have emerged as a powerful approach for representation learning on graphs recently, such as GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017a). Through a recursive neighborhood aggregation scheme, GNNs can generate node embedding by aggregating features of neighbors. In this part, we focus on reviewing related works about the convolution based GNNs, which can be categorized as spectral approaches and nonspectral approaches.
The spectral approaches depend on the theory of spectral graph convolutions. Bruna et al. (Bruna et al., 2014) first propose a generalization of convolutional neural networks (CNNs) to graphs, however, it is computationally expensive. Defferrard et al. (Defferrard et al., 2016) design localized convolutional filters on graphs based on spectral graph theory, which is more computationally efficient. Kipf et al. (Kipf and Welling, 2017) limit the layerwise convolution operation to to avoid overfitting, and propose the graph convolutional network (GCN), which can be applied to encode both local graph structure and features of nodes through layerwise propagation. The nonspectral approaches operate spatial convolutions on the graph. Hamilton et al. (Hamilton et al., 2017a) propose GraphSAGE to generates node embeddings by sampling and aggregating features from a node’s local neighborhood, which can be applied for largescale graph embedding. However, these GNNs are developed for single graph embedding, which fail to learn effective multigraph embedding.
5. Conclusion
In this paper, we focus on learning effective crossdomain representation. We propose the Deep MultiGraph Embedding (DMGE) model, which is a multigraph neural network based on multitask learning. We construct the item graphs as a multigraph based on users’ behaviors from different domains, and then design a graph neural network to learn multigraph embedding in an unsupervised manner. Particularly, we introduce a multiplegradient descent optimizer for efficiently training the model. We evaluate our approach on various largescale realworld datasets, and the experimental results show that DMGE outperforms other stateoftheart embedding methods in various tasks.
6. Acknowledgments
This work was partially supported by the National Key R&D Program of China (2017YFB1001800) and the National Natural Science Foundation of China (No. 61772428, 61725205).
References
 (1)
 Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.
 Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2014. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations. 1–14.
 Caruana (1997) Rich Caruana. 1997. Multitask learning. Machine learning 28, 1 (1997), 41–75.
 Cheng et al. (2016) HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 7–10.
 Chu et al. (2018) Chen Chu, Zhao Li, Beibei Xin, Fengchao Peng, Chuanren Liu, Remo Rohs, Qiong Luo, and Jingren Zhou. 2018. Deep Graph Embedding for Ranking Optimization in Ecommerce. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 2007–2015.
 Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. ACM, 191–198.
 Cui et al. (2018) Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. 2018. A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering (2018).
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems. 3844–3852.
 Désidéri (2012) JeanAntoine Désidéri. 2012. Multiplegradient descent algorithm (MGDA) for multiobjective optimization. Comptes Rendus Mathematique 350, 56 (2012), 313–318.
 Frank and Wolfe (1956) Marguerite Frank and Philip Wolfe. 1956. An algorithm for quadratic programming. Naval research logistics quarterly 3, 12 (1956), 95–110.
 Grbovic and Cheng (2018) Mihajlo Grbovic and Haibin Cheng. 2018. Realtime personalization using embeddings for search ranking at Airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 311–320.
 Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.

Guo
et al. (2017)
Huifeng Guo, Ruiming
Tang, Yunming Ye, Zhenguo Li, and
Xiuqiang He. 2017.
DeepFM: a factorizationmachine based neural
network for CTR prediction. In
Proceedings of the 26th International Joint Conference on Artificial Intelligence
. 1725–1731.  Hamilton et al. (2017a) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017a. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
 Hamilton et al. (2017b) William L. Hamilton, Rex Ying, and Jure Leskovec. 2017b. Representation Learning on Graphs: Methods and Applications. IEEE Data Engineering Bulletin 40, 3 (2017), 52–74.

Hidasi et al. (2016)
Balázs Hidasi,
Alexandros Karatzoglou, Linas Baltrunas,
and Domonkos Tikk. 2016.
Sessionbased recommendations with recurrent neural networks. In
International Conference on Learning Representations. 1–10.  Jaggi (2013) Martin Jaggi. 2013. Revisiting FrankWolfe: ProjectionFree Sparse Convex Optimization. In International Conference on Machine Learning. 427–435.
 Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations. 1–14.
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 8 (2009), 30–37.
 Kuhn and Tucker (1951) HW Kuhn and AW Tucker. 1951. Nonlinear Programming. In Second Berkeley Symposium on Mathematical Statistics and Probability. 481–492.

Li et al. (2018)
Qimai Li, Zhichao Han,
and XiaoMing Wu. 2018.
Deeper insights into graph convolutional networks for semisupervised learning. In
ThirtySecond AAAI Conference on Artificial Intelligence.  Ma et al. (2019) Yao Ma, Suhang Wang, Charu C Aggarwal, Dawei Yin, and Jiliang Tang. 2019. Multidimensional Graph Convolutional Networks. In Proceedings of the 2019 SIAM International Conference on Data Mining.
 Man et al. (2017) Tong Man, Huawei Shen, Xiaolong Jin, and Xueqi Cheng. 2017. CrossDomain Recommendation: An Embedding and Mapping Approach.. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence. 2464–2470.

Mikolov
et al. (2013a)
Tomas Mikolov, Kai Chen,
Greg Corrado, and Jeffrey Dean.
2013a.
Efficient estimation of word representations in vector space. In
Workshop on International Conference on Learning Representations. 1–11.  Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
 Ni et al. (2018) Yabo Ni, Dan Ou, Shichen Liu, Xiang Li, Wenwu Ou, Anxiang Zeng, and Luo Si. 2018. Perceive Your Users in Depth: Learning Universal User Representations from Multiple Ecommerce Tasks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 596–605.
 Okura et al. (2017) Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017. Embeddingbased news recommendation for millions of users. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1933–1942.
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
 Ruder (2017) Sebastian Ruder. 2017. An overview of multitask learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).
 Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Largescale information network embedding. In Proceedings of the 24th international conference on world wide web. International World Wide Web Conferences Steering Committee, 1067–1077.
 Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations. 1–12.
 Wang et al. (2018) Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018. Billionscale commodity embedding for ecommerce recommendation in alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 839–848.
 Wang et al. (2019) Yaqing Wang, Chunyan Feng, Caili Guo, Yunfei Chu, and JenqNeng Hwang. 2019. Solving the Sparsity Problem in Recommendations via CrossDomain Item Embedding Based on CoClustering. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. ACM, 717–725.
 Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations. 1–17.
 Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for webscale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 974–983.
 Zafarani and Liu (2009) R. Zafarani and H. Liu. 2009. Social Computing Data Repository at ASU. http://socialcomputing.asu.edu
 Zhao et al. (2018) Kui Zhao, Yuechuan Li, Zhaoqian Shuai, and Cheng Yang. 2018. Learning and Transferring IDs Representation in Ecommerce. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1031–1039.
 Zhou et al. (2018) Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. 2018. Graph Neural Networks: A Review of Methods and Applications. arXiv preprint arXiv:1812.08434 (2018).
 Zhuang et al. (2017) Fuzhen Zhuang, Yingmin Zhou, Fuzheng Zhang, Xiang Ao, Xing Xie, and Qing He. 2017. Sequential Transfer Learning: Crossdomain Novelty Seeking Trait Mining for Recommendation. In Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 881–882.
Comments
There are no comments yet.