In the era of big data, graph provides a generalized representation of many different types of inter-connected data collected from various disciplines. Besides the unique attributes possessed by individual nodes, the extensive connections among the nodes can convey very complex yet important information. Graph data are very difficult to deal with because of their various shapes (e.g., small brain graphs vs. giant online social networks), complex structures (containing various kinds of nodes and extensive connections) and diverse attributes
(attached to the nodes and links). Great challenges exist in handling the graph data with traditional machine learning algorithms directly, which usually take feature vectors as the input. Viewed in such a perspective, learning the feature vector representations of graph data will be an important problem.
The graph data studied in research can be generally categorized into two main types, i.e., small graphs vs. giant networks, which differ from each other a lot in the size, instance number and label annotation.
The small graphs we study are generally of a much smaller size, but with a large number of instances, and each graph instance is annotated with certain labels. The representative examples include the human brain graph, molecular graph, and real-estate community graph, whose nodes (usually only in hundreds) represent the brain regions, atoms and POIs, respectively.
On the contrary, giant networks in research usually involve a large number of nodes/links, but with only one single network instance, and individual nodes are labeled instead of the network. Examples of giant networks include social network (e.g., Facebook), eCommerce network (e.g., Amazon) and bibliographic network (e.g., DBLP), which all contain millions even billions of nodes.
Due to these property distinctions, the representation learning algorithms proposed for small graphs and giant networks are very different. To solve the small graph oriented problems, the existing graph neural networks focus on learning a representation of the whole graph (not the individual nodes) based on the graph structure and attribute information. Several different graph neural network models have been introduced already, including IsoNN (Isomorphic Neural Network) , SDBN (Structural Deep Brain Network)  and LF&ER6]. These models are proposed for different small graph oriented application scenarios, covering both brain graphs and community POI graphs, which can also be applied to other application settings with minor extensions.
Meanwhile, for the giant network studies, in recent years, many research works propose to apply graph neural networks to learn their low-dimensional feature representations, where each node is represented as a feature vector. With these learned node representations, the graph neural network model can directly infer the potential labels of the nodes/links. To achieve such objectives, several different type of graph neural network models have been introduced, including GCN (Graph Convolutional Network) , GAT (Graph Attention Network) , DifNN (Deep Diffusive Neural Network) , GNL (Graph Neural Lasso), GraphSage (Graph Sample and Aggregate)  and seGEN (Sample and Ensemble Genetic Evolutionary Network) .
In this paper, we will introduce the aforementioned graph neural networks proposed for small graphs and giant networks, respectively. This tutorial paper will be updated accordingly as we observe the latest developments on this topic.
2 Graph Neural Networks for Small Graphs
In this section, we will introduce the graph neural networks proposed for the representation learning tasks on small graphs. Formally, we can represent the set of small graphs to be studied in this section as , where denotes a small graph instance and denotes its label vector. Given a graph , we can denote its network size as the number of involved nodes, i.e., . Normally, the small graphs to be studied in set are of the same size. Meanwhile, depending on the application scenarios, the objective labels of the graph instances can be binary-class/multi-class vectors. The small graph oriented graph neural networks aim at learning a mapping, i.e., , to project the graph instances to their feature vector representations, which will be further utilized to infer their corresponding labels. Specifically, the graph neural network models to be introduced in this section include IsoNN , SDBN  and LF&ER . The readers are also suggested to refer to these papers for detailed information when reading this tutorial paper.
2.1 IsoNN: Isomorphic Neural Network
Graph isomorphic neural network (IsoNN) proposed in  recently aims at extracting meaningful sub-graph patterns from the input graph for representation learning. Sub-graph mining techniques have been demonstrated to be effective for feature extraction in the existing works. Instead of designing the sub-graph templates manually, IsoNN proposes to integrate the sub-graph based feature extraction approaches into the neural network framework for automatic feature representation learning. As illustrated in Figure 1, IsoNN includes two main components: graph Isomorphic feature extraction component and classification component. IsoNN can be in a deeper architecture by involving multiple graph isomorphic feature extraction components so that the model will learn more complex sub-graph patterns.
The graph isomorphic feature extraction component in IsoNN targets at the automatic sub-graph pattern learning and brain graph feature extraction with the following three layers:graph isomorphic layer, min-pooling layer 1 and min-pooling layer 2, which will be introduced as follows, respectively. Meanwhile, the classification component used in IsoNN involves several fully connected layers, which project the learned isomorphic features to the corresponding graph labels.
2.1.1 Graph Isomorphic Layer
In IsoNN, the sub-graph based feature extraction process is achieved by a novel graph isomorphic layer. Formally, given a brain graph , its adjacency matrix can be represented as . In order to find the existence of specific sub-graph patterns in the input graph, IsoNN matches the input graph with a set of sub-graph templates. Instead of defining these sub-graph templates manually as the existing works, each template is denoted as a kernel variable and IsoNN will learn these kernel variables automatically. Here, denotes the node number in the templates and is the channel number. Meanwhile, to match one template (i.e., the kernel variable matrix ) with regions in the input graph (i.e., sub-matrices in ), IsoNN uses a set of permutation matrices, which map both rows and columns of the kernel matrix to the sub-matrices in effectively. The permutation matrix can be represented as that shares the same dimensions with the kernel matrix. Given a kernel variable matrix and a regional sub-matrix in (where and index pair ), there may exist different such permutation matrices and the optimal one can be denoted as :
where covers all the potential permutation matrices. The F-norm term measures the mapping loss, which is also used as the graph isomorphic feature in IsoNN. Formally, the isomorphic feature extracted based on the kernel for the regional sub-matrix in can be represented as
where vector with denoting the features computed on the permutation matrix . Furthermore, by shifting the kernel matrix on regional sub-matrices in , the isomorphic features extracted by IsoNN
from the input graph can be denoted as a 3-way tensor, where .
2.1.2 Min-pooling Layers
Min-pooling Layer 1: As indicated by the Figure 1, IsoNN computes the final isomorphic features with the optimal permutation matrix for the kernel via two steps: (1) computing all the potential isomorphic features via different permutation matrices with the graph isomorphic layer, and (2) identifying the optimal features with the min-pooling layer 1 and layer 2. Formally, given the tensor computed by in the graph isomorphic layer, IsoNN will identify the optimal permutation matrices via the min-pooling layer 1. From tensor , the features computed with the optimal permutation matrices can be denoted as , where
The min-pooling layer 1 learns the optimal feature matrix for kernel along the first dimension of tensor , which are computed by the optimal permutation matrices. In a similar way, for the remaining kernels, their optimal graph isomorphic features can be obtained and denoted as matrices , , , , respectively.
Min-pooling Layer 2: For the same region in the input graph, different kernels can be applied to match the regional sub-matrix. Inspired by this, IsoNN incorporates the min-pooling layer 2, so that the model can find the best kernels that match the regions in . With inputs , the min-pooling layer 2 in IsoNN can identify the optimal features across all the kernels, which can be denoted as matrix with
Entry denotes the graph isomorphic feature computed by the best sub-graph kernel on the regional matrix in . Thus, via min-pooling layer 2, let be the final isomorphic feature matrix, which preserves the best sub-graph patterns contributing to the classification result. In addition, min-pooling layer 2 also effectively shrinks the feature length and greatly reduces the number of variables to be learned in the following classification component.
2.1.3 Classification Component
Given a brain graph instance ( denotes the training batch), its extracted isomorphic feature matrix can be denoted as . By feeding its flat vectorized representation vector as the input into the classification component (with three fully-connected layers), the predicted label vector by IsoNN on the instance can be represented as
. Several frequently used loss functions, e.g., cross-entropy, can be used to measure the introduced loss betweenand the ground-truth label vector . Formally, the fully-connected (FC) layers and the loss function used in IsoNN can be represented as follows:
where and are the weight and biase in layer,
denotes the sigmoid activation function andis the softmax function for output normalization. Variables (including the kernel matrices and weight/bias terms) involved in the model can be effectively learned with the error back propagation algorithm by minimizing the above loss function. For more information about IsoNN, the readers are suggested to refer to  for detailed descriptions.
2.2 Sdbn: Structural Deep Brain Network
Structural Deep Brain Network (SDBN) initially proposed 
applies the deep convolutional neural network to the brain graph mining problem, which can be viewed as an integration of convolutional neural network and autoencoder. Similar toIsoNN, SDBN also observes the order-less property with the brain graph data, and introduce a graph reordering approach to resolve the problem.
As illustrated in Figure 2, besides the necessary graph data processing and representation, SDBN involves three main steps to learn the final representations and labels fo the graph instances, i.e., graph reordering, structural augmentation and convolutional feature learning, which will be introduced as follows, respectively.
2.2.1 Graph Reordering
Given the graph set , the goal of graph reordering is to find a node labeling such that for any two graphs randomly draw from , the expected differences between the distance of the graph connectivity adjacency matrices based on and the distance of the graphs in the graph space is minimized. Formally, for each graph instance , its connectivity adjacency matrix can be denoted as . Let and denote the distance metrics on the adjacency matrix domain and graph domain respectively, the graph reordering problem can be formulated as the following optimization problem:
Graph reordering is a combinatorial optimization problem, which has also be demonstrated to be NP-hard and is computationally infeasible to address in polynomial time.SDBN
proposes to apply the spectral clustering to help reorder the nodes and brain graph connectivity instead. Formally, based on the brain graph adjacency matrixof , its corresponding Laplacian matrix can be represented as . The spectral clustering algorithm aims at partitioning the brain graph into modules, where the node-module belonging relationships are denoted by matrix . The optimal can be effectively learned with the following objective function:
denotes an identity matrix and the constraint is added to ensure one node is assigned to one module only. From the learned optimal, SDBN can assign the nodes in graph to their modules , where and and . Such learned modules can help reorder the nodes in the graph into relatively compact regions, and the graph connectivity adjacency matrix after reordering can be denoted as . Similar operations will be performed on the other graph instances in the set .
2.2.2 Structural Augmentation
Prior to feeding the reordered graph adjacency matrix to the deep convolutional neural network for representation learning, SDBN proposes to augment the network structure by both refining the connectivity adjacency matrix and creating an additional module identification channel.
Reordered Adjacency Matrix Refinement: Formally, for graph , based on its reordered adjacency matrix obtained from the previous step, SDBN proposes to refine its entry values with the following equation:
In the equation, term denotes a small constant.
Module Identification Channel Creation: From the reordered adjacency matrix for graph , the learned module identity information is actually not preserved. To effectively incorporate such information in the model, SDBN proposes to create one more channel for graph , whose entry values can be denoted as follows:
Formally, based on the above operations, the inputs for the representation learning component on graph will be , which encodes much more information and can help learn more useful representations.
2.2.3 Learning of the Sdbn Model
As illustrated in Figure 3, based on the input matrix for the graphs in (here, the subscript of is not indicated and it can represent any graphs in ), SDBN proposes to apply the convolutional neural network for the graph representation learning. To be specific, the convolutional neural network used in SDBN involves two operators: conv and pool, which can be stacked together multiple times to form a deep architecture.
Formally, according to Figure 3, the intermediate representations of the input graphs as well as the corresponding labels in the SDBN can be represented with the following equations:
where flattens the matrix to a matrix and denotes the fully-connected layers in the model. In the above equations, denotes the involved variables in the model, which will be optimized.
Based on the above model, for all the graph instances , we can represent the introduced loss terms by the model as
where and represent the ground-truth label vector and the inferred label vector of graph , respectively.
Meanwhile, in addition to the above loss term, SDBN also incorporates the autoencoder into the model learning process via the depool and deconv operations. The conv and pool operators mentioned above compose the encoder part, whereas the deconv and depool operators will form the decoder part. Formally, based on the learned intermediate representation of the input graph matrix , SDBN computes the recovered representations as follows:
By minimizing the difference between and , as well as the difference between and , i.e.,
SDBN can effectively learn the involved variables in the model. As illustrated in Figure 3, the decoder step can work in different manner, which will lead to different regularization terms on the intermediate representations. The performance comparison between IsoNN and SDBN is also reported in , and the readers may refer to [4, 7] for more detailed information of the models and the experimental evaluation results.
2.3 Lf&er: Deep Autoencoder based Latent Feature Extraction
Deep Autoencoder based Latent Feature Extraction (LF&ER) initially proposed in  serves an a latent feature extraction component in the final model introduced in that paper. Based on the input community real-estate POI graphs, LF&ER aims to learn the latent representations of the POI graphs, which will be used to infer the community vibrancy scores.
2.3.1 Deep Autoencoder Model
The LF&ER model works based on the deep autoencoder actually. Autoencoder is an unsupervised neural network model, which projects the instances in original feature representations into a lower-dimensional feature space via a series of non-linear mappings. Figure 4 shows that autoencoder model involves two steps: encode and decode. The encode part projects the original feature vector to the objective feature space, while the decode step recovers the latent feature representation to a reconstruction space. In autoencoder model, we generally need to ensure that the original feature representation of instances should be as similar to the reconstructed feature representation as possible.
Formally, let represent the original feature representation of instance , and be the latent feature representations of the instance at hidden layers in the encode step respectively, the encoding result in the objective lower-dimension feature space can be represented as with dimension . Formally, the relationship between these vector variables can be represented with the following equations:
Meanwhile, in the decode step, the input will be the latent feature vector (i.e., the output of the encode step), and the final output will be the reconstructed vector . The latent feature vectors at each hidden layers can be represented as . The relationship between these vector variables can be denoted as
In the above equations, and with different subscripts denote the weight matrices and bias terms to be learned in the model. The objective of the autoencoder model is to minimize the loss between the original feature vector and the reconstructed feature vector . Formally, the loss term can be represented as
where denotes the variables involved in the autoencoder model.
2.3.2 Latent Representation Learning
LF&ER proposes to learn the community allocation information for the vibrancy inference and ranking. Formally, spatial structure denotes the distribution of POIs inside the community, e.g., a grocery store lies between two residential buildings; a school is next to the police office. The Spatial structure can hardly be represented with explicit features extracted before, and LF&ER proposes to represent them with a set of latent feature vectors extracted from the geographic distance graph and the mobility connectivity graph defined in the previous subsection. The autoencoder model is applied here for the latent feature extraction.
Autoencoder model has been applied to embed the graph data into lower-dimensional spaces in many of the research works, which will obtain a latent feature representation for the nodes inside the graph. Different from these works, instead of calculating the latent feature for the POI categories inside the communities, LF&ER aims at obtaining the latent feature vector for the whole community, i.e., embedding the graph as one latent feature vector.
As shown in Figure 4, LF&ER transforms the matrix of the graphical distance graph (involving the POI categories) into a vector, which can be denoted as
Vector will be used as the input feeding into the autoencoder model. The latent embedding feature vector of can be represented as (i.e., the vector as introduced in the autoencoder model in the previous section), which depicts the layout information of POI categories in the community in terms of the geographical distance. Besides the static layout based on geographic distance graph, the spatial structure of the POIs in the communities can also be revealed indirectly through the human mobility. For a pair of POI categories which are far away geographically, if people like to go between them frequently, it can display another type of structure of the POIs in terms of their functional correlations. Via the multiple fully connected layers, LF&ER will project such learned features to the objective vibrancy scores of the community. We will not introduce the model learning part here, since it also involves the ranking models and explicit feature engineering works, which is not closely related to the topic of this paper. The readers may refer to  for detailed description about the model and its learning process. In addition, autoencoder (i.e., the base model of LF&ER) is also compared against IsoNN, whose results are reported in .
3 Graph Neural Networks for Giant Networks
In this section, we will introduce the graph neural networks proposed for the representation learning tasks on giant networks instead. Formally, we can represent the giant network instance to be studied in this section as , where and denote the sets of nodes and links in the network, respectively. Different from the small graph data studied in Section 2, the nodes in the giant network are partially annotated with labels instead. Formally, we can represent the set of labeled nodes as , where and denotes its label vector; whereas the remaining unlabeled nodes can be represented as . In the case where all the involved network nodes are labeled, we will have and , which will be a special learning scenario of the general partial-labeled learning setting as studied in this paper. The giant network oriented graph neural networks aim at learning a mapping, i.e., , to obtain the feature vector representations of the nodes in the network, which can be utilized to infer their labels. To be more specific, the models to be introduced in this section include GCN , GAT , DifNN , GNL , GraphSage  and seGEN .
3.1 Gcn: Graph Convolutional Network
Graph convolutional network (GCN) initially proposed in  introduces a spectral graph convolution operator for the graph data representation learning, which also provides several different approximations of the operator to encode both the graph structure and features of the nodes. GCN works well for the partially labeled giant networks, and the learned node representations can be effectively applied for the node classification task.
3.1.1 Spectral Graph Convolution
Formally, given an input network , its network structure information can be denoted as an adjacency matrix . The corresponding normalized graph Laplacian matrix can be denoted as , where is a diagonal matrix with entries on its diagonal and is an identity matrix with ones on its diagonal. The eigen-decomposition of matrix can be denoted as , where denotes the eigen-vector matrix and diagonal matrix has eigen-values on its diagonal.
The spectral convolution operator defined on network in GCN is denoted as a multiplication of an input signal vector with a filter (parameterized by variable vector in the Fourier domain as follows:
is defined as the graph Fourier transform ofand can be understood as a function on the eigen-values, i.e., .
According to Equ. (19), the computation cost of the term on the right-hand-side will be . For the giant networks involving millions even billions of nodes, the computation of the graph convolution term will become infeasible, not to mention the eigen-decomposition of the Laplacian matrix defined before. Therefore, to resolve such a problem,  introduces an approximation of the filter function by a truncated expansion in terms of the Chebyshev polynomial up to the order as follows:
where and is the largest eigen-value in matrix . Vector is a vector of Chebyshev coefficients. Noticing that the computational complexity of the term on the right-hand-side is , i.e., linear in terms of the edge numbers, which will be lower than that of Equ. (19) introduced before.
3.1.2 Graph Convolution Approximation
where and is the diagonal matrix defined on instead.
As illustrated in Figure 5, in the case when there exist input channels, i.e., the input will be a matrix , and different filters defined above, the learned graph convolution feature representations will be
where matrix can be pre-computed in advance. Matrix is the filter parameter matrix and will be the learned convolved representations of all the nodes. The computational time complexity of the operation will be .
3.1.3 Deep Graph Convolutional Network Learning
The GCN model can have a deeper architecture by involving multiple graph convolution operators defined in the previous sections. For instance, the GCN model with two layers can be represented with the following equations:
In the above equation, matrices and are the involved variables in the model. ReLU is used as the activation function for the hidden layer 1, and softmax function is used for the output result normalization. By comparing the inferred labels, i.e., , of the labeled instances against their ground-truth labels, i.e., , the model variables can be effectively learned by minimizing the following loss function:
where covers all the variables in the model.
For representation simplicity, node subscript is used as its corresponding index in the label matrix . Notation denotes the number of labels in the studied problem, and for the traditional binary classification tasks. The readers can also refer to  for detailed information of the GCN model.
3.2 Gat: Graph Attention Network
Graph attention network (GAT) initially proposed in  can be viewed as an extension of GCN. In updating the nodes’ representations, instead of assigning the neighbors with fixed weights, i.e., values in matrix in Equ. (22) and Equ. (23), GAT introduces an attention mechanism to compute the weights based on the representations of the target node as well as its neighbors.
3.2.1 Graph Attention Coefficient Computation
Formally, given an input network and the raw features of the nodes, the node features can be denoted as a matrix , where denotes the dimension of the node feature vectors. Furthermore, for node , its feature vector can also be represented as for simplicity. Without considerations about the network structures, via a mapping matrix , the nodes can be projected to their representations in the hidden layer. Meanwhile, to further incorporate the network structure information into the model, based on the network structure, the neighbor set of node can be denoted as , where is also added and treated as the self-neighbor. As illustrated in Figure 6, GAT proposes to compute the attention coefficient between nodes and (if )) as follows:
where is a variable vector for weighted sum of the entries in vector and denotes the concatenation operator of two vectors. LeakyReLU function is added here mainly for the model learning considerations.
To further normalize the coefficients among all the neighbors, GAT further adopts the softmax function based on the coefficients defined above. Formally, the final computed weight between nodes and can be denoted as
3.2.2 Representation Update via Neighborhood Aggregation
GAT effectively update the nodes’ representations by aggregating the information from their neighbors (including the self-neighbor
). Formally, the learned hidden representation of nodecan be represented as
GAT can be in a deeper architecture by involving multiple attentive node representation updating. In the deep architecture, for the upper layers, the representation vector will be treated as the inputs feature vector instead, and we will not over-elaborate that here.
3.2.3 Multi-Head Attention Aggregation
As introduced in , to stabilize the learning process of the model, GAT can be further extended to include the multi-head attention as illustrated in Figure 6. Specifically, let denote the number of involved attention mechanisms. Based on each attention mechanism, the learned representations of node based on the above descriptions (i.e., Equ. (27)) can be denoted as , , , , respectively. By further aggregating such learned representations together, the ultimate learned representation of node can be denoted as
Several different aggregation function is tried in , including concatenation and average:
The learning process of the GAT model is very similar to that of GCN introduced in Section 3.1.3, and we will not introduce that part again here. The readers can also refer to  for more detailed descriptions about the model as well as its experimental performance.
3.3 DifNN: Deep Diffusive Neural Network
Deep diffusive network (DifNN) model initially introduced in  aims at modeling the diverse connections in heterogeneous information networks, which contains multiple types of nodes and links. DifNN
is based on a new type of neuron, namelygated diffusive unit (GDU), which can be extended to incorporate the inputs from various groups of neighbors.
3.3.1 Model Architecture
Given a heterogeneous input network , the node set in the network can be divided into multiple subsets depending on their node types. It is similar for the links in set . Here, for the representation simplicity, we will follow the news augmented heterogeneous social network example illustrated in  when introducing the model. As illustrated in Figure 8, there exist three different types of nodes (i.e., creator, news article and subject) and two different types of links (i.e., the creator-article link and article-subject link) in the network. Formally, the node set can be categories into three subsets, i.e., , and the link set can be categorized into two subsets, i.e., .
For each node in the network, e.g., , its extracted raw feature vector can be denoted as . As introduced at the beginning of Section 3, in many cases, the network is partially labeled. Formally, the label vector of node is represented as . For each nodes in the input network, DifNN utilizes one GDU (which will be introduced in the following subsection) to model its representations and the connections with other neighboring nodes. For instance, based on the input network in Figure 8, its corresponding DifNN model architecture can be represented in Figure 8. Via the gdu neuron unit, DifNN can effectively project the node inputs to their corresponding labels. The parameters involved in the DifNN model can be effectively trained based on the labeled nodes via the back propagation algorithm. In the following two subsections, we will introduce the detailed information about GDU as well as the DifNN model training.
3.3.2 Gated Diffusive Unit
To introduce the GDU neuron, we can take news article nodes as an example here. Formally, among all the inputs of the GDU model, denotes the extracted feature vector for news articles, represents the input from other GDUs corresponding to subjects, and represents the input from other GDUs about creators. Considering that the GDU for each news article may be connected to multiple GDUs of subjects and creators, the of the outputs from the GDUs corresponding to these subjects and creators will be computed as the inputs and instead respectively, which is also indicated by the GDU architecture illustrated in Figure 9. For the inputs from the subjects, GDU has a gate called the “forget gate”, which may update some content of to forget. The forget gate is important, since in the real world, different news articles may focus on different aspects about the subjects and “forgetting” part of the input from the subjects is necessary in modeling. Formally, the “forget gate” together with the updated input can be represented as
Here, operator denotes the entry-wise product of vectors and represents the variable of the forget gate in GDU.
Meanwhile, for the input from the creator nodes, a new node-type “adjust gate” is introduced in GDU. Here, the term “adjust” models the necessary changes of information between different node categories (e.g., from creators to articles). Formally, the “adjust gate” as well as the updated input can be denoted as
where denotes the variable matrix in the adjust gate.
GDU allows different combinations of these input/state vectors, which are controlled by the selection gates and respectively. Formally, the final output of GDU will be
where , and , and term denotes a vector filled with value . Operators and denote the entry-wise addition and minus operation of vectors. Matrices , , represent the variables involved in the components. Vector will be the output of the GDU model.
The introduced GDU model also works for both the news subjects and creator nodes in the network. When applying the GDU to model the states of the subject/creator nodes with two input only, the remaining input port can be assigned with a default value (usually vector ). In the following section, we will introduce how to learn the parameters involved in the DifNN model for concurrent inference of multiple nodes.
3.3.3 DifNN Model Learning
In the DifNN model as shown in Figure 8, based on the output state vectors of news articles, news creators and news subjects, the framework will project the feature vectors to their labels. Formally, given the state vectors of news article , of news creator , and of news subject , their inferred labels can be denoted as vectors respectively, which can be represented as
where , and define the weight variables projecting state vectors to the output vectors.
Meanwhile, based on the news articles in the training set with the ground-truth label vectors , the loss function of the framework for news article label learning are defined as the cross-entropy between the prediction results and the ground truth:
Similarly, the loss terms introduced by news creators and subjects based on training sets and can be denoted as
where and (and and ) denote the prediction result vector and ground-truth vector of creator (and subject) respectively.
Formally, the main objective function of the DifNN model can be represented as follows:
where denotes all the involved variables to be learned, term represents the regularization term (i.e., the sum of norm on the variable vectors and matrices), and denotes the regularization term weight. By resolving the optimization functions, variables in the model can be effectively learned with the back-propagation algorithm. For the news articles, creators and subjects in the testing set, their predicted labels will be outputted as the final result.
3.4 Gnl: Graph Neural Lasso
Graph neural lasso (GNL) initially proposed in  is a graph neural regression model and it can effectively incorporate the historical time-series data of multiple instances for addressing the dynamic network regression problem. GNL extends the GDU neuron  (also introduced in Section 3.3.2) for incorporating both the network internal relationships and the network dynamic relationships between sequential network snapshots.
3.4.1 Dynamic Gated Diffusive Unit
GNL also adopts GDU as the basic neuron unit and extends it to the dynamic network regression problem settings, which can model both the network snapshot internal connections and the temporal dependency relationships between sequential network snapshots for the nodes.
Formally, given the time series data about a set of connected entities, such data can be represented as a dynamic network set , where denotes the maximum timestamp. For each network , it can be denoted as involving the node set and link set , respectively. Given a node in network , its in-neighbors and out-neighbors in the network can be denoted as sets and . Here, the link direction denotes the influences among the nodes. If the influences in the studied networks are bi-directional, the in/out neighbor sets will be identical, i.e., .
For node in network of the timestamp, the input attribute values of can be denoted as an input feature vector . GDU maintains a hidden state vector for each node, and the vector of node at timestamp can be denoted as . As illustrated in Figure 11, besides the feature vector and hidden state vector inputs, the GDU neuron of will also accept the inputs from ’s input neighbor nodes, i.e., , which will be integrated via certain aggregation operators:
The operator used in GNL will be introduced in detail in the next subsection.
A common problem with existing graph neural network models is over-smoothing, which will reduce all the nodes in the network to similar hidden representations. Such a problem will be much more serious when the model involves a deep architecture with multiple layers. To resolve such a problem, besides the attention mechanism to be introduced later, GDU introduces several gates for the neural state adjustment as introduced in Section 3.3.2. Formally, based on the input vectors , and , the representation of node in the next timestamp can be represented as
3.4.2 Attentive Neighborhood Influence Aggregation
In this part, we will introduce the operator used in Equ. (39) for node neighborhood influence integration proposed in . The GNL model defines such an operator based on an attention mechanism. Formally, given the node and its in-neighbor set , for any node , GNL quantifies the influence coefficient of on based on their hidden state vectors and as follows:
In the above equation, operator denotes a linear sum of the input vector parameterized by weight vector . According to , out of the model learning concerns, the above influence coefficient term can be slightly changed by adding the LeakyReLU function into its definition. Formally, the final influence coefficient used in GNL is represented as follows: