DeepAI
Log In Sign Up

Auto-GNN: Neural Architecture Search of Graph Neural Networks

Graph neural networks (GNN) has been successfully applied to operate on the graph-structured data. Given a specific scenario, rich human expertise and tremendous laborious trials are usually required to identify a suitable GNN architecture. It is because the performance of a GNN architecture is significantly affected by the choice of graph convolution components, such as aggregate function and hidden dimension. Neural architecture search (NAS) has shown its potential in discovering effective deep architectures for learning tasks in image and language modeling. However, existing NAS algorithms cannot be directly applied to the GNN search problem. First, the search space of GNN is different from the ones in existing NAS work. Second, the representation learning capacity of GNN architecture changes obviously with slight architecture modifications. It affects the search efficiency of traditional search methods. Third, widely used techniques in NAS such as parameter sharing might become unstable in GNN. To bridge the gap, we propose the automated graph neural networks (AGNN) framework, which aims to find an optimal GNN architecture within a predefined search space. A reinforcement learning based controller is designed to greedily validate architectures via small steps. AGNN has a novel parameter sharing strategy that enables homogeneous architectures to share parameters, based on a carefully-designed homogeneity definition. Experiments on real-world benchmark datasets demonstrate that the GNN architecture identified by AGNN achieves the best performance, comparing with existing handcrafted models and tradistional search methods.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/26/2020

Simplifying Architecture Search for Graph Neural Network

Recent years have witnessed the popularity of Graph Neural Networks (GNN...
09/21/2020

Evolutionary Architecture Search for Graph Neural Networks

Automated machine learning (AutoML) has seen a resurgence in interest wi...
06/19/2020

AutoOD: Automated Outlier Detection via Curiosity-guided Search and Self-imitation Learning

Outlier detection is an important data mining task with numerous practic...
11/26/2020

Autonomous Graph Mining Algorithm Search with Best Speed/Accuracy Trade-off

Graph data is ubiquitous in academia and industry, from social networks ...
09/21/2021

Search For Deep Graph Neural Networks

Current GNN-oriented NAS methods focus on the search for different layer...
07/31/2020

Neural Architecture Search in Graph Neural Networks

Performing analytical tasks over graph data has become increasingly inte...
06/18/2022

NAS-Bench-Graph: Benchmarking Graph Neural Architecture Search

Graph neural architecture search (GraphNAS) has recently aroused conside...

1. Introduction

Graph neural networks (GNN) (Gori et al., 2005; Scarselli et al., 2009) has been demonstrated that it could achieve superior performance in modeling graph-structured data, within various domains such as social media (Grover and Leskovec, 2016; Perozzi et al., 2014; Tang et al., 2015; Wang et al., 2016) and bioinformatics (Zitnik and Leskovec, 2017; Aynaz Taheri, 2018). Following the message passing strategy (Hamilton et al., 2017)

, GNN iteratively learns a node’s embedding representations via aggregating representations of its neighbors and itself. The learned node representations could be employed by downstream machine learning algorithms to perform different tasks efficiently.

However, the success of GNN is accompanied with laborious work of neural architecture tuning, aiming to adapt GNN to different graph-structure data. For example, the attention heads in the graph attention networks  (Velickovic et al., 2017) are selected carefully for citation networks and protein-protein interactions. GraphSAGE (Hamilton et al., 2017) has been shown to be sensitive to hidden dimensions. These handcrafted architectures not only require extensive search in the design space through many trials, but also tend to obtain suboptimal performance when they are transferred to other graph-structured datasets. Naturally, there is a raising demand for automated GNN search to identify the optimal architecture for different real-world scenarios.

Recently, neural architecture search (NAS) has attracted increasing research interests (Elsken et al., 2018). Its goal is to find the optimal neural architecture in the predefined search space to maximize model performance on a given task. The deep architectures discovered by NAS algorithms have outperformed the handcrafted ones at the domains including image classification (Zoph and Le, 2016; Zoph et al., 2018; Liu et al., 2017; Pham et al., 2018; Jin et al., 2018; Luo et al., 2018; Liu et al., 2018a, b; Xie et al., 2019; Kandasamy et al., 2018), semantic image segmentation (Liu et al., 2019), and image generation (Wang and Huan, 2019). Motivated by the success of NAS, we extend NAS studies beyond the image domains to node classification.

However, the direct application of NAS algorithms to find GNN architectures is non-trivial due to three major challenges as follows. First, the search space of GNN architecture is different with the ones in existing NAS work

. Taking the search of convolutional neural network (CNN) based architectures 

(Zoph and Le, 2016) as an example, the convolution operation is specified only by the kernel size. In contrast, the message-passing based graph convolution in GNN is described by a sequence of actions, including aggregation, combination, and activation. Second, the traditional controller is inefficient to discover the potentially well-performed GNN architecture

. It is because the representation learning capacity of GNN architecture varies significantly with slight architecture modification. In contrast, the widely-used controller samples a complete neural architecture at each search step, and gets update after validating the new architecture. It would be hard for the traditional controller to learn the following causality: which part of the architecture modification improves or degrades the model performance. For example, the traditional controller changes the action sequence in new GNN architecture, and cannot distinguish the improvement brought only by replacing the aggregate function of max pooling with summation 

(Xu et al., 2018). Third, the widely-used techniques in NAS such as parameter sharing is not suitable to GNN architecture. The parameter sharing transfers weight trained from one architecture to another one, aiming to avoid training from scratch. But it would lead to unstable training when sharing parameters among heterogeneous GNN architectures. We say that two neural architectures are heterogeneous if they have different shape of trainable weight or output statistics. The weights of architectures with different shapes cannot be directly shared. Output statistics (Guo et al., 2019)

is defined as the mean, variance, or interval of the output value in each graph convolutional layer of GNN architecture. Suppose that we have parameters deeply trained in a layer with

activation function, bounding the output within interval []. If we transfer the parameter to another layer possessing

function, the output value may be too large to be backpropagated steadily in the gradient decent optimizer.

To tackle the abovementioned challenges, we investigate the automated graph neural architecture search problem. Specifically, it could be separated as two research questions. (i) How to define the search space of GNN architecture, and explore it efficiently? (ii) How to constrain the parameter sharing among the heterogeneous GNN architectures to make training more stably? In summary, our major contributions are described below.

  • We formally define the neural architecture search problem tailored to graph neural networks.

  • We design a more efficient controller by considering a key property of GNN architecture–the variation of representation learning capacity with slight architecture modification.

  • We define the heterogeneous GNN architectures in the context of parameter sharing, to train the architecture more stable with shared weight.

  • The experiments show that the discovered neural architecture consistently outperforms state-of-the-art handcrafted models and other search methods.

2. Problem Statement

We formally define the graph neural architecture search problem as follows. Given search space , training set , validation set

and evaluation metric

, we aims to find the optimal GNN architecture accompanied with the best metric on set . Mathematically, it is written as follows.

(1)

denotes the parameter learned for architecture and

denotes the loss function. Metric

could be represented by F1 score or accuracy for node classification task. The characteristics of GNN search problem could be viewed from three aspects. First, search space is constructed based graph convolutions. Second, an efficient controller is required to consider the relationship between model performance and slight architecture modification in GNN. Third, the parameter sharing needs to promise weight could be transferred stably among heterogeneous GNN architectures.


Figure 1. Illustration of AGNN with -layer GNN search. Controller takes the best architecture found so far as input, and removes one of the six classes in turns to generate six subarchitectures. Their strings are fed to RNN encoders to determinate the best alternative action for the missing class. We select the new best architecture from all completed subarchitectures, based the accompanied decision entropy. Herein action guider selects class list . The retained architecture is modified via replacing activation functions with , , and , in all graph convolutional layers, respectively.

We propose an efficient and effective framework named AGNN to handle the GNN search problem. Figure 1 illustrates its core idea via a -layer GNN architecture search example. In the search space, each graph convolutional layer is specified by an action sequence as listed in the left box. There are totally six action classes, which cover a wide-variety of state-of-the-art GNN models. Instead of resampling a completely new neural architecture, we have independent RNN encoders to decide the new action for each class, e.g., the hidden dimension and activation function. Controller keeps the best architecture found so far, and makes slight architecture modification to it on specific classes. As shown in the right hand of figure, we change the activation functions in all layers of the retained architecture to , and , respectively. In this way, we are able update each RNN encoder independently to learn the affect of specific action class to model performance. A tailored parameter sharing strategy is designed. It defines homogeneous GNN architectures via three constraints. Weight only shares from the homogeneous ancestor architecture, helping the offspring architecture train stably. We will update the best architecture if the offspring architecture outperforms it; otherwise, we continue the search by reusing the old best architecture. Next, we introduce the search space, controller, and parameter sharing in detail.

3. Search Space

In this section, we describe the designed search space for the general GNN architecture, which is composed of layers of message-passing based graph convolutions. Formally, the -th layer

(2)

denotes the embedding of node at the -th layer. denotes the set of nodes adjacent to node . denotes the trainable matrix used to transform embedding dimension. denotes the attention coefficient between nodes and obtained from the additional attention layer. Function is applied to aggregate neighbor representations and prepare intermediate embedding . In addition, function is used to combine information from node itself as well as intermediate embedding , and function is used to activate the node embedding. Based on the message-passing graph convolutions defined in Equation (2), we decompose the search space into the following classes of actions:

  • Hidden dimension: Trainable matrix extracts representative features from embedding of the last layer, and maps the embedding to a -dimensional space. The choice of dimension is crucial to the final node classification performance. We collect the set of dimensions that are widely adopted by existing work as the candidates, i.e., 48163264128256.

  • Attention function: The real-world graph-structured data could be both complex and noisy (Lee et al., 2018), which may lead to the inefficient information aggregation. The attention mechanism helps to focus on the most relevant neighbors to improve the representative learning of node embedding. Following NAS framework in (Gao et al., 2019), we collect the set of attention functions as shown in Table 1 to compute coefficient .

  • Attention head: It is found that the multi-head attention could be beneficial to stabilize the learning process (Velickovic et al., 2017; Vaswani et al., 2017). We select the number of attention heads within the set: 1246816.

  • Aggregate function: As shown in (Xu et al., 2018)

    , aggregate function is crucial to capture neighborhood structure for learning node representation. Herein GNN architecture is developed based on package Pytorch Geometric

    (Fey and Lenssen, 2019). The package provides the following available aggregate functions: .

  • Combine function: Embeddings and are usually concatenated to combine information from node itself and neighbors. A differentiable function could then be applied to enhance the node representation learning. We design to select from two types of combine functions: . Herein MLP is a

    -layer perceptron with a fixed hidden dimension of

    .

  • Activation function: The set of available activation functions in our AGNN is listed as follows:

Note that a wide-variety of state-of-the-art model fall into the above message-passing based GNN architecture, including Chebyshev (Defferrard et al., 2016), GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017), GAT (Velickovic et al., 2017) and LGCN (Gao et al., 2018). We apply the fixed skip connection as those in (Kipf and Welling, 2017; Velickovic et al., 2017). The skip connection action could be easily incorporated into search space if necessary. Equipped with the above design, a GNN architecture could be specified by a string of length , where denotes the number of graph convolutional layers. For each layer, cardinalities of the above six action classes are , , , , , , respectively, which provides possible combinations in total. Suppose we target at searching a three-layer GNN architecture, i.e., , which is commonly accepted in GNN models. The number of unique architectures within our search space is , which is quite large and multifarious.

Attention Mechanisms Equations
-
-
Table 1. The set of attention functions, where symbol denotes the concatenation operation, , and

denote the trainable vectors, and

denotes the trainable matrix.

4. Reinforced Conservative Controller

In this section, we elaborate the proposed controller aiming to search GNN architecture efficiently. The controller framework is built up upon RL-based exploration guided with conservative exploitation. In traditional RL-based NAS, RNN is applied to specify the variable-length neural architecture, and generate a new candidate architecture at each search step. All of the action components in the neural architecture will be resampled and replaced with the new ones. After validating the new architecture, a scalar reward is made use to update the RNN. However, it could be problematic to directly apply this traditional controller to find potentially well-performed GNN architectures. The main reason is that the representation learning capacity of GNN architecture varies significantly with slight modification of some action classes. Taking the aggregate function as example, the classification performance of GNN architecture may improve by only replacing the function of max pooling with summation (Xu et al., 2018). It would be hard for the conventional controller to learn about which part of architecture modification contributes more to the performance improvement.

In order to tackle the above challenge, we propose a new searching algorithm named reinforced conservative neural architecture search (RCNAS). It consists of three components: (1) A conservative explorer, which screens out the best architecture found so far. (2) A guided architecture modifier, which slightly mutates certain actions in the retained best architecture. (3) A reinforcement learning trainer that learns the architecture modification causality. In the following, we introduce the details of these three components.

4.1. Conservative Explorer

As the key exploitation component, the conservative explorer is applied to maintain the best neural architecture found so far. In this way, the following architecture modification is performed based on a reliable well-performed architecture, which ensures a fast exploitation towards better architectures among the offsprings generated from slight architecture modification. If the offspring architecture outperforms its parent one, we will update the best neural architecture; otherwise, the best one will be kept and reused to generate the next offspring architecture. In practice, multiple starting points could be randomly initialized to enhance the exploration ability and avoid trapping in local minimums.

4.2. Guided Architecture Modifier

The main role of the guided architecture modifier is to modify the best architecture found so far via selecting and mutating the action classes that wait for exploration. As shown in the right hand of Figure 1, assume the class of activation function is selected. Correspondingly, the actions of activation function in the -layer GNN architecture are resampled and changed to , and , respectively. This will facilitate controller to learn the affect of architecture modification on specific action class.

To be specific, the architecture modification is realized by three steps: (1) For each class, an independent RNN encoder decides a sequence of new actions. (2) An action guider receives the decision entropy and selects the action classes to be modified. (3) An architecture modification generates the final offspring architecture. Details are introduced as follows.

4.2.1. RNN Encoders:

As shown in Figure 1, for each class, an independent RNN encoder is implemented to decide a sequence of new actions. First, a subarchitecture string of length is generated by removing actions of concerned class. For example, considering the -layer neural architecture in Figure 1, the subarchitecture of class activation function is obtained by removing activations existing in all

convolutional layers of the best architecture. Second, following an embedding layer, the subarchitecture string is taken as input to RNN encoder. This string represents the input status that asks for action padding of concerned class. Third, RNN encoder iteratively outputs the candidate action; and the output is then fed into next step as input. Note that the candidate action is sampled by feeding hidden state

into a softmax classifier. The length of each RNN encoder is

, coupling with the number of layers to be searched in the architectures.

4.2.2. Action Guider:

It is responsible to receive the decision entropy of each RNN encoder, and select some classes to be modified on the retained architecture. Consider the decision entropy of class . At step of RNN encoder, hidden state

is fed into the softmax classifier, and a probability vector

is given as output. The -th element represents the probability of sampling action . The decision entropy of class is then given by: , where denote the action cardinality of class . Decision entropy represents the uncertainty of current subarchitecture to explore along action class .

Given decision entropy list of the six action classes, the action guider samples classes with size , which would be used to modify network architecture. For example, class activation function is selected as shown in Figure 1, where , . The larger the decision entropy is, the larger the probability class are desired to be sampled. The action guider help controller search the potential networks along the direction with most uncertainty, which performs similar to the Bayesian optimization method (Jin et al., 2018).

4.2.3. Architecture Modification:

The best architecture found so far is modified via replacing the corresponding actions of each class in list . In Figure 1, action list is applied to replace the activation functions existing in all of the graph convolutional layers. When list includes only one class, we modify the retained neural architecture at a minimum level. If size , our controller resamples actions in the whole architecture similar to the traditional controller.

4.3. Reinforcement Learning Trainer

We use the REINFORCE rule of policy gradient (Sutton et al., 2000) to update parameters for RNN encoder of class . Let denote the decided action list of class . We have the following update rule (Zoph and Le, 2016):

(3)

where denotes the reward for taking decisions of class , and denotes the baseline of class for variance reduction. Let and denote the model performances of the best architecture found so far and its offspring one, respectively. We propose the following reward shaping: , which represents the performance variation brought by modifying the retained architecture on the class .

5. Constrained Parameter Sharing

Compared to training from scratch, parameter sharing reduces the computation cost via forcing the offspring architecture to share weight already trained well in the ancestor architecture. However, the traditional strategy cannot be directly applied to share weight among the heterogeneous GNN architectures. We say that two neural architectures are heterogeneous if they have different shapes of trainable weight or output statistics. First, the distinct weight shape in the offspring architecture prevents the direct transfer from an ancestor architecture. Second, weight is deeply trained and coupled in the ancestor architecture. The shared weight from heterogeneous architecture with different output statistics may lead to output explosion and unstable training (Guo et al., 2019). Consider the output intervals of activation functions and , which are given by [] and [], respectively. The shared wight is unsuitable to the architecture possessing function when it is transferred from the one possessing function

. Third, the shared weights in the connection layer may not be effective and adaptive to the offspring architecture immediately. The connection layer is given by the batch normalization or skip connection, and may be uncoupled to the offspring architecture.

To tackle the above challenges, we propose the constrained parameter sharing strategy to limit how the offspring architecture inheriting parameter from ancestor architectures found before. As shown in Figure 2, we explain the three constraints as follows:

  • The ancestor and offspring architectures have the same shape of input and output tensors for the graph convolutional layer. Based on the graph convolutions defined in Equation (

    2), both trainable matrix and transform weight used in the attention function could be shared directly only if they have the same shape.

  • The ancestor and offspring architectures have the same attention function and activation function for the graph convolutional layer. The attention function defines the neighbor information to be aggregated, and the activation function squashes the output to a specific interval. Hence both attention function and activation function greatly determines the output statistics of a graph convolutional layer. It is expected to void output explosion and improve the training stability via sharing parameter from homogeneous architecture with similar output statistics.

  • The parameters of batch normalization (BN) and skip connection (SC) will not be shared. It is because we do not know the exact output statistics of each layer in the offspring architecture in advance. The shared parameters of BN and SC may cannot bridge the two successive layers well. We train the whole offspring architecture with a few epochs (e.g.,

    or epochs in our experiment), to adapt these parameters to the new architecture.


Figure 2. An illustration of the constrained parameter sharing strategy between the ancestor and offspring architectures. The trainable parameter of a convolutional layer could only be shared when they have the same weight shape (constraint ), attention and activation functions (constraint ). Constraint removes the parameter sharing for batch normalization (BN) and skip connection (SC).

6. Experiments

We apply our method to find the optimal GNN architecture given the node classification task, to answer the following four questions:

  • Q1: How does the GNN architecture discovered by AGNN compare with state-of-the-art handcrafted architectures and the ones searched by other methods?

  • Q2: How does the search efficiency of RCNAS controller compare with those of other search methods?

  • Q3: Whether or not the constrained strategy shares weight effectively, to help the offspring architecture achieve good classification performance?

  • Q4: How does different scales of architecture modification affect the search efficiency of the RCNAS controller?

More details about the datasets, baseline methods, experimental configuration and results are introduced as follows.

Cora Citeseer Pubmed PPI
Setting T T T I
#Nodes
#Features
#Classes
#Training Nodes ( graphs)
#Validation Nodes ( graphs)
#Testing Nodes ( graphs)
Table 2. Statistics of datasets Cora, Citeseer, Pubmed, and PPI (Velickovic et al., 2017; Gao et al., 2018), where T and I denote the transductive and inductive learning, respectively.

6.1. Datasets

We consider both transductive and inductive learning settings for the node classification task. Under the transductive learning, the unlabeled data used for validation and testing are accessible during training. This means the training process could make use of the complete graph structure and node features, except for node labels on the held-out validation and testing sets. Under the inductive learning, the training process has no idea about the graph structure and node features on both validation and testing sets.

We utilize Cora, Citeseer and Pubmed (Sen et al., 2008) for the transductive learning, and use PPI for the inductive learning (Zitnik and Leskovec, 2017). These benchmark datasets are commonly used for studying the node classification task. The dataset statistics is given in Table 2. The three datasets evaluated under transductive learning are citation networks, where node corresponds to document and edge corresponds to citation relation. Node feature is given by bag-of-words representation of a document, and each node is associated with a class label. Following the same experimental setting as those in baseline methods, we allow for nodes per class to be used for training, and use and nodes for validation and testing, respectively. PPI dataset evaluated under inductive learning consists of graphs corresponding to different human tissues. There are features for each node, including the positional gene sets, motif gene sets and immunological signatures. Each node has several labels simultaneously collected from total of classes. We use graphs for training, graphs for validation and graphs for testing. The model metric is given by classification accuracy and micro-averaged F1 score for transductive learning and inductive learning, respectively.

6.2. Baseline Methods

In order to evaluate our method designed specifically for finding GNN architecture, we consider the baselines of both state-of-the-art handcrafted architectures as well as other NAS approaches.

  • Handcrafted architectures: Herein we only consider the message-passing based GNNs as shown in Equation (2) for fair comparison, except the one combined with pooling layer. The following baseline methods are included: Chebyshev (Defferrard et al., 2016), GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017), GAT (Velickovic et al., 2017), LGCN (Gao et al., 2018). Note that both Chebyshev and GCN perform information aggregation based on the Laplacian or adjacent matrix of the complete graph. Hence they are only evaluated under the transductive learning setting. Baseline GraphSAGE aggregates information via sampling neighbors pf fixed size, which will be compared only under the inductive learning setting. We consider a variety of GraphSAGE possessing different aggregate functions, including GraphSAGE-GCN, GraphSAGE-mean, GraphSAGE-pool and GraphSAGE-LSTM.

  • NAS approaches: We compare with the previous NAS approaches based on reinforcement learning and random search. The former one utilizes RNN to sample the whole neural architecture, and applies reinforcement rule to update controller. GraphNAS proposed in (Gao et al., 2019) applies this approach directly to search GNN architecture. The later one samples architecture randomly, serving as baseline to evaluate the efficiency of our controller.

6.3. Training Details

We train the sampled neural architecture on the training set, and update the controller via receiving reward from the validation set. Following the model configurations in baselines (Velickovic et al., 2017; Gao et al., 2018), the training experiments are set up according to transductive learning and inductive learning, respectively. We have an unified model configuration of controller. More details about our experimental procedure are introduced as follows.

6.3.1. Transductive Learning

Herein we explore a two-layer GNN architecture in the predefined search space. Except that the neural architecture is updated iteratively during the search progress, we have the same training environment to those in the baselines. To deal with the issue of small training set, we apply L2 regularization with . Dropout rate of is applied to both layers’ inputs as well as the attention coefficients during training. For Pubmed dataset, L2 regularization is strengthened to .

Foe each sampled architecture, weight is initialized using Glorot initialization (Glorot and Bengio, 2010) and trained with Adam optimizer (Kingma and Ba, 2014) to minimize the cross-entropy loss. We set the initial learning rate of for Pubmed and for Cora and Citeseer. We have two different settings to train a new offspring architecture: with parameter sharing and without weight sharing. The former one has a small warm-up epochs of , while the later one has training epochs.

Baseline Class Model #Layers Cora Citeseer Pubmed
#Params Accuracy #Params Accuracy #Params Accuracy
Chebyshev M % M % M %
Handcrafted GCN M % M % M %
Architectures GAT M M M
LGCN M M M
NAS Baselines GraphNAS-w/o share M M M
GraphNAS-with share M M M
Random-w/o share M M M
Random-with share M M M
AGNN AGNN-w/o share M M M
AGNN-with share M M M
Table 3. Test performance comparison for architectures under the transductive learning setting: the state-of-the-art handcrafted architectures, the optimal ones found by NAS baselines, the optimal ones found by AGNN.

6.3.2. Inductive Learning

Herein we explore a three-layer GNN architecture. The skip connection between the intermediate graph convolutional layers is included to improve the representation learning. Since dataset PPI is sufficiently large for training, the L2 regularization and random dropout are removed from GNN model. The batch size of graphs is employed during training.

We have the same parameter initialization and optimizer as the transductive learning. The initial learning rate is set to . The warm-up epoch number is under the setting with parameter sharing, and it is under the setting without parameter sharing.

6.3.3. Controller

For each action class, RNN encoder is realized by an one-layer LSTM with hidden units. Weights are initialized uniformly in [], and trained with Adam optimizer at a learning rate of . Following the controller configurations in the previous NAS work, we use a constant of and a sample temperature of to the hidden output. Totally architectures are explored iteratively during the search progress, and evaluated to obtain reward for updating controller. Reward to the policy gradient is given by the following combination: the validation performance and the controller entropy weighted by .

Baseline Model Layers PPI
Class Params F1 score
GraphSAGE-GCN M
GraphSAGE-mean M
Hand- GraphSAGE-pool M
crafted GraphSAGE-LSTM M
GAT M
LGCN M
GraphNAS-w/o share M
NAS GraphNAS-with share M
Baselines Random-w/o share M
Random-with share M
AGNN AGNN-w/o share M
AGNN-with share M
Table 4. Test performance comparison of our AGNN to state-of-the-art handcrafted architectures and other search approaches under the inductive learning setting.

6.4. Results

In this section, we show the comparative evaluation experiments to answer the above four research questions.

6.4.1. Test Performance Comparison

We compare the architecture discovered by our AGNN with the handcrafted ones and those found by other search methods, aiming to provide positive answer for research question Q1. Considering the architecture modification in AGNN, the default size of class list is set to . All of NAS approaches find the optimal architecture achieving the best performance on the separate held-out validation set. Then, it is evaluated on the testing set only once. Two comprehensive lists of architecture information and model performance are presented in Tables 3 and 4 for transductive learning and inductive learning, respectively. The test performance of NAS approaches is averaged via randomly initializing the optimal architecture times, and those of handcrafted architectures are reported directly from their papers.

As can be seen from Tables 3 and 4, the neural architecture discovered by AGNN outperforms the handcrafted ones and other search methods. Compared with the handcrafted architectures, the discovered models generally improve the classification performance accompanied with the increment of parameter size. During the search process, the larger ones of attention head and hidden dimension are explored to improve the representation learning capacity of GNN. The whole neural architecture is sampled and reconstructed in GraphNAS and random search at each step, similar to the previous NAS frameworks. In contrast, our AGNN explores the offspring architecture via only modifying specific action class. The best architecture are retained to provide a good start for architecture modification. This will facilitate the controller to learn the causality between architecture modification and model performance variation, and find the better architecture more potentially.

It is observed that the architectures found without parameter sharing generally outperform the ones found with parameter sharing. It is because the shared parameter may be uncoupled to the offspring architecture, although several epochs are applied to warm up. Running on a single Nvidia GTX 1080Ti GPU, it takes about GPU days to find the best architecture without parameter sharing, which is a few times that with parameter sharing. There is a trade-off between model performance and computation time cost.

(a) PPI
(b) Cora
(c) Citeseer
(d) Pubmed
Figure 3. Progression of top- averaged performance of different search methods, i.e., AGNN, GraphNAS, and random search.

6.4.2. Search Efficiency Comparison

We compare the progression of top- averaged performance of our AGNN, GraphNAS and random search, in order to provide positive answer to the research question Q2. All of the search methods are performed without parameter sharing to only study the efficiencies of different controllers. For each search method, totally architectures are explored in the same search space. The progression comparisons on the four datasets are shown in Figure 3.

As can be seen from Figure 3, AGNN is more efficient to find the well-performed architectures during the search progress. The top- architectures discovered by AGNN have better averaged performance on PPI and Citeseer. It is because the best architecture found so far is retained and prepared for slight architecture modification in the next step. Only some actions are resampled to generate the offspring architecture. This will accelerate the search progress toward the better neural architectures among the offsprings.

6.4.3. Effectiveness Validation of Parameter Sharing

Herein we study whether or not the shared parameter could be effective in the offspring architecture to help achieve good classification performance, aiming to provide answer for research question Q3. We consider AGNN equipped with different parameter sharing strategies: the proposed constrained one, the relaxed one in GraphNAS, and training from scratch without parameter sharing. Note that the relaxed parameter sharing in GraphNAS is similar to that in the previous NAS framework, at which the offspring architecture shares weight of the same shape directly without any constraint. The cumulative distribution of validation performance is compared for the discovered architectures in Figure 4.

(a) PPI
(b) Cora
(c) Citeseer
(d) Pubmed
Figure 4. The cumulative distribution of validation performance for AGNN under different parameter sharing strategies: the proposed constrained one, the relaxed one in GraphNAS, and training from scratch without parameter sharing.

As can be seen from Figure 4, most of the neural architectures found by the constrained parameter sharing have better performance than those found by relaxed strategy. That is because the manually-designed constraints limit the parameter sharing only between the homogeneous architectures with similar output statistics. Combined with a few epochs to warm up weight in batch normalization and skip connection, the shared parameter could be effective to the newly sampled architecture. In addition, the offspring architecture is generated with slight architecture modification to the best architecture found so far, which means that they potentially have the similar architecture and output statistics. Hence the well-trained weight could be transferred to the offspring architecture stably. Although the strategy of training from scratch couples the weight to each architecture perfectly, it needs to pay much more computation cost.

(a) PPI
(b) Cora
(c) Citeseer
(d) Pubmed
Figure 5. The progression of top- averaged performance of AGNN under different architecture modification: , and .

6.4.4. Influence of Architecture Modification

We study how does different scales of architecture modification affect the search efficiency, in order to provide answer to research question Q4. Note that the action class in list are exploited to modify the retained architecture, and the size of list is denoted by . When , we perform the architecture modification at the minimum level, at which actions of one specific class will be resampled. When , we modify the retained network completely similar to the traditional controller. Considering , , and , we show the progression of top- architectures under the setting of parameter sharing in Figure 5.

As can be seen from Figure 5, the architecture search progress tends to be more efficient with the decrease of . The top- neural architectures found by achieves the best averaged performance on PPI and Citeseer. The efficient progression of smaller benefits from the following two facts. First, the offspring architecture tends to have similar structure and output statistics with the retained one. It is more possible for the shared weight being effective in the offspring architecture. Second, the independent RNN encoder can exactly learn causality between performance variation and architecture modification of its own class, and tends to sample well-performed architecture at the next step.

7. Related Work

Our work is related to the graph neural networks and neural architecture search.

Graph Neural Networks. A wide variety of GNNs have been proposed to learn the node representation effectively, e.g., recursive neural networks (Gori et al., 2005; Scarselli et al., 2009), graph convolutional networks (Bruna et al., 2013; Defferrard et al., 2016; Kipf and Welling, 2017; Hamilton et al., 2017; Gao et al., 2018) and graph attention networks (Velickovic et al., 2017; Vaswani et al., 2017). Most of these approaches are built up based on message-passing based graph convolutions. The underlying graph is viewed as a computation graph, at which node embedding is generated via message passing, information transformation, neighbor aggregation and self update.

Neural Architecture Search. Most of NAS frameworks are built up based on one of the two basic algorithms: RL (Zoph and Le, 2016; Zoph et al., 2018; Pham et al., 2018; Cai et al., 2017; Baker et al., 2016) and EA (Liu et al., 2017; Real et al., 2017; Miikkulainen et al., 2019; Xie and Yuille, 2017; Real et al., 2019). For the former one, a RNN controller is applied to specify the variable-length strings of neural architecture. Then the controller is updated with policy gradient after evaluating the sampled architecture on validation set. For the latter one, a population of architectures are initialized first and evolved with mutation and crossover. The architectures with competitive performance will be retained during the search progress. A new framework combines these two search algorithms to improve the search efficiency (Chen et al., 2018). Parameter sharing (Pham et al., 2018) is proposed to transfer the well-trained weight before to a sampled architecture, to avoid training the offspring architecture from scratch to convergence.

8. Conclusion

In this paper, we present AGNN to find the optimal neural architecture given a node classification task. The search space, RCNAS controller and constrained parameter sharing strategy together are designed specifically suited for the message-passing based GNN. Experiment results show the discovered neural architectures achieve quite competitive performance on both transductive and inductive learning tasks. The proposed RCNAS controller search the well-performed architectutres more efficiently, and the shared weight could be effective in the offspring network under constraints. For future work, first we will try to apply AGNN to discover architectures for more applications such as graph classification and link prediction. Second, we plan to consider more advanced techniques of graph convolutions in the search space, to facilitate neural architecture search in different applications.

References

  • [1] T. B. Aynaz Taheri (2018)

    Learning graph representations with recurrent neural network autoencoders

    .
    In

    KDD’18 Deep Learning Day

    ,
    Cited by: §1.
  • [2] B. Baker, O. Gupta, N. Naik, and R. Raskar (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §7.
  • [3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §7.
  • [4] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang (2017) Reinforcement learning for architecture search by network transformation. arXiv preprint arXiv:1707.04873. Cited by: §7.
  • [5] Y. Chen, G. Meng, Q. Zhang, S. Xiang, C. Huang, L. Mu, and X. Wang (2018) Reinforced evolutionary neural architecture search. arXiv preprint arXiv:1808.00193. Cited by: §7.
  • [6] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844–3852. Cited by: §3, 1st item, §7.
  • [7] T. Elsken, J. H. Metzen, and F. Hutter (2018) Neural architecture search: a survey. arXiv preprint arXiv:1808.05377. Cited by: §1.
  • [8] M. Fey and J. E. Lenssen (2019) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: 4th item.
  • [9] H. Gao, Z. Wang, and S. Ji (2018) Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1416–1424. Cited by: §3, 1st item, §6.3, Table 2, §7.
  • [10] Y. Gao, H. Yang, P. Zhang, C. Zhou, and Y. Hu (2019) GraphNAS: graph neural architecture search with reinforcement learning. arXiv preprint arXiv:1904.09981. Cited by: 2nd item, 2nd item.
  • [11] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    ,
    pp. 249–256. Cited by: §6.3.1.
  • [12] M. Gori, G. Monfardini, and F. Scarselli (2005) A new model for learning in graph domains. In Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, Vol. 2, pp. 729–734. Cited by: §1, §7.
  • [13] A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §1.
  • [14] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2019) Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: §1, §5.
  • [15] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1, §1, §3, 1st item, §7.
  • [16] H. Jin, Q. Song, and X. Hu (2018)

    Auto-keras: efficient neural architecture search with network morphism

    .
    arXiv preprint arXiv:1806.10282. Cited by: §1, §4.2.2.
  • [17] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing (2018) Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pp. 2016–2025. Cited by: §1.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.3.1.
  • [19] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. International Conference on Learning Representation. Cited by: §3, 1st item, §7.
  • [20] J. B. Lee, R. A. Rossi, S. Kim, N. K. Ahmed, and E. Koh (2018) Attention models in graphs: a survey. arXiv preprint arXiv:1807.07984. Cited by: 2nd item.
  • [21] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 82–92. Cited by: §1.
  • [22] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: §1.
  • [23] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu (2017) Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436. Cited by: §1, §7.
  • [24] H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §1.
  • [25] R. Luo, F. Tian, T. Qin, E. Chen, and T. Liu (2018) Neural architecture optimization. In Advances in neural information processing systems, pp. 7816–7827. Cited by: §1.
  • [26] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy, et al. (2019) Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pp. 293–312. Cited by: §7.
  • [27] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1.
  • [28] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268. Cited by: §1, §7.
  • [29] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019) Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §7.
  • [30] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §7.
  • [31] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1, §7.
  • [32] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §6.1.
  • [33] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §4.3.
  • [34] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei (2015) Line: large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. Cited by: §1.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: 3rd item, §7.
  • [36] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903 1 (2). Cited by: §1, 3rd item, §3, 1st item, §6.3, Table 2, §7.
  • [37] D. Wang, P. Cui, and W. Zhu (2016) Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. Cited by: §1.
  • [38] H. Wang and J. Huan (2019) AGAN: towards automated design of generative adversarial networks. arXiv preprint arXiv:1906.11080. Cited by: §1.
  • [39] L. Xie and A. Yuille (2017) Genetic cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1379–1388. Cited by: §7.
  • [40] S. Xie, A. Kirillov, R. Girshick, and K. He (2019) Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569. Cited by: §1.
  • [41] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. CoRR abs/1810.00826. External Links: 1810.00826 Cited by: §1, 4th item, §4.
  • [42] M. Zitnik and J. Leskovec (2017) Predicting multicellular function through multi-layer tissue networks. Bioinformatics 33 (14), pp. i190–i198. Cited by: §1, §6.1.
  • [43] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §1, §4.3, §7.
  • [44] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §1, §7.