ESOL [delaney2004esol], FreeSolv [mobley2014freesolv]
, and Lipophilicity[hersey_chembl_2015] are three representative molecular benchmark datasets used in our experiments, and they respectively correspond the tasks of predicting solubility, hydration free energy, and octanol/water distribution coefficient. The sizes of them are 1,128, 642, 4,200 respectively. Each dataset is split into training, validation, and test sets with the ratio . The training set is used to train GNNs, the validation set is used for guiding HPO, and test set is used to do final evaluations. Compared with the common designs of HPO experiments, we made the modification that the evaluation of a solution (hyperparameter setting) is repeated three times, and the mean of the root mean squared errors (RMSE) is used to score the solutions. Meanwhile, we employed the graph convolution model (GC) [duvenaud2015convolutional] to predict these properties in the above mentioned three datasets. GC was proposed with the molecular domain knowledge which fits our research problem (molecular property prediction) compared with other GNNs. For ease of implementation, we leveraged the DeepChem (Python toolkit for deep learning in drug discovery, materials science, quantum chemistry, and biology) [Ramsundar-et-al-2019] to preprocess molecular data and implement GC. In addition, we make use of Optuna [akiba2019optuna] to conduct the HPO experiments with CMA-ES. To assess the impact of optimizing different types of hyperparameters on the performance of GNNs, we first look at the graph-related hyperparameters, and for GC, they include the number of graph convolution layer , the sizes of the graph convolution layer , and the size of dense layer which is defined in GC to generate molecular representations. The range of is with the step size 1. and are in the ranges of (step size 32) and (step size 64), respectively. These ranges are set according to the default value provided in [wu2018moleculenet], while the step sizes are selected following [yuan2021novel]. As for the task-specific hyperparameters (i.e., the hyperparameters in task-specific layers), in order to predict molecular properties , we employed a simple feedforward neural network which consists of a few fully-connected layers. So the task-specific hyperparameters include the number of fully-connected layers (excluding the output layer), and the sizes of those layers , and the activation function . The options of are , , and . Meanwhile, the ranges of referring to are from with the step size 64. The arrangement of is the same as for facilitating analysis. The aboved hyperparameters are summarized in Table Document. Furthermore, and will determine the number of and , so our search space is dynamic. The dynamic feature of the search space will change the dimensions of the problem during the search, which makes HPO more challenging.
Pseudo-dynamic Search Space
However, it is noted that CMA-ES does not support dynamic search space [akiba2019optuna]. Therefore we turn to implement the pseudo-dynamic process, and the process of HPO is shown in Algorithm Document. In Algorithm Document, denotes the entire hyperparameter space, and means the number of hyperparameters. Furthermore, is used to define a dynamic hyperparameter. For example, in our experiments, is a list in which each element represents the size of the corresponding graph convolution layer. Meanwhile, the number of elements in is dynamic as it is determined by . It is not possible to decrease/increase the dimensions of the multivariate normal distribution after initialization, so that we keep the CMA-ES to sample the maximum number of elements. Regarding this, we make use of to decide how many elements will be used to instantiate the model. In this way, the search space maintained by CMA-ES is not changed, but in practical it affects the generation of models. [!ht] Inputinput the hyperparameter space , where denotes the th hyperparameter; if is a dynamic hyperparameter, , where , are paired, will determines the number of elements in ; a GNN ; the total number of trials initializing CMA-ES; sort() *[r]move all dynamic hyperparameters backward t = 1 =  *[r]null list for collecting sampled hyperparameters = to is a dynamic hyperparameter lookup () = to = CMA-ES.suggest() . = CMA-ES. . is instantiated with evaluate() update() Overall, we designed four sets of experiments. The first set of experiments take the default values of hyperparameters from DeepChem [Ramsundar-et-al-2019]
to train a GC model for thirty times on ESOL, FreeSolv, and Lipophilicity datasets, respectively. The average of root mean squared errors (Mean RMSE) and standard deviations (Mean Std) are shown in TableDocument. Batch size
, the number of training epochs, and learning rate , which are not considered to optimize because this study focuses on discovering the impact of optimizing the two different types of GNNs’ hyperparameters. Meanwhile, , , and are set for all our experiments. In Table Document, , and are 0 or none because the task-specific layer only has a single output layer (i.e., no hidden layers) according to the orginal settings in DeepChem [Ramsundar-et-al-2019]. The purpose of this group of experiments is to set a baseline for our following experiments.
The other three sets of experiments are conducted the HPO for GC on ESOL, FreeSolv, and Lipophilicity, respectively. In Section 3.2, we will discuss the process of HPO on the three datasets (Figs. DocumentDocument). Thereafter, we will start to analyse the results in details for each dataset (Tables DocumentDocument).
The Process of Hyperparameter Optimization
In Figs. DocumentDocument, the x-axis denotes the index of each trial. One trial represents a process of sampling and evaluating a new hyperparameter setting. CMA-ES was assigned with 200 trials for each dataset. The y-axis denotes the metric of RMSE for evaluating each trial; each trial in our experiments are evaluated thrice, and the mean values are drawn in these figures. The reason for using the mean value of multiple evaluations is that we observed the results for multiple evaluations are not stable, which may mislead the HPO. In Fig. DocumentDocument, the blue points represent the RMSEs for optimizing graph-related hyperparameters, the green points represent the RMSEs of optimizing task-specified hyperparameters, and the orange points denote the results of optimizing both graph-related and task-specified hyperparameters simultaneously. The nine lines are used to represent the trends of performing HPO on different types of hyperparameters, and these lines are drawn by connecting all current best points in time sequence. It is easy to observe that most of the lines hold obviously decreasing trends, which indicates that CMA-ES works for optimizing hyperparameters. Furthermore, in Figs. Document and Document, the decreasing trends of RMSEs for red and purple lines are less significant, compared with those for the blue lines. From these observations, we can see that, Fig. Document implies that appropriate settings for both graph and task-specific layers are needed together, and they may complement with each other to achieve better performance in molecular prediction tasks. Overall, the three figures indicate that optimizing both types of hyperparameters can get more gains given the same number of trials, even the search space becomes larger because the number of possible combinations of different hyperparameters increases.
The best hyperparamter values obtained from Section 3.2 are used to instantiate GCs, and these GCs are trained respectively on ESOL, FreeSolv, and Lipophilicity datasets. The detailed results are shown in Tables Document Document. In general, the models configured with the CMA-ES optimized hyperparameters on the three datasets achieved better performances than the original ones (Tab Document). For example, in ESOL, the RMSE of the GC with default hyperparameters (Table Document) is 1.1570 on the test set. The models with HPO on graph layers, task-specific layers, and both of them have the Mean RMSE of 1.0854, 0.9505, and 0.8824, respectively (Table Document). To statistically analyse the improvements, we conducted the -test for the RMSEs on the test dataset between Table Document and Table Document with the values of 4.0000, 12.7625, and 18.1311. When the significance level , their performance are all significantly better than original ones?? In Tables DocumentDocument, we can see that HPO on graph and fully connected (task-specific) layers outperforms HPO on either graph or fully connected layers. With the same number of trials in HPO, optimizing graph or fully connected layers face a large search space, but it achieved promising performance. Meanwhile, in Tables Document and Document
, we observed that only optimizing fully-connected layers has relatively more serious over-fitting problem compared with the optimizing both types of hyperparameters, since it always obtained less RMSE values on the training datasets, and inversely had larger RMSE values on validation and test sets. In this case, it indicates optimizing fully-connected layers only would help to fit the problems, but without optimizing graph layers, the molecular representations may not be better learnt, which leads to reduced performance of GNNs in test set. Conducting HPO on graph layers only achieved lower performance than performing HPO on task-specific layers and the both in the three datasets. We believe the reason is that the default setting of GC only provides a output layer without hidden layers; this means molecular representations are only passed to a linear layer (without non-linear transformation) which dramatically restricts the learning capability. Interestingly, after HPO, the hyperparameterwas assigned to in all experiments which is the same choice as described in [duvenaud2015convolutional], where the activation function was manually selected. In summary, although graph layers and task-specific layers play different roles in GNNs, they need to be optimized together when solving practical problems. The reason is as follows: a better graph representation learned from graph layers needs to be supported by tailored task-specific layers to accomplish tasks. Similarly, task-specific layers also need appropriate graph representations to achieved good performances. From the above analysis, we can conclude that when only limited computational budget is available, we should still optimize the hyperparameters of both types of layers, rather than focusing on one of them.
Conclusions and Future Work
With the rapid development of GNNs, applying them in molecular machine learning problems becomes increasingly compelling and meaningful. For example, accurate molecular property prediction can significantly facilitate the entire process of drug discovery in a faster and cheaper way. However, the performance of GNNs are largely affected by hyperparameter selection, so the research of HPO on GNNs is of extremely important. In this paper, we elaborated the problem of HPO on GNNs for molecular property prediction, and investigated in depth that which types of hyperparameters should be selected to optimize when computational resources are limited. Based on our experiments, we conclude that both graph-related hyperparameters and task-specific hyperparameters should be optimised simultaneously, and leaving any one out will result in reduced performance. Even doing this means a larger search space, which seems to be more challenging given the same number of trials (limited computational resources), such a strategy can surprisingly achieve better performance. Finally, we acknowledge that our experiments are based on one type of GNN model and one evolutionary strategy. However, we believe that our conclusion can be further generalised, because we have selected the representative GNN model, used state-of-the-art evolutionary HPO approach, and the benchmark datasets used for experiments are also representative in molecular property prediction problems. Still, we propose two future directions to carry out our research in the next step. First, there exist various GNNs, and most of them comply with the rule of aggregating the neighbor information to learn the node representations. However, they can be classified into spectral and spatial GNNs [balcilar2020bridging]. In this research, we have extensively investigated the impact of HPO on GC, which is a representative of spatial GNNs. Therefore, it would be interesting and worthwhile to investigate whether the same conclusion holds for HPO of spectral GNNs. Second, we employed CMA-ES as the HPO strategy because it is a state-of-the-art evolutionary HPO method. However, it does not support the dynamic search space, which constrains its scalability. In our future work, other evolutionary HPO approaches can be applied to explore their effectiveness on optimizing hyperparamters with dynamic search space for GNNs.