Discovering the essentialome, i.e., the set of essential genes for the survival of a living organism, is an important research question in bioinformatics and genomics. A plethora of fundamental functions are played in the cell by proteins coded by essential genes , resulting in an irreplaceable role in controlling cellular processes. Therefore, essential genes are regarded as the basis of life, such that characterizing an organism’s essentialome can give us insights into the inner working and interdependence within its genome. Knowledge of essential proteins and genes not only improves our understanding about biological processes and molecular functions, but also has contributed to a range of fields, such as in synthetic biology, for the definition of a minimal genome ; in drug design, for the choice of drug targets of antimicrobial and anticancer compounds [15, 36]; in a better understanding of diseases, by helping discover pathogenic genes ; and in metabolic engineering .
Several experimental procedures, such as targeted single-gene knockout and transposon mutagenesis, have been developed to identify essential genes . Although they are generally robust and accurate, wet-lab approaches imply costs with laboratory resources and are time consuming, which may impose significant limitations to their use, especially for large scale analyses and for complex organisms like mice and humans . Considering this scenario, computational approaches have become appealing solutions as they provide fast and inexpensive results, which can reduce the workload in experimental procedures by suggesting a list of candidate genes and proteins to be tested. With the rapid development of high-throughput techniques, a large volume of biological data is available for in silico studies, covering a wide variety of organisms. This factor was crucial for the further development of computational methods to predict essential genes and proteins.
As recently reviewed 
, computational methods usually rely on network topology features extracted from Protein-Protein Interaction (PPI) networks as the main source of data to predict gene essentiality. These methods leverage the fact that highly connected genes111We note that although PPI networks encode interactions among proteins, we refer to the nodes of the network as genes or proteins interchangeably throughout the text, considering the genes by which these proteins are encoded. in the PPI network are more likely to be essential – the centrality-lethality rule . Therefore, correlation between node centralities and gene essentiality has been explored in several computational approaches previously proposed, including neighborhood-based methods such as degree centrality (DC), local average connectivity (LAC), and edge-clustering coefficient centrality (NC) .
Although information regarding interaction among genes is strongly related to gene essentiality, using only PPI network topological features leads to limited prediction accuracy . First, PPI networks are known for being noisy and containing many false positive connections . Second, there are many protein interaction databases, with significantly different networks among them, which not only emphasizes the potential incompleteness of single PPI networks, but may also generate varying and unstable results depending on the network used. Third, due to the complexity of living organisms, it is natural to assume that gene essentiality is determined by multiple biological factors that can not be fully captured by topological characteristics of the network .
Recent works aim at alleviating these issues by using multiomics datasets to either manually prune untrustworthy connections or as additional input for the statistical and computational methods being applied. In this direction, gene expression profiles are perhaps the most common choice of additional evidence among works in the field , being applied, for instance, in  and . Other types of biological information already used in conjunction with PPI networks are subcellular localization information , orthology information , protein complexes  and domains , sequence data , and functional annotations such as Gene Ontology. In addition, there is an increasing number of studies exploring three or more sources of biological information in the construction of statistical or computational models (e.g., [27, 25, 23]).
Given the nature of this prediction problem, computational approaches often apply machine learning (ML) algorithms, especially supervised learning methods, as part of the solution
. In this sense, the identification of essential genes and proteins is regarded as a binary classification task, and algorithms build a prediction model based on features related to gene and protein essentiality. Most works in this scope have applied shallow ML algorithms, among which Support Vector Machines (SVM), naïve Bayes 
, and decision trees are recurrent ones. This methodology undoubtedly helped advance the frontier of knowledge regarding essential genes and proteins. Nonetheless, whereas centrality-based methods needs prior knowledge to create good score functions, shallow ML methods demand the specialist’s input to select representative biological properties as features for the learning problem, which is not a trivial task, especially when knowledge regarding these features is still evolving and may not be fully elucidated .
Deep learning methods, on the contrary, avoid the critical step of extracting handcrafted features in the design of ML methods, automatically learning features from the training data during its operation. Thus, deep learning does not require prior knowledge or assumptions regarding relevant biological features of gene essentiality. Moreover, deep learning methods have demonstrated an excellent performance in a wide range of prediction and network-based problems in Bioinformatics [31, 18].
Zeng et al.  proposed a deep learning framework for identifying essential proteins (DeepEP) without any prior knowledge, adopting PPI network, gene expression data, and subcellular localization information. The node2vec algorithm 
was applied for network representation learning, a long short term memory network was used to process the gene expression data, and the output vectors originated from pre-processing the three data sources are concatenated and classified by a fully connected layer with the sigmoid activation function. Computational experiments with the PPI network ofSaccharomyces cerevisiae indicate an area under the Receiver Operating Characteristic (ROC) curve (AUC) of 0.832.
Also in this direction, PPI networks and gene expression data were used as input for the deep learning framework proposed in 
. The node2vec technique was used for encoding the PPI network into a low-dimensional space, which is concatenated with patterns extracted by a multi-scale Convolutional Neural Network (CNN) from an image-based representation of gene expression profiles. The output vector is analyzed by a sequence of two fully connected layers using the rectified linear unit (ReLU) and softmax activation functions to predict the final label of a protein. An AUC score of 0.82 is achieved forS. cerevisiae data. The works by  and  show that the dense vectors generated by node2vec technique contribute to an improved performance over commonly used centrality measures.
More recently, Zhang et al.  proposed DeepHE, a deep learning-based method to predict human essential genes. DeepHE integrates sequence features extracted from DNA and protein sequences with features learned from PPI network using node2vec. A deep neural network was trained with a cost-sensitive technique to address the imbalanced learning problem inherent to this domain, predicting human gene essentiality with an average AUC score higher than 0.94. The method is shown to outperform traditional shallow ML algorithms such as SVM, RF, and naïve Bayes.
Despite the success of these deep learning methods for gene or protein essentiality prediction, they have in common the need to use a node embedding technique (e.g., node2vec) to transform the PPI network to an ordered and fixed-size input lying in the Euclidean space, as expected by statistical and ML models. In other words, these approaches do not learn from the full graph data, but rather from low-dimensional representations obtained from them. Therefore, it is reasonable to assume that during this transformation, valuable information may be lost and more complex patterns may not be discovered, impacting in the performance and generalization power of the model.
Graph Neural Networks (GNNs) [11, 43] were introduced as a class of deep learning based methods capable of dealing with the non-Euclidean nature of graphs, automatically learning network topology-preserving node-level vector representations from networks. Graph Convolutional Networks (GCNs) 
are the current state-of-the-art for problems posed for GNNs, performing node classification based on node features and network topology by aggregating information from neighboring nodes in a hierarchical fashion. GCNs have significantly improved performance in other prediction tasks in Bioinformatics, such as estimating drug–target binding affinity, identifying cancer driver genes , and classifying breast cancer subtype .
In this work, we hypothesize that GNN-based models could offer high performance in the identification of essential genes and proteins compared to previous deep classifiers, by learning more complex relations directly from the PPI networks. We propose EPGAT, a novel computational method for gene essentiality prediction based on Graph Attention Networks (GATs) , an extension to GCNs that adds an attention mechanism to the original model. As far as we are aware, no previous work has adopted GCNs, or more especifically GATs, for prediction of essential genes and proteins.
Our approach aims at tackling the main limitations of current methods, i.e., (i) the low reliability of PPI networks, (ii) the need to integrate multiomics data to more broadly capture the biological notion of essentiality, and (iii) the limited use of graph-embedded knowledge by previously proposed network topology measures, shallow ML algorithms, and deep learning methods. Our solution operates directly over the graph structure by using GATs, and incorporates multiomics datasets, namely gene expression profiles, orthology information, and subcelular localization information, as node features, which are collectively used with PPI networks in the model learning process. To deal with issue (i), we evaluate our experiments on three distinct PPI network databases. Additionally, we benchmark our method on four organisms: Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, and Homo sapiens. Our contributions in this article are: 1) We show that EPGAT provides state-of-the-art results compared to other ML and network-based approaches and 2) we analyze and integrate different PPI and multiomics datasets in order to infer which of them are the most relevant for gene essentiality prediction.
The remainder of this paper is organized as follows. We introduce the relevant background and the setting of our experiments throughout Section II. In Section III, we present our experiments, results, and discussion of our findings. Lastly, in Section IV, we conclude our study and offer perspectives for future research.
Ii Materials and Methods
In this section, we present the basis of our proposed method, including an overview of the used data and learning algorithm, evaluation strategies, and baselines adopted for comparison purpose.
Ii-a Data Collection and Preprocessing
The proposed method was evaluated in the prediction of essential genes on four organisms: Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, and Homo sapiens. S. cerevisiae and E. coli were choosen as they have their complete essentialome published  and are widely used as testbed for statistical models. We also tested our method on the human genome since the results and performance of computational models in complex organisms that do not have a known essentialome is an open question and, therefore, a valuable research direction. D. melanogaster is also an interesting benchmark organism because, as we will later discuss, it is the most negatively biased organism of our dataset regarding annotation on essential genes.
Besides essential genes dataset, we used four kinds of biological datasets in our experiments: PPI networks, gene expression profiles, subcellular localization, and orthology information. To standardize nomenclature among these sources and allow data integration, the identification of genes and proteins was mapped to UniProtKB nomenclature. Elements without such correspondent were removed from our dataset. In what follows we explain our data sources.
Ii-A1 Essential Genes Dataset
Annotations for essential and non-essential genes for all organisms were downloaded from the database of Online GEne Essentiality (OGEE) (downloaded at 25/03/2020) . A total of 5,636 genes (18.63% essential) were obtained for S. cerevisiae, 4,322 genes (8.2% essential) for E. coli, 13,781 genes (2.96% essential) for D. melanogaster, and 21,556 (8.9% essential) genes for H. sapiens.
After the pre-processing step to standardize the nomenclature of genes using the UniProKB standard, the number of genes labeled for each species was: 5,636 genes (18.63% essential) for S. cerevisiae, 3,686 genes (6.56% essential) for E. coli, 12,329 genes (1.95% essential) for D. melanogaster, and 18,476 genes (9.88% essential) for H. sapiens. As we may observe, D. melanogaster is the most negatively biased organism since only 1.95% of labeled data refer to essential genes (i.e., positive instances, in the context of classification).
|S. cerevisiae||E. coli|
|N. labeled genes||5548||5455||4570||3489||3521||1902|
|N. essential genes||1048||1050||981||201||234||205|
|N. test labels||1110||1091||914||698||705||381|
|H. sapiens||D. melanogaster|
|N. labeled genes||14486||17894||3370||1778||8592||625|
|N. essential genes||1590||1806||726||79||161||41|
|N. test labels||2898||3579||674||356||1719||125|
Ii-A2 PPI Networks
As previously reported, PPI networks have many false positive interactions, and networks for the same organism from different databases are extremely heterogeneous. This may lead to results instability due to the choice of PPI dataset. To analyze and alleviate the effect of network choice, we evaluated the results for each organism in three different PPI networks: DIP (data as of 2017-02-05), BioGRID (Version 3.5.182), and STRING (Version 11.0).
DIP, the Database of Interacting Proteins , lists protein pairs that were experimentally shown to bind to each other. Although DIP has been used by previous works, it has the characteristic of being a sparse network. In our work, DIP dataset contains the lowest number of nodes and edges among all networks used, for all organisms evaluated.
BioGRID, the Biological General Repository for Interaction Datasets , provides the curation and storage of protein interactions reported in the biomedical literature for all major model organisms and for humans. Interactions in this database are divided between physical and genetic, both of which are used in our dataset.
dataset assembles PPI collected and scored from different ’evidence channels’, depending on the origin and type of the biological evidence. A combined and final score is computed for each interaction, which is typically used as an estimate of the likelihood that a given interaction is biologically meaningful, given the supporting evidence. Heuristically, we filtered out connections with a confidence score bellow 0.5 aiming at reducing false positive interactions. As part of our experimental approach, we further analyzed such choice, as discussed in SectionIII-D.
DIP dataset lists only proteins that bind to each other, while STRING and BioGRID may also include indirect interactions between proteins, which makes them much denser. We also note that we added self-connections for every node in the networks, a standard procedure in order to train GATs and GCNs in general (see Section II-B for further details).
Details of collected PPI networks are summarized in Table I. The number of nodes and edges refer to the structure of networks collected from the databases aforementioned. The number of labeled genes of each PPI network is the number of genes contained within the essential genes dataset that intersect with elements from the PPI network, including both positive and negative labels regarding essentiality. The positive labels, which are of special interest in our work, are shown in the ”N. essential genes” field. Finally, the number of test labels is the size of the partitions reserved for model evaluation from the labeled genes dataset.
Ii-A3 Gene Expression Profiles
In our multiomics-based approach, we used gene expression profiles from the Gene Expression Omnibus (GEO) database. For E. coli, we collected gene expression data from GSE7326 and GSE40693, which we found empirically to be highly correlated with the E. coli essentialome. GSE7326 evaluates E. coli gene expression during cell death, whereas GSE40693 measures the transcriptomic changes of E. coli in response to an antimicrobial compound. For S. cerevisiae, we used the expression profiles from the GSE3431 dataset, which was previously applied for essential gene prediction . GSE86354 and GSE67547 were used for H. sapiens and D. melanogaster, respectively. GSE86354 provides expression profiles for 1,558 samples across 8 tissue sites generated by the Genotype-Tissue Expression (GTEx) project, and GSE67547 expression is obtained over the lifespan of 120,000 fruit-flies. Further information about gene expression data may be obtained in the GEO database through datasets’ accession numbers.
Ii-A4 Subcelullar Localization and Orthology Information
As additional biological evidence, we used gene orthology information gathered from the InParanoid (v8)  database and protein subcellular localization data from the COMPARTMENTS database (version 8) . We note that the COMPARTMENTS databases does not provide information on subcellular localization for E. coli. For this reason, we evaluate this organism using only the gene expression and orthology information as additional data.
Ii-B Graph Neural Networks
A PPI dataset may be denoted by a graph composed by a set of nodes , representing proteins, and a set of edges reflecting the interactions among the proteins. Additionally, we may represent information of each protein in the network by adding attributes to the node it is represented by. With this reasoning we frame essential gene prediction as a node classification problem over a PPI network attributed with additional multiomics data.
Traditional ML methods usually rely on preprocessing procedures in order to parse graph data and generate a tabular dataset. As neighborhood information is important in a graph, the entry of each node in the resulting tabular dataset would need to contain features of neighboring nodes. As the number of neighbors for each node varies, the resulting dataset would be unwieldy. Hence, methods that deal with raw graph data are desirable.
Graph Neural Networks (GNNs) comprise a family of ML methods that handle graph data without pre-processing. In this work we use a specific GNN method, namely Graph Attention Networks (GATs) . GATs combine ideas of generalized convolutions , which allows graph nodes to aggregate information from their irregular neighborhoods, with self-attention mechanisms , which allows nodes to learn the relative importance of each neighbor during the aggregation process. Next, we describe GATs for node classification tasks, where the goal is to predict the class of each graph node out of possible classes.
In a GAT, each layer contains a vector-valued representation, also called embedding, , for each node . For an arbitrary node , the first layer, may contain features that reflect properties of the node (e.g. multiomics protein data), domain knowledge or just be randomly initialized. Embeddings of intermediate (hidden) layers represent increasingly “higher-level” latent features of the nodes. Embedding dimensions can be arbitrary in the hidden layers. The final layer, , is a vector of dimension
, containing the probability of nodeto belong to each one of the classes.
For a given node , the embeddings of all its neighbors in a layer, , are used to compute ’s embedding on the next layer’s . To make use of its own embedding, the set of ’s neighbors, , includes itself. Equation 1 shows the embedding calculation procedure, where is a non-linear activation function, applied element-wise to the resulting vector, is the matrix of learnable weights of layer and is the learnable attention between and , which indicates how strongly “listens” to information on node .
Thus, the goal in GATs is to adjust the weights of each layer and the attention coefficients between every pair of nodes so as to capture latent node features and, ultimately, generate good class predictions. Note that the weights of each layer, , are shared, i.e., all nodes use the same weight matrix to update their next layer’s embeddings. Hence the importance of the self-attention mechanism: it weighs the importance of each neighbor when updating a node’s embedding. Moreover, this value is also learned from data.
As node embeddings lie in a continuous space, GATs can be seen as a deep learning technique. This means that, if differentiable activation functions are used, GATs can be trained end-to-end with gradient descent methods.
Ii-C Model Training and Performance Evaluation
For model training and evaluation, we randomly split the labeled data into a training (80%) and a test set (20%) in a stratified manner, i.e., preserving original classes proportions in both sets. To better estimate the models generalization power, ten random splits were used and performance metrics were averaged across repetitions. Additionally, we reserved a subset of the trainining dataset and used it for validation while training statistical methods (including GAT). The validation data is used for early stoppage of the training procedure.
Essential genes compose only a small fraction of the coding genome of most organisms, including E. coli, S. cerevisiae, D. melanogaster, and H. sapiens. As a result, datasets of essential genes are heavily inclined towards negative cases. To account for this class imbalance, we trained EPGAT using the weighted binary cross-entropy (CE) function, defined as:
where is the true label of the instance and assumed to be 1 (i.e., positive or essential) or 0 (i.e., negative or non-essential), is the probability predicted by the model for the positive class (and thus is the probability predicted for the negative class), and and are inversely proportional to the number of instances in the training dataset for the positive and negative classes, respectively.
Training GAT models involves tuning several hyperparameters, which can be very time consuming, especially when a number of models are being developed, as in our work. To reduce computational costs, we first determined the best combination of hyperparameters for theS. cerevisiae networks using optuna , an algorithm for hyperparameter optimization in ML. Next, based on the results from this analysis, we performed an empirical evaluation for the other organisms by manually varying the hyperparameters using a smaller number of combinations that had given good results for the S. cerevisiae networks. The hyperparameters used for EPGAT for each organism are listed in Table II.
In order to improve the generalization and robustness of our models, we applied the regularization techniques dropout and L2 regularization. After every GAT layer, we added a dropout layer with probability , specified for each organism as summarized in Table II. We also added a regularization based on the L2-norm of the model’s hyperparameters when computing its gradient. This causes the weights to decay to smaller values, thus biasing the learning procedure towards simpler solutions.
The performance was assessed using the area under the ROC curve (AUC). The ROC curve shows the performance of a classification model in distinguishing between two classes at distinct thresholds settings, by plotting the corresponding True Positive Rate (TPR, in the y-axis) and the False Positive Rate (FPR, in the x-axis). The AUC score may be interpreted as the probability that the model ranks a random positive instance more highly than a random negative instance. Therefore, higher AUC scores indicate better classification models. The choice of this metric aims to allow a qualitative comparison with previous works and to better deal with the inherent class imbalance in our datasets in contrast to measures such as accuracy.
|Parameter||S. cerevisiae||D. melanogaster||E. coli & H. sapiens|
As baselines in our study, we adopted approaches from the three classes of computational strategies more common in the related literature, namely, network topology measures, shallow ML algorithms, and node embedding coupled with deep learning.
As for topology measures, we used degree centrality (DC), which calculates the number of neighbors of a node in the network; neighborhood centrality (NC), which is based on edge clustering coefficient; and local average connectivity (LAC), which evaluates the local connectivity of node ’s neighbors. For further details about these measures, we refer reader to the survey by Li et al. .
The traditional, shallow ML algorithms multilayer perceptrons (MLP) and support vector machines (SVM) were also applied as baselines. These algorithms do not explicitly handle graph data as input. Therefore, standard feature vectors for each gene were constructed by concatenating their information from our multiomics dataset (i.e., gene expression, orthology, and subcellular organization) and extracted from the STRING PPI network topology (i.e., node degree) into a single vector. Next, feature vectors for all genes are concatenated in a single table and used as input for training these models. The MLP network was configured with a single hidden layer of 32 units and trained with a dropout rate of
to reduce overfitting. The SVM was trained using the radial basis function (RBF) kernel.
Finally, we compared our results against models built upon network features extracted by the node2vec (N2V) embedding method, as applied in previous works (e.g., [53, 54, 56]). N2V uses biased random walks on graphs to learn a graph embedding for each node, which captures the network structure in a regular format. These embeddings are then used as additional features for the MLP training set, so that we carry out a fair performance comparison with our GAT model. The parameters used for training the N2V-based MLP model were the same as in the comparison with shallow ML algorithms.
Iii Results and Discussion
Our experiments were implemented in Python using the sklearn library 
for general training and evaluation methodology, and for the SVM model; Pytorch for the MLP models; and Pytorch Geometric extension  for the graph-based models, namely EPGAT and N2V.
We setup our experiments aiming at i) comparing our approach to other network-based and ML approaches (both shallow and deep learning methods) used for gene or protein essentiality prediction and ii) assessing the impact over performance of the different choices of PPI databases and multiomica datasets. Performance for the trained models are shown as the mean and the standard deviation of ten runs using random stratified splits of the dataset. We carried out statistical comparisons by means of two-tailed Student t-test with a 95% confidence level to evaluate the significance of different metrics.
Iii-a Comparison with network topology-based methods
Our first batch of experiments compared the proposed EPGAT model to the network centralities DC, LAC, and NC. For the analysis of topology-based methods, genes in the testing set were ordered by their centrality measure in a descending order and the resulting rank was used for AUC score analysis.
The results for this comparison considering the three PPI networks, namely STRING, BioGRID, and DIP, are shown in Figure 1. In most organisms and networks, our approach had a significant improvement over the baselines. Except for the D. melanogaster DIP and BioGrid networks, for which none of the comparisons achieved statistical significance, and for the E. coli DIP network, in which EPGAT was only significantly superior than LAC, in all other comparisons EPGAT presented the highest performance with statistical significance ().
Surprisingly, the topology-based methods were less consistent than EPGAT. This is mainly perceived in the results for E. coli and D. melanogaster. The improvement brought by the proposed model in terms of performance values and stability is caused by two main factors. First, these baseline methods are analytical approaches that provide only a crude simplification of nodes patterns and roles within the PPI network, thus probably only partially capturing the information correlated to gene essentiality that it is expected to be contained within this type of biological evidence (i.e., protein interaction networks). In contrast, GAT is an expressive neural network that is able to model and detect complex relationships by directly analyzing the complete graph structure, as opposed to centrality metrics derived from it. Second, our EPGAT model parses additional datasets, which provides richer and more complete information for the prediction task. As previously discussed, gene or protein essentiality is a multifactorial phenomenon, which is corroborated by our results. A distillation of these two factors is explored in a data ablation study presented in Section III-C
Regarding the poor performance of EPGAT for D. melanogaster, we note that the networks provided by BioGRID and DIP are severely restricted for this organism. First, the average degree for these networks is relatively low as compared to STRING and to other organisms. Second, while our dataset contains over twelve thousand labeled genes for the D. melanogaster, the DIP and BioGRID networks only map interactions among 625 and 1,778 of them, respectively, such that the training set is also limited in contrast to other scenarios. Finally, it should be noted that the D. melanogaster
essentialome was by far the most challenging one due to its severe class imbalance (1.95% of labeled genes are classified as essential). Considering the total number of nodes in the networks gathered from BioGRID and DIP databses, only 4.21% and 3.18% of them are essential genes. These constrictions, both in the number of samples and skewness of instances, are hard to be overcome by statistical methods. In the BioGRID network, EPGAT achieved comparable results to the baselines, whereas in the DIP network every method was equivalent to random guessing. We note, however, that theD. melanogaster STRING network contained over 8,000 labeled genes, and in this case, EPGAT had significant improvement over the baselines, which shows that given enough data our approach can learn important features even in highly biased datasets.
Noticeably, the DIP network had the worst performance on every organism. For the D. melanogaster and the H. sapiens networks, the performance for most methods were equivalent to random guessing. EPGAT was able to introduce improvements for the Human DIP network, but its performance was still inferior as those for BioGRID and STRING. As it was briefed in Table I, DIP is the database with the lowest number of mapped interactions. For the D. melanogaster and H. sapiens organisms, DIP provides extremely sparse networks, with average degree of 1.44 and 1.6, respectively, in contrast to 4.47 for S. cerevisiae and 4.18 for E. coli. This characteristic impairs classification performance given the underlying functioning of GATs, which depends on a node’s connectivity to update its embedding. Nonetheless, even for S. cerevisiae, for which DIP presents the highest average degree and ratio of labeled genes (i.e., 89.15% of nodes in the network) among all organisms, its performance is inferior than denser network choices. This is an important observation since much work on the field rely solely on the DIP PPI network to evaluate algorithms [48, 57]. In general, BioGRID and STRING had comparable performance among the analytical methods, although our model had a slightly better edge with the STRING network. Therefore, in the next experiments, we use it as a basis for our model.
Iii-B Comparison with machine learning methods
We compared the proposed approach with three ML classifiers: MLP, SVM, and a node2vec-based model to extract features that are further analyzed using a MLP (N2V). All classifiers were trained using the collected multiomics dataset (see Section II-A). Nonetheless, while the standard, shallow ML methods (SVM and MLP) relied on a structured table with features extracted from PPI networks and other omics data, the graph-based methods (i.e., EPGAT and N2V) are tailored to the task of learning the gene essentiality-related patterns directly from the input network. We note, however, that their underlying functioning for creating a low-dimensional feature vector for each node in the network is different, as well as the strategy to combine multiomics data with network-based features.
Figure 2 displays the performance of the EPGAT model in contrast to the baseline ML methods. All models use as basis the STRING PPI network. Our approach notably improved the prediction performance over MLP and SVM for S. cerevisiae, E. coli, and H. sapiens, achieving statistically significant differences () in these comparisons. MLP was significantly superior than SVM for E. coli, but in all other organisms the average performance for both shallow ML methods was very similar.
Regarding the comparison between EPGAT and N2V, overall, we observed a very competitive performance of our model in relation to the node2vec embedding, which is the state-of-the-art method used to learn from graph structured data in related works. EPGAT achieved a superior performance with statistical significance for E. coli () and S. cerevisiae (). For D. melanogaster, the difference was not statistically significant (), nonetheless, EPGAT had a very positive impact on the stability of the model, notably reducing the standard deviation obtained from the ten random splits of data in relation to the other methods, including N2V. Finally, EPGAT and N2V achieved very close performance for identification of human essential genes (90.95 0.0054 for EPGAT versus 91.300.0059 for N2V, with ).
Thus, in general, our approach improved the identification of essential genes over previous ML methods for all organisms, either by increasing the average AUC score or by generating a more robust model (i.e.,
presenting lower variance on predictive performance). Our results show the importance of a more elaborate approach to deal with graphs in this prediction task. The advantage in approaches that maximally preserve properties and information from graph whilst identifying gene essentiality patterns is especially clear in the comparison between the N2V and MLP, which are two models based on the MLP learning algorithm. By adding the node embeddings generated with node2vec to its feature set, N2V was significantly better () than the shallow MLP for all organisms except D. melanogaster.
|Method||S. cerevisiae||E. coli|
|H. sapiens||D. melanogaster|
* Statistically significant () performance improvement in comparison to previous model in the incremental analysis.
Iii-C Data ablation study
To elucidate the importance of each type of biological evidence in our multiomics-based approach, we carried out an ablation study for the proposed EPGAT model. Performance of EPGAT was evaluated with the PPI network (i.e.,, STRING, BioGRID, or DIP) without any additional attributes, followed by incremental inclusion of other omics data to investigate their impact over the model AUC score. Results are shown in Table III, in which the best performance per organism network are highlighted in bold.
We perceive mixed results and a variance in performance depending not only on the choice of data, but on the organism analyzed as well. Nonetheless, in general, additional types of biological evidence seems to positively impact models predictive power as compared to the exclusive use of PPI networks. Notably, gene expression data presented a great potential to boost performance over every choice of PPI network and organism. The only exception was for the D. melanogaster DIP model, which had a very unstable performance due to the sparseness of this network (see Figure 2-d) and the large class imbalance in the labeled dataset. Therefore, results for this specific scenario may not be statistically relevant as the previous AUC performance was already inferior to a random classifier. Interestingly, the use of additional omics data with the DIP network for S. cerevisiae, E. coli, and H. sapiens was able to decrease the performance gap in relation to the other networks, despite the general limitation of sparseness observed in the DIP database.
Subcellular localization information had only a slight quantitative impact on the results, but in four out of the 12 scenarios analyzed (i.e., combinations of PPI networks and organisms), the EPGAT model trained with PPI plus gene expression and sublocalization data led to the best performance. This is an interesting result, since although the complete multiomics models were the best in 5 of these scenarios, for E. coli we could not assess models trained with sublocalization due to data unavailability. Therefore, it is not possible to assert that models with ortholog were in fact the best framework, or if this result is associated with the lack of sublocalization information for E. coli.
Finally, we note that even without other sources of omics data, our EPGAT model surpasses the DC baseline in most cases. A statistically better performance () of EPGAT in relation to DC was observed in all models, except for the D. melanogaster’s DIP and BioGRID networks and for the E. coli’s DIP network. In these two models, DC achieved a higher AUC score, although without statistical significance considering a 95% confidence level.
Iii-D Analysis of PPI thresholds on STRING network
The impact on model performance according to the choice of PPI network database is quite noticeable from our experimental results summarized in Table III. This observation led us to examine the impact of filtering out the networks used as input for EPGAT based on distinct confidence measures, as in the previous experiments we kept only interactions with a STRING confidence score higher than 0.5.
We rerun the experiments with the EPGAT model in all organisms using different filtering thresholds. Results are given in Figure 3, in which the confidence thresholds are indicated in the x-axis. Experiments for H. sapiens were truncated at a confidence score of 0.3 due to memory constraints with denser networks.
The best AUC score was obtained at a threshold of 0.2 for E. coli (98.13%), of 0.3 for H. sapiens (91.00%), of 0.5 for S. cerevisiae (90.42%), and of 0.1 for D. melanogaster (82.53%). Hence, there is a clear trend towards smaller thresholds achieving better performance. This finding corroborates with our previous assertions on dense networks carrying more information about gene essentiality, even if at the cost of more misinformation, and yielding better results for our approach. Nonetheless, a surprising observation is that for most organisms, EPGAT performance was relatively consistent even with largely diverging threshold values and, thus, different number of interactions within the PPI network. The variation between the best and the worst performance was relatively small for E. coli (0.069), S. cerevisiae (0.029), and H. sapiens (0.047). This result reinforces the robustness and potential of EPGAT for this prediction task.
For the D. melanogaster dataset, we observed an erratic behavior. Not only the average performance was unstable across distinct PPI filter thresholds, but also the standard deviation was consistently higher when compared to other organisms. We believe this is caused by the limited volume and class imbalance characteristics for the essential genes data. D. melanogaster dataset comprised only 161 positive labeled genes in the STRING PPI network used in our work, which causes any perturbation, even if a small one, to affect classification and exert a significant impact on the AUC score. Moreover, as we may see in Table I, the D. melanogaster STRING PPI network has a relatively large dimension when compared to S. cerevisiae and E. coli, which aggravates the situation as changes in the filtering thresholds lead to a more prominent perturbation on the structure of the network. Finally, we note that EPGAT models could be enhanced by optimizing the interactions filtering threshold in a case-by-case basis, which was not explored in the current work.
Iii-E GAT model interpretation
Our model is based on an attention mechanism, which weights every interaction among two genes and . Intuitively, the weight is a value that assigns the importance of node for the prediction returned for node . In other words, the weights can be seen as the influence that a gene causes in the classification of another gene . Therefore, learned attentional weights may provide benefits in model interpretability.
We visualize the attention weights for the H. sapiens STRING PPI network in Figure 4. For better visualization purpose, we obtained a subgraph from highly connected nodes to reduce network dimension. The weight is displayed by the transparency of each edge, such that strongly shaded edges means that the interaction had a greater influence on the classification of the target gene. Green and red nodes correspond to essential and non-essential genes, respectively, while black nodes refer to network genes without a label (i.e., not comprised in the collected essential genes dataset).
Noticeably, we see an interesting phenomenon. The nodes with very few connections do not affect nor are strongly affected by their neighbors, which means that the model relies on the features given by the additional datasets for their prediction. Highly connected nodes (i.e., hubs), on the other hand, show the opposite effect as they greatly affect the predictions of their neighbors. Complementarily to the centrality-lethality rule that affirms that hubs have a higher likelihood of being essential, our model indicates that genes interacting with hubs are likely to have similar essentiality status. In this network visualization, this is characterized by the highly interacting negative labeled genes in the lower left of Figure 4.
In this study, we approached the gene essentiality prediction problem by using powerful GATs, a type of GNNs, combined with the integration of multiomics datasets. While most of the previous research on the field evaluate their algorithms on mostly one or two organisms (especially model organisms), we adopted a more comprehensive benchmark that included H. sapiens datasets. We showed that our approach, named EPGAT, achieved a significant improvement in performance compared to other network-based and machine learning methods commonly used in the field. Notably, results obtained by EPGAT were at least comparable (when not superior) to node2vec, a state-of-the-art method for gene or protein essentiality prediction (used, for instance, in [53, 54, 56]). While maintaining a slight edge, EPGAT still has the benefits of having a more straightforward training procedure and shorter training time.
Our experiments indicated that denser PPI networks are more informative and reliable for gene essentiality prediction. However, even under the challenged posed by network sparseness (i.e., D. melanogaster
network in our study), EPGAT was the most robust ML method. Lastly, we observed that our approach can also be helpful for insights over gene relations, as the GATs’ attention mechanisms is a measure of the influence among interacting genes. Further investigation towards interpretability of model attentional weights may help shed light on network-based features of genes and proteins essentiality patterns. Other interesting research directions include (i) expanding our approach to capture the context-dependent nature of gene essentiality, for instance, by integrating more omics datasets, and (ii) incorporating node-level feature selection techniques in our model to control dimensionality and improve its generalization power.
This work was partially supported by CNPq/Brazil and by CAPES Finance Code 001.
-  (2009) Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information. BMC Bioinformatics 10 (1), pp. 290. Cited by: §I.
-  (2019) Optuna: a next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631. Cited by: §II-C.
-  (2006) Construction of Escherichia coli k-12 in-frame, single-gene knockout mutants: the keio collection. Molecular Systems Biology 2 (1), pp. 2006–0008. Cited by: §II-A.
-  (2012) NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Research 41 (D1), pp. D991–D995. Cited by: §II-A3.
-  (2012) Systems and synthetic metabolic engineering for amino acid production–the heartbeat of industrial strain development. Current Opinion in Biotechnology 23 (5), pp. 718–726. Cited by: §I.
-  (2014) COMPARTMENTS: unification and visualization of protein subcellular localization evidence. Database 2014. Cited by: §II-A4.
-  (2016) OGEE v2: an update of the online gene essentiality database with special focus on differentially essential genes in human cancer cell lines. Nucleic Acids Research, pp. gkw1013. Cited by: §II-A1.
-  (2011) Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic Acids Research 39 (3), pp. 795–807. Cited by: §I.
-  (2016) A novel algorithm for identifying essential proteins by integrating subcellular localization. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Vol. , pp. 107–110. Cited by: §II-A3.
-  (2019) Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: §III.
-  (2005) A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2, pp. 729–734. Cited by: §I.
-  (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. Cited by: §I.
-  (2017) Accurate prediction of human essential genes using only nucleotide composition and association information. Bioinformatics 33 (12), pp. 1758–1764. Cited by: §I.
-  (2006) Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics 7 (1), pp. 1–16. Cited by: §I.
-  (2015) High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell 163 (6), pp. 1515–1526. Cited by: §I.
-  (2006) Why do hubs tend to be essential in protein networks?. PLoS Genetics 2 (6), pp. e88. Cited by: §I.
-  (2001-05-01) Lethality and centrality in protein networks. Nature 411 (6833), pp. 41–42. Cited by: §I.
-  (2020) Application of deep learning methods in biological networks. Briefings in Bioinformatics, pp. bbaa043. Cited by: §I.
-  (2011) Essence of life: essential genes of minimal genomes. Trends in Cell Biology 21 (10), pp. 562–568. Cited by: §I.
-  (2012) Prediction of essential proteins using topological properties in GO-pruned PPI network based on machine learning methods. Tsinghua Science and Technology 17 (6), pp. 645–658. Cited by: §I.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §I, §II-B.
-  (2000) How many genes can make a cell: the minimal-gene-set concept. Annual Review of Genomics and Human Genetics 1 (1), pp. 99–116. Cited by: §I.
-  (2018) Predicting essential proteins based on RNA-Seq, subcellular localization and GO annotation datasets. Knowledge-Based Systems 151, pp. 136–148. Cited by: §I.
-  (2018) United neighborhood closeness centrality and orthology for predicting essential proteins. IEEE/ACM Transactions on Computational Biology and Bioinformatics. Cited by: §I.
-  (2016) Predicting essential proteins based on subcellular localization, orthology and PPI networks. BMC Bioinformatics 17 (8), pp. 571–581. Cited by: §I.
-  (2015) United complex centrality for identification of essential proteins from PPI networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14 (2), pp. 370–380. Cited by: §I.
-  (2016) A reliable neighbor-based method for identifying essential proteins by integrating gene expressions, orthology, and subcellular localization information. Tsinghua Science and Technology 21 (6), pp. 668–677. Cited by: §I.
-  (2011) A local average connectivity-based method for identifying essential proteins from the network level. Computational Biology and Chemistry 35 (3), pp. 143–150. Cited by: §I.
-  (2019-02) Network-based methods for predicting essential genes or proteins: a survey. Briefings in Bioinformatics 21 (2), pp. 566–583. Cited by: §I, §I, §I, §II-D.
-  (2007-07-23) False positive reduction in protein-protein interaction predictions using gene ontology annotations. BMC Bioinformatics 8 (1), pp. 262. Cited by: §I, §II-A2.
-  (2017) Deep learning in bioinformatics. Briefings in Bioinformatics 18 (5), pp. 851–869. Cited by: §I.
-  (2019) GraphDTA: prediction of drug–target binding affinity using graph convolutional networks. BioRxiv, pp. 684662. Cited by: §I.
-  (2019) The BioGRID interaction database: 2019 update. Nucleic Acids Research 47 (D1), pp. D529–D541. Cited by: §II-A2.
-  (2008) Analysis of human disease genes in the context of gene essentiality. Genomics 92 (6), pp. 414–418. Cited by: §I.
-  (2017) Automatic differentiation in pytorch. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Cited by: §III.
-  (2014) Essential gene identification and drug target prioritization in Leishmania species. Molecular BioSystems 10 (5), pp. 1184–1195. Cited by: §I.
-  (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §III.
-  (2014) UDoNC: an algorithm for identifying essential proteins based on protein domains and protein-protein interaction networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics 12 (2), pp. 276–288. Cited by: §I.
-  (2015) Rechecking the centrality-lethality rule in the scope of protein subcellular localization interaction networks. PloS One 10 (6), pp. e0130743. Cited by: §I.
-  (2018) Emerging and evolving concepts in gene essentiality. Nature Reviews Genetics 19 (1), pp. 34. Cited by: §I, §I, §I.
-  (2017) Hybrid approach of relation network and localized graph convolutional filtering for breast cancer subtype classification. arXiv preprint arXiv:1711.05859. Cited by: §I.
-  (2004) The database of interacting proteins: 2004 update. Nucleic Acids Research 32 (suppl_1), pp. D449–D451. Cited by: §II-A2.
-  (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §I.
-  (2019) Graph convolutional networks improve the prediction of cancer driver genes. In International Conference on Artificial Neural Networks, pp. 658–668. Cited by: §I.
-  (2014) InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Research 43 (D1), pp. D234–D239. Cited by: §II-A4.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §II-C.
-  (2019) STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research 47 (D1), pp. D607–D613. Cited by: §II-A2.
-  (2014) Predicting essential proteins based on weighted degree centrality. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11 (2), pp. 407–418. Cited by: §I, §III-A.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §II-B.
-  (2017) Graph attention networks. External Links: Cited by: §I, §II-B.
-  (2011) Identification of essential proteins based on edge clustering coefficient. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9 (4), pp. 1070–1080. Cited by: §I.
-  (1999) Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285 (5429), pp. 901–906. Cited by: §II-A.
-  (2019) A deep learning framework for identifying essential proteins by integrating multiple types of biological information. IEEE/ACM Transactions on Computational Biology and Bioinformatics. Cited by: §I, §I, §II-D, §IV.
-  (2019) DeepEP: a deep learning framework for identifying essential proteins. BMC Bioinformatics 20 (16), pp. 506. Cited by: §I, §II-D, §IV.
-  (2016) Predicting essential genes and proteins based on machine learning and network topological features: a comprehensive review. Frontiers in Physiology 7, pp. 75. Cited by: §I, §I.
-  (2020) DeepHE: accurately predicting human essential genes based on deep learning. bioRxiv. Cited by: §I, §II-D, §IV.
-  (2016) Essential protein discovery based on a combination of modularity and conservatism. Methods 110, pp. 54–63. Cited by: §I, §III-A.