1 Introduction
Knowing the function of a protein informs us on its biological role in the organism. With large numbers of genomes being sequenced every year, there is a rapidly growing number of newly discovered proteins. Protein function is most reliably determined in wet lab experiments, but current experimental methods are too slow for such quick income of novel proteins. Therefore, the development of tools for automated prediction of protein functions is necessary. Fast and accurate prediction of protein function is especially important in the context of human diseases since many of them are associated with specific protein functions.
The space of all known protein functions is defined by a directed acyclic graph known as the Gene Ontology (GO) (Ashburner et al., 2000), where each node represents one function and each edge encodes a hierarchical relationship between two functions, such as isa or partof (refer to Figure 2 for a visualisation). For every protein, its functions constitute a subgraph of GO, consistent in the sense that it is closed with respect to the predecessor relationship. GO contains thousands of nodes, with function subgraphs usually having dozens of nodes for each protein. Hence, the output of the protein function prediction problem is a subgraph of a hierarchicallystructured graph.
This opens up a clear path of application for graph representation learning (Bronstein et al., 2017; Hamilton et al., 2017b; Battaglia et al., 2018), especially graph neural networks (GNNs) (Kipf and Welling, 2016; Veličković et al., 2017; Gilmer et al., 2017; Corso et al., 2020), given their natural inductive bias towards processing relational data.
One key aspect in which the protein function prediction task differs from most applications of graph representation learning, however, is in the fact that the graph is specified in the label space—that is, we are given a multilabel classification task in which we have known relational inductive biases over the individual labels (e.g. if protein has function , it must also have all predecessor functions of under the closure constraint).
Driven by the requirement for a GNN to operate in the label space, we propose TailGNN, a graph neural network which learns representations of labels, introducing relational inductive biases into the flat label predictions of a feedforward neural network. Our results demonstrate that introducing this inductive bias provides significant gains on the protein function prediction task, paving the way to many other possible applications in the sciences (e.g., prediction of spatial phenomena over several correlated locations (Radosavljevic et al., 2010; Djuric et al., 2015)
, traffic state estimation
(Djuric et al., 2011), and polypharmacy side effect prediction (Zitnik et al., 2018; Deac et al., 2019a)).2 TailGNNs
In this section, we will describe an abstract model which takes advantage of a TailGNN, followed by an overview and intuition for the specific architectural choices we used for the protein prediction task. The entire setup from this section may be visualised in Figure 1.
Generally, we have a multilabel prediction task, from inputs , to outputs , for each label . We are also aware that there exist relations between labels, which we explicitly encode using a binary adjacency matrix , such that implies that the prediction for label can be related^{1}^{1}1Note that different kinds of entries in are also allowed, in case we would like to explicitly account for edge features. with the prediction for label .
Our setup consists of a labeller network
(1) 
which attaches latent vectors , to each label , for a given input . Typically, these will be dimensional realvalued vectors, i.e. .
These labels are then provided to the TailGNN layer , which is a nodelevel predictor; treating each label as a node in a graph, as its corresponding node features, and as its corresponding adjacency matrix, it produces a prediction for each node:
(2) 
That is, , provides the final predictions for the model in each label. As implied, the TailGNN is typically implemented within the graph neural network (Scarselli et al., 2008) framework, explicitly including the relational information.
Assuming and are differentiable w.r.t. their parameters, the entire system can be endtoend optimised via gradient descent on the label errors w.r.t. groundtruth values.
In our specific case, the inputs
are protein sequences of onehot encoded amino acids, and outputs
are binary labels indicating presence or absence of individual functions for those proteins.Echoing the protein modelling results of FastParapred (Deac et al., 2019b), we have used a deep dilatedconvolutional neural network for (similarly as in ByteNet (Kalchbrenner et al., 2016) and WaveNet (Oord et al., 2016)). This architecture provides a parallelisable way of modelling aminoacid sequences without sacrificing performance compared to RNN encoders. This labelling network is fully convolutional (Springenberg et al., 2014): it predicts latent features for each amino acid, followed by global average pooling and reshaping the output to obtain a length vector for each label.
As we know that the gene ontology edges encode explicit containment relations between function labels, our TailGNN is closely related to the GCN model (Kipf and Welling, 2016). At each step, we update latent features in each label by aggregating neighbourhood features across edges:
(3) 
where is the onehop neighbourhood of label in the GO,
is a shared weight matrix parametrising a linear transformation in each node, and
is a coefficient of interaction from node to node , for which we attempt several variants: sumpooling (Xu et al., 2018) (), meanpooling (Hamilton et al., 2017a) (), and graph attention (, where is an attention function producing scalar coefficients). We use the same attention mechanism as used in GAT (Veličković et al., 2017).Lastly, we also attempt to explicitly align with the containment inductive bias by leveraging maxpooling:
(4) 
where is performed elementwise.
The final layer of our network is a shared linear layer, followed by a logistic sigmoid activation. It takes the latent label representations produced by TailGNN and predicts a scalar value for each label, indicating the probability of the protein having the corresponding function. We optimise the entire network endtoend using binary crossentropy on the groundtruth functions.
It is interesting to note that, performing constrained relational computations in the label space, the operation of the TailGNN can be closely related to conditional random fields (CRFs) (Lafferty et al., 2001; Krähenbühl and Koltun, 2011; Cuong et al., 2014; Belanger and McCallum, 2016; Arnab et al., 2018). CRFs have been combined with GNNs in prior work (Ma et al., 2018; Gao et al., 2019), primarily as a means of strengthening the GNN prediction; in our work, we express all computations using GNNs alone, relying on the fact that, if optimal, TailGNNs could learn to specialise to the computations of the CRF through neural execution (Veličković et al., 2019), but will in principle have an opportunity to learn more datadriven rules for message passing between different labels.
Further, TailGNNs share some similarities with gated propagation networks (GPNs) (Liu et al., 2019), which leverage class relations to compute class prototypes for metalearning (Snell et al., 2017)
. While both GPNs and TailGNNs perform GNN computations over a graph in the label space, the aim of GPNs is to compute structureinformed prototypes for a 1NN classifier, while here we focus on multitask predictions and directly produce outputs in an endtoend differentiable fashion.
Beyond operating in the label space, GNNs have seen prior applications to protein function modelling through explicitly taking into account either the protein’s residue contact map (Gligorijevic et al., 2019) or existing proteinprotein interaction (PPI) networks. Especially, Hamilton et al. (2017a) provide the first study of explicitly running GNNs over PPI graphs in order to predict gene ontology signatures (Zitnik and Leskovec, 2017). However, as these models rely on an existence of either a reliable contact map or PPI graph, they cannot be reliably used to predict functions for novel proteins (for which these may not yet be known). Such information, if assumed available, may be explicitly included as a relational component within the labeller network.
3 Experimental Evaluation
3.1 Dataset
We used training sequences and functional annotations from CAFA3, a protein function prediction challenge (Zhou et al., 2019). The functional annotations were represented by functional terms of the hierarchical structure of the Gene Ontology (GO) (Ashburner et al., 2000)—the version released in April 2020. Out of the three large groups of functions represented in GO, we used the Molecular Function Ontology (MFO) which contains 11,113 terms. Function subgraphs for each protein were obtained by propagating functional annotations to the root. We discarded obsolete nodes and functions occurring in less than 500 proteins in the original dataset, obtaining a reduced ontology with 123 nodes and 145 edges. Next, we eliminated proteins whose function subgraph contained only the root node (which is always active), as well as proteins longer than 1,000 amino acids.
All of the above constraints were devised with the aim of keeping the downstream task relevant, while at the same time simpler for the dilated convolutions to model—delegating most of the subsequent representational effort to the TailGNN. The final dataset contains 31,243 proteins, with an average sequence length of 431 amino acids. Average number of protein functions per protein is 7.
3.2 Training specifics
The dataset was randomly split into training/validation/test sets, with a rough proportion of 68:17:15 percent. We counted up the individual label occurrences within these datasets, observing that the split was appropriately stratified across all of them. The time of characterization of protein function was not taken into account since the aim was to examine whether GNN method is able to cope with structural labels.
The architectural hyperparameters were determined based on the validation set performance, using the
score—a suitable measure for imbalanced label problems, which is also commonly used for evaluating models in CAFA challenges (Zhou et al., 2019). Via thorough hyperparameter sweeps, we decided on a labelling network of six dilated convolutional layers, with exponentially increasing dilation rate. Initially the individual amino acids are embedded into 16 features, and the individual layers compute features each, mirroring the results of Deac et al. (2019b).For predicting functions directly from the labelling network, we follow with a linear layer of features and global average pooling across amino acid positions, predicting the probability of each function occurring.
When pairing with TailGNN, however, the linear layer computes features, with being the number of latent features computed per label (i.e. the dimensionality of the vectors). We swept various small^{2}^{2}2Further increasing quickly leads to an increase in parameter count, leading to overfitting and memory issues. values of , finding to perform optimally.
In addition, we concatenate five spectral
features to each input node to the TailGNN, in the form of the five eigenvectors corresponding to the five largest eigenvalues of the graph Laplacian—inspired by the Graph Fourier Transform of
Bruna et al. (2013).For each choice of TailGNN aggregation, we evaluated one and two GNN layers of features each, followed by a linear classifier for protein functions. We also assessed performance without incorporating the spectral features.
All models are optimising the binary crossentropy on the function predictions using the Adam SGD optimiser (Kingma and Ba, 2014) (with learning rate and batch size of ), incorporating class weights to account for any imbalance. We train for epochs with early stopping on the validation , with a patience of epochs.
3.3 Results
We evaluate the recovered optimised models across five random seeds. Results are given in Table 1; the labelling network is the baseline dilated convolutional network without leveraging GNNs. Additionally, we provide results across a variety of TailGNN configurations. Our results are consistent with the top10 performance metrics in the CAFA3 challenge (Zhou et al., 2019) but the direct comparison was not possible since we use a reduced ontology.
Our results demonstrate a significant performance gain associated with appending TailGNN to the labelling network, specifically, when using the sum aggregator. While less aligned to the containment relation than maximisation, summation is also more “forgiving” with respect to any labelling mistakes: if TailGNNmax had learnt to perfectly implement containment, any mistakenly labelled leaves would cause large chunks of the ontology to be misclassified.
Further, we discover a performance gain associated with including the Laplacian eigenvectors: including them as node features, and a lowfrequency indicator of global graph features, further improves the results of the TailGNNsum.
While much of our analysis was centered around the protein function prediction task, we conclude by noting that the way TailGNNs are defined is taskagnostic, and could easily see application in other areas of the sciences (as discussed in the Introduction), with minimal modification to the setup.
Model  Validation  Test 

Labelling network  
TailGNNmean  
TailGNNGAT  
TailGNNmax  
TailGNNsum  
TailGNNsum  
(no spectral fts.) 
References

Conditional random fields meet deep neural networks for semantic segmentation: combining probabilistic graphical models with deep learning for structured prediction
. IEEE Signal Processing Magazine 35 (1), pp. 37–52. Cited by: §2.  Gene ontology: tool for the unification of biology. Nature genetics 25 (1), pp. 25–29. Cited by: §1, §3.1.
 Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §1.
 Structured prediction energy networks. In ICML, Cited by: §2.
 Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
 Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §3.2.
 Principal neighbourhood aggregation for graph nets. arXiv preprint arXiv:2004.05718. Cited by: §1.
 Conditional random field with highorder dependencies for sequence labeling and segmentation. Journal of Machine Learning Research 15, pp. 981–1009. Cited by: §2.
 Drugdrug adverse effect prediction with graph coattention. arXiv preprint arXiv:1905.00534. Cited by: §1.
 Attentive crossmodal paratope prediction. Journal of Computational Biology 26 (6), pp. 536–545. Cited by: §2, §3.2.
 Gaussian conditional random fields for aggregation of operational aerosol retrievals. IEEE Geoscience and Remote Sensing Letters 12 (4), pp. 761–765. Cited by: §1.
 Travel speed forecasting by means of continuous conditional random fields. Transportation Research Record: Journal of the Transportation Research Board 2263, pp. 131–139. External Links: Document Cited by: §1.
 Conditional random field enhanced graph convolutional neural networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 276–284. Cited by: §2.
 Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1263–1272. Cited by: §1.
 Structurebased function prediction using graph convolutional networks. bioRxiv, pp. 786236. Cited by: §2.
 Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §2, §2.
 Representation learning on graphs: methods and applications. arXiv preprint arXiv:1709.05584. Cited by: §1.
 Neural machine translation in linear time. arXiv preprint arXiv:1610.10099. Cited by: §2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.
 Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, Cited by: §2.
 Conditional random fields: probabilistic models for segmenting and labeling sequence data. pp. 282–289. Cited by: §2.
 Learning to propagate for graph metalearning. In Advances in Neural Information Processing Systems, pp. 1037–1048. Cited by: §2.
 CGNF: conditional graph neural fields. Cited by: §2.
 Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §2.
 Continuous conditional random fields for regression in remote sensing. Vol. 215, pp. 809–814. External Links: Document Cited by: §1.
 The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §2.
 Prototypical networks for fewshot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §2.
 Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §2.
 Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §2.
 Neural execution of graph algorithms. arXiv preprint arXiv:1910.10593. Cited by: §2.
 How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §2.
 The cafa challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome biology 20 (1), pp. 1–23. Cited by: §3.1, §3.2, §3.3.
 Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34 (13), pp. i457–i466. Cited by: §1.
 Predicting multicellular function through multilayer tissue networks. Bioinformatics 33 (14), pp. i190–i198. Cited by: §2.