Applications in precision oncology, such as survival analysis and cancer subtype detection, use regression on gene expression data or hand crafted differential expression models. Particular genes are usually selected to build prediction models to avoid the impact of spurious correlations between genes expression levels on the predicted labels. An experienced bioinformatician would know the genes related to a particular tissue or disease given online repositories of gene function and interaction. In this work we explore how this information can be used to bias the feature construction in a deep learning model.
Due to high cost of data acquisition in biology, most predictive models must be trained in the “low data” regime. Thousands of studies each year are deposited into the NIH Gene Expression Omnibus (GEO) 111https://www.ncbi.nlm.nih.gov/geo/ database and the number of new samples is rising almost monotonically as seen in Figure 1. However, the median dataset size remains stagnant. In order for deep learning models to have an impact on biological research we must improve their performance in low data regimes.
Gene expression data produced from RNA-Seq or MicroArray is often interpreted as a set of independent variables. However, one can take advantage of the experimental data on protein-protein interactions, transcription factors, and gene co-expression to factorize the probability distribution of genes expression levels. These interactions between genes or their products are represented as graphs(Warde-Farley et al., 2010).
We explore how these graphs can be used to bias a model and subsequently the representation learned using what is known already about the biological system. The bulk of such work has been previously done using linear models (Zhang et al., 2017). Here we try to extend graph bias to modern deep learning models. This approach helps the model to ignore noise which correlates with a target prediction by chance. Also it can reduce the overall number of parameters, analogous to what is done with Convolutional Neural Networks in images.
2 Gene Graph Convolutions
Most existing work with graph convolutional networks focuses on settings where data is confined to a graph structure, for example points clouds, social networks, or protein structures. With gene expression data, the graphs are complementary to the main task and can be used as a bias. With low numbers of samples, known relationships between variables can help a model avoid learning spurious correlations.
Via regularization: The method proposed by Min et al. (2016)
is to regularize the weights of a Network-regularized Sparse Logistic Regression (NSLR) based on the connectivity of the nodes found in the interaction graph. This is achieved by adding a regularization term using the graph Laplacian
to the logistic regression loss function. This regularization encourages the weights associated with nodes that have a high number of interactions to remain important.
Via convolution: We can also use the structure of the graph as a bias. By performing convolution operations on a node to incorporate information from its neighbors we can extract and propagate the features along the edges of the graph, similar to what happens inside a Convolution Neural Network with adjacent pixels. This convolution over the features with parameters (denoted ) is not trivial when the structure of the graph is highly complex. Bruna et al. (2014)
explored the use of highly sparse Multi-Layer Perceptrons (MLP) where each feature is only linked to its neighbours. They also used a spectral convolution, by projecting the parametersinto the spectral space of the Laplacian matrix:
contains the eigenvectors of the Laplacian, and are the nodes features at layer and respectively.
However, the full projection of the eigenvectors represent paths of infinite length and will therefore take into account all nodes at once and prevent the network from reasoning about interactions with neighbors. This means that no locality would be present in the convolution, which makes the interpretability and the sharing the parameters more difficult.
To obtain locality in the convolution we can utilize methods in Defferrard et al. (2016) and Kipf & Welling (2016) and approximate the convolution to the first neighbouring nodes (paths of length 1). With ( being the adjacency matrix) and , they showed that:
Where , , , and the number of nodes, the input feature size, and the output feature size. This leaves us with only parameters to learn for one specific layer.
This approach does not allow us to have different types of interactions, since all nodes are aggregated before any transformation is done. While it is possible to have different sets of parameters for different interactions like in Bruna et al. (2014), gene interaction graphs do not contain the specific type of interactions between genes. For this reason, we followed Hamilton et al. (2017)
and added a skip connection at each convolution layer, which essentially preserves two kinds of signals: the neighborhood and the node itself. The full convolution is then followed by an activation function (in our case) and an aggregation clustering method to reduce the number of nodes at each layer:
In this work hierarchical clustering is used based on the node connectivity in the interaction graph to reduce the number of nodes by half after each convolution. A max pooling is then performed on each resulting cluster. At the last layer, the remaining nodes are concatenated together and fed into a linear layer to make the final prediction. The inputof the model are gene embeddings, learned during training, and scaled by their corresponding expression level.
To help with the low amount of data, we also use Drop-out (Srivastava et al., 2014). After every convolutional layer, each node has a 40% chance of being dropped. This results in the model not being able to rely on a specific node and has to spread the important information across the network, which in turn can make the learning of important features easier. An overview of the model can be seen in Figure 2.
This work uses 10,459 RNA-Seq samples from the TCGA PANCAN database (The Cancer Genome Atlas et al., 2013) spanning multiple tissues and measuring 16,300 genes for each sample. Most samples have been diagnosed with some form of cancer but many healthy examples are also included.
While our ultimate goal is to predict clinical attributes using all gene signals, we have not observed conclusive evidence of deep neural networks outperforming logistic regression baselines in this setting.
To allow us to obtain feedback when comparing different methods, we constructed an experimental setting where we can reduce the difficulty by using only a subset of all genes which are known to be related to the task at hand. Ideally this task would be cancer subtype or phenotypical trait but this does not seem possible for the following reasons: 1. we cannot make any assumptions on the relevant genes 2. we cannot guarantee that any complex relationship is necessary to solve the task.
The setting is as follows: we select a specific gene and convert its real-valued expression level to a binary variable representing if it is over or under expressed compared to the mean value for that gene in the 10,459 TCGA samples. This allows a simple binary prediction which we can evaluate using AUC. We then try to predict this value based on the expression level of other genes, not unlike the gene inference task done byChen et al. (2016). To vary the difficulty of the task, we choose different subsets of input genes, based on their connectivity as shown in Figure 3. If nodes are of equal distance they are sorted for repeatable experiments. In the easiest setting, we predict the gene over/under expressivity by providing a classification model with the expression values of its closest neighbors and then add more and more gene expression values until all 16k genes are considered by the model.
We perform experiments to illustrate why predicting labels using all genes (like cancer subtype or other clinical labels we care about) does not work well. We predict if a single gene value is over/under expressed given its neighbors in the GeneMania gene-gene interaction graph. For each plot the target gene varies and the number of gene signals given to each model is increased starting from the small set of neighboring genes and increased to contain all genes in the dataset. 20 trials are used and error bars are shown using standard error. A vertical bar is omitted when the number of first degree neighbors are less than 50.
3.1 Quality of graphs
This work explores two public undirected graph datasets containing a mixture of protein-protein interaction and gene co-expression data. The first, GeneMania (Warde-Farley et al., 2010), is a combination of previously published protein-protein interaction and co-expression graphs that contains 264,657 edges covering 16,297 genes. The second, RegNetwork (Liu et al., 2015), is composed of experimental and predicted up/down regulating effects between genes and includes information from KEGG (Kanehisa et al., 2017). RegNetwork contains 247,848 edges covering 7,220 nodes.
The performance of these graphs is compared in Figure 4. The percentage of genes where there is an improvement using first degree neighbors is a minority. However, it can yield a AUC improvement in these cases. Specifically, when using the RegNetwork graph the expression value of of target genes were better predicted by their first degree neighbors than the full gene set. This percentage climbs to for the GeneMania graph.
Predictions made with the GeneMania graph outperformed those made with RegNetwork even though RegNetwork has almost twice as many edges per node. This finding suggests that simply merging graphs will not yield improved performance. Future work should experiment with directed graphs and multiple edge-types.
3.2 Robustness to noise
As discussed earlier, naively comparing models with the whole input is prone to failure. In order to study at what point these models fail we vary the features considered between 50 and all 16k genes. We aim to observe the potential of a model helped by knowledge of a subset of related neighboring genes and how that signal is obscured as we include more and more unrelated genes. In Figure 5
we show the AUC of different models plotted side by side. We compare a MLP (2 layers, 512 channels, ReLU) with and without dropout, a Sparse Logistic Regression model with L1 and network regularization (SLR) and two Graph Convolution Network (GCN), as described in section2. A simple GCN (3 layers, 64 channels, 32 dim embeddings), and a bigger GCN with 4 pooling layers (24 layers, 32 channels, 32 dim embeddings). The GeneMania graph was used for this task. Each experiment is performed 20 times where each run selects 50 random samples for training, 50 for validation and early stopping, and 1000 for testing/reporting. We were not able to compute the clustering for all 16k genes so pooling cannot be done at that scale.
Three categories of model performance are observed. In Figure 5 using neighbors provides an improvement demonstrating that signal exists in the data but these methods are unable to construct features when all genes are considered. This confirms the necessity of doing experiments with a gradual level of difficulty, in order to adequately compare the different models. We observe that Graph Convolutional approaches perform better than a MLP or Logistic Regression model. In Figure 5 we observe that models performance is either consistent or increases as neighbors are added, demonstrating a limitation of using graph information, as the most relevant nodes were not in the immediate neighborhood.
In this paper we explore the difficulty in using deep learning models on gene expression data. We provided an experimental setup where we can evaluate different methods with an increasing level of difficulty. We demonstrate that gene-gene interaction graphs can be utilized in a deep model with Graph Convolutional Networks and that they perform well on the single gene inference task when compared to MLP or logistic regression models.
This work is partially funded by a grant from the U.S. National Science Foundation Graduate Research Fellowship Program (grant number: DGE-1356104) and the Institut de valorisation des donnees (IVADO). This work utilized the supercomputing facilities managed by the Montreal Institute for Learning Algorithms, NSERC, Compute Canada, and Calcul Quebec.
- Bruna et al. (2014) Bruna, Joan, Zaremba, Wojciech, Szlam, Arthur, and LeCun, Yann. Spectral Networks and Locally Connected Networks on Graphs. In International Conference on Learning Representations, 2014.
- Chen et al. (2016) Chen, Yifei, Li, Yi, Narayan, Rajiv, Subramanian, Aravind, and Xie, Xiaohui. Gene expression inference with deep learning. Bioinformatics, 2016. doi: 10.1093/bioinformatics/btw074.
- Defferrard et al. (2016) Defferrard, Michaël, Bresson, Xavier, and Vandergheynst, Pierre. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Neural Information Processing Systems, 2016.
- Hamilton et al. (2017) Hamilton, William L., Ying, Rex, and Leskovec, Jure. Inductive Representation Learning on Large Graphs. In Neural Information Processing Systems, 2017.
- Kanehisa et al. (2017) Kanehisa, Minoru, Furumichi, Miho, Tanabe, Mao, Sato, Yoko, and Morishima, Kanae. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research, 2017. doi: 10.1093/nar/gkw1092.
- Kipf & Welling (2016) Kipf, Thomas N. and Welling, Max. Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations, 2016.
- Liu et al. (2015) Liu, Zhi-Ping, Wu, Canglin, Miao, Hongyu, and Wu, Hulin. RegNetwork: an integrated database of transcriptional and post-transcriptional regulatory networks in human and mouse. Database: The Journal of Biological Databases and Curation, 2015. doi: 10.1093/database/bav095.
- Min et al. (2016) Min, Wenwen, Liu, Juan, and Zhang, Shihua. Network-regularized Sparse Logistic Regression Models for Clinical Risk Prediction and Biomarker Discovery. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2016. doi: 10.1109/TCBB.2016.2640303.
Srivastava et al. (2014)
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and
Dropout: A Simple Way to Prevent Neural Networks from Overfitting.
Journal of Machine Learning Research, 2014.
- The Cancer Genome Atlas et al. (2013) The Cancer Genome Atlas, John N., Weinstein, John N, Collisson, Eric A, Mills, Gordon B, Shaw, Kenna R Mills, Ozenberger, Brad A, Ellrott, Kyle, Shmulevich, Ilya, Sander, Chris, and Stuart, Joshua M. The Cancer Genome Atlas Pan-Cancer analysis project. Nature genetics, 2013. doi: 10.1038/ng.2764.
- Warde-Farley et al. (2010) Warde-Farley, David, Donaldson, Sylva L., Comes, Ovi, Zuberi, Khalid, Badrawi, Rashad, Chao, Pauline, Franz, Max, Grouios, Chris, Kazi, Farzana, Lopes, Christian Tannus, Maitland, Anson, Mostafavi, Sara, Montojo, Jason, Shao, Quentin, Wright, George, Bader, Gary D., and Morris, Quaid. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research, 2010. doi: 10.1093/nar/gkq537.
- Zhang et al. (2017) Zhang, Wei, Chien, Jeremy, Yong, Jeongsik, and Kuang, Rui. Network-based machine learning and graph theory algorithms for precision oncology. npj Precision Oncology, dec 2017. doi: 10.1038/s41698-017-0029-7.