Cancer is caused by gene alterations and abnormal behaviors of genes that control cell division and cell growth. The change in the structure of occurring genetic aberrations, such as somatic mutations, copy number variations (CNV), profiles, and different epigenetic alterations are unique for each type of cancer [82Tomczak, 19Cruz]. As a result, gene expressions (GE) can be disrupted by cell division or environmental effects, or genetically inherited from parents. Changes in GE sometimes change the production of different proteins, affecting normal cell behavior. These damaged cells start reproducing more rapidly than usual and gradually increase in the affected area by forming a tumor. Intermittently, such tumors turn into a type of cancer [zuo2019identification, 24Podolsky]. This is one of the utmost reasons cancer incidences are gradually increasing every year and have become the second leading cause of death worldwide. Consequently, more than 200 types of cancer have been identified, each of which can be characterized by different molecular profiles requiring unique therapeutic strategies [82Tomczak]. The most common cancers diagnosed in men are prostate, lung, and colorectal cancers, while for women, breast, lung, and colorectal cancers are most common [li2017comprehensive].
As the importance of genetic knowledge in cancer treatment is increasingly addressed, several projects have emerged , of which The Cancer Genome Atlas (TCGA) most well-known for omics data. TCGA curates various omics data, e.g., gene mutation, gene expressions (GE), DNA methylation, CNV, and miRNA expressions. By acquiring deep insights of these data, treatment can be focused on preventive measures. Besides, clinical outcomes, i.e., clinical and pathology information are provided. TCGA further analyzed over 11,000 cancer cases from 33 prevalent forms of cancer, which fostered the accomplishment of the Pan-Cancer Atlas (PCA), which results from the normalized GE data about 20K protein-coding genes [lyu2018deep].
These data, however, are highly variable, high-dimensional, and sourced from heterogeneous platforms, which imposes significant challenges to existing bioinformatics tools stimulating the development of deep learning (DL)-based diagnosis and prognosis systems. Since DL algorithms work better with such high dimensional data, recent studies focused on using deep architectures such as autoencoders, CNN, and Recurrent Neural Networks (RNN). Although these models have shown tremendous success in exhibiting high confidence, they are mostly perceived as ‘black box’ methods because of a lack of understanding of their internal functioning. This is a serious drawback since interpretability is essential to generate insights on why a given cancer case is of a certain type, and since knowing the most important biomarkers can help in recommending more accurate treatments and drug repositioning. Further, the ‘right to explanation’ of the EU GDPR[kaminski2019right] gives patients the right to know why and how an algorithm made a diagnosis decision. However, existing approaches can neither ensure the diagnosis transparently nor are they trustworthy.
Since GE data are very high dimensional and a significant number of genes have a trivial effect on the tumor making them very weak features, we hypothesize that our approach called OncoNetExplainer based on neural networks (NN) and ML baselines with the explanation capability can be more effective at learning hierarchical features. We trained and evaluated CNN and VGG16 networks with a guided-gradient class activation map (GradCAM++) [chattopadhay2018grad]. Using GradCAM++, we generated heat maps (HM) for all the classes showing prominent pixels across GE values and computed the feature importance in terms of mean absolute impact (MAI) to identify important biomarkers and provide interpretations of the predictions to make the cancer diagnosis transparent. Further, we validated our findings through functional analysis to make sure the selected genes are biologically trustworthy for the corresponding tumor types. Further, SHAP is used along with GBT to validate our findings based on the annotations provided by the TumorPortal to ensure the consistency and accuracy.
The rest of the paper is structured as follows: section II discusses related works and highlights their potential limitations. Section III chronicles data collection and preparation before constructing and training the network. Section IV demonstrates some experimental results and discusses the key findings of the study. Section V provides explanations and points out the relevance of the study, highlights its limitations and discusses future works before concluding the paper.
Ii Related work
Numerous approaches using genomic data, bioimaging, and clinical outcomes have been proposed for analyzing genomic profiles of patients for treatment decision making [li2017comprehensive]. However, early detection of tumors is particularly important for better treatment of patients, a notable issue being the discrimination of tumor samples from normal ones [liu2017tumor]. Unlike conventional cancer typing methods that work based on morphological appearances, GE levels of the tumor are used to differentiate tumors that have similar histo-pathological appearances, giving more accurate tumor typing results for colorectal cancer diagnosis [paroder2006na+]. Different types of mutation data, e.g., point mutation, single nucleotide variation, INDEL, and CNV are also used. Yuan et al. [yuan2018cancer] observed that these genomics phenomena are associated with complex diseases and contribute to the growth of different types of cancers.
Different ML algorithms were trained using mixed data types, e.g., genomic data, bioimaging data, and clinical outcomes. These approaches not only proved useful at improving cancer prognosis, diagnosis, and treatments but also revealed subtype information of several cancer types [66Huang]. Li et al. [li2017comprehensive]
employs a genetic algorithm for feature selection and a k-nearest neighbors for the classification based on GE data from the PCA project. Their approach can classify 90% of the tumor cases correctly using different 20-gene sets. However, since the data contain GE values of more 20K protein-coding genes, these generic ML methods were found to be inefficient, with some genes appearing repeatedly in the sets because of the curse of dimensionality. Since DL algorithms work better with high dimensional data, recent studies focused on employing NN architectures, which in comparison with traditional ML-based approaches, have shown more accurate and promising results for cancer identification. In particular, CNN has shown tremendous success and has gained much attention for solving gene selection and classification based on microarray data[zeebaree2018gene]. Further, Cruz et al. [19Cruz]
trained a CNN using whole slide images, and extract deep features from different cohorts, which are used to detect cancer regions. Danaee et al.[17Danaee]
used a stacked denoising autoencoder to extract features from RNA-seq data, which are then fed into a SVM and a shallow ANN to classify malignant or benign tumors of breasts[18Chen]. Although their study makes predictions on the cancer predisposition of unseen test groups of mixed DNAs with high confidence, it is limited to only Caucasian and Korean cohorts. Elsadek et al. [elsadek2018supervised] used CNVs about 23,082 genes for 2,916 instances from cBioPortal to classify the tumor samples of breast, bladder urothelial, colon, glioblastoma, kidney, and head and neck squamous cells and achieve an accuracy of 85%.
Lyu et al. [lyu2018deep] and Mostavi et al. [mostavi2019convolutional] embedded the RNA-Seq data from the PCA project into 2D images and trained a CNN to classify 33 tumor types, which outperforms the approach in [li2017comprehensive]. Besides, they provide a functional analysis on the genes with high intensities in the HM based on GradCAM and validated that these top genes are related to tumor-specific pathways. However, due to the stochastic nature of NN, the prediction and feature importance generated is slightly different across runs, i.e., not deterministic. This is also no exception for tree-based ensemble models such as gradient boosted trees (GBT), which provides 3 options for measuring feature importance: i) weight, which is the number of times a feature is used to split the data across all trees, ii) cover, the number of times a feature is used to split the data across all trees weighted by the number of training data points go through those splits, and iii) gain, which is the average training loss reduction gained when using a feature for splitting. Based on these measure, feature importance orderings (i.e., the order in which features were added) are different since subsequent features will get a disproportionately high weight.
Our proposed approach OncoNetExplainer first embeds high dimensional RNA-Seq data into 2D images and trains CNN and VGG16 networks with GradCAM++ activated to classify 33 tumor types based on patients’ GE profiles and provides a human-interpretable explanation (post-model explainability) to identify important biomarkers, which is further validated based on the annotations from TumorPortal. To provide a comparison with baselines, we also identify the top-K biomarkers for each cancer type and cancer specific driver genes based on GBT and SHAP (pre-model explainability).
Iii Materials and Methods
In this section, we discuss the data preparation, network construction, training, and biomarkers discovery with ranking.
Iii-a Data collection and preparation
We use the cancer transcriptomes from the Pan-Cancer Atlas project to interrogate GE states induced by deleterious mutations and copy number alterations. In particular, GE profiles about 33 prevalent tumor type for 9,074 samples are used in our approach. This dataset has been used widely as prior knowledge to generate tumor-specific biomarkers [way2018machine, hoadley2018cell, malta2018machine]. These data are hybridized by the Affymetrix 6.0 , which allows us to examine the largest number of cases along with the highest probe density [31Park]. Table I shows sample distribution.
|BRCA||981||Breast invasive carcinoma|
|LGG||507||Brain lower grade glioma|
|UCEC||507||Uterine endometrial carcinoma|
|HNSC||487||Head-neck squamous cell carcinoma|
|LUSC||464||Lung squamous cell carcinoma|
|BLCA||398||Bladder urothelial carcinoma|
|SKCM||363||Skin cutaneous melanoma|
|KIRC||352||Kidney renal clear cellcarcinoma|
|LIHC||348||Liver hepato-cellular carcinoma|
|CESC||272||Cervical & endo-cervical cancer|
|KIRP||271||Kidney papillary cell carcinoma|
|TGCT||144||Testicular germ cell tumor|
|LAML||115||Acute myeloid leukemia|
|DLBC||37||Diffuse large B-cell lymphoma|
To apply the convolutional (conv) operations, we embed GE samples into 2D images in which GE values for each sample are reshaped from a 20,5011 array into a 144
144 image by zero padding around the edges and normalized to [0,255] without losing any information.
Iii-B Network construction and training
We trained a shallow CNN from scratch alongside data augmentation in which the output of each conv layer is passed to dropout and Gaussian noise layers to avoid overfitting and thus regularize the learning [vardropout]
. This involves the input feature space into a lower-dimensional representation, which is then further down-sampled by two different pooling layers and a max-pooling layer (MPL) by setting the pool size. The output of an MPL is considered as an ‘extracted feature’ from each 2D GE image. Since each MPL ‘flattens’ the output space by taking the highest value in a FM, this produces a sequence vector from the last conv layer, which we expect to force the GE value of specific genes that are highly indicative of being responsible for a specific cancer type. Then this vector is passed through another dropout layer and a fully connected softmax for the probability distribution over the classes.
The CNN is trained with AdaGrad to optimize the categorical cross-entropy loss of the predicted cancer type vs. actual cancer type. Further, we observe the performance by adding the Gaussian noise layer (GNL) following each conv layer to improve model generalization. Further, we used the pretrained VGG16 network to which we added two dense layers at the end of the original architecture, followed by a GNL. Then, we fine-tuned the top layers with minor weight updates:
First, we instantiated the conv base of the VGG16 network and loaded its weights.
Then, we added our previously defined fully-connected layers on top with minor weight updates.
Finally, we placed a softmax layer by freezing up to the last conv block of the VGG16 model, which yields a probability distribution over 33 different classes.
Since the data is very high dimensional, we chose not to go for manual feature selection. Rather, we let both CNN and VGG16 networks extract the most important features. The guided back-propagation helps to generate more human-interpretable but fewer class-sensitive visualizations than the saliency maps (SM) [nie2018theoretical]. Since SM use true gradients, the trained weights are likely to impose a stronger bias towards specific subsets of the input pixels. Accordingly, class-relevant pixels are highlighted rather than producing random noise [nie2018theoretical]. Therefore, GradCAM++ is used to draw the HM to provide attention to most important genes. Class-specific weights of each FM are collected from the final conv layer through globally averaged gradients (GAG) of FMs instead of pooling [chattopadhay2018grad]:
where is the number of pixels in a FM, is the gradient of the class, and is the value of th FM at . Having gathered relative weights, the coarse SM, is computed as the weighted sum ofchattopadhay2018grad].
The GradCAM++ replaces the GAG with a weighted average of the pixel-wise gradients (eq. 3), since the weights of pixels contribute to the final prediction(eq. 4) by aggregating eq. 3 and (eq. 5). In summary, it applies the following iterators over the same activation map , and :
Further, since an appropriate selection of hyperparameters can have a huge impact on the performance of a deep architecture, we perform the hyperparameter optimization through a random search and 5-fold cross-validation tests. In each of 5 runs, 70% of the data is used for the training, 30% for the evaluation. 10% of the training set is used for validation of the networks to optimize the cross-entropy loss based on the best learning rate, batch size, number of filters, kernel size, and dropout/Gaussian noise probability.
Iii-C Finding and validating important biomarkers
LABEL:algo:hm and LABEL:algo:imp_area depict the pseudocodes for computing feature importance with ranking genes and identification of important areas on the HM, respectively. We averaged all the normalized HM from the same class to generate a class-specific HM inspired by Selvaraju et al. [chattopadhay2018grad]. In the HM, a higher intensity pixel represents a higher significance to the final prediction, which indicates higher importance of corresponding genes and the GE values. Top genes are then selected based on the intensity rankings and MAI threshold. Since GradCAM++ requires all the samples to run through the network once, we let the trained CNN models set and record the activation maps in the forward pass, and the gradient maps in the back-prop to collect the HM for each sample.
In contrast, Shapley values are used to calculate the importance of a feature by comparing what a model predicts with and without a feature from all possible combinations of features in the dataset . Given a GE value of feature , SHAP calculates the prediction of the model with . The Shapely value is calculated as follows [NIPS2017_7062]:
However, since the order in which a model sees features can affect the predictions, this computation is repeated in all possible orders to compare the features fairly. Feature that have no effect on the predicted value are expected to produce a Shapley value of 0. However, if two features contribute equally to the prediction, the Shapley values should be the same [NIPS2017_7062].
Implementation was done in Python111https://github.com/rezacsedu/XAI_Cancer_Prediction
using a software stack comprising Scikit-learn and Keras with the TensorFlow backend. The network was trained on an Nvidia GTX 1080i GPU with CUDA and cuDNN enabled. Results based on hyperparameters produced through random search and 5-fold cross-validation are reported and discussed with a comparative analysis with macro-averaged precision and recall. Further, since the classes are imbalanced, Matthias correlation coefficient (MCC) scores were reported. Since it is important for cancer diagnosis to have both high precision and high recall[naulaerts2017precision], results with very different precision and recall are not useful in cancer diagnosing and tumor type identification. Hence, we did not report F1-scores.
Iv-a Performance of cancer type classification
The average accuracy obtained was 89.75% and 96.25% using CNN and VGG16 models, respectively. However, since the classes are imbalanced, only the accuracy will give a very distorted estimation of the cancer types. Thus, we report the class-specific classification reports along with the corresponding MCC scores intable II. As can be seen, precision and recall for the majority cancer types were high and for these the VGG16 model performs mostly better. Notably, the VGG16 model classifies BRCA, UCEC, LUAD, HNSC, LUSC, THCA, PRAD, BLCA, STAD, KIRC, LIHC, COAD, CESC, KIRP, SARC, OV, PCPG, TGCT, GBM, READ, LAML, MESO, and DLBC cancer cases more confidently, whereas the CNN model classifies PAAD, CHOL, and UCS cancer cases more accurately.
|CNN (89.75%)||VGG16 (96.25%)|
The ROC curves generated by the VGG16 model in fig. 2
show that the AUC scores are consistent across the folds.This signifies that the predictions by the VGG16 model are much better than random guessing. Further, the class-specific MCC scores of the VGG16 model is 4% higher than that of the CNN model, which suggests that the predictions were strongly correlated with the ground truth, yielding a Pearson product-moment correlation coefficient higher than 0.70 for all the classes except for the CHOL tumor samples. The downside, however, is that both classifiers made a number of mistakes too, e.g., VGG16 can classify ESCA, READ, UCS, and CHOL tumor cases in only 89% of the cases accurately, while the CNN model made more mistakes particularly for the READ, LUAD, LIHC, KIRP, COAD, and STAD tumor samples.
Iv-B Feature importance and validation of top biomarkers
We identified top genes for which the change in expression has significant impact on patients. Figure 3 shows examples of HM generated for each class across 5 different folds. As seen, there are similarities across folds and displaying distinct and similar patterns when comparing different cancer types. The red circles highlight similar patterns, e.g., between KIRC and BRCA, and PRAD and LUAD across folds, whereas COAD shows very different patterns. Although there are differences among folds, some patterns are clearly visible. Since intensities did not follow any regular pattern, we chose top 660 genes across 33 tumor types (top-20 genes per class) as more significant based on the measure of MAI. Since we have more than 20K protein-coding genes, our choice of 660 is still a reasonable choice, since the number of important biomarkers should be small whose GE changes are sensitive to cancer growth [zuo2019identification]. All genes in the top-20 list can thus be viewed as tumor-specific biomarkers, which contribute most toward making the predictions. As for the other 29 tumor types, only 3 genes were in the list. Further hyperparameter tuning and training of both CNN models might improve this outcome.
Then we further narrowed down the list to the top-5 genes in which only 5 tumor types (i.e., BRCA, KIRC, COAD, LUAD, and PRAD) have at least five genes with feature importance of at least 0.5 w.r.t. MAI; they are shown in table III.To further validate our findings, the saturation analyses of cancer genes across 33 tumor types (except for COAD) are obtained from the TumorPortal [lawrence2014discovery]. Validation for the COAD cancer follows a signature-based approach [zuo2019identification], which was used for predicting the progression of colorectal cancer. However, our approach makes some false identifications, as 21 out of 25 genes are validated to be correct, making only 4 false identifications.
Iv-C Finding common biomarkers
Identifying all significant common genes will help understand various aspects for a specific cancer type (e.g., BRCA carcinogenesis). Thus, these top genes have close relations to the corresponding tumor types, which could be viewed as potential biomarkers. Figure 4 shows the top-10 common biomarkers, in which KRTAP1-1, INPP5K, GAS8, MC1R, POLR2A, BET1P1, NAT2, PSD3, KAT6A, and INTS10 genes are common across cancer types, with the INTS protein-coding gene having the highest feature importance of 0.6.
Iv-D Explanations with SHAP
The GBT model is trained to provide explanations generated by SHAP. Figure 5 shows a base value that indicates the direction of the first prediction made by the GBT model and shows how much each feature is pushing the model’s output from the base value222The average model output over the training dataset passed 0.55 to the predicted output. Features pushing the prediction higher are shown in red; those pushing the prediction lower are in blue. Further, to get an overview of which biomarkers are most important for the GBT model, we plot the SHAP values of each feature for each sample. The plot in fig. 6 sorts features by the sum of SHAP value magnitudes over all the samples, shows the distribution of the impact of each feature on the model output, and gives the top-20 common biomarkers, where red represents high feature values, blue low. This reveals, e.g., that a low NACA2 (low GE value) lowers the predicted value. Since the common biomarkers predicted by VGG16 (fig. 4) and GBT (fig. 6) are very different, a more detailed analysis of biological signaling pathways is further required to validate these findings.
Iv-E Comparison with related works
OncoNetExplainer slightly outperforms the approach by Boyu et al. [lyu2018deep] but 6.5% better than the approach by Yuanyuan et al. [li2017comprehensive]. Further, OncoNetExplainer can improve the false prediction rate for the READ, UCS, ESCA, and CHOL tumor samples. In particular, against 35%, 81%, 77%, and 56% of the correctly predicted cases by [lyu2018deep], our approach can predict 88.74%, 87.26%, 89.56%, and 84.55% (in cyan) of the same cases correctly. Although OncoNetExplainer performs slightly worse than [lyu2018deep] at classifying BRCA, THCA, and PRAD (in red), it is more consistent for the majority of cancer types and likely to perform more stably on new GE data. OncoNetExplainer provides both pre-model (GBT) and post-model interpretation (CNN and VGG16), whereas [lyu2018deep] provides only the post-model interpretability. some other studies also used GE data [zhang2016classification, elsadek2018supervised] for the cancer prediction. However, since GE data from the PCA project had more samples, a one-to-one comparison with these studies was not viable.
V Conclusion and outlook
In this paper, we proposed OncoNetExplainer, an explainable method for the prediction of cancer types based on GE data. Our approach is based on GradCAM++ with CNN and VGG16 networks, and SHAP-based GBT model. Experiment results show that GE is useful for predicting cancer types with high confidence giving an accuracy of up to 96.25%. We also attempted to provide a more human-interpretable explanation by showing statistically significant biomarkers. These analyses are further validated with scientific literature [lawrence2014discovery], which confirms that the identified genes are biologically relevant.
Although we attempted to open the CNN and VGG16 black-box models through biomarker validation and feature ranking, our approach is mostly post-hoc in that the explainability is based on test cases and results similar to layer-wise relevance propagation. Several further factors have hindered this research: i) lack of enough training samples, ii) lack of biological pathways analysis, and iii) since multiple factors are involved in cancer diagnosis (e.g., estrogen, progesterone, and epidermal growth receptors in BRCA), AI-based diagnoses might not be trustworthy solely based on a single modality, which demands the requirements of multimodal features of DNA methylation, GE, miRNA expression, and CNVs data.
In the future, we intend to extend this work by: i) alleviating more samples by combining genomics data from ICGC and COSMIC to train a multimodal architecture, ii) improving the explanations about the predictions using an ante-hoc approach by seeding explainability into the model from the beginning. In particular, we will focus on multimodality with reversed time attention model and Bayesian deep learning[choi2016retain].