Variational Autoencoder for Anti-Cancer Drug Response Prediction

Cancer is a primary cause of human death, but discovering drugs and tailoring cancer therapies are expensive and time-consuming. We seek to facilitate the discovery of new drugs and treatment strategies for cancer using variational autoencoders (VAEs) and multi-layer perceptrons (MLPs) to predict anti-cancer drug responses. Our model takes as input gene expression data of cancer cell lines and anti-cancer drug molecular data and encodes these data with our GeneVae model, which is an ordinary VAE model, and a rectified junction tree variational autoencoder (JTVae) model, respectively. A multi-layer perceptron processes these encoded features to produce a final prediction. Our tests show our system attains a high average coefficient of determination (R^2 = 0.83) in predicting drug responses for breast cancer cell lines and an average R^2 > 0.84 for pan-cancer cell lines. Additionally, we show that our model can generates effective drug compounds not previously used for specific cancer cell lines.



There are no comments yet.


page 1

page 2

page 3

page 4


Drug cell line interaction prediction

Understanding the phenotypic drug response on cancer cell lines plays a ...

Predicting drug response of tumors from integrated genomic profiles by deep neural networks

The study of high-throughput genomic profiles from a pharmacogenomics vi...

Identification of structural features in chemicals associated with cancer drug response: A systematic data-driven analysis

Motivation: Analysis of relationships of drug structure to biological re...

A stability-driven protocol for drug response interpretable prediction (staDRIP)

Modern cancer -omics and pharmacological data hold great promise in prec...

A Deep Autoencoder System for Differentiation of Cancer Types Based on DNA Methylation State

A Deep Autoencoder based content retrieval algorithm is proposed for pre...

Dr.VAE: Drug Response Variational Autoencoder

We present two deep generative models based on Variational Autoencoders ...

Smaller p-values in genomics studies using distilled historical information

Medical research institutions have generated massive amounts of biologic...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the development of molecular biology, the study of cancer genomics has enabled scientists to develop anti-cancer drugs according to cancers’ genomics features. These drugs are used widely and have great significance in the therapy of cancer treatment nowadays. However,the efficacy of anti-cancer drugs vary greatly from one kind of tumor to another, making it considerably difficult to customize therapy strategy for patients. Moreover, the efficacy of anti-cancer drugs is closely related with their molecular structure, which is also hard to predict even for sophisticated pharmacists. Many researchers study drug expression embeddings[7, 11, 13, 22, 15, 9] and genomic data [3, 1, 8, 20]

separately and some others combine them in deep neural networks

[2, 18, 19, 23] to predict drug efficacy. To provide more precise treatment strategy, a sufficient analysis and understanding of cancers’ genomic data and drugs molecular structure is essential.

On the other hand, there exists many ways to learn features from drug molecular data and gene expression data. A more appropriate method to encode features from graph structure and gene expression can be helpful to a more accurate prediction of the drug response. According to supervised learning methods, random forest is able to list the importance of each gene, which can help us filter genes at the very first step. However it will encounter the problems when meeting data with no labels, such as the Cancer Cell Line Encyclopedia (CCLE) datasets that we want to explore. Some dimensionality reduction methods, such as principal component analysis (PCA) , independent component analysis (ICA) and manifold learning based t-distributed stochastic neighbor embedding (t-SNE) are common in analyzing medical data where feature numbers are numerous and features are with no labels. However, they are primarily for 2D visualization in most cases and might lose many important information when compressing data into high dimensional latent features. If situated in a higher dimension, they do not perform well at all and we cannot judge its performance intuitively by visualizing latent space


To extract features from a large amount of gene data , we take advantage of variational autoencoder(VAE)[12]

, which has achieved great success in the field of unsupervised learning of complex probability distribution. The amazing ability of VAE models to capture probabilistic distribution of latent information could enable more complete analysis of gene data, making it easier to predict the response of anti-cancer drugs when they are used in specific cancer cell lines. As for anti-cancer drugs, we will transform their molecular graph into junction trees by functional group split and implement a junction tree VAE model to extract their low dimensional features. Finally, we will implement a fully connected neural network to combine the extracted features to produce the final result,

value of the anti-cancer drug used against the cancer cell line. Specifically, in this research, we firstly select breast cancer, which is the dominant cancer in women group for our study. Many drug efficacy predictions were under the background of breast cancer but only a few studies are efficiently making the full use of encoded information to make predictions and further generalizations. We concentrate on judging encoders’ efficiency on extracting features. Then we generalize it on pan-cancer CCLE data to see if the VAE models fit well. We do visualization on latent space dimensionality reduction with unsupervised learning to see the model’s robustness with t-SNE. Finally we check the similarity of drugs with Euclidean distance of latent vectors from junction tree encoder model. Drugs with similar structure or property may have certain close latent vectors, which enables us to compare their discrepancy on different cell lines.

Present work

Our present work focuses on using variational autoencoders(VAE) to learn latent vectors from latent space and using latent vectors to do further tasks. Our work includes two kinds of VAE, one for gene expression data input and the other for drug’s sequence data input. Our aim is to extract explainable latent vectors from the VAE model by reducing its reconstruction loss as well as Kullback-Leibler divergence(KL divergence) loss. Drug response model is based on a deep neural network.

is the value for final prediction. Besides KL loss, coefficient of determination metrics() and root mean squared error() are the metrics that we choose in our research. The datasets that we adopt are Cancer Cell Line Encyclopedia(CCLE) gene expression dataset, ZINC molecular structure dataset and GDSC drug response dataset [1, 25]. Generally we find it explainable to predict . While present work is based on VAE, more ideas can be found in our future work part that we will attempt to realize . All the figures, datas and hands-on notebooks can be found here:

Figure 1: Variational autoencoders on gene expression data. In this research we pay much attention on its latent space and use latent vector to make predictions. We show the first 20 genes in each breast cancer cell line specifically(51 breast cancer cell lines in total). The feasibility of VAE can be explained by its high similarity from raw gene expression data to its reconstructed one.
Figure 2: We show the structure of the original junction tree variational autoencoder(VAE), which is proposed by Jin in 2018. The model has been split into two parts: tree and graph reconstruction. The important thing about JTVAE is the tree structure expanded from functional group apart from usual graph node generation by neighbours. Since we consider about the latent vectors both from graph and its tree structure, we supplement some codes and use a different SMILES expression list to the original project to make sure that we extract the latent vectors correctly.
Figure 3: After we extract both drug and gene latent vectors from individual VAE model, we let them pass through single MLP model to reduce dimension again, following by the concatenated operation, finally we build a multi-layer perceptron to predict drug response IC50. In our research, we take support vector machine regressor as one of our baseline models since it’s easy to train and also intuitive. Another baseline model is based on dense deep neural network, from input gene expression data(with encoded drug latent vectors) directly to output drug response data.

2 Related work

Dimensionality reduction on features

Reducing feature numbers or encoding features into lower dimensions is common in those projects that use feature engineering to make predictions or analyze clustering effects. Supervised learning methods can select genes subset which are the most related to the research task, such as random forest with feature importance about gene in RNA sequence case-control studies[24] and support vector machines (SVM) with double RBF-kernels to filter irrelevant gene features[16]. Unsupervised learning methods, such as principal component analysis(PCA) and hierarchical learning can help explain the genes’ group features and use certain PCs or hierarchical relationship to a lower dimension mapping space[10]. Our idea is to compare traditional unsupervised learning methods with the VAE since our CCLE and ZINC dataset is unsupervised. We try to find the difference between each latent space and talk about the feasibility of using VAE.

Variational auto-encoders on gene expression

A plethora of works have been done on encoding important features from gene expression data. The core idea behind feature extraction is how to learn latent vectors effectively from input embedding. Usually multiple perceptrons can avoid the curse of dimensionality and simply encode gene features from input layer

[2, 3, 18]. An encoder-to-decoder structure[3] extends from a multi-layer perceptron, which considers more about genes’ reconstruction. The bottleneck layer represents latent information from this kind of autoencoder. Recently variational autoencoder(VAE)[12] has appeared frequently in pre-trained models that encode gene expression [8, 20, 21]. These studies are primarily focusing on latent space representation based on maximizing likelihood of gene distribution. Our approach on encoding CCLE gene expression data is mainly inspired by VAE. We take a simple deep-neural-network based VAE as our baseline model to form a pre-trained encoder on gene expression.

Representation learning on graph

Graph features can be encoded by deep learning methods, such as convolutional neural network(CNN), recurrent neural network(RNN) and message passing neural network(MPNN)

[5, 6, 7]. Besides, variational autoencoder(VAE) are also widely used in graph generation and graph encoders[13, 22, 14, 15]. In order to avoid generating nodes one by one, which is often of non-sense in drug design, a method that combined tree encoder with graph encoder was proposed[11]. It treats functional groups as nodes for broadcasting. Also, attention mechanism are applied to VAE[18, 19]

, which is used to be usual in the transformer architecture in natural language processing models . They learn attention weights by multihead-attention or self-attention with softmax operation in order to forget certain unimportant genes or drugs during propagation. Among all these studies, we choose junction tree variational autoencoder(JTVAE) as our pre-trained baseline model on encoding drug structures. Although attention mechanism is popular in recent works, training such a transformer takes time and we have included it in our future work.

Drug response prediction methods

Supervised learning methods are useful to predict drug response with encoded information. Support vector machine regressor(SVR) and random forest regressor are basic alogrithms to perform regression. Recently deep neural network methods have been popular in the drug prediction network[2, 3, 18, 19]. Our own drug prediction network is also based on deep neural network but with some modifications. Our baseline model is mainy based on SVR and single deep neural network. We compare them with our VAE plus MLP model to see its lack of high score and lack of innovation.

3 Method

Problem restatement: Given CCLE gene expression data and drug’s SMILES expression, build a model that can precisely predict , which is the response of each anti-cancer drug. We build three main variational autoencoders and calculate its latent vectors from the latent space. We further fit it into our prediction network. Our further tasks concentrate on latent vectors to do visualization and predictions, which will be discussed about in result part. In this part we mainly describe how to apply VAE to our datasets.

3.1 Data

Gene expression data

We obtain gene expression data of 1021 cancer lines with 57820 genes provided by the Cancer Cell Line Encyclopedia (CCLE)[1]. Each cell line belongs to a specific cancer type. Specifically, we choose breast cancer as our research object primarily, and then generalize our model on pan cancer cell lines. After filtering by key word token [BREAST], we select 51 breast cancer cell lines from this dataset, which are [AU565_BREAST],[BT20_BREAST],[BT474_BREAST],...,[ZR7530_BREAST]. Gene expression data is given by G, where g is the number of genes and c is the number of cancer cell lines. The elements of matrix G are , where is transcriptome per million (tpm) value of the gene in the corresponding cell line. Moreover, we access the Cancer Genomic Census (CGC) dataset[2]

, which classifies different genes into two tiers. One tier is for the genes that are closely associated with cancers and have a high probability to mutate in cancers that change the activity of the gene product. The other tier includes genes that play a strong indicated role in cancer but show little evidence. Genes in both tiers are highly relevant with cancer, and we take all of these genes in our research. We select 51 breast cancer cell lines from CCLE data set and remove expression data of genes which are not in CGC dataset. Each gene expression entrance with a mean of

which is less than 1 or standard deviation

which is less than 0.5 are also removed for their little relevance with cancer cell lines [3]. Eventually, we get gene expression data of 597 genes in 51 breast cancer cell lines.

Anti-cancer drug molecular structure data

In this research, we have prepared ZINC dataset for molecular structure data of organic compounds to train the JTVAE model. Molecular structure data is given in simplified molecular-input line entry system (SMILES) strings. SMILES expression is often used in defining drug structures[2, 18, 19, 11, 13, 22, 15, 23]. They are widely used as inputs in drug structure prediction tasks. Also the SMILES expression is easier for us to get embeddings from vocab parsing library that we have generated. From ZINC data set, we select 10, 000 SMILES strings to train our JTVAE model. The number of trained SMILES strings that we choose is far beyond the actual number of 222 drugs in processed GDSC dataset. The reason is that we would like to see a better generalization on all drug structures instead of drugs on specific cancer types.

Drug response data

Drug response data is obtained from the Genomics of Drug Sensitivity in Cancer (GDSC) project[25], which contains response data of anti-cancer drugs used against numerous cancer cell lines. Data from GDSC data set is given by a matrix , where d is number of drugs and c is number of cancer lines. The elements in this matrix are , where is the half maximal inhibitory concentration value of the drugs used against specific cancer cell lines. We obtain molecular data of anti-cancer drugs from PubChem dataset with their unique PubChem ID available from GDSC dataset. Eventually, we get 3358 pieces of drug response data in breast cancer cell lines where gene expression data and molecular structure are available.

3.2 Gene expression VAE(GeneVAE)

We consider building a simple encoder-decoder model with fully connected neural network at the first stage. It aims to extract the latent vectors for CCLE gene expression data, which will be fit into the combined MLP drug prediction network. We use a simple multi-layer perceptron for our forward propagation with a batch-norm layer before activation.


is the activation function (ReLU in our model),

is the weight matrix and

is the bias vector at the first dense layer. Batch normalization is used to train our model more efficiently. Using activation can filter unimportant information.

is the output of the first layer in our MLP model. We connect it to the second layer:


where is the activation function , is the weight matrix and is the bias vector at the second dense layer. is computed by another 2-layer MLP with the same architecture as . Latent vector is randomly sampled from .

The decoder architecture is constructed by two dense layers with same output dimensions as the input. The decoded gene expression is written as G’.


In this VAE, we need to compute reconstruction loss and Kullback-Leibler(KL) divergence loss KL[12], where p is posterior distribution and q is distribution of z.

serves to standard normal distribution in a variational autoencoder. The total loss can be written to:


We aim to reduce total loss until the loss converges.

3.3 Junction tree VAE(JTVAE)

Graph encoder

We take junction tree variational autoencoder (JTVAE) [11] as one of our encoding model to represent drug’s latent space. We use a message passing network[11, 9] as a graph encoder. Suppose there d nodes in the graph. Each node u has the property of atom type. encodes the bond type between node u and node v. The matrix represents the message passing from node j to node u in number of t iterations. is set to 0 initially. , ,

are three weight matrices separately. With the knowledge of loopy belief propagation, we can achieve the message passing embedding with a rectified linear unit from node

u to v at time t as:


Getting the message from the neighborhood of node u, we can aggregate those message embedding vectors with its atom type, which can be written in the summation form with a rectified linear unit as following equation(2). Final graph representation is shown as .



and variance

can be computed from

by an affine layer. The graph latent vector

is sampled from (,).

Tree encoder

The architecture of tree encoder is based on Gated Recurrent Unit(GRU)

[11, 4]. The hidden state is

in this tree encoder model. It is used to reserve tree’s message passing information from the moment t-1 together with the tree clusters {

, i = 1, 2, … ,d}.


There are two kinds of gates in our tree encoder model, which are reset gate and update gate . is used for calculating how much the system is going to reserve while is used for counting the probability that how likely the system is going to update the message passing information at the moment t. If the reset gate is set to 0, the element-wise multiplication in equation (3) will be simplified to tanh(), which means that there’s no reserved message at the previous stage.


The total update function, which depends on the previous activation and candidate activation can be writtern into the form of elment-wise multiplication.


We can get the tree’s latent representation of node u by aggregating its updated messages at the t-th iteration(equation 9 ). is calculated in the smiliar way as graph encoder do. Since the graph and tree decoders’ structure in JTVAE is also based on GRU method, we will not discuss about it here but instead reference to raw paper of JTVAE [11].


3.4 Drug response prediction network

Since gene VAE and molecular VAE have been trained at this stage, we implement two multi-layer perceptron (MLP) models to pose-process the output of the two VAE models respectively, and then build another MLP model to concatenate them and produce the final drug response prediction. The input to the final MLP model is , where and are outputs of the two post-processing MLP models. Suppose and , then , which means that the total dimension of the input to the final MLP model is . Value of perceptrons of the layer in the final MLP model is computed according to:


where is the weight matrix of a layer in the final MLP model and is a non-linear activation function for which we choose PRelu in our model. The predicted is computed at the last layer of the final MLP model:


where is the number of layers in the final MLP model.

Baseline model

We use Support Vector Regression (SVR) in substitute for MLP as our baseline model, showing a convenient way to take advantage of machine learning methods to make drug response prediction. Our baseline models also rely on the result of Junction Tree VAE.

4 Experiments

Experiment set-up

We first train our gene expression VAE(geneVAE) model and Junction Tree VAE(JTVAE) unsupervisedly. Then we use geneVAE to encode gene expression data either filtered by CGC data set or not respectively on breast cancer cell lines, and use Junction Tree VAE(JTVAE) to encode anti-cancer drugs. With these encoded features, we train our Support Vector Regression (SVR) and Multi-Layer Perceptron (MLP) model on breast cancer cell lines. Finally, we generalize our model and test it on pan-cancer cell lines.

4.1 Result of VAE on breast cancer

According to the training part of variational autoencoder, the sum of reconstruction loss and latent loss is the objective function that we aim to minimize. The reconstruction loss is , where G represents initial input gene expression data and G’ represents reconstructed data. It can be mean squared loss[MSEloss] or cross entropy loss[CrossEntropyloss]

. We choose cross entropy loss as our reconstruction loss in our experiments, because we normalize the input data add sigmoid function in the last layer to make sure the input and output both consist of values between

and . We connect the input layer to the final custom variational layer in our program to compute such loss.

Filtering out a representative gene subset using CGC data set also matters in the training of our gene expression VAE model. We have mentioned in the 3.1 part that for breast cancer cell lines, the selected gene number from CGC is 597. We test our model either filtering out a gene subset or not on breast cancer cell lines, and the result indicates that filtering out such a gene subset could help improve the accuracy of the prediction of

value. According to the evaluation of total loss, our tests show that at the beginning of the training loop, validation VAEloss is much higher than training VAEloss, and VAEloss starts to convergent after 100 epochs. Model on CGC-selected gene expression data has an average VAEloss of 27.3. Model without CGC selected gene expression data has an average VAEloss of 68 after validation loss becomes stable.

Figure 4: KLloss and lr with CGC
Figure 5: KLloss and lr without CGC

Model Comparison

We select 2 metrics: Coefficient of Determination (R2 score) and Root Mean Square Error (RMSE) to evaluate the discrepancy between our predicted drug response and true drug response. We propose 6 models and the results have been listed in Table 1. Among these models, the first 5 models are targeted on breast cancer, and the last one is tested on pan cancer cell lines: Support Vector Regression model trained on drug molecular structure data encoded by VAE model and gene expression data filtered by CGC dataset. Support Vector Regression model trained on gene expression data filtered by CGC dataset and drug molecular structure data which are both encoded by VAE model. Multi-Layer Perceptron model trained on drug molecular structure data encoded by VAE model and gene expression data filtered by CGC dataset. Multi-Layer Perceptron model trained on raw gene expression data (not filtered by CGC dataset) and drug molecular structure data which are both encoded by VAE model. Multi-Layer Perceptron model trained on gene expression data filtered by CGC dataset and drug molecular structure data which are both encoded by VAE model. Multi-Layer Perceptron model trained on gene expression data filtered by CGC dataset and drug molecular structure data which are both encoded by VAE model. This model is trained on pan cancer dataset.

The test results of these models are shown in Table 1. We can see that the MLP model and VAE model bring about huge improvement in the performance of our models: model outperforms model by 0.143 R2 score, and model performs even better than model with a 0.008 higher R2 score. Moreover, the selection of representative gene subset is essential to the performance of our models. For example, model on breast cancer cell lines reaches 0.830 R2 score, much better than that of model with a 0.025 higher R2 score.

Table 1  Metrics evaluation on different gene subsets in breast cancer dataset(average).

Models Cancer type
CGC + SVR Breast 0.679 1.489
CGC + VAE + SVR Breast 0.700 1.439
CGC + MLP Breast 0.822 1.133

Breast 0.805 1.163
CGC + VAE + MLP Breast 0.830 1.130
CGC + VAE + MLP Pan-cancer 0.845 1.080
CDRscan[2] Pan-cancer 0.843 1.069

Figure 6: CGC+SVR
Figure 7: CGC+VAE+SVR
Figure 8: CGC+MLP
Figure 9: Raw+VAE+MLP
Figure 10: CGC+VAE+MLP
Figure 11: CGC+VAE+MLP(pan)

4.2 Generalization on pan-cancer

We have achieved an ideal by testing our models on breast cancer cell lines to predict drug response. We generalize our model on the pan-cancer cell lines based on CCLE dataset. The only difference in pan-cancer gene expression data from that of breast cancer is that the total number of pan-cancer cell lines is 1021. Our model achieves a even higher 0.845 on pan-cancer cell lines, reaching the performance of CDRscan[2] on our dataset. To make our model more robust, we will incorporate more data into our model like TCGA dataset.

4.3 Exploring latent vectors from geneVAE

Taking advantage of the diversity of cancer types in pan-cancer dataset, we discover that latent vectors encoded by geneVAE retains the features of original data. We visualize latent vectors of gene expression data into two dimension Euclidean space. Effects of dimensionality reduction are evaluated by a single t-SNE model compared with another t-SNE mdodel combined with our pretrained VAE encoder. Generally t-SNE is just used for visualization on a two dimensional plane since t-SNE model performs worse at a higher dimension space. We begin with giving each cell line its tissue type, from "CERVIX" to "OVARY". We encode them by extracting the pattern after their first underscore in CCLE dataset. Especially we rename "HAEMATOPOIETIC_AND_LYMPHOID_TISSUE" to "HALT" since it’s the longest string. The parameters are perplexity and iterations for the single t-SNE model. We set perplexity to n/120, where n are the numbers of cell lines and we set iterarions(n_iter in python) to 3000. The same settings are applied to the combined model. The result shows that many clusters are apparent both in a single t-SNE model and a combined model. Therefore, the latent vectors encoded by geneVAE model retains the unique features of input data.

Figure 12: t-SNE on pancancer dateset before applying VAE
Figure 13: t-SNE on latent vectors after applying VAE

In the single model, tissue type labels with [HALT], [AUTONOMIC_GANGLIA] ,[BREAST] and [SKIN] etc are separate obviously, while some other type of tissues are clustered together with other similar tissue types. For example, genes do not have a great difference on their expression according to "STOMACH" and "LARGE INTESTINE". Several tissue types are so rare in cancer cell lines that they might be clustered with another tissue, because t-SNE doesn’t exactly explain the real distance between cancer types.

Eliminating rare cancer types could help improve the t-SNE results. We set the threshold of 30 to filter the tissue types, where 12 tissues are hold. They are [BREAST, CENTRAL_NERVOUS_SYSTEM, FIBROBLAST, HALT, KIDNEY, LARGE_INTESTINE, LUNG, OVARY, PANCREAS, SKIN, STOMACH, UPPER_AERODIGESTIVE_TRACT]. We remain the gene subset that we have filtered and eliminated from raw data. We visualize it again and find that more clusters are apparent in the picture, where we use black frames to represent. The clustering results of latent vectors and original data still remain similar in Figure 14, where primary cancer tissue types are separated clearly. Therefore, latent vectors encoded by geneVAE model retain the essential features of original data robustly. With geneVAE, our models are able to focus on the low-dimensional critical features of original data and produce a more accurate prediction.

Figure 14: t-SNE with threshold 30 before and after VAE

4.4 Exploring latent vectors from JTVAE

Drugs sharing similar molecular structure are also similar in latent vectors. We get access to latent vectors encoded by JTVAE, and reveal drugs sharing similar molecular structure are also similar in latent vectors. We measure the similarity of latent vectors of different drugs in terms of Euclidean Distance. Shorter distance indicates a higher similarity between two drugs. For example, MG132(inhibitor) and Proteasome (inhibitor) share a shortest distance which is about 23.73. We obtain their molecular structures in Pubchem database and find that a majority of functional groups are similar between these two drugs. Small differences lie in a carboxyl and an amide at the endings of the molecule. However, not all related drugs have such a great similarity. According to drug Imatinib and Linifanib, their are even closer in terms of Euclidean distance between their latent vectors but they have only middle part of the functional groups exactly the same. JTVAE model might discover underlying similarity among functional groups that are not exactly the same. Also, the message passing network in JTVAE is based on GRU, and it might forget some functional groups during propagation by neighbours.

Though similar drugs share close latent vectors, our MLP model is still able to capture subtle differences and produce an accurate prediction. We focus on the example of MG132 and Proteasome used against HCC1187 cancer cell line. We remove these two pieces of data from training set, and test our trained model on them. The predicted

of MG132 and Proteasome in cell line HCC1187 are 0.84 and -0.866 in our best model. True value of these two drugs are 1.589 and -0.181 respectively. Although two values are not too close to the expected values, they do not step into the range of the other’s confidence interval. Therefore, despite the considerably high similarity between similar drugs, our MLP model is still able to differentiate each of them and produce a reasonable result.

Figure 15: Similar latent vectors on similar drug structures.

5 Discussion and conclusions

In this research we build gene expression VAE(geneVAE) model, junction tree VAE(JTVAE)model, Support Vector Regression (SVR) model and several Multi-Layer Perceptron (MLP) models. We extract latent vectors with geneVAE and JTVAE model to fit them into our drug prediction network. We compare our combined models with baseline SVR models that have been mentioned in previous related works. Generally speaking, we have achieved a great coefficient of determination value on our model (0.845 R2 score), reaching state-of-the-art performance of CDRscan[2] on our data set. Besides, we discuss the effectiveness of geneVAE and JTVAE model from the perspective of visualization and drug similarity, further proving the validity of our pipeline.

There are still some interesting aspects that we have met during our research. Hyper-parameter tuning and layer setting are superior. Different hyper-parameters lead to different consequence. For example, adding Batch-Norm (BN) layer in MLP model results in a worse performance. Batch normalization is a widely used technique to avoid gradient exploding and vanishing. However, its effectiveness is doubtful when it is used in shallow networks with Rectified Linear Unit, where gradient exploding and vanishing seldom occur. Moreover, the inconsistency among mini-batches could influence the performance of the batch-norm layers badly. Besides BN layer, proportion of train-valid-test split is essential to the final result, as well as the proportions of batch size and learning rate. We set the default batch-size of 8 and learning rate 0.001 in the training loop. Larger values of these two hyper-parameters convergent faster, however might meet with problem of falling into local minimum. A proper proportion of train and validation and test sets are 10:1:1 and for each epoch we choose them randomly from total dataset. K-fold cross validation is also a good choice.

We suggest dimensional reduction should be done without PCA, as well as t-SNE and some other powerful clustering methods. PCA is not suitable when reducing dimension numbers to 56 or even higher in our research. t-SNE is better than PCA at 2-dimensional representation. However we also find that there are something mess up when applying t-SNE at 2-dimensional space. Besides, we will showcase more predictions in our future works with similar structure to further generalize our idea that although similar, their latent vectors could not be changed when making predictions. Unless we find the prediction value is not in each other’s confidence interval.

6 Future work

Attention mechanism based models are included in our future works. Using attention mechanism is not only popular in transformer, BERT model which belong to natural language processing field, but also widely used in drug structure translation field[18, 19]. Apart from attention based models, there are other sequence generation models like GMM and graph neural network(GNN). Moreover, we’d like to build a toolkit for drug response prediction if given one cancer cell line data and corresponding drug’s response. Last but not least, select better gene subset of each drug since drug response may have different gene contributions.


Thanks to professor Manolis Kellis for reviewing this artile.


  • [1] J. Barretina, G. Caponigro, N. Stransky, K. Venkatesan, A. A. Margolin, S. Kim, C. J. Wilson, J. Lehár, G. V. Kryukov, D. Sonkin, et al. (2012) The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483 (7391), pp. 603–607. Cited by: §1, §1, §3.1.
  • [2] Y. Chang, H. Park, H. Yang, S. Lee, K. Lee, T. S. Kim, J. Jung, and J. Shin (2018) Cancer drug response profile scan (cdrscan): a deep learning model that predicts drug effectiveness from cancer genomic signature. Scientific reports 8 (1), pp. 1–11. Cited by: §1, §2, §2, §3.1, §3.1, §4.1, §4.2, §5.
  • [3] Y. Chiu, H. H. Chen, T. Zhang, S. Zhang, A. Gorthi, L. Wang, Y. Huang, and Y. Chen (2019) Predicting drug response of tumors from integrated genomic profiles by deep neural networks. BMC medical genomics 12 (1), pp. 18. Cited by: §1, §2, §2, §3.1.
  • [4] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.3.
  • [5] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §2.
  • [6] C. Dyer, A. Kuncoro, M. Ballesteros, and N. A. Smith (2016) Recurrent neural network grammars. arXiv preprint arXiv:1602.07776. Cited by: §2.
  • [7] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212. Cited by: §1, §2.
  • [8] C. H. Grønbech, M. F. Vording, P. N. Timshel, C. K. Sønderby, T. H. Pers, and O. Winther (2018) ScVAE: variational auto-encoders for single-cell gene expression datas. bioRxiv, pp. 318295. Cited by: §1, §2.
  • [9] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe (2016) Deep clustering: discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. Cited by: §1, §3.3.
  • [10] H. Huang and K. Kim (2006)

    Unsupervised clustering analysis of gene expression

    Chance 19 (3), pp. 49–51. Cited by: §2.
  • [11] W. Jin, R. Barzilay, and T. Jaakkola (2018) Junction tree variational autoencoder for molecular graph generation. arXiv preprint arXiv:1802.04364. Cited by: §1, §2, §3.1, §3.3, §3.3.
  • [12] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2, §3.2.
  • [13] M. J. Kusner, B. Paige, and J. M. Hernández-Lobato (2017) Grammar variational autoencoder. arXiv preprint arXiv:1703.01925. Cited by: §1, §2, §3.1.
  • [14] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia (2018) Learning deep generative models of graphs. arXiv preprint arXiv:1803.03324. Cited by: §2.
  • [15] Q. Liu, M. Allamanis, M. Brockschmidt, and A. Gaunt (2018) Constrained graph variational autoencoders for molecule design. In Advances in neural information processing systems, pp. 7795–7804. Cited by: §1, §2, §3.1.
  • [16] S. Liu, C. Xu, Y. Zhang, J. Liu, B. Yu, X. Liu, and M. Dehmer (2018) Feature selection of gene expression data for cancer classification using double rbf-kernels. BMC bioinformatics 19 (1), pp. 1–14. Cited by: §2.
  • [17] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §1.
  • [18] M. Manica, A. Oskooei, J. Born, V. Subramanian, J. Sáez-Rodríguez, and M. Rodríguez Martínez (2019) Toward explainable anticancer compound sensitivity prediction via multimodal attention-based convolutional encoders. Molecular Pharmaceutics. Cited by: §1, §2, §2, §2, §3.1, §6.
  • [19] A. Oskooei, J. Born, M. Manica, V. Subramanian, J. Sáez-Rodríguez, and M. R. Martínez (2018) PaccMann: prediction of anticancer compound sensitivity with multi-modal attention-based neural networks. arXiv preprint arXiv:1811.06802. Cited by: §1, §2, §2, §3.1, §6.
  • [20] L. Rampasek, D. Hidru, P. Smirnov, B. Haibe-Kains, and A. Goldenberg (2017) Dr. vae: drug response variational autoencoder. arXiv preprint arXiv:1706.08203. Cited by: §1, §2.
  • [21] L. Rampášek, D. Hidru, P. Smirnov, B. Haibe-Kains, and A. Goldenberg (2019) Dr. vae: improving drug response prediction via modeling of drug perturbation effects. Bioinformatics 35 (19), pp. 3743–3751. Cited by: §2.
  • [22] M. Simonovsky and N. Komodakis (2018) Graphvae: towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pp. 412–422. Cited by: §1, §2, §3.1.
  • [23] M. Tsubaki, K. Tomii, and J. Sese (2019) Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics 35 (2), pp. 309–318. Cited by: §1, §3.1.
  • [24] S. Wenric and R. Shemirani (2018) Using supervised learning methods for gene selection in rna-seq case-control studies. Frontiers in genetics 9, pp. 297. Cited by: §2.
  • [25] W. Yang, J. Soares, P. Greninger, E. J. Edelman, H. Lightfoot, S. Forbes, N. Bindal, D. Beare, J. A. Smith, I. R. Thompson, et al. (2012) Genomics of drug sensitivity in cancer (gdsc): a resource for therapeutic biomarker discovery in cancer cells. Nucleic acids research 41 (D1), pp. D955–D961. Cited by: §1, §3.1.