## 1 Introduction

Despite tremendous advances in the pharmaceutical industry, many patients worldwide do not respond to the first medication they are prescribed. Personalized medicine, an approach that uses patients’ own genomic data, promises to tailor the treatment program to increase the probability of positive response. That idea is great, but to build powerful predictive models, we need training data. The space of all possible treatments and their combinations for a given condition is enormous and the heterogeneity of patients with complex diseases is high. Thus, while much data has been collected, it is sparsely and inefficiently sampled making it a very hard learning problem.

In the last decade there have been several public releases of high throughput drug screening in cell lines. Cancer cell lines are cells taken from a patient’s tumor that are “immortalized”, i.e. modified to divide indefinitely. The greatest advantage of cell lines is that it is relatively inexpensive to test them with thousands of drugs providing a rich basis for learning predictive models. This screening task was undertaken by several large consortia and pharmaceutical companies resulting in public datasets of various sizes, e.g. Genomics of Drug Sensitivity in Cancer (GDSC) with 138 drugs (Yang et al., 2013) tested on 700 cancer cell lines, and the Cancer Cell Line Encyclopedia (CCLE) (Barretina et al., 2012) with 24 drugs tested on a panel of >1000 cell lines. The availability of these datasets spurred the development of predictive models. Jang et al. (2014)

compared seven standard machine learning approaches and identified ridge and elastic net regressions as the best performers with an average AUC of

across 24 compounds from the CCLE dataset and across 138 compounds from the GDSC dataset.In addition, there is a perturbation database containing over 16,000 experiments showing how the expression of 1000 genes changed in response to a drug (gene expression is recorded before and after drug application) (Duan et al., 2014). This information allows for the assessment of biological change due to treating the cancer cells with drugs. Combining response and perturbation data is expected to ultimately yield a better and more biologically relevant model of drug response, though likely more experiments will be needed, since there are only a few drugs tested in each cell line.

In this paper we present two deep generative models Perturbation Variational Autoencoder and its semi-supervised extension, Drug Response Variational Autoencoder (Dr.VAE), that learn latent representation of the underlying gene states before and after drug application that depend on both the cell line’s overall response to the drug and the biological change of each of the landmark genes. We are building on VAEs ability to leverage expressiveness of deep neural networks for Bayesian learning and inference

(Kingma et al., 2014). In addition, as Bayesian models they are more adept for the task when very little data is present, which is the case in our drug response prediction problem. To fit our model we use a combination of Stochastic Gradient Variational Bayes (Kingma and Welling, 2013) and Inverse Autoregressive Flow (Kingma et al., 2016), a recently introduced type of Normalizing Flow (Rezende and Mohamed, 2015).We tested our methods on 19 drugs for which both perturbation and drug response data was available. Both Dr.VAE and Semi-Supervised VAE outperform the current published benchmark models (Jang et al., 2014) in the field by anywhere from to AUROC and to

AUPR. Our analysis of this problem and of the model performance shows the difficulty of fitting sparsely and inefficiently sampled high dimensional data, but at the same time illustrates the flexibility and potential improvement over the currently available state-of-the-art models for drug response prediction problem.

## 2 Methods

We propose two models. First, we discuss an approach for modeling drug perturbation effects, i.e. given gene expression of a cell line before the drug is applied (pre-treatment gene expression), we are aiming to predict gene expression after the drug is applied (post-treatment state). To this end we propose a deep generative model, Perturbation VAE (PertVAE).

We then develop drug response prediction model, Drug Response Variational Autoencoder (Dr.VAE), a semi-supervised extension of PertVAE, to tackle the problem of drug response (treatment efficacy) prediction while harnessing the unsupervised information about the particular drug from observed pre- and post-treatment gene expression perturbation pairs.

### 2.1 Perturbation VAE

Perturbation Variational Autoencoder (PertVAE) is an unsupervised model for drug-induced gene expression perturbations, that embeds the data space (gene expression) in a lower dimensional latent space. In the latent space we model the drug-induced effect as a linear function, which is trained jointly with the embedding encoder and decoder.

We fit PertVAE on “perturbation pairs” of pre-treatment and post-treatment gene expression with shared stochastic embedding encoder and decoder

. The original dimension of each vector

is genes. Additionally we use unpaired pre-treatment data (with no know post-treatment state) to improve learning of the latent representation. The graphical representation of PertVAE model is shown in Figure 1.#### Joint distribution.

Our Perturbation VAE models joint , which we assume to factorize as:

(1) |

#### Generative distribution .

Perturbation VAE’s generative process is as follows:

(2) | ||||

(3) | ||||

(4) |

The parameters of these distributions are computed by functions , which are neural networks with a total set of parameters . For brevity we refer to these parameters as instead of more specific subsets or when such level of detail unnecessarily clutters the notation.

We constrain the mean function in to be a linear function of the following form:

(5) |

with and initialized close to zero such that starts as an identity function. We found that together with L2 penalization this formulation improves stability and generalization of the model.

#### Approximate posterior .

Depending on the type of the data, we assume the approximate posterior with a set of parameters to factorize as:

perturbation pairs: | (6) | |||

pre-treatment singleton: | (7) |

Analogously to the shared generative distribution, also is shared for both . Here, instead of directly using a diagonal Gaussian as the final approximate posterior

(8) |

we apply two steps of “LSTM-type” Inverse Autoregressive Flow (IAF) (Kingma et al., 2016) updates to facilitate a richer family of approximate distributions. A sample from is then derived by two iterations of the following step:

(9) | ||||

(10) |

The coefficients

of the IAF are computed by a Masked Autoencoder for Distribution Estimation (MADE) model

(Germain et al., 2015):(11) |

MADE is an autoregressive model, that is,

-th elements of the and vectors only depend on up to the first elements of . Using this property, the determinant of Jacobian of each IAF step can be computed efficiently. As each IAF step is then an invertible function with known Jacobian determinant, it is thus possible to efficiently compute probability of the derived sample in the complex posterior that does not have a parametric form (Kingma et al., 2016).#### Fitting and parameters.

We jointly optimize the generative model and variational parameters with Stochastic Gradient Variational Bayes (SGVB) (Kingma and Welling, 2013; Rezende et al., 2014) to maximize Evidence Lower Bound (ELBO) of the data:

(12) | |||

(13) |

which is a sum of the evidence lower bound of perturbation pairs and the lower bound of unpaired “singleton” examples that we leverage to train the latent space Variational Autoencoder as well. The individual per-example lower bounds and take the following form:

(14) | ||||

(15) | ||||

(16) | ||||

Using SGVB it is possible to backpropagate through

and we use Adam (Kingma and Ba, 2014) to compute gradient updates for both and parameters. As we use IAF to model, the Kullback–Leibler divergence

cannot be computed numerically and therefore we use a Monte Carlo estimate. Additionally we follow (Kingma et al., 2016) and allow “free bits” in to mitigate the problem of overly strong prior causing the optimization to get stuck in bad local optima.### 2.2 Drug Response VAE

Analogously to Semi-Supervised VAE, we can extend our unsupervised Perturbation VAE to a semi-supervised model by stacking a modified “M2 model” (Kingma et al., 2014). This model can be trained jointly to model both drug-induced perturbation effects as well as treatment response outcomes. As such we call this model Drug Response VAE (Dr.VAE).

We use similar type of data to train Dr.VAE as we use for PertVAE, however some of the perturbation pairs and pre-treatment singletons now can have a binary outcome label associated with them, denoting if the drug treatment was successful or not. Schema of Dr.VAE model is shown in Figure 2.

#### Joint distribution.

#### Generative distributions .

The individual generative distributions Dr.VAE factorizes have the following form:

(18) | ||||

(19) | ||||

(20) | ||||

(21) | ||||

(22) |

Same way as in PertVAE, we share the “data decoder” among both .

#### Approximate posterior .

Depending on the type of the data, we assume the approximate posterior to factorize as:

labeled pair: | (23) | |||

unlabeled pair: | (24) | |||

labeled singleton: | (25) | |||

unlab. singleton: | (26) | |||

The “data encoder” is shared and parametrized the same way as in PertVAE. The additional approximate posterior distributions then take the following form:

(27) | ||||

(28) |

The afford mentioned factorizations of the joint and of the posteriors also provide a recipe for sampling and inference in the model by Monte Carlo sampling.

#### Fitting and parameters.

We have 4 different sets of partially observed variables, which correspond to different types of data. Therefore there are 4 different evidence lower bounds to optimize:

(29) | ||||

(30) | ||||

(31) | ||||

(32) |

The sum of these 4 specific evidence lower bounds, , is the evidence lower bound we need to maximize. We omit the derivation of these specific lower bounds in the main manuscript since it follows the same principles as shown above for PertVAE and as shown in the derivation of Semi-Supervised VAE (Kingma et al., 2014; Louizos et al., 2015).

Finally, we need to explicitly introduce loss of the predictive posterior

in order for it to be trained also on labeled data. This is required as for the labeled data the random variable

is observed and therefore the lower bounds and are conditioned on and do not contribute to fitting of . Our final objective to maximize is(33) |

## 3 Datasets

In our experiments we used two main data resources: (i) high-throughput screens of cancer cell-lines including gene expression pre-treatment for all tested cell lines and drug response in terms of cell line viability, and (ii) high-throughput screens of gene expression perturbation effects induced by drugs in cancer cell lines.

We tested our methods on a panel of 19 drugs for which there are both response and perturbation experiments available. These 19 drugs were also used in recent AstraZeneca-Sanger DREAM Challenge and therefore we use it as a representative sample of anti-cancer drugs.

### 3.1 Datasets of drug response in cancer cell lines

Large high-throughput screening efforts have been undertaken resulting in publicly available datasets of cancer cell lines with post treatment cell viability at various drug concentrations. In our experiments we utilize the Genomics of Drug Sensitivity in Cancer (GDSC) (Yang et al., 2013) and Cancer Cell Line Encyclopedia (CCLE) (Barretina et al., 2012) datasets. We obtained these datasets using PharmacoGx R package (Smirnov et al., 2015)

. As the drug response outcome we use binarized version of dose-response curves, which were computed by PharmacoGx. For consistency, we use response outcome from GDSC, while we use all the other cell lines in GDSC and CCLE not screened for response to a given drug as unlabeled cell line examples. Summary of our pooled dataset is detailed in Supplementary Materials.

### 3.2 Drug-induced perturbations dataset

The Library of Network-Based Cellular Signatures (LINCS) consortium screened perturbation effects that drugs have on gene expression of L1000 landmark genes in cancer cell lines (Duan et al., 2014). These experiments do not measure the drug treatment efficacy, however some of these cell lines were independently tested in GDSC for the drug response. We cross-reference these cell lines and assign the corresponding label to their perturbation measurement.

The L1000 perturbation dataset is very sparse: for the 19 drugs, only up to 56 different cell lines were screened. In fact, only 8 drugs have been measured in over 50 cell lines, the remaining 19 have been measured in fewer than 20 cell lines, albeit at various concentrations and with many biological replicates. In our results we use measurements at the highest drug concentration and all the biological replicates of such experiments. In cross-validation of our models we use cell-line-wise splitting so that the biological replicates for a particular cell line are in the same data fold.

## 4 Results

We tested the performance of our models on three tasks: (i) drug response prediction task, (ii) drug perturbation prediction and (iii) gene expression reconstruction from the latent representation.

#### Architecture.

All evaluated Variational Autoencoder -based models, our proposed models (Dr.VAE, PertVAE) and the published models we used for comparison (VAE and Semi-Supervised VAE (SSVAE)), use 100 stochastic latent units, i.e. all are stochastic vectors of length 100, and have the same architecture for the “data encoder” and “data decoder” . For the encoder, there are 903 input units corresponding to 903 landmark genes (we exclude

genes that we could not uniquely map between data sets). The encoder has hidden layers with 500 and 300 units from which parameters of initial Gaussian distribution

andare computed together with 200 hidden units on which the subsequent Inverse Autoregressive Flow is conditioned. We use 2 steps of IAF, each with one hidden layer of 300 units. Architecture of data decoder mirrors that of data encoder, but without IAF. Throughout all our models we use ELU activation function

(Clevert et al., 2015) and Weight Normalization (Salimans and Kingma, 2016). We preserve various parts of the architecture among different models to help with further analysis of what helps with the observed performance.For both Dr.VAE and SSVAE, the classification posteriors and , respectively, are linear functions with soft-max activation over two output units. In our implementation, we use a slight modification for Dr.VAE, for which we found that using instead of

as the classifier input improves the performance. Further, the distributions

and (and their equivalents in SSVAE) are parametrized by a neural network with a single hidden layer of 100 units.All our presented experiments are evaluated in 10-times randomized 5-fold cross-validation and we report the average metric across these 50 data splits. The models were fitted independently for each of the 19 drugs, but with the same hyperparameters.

### 4.1 Predicting drug response

We compare our models to Ridge L2 logistic regression (LR), random forest (RF), and support vector machine with a linear kernel (SVM), following

Jang et al. (2014) that found Ridge LR to be the best overall classifier for drug response in GDSC dataset.To assess informativeness of drug-induced perturbations for drug response prediction task we also compare Dr.VAE to a Semi-Supervised VAE (Kingma et al., 2014). SSVAE is trained on the same data, however without ability to model the drug effect, as the perturbation pairs are simply presented as independent unlabeled singletons. On average, Dr.VAE is the best performing method from all tested models ranging from to improvement over the ridge logistic regression, considered state-of-the-art in the field. Over all 19 drugs, the average improvement in performance is for Dr.VAE compared to of SSVAE. The only drug where both Dr.VAE and SSVAE performed worse than Ridge LR is paclitaxel. This is a chemotherapy drug (no specific gene target) with a much smaller sample size, thus it appears that the simpler model has an advantage over all other models for this one.

Dr.VAE and SSVAE learn a latent representation of the data and the classifier jointly. To understand the importance of learning a good latent embedding, we also explored the learning paradigm where we first optimize latent representation in an unsupervised fashion and then train the classifier using the already learnt embedding. To this end we trained an unsupervised PertVAE on all perturbation pairs and afterwards fitted Ridge LR classifiers, one using the PertVAE’s latent representation of pre-treatment gene expression (PertVAE+LR on Z1) and another on the latent representation of predicted post-treatment state (PertVAE+LR on Z2). Additionally, we compared these results to the baseline models trained on principal component analysis (PCA) projection of the dataset to the first 100 principal components. The third best average performance over all models was achieved by LR trained on the latent embedding of pre-treatment gene expression learned by the PertVAE model. The improvement is only

over Ridge LR and does not beat SSVAE or Dr.VAE on any of the 19 drugs. Ridge LR performs better on PertVAE latent representations than on both the observed gene expression and PCA representation with the same number of hidden units (principal components) as PertVAE.As the evaluation metric we use the area under precision-recall curve (AUPR) and area under the ROC curve (in the Supplementary Materials). Performance of all models is presented in Table

1.### 4.2 Reconstructing gene expression

Variational Autoencoder is an expressive non-linear model, while PCA has the best reconstruction loss among linear models. To evaluate how well a VAE with our architecture can model gene expression, we fitted a VAE with various number of stochastic latent variables and compared its reconstruction to reconstructions by a PCA with equivalent number of principal components. As the measure of reconstruction quality we used Spearman’s between reconstruction mean and the original gene expression. We plot the results in Figure 3. The Variational Autoencoder with our encoder/decoder architecture, as used in Dr.VAE and PertVAE, does better for small latent spaces () after which it seems to overfit compared to PCA. We chose this architecture and 100 stochastic units as the default for all our models. We expect our models then to have enough expressive power and capacity to not just model gene expression but also find such a latent space that can be informative for drug response and/or drug effect can be modeled as a stochastic linear function.

### 4.3 Predicting post-treatment gene expression

We trained a PertVAE for each drug independently to see how well we can predict drug perturbation effects. That is, we optimized the and stopped training when perturbation prediction loss started to increase on the validation set.

To evaluate the prediction performance we computed Spearman’s correlation between the mean of predicted gene expression distribution and the observed post-treatment gene expression in the test set. We compare this correlation to the correlation between the mean of pre-treatment reconstruction distribution and the true post-treatment gene expression. This is done to assess whether the drug perturbation function is in fact learning anything beyond reconstructing pre-treatment gene expression. Note, that in the training step the “drug effect” mean function is initialized close to identity. If PertVAE would either underfit or overfit on the training set, we would expect to be no larger than . Therefore we calculate Mann-Whitney single-sided test with the alternative hypothesis on the results of our 10-times randomized 5-fold CV. The average correlation values and p-values of the statistical test are in Table 2, showing that PertVAE can at least partially predict drug perturbations for 5 out of 8 drugs (p-value ) for which the data set consists of perturbation experiments with at least 51 unique cell lines.

## 5 Discussion

In our explorations of optimizing latent space, we found that doing well on reconstruction task does not directly lead to improved classification performance. The ability of Dr.VAE to model drug-induced perturbation effects on gene expression leads to limited improvement in classification performance. Compared to fitted PertVAE, a Dr.VAE model trained predominantly for classification does not learn to predict perturbation effects along the way. However, it is the best performing classification model. This set of observations compells us to conclude that the latent space serves a different role than simply compressing observed gene expression. Given a very small set of samples, very heterogeneous and noisy input and likely noisy output, the goal of the latent space is to capture the essense of the observed gene expression that is most useful and likely biased for prediction. The original goal of our work was to create a rich paradigm where much of the available data can be incorporated to boost the predictive performance of drug response. We did achieve an improvement in predicting drug response in the flexible and powerful VAE framework that we believe is the way to model such data in the future.

## References

- Barretina et al. (2012) Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A. A., Kim, S., Wilson, C. J., Lehár, J., Kryukov, G. V., Sonkin, D., et al. (2012). The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391):603–607.
- Clevert et al. (2015) Clevert, D.-A., Unterthiner, T., and Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289.
- Duan et al. (2014) Duan, Q., Flynn, C., Niepel, M., Hafner, M., Muhlich, J. L., Fernandez, N. F., Rouillard, A. D., Tan, C. M., Chen, E. Y., Golub, T. R., et al. (2014). Lincs canvas browser: interactive web app to query, browse and interrogate lincs l1000 gene expression signatures. Nucleic acids research, page gku476.
- Germain et al. (2015) Germain, M., Gregor, K., Murray, I., and Larochelle, H. (2015). Made: Masked autoencoder for distribution estimation. In ICML, pages 881–889.
- Jang et al. (2014) Jang, I. S., Neto, E. C., Guinney, J., Friend, S. H., and Margolin, A. A. (2014). Systematic assessment of analytical methods for drug sensitivity prediction from cancer cell line data. In Pac. Symp. Biocomput, volume 19, pages 63–74. World Scientific.
- Kingma and Ba (2014) Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kingma et al. (2014) Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014). Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589.
- Kingma et al. (2016) Kingma, D. P., Salimans, T., and Welling, M. (2016). Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934.
- Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Louizos et al. (2015) Louizos, C., Swersky, K., Li, Y., Welling, M., and Zemel, R. (2015). The variational fair autoencoder. arXiv preprint arXiv:1511.00830.
- Rezende and Mohamed (2015) Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770.
- Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
- Salimans and Kingma (2016) Salimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–901.
- Smirnov et al. (2015) Smirnov, P., Safikhani, Z., El-Hachem, N., Wang, D., She, A., Olsen, C., Freeman, M., Selby, H., Gendoo, D. M., Grossman, P., et al. (2015). Pharmacogx: an r package for analysis of large pharmacogenomic datasets. Bioinformatics, page btv723.
- Yang et al. (2013) Yang, W., Soares, J., Greninger, P., Edelman, E. J., Lightfoot, H., Forbes, S., Bindal, N., Beare, D., Smith, J. A., Thompson, I. R., et al. (2013). Genomics of drug sensitivity in cancer (gdsc): a resource for therapeutic biomarker discovery in cancer cells. Nucleic acids research, 41(D1):D955–D961.

Comments

There are no comments yet.