Precision oncology, the genetic sequencing of tumors to identify druggable targets, has quickly progressed as the standard of care in the treatment of many cancers Schwartzberg2017. Here, targeted treatments show therapeutic activity in subsets of patients defined by tumor-specific genetic alterations, such as imatinib for chronic myelogenous leukemia. Blumenthal2010 However, assigning patients to adequate treatments remains challenging. Current practice applies published and approved therapeutic protocols that consider the patient’s clinical characteristics, including the presence of (most often one) particular genetic mutation, to choose a therapeutic. For example, based on the presence of a single genetic variant, such as a BRAF V600E mutation (a gene involved in cell growth), a treatment decision can be made Hou2011. However, this limits the ability to make high-confidence clinical decisions in a real-world scenario with a large number of both observable genetic data and treatment options to choose from. The implications for the practice of precision oncology are (I) a high selectivity: only 4.9% of oncology patients are eligible for genome-targeted therapies with robust clinical evidence Marquart2018, and subsequently (II) high compassionate use: the majority of oncology patients are left with limited treatment options outside of existing therapeutic protocols, of which off-label use does not contribute systematically to the development of new clinical evidence.
Given the nature of precision oncology, treatment assignment can be modeled as a contextual bandit problem with a patient’s information informing the choice of treatment. In contrast to supervised learning on the one end and reinforcement learning on the other end, contextual bandit problems, especially when based on Thompson sampling, are a well-suited method for this task as it allow agents to explore new treatment options while ensuring that every action has a non-zero chance of being optimalRiquelme2018. Put differently, Thompson sampling based agents would never make choices for a patient that are certainly non-optimal in order to improve decision making at a later time point, something that can not be excluded for more most full reinforcement learning algorithms. While contextual bandit applications in oncology have been previously proposed Durand2015; Durand2018ContextualBF, there are no established benchmarks to evaluate different algorithms, objectives, and state representations, due to a lack of biologically interpretable and complete observations of drug response in cancer. This is especially relevant as major ethical questions of how to balance the competing directives of individual utility and population utility remain.
Here we propose a benchmark for contextual bandits in precision oncology based on real in vitro drug response of approximately 900 cancer cell lines Iorio2016. For each cell line, mutation, copy number variation, and gene expression data is available to represent the sample’s state. After defining rewards based on treatment response, we used all available algorithms implemented in the Bayesian bandit showdown project Riquelme2018; Snoek2015 to subsequently choose the best treatments for randomly selected cell lines. In addition, we defined a rule-based agent based on a set of current evidence-based therapeutic protocols to evaluate bandit performance and to include prior knowledge into the available state information during selected experiments.
2.1 Benchmark Construction
We derived all molecular and drug sensitivity data from the Genomics of Drug Sensitivity in Cancer database, a public research repository described by Iorio et al Iorio2016 and available at https://www.cancerrxgene.org. In a first pre-processing step we focused on a subset of 7 drugs that are currently used in clinical practice. We log transformed the values and normalized them relative to the median across cell-lines for each drug. We used the resulting score to quantify drug response and calculate response-based rewards (Figure 1A).
In order to reduce the dimensionality of the state representation, we reduced 18523 cell-line specific features including scaled gene expression data, and binarized mutation and copy-number variant information into 20 dimensions by uniform manifold approximation and projection (UMAP) using default parameters. UMAP projected features recovered tissue types (Figure1B) while not directly recovering overall drug sensitivities (Figure 1C).
Next we manually curated therapeutic protocols based on current clinical evidence, established and recent databases, as well as trial protocols with selected simplifications: (I) we excluded any protocols involving combination treatments, (II) we excluded any protocols that are based on the presence of oncogenic gene-fusions as they were not included in our dataset, (III) in analogy to a basket trials, we did not include tissue type restrictions into any protocols. Cell lines that did not qualify for any of the curated treatments were assigned to be treated with Cisplatin, an established chemotherapeutic used, among others, for treatment of cancers of unknown progeny (Figure 2A).
2.2 Contextual Bandit Formulation
We followed the definition of the contextual bandit problem as described in Riquelme2018. The algorithm assigns treatments to units sequentially. At a time the algorithm takes as input a context corresponding to the next unit (e.g., a cell line’s projected genetic data). The algorithm selects one of k actions (e.g., one of 7 available treatments). A reward is then generated and returned. At the end, the cumulative reward for the algorithm is defined as . The goal is to maximize cumulative reward, and thus minimize the cumulative regret, defined as , where is the cumulative reward of the optimal policy (i.e., the policy that always selects the ideal treatment given the context).
Similar to the study in Riquelme2018, we exclusively examined the performance of decision making via Thompson sampling. In each round of Thompson Sampling, parameters, , are sampled from the posterior distribution given all previous observations. Using these parameters and the current context , an action is chosen to optimize the expected regret according to an internal model.
In this study, we examined the effect of (I) different state representations (II) different reward functions and (III) different posterior distribution approximations Riquelme2018 on performance. We evaluated three different state representations by including only genetic features, only rule-based recommendations or both datatypes in the state. Thus, in total, the state was represented by up to 27 features (20 UMAP + 7 recommendations). Further we defined three different reward metrics:
subtract the lowest drug response score (the strongest response) from the response score of the selected drug.
rank the drugs by drug response score in ascending order. The best drug will be ranked 7, while the least active drug will be ranked 1.
for each drug, we map its response score to its distribution over cell lines and use the percentile as reward.
For posterior approximation, we consider all of the Bayesian bandit algorithms included in Riquelme2018
with default parameters. These included uniform sampling, Bayesian linear regression, Gaussian Processes, stochastic variational inference, and several neural-network based approximations. A full listing of methods and their hyperparameters are included in the appendix.
For each state representation, reward function and all posterior approximators, we ran 100 epochs with 512 as the batch size for deep Bayesian network training to obtain the final results. We repeated each experiment with 5 independent random seeds, thus generating 5 random patient sequences to go through. We did not define a separate validation dataset to measure agent performance, as commonly done in methods such as cross-validation, because the UMAP representation was learned on the complete dataset, leading to an overestimation of agent performance.
All experiments were run in python 3.6 using modified code from the Deep Bayesian Bandits Library, including an additional rule-based "Clinical Guideline" agent that followed current therapeutic protocols and a logging function to export an agent’s actions over all experiment steps.
Overall, our results suggest that contextual bandit algorithms show promise in the precision oncology setting. In our experiments, contextual bandits methods were able to leverage genomic information to consistently achieve substantially lower regret than both uniform random allocation and rule-based clinical guidelines 2B. This main result is shown in Figure 2C with an exemplary plot visualising agent activity 2D. Specifically, in a baseline experiment where each algorithm was only given the information needed to implement clinical guidelines, all agents out-performed uniform random allocation, and had comparable performance to the rule-based reference agent (bottom row). This was to be expected, as clinical guidelines have already been tuned to take advantage of this information. However, when genomic information was available, most of the contextual bandit algorithms were able to improve on the rule-based protocol significantly (top and middle rows). Providing both genomic information and guideline input in the state information did not further improve model performance in most cases (top row vs middle row).
These results were generally robust across reward definitions (columns), although the percentile-based reward showed the weakest results. Of note, three Neural Network based algorithms, bootstrapped-, greedy and Dropout, consistently scored higher rewards compared to linear methods or Gaussian Processes.
In summary, we state that genomics based assignment mechanisms in precision oncology programs can be framed as a contextual bandit problem. When provided with a representation of genomic information, contextual bandit agents can outperform simplified abstractions of current clinical standards based on in vitro drug response data. Among the most successful agents were bootstrapped or dropout-based dense neural networks.
This study has several limitations including: (I) In vitro drug response data of cancer models has limited transferability into a clinical context although recovering a considerable portion of clinically established genetic predictors of drug response Iorio2016, (II) The response scores are on average lower in treatments vs. reference agents, (III) We reduced the dimensionality of available genomic data without dedicated learning of a shared multi-omics embedding, for example as described in Simidjievski719542 (IV) Cisplatin is a limited reference treatment for all considered cancer types.
We decided to reduce the dimensionality of the available feature space in order to reduce model complexity and increase training efficiency. We chose UMAP for this purpose as it recovers both global- and local structure of the dataset. As mentioned before, we believe that this dataset does not only offer a benchmark for machine learning based treatment assignment for cancer, but also action-oriented mulitomic feature representation of this disease.
In the future, we plan to address the limitations above and validate our findings in alternative in vitro and in vivo drug response datasets Gao2015, which were measured by perturbing cancer cell lines, patient-derived organoids or xenografts. Clinical outcome data, although valuable, does not lend itself directly for benchmarking, as not all available treatments have been observed for every patient and thus no ideal policy beyond the standard of care is known. Nevertheless, we plan to validate out finding by analyzing agent behaviour for action-patterns that correspond to current clinical best practices. In addition, we plan to measure the impact of certain genomic information types on model performance, for example by using only the available information provided by current genetic testing services.
We would like to stimulate an open discussion about the limitations and potential benefits of bandit-guided treatment assignments in precision oncology programs to minimize collective treatment regret.
5 Code and Data availability
Appendix A Full Listing of Bayesian Bandit Algorithms
Here we list the full suite of Bayesian bandit algorithms that we evaluated with our benchmark.
Uniform Sampling (Takes each action at random with equal probability)
Bayesian linear (Noise prior , . Ridge prior = 0.25)
Neural Linear (Noise prior , . Ridge prior . Based on RMS2 net)
Neural Greedy (Greedy NN approach with fixed learning rate ())
Dropout (Dropout with probability p = 0.8. Based on RMS3 net)
Parameter-Noise (Initial noise = 0.01, and level = 0.01. Based on RMS2 net)
Bootstrapped Networks (Bootstrapped with models, and . Based on RMS3 net)
Stochastic Variational Inference (BayesByBackprop with noise )
Expectation-Propagation (Alpha Divergences BB -divergence with = 0.1, noise = 0.1, prior var .)
RMS2 net (Learning rate decays, and it is reset every training period)
RMS3 net (Learning rate decays, and it is not reset at all. Starts at )