Personalized Cancer Chemotherapy Schedule: a numerical comparison of performance and robustness in model-based and model-free scheduling methodologies

04/02/2019 ∙ by Jesus Tordesillas, et al. ∙ MIT 0

Reinforcement learning algorithms are gaining popularity in fields where optimal scheduling is important, and oncology is not an exception. The complex and uncertain dynamics of cancer limit the performance of traditional model-based scheduling strategies like Optimal Control. Some preliminary efforts have already been made to design chemotherapy schedules using Q-learning considering a discrete action space. Motivated by the recent success of model-free Deep Reinforcement Learning (DRL) in challenging control tasks, we suggest the use of the Deep Q-Network (DQN) and Deep Deterministic Policy Gradient (DDPG) algorithms to design a personalized cancer chemotherapy schedule. We show that both of them succeed in the task and outperform the Optimal Control solution in the presence of uncertainty. Furthermore, we show that DDPG can exterminate cancer more efficiently than DQN due to its continuous action space. Finally, we provide some intuition regarding the amount of samples required for the training.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Cancer is a common name that is given to a group of diseases that involve the repeated and uncontrolled division and spreading of abnormal cells. These abnormal tissues are called tumors [1]. Early diagnosis and effective treatment improve the survival rate of these diseases.

The optimal treatment schedule and drug dose vary according to the stage of the tumor, the weight of the patient, the white blood cell levels (immune cells), concurrent illness and age of the patient. Thus, proper scheduling and personalizing the chemotherapy treatment are vital to reduce the mortality rate. This motivated the use of techniques originated in engineering fields – such as optimal control – to derive optimal drug dosing for cancer chemotherapy [2]. A good review of model-based scheduling strategies is provided in [3].

One of the main challenges of studying cancer as a dynamical system is that it is known to be complex, nonlinear, and its mechanisms of action are uncertain. Consequently, first principle mathematical models may not be able to account for all the variations in the patient dynamics.

Fig. 1: Immune cells (in orange) attached to a tumour cell (brown). This image was captured by a scanning electron microscope [4].

Motivated by the challenging nature of generating accurate models of cancer dynamics and the recent success stories of using RL for control [5, 6], we use model-free Deep-Reinforcement Learning (DRL) algorithms to design an optimal and personalized cancer chemotherapy schedule. Particularly, we will use Deep Q-Network (DQN) [7] and Deep Deterministic Policy Gradient [8]. We use DQN with a discrete action space, and DDPG with a continuous action space, comparing the performance of both algorithms.

DRL has been successfully applied en different medical treatment applications. For instance, it was applied in [9] to design the optimal treatment regimes from medical data. It was also used in [10] to develop automated radiation adaptation protocols in Lung Cancer. In [1], Q-learning was used to design a chemotherapy schedule with a discrete action space, and in [11] DL was used in the metastatic breast cancer detection. Moreover, [12] used DDPG to control the dosing for suppressing cell growth using a logistic growth model with stochastic differential equations.

Ultimately, the goal of this work is to determine through in-silico trials whether the use of DRL could potentially be a feasible methodology for in-vivo cancer treatment scheduling and if so, provide some guidelines for in-vivo trials. In this preliminary work, our objectives are to:

  1. Solve the optimal control problem to establish a baseline for comparison and evaluate whether DQN and DDPG can provide similar performance.

  2. Get some qualitative intuition of how much training data (episodes) DQN and DDPG require to perform similarly to the baseline policy.

  3. Evaluate the robustness of the optimal control, DQN and DDPG policies in the presence of different types of relevant uncertainties: parametric in the model parameters, parametric in the initial conditions (diagnosis) and stochasticity in the tumor dynamics.

  4. Compare the performance of DQN and DDPG.

Ii Model

In order to compute the optimal control policy and the reward in DRL, a mathematical model that captures the distribution and effects of the chemotherapy drug is required. A realistic model should address tumor growth, the reaction of the human immune system to the tumor growth, and the effects of chemotherapy treatment on immune cells, normal cells and tumor growth [2] [3].

We will simulate patient’s response to the treatment through a pharmacological model of cancer chemotherapy, given by a nonlinear and coupled system of 4 deterministic Ordinary Differential Equations (ODEs)



where is the number of immune cells, is the number of normal cells, is the number of tumor cells and is the drug concentration. The action or control is the chemotherapy drug infusion rate [], which we will try to optimally determine through DRL. The initial conditions, are determined according to the diagnosis. From now on, we will refer to this nonlinear model by , where and . An electron micrograph of real immune and cancer cells is shown in Figure 1.

The model parameters and their values are provided in Figure 2. Note that both the initial conditions and the model parameters need to be adapted to the patient.

Fig. 2: Parameters of the model, description and the values used in our optimal control policy computation and training of DRL algorithms unless otherwise indicated. Extracted from [1].

Note: ODEs and SDEs. In order to test the robustness of our optimal control and DRL policies, we simulated the previously presented system of ODEs with parametric uncertainty, but also a stochastic version of it (SDEs). The reader is referred to the next Section for further details.

Iii Scenarios Considered

A personalized scheduling of chemotherapy treatment is a key factor in patient’s recovery from the disease. When designing the treatment schedule, it is important to optimize the amount of drug used in order to regulate the potentially lethal side effects of chemotherapy, since often, as a side effect of the treatment, the patient’s immune system weakens and becomes prone to life-threatening infections. This diminishes the capability of the immune system to eradicate cancer [1].

Consequently, we implemented the different methodologies in two cases: a preliminary and somewhat unrealistic Case 0, in which the goal is to exterminate cancer regardless of the state of the rest of the cells. We used this case to validate our algorithms, codes and parameter and hypermarameter values. Then, we simulated a Case Patient, in which cancer needs to be eradicated while preserving a minimum population of normal and immune cells in order to guarantee patient’s safety. In both cases, the initial condition was the same: .

Iii-a Case 0

The optimal control problem for Case 0 is formulated in the following nonlinear program:

s. t.

Both the policy and the length of the treatment are determined by the optimization. The hard constraints correspond to the dynamics of cancer and the initial state of the patient, bounds in the drug-rate and the target of cancer eradication, respectively. The cost is the area enclosed by the tumor cell population (note that is nonnegative) and the time axis, between and . For the DRL algorithms, the reward used is , where corresponds to the length of the timestep, and it is a fixed value. Note that is included in the reward just to make the comparison with the cost used in optimal control straightforward.

Iii-B Case Patient

In order to guarantee patient’s safety during the treatment, additional state constraints are added to the program (2). Particularly, the constraints and are included.

In the case of the DRL algorithms, the constraints are imposed softly by modifying the reward function:


where denotes the Iverson bracket. The last two terms of this expression penalize the cases where the state does not satisfy the constraints and . Note that, for the cases where the hard constraints are satisfied, this reward is the negative of the Riemann sum approximation of the area below the curve of the number of tumor cells (i.e. it is a rectangular approximation to the negative of the integral cost used in the optimal control problem).

Furthermore, in order to test the robustness of the obtained policies, different types of relevant sources of uncertainty will be considered regarding the diagnosis, growth-rate and dynamics of the tumor. The corresponding results are provided in Section VI.

Iii-B1 Parametric uncertainty: model parameter

The per-unit growth-rate of the tumor,

, represents how aggressive the disease is. Thus, an accurate estimation of its value is important for the computation of the optimal control policy. We systematically varied its value and obtained the range for which the optimal control problem is feasible, as well as the sensitivity of the nominal optimal control policy to perturbations in the value of this parameter.

Iii-B2 Parametric uncertainty: initial condition

A wrong estimation of the initial size of the tumor , will also have an impact on the performance of the optimal policy. We will systematically evaluate the robustness of the nominal policy to perturbations on .

Iii-B3 Stochastic forcing in tumor dynamics

While numerical simulation of ODEs is straightforward, Stochastic Differential Equations (SDE) require more care, since concepts from traditional calculus do not hold for them. We simulated a system of SDEs of the form:


where the first term is the drift and corresponds to the deterministic dynamics given by the ODEs, and the second one is the diffusion term, modeled by a constant in our case and applied only to the equation that gives the evolution of the number of tumor cells . denotes a Wiener process, common in the motion of cells. We simulated the SDEs using the Euler-Maruyama scheme.

Iv Methods and Algorithms

Iv-a Optimal Control

The general form of a finite-horizon optimal control problem in Bolza form is:

s. t.

It is straightforward to see that the problem (2) constitutes a particular case of problem (5) and thus can be solved using standard optimal control techniques. Something to note is that the control solution provided by (5) will (in general) be open-loop (i.e. the control policy will be a function of time, as opposed to feedback or closed-loop control laws, in which the control is a function of the state. Usually this last case is more desirable since it accounts for discrepancies between the real dynamics and the predictions made by the model and corrects the control input accordingly, making the controller more robust to uncertainty).

Iv-B Deep Reinforcement Learning

DDPG and DQN algorithms are considered. Both DDPG and DQN share two common features: they are off-policy and model-free algorithms. Off-policy means that the behaviour policy used is not the same as the policy being improved. This allows the use of a memory replay, and the use of any exploration strategy. Model-free means that the algorithm does not try to estimate the transition matrix of the dynamics of the environment . Instead, it estimates the optimal policy or value function directly.

The differences between these two algorithms are highlighted below.

Iv-B1 Deep Q Network (DQN)

Deep Q Network was proposed in [7]

, and the main difference with respect to standard Q-learning is the use of a neural network to approximate the action-value function

. The DQN algorithm is shown in Alg. 1 (taken from [7]).

The inclusion of a neural network, usually leads to an unstable training due to two factors: the correlation between samples and the non-stationary targets. These two challenges are addressed by DQN using [13]:

  • An experience replay: In a replay buffer, a dataset of tuples are saved. During the training, the agent will randomly sample mini-batch samples from this replay buffer (line 1 of the algorithm). This allows a stabilization of the training process, and a better approximation to i.i.d samples removing the correlation between them.

  • Fixed Q-targets: During several updates, the target network weights used in the target calculation are fixed (lines 1, 1, 1).

DQN is mainly used for problems with a discrete action space. However, some applications make use of the Normalized Advantage Function (NAF) to apply DQN in continuous action spaces.

Iv-B2 Deep Deterministic Policy Gradient (DDPG))

DDPG was proposed in [8]. The algorithm used by DDPG is shown in Alg. 2 (taken from [8]), and its main characteristics are these ones [8], [14], [15]:

  • Policy-gradient: DDPG tries to estimate the gradient of the expected return, and the policy is updated using this estimate.

  • Actor-critic: DDPG has two different structures (see also Fig. 3) :

    • Actor: The actor contains the policy function. It takes the state of the environment as input, and produces an action.

    • Critic: From the state of the environment and the reward received by the action taken by the actor, the critic produces the temporal difference error. This error is used to update both the actor (line 2) and the critic (line 2).

Both the actor and critic described above are represented using neural networks.

Fig. 3: Actor and Critic used in DDPG. Adapted from [14].
1 Initialize replay memory D to capacity N Initialize action-value function Q with random weights Initialize target action-value function with random weights for episode= do
2        Initialize sequence and preprocessed sequence for t= do

With probability

select a random action otherwise select Execute action in emulator and observe reward and image Set and preprocess Store transition in Sample random minibatch of transitions from Set Perform a gradient descent step on with respect to the network parameters Every C steps reset
Algorithm 1 DQN
1 Initialize critic network and actor with weights and . Initialize target network and with weights , Initialize replay buffer for episode= do
2        Initialize a random process for active exploration Receive initial observation state for  do
3               Select action according to the current policy and exploration noise Execute action and observe reward and new state Store transition in Sample a random minibatch of transitions from Set Update critic by minimizing the loss: Update the actor policy using the sampled policy gradient: Update the target networks:
Algorithm 2 DDPG

V Implementation

V-a Optimal Control

We solved the optimal control problem using direct collocation methods [16]

, which transcribe the continuous dynamics and control functions to a finite set of algebraic variables and then solve a high-dimensional non-linear program (NLP). We used the MATLAB implementation of ICLOCS2

[17], the open-source Imperial College London Optimal Control software (available for download here) in conjunction with the open-source NLP solver IPOPT [18]. We modified the open-source code and introduced the dynamics of cancer (the ODE system), cost functional and constrains, as well as the desired options and tolerances for the solvers.

We used h-methods for the transcription, particularly the Hermite-Simpson method with automatic mesh refinement and an initial number of 200 nodes. Regarding numerical tolerances, we allowed errors of up to in the state and control bounds, and in the terminal condition of cancer eradication. The optimal control solution for both the Case 0 and Case Patient is provided in Section VI.

V-B Reinforcement Learning

To implement the DRL algorithms, we used the open-source library ChainerRL [19]. Moreover, we created an environment using the OpenAI Gym [20] framework. This environment takes an action and the current state, and returns the next observation and the reward obtained. In this environment, we implemented the system of ODEs, and solved it using the numerical integration methods provided by Scipy [21].

The values of the most relevant parameters of the DQN and DDPG implementations are shown in Table I. Note that we used to match the reward used in Optimal Control as well as possible.

To improve the convergence and the training times, all the states, actions and the reward were normalized to values in , and CuPy was used.

Note that both in DDPG and DQN, it is required to select the size of the time steps taken by the agent during an episode. The finer the time step, the better the results will be. After running tests with several time steps, we used a time step size of days, which achieved a reasonable training time.

Parameter DQN DDPG
Replay Start Size
Layers 2 3
Hidden Units per layer 100 300
Discount Factor 0.99 0.995
TABLE I: Parameters used in DQN and DDPG

Vi Results

Vi-a Case 0

The results for the preliminary Case 0 (in which there are not constraints on and ) are shown in Fig. 4. Note that the optimal policy found by all the algorithms (Optimal Control, DDPG and DQN) is the same: apply the maximum drug infusion rate until the cancer is exterminated, which is the expected solution.

Fig. 4: Results for the preliminary Case 0. (a) Trajectories of the states when the optimal drug-rate is applied. Optimal policies provided by (b) O.C. (c) DQN (d) DDPG. Note that (b)-(d) match. The shaded regions represent the feasible control region.
Fig. 5: Optimal policies provided by the different methods when sweeping the values of .

Vi-B Case Patient

In this case, we add the constraints and to the program. The optimal policies found by the three algorithms for the nominal growth tumor rate () are shown in Fig. 6. All the policies are able to exterminate the cancer in days. Note also that the shapes of the curves of are similar: Maximum drug infusion at the beginning, followed by a value around . In this case, DDPG and DQN obtain a solution slightly worse than O.C.

Fig. 6: Results for the Case Patient. (a) O.C. (b) DQN (c) DDPG. Row 1 provides the policies and row 2 the state evolution when the corresponding policy is applied. The shaded regions represent the feasible control region.

The convergence rates for DDPG and DQN are shown in Fig. 7. The training was stopped when the average of the Q-value reached a stationary behaviour. After several trials with different normalizations of the reward, we achieved training times minutes.

Note also that the shape of the plot of the average Q has the shape that usually appears after a successful training (see [22] for example): After the replay buffer has been filled, the average Q increases at the beginning of the training, it reaches a maximum (where the agent is overestimating the Q value), and then decreases to achieve a stationary value.

Fig. 7: Convergence rates in both DQN10 and DDPG. For DDPG, both the Critic and Actor losses are shown. The values of these plots are normalized.

Vi-B1 Parameter uncertainty: Model Parameter

To study how sensitive the optimal policy is with respect to changes on , we perturbed the value of around the nominal and obtained the range for which an optimal policy exists (i.e. range in which the problem is feasible). The policies found are plotted in Fig. 5. The problem is feasible for values , becoming unfeasible for larger values. Again, DDPG and DQN obtain a similar policy as O.C., although some of them present an oscillatory behaviour. Both for DQN and DDPG, these oscillations may be due to the length of the time step used ( days), and could be reduced by decreasing the time step size. Moreover, for the case of DQN we also have to consider the coarse discretization (10 points) in the action space)

Once we obtained the values of for which the problem is feasible, we tested the optimal policy obtained for in different scenarios where . The results are shown in Fig. 8. In all the algorithms, the policy found eradicates the cancer for , and does not cure it for . However, for the case , DRL algorithms success exterminating the cancer, but O.C. does not. We think that this effect might be due to the exploration performed in DRL: exploration makes the policies found by DRL algorithms contain more information than the optimal control solution, which only focuses on finding the a minimum for the given model. Hence, we conclude that the minimum found by DRL policies is more robust to uncertainty in than the one provided by O.C.

Fig. 8: Test of the performance of the nominal policies (computed/trained with ) to perturbations in the value of . (a) O.C. (b) DQN10 (c) DDPG. Again, DRL methods show improved robustness to perturbations than O.C.

Vi-B2 Parametric uncertainty: initial condition

For this case, we train the agent with the initial condition , and we test it with other values of . The results are shown in Fig. 9. Note that DRL is able to exterminate the cancer for all the cases, while O.C. fails to do it for . Again, this argues for a much more robust solution found by DRL compared to O.C.

Fig. 9: Evolution of the tumor cell population when the policy obtained for the nominal initial condition (computed/trained with ) is applied for perturbed initial size of the tumor. (a) O.C. (b) DQN10 (c) DDPG. It is remarkable that while optimal control does not manage to exterminate all the cases, DRL methods do, showing increased robustness. The first 5 days have been removed from the graph for the sake of visualization.

Vi-B3 Stochastic forcing in tumor dynamics

In this last scenario, we test the agent with a stochastic forcing term in the equation that governs the dynamics of

. The plots of the mean and standard deviations of the solutions found are shown in Fig.

10. DQN and DDPG drive the tumor closer to extermination than O.C. at the end of the treatment.

Fig. 10: Evolution of for the noisy case (left). Note the improved performance of DDPG and DQN at .
Fig. 11: Mean cost (-reward) of the policies found as a function of the training episodes for each algorithm compared to that of the optimal controller. Legend indicates the algorithms with the respective training episodes.

Vi-C Sampling

One interesting question to answer is the dependence of learning in DQN and DDPG on the number of episodes. This question is specially relevant in data-poor scenarios, where one may wonder if the size of the dataset is enough to obtain a decent policy. With this aim, we evaluated the agent found after each training episode, and compared the obtained cost with the optimal cost of O.C. In the case of DQN, we additionally tried 7 different discretizations of the action space using nodes. The resulting plot is shown in Fig. 11. It is observed that the DRL costs asymptote to the optimal one and that, after episodes, DDPG obtains a cost that is close to that of optimal control. Note also that, in general, DDPG tends to obtain better agents than DQN.

Vii Conclusions

This work presented a comparison between classical O.C. and model-free DRL approaches, both in discrete and continuous action space. We showed that, with an accurate model of the dynamics, O.C. provides the best solution, but closely followed by DRL. Moreover, we showed that DRL outperforms O.C. in the cases when there is uncertainty on parameters of the model, on the initial condition, or when the dynamics is stochastic.

In the Case 0, all the algorithms found the same optimal policy. In the Case Patient, the policies found by DRL perform similarly to O.C, but they exhibit increased robustness to uncertainties. Regarding the relative performance of DQN and DDPG, we found that DDPG performs better, as expected due to its continuous action space. Furthermore, it seems to learn faster.

The sampling analysis of the algorithms showed that approximately 1500 calls to the model are needed for DDPG to obtain a performance close to optimal.


Thanks to Prof. David Sontag, Alejandro Rodriguez-Ramos and Dong-Ki Kim for valuable discussions and ideas. The authors would also like to give thanks to Fundacion Bancaria ”la Caixa” for financial support.


  • [1] Regina Padmanabhan, Nader Meskin, and Wassim M Haddad. Reinforcement learning-based control of drug dosing for cancer chemotherapy treatment. Mathematical biosciences, 293:11–20, 2017.
  • [2] L.G.D. Pillis and A. Radunskaya. A mathematical tumor model with immune resistance and drug therapy: an optimal control approach. Comput. Math. Methods, 2001.
  • [3] H. Sbeity and R. Younes. Review of optimization methods for cancer chemotherapy treatment planning. J. Comput. Sci. Syst. Bio., 2015.
  • [4] Memorial sloan kettering cancer center. Accessed: 8-Dec-2018.
  • [5] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. A general reinforcement learning algorithm that masters chess, shogi and go through self-play. Science, 2018.
  • [6] B. Recht. A tour of reinforcement learning: The view from continuous control. 2018.
  • [7] V. Mnih, K. Kavukcuoglu, D Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Peterseu, C. Beattie, A. Sadik, I. Antonoglou, H. King, D Kumaran, D. Wierstra, S. Legg, and D. Hassaan. Human-level control through deep reinforcement learning. Nature, 2016.
  • [8] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.
  • [9] Ying Liu, Brent Logan, Ning Liu, Zhiyuan Xu, Jian Tang, and Yangzhi Wang. Deep reinforcement learning for dynamic treatment regimes on medical registry data. In 2017 IEEE International Conference on Healthcare Informatics (ICHI), pages 380–385. IEEE, 2017.
  • [10] Huan-Hsin Tseng, Yi Luo, Sunan Cui, Jen-Tzung Chien, Randall K Ten Haken, and Issam El Naqa. Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical physics, 44(12):6690–6705, 2017.
  • [11]

    Google a.i.: Using deep-learning for breast tumor detection. Accessed: 11-Dec-2018.
  • [12] Dalit Engelhardt. Dynamic control of stochastic evolution: A deep reinforcement learning approach to adaptively targeting emergent drug resistance. arXiv preprint arXiv:1903.11373, 2019.
  • [13] Emma Brunskill. Cnns and deep q-learning. Accessed: 10-Dec-2018.
  • [14] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [15] Deep deterministic policy gradients. Accessed: 10-Dec-2018.
  • [16] J.T. Betts. Practical methods for optimal control and estimation using nonlinear programming. Siam, 2010.
  • [17] P. Falugi, E. Kerrigan, and E. van Wyk. Imperial college london optimal control software: User guide (iclocs). 2010.
  • [18] A. Watcher and L.T. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math. Program., Ser. A, 2005.
  • [19] Chainerrl, a deep reinforcement learning library. Accessed: 10-Dec-2018.
  • [20] Openai gym. Accessed: 10-Dec-2018.
  • [21] Scipy. Accessed: 10-Dec-2018.
  • [22] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4):e0172395, 2017.