I INTRODUCTION and OBJECTIVES
Cancer is a common name that is given to a group of diseases that involve the repeated and uncontrolled division and spreading of abnormal cells. These abnormal tissues are called tumors . Early diagnosis and effective treatment improve the survival rate of these diseases.
The optimal treatment schedule and drug dose vary according to the stage of the tumor, the weight of the patient, the white blood cell levels (immune cells), concurrent illness and age of the patient. Thus, proper scheduling and personalizing the chemotherapy treatment are vital to reduce the mortality rate. This motivated the use of techniques originated in engineering fields – such as optimal control – to derive optimal drug dosing for cancer chemotherapy . A good review of model-based scheduling strategies is provided in .
One of the main challenges of studying cancer as a dynamical system is that it is known to be complex, nonlinear, and its mechanisms of action are uncertain. Consequently, first principle mathematical models may not be able to account for all the variations in the patient dynamics.
Motivated by the challenging nature of generating accurate models of cancer dynamics and the recent success stories of using RL for control [5, 6], we use model-free Deep-Reinforcement Learning (DRL) algorithms to design an optimal and personalized cancer chemotherapy schedule. Particularly, we will use Deep Q-Network (DQN)  and Deep Deterministic Policy Gradient . We use DQN with a discrete action space, and DDPG with a continuous action space, comparing the performance of both algorithms.
DRL has been successfully applied en different medical treatment applications. For instance, it was applied in  to design the optimal treatment regimes from medical data. It was also used in  to develop automated radiation adaptation protocols in Lung Cancer. In , Q-learning was used to design a chemotherapy schedule with a discrete action space, and in  DL was used in the metastatic breast cancer detection. Moreover,  used DDPG to control the dosing for suppressing cell growth using a logistic growth model with stochastic differential equations.
Ultimately, the goal of this work is to determine through in-silico trials whether the use of DRL could potentially be a feasible methodology for in-vivo cancer treatment scheduling and if so, provide some guidelines for in-vivo trials. In this preliminary work, our objectives are to:
Solve the optimal control problem to establish a baseline for comparison and evaluate whether DQN and DDPG can provide similar performance.
Get some qualitative intuition of how much training data (episodes) DQN and DDPG require to perform similarly to the baseline policy.
Evaluate the robustness of the optimal control, DQN and DDPG policies in the presence of different types of relevant uncertainties: parametric in the model parameters, parametric in the initial conditions (diagnosis) and stochasticity in the tumor dynamics.
Compare the performance of DQN and DDPG.
In order to compute the optimal control policy and the reward in DRL, a mathematical model that captures the distribution and effects of the chemotherapy drug is required. A realistic model should address tumor growth, the reaction of the human immune system to the tumor growth, and the effects of chemotherapy treatment on immune cells, normal cells and tumor growth  .
We will simulate patient’s response to the treatment through a pharmacological model of cancer chemotherapy, given by a nonlinear and coupled system of 4 deterministic Ordinary Differential Equations (ODEs):
where is the number of immune cells, is the number of normal cells, is the number of tumor cells and is the drug concentration. The action or control is the chemotherapy drug infusion rate , which we will try to optimally determine through DRL. The initial conditions, are determined according to the diagnosis. From now on, we will refer to this nonlinear model by , where and . An electron micrograph of real immune and cancer cells is shown in Figure 1.
The model parameters and their values are provided in Figure 2. Note that both the initial conditions and the model parameters need to be adapted to the patient.
Note: ODEs and SDEs. In order to test the robustness of our optimal control and DRL policies, we simulated the previously presented system of ODEs with parametric uncertainty, but also a stochastic version of it (SDEs). The reader is referred to the next Section for further details.
Iii Scenarios Considered
A personalized scheduling of chemotherapy treatment is a key factor in patient’s recovery from the disease. When designing the treatment schedule, it is important to optimize the amount of drug used in order to regulate the potentially lethal side effects of chemotherapy, since often, as a side effect of the treatment, the patient’s immune system weakens and becomes prone to life-threatening infections. This diminishes the capability of the immune system to eradicate cancer .
Consequently, we implemented the different methodologies in two cases: a preliminary and somewhat unrealistic Case 0, in which the goal is to exterminate cancer regardless of the state of the rest of the cells. We used this case to validate our algorithms, codes and parameter and hypermarameter values. Then, we simulated a Case Patient, in which cancer needs to be eradicated while preserving a minimum population of normal and immune cells in order to guarantee patient’s safety. In both cases, the initial condition was the same: .
Iii-a Case 0
The optimal control problem for Case 0 is formulated in the following nonlinear program:
Both the policy and the length of the treatment are determined by the optimization. The hard constraints correspond to the dynamics of cancer and the initial state of the patient, bounds in the drug-rate and the target of cancer eradication, respectively. The cost is the area enclosed by the tumor cell population (note that is nonnegative) and the time axis, between and . For the DRL algorithms, the reward used is , where corresponds to the length of the timestep, and it is a fixed value. Note that is included in the reward just to make the comparison with the cost used in optimal control straightforward.
Iii-B Case Patient
In order to guarantee patient’s safety during the treatment, additional state constraints are added to the program (2). Particularly, the constraints and are included.
In the case of the DRL algorithms, the constraints are imposed softly by modifying the reward function:
where denotes the Iverson bracket. The last two terms of this expression penalize the cases where the state does not satisfy the constraints and . Note that, for the cases where the hard constraints are satisfied, this reward is the negative of the Riemann sum approximation of the area below the curve of the number of tumor cells (i.e. it is a rectangular approximation to the negative of the integral cost used in the optimal control problem).
Furthermore, in order to test the robustness of the obtained policies, different types of relevant sources of uncertainty will be considered regarding the diagnosis, growth-rate and dynamics of the tumor. The corresponding results are provided in Section VI.
Iii-B1 Parametric uncertainty: model parameter
The per-unit growth-rate of the tumor,
, represents how aggressive the disease is. Thus, an accurate estimation of its value is important for the computation of the optimal control policy. We systematically varied its value and obtained the range for which the optimal control problem is feasible, as well as the sensitivity of the nominal optimal control policy to perturbations in the value of this parameter.
Iii-B2 Parametric uncertainty: initial condition
A wrong estimation of the initial size of the tumor , will also have an impact on the performance of the optimal policy. We will systematically evaluate the robustness of the nominal policy to perturbations on .
Iii-B3 Stochastic forcing in tumor dynamics
While numerical simulation of ODEs is straightforward, Stochastic Differential Equations (SDE) require more care, since concepts from traditional calculus do not hold for them. We simulated a system of SDEs of the form:
where the first term is the drift and corresponds to the deterministic dynamics given by the ODEs, and the second one is the diffusion term, modeled by a constant in our case and applied only to the equation that gives the evolution of the number of tumor cells . denotes a Wiener process, common in the motion of cells. We simulated the SDEs using the Euler-Maruyama scheme.
Iv Methods and Algorithms
Iv-a Optimal Control
The general form of a finite-horizon optimal control problem in Bolza form is:
It is straightforward to see that the problem (2) constitutes a particular case of problem (5) and thus can be solved using standard optimal control techniques. Something to note is that the control solution provided by (5) will (in general) be open-loop (i.e. the control policy will be a function of time, as opposed to feedback or closed-loop control laws, in which the control is a function of the state. Usually this last case is more desirable since it accounts for discrepancies between the real dynamics and the predictions made by the model and corrects the control input accordingly, making the controller more robust to uncertainty).
Iv-B Deep Reinforcement Learning
DDPG and DQN algorithms are considered. Both DDPG and DQN share two common features: they are off-policy and model-free algorithms. Off-policy means that the behaviour policy used is not the same as the policy being improved. This allows the use of a memory replay, and the use of any exploration strategy. Model-free means that the algorithm does not try to estimate the transition matrix of the dynamics of the environment . Instead, it estimates the optimal policy or value function directly.
The differences between these two algorithms are highlighted below.
Iv-B1 Deep Q Network (DQN)
Deep Q Network was proposed in 
, and the main difference with respect to standard Q-learning is the use of a neural network to approximate the action-value function. The DQN algorithm is shown in Alg. 1 (taken from ).
The inclusion of a neural network, usually leads to an unstable training due to two factors: the correlation between samples and the non-stationary targets. These two challenges are addressed by DQN using :
An experience replay: In a replay buffer, a dataset of tuples are saved. During the training, the agent will randomly sample mini-batch samples from this replay buffer (line 1 of the algorithm). This allows a stabilization of the training process, and a better approximation to i.i.d samples removing the correlation between them.
DQN is mainly used for problems with a discrete action space. However, some applications make use of the Normalized Advantage Function (NAF) to apply DQN in continuous action spaces.
Iv-B2 Deep Deterministic Policy Gradient (DDPG))
Policy-gradient: DDPG tries to estimate the gradient of the expected return, and the policy is updated using this estimate.
Actor-critic: DDPG has two different structures (see also Fig. 3) :
Actor: The actor contains the policy function. It takes the state of the environment as input, and produces an action.
Both the actor and critic described above are represented using neural networks.
V-a Optimal Control
We solved the optimal control problem using direct collocation methods 
, which transcribe the continuous dynamics and control functions to a finite set of algebraic variables and then solve a high-dimensional non-linear program (NLP). We used the MATLAB implementation of ICLOCS2, the open-source Imperial College London Optimal Control software (available for download here) in conjunction with the open-source NLP solver IPOPT . We modified the open-source code and introduced the dynamics of cancer (the ODE system), cost functional and constrains, as well as the desired options and tolerances for the solvers.
We used h-methods for the transcription, particularly the Hermite-Simpson method with automatic mesh refinement and an initial number of 200 nodes. Regarding numerical tolerances, we allowed errors of up to in the state and control bounds, and in the terminal condition of cancer eradication. The optimal control solution for both the Case 0 and Case Patient is provided in Section VI.
V-B Reinforcement Learning
To implement the DRL algorithms, we used the open-source library ChainerRL . Moreover, we created an environment using the OpenAI Gym  framework. This environment takes an action and the current state, and returns the next observation and the reward obtained. In this environment, we implemented the system of ODEs, and solved it using the numerical integration methods provided by Scipy .
The values of the most relevant parameters of the DQN and DDPG implementations are shown in Table I. Note that we used to match the reward used in Optimal Control as well as possible.
To improve the convergence and the training times, all the states, actions and the reward were normalized to values in , and CuPy was used.
Note that both in DDPG and DQN, it is required to select the size of the time steps taken by the agent during an episode. The finer the time step, the better the results will be. After running tests with several time steps, we used a time step size of days, which achieved a reasonable training time.
|Replay Start Size|
|Hidden Units per layer||100||300|
Vi-a Case 0
The results for the preliminary Case 0 (in which there are not constraints on and ) are shown in Fig. 4. Note that the optimal policy found by all the algorithms (Optimal Control, DDPG and DQN) is the same: apply the maximum drug infusion rate until the cancer is exterminated, which is the expected solution.
Vi-B Case Patient
In this case, we add the constraints and to the program. The optimal policies found by the three algorithms for the nominal growth tumor rate () are shown in Fig. 6. All the policies are able to exterminate the cancer in days. Note also that the shapes of the curves of are similar: Maximum drug infusion at the beginning, followed by a value around . In this case, DDPG and DQN obtain a solution slightly worse than O.C.
The convergence rates for DDPG and DQN are shown in Fig. 7. The training was stopped when the average of the Q-value reached a stationary behaviour. After several trials with different normalizations of the reward, we achieved training times minutes.
Note also that the shape of the plot of the average Q has the shape that usually appears after a successful training (see  for example): After the replay buffer has been filled, the average Q increases at the beginning of the training, it reaches a maximum (where the agent is overestimating the Q value), and then decreases to achieve a stationary value.
Vi-B1 Parameter uncertainty: Model Parameter
To study how sensitive the optimal policy is with respect to changes on , we perturbed the value of around the nominal and obtained the range for which an optimal policy exists (i.e. range in which the problem is feasible). The policies found are plotted in Fig. 5. The problem is feasible for values , becoming unfeasible for larger values. Again, DDPG and DQN obtain a similar policy as O.C., although some of them present an oscillatory behaviour. Both for DQN and DDPG, these oscillations may be due to the length of the time step used ( days), and could be reduced by decreasing the time step size. Moreover, for the case of DQN we also have to consider the coarse discretization (10 points) in the action space)
Once we obtained the values of for which the problem is feasible, we tested the optimal policy obtained for in different scenarios where . The results are shown in Fig. 8. In all the algorithms, the policy found eradicates the cancer for , and does not cure it for . However, for the case , DRL algorithms success exterminating the cancer, but O.C. does not. We think that this effect might be due to the exploration performed in DRL: exploration makes the policies found by DRL algorithms contain more information than the optimal control solution, which only focuses on finding the a minimum for the given model. Hence, we conclude that the minimum found by DRL policies is more robust to uncertainty in than the one provided by O.C.
Vi-B2 Parametric uncertainty: initial condition
For this case, we train the agent with the initial condition , and we test it with other values of . The results are shown in Fig. 9. Note that DRL is able to exterminate the cancer for all the cases, while O.C. fails to do it for . Again, this argues for a much more robust solution found by DRL compared to O.C.
Vi-B3 Stochastic forcing in tumor dynamics
In this last scenario, we test the agent with a stochastic forcing term in the equation that governs the dynamics of
. The plots of the mean and standard deviations of the solutions found are shown in Fig.10. DQN and DDPG drive the tumor closer to extermination than O.C. at the end of the treatment.
One interesting question to answer is the dependence of learning in DQN and DDPG on the number of episodes. This question is specially relevant in data-poor scenarios, where one may wonder if the size of the dataset is enough to obtain a decent policy. With this aim, we evaluated the agent found after each training episode, and compared the obtained cost with the optimal cost of O.C. In the case of DQN, we additionally tried 7 different discretizations of the action space using nodes. The resulting plot is shown in Fig. 11. It is observed that the DRL costs asymptote to the optimal one and that, after episodes, DDPG obtains a cost that is close to that of optimal control. Note also that, in general, DDPG tends to obtain better agents than DQN.
This work presented a comparison between classical O.C. and model-free DRL approaches, both in discrete and continuous action space. We showed that, with an accurate model of the dynamics, O.C. provides the best solution, but closely followed by DRL. Moreover, we showed that DRL outperforms O.C. in the cases when there is uncertainty on parameters of the model, on the initial condition, or when the dynamics is stochastic.
In the Case 0, all the algorithms found the same optimal policy. In the Case Patient, the policies found by DRL perform similarly to O.C, but they exhibit increased robustness to uncertainties. Regarding the relative performance of DQN and DDPG, we found that DDPG performs better, as expected due to its continuous action space. Furthermore, it seems to learn faster.
The sampling analysis of the algorithms showed that approximately 1500 calls to the model are needed for DDPG to obtain a performance close to optimal.
Thanks to Prof. David Sontag, Alejandro Rodriguez-Ramos and Dong-Ki Kim for valuable discussions and ideas. The authors would also like to give thanks to Fundacion Bancaria ”la Caixa” for financial support.
-  Regina Padmanabhan, Nader Meskin, and Wassim M Haddad. Reinforcement learning-based control of drug dosing for cancer chemotherapy treatment. Mathematical biosciences, 293:11–20, 2017.
-  L.G.D. Pillis and A. Radunskaya. A mathematical tumor model with immune resistance and drug therapy: an optimal control approach. Comput. Math. Methods, 2001.
-  H. Sbeity and R. Younes. Review of optimization methods for cancer chemotherapy treatment planning. J. Comput. Sci. Syst. Bio., 2015.
-  Memorial sloan kettering cancer center. https://www.mskcc.org. Accessed: 8-Dec-2018.
-  D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. A general reinforcement learning algorithm that masters chess, shogi and go through self-play. Science, 2018.
-  B. Recht. A tour of reinforcement learning: The view from continuous control. 2018.
-  V. Mnih, K. Kavukcuoglu, D Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Peterseu, C. Beattie, A. Sadik, I. Antonoglou, H. King, D Kumaran, D. Wierstra, S. Legg, and D. Hassaan. Human-level control through deep reinforcement learning. Nature, 2016.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.
-  Ying Liu, Brent Logan, Ning Liu, Zhiyuan Xu, Jian Tang, and Yangzhi Wang. Deep reinforcement learning for dynamic treatment regimes on medical registry data. In 2017 IEEE International Conference on Healthcare Informatics (ICHI), pages 380–385. IEEE, 2017.
-  Huan-Hsin Tseng, Yi Luo, Sunan Cui, Jen-Tzung Chien, Randall K Ten Haken, and Issam El Naqa. Deep reinforcement learning for automated radiation adaptation in lung cancer. Medical physics, 44(12):6690–6705, 2017.
Google a.i.: Using deep-learning for breast tumor detection.https://ai.googleblog.com/2018/10/applying-deep-learning-to-metastatic.html. Accessed: 11-Dec-2018.
-  Dalit Engelhardt. Dynamic control of stochastic evolution: A deep reinforcement learning approach to adaptively targeting emergent drug resistance. arXiv preprint arXiv:1903.11373, 2019.
-  Emma Brunskill. Cnns and deep q-learning. https://web.stanford.edu/class/cs234/slides/cs234_2018_l6.pdf. Accessed: 10-Dec-2018.
-  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
-  Deep deterministic policy gradients. https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html. Accessed: 10-Dec-2018.
-  J.T. Betts. Practical methods for optimal control and estimation using nonlinear programming. Siam, 2010.
-  P. Falugi, E. Kerrigan, and E. van Wyk. Imperial college london optimal control software: User guide (iclocs). 2010.
-  A. Watcher and L.T. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math. Program., Ser. A, 2005.
-  Chainerrl, a deep reinforcement learning library. https://chainerrl.readthedocs.io/en/latest/. Accessed: 10-Dec-2018.
-  Openai gym. https://gym.openai.com/. Accessed: 10-Dec-2018.
-  Scipy. https://docs.scipy.org/doc/scipy/reference/index.html. Accessed: 10-Dec-2018.
-  Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4):e0172395, 2017.