1 Introduction
Malaria is a mosquitoborne disease that continues to pose a heavy burden on South Sahara Africa (SSA) [moran2007malaria]. Recently, there has been significant progress in improving treatment efficiency and reducing the mortality rate of malaria. Unfortunately, due to financial constraints, the policymakers face the challenge of ensuring continued success in disease control with insufficient resources. To make intelligent decisions, learning control policies over the years have been formulated as Reinforcement Learning (RL) problems [bent2018novel]. Nevertheless, applying RL to malaria control seems to be a tricky issue since RL usually requires numerous trialanderror searches to learn from scratch. Unlike simulationbased games, e.g., Atari games [mnih2013playing] and game GO [silver2017mastering], the endless intervention trial is unacceptable to regions over the years since the actual cost of life and money is enormous. Hence, as in many humaninloop systems [zou2019longterm, zou2020neural, zou2020pseudo], it is too expensive to apply RL to learn malaria intervention policy from scratch directly.
Therefore, to reduce the heavy burden of malaria in SSA, it is urgent to improve the data efficiency of policy learning. In [bent2018novel]
, novel exploration techniques, such as Genetic Algorithm
[holland1992genetic], Batch Policy Gradient [sutton2000policy] and Upper/Lower Confidence Bound [auer2010ucb], have been firstly applied to learn malaria control policies from scratch. However, these solutions are introduced under the Stochastic MultiArmed Bandit (SMAB) setting, which myopically ignores the delayed impact of the interventions in the future and might result in serious problems. For example, the largescale use of spraying may lead to mosquito resistance and bring about the uncontrolled spread of malaria in the coming years. Hence, it requires us to optimize disease control policies in the long run, which is far more challenging than the exploration in SMAB.Considering the longterm effects, the finite horizon continuousspace Markov Decision Process is employed to model the disease control in this work. Under this setting, we propose a framework named VarianceBonus Monte Carlo Tree Search (VBMCTS) for dataefficient policy searching, illustrated in Figure
1. Particularly, it is a modelbased training framework, which iterates between updating the world model and collecting data. In model training, Gaussian Process (GP) is used to approximate the state transition function with collected rollouts. As a nonparametric probabilistic model, GP can avoid the model bias and explicitly model the uncertainty about the transitions, i.e., the variance of the state. In data collection, we propose to employ MCTS for generating the policy with the mean MDP plus variancebonus reward. The variancebonus reward can decrease the uncertainty at the stateaction pairs with high potential reward by explicitly motivating the agent to sample the stateactions with the highest upperbounded reward. Furthermore, the sample complexity of the proposed method indicates that it is a PAC optimal exploration solution for malaria control. Finally, to verify the effectiveness of our policy search solution, extensive experiments are conducted on the malaria control simulators^{1}^{1}1https://github.com/IBM/ushirikipolicyenginelibrary [bent2018novel], which are gymlike^{2}^{2}2https://gym.openai.com environments for KDD Cup 2019 [zhou2020kdd]. The outstanding performance on the competition and extensive experimental results demonstrated that our approach could achieve unprecedented data efficiency on malaria control compared to the stateoftheart methods.Our main contributions are: (1) We propose a highly dataefficient learning framework for malaria control. Under the framework, the policymakers can successfully learn control policies from scratch within 20 rollouts. (2) We derive the sample complexity of the proposed method and verify that VBMCTS is an efficient PACMDP algorithm. (3) Extensive experiments conducted on malaria control demonstrate that our solution can outperform the stateoftheart methods.
2 Related Work
As a highly pathogenic disease, malaria has been widely studied from the perspective of predicting the disease spread, diagnosis, and personalized care planning. However, rarely work focuses on applying RL to learn the costeffectiveness intervention strategies, which plays a crucial role in controlling the spread of malaria [moran2007malaria]. In [bent2018novel], malaria control has firstly been formulated as a stochastic multiarmed bandit (SMAB) problem and solved with novel exploration techniques. Nevertheless, SMAB based solutions only myopically maximize instant rewards, and the ignorance of delayed influences might result in disease outbreaks in the future. Therefore, in this work, a comprehensive solution has been proposed to facilitate policy learning in a few trials under the setting of finitehorizon MDP.
Another topic is dataefficient RL. To increase the data efficiency, we are required to extract more information from available trials [deisenroth2011pilco], which involves utilizing the samples in the most efficient way (e.g., exploitation) and choosing the samples with more information (e.g., exploration). Generally, for exploitation, modelbased methods [ha2018recurrent, kamthe2017data] are more sample efficient but require more computation time for the planning. Modelfree [szita2006learning, krause2016cma, van2009theoretical] methods are generally computationally light and can be applied without a planner, but need (sometimes exponentially) more samples, and are usually not efficient PACMDP algorithm [strehl2009reinforcement]. For exploration, there are two options: (1) Bayesian approaches, considering a distribution over possible models and acting to maximize expected reward; unluckily, it is intractable for all but very restricted cases, such as the linear policy assumption in PILCO [deisenroth2011pilco]. (2) intrinsically motivated exploration, implicitly negotiating the exploration/exploitation dilemma by always exploiting a modified reward for directly accomplishing exploration. However, on the one hand, the vast majority of papers only address the discrete state case, providing incremental improvements on the complexity bounds, such as MMDPRB [sorg2012variance], metricE [kakade2003exploration], and BED [kolter2009near]. On the other hand, for more realistic continuous state space MDP, overexploration has been introduced for achieving polynomial sample complexity in many work, such as KWIK [li2011knows], and GPRmax [grande2014sample]. These methods will explore all regions equally until the reward function is highly accurate everywhere. By drawing on the strength of existing methods, our solution is a modelbased RL framework, which efficiently plans with MCTS and tradeoff exploitation and exploration by exploiting a variancebonus reward.
3 Proposed Method: VBMCTS
3.1 Malaria Control as MDP
Finding an optimal malaria control policy can be posted as a reinforcement learning task by illustrating it as a Markov Decision Process (MDP). Specifically, we formulate the task as a finitehorizon MDP, defined by the tuple with as the potential infinite state space, as the finite set of actions, as the deterministic transition function, as the reward function, and as the discount factor. In this case, we face the challenge of developing an efficient policy for a population over a 5 year intervention time frame. As shown in Figure 2, the corresponding components in malaria control are defined as,
Action
The actions are the available means of interventions, including the massdistribution of longlasting insecticidetreated nets (ITNs) and indoor residual spraying (IRS) with pyrethroids in SSA [stuckey2014modeling]. In this work, the action space is constructed through with , which represent the population coverage for ITNs and IRS of a specific area. Without significantly affecting performance, we discrete the action space with an accuracy of 0.1 for simplicity.
Reward
The reward is a scalar , associated with next state . In malaria control, it is determined through an economic costeffectiveness analysis. In [bent2018novel], an overview of the reward calculation is specified. Without loss of generality, the reward function is assumed known to us since an MDP with unknown rewards and unknown transitions can be represented as an MDP with known rewards and unknown transitions by adding additional states to the system.
State
The state contains important observations for decision making in every time step. In malaria control, it includes the number of life with disability, life expectancy, life lost, and treatment expenses [bent2018novel]. We set the state in the form , including current reward , previous action and the current intervention timestamp, which covers the most crucial and useful observations for malaria control, as shown in Figure 2. For the start state , the reward and action are initialized with 0.
Let denote a deterministic mapping from states to actions, and let denote the expected discounted reward by following policy in state . The objective is to find a deterministic policy that maximizes the expected return at state as
3.2 Modelbased Indirect Policy Search
In the following, we detail the key components of the proposed framework VBMCTS, including the world model, the planner, and variancebonus reward with its sample complexity.
World Model Learning
The probabilistic world model is implemented as a GP, where we use a predefined feature mapping as training input and the target state as the training target. The GP yields onestep predictions
where is the kernel function,
denotes the vector of covariances between the test point and all training points with
, and is the corresponding training targets. is the noise variance. is the Gram matrix with entries .Throughout this paper, we consider a prior mean function and a squared exponential (SE) kernel with automatic relevance determination. The SE covariance function is defined as
where is the variance of state transition and . The characteristic lengthscale controls the importance of th feature. Given training inputs and the corresponding targets , the posterior hyperparameters of GP (lengthscales and signal variance ) are determined through evidence maximization technique [williams2006gaussian].
Exploration with VarianceBonus Reward
The algorithm we propose is itself very straightforward and similar to many previously proposed exploration heuristics
[kolter2009near, srinivas2009gaussian, sorg2012variance, grande2014computationally]. We call the algorithm VarianceBonus Reward, since it chooses action according to the current mean estimation of the reward plus an additional variancebased reward bonus for stateactions that have not been well explored aswhere and are the parameters that tradeoff the balance of exploitation and exploration. and are the predicted variances for state and reward. The variance of reward can be exactly computed following the law of iterated variances.
Planning with Mean MDP + Reward Bonus
MCTS is a strikingly successful planning algorithm [silver2017mastering], which can find out the optimal solution with enough computation resources. Since disease control is not a realtime task, we propose to apply MCTS (Figure 1) as the planner for generating the policy to maximize the variancebonus reward. In the executing process, the MCTS planner incrementally builds an asymmetric search tree guided to the most promising direction by a tree policy. This process usually consists of four phases — selection, expansion, evaluation, and backup (as shown in Figure 3).
Specifically, each edge of the search tree stores an average action value and visit count . In the selection phase, starting from the root state, the tree is traversed by simulation (that is, descending the tree with the mean prediction of states without backup). At each time step of each simulation, an action is selected from state
so as maximize action value plus a bonus that decays with repeated visits to encourage exploration for tree search. Here, is the constant determining the level of exploration. When the traversal reaches a leaf node at step , the leaf node may be expanded with each edge initialized as , , and the corresponding leaf nodes are initialized with the mean GP prediction . Then, the leaf node is evaluated by the average outcome of rollouts, which played out until terminal step using the fast rollout policies, such as random policy and greedy policy. At the end of the simulation, the action values and visit counts of all traversed edges are updated, i.e., backup. Each edge accumulates the visit counts and means the evaluation of all simulations passing through that edge as
where with is the edge in the forward trace.
This cycle of selection, evaluation, expansion and backup is repeated until the maximum iteration number has been reached. At this point, the best action has been chosen by selecting the action that leads to the highest reward (max child) as follows
(1) 
where is the policy generated by MCTS. To be noticed, the variancebonus reward can be replaced with the for generating the bestknown policy since it maximizes the expected reward based on the posterior so far.
Finally, we implement an iterative training procedure, as shown in Algorithm 1, where we specify the order in which each component occurs within the iteration.
3.3 Sample Complexity
In the following, we derive sample complexity (e.g., the required samples to learn nearoptimal performances) for our proposed solution. In the worst case, when the reward function is equal everywhere, and all stateaction pairs will be equally explored, VBMCTS will have the same complexity bounds as overexploration methods, such as GPRmax [grande2014sample]. In practice, VBMCTS will learn the optimal policy in fewer steps than the overexploration methods because the high uncertainty region with low reward is not worth exploration. Following the general PACMDP theorem, Theorem 10 in [strehl2009reinforcement], we derive the polynomial sample complexity of VBMCTS.
Theorem 1.
Assume that the feature space of stateaction pairs is a compact domain and the target value is bounded . The reward and action value are Lipschitz continuous w.r.t the stateaction pairs and let and be the Lipschitz constants for reward and action value respectively. If VBMCTS is executed with and for any MDP
, then with probability
, VBMCTS will follow a optimal policy from its current state on all but(2) 
timesteps, with probability at least , where , . is the covering number of domain , which defines the cardinality of the minimal set where is a distance measure.
Sketch of Proof.
The proof is organized by showing the required key properties (optimism, accuracy, and learning complexity) for general PACMDP [strehl2009reinforcement] are satisfied.
Optimism
Assume that MCTS can find the optimal decision under every state with enough computation resources. Then, , the optimal action value satisfies
where , . The maximum error of propagating instead of is given by . The norm is upper bounded by the regularization error with probability of (Section A.2 in [grande2014computationally]). In the same way, we have with probability . It follows that Algorithm 1 remains optimistic with probability at least 12.
Accuracy
Following the Lemma 1 from [grande2014sample], the prediction error at is bounded in probability if the predictive variance of the GP at .
Learning complexity
(a) For with , let , , then can be rewritten as with . If , the posterior variance of satisfies . (b) Since is a compact domain with the length of th dimension as , we have and can be covered by balls , which centers at with a radius of .
Given (a) and (b), , s.t. . If there are at least observations in , then (2) satisfied. Combining the Lemma 8 from [strehl2009reinforcement] and denoting , the total number of updates occurs will be bounded by with probability of .
Now that the key properties of optimism, accuracy, and learning complexity have been established, the general PACMDP Theorem 10 of [strehl2009reinforcement] is invoked. ∎
Theorem 1 allows us to guarantee that the number of steps in which the performance of VBMCTS is significantly worse than that of an optimal policy starting from the current state is at most loglinear in the covering number of the stateaction space with probability .
4 Experiments
In this section, we conduct extensive experiments on two different OpenMalaria [smith2008towards] based simulators: SeqDecChallenge and ProveChallenge, which are the testing environments used in KDD Cup 2019. In these simulators, the parameters of the simulator are hidden from the RL agents since the “simulation parameters” for SSA are unknown for policymakers. Additionally, to simulate the real disease control problem, only 20 trials were allowed before generating the final decision, which is much more challenging than traditional RL tasks. These simulating environments are available at https://github.com/IBM/ushirikipolicyenginelibrary.
Agents for Comparison
To show the advantage of VBMCTS, many benchmarking reinforcement learning methods and open source solutions have been deployed to verify its effectiveness:
Random Policy: The random policy is executed in 20 trials and chooses the generated policy with the maximum reward as the final decision. SMAB: This kind of policy treats the problem as a Stochastic MultiArmed Bandit problem and independently optimizes the policy every year with Thompson sampling
[chapelle2011empirical]. CEM: CrossEntropy Method is a simple gradientfree policy searching method [szita2006learning]. CMAES: CMAES is a gradientfree evolutionary approach to optimizing nonconvex objective functions [krause2016cma]. QlearningGA: It learns the malaria control policy by combining Qlearning and Genetic Algorithm. ExpectedSarsa: It collects 13 random episodes and runs expected value SARSA [van2009theoretical] for 7 episodes to improve the best policy using the collected statistics. GPRmax: It uses GP learners to model and and replaces the value of any where is ”unknown” with a value of [li2011knows, grande2014sample]. GPMC: It employs Gaussian Process to regress the world model. The policy is generated by sampling from the posterior and choosing the max rewarded action. VBMCTS: Our proposed method.Implementation Details
We build a feature map of 14dimension for this task, which includes the periodic feature and cross term feature. Specifically, the feature map is set as,
where are the periodic features and is the cross term feature. Since the predicted variance of state and reward are the same in our setting, we empirically set the sum of exploration/exploitation parameters as in the experiments. In MCTS, is 5, and only the top 50 rewarded child nodes are expanded. The number of iterations does not exceed 100,000. For Gaussian Process, to avoid overfitting problems, 5fold crossvalidation is performed during the updates of the GP world model. Particularly, we use 1fold for training and 4fold for validation, which ensures the generalizability of the GP world model over different stateaction pairs. Our implementation and all baseline codes are available at https://github.com/zoulixin93/VB_MCTS.



Agents  SeqDecChallenge  
Med. Reward  Max Reward  Min Reward  
Random Policy  167.79  193.24  135.06 
SMAB  209.05  386.28  6.44 
CEM  179.30  214.87  120.92 
CMAES  185.34  246.12  108.18 
QLearningGA  247.75  332.40  171.33 
ExpectedSarsa  462.76  495.03  423.93 
GPRmax  233.95  292.99  200.35 
GPMC  475.99  499.60  435.51 
VBMCTS  533.38  552.78  519.61 


Agents  ProveChallenge  
Med. Reward  Max Reward  Min Reward  
Random Policy  248.25  464.92  55.24 
SMAB  18.02  135.37  56.86 
CEM  229.61  373.83  20.09 
CMAES  289.03  314.57  92.95 
QLearningGA  242.97  325.24  88.70 
ExpectedSarsa  190.08  296.16  140.86 
GPRmax  287.45  371.49  153.98 
GPMC  300.37  447.15  263.96 
VBMCTS  352.17  492.23  259.97 

4.1 Results
Main Results
In Table 1, we report the median reward, the maximum reward, and the minimal reward over 10 independent repeat runs. The results are quite consistent with our intuition. We have the following observations: (1) For the finitehorizon decisionmaking problem, treating it as SMAB does not work, and the delayed influence of actions can not be ignored in malaria control. As presented in Table 1, SMAB’s performances have large variance and are even worse than random policy in ProveChallenge. (2) Overall, two modelbased methods (GPMC and VBMCTS) consistently outperform the modelfree methods (CEM, CMAES, QlearningGA, and ExpectedSarsa) in SeqDecChallenge and ProveChallenge, which indicates that empirically modelbased solutions are generally more dataefficient than modelfree solutions. From the results in Table 1, performances of modelfree methods are defeated by modelbased methods with a large margin in SeqDecChallenge, and their performances are almost the same as the random policy in ProveChallenge. (3) The proposed VBMCTS can outperform all the baselines in SeqDecChallenge and ProveChallenge. Compared with GPMC and GPRmax, the advantage from the efficient MCTS with variancebonus reward leads to the success on SeqDecChallenge and ProveChallenge.
Data Efficiency
This paragraph compares dataefficiency (required trials) of VBMCTS with other RL methods that learn malaria policies from scratch. In Figure 4(a) and 4(b), we report agents’ performances after collecting every trial episode in SeqDecChallenge and ProveChallenge. The horizontal axis indicates the number of trials. The vertical axis shows the average performance after collecting every episode. Figure 4(a) and 4(b) highlighted that our proposed VBMCTS approach (brown) requires on average only 8 trials to achieve the best performances in SeqDecChallenge and ProveChallenge, including the first random trial. The results indicate that VBMCTS can outperform the stateoftheart method on both data efficiency and performance.
5 Conclusion and Future Work
We proposed a modelbased approach employing the Gaussian Process to regress the state transition for dataefficient RL in malaria control. By planning with variancebonus reward, our method can naturally deal with the dilemma of exploration and exploitation by efficiently MCTS planning. Extensive experiments conducted on the challenging malaria control task have demonstrated the advantage of VBMCTS over stateofthearts both on performance and efficiency. However, the stationary setting of MDP may be unrealistic due to the development of disease control tools and the evolution of the disease. Therefore, dataefficient reinforcement learning under nonstationary settings will be more realistic and more challenging task.
Comments
There are no comments yet.