Log In Sign Up

Policy Learning for Malaria Control

by   Van Bach Nguyen, et al.

Sequential decision making is a typical problem in reinforcement learning with plenty of algorithms to solve it. However, only a few of them can work effectively with a very small number of observations. In this report, we introduce the progress to learn the policy for Malaria Control as a Reinforcement Learning problem in the KDD Cup Challenge 2019 and propose diverse solutions to deal with the limited observations problem. We apply the Genetic Algorithm, Bayesian Optimization, Q-learning with sequence breaking to find the optimal policy for five years in a row with only 20 episodes/100 evaluations. We evaluate those algorithms and compare their performance with Random Search as a baseline. Among these algorithms, Q-Learning with sequence breaking has been submitted to the challenge and got ranked 7th in KDD Cup.


page 3

page 6


Robust Bayesian reinforcement learning through tight lower bounds

In the Bayesian approach to sequential decision making, exact calculatio...

Sub-optimal Policy Aided Multi-Agent Reinforcement Learning for Flocking Control

Flocking control is a challenging problem, where multiple agents, such a...

CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq

Reinforcement learning algorithms solve sequential decision-making probl...

A Learning and Control Perspective for Microfinance

Microfinance in developing areas such as Africa has been proven to impro...

Partial Policy-based Reinforcement Learning for Anatomical Landmark Localization in 3D Medical Images

Deploying the idea of long-term cumulative return, reinforcement learnin...

Breaking the Deadly Triad with a Target Network

The deadly triad refers to the instability of a reinforcement learning a...

Active Contextual Entropy Search

Contextual policy search allows adapting robotic movement primitives to ...

1. Introduction

Malaria is caused by parasites that are transmitted to people through the bites of infected mosquitoes. It is one of the most dangerous diseases in the world. According to WHO, in 2017, nearly half of the world’s population was at risk of malaria, among those, there were 219 million cases of Malaria, about 435 000 malaria deaths. Sub Saharan Africa is the home to 92% of cases and 93% of deaths(Organization(WHO), ). Furthermore, about 450$ M is spent in research and development each year to deal with this disease(Moran, 2007). That is why it is also the topic for the KDD challenge 2019 to find out the effective combinations of interventions to prevent malaria infection.

In this challenge, two interventions are considered to control the malaria disease: distributing long-lasting insecticide-treated nets (ITN), and performing indoor residual spraying programs (IRS). Our goal is to determine the most effective policies for five years based on the combinations of these two interventions. the cost-effectiveness of a policy depends on how much we use each intervention among five years. For example, According to a report from IBM (Bent, 2018) in some transmission locations, ITNs are the best interventions meanwhile, in Western Kenya, perform IRS in a small proportion of households is more effective, instead of deploying ITNs.

The most important thing that we need to deal with Malaria Problems is the environment. Fortunately, a simulation called OpenMalaria (Smith and Tanner, 2008) is created that transforms the real world to the machine world, based on that, we can interact with the environment. OpenMalaria provides a simulation environment that can return a reward each time an agent takes an action. Basing on it, interfaces were created for the challenge, each interface corresponds to an environment. In the KDD cup, two interfaces were provided as training environments and a secret interface was used as the test environment to evaluate solutions.

In this report, we first introduce the challenge as a Reinforcement Learning problem with a limited number of observations. Second, we propose our solutions that include: Random Search as a baseline, Generic Algorithm, Bayesian Optimization and Q-learning with sequence breaking, which is also the final submission, to solve the challenge. Finally, we compare the results of these algorithms and frame future approaches.

2. Malaria control policy as Reinforcement learning problem

Malaria challenge is considered as a sequential decision-making problem, which is a typical type of problem in Reinforcement Learning (RL). Therefore, in this section, we will explain some notations in RL before introducing the Malaria challenge.

2.1. Reinforcement learning

Reinforcement learning (RL) paradigm enables the autonomous ability for artificial intelligence (AI) machines, in which, we do not need to teach them by providing data or any knowledge for these machines, instead, they learn how to behave by themselves. This idea is inspired by the human development process as we learn new skills not only from teachers but also from a ton of mistakes (trial-and-error learning). RL has a long history of development. It achieved impressive results in robotics

(Kohl and Stone, 2004; D.C. Bentivegna, 2001), controls (Coates. and et al., 2006; Bagnell and Schneider, 2001), and games (Tesauro, 1995; Yan et al., 2005).

Figure 1. The agent-environment interaction (Sutton and Barto, 2018))

Hello every one

Beside supervised learning and unsupervised learning, reinforcement learning (RL) is an essential paradigm of machine learning. The idea of RL is depicted in figure

1. Unlike supervised learning where a model takes a bunch of examples with ground truth and learns from them, in RL, an agent (similar to model) learns by experience through interacting with environment. Agents get experience by taking actions and receiving reward from the environment. The goal of agents in RL is maximizing rewards by choosing proper actions in a situation. These situations are called states, a strategy to decide an action in a specific state is called a policy , i.e. is a function that maps a state to an action. Agents start from the initial state, follow the policy and will stop if they reach the terminal state, all states that agents go through between the initial state and the terminal state form an Episode.

2.2. Malaria Control Challenge

Malaria Control Challenge is a typical Reinforcement Learning problem that includes:

  • An Agent: It is the model that we need to build.

  • An Environment: It is a simulation provided by OpenMalaria. We have two training environments and one final test environment.

  • Actions: they combine two interventions. Insecticide-Treated Nets (ITNs) and Indoor Residual Spraying (IRS). The domain of the first component is the deployment of nets, which defines the coverage of the population (). The domain for the second component is the application of seasonal spraying, which defines the proportion of population coverage for this intervention (). Action space is continuous space that is constructed through . A policy maps five years to a set of actions .

  • States: they are represented by the year. This means we have five years corresponding to five states. In RL problem, the next state usually depends on the action that agent takes in the current state, however, in this problem, we do not have this kind of state transition, instead, whatever the action is, the state always increases to the next state (next year). An episode always consists of five states.

  • Rewards: They are modeled as float numbers, the environment will return a reward after each year and a reward for the whole episode. This latter is the sum of the five intermediary rewards (or five states)

Our goal is to build an agent that can explore the environment after 20 episodes or 100 evaluations to get the optimal policy that has the highest reward. Reinforcement Learning tasks usually use hundreds or thousands of episodes to train the agent but in this challenge, we only have 20 and our action space is continuous. It is the most difficult part of the challenge. Next section, we will propose some methods to deal with this difficulty.

3. Solutions

3.1. Random Search (Baseline)

For the Random Search algorithm, the agent just picks 20 arbitrary sets of actions corresponding to 20 episodes, stores the rewards for each episode and finally, chooses the set of actions that has the highest reward. The final policy is: . This approach is very simple and straightforward. It is only a baseline to find better solutions. However, in our case of very limited episodes, this baseline is already very hard to overcome.

3.2. Genetic Algorithm

Genetic Algorithm (GA) is a biologically inspired and population-based algorithm (Holland, 1992). It starts from an initial population of policies, called 1st generation and aims to improve through the creation of new generations by simulating natural selection operations.

GA uses the term fitness, which indicates how good a policy is. Fitness can be seen as accuracy in general machine learning algorithms and is normalized to [0, 1] for this problem. Operations are the actions that modify policies in a genetically way. In this paper, 2 operations called Mutation and Crossover are used. The Crossover operation mixes 2 policies randomly or orderly. Here, a policy contains 5 tuples of 2 values, hence when we combine 2 policies using Crossover, some of their tuples will be exchanged and new policies are created. While the Mutation operation changes the value of a tuple randomly, with a given policy, one or more of its tuples’ value can be increased or decreased by adding a noise value.

From the current population, which is a set of policies and their rewards, we select candidates for crossover and mutation by Roulette Wheel Selection Roulette Wheel Selection (Goldberg D.E., 1991)

algorithm. Where the probability of selection of a policy

is determined by dividing its fitness to the sum of the fitness of all policies, which is defined in equation (1).


The complete work is as follows: 1st generation will be created stochastically. Then they are pushed into the environment when their rewards come out. Roulette Wheel Selection will choose the 2 best policies and start mating them together using Crossover and Mutation operations. The new child will be pushed to the existing population and the process continue.

3.3. Bayesian Optimisation (BO)

Before introducing Bayesian Optimization, we need to present, first, Active Learning.

3.3.1. What is Active Learning (AL):

For certain AI challenges, we can have at our disposal an Oracle/Expert that can answer the targeted question. As an example, we want to create an Optical character recognition (OCR) for Street View House Number. For a supervised learning approach, a human being (Expert) has to label a big number of images with the house number. The trivial way would be to randomly choose images to label. In this case, some images would look very similar and will not help the algorithm learn more. On the other hand, an AL algorithm would estimate which image would improve the current model and then ask the Oracle for help.

3.3.2. Active Learning in KDD Cup:

In the KDD Cup Challenge 2019, the online server represents an Oracle. The agent can ask only 100 queries to the Oracle. The goal of the AL module would be to maximize the potential gained knowledge by the th query based on the result of the previous queries. The AL Algorithm can calculate the distance between all queried policies and a non tested one. If the distance is maximized, we will try to explore the action space. Otherwise, querying around best values will tend to optimize the best policy.

3.3.3. Bayesian Optimisation and Active Learning:

Bayesian Optimization is a global optimization algorithm based on Gaussian processes(fmfn, 2019)

. It is able to approximate a function using Upper Bound Confidence. the interpolation becomes more and more precise while querying more points using the oracle. The goal is to find the maximum of the function and keep querying around this maxima thanks to its Active Learning module.

Figure 2. Function approximation using Bayesian Optimization

As seen in Figure 2

, the algorithm is also able to approximate functions biased with white noise. The Utility function estimates the next best point to query. A parameter Kappa represents the balance between exploration and exploitation (e.g. finding another maximum or querying around the one already found)

To summarize, the advantage of Bayesian Optimisation is that it includes:

  • Interpolation (even for noisy functions)

  • Inverse Reinforcement Learning(Lopes et al., 2009)

  • Active Learning

More information about global optimization with Gaussian processes can be found here(fmfn, 2019).

3.3.4. Implementation:

A trivial implementation of BO would be to approximate 5 functions with representing the year . The input is the pair of actions on year , and the approximated output is the obtained reward.

Figure 3. Reward distribution on year 1. The axis represent the pair of actions. The reward value is scaled down by a factor of 1/100

In Figure 3, f1 is plotted using the exhaustive search: 1600 (40 by 40) queries were performed to obtain this precise representation. The queried environment was the first one provided for this challenge. We will call it Environment 1 or Sequential Decision Making.

Figure 4. Year 1’s reward, approximated using Bayesian Optimization

Figure 4 shows the approximation of using BO (fmfn, 2019). We notice that the shape of the distributions is not well drawn: the Gaussian with low values centered on the point (0.2 , 0.9) is approximated differently than the one centered on the point (0.8 , 0.9). Also, BO was not able to detect the separation between the 2 Gaussian curves with high reward (Figure 4, up left). Despite all these errors, BO is able to detect the maximum of the function and only 89 points were used. Precisely for Environment 1, 25 to 30 queries are enough to find the maximum.

Since the reward of year is affected by previous actions, estimating 5 separate functions is not a general solution. To solve this issue, we propose 3 different approaches:

  • Algorithm BO.1 (Figure 5) uses Bayesian Optimisation for each year: the 5 functions are approximated in a greedy way since that more query points are reserved for the first years. For example, we first approximate year 1 with 15 points, then we start approximating year 2 based on the optimal action of year 1. Starting from this step, for each new episode, we query year 1 to optimize the current best action, then we query year 2 to explore the action space. Iteratively, we keep going through the years till we reach year 5.

    Figure 5. Architecture of the Algorithm BO.1
  • Algorithm BO.2(Belaid, 2019) uses only one instance of Bayesian Optimisation: the approximated function has ten dimensions as input (the full policy) and one dimension as output (the approximated total reward).

  • Algorithm BO.3 called, ”BO with a forward Boosting Network”(Belaid, 2019)

    , combine both preceding algorithms thanks to an Ensemble Learning technique: multiple instances of BO with different input and output are first trained. In the second step, a small neural network learns which instance of BO approximates better the environment. based on these weights the agent query the online environment. Upon the reception of the new rewards, the BO instances and the Neural Net. are retrained again. To keep the Architecture simple, figure

    6 represent only the first 2 years.

Figure 6. Partial Architecture of the Algorithm BO.3

3.4. Q-learning with sequence breaking

In another perspective, we consider the Malaria problem as a sequence of decision-making problems when the reward for a given year depends on all actions that we choose in previous years. Q-Learning was born to solve this kind of problem. In this section, we will introduce the plain Q-learning and the way we modify it in order to apply it to the Malaria problem. This is the solution that we submitted to the KDD competition.

3.4.1. Plain Q-Learning

Q-learning (Watkins and Dayan, 1992), which is an early breakthrough in reinforcement learning(RL) (Sutton and Barto, 2018), is defined by


This equation expresses that the value of an action in a state , which evaluates the goodness of an action in a state, will be updated based on: (1) The current value that agents have already known, (2) the reward that agent received after taking action , (3) maximum action-value on next state with discount factor and (4) learning rate that defines weights for old and new value that agents just got. If , agents learn nothing, Q-values are unchanged, while if agents will forget everything from their experience. is a discount factor that expresses weight for each reward in a specific time step(, ,..), the further time step, the less weight. All action-values are stored in a table called Q-table. This table stores all states with every possible action that agents can take and agents base on this table to choose the best action. The Q-values on this table does not depend on the way agents choose action to take, i.e. Q-learning is off policy. But in this challenge, with only 100 evaluations (20 episodes) and continuous action space, it is challenging for Q-learning to perform well.

Figure 7. Grid Search: The big blue point is the maximum point after grid search, from that, we query around it to find the best point

Hello every one

3.4.2. Q-learning with sequence breaking

Figure 8. Q-Learning with 200 episodes result

Hello every one

Original Q-learning requires discrete action spaces. To satisfy this condition, we only consider actions that have one number in decimal part while the integer part is zero, for example: [0.1,0.2] is a pair of action for a year. There are total 100 pairs of action in our action space. By doing it, we can easily apply algorithms that can only work on discrete action spaces like Q-learning. Furthermore, in these test environments, rewards follow multivariate gaussian distributions, we can quickly find areas that give us high reward. However, with this approach, we cannot reach the global optimal because the best action might be not in the 100 pairs of action in our space. But with limited episodes, it is a good choice to come up with a pair of actions that gives us high rewards.

Result: Optimal policy
define action space A with resolution = 0.3;
take each action in A, store reward in R table;
= random action around with distance = 0.1;
       if  then
             = random action around ;
       end if
until 4 times ;
define action space A with resolution = 0.1;
Initialize , , ;
Initialize , for all , ;
for  in 16 episodes do
       if  then
             take , observe ;
             for 4 years do
                   choose from A follow ;
                   take , observe ; .add();
                   update: ;
                   s = s’; ;
             end for
            if Reward() ¿ Reward() then
             end if
       end if
end for
return ;
Algorithm 1 Q-learning with sequence breaking algorithm

After discretizing our actions space, if we only apply the simple Q-learning, it worked well with a high number of episodes figure 8, however, with only 20 episodes, it is not better than the baseline. Therefore, we decided to take some advantages at the beginning of each episode to boost the Q-learning result. The key idea is: We break the sequence of action into 2 parts: first-year and other years. For the first years, we spent 20 evaluations to find the best pair of action that returns maximum immediate reward this year, ignoring the relationship between the first year and other years. To determine this best action, firstly, we use a grid search with size 4x4 to explore the environment and find areas that have potential high reward. We checked 16 pairs of actions that are a combination of 2 actions from the following list [0.0,0.3,0.6,0.9], for example, [0.3,0.6] or [0.9,0.0], then we choose the pair of actions that give us highest reward among these 16 pairs. Secondly, we use 4 remaining evaluations to exploit the area around the chosen pair of actions. To exploit it, we query random 1 pair action around the current best action, if the chosen action gives higher reward than the best action, we follow that direction, check the pair of action that lies on further part on this direction. For example, if [0.3,0.6] is the best action after grid search, one random action around it will be checked, for example [0.2,0.6]. if this action is better than [0.3 0.6], we will check the next action is [0.1, 0.6], otherwise, we just choose the random action around the best current action, figure 7 illustrates the idea. After coming up with the best action for the 1st year, we will fix it unchanged and apply Q-learning for other years. For Q-learning, we use policy to choose the action with and learning rate with is number of action taken times . The source code for this solution is available on

Another simple approach is that we can apply a grid search for all five years, ignore the relation between each year. This completely breaks the sequence relation of five years and only applies a grid search. The result was even better than combining with Q-learning, however, due to the restriction of the competition, we could not submit it.

4. Results

4.1. Genetic Algorithm

Figure 9. Genetic Algorithm with noise value of 0.05

Mutation operation is the main way to control the exploration-exploitation tradeoff, the lower the noise’s value, the learning are more exploitation-based.

With a limitation of only 100 testing times, here we use a noise value of 0.05, which is nearly pure exploitation, with such configuration Genetic Algorithm can learn new policies that can yield around 250.00 worth of rewards on the environment. See more on figure 9 and table 3.

4.2. Bayesian Optimization

Algorithms Average rewards
Env. 1 Env. 2
Random Search (5 years) 161.20 100% 158.991 100%

BO.1: 5 indep. B.O.
250 155% 500 315%
BO.2: 10-dim B.O. 400 248% 200 126%
Table 1. Comparing the results of BO.1 and BO.2

Concerning the first implementation (BO.1), only a few queries are left for episodes reaching the terminal state. Therefore, year 5’s action space is not well explored comparing to the first years but we generally get a high total reward. For Environment 1, the maximum total reward is around 110 for each year. It is approximated using the simple exhaustive search algorithm used before. As shown in table 1, BO.1 obtains a score close to the maximum in Environment 2. On the other hand, BO.2 obtains a score close to the maximum in Environment 1. This is explained by the nature of each environment:

  • In Environment 1, the reward of year depends only on 2 actions: the current one and the one before. By breaking the sequence for 5 years, we learn the best action on year before proceeding with year . Hence, the algorithm BO.1 can optimize year ’s reward for a fixed action of year .

  • In Environment 2, the reward of year depends on current action and all previous actions. A 10-dimensional function - as presented in BO.2 - is able to catch this relation and output the best policy.

For BO.3, only a proof of concept was implemented: the algorithm can learn the best policy only for the first 2 years. Results are shown in table 2.

Algorithms Average rewards
Env. 1 Env. 2
Random Search (2 years) 35 100% 105 100%

BO.3: Boosting Network
70 200% 210 140%
Table 2. Results of BO.3.

Compared to the first two algorithms, BO.3 is able to achieve stable results since it is a balance between the two other solutions. This improvement is achieved by learning the two weights and . As example, In the figure 10 the best weights are (: 0.78, : 0.82). 5 to 12 episodes are usually used before reaching the perfect weights. Therefore, BO.3 can not outperform the two other algorithms.

Figure 10. Mean Square error of the predicted reward w.r.t. the 2 weights [w1 , w2] of the boosting network.

4.3. Q-learning with sequence breaking algorithm

To evaluate the result, the algorithm needs to be run for 10 times, 20 episodes each time and no transferring knowledge between each time, and then we take the average of these 10 runs. This evaluation makes sure that the algorithms produce better result than the baseline. The figure 11 below shows the comparisons between each algorithm. As we can see, simple Q-learning cannot perform much better than random search, in some runs, it is even worse than the baseline. Meanwhile, both Q-learning with first-year breaking or completely breaking sequences approach outperformed the others. The best approach is completely breaking sequences, however, we were not allowed to submit it to the competition, this is just our proposed solution. The table 3 shows the average rewards for each algorithm and comparison between them.

Figure 11. Result comparison

Hello every one

Algorithms Average rewards
Environment 1 Env. 2
Random Search (Baseline) 161.20 100% 158.991 100%
Break. Seq. for 5 years 434.36 290% 423.019 267%
Genetic Algorithm 148.18 91% 262.83 163%
Q-learning 209.51 130% 185.055 117%
1st year break+Q-Learning 289.93 180% 428.01 269%
Table 3. Comparison between algorithms

Both sequences breaking algorithms, which use sequence breaking for the first year or whole five years, have common advantages and disadvantages. We can see they work better than baseline and traditional Reinforcement Learning algorithms like plain Q-learning with a short run, they are also quite simple to implement. For long runs with enough episodes, simple Q-learning may be the better choice. The disadvantage of these approaches is that if the maximum reward of the first year leads to a very bad reward in the next four years, it will be a disaster. Furthermore, because we discretized the action space, we cannot obtain the global optimum. However, with limited observations, these algorithms are not bad choices.

5. Conclusion

Sequence decision making is always an interesting task for Reinforcement Learning that can be applied to many real-world problems such as Malaria control. However, with a very small number of observations, this task was very hard to solve, traditional Reinforcement Learning algorithms cannot overcome the baseline. This paper has introduced enhanced algorithms that can deal with this difficulty and increase the performance of traditional methods. We hope that our solutions would help in controlling the spread of Malaria disease or at least in bringing up helpful ideas.


  • J.A. Bagnell and J.G. Schneider (2001) Autonomous helicopter control using reinforcement learning policy search methods. IEEE International Conference on Robotics and Automation (), pp. . Note: External Links: Document, Link Cited by: §2.1.
  • M. K. Belaid (2019) Malaria control using reinforcement learning and bayesian optimisation - kddcup 2019. GitHub. Note: Cited by: 2nd item, 3rd item.
  • O. Bent (2018) AI in the fight against malaria. (), pp. . Note: External Links: Document, Link Cited by: §1.
  • A. Y. N. Coates. and et al. (2006) Autonomous inverted helicopter flight via reinforcement learning. Experimental Robotics IX (), pp. 363–372. Note: External Links: Document, Link Cited by: §2.1.
  • C.G. A. D.C. Bentivegna (2001) Learning from observation using primitives. IEEE International Conference on Robotics and Automation (), pp. . Note: External Links: Document, Link Cited by: §2.1.
  • fmfn (2019) Bayesian optimisation algorithm. GitHub. Note: Cited by: §3.3.3, §3.3.3, §3.3.4.
  • D. K. Goldberg D.E. (1991) A comparative analysis of selection schemes used in genetic algorithms. Foundations of Genetic Algorithms 1 (), pp. 69–93. Note: External Links: Document, Link Cited by: §3.2.
  • J.H. Holland (1992) Genetic algorithms - computer programs that ”evolve” in ways that resemble natural selection can solve complex problems even their creators do not fully understand. (), pp. . Note: External Links: Document, Link Cited by: §3.2.
  • N. Kohl and P. Stone (2004) Policy gradient reinforcement learning for fast quadrupedal locomotion. IEEE International Conference on Robotics and Automation (), pp. . Note: External Links: Document, Link Cited by: §2.1.
  • M. Lopes, F. Melo, and L. Montesano (2009) Active learning for reward estimation in inverse reinforcement learning. In Machine Learning and Knowledge Discovery in Databases, W. Buntine, M. Grobelnik, D. Mladenić, and J. Shawe-Taylor (Eds.), Berlin, Heidelberg, pp. 31–46. External Links: ISBN 978-3-642-04174-7 Cited by: 2nd item.
  • M. Moran (2007) The malaria product pipeline: planning for the future. (), pp. . Note: External Links: Document, Link Cited by: §1.
  • [12] W. H. Organization(WHO) () Malaria. (), pp. . Note: External Links: Document, Link Cited by: §1.
  • Smith and M. Tanner (2008) Towards a comprehensive simulation model of malaria epidemiology and control.. (), pp. . Note: External Links: Document, Link Cited by: §1.
  • R. Sutton and A. Barto (2018) Reinforcement learning: an introduction. (), pp. 38. Note: External Links: Document, Link Cited by: Figure 1, §3.4.1.
  • G. Tesauro (1995) Temporal difference learning and td-gammon. Communications of the ACM (), pp. . Note: External Links: Document, Link Cited by: §2.1.
  • C. J. Watkins and P. Dayan (1992) Q-learning. Mach. Learn 8 (), pp. 279–292. Note: External Links: Document, Link Cited by: §3.4.1.
  • X. Yan, P. Diaconis, P. Rusmevichientong, and B. V. Roy (2005) Solitaire: man versus machine. Advances in Neural Information Processing Systems 17 (), pp. . Note: External Links: Document, Link Cited by: §2.1.