1. Introduction
Malaria is caused by parasites that are transmitted to people through the bites of infected mosquitoes. It is one of the most dangerous diseases in the world. According to WHO, in 2017, nearly half of the world’s population was at risk of malaria, among those, there were 219 million cases of Malaria, about 435 000 malaria deaths. Sub Saharan Africa is the home to 92% of cases and 93% of deaths(Organization(WHO), ). Furthermore, about 450$ M is spent in research and development each year to deal with this disease(Moran, 2007). That is why it is also the topic for the KDD challenge 2019 to find out the effective combinations of interventions to prevent malaria infection.
In this challenge, two interventions are considered to control the malaria disease: distributing longlasting insecticidetreated nets (ITN), and performing indoor residual spraying programs (IRS). Our goal is to determine the most effective policies for five years based on the combinations of these two interventions. the costeffectiveness of a policy depends on how much we use each intervention among five years. For example, According to a report from IBM (Bent, 2018) in some transmission locations, ITNs are the best interventions meanwhile, in Western Kenya, perform IRS in a small proportion of households is more effective, instead of deploying ITNs.
The most important thing that we need to deal with Malaria Problems is the environment. Fortunately, a simulation called OpenMalaria (Smith and Tanner, 2008) is created that transforms the real world to the machine world, based on that, we can interact with the environment. OpenMalaria provides a simulation environment that can return a reward each time an agent takes an action. Basing on it, interfaces were created for the challenge, each interface corresponds to an environment. In the KDD cup, two interfaces were provided as training environments and a secret interface was used as the test environment to evaluate solutions.
In this report, we first introduce the challenge as a Reinforcement Learning problem with a limited number of observations. Second, we propose our solutions that include: Random Search as a baseline, Generic Algorithm, Bayesian Optimization and Qlearning with sequence breaking, which is also the final submission, to solve the challenge. Finally, we compare the results of these algorithms and frame future approaches.
2. Malaria control policy as Reinforcement learning problem
Malaria challenge is considered as a sequential decisionmaking problem, which is a typical type of problem in Reinforcement Learning (RL). Therefore, in this section, we will explain some notations in RL before introducing the Malaria challenge.
2.1. Reinforcement learning
Reinforcement learning (RL) paradigm enables the autonomous ability for artificial intelligence (AI) machines, in which, we do not need to teach them by providing data or any knowledge for these machines, instead, they learn how to behave by themselves. This idea is inspired by the human development process as we learn new skills not only from teachers but also from a ton of mistakes (trialanderror learning). RL has a long history of development. It achieved impressive results in robotics
(Kohl and Stone, 2004; D.C. Bentivegna, 2001), controls (Coates. and et al., 2006; Bagnell and Schneider, 2001), and games (Tesauro, 1995; Yan et al., 2005).Beside supervised learning and unsupervised learning, reinforcement learning (RL) is an essential paradigm of machine learning. The idea of RL is depicted in figure
1. Unlike supervised learning where a model takes a bunch of examples with ground truth and learns from them, in RL, an agent (similar to model) learns by experience through interacting with environment. Agents get experience by taking actions and receiving reward from the environment. The goal of agents in RL is maximizing rewards by choosing proper actions in a situation. These situations are called states, a strategy to decide an action in a specific state is called a policy , i.e. is a function that maps a state to an action. Agents start from the initial state, follow the policy and will stop if they reach the terminal state, all states that agents go through between the initial state and the terminal state form an Episode.2.2. Malaria Control Challenge
Malaria Control Challenge is a typical Reinforcement Learning problem that includes:

An Agent: It is the model that we need to build.

An Environment: It is a simulation provided by OpenMalaria. We have two training environments and one final test environment.

Actions: they combine two interventions. InsecticideTreated Nets (ITNs) and Indoor Residual Spraying (IRS). The domain of the first component is the deployment of nets, which defines the coverage of the population (). The domain for the second component is the application of seasonal spraying, which defines the proportion of population coverage for this intervention (). Action space is continuous space that is constructed through . A policy maps five years to a set of actions .

States: they are represented by the year. This means we have five years corresponding to five states. In RL problem, the next state usually depends on the action that agent takes in the current state, however, in this problem, we do not have this kind of state transition, instead, whatever the action is, the state always increases to the next state (next year). An episode always consists of five states.

Rewards: They are modeled as float numbers, the environment will return a reward after each year and a reward for the whole episode. This latter is the sum of the five intermediary rewards (or five states)
Our goal is to build an agent that can explore the environment after 20 episodes or 100 evaluations to get the optimal policy that has the highest reward. Reinforcement Learning tasks usually use hundreds or thousands of episodes to train the agent but in this challenge, we only have 20 and our action space is continuous. It is the most difficult part of the challenge. Next section, we will propose some methods to deal with this difficulty.
3. Solutions
3.1. Random Search (Baseline)
For the Random Search algorithm, the agent just picks 20 arbitrary sets of actions corresponding to 20 episodes, stores the rewards for each episode and finally, chooses the set of actions that has the highest reward. The final policy is: . This approach is very simple and straightforward. It is only a baseline to find better solutions. However, in our case of very limited episodes, this baseline is already very hard to overcome.
3.2. Genetic Algorithm
Genetic Algorithm (GA) is a biologically inspired and populationbased algorithm (Holland, 1992). It starts from an initial population of policies, called 1st generation and aims to improve through the creation of new generations by simulating natural selection operations.
GA uses the term fitness, which indicates how good a policy is. Fitness can be seen as accuracy in general machine learning algorithms and is normalized to [0, 1] for this problem. Operations are the actions that modify policies in a genetically way. In this paper, 2 operations called Mutation and Crossover are used. The Crossover operation mixes 2 policies randomly or orderly. Here, a policy contains 5 tuples of 2 values, hence when we combine 2 policies using Crossover, some of their tuples will be exchanged and new policies are created. While the Mutation operation changes the value of a tuple randomly, with a given policy, one or more of its tuples’ value can be increased or decreased by adding a noise value.
From the current population, which is a set of policies and their rewards, we select candidates for crossover and mutation by Roulette Wheel Selection Roulette Wheel Selection (Goldberg D.E., 1991)
algorithm. Where the probability of selection of a policy
is determined by dividing its fitness to the sum of the fitness of all policies, which is defined in equation (1).(1) 
The complete work is as follows: 1st generation will be created stochastically. Then they are pushed into the environment when their rewards come out. Roulette Wheel Selection will choose the 2 best policies and start mating them together using Crossover and Mutation operations. The new child will be pushed to the existing population and the process continue.
3.3. Bayesian Optimisation (BO)
Before introducing Bayesian Optimization, we need to present, first, Active Learning.
3.3.1. What is Active Learning (AL):
For certain AI challenges, we can have at our disposal an Oracle/Expert that can answer the targeted question. As an example, we want to create an Optical character recognition (OCR) for Street View House Number. For a supervised learning approach, a human being (Expert) has to label a big number of images with the house number. The trivial way would be to randomly choose images to label. In this case, some images would look very similar and will not help the algorithm learn more. On the other hand, an AL algorithm would estimate which image would improve the current model and then ask the Oracle for help.
3.3.2. Active Learning in KDD Cup:
In the KDD Cup Challenge 2019, the online server represents an Oracle. The agent can ask only 100 queries to the Oracle. The goal of the AL module would be to maximize the potential gained knowledge by the th query based on the result of the previous queries. The AL Algorithm can calculate the distance between all queried policies and a non tested one. If the distance is maximized, we will try to explore the action space. Otherwise, querying around best values will tend to optimize the best policy.
3.3.3. Bayesian Optimisation and Active Learning:
Bayesian Optimization is a global optimization algorithm based on Gaussian processes(fmfn, 2019)
. It is able to approximate a function using Upper Bound Confidence. the interpolation becomes more and more precise while querying more points using the oracle. The goal is to find the maximum of the function and keep querying around this maxima thanks to its Active Learning module.
As seen in Figure 2
, the algorithm is also able to approximate functions biased with white noise. The Utility function estimates the next best point to query. A parameter Kappa represents the balance between exploration and exploitation (e.g. finding another maximum or querying around the one already found)
3.3.4. Implementation:
A trivial implementation of BO would be to approximate 5 functions with representing the year . The input is the pair of actions on year , and the approximated output is the obtained reward.
In Figure 3, f1 is plotted using the exhaustive search: 1600 (40 by 40) queries were performed to obtain this precise representation. The queried environment was the first one provided for this challenge. We will call it Environment 1 or Sequential Decision Making.
Figure 4 shows the approximation of using BO (fmfn, 2019). We notice that the shape of the distributions is not well drawn: the Gaussian with low values centered on the point (0.2 , 0.9) is approximated differently than the one centered on the point (0.8 , 0.9). Also, BO was not able to detect the separation between the 2 Gaussian curves with high reward (Figure 4, up left). Despite all these errors, BO is able to detect the maximum of the function and only 89 points were used. Precisely for Environment 1, 25 to 30 queries are enough to find the maximum.
Since the reward of year is affected by previous actions, estimating 5 separate functions is not a general solution. To solve this issue, we propose 3 different approaches:

Algorithm BO.1 (Figure 5) uses Bayesian Optimisation for each year: the 5 functions are approximated in a greedy way since that more query points are reserved for the first years. For example, we first approximate year 1 with 15 points, then we start approximating year 2 based on the optimal action of year 1. Starting from this step, for each new episode, we query year 1 to optimize the current best action, then we query year 2 to explore the action space. Iteratively, we keep going through the years till we reach year 5.

Algorithm BO.2(Belaid, 2019) uses only one instance of Bayesian Optimisation: the approximated function has ten dimensions as input (the full policy) and one dimension as output (the approximated total reward).

Algorithm BO.3 called, ”BO with a forward Boosting Network”(Belaid, 2019)
, combine both preceding algorithms thanks to an Ensemble Learning technique: multiple instances of BO with different input and output are first trained. In the second step, a small neural network learns which instance of BO approximates better the environment. based on these weights the agent query the online environment. Upon the reception of the new rewards, the BO instances and the Neural Net. are retrained again. To keep the Architecture simple, figure
6 represent only the first 2 years.
3.4. Qlearning with sequence breaking
In another perspective, we consider the Malaria problem as a sequence of decisionmaking problems when the reward for a given year depends on all actions that we choose in previous years. QLearning was born to solve this kind of problem. In this section, we will introduce the plain Qlearning and the way we modify it in order to apply it to the Malaria problem. This is the solution that we submitted to the KDD competition.
3.4.1. Plain QLearning
Qlearning (Watkins and Dayan, 1992), which is an early breakthrough in reinforcement learning(RL) (Sutton and Barto, 2018), is defined by
(2) 
This equation expresses that the value of an action in a state , which evaluates the goodness of an action in a state, will be updated based on: (1) The current value that agents have already known, (2) the reward that agent received after taking action , (3) maximum actionvalue on next state with discount factor and (4) learning rate that defines weights for old and new value that agents just got. If , agents learn nothing, Qvalues are unchanged, while if agents will forget everything from their experience. is a discount factor that expresses weight for each reward in a specific time step(, ,..), the further time step, the less weight. All actionvalues are stored in a table called Qtable. This table stores all states with every possible action that agents can take and agents base on this table to choose the best action. The Qvalues on this table does not depend on the way agents choose action to take, i.e. Qlearning is off policy. But in this challenge, with only 100 evaluations (20 episodes) and continuous action space, it is challenging for Qlearning to perform well.
3.4.2. Qlearning with sequence breaking
Original Qlearning requires discrete action spaces. To satisfy this condition, we only consider actions that have one number in decimal part while the integer part is zero, for example: [0.1,0.2] is a pair of action for a year. There are total 100 pairs of action in our action space. By doing it, we can easily apply algorithms that can only work on discrete action spaces like Qlearning. Furthermore, in these test environments, rewards follow multivariate gaussian distributions, we can quickly find areas that give us high reward. However, with this approach, we cannot reach the global optimal because the best action might be not in the 100 pairs of action in our space. But with limited episodes, it is a good choice to come up with a pair of actions that gives us high rewards.
After discretizing our actions space, if we only apply the simple Qlearning, it worked well with a high number of episodes figure 8, however, with only 20 episodes, it is not better than the baseline. Therefore, we decided to take some advantages at the beginning of each episode to boost the Qlearning result. The key idea is: We break the sequence of action into 2 parts: firstyear and other years. For the first years, we spent 20 evaluations to find the best pair of action that returns maximum immediate reward this year, ignoring the relationship between the first year and other years. To determine this best action, firstly, we use a grid search with size 4x4 to explore the environment and find areas that have potential high reward. We checked 16 pairs of actions that are a combination of 2 actions from the following list [0.0,0.3,0.6,0.9], for example, [0.3,0.6] or [0.9,0.0], then we choose the pair of actions that give us highest reward among these 16 pairs. Secondly, we use 4 remaining evaluations to exploit the area around the chosen pair of actions. To exploit it, we query random 1 pair action around the current best action, if the chosen action gives higher reward than the best action, we follow that direction, check the pair of action that lies on further part on this direction. For example, if [0.3,0.6] is the best action after grid search, one random action around it will be checked, for example [0.2,0.6]. if this action is better than [0.3 0.6], we will check the next action is [0.1, 0.6], otherwise, we just choose the random action around the best current action, figure 7 illustrates the idea. After coming up with the best action for the 1st year, we will fix it unchanged and apply Qlearning for other years. For Qlearning, we use policy to choose the action with and learning rate with is number of action taken times . The source code for this solution is available on https://github.com/bach1292/KDD_Cup_2019_LOLS_Team.git
Another simple approach is that we can apply a grid search for all five years, ignore the relation between each year. This completely breaks the sequence relation of five years and only applies a grid search. The result was even better than combining with Qlearning, however, due to the restriction of the competition, we could not submit it.
4. Results
4.1. Genetic Algorithm
Mutation operation is the main way to control the explorationexploitation tradeoff, the lower the noise’s value, the learning are more exploitationbased.
4.2. Bayesian Optimization
Algorithms  Average rewards  

Env. 1  Env. 2  
Random Search (5 years)  161.20  100%  158.991  100% 
BO.1: 5 indep. B.O. 
250  155%  500  315% 
BO.2: 10dim B.O.  400  248%  200  126% 
Concerning the first implementation (BO.1), only a few queries are left for episodes reaching the terminal state. Therefore, year 5’s action space is not well explored comparing to the first years but we generally get a high total reward. For Environment 1, the maximum total reward is around 110 for each year. It is approximated using the simple exhaustive search algorithm used before. As shown in table 1, BO.1 obtains a score close to the maximum in Environment 2. On the other hand, BO.2 obtains a score close to the maximum in Environment 1. This is explained by the nature of each environment:

In Environment 1, the reward of year depends only on 2 actions: the current one and the one before. By breaking the sequence for 5 years, we learn the best action on year before proceeding with year . Hence, the algorithm BO.1 can optimize year ’s reward for a fixed action of year .

In Environment 2, the reward of year depends on current action and all previous actions. A 10dimensional function  as presented in BO.2  is able to catch this relation and output the best policy.
For BO.3, only a proof of concept was implemented: the algorithm can learn the best policy only for the first 2 years. Results are shown in table 2.
Algorithms  Average rewards  

Env. 1  Env. 2  
Random Search (2 years)  35  100%  105  100% 
BO.3: Boosting Network 
70  200%  210  140% 
Compared to the first two algorithms, BO.3 is able to achieve stable results since it is a balance between the two other solutions. This improvement is achieved by learning the two weights and . As example, In the figure 10 the best weights are (: 0.78, : 0.82). 5 to 12 episodes are usually used before reaching the perfect weights. Therefore, BO.3 can not outperform the two other algorithms.
4.3. Qlearning with sequence breaking algorithm
To evaluate the result, the algorithm needs to be run for 10 times, 20 episodes each time and no transferring knowledge between each time, and then we take the average of these 10 runs. This evaluation makes sure that the algorithms produce better result than the baseline. The figure 11 below shows the comparisons between each algorithm. As we can see, simple Qlearning cannot perform much better than random search, in some runs, it is even worse than the baseline. Meanwhile, both Qlearning with firstyear breaking or completely breaking sequences approach outperformed the others. The best approach is completely breaking sequences, however, we were not allowed to submit it to the competition, this is just our proposed solution. The table 3 shows the average rewards for each algorithm and comparison between them.
Algorithms  Average rewards  
Environment 1  Env. 2  
Random Search (Baseline)  161.20  100%  158.991  100% 
Break. Seq. for 5 years  434.36  290%  423.019  267% 
Genetic Algorithm  148.18  91%  262.83  163% 
Qlearning  209.51  130%  185.055  117% 
1st year break+QLearning  289.93  180%  428.01  269% 
Both sequences breaking algorithms, which use sequence breaking for the first year or whole five years, have common advantages and disadvantages. We can see they work better than baseline and traditional Reinforcement Learning algorithms like plain Qlearning with a short run, they are also quite simple to implement. For long runs with enough episodes, simple Qlearning may be the better choice. The disadvantage of these approaches is that if the maximum reward of the first year leads to a very bad reward in the next four years, it will be a disaster. Furthermore, because we discretized the action space, we cannot obtain the global optimum. However, with limited observations, these algorithms are not bad choices.
5. Conclusion
Sequence decision making is always an interesting task for Reinforcement Learning that can be applied to many realworld problems such as Malaria control. However, with a very small number of observations, this task was very hard to solve, traditional Reinforcement Learning algorithms cannot overcome the baseline. This paper has introduced enhanced algorithms that can deal with this difficulty and increase the performance of traditional methods. We hope that our solutions would help in controlling the spread of Malaria disease or at least in bringing up helpful ideas.
References
 Autonomous helicopter control using reinforcement learning policy search methods. IEEE International Conference on Robotics and Automation (), pp. . Note: External Links: Document, Link Cited by: §2.1.
 Malaria control using reinforcement learning and bayesian optimisation  kddcup 2019. GitHub. Note: https://github.com/Karim53/ReinforcementLearningMalariaControlKDDCup2019/ Cited by: 2nd item, 3rd item.
 AI in the fight against malaria. (), pp. . Note: External Links: Document, Link Cited by: §1.
 Autonomous inverted helicopter flight via reinforcement learning. Experimental Robotics IX (), pp. 363–372. Note: External Links: Document, Link Cited by: §2.1.
 Learning from observation using primitives. IEEE International Conference on Robotics and Automation (), pp. . Note: External Links: Document, Link Cited by: §2.1.
 Bayesian optimisation algorithm. GitHub. Note: https://github.com/fmfn/BayesianOptimization Cited by: §3.3.3, §3.3.3, §3.3.4.
 A comparative analysis of selection schemes used in genetic algorithms. Foundations of Genetic Algorithms 1 (), pp. 69–93. Note: External Links: Document, Link Cited by: §3.2.
 Genetic algorithms  computer programs that ”evolve” in ways that resemble natural selection can solve complex problems even their creators do not fully understand. (), pp. . Note: External Links: Document, Link Cited by: §3.2.
 Policy gradient reinforcement learning for fast quadrupedal locomotion. IEEE International Conference on Robotics and Automation (), pp. . Note: External Links: Document, Link Cited by: §2.1.
 Active learning for reward estimation in inverse reinforcement learning. In Machine Learning and Knowledge Discovery in Databases, W. Buntine, M. Grobelnik, D. Mladenić, and J. ShaweTaylor (Eds.), Berlin, Heidelberg, pp. 31–46. External Links: ISBN 9783642041747 Cited by: 2nd item.
 The malaria product pipeline: planning for the future. (), pp. . Note: External Links: Document, Link Cited by: §1.
 [12] () Malaria. (), pp. . Note: External Links: Document, Link Cited by: §1.
 Towards a comprehensive simulation model of malaria epidemiology and control.. (), pp. . Note: External Links: Document, Link Cited by: §1.
 Reinforcement learning: an introduction. (), pp. 38. Note: External Links: Document, Link Cited by: Figure 1, §3.4.1.
 Temporal difference learning and tdgammon. Communications of the ACM (), pp. . Note: External Links: Document, Link Cited by: §2.1.
 Qlearning. Mach. Learn 8 (), pp. 279–292. Note: External Links: Document, Link Cited by: §3.4.1.
 Solitaire: man versus machine. Advances in Neural Information Processing Systems 17 (), pp. . Note: External Links: Document, Link Cited by: §2.1.