COVID-19 has changed the way of living of all the people on Earth. In fact, since its outbreak, several countries adopted safety measures such as: social distancing, lockdowns, closing certain types of economic activities and others. These safety measures were taken as a prevention mechanism to slow down the spreading of the pandemic in the world population. However, some safety measures may have a relevant impact on the economy of a country. For this reason, it is important to find a trade-off between spreading of the pandemic and economic losses.
To this end, some previous works focused on the use of simulators to estimate the virus propagation and economic losses given a policy(Kompella* et al., 2020; Trott et al., 2021). In this setting, a policy decides the safety measures to adopt in a given scenario. However, while these works have been proven to be able to find policies that have better trade-offs between economic losses and pandemic spread than the ones used by governments, they do not employ interpretable AI methods. Thus, even though such models do perform very well, their applicability is limited due to the lack of understandability. More specifically, the main drawbacks of non-interpretable approaches for this task are: a) the fact that the trained models act as “oracles” and, thus, a policymaker cannot understand the rationale underlying a decision; and b) the fact that such models cannot be inspected and, thus, their behavior cannot be formally specified.
In this work, we propose interpretable models for policy-making in pandemics. These models, which combine decision trees (DTs) trained by means of grammatical evolution with Q-learning, are very simple and objectively interpretable. Moreover, our solutions exhibit better performance w.r.t. non-interpretable state-of-the-art models. Since our evolved policies are both more effective and more interpretable than existing black-box models, they are extremely suitable for creating policies to control the pandemic while avoiding unnecessary economic damage.
This paper is structured as follows. The next section makes a short review of the state of the art. Section 3 describes the method used to evolve interpretable DTs. In Section 4 we show the experimental results and, in Section 5, we interpret the trees obtained. Finally, in Section 6 we draw the conclusions of this work and suggest future work.
2. Related work
2.1. Policy-making for pandemics
In (Kompella* et al., 2020), the authors propose a simulator of pandemics that is able to simulate a population at a very fine granularity, modeling the activities that each agent can perform, as well as various types of economic activities. The goal of this simulator is to test policies with the objective of minimizing simultaneously the spreading of the pandemic and the economic damages. The authors test various handcrafted policies, policies used from a few countries, and a deep reinforcement learning based policy. The results show that the deep reinforcement learning based policy is able to outperform both the handmade policies and the governmental ones.
In (Trott et al., 2021), the authors propose a simulator to estimate the effects on the management of closing economic activities on both the spreading of the pandemic and the economic losses. The proposed simulator is tailored on the U.S. economy, simulating all the 51 states and a central entity that manages subsidies.
The main difference between the approaches proposed above is that in (Kompella* et al., 2020) the simulator acts at a very low scale, but with a high level of detail, being able to simulate aspects of everyday life, while in (Trott et al., 2021) the simulator aims to perform at a very large scale with a lower level of detail.
In (Miikkulainen et al., 2021)
, the authors use a surrogate-assisted evolutionary process to optimize an agent that has to apply restrictions to avoid the spreading of the pandemic. They make use of two neural networks: one that acts as the policy, and the other that estimates the quality of a policy, trained on real data. While the performance of this approach is promising, the use of black-box policies such as neural network makes the real-world application of such systems difficult(Rudin, 2019).
2.2. Interpretable AI
In recent years, the need for interpretable AI emerged. Interpretable AI (Barredo Arrieta et al., 2020) is defined as the branch of AI that focuses on models that are inherently interpretable by humans. This type of models is extremely important in high-stakes and safety critical scenarios, as the ability to inspect the model and understand its behavior becomes crucial (Rudin, 2019).
While the field of interpretable AI is making progresses, there are some sub-fields that are still seen as crucial to the development of interpretable AI. One of these fields is interpretable reinforcement learning (Rudin et al., 2021).
Lately, several works approached the problem of interpretable reinforcement learning.
Silva et al. (Silva et al., 2020) make use of differentiable DTs that are trained by means of the proximal policy optimization algorithm (Schulman et al., 2017). While this approach has shown very promising performance in the tested environments, its performance degrades significantly when transforming the differentiable DT into a regular DT.
Dhebar et al. (Dhebar et al., 2020)
propose an evolutionary approach for producing DTs with nonlinear splits (i.e., hyperplanes defined by conditions) for reinforcement learning tasks. This approach is able to obtain very good performance in the tested tasks. However, the high non-linearity of the splits hinders the interpretability of the DTs obtained.
In (Custode and Iacca, 2020), the authors propose a two-level optimization approach that combines grammatical evolution (Ryan et al., 1998) and Q-learning (Watkins, 1989) for evolving DTs for reinforcement learning tasks. The approach is tested on classic reinforcement learning tasks, and the results show that the systems obtained are competitive w.r.t. non-interpretable state-of-the-art models while being very easy to interpret. However, this approach only works in environments with discrete action spaces.
In (Custode and Iacca, 2021), the authors extend the approach performed in (Custode and Iacca, 2020) to make it work in scenarios with continuous action spaces. To do so, they make use of a co-evolutionary approach (Potter and De Jong, 1994) that allows to evolve simultaneously decision tress and “pools” of continuous actions.
In (Custode and Iacca, 2022)
, the authors use genetic programming(Koza, 1992) and CMA-ES (Hansen and Ostermeier, 1996) to evolve interpretable DTs that are able to work in RL settings with images as input. However, the experimental results show that the proposed approach exhibits good performance only in scenarios that are not affected by noise.
Since our goal is to evolve general policies for minimizing the spread of the pandemic and, at the same time, reducing the economic losses, we employ the simulator proposed in (Kompella* et al., 2020). In fact, this simulator allows us to test rules that are applicable in every country (i.e., not only tailored to U.S. as in (Trott et al., 2021)) and that do not make use of economic subsidies.
The simulator, at each step, provides the following features:
: Number of infected people since the beginning of the simulation;
: Number of recovered people since the beginning of the simulation;
: Number of patients in critical conditions since the beginning of the simulation;
: Number of dead people since the beginning of the simulation;
: Number of people that did not contract the virus from the beginning of the simulation until the current simulation day;
: Number of daily infected;
: Number of daily recovered;
: Number of patients that are in critical conditions in the current simulation day;
: Number of dead people in the current simulation day;
: Number of people that did not contract the virus in the current day111Please note that the simulator always returns . However, for consistency with the experiments reported in (Kompella* et al., 2020), in our experiments we kept both values as inputs to the policy-making agent.;
: The current level of ongoing restrictions;
: A Boolean variable indicating whether the capacity of the hospitals is saturated.
All the variables are normalized by using a min-max normalization before being fed to the policy-making agent. Note that the simulator returns a noisy version of the estimates of the variables described above, to simulate a real-world scenario in which not all the results of the tests are known.
The agent can choose an action between a pool of 5 actions:
Stage 0: No restrictions
Stage 1: Stay at home if sick, gathering limits:
Low-risk people: 50
High-risk people: 25
Stage 2: Limitations of stage 1 + wear masks + light social distancing, schools and hair salons are closed, gathering limits:
Low-risk people: 25
High-risk people: 10
Stage 3: Limitations of stage 2 + moderate social distancing + no gatherings
Stage 4: Limitations of stage 3 + heavy social distancing + office and retail stores closed
A more detailed list of actions is described in (Kompella* et al., 2020).
At each step, the agent receives a reward of:
where is the capacity of the hospitals.
Thus, the reward function aims to trade-off the number of patients in critical conditions with the stringency level of the restrictions. Moreover, this function indirectly induces the agent to minimize also the closing of stores, limiting the economic losses.
3.2. Evolution of decision trees
Since the action space of this environment is discrete and the interpretability of the policies is crucial for their application in real world scenarios, we employ the method described in (Custode and Iacca, 2020) for evolving policies under the form of DTs.
Thus, we use grammatical evolution to evolve DTs with empty leaves, i.e., leaves that do not perform any action. In the fitness evaluation phase, the leaves are then trained by means of Q-learning (Watkins, 1989), by exploiting the reward signals coming from the environment.
A graphical illustration of the fitness evaluation phase is shown in Figure 1.
The evolutionary process is iterated for 50 generations using a population of 45 individuals. All the parameters (including the ones shown in the following subsections, e.g., mutation rate and tournament size) were empirically tuned to balance performance and computational cost. We perform 10 i.i.d. experimental runs to assess the statistical repeatability of our results.
3.2.1. Individual encoding
Each individual is encoded as a list of integers, ranging from 0 to a maximum value , such that is significantly higher than the maximum numbers of productions in the grammar. In our experiments, .
3.2.2. Genotype to phenotype mapping
Each genotype (i.e., a list of integers) is transformed into its corresponding phenotype, i.e., a Python program that encodes the corresponding DT by means of “if-then-else” statements. To do so, we make use of the grammar shown in Table 1. Starting from the first gene and from a production containing only the rule “dt”, each gene is used to replace the first rule currently found in the string with the -th production, where is the value of the current gene and is the size of the production of the current rule (i.e., the number of choices associated to the current rule).
Note that the leaves of the resulting tree are not defined, i.e., they do not have a fixed action. This allows us to perform Q-learning in the fitness evaluation phase, as mentioned above.
To select the individuals for mutation/crossover, we use a tournament selection with tournament size 2.
The mutation operator used in our experiments is the uniform mutation, which replaces the current gene with a new integer uniformly sampled in
. The mutation operator is applied with probability(i.e., to all the individuals in the population), while the mutation rate (e.g., the probability of mutation of each gene) is .
Since crossover proved to be too “destructive” in preliminary experiments, we do not employ it in the experiments reported below. By destructive, we mean that, by means of this operator, the offspring can be significantly different from both parents, hindering the exploitation of good structures emerged during the evolutionary process.
In order to increase the exploitativeness of our approach, we introduce a replacement operator. This operator replaces a parent with an offspring only in case the offspring has (strictly) better fitness than the parent.
3.2.7. Fitness evaluation
The fitness of each phenotype is determined by using the corresponding policy to perform decisions in the PandemicSimulator222https://github.com/SonyAI/PandemicSimulator environment. To learn actions for the leaves, we employ -greedy Q-learning, with learning rate , , and random initialization for the Q-values in . Moreover, we evaluate each phenotype in 10 independent episodes, each of which simulates 100 days of pandemic.
|[0, 1) with step 0.1|
|[0, 1) with step 0.5|
Table 2 shows the scores obtained by the best agents obtained in each independent run. Here, we differentiate between training and testing returns. Training returns are defined as the returns obtained by the agent in the training process, while testing refers to the returns obtained by the agent after the evolutionary process, i.e., when learning is disabled. Moreover, we also measure the interpretability of our solutions. To do so, we employ the metric of interpretability proposed in (Virgolin et al., 2020), reworked as done in (Custode and Iacca, 2020). This metric is defined as:
where is the number of symbols appearing in the formula, is the number of operations contained in the model, is the number of non-arithmetical operations, and is the maximum number of consecutive compositions of non-arithmetical operations.
Figure 2, instead, compares the results obtained with the best DT produced by our method (i.e., the one corresponding to seed 3 in Table 2, which obtained the highest test mean return) to the policies reported in (Kompella* et al., 2020). In particular, the policies used for the comparison are the following:
S0-4-0: Starts with stage 0, then, once 10 have been infected, it switches to stage 4 and, after 30 days, it returns to stage 0.
S0-4-0FI: Similar to S0-4-0, but the return from stage 4 to stage 0 is performed gradually, by reducing the restrictions by 1 stage each 5 days.
S0-4-0GI: Similar to S0-4-0, but the intermediate stages from stage 4 to stage 0 last 10 days instead of 5.
S0: Always applies stage 0 restrictions.
S1: Always applies stage 1 restrictions.
S2: Always applies stage 2 restrictions.
S3: Always applies stage 3 restrictions.
S4: Always applies stage 4 restrictions.
From Figure 2, we can see that our best DT outperforms all the hand-crafted policies under comparison (note that the negative reward is to be maximized). The statistical difference between the best decision tree evolved and the hand-crafted policies has been confirmed by a two-sided Wilcoxon test with .
As for the deep reinforcement learning based policy presented in (Kompella* et al., 2020), while we could not test it due to the fact that the model is not publicly available (and neither its numerical results are), we estimate, from the plots reported in the original paper, a cumulative reward of about -5, which is substantially lower than the cumulative reward obtained by our best DT.
Finally, in Figure 3 we compare the policy obtained by means of our best DT with the ones implemented by the Italian and Swedish governments (for which we use the implementation provided in (Kompella* et al., 2020)). These policies act as follows:
ITA (approximation of the restrictions adopted by the Italian government): Increase the restrictions gradually from stage 0 to stage 4 and then it gradually returns to stage 2.
SWE (approximation of restrictions adopted by the Swedish government): Applies stage 0 regulations for 3 days and then stage 1 restrictions for all the duration of the simulation.
Also in this case, we observe that the performance are significantly better than the compared policies. Similarly, also in this case, a Wilcoxon test with confirms the statistical significance of the results.
Moreover, we observe that the number of people infected, by using our proposed policy, is significantly smaller w.r.t. the other approaches. This fact, combined with the very high rewards obtained by our policy, suggests that the policy minimizes the economic losses by trying to stop the pandemic at the beginning, and then making the restrictions less stringent.
In the next section, we analyze the policy to understand the reasons underlying its significantly higher performance.
|Seed||Train mean return||Test mean return|
The best DT obtained is shown in Figure 4. While the size of the DT is quite limited, its performance are extremely satisfactory. We can see that the DT makes use of two conditions to compute its output.
The first condition (the root) checks whether the number of never-infected people () is greater than the 90%. If so, then it checks the second condition. Otherwise, it applies stage 2 restrictions, regardless of the other variables. We may hypothesize that, in the latter case, the policy does that because, in case the number of infected people (from the beginning of the pandemic) is greater than the 10%, then the virus may spread very quickly if the restrictions are not appropriate.
The second condition, instead, checks the number of people infected so far (). If the known number of infected is greater than zero, it applies stage 3 restrictions. Otherwise, it applies no restrictions.
In essence, the combination of the two branches of the root aims to induce a “strong” response to the initial wave of the pandemic, which is then slightly relaxed after the number of total cases increases. Thus, the overall strategy learned by the agent is to stop the pandemic at the beginning, keeping slightly less stringent restriction after the initial wave, to avoid new waves.
In this paper, we leverage interpretable AI methodologies to train interpretable policies for managing a pandemic.
Despite the widely-thought trade-off between interpretability and performance, the obtained policies proved to be significantly better than handcrafted policies, deep learning policies, and government policies.
It is important to note that the results shown above have been tested in a simulated scenario and, thus, their applicability to real-world scenario must be assessed.
A limitation of the current work is the simulation scenario adopted: in our work, we simulate a population of 1000 people. The assessment of the scalability of our results is left as future work.
Other future works include: testing the proposed methodology on different simulators with different objectives, and adopting more realistic scenarios w.r.t. economical aspects (e.g., subsidies) or epidemiological aspects (e.g., the possibility of the emergence of new variants).
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion 58, pp. 82–115 (en). External Links: Cited by: §2.2.
- Evolutionary learning of interpretable decision trees. Cited by: §2.2, §2.2, §3.2, §4.
- A co-evolutionary approach to interpretable reinforcement learning in environments with continuous action spaces. In Symposium Series on Computational Intelligence (SSCI), New York, NY, USA, pp. 1–8. Cited by: §2.2.
- Interpretable pipelines with evolutionarily optimized modules for rl tasks with visual inputs. Cited by: §2.2.
- Interpretable-AI Policies using Evolutionary Nonlinear Decision Trees for Discrete Action Systems. External Links: Cited by: §2.2.
- Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In IEEE International Conference on Evolutionary Computation, New York, NY, USA, pp. 312–317. Cited by: §2.2.
- Reinforcement learning for optimization of covid-19 mitigation policies. External Links: Cited by: §1, §2.1, §2.1, §3.1.2, §3.1, Figure 2, §4, §4, §4, footnote 1.
- Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge, Mass (en). External Links: Cited by: §2.2.
- From prediction to prescription: evolutionary optimization of nonpharmaceutical interventions in the covid-19 pandemic. IEEE Transactions on Evolutionary Computation 25 (2), pp. 386–401. Cited by: §2.1.
- A cooperative coevolutionary approach to function optimization. In Parallel Problem Solving from Nature — PPSN III, Berlin, Heidelberg, pp. 249–257 (en). External Links: Cited by: §2.2.
- Interpretable machine learning: fundamental principles and 10 grand challenges. Cited by: §2.2.
- Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5), pp. 206–215 (en). External Links: Cited by: §2.1, §2.2.
- Grammatical evolution: Evolving programs for an arbitrary language. In European Conference on Genetic Programming, pp. 83–96 (en). External Links: Cited by: §2.2.
- Proximal Policy Optimization Algorithms. Note: arXiv:1707.06347 External Links: Cited by: §2.2.
- Optimization Methods for Interpretable Differentiable Decision Trees Applied to Reinforcement Learning. In International Conference on Artificial Intelligence and Statistics, Palermo, Italy, pp. 1855–1865 (en). External Links: Cited by: §2.2.
- Building a foundation for data-driven, interpretable, and robust policy design using the ai economist. Cited by: §1, §2.1, §2.1, §3.1.
- Learning a formula of interpretability to learn interpretable formulas. In Parallel Problem Solving from Nature – PPSN XVI, T. Bäck, M. Preuss, A. Deutz, H. Wang, C. Doerr, M. Emmerich, and H. Trautmann (Eds.), Cham, pp. 79–93. External Links: Cited by: §4.
- Learning from delayed rewards. Ph.D. Thesis, King’s College, Cambridge, United Kingdom. Cited by: §2.2, §3.2.