Interpretable Policies for Reinforcement Learning by Genetic Programming

12/12/2017 ∙ by Daniel Hein, et al. ∙ Technische Universität München 0

The search for interpretable reinforcement learning policies is of high academic and industrial interest. Especially for industrial systems, domain experts are more likely to deploy autonomously learned controllers if they are understandable and convenient to evaluate. Basic algebraic equations are supposed to meet these requirements, as long as they are restricted to an adequate complexity. Here we introduce the genetic programming for reinforcement learning (GPRL) approach based on model-based batch reinforcement learning and genetic programming, which autonomously learns policy equations from pre-existing default state-action trajectory samples. GPRL is compared to a straight-forward method which utilizes genetic programming for symbolic regression, yielding policies imitating an existing well-performing, but non-interpretable policy. Experiments on three reinforcement learning benchmarks, i.e., mountain car, cart-pole balancing, and industrial benchmark, demonstrate the superiority of our GPRL approach compared to the symbolic regression method. GPRL is capable of producing well-performing interpretable reinforcement learning policies from pre-existing default trajectory data.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work introduces a genetic programming (GP) approach for autonomously learning interpretable reinforcement learning (RL)  (Sutton and Barto, 1998) policies from previously recorded state transitions. Despite the search of interpretable RL policies being of high academic and industrial interest, little has been published concerning human interpretable and understandable policies trained by data driven learning methods (Maes et al., 2012). Recent research results show that using fuzzy rules in batch RL settings can be considered an adequate solution to this task (Hein et al., 2017b). However, in many cases the successful use of fuzzy rules requires prior knowledge about the shape of the membership functions, the number of fuzzy rules, the relevant state features, etc. Moreover, for some problems the policy representation as a set of fuzzy rules might be generally unfavorable by some domain experts. Our genetic programming for reinforcement learning (GPRL) approach learns policy representations which are represented by basic algebraic equations of low complexity.

The GPRL approach is motivated by typical industrial application scenarios like wind or gas turbines. For industrial systems, low-level control is realized by dedicated expert-designed controllers, which guarantee safety and stability. However, we observed that high-level control is usually implemented by default control strategies, provided by best practice approaches or domain experts who are maintaining the system, based on personal experience and knowledge about the system’s dynamics. One reason for the lack of autonomously generated real-world controllers is that modeling system dependencies for high-level control by a first principle model is a complicated and often infeasible approach. Since in many real-world applications such representations cannot be found, training high-level controllers has to be performed on data samples from the system. RL (Sutton and Barto, 1998) is capable of yielding high-level controllers based solely on available system data.

RL is concerned with learning a policy for a system that can be modeled as a Markov decision process. This policy maps from system states to actions in the system. Repeatedly applying an RL policy generates a trajectory in the state-action space (Section 

2). Based on our experience, learning such RL controllers in a way that produces interpretable high-level controllers is of high interest, especially for real-world industry problems. Our experience shows that interpretable RL policies are expected to yield higher acceptance from domain experts than black-box solutions.

GP has been utilized for creating rule-based policies since its introduction by Koza (1992). Since then, the field of GP has grown significantly and has produced numerous results that can compete with human-produced results, including controllers, game playing, and robotics (Koza, 2010). Keane et al. (2002)

automatically synthesized a controller by using GP, outperforming conventional PID controllers for an industrially representative set of plants. Another approach using genetic algorithms for RL policy design is to learn a set of fuzzy “if-then” rules, by modifying membership functions, rule sets and consequent types

(Juang et al., 2000). Recently, Koshiyama et al. (2014) introduced GPFIS, a genetic fuzzy controller based on multi-gene GP, and demonstrated the superiority in relation to other genetic fuzzy controllers on the cart-centering and the inverted pendulum problems. On the same benchmark, a movable inverted pendulum, Shimooka and Fujimoto (1999) applied GP to generate equations for calculating the control force by evaluating the individuals’ performances on predefined fitness functions.

A fundamental drawback with all of the former methods is that in many real-world scenarios such dedicated expert generated fitness functions do not exist. In RL the goal is to derive well-performing policies only by (i) interacting with the environment, or by (ii) extracting knowledge out of pre-generated data, running the system with an arbitrary policy (Sutton and Barto, 1998). (i) is referred to as the online RL problem, for which Q-learning methods are known to produce excellent results. For (ii), the offline RL problem, model-based algorithms are usually more stable and yield better performing policies (Hein et al., 2017b).

GP in conjunction with online RL Q-learning has been used in (Downing, 2001) on standard maze search problems and in  (Kamio and Iba, 2005) to enable a real robot to adapt its action to a real environment. Katagiri et al. (2002) introduced genetic network programming (GNP), which has been applied to online RL in (Mabu et al., 2002) and improved by Q-tables in (Mabu et al., 2004). In these publications, the efficiency of GNP for generating RL policies has been discussed. This performance gain, in comparison to standard GP, comes at the cost of interpretability, since complex network graphs have to be traversed to compute the policy outputs.

To the best of our knowledge, GP-generated policies have never been combined with a model-based batch RL approach. In the proposed GPRL approach, the performance of a population of basic algebraic equations is evaluated by testing the individuals on a world model using the Monte Carlo method (Sutton and Barto, 1998). The combined return value of a number of action sequences is the fitness value that is maximized iteratively from GP generation to generation.

In batch RL, we consider applications where online learning approaches, such as classical temporal-difference learning (Sutton, 1988), are prohibited for safety reasons, since these approaches require exploration of system dynamics. In contrast, batch RL algorithms generate a policy based on existing data and deploy this policy to the system after training. In this setting, either the value function or the system dynamics are trained using historic operational data comprising a set of four-tuples of the form (observation, action, reward, next observation), which is referred to as a data batch. Research from the past two decades (Gordon, 1995; Ormoneit and Sen, 2002; Lagoudakis and Parr, 2003; Ernst et al., 2005)

suggests that such batch RL algorithms satisfy real-world system requirements, particularly when involving neural networks (NNs) modeling either the state-action value function 

(Riedmiller, 2005a, b; Schneegaßet al., 2007a, b; Riedmiller et al., 2009) or system dynamics (Bakker, 2004; Schäfer, 2008; Depeweg et al., 2016). Moreover, batch RL algorithms are data-efficient (Riedmiller, 2005a; Schäfer et al., 2007) because batch data is utilized repeatedly during the training phase.

GPRL is a model-based approach, i.e., training is conducted on an environment approximation referred to as world model. Generating a world model from real system data in advance and training a GP policy using this model has several advantages. (i) In many real-world scenarios, data describing system dynamics is available in advance or is easily collected. (ii) Policies are not evaluated on the real system, thereby avoiding the detrimental effects of executing a bad policy. (iii) Expert-driven reward function engineering, yielding a closed-form differentiable equation, utilized during policy training is not required, i.e., it is sufficient to sample from the system’s reward function and model the underlying dependencies by using supervised machine learning.

Little has been published concerning GP and model-based batch RL. Gearhart (2003) examined GP as a policy search technique for Markov Decision Processes. Given a simulation of the Freecraft tactical problem, he performed Monte Carlo simulations to evaluate the fitness of each individual. Similarly, in (Maes et al., 2012) Monte Carlo simulations have been drawn in order to identify the best policies. However, the policy search itself has been performed by formalizing a search over a space of simple closed-form formulas as a multi-armed bandit problem.

The remainder of this paper is organized as follows. The RL and GP methods employed in our framework are reviewed in Sections 2 and 3. Specifically, the problem of finding policies via RL is formalized as an optimization task. In addition, GP in general and the specific implementation that we used for experiments are motivated and presented. An overview of how the proposed GPRL approach is derived from different methods is given in Section 4. Experiments using three benchmark problems, i.e., the mountain car (MC) problem, the cart-pole balancing (CPB) task, and the industrial benchmark (IB), are described in Section 5. Experimental results are discussed in Section 6. The results demonstrate that the proposed GPRL approach can solve the benchmark problems and is able to produce interpretable RL policies. To benchmark GPRL, we compare the obtained results to an alternative approach in which GP is used to mimic an existing non-interpretable NN policy by symbolic regression.

2 Model-based Reinforcement Learning

Inspired by behaviourist psychology, RL is concerned with how software agents ought to take actions in an environment in order to maximize their received accumulated rewards. In RL, the acting agent is not explicitly told which actions to implement. Instead, the agent must learn the best action strategy from the observed environment’s rewards in response to the agent’s actions. Generally, such actions affect both the next reward and subsequent rewards (Sutton and Barto, 1998).

In RL formalism, at each discrete time step , the agent observes the system’s state and applies an action , where is the state space and is the action space. Depending on and , the system transitions to the next state and the agent receives a real-value reward . In deterministic systems the state transition can be expressed as a function with . The related reward is given by a reward function with . Hence, the desired solution to an RL problem is a policy that maximizes the expected accumulated rewards.

In our proposed setup, the goal is to find the best policy among the set of all possible equations which can be built from a pre-defined set of function building blocks, with respect to a certain maximum complexity. For every state , the policy outputs an action, i.e., . The policy’s performance, when starting from , is measured by the return , i.e., the accumulated future rewards obtained by executing the policy . To account for increasing uncertainties when accumulating future rewards, the reward for future time steps is weighted by , where . Furthermore, adopting a common approach, we include only a finite number of future rewards in the return (Sutton and Barto, 1998), which is expressed as follows:


Herein, we select the discount factor such that, at the end of time horizon , the last reward accounted for is weighted by , yielding . The overall state-independent policy performance is obtained by averaging over all starting states

, using their respective probabilities

as weight factors. Thus, optimal solutions to the RL problem are policies with


In optimization terminology, the policy performance function is referred to as a fitness function.

For most real-world industrial control problems, the cost of executing a potentially bad policy is prohibitive. Therefore, in model-based RL (Busoniu et al., 2010), the state transition function is approximated using a model , which can be a first principle model or can be created from previously gathered data. By substituting in place of the real system in (1), we obtain a model-based approximation of the true fitness function (2). In this study, we employ models based on NNs. However, the proposed method can be extended to other models, such as Bayesian NNs (Depeweg et al., 2016) and Gaussian process models (Rasmussen and Williams, 2006).

3 Genetic Programming

GP is a technique, which encodes computer programs as a set of genes. Applying a so-called genetic algorithm (GA) on these genes to modify (evolve) them drives the optimization of the population. Generally, the space of solutions consists of computer programs, which perform well on predefined tasks (Koza, 1992). Since we are interested in interpretable equations as RL policies, the genes in our setting include basic algebraic functions, as well as constant float numbers and state variables. Such basic algebraic functions can be depicted as function trees and stored efficiently in memory arrays.

The GA drives the optimization by applying selection and reproduction on the populations. The basis for both concepts is a fitness value which represents the quality of performing the predefined task for each individual. Selection means that only the best portion of the current generation will survive each iteration and continue existing in the next generation. Analogous to biological sexual breeding, two individuals are selected for reproduction based on their fitness, and two offspring individuals are created by crossing their chromosomes. Technically, this is realized by selecting compatible cutting points in the function trees and interchanging the subtrees beneath these cuts. Subsequently, the two resulting individuals are introduced to the population of the next generation (Figure 1). Herein, we applied tournament selection (Blickle and Thiele, 1995) to select the individuals to be crossed.

Figure 1: GP individuals as function trees. Depicted are four exemplary GP individuals from two consecutive generations with their respective complexity measures . Crossover cutting points are marked in the tree diagrams of policies and .

In our experiments, it has shown to be advantageous to apply automatic equation cancellation to a certain amount (defined by auto cancellation ratio ) of the best-performing individuals of one generation. For canceling, an algorithm is applied, which searches the chromosomes for easy-to-cancel subtree structures, calculates the result of the subtree and replaces it by this result. For example, if a subtree with a root node is found, whose children are two float terminal nodes and , the subtree can be replaced by one float terminal node , given by . Similar cancelation rules can be derived for all functions in the function set. Since the tree depth is limited in our GP setup, this algorithm can reduce the complexity of substructures and even the whole individual, as well as generate space for potentially more important subtrees.

As a mutation operator, we adopted the so-called Gaussian mutator for float terminals, which is common in evolutionary algorithms 

(Schwefel, 1981, 1995). In each generation, a certain portion (according to terminal mutation ratio ) of the best-performing individuals for each complexity is selected. Subsequently, these individuals are copied and their original float terminals are mutated by drawing replacement terminals

from a normal distribution

. If the performance of the best copy is superior to that of the original individual, it is added to the new population. This strategy provides an option for conducting a local search in the policy search space, because the basic structure of the individual’s genotype remains untouched.

Initially, the population is generated randomly, as well as a certain portion of each population every new generation (according to a new random individual ratio ). A common strategy to randomly generate valid individuals is to apply the so-called grow method. In our implementation, growing a chromosome is realized as follows:

  1. Randomly draw tree depth from

  2. select_next_gene()

  3. Procedure: select_next_gene()

    1. If
      Randomly draw gene from the set of terminals and variables
      Randomly draw gene from the set of functions

    2. Add to chromosome

    3. Randomly select one leaf of :

    4. Build chromosome select_next_gene()

    5. Add to

    6. For all leafs of node

      1. Randomly draw subtree depth from

      2. Build chromosome select_next_gene()

      3. Add to

    7. Return

Note that this algorithm enforces a broad variety of individuals, since it randomizes the length of each subtree (1. and 3.(f).i.), as well as the position of the biggest subtree (3.(c)). Both properties save the GA from generating only small initial chromosomes, which eventually would result in an early convergence of the population.

The overall GA used in the experiments is given as follows:

  1. Randomly initialize the population of size

  2. Determine fitness value of each individual (in parallel)

  3. Evolve next generation

    1. Crossover (depending on crossover ratio )

      1. Select individuals by tournament selection

      2. Cross two tournament winners

      3. Add resulting individuals to new population

    2. Reproduction (depending on reproduction ratio )

      1. Select individuals by tournament selection

      2. Add tournament winner to new population

    3. Automatic cancelation and terminal adjustment (depending on auto cancel ratio and terminal adjustment ration )

      1. Apply automatic cancelation on all individuals

      2. Add canceled individuals according to

      3. Select best individuals of old population

      4. Randomly mutate float terminals () and create adjusted individuals from each best

      5. Determine fitness value of each individual (in parallel)

      6. Add best adjusted individuals to new population according to

    4. Fill new population with new randomly generated individuals (new individuals ratio )

    5. Determine fitness value of each individual (in parallel)

    6. If none of the stopping criteria is met

      1. Go back to 3.

  4. Return best individual found so far for each complexity level

Since in this work we are searching for interpretable solutions, a measure of complexity has to be established. Measuring the complexity of an individual can generally be stated with respect to its genotype (structural) or with respect to its phenotype (functional) (Le et al., 2016). Here, we decided to use a simple node-counting measuring strategy where different types of functions, variables and terminals are counted with different weightings. Hence, the domain experts, to whom the RL policies might be delivered to, can pre-define which types of genes they are more likely to accept as interpretable compared to others. In particular, we adopted the complexity weightings from Eureqa111, a commercial available software for symbolic regression (Dubčáková, 2011). Table 1 lists the weightings we applied in our experiments and Figure 1 gives four examples on how to calculate the respective complexities.

Variables 1
Terminals 1
if 5
Table 1: GP complexities
Individual representation tree-based
Initialization method grow method
Selection method tournament selection (size=3)
Terminal set state variables, random float
numbers ,
Function set
Maximal gene amount 100
Maximal tree depth 5
Maximal complexity 100
Ratios for new generation crossover
auto cancel
terminal mutation
new random individuals
Population/generations/ MC: 100/1,000/1,000
training states CPB: 1,000/1,000/1,000
for GPRL IB: 1,000/1,000/100
Population/generations/ MC: 1,000/1,000/70,000
samples CPB: 10,000/1,000/70,000
for symbolic regression IB: 10,000/1,000/100,000
Table 2: GP parameters. Note, that the population and generation numbers have been determined empirically. The generation numbers are chosen such that after these points no substantially better individuals have been found throughout the population.

4 Genetic Programming Reinforcement Learning

The basis for the proposed GPRL approach is a data set that contains state transition samples gathered from the dynamics of a real system. These samples are represented by tuples , where, in state , action was applied and resulted in state transition to . Subsequently, this transition yielded a real value reward . Note that generally can be generated using any (even a random) policy prior to policy training as long as sufficient exploration is involved (Sutton and Barto, 1998).

In a first step, we generate world models with inputs to predict , using data set . To yield better approximative quality, we observed that for many problems it is advantageous to learn the differences between and and to train a single model per state variable separately:

Hence, the resulting state is calculated according to . Note that the reward is also given in data set ; thus, the reward function can also be approximated using .

The interpretable policies we are generating applying our GPRL approach in Section 5 are basic algebraic equations. Given that GPRL is able to find rather short (non-complex) equations, we expect to reveal substantial knowledge about underlying coherencies between available state variables and well-performing control policies with respect to a certain RL problem. To rate the quality of each policy candidate a fitness value has to be provided for the GP algorithm to advance. For our GPRL approach, the fitness of each individual is calculated by generating trajectories using the world model starting from a fixed set of initial benchmark states (Section 2).

The performance of GPRL is compared to a rather straightforward approach, which utilizes GP to conduct symbolic regression on a data set generated by a well-performing but non-interpretable RL policy . contains tuples , where are the generated actions of policy on state . The states originate from trajectories created by policy on world model . One might think that given an adequate policy of any form and using GP to mimic this policy by means of some regression error with respect to , could also yield successful interpretable RL policies. However, our results clearly indicate that this strategy is only successful for rather small and simple problems and produces highly non-stable and unsatisfactory results for more complex tasks.

In our experiments, the well-performing but non-interpretable RL policy we used to generate data set is a NN policy. To yield comparable results we always trained the weights of this policy by model-based RL on the very same world models as applied for GPRL. Note, that usually we expect

to yield higher fitness values during training, since it is able to utilize significantly more degrees of freedom to compute an optimal state action mapping than basic algebraic equations found by GPRL.

Note that we use NNs as world models for the GPRL experiments (not to be confused with NN policy ). In many real-world industrial problem domains, i.e., continuous and rather smooth system dynamics, NNs are known to serve as adequate world models with excellent generalization properties. Given a batch of previously generated transition samples, the NN training process is known to be data-efficient. Moreover, the training errors are excellent indicators of how well the model will perform in model-based RL training. Nevertheless, for other problem domains, alternative types of world models might be preferable. For example, Gaussian processes (Rasmussen and Williams, 2006) provide a good approximation of the mean of the target value, and this technique indicates the level of confidence about this prediction, which may be of value for stochastic system dynamics. Another alternative modeling technique is the use of regression trees (Breiman et al., 1984). While typically lacking data efficiency, regression tree predictions are less affected by nonlinearities perceived by system dynamics because they do not rely on a closed-form functional approximation.

Figure 2 gives an overview of the relationships between the different policy implementations, i.e., GPRL, GP for symbolic regression, and NN policy, and the environment instances, i.e., NN system model and the real dynamics, used for training and evaluation, respectively.

Figure 2: The proposed GPRL approach and its integration in the experimental setup. The world model is the result of supervised ML regression on data originated from the real dynamics (A). GPRL generates a GP model-based policy by training on this world model (B.1), which is an NN model in our experimental setup. Similarly, the NN policy is trained in a model-based manner by utilizing the same NN model (B.2). In contrast to both other policy training approaches, the GP regression policy mimics an already existing policy by learning to minimize an error with respect to the existing policy’s action (C). All of the policies are finally evaluated by comparing their performance on the real dynamics (D.1-D.3).

5 Experiments

5.1 Mountain Car

In the MC benchmark, an underpowered car must be driven to the top of a hill (Figure 3) (Moore, 1990). This is achieved by building sufficient potential energy by first driving in the direction opposite to the final direction. The system is fully described by the two-dimensional state space representing the cars position and velocity .

Figure 3: Mountain car benchmark. The task is to first build up momentum by driving to the left in order to subsequently reach the top of the hill on the right at .

We conducted MC experiments using the freely available software (’clsquare’)222, which is an RL benchmark system that applies the Runge-Kutta fourth-order method to approximate closed loop dynamics. The task for the RL agent is to find a policy producing action sequence that drive the car up the hill, which is achieved when reaching position .

The agent receives a reward of


subsequent to each state transition . When the car reaches the goal position, i.e., , its position becomes fixed and the agent receives the maximum reward in each following time step regardless of the applied actions.

5.2 Cart-pole Balancing

The CPB experiments described in the following section were also conducted using the software. The objective of the CPB benchmark is to apply forces to a cart moving on a one-dimensional track to keep a pole hinged to the cart in an upright position (Figure 4). Here, the four Markov state variables are the pole angle , the pole angular velocity , the cart position , and the cart velocity . These variables describe the Markov state completely, i.e., no additional information about the system’s past behavior is required. The task for the RL agent is to find a sequence of force actions that prevent the pole from falling over (Fantoni and Lozano, 2002).

Figure 4: Cart-pole benchmark. The task is to balance the pole around while moving the cart to position by applying positive or negative force to the cart.

In the CPB task, the angle of the pole and the cart’s position are restricted to intervals of and respectively. Once the cart has left the restricted area, the episode is considered a failure and the system remains in the failure state for the rest of the episode. The RL policy can apply force actions on the cart from  N to  N in time intervals of  s.

The reward function for the balancing problem is given as follows:


Based on this reward function, the primary goal of the policy is to avoid reaching the failure state. The secondary goal is to drive the system to the goal state region where and keep it there for the rest of the episode.

5.3 Industrial Benchmark

The IB333 was designed to emulate several challenging aspects eminent in many industrial applications (Hein et al., 2017a, c). It is not designed to be an approximation of any specific real-world system, but to pose a comparable hardness and complexity found in many industrial applications.

State and action spaces are continuous. Moreover, the state space is high-dimensional and only partially observable. The actions consist of three continuous components and affect three control inputs. Moreover, the IB includes stochastic and delayed effects. The optimization task is multi-criterial in the sense that there are two reward components that show opposite dependencies on the actions. The dynamical behavior is heteroscedastic with state-dependent observation noise and state-dependent probability distributions, based on latent variables. Furthermore, it depends on an external driver that cannot be influenced by the actions.

At any time step the RL agent can influence the IB via actions

that are three dimensional vectors in

. Each action can be interpreted as three proposed changes to three observable state control variables. Those variables are: velocity , gain , and shift . Each variable is limited to and calculated as follows:

with scaling factors , , and .

After applying the action , the environment transitions to the next time step , yielding the internal state . State and successor state are the Markovian states of the environment, which are only partially observable by the agent. In addition to the three control variables velocity , gain , and shift , a setpoint is applied to the system. Setpoint simulates an external force like the demanded load in a power plant or the wind speed actuating a wind turbine, which cannot be controlled by the agent, but still has a major influence on the system dynamics. Depending on the setpoint and the choice of control values , the system suffers from detrimental fatigue and consumes resources such as power, fuel, etc., represented by consumption . In each time step, the IB generates output values for and , which are part of the internal state . The reward is solely determined by as follows:


Note that the complete Markov state of the IB remains unobservable. Only an observation vector consisting of:

  • the current control variables velocity , gain , and shift ,

  • the external driver set point , and

  • the reward relevant variables consumption and fatigue ,

can be observed externally.

In Section 2 the optimization task in model-based RL is described as working on the Markovian state of the system dynamics. Since this state is not observable in the IB environment is approximated by a sufficient amount of historic observations with time horizon . Given a system model with an adequate prediction performance could be achieved during IB experiments. Note that observation size in combination with time horizon results in a 180-dimensional approximation vector of the Markovian state.

5.4 Neural Network World Models

The model-based policy training has been performed on NN world models, which yielded approximative fitness functions (Section 2). For these experiments, we created one NN for each state variable. Prior to training, the respective data sets were split into blocks of 80%, 10%, and 10% (training, validation and generalization sets, respectively). While the weight updates during training were computed by utilizing the training sets, the weights that performed best given the validation sets were used as training results. Finally, those weights were evaluated using the generalization sets to rate the overall approximation quality on unseen data.

The MC NNs were trained with data set containing tuples from trajectories generated by applying random actions on the benchmark dynamics. The start states for these trajectories were uniformly sampled as , i.e., at a random position on the track with zero velocity. contains 10,000 transition samples. The following three NNs were trained to approximate the MC task:

Similarly, for the CPB dynamic model state we created the following four networks:

An approximation of the next state is given by the following formula:


The result of this formula can subsequently be used to approximate the state transition’s reward by


For the training set of the CPB benchmark, the samples originate from trajectories of 100 state transitions generated by a random walk on the benchmark dynamics. The start states for these trajectories were sampled uniformly from . contains 10,000 transition samples.

The experiments for MC and CPB were conducted with a network complexity of three hidden layers with 10 hidden neurons each and rectifier activation functions. For training, we used the Vario-Eta algorithm 

(Neuneier and Zimmermann, 2012).

For the IB benchmark two recurrent neural networks (RNNs) have been trained:

Note that the Markov decision process extraction topology (Duell et al., 2012) of the RNNs that we applied here is well-suited for partially observable problems like the IB. Detailed information on this topology, other design decisions, the training data, and the training process have been previously published by Hein et al. (2017c).

6 Results

6.1 Mountain Car

For the MC experiments, a non-interpretable NN policy with fitness value (equivalent to a penalty value of 41.0) has been trained prior to the GP experiments. A policy with this fitness value is capable of driving the car to the top of the hill from every state in the test set. The NN policy has two hidden layers with activation function and 10 hidden neurons on each layer. Note that recreating such a policy with function trees as considered for our GPRL approach would result in a complexity value of 1581.

The ten GPRL runs learned interpretable policies with a median model penalty of 41.8 for complexities (Figure (a)a). However, even the policies of complexity 1 managed to drive the car to the top of the hill, by simply applying the force along the direction of the car’s velocity, i.e., . Though, policies with lower penalty managed to reach the top of the hill in fewer time steps.

(a) MC
(b) CPB
(c) IB
Figure 5: Pareto fronts from ten model-based GPRL trainings. Depicted is the median (green line) together with the minimum and maximum (semi-transparent green area) Pareto front penalty from all experiments. The dashed line depicts the performance baseline of the NN policy on the NN model.

The resulting GPRL Pareto front individuals from complexity 1 to 15 are shown in Figure 6.

Figure 6: Interpretable MC policy results by our GPRL approach for complexities 1-15

Performing ten symbolic regression runs on the non-interpretable NN policy yielded interpretable policies with a median of the regression errors of 0.028 at best (Figure (a)a). This rather low regression error suggests a good performance on imitating the NN policy.

(a) MC
(b) CPB
(c) IB
Figure 7: Pareto fronts from ten symbolic regression GP trainings. Depicted is the median (blue line) together with the minimum and maximum (semi-transparent blue area) Pareto front regression error from all experiments.

In Figure (a)a the performances of GPRL and the GP regression approaches are evaluated by testing the policies of their Pareto fronts with different start states on the real MC dynamics. Note that on average our GPRL approach produced the best interpretable policies for all complexities. However, the performance of the symbolic regression approach is quite similar, which suggests that for the MC benchmark such a procedure of creating interpretable RL policies is not impossible.

(a) MC
(b) CPB
(c) IB
Figure 8: Squashed Pareto fronts from evaluating both GP results with a different set of states on the real benchmark dynamics. On average, the GPRL individuals (green) outperform the GP regression individuals (blue).

6.2 Cart-pole Balancing

For the CPB experiments, a non-interpretable NN policy with fitness value (equivalent to a penalty value of 27.1) has been trained prior to the GP experiments. Based on our experience, this performance value represents a successful CPB policy. The NN policy has two hidden layers with activation function and 10 hidden neurons on each layer, i.e., complexity 2471.

In Figure (b)b the results of the GP model-based training are compared to the NN policy baseline. Note that all ten independent GP runs produced Pareto fronts of very similar performance. In comparison to the NN policy, individuals with complexity performed significantly worse with respect to the model penalty. Individuals with complexity on the other hand yielded a median penalty of 27.5 or below, which corresponds to an excellent CPB policy suitability.

Figure 9 depicts all individuals of the Pareto fronts of the ten experiment runs from complexity 1 to 15. Note how the solutions agree not only on the utilized state variables but also on the float values of the respective factors. Differences often only arise due to the multiplication of the whole terms with different factors, i.e., the ratios between the important state variables remain very similar. Provided with such a policy Pareto chart, experts are more likely to succeed selecting interpretable policies, since common policy concepts are conveniently identified with respect to both, their complexity as well as their model-based performance.

Figure 9: Interpretable CPB policy results by our GPRL approach for complexities 1-15

The Pareto front results of the GP regression experiments are presented in Figure (b)b. Here, the fitness value driving the GP optimization was the regression error with respect to the NN policy. As expected, the individuals of higher complexity achieve lower errors. Note that compared to GP model-based training the Pareto fronts results of the 10 experiments are spread throughout a bigger area. This fact suggests that the NN policy might be highly non-linear in its outputs, which makes it harder for the GP to find solutions in this huge search space.

To evaluate the true performance of the two approaches GP model-based training and GP regression training, the individuals of both sets of Pareto fronts have been tested on the true CPB dynamics. Figure (b)b shows the resulting squashed

Pareto fronts compared to the performance of the NN policy. It is obvious that almost for every complexity the GP model-based approach GPRL is superior to the GP regression idea. Not only are the median results of significantly lower penalty, but the variance of the GPRL solution is also much lower compared to the GP regression result. Interestingly, the median GP results for complexity 11 and above even outperformed the NN policy result. This indicates that the NN policy already overfitted the NN model and exploited its inaccuracies, while the simple GP policy equations generalize better because of their rather restricted structure.

6.3 Industrial Benchmark

For the IB experiments, a non-interpretable NN policy with fitness value (equivalent to a penalty value of 165.5) has been trained prior to the GP experiments. Based on our experiences, this performance value represents a successful IB policy. The NN policy consists of three separate NNs with one hidden layer, activation functions, and 20 hidden neurons on each layer, i.e., complexity value .

The results of ten individual GPRL runs are depicted in Figure (c)c. Despite the median model penalty of the Pareto fronts is worse compared to the non-interpretable NN policy, one of the GPRL results produced a slightly better performing policy. This comes as a surprise, since generally the NN policy has an advantage in degrees of freedom to find the optimal policy.

The resulting GPRL policies between complexity 21 and 29 are presented in Figure 10. Note that independently learned policies share similar concepts of how policy actions are computed from related state variables with similar time lags. For example, the equations for always use a linear combination of shift value from time lags 2, 3, or 4 and the constant setpoint value . Another example is the computation of , for which a velocity value with time lags is always used. Moreover, it is possible to reveal common concepts and relevant differences between a rich set of possible solutions. This presentation could provide the domain experts with important insights on how well-performing policies for the system at hand look like, on which state variables they react, and how they generalize in state space areas where currently insufficient training data is available.

Figure 10: Interpretable IB policies for complexities 21-29. The presented 3-dimensional policies are the result of ten independent GPRL trainings. Each box contains three policy equations for ,, and (from top to bottom) to calculate actions for time step . The actions are limited to -1 and +1 before they are applied on the system dynamics. The input variables are normalized by subtracting their respective mean

and dividing by their respective standard deviation

. These values can easily be calculated from the training data set.

The results of applying GP symbolic regression on a non-interpretable NN policy are shown in Figure (c)c. For each policy action, an independent GP run has been conducted. After the training multi-dimensional policies have been created in such a way that the accumulated regression errors of , , and are as low as possible for every possible complexity value. This procedure has been repeated ten times to yield ten independent IB policies.

In the final step of the experiment, the GPRL and the GP regression solutions have been evaluated on the real IB dynamics. Figure (c)c clearly reveals the strengths of our GPRL approach. First, the model-based GP training performs significantly better than the symbolic regression training. Despite even in the MC and CPB experiments GPRL outperformed the regression approach, the experiments with IB illustrate the complete performance breakdown of the latter. Second, the good generalization properties of GPRL led to interpretable policies which even outperformed a non-interpretable NN policy from complexity 11 on. Note that given this result and the superior performance during model-based NN policy training, it can be concluded that the NN policy started to overfit the policy with respect to the NN model penalty.

7 Conclusion

Our GPRL approach conducts model-based batch RL to learn interpretable policies for control problems on the basis of already existing default system trajectories. The policies can be represented by compact algebraic equations or Boolean logic terms. Autonomous learning of such interpretable policies is of high interest for industry domain experts. Presented with a number of GPRL results for a preferred range of complexity, new concepts for controlling an industrial plant can be revealed. Moreover, safety concerns can more easily be addressed, if the policy at hand itself, as well as its generalization to certain state space areas, are completely understandable.

The complete GPRL procedure of (i) training a model from existing system trajectories, (ii) learning interpretable policies by GP, (iii) selecting a favorable solution candidate from a Pareto front result has been evaluated for three RL benchmarks, i.e., MC, CPB, and IB. First, the control performance was compared to a non-interpretable NN policy. This comparison showed that the GPRL performance on the approximation model can be slightly worse compared to the NN policy result. However, when evaluated on the real system dynamics, even interpretable policies of rather low complexity could outperform the non-interpretable approach in many occasions. This suggests that simple algebraic equations used as policies generalize better on new system states. In a second evaluation, our GPRL approach has been compared to a straightforward GP utilization as a symbolic regression tool, i.e., fitting the existing non-interpretable NN policy by GP to yield interpretable policies of similar control performance. All of our experiments showed that this strategy is significantly less suitable to produce policies of adequate performance.

Especially the experiments with the IB indicated that the application of the proposed GPRL approach in industry settings could prove to be of significant interest. In many cases, data from systems is readily available and interpretable simple algebraic policies are favored over black-box RL solutions, such as non-interpretable NN policies.


The project on which this report is based was supported with funds from the German Federal Ministry of Education and Research under project number 01IB15001. The sole responsibility for the report’s contents lies with the authors.



  • Bakker (2004) Bakker, B., 2004. The state of mind: Reinforcement learning with recurrent neural networks. Ph.D. thesis, Leiden University, Netherlands.
  • Blickle and Thiele (1995) Blickle, T., Thiele, L., 1995. A mathematical analysis of tournament selection. In: ICGA. pp. 9–16.
  • Breiman et al. (1984) Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classification and Regression Trees. CRC Press, Boca Raton, FL.
  • Busoniu et al. (2010) Busoniu, L., Babuska, R., De Schutter, B., Ernst, D., 2010. Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press.
  • Depeweg et al. (2016) Depeweg, S., Hernández-Lobato, J. M., Doshi-Velez, F., Udluft, S., 2016. Learning and policy search in stochastic dynamical systems with Bayesian neural networks. arXiv preprint arXiv:1605.07127.
  • Downing (2001)

    Downing, K. L., 2001. Adaptive genetic programs via reinforcement learning. In: Proceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation. GECCO’01. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 19–26.

  • Dubčáková (2011) Dubčáková, R., 2011. Eureqa: software review. Genetic programming and evolvable machines 12 (2), 173–178.
  • Duell et al. (2012) Duell, S., Udluft, S., Sterzing, V., 2012. Solving partially observable reinforcement learning problems with recurrent neural networks. In: Neural Networks: Tricks of the Trade. Springer, pp. 709–733.
  • Ernst et al. (2005) Ernst, D., Geurts, P., Wehenkel, L., Littman, L., 2005. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, 503–556.
  • Fantoni and Lozano (2002) Fantoni, I., Lozano, R., 2002. Non-linear control for underactuated mechanical systems. Springer.
  • Gearhart (2003) Gearhart, C., 2003. Genetic programming as policy search in markov decision processes. In: Koza, J. R. (Ed.), Genetic Algorithms and Genetic Programming at Stanford 2003. Stanford Bookstore, Stanford, California, 94305-3079 USA, pp. 61–67.
  • Gordon (1995) Gordon, G., 1995. Stable function approximation in dynamic programming. In: In Machine Learning: Proceedings of the Twelfth International Conference. Morgan Kaufmann.
  • Hein et al. (2017a) Hein, D., Depeweg, S., Tokic, M., Udluft, S., Hentschel, A., Runkler, T. A., Sterzing, V., 2017a. A benchmark environment motivated by industrial control problems. arXiv preprint arXiv:1709.09480.
  • Hein et al. (2017b)

    Hein, D., Hentschel, A., Runkler, T., Udluft, S., 2017b. Particle swarm optimization for generating interpretable fuzzy reinforcement learning policies. Engineering Applications of Artificial Intelligence 65, 87–98.

  • Hein et al. (2017c) Hein, D., Udluft, S., Tokic, M., Hentschel, A., Runkler, T. A., Sterzing, V., 2017c. Batch reinforcement learning on the industrial benchmark: First experiences. In: 2017 International Joint Conference on Neural Networks (IJCNN). pp. 4214–4221.
  • Juang et al. (2000) Juang, C.-F., Lin, J.-Y., Lin, C.-T., Apr. 2000. Genetic reinforcement learning through symbiotic evolution for fuzzy controller design. Trans. Sys. Man Cyber. Part B 30 (2), 290–302.
  • Kamio and Iba (2005) Kamio, S., Iba, H., Jun. 2005. Adaptation technique for integrating genetic programming and reinforcement learning for real robots. Trans. Evol. Comp 9 (3), 318–333.
  • Katagiri et al. (2002) Katagiri, H., Hirasawa, K., Hu, J., Murata, J., Kosaka, M., 2002. Network structure oriented evolutionary model: Genetic network programming. Transactions of the Society of Instrument and Control Engineers 38 (5), 485–494.
  • Keane et al. (2002) Keane, M. A., Koza, J. R., Streeter, M. J., 2002. Automatic synthesis using genetic programming of an improved general-purpose controller for industrially representative plants. In: Proceedings of the 2002 NASA/DoD Conference on Evolvable Hardware (EH’02). EH ’02. IEEE Computer Society, Washington, DC, USA, pp. 113–123.
  • Koshiyama et al. (2014) Koshiyama, A. S., Escovedo, T., Vellasco, M. M. B. R., Tanscheit, R., July 2014. Gpfis-control: A fuzzy genetic model for control tasks. In: 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). pp. 1953–1959.
  • Koza (1992) Koza, J. R., 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA.
  • Koza (2010) Koza, J. R., 2010. Human-competitive results produced by genetic programming. Genetic Programming and Evolvable Machines 11 (3), 251–284.
  • Lagoudakis and Parr (2003) Lagoudakis, M., Parr, R., 2003. Least-squares policy iteration. Journal of Machine Learning Research, 1107–1149.
  • Le et al. (2016) Le, N., Xuan, H. N., Brabazon, A., Thi, T. P., 2016. Complexity measures in genetic programming learning: A brief review. In: Evolutionary Computation (CEC), 2016 IEEE Congress on. IEEE, pp. 2409–2416.
  • Mabu et al. (2004) Mabu, S., Hirasawa, K., Hu, J., 2004. Genetic Network Programming with Reinforcement Learning and Its Performance Evaluation. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 710–711.
  • Mabu et al. (2002) Mabu, S., Hirasawa, K., Hu, J., Murata, J., 2002. Online learning of genetic network programming (GNP). In: Fogel, D. B., El-Sharkawi, M. A., Yao, X., Greenwood, G., Iba, H., Marrow, P., Shackleton, M. (Eds.), Proceedings of the 2002 Congress on Evolutionary Computation CEC2002. IEEE Press, pp. 321–326.
  • Maes et al. (2012) Maes, F., Fonteneau, R., Wehenkel, L., Ernst, D., 2012. Policy search in a space of simple closed-form formulas: towards interpretability of reinforcement learning. Discovery Science, 37–50.
  • Moore (1990) Moore, A. W., nov 1990. Efficient memory-based learning for robot control. Tech. Rep. UCAM-CL-TR-209, University of Cambridge, Computer Laboratory.
  • Neuneier and Zimmermann (2012) Neuneier, R., Zimmermann, H.-G., 2012. How to train neural networks. In: Montavon, G., Orr, G., Müller, K.-R. (Eds.), Neural Networks: Tricks of the Trade, Second Edition. Springer, pp. 369–418.
  • Ormoneit and Sen (2002) Ormoneit, D., Sen, S., 2002. Kernel-based reinforcement learning. Machine learning 49 (2), 161–178.
  • Rasmussen and Williams (2006) Rasmussen, E., Williams, C., 2006. Gaussian processes for machine learning (adaptive computation and machine learning). Mit Press Ltd.
  • Riedmiller (2005a) Riedmiller, M., 2005a. Neural fitted Q iteration — first experiences with a data efficient neural reinforcement learning method. In: Machine Learning: ECML 2005. Vol. 3720. Springer, pp. 317–328.
  • Riedmiller (2005b) Riedmiller, M., 2005b. Neural reinforcement learning to swing-up and balance a real pole. In: Systems, Man and Cybernetics, 2005 IEEE International Conference on. Vol. 4. pp. 3191–3196.
  • Riedmiller et al. (2009) Riedmiller, M., Gabel, T., Hafner, R., Lange, S., 2009. Reinforcement learning for robot soccer. Autonomous Robots 27 (1), 55–73.
  • Schäfer (2008) Schäfer, A. M., 2008. Reinforcement learning with recurrent neural networks. Ph.D. thesis, University of Osnabrück, Germany.
  • Schäfer et al. (2007) Schäfer, A. M., Udluft, S., Zimmermann, H.-G., 2007. A recurrent control neural network for data efficient reinforcement learning. In: Proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. pp. 151–157.
  • Schneegaßet al. (2007a) Schneegaß, D., Udluft, S., Martinetz, T., 2007a. Improving optimality of neural rewards regression for data-efficient batch near-optimal policy identification. Proceedings of the International Conference on Artificial Neural Networks.
  • Schneegaßet al. (2007b) Schneegaß, D., Udluft, S., Martinetz, T., 2007b. Neural rewards regression for near-optimal policy identification in Markovian and partial observable environments. In: Proceedings the European Symposium on Artificial Neural Networks. pp. 301–306.
  • Schwefel (1981) Schwefel, H.-P., 1981. Numerical optimization of computer models. John Wiley & Sons, Inc.
  • Schwefel (1995) Schwefel, H.-P., 1995. Evolution and optimum seeking. sixth-generation computer technology series.
  • Shimooka and Fujimoto (1999) Shimooka, H., Fujimoto, Y., 1999. Generating equations with genetic programming for control of a movable inverted pendulum. In: Selected Papers from the Second Asia-Pacific Conference on Simulated Evolution and Learning on Simulated Evolution and Learning. SEAL’98. Springer-Verlag, London, UK, UK, pp. 179–186.
  • Sutton (1988) Sutton, R., 1988. Learning to predict by the methods of temporal differences. Machine learning 3 (1), 9–44.
  • Sutton and Barto (1998) Sutton, R., Barto, A., 1998. Reinforcement learning: an introduction. A Bradford book.