1. Introduction
In typical industrial applications, such as controlling wind or gas turbines, control strategies, known as policies, that can be interpreted and controlled by humans, are of great interest (Maes et al., 2012). However, using domain experts to manually design such policies is complicated and sometimes infeasible, since it requires the plant’s system dependencies to be modeled in great detail with dedicated mathematical representations. Since such representations cannot be found for many realworld applications, policies have to be learned via reward samples from the plant itself. Reinforcement learning (RL) (Sutton and Barto, 1998) is capable of determining such policies using only the available system data.
Recently, fuzzy particle swarm reinforcement learning
(FPSRL) has been proposed, and it has been shown that an evolutionary computation method, namely particle swarm optimization (PSO), can be successfully combined with fuzzy rulebased systems to generate interpretable RL policies
(Hein et al., 2017b). This can be achieved by first training a model on a batch of preexisting stateaction trajectory samples and subsequently conducting modelbased RL. This step uses PSO to optimize a predefined set of fuzzy rule parameters.FPSRL has been applied to several wellknown RL benchmarks, such as the mountain car and cartpole problems (Hein et al., 2017b). While such simple benchmark problems are wellsuited to introducing a new method and comparing its performance to that of standard approaches, their easytomodel dynamics and lowdimensional state and action spaces share few similarities with realworld industrial applications. Real applications usually have highdimensional continuous state and action spaces. Applying FPSRL to systems with many state features often yields noninterpretable fuzzy systems since every fuzzy rule contains all the state dimensions, including redundant or irrelevant state dimensions, in its membership function by default.
In this paper, we propose an approach to efficiently determine the most important state features with respect to the optimal policy. Selecting only the most important features prior to policy parameter training makes the production of interpretable fuzzy policies using FPSRL possible again.
However, performing a heuristic feature selection initially and subsequently creating policy structures manually is a feasible but limited approach; in highdimensional state and action spaces, the effort involved grows exponentially.
Instead, we propose as main contribution of our work fuzzy genetic programming reinforcement learning (FGPRL), an approach, such as FPSRL, that is based on modelbased batch RL. By creating fuzzy rules using genetic programming (GP) rather than tuning the fuzzy rule parameters via PSO, FGPRL eliminates the manual feature selection process. This GP technique is able to automatically select the most important features as well as the most compact fuzzy rule representation with respect to a certain level of performance. Moreover, it returns not just one solution to the problem but a whole Pareto front containing the bestperforming solutions for many different levels of complexity.
Although genetic fuzzy systems have demonstrated their ability to learn and adapt to solve different types of problems in various application domains, GPgenerated fuzzy logic controllers have never been combined with a modelbased batch RL approach so far.
Combining a fuzzy system’s approximate reasoning with an evolutionary algorithm’s ability to learn allows the proposed method to learn humaninterpretable softcomputing solutions autonomously.
Cordón et al. (2004) provide an extensive overview of previous genetic fuzzy rulebased systems. While most of the existing research in this area has focused on genetic tuning of scaling and membership functions as well as genetic learning of rule and knowledge bases, less attention has been paid to using GP to design fuzzy rulebased systems. Since GP is concerned with automatically generating computer programs (Koza, 1992), it should theoretically be able to both learn rule and knowledge bases as well as tune scale and membership functions simultaneously (GeyerSchulz, 1995).Fuzzy rulebased systems have been combined with GP for modeling (Hoffmann and Nelles, 2001) and classification (Ramos and González, 2000; Sánchez et al., 2001; Chien et al., 2002; Berlanga et al., 2010) tasks. In the optimal system control field considered in this paper, early applications that combine GP and fuzzy rulebased systems for mobile robot path tracking have been demonstrated (Tunstel and Jamshidi, 1996). Typeconstraint GP has also been used to define fuzzy logic controller rulebases for the cartcentering problem (Alba et al., 1996, 1999). Memetic GP, which combines local and global optimization, has been used to train TakagiSugeno fuzzy controllers to solve the cartpole balancing problem (Tsakonas, 2013). Recently, based on the GP fuzzy inference system (GPFIS), GPFIScontrol has been proposed (Koshiyama et al., 2014). They used multigene GP to automatically train a fuzzy logic controller and tested its performance on the cartcentering and inverted pendulum problems.
In this study, we apply FPSRL and FGPRL to two different benchmarks, namely the cartpole swingup and industrial benchmarks, to compare their RL policy performance and the interpretability of their fuzzy system controllers.
2. Policy Generation Methods
This paper compares two approaches to generate fuzzy RL policies from a batch of previously generated stateaction trajectories (Fig. 1). FPSRL, first proposed in (Hein et al., 2017b), tunes the fuzzy membership parameters of a predefined fuzzy set. Here, we extend FPSRL by adding an initial feature selection step, thus enabling its application to RL problems with highdimensional state spaces. In addition, we compare FPSRL to a new approach, called FGPRL, that uses GP to create fuzzy RL policies using the same underlying modelbased RL fitness function as FPSRL.
2.1. Modelbased Reinforcement Learning
Inspired by behaviorist psychology, RL is concerned with the actions software agents should take in an environment to maximize their received accumulated rewards. In RL, the agents are not explicitly told the actions they are supposed to take; instead, they must learn the best strategy by observing the rewards given by the environment in response to their actions. In general, their actions can affect both the next reward and all subsequent rewards (Sutton and Barto, 1998).
In the RL formalism, each agent observes the system state at each discrete time step and takes an action , where and are the state and action spaces, respectively. In deterministic systems, state transitions can be expressed as function with . The corresponding rewards are given by reward function , with . Thus, the RL problem’s optimal solution is the policy that maximizes the expected accumulated rewards.
In the proposed approach, the goal is to find the best policy , with being the set of all possible fuzzy RL policies. Policies associate every state with an action , and their performance, for a given starting state , is measured by the return , i.e., the accumulated future rewards obtained by executing them. To account for the increasing uncertainties associated with future rewards, the reward received time steps in the future is weighted by , where . In addition, we adopt the common approach of only including a finite number of future rewards in the return (Sutton and Barto, 1998), as follows:
(1)  
The overall stateindependent policy performance is obtained by averaging over all starting states . Thus, the optimal solutions to the RL problem are the policies where
(2) 
In optimization terminology, the policy performance function is known as the fitness function.
For most realworld industrial control problems, the cost of executing a potentially bad policy is prohibitive. Therefore, in modelbased RL (Busoniu et al., 2010), the state transition function is approximated by a model , that is either a firstprinciples model or has been created from previously recorded data. Substituting for the real system in (1) allows us to obtain a modelbased approximation of the true fitness function (2
). Here, we consider models based on neural networks (NNs), but the proposed method could be extended to other models, such as Bayesian NNs
(Depeweg et al., 2016) and Gaussian process models (Rasmussen and Williams, 2006).The modelbased RL approaches considered in this paper are based on data sets of state transition samples gathered from a real system. These samples are tuples that represent a start state transitioning to a next state owing to take action and yielding a reward . The set can be generated using any policy (even a random one) prior to policy training and is subsequently used to generate world models that take inputs and predict and .
2.2. Fuzzy Controller
Fuzzy set theory was first proposed by Zadeh (1965). Based on this theory, Mamdani and Assilian (1975) subsequently introduced socalled fuzzy controllers, specified by sets of linguistic ifthen rules, whose membership functions can be activated independently to produce a combined output, computed by a suitable defuzzification function.
In a inputssingleoutput system with rules, the fuzzy rules can be expressed as follows:
(3) 
where
is the input vector (environment state, in our case),
is the membership of a fuzzy set of the input vector in the premise part, and is a real number in the consequent part.We use Gaussian membership functions (Wang and Mendel, 1992). These multivariate Gaussian functions are formed from products over all membership dimensions and yield smooth outputs, are local, and never produce zero activation. We define each rule’s membership function as follows:
(4)  
(5) 
where is the ith parameterized Gaussian , with center and width .
The output is determined by the following equation:
(6) 
where the hyperbolic tangent limits the output to be between 1 and 1, and the parameter can be used to change the function’s slope.
2.3. Fuzzy Particle Swarm Reinforcement Learning
FPSRL is a PSObased approach to solve modelbased batch RL problems (Hein et al., 2017b). PSO is a populationbased, nonconvex, stochastic optimization heuristic and can be applied to any search space that is a bounded subspace of a finitedimensional vector space (Kennedy and Eberhart, 1995).
The position of each particle in the swarm represents a potential solution to the given problem. The particles fly iteratively through the multidimensional search space, which is referred to as the fitness landscape. At each iteration, the particles move and receive fitness values for their new positions. These values are used to update each particle’s velocity vector as well as those of all the other particles in a certain neighborhood.
For a given maximization problem, the best particle positions at iteration are calculated as follows:
(7) 
where, in our framework, is the fitness function given in (2) and the particle positions represent the policy parameters .
The parameter vector , where is the set of valid Gaussian fuzzy parameterizations, is of size and can be presented as follows:
(8) 
2.3.1. Rule Construction
2.3.2. Feature Selection
Industrial applications usually have dozens or even hundreds of possible state features, collected by different plant sensors. However, in our experience, very often only a small subset of them are required to create a policy that performs well. Unfortunately, determining the most important features is an expensive and ambiguous process, but it is nonetheless essential if we are to apply promising techniques such as FPSRL to industrial applications. Therefore, we propose a twostep approach that yields lists of features for each action, which are ordered by their relevance to an optimal policy.
First, an optimal trajectory is generated by applying the PSOP receding horizon controller (Hein et al., 2016, 2018) to the system model . Here, PSOP uses the model to determine action sequences starting from the state . The first action in the optimal sequence is stored in the tuple , and this process is repeated for all possible states in the data set . Note that no explicit policy representation is required to use the model to generate optimal actions.
Second, the AMIFS feature selection heuristic (Tesmer and Estévez, 2004) is used to order the possible features of state in terms of their relevance to the individual action dimensions . AMIFS uses mutual information as a measure of feature relevance and redundancy to reveal nonlinear featureaction relations.
Finally, the resulting ordered feature lists, e.g., and for the action dimensions and , are used to construct compact and interpretable fuzzy rule representations, whose parameters will then be optimized by FPSRL.
2.4. Fuzzy Genetic Programming Reinforcement Learning
In this section, we introduce a GPbased approach that can generate interpretable fuzzy rule policies automatically using modelbased batch RL. Analogously to FPSRL, FGPRL uses an approximate system model to predict the performance of policy candidates and subsequently uses this knowledge to iteratively generate highperforming policies. Unlike FPSRL, FGPRL not only optimizes the predefined policy parameters but also automatically selects the relevant state features, finds the necessary number of rules, and returns a Pareto front of policy candidates with different levels of complexity.
FGPRL is based on GP, which encodes computer programs as sets of genes and then modifies (evolves) them using a socalled genetic algorithm (GA) to drive the optimization of the population. The solution spaces comprise computer programs that perform well on the given tasks
(Koza, 1992). Since we are interested in using interpretable fuzzy controllers as RL policies, the genes in our case include membership and defuzzification functions, as well as constant floatingpoint numbers and state variables. These fuzzy policies can be represented as function trees (Fig. 2) and stored efficiently in arrays.The GA drives the optimization process using selection and reproduction of population members, both of which are based on the members’ fitness values . These represent how well each individual can perform the given task. Selection ensures that only the fittest individuals will survive into the next generation. Similar to the case of biological breeding process, pairs of individuals are selected for reproduction based on their fitness, and two offspring individuals are created for each pair by crossing their chromosomes. Technically, this is achieved by selecting compatible cutting points in the function trees and interchanging the subtrees below these cuts. Here, we apply tournament selection (Blickle and Thiele, 1995) to select the parent individuals. In addition, we use stronglytyped GP for FGPRL to avoid constructing illdefined rules (Alba et al., 1999). This means that each building block is assigned a type (Table 1). The different colors in Fig. 2 highlight the example GP individual’s type structure. During the crossover process, only cutting points of equal type (color) are selected to ensure that only legal offspring are created.
Type  Complexity  

Variable  0  
Floatingpoint number  1  
Dimension  1  
Rule  2  
Policy  10 
We adopt the socalled Gaussian mutator as the mutation operator for floatingpoint terminals, which is common for evolutionary algorithms (Schwefel, 1981, 1995). In each generation, a given fraction of the bestperforming individuals is selected for each level of rule complexity. Then, these individuals are copied, and their original floatingpoint values are mutated by drawing replacement values
from the normal distribution
. If the best copy’s performance is better than that of the original individual, it is added to the new population. This allows us to conduct a local search in the policy space because it does not affect the individual’s basic genotype structure.To yield fuzzy rules with the structure described in Section 2.2
, we must apply an additional tree correction. The GA can construct rules where two or more activation functions act on the same state variable, such as in Fig.
2 where the same appears twice below the same rule . Such rules are expected to be difficult to interpret since their shape does not conform the standard TakagiSugeno fuzzy inference model. For FGPRL, we decided to check every tree before evaluating its fitness by looking for recurring state variables and cutting out their corresponding activation functions. Note that the structures of the subsequent activation functions are not affected.Since we are looking for interpretable solutions, we need to establish a suitable complexity measure. An individual’s complexity can generally be measured in terms of its genotype (structural) or phenotype (functional) (Le et al., 2016). Here, we decided to use a simple node counting measurement strategy where different types of functions, variables, and terminals were weighted differently. Table 1 lists the weights (complexities) we decided to use in our experiments. Note that the weights yield a problemspecific balance between learning controllers consisting of more rules with less dimensions and vice versa.
Finally, we decided to create new generations for FGPRL using the following ratios: 45% crossover, 5% reproduction, 10% mutation, and 40% new random individuals.
2.4.1. Local Search
Since FGPRL’s GP process searches the entire fuzzy policy structure space, it is prone to underestimate the importance of local fuzzy parameter tuning (Moscato, 1989; Tsakonas, 2013). We propose to counteract this by applying an additional parameter tuning step to all the terminals () after optimization is complete (Fig. 1). Applying PSO to every individual in the final Pareto front yields an updated front comprising at most the same number of individuals with equal or higher fitness values.
3. Experiments
In this section, we evaluate both approaches, FPSRL and FGPRL, using two benchmarks. The first is the socalled cartpole swingup problem (CP), a widely known RL benchmark, that was selected to demonstrate the methods’ performance on a task where they could be easily compared with other RL methods. However, CP has a lowdimensional state space and its dynamics are deterministic and smooth. To investigate these methods’ performance on realworld industry applications, we also selected a second benchmark, the industrial benchmark (IB). This combines a highdimensional state space with stochastic dynamics, thus making its results increasingly meaningful for industrial applications, such as controlling wind and gas turbines.
To compare the computational costs of FPSRL with FGPRL, we decided to use the parameter values shown in Table 2 for our experiments. Unlike FPSRL, FGPRL produces a whole Pareto front of solutions with different complexity and fitness values, while FPSRL only yields one solution, derived from its initial rule set. Note that although it is useful for FGPRL, FPSRL does not require any additional local optimization. Consequently, we chose the additional local search and runs for multiple complexities settings to produce similar total fitness value calculations, yielding the results presented below.
FPSRL  FGPRL  
Particles/Individuals  
Iterations/Generations  
Fitness value calculations (A)  
Optimization result  Single policy  Pareto front of policies 
1  22 to 41  
Additional local search  
Additional local fitness value calc. (B)  to  
Runs for multiple complexities  4  1 
(A)  
Fitness value calc. for multiple compl. (C)  
Total fitness value calc. (B)+(C)  to 
3.1. Cartpole Swingup
3.1.1. Dynamics
The objective of the CP benchmark is to apply forces to a cart moving along a onedimensional track to keep a pole (hinged to the cart) in an upright position. The four Markov state variables are the pole’s angle and angular velocity , and the cart’s position and velocity . These variables completely describe the Markov state; therefore, no additional information is required about the system’s previous behavior. The RL agent’s task is to find a sequence of force actions that prevent the pole from falling over (Fantoni and Lozano, 2002). The CP experiments described in this paper were conducted using the software^{1}^{1}1http://ml.informatik.unifreiburg.de/research/clsquare..
There are no restrictions on the cart’s position or the pole’s angle. Consequently, the pole can swing through, which is an important property of the CP. Since the pole’s angle can initially be anywhere in the full interval , it is often necessary for the policy to swing the pole from one side to another to gain sufficient momentum to raise it and consequently receive the highest reward.
CP policies can apply actions of between N and N to the cart, and the reward function is given as follows:
(9) 
3.1.2. Benchmark Setup
Initially, we generated data set by applying random actions using the real CP dynamics. Generating 100 stateaction trajectories of length 100 gave us . These trajectories’ initial states were sampled uniformly from . Then, we trained five NN system models
, one for each state variable and one for predicting the probability of reaching the goal region
(Hein et al., 2017b).These system models were subsequently used in modelbased RL, with a time horizon of 500 and a discount factor of 0.994. The training process involved 100 training states sampled from . Solutions with or greater were considered to be successful because such policies could swingup more than 99% of the given test states.
Finally, the best policies were tested against the real system dynamics using the same and parameter values but a different set of 100 test states sampled from .
3.1.3. Results
Here, we compare the results of CP experiments in which we ran the benchmark for each method 10 times. FPSRL produced 40 policies for 4 complexity levels, while FGPRL produced 278 policies for 96 complexity levels. Note that both methods involved a similar number of fitness value calculation (Table 2).
Since it is known that, for the CP, all four Markov state variables are required to produce policies that perform well, we skipped the feature selection step for FPSRL (Fig. 1) and invested the fitness value calculations budget in evaluating different numbers of rules, i.e., 2, 4, 6, and 8 rules yielding complexities of 63, 125, 187, and 249, respectively.
Fig. 3 shows that for problems such as the CP, which have lowdimensional state spaces and no irrelevant or redundant state variables, applying prior knowledge to the rule construction step yields a rule structure that can be easily tuned to produce highperformance, interpretable fuzzy policies. FPSRL can utilize all the available computational resources to tune a fixed set of parameters.
However, FGPRL has to employ the same resources to search a significantly larger space of possible solutions. Note that although FGPRL is theoretically able to produce exactly the same fuzzy policies for a complexity of 63 as FPSRL, it was unable to find a comparable solution for the system models in any of our experiments (Fig. 3). The best individual it produced with a complexity of 63 or less had a penalty value of 48.88, which was even above the median FPSRL penalty value of 47.05.
Comparing the model and real dynamics penalties yields another interesting observation. Even though the best FGPRL policies never surpassed FPSRL’s performance for complexities of 300 and below on the system model, when they were evaluated using the real dynamics, some FGPRL policies actually performed better than the FPSRL policies. This could possibly be because the swarm optimization had already started to overfit the fuzzy parameters with respect to the system model; this means that FPSRL was exploiting model inaccuracies, thus reducing its performance on the real dynamics.
3.2. Industrial Benchmark
3.2.1. Dynamics
The IB^{2}^{2}2http://github.com/siemens/industrialbenchmark was designed to simulate several of the common challenges associated with many industrial applications (Hein et al., 2017a). It was not designed to approximate any specific realworld system but to be of a comparable hardness and complexity to many industrial applications.
The IB’s state space is continuous, highdimensional, and only partially observable. The actions are made up of three continuous components and affect three control inputs. In addition, the IB includes stochastic and delayed effects. The optimization task also involves multiple criteria; there are two reward components with opposite dependencies on the actions. Its dynamics are heteroscedastic, with statedependent observation noise and probability distributions that are based on latent variables. Finally, it depends on an external driver that cannot be influenced by the actions.
At any given time step , the RL agent can influence the state via actions that change the three observable state control variables, namely the velocity , gain , and shift , i.e., .
The state and successor state are the Markov environment states. They can only be partially observed by the agent. In addition to the three control variables, , , and , there is a setpoint that simulates an external force, such as the load on a power plant or the speed of the wind driving a turbine that the agent cannot control but still has a significant influence on the system’s dynamics. The system also suffers from a detrimental fatigue , which depends on the setpoint and the chosen control values , and consumes resources, such as power and fuel, represented by the consumption . At each time step, it generates output values for and , which are part of the internal state . The reward is calculated as .
Note that the IB system’s complete Markov state is unobservable. Only the observation vector can be observed externally. The Markov state can be approximated using a sufficient number of historic observations with a time horizon . A system model that computes with , achieved adequate prediction performance during our IB experiments. Note that combining the sixdimensional observation vector with a time horizon of 30 results in a 180dimensional approximate Markovian state vector.
3.2.2. Benchmark Setup
The system was initialized for each setpoint in and then random trajectories of length 1,000 were produced. This process was repeated 10 times, resulting in a data set of size . Following the approach reported in (Hein et al., 2017c), we trained two recurrent NNs to predict the consumption and fatigue . These models were then used in modelbased RL, with a time horizon of 100 and a discount factor of 0.97. The training process involved 100 training states, drawn randomly from states in .
Finally, the best policies were tested against the real system dynamics, using the same and parameter values, but a different set of 100 test states drawn randomly from states in .
3.2.3. Results
As with the CP experiments, we ran the IB benchmark 10 times for each method. FPSRL produced 40 policies for 4 complexity levels, while FGPRL produced 368 policies for 86 complexity levels.
To construct interpretable fuzzy rules using FPSRL, we have to select suitable features before swarm optimization of the fuzzy parameters (Section 2.3). The proposed method identified the following state variables as being the most important for each action dimension: : , : , and : . Here, the variables are listed in descending order of importance and the indices represent the time elapsed since the observation. Four different fuzzy rule structures were constructed based on these variables. The first policy with a complexity of 99, incorporated only the first variable in each list into two rules per action dimension, while the other policies, with complexities of 129, 159, and 189, incorporated the first two, the first three, or all four variables, respectively.
Using the proposed feature selection heuristic, FPSRL was able to generate policies with adequate performance () for complexities of 129 or higher (Fig. 4). However, FGPRL was able to generate policies of a significant low complexity with a higher performance, achieving at a complexity of 94 (Fig. 5). Moreover, the FGPRL search space covers all possible combinations of state dimensions and numbers of rules for each individual action. For industrial problems where the statetoaction dependencies are not known a priori, we expect this ability to search autonomously to be highly valuable to control system designers and domain experts.
Comparing the policies’ performance on the approximate IB model with their performance on the real IB dynamics shows that the results of training can be transferred to the real system as long as their regression and generalization quality is adequate (Fig. 4).
4. Conclusion
In this paper, we have evaluated two approaches to learn fuzzy control policies autonomously in terms of their performance and interpretability in industrial applications. We have considered applications with highdimensional continuous state and action spaces and have proposed a feature selection heuristic that enables the previously presented FPSRL approach to be applied successfully in such industrial domains. Our second contribution is a GPbased fuzzy policy learning approach called FGPRL that utilizes the same modelbased batch RL technique as FPSRL. However, instead of only tuning the parameters of fixed fuzzy policy structures, FGPRL searches the full space of all possible fuzzy controllers by determining the important state variables and the number of rules required and subsequently tunes all the rule parameters.
Experiments using the standard CP RL benchmark showed that FPSRL has a significant advantage when no feature selection is necessary and the number of rules required can be easily determined by simply testing a few different options. In this case, FGPRL’s significantly wide search space is a drawback and it was far less likely to converge to a solution with similar performance to that produced by FPSRL when using a similar number of fitness value calculations. However, FGPRL was occasionally able to produce highperformance solutions for the CP benchmark.
Experiments using the IB, a benchmark that mimics real industrial systems like gas or wind turbines, yielded significant advantage for FGPRL over FPSRL. This benchmark has a highdimensional state space, a multidimensional action space, stochastic and delayed effects, and a reward function with multiple criteria. Applying feature selection to FPSRL and manually testing different fuzzy policy structures did not yield satisfactory performance for low complexity solutions. In contrast, FGPRL was able to find highquality interpretable solutions of low complexity with a similar number of fitness value calculations.
These results indicate that FGPRL is better than FPSRL at creating interpretable fuzzy policies autonomously from existing transition samples.
Acknowledgments
The project this report is based on was supported with funds from the German Federal Ministry of Education and Research under project number 01IB15001. The sole responsibility for the report’s contents lies with the authors.
References
 (1)
 Alba et al. (1996) E. Alba, C. Cotta, and J.M. Troya. 1996. Typeconstrained genetic programming for rulebase definition in fuzzy logic controllers. In Proceedings of the 1st annual conference on genetic programming. MIT Press, 255–260.
 Alba et al. (1999) E. Alba, C. Cotta, and J.M. Troya. 1999. Evolutionary design of fuzzy logic controllers using stronglytyped GP. Mathware and Soft Computing 6, 1 (1999), 109–124.
 Berlanga et al. (2010) F.J. Berlanga, A.J. Rivera, M.J. del Jesús, and F. Herrera. 2010. GPCOACH: Genetic Programmingbased learning of COmpact and ACcurate fuzzy rulebased classification systems for Highdimensional problems. Information Sciences 180, 8 (2010), 1183–1200.
 Blickle and Thiele (1995) T. Blickle and L. Thiele. 1995. A Mathematical Analysis of Tournament Selection. In ICGA. 9–16.
 Busoniu et al. (2010) L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst. 2010. Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press.
 Chien et al. (2002) B.C. Chien, J.Y. Lin, and T.P. Hong. 2002. Learning discriminant functions with fuzzy attributes for classification using genetic programming. Expert Systems with Applications 23, 1 (2002), 31–37.
 Cordón et al. (2004) O. Cordón, F. Gomide, F. Herrera, F. Hoffmann, and L. Magdalena. 2004. Ten years of genetic fuzzy systems: current framework and new trends. Fuzzy sets and systems 141, 1 (2004), 5–31.
 Depeweg et al. (2016) S. Depeweg, J.M. HernándezLobato, F. DoshiVelez, and S. Udluft. 2016. Learning and policy search in stochastic dynamical systems with Bayesian neural networks. arXiv preprint arXiv:1605.07127 (2016).
 Fantoni and Lozano (2002) I. Fantoni and R. Lozano. 2002. Nonlinear control for underactuated mechanical systems. Springer.

GeyerSchulz (1995)
A. GeyerSchulz.
1995.
Fuzzy RuleBased Expert Systems and Genetic Machine Learning.
PhysicaVerlag, Heidelberg (1995).  Hein et al. (2017a) D. Hein, S. Depeweg, M. Tokic, S. Udluft, A. Hentschel, T.A. Runkler, and V. Sterzing. 2017a. A benchmark environment motivated by industrial control problems. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI). 1–8. https://doi.org/10.1109/SSCI.2017.8280935
 Hein et al. (2016) D. Hein, A. Hentschel, T.A. Runkler, and S. Udluft. 2016. Reinforcement Learning with Particle Swarm Optimization Policy (PSOP) in Continuous State and Action Spaces. International Journal of Swarm Intelligence Research (IJSIR) 7, 3 (2016), 23–42.

Hein
et al. (2017b)
D. Hein, A. Hentschel,
T.A. Runkler, and S. Udluft.
2017b.
Particle swarm optimization for generating
interpretable fuzzy reinforcement learning policies.
Engineering Applications of Artificial Intelligence
65 (2017), 87–98.  Hein et al. (2018) D. Hein, A. Hentschel, T.A. Runkler, and S. Udluft. 2018. Particle swarm optimization for model predictive control in reinforcement learning environments. In Critical Developments and Applications of Swarm Intelligence, Y. Shi (Ed.). IGI Global, Hershey, PA, USA, Chapter 16, 401–427.
 Hein et al. (2017c) D. Hein, S. Udluft, M. Tokic, A. Hentschel, T. A. Runkler, and V. Sterzing. 2017c. Batch reinforcement learning on the industrial benchmark: First experiences. In 2017 International Joint Conference on Neural Networks (IJCNN). 4214–4221.
 Hoffmann and Nelles (2001) F. Hoffmann and O. Nelles. 2001. Genetic programming for model selection of TSKfuzzy systems. Information Sciences 136, 14 (2001), 7–28.
 Kennedy and Eberhart (1995) J. Kennedy and R.C. Eberhart. 1995. Particle swarm optimization. Proceedings of the IEEE International Joint Conference on Neural Networks (1995), 1942–1948.
 Koshiyama et al. (2014) A.S. Koshiyama, T. Escovedo, M.M.B.R. Vellasco, and R. Tanscheit. 2014. GPFISControl: A fuzzy Genetic model for Control tasks. In Fuzzy Systems (FUZZIEEE), 2014 IEEE International Conference on. IEEE, 1953–1959.
 Koza (1992) J.R. Koza. 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA.
 Le et al. (2016) N. Le, H.N. Xuan, A. Brabazon, and T.P. Thi. 2016. Complexity measures in Genetic Programming learning: A brief review. In Evolutionary Computation (CEC), 2016 IEEE Congress on. IEEE, 2409–2416.
 Maes et al. (2012) F. Maes, R. Fonteneau, L. Wehenkel, and D. Ernst. 2012. Policy search in a space of simple closedform formulas: towards interpretability of reinforcement learning. Discovery Science (2012), 37–50.
 Mamdani and Assilian (1975) E.H. Mamdani and S. Assilian. 1975. An experiment in linguistic synthesis with a fuzzy logic controller. International Journal of ManMachine Studies 7, 1 (1975), 1–13.
 Moscato (1989) P. Moscato. 1989. On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech concurrent computation program, C3P Report 826 (1989), 1989.

Ramos and
González (2000)
L.S. Ramos and J.A.C.
González. 2000.
A niching scheme for steady state GAP and its application to fuzzy rule based classifiers induction.
Mathware and Soft Computing 7, 23 (2000), 337–350.  Rasmussen and Williams (2006) E. Rasmussen and C.K.I. Williams. 2006. Gaussian processes for machine learning (adaptive computation and machine learning). Mit Press Ltd.
 Sánchez et al. (2001) L. Sánchez, I. Couso, and J.A. Corrales. 2001. Combining GP operators with SA search to evolve fuzzy rule based classifiers. Information Sciences 136, 14 (2001), 175–191.
 Schwefel (1981) H.P. Schwefel. 1981. Numerical optimization of computer models. John Wiley & Sons, Inc.
 Schwefel (1995) H.P. Schwefel. 1995. Evolution and optimum seeking. Sixthgeneration computer technology series. (1995).
 Sutton and Barto (1998) R.S. Sutton and A.G. Barto. 1998. Reinforcement learning: an introduction. A Bradford book.
 Tesmer and Estévez (2004) M. Tesmer and P.A. Estévez. 2004. AMIFS: Adaptive feature selection by using mutual information. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, Vol. 1. IEEE, 303–308.
 Tsakonas (2013) A. Tsakonas. 2013. Local and global optimization for Takagi–Sugeno fuzzy system by memetic genetic programming. Expert Systems with Applications 40, 8 (2013), 3282–3298.
 Tunstel and Jamshidi (1996) E. Tunstel and M. Jamshidi. 1996. On genetic programming of fuzzy rulebased systems for intelligent control. Intelligent Automation & Soft Computing 2, 3 (1996), 271–284.
 Wang and Mendel (1992) L.X. Wang and J.M. Mendel. 1992. Fuzzy basis functions, universal approximation, and orthogonal leastsquares learning. IEEE Transactions on Neural Networks 3, 5 (1992), 807–814.
 Zadeh (1965) L.A. Zadeh. 1965. Fuzzy sets. Information and Control 8 (1965), 338–353.