1 Introduction
Structure optimization of a feedforward neural network (FNN) and its impact on FNN’s generalization ability inspired the flexible neural tree (FNT) chen2005time
. FNN components such as weights, structure, and activation function are the potential candidates for the optimization, which improves FNN’s generalization ability to a great extent
yao1997new . These efforts are notable because of FNN’s ability to solve a large range of realworld problems basheer2000artificial ; maren2014handbook ; sethi2014artificial ; tkavc2016artificial . Followings are the significance structure optimization methods: constructive and pruning algorithms fahlman1989cascade ; nadal1989study , EPNet yao1997new , NeuroEvolution of Augmenting Topologies stanley2002evolving , sparse neural trees zhang1997evolutionary , Cooperative coevolution approach potter2000cooperative , etc. Similarly, many efforts focus on the optimization of hybrid training of FNN such as yaghini2013hybrid ; wang2015feed ; wang2015fruit. FNT was an additional step into this series of efforts, which was proposed to evolve as a treelike feedforward neural network model, where the probabilistic incremental program evolution (PIPE)
salustowicz1997probabilistic was applied optimize the tree structure chen2005time. The underlying parameter vector of the developed FNT (weights associated with the edges and arguments of the activation functions) was optimized by metaheuristic algorithms, which are natureinspired parameter optimization algorithms
kar2016bio . The evolutionary process allowed FNT to select significant input features from an input feature set.In the design of FNT, the nonleaf nodes are the computational node, which takes an activation function. Hence, rather than relying on a fixed activation function, if the selection of activation function at the computational nodes is allowed to be selected by the evolutionary process. Then, it produces heterogeneous FNTs (HFNT) with the heterogeneity in its structure, computational nodes, and input set. In addition, heterogeneous function allowed HFNT to capture different feature of the datasets efficiently since each input in the datasets posses different features. The evolutionary process provides adaptation in structure, weights, activation functions, and input features. Therefore, an optimum HFNT is the one that offers the lowest approximation error with the simplest tree structure and the smallest input feature set. However, approximation error minimization and structure simplification are two conflicting objectives jin2008pareto . Hence, a multiobjective evolutionary approach deb2001multi may offer an optimal solution(s) by maintaining a balance between these objectives.
Moreover, in the proposed work, an evolutionary process guides a population of HFNTs towards Paretooptimum solutions. Hence, the final population may contain several solutions that are close to the best solution. Therefore, an ensemble system was constructed by exploiting many candidates of the population (candidate, solution, and model are synonymous in this article). Such ensemble system takes advantage of many solutions including the best solution yao1998making . Diversity among the chosen candidates holds the key in making a good ensemble system kuncheva2003measures
. Therefore, the solutions in a final population should fulfill the following objectives: low approximation error, structural simplicity, and high diversity. However, these objectives are conflicting to each other. A fast elitist nondominated sorting genetic algorithm (NSGAII)based multiobjective genetic programming (MOGP) was employed to guide a population of HFNTs
deb2000fast . The underlying parameters of selected models were further optimized by using differential evaluation (DE) algorithm das2016recent . Therefore, we may summarize the key contributions of this work are as follows:
A heterogeneous flexible neural tree (HFNT) for function approximation and feature selection was proposed.

HFNT was studied under an NSGAIIbased multiobjective genetic programming framework. Thus, it was termed HFNT.

Alongside approximation error and tree size (complexity), a diversity index was introduced to maintain diversity among the candidates in the population.

HFNT was found competitive with other algorithms when compared and crossvalidated over classification, regression, and timeseries datasets.

The proposed evolutionary weighted ensemble of HFNTs final population further improved its performance.
A detailed literature review provides an overview of FNT usage over the past few years (Section 2). Conclusions derived from literature survey supports our HFNT approach, where a Paretobased multiobjective genetic programming was used for HFNT optimization (Section 3.1). Section 3.2 provides a detailed discussion on the basics of HFNT: MOGP for HFNT structure optimization, and DE for HFNT parameter optimization. The efficiency of the abovementioned hybrid and complex multiobjective FNT algorithm (HFNT) was tested over various prediction problems using a comprehensive experimental setup (Section 4). The experimental results support the merits of proposed approach (Section 5). Finally, we provide a discussion of experimental outcomes in Section 6 followed by conclusions in Section 7.
2 Literature Review
The literature survey describes the following points: basics of FNT, approaches that improvised FNT, and FNTs successful application to various reallife problems. Subsequently, the shortcomings of basic FNT version are concluded that inspired us to propose HFNT.
FNT was first proposed by Chen et al. chen2005time , where a treelikestructure was optimized by using PIPE. Then, its approximation ability was tested for timeseries forecasting chen2005time and intrusion detection chen2005feature , where a variant of simulated annealing (called degraded ceiling) sanchez2001combining
, and particle swarm optimization (PSO)
kennedy2001swarm , respectively, were used for FNT parameter optimization. Since FNT is capable of input feature selection, in chen2006feature , FNT was applied for selecting input features in several classification tasks, in which FNT structure was optimized by using genetic programming (GP) riolo2014genetic , and the parameter optimization was accomplished by using memetic algorithm chen2011multi . Additionally, they defined five different mutation operators, namely, changing one terminal node, all terminal nodes, growing a randomly selected subtree, pruning a randomly selected subtree, and pruning redundant terminals. Li et al. li2006flexibleproposed FNTbased construction of decision trees whose nodes were conditionally replaced by neural node (activation node) to deal with continuous attributes when solving classification tasks. In many other FNT based approaches, like in
Chen2006137, GP was applied to evolve hierarchical radialbasisfunction network model, and in
Chen2007373 a multiinputmultioutput FNT model was evolved. Wu et al. wu2007grammar proposed to use grammar guided GP Shan2004478 for FNT structure optimization. Similarly, in jia2008mep , authors proposed to apply multiexpression programming (MEP) oltean2003evolving for FNT structure optimization and immune programming algorithm musilek2006immune for the parameter vector optimization. To improve classification accuracy of FNT, Yang et al. Yang2010690 proposed a hybridization of FNT with a furtherdivisionofpartitionspace method. In bouaziz2016evolving , authors illustrated crossover and mutation operators for evolving FNT using GP and optimized the tree parameters using PSO algorithm.A model is considered efficient if it has generalization ability. We know that a consensus decision is better than an individual decision. Hence, an ensemble of FNTs may lead to a bettergeneralized performance than a single FNT. To address this, in Chen2007697 , authors proposed to make an ensemble of FNTs to predict the chaotic behavior of stock market indices. Similarly, in yang2013ensemble , the proposed FNTs ensemble predicted the breast cancer and network traffic better than individual FNT. In ojha2016ensemble , protein dissolution prediction was easier using ensemble than the individual FNT.
To improve the efficiency in terms of computation, Peng et al. peng2011parallel proposed a parallel evolving algorithm for FNT, where the parallelization took place in both treestructure and parameter vector populations. In another parallel approach, Wang et al. wang2012modeling
used gene expression programming (GEP)
ferreira2006gene for evolving FNT and used PSO for parameter optimization.A multiagent system weiss1999multiagent based FNT (MASFNT) algorithm was proposed in ammar2015negotiation , which used GEP and PSO for the structure and parameter optimization, respectively. The MASFNT algorithm relied on the division of the main population into subpopulation, where each subpopulation offered local solutions and the best local solution was pickedup by analyzing tree complexity and accuracy.
Chen et al. chen2005time ; chen2006feature referred the arbitrary choice of activation function at nonleaf nodes. However, they were restricted to use only Gaussian functions. A performance analysis of various activation function is available in burianekperformance . Bouaziz et al. bouaziz2013hybrid ; bouaziz2014universal proposed to use betabasis function at nonleaf nodes of an FNT. Since betabasis function has several controlling parameters such as shape, size, and center, they claimed that the betabasis function has advantages over other two parametric activation functions. Similarly, many other forms of neural tree formation such as balanced neural tree micheloni2012balanced , generalized neural tree foresti2002generalized , and convex objective function neural tree rani2015neural , were focused on the tree improvement of neural nodes.
FNT was chosen over the conventional neural network based models for various realworld applications related to prediction modeling, pattern recognition, feature selection, etc. Some examples of such applications are cementdecomposingfurnace productionprocess modeling
shou2008modeling , timeseries prediction from gene expression profiling yang2013reverse . stockindex modeling Chen2007697, anomaly detection in peertopeer traffic
Chen2009685 , intrusion detection novosad2010fast , face identification pan2007face , gesture recognition Guo20121099 , shareholder’s management risk prediction qu2011controlling , cancer classification rajini2012swarm , somatic mutation, risk prediction in grid computing abdelwahab2016ensemble , etc.The following conclusions can be drawn from the literature survey. First, FNT was successfully used in various realworld applications with better performance than other existing function approximation models. However, it was mostly used in timeseries analysis. Second, the lowest approximation error obtained by an individual FNT during an evolutionary phase was considered as the best structure that propagated to the parameter optimization phase. Hence, there was no consideration as far as structural simplicity and generalization ability are concerned. Third, the computational nodes of the FNT were fixed initially, and little efforts were made to allow for its automatic adaptation. Fourth, little attention was paid to the statistical validation of FNT model, e.g., mostly the single best model was presented as the experimental outcome. However, the evolutionary process and the metaheuristics being stochastic in nature, statistical validation is inevitably crucial for performance comparisons. Finally, to create a generalized model, an ensemble of FNTs were used. However, FNTs were created separately for making the ensemble. Due to stochastic nature of the evolutionary process, FNT can be structurally distinct when created at different instances. Therefore, no explicit attention was paid to create diverse FNTs within a population itself for making ensemble. In this article, a heterogeneous FNT called HFNT was proposed to improve the basic FNT model and its performance by addressing above mentioned shortcomings.
3 Multiobjectives and Flexible Neural Tree
In this section, first, Paretobased multiobjective is discussed. Second, we offer a detailed discussion on FNT and its structure and parameter optimization using NSGAIIbased MOGP and DE, respectively. Followed by a discussion on making an evolutionary weighted ensemble of the candidates from the final population.
3.1 ParetoBased Multiobjectives
Usually, learning algorithms owns a single objective, i.e., the approximation error minimization, which is often achieved by minimizing mean squared error (MSE) on the learning data. MSE on a learning data is computed as:
(1) 
where and are the desired output and the model’s output, respectively and indicates total data pairs in the learning set. Additionally, a statistical goodness measure, called, correlation coefficient that tells the relationship between two variables (i.e., between the desired output and the model’s output ) may also be used as an objective. Correlation coefficient is computed as:
(2) 
where and are means of the desired output and the model’s output , respectively.
However, single objective comes at the expense of model’s complexity or generalization ability on unseen data, where generalization ability broadly depends on the model’s complexity jin2005evolutionary . A common model complexity indicator is the number of free parameters in the model. The approximation error (1) and the number of free parameters minimization are two conflicting objectives. One approach is to combine these two objectives as:
(3) 
where is a constant, is the MSE (1) and is the total free parameter in a model. The scalarized objective in (3), however, has two disadvantages. First, determining an appropriate that controls the conflicting objectives. Hence, generalization ability of the produced model will be a mystery das1997closer . Second, the scalarized objective in (3) leads to a single best model that tells nothing about how the conflicting objectives were achieved. In other words, no single solution exists that may satisfy both objectives, simultaneously.
We study a multiobjective optimization problem of the form:
where we have objective functions . We denote the vector
of objective functions by . The decision (variable) vectors
belong to the set , which is a subset of the decision variable space .
The word ‘minimize’ means that we want to minimize all the objective functions simultaneously.
A nondominated solution is one in which no one objective function can be improved without a simultaneous detriment to at least one of the other objectives of the solution deb2000fast . The nondominated solution is also known as a Paretooptimal solution.
Definition 1.
Paretodominance  A solution is said to dominate a solution if , and there exists such that holds.
Definition 2.
Paretooptimal  A solution is called Paretooptimal if there does not exist any other solution that dominates it. A set Paretooptimal solution is called Paretofront.
Algorithm 1 is a basic framework of NSGAII based MOGP, which was used for computing Paretooptimal solutions from an initial HFNT population. The individuals in MOGP were sorted according to their dominance in population. Note that the function returns total number of rows (population size) for a 2D matrix and returns total number of elements for a vector. The Moreover, individuals were sorted according to the rank/Paretofront. MOGP is an elitist algorithm that allowed the best individuals to propagate into next generation. Diversity in the population was maintained by measuring crowding distance among the individuals deb2000fast .
3.2 Heterogeneous Flexible Neural Tree
HFNT is analogous to a multilayer feedforward neural network that has overlayer connections and activation function at the nodes. HFNT construction has two phases chen2005time : 1) the tree construction
phase, in which evolutionary algorithms are applied to construct treelike structure; and 2) the
parametertuning phase, in which genotype of HFNT (underlying parameters of treestructure) is optimized by using parameter optimization algorithms.To create a nearoptimum model, phase one starts with random treelike structures (population of initial solutions), where parameters of each tree are fixed by a random guess. Once a nearoptimum tree structure is obtained, parametertuning phase optimizes its parameter. The phases are repeated until a satisfactory solution is obtained. Figure 1 is a lucid illustration of these two phases that work in some coevolutionary manner. From Figure 1, it may be observed that two global search algorithms MOGP (for structure optimization) and DE (for parameter optimization) works in a nested manner to obtain a near optimum tree that may have less complex tree structure and better parameter. Moreover, evolutionary algorithm allowed HFNT to select activation functions and input feature at the nodes from sets of activation functions and input features, respectively. Thus, HFNT possesses automatic feature selection ability.
3.2.1 Basic Idea of HFNT
An HFNT is a collection of function set and instruction set :
(4) 
where denotes a nonleaf instruction (a computational node). It receives arguments and is a function that randomly takes an activation function from a set of activation functions. Maximum arguments to a computational node are predefined. A set of seven activation functions is shown in Table 1. Leaf node’s instruction denotes input variables. Figure 2 is an illustration of a typical HFNT. Similarly, Figure 3 is an illustration of a typical node in an HFNT.
The th computational node (Figure 3) of a tree (say th node in Figure 2) receives inputs (denoted as ) through connectionweights (denoted as ) and takes two adjustable parameters and that represents the arguments of the activation function at that node. The purpose of using an activation function at a computational node is to limit the output of the computational node within a certain range. For example, if the th node contains a Gaussian function (Table 1). Then, its output is computed as:
(5) 
where is the weighted summation of the inputs and weights at the th computational node (Figure 3), also known as excitation of the node. The net excitation of the th node is computed as:
(6) 
where or, , i.e., can be either an input feature (leaf node value) or the output of another node (a computational node output) in the tree. Weight is the connection weight of real value in the range . Similarly, the output of a tree is computed from the root node of the tree, which is recursively computed by computing each node’s output using (5) from right to left in a depthfirst method.
The fitness of a tree depends on the problem. Usually, learning algorithm uses approximation error, i.e., MSE (1). Other fitness measures associated with the tree are tree size and diversity index. The tree size is the number of nodes (excluding root node) in a tree, e.g., the number of computational nodes and leaf nodes in the tree in Figure 2 is 11 (three computational nodes and eight leafnodes). The number of distinct activation functions (including root node function) randomly selected from a set of activation functions gives the diversity index of a tree. Total activation functions (denoted as in ) selected by the tree in Figure 2 is three (). Hence, its diversity index is three.
Activationfunction  Expression for  
Gaussian Function  1  
TangentHyperbolic  2  
Fermi Function  3  
Linear Fermi  4  
Linear Tangenthyperbolic  5  
Bipolar Sigmoid  6  
Unipolar Sigmoid  7 
3.3 Structure and Parameter Learning (Near optimal Tree)
A tree that offers the lowest approximation error and the simplest structure is a near optimal tree, which can be obtained by using an evolutionary algorithm such as GP riolo2014genetic , PIPE salustowicz1997probabilistic , GEP ferreira2006gene , MEP oltean2003evolving , and so on. To optimize tree parameters, algorithms such as genetic algorithm eiben2015ec , evolution strategy eiben2015ec , artificial bee colony karaboga2007powerful , PSO kennedy2001swarm ; zhang2015comprehensive , DE das2016recent , and any hybrid algorithm such as GA and PSO juang2004hybrid can be used.
3.3.1 Treeconstruction
The proposed multiobjective optimization of FNT has three fitness measures: approximation error (1) minimization, tree size minimization, and diversity index maximization. These objectives are simultaneously optimized during the tree construction phase using MOGP, which guides an initial population of random treestructures according to Algorithm 1. The detailed description of the components of Algorithm 1 are as follows:
Selection
In selection operation, a mating pool of size is created using binary tournament selection, where two candidates are randomly selected from a population and the best (according to rank and crowding distance) among them is placed into the mating pool. This process is continued until the mating pool is full. An offspring population
is generated by using the individuals of mating pool. Two distinct individuals (parents) are randomly selected from the mating pool to create new individuals using genetic operators: crossover and mutation. The crossover and mutation operators are applied with probabilities
and , respectively.Crossover
In crossover operation, randomly selected subtrees of two parent trees were swapped. The swapping includes the exchange of activationnodes, weights, and inputs as it is described in bouaziz2016evolving ; eiben2015ec ; Wongseree2007771 .
Mutation
The mutation of a selected individual from mating pool took place in the following manner bouaziz2016evolving ; eiben2015ec ; Wongseree2007771 :

A randomly selected terminal node is replaced by a newly generated terminal node.

All terminal nodes of the selected tree were replaced by randomly generated new terminal nodes.

A randomly selected terminal node or a computational node is replaced by a randomly generated subtree.

A randomly selected terminal node is replaced by a randomly generated computational node.
In the proposed MOGP, during the each mutation operation event, one of the abovementioned four mutation operators was randomly selected for mutation of the tree.
Recombination
The offspring population and the main population , are merged to make a combined population .
Elitism
In this step, worst individuals are weeded out. In other words, best individuals are propagated to a new generation as main population .
3.3.2 Parametertuning
In parametertuning phase, a single objective, i.e., approximation error was used in optimization of HFNT parameter by DE. The tree parameters such as weights of tree edges and arguments of activation functions were encoded into a vector for the optimization. In addition, a crossvalidation (CV) phase was used for statistical validation of HFNTs.
The basics of DE is as follows. For an initial population of parameter vectors , DE repeats its steps mutation, recombination, and selection until an optimum parameter vector is obtained. DE updates each parameter vector by selecting the best vector and three random vectors and from such that holds. The random vector is considered as a trial vector . Hence, for all , and , the th variable of th trailvectors is generated by using crossover, mutation, and recombination as:
(7) 
where is a random index in , is within , is in , is crossover probability, and is mutation factor. The trail vector is selected if
(8) 
where returns fitness of a vector as per (1). Hence, the process of crossover, mutation, recombination, and selection are repeated until an optimal parameter vector solution is found.
3.4 Ensemble: Making use of MOGP Final Population
In tree construction phase, MOGP provides a population from which we can select tree models for making the ensemble. Three conflicting objectives such as approximation error, tree size, and diversity index allows the creation of Paretooptimal solutions, where solutions are distributed on various Paretooptimal fronts according to their rank in population. Ensemble candidates can be selected from the first line of solutions (Front 1), or they can be chosen by examining the three objectives depending on the user’s need and preference. Accuracy and diversity among the ensemble candidate are important kuncheva2003measures . Hence, in this work, approximation error, and diversity among the candidates were given preference over tree size. Not to confuse “diversity index” with “diversity”. The diversity index is an objective in MOGP, and the diversity is the number of distinct candidates in an ensemble. A collection of the diverse candidate is called a bag of candidates hastie2009elements . In this work, any two trees were considered diverse (distinct) if the followings hold: 1) Two trees were of different size. 2) The number of function nodes/or leaf nodes in two trees were dissimilar. 3) Two models used a different set of input features. 4) Two models used a different set of activation functions. Hence, diversity of ensemble (a bag of solutions) was computed as:
(9) 
where is a function that returns total distinct models in an ensemble and is a total number of models in the bag.
Now, for a classification problem, to compute combined vote of respective candidate’s outputs , , , of bag and classes , we used an indicator function which takes if ‘’ is true, and takes if ‘’ is false. Thus, ensemble decisions by weighted majority voting is computed as polikar2006ensemble ; zhou2012ensemble :
(10) 
where is weight associated with the th candidate in an ensemble and is set to class if the total weighted vote received by is higher than the total vote received by any other class. Similarly, the ensemble of regression methods was computed by weighted arithmetic mean as polikar2006ensemble :
(11) 
where and are weight and output of th candidate in a bag , respectively, and is the ensemble output, which is then used for computing MSE (1) and correlation coefficient (2). The weights may be computed according to fitness of the models, or by using a metaheuristic algorithm. In this work, DE was applied to compute the ensemble weights , where population size was set to 100 and number of function evaluation was set to 300,000.
3.5 Multiobjective: A General Optimization Strategy
A summary of general HFNT learning algorithm is as follows:

Initializing HFNT training parameters.

Apply tree construction phase to guide initial HFNT population towards Paretooptimal solutions.

Select treemodel(s) from MOGP final population according to their approximation error, tree size, and diversity index from the Pareto front.

Apply parametertuning phase to optimize the selected treemodel(s).

Go to Step 2, if no satisfactory solution found. Else go to Step 6.

Using a crossvalidation (CV) method to validate the chosen model(s).

Use the chosen treemodel(s) for making ensemble (recommended).

Compute ensemble results of the ensemble model (recommended).
4 Experimental SetUp
Several experiments were designed for evaluating the proposed HFNT. A careful parametersetting was used for testing its efficiency. A detailed description of the parametersetting is given in Table 2, which includes: definitions, default range, and selected value. The phases of the algorithm were repeated until the stopping criteria met, i.e., either the lowest predefined approximation error was achieved, or the maximum function evaluations were reached. The repetition holds the key to obtaining a good solution. A carefully designed repetition of these two phases may offer a good solution in fewer of function evaluations.
In this experiment, three general repetitions were used with 30 tree construction iterations , and 1000 parametertuning iterations (Figure 1). Hence, the maximum function evaluation^{1}^{1}1Initial GP population + three repetition ((GP population + mating pool size) MOGP iterations + MH population MH iterations) = . was . The DE version das2016recent with equal to 0.9 and equal to 0.7 was used in the parametertuning phase.
Parameter  Definition  Default Rang  Value 
Scaling  Inputfeatures scaling range.  ,  [0,1] 
Tree height  Maximum depth (layers) of a tree model.  4  
Tree arity  Maximum arguments of a node .  5  
Node range  Search space of functions arguments.  ,  [0,1] 
Edge range  Search space for edges (weights) of tree.  ,  [1,1] 
MOGP population.  30  
Mutation  Mutation probability  0.3  
Crossover  Crossover probability  0.7  
Mating pool  Size of the pool of selected candidates.  0.5  
Tournament  Tournament selection size.  2  
DE population.  50  
General  Maximum number of trails.  3  
Structure  MOGP iterations  30  
Parameter  DE iterations  1000 
The experiments were conducted over classification, regression, and timeseries datasets. A detailed description of the chosen dataset from the UCI machine learning UCILichman2013 and KEEL alcala2009keel repository is available in Table 17. The parametersetting mentioned in Table 2
was used for the experiments over each dataset. Since the stochastic algorithms depend on random initialization, a pseudorandom number generator called, Mersenne Twister algorithm that draws random values using probability distribution in a pseudorandom manner was used for initialization of HFNTs
matsumoto1998mersenne . Hence, each run of the experiment was conducted with a random seed drawn from the system. We compared HFNT performance with various other approximation models collected from literature. A list of such models is provided in Table 18. A developed software tool based on the proposed HFNT algorithm for predictive modeling is available in ojha2016mogp .To construct good ensemble systems, highly diverse and accurate candidates were selected in the ensemble bag . To increase diversity (9) among the candidates, the Paretooptimal solutions were examined by giving preference to the candidates with low approximation error, small tree size and distinct from others selected candidates. Hence, candidates were selected from a population . An illustration of such selection method is shown in Figure 4, which represents an MOGP final population of 50 candidate solutions computed over dataset MGS.
MOGP simultaneously optimized three objectives. Hence, the solutions were arranged on the threedimensional map (Figure 4(a)), in which along the xaxis, error was plotted; along the yaxis, tree size was plotted; and along zaxis, diversity index (diversity) was plotted. However, for the simplicity, we have arranged solutions also in 2D plots (Figure 4(b)), in which along the xaxis, computed error was plotted; and along the yaxis, tree size (indicated by blue dots) and diversity index (indicated by red squares) were plotted. From Figure 4(b), it is evident that a clear choice is difficult since decreasing approximation error increases models tree size (blue dots in Figure 4(b)). Similarly, decreasing approximation error increases models tree size and diversity (red squares in Figure 4(b)). Hence, solutions along the Paretofront (rank1), i.e., Pareto surface indicated in the 3D map of the solutions in Figure 4(a) were chosen for the ensemble. For all datasets, ensemble candidates were selected by examining Paretofronts in a similar fashion as described for the dataset MGS in Figure 4.
The purpose of our experiment was to obtain sufficiently good prediction models by enhancing predictability and lowering complexity. We used MOGP for optimization of HFNTs. Hence, we were compromising fitness by lowering models complexity. In single objective optimization, we only looked for models fitness. Therefore, we did not possess control over model’s complexity. Figure 5 illustrates eight runs of both single and multiobjective optimization course of HFNT, where models tree size (complexity) is indicated along yaxis and xaxis indicates fitness value of the HFNT models. The results shown in Figure 5 was conducted over MGS dataset. For each single objective GP and multiobjective GP, optimization course was noted, i.e., successive fitness reduction and tree size were noted for 1000 iterations.
It is evident from Figure 5 that the HFNT approach leads HFNT optimization by lowering model’s complexity. Whereas, in the single objective, model’s complexity was unbounded and was abruptly increased. The average tree size of eight runs of single and eight runs of multiobjective were 39.265 and 10.25, respectively; whereas, the average fitness were 0.1423 and 0.1393, respectively. However, in single objective optimization, given the fact that the tree size is unbounded, the fitness of a model may improve at the expense of model’s complexity. Hence, the experiments were setup for multiobjective optimization that provides a balance between both objectives as described in Figure 4.
5 Results
Experimental results were classified into three categories: classification, regression, and timeseries. Each category has two parts: 1) First part describes the best and average results obtained from the experiments; 2) Second part describes ensemble results using tabular and graphical form.
5.1 Classification dataset
We chose five classification datasets for evaluating HFNT, and the classification accuracy was computed as:
(12) 
where is the total positive samples correctly classified as positive samples, is the total negative samples correctly classified as negative samples, is the total negative samples incorrectly classified as positive samples, and is the total positive samples incorrectly classified as negative samples. Here, for a binary class classification problem, the positive sample indicates the class labeled with ‘1’ and negative sample indicates class labeled with ‘0’. Similarly, for a threeclass ( and ) classification problem, the samples which are labeled as a class are set to 1, 0, 0, i.e., set to positive for class and negative for and . The samples which are labeled as a class are set to 0, 1, 0, and the samples which are labeled as a class are set to 0, 0, 1.
5.1.1 10Fold CV
The experiments on classification dataset were conducted in three batches that produced 30 models, and each model was crossvalidated using 10fold CV, in which a dataset is equally divided into 10 sets and the training of a model was repeated 10 times. Each time a distinct set was picked for the testing the models, and the rest of nine set was picked for the training of the model. Accordingly, the obtained results are summarized in Table 3. Each batch of experiment produced an ensemble system of 10 models whose results are shown in Table 7.
The obtained results presented in Table 3 describes the best and mean results of 30 models. We present a comparative study of the best 10fold CV models results of HFNT and the results reported in the literature in Table 4. In Table 4, the results of HDT and FNT li2006flexible were of 10 fold CV results on the test dataset. Whereas, the results of FNT chen2006ensemble was the best test accuracy and not the CV results. The results summarized in Table 4 suggests a comparatively better performance of the proposed HFNT over the previous approaches. For the illustration of a model created by HFNT approach, we chose the best model of dataset WDB that has a test accuracy of (shown in Table 3). A pictorial representation of the WDB model is shown in Figure 6, where the model’s tree size is 7, total input features are 5, ( and ) and the selected activation function is tangent hyperbolic () at both the nonleaf nodes. Similarly, we may represent models of all other datasets.
Best of 30 models  Mean of 30 models  
Data  train  test  tree size  Features  train  test  avg. tree size  diversity 
AUS  87.41%  87.39%  4  3  86.59%  85.73%  5.07  0.73 
HRT  87.41%  87.04%  8  5  82.40%  80.28%  7.50  0.70 
ION  90.92%  90.29%  5  3  87.54%  86.14%  6.70  0.83 
PIM  78.67%  78.03%  10  5  71.12%  70.30%  6.33  8.67 
WDB  97.02%  96.96%  6  5  94.51%  93.67%  7.97  0.73 
Algorithms  AUS  HRT  ION  PIM  WDB  
test  test  test  test  test  
HDT li2006flexible  86.96%  2.058  76.86%  2.086  89.65%  1.624  73.95%  2.374  
FNT li2006flexible  83.88%  4.083  83.82%  3.934  88.03%  0.953  77.05%  2.747  
FNT chen2006ensemble  93.66%  n/a  
HFNT  87.39%  0.029  87.04%  0.053  90.29%  0.044  78.03%  0.013  96.96%  0.005 
and variance
of algorithmsIn this work, Friedman test was conducted to examine the significance of the algorithms. For this purpose, the classification accuracy (test results) was considered (Table 4). The average ranks obtained by each method in the Friedman test is shown in Table 5. The Friedman statistic at
(distributed according to chisquare with 2 degrees of freedom) is 5.991, i.e.,
. The obtained test value according to Friedman statistic is 6. Since , then the null hypothesis that “there is no difference between the algorithms” is rejected. In other words, the computed value by Friedman test is 0.049787 which is less than or equal to 0.05, i.e., valuevalue. Hence, we reject the null hypothesis.
Table 5 describes the significance of differences between the algorithms. To compare the differences between the best rank algorithm in Friedman test, i.e., between the proposed algorithm HFNT and the other two algorithms, Holm’s method holm1979simple was used. Holm’s method rejects the hypothesis of equality between the best algorithm (HFNT) and other algorithms if the value is less than , where is the position of an algorithm in a list sorted in ascending order of value (Table 6). From the post hoc analysis, it was observed that the proposed algorithm HFNT outperformed both HDT li2006flexible and FNT li2006flexible algorithms.
Algorithm  Ranking 
HFNT  1.0 
HDT  2.5 
FNT  2.5 
algorithm  Hypothesis  
2  HDT  2.12132  0.033895  0.05  rejected 
1  FNT  2.12132  0.033895  0.1  rejected 
5.1.2 Ensembles
The best accuracy and the average accuracy of 30 models presented in Table 3 are the evidence of HFNT efficiency. However, as mentioned earlier, a generalized solution may be obtained by using an ensemble. All 30 models were created in three batches. Hence, three ensemble systems were obtained. The results of those ensemble systems are presented in Table 7, where ensemble results are the accuracies obtained by weighted majority voting (10). In Table 7, the classification accuracies were computed over CV test dataset. From Table 7, it may be observed that high diversity among the ensemble candidates offered comparatively higher accuracy. Hence, an ensemble model may be adopted by examining the performance of an ensemble system, i.e., average tree size (complexity) of the candidates within the ensemble and the selected input features.
An ensemble system created from a genetic evolution and adaptation is crucial for feature selection and analysis. Summarized ensemble results in Table 7 gives the following useful information about the HFNT feature selection ability: 1) TSF  total selected features; 2) MSF  most significant (frequently selected) features; and 3) MIF  most infrequently selected features. Table 7 illustrates feature selection results.
Data  Batch  test  avg.  (9)  TSF  MSF  MIF 
AUS  1  86.96%  5  0.7  4  , , ,  , , , , 
2  85.51%  6  0.7  5  
3  86.81%  4.2  0.8  5  
HRT  1  77.41%  6.8  0.5  6  , , ,  
2  70.37%  7.6  0.6  9  
3  87.04%  8.1  1  10  
ION  1  82.86%  7.2  0.9  15  , , ,  , , , , , , , , 
2  90.29%  7.3  1  16  
3  86.57%  5.6  0.6  6  
PIM  1  76.32%  6.9  1  8  , , , , ,  
2  64.74%  5.6  0.7  7  
3  64.21%  7.4  0.9  8  
WDB  1  94.29%  8.2  0.7  15  , , ,  , , , , , , 
2  93.75%  5  1  15  
3  94.29%  10.7  0.5  15 
5.2 Regression dataset
5.2.1 5Fold CV
For regression dataset, the performance of HFNT was examined by using 5fold CV method, in which the dataset was divided into 5 sets, each was 20% in size, and the process was repeated five times. Each time, four set was used to training and one set for testing. Hence, a total 5 runs were used for each model. As described in gacto2014metsk , MSE was used for evaluating HFNT, where was computed as per (1). The training MSE is represented as and test MSE is represented as . Such setting of MSE computation and crossvalidation was taken for comparing the results collected from gacto2014metsk . Table 8 presents results of 5fold CV of each dataset for 30 models. Hence, each presented result is averaged over a total 150 runs of experiments. Similarly, in Table 9, a comparison between HFNT and other collected algorithms from literature is shown. It is evident from comparative results that HFNT performs very competitive to other algorithms. The literature results were averaged over 30 runs of experiments; whereas, HFNT results were averaged of 150 runs of experiments. Hence, a competitive result of HFNT is evidence of its efficiency.
Moreover, HFNT is distinct from the other algorithm mentioned in Table 9 because it performs feature selection and models complexity minimization, simultaneously. On the other hand, the other algorithms used entire available features. Therefore, the result’s comparisons were limited to assessing average MSE, where HFNT, which gives simple models in comparison to others, stands firmly competitive with the others. An illustration of the best model of regression dataset DEE is provided in Figure 7, where the model offered a test MSE of 0.077, tree size equal to 10, and four selected input features (, , , and ). The selected activation functions were unipolar sigmoid (), bipolar sigmoid (), tangent hyperbolic (), and Gaussian (). Note that while creating HFNT models, the datasets were normalized as described in Table 2 and the output of models were denormalized accordingly. Therefore, normalized inputs should be presented to the tree (Figure 7), and the output of the tree (Figure 7) should be denormalized.
Best of 30 models  Mean of 30 models  
Data  train  test  tree size  #Features  train  test  tree size  diversity 
ABL  2.228  2.256  14  5  2.578  2.511  11.23  0.7 
BAS  198250  209582  11  5  261811  288688.6  7.69  0.6 
DEE  0.076  0.077  10  4  0.0807  0.086  11.7  0.7 
ELV  8.33  8.36  11  7  1.35  1.35  7.63  0.5 
FRD  2.342  2.425  6  5  3.218  3.293  6.98  0.34 
Note: Results of ELV should be multiplied with 10 
Algorithms  ABL  BAS  DEE  ELV  FRD  
MLP    2.694    540302    0.101    2.04  3.194  
ANFISSUB  2.008  2.733  119561  1089824  3087  2083  61.417  61.35  0.085  3.158 
TSKIRL  2.581  2.642  0.545  882.016  0.433  1.419  
LINEARLMS  2.413  2.472  224684  269123  0.081  0.085  4.254  4.288  3.612  3.653 
LELTSK  2.04  2.412  9607  461402  0.662  0.682  0.322  1.07  
METSKHD  2.205  2.392  47900  368820  0.03  0.103  6.75  7.02  1.075  1.887 
HFNT  2.578  2.511  261811  288688.6  0.0807  0.086  1.35  1.35  3.218  3.293 
Note: ELV results should be multiplied with 10, HFNT results were averaged over 150 runs compared to MLP, ANFISSUB, TSKIRL, LINEARLMS, LELTSK, and METSKHD, which were averaged over 30 runs. 
For regression datasets, Friedman test was conducted to examine the significance of the algorithms. For this purpose, the best test MSE was considered of the algorithms MLP, ANFISSUB, TSKIRL, LINEARLMS, LELTSK, and METSKHD from Table 9 and the best test MSE of algorithm HFNT was considered from Table 8. The average ranks obtained by each method in the Friedman test is shown in Table 10. The Friedman statistic at (distributed according to chisquare with 5 degrees of freedom) is 11, i.e., . The obtained test value according to Friedman statistic is 11. Since , then the null hypothesis that “there is no difference between the algorithms” is rejected. In other words, the computed value by Friedman test is 0.05 which is less than or equal to 0.05, i.e., value value. Hence, we reject the null hypothesis.
Algorithm  Ranking 
HFNT  1.5 
METSKHD  2.75 
LELTSK  3.25 
LINEARLSM  3.5 
MLP  4.5 
ANFISSUB  5.5 
From the Friedman test, it is clear that the proposed algorithm HFNT performed best among all the other algorithms. However, in the posthoc analysis presented in Table 11 describes the significance of difference between the algorithms. For this purpose, we apply Holm’s method holm1979simple , which rejects the hypothesis of equality between the best algorithm (HFNT) and other algorithms if the value is less than , where is the position of an algorithm in a list sorted ascending order of value (Table 11).
In the obtained result, the equality between ANFISSUB, MLP and HFNT was rejected, whereas the HFNT equality with other algorithms can not be rejected with , i.e., with 90% confidence. However, the value shown in Table 11 indicates the quality of their performance and the statistical closeness to the algorithm HFNT. It can be observed that the algorithm METSKHD performed closer to algorithm HFNT, followed by LELTSK, and LINEARLSM.
algorithm  Hypothesis  
5  ANFISSUB  3.023716  0.002497  0.02  rejected 
4  MLP  2.267787  0.023342  0.025  rejected 
3  LINEARLSM  1.511858  0.13057  0.033  
2  LELTSK  1.322876  0.185877  0.05  
1  METSKHD  0.944911  0.344704  0.1 
5.2.2 Ensembles
For each dataset, we constructed five ensemble systems by using 10 models in each batch. In each batch, 10 models were created and crossvalidated using fold CV. In fold CV, a dataset is randomly divided into two equal sets: A and B. Such partition of the dataset was repeated five times and each time when the set A was presented for training, the set B was presented for testing, and vice versa. Hence, total 10 runs of experiments for each model was performed. The collected ensemble results are presented in Table 12, where ensemble outputs were obtained by using weighted arithmetic mean as mentioned in (11).
The weights of models were computed by using DE algorithm, where the parameter setting was similar to the one mentioned in classification dataset. Ensemble results shown in Table 12 are MSE and correlation coefficient computed on CV test dataset. From ensemble results, it can be said that the ensemble with higher diversity offered better results than the ensemble with lower diversity. The models of the ensemble were examined to evaluate MSF and MIF presented in Table 12. A graphical illustration of ensemble results is shown in Figure 8 using scattered (regression) plots, where a scatter plots show how much one variable is affected by another (in this case model’s and desired outputs). Moreover, it tells the relationship between two variables, i.e., their correlation. Plots shown in Figure 8 represents the best ensemble batch (numbers indicated bold in Table 12) four, five, three, four and five where MSEs are 2.2938, 270706, 0.1085, 1.1005 and 2.3956, respectively. The values of in plots tell about the regression curve fitting over CV test datasets. In other words, it can be said that the ensemble models were obtained with generalization ability.
Data  batch  MSE  avg.  (9)  TSF  MSF  MIF  
ABL  1  3.004  0.65  5  0.1  3  , , ,  
2  2.537  0.72  8.3  1  7  
3  3.042  0.65  8.5  0.5  5  
4  2.294  0.75  10.7  1  7  
5  2.412  0.73  11.2  0.7  7  
BAS  1  2.932  0.79  5.6  0.3  5  , , , , ,  , , , , 
2  3.275  0.76  8.2  0.3  6  
3  3.178  0.77  5  0.2  7  
4  3.051  0.78  5.7  0.3  5  
5  2.707  0.81  7.3  0.7  9  
DEE  1  0.112  0.88  4.3  0.2  4  , , , ,  
2  0.115  0.88  8.9  0.6  6  
3  0.108  0.88  5.4  0.5  3  
4  0.123  0.87  10.8  0.9  5  
5  0.111  0.88  5.2  0.6  4  
EVL  1  1.126  0.71  9.3  0.1  12  , , , ,  , , 
2  1.265  0.67  9.6  0.1  12  
3  1.124  0.71  10.4  0.1  15  
4  1.097  0.72  9.2  0.2  10  
5  2.047  0.31  3.8  0.4  3  
FRD  1  3.987  0.86  6.2  0.2  4  , , ,  
2  4.154  0.83  8  0.2  4  
3  4.306  0.83  5.2  0.4  5  
4  3.809  0.86  7.8  0.5  4  
5  2.395  0.91  7.7  0.4  5  
Note: BAS results should be multiplied with 10, ELV results should be multiplied with 10. 
5.3 Timeseries dataset
5.3.1 2Fold CV
In literature survey, it was found that efficiency of most of the FNTbased models was evaluated over timeseries dataset. Mostly, MackyGlass (MGS) dataset was used for this purpose. However, only the bestobtained results were reported. For timeseries prediction problems, the performances were computed using the root of mean squared error (RMSE), i.e., we took the square root of given in (1). Additionally, correlation coefficient (2) was also used for evaluating algorithms performance.
For the experiments, first 50% of the dataset was taken for training and the rest of 50% was used for testing. Table 13 describes the results obtained by HFNT, where is RMSE for training set and is RMSE for testset. The best test RMSE obtained by HFNT was and on datasets MGS and WWR, respectively. HFNT results are competitive with most of the algorithms listed in Table 14. Only a few algorithms such as LNF and FWNNM reported better results than the one obtained by HFNT. FNT based algorithms such as FNT chen2005time and FBBFNTEGP&PSO reported RMSEs close to the results obtained by HFNT. The average RMSEs and its variance over testset of 70 models were 0.10568 and 0.00283, and 0.097783 and 0.00015 on dataset MGS and WWR, respectively. The low variance indicates that most models were able to produce results around the average RMSE value. The results reported by other function approximation algorithms (Table 13) were merely the best RMSEs. Hence, the robustness of other reported algorithm cannot be compared with the HFNT. However, the advantage of using HFNT over other algorithms is evident from the fact that the average complexity of the predictive models were 8.15 and 8.05 for datasets MGA and WWR, respectively.
The best model obtained for dataset WWR is shown in Figure 9, where the tree size is equal to 17 and followings are the selected activation functions: tangent hyperbolic, Gaussian, unipolar sigmoid, bipolar sigmoid and linear tangent hyperbolic. The selected input features in the tree (Figure 9) are , , and . Since in time series category experiment, we have only two datasets and for each dataset HFNT was compared with different models from literature. Hence, the statistical test was not conducted in this category because differences between algorithms are easy to determine from Table 14.
Best of 70 models  Mean of 70 models  
Data  Features  
MGS  0.00859  0.00798  21  4  0.10385  0.10568  8.15 
WWR  0.06437  0.06349  17  4  0.10246  0.09778  8.05 
Algorithms  MGS  WWR  
CPSO  0.0199  0.0322  
PSOBBFN    0.027  
HCMSPSO  0.0095  0.0208  
HMDDEBBFNN  0.0094  0.017  
GBBFNN    0.013  
Classical RBF  0.0096  0.0114  
FNT chen2005time  0.0071  0.0069  
FBBFNTEGP&PSO  0.0053  0.0054  
FWNNM  0.0013  0.00114  
LNF  0.0007  0.00079  
BPNN        0.200 
EFuNNs      0.1063  0.0824 
HFNT  0.00859  0.00798  0.064377  0.063489 
5.3.2 Ensembles
The ensemble results of timeseries datasets are presented in Table 15, where the best ensemble system of dataset MGS (marked bold in Table 15) offered a test RMSE with a test correlation coefficient . Similarly, the best ensemble system of dataset WWR (marked bold in Table 15) offered a test RMSE with a test correlation coefficient . However, apart from the best results, most of the ensemble produced low RMSEs, i.e., high correlation coefficients. The best ensemble batches (marked bold in Table 15) of dataset MGS and WWR were used for graphical plots in Figure 10. A onetoone fitting of target and prediction values is the evidence of a high correlation between model’s output and desired output, which is a significant indicator of model’s efficient performance.
Data  batch  avg. tree size  (9)  TSF  MSF  MIF  
MGS  1  0.018  0.99  9.4  0.6  4  , ,   
2  0.045  0.98  5.8  0.2  3  
3  0.026  0.99  15.2  0.5  3  
4  0.109  0.92  5.1  0.4  3  
5  0.156  0.89  7  0.2  3  
6  0.059  0.97  8.2  0.5  3  
7  0.054  0.98  6.4  0.4  4  
WWR  1  0.073  0.94  5  0.1  3  ,   
2  0.112  0.85  6  0.2  2  
3  0.097  0.91  10.6  0.3  4  
4  0.113  0.84  5  0.1  2  
5  0.063  0.96  14.4  0.9  4  
6  0.099  0.89  8.5  0.7  3  
7  0.101  0.88  6.9  0.4  3  
Note: , , and indicate test RMSE, test correlation coefficient, and diversity, respectively 
6 Discussions
HFNT was examined over three categories of datasets: classification, regression, and timeseries. The results presented in Section 5, clearly suggests a superior performance of HFNT approach. In HFNT approach, MOGP guided an initial HFNT population towards Paretooptimal solutions, where HFNT final population was a mixture of heterogeneous HFNTs. Alongside, accuracy and simplicity, a Paretobased multiobjective approach ensured diversity among the candidates in final population. Hence, HFNTs in the final population were fairly accurate, simple, and diverse. Moreover, HFNTs in the final population were diverse according to structure, parameters, activation function, and input feature. Hence, the model’s selection from Paretofronts, as indicated in Section 4, led to a good ensemble system.
activation function ()  
Data  1  2  3  4  5  6  7 
AUS  10      2       
HRT  10    9  4    5  3 
ION  6  5      2  4  4 
PIM  3  8  2  5  2  1   
WDB    3    7  8  10  8 
ABL  2  10        10   
BAS  2  5      2  10   
DEE    6  6  4  4  10   
EVL  10  5    3      6 
FRD  10  10           
MGS  4  1    2  1  10  10 
WWR  10    4    4  7   
Total  67  53  21  27  23  67  31 
Note: 67 is the best and 21 is the worst 
HFNT was applied to solve classification, regression, and timeseries problems. Since HFNT is stochastic in nature, its performance was affected by several factors: random generator algorithm, random seed, the efficiency of the metaheuristic algorithm used in parametertuning phase, the activation function selected at the nodes, etc. Therefore, to examine the performance of HFNT, several HFNTmodels were created using different random seeds and the best and average approximation error of all created models were examined. In Section 5, as far as the best model is concerned, the performance of HFNT surpass other approximation models mentioned from literature. Additionally, in the case of each dataset, a very low average value (high accuracy in the case of classification and low approximation errors in case of regression and timeseries) were obtained, which significantly suggests that HFNT often led to good solutions. Similarly, in the case of the ensembles, it is clear from the result that combined output of diverse and accurate candidates offered high quality (in terms of generalization ability and accuracy) approximation/prediction model. From the results, it is clear that the final population of HFNT offered the best ensemble when the models were carefully examined based on approximation error, average complexity (tree size), and selected features.
Moreover, the performances of the best performing activation functions were examined. For this purpose, the best ensemble system obtained for each dataset were considered. Accordingly, the performance of activation functions was evaluated as follows. The best ensemble system of each dataset had 10 models; therefore, in how many models (among 10) an activation function appeared, was counted. Hence, for a dataset, if an activation function appeared in all models of an ensemble system, then the total count was 10. Subsequently, counting was performed for all the activation functions for the best ensemble systems of all the datasets. Table 16, shows the performance of the activation functions. It can be observed that the activation function Gaussian () and Bipolar Sigmoid () performed the best among all the other activation functions followed by Tangenthyperbolic () function. Hence, no one activation function performed exceptionally well. Therefore, the efforts of selecting activation function, adaptively, by MOGP was essential in HFNTs performance.
In this work, we were limited to examine the performance of our approach to only benchmark problems. Therefore, in presences of no free lunch theorem wolpert1997no ; koppen2001remarks and the algorithm’s dependencies on random number generator, which are platforms, programming language, and implementation sensitive l2005fast , it is clear that performance of the mentioned approach is subjected to careful choice of training condition and parametersetting when it comes to deal with other realworld problems.
7 Conclusion
Effective use of the final population of the heterogeneous flexible neural trees (HFNTs) evolved using Paretobased multiobjective genetic programming (MOGP) and the subsequent parameter tuning by differential evolution led to the formation of highquality ensemble systems. The simultaneous optimization of accuracy, complexity, and diversity solved the problem of structural complexity that was inevitably imposed when a single objective was used. MOGP used in the tree construction phase often guided an initial HFNT population towards a population in which the candidates were highly accurate, structurally simple, and diverse. Therefore, the selected candidates helped in the formation of a good ensemble system. The result obtained by HFNT approach supports its superior performance over the algorithms collected for the comparison. In addition, HFNT provides adaptation in structure, computational nodes, and input feature space. Hence, HFNT is an effective algorithm for automatic feature selection, data analysis, and modeling.
Acknowledgment
This work was supported by the IPROCOM Marie Curie Initial Training Network, funded through the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme FP7/2007–2013/, under REA grant agreement number 316555.
References
References
 (1) Y. Chen, B. Yang, J. Dong, A. Abraham, Timeseries forecasting using flexible neural tree model, Information Sciences 174 (3) (2005) 219–235.
 (2) X. Yao, Y. Liu, A new evolutionary system for evolving artificial neural networks, IEEE Transactions on Neural Networks 8 (3) (1997) 694–713.
 (3) I. Basheer, M. Hajmeer, Artificial neural networks: Fundamentals, computing, design, and application, Journal of Microbiological Methods 43 (1) (2000) 3–31.
 (4) A. J. Maren, C. T. Harston, R. M. Pap, Handbook of neural computing applications, Academic Press, 2014.
 (5) I. K. Sethi, A. K. Jain, Artificial neural networks and statistical pattern recognition: Old and new connections, Vol. 1, Elsevier, 2014.
 (6) M. Tkáč, R. Verner, Artificial neural networks in business: Two decades of research, Applied Soft Computing 38 (2016) 788–804.
 (7) S. E. Fahlman, C. Lebière, The cascadecorrelation learning architecture, in: D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems 2, Morgan Kaufmann Publishers Inc., 1990, pp. 524–532.
 (8) J.P. Nadal, Study of a growth algorithm for a feedforward network, International Journal of Neural Systems 1 (1) (1989) 55–59.

(9)
K. O. Stanley, R. Miikkulainen, Evolving neural networks through augmenting topologies, Evolutionary Computation 10 (2) (2002) 99–127.
 (10) B.T. Zhang, P. Ohm, H. Mühlenbein, Evolutionary induction of sparse neural trees, Evolutionary Computation 5 (2) (1997) 213–236.
 (11) M. A. Potter, K. A. De Jong, Cooperative coevolution: An architecture for evolving coadapted subcomponents, Evolutionary computation 8 (1) (2000) 1–29.

(12)
M. Yaghini, M. M. Khoshraftar, M. Fallahi, A hybrid algorithm for artificial neural network training, Engineering Applications of Artificial Intelligence 26 (1) (2013) 293–301.
 (13) S. Wang, Y. Zhang, Z. Dong, S. Du, G. Ji, J. Yan, J. Yang, Q. Wang, C. Feng, P. Phillips, Feedforward neural network optimized by hybridization of PSO and ABC for abnormal brain detection, International Journal of Imaging Systems and Technology 25 (2) (2015) 153–164.
 (14) S. Wang, Y. Zhang, G. Ji, J. Yang, J. Wu, L. Wei, Fruit classification by waveletentropy and feedforward neural network trained by fitnessscaled chaotic abc and biogeographybased optimization, Entropy 17 (8) (2015) 5711–5728.
 (15) R. Salustowicz, J. Schmidhuber, Probabilistic incremental program evolution, Evolutionary Computation 5 (2) (1997) 123–141.
 (16) A. K. Kar, Bio inspired computing–a review of algorithms and scope of applications, Expert Systems with Applications 59 (2016) 20–32.
 (17) Y. Jin, B. Sendhoff, Paretobased multiobjective machine learning: An overview and case studies, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 38 (3) (2008) 397–415.
 (18) K. Deb, Multiobjective optimization using evolutionary algorithms, Vol. 16, John Wiley & Sons, 2001.
 (19) X. Yao, Y. Liu, Making use of population information in evolutionary artificial neural networks, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 28 (3) (1998) 417–425.
 (20) L. I. Kuncheva, C. J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Machine Learning 51 (2) (2003) 181–207.
 (21) K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast elitist nondominated sorting genetic algorithm for multiobjective optimization: NSGAII, in: Parallel Problem Solving from Nature PPSN VI, Vol. 1917 of Lecture Notes in Computer Science, Springer, 2000, pp. 849–858.
 (22) S. Das, S. S. Mullick, P. Suganthan, Recent advances in differential evolution–an updated survey, Swarm and Evolutionary Computation 27 (2016) 1–30.
 (23) Y. Chen, A. Abraham, J. Yang, Feature selection and intrusion detection using hybrid flexible neural tree, in: Advances in Neural Networks–ISNN, Vol. 3498 of Lecture Notes in Computer Science, Springer, 2005, pp. 439–444.
 (24) L. Sánchez, I. Couso, J. A. Corrales, Combining GP operators with SA search to evolve fuzzy rule based classifiers, Information Sciences 136 (1) (2001) 175–191.
 (25) J. Kennedy, R. C. Eberhart, Y. Shi, Swarm Intelligence, Morgan Kaufmann, 2001.
 (26) Y. Chen, A. Abraham, B. Yang, Feature selection and classification using flexible neural tree, Neurocomputing 70 (1) (2006) 305–313.
 (27) R. Riolo, J. H. Moore, M. Kotanchek, Genetic programming theory and practice XI, Springer, 2014.
 (28) X. Chen, Y.S. Ong, M.H. Lim, K. C. Tan, A multifacet survey on memetic computation, IEEE Transactions on Evolutionary Computation 15 (5) (2011) 591–607.
 (29) H.J. Li, Z.X. Wang, L.M. Wang, S.M. Yuan, Flexible neural tree for pattern recognition, in: Advances in Neural Networks–ISNN, Vol. 3971 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2006, pp. 903–908.
 (30) Y. Chen, Y. Wang, B. Yang, Evolving hierarchical RBF neural networks for breast cancer detection, in: Neural Information Processing, Vol. 4234 of Lecture Notes in Computer Science, Springer, 2006, pp. 137–144.
 (31) Y. Chen, F. Chen, J. Yang, Evolving MIMO flexible neural trees for nonlinear system identification, in: International Conference on Artificial Intelligence, Vol. 1, 2007, pp. 373–377.
 (32) P. Wu, Y. Chen, Grammar guided genetic programming for flexible neural trees optimization, in: Advances in Knowledge Discovery and Data Mining, Springer, 2007, pp. 964–971.
 (33) Y. Shan, R. McKay, R. Baxter, H. Abbass, D. Essam, H. Nguyen, Grammar modelbased program evolution, in: Congress on Evolutionary Computation, Vol. 1, 2004, pp. 478–485.
 (34) G. Jia, Y. Chen, Q. Wu, A MEP and IP based flexible neural tree model for exchange rate forecasting, in: Fourth International Conference on Natural Computation, Vol. 5, IEEE, 2008, pp. 299–303.
 (35) M. Oltean, C. Groşan, Evolving evolutionary algorithms using multi expression programming, in: Advances in Artificial Life, Springer, 2003, pp. 651–658.
 (36) P. Musilek, A. Lau, M. Reformat, L. WyardScott, Immune programming, Information Sciences 176 (8) (2006) 972–1002.
 (37) B. Yang, L. Wang, Z. Chen, Y. Chen, R. Sun, A novel classification method using the combination of FDPS and flexible neural tree, Neurocomputing 73 (4–6) (2010) 690 – 699.
 (38) S. Bouaziz, H. Dhahri, A. M. Alimi, A. Abraham, Evolving flexible beta basis function neural tree using extended genetic programming & hybrid artificial bee colony, Applied Soft Computing.
 (39) Y. Chen, B. Yang, A. Abraham, Flexible neural trees ensemble for stock index modeling, Neurocomputing 70 (4–6) (2007) 697 – 703.

(40)
B. Yang, M. Jiang, Y. Chen, Q. Meng, A. Abraham, Ensemble of flexible neural tree and ordinary differential equations for smalltime scale network traffic prediction, Journal of Computers 8 (12) (2013) 3039–3046.
 (41) V. K. Ojha, A. Abraham, V. Snasel, Ensemble of heterogeneous flexible neural tree for the approximation and featureselection of Poly (Lacticcoglycolic Acid) microand nanoparticle, in: Proceedings of the Second International AfroEuropean Conference for Industrial Advancement AECIA 2015, Springer, 2016, pp. 155–165.
 (42) L. Peng, B. Yang, L. Zhang, Y. Chen, A parallel evolving algorithm for flexible neural tree, Parallel Computing 37 (10–11) (2011) 653–666.
 (43) L. Wang, B. Yang, Y. Chen, X. Zhao, J. Chang, H. Wang, Modeling earlyage hydration kinetics of portland cement using flexible neural tree, Neural Computing and Applications 21 (5) (2012) 877–889.
 (44) C. Ferreira, Gene expression programming: mathematical modeling by an artificial intelligence, Vol. 21, Springer, 2006.
 (45) G. Weiss, Multiagent systems: A modern approach to distributed artificial intelligence, MIT Press, 1999.
 (46) M. Ammar, S. Bouaziz, A. M. Alimi, A. Abraham, Negotiation process for biobjective multiagent flexible neural tree model, in: International Joint Conference on Neural Networks (IJCNN), 2015, IEEE, 2015, pp. 1–9.

(47)
T. Burianek, S. Basterrech, Performance analysis of the activation neuron function in the flexible neural tree model, in: Proceedings of the Dateso 2014 Annual International Workshop on DAtabases, TExts, Specifications and Objects, 2014, pp. 35–46.
 (48) S. Bouaziz, H. Dhahri, A. M. Alimi, A. Abraham, A hybrid learning algorithm for evolving flexible beta basis function neural tree model, Neurocomputing 117 (2013) 107–117.
 (49) S. Bouaziz, A. M. Alimi, A. Abraham, Universal approximation propriety of flexible beta basis function neural tree, in: International Joint Conference on Neural Networks, IEEE, 2014, pp. 573–580.
 (50) C. Micheloni, A. Rani, S. Kumar, G. L. Foresti, A balanced neural tree for pattern classification, Neural Networks 27 (2012) 81–90.
 (51) G. L. Foresti, C. Micheloni, Generalized neural trees for pattern classification, IEEE Transactions on Neural Networks 13 (6) (2002) 1540–1547.
 (52) A. Rani, G. L. Foresti, C. Micheloni, A neural tree for classification using convex objective function, Pattern Recognition Letters 68 (2015) 41–47.
 (53) Q. Shouning, L. Zhaolian, C. Guangqiang, Z. Bing, W. Sujuan, Modeling of cement decomposing furnace production process based on flexible neural tree, in: Information Management, Innovation Management and Industrial Engineering, Vol. 3, IEEE, 2008, pp. 128–133.
 (54) B. Yang, Y. Chen, M. Jiang, Reverse engineering of gene regulatory networks using flexible neural tree models, Neurocomputing 99 (2013) 458–466.
 (55) Z. Chen, B. Yang, Y. Chen, A. Abraham, C. Grosan, L. Peng, Online hybrid traffic classifier for peertopeer systems based on network processors, Applied Soft Computing 9 (2) (2009) 685–694.
 (56) T. Novosad, J. Platos, V. Snásel, A. Abraham, Fast intrusion detection system based on flexible neural tree, in: International Conference on Information Assurance and Security, IEEE, 2010, pp. 106–111.

(57)
Y.Q. Pan, Y. Liu, Y.W. Zheng, Face recognition using kernel PCA and hybrid flexible neural tree, in: International Conference on Wavelet Analysis and Pattern Recognition, 2007. ICWAPR’07, Vol. 3, IEEE, 2007, pp. 1361–1366.
 (58) Y. Guo, Q. Wang, S. Huang, A. Abraham, Flexible neural trees for online hand gesture recognition using surface electromyography, Journal of Computers 7 (5) (2012) 1099–1103.
 (59) S. Qu, A. Fu, W. Xu, Controlling shareholders management risk warning based on flexible neural tree, Journal of Computers 6 (11) (2011) 2440–2445.
 (60) A. Rajini, V. K. David, Swarm optimization and flexible neural tree for microarray data classification, in: International Conference on Computational Science, Engineering and Information Technology, ACM, 2012, pp. 261–268.
 (61) S. Abdelwahab, V. K. Ojha, A. Abraham, Ensemble of flexible neural trees for predicting risk in grid computing environment, in: Innovations in BioInspired Computing and Applications, Springer, 2016, pp. 151–161.
 (62) Y. Jin, B. Sendhoff, E. Körner, Evolutionary multiobjective optimization for simultaneous generation of signaltype and symboltype representations, in: Evolutionary MultiCriterion Optimization, Vol. 3410 of Lecture Notes in Computer Science, Springer, 2005, pp. 752–766.
 (63) I. Das, J. E. Dennis, A closer look at drawbacks of minimizing weighted sums of objectives for pareto set generation in multicriteria optimization problems, Structural optimization 14 (1) (1997) 63–69.
 (64) A. E. Eiben, J. E. Smith, Introduction to Evolutionary Computing, Springer, 2015.
 (65) D. Karaboga, B. Basturk, A powerful and efficient algorithm for numerical function optimization: Artificial bee colony (ABC) algorithm, Journal of Global Optimization 39 (3) (2007) 459–471.
 (66) Y. Zhang, S. Wang, G. Ji, A comprehensive survey on particle swarm optimization algorithm and its applications, Mathematical Problems in Engineering 2015 (2015) 1–38.
 (67) C.F. Juang, A hybrid of genetic algorithm and particle swarm optimization for recurrent network design, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 34 (2) (2004) 997–1006.
 (68) W. Wongseree, N. Chaiyaratana, K. Vichittumaros, P. Winichagoon, S. Fucharoen, Thalassaemia classification by neural networks and genetic programming, Information Sciences 177 (3) (2007) 771 – 786.
 (69) T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, R. Tibshirani, The elements of statistical learning, Vol. 2, Springer, 2009.
 (70) R. Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems Magazine 6 (3) (2006) 21–45.
 (71) Z.H. Zhou, Ensemble methods: Foundations and algorithms, CRC Press, 2012.
 (72) M. Lichman, UCI machine learning repository, http://archive.ics.uci.edu/ml Accessed on: 01.05.2016 (2013).
 (73) J. AlcalaFdez, L. Sanchez, S. Garcia, M. J. del Jesus, S. Ventura, J. Garrell, J. Otero, C. Romero, J. Bacardit, V. M. Rivas, et al., Keel: a software tool to assess evolutionary algorithms for data mining problems, Soft Computing 13 (3) (2009) 307–318.
 (74) M. Matsumoto, T. Nishimura, Mersenne twister: A 623dimensionally equidistributed uniform pseudorandom number generator, ACM Transactions on Modeling and Computer Simulation 8 (1) (1998) 3–30.
 (75) V. K. Ojha, MOGPFNT multiobjective flexible neural tree tool, http://dap.vsb.cz/aat/ Accessed on: 01.05.2016 (May 2016).
 (76) Y. Chen, A. Abraham, Y. Zhang, et al., Ensemble of flexible neural trees for breast cancer detection, The International Journal of Information Technology and Intelligent Computing 1 (1) (2006) 187–201.
 (77) S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics (1979) 65–70.
 (78) M. J. Gacto, M. Galende, R. Alcalá, F. Herrera, METSKHD: A multiobjective evolutionary algorithm to learn accurate TSKfuzzy systems in highdimensional and largescale regression problems, Information Sciences 276 (2014) 63–79.
 (79) D. H. Wolpert, W. G. Macready, No free lunch theorems for optimization, IEEE Transactions on Evolutionary Computation 1 (1) (1997) 67–82.
 (80) M. Koppen, D. H. Wolpert, W. G. Macready, Remarks on a recent paper on the” no free lunch” theorems, IEEE Transactions on Evolutionary Computation 5 (3) (2001) 295–296.
 (81) P. L’Ecuyer, F. Panneton, Fast random number generators based on linear recurrences modulo 2: Overview and comparison, in: Proceedings of the 2005 Winter Simulation Conference, IEEE, 2005, pp. 10–pp.
 (82) S. Haykin, Neural networks and learning machines, Vol. 3, Pearson Education Upper Saddle River, 2009.
 (83) Z.H. Zhou, Z.Q. Chen, Hybrid decision tree, KnowledgeBased Systems 15 (8) (2002) 515–528.
 (84) J.S. R. Jang, ANFIS: adaptivenetworkbased fuzzy inference system, IEEE Transactions on Systems, Man and Cybernetics 23 (3) (1993) 665–685.

(85)
O. Cordón, F. Herrera, A twostage evolutionary process for designing TSK fuzzy rulebased systems, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 29 (6) (1999) 703–715.
 (86) J. S. Rustagi, Optimization techniques in statistics, Academic Press, 1994.
 (87) R. Alcalá, J. AlcaláFdez, J. Casillas, O. Cordón, F. Herrera, Local identification of prototypes for genetic learning of accurate tsk fuzzy rulebased systems, International Journal of Intelligent Systems 22 (9) (2007) 909–941.
 (88) K. B. Cho, B. H. Wang, Radial basis function based adaptive fuzzy systems and their applications to system identification and prediction, Fuzzy Sets and Systems 83 (3) (1996) 325–339.
 (89) F. Van den Bergh, A. P. Engelbrecht, A cooperative approach to particle swarm optimization, IEEE Transactions on Evolutionary Computation 8 (3) (2004) 225–239.
 (90) A. M. A. H. Dhahri, F. Karray, Designing beta basis function neural network for optimization using particle swarm optimization, in: IEEE Joint Conference on Neural Network, 2008, pp. 2564––2571.
 (91) C. Aouiti, A. M. Alimi, A. Maalej, A genetic designed beta basis function neural network for approximating multivariables functions, in: International Conference on Artificial Neural Nets and Genetic Algorithms, Springer, 2001, pp. 383–386.

(92)
C.F. Juang, C.M. Hsiao, C.H. Hsu, Hierarchical clusterbased multispecies particleswarm optimization for fuzzysystem optimization, IEEE Transactions on Fuzzy Systems 18 (1) (2010) 14–26.
 (93) S. Yilmaz, Y. Oysal, Fuzzy wavelet neural network models for prediction and identification of dynamical systems, IEEE Transactions on Neural Networks 21 (10) (2010) 1599–1609.
 (94) H. Dhahri, A. M. Alimi, A. Abraham, Hierarchical multidimensional differential evolution for the design of beta basis function neural network, Neurocomputing 97 (2012) 131–140.

(95)
A. Miranian, M. Abdollahzade, Developing a local leastsquares support vector machinesbased neurofuzzy model for nonlinear and chaotic time series prediction, IEEE Transactions on Neural Networks and Learning Systems 24 (2) (2013) 207–218.

(96)
N. K. Kasabov, Foundations of neural networks, fuzzy systems, and knowledge engineering, Marcel Alencar, 1996.
 (97) N. Kasabov, Evolving fuzzy neural networks for adaptive, online intelligent agents and systems, in: Recent Advances in Mechatronics, Springer, Berlin, 1999.
 (98) S. Bouaziz, A. M. Alimi, A. Abraham, Extended immune programming and oppositebased PSO for evolving flexible beta basis function neural tree, in: IEEE International Conference on Cybernetics, IEEE, 2013, pp. 13–18.
Appendix A Dataset Description
Index  Name  Features  Samples  Output  Type 
AUS  Australia  14  691  2  Classification 
HRT  Heart  13  270  2  
ION  Ionshpere  33  351  2  
PIM  Pima  8  768  2  
WDB  Wdbc  30  569  2  
ABL  Abalone  8  4177  1  Regression 
BAS  Baseball  16  337  1  
DEE  DEE  6  365  1  
EVL  Elevators  18  16599  1  
FRD  Fridman  5  1200  1  
MGS  MackeyGlass  4  1000  1  Timeseries 
WWR  Waste Water  4  475  1 
Appendix B Algorithms from literature
Ref.  Algorithms  Definition 
haykin2009neural  MLP  Multilayer Perceptron 
zhou2002hybrid  HDT  Hybrid Decision Tree 
chen2006ensemble  FNT  Flexible Neural Tree 
jang1993anfis  ANFISSUB  Adaptive NeuroFuzzy Inference System Using Subtractive Clustering 
cordon1999two  TSKIRL  Genetic Learning of TSKrules Under Iterative Rule Learning 
rustagi1994optimization  LINEARLMS  Least Mean Squares Linear Regression 
alcala2007local  LELTSK  Local Evolutionary Learning of TSKrules 
cho1996radial  RBF  Classical Radial Basis Function 
van2004cooperative  CPSO  Cooperative Particle Swarm Optimization (PSO) 
psoBBNN  PSOBBFN  PSObased Beta Basis Function Neural Network 
gBBNN  GBBFNN  GAbased BBFNN 
juang2010hierarchical  HCMSPSO  Hierarchical ClusterBased Multispecies PSO 
fwnn  FWNNM  Fuzzy Wavelet Neural Network Models 
dhahri2012hierarchical  HMDDEBBFNN  Hierarchical Multidimensional DEBased BBFNN 
miranian2013developing  LNF  Local LeastSquares Support Vector MachinesBased NeuroFuzzy Mode 
kasabov1996foundations  BPNN  Backpropagation Neural Network 
kasabov1999evolving  EFuNNs  Evolving Fuzzy Neural Networks 
bouaziz2013extended  FBBFNTEGP&PSO  Extended Immune Programming and OppositePSO for Flexible BBFNN 
gacto2014metsk  METSKHD  Multiobjective Evolutionary Learning of TSKrules for HighDimensional Problems 
Comments
There are no comments yet.