Structure optimization of a feedforward neural network (FNN) and its impact on FNN’s generalization ability inspired the flexible neural tree (FNT) chen2005time
. FNN components such as weights, structure, and activation function are the potential candidates for the optimization, which improves FNN’s generalization ability to a great extentyao1997new . These efforts are notable because of FNN’s ability to solve a large range of real-world problems basheer2000artificial ; maren2014handbook ; sethi2014artificial ; tkavc2016artificial . Followings are the significance structure optimization methods: constructive and pruning algorithms fahlman1989cascade ; nadal1989study , EPNet yao1997new , NeuroEvolution of Augmenting Topologies stanley2002evolving , sparse neural trees zhang1997evolutionary , Cooperative co-evolution approach potter2000cooperative , etc. Similarly, many efforts focus on the optimization of hybrid training of FNN such as yaghini2013hybrid ; wang2015feed ; wang2015fruit
. FNT was an additional step into this series of efforts, which was proposed to evolve as a tree-like feed-forward neural network model, where the probabilistic incremental program evolution (PIPE)salustowicz1997probabilistic was applied optimize the tree structure chen2005time
. The underlying parameter vector of the developed FNT (weights associated with the edges and arguments of the activation functions) was optimized by metaheuristic algorithms, which are nature-inspired parameter optimization algorithmskar2016bio . The evolutionary process allowed FNT to select significant input features from an input feature set.
In the design of FNT, the non-leaf nodes are the computational node, which takes an activation function. Hence, rather than relying on a fixed activation function, if the selection of activation function at the computational nodes is allowed to be selected by the evolutionary process. Then, it produces heterogeneous FNTs (HFNT) with the heterogeneity in its structure, computational nodes, and input set. In addition, heterogeneous function allowed HFNT to capture different feature of the datasets efficiently since each input in the datasets posses different features. The evolutionary process provides adaptation in structure, weights, activation functions, and input features. Therefore, an optimum HFNT is the one that offers the lowest approximation error with the simplest tree structure and the smallest input feature set. However, approximation error minimization and structure simplification are two conflicting objectives jin2008pareto . Hence, a multiobjective evolutionary approach deb2001multi may offer an optimal solution(s) by maintaining a balance between these objectives.
Moreover, in the proposed work, an evolutionary process guides a population of HFNTs towards Pareto-optimum solutions. Hence, the final population may contain several solutions that are close to the best solution. Therefore, an ensemble system was constructed by exploiting many candidates of the population (candidate, solution, and model are synonymous in this article). Such ensemble system takes advantage of many solutions including the best solution yao1998making . Diversity among the chosen candidates holds the key in making a good ensemble system kuncheva2003measures
. Therefore, the solutions in a final population should fulfill the following objectives: low approximation error, structural simplicity, and high diversity. However, these objectives are conflicting to each other. A fast elitist nondominated sorting genetic algorithm (NSGA-II)-based multiobjective genetic programming (MOGP) was employed to guide a population of HFNTsdeb2000fast . The underlying parameters of selected models were further optimized by using differential evaluation (DE) algorithm das2016recent . Therefore, we may summarize the key contributions of this work are as follows:
A heterogeneous flexible neural tree (HFNT) for function approximation and feature selection was proposed.
HFNT was studied under an NSGA-II-based multiobjective genetic programming framework. Thus, it was termed HFNT.
Alongside approximation error and tree size (complexity), a diversity index was introduced to maintain diversity among the candidates in the population.
HFNT was found competitive with other algorithms when compared and cross-validated over classification, regression, and time-series datasets.
The proposed evolutionary weighted ensemble of HFNTs final population further improved its performance.
A detailed literature review provides an overview of FNT usage over the past few years (Section 2). Conclusions derived from literature survey supports our HFNT approach, where a Pareto-based multiobjective genetic programming was used for HFNT optimization (Section 3.1). Section 3.2 provides a detailed discussion on the basics of HFNT: MOGP for HFNT structure optimization, and DE for HFNT parameter optimization. The efficiency of the above-mentioned hybrid and complex multiobjective FNT algorithm (HFNT) was tested over various prediction problems using a comprehensive experimental set-up (Section 4). The experimental results support the merits of proposed approach (Section 5). Finally, we provide a discussion of experimental outcomes in Section 6 followed by conclusions in Section 7.
2 Literature Review
The literature survey describes the following points: basics of FNT, approaches that improvised FNT, and FNTs successful application to various real-life problems. Subsequently, the shortcomings of basic FNT version are concluded that inspired us to propose HFNT.
FNT was first proposed by Chen et al. chen2005time , where a tree-like-structure was optimized by using PIPE. Then, its approximation ability was tested for time-series forecasting chen2005time and intrusion detection chen2005feature , where a variant of simulated annealing (called degraded ceiling) sanchez2001combining
, and particle swarm optimization (PSO)kennedy2001swarm , respectively, were used for FNT parameter optimization. Since FNT is capable of input feature selection, in chen2006feature , FNT was applied for selecting input features in several classification tasks, in which FNT structure was optimized by using genetic programming (GP) riolo2014genetic , and the parameter optimization was accomplished by using memetic algorithm chen2011multi . Additionally, they defined five different mutation operators, namely, changing one terminal node, all terminal nodes, growing a randomly selected sub-tree, pruning a randomly selected sub-tree, and pruning redundant terminals. Li et al. li2006flexible
proposed FNT-based construction of decision trees whose nodes were conditionally replaced by neural node (activation node) to deal with continuous attributes when solving classification tasks. In many other FNT based approaches, like inChen2006137
, GP was applied to evolve hierarchical radial-basis-function network model, and inChen2007373 a multi-input-multi-output FNT model was evolved. Wu et al. wu2007grammar proposed to use grammar guided GP Shan2004478 for FNT structure optimization. Similarly, in jia2008mep , authors proposed to apply multi-expression programming (MEP) oltean2003evolving for FNT structure optimization and immune programming algorithm musilek2006immune for the parameter vector optimization. To improve classification accuracy of FNT, Yang et al. Yang2010690 proposed a hybridization of FNT with a further-division-of-partition-space method. In bouaziz2016evolving , authors illustrated crossover and mutation operators for evolving FNT using GP and optimized the tree parameters using PSO algorithm.
A model is considered efficient if it has generalization ability. We know that a consensus decision is better than an individual decision. Hence, an ensemble of FNTs may lead to a better-generalized performance than a single FNT. To address this, in Chen2007697 , authors proposed to make an ensemble of FNTs to predict the chaotic behavior of stock market indices. Similarly, in yang2013ensemble , the proposed FNTs ensemble predicted the breast cancer and network traffic better than individual FNT. In ojha2016ensemble , protein dissolution prediction was easier using ensemble than the individual FNT.
To improve the efficiency in terms of computation, Peng et al. peng2011parallel proposed a parallel evolving algorithm for FNT, where the parallelization took place in both tree-structure and parameter vector populations. In another parallel approach, Wang et al. wang2012modeling
used gene expression programming (GEP)ferreira2006gene for evolving FNT and used PSO for parameter optimization.
A multi-agent system weiss1999multiagent based FNT (MAS-FNT) algorithm was proposed in ammar2015negotiation , which used GEP and PSO for the structure and parameter optimization, respectively. The MAS-FNT algorithm relied on the division of the main population into sub-population, where each sub-population offered local solutions and the best local solution was picked-up by analyzing tree complexity and accuracy.
Chen et al. chen2005time ; chen2006feature referred the arbitrary choice of activation function at non-leaf nodes. However, they were restricted to use only Gaussian functions. A performance analysis of various activation function is available in burianekperformance . Bouaziz et al. bouaziz2013hybrid ; bouaziz2014universal proposed to use beta-basis function at non-leaf nodes of an FNT. Since beta-basis function has several controlling parameters such as shape, size, and center, they claimed that the beta-basis function has advantages over other two parametric activation functions. Similarly, many other forms of neural tree formation such as balanced neural tree micheloni2012balanced , generalized neural tree foresti2002generalized , and convex objective function neural tree rani2015neural , were focused on the tree improvement of neural nodes.
FNT was chosen over the conventional neural network based models for various real-world applications related to prediction modeling, pattern recognition, feature selection, etc. Some examples of such applications are cement-decomposing-furnace production-process modelingshou2008modeling , time-series prediction from gene expression profiling yang2013reverse . stock-index modeling Chen2007697
, anomaly detection in peer-to-peer trafficChen2009685 , intrusion detection novosad2010fast , face identification pan2007face , gesture recognition Guo20121099 , shareholder’s management risk prediction qu2011controlling , cancer classification rajini2012swarm , somatic mutation, risk prediction in grid computing abdelwahab2016ensemble , etc.
The following conclusions can be drawn from the literature survey. First, FNT was successfully used in various real-world applications with better performance than other existing function approximation models. However, it was mostly used in time-series analysis. Second, the lowest approximation error obtained by an individual FNT during an evolutionary phase was considered as the best structure that propagated to the parameter optimization phase. Hence, there was no consideration as far as structural simplicity and generalization ability are concerned. Third, the computational nodes of the FNT were fixed initially, and little efforts were made to allow for its automatic adaptation. Fourth, little attention was paid to the statistical validation of FNT model, e.g., mostly the single best model was presented as the experimental outcome. However, the evolutionary process and the meta-heuristics being stochastic in nature, statistical validation is inevitably crucial for performance comparisons. Finally, to create a generalized model, an ensemble of FNTs were used. However, FNTs were created separately for making the ensemble. Due to stochastic nature of the evolutionary process, FNT can be structurally distinct when created at different instances. Therefore, no explicit attention was paid to create diverse FNTs within a population itself for making ensemble. In this article, a heterogeneous FNT called HFNT was proposed to improve the basic FNT model and its performance by addressing above mentioned shortcomings.
3 Multi-objectives and Flexible Neural Tree
In this section, first, Pareto-based multiobjective is discussed. Second, we offer a detailed discussion on FNT and its structure and parameter optimization using NSGA-II-based MOGP and DE, respectively. Followed by a discussion on making an evolutionary weighted ensemble of the candidates from the final population.
3.1 Pareto-Based Multi-objectives
Usually, learning algorithms owns a single objective, i.e., the approximation error minimization, which is often achieved by minimizing mean squared error (MSE) on the learning data. MSE on a learning data is computed as:
where and are the desired output and the model’s output, respectively and indicates total data pairs in the learning set. Additionally, a statistical goodness measure, called, correlation coefficient that tells the relationship between two variables (i.e., between the desired output and the model’s output ) may also be used as an objective. Correlation coefficient is computed as:
where and are means of the desired output and the model’s output , respectively.
However, single objective comes at the expense of model’s complexity or generalization ability on unseen data, where generalization ability broadly depends on the model’s complexity jin2005evolutionary . A common model complexity indicator is the number of free parameters in the model. The approximation error (1) and the number of free parameters minimization are two conflicting objectives. One approach is to combine these two objectives as:
where is a constant, is the MSE (1) and is the total free parameter in a model. The scalarized objective in (3), however, has two disadvantages. First, determining an appropriate that controls the conflicting objectives. Hence, generalization ability of the produced model will be a mystery das1997closer . Second, the scalarized objective in (3) leads to a single best model that tells nothing about how the conflicting objectives were achieved. In other words, no single solution exists that may satisfy both objectives, simultaneously.
We study a multiobjective optimization problem of the form:
where we have objective functions . We denote the vector of objective functions by . The decision (variable) vectors belong to the set , which is a subset of the decision variable space . The word ‘minimize’ means that we want to minimize all the objective functions simultaneously.
A nondominated solution is one in which no one objective function can be improved without a simultaneous detriment to at least one of the other objectives of the solution deb2000fast . The nondominated solution is also known as a Pareto-optimal solution.
Pareto-dominance - A solution is said to dominate a solution if , and there exists such that holds.
Pareto-optimal - A solution is called Pareto-optimal if there does not exist any other solution that dominates it. A set Pareto-optimal solution is called Pareto-front.
Algorithm 1 is a basic framework of NSGA-II based MOGP, which was used for computing Pareto-optimal solutions from an initial HFNT population. The individuals in MOGP were sorted according to their dominance in population. Note that the function returns total number of rows (population size) for a 2-D matrix and returns total number of elements for a vector. The Moreover, individuals were sorted according to the rank/Pareto-front. MOGP is an elitist algorithm that allowed the best individuals to propagate into next generation. Diversity in the population was maintained by measuring crowding distance among the individuals deb2000fast .
3.2 Heterogeneous Flexible Neural Tree
HFNT is analogous to a multi-layer feedforward neural network that has over-layer connections and activation function at the nodes. HFNT construction has two phases chen2005time : 1) the tree construction
phase, in which evolutionary algorithms are applied to construct tree-like structure; and 2) theparameter-tuning phase, in which genotype of HFNT (underlying parameters of tree-structure) is optimized by using parameter optimization algorithms.
To create a near-optimum model, phase one starts with random tree-like structures (population of initial solutions), where parameters of each tree are fixed by a random guess. Once a near-optimum tree structure is obtained, parameter-tuning phase optimizes its parameter. The phases are repeated until a satisfactory solution is obtained. Figure 1 is a lucid illustration of these two phases that work in some co-evolutionary manner. From Figure 1, it may be observed that two global search algorithms MOGP (for structure optimization) and DE (for parameter optimization) works in a nested manner to obtain a near optimum tree that may have less complex tree structure and better parameter. Moreover, evolutionary algorithm allowed HFNT to select activation functions and input feature at the nodes from sets of activation functions and input features, respectively. Thus, HFNT possesses automatic feature selection ability.
3.2.1 Basic Idea of HFNT
An HFNT is a collection of function set and instruction set :
where denotes a non-leaf instruction (a computational node). It receives arguments and is a function that randomly takes an activation function from a set of activation functions. Maximum arguments to a computational node are predefined. A set of seven activation functions is shown in Table 1. Leaf node’s instruction denotes input variables. Figure 2 is an illustration of a typical HFNT. Similarly, Figure 3 is an illustration of a typical node in an HFNT.
The -th computational node (Figure 3) of a tree (say -th node in Figure 2) receives inputs (denoted as ) through connection-weights (denoted as ) and takes two adjustable parameters and that represents the arguments of the activation function at that node. The purpose of using an activation function at a computational node is to limit the output of the computational node within a certain range. For example, if the -th node contains a Gaussian function (Table 1). Then, its output is computed as:
where is the weighted summation of the inputs and weights at the -th computational node (Figure 3), also known as excitation of the node. The net excitation of the -th node is computed as:
where or, , i.e., can be either an input feature (leaf node value) or the output of another node (a computational node output) in the tree. Weight is the connection weight of real value in the range . Similarly, the output of a tree is computed from the root node of the tree, which is recursively computed by computing each node’s output using (5) from right to left in a depth-first method.
The fitness of a tree depends on the problem. Usually, learning algorithm uses approximation error, i.e., MSE (1). Other fitness measures associated with the tree are tree size and diversity index. The tree size is the number of nodes (excluding root node) in a tree, e.g., the number of computational nodes and leaf nodes in the tree in Figure 2 is 11 (three computational nodes and eight leaf-nodes). The number of distinct activation functions (including root node function) randomly selected from a set of activation functions gives the diversity index of a tree. Total activation functions (denoted as in ) selected by the tree in Figure 2 is three (). Hence, its diversity index is three.
3.3 Structure and Parameter Learning (Near optimal Tree)
A tree that offers the lowest approximation error and the simplest structure is a near optimal tree, which can be obtained by using an evolutionary algorithm such as GP riolo2014genetic , PIPE salustowicz1997probabilistic , GEP ferreira2006gene , MEP oltean2003evolving , and so on. To optimize tree parameters, algorithms such as genetic algorithm eiben2015ec , evolution strategy eiben2015ec , artificial bee colony karaboga2007powerful , PSO kennedy2001swarm ; zhang2015comprehensive , DE das2016recent , and any hybrid algorithm such as GA and PSO juang2004hybrid can be used.
The proposed multiobjective optimization of FNT has three fitness measures: approximation error (1) minimization, tree size minimization, and diversity index maximization. These objectives are simultaneously optimized during the tree construction phase using MOGP, which guides an initial population of random tree-structures according to Algorithm 1. The detailed description of the components of Algorithm 1 are as follows:
In selection operation, a mating pool of size is created using binary tournament selection, where two candidates are randomly selected from a population and the best (according to rank and crowding distance) among them is placed into the mating pool. This process is continued until the mating pool is full. An offspring population
is generated by using the individuals of mating pool. Two distinct individuals (parents) are randomly selected from the mating pool to create new individuals using genetic operators: crossover and mutation. The crossover and mutation operators are applied with probabilitiesand , respectively.
A randomly selected terminal node is replaced by a newly generated terminal node.
All terminal nodes of the selected tree were replaced by randomly generated new terminal nodes.
A randomly selected terminal node or a computational node is replaced by a randomly generated sub-tree.
A randomly selected terminal node is replaced by a randomly generated computational node.
In the proposed MOGP, during the each mutation operation event, one of the above-mentioned four mutation operators was randomly selected for mutation of the tree.
The offspring population and the main population , are merged to make a combined population .
In this step, worst individuals are weeded out. In other words, best individuals are propagated to a new generation as main population .
In parameter-tuning phase, a single objective, i.e., approximation error was used in optimization of HFNT parameter by DE. The tree parameters such as weights of tree edges and arguments of activation functions were encoded into a vector for the optimization. In addition, a cross-validation (CV) phase was used for statistical validation of HFNTs.
The basics of DE is as follows. For an initial population of parameter vectors , DE repeats its steps mutation, recombination, and selection until an optimum parameter vector is obtained. DE updates each parameter vector by selecting the best vector and three random vectors and from such that holds. The random vector is considered as a trial vector . Hence, for all , and , the -th variable of -th trail-vectors is generated by using crossover, mutation, and recombination as:
where is a random index in , is within , is in , is crossover probability, and is mutation factor. The trail vector is selected if
where returns fitness of a vector as per (1). Hence, the process of crossover, mutation, recombination, and selection are repeated until an optimal parameter vector solution is found.
3.4 Ensemble: Making use of MOGP Final Population
In tree construction phase, MOGP provides a population from which we can select tree models for making the ensemble. Three conflicting objectives such as approximation error, tree size, and diversity index allows the creation of Pareto-optimal solutions, where solutions are distributed on various Pareto-optimal fronts according to their rank in population. Ensemble candidates can be selected from the first line of solutions (Front 1), or they can be chosen by examining the three objectives depending on the user’s need and preference. Accuracy and diversity among the ensemble candidate are important kuncheva2003measures . Hence, in this work, approximation error, and diversity among the candidates were given preference over tree size. Not to confuse “diversity index” with “diversity”. The diversity index is an objective in MOGP, and the diversity is the number of distinct candidates in an ensemble. A collection of the diverse candidate is called a bag of candidates hastie2009elements . In this work, any two trees were considered diverse (distinct) if the followings hold: 1) Two trees were of different size. 2) The number of function nodes/or leaf nodes in two trees were dissimilar. 3) Two models used a different set of input features. 4) Two models used a different set of activation functions. Hence, diversity of ensemble (a bag of solutions) was computed as:
where is a function that returns total distinct models in an ensemble and is a total number of models in the bag.
Now, for a classification problem, to compute combined vote of respective candidate’s outputs , , , of bag and classes , we used an indicator function which takes if ‘’ is true, and takes if ‘’ is false. Thus, ensemble decisions by weighted majority voting is computed as polikar2006ensemble ; zhou2012ensemble :
where is weight associated with the -th candidate in an ensemble and is set to class if the total weighted vote received by is higher than the total vote received by any other class. Similarly, the ensemble of regression methods was computed by weighted arithmetic mean as polikar2006ensemble :
where and are weight and output of -th candidate in a bag , respectively, and is the ensemble output, which is then used for computing MSE (1) and correlation coefficient (2). The weights may be computed according to fitness of the models, or by using a metaheuristic algorithm. In this work, DE was applied to compute the ensemble weights , where population size was set to 100 and number of function evaluation was set to 300,000.
3.5 Multiobjective: A General Optimization Strategy
A summary of general HFNT learning algorithm is as follows:
Initializing HFNT training parameters.
Apply tree construction phase to guide initial HFNT population towards Pareto-optimal solutions.
Select tree-model(s) from MOGP final population according to their approximation error, tree size, and diversity index from the Pareto front.
Apply parameter-tuning phase to optimize the selected tree-model(s).
Go to Step 2, if no satisfactory solution found. Else go to Step 6.
Using a cross-validation (CV) method to validate the chosen model(s).
Use the chosen tree-model(s) for making ensemble (recommended).
Compute ensemble results of the ensemble model (recommended).
4 Experimental Set-Up
Several experiments were designed for evaluating the proposed HFNT. A careful parameter-setting was used for testing its efficiency. A detailed description of the parameter-setting is given in Table 2, which includes: definitions, default range, and selected value. The phases of the algorithm were repeated until the stopping criteria met, i.e., either the lowest predefined approximation error was achieved, or the maximum function evaluations were reached. The repetition holds the key to obtaining a good solution. A carefully designed repetition of these two phases may offer a good solution in fewer of function evaluations.
In this experiment, three general repetitions were used with 30 tree construction iterations , and 1000 parameter-tuning iterations (Figure 1). Hence, the maximum function evaluation111Initial GP population + three repetition ((GP population + mating pool size) MOGP iterations + MH population MH iterations) = . was . The DE version das2016recent with equal to 0.9 and equal to 0.7 was used in the parameter-tuning phase.
|Scaling||Input-features scaling range.||,||[0,1]|
|Tree height||Maximum depth (layers) of a tree model.||4|
|Tree arity||Maximum arguments of a node .||5|
|Node range||Search space of functions arguments.||,||[0,1]|
|Edge range||Search space for edges (weights) of tree.||,||[-1,1]|
|Mating pool||Size of the pool of selected candidates.||0.5|
|Tournament||Tournament selection size.||2|
|General||Maximum number of trails.||3|
The experiments were conducted over classification, regression, and time-series datasets. A detailed description of the chosen dataset from the UCI machine learning UCILichman2013 and KEEL alcala2009keel repository is available in Table 17. The parameter-setting mentioned in Table 2
was used for the experiments over each dataset. Since the stochastic algorithms depend on random initialization, a pseudorandom number generator called, Mersenne Twister algorithm that draws random values using probability distribution in a pseudo-random manner was used for initialization of HFNTsmatsumoto1998mersenne . Hence, each run of the experiment was conducted with a random seed drawn from the system. We compared HFNT performance with various other approximation models collected from literature. A list of such models is provided in Table 18. A developed software tool based on the proposed HFNT algorithm for predictive modeling is available in ojha2016mogp .
To construct good ensemble systems, highly diverse and accurate candidates were selected in the ensemble bag . To increase diversity (9) among the candidates, the Pareto-optimal solutions were examined by giving preference to the candidates with low approximation error, small tree size and distinct from others selected candidates. Hence, candidates were selected from a population . An illustration of such selection method is shown in Figure 4, which represents an MOGP final population of 50 candidate solutions computed over dataset MGS.
MOGP simultaneously optimized three objectives. Hence, the solutions were arranged on the three-dimensional map (Figure 4(a)), in which along the x-axis, error was plotted; along the y-axis, tree size was plotted; and along z-axis, diversity index (diversity) was plotted. However, for the simplicity, we have arranged solutions also in 2-D plots (Figure 4(b)), in which along the x-axis, computed error was plotted; and along the y-axis, tree size (indicated by blue dots) and diversity index (indicated by red squares) were plotted. From Figure 4(b), it is evident that a clear choice is difficult since decreasing approximation error increases models tree size (blue dots in Figure 4(b)). Similarly, decreasing approximation error increases models tree size and diversity (red squares in Figure 4(b)). Hence, solutions along the Pareto-front (rank-1), i.e., Pareto surface indicated in the 3-D map of the solutions in Figure 4(a) were chosen for the ensemble. For all datasets, ensemble candidates were selected by examining Pareto-fronts in a similar fashion as described for the dataset MGS in Figure 4.
The purpose of our experiment was to obtain sufficiently good prediction models by enhancing predictability and lowering complexity. We used MOGP for optimization of HFNTs. Hence, we were compromising fitness by lowering models complexity. In single objective optimization, we only looked for models fitness. Therefore, we did not possess control over model’s complexity. Figure 5 illustrates eight runs of both single and multiobjective optimization course of HFNT, where models tree size (complexity) is indicated along y-axis and x-axis indicates fitness value of the HFNT models. The results shown in Figure 5 was conducted over MGS dataset. For each single objective GP and multiobjective GP, optimization course was noted, i.e., successive fitness reduction and tree size were noted for 1000 iterations.
It is evident from Figure 5 that the HFNT approach leads HFNT optimization by lowering model’s complexity. Whereas, in the single objective, model’s complexity was unbounded and was abruptly increased. The average tree size of eight runs of single and eight runs of multiobjective were 39.265 and 10.25, respectively; whereas, the average fitness were 0.1423 and 0.1393, respectively. However, in single objective optimization, given the fact that the tree size is unbounded, the fitness of a model may improve at the expense of model’s complexity. Hence, the experiments were set-up for multiobjective optimization that provides a balance between both objectives as described in Figure 4.
Experimental results were classified into three categories: classification, regression, and time-series. Each category has two parts: 1) First part describes the best and average results obtained from the experiments; 2) Second part describes ensemble results using tabular and graphical form.
5.1 Classification dataset
We chose five classification datasets for evaluating HFNT, and the classification accuracy was computed as:
where is the total positive samples correctly classified as positive samples, is the total negative samples correctly classified as negative samples, is the total negative samples incorrectly classified as positive samples, and is the total positive samples incorrectly classified as negative samples. Here, for a binary class classification problem, the positive sample indicates the class labeled with ‘1’ and negative sample indicates class labeled with ‘0’. Similarly, for a three-class ( and ) classification problem, the samples which are labeled as a class are set to 1, 0, 0, i.e., set to positive for class and negative for and . The samples which are labeled as a class are set to 0, 1, 0, and the samples which are labeled as a class are set to 0, 0, 1.
5.1.1 10-Fold CV
The experiments on classification dataset were conducted in three batches that produced 30 models, and each model was cross-validated using 10-fold CV, in which a dataset is equally divided into 10 sets and the training of a model was repeated 10 times. Each time a distinct set was picked for the testing the models, and the rest of nine set was picked for the training of the model. Accordingly, the obtained results are summarized in Table 3. Each batch of experiment produced an ensemble system of 10 models whose results are shown in Table 7.
The obtained results presented in Table 3 describes the best and mean results of 30 models. We present a comparative study of the best 10-fold CV models results of HFNT and the results reported in the literature in Table 4. In Table 4, the results of HDT and FNT li2006flexible were of 10 fold CV results on the test dataset. Whereas, the results of FNT chen2006ensemble was the best test accuracy and not the CV results. The results summarized in Table 4 suggests a comparatively better performance of the proposed HFNT over the previous approaches. For the illustration of a model created by HFNT approach, we chose the best model of dataset WDB that has a test accuracy of (shown in Table 3). A pictorial representation of the WDB model is shown in Figure 6, where the model’s tree size is 7, total input features are 5, ( and ) and the selected activation function is tangent hyperbolic () at both the non-leaf nodes. Similarly, we may represent models of all other datasets.
|Best of 30 models||Mean of 30 models|
|Data||train||test||tree size||Features||train||test||avg. tree size||diversity|
and varianceof algorithms
In this work, Friedman test was conducted to examine the significance of the algorithms. For this purpose, the classification accuracy (test results) was considered (Table 4). The average ranks obtained by each method in the Friedman test is shown in Table 5. The Friedman statistic at
(distributed according to chi-square with 2 degrees of freedom) is 5.991, i.e.,. The obtained test value according to Friedman statistic is 6. Since , then the null hypothesis that “there is no difference between the algorithms” is rejected. In other words, the computed -value by Friedman test is 0.049787 which is less than or equal to 0.05, i.e., -value
-value. Hence, we reject the null hypothesis.
Table 5 describes the significance of differences between the algorithms. To compare the differences between the best rank algorithm in Friedman test, i.e., between the proposed algorithm HFNT and the other two algorithms, Holm’s method holm1979simple was used. Holm’s method rejects the hypothesis of equality between the best algorithm (HFNT) and other algorithms if the -value is less than , where is the position of an algorithm in a list sorted in ascending order of -value (Table 6). From the post hoc analysis, it was observed that the proposed algorithm HFNT outperformed both HDT li2006flexible and FNT li2006flexible algorithms.
The best accuracy and the average accuracy of 30 models presented in Table 3 are the evidence of HFNT efficiency. However, as mentioned earlier, a generalized solution may be obtained by using an ensemble. All 30 models were created in three batches. Hence, three ensemble systems were obtained. The results of those ensemble systems are presented in Table 7, where ensemble results are the accuracies obtained by weighted majority voting (10). In Table 7, the classification accuracies were computed over CV test dataset. From Table 7, it may be observed that high diversity among the ensemble candidates offered comparatively higher accuracy. Hence, an ensemble model may be adopted by examining the performance of an ensemble system, i.e., average tree size (complexity) of the candidates within the ensemble and the selected input features.
An ensemble system created from a genetic evolution and adaptation is crucial for feature selection and analysis. Summarized ensemble results in Table 7 gives the following useful information about the HFNT feature selection ability: 1) TSF - total selected features; 2) MSF - most significant (frequently selected) features; and 3) MIF - most infrequently selected features. Table 7 illustrates feature selection results.
|AUS||1||86.96%||5||0.7||4||, , ,||, , , ,|
|HRT||1||77.41%||6.8||0.5||6||, , ,|
|ION||1||82.86%||7.2||0.9||15||, , ,||, , , , , , , ,|
|PIM||1||76.32%||6.9||1||8||, , , , ,|
|WDB||1||94.29%||8.2||0.7||15||, , ,||, , , , , ,|
5.2 Regression dataset
5.2.1 5-Fold CV
For regression dataset, the performance of HFNT was examined by using 5-fold CV method, in which the dataset was divided into 5 sets, each was 20% in size, and the process was repeated five times. Each time, four set was used to training and one set for testing. Hence, a total 5 runs were used for each model. As described in gacto2014metsk , MSE was used for evaluating HFNT, where was computed as per (1). The training MSE is represented as and test MSE is represented as . Such setting of MSE computation and cross-validation was taken for comparing the results collected from gacto2014metsk . Table 8 presents results of 5-fold CV of each dataset for 30 models. Hence, each presented result is averaged over a total 150 runs of experiments. Similarly, in Table 9, a comparison between HFNT and other collected algorithms from literature is shown. It is evident from comparative results that HFNT performs very competitive to other algorithms. The literature results were averaged over 30 runs of experiments; whereas, HFNT results were averaged of 150 runs of experiments. Hence, a competitive result of HFNT is evidence of its efficiency.
Moreover, HFNT is distinct from the other algorithm mentioned in Table 9 because it performs feature selection and models complexity minimization, simultaneously. On the other hand, the other algorithms used entire available features. Therefore, the result’s comparisons were limited to assessing average MSE, where HFNT, which gives simple models in comparison to others, stands firmly competitive with the others. An illustration of the best model of regression dataset DEE is provided in Figure 7, where the model offered a test MSE of 0.077, tree size equal to 10, and four selected input features (, , , and ). The selected activation functions were unipolar sigmoid (), bipolar sigmoid (), tangent hyperbolic (), and Gaussian (). Note that while creating HFNT models, the datasets were normalized as described in Table 2 and the output of models were denormalized accordingly. Therefore, normalized inputs should be presented to the tree (Figure 7), and the output of the tree (Figure 7) should be denormalized.
|Best of 30 models||Mean of 30 models|
|Data||train||test||tree size||#Features||train||test||tree size||diversity|
|Note: Results of ELV should be multiplied with 10|
|Note: ELV results should be multiplied with 10, HFNT results were averaged over 150 runs compared to MLP, ANFIS-SUB, TSK-IRL, LINEAR-LMS, LEL-TSK, and METSK-HD, which were averaged over 30 runs.|
For regression datasets, Friedman test was conducted to examine the significance of the algorithms. For this purpose, the best test MSE was considered of the algorithms MLP, ANFIS-SUB, TSK-IRL, LINEAR-LMS, LEL-TSK, and METSK-HD from Table 9 and the best test MSE of algorithm HFNT was considered from Table 8. The average ranks obtained by each method in the Friedman test is shown in Table 10. The Friedman statistic at (distributed according to chi-square with 5 degrees of freedom) is 11, i.e., . The obtained test value according to Friedman statistic is 11. Since , then the null hypothesis that “there is no difference between the algorithms” is rejected. In other words, the computed -value by Friedman test is 0.05 which is less than or equal to 0.05, i.e., -value -value. Hence, we reject the null hypothesis.
From the Friedman test, it is clear that the proposed algorithm HFNT performed best among all the other algorithms. However, in the post-hoc analysis presented in Table 11 describes the significance of difference between the algorithms. For this purpose, we apply Holm’s method holm1979simple , which rejects the hypothesis of equality between the best algorithm (HFNT) and other algorithms if the -value is less than , where is the position of an algorithm in a list sorted ascending order of -value (Table 11).
In the obtained result, the equality between ANFIS-SUB, MLP and HFNT was rejected, whereas the HFNT equality with other algorithms can not be rejected with , i.e., with 90% confidence. However, the -value shown in Table 11 indicates the quality of their performance and the statistical closeness to the algorithm HFNT. It can be observed that the algorithm METSK-HD performed closer to algorithm HFNT, followed by LEL-TSK, and LINEAR-LSM.
For each dataset, we constructed five ensemble systems by using 10 models in each batch. In each batch, 10 models were created and cross-validated using -fold CV. In -fold CV, a dataset is randomly divided into two equal sets: A and B. Such partition of the dataset was repeated five times and each time when the set A was presented for training, the set B was presented for testing, and vice versa. Hence, total 10 runs of experiments for each model was performed. The collected ensemble results are presented in Table 12, where ensemble outputs were obtained by using weighted arithmetic mean as mentioned in (11).
The weights of models were computed by using DE algorithm, where the parameter setting was similar to the one mentioned in classification dataset. Ensemble results shown in Table 12 are MSE and correlation coefficient computed on CV test dataset. From ensemble results, it can be said that the ensemble with higher diversity offered better results than the ensemble with lower diversity. The models of the ensemble were examined to evaluate MSF and MIF presented in Table 12. A graphical illustration of ensemble results is shown in Figure 8 using scattered (regression) plots, where a scatter plots show how much one variable is affected by another (in this case model’s and desired outputs). Moreover, it tells the relationship between two variables, i.e., their correlation. Plots shown in Figure 8 represents the best ensemble batch (numbers indicated bold in Table 12) four, five, three, four and five where MSEs are 2.2938, 270706, 0.1085, 1.1005 and 2.3956, respectively. The values of in plots tell about the regression curve fitting over CV test datasets. In other words, it can be said that the ensemble models were obtained with generalization ability.
|ABL||1||3.004||0.65||5||0.1||3||, , ,|
|BAS||1||2.932||0.79||5.6||0.3||5||, , , , ,||, , , ,|
|DEE||1||0.112||0.88||4.3||0.2||4||, , , ,|
|EVL||1||1.126||0.71||9.3||0.1||12||, , , ,||, ,|
|FRD||1||3.987||0.86||6.2||0.2||4||, , ,|
|Note: BAS results should be multiplied with 10, ELV results should be multiplied with 10.|
5.3 Time-series dataset
5.3.1 2-Fold CV
In literature survey, it was found that efficiency of most of the FNT-based models was evaluated over time-series dataset. Mostly, Macky-Glass (MGS) dataset was used for this purpose. However, only the best-obtained results were reported. For time-series prediction problems, the performances were computed using the root of mean squared error (RMSE), i.e., we took the square root of given in (1). Additionally, correlation coefficient (2) was also used for evaluating algorithms performance.
For the experiments, first 50% of the dataset was taken for training and the rest of 50% was used for testing. Table 13 describes the results obtained by HFNT, where is RMSE for training set and is RMSE for test-set. The best test RMSE obtained by HFNT was and on datasets MGS and WWR, respectively. HFNT results are competitive with most of the algorithms listed in Table 14. Only a few algorithms such as LNF and FWNN-M reported better results than the one obtained by HFNT. FNT based algorithms such as FNT chen2005time and FBBFNT-EGP&PSO reported RMSEs close to the results obtained by HFNT. The average RMSEs and its variance over test-set of 70 models were 0.10568 and 0.00283, and 0.097783 and 0.00015 on dataset MGS and WWR, respectively. The low variance indicates that most models were able to produce results around the average RMSE value. The results reported by other function approximation algorithms (Table 13) were merely the best RMSEs. Hence, the robustness of other reported algorithm cannot be compared with the HFNT. However, the advantage of using HFNT over other algorithms is evident from the fact that the average complexity of the predictive models were 8.15 and 8.05 for datasets MGA and WWR, respectively.
The best model obtained for dataset WWR is shown in Figure 9, where the tree size is equal to 17 and followings are the selected activation functions: tangent hyperbolic, Gaussian, unipolar sigmoid, bipolar sigmoid and linear tangent hyperbolic. The selected input features in the tree (Figure 9) are , , and . Since in time series category experiment, we have only two datasets and for each dataset HFNT was compared with different models from literature. Hence, the statistical test was not conducted in this category because differences between algorithms are easy to determine from Table 14.
|Best of 70 models||Mean of 70 models|
The ensemble results of time-series datasets are presented in Table 15, where the best ensemble system of dataset MGS (marked bold in Table 15) offered a test RMSE with a test correlation coefficient . Similarly, the best ensemble system of dataset WWR (marked bold in Table 15) offered a test RMSE with a test correlation coefficient . However, apart from the best results, most of the ensemble produced low RMSEs, i.e., high correlation coefficients. The best ensemble batches (marked bold in Table 15) of dataset MGS and WWR were used for graphical plots in Figure 10. A one-to-one fitting of target and prediction values is the evidence of a high correlation between model’s output and desired output, which is a significant indicator of model’s efficient performance.
|Data||batch||avg. tree size||(9)||TSF||MSF||MIF|
|Note: , , and indicate test RMSE, test correlation coefficient, and diversity, respectively|
HFNT was examined over three categories of datasets: classification, regression, and time-series. The results presented in Section 5, clearly suggests a superior performance of HFNT approach. In HFNT approach, MOGP guided an initial HFNT population towards Pareto-optimal solutions, where HFNT final population was a mixture of heterogeneous HFNTs. Alongside, accuracy and simplicity, a Pareto-based multiobjective approach ensured diversity among the candidates in final population. Hence, HFNTs in the final population were fairly accurate, simple, and diverse. Moreover, HFNTs in the final population were diverse according to structure, parameters, activation function, and input feature. Hence, the model’s selection from Pareto-fronts, as indicated in Section 4, led to a good ensemble system.
|activation function ()|
|Note: 67 is the best and 21 is the worst|
HFNT was applied to solve classification, regression, and time-series problems. Since HFNT is stochastic in nature, its performance was affected by several factors: random generator algorithm, random seed, the efficiency of the meta-heuristic algorithm used in parameter-tuning phase, the activation function selected at the nodes, etc. Therefore, to examine the performance of HFNT, several HFNT-models were created using different random seeds and the best and average approximation error of all created models were examined. In Section 5, as far as the best model is concerned, the performance of HFNT surpass other approximation models mentioned from literature. Additionally, in the case of each dataset, a very low average value (high accuracy in the case of classification and low approximation errors in case of regression and time-series) were obtained, which significantly suggests that HFNT often led to good solutions. Similarly, in the case of the ensembles, it is clear from the result that combined output of diverse and accurate candidates offered high quality (in terms of generalization ability and accuracy) approximation/prediction model. From the results, it is clear that the final population of HFNT offered the best ensemble when the models were carefully examined based on approximation error, average complexity (tree size), and selected features.
Moreover, the performances of the best performing activation functions were examined. For this purpose, the best ensemble system obtained for each dataset were considered. Accordingly, the performance of activation functions was evaluated as follows. The best ensemble system of each dataset had 10 models; therefore, in how many models (among 10) an activation function appeared, was counted. Hence, for a dataset, if an activation function appeared in all models of an ensemble system, then the total count was 10. Subsequently, counting was performed for all the activation functions for the best ensemble systems of all the datasets. Table 16, shows the performance of the activation functions. It can be observed that the activation function Gaussian () and Bipolar Sigmoid () performed the best among all the other activation functions followed by Tangent-hyperbolic () function. Hence, no one activation function performed exceptionally well. Therefore, the efforts of selecting activation function, adaptively, by MOGP was essential in HFNTs performance.
In this work, we were limited to examine the performance of our approach to only benchmark problems. Therefore, in presences of no free lunch theorem wolpert1997no ; koppen2001remarks and the algorithm’s dependencies on random number generator, which are platforms, programming language, and implementation sensitive l2005fast , it is clear that performance of the mentioned approach is subjected to careful choice of training condition and parameter-setting when it comes to deal with other real-world problems.
Effective use of the final population of the heterogeneous flexible neural trees (HFNTs) evolved using Pareto-based multiobjective genetic programming (MOGP) and the subsequent parameter tuning by differential evolution led to the formation of high-quality ensemble systems. The simultaneous optimization of accuracy, complexity, and diversity solved the problem of structural complexity that was inevitably imposed when a single objective was used. MOGP used in the tree construction phase often guided an initial HFNT population towards a population in which the candidates were highly accurate, structurally simple, and diverse. Therefore, the selected candidates helped in the formation of a good ensemble system. The result obtained by HFNT approach supports its superior performance over the algorithms collected for the comparison. In addition, HFNT provides adaptation in structure, computational nodes, and input feature space. Hence, HFNT is an effective algorithm for automatic feature selection, data analysis, and modeling.
This work was supported by the IPROCOM Marie Curie Initial Training Network, funded through the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme FP7/2007–2013/, under REA grant agreement number 316555.
- (1) Y. Chen, B. Yang, J. Dong, A. Abraham, Time-series forecasting using flexible neural tree model, Information Sciences 174 (3) (2005) 219–235.
- (2) X. Yao, Y. Liu, A new evolutionary system for evolving artificial neural networks, IEEE Transactions on Neural Networks 8 (3) (1997) 694–713.
- (3) I. Basheer, M. Hajmeer, Artificial neural networks: Fundamentals, computing, design, and application, Journal of Microbiological Methods 43 (1) (2000) 3–31.
- (4) A. J. Maren, C. T. Harston, R. M. Pap, Handbook of neural computing applications, Academic Press, 2014.
- (5) I. K. Sethi, A. K. Jain, Artificial neural networks and statistical pattern recognition: Old and new connections, Vol. 1, Elsevier, 2014.
- (6) M. Tkáč, R. Verner, Artificial neural networks in business: Two decades of research, Applied Soft Computing 38 (2016) 788–804.
- (7) S. E. Fahlman, C. Lebière, The cascade-correlation learning architecture, in: D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems 2, Morgan Kaufmann Publishers Inc., 1990, pp. 524–532.
- (8) J.-P. Nadal, Study of a growth algorithm for a feedforward network, International Journal of Neural Systems 1 (1) (1989) 55–59.
K. O. Stanley, R. Miikkulainen, Evolving neural networks through augmenting topologies, Evolutionary Computation 10 (2) (2002) 99–127.
- (10) B.-T. Zhang, P. Ohm, H. Mühlenbein, Evolutionary induction of sparse neural trees, Evolutionary Computation 5 (2) (1997) 213–236.
- (11) M. A. Potter, K. A. De Jong, Cooperative coevolution: An architecture for evolving coadapted subcomponents, Evolutionary computation 8 (1) (2000) 1–29.
M. Yaghini, M. M. Khoshraftar, M. Fallahi, A hybrid algorithm for artificial neural network training, Engineering Applications of Artificial Intelligence 26 (1) (2013) 293–301.
- (13) S. Wang, Y. Zhang, Z. Dong, S. Du, G. Ji, J. Yan, J. Yang, Q. Wang, C. Feng, P. Phillips, Feed-forward neural network optimized by hybridization of PSO and ABC for abnormal brain detection, International Journal of Imaging Systems and Technology 25 (2) (2015) 153–164.
- (14) S. Wang, Y. Zhang, G. Ji, J. Yang, J. Wu, L. Wei, Fruit classification by wavelet-entropy and feedforward neural network trained by fitness-scaled chaotic abc and biogeography-based optimization, Entropy 17 (8) (2015) 5711–5728.
- (15) R. Salustowicz, J. Schmidhuber, Probabilistic incremental program evolution, Evolutionary Computation 5 (2) (1997) 123–141.
- (16) A. K. Kar, Bio inspired computing–a review of algorithms and scope of applications, Expert Systems with Applications 59 (2016) 20–32.
- (17) Y. Jin, B. Sendhoff, Pareto-based multiobjective machine learning: An overview and case studies, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 38 (3) (2008) 397–415.
- (18) K. Deb, Multi-objective optimization using evolutionary algorithms, Vol. 16, John Wiley & Sons, 2001.
- (19) X. Yao, Y. Liu, Making use of population information in evolutionary artificial neural networks, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 28 (3) (1998) 417–425.
- (20) L. I. Kuncheva, C. J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Machine Learning 51 (2) (2003) 181–207.
- (21) K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II, in: Parallel Problem Solving from Nature PPSN VI, Vol. 1917 of Lecture Notes in Computer Science, Springer, 2000, pp. 849–858.
- (22) S. Das, S. S. Mullick, P. Suganthan, Recent advances in differential evolution–an updated survey, Swarm and Evolutionary Computation 27 (2016) 1–30.
- (23) Y. Chen, A. Abraham, J. Yang, Feature selection and intrusion detection using hybrid flexible neural tree, in: Advances in Neural Networks–ISNN, Vol. 3498 of Lecture Notes in Computer Science, Springer, 2005, pp. 439–444.
- (24) L. Sánchez, I. Couso, J. A. Corrales, Combining GP operators with SA search to evolve fuzzy rule based classifiers, Information Sciences 136 (1) (2001) 175–191.
- (25) J. Kennedy, R. C. Eberhart, Y. Shi, Swarm Intelligence, Morgan Kaufmann, 2001.
- (26) Y. Chen, A. Abraham, B. Yang, Feature selection and classification using flexible neural tree, Neurocomputing 70 (1) (2006) 305–313.
- (27) R. Riolo, J. H. Moore, M. Kotanchek, Genetic programming theory and practice XI, Springer, 2014.
- (28) X. Chen, Y.-S. Ong, M.-H. Lim, K. C. Tan, A multi-facet survey on memetic computation, IEEE Transactions on Evolutionary Computation 15 (5) (2011) 591–607.
- (29) H.-J. Li, Z.-X. Wang, L.-M. Wang, S.-M. Yuan, Flexible neural tree for pattern recognition, in: Advances in Neural Networks–ISNN, Vol. 3971 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2006, pp. 903–908.
- (30) Y. Chen, Y. Wang, B. Yang, Evolving hierarchical RBF neural networks for breast cancer detection, in: Neural Information Processing, Vol. 4234 of Lecture Notes in Computer Science, Springer, 2006, pp. 137–144.
- (31) Y. Chen, F. Chen, J. Yang, Evolving MIMO flexible neural trees for nonlinear system identification, in: International Conference on Artificial Intelligence, Vol. 1, 2007, pp. 373–377.
- (32) P. Wu, Y. Chen, Grammar guided genetic programming for flexible neural trees optimization, in: Advances in Knowledge Discovery and Data Mining, Springer, 2007, pp. 964–971.
- (33) Y. Shan, R. McKay, R. Baxter, H. Abbass, D. Essam, H. Nguyen, Grammar model-based program evolution, in: Congress on Evolutionary Computation, Vol. 1, 2004, pp. 478–485.
- (34) G. Jia, Y. Chen, Q. Wu, A MEP and IP based flexible neural tree model for exchange rate forecasting, in: Fourth International Conference on Natural Computation, Vol. 5, IEEE, 2008, pp. 299–303.
- (35) M. Oltean, C. Groşan, Evolving evolutionary algorithms using multi expression programming, in: Advances in Artificial Life, Springer, 2003, pp. 651–658.
- (36) P. Musilek, A. Lau, M. Reformat, L. Wyard-Scott, Immune programming, Information Sciences 176 (8) (2006) 972–1002.
- (37) B. Yang, L. Wang, Z. Chen, Y. Chen, R. Sun, A novel classification method using the combination of FDPS and flexible neural tree, Neurocomputing 73 (4–6) (2010) 690 – 699.
- (38) S. Bouaziz, H. Dhahri, A. M. Alimi, A. Abraham, Evolving flexible beta basis function neural tree using extended genetic programming & hybrid artificial bee colony, Applied Soft Computing.
- (39) Y. Chen, B. Yang, A. Abraham, Flexible neural trees ensemble for stock index modeling, Neurocomputing 70 (4–6) (2007) 697 – 703.
B. Yang, M. Jiang, Y. Chen, Q. Meng, A. Abraham, Ensemble of flexible neural tree and ordinary differential equations for small-time scale network traffic prediction, Journal of Computers 8 (12) (2013) 3039–3046.
- (41) V. K. Ojha, A. Abraham, V. Snasel, Ensemble of heterogeneous flexible neural tree for the approximation and feature-selection of Poly (Lactic-co-glycolic Acid) micro-and nanoparticle, in: Proceedings of the Second International Afro-European Conference for Industrial Advancement AECIA 2015, Springer, 2016, pp. 155–165.
- (42) L. Peng, B. Yang, L. Zhang, Y. Chen, A parallel evolving algorithm for flexible neural tree, Parallel Computing 37 (10–11) (2011) 653–666.
- (43) L. Wang, B. Yang, Y. Chen, X. Zhao, J. Chang, H. Wang, Modeling early-age hydration kinetics of portland cement using flexible neural tree, Neural Computing and Applications 21 (5) (2012) 877–889.
- (44) C. Ferreira, Gene expression programming: mathematical modeling by an artificial intelligence, Vol. 21, Springer, 2006.
- (45) G. Weiss, Multiagent systems: A modern approach to distributed artificial intelligence, MIT Press, 1999.
- (46) M. Ammar, S. Bouaziz, A. M. Alimi, A. Abraham, Negotiation process for bi-objective multi-agent flexible neural tree model, in: International Joint Conference on Neural Networks (IJCNN), 2015, IEEE, 2015, pp. 1–9.
T. Burianek, S. Basterrech, Performance analysis of the activation neuron function in the flexible neural tree model, in: Proceedings of the Dateso 2014 Annual International Workshop on DAtabases, TExts, Specifications and Objects, 2014, pp. 35–46.
- (48) S. Bouaziz, H. Dhahri, A. M. Alimi, A. Abraham, A hybrid learning algorithm for evolving flexible beta basis function neural tree model, Neurocomputing 117 (2013) 107–117.
- (49) S. Bouaziz, A. M. Alimi, A. Abraham, Universal approximation propriety of flexible beta basis function neural tree, in: International Joint Conference on Neural Networks, IEEE, 2014, pp. 573–580.
- (50) C. Micheloni, A. Rani, S. Kumar, G. L. Foresti, A balanced neural tree for pattern classification, Neural Networks 27 (2012) 81–90.
- (51) G. L. Foresti, C. Micheloni, Generalized neural trees for pattern classification, IEEE Transactions on Neural Networks 13 (6) (2002) 1540–1547.
- (52) A. Rani, G. L. Foresti, C. Micheloni, A neural tree for classification using convex objective function, Pattern Recognition Letters 68 (2015) 41–47.
- (53) Q. Shou-ning, L. Zhao-lian, C. Guang-qiang, Z. Bing, W. Su-juan, Modeling of cement decomposing furnace production process based on flexible neural tree, in: Information Management, Innovation Management and Industrial Engineering, Vol. 3, IEEE, 2008, pp. 128–133.
- (54) B. Yang, Y. Chen, M. Jiang, Reverse engineering of gene regulatory networks using flexible neural tree models, Neurocomputing 99 (2013) 458–466.
- (55) Z. Chen, B. Yang, Y. Chen, A. Abraham, C. Grosan, L. Peng, Online hybrid traffic classifier for peer-to-peer systems based on network processors, Applied Soft Computing 9 (2) (2009) 685–694.
- (56) T. Novosad, J. Platos, V. Snásel, A. Abraham, Fast intrusion detection system based on flexible neural tree, in: International Conference on Information Assurance and Security, IEEE, 2010, pp. 106–111.
Y.-Q. Pan, Y. Liu, Y.-W. Zheng, Face recognition using kernel PCA and hybrid flexible neural tree, in: International Conference on Wavelet Analysis and Pattern Recognition, 2007. ICWAPR’07, Vol. 3, IEEE, 2007, pp. 1361–1366.
- (58) Y. Guo, Q. Wang, S. Huang, A. Abraham, Flexible neural trees for online hand gesture recognition using surface electromyography, Journal of Computers 7 (5) (2012) 1099–1103.
- (59) S. Qu, A. Fu, W. Xu, Controlling shareholders management risk warning based on flexible neural tree, Journal of Computers 6 (11) (2011) 2440–2445.
- (60) A. Rajini, V. K. David, Swarm optimization and flexible neural tree for microarray data classification, in: International Conference on Computational Science, Engineering and Information Technology, ACM, 2012, pp. 261–268.
- (61) S. Abdelwahab, V. K. Ojha, A. Abraham, Ensemble of flexible neural trees for predicting risk in grid computing environment, in: Innovations in Bio-Inspired Computing and Applications, Springer, 2016, pp. 151–161.
- (62) Y. Jin, B. Sendhoff, E. Körner, Evolutionary multi-objective optimization for simultaneous generation of signal-type and symbol-type representations, in: Evolutionary Multi-Criterion Optimization, Vol. 3410 of Lecture Notes in Computer Science, Springer, 2005, pp. 752–766.
- (63) I. Das, J. E. Dennis, A closer look at drawbacks of minimizing weighted sums of objectives for pareto set generation in multicriteria optimization problems, Structural optimization 14 (1) (1997) 63–69.
- (64) A. E. Eiben, J. E. Smith, Introduction to Evolutionary Computing, Springer, 2015.
- (65) D. Karaboga, B. Basturk, A powerful and efficient algorithm for numerical function optimization: Artificial bee colony (ABC) algorithm, Journal of Global Optimization 39 (3) (2007) 459–471.
- (66) Y. Zhang, S. Wang, G. Ji, A comprehensive survey on particle swarm optimization algorithm and its applications, Mathematical Problems in Engineering 2015 (2015) 1–38.
- (67) C.-F. Juang, A hybrid of genetic algorithm and particle swarm optimization for recurrent network design, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 34 (2) (2004) 997–1006.
- (68) W. Wongseree, N. Chaiyaratana, K. Vichittumaros, P. Winichagoon, S. Fucharoen, Thalassaemia classification by neural networks and genetic programming, Information Sciences 177 (3) (2007) 771 – 786.
- (69) T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, R. Tibshirani, The elements of statistical learning, Vol. 2, Springer, 2009.
- (70) R. Polikar, Ensemble based systems in decision making, IEEE Circuits and Systems Magazine 6 (3) (2006) 21–45.
- (71) Z.-H. Zhou, Ensemble methods: Foundations and algorithms, CRC Press, 2012.
- (72) M. Lichman, UCI machine learning repository, http://archive.ics.uci.edu/ml Accessed on: 01.05.2016 (2013).
- (73) J. Alcala-Fdez, L. Sanchez, S. Garcia, M. J. del Jesus, S. Ventura, J. Garrell, J. Otero, C. Romero, J. Bacardit, V. M. Rivas, et al., Keel: a software tool to assess evolutionary algorithms for data mining problems, Soft Computing 13 (3) (2009) 307–318.
- (74) M. Matsumoto, T. Nishimura, Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator, ACM Transactions on Modeling and Computer Simulation 8 (1) (1998) 3–30.
- (75) V. K. Ojha, MOGP-FNT multiobjective flexible neural tree tool, http://dap.vsb.cz/aat/ Accessed on: 01.05.2016 (May 2016).
- (76) Y. Chen, A. Abraham, Y. Zhang, et al., Ensemble of flexible neural trees for breast cancer detection, The International Journal of Information Technology and Intelligent Computing 1 (1) (2006) 187–201.
- (77) S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics (1979) 65–70.
- (78) M. J. Gacto, M. Galende, R. Alcalá, F. Herrera, METSK-HD: A multiobjective evolutionary algorithm to learn accurate TSK-fuzzy systems in high-dimensional and large-scale regression problems, Information Sciences 276 (2014) 63–79.
- (79) D. H. Wolpert, W. G. Macready, No free lunch theorems for optimization, IEEE Transactions on Evolutionary Computation 1 (1) (1997) 67–82.
- (80) M. Koppen, D. H. Wolpert, W. G. Macready, Remarks on a recent paper on the” no free lunch” theorems, IEEE Transactions on Evolutionary Computation 5 (3) (2001) 295–296.
- (81) P. L’Ecuyer, F. Panneton, Fast random number generators based on linear recurrences modulo 2: Overview and comparison, in: Proceedings of the 2005 Winter Simulation Conference, IEEE, 2005, pp. 10–pp.
- (82) S. Haykin, Neural networks and learning machines, Vol. 3, Pearson Education Upper Saddle River, 2009.
- (83) Z.-H. Zhou, Z.-Q. Chen, Hybrid decision tree, Knowledge-Based Systems 15 (8) (2002) 515–528.
- (84) J.-S. R. Jang, ANFIS: adaptive-network-based fuzzy inference system, IEEE Transactions on Systems, Man and Cybernetics 23 (3) (1993) 665–685.
O. Cordón, F. Herrera, A two-stage evolutionary process for designing TSK fuzzy rule-based systems, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 29 (6) (1999) 703–715.
- (86) J. S. Rustagi, Optimization techniques in statistics, Academic Press, 1994.
- (87) R. Alcalá, J. Alcalá-Fdez, J. Casillas, O. Cordón, F. Herrera, Local identification of prototypes for genetic learning of accurate tsk fuzzy rule-based systems, International Journal of Intelligent Systems 22 (9) (2007) 909–941.
- (88) K. B. Cho, B. H. Wang, Radial basis function based adaptive fuzzy systems and their applications to system identification and prediction, Fuzzy Sets and Systems 83 (3) (1996) 325–339.
- (89) F. Van den Bergh, A. P. Engelbrecht, A cooperative approach to particle swarm optimization, IEEE Transactions on Evolutionary Computation 8 (3) (2004) 225–239.
- (90) A. M. A. H. Dhahri, F. Karray, Designing beta basis function neural network for optimization using particle swarm optimization, in: IEEE Joint Conference on Neural Network, 2008, pp. 2564––2571.
- (91) C. Aouiti, A. M. Alimi, A. Maalej, A genetic designed beta basis function neural network for approximating multi-variables functions, in: International Conference on Artificial Neural Nets and Genetic Algorithms, Springer, 2001, pp. 383–386.
C.-F. Juang, C.-M. Hsiao, C.-H. Hsu, Hierarchical cluster-based multispecies particle-swarm optimization for fuzzy-system optimization, IEEE Transactions on Fuzzy Systems 18 (1) (2010) 14–26.
- (93) S. Yilmaz, Y. Oysal, Fuzzy wavelet neural network models for prediction and identification of dynamical systems, IEEE Transactions on Neural Networks 21 (10) (2010) 1599–1609.
- (94) H. Dhahri, A. M. Alimi, A. Abraham, Hierarchical multi-dimensional differential evolution for the design of beta basis function neural network, Neurocomputing 97 (2012) 131–140.
A. Miranian, M. Abdollahzade, Developing a local least-squares support vector machines-based neuro-fuzzy model for nonlinear and chaotic time series prediction, IEEE Transactions on Neural Networks and Learning Systems 24 (2) (2013) 207–218.
N. K. Kasabov, Foundations of neural networks, fuzzy systems, and knowledge engineering, Marcel Alencar, 1996.
- (97) N. Kasabov, Evolving fuzzy neural networks for adaptive, on-line intelligent agents and systems, in: Recent Advances in Mechatronics, Springer, Berlin, 1999.
- (98) S. Bouaziz, A. M. Alimi, A. Abraham, Extended immune programming and opposite-based PSO for evolving flexible beta basis function neural tree, in: IEEE International Conference on Cybernetics, IEEE, 2013, pp. 13–18.
Appendix A Dataset Description
Appendix B Algorithms from literature
|zhou2002hybrid||HDT||Hybrid Decision Tree|
|chen2006ensemble||FNT||Flexible Neural Tree|
|jang1993anfis||ANFIS-SUB||Adaptive Neuro-Fuzzy Inference System Using Subtractive Clustering|
|cordon1999two||TSK-IRL||Genetic Learning of TSK-rules Under Iterative Rule Learning|
Least Mean Squares Linear Regression
|alcala2007local||LEL-TSK||Local Evolutionary Learning of TSK-rules|
|cho1996radial||RBF||Classical Radial Basis Function|
|van2004cooperative||CPSO||Cooperative Particle Swarm Optimization (PSO)|
|psoBBNN||PSO-BBFN||PSO-based Beta Basis Function Neural Network|
|juang2010hierarchical||HCMSPSO||Hierarchical Cluster-Based Multispecies PSO|
|fwnn||FWNN-M||Fuzzy Wavelet Neural Network Models|
|dhahri2012hierarchical||HMDDE-BBFNN||Hierarchical Multidimensional DE-Based BBFNN|
|miranian2013developing||LNF||Local Least-Squares Support Vector Machines-Based Neuro-Fuzzy Mode|
|kasabov1996foundations||BPNN||Back-propagation Neural Network|
|kasabov1999evolving||EFuNNs||Evolving Fuzzy Neural Networks|
|bouaziz2013extended||FBBFNT-EGP&PSO||Extended Immune Programming and Opposite-PSO for Flexible BBFNN|
|gacto2014metsk||METSK-HD||Multiobjective Evolutionary Learning of TSK-rules for High-Dimensional Problems|