A fuzzy inference system (FIS)—composed of a fuzzifier to fuzzify input information, an inference engine to infer information from a rule base (RB), and a defuzzifier to return crisp information—solves a wide range of problems that are ambiguous, uncertain, inaccurate, and noisy. An RB of an FIS is a set of rules of the form IF-THEN, i.e., the antecedent and the consequent form. The Takagi–Sugeno–Kang (TSK) is a widely used FIS model . It embraces the IF-THEN form, where the antecedent part consists of type-1 fuzzy sets (T1FS) and/or type-2 fuzzy sets (T2FS), and the consequent part consists of real values or a linear/nonlinear function.
Type-1 FIS (T1FIS) and type-2 FIS (T2FIS) differ when it comes to the representation of the antecedent part and the consequent part of a rule, and T1FS and T2FS differ in the definitions of their membership functions. Unlike the crisp output of a T1FS membership function (MF) , the output of a T2FS MF is fuzzy in nature . Such nature of the T2FS MFs is advantageous in processing uncertain information more effectively than with T1FS MFs . Hence, a T2FIS can overcome the inability of a T1FIS to fully handle or accommodate the linguistic and numerical uncertainties associated with a changing and dynamic environment .
However, a T2FIS is computationally expensive because it has a larger number of parameters than a T1FIS, and it requires a type-reduction mechanism in its defuzzification part. The interval T2FIS (IT2FIS) reduces the computational cost by employing a simplified T2FS, known as interval T2FS (IT2FS) . An IT2FS MF is bounded by a lower MF (LMF) and an upper MF (UMF), and the area between the LMF and UMF is called the footprint of uncertainty . Then, a type-reducer reduces IT2FS to interval-valued T1FS. Subsequently, the output of IT2FIS is produced by averaging the intervals.
The construction and tuning of the rules are among the vital tasks in the optimization of an FIS, where the rule’s construction is met by combining the fuzzy sets and the rule’s tuning is met by adjusting the MF’s parameters and the consequent part’s parameters. Such a form of rule optimization is often achieved by mapping the rule’s parameters onto a real-valued genetic vector, and it is known as the Michigan Approach
The construction and tuning of the rules are among the vital tasks in the optimization of an FIS, where the rule’s construction is met by combining the fuzzy sets and the rule’s tuning is met by adjusting the MF’s parameters and the consequent part’s parameters. Such a form of rule optimization is often achieved by mapping the rule’s parameters onto a real-valued genetic vector, and it is known as the Michigan Approach. Similarly, the construction/optimization of the RB is met by the genetic selection of the rules at the RB. Such a form of RB optimization is often achieved by mapping the rules onto a binary-valued genetic vector , and it is known as the Pittsburgh Approach .
However, FIS optimization is not limited only to its mapping onto the genetic vector, but a structural/network-like implementation of FIS is often performed . Additionally, TSK-based hierarchical self-organizing learning dynamics have also been proposed . Moreover, several researchers have focused on
the FIS and neural network (NN) integration and its parameter optimization using various learning methods including gradient-decent and the metaheuristic algorithms[11, 12, 13, 14]. The summaries of such optimization paradigms are described as follows:
A self-constructing neural fuzzy inference network (SONFIN), proposed by Juang et al. , is a six layered network structure whose optimization begins with no rule and then rules are incrementally added during the learning process. SONFIN uses a clustering method to partition the input space that governs the number of rules extracted from the data , then the parameters (MF’s arguments) of the determined SONFIN structure are tuned by the backpropagation algorithm.
, then the parameters (MF’s arguments) of the determined SONFIN structure are tuned by the backpropagation algorithm.Later, in , SONFIN’s
concept was extended for the construction of T2FIS, where a self-evolving IT2FIS (SEIT2FNN) that implements a TSK-type FIS model was proposed, and the parameters of the evolved structure were tuned by using the Kalman-filtering algorithm. Additionally, a simplified type-reduction process for SEIT2FNN was proposed in. Like SONFIN, in , a TSK-type FIS model, called a dynamic evolving neural-fuzzy inference system (DENFIS),
was proposed, which evolved incrementally by choosing active rules from a set of rules and employed an evolving clustering method to partition the input space and the least-square estimator to optimize its parameters.
To overcome some limitations of the self-organizing fuzzy NN paradigm, Tung et al.  proposed a self-adaptive fuzzy inference network (SaFIN) that applied a categorical learning induced partitioning algorithm to eliminate two limitations: 1) the need for predefined numbers of fuzzy clusters and 2) the stability–plasticity trade-off that addresses the difficulty in finding a balance between past knowledge and current knowledge during the learning process. SaFIN also employed a rule consistency checking mechanism to avoid inconsistent RB construction. Additionally, the Levenberg-Marquardt method was applied for RB’s parameters tuning. In , to improve the efficiency of IT2FIS, a mutually recurrent interval type-2 neural fuzzy system (MRIT2NFS) was proposed which used weighted feedback loops in the antecedent parts of the formed rules and applied gradient-decent learning and a Kalman-filter algorithm to tune the recurrent weights and the rules’ parameters, respectively. In , a self-evolving T2FIS model was proposed that employed a compensatory operator in the type-2 inference mechanism and a variable-expansive Kalman-filter algorithm for parameter tuning.
Further, a simplified interval type-2 fuzzy NN with a simplified type-reduction process (SIT2FIS) was proposed in , and a growing online self-learning IT2FIS that used the dynamics of a
growing Gaussian mixture model was proposed in. Recently, in , a meta-cognitive interval type-2 neuro FIS (McIT2FIS) was proposed, which employs a self-regulatory meta-cognitive system that extracts the knowledge contained in minimal samples by accepting or discarding data samples based on sample’s contribution to knowledge. For the parameters tuning, McIT2FIS employed the Kalman-filtering algorithm.
However, the self-organizing fuzzy NN paradigm discussed above has to employ a clustering method to partition the input space during the FIS structure’s design. Contrary to this, a hierarchical FIS (HFIS) constructs an FIS by using a hierarchical arrangement of several low-dimensional fuzzy subsystems . Initially, the input variables selection, the levels of hierarchy, and the number of parameters was fully up to the experts to determine. Moreover, HFIS design overcomes the curse of dimensionality , and it possesses a universal approximation ability [27, 28, 29, 30].
Torra et al.  summarized the contributions where the expert’s role in the HFIS design process was minimized/eliminated. For example, in , HFIS was realized as a feedforward network like structure in which the output of the previous layer’s subsystem was only fed to the consequent part of the next layer, and so on. Similarly, in , a two-layered HFIS was developed, where, for each layer, the knowledge bases (KB) were generated by linguistics rule generation method and the KB rules
were selected by genetic algorithm (GA). In, an adaptive fuzzy hierarchical sliding-mode control method was proposed, which was an arrangement of many subsystems, and the top layer accommodated all the subsystems’ outputs. Moreover, in , to optimize the structure of a hierarchical arrangement of low-dimensional TSK-type FISs, probabilistic incremental program evolution  was employed. Similarly, the importance of the hierarchical arrangements of the low-dimensional T2FSs is explained in [5, 37].
For FIS models that have a structural representation (e.g., self-organizing fuzzy NN and HFIS models), multiobjective optimization is inherent since accuracy maximization and complexity minimization are two desirable objectives . Hence, to make trade-offs between interpretability and accuracy, or, in other words, to make trade-offs between approximation error minimization and complexity minimization, a multiobjective orientation of FIS optimization can be used [39, 40, 41]. Complexity minimization can be defined in many ways, such as a reduced number of rules, reduced number of parameters, etc. [42, 41].
Since a single solution may not satisfy both objectives simultaneously, a Pareto-based multiobjective optimization algorithm can be used in FIS optimization, the scope of which spans from the rule selection, to rule mining, rule learning, etc. [43, 44, 45, 46]. Similarly, in [47, 48, 49, 50], simultaneous learning of KB was proposed, which included feature selection, rule complexity minimization together with approximation error minimization, etc.
Moreover, in , a co-evolutionary approach that aimed at combining a multiobjective approach with a single objective approach was presented where, at first, a multiobjective GA determined a Pareto-optimal solution by finding a trade-off between accuracy and rule complexity. Then, a single objective GA was applied to reduce training instances. Such a process was then repeated until a satisfactory solution was obtained. A summary of research works focused on multiobjective optimization of FIS is provided in .
In conclusion, the following are the necessary practices for an FIS model design: 1) input space partitioning; 2) rule formation; 3) rule tuning; 4) FIS structural representation; 5) improving accuracy and minimizing a model’s complexity. Therefore, in this work, a multiobjective optimization of HFIS, called a hierarchical fuzzy inference tree (HFIT), was proposed.
Unlike the self-organizing paradigm that has a network-like structure and uses a clustering algorithm for partitioning of input
space, the proposed HFIT constructs a tree-like structure and uses the dynamics of the evolutionary algorithm for partitioninginput space . The HFIT is analogous to a multi-layered network and automatically partitions input space during the structure optimization phase, i.e., during the tree construction phase. The parameter tuning of the HFIT was performed by the differential evolution (DE) algorithm , which is a metaheuristic algorithm inspired by the dynamics of the evolutionary process. Metaheuristic algorithms, being independent of the problems, solve complex optimization problems. Hence, they are useful in finding the appropriate parameter values for an FIS .
In this work, the proposed HFIT implements a TSK-type FIS for both T1FIS and T2FIS, and HFIT was studied under both single objective and multiobjective optimization orientations. Hence, a total of four versions of HFIT algorithms were proposed: type-1 single objective HFIT (T1HFIT), type-1 multiobjective objective HFIT (T1HFIT), type-2 single objective HFIT (T2HFIT), and type-2 multiobjective objective HFIT (T2HFIT). In the construction of type-2 HFITs, the type-reduction algorithm of the Karnik-Mendel method described in  was used with an improvement in its termination criteria. In summary, the following are the main and novel contributions of this work.
The proposed hierarchical tree-like design (HFIT) forms a natural hierarchical structure by combining several low-dimensional fuzzy subsystems.
MOGP driven optimization provided a trade-off between model’s accuracy and complexity. Moreover, in the obtained tree, each node has a different input’s combination, where the MOGP governs the input’s combination. Hence, HFIT nodes are heterogeneous in nature, which leads to a high diversity among the rules generated by the HFIT. Such a diverse rule generation methods is a distinguished aspect of the proposed HFIT.
A comprehensive theoretical study of HFIT shows that when it comes to the partitioning of input space, membership function design, and even rule formation, it has advantages over network-like layered architecture models, which have to use clustering methods when they do input space partitioning. Clustering methods generate overlapping MFs in fuzzy sets, whereas HFIT’s MOGP driven MFs selection avoid such a overlapping of MFs.
Unlike many models in the literature, HFIT performed an inclusive automatic feature selection, which led to the simplification of the RB in fuzzy subsystems and incorporated only relevant knowledge contained in the dataset into HFIT’s structural representation.
A comprehensive performance comparison of the proposed four versions of the HFIT algorithms both in theoretical and empirical sense with the recently proposed FIS algorithms found in the literature suggests that HFIT design offers a high approximation ability with simple model complexity.
The structure of this article is as follows: Section II provides an introduction to T1FIS and T2FIS; Section III describes the proposed multiobjective strategy for developing HFIT and its parameter optimization; Section IV provides a comprehensive theoretical evaluation of HFIT; Section V provides a detailed description of parameter setting and a comprehensive empirical evaluation the proposed HFIT compared with the algorithms reported in the literature; finally, the obtained results are discussed in Section VI followed by a concise conclusion in Section VII.
Ii TSK Fuzzy Inference Systems
Ii-a Type-1 Fuzzy Inference Systems
A TSK-type FIS is governed by the IF–THEN rules of the form :
where is the -th rule in an FIS, are the T1FSs, is a function of an input vector that returns a crisp output , and is the total number of the inputs presented to the -th rule. Note that the number of inputs may vary from rule-to-rule. Hence, the dimension of inputs in a rule is denoted as . In TSK, the function is usually expressed as:
where for = to is the free parameters in the consequent part of a rule. The defuzzified crisp output of FIS is computed as follows: First, the inference engine fires up the RB rules. The firing strength of the -th rule is computed as:
where is the value of -th T1FS MF at the -th rule. Then, the defuzzified output of an FIS is computed as:
where is the total rules in the RB. In this work, as shown in Fig. 1(a), the T1FS was of the form:
where and are the center and the width of MF , respectively.
Ii-B Type-2 Fuzzy Inference Systems
A T2FS is characterized by a 3-dimensional (3-D) MF . The three axes of T2FS are defined as follows. The x-axis is called the primary variable, the y-axis is called the secondary variable (or primary MF, which is denoted by ), and the z-axis is called the MF value (or secondary MF value), which is denoted by . Hence, in a universal set , a T2FS has the form:
where the MF value has a 2-dimensional support called the footprint of uncertainty of , which is bounded by an LMF and a UMF (Fig. 1(b)). A Gaussian function, with an uncertain mean within, is an IT2FS MF (Fig. 1(b)), which is written as:
In this work, the LMF was defined as :
and the UMF was defined as :
In Fig. 1(b), the point along the x-axis of 3-D IT2FS MF cuts the LMF and UMF along the y-axis, and the value of the IT2FS is considered to be along the z-axis (not shown in Fig. 1(b)) are and . Considering IT2FS MFs, -th IF–THEN rule of type-2 TSK-FIS for an input vector takes the following form:
where are the T2FSs, is a function of x that returns a pair called the left and right weights of the consequent part of the -th rule. In TSK, is usually written as:
where for = to is the free parameter in the consequent part of a rule and for = to are the deviation factors of the free parameters. The firing strength of IT2FS is computed as:
At this stage, inference engine fires up the rule and the type-reducer reduces the IT2FS to T1FS. In this work, the center of set type-reducer , prescribed in , was used:
where and are the left and the right end of the interval. For the ascending order of and , and are computed as:
where and are the switch point, determined by
respectively. The defuzzified crisp output is then computed as:
Iii Multiobjective Optimization of Hierarchical Fuzzy Inference Trees
Iii-a Hierarchical Tree Formation
A hierarchical fuzzy inference tree (HFIT) is a tree-based system. Its hierarchical structure is analogous to a multilayer feedforward NN, where the nodes (the low-dimensional FISs) are connected using weighted links. The concept of forming a hierarchical fuzzy inference tree is inherited from the flexible neural tree proposed by Chen et al. , which has two learning phases. First, in the tree construction phase, an evolutionary algorithm is employed to construct/optimize a tree-like structure. Second, in the parameter tuning phase, a genotype representing the underlying parameters of the tree structure is optimized by using parameter optimization algorithms.
To create an optimum tree based model, firstly, a population of randomly generated trees is formed. Once a satisfactory tree structure (a tree with a small approximation error and low complexity) is obtained using an evolutionary algorithm, the parameter tuning phase optimizes its parameters. The phases are repeated until a satisfactory solution is obtained. Fig. 2 is a clear representation of HFIT’s two-phase construction approach.
Iii-B Tree Encoding
An HFIT is a collection of nodes and terminal nodes :
where denotes non-leaf instruction and has arguments. The leaf node’s instruction takes no argument and represents the input variable/instruction. A typical HFIT is shown in Fig. 3(a); whereas, Fig. 3(b) illustrates an HFIT’s node that takes inputs. The inputs for to a node is either from the input layer or from other nodes in HFIT. Each node in an HFIT receives a weighted input , where is the weight. In this work, however, the weights in HFIT were set to 1 because the objective of this work was also to reduce the complexity of the produced tree along with approximation error. Setting weights to 1 also allow raw input to be fed to the fuzzy sets.
Iii-C Rule Formation at the Nodes
Iii-C1 Rules for Type-1 FIS Node
Each node in an HFIT is an FIS of either type-1 or type-2. Hence, the rules at a node were created as follows: Considering a reference to the node from Fig. 3(a) that has two arguments/inputs and and assuming that each input and has two T1FSs and , respectively, the rules for T1FIS are generated as:
The consequent part of the rules at the node is computed by using (2). Finally, the output of node is computed as:
where the firing strength is computed as:
Similar to node , node has two inputs and, if each input at node is partitioned into two T1FSs, then the output of node is computed in a similar way to how the output of the node is computed.
The output of the HFIT shown in Fig. 3(a) is computed from node , which revives inputs and and , where and are the outputs of nodes and , respectively. Therefore, the rules at node , considering each input , , and has two T1FSs , , and respectively, is represented as:
Output of node , which is also the output of the tree (Fig. 3(a)), is computed as:
where the consequent part is computed using (2) and the firing strength is computed as:
Iii-C2 Rules for Type-2 FIS Node
If the nodes of the HFIT in Fig. 3(a) are type-2 nodes, then, assuming that node has two T2FSs and , respectively, the rules for T2FIS at node are generated as:
and the lower and upper firing strengths and at node are computed as:
Then, the left and right weights and of the consequent part of the rules are produced by using (11). Thereafter, the type-reduction of the node is performed as described in , where the left and right intervals and are computed as per (14) and (15). During type-reduction , an early stopping mechanism was adopted to reduce computation time. Finally, output of node is computed as .
The output computation at node of the tree in Fig. 3(a) is similar to that of the output computation of node because, at node , there are two inputs and each of these are partitioned into two T2FSs.
The output of the type-2 HFIT shown in Fig. 3(a) is computed from node , which receives inputs and and , where and are the outputs of nodes and , respectively. Therefore, the rules at node , considering each input , , and has two T2FSs , , and respectively, are represented as:
The lower and upper firing strengths and at node are computed as:
After computing the firing strengths and the left and right weights and of the rules, the type-reduction at the node is performed by using (13), where the left and right intervals and are computed as per (14) and (15). Output of node , which is also the output of the tree, is computed by averaging and as .
Iii-D Structure Tuning (Pareto-based Multiobjective Optimization)
Usually, a learning algorithm owns a single objective (approximation error minimization) that is often achieved by minimizing the root mean squared error (RMSE) on the learning data:
where and are the desired and the model’s outputs, respectively, and is the number of data pairs in the training set. However, a single objective comes at the expense of a model’s complexity or the generalization ability on unseen data. The generalization ability broadly depends on the model’s complexity (e.g., the number of parameters in the model) . The minimization of the approximation error and the number of free parameters are conflicting objectives. Hence, a Pareto-based multiobjective optimization can be applied to obtained a Pareto set of nondominated solutions, in which no one objective function can be improved without a simultaneous detriment to at least one of the other objectives of the solution 
Therefore, an HFIT that offers the lowest approximation error and simplest structure is the most desirable one. To obtain such a set of Pareto-optimal (nondominated) solutions, a nondominated sorting based MOGP was applied.
The proposed MOGP acquires the nondominated sorting algorithm  for computing Pareto-optimal solutions from an initial population of fuzzy inference trees. The individuals in MOGP were sorted according to their dominance in population. Moreover, individuals were sorted according to the rank/Pareto-front/line. MOGP is an elitist algorithm that allows the best individuals to propagate into the next generation. Diversity in population was maintained by measuring the crowding distance among the individuals .
A detailed description of MOGP algorithm is as follows:
Iii-D1 Initial Population
Two fitness measures were considered: approximation error minimization and parameter count minimization. To simultaneously optimize these objectives during the structure-tuning phase, an initial population of randomly generated HFITs was formed and sorted according to their nondominance.
In selection operation, a mating pool of was obtained using binary tournament selection that selects two candidates randomly at a time from a population , and the best solution (according to its rank and crowding distance) is copied into the mating pool . This process is continued until the mating pool becomes full.
An offspring population was generated using the individuals of the mating pool . Two distinct individuals (parents) were randomly selected from the mating pool to create new individuals using the genetic operators crossover and mutation.
In crossover operation, randomly selected sub-trees of two parent trees are swapped (Fig. 4(a)). The swapping includes the exchange of nodes. A detailed description of the crossover operation in genetic programming is available in [59, 60]. The crossover operation is selected with the
Replacing a randomly selected terminal with a newly generated terminal for .
Replacing all terminal nodes of an HFIT with a new set of terminal nodes derived from .
Replacing a randomly selected FIS node with a newly generated FIS node for .
Replacing a randomly selected terminal node with a newly created FIS node .
Deleting a randomly selected terminal node or deleting a randomly selected FIS node .
The mutation operation was selected with the probability , and the type of mutation operator (a or b or c or d or e) was chosen randomly during the mutation operation (Fig. 4(b)).
The offspring population and the main population were mixed together to make a combined population .
In this work, elitism was decided according to the rank (based on both RMSE and the model’s complexity) of the individuals (HFITs) in the population. Therefore, in this step, worst (poorer rank) individuals were weeded out from the combined population . In other words, best individuals are propagated into the new generation as the main population .
Iii-E Parameter Tuning
In the structure tuning phase, an optimum phenotype (HFIT) was derived with the parameters being initially fixed by random guesswork. Hence, the obtained phenotype was further tuned in the parameter tuning phase by using a parameter optimization algorithm. To tune the parameters of the derived phenotype, its parameters were mapped onto a genotype, i.e., onto a real vector, called a solution vector. The selection of the best phenotype in a single objective training was solely based on a comparison of the RMSEs. However, selecting a solution in a multiobjective training is a difficult choice. In this work, after the multiobjective training of HFIT, the best solution for parameter tuning was picked from the Pareto front. Strictly, the solution that gave the best RMSE among the solutions marked rank-one in the Pareto-optimal set was chosen. Fig. 5 is an illustration of the solutions that belong to the Pareto-front.
The genotype mapping of a T1FIS and a T2FIS differ only in regard to their number of parameters. In HFIT, a T1FIS uses the MF mentioned in (5), which has two arguments and and each rule in T1FIS has variables in the consequent part as referred to in (2), where is the number of inputs to the -th rule. On the other hand, a T2FIS uses IT2FSs, which are bounded by LMFs and UMFs (Fig. 1(b)) and have two Gaussian means and
and a varianceto be optimized. The Gaussian means and for type-2 Gaussian MF (7) were defined as:
whereis the center of the Gaussian means and taken from [0, 1]. Similarly, the variance of type-2 Gaussian MF (7) was taken from [0, 1]. The consequent part of the T2FIS was computed according to (11), which led to variables.
Assume that an HFIT (a tree like Fig. 3(a)) has many nodes and each node in the phenotype takes inputs, where each input is partitioned into two fuzzy sets (MFs). Then, the number of the fuzzy sets at a node is . Since the number of inputs at a node is and each input is partitioned into two fuzzy sets, the number of rules at a node is . Hence, the number of parameters at a T1FIS node is and the number of parameters at a T2FIS node is . Therefore, the total number of parameters in an HFIT is the summation of the number of parameters at all nodes in the tree. For example, the number of parameters in the type-1 HFIT and type-2 HFIT shown in Fig. 3(a) are 84 and 154 respectively.
Assuming is the total number of parameters in a tree, the genotype or the solution vector w corresponding to the tree (phenotype) is expressed as:
Now, to optimize parameter vector w, a parameter optimizer can be used: genetic algorithms , evolution strategy , artificial bee colony , PSO , DE , gradient-based algorithms , backpropagation , Klaman-filter , etc.
In this work, the differential evolution (DE) version “DE/rand-to-best/1/bin”  was used, which is a metaheuristic algorithm that uses a crossover operator inspired by the dynamics of “natural selection.” The basic principle of the DE is as follows: First, an initial population matrix at the iteration is randomly initialized. The population contains many solution vectors. A solution vector w in the population is an -dimensional vector representing the free parameters of an HFIT. Secondly, the population is created using binomial trials. Hence, to create a new solution vector for the population , three distinct solution vectors , and and the best solution vector are selected from the population . Then, for a random index and for the selected trial vector , the -th variable of the modified trial vector are created as:
where is a uniform random sample, is the crossover rate, and is the differential weight. Similarly, all the variables of the trial vector is created using (28). After creation of the modified trial vector , it is recombined as:
where is the function that returns the fitness of a solution vector using (26). In DE , operators, such as selection, crossover, and recombination were repeated until a satisfactory solution vector was found or no improvement was observed compared to an obtained solution over a fixed period (100 DE iterations).
Iv Theoretical Evaluation
Efficiency of the proposed HFIT comes from a combined influence of three basic operations involved in the model’s development: tree construction through MOGP, combining several low-dimensional fuzzy systems in a hierarchical manner, and parameters tuning through differential evolution (DE). Hence, HFIT bears many distinguished properties that define its prediction efficiency compared to many models invoked from literature for comparison. Following are the HFIT’s properties: 1) Convergence ability of the evolutionary class algorithms (EA) or for that matter MOGP. 2) Approximation ability of the evolved hierarchical fuzzy system (tree model). 3) Convergence ability of DE in tree’s parameters tuning. Subsequent discussions theoretically analyze each of these properties one-by-one.
Iv-a Optimal tree structure through MOGP convergence
Evaluating the convergence of evolutionary class algorithms has been a challenging task because of their stochastic nature. Theoretical studies of EAs performed through various perspectives show that indeed an optimal solution is possible in a finite time. Initially, Goldberg and Sergret 
showed convergence property of GA using a finite Markov chain analysis, where they considered GA with a finite population and recombination and mutation operators.
A different viewpoint of MOGP convergence (EAs in general) can be referred to as by using Banach fixpoint theorem described in . Banach fixpoint theorem  states that on a metric space a constructive mapping has a unique fixpoint, i.e., for an element , . Therefore, Banach fixpoint theorem can explain MOGP convergence with only assumption that there should be an improvement of the population (not necessarily of the optimal solution) from one generation to another. Banach fixpoint theorem also indicates that if MOGP semantics is to be considered as a transformation between one population to another and if it is possible to obtain a metric space in which transformation is constructive, then MOGP converges to a optimal population , i.e., to a population containing optimal solution.
defined on elements of ordered pair setis constructive if the distance between and is less than and for any . Now, distance mapping is a metric space iff for any the following condition satisfy:
Let be a metric space and be a mapping, then is constructive iff there is a constant such that for all
Therefore, for Banach theorem’s formulation, the completeness of the metric space needs to be defined. Now, metric space elements are a Cauchy sequence iff for any , there exist such that for all . It also follows that, if such Cauchy sequence has a limit , then metric space is complete.
For a complete metric space and constructive mapping , mapping has a unique fixpoint such that for any
A proof of Banach theorem can be found in [70, p. 60] described as method of successive approximation. ∎
In this article, it is necessary to show that if a metric space for MOGP population can be obtained, then any constructive mapping in MOGP will contain a unique fixpoint. The proposed MOGP has a fixed population size (say ), i.e., each population contain individuals, and in each generation, the total fitness of the population is expected to increase. Let be a function that computes the fitness of a population, which is expressed as:
where function evaluates RMSE of each . Now, distance mapping , where is a set of MOGP populations, can be defined as:
It follows that
and if holds for any population and in MOGP.
is obvious and
Therefore, MOGP has a metric space . Now, it only remains to show that the MOGP follows a constructive mapping , i.e., in each subsequent generation of MOGP, an improvement is possible. Altenberg  showed that by maintaining genetic operators, such as selection, crossover, and mutation, the evolvability of genetic programming can be increased. Additionally, Altenberg  analyzed the probability of a population containing fitter individuals than the previous population and offered the subsequent proof. It was observed that even for a random crossover operation, genetic programming evolvability can be ensured. It is then necessary to say that, indeed an MOGP can produce fitter population than the previous ones.
Let’s depart from MOGP operations descriptions to continue with Banach theorem since it is now known that MOGP offers constructive mapping , for which -th iteration population offers constructive mapping. In other words, , i.e., mapping holds. It follows that
Moreover, it satisfies Banach fixpoint theorem. Hence,
It indicates that MOGP converges to a population , which is a unique fixpoint in a population space.
It is evident from MOGP operation that it produces an optimal tree structure from a population space. Although obtaining optimality in the tree design using MOGP is sufficient to claim the formation of a function that can approximate to a high degree of accuracy, it is necessary to investigate the approximation capability of the hierarchical fuzzy system developed in the form of a tree structure.
Iv-B Approximation ability of hierarchical fuzzy inference tree
This Section describes the approximation capability of an HFIT, which is a result of MOGP operation. Theoretical studies of special cases of the hierarchical fuzzy systems are provided in [27, 32]. Whereas, the proposed HFIT produces a general hierarchical fuzzy system. In HFIT, not only a cascaded hierarchy of fuzzy system (a fuzzy subsystem takes input only from its previous fuzzy subsystem ) can be produced, but a general hierarchical fuzzy system, in which a fuzzy subsystem can take inputs from any previous layer fuzzy subsystem, can be produced. A hierarchical fuzzy system described in  resembles the hierarchical fuzzy system produced by HFIT. To show the approximation capability of the proposed HFIT, it requires coming to the conclusion that the proposed HFIT is analogous to the hierarchical fuzzy system described by Zeng and Keane .
Let’s perform an analogy between the proposed HFIT and the concept of a natural hierarchical fuzzy system described by Zeng and Keane . To show such an analogy, at first, it needs to establish the definition of the natural hierarchical structure of a continuous function, then it will be necessary to show that, for any such continuous function, a hierarchical fuzzy system exists.
Let’s take the example of the HFIT shown in Fig. 3(a), which can be represented as natural hierarchical structure of a continuous function. The tree in Fig. 3(a) gives the output from node . Moreover, the tree in Fig. 3(a) gives the following functions:
It can also be expressed as:
It follows that, for a given function , if there exist functions such that function (34) can be obtained, then function can be represented as hierarchical structure.
For simplicity’s sake, let’s take the case of a two-stage tree, where the top layer node is denoted by and its output is by . Similarly, second layer nodes are denoted by and their outputs are by , for . Therefore, a natural hierarchical structure can be defined as:
Definition 1 (Natural Hierarchical Structure).
Let be a multi-input-single-output continuous function with input variables defined on input space and the output defined on the output space . If there exist continuous functions
and the functions have inputs and , where and are input dimensions at the top and second stage of hierarchy, respectively, such that
then is a continuous function with natural hierarchal structure.
Such form of natural hierarchical structure also possesses separable or arbitrarily separable hierarchical structural property, i.e., the individual functions can be decomposed . Now, from Kolmogorov’s Theorem , the following can be stated: Any continuous function on ( and define the input range) can be represented as a sum of continuous functions with an arbitrarily separable hierarchical structure. This statement concludes to the following theorem.
Let be any continuous function on and its hierarchical structure representation be , in which are continuous functions with natural hierarchical structure, then for any given , there exists a hierarchical fuzzy system
which has the same natural hierarchical structure as such that
and the same holds between the sub-functions of and the fuzzy subsystems of .
It is to note that Theorem 2 shows that hierarchical structure of fuzzy systems is universal approximators. Therefore, they can approximate any continuous function to any degree of accuracy as-well-as they can approximate each component of that function. Hence, the proposed HFIT that can form a natural hierarchical structure can achieve universal structure approximation.
Another property of the proposed HFIT is the parameter tuning, which is performed by a global optimizer (e.g., DE was applied in this research). Hence, it is required to investigate the convergence ability of the DE algorithm in parameter tuning of HFIT.
Iv-C Optimal parameter through differential evolution convergence
Convergence property and efficiency of DE is well studied [73, 74]. A probabilistic viewpoint of DE convergence followed by a description of global convergence condition for DE is described in . They show that indeed DE converges to an optimal solution. Similarly, Zhang and Sanderson  studied the various property of DE, such as mutation, crossover and recombination operators that influence the DE convergence. DE follows a similar property as of EA class algorithms described in Section IV-A. Hence, its global convergence ability is not different than the one described for MOGP, and indeed it finds an optimal parameter vector for HFIT.
Iv-D Comparative study of HFIT with other models
The proposed HFIT learns knowledge contained in the training data through adaptation in its structure and the rules generated at its nodes. Such a process of learning/acquiring knowledge from data is somehow similar to the models having network-like layered architecture, i.e., ANFIS-like appr