I Introduction
Fuzzy systems have achieved great success in numerous applications [1, 2]. As shown in Fig. 1, a fuzzy system consists of four components: fuzzifier, rulebase, inference engine, and defuzzifier. The fuzzifier maps each crisp input into a fuzzy set, the inference engine performs inferences on these fuzzy sets to obtain another fuzzy set, utilizing the rulebase, and the defuzzifier converts the inferred fuzzy set into a crisp output.
There are two kinds of rules for a fuzzy system: Zadeh [3], where the rule consequents are fuzzy sets, and TakagiSugenoKang (TSK) [4], where the rule consequents are functions of the inputs. Zadeh rules were the earliest rules proposed. However, TSK rules are much more popular in practice due to their simplicity and flexibility. This paper considers mainly TSK fuzzy systems for regression.
As an example, a TSK fuzzy system with two inputs, two membership function (MFs) for each input, and one output, has the following rulebase:
where (; ) are fuzzy sets for , and , and are adjustable regression coefficients.
For a particular input , the membership grade on is , and the firing levels of the rules are:
The output of the TSK fuzzy system is:
(1) 
Or, if we define the normalized firing levels as:
(2) 
then, (1) can be rewritten as:
(3) 
This paper gives a comprehensive overview of the functional equivalence^{1}^{1}1It has been shown that many machine learning algorithms, e.g., fuzzy systems and neural networks, are universal approximators [5, 6]. However, two algorithms are both universal approximators does not mean that they are functionally equivalent: universal approximation usually requires a very large number of nodes or parameters, so it is theoretically important, but may not be used in realworld algorithm design. By functional equivalence, we emphasize that two algorithms can implement exactly the same function with a relatively small number of parameters. of TSK fuzzy systems to four classical machine learning algorithms: neural networks [7], mixture of experts (ME) [8], classification and regression trees (CART) [9], and stacking ensemble regression [10]. Although few publications on the connections of TSK fuzzy systems to some of these approaches have scattered in the literature, to our knowledge, no one has put everything together in one place so that the reader can easily see the big picture and get inspired. Moreover, we also discuss some promising hybridizations between TSK fuzzy systems and each of the four algorithms, which could be interesting new research directions. For example:

By making use of the functional equivalence between TSK fuzzy systems and some neural networks, we can design more efficient training algorithms for TSK fuzzy systems.

By making use of the functional equivalence between TSK fuzzy systems and ME, we may be able to achieve a better tradeff between cooperations and competitions of the rules in a TSK fuzzy system.

By making use of the functional equivalence between TSK fuzzy systems and CART, we can better initialize a TSK fuzzy system for highdimensional problems.

Inspired by the connections between TSK fuzzy systems and stacking ensemble regression, we may be able to design better stacking models, and increase the robustness of a TSK fuzzy model.
Ii TSK Fuzzy Systems and Neural Networks
Neural networks have a longer history^{2}^{2}2https://cs.stanford.edu/people/eroberts/courses/soco/projects/neuralnetworks/History/history1.html
than fuzzy systems, and are now at the center stage of machine learning, because of the booming of deep learning
[11].Researchers started to discover in the early 1990s that a TSK fuzzy system can be represented similarly to a neural network [12, 13, 14, 15, 16], so that a neural network learning algorithm, such as backpropagation [7], can be used to train it. These fuzzy systems are called neurofuzzy systems in the literature [2].
Iia Anfis
Some neurofuzzy systems resemble the structure of the 3layer multilayer perceptrons (MLP)
[7] in Fig. 2. The first layer represents inputs, the middle (hidden) layer represents fuzzy rules, and the third layer represents outputs. Fuzzy sets are encoded as connection weights.Among the many variants of neurofuzzy systems, the most popular one may be the adaptivenetworkbased fuzzy inference system (ANFIS) [13], which has been cited over 14,600 times on Google Scholar, and implemented in the Matlab Fuzzy Logic Toolbox. The ANFIS structure of the twoinput oneoutput TSK fuzzy system, introduced in the Introduction, is shown in Fig. 3. It has five layers:

Layer 1: The membership grade of on (; ) is computed.

Layer 2: The firing level of each rule is computed, by multiplying the membership grades of the corresponding rule antecedents.

Layer 3: The normalized firing levels of the rules are computed, using (2).

Layer 4: Each normalized firing level is multiplied by its corresponding rule consequent.

Layer 5: The output is computed by (3).
All parameters of the ANFIS, i.e., the shapes of the MFs and the rule consequents, can be trained by a gradient descent algorithm [13]
. Or, to speed up the training, the antecedent parameters can be tuned by gradient descent, and the consequent parameters by least squares estimation
[13].However, it should be noted that ANFIS is fundamentally different from the MLP in several aspects:

The MLP always uses fully connected layers (also called dense layers in deep learning), whereas the ANFIS is not. For example, in ANFIS, the inputs are selectively connected to the nodes in Layer 1, and the nodes in Layer 1 are also selectively connected to those in Layer 2, to reflect the structure of the rule antecedents.

In the MLP, the output of a node in the hidden layer and output layer is always computed by weighted sum followed by an activation function, whereas there are many different operations in an ANFIS. For example, Layer 1 computes the membership grade of an input on the corresponding MF, and Layer 2 uses direct multiplication.

Layer 4 of the ANFIS also uses as an input (to represent the rule consequents), something called skip connection in deep learning, but usually an MLP does not have such connections.

The MLP is a blackbox model, whereas the ANFIS can be expressed by IFTHEN rules, which is easier to interpret and understand.
IiB Functional Equivalence between TSK Fuzzy Systems and Radial Basis Functional Networks (RBFN)
Although an MLP is different from a TSK fuzzy system, there is a variant of neural networks, the radial basis function network (RBFN)
[17], that is functionally equivalent to a TSK fuzzy system under certain constraints.An RBFN [17] uses local receptive fields, inspired by biological receptive fields, for function mapping. Its diagram is shown in Fig. 4. For input , the output of the th () receptive field unit, using a Gaussian response function, is:
(4) 
where and are the centers of the Gaussian functions for and , respectively, and
is the common standard deviation of the Gaussian functions.
With the addition of lateral connections (not shown in Fig. 4) between the receptive field units, the output of the RBFN is:
(5) 
where is a constant output associated with the th receptive field unit^{3}^{3}3There is a related machine learning approach called local model networks [18], which can be viewed as a decomposition of a complex nonlinear system into a set of locally accurate submodels smoothly integrated by their associated basis functions. It replaces the constant output of each receptive unit in an RBFN by a function of the inputs..
Jang and Sun [19] have shown that a TSK fuzzy system [see (3)] is functionally equivalent to an RBFN [see (5)], if the following constraints are satisfied:

The number of receptive field units equals the number of fuzzy rules.

The output of each fuzzy rule is a constant, instead of a function of the inputs.

The antecedent MFs of each fuzzy rule are Gaussian functions with the same variance.

The product norm is used to compute the firing level of each rule.

The fuzzy system and the RBFN use the same method (i.e., either weighted average or weighted sum) to compute the final output.
Hunt et al. [20] proposed a generalized RBFN, which has the following main features, comparing with the above standard RBFN:

A receptive field unit may be connected with only a subset of the inputs, instead of all inputs in the standard RBFN.

The output associated with each receptive field unit can be a linear or nonlinear function of the inputs, instead of a constant in the standard RBFN.

The Gaussian response functions of the receptive field units can have different variances for different inputs, instead of identical variance in the standard RBFN.
Then, the generalized RBFN is functionally equivalent to a TSK fuzzy system, under the following constraints [20]:

The number of receptive field units equals the number of fuzzy rules.

The antecedent MFs of each fuzzy rule are Gaussian.

The product norm is used to compute the firing level of each rule.

The fuzzy system and the RBFN use the same method (i.e., either weighted average or weighted sum) to compute the final output.
IiC Discussions and Future Research
As ANFIS is an efficient and popular training algorithm for type1 TSK fuzzy systems, it is natural to consider whether it can also be used for interval and general type2 fuzzy systems [1], which have demonstrated better performance than type1 fuzzy systems in many applications. There have been limited research in this direction [21, 22]. Unfortunately, it was found that interval type2 ANFIS may not outperform type1 ANFIS. One possible reason is that when the KarnikMendel algorithms [1] are used in typereduction of the interval type2 fuzzy system, the least squares estimator in the interval type2 ANFIS does not always give the optimal solution, due to the switch point mismatch [21]. A remedy may be to use an alternative typereduction and defuzzification approach [23], which does not involve the switch points, e.g., the WuTan method [24]. This is a direction that we are currently working on.
Many novel approaches have been proposed in the last few years to speed up the training and increase the generalization ability of neural networks, particularly deep neural networks, e.g., dropout [25], dropConnect [26]
, and batch normalization
[27]. Dropout randomly discards some neurons and their connections during the training. DropConnect randomly sets some connection weights to zero during the training. Batch normalization normalizes the activation of the hidden units, and hence reduces internal covariate shift
^{4}^{4}4As explained in [27], internal covariate shift means “the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.”. Similar concepts may also be used to expedite the training and increase the robustness of TSK fuzzy systems.Although deep learning has achieved great success in numerous applications, its model is essentially a blackbox because it is difficult to explain the acquired knowledge or decision rationale. This may hinder it from safetycritical applications such as medical diagnoses. Explainability of deep learning models has attracted a rapidly growing research interest in the past few years. According to Amarasinghe and Manic, there have been two groups of research on this [28]: 1) altering the learning algorithms to learn explainable features; and, 2) using additional methods with the standard learning algorithm to explain existing deep learning algorithms. They [28]
also presented an interesting methodology for linguistically explaining the knowledge a deep neural network classifier has acquired in training, using linguistic summarization
[29], which generates Zadeh fuzzy rules. This work shows a novel and promising application of fuzzy rules in deep learning. Similarly, TSK fuzzy rules could also be used to linguistically explain a deep regression model.Iii TSK Fuzzy Systems and ME
ME was first proposed by Jacobs et al. in 1991 [8]. It is established based on the divideandconquer principle, in which the problem space is divided among multiple local experts, supervised by a gating network, as shown in Fig. 5. ME trains multiple local experts, each taking care of only a small local region of the problem space; for a new input, the gating network determines which experts should be used for it, and then aggregates the outputs of these experts by a weighted average.
Iiia The ME
Assume there are training examples , . The experts are trained from minimizing the following error^{5}^{5}5In practice transforms of (6), e.g., , may be used to speed up the optimization [8].:
(6) 
where is the output of the th expert for input , and is the corresponding normalized weight for the th expert, assigned by the gating network:
(7) 
in which is a tunable function.
Once the training is done, the final output of ME is:
(8) 
IiiB Functional Equivalence between TSK Fuzzy Systems and ME
It is easy to see that, when and , in (8) is functionally equivapent to the output of the TSK fuzzy system in (3).
have made the connection between TSK fuzzy systems and ME. The regression function in each rule consequent of the TSK fuzzy system can be viewed as an expert, and the rule antecedents work as the gating network: for each input, they determine how much weight should be assigned to each rule consequent (expert) in the final aggregation. Of course, the experts and gating network in ME can be constructed by more complex models, such as neural networks and support vector machines
[32], but the structure resemblance remains unchanged.As summarized in [33], two categories of strategies have been used to partition the problem space among the experts in ME:

Mixture of implicitly localized experts (MILE), which stochastically partitions the problem space into a number of subspaces using a specific error function, and experts become specialized in each subspace.
Both categories have equivalent TSK fuzzy system design strategies.
For example, in evolutionary fuzzy systems design [36], the input MFs are randomly initialized, and a fitness function is used to select the configuration that achieves the best performance. In Yen et al.’s approach [37] to increase the interpretability of a TSK fuzzy system, the antecedent fuzzy partitions are determined by starting with an oversized number of partitions, and then removing redundant and less important ones using the SVDQR algorithm [38]. These ideas are very close to MILE in ME design.
There have also been many approaches for generating initial fuzzy rule partitions through clustering [39, 40, 41, 42]. Different clustering algorithms, e.g., mountain clustering [39], fuzzy means clustering [40], aligned clustering [42], etc., have been used. Then, one rule is generated for each cluster. This is essentially the MELE strategy in ME.
IiiC Discussions and Future Research
Lots of progresses on ME have been made since it was first proposed in 1991 [32, 33], e.g., different training algorithms, different gating networks, and different expert models. Since ME is essentially identical to a TSK fuzzy system, these ideas could also be applied to fuzzy systems, particularly the training algorithms and expert models (the gating network is a little more challenging because in a TSK fuzzy system we always use MFs to perform gating; there is not too much freedom).
First, in training a TSK fuzzy system for regression, the error function is usually defined as:
(9) 
However, as pointed out in [8], “this error measure compares the desired output with a blend of the outputs of the local experts, so, to minimize the error, each local expert must make its output cancel the residual error that is left by the combined effects of all the other experts. When the weight in one expert change, the residual error changes, and so the error derivatives for all other local experts change.” This strong coupling between the experts facilitates their cooperation, but may lead to solutions in which many experts are used for each input. That’s why in training the local experts, the error function is defined as (6) to facilitate the competition among them. (6) requires each expert to approximate , instead of a residual. Hence, each local expert is not directly affected by other experts (it is indirectly affected by other experts through the gating network, though).
It is thus interesting to study if changing the error function from (9) to (6) in training a TSK fuzzy system can improve its performance, in terms of speed and accuracy. Or, the error function could be a hybridization of (9) and (6), to facilitate both the cooperations and competitions among the local experts, i.e.,
(10) 
where is a hyperparameter defining the tradeoff between cooperation and competition. This idea was first explored in [37], which proposed a TSK fuzzy system design strategy to increase its interpretability, by forcing each rule consequent to be a reasonable local model (the regression function in each rule consequent needs to fit the training data that are covered by the rule antecedent MFs well), and also the overall TSK fuzzy system to be a good global model. However, the algorithms in [37] are very memoryhungry^{6}^{6}6Let be the number of training examples, the dimensionality of the input, and the number of rules. The algorithms in [37] need to construct matrices in and , which are hardly scalable., and hence may not be applicable when is large. A more efficient solution to this problem is needed.
Second, when the performance of an initially designed TSK fuzzy system is not satisfactory, there could be two strategies to improve it: 1) increase the number of rules, so that each rule covers a smaller region in the input domain, and hence may better approximate the training examples in that region; and, 2) increase the fitting power (nonlinearity) of the consequent function, so that it can better fit the training examples in its local region. The first strategy is frequently used in practice; however, it can increase the number of parameters of the TSK fuzzy system very rapidly. Juang and Lin [42] proposed an interesting approach to incrementally add linear terms to the rule consequent to increase its fitting power. However, they only considered linear terms. Inspired by ME, whose expert models could use complex models like the neural networks and support vector machines [32], the TSK rule consequents (local experts) could also use more sophisticated models, particularly, support vector regression [43]
, which outperforms simple linear regression in many applications. The feasibility of this idea has been verified in
[44, 45].Third, TSK fuzzy rules could also be used as experts in ME. For example, Leski [46] proposed such an approach for classification: each expert model in the ME was constructed as a TSK fuzzy rule (whose input region was determined by fuzzy means clustering), and then a gating network was used to aggregate them. This may increase the interpretability of ME. This idea can also be extended to regression problems.
Iv TSK Fuzzy Systems and CART
CART [9]
is a popular and powerful strategy for constructing classification and regression trees. It has also been used in ensemble learning such as random forests
[47]and gradient boosting machines
[48]. This section focuses on regression only.Iva Cart
Assume there are two numerical inputs, and , and one output, . An example of CART is shown in Fig. 6
. It is constructed by a divideandconquer strategy, in which the input space is partitioned by a hierarchy of Boolean tests into multiple nonoverlapping partitions. Each Boolean test corresponds to an internal node of the decision tree. The leaf node (terminal node) is computed as the mean
of all training examples falling into the corresponding partition; thus, CART implements a piecewise constant regression function, as shown in Fig. 6. The route leading to each leaf node can be written as a crisp rule, e.g., If and , then . Note that each leaf node can also be a function of the inputs [49, 50, 51, 52, 53], instead of a constant. In this way, the implemented regression function is smoother; however, the trees are more difficult to train.IvB Functional Equivalence between TSK Fuzzy Systems and CART
Both CART and fuzzy systems use rules. The rules in CART are crisp: each input belongs to only one rule, and the output is the leaf node of that rule. On the contrary, the rules in a fuzzy system are fuzzy: each input may fire more than one rules, and the output is a weighted average of these rule consequents.
The regression output of a traditional CART has discontinuities, which may be undesirable in practice. So, fuzzy CART, which allows an input to belong to different leaf nodes with different degrees, has been proposed to accommodate this [54, 55, 56, 57]. As pointed out by Suarez and Lutsko [57], “in regression problems, it is seen that the continuity constraint imposed by the function representation of the fuzzy tree leads to substantial improvements in the quality of the regression and limits the tendency to overfitting.” An example of fuzzy CART for regression is shown in Fig. 7, where is a fuzzy set for , and and are fuzzy sets for . Its inputoutput mapping is shown in Fig. 7, which is continuous.
Let the th constant leaf node in a fuzzy CART be . Then, given an input , the output of a fuzzy CART can be a weighed average of the predictions at all leaves:
(11) 
where is the product of all membership grades in the path to . In this way, is a smooth function. Clearly, is functionally equivalent to the output of a TSK fuzzy system in (1), when each rule consequent of the fuzzy system is a constant (instead of a function).
Chaudhuri et al. [51] proposed smoothed and unsmoothed piecewisepolynomial regression trees (SUPPORT), in which each leaf node is a polynomial function of the inputs. The SUPPORT tree is generally much shorter than a traditional CART tree, and hence enjoys better interpretability. The following threestep procedure is used to ensure that its output is smooth:

The input space is recursively partitioned until the data in each partition are adequately fitted by a fixedorder polynomial. Partitioning is guided by analyzing the distributions of the residuals and the crossvalidation estimates of the mean squared prediction error.

The data within a neighborhood of the th () partition are fitted by a polynomial .

The prediction for an input is a weighted average of , where the weighting function diminishes rapidly to zero outside the th partition.
If the weighting functions are Gaussianlike, i.e.,
(12) 
where is the mean of the Gaussian function for the th input, and is the standard deviation, then the output of SUPPORT is:
(13) 
Clearly, is functionally equivalent to the output of the TSK fuzzy system in (1).
IvC Discussions and Future Research
It is wellknown that fuzzy systems are subject to the curse of dimensionality. Assume a fuzzy system has inputs, each with MFs in its domain. Then, the total number of rules is , i.e., the number of rules increases exponentially with the number of inputs, and the fuzzy system quickly becomes unmanageable. Clustering could be used to reduce the number of rules (one rule is extracted for each cluster) [39, 40, 41, 42]. However, the validity of the clusters also decreases with the increase of feature dimensionality, especially when different features have different importance in different clusters [58, 59].
CART may offer a solution to this problem. For example, Jang [55] first performed CART on a regression dataset to roughly estimate the structure of a TSK fuzzy system, i.e., number of MFs in each input domain, and the number of rules. Then, each crisp rule antecedent was converted into a fuzzy set, and consequent to a linear function of the inputs. For example, a crisp rule:
can be converted to a TSK fuzzy rule:
where , and are regression coefficients, and and are fuzzy sets defined as:
(14)  
(15) 
in which and are tunable parameters. Once all crisp rules have been converted to TSK fuzzy rules, ANFIS [13] can be used to optimize the parameters of all rules together, e.g., , , , , and .
The above TSK fuzzy system design strategy offers at least three advantages:

We can prune the CART tree on a highdimensional dataset to obtain a regression tree with a desired number of leaf nodes, and hence a TSK fuzzy system with a desired number of rules.

Rules in a traditionally designed fuzzy system usually have the same number of antecedents (which equals the number of inputs), but rules initialized from CART may have different number of antecedents (which are usually smaller than the number of inputs), depending on the depths of the corresponding leaf nodes, i.e., we can extract simple and novel rules that may not be extractable using a traditional fuzzy system design approach.

In a traditional fuzzy system, each input (feature) is independently considered in rule antecedents. However, some variants of CART [49] allow to split on the linear combinations of the inputs, which is equivalent to using new (usually more informative) features in splitting. These new features are also used by fuzzy rules when they are converted from CART leaf nodes, which may be difficult to achieve in traditional fuzzy systems.
In summary, initializing TSK fuzzy systems from CART regression trees is a promising solution to deal with highdimensional problems, and hence deserves further research.
V Fuzzy System and Stacking
Ensemble regression [10] improves the regression performance by integrating multiple base models. Stacking may be the simplest supervised ensemble regression approach. Its final regression model is a weighted average of the base models, where the weights are trained from the labeled training data.
Va Stacking
The base models in stacking may be trained from other related tasks or datasets [60]. However, when there are enough training examples for the problem under consideration, the base models may also be trained directly from them. Fig. 8 illustrates such a strategy. For a given training dataset, we can resample (e.g., using bootstrap [61]) it multiple times to obtain multiple new training datasets, each of which is slightly different from the original training dataset. Then, a base model can be trained using each resampled dataset. These base models could use the same regression algorithm, but different regression algorithms, e.g., LASSO [62]
[63], support vector regression [43], etc., can also be used to increase their diversity. Because the training datasets are different, the trained base models will be different even if the same regression algorithm is used.Once the base models are obtained, stacking trains another (linear or nonlinear) regression model to fuse them. Assume the outputs of the base regression models are . Stacking finds a regression model on the training dataset to aggregate them.
VB Connections between TSK Fuzzy Systems and Stacking
TSK fuzzy systems for regression can be viewed as a stacking model. Each rule consequent is a base regression model, and the rule antecedent MFs determine the weights of the base models in stacking. Note that in stacking usually the aggregated output is a function of only, but in a TSK fuzzy system the aggregation function also depends on the input , as the weights are computed from them, and change with them. So, a TSK fuzzy system is actually an adaptive stacking regression model.
In stacking, the base regression models can be built from resampling, reweighting, or different partitions of the original training dataset. This concept has also been used to construct or initialize TSK fuzzy systems.
For example, Nozaki et al. [64]
proposed a simple yet powerful heuristic approach for generating TSK fuzzy rules (whose rule consequents are constants, instead of functions of the inputs) from numerical data. Assume there are
training examples , , i.e., the fuzzy system has two inputs and one output. Then, Nozaki et al.’s approach consists of the following three steps [64]:
Determine how many MFs should be used for each input, and define the shapes of the MFs. Once this is done, the input space is partitioned into several fuzzy regions.

Generate a fuzzy rule in the form of:
in the th fuzzy region, where MFs and have been determined in Step (1), and
(16) in which is a positive constant. could also be computed using a least squares approach [64].
Essentially, the above ruleconstruction approach reweights each training example in a fuzzy partition using the firing levels of the rule antecedents, and then computes a simple base model from them. The final fuzzy system is an aggregation of all such rules. This is exactly the idea of stacking in Fig. 8.
Another example is the local learning part in [37], where an approach for constructing a local TSK rule for each rule partition is proposed. Using again the twoinput oneoutput example in the Introduction, a local TSK rule is in the form of:
Given and , the weight for each training example is , and then the regression coefficients , and are found from minimizing the following weighted loss:
(17) 
Each local rule is equivalent to a base model in stacking.
VC Discussions and Future Research
Traditional stacking assigns each base model a constant weight. As pointed out in the previous subsection, a TSK fuzzy system can be viewed as an adaptive stacking model, because the weights for the base models (rule consequents) change with the inputs. Inspired by this phenomenon, we can design more powerful stacking strategies, by replacing each constant weight by a linear or nonlinear function^{7}^{7}7This idea was first used in [65] for classification, under the name “modified stacked generalization.” It outperformed traditional stacking. of the input . The rationale is that the weight for a base model should be dependent on its performance, whereas its performance is usually related to the location of the input: each base model may perform well in some input regions, but not well in the rest. A welltrained function of may be able to reflect the expertise of the corresponding base model, and hence help achieve better aggregation performance.
Moreover, if the weighting functions and the base models are trained simultaneously, then the weighting functions may encourage the base models to cooperate: each focuses on a partition of the input domain, instead of the entire domain in traditional stacking. Even better performance could be expected, than training the base models first and then separately the weighting functions to aggregate them.
Some proven strategies in stacking may also be used to improve the performance of a TSK fuzzy system. For example, regularization is frequently used to increase the robustness of the stacking model [60, 66]. When LASSO [62] is used to build the stacking model, regularization is added, and hence some regression coefficients may be zero, i.e., it increases the sparsity of the solution. When ridge regression [63] or support vector regression [43] is used to build the stacking model, regularization is added, and hence the regression coefficients usually have small magnitudes, i.e., they reduces overfitting. Some new regularization terms, e.g., negative correlation [66], can be used to create negatively correlated base models to encourage specialization and cooperation among them. These concepts may also be used in training the rule consequents (base models) of a TSK fuzzy system, and also the antecedent MFs (so that the MFs for the same input are neither too crowded, nor too far away from each other).
Vi Conclusion
TSK fuzzy systems have achieved great success in numerous applications. However, there are still many challenges in designing an optimal TSK fuzzy system, e.g., how to efficiently train its parameters, how to improve its performance without adding too many parameters, how to balance the tradeoff between cooperations and competitions among the rules, how to overcome the curse of dimensionality, etc. Literature has shown that by making appropriate connections between fuzzy systems and other machine learning approaches, good practices from other domains may be used to improve the fuzzy systems, and vice versa.
This paper has given an overview on the functional equivalence between TSK fuzzy systems and four classic machine learning approaches – neural networks, ME, CART, and stacking – for regression problems. We also pointed out some promising new research directions, inspired by the functional equivalence, that could lead to solutions to the aforementioned problems. For example, by making use of the functional equivalence between TSK fuzzy systems and some neural networks, we can design more efficient training algorithms for TSK fuzzy systems; by making use of the functional equivalence between TSK fuzzy systems and ME, we may be able to achieve a better tradeoff between cooperations and competitions of the rules in a TSK fuzzy system; by making use of the functional equivalence between TSK fuzzy systems and CART, we can better initialize a TSK fuzzy system to deal with the curse of dimensionality; and, inspired by the connections between TSK fuzzy systems and stacking, we may design better stacking models, and increase the robustness of a TSK fuzzy model.
To our knowledge, this paper is so far the most comprehensive overview on the connections between fuzzy systems and other popular machine learning approaches, and hopefully will stimulate more hybridization between different machine learning algorithms.
References
 [1] J. M. Mendel, Uncertain rulebased fuzzy systems: introduction and new directions, 2nd ed. Springer, 2017.
 [2] C.T. Lin and C. S. G. Lee, Neural Fuzzy Systems: a NeuroFuzzy Synergism to Intelligent Systems. Upper Saddle River, NJ: Prentice Hall, 1996.
 [3] S. S. L. Chang and L. A. Zadeh, “On fuzzy mapping and control,” IEEE Trans. on Systems, Man, and Cybernetics, vol. SMC2, no. 1, pp. 30–34, 1972.
 [4] T. Takagi and M. Sugeno, “Fuzzy identification of systems and its application to modeling and control,” IEEE Trans. on Systems, Man, and Cybernetics, vol. 15, pp. 116–132, 1985.
 [5] L.X. Wang and J. M. Mendel, “Fuzzy basis functions, universal approximation, and orthogonal leastsquares learning,” IEEE Trans. on Neural Networks, vol. 3, pp. 807–813, 1992.
 [6] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.

[7]
C. M. Bishop,
Neural Networks for Pattern Recognition
. Oxford, UK: Oxford University Press, 1995.  [8] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87, 1991.
 [9] L. Breiman, J. Friedman, C. J. Stone, and R. Olshen, Classification and Regression Trees, 1st ed. Routledge, 2017.
 [10] Z.H. Zhou, Ensemble Methods: Foundations and Algorithms. Boca Raton, FL: CRC press, 2012.
 [11] G. E. Hinton, S. Osindero, and Y.W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006.
 [12] S. K. Halgamuge and M. Glesner, “Neural networks in designing fuzzy systems for real world applications,” Fuzzy Sets and Systems, vol. 65, no. 1, pp. 1–12, 1994.
 [13] J. R. Jang, “ANFIS: adaptivenetworkbased fuzzy inference system,” IEEE Trans. on Systems, Man, and Cybernetics, vol. 23, no. 3, pp. 665–685, 1993.
 [14] H. R. Berenji and P. Khedkar, “Learning and tuning fuzzy logic controllers through reinforcements,” IEEE Trans. on Neural Networks, vol. 3, no. 5, pp. 724–740, 1992.
 [15] J. J. Buckley and Y. Hayashi, “Fuzzy neural networks: A survey,” Fuzzy Sets and Systems, vol. 66, no. 1, pp. 1–13, 1994.
 [16] L.X. Wang and J. M. Mendel, “Backpropagation of fuzzy systems as nonlinear dynamic system identifiers,” in Proc. IEEE Int’l Conf. on Fuzzy Systems, San Diego, CA, 1992, pp. 1409–1418.
 [17] J. Moody and C. J. Darken, “Fast learning in networks of locallytuned processing units,” Neural Computation, vol. 1, no. 2, pp. 281–294, 1989.
 [18] R. MurraySmith and T. A. Johansen, “Local learning in local model networks,” in Proc. 4th IEEE Int’l Conf. on Artificial Neural Networks, Perth, Australia, June 1995, pp. 40–46.
 [19] J.S. Jang and C.T. Sun, “Functional equivalence between radial basis function networks and fuzzy inference systems,” IEEE Trans. on Neural Networks, vol. 4, no. 1, pp. 156–159, 1993.
 [20] K. J. Hunt, R. Haas, and R. MurraySmith, “Extending the functional equivalence of radial basis function networks and fuzzy inference systems,” IEEE Trans. on Neural Networks, vol. 7, no. 3, pp. 776–781, 1996.
 [21] C. Chen, R. John, J. Twycross, and J. M. Garibaldi, “An extended ANFIS architecture and its learning properties for type1 and interval type2 models,” in IEEE Int’l Conf. on Fuzzy Systems, Vancouver, Canada, Jul. 2016, pp. 602–609.
 [22] C. Chen, R. John, J. Twycross, and J. M. Garibaldi, “Type1 and interval type2 ANFIS: A comparison,” in IEEE Int’l Conf. on Fuzzy Systems, Naples, Italy, July 2017, pp. 1–6.
 [23] D. Wu, “Approaches for reducing the computational cost of interval type2 fuzzy logic systems: Overview and comparisons,” IEEE Trans. on Fuzzy Systems, vol. 21, no. 1, pp. 80–99, 2013.
 [24] D. Wu and W. W. Tan, “Computationally efficient typereduction strategies for a type2 fuzzy logic controller,” in Proc. IEEE Int’l Conf. on Fuzzy Systems, Reno, NV, May 2005, pp. 353–358.
 [25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [26] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, “Regularization of neural networks using DropConnect,” in Proc. Int’l Conf. on Machine Learning, Atlanta, GA, June 2013, pp. 1058–1066.
 [27] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int’l Conf. on Machine Learning, Lille, France, July 2015, pp. 448–456.
 [28] K. Amarasinghe and M. Manic, “Explaining what a neural network has learned: Toward transparent classification,” in Proc. Int’l Joint Conf. on Neural Networks, Rio, Brazil, Jul. 2018.
 [29] D. Wu and J. M. Mendel, “Linguistic summarization using IFTHEN rules and interval type2 fuzzy sets,” IEEE Trans. on Fuzzy Systems, vol. 19, no. 1, pp. 136–151, 2011.
 [30] H. Bersini and G. Bontempi, “Now comes the time to defuzzify neurofuzzy models,” Fuzzy Sets and Systems, vol. 90, no. 2, pp. 161–169, 1997.
 [31] H. Andersen, A. Lotfi, and L. Westphal, “Comments on ‘functional equivalence between radial basis function networks and fuzzy inference systems’ [and author’s reply],” IEEE Trans. on Neural Networks, vol. 9, no. 6, pp. 1529–1532, 1998.
 [32] S. E. Yuksel, J. N. Wilson, and P. D. Gader, “Twenty years of mixture of experts,” IEEE Trans. on Neural Networks and Learning Systems, vol. 23, no. 8, pp. 1177–1193, 2012.
 [33] S. Masoudnia and R. Ebrahimpour, “Mixture of experts: a literature survey,” Artificial Intelligence Review, vol. 42, no. 2, pp. 275–293, 2014.
 [34] B. Tang, M. I. Heywood, and M. Shepherd, “Input partitioning to mixture of experts,” in Proc. IEEE Int’l Joint Conf. on Neural Networks, Honolulu, HI, May 2002, pp. 227–232.
 [35] S. Gutta, J. R. Huang, P. Jonathon, and H. Wechsler, “Mixture of experts for classification of gender, ethnic origin, and pose of human faces,” IEEE Trans. on Neural Networks, vol. 11, no. 4, pp. 948–960, 2000.
 [36] D. Wu and W. W. Tan, “Genetic learning and performance evaluation of type2 fuzzy logic controllers,” Engineering Applications of Artificial Intelligence, vol. 19, no. 8, pp. 829–841, 2006.
 [37] J. Yen, L. Wang, and C. W. Gillespie, “Improving the interpretability of TSK fuzzy models by combining global learning and local learning,” IEEE Trans. on Fuzzy Systems, vol. 6, no. 4, pp. 530–537, 1998.
 [38] G. Golub, V. Klema, and G. W. Stewart, “Rank degeneracy and least squares problems,” Department of Computer Science, Stanford University, Tech. Rep. STANCS76559, 1976.
 [39] R. R. Yager and D. P. Filev, “Generation of fuzzy rules by mountain clustering,” Journal of Intelligent & Fuzzy Systems, vol. 2, no. 3, pp. 209–219, 1994.
 [40] M. Delgado, A. F. GómezSkarmeta, and F. Martín, “A fuzzy clusteringbased rapid prototyping for fuzzy rulebased modeling,” IEEE Trans. on Fuzzy Systems, vol. 5, no. 2, pp. 223–233, 1997.
 [41] S. L. Chiu, “Fuzzy model identification based on cluster estimation,” Journal of Intelligent & fuzzy systems, vol. 2, no. 3, pp. 267–278, 1994.
 [42] C.F. Juang and C.T. Lin, “An online selfconstructing neural fuzzy inference network and its applications,” IEEE Trans. on Fuzzy Systems, vol. 6, no. 1, pp. 12–32, Feb 1998.
 [43] A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, 2004.
 [44] C. Juang, R. Huang, and W. Cheng, “An interval type2 fuzzyneural network with supportvector regression for noisy regression problems,” IEEE Trans. on Fuzzy Systems, vol. 18, no. 4, pp. 686–699, 2010.
 [45] M. Komijani, C. Lucas, B. N. Araabi, and A. Kalhor, “Introducing evolving TakagiSugeno method based on local least squares support vector machine models,” Evolving Systems, vol. 3, no. 2, pp. 81–93, 2012.
 [46] J. Leski, “A fuzzy ifthen rulebased nonlinear classifier,” Int’l Journal of Applied Mathematics and Computer Science, vol. 13, pp. 215–223, 2003.
 [47] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 5, pp. 5–32, 2001.
 [48] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, vol. 29, no. 5, pp. 1189–1232, 2001.
 [49] W.Y. Loh, “Fifty years of classification and regression trees,” International Statistical Review, vol. 82, no. 3, pp. 329–348, 2014.
 [50] W. P. Alexander and S. D. Grimshaw, “Treed regression,” Journal of Computational and Graphical Statistics, vol. 5, no. 2, pp. 156–175, 1996.
 [51] P. Chaudhuri, M.C. Huang, W.Y. Loh, and R. Yao, “Piecewisepolynomial regression trees,” Statistica Sinica, pp. 143–167, 1994.
 [52] J. R. Quinlan, “Learning with continuous classes,” in Proc. 5th Australian Joint Conf. on Artificial Intelligence, vol. 92, Hobart, Tasmania, Nov. 1992, pp. 343–348.
 [53] A. Dobra and J. Gehrke, “SECRET: a scalable linear regression tree algorithm,” in Proc. 8th ACM SIGKDD Inte’l Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, Jul. 2002, pp. 481–487.
 [54] R. L. Chang and T. Pavlidis, “Fuzzy decision tree algorithms,” IEEE Trans. on Systems, Man, and Cybernetics, vol. 7, no. 1, pp. 28–35, 1977.
 [55] J.S. R. Jang, “Structure determination in fuzzy modeling: a fuzzy CART approach,” in Proc. IEEE Int’l Conf. on Fuzzy Systems, Orlando, FL, Jun. 1994, pp. 480–485.
 [56] C. Z. Janikow, “Fuzzy decision trees: issues and methods,” IEEE Trans. on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 28, no. 1, pp. 1–14, 1998.
 [57] A. Suárez and J. F. Lutsko, “Globally optimal fuzzy decision trees for classification and regression,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 21, no. 12, pp. 1297–1311, 1999.
 [58] L. Jing, M. K. Ng, and J. Z. Huang, “An entropy weighting means algorithm for subspace clustering of highdimensional sparse data,” IEEE Trans. on Knowledge & Data Engineering, no. 8, pp. 1026–1041, 2007.
 [59] H. Jia and Y. Cheung, “Subspace clustering of categorical and numerical data with an unknown number of clusters,” IEEE Trans. on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3308–3325, 2018.
 [60] D. Wu, F. Liu, and C. Liu, “Active stacking for heart rate estimation,” Information Fusion, 2019, submitted.
 [61] B. Efron and R. Tibshirani, An Introduction to the Bootstrap. New York, NY: Chapman & Hall, 1993.
 [62] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, vol. 58, no. 1, pp. 267–288, 1996.
 [63] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970.
 [64] K. Nozaki, H. Ishibuchi, and H. Tanaka, “A simple but powerful heuristic method for generating fuzzy rules from numerical data,” Fuzzy Sets and Systems, vol. 86, no. 3, pp. 251–270, 1997.
 [65] R. Ebrahimpour, H. Nikoo, S. Masoudnia, M. R. Yousefi, and M. S. Ghaemi, “Mixture of MLPexperts for trend forecasting of time series: A case study of the Tehran stock exchange,” International Journal of Forecasting, vol. 27, no. 3, pp. 804–816, 2011.
 [66] Y. Liu and X. Yao, “Ensemble learning via negative correlation,” Neural Networks, vol. 12, no. 10, pp. 1399–1404, 1999.