On the Functional Equivalence of TSK Fuzzy Systems to Neural Networks, Mixture of Experts, CART, and Stacking Ensemble Regression

03/25/2019 ∙ by Dongrui Wu, et al. ∙ University of Technology Sydney Huazhong University of Science u0026 Technology 0

Fuzzy systems have achieved great success in numerous applications. However, there are still many challenges in designing an optimal fuzzy system, e.g., how to efficiently train its parameters, how to improve its performance without adding too many parameters, how to balance the trade-off between cooperations and competitions among the rules, how to overcome the curse of dimensionality, etc. Literature has shown that by making appropriate connections between fuzzy systems and other machine learning approaches, good practices from other domains may be used to improve the fuzzy systems, and vice versa. This paper gives an overview on the functional equivalence between Takagi-Sugeno-Kang fuzzy systems and four classic machine learning approaches -- neural networks, mixture of experts, classification and regression trees, and stacking ensemble regression -- for regression problems. We also point out some promising new research directions, inspired by the functional equivalence, that could lead to solutions to the aforementioned problems. To our knowledge, this is so far the most comprehensive overview on the connections between fuzzy systems and other popular machine learning approaches, and hopefully will stimulate more hybridization between different machine learning algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Fuzzy systems have achieved great success in numerous applications [1, 2]. As shown in Fig. 1, a fuzzy system consists of four components: fuzzifier, rulebase, inference engine, and defuzzifier. The fuzzifier maps each crisp input into a fuzzy set, the inference engine performs inferences on these fuzzy sets to obtain another fuzzy set, utilizing the rulebase, and the defuzzifier converts the inferred fuzzy set into a crisp output.

Fig. 1: Flowchart of a fuzzy system.

There are two kinds of rules for a fuzzy system: Zadeh [3], where the rule consequents are fuzzy sets, and Takagi-Sugeno-Kang (TSK) [4], where the rule consequents are functions of the inputs. Zadeh rules were the earliest rules proposed. However, TSK rules are much more popular in practice due to their simplicity and flexibility. This paper considers mainly TSK fuzzy systems for regression.

As an example, a TSK fuzzy system with two inputs, two membership function (MFs) for each input, and one output, has the following rulebase:

where (; ) are fuzzy sets for , and , and are adjustable regression coefficients.

For a particular input , the membership grade on is , and the firing levels of the rules are:

The output of the TSK fuzzy system is:

(1)

Or, if we define the normalized firing levels as:

(2)

then, (1) can be rewritten as:

(3)

This paper gives a comprehensive overview of the functional equivalence111It has been shown that many machine learning algorithms, e.g., fuzzy systems and neural networks, are universal approximators [5, 6]. However, two algorithms are both universal approximators does not mean that they are functionally equivalent: universal approximation usually requires a very large number of nodes or parameters, so it is theoretically important, but may not be used in real-world algorithm design. By functional equivalence, we emphasize that two algorithms can implement exactly the same function with a relatively small number of parameters. of TSK fuzzy systems to four classical machine learning algorithms: neural networks [7], mixture of experts (ME) [8], classification and regression trees (CART) [9], and stacking ensemble regression [10]. Although few publications on the connections of TSK fuzzy systems to some of these approaches have scattered in the literature, to our knowledge, no one has put everything together in one place so that the reader can easily see the big picture and get inspired. Moreover, we also discuss some promising hybridizations between TSK fuzzy systems and each of the four algorithms, which could be interesting new research directions. For example:

  1. By making use of the functional equivalence between TSK fuzzy systems and some neural networks, we can design more efficient training algorithms for TSK fuzzy systems.

  2. By making use of the functional equivalence between TSK fuzzy systems and ME, we may be able to achieve a better trade-ff between cooperations and competitions of the rules in a TSK fuzzy system.

  3. By making use of the functional equivalence between TSK fuzzy systems and CART, we can better initialize a TSK fuzzy system for high-dimensional problems.

  4. Inspired by the connections between TSK fuzzy systems and stacking ensemble regression, we may be able to design better stacking models, and increase the robustness of a TSK fuzzy model.

The remainder of this paper is organized as follows: Sections II-V describe the functional equivalence of TSK fuzzy systems to neural networks, ME, CART, and stacking ensemble regression, respectively. Section VI draws conclusion.

Ii TSK Fuzzy Systems and Neural Networks

Neural networks have a longer history222https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/History/history1.html

than fuzzy systems, and are now at the center stage of machine learning, because of the booming of deep learning

[11].

Researchers started to discover in the early 1990s that a TSK fuzzy system can be represented similarly to a neural network [12, 13, 14, 15, 16], so that a neural network learning algorithm, such as back-propagation [7], can be used to train it. These fuzzy systems are called neuro-fuzzy systems in the literature [2].

Ii-a Anfis

Some neuro-fuzzy systems resemble the structure of the 3-layer multi-layer perceptrons (MLP)

[7] in Fig. 2. The first layer represents inputs, the middle (hidden) layer represents fuzzy rules, and the third layer represents outputs. Fuzzy sets are encoded as connection weights.

Fig. 2: An MLP with two inputs, one output, and one hidden layer.

Among the many variants of neuro-fuzzy systems, the most popular one may be the adaptive-network-based fuzzy inference system (ANFIS) [13], which has been cited over 14,600 times on Google Scholar, and implemented in the Matlab Fuzzy Logic Toolbox. The ANFIS structure of the two-input one-output TSK fuzzy system, introduced in the Introduction, is shown in Fig. 3. It has five layers:

  • Layer 1: The membership grade of on (; ) is computed.

  • Layer 2: The firing level of each rule is computed, by multiplying the membership grades of the corresponding rule antecedents.

  • Layer 3: The normalized firing levels of the rules are computed, using (2).

  • Layer 4: Each normalized firing level is multiplied by its corresponding rule consequent.

  • Layer 5: The output is computed by (3).

All parameters of the ANFIS, i.e., the shapes of the MFs and the rule consequents, can be trained by a gradient descent algorithm [13]

. Or, to speed up the training, the antecedent parameters can be tuned by gradient descent, and the consequent parameters by least squares estimation

[13].

Fig. 3: The TSK fuzzy system introduced in the Introduction, represented as a 5-layer ANFIS.

However, it should be noted that ANFIS is fundamentally different from the MLP in several aspects:

  1. The MLP always uses fully connected layers (also called dense layers in deep learning), whereas the ANFIS is not. For example, in ANFIS, the inputs are selectively connected to the nodes in Layer 1, and the nodes in Layer 1 are also selectively connected to those in Layer 2, to reflect the structure of the rule antecedents.

  2. In the MLP, the output of a node in the hidden layer and output layer is always computed by weighted sum followed by an activation function, whereas there are many different operations in an ANFIS. For example, Layer 1 computes the membership grade of an input on the corresponding MF, and Layer 2 uses direct multiplication.

  3. Layer 4 of the ANFIS also uses as an input (to represent the rule consequents), something called skip connection in deep learning, but usually an MLP does not have such connections.

  4. The MLP is a black-box model, whereas the ANFIS can be expressed by IF-THEN rules, which is easier to interpret and understand.

Ii-B Functional Equivalence between TSK Fuzzy Systems and Radial Basis Functional Networks (RBFN)

Although an MLP is different from a TSK fuzzy system, there is a variant of neural networks, the radial basis function network (RBFN)

[17], that is functionally equivalent to a TSK fuzzy system under certain constraints.

An RBFN [17] uses local receptive fields, inspired by biological receptive fields, for function mapping. Its diagram is shown in Fig. 4. For input , the output of the th () receptive field unit, using a Gaussian response function, is:

(4)

where and are the centers of the Gaussian functions for and , respectively, and

is the common standard deviation of the Gaussian functions.

Fig. 4: The RBFN.

With the addition of lateral connections (not shown in Fig. 4) between the receptive field units, the output of the RBFN is:

(5)

where is a constant output associated with the th receptive field unit333There is a related machine learning approach called local model networks [18], which can be viewed as a decomposition of a complex nonlinear system into a set of locally accurate sub-models smoothly integrated by their associated basis functions. It replaces the constant output of each receptive unit in an RBFN by a function of the inputs..

Jang and Sun [19] have shown that a TSK fuzzy system [see (3)] is functionally equivalent to an RBFN [see (5)], if the following constraints are satisfied:

  1. The number of receptive field units equals the number of fuzzy rules.

  2. The output of each fuzzy rule is a constant, instead of a function of the inputs.

  3. The antecedent MFs of each fuzzy rule are Gaussian functions with the same variance.

  4. The product -norm is used to compute the firing level of each rule.

  5. The fuzzy system and the RBFN use the same method (i.e., either weighted average or weighted sum) to compute the final output.

Hunt et al. [20] proposed a generalized RBFN, which has the following main features, comparing with the above standard RBFN:

  1. A receptive field unit may be connected with only a subset of the inputs, instead of all inputs in the standard RBFN.

  2. The output associated with each receptive field unit can be a linear or nonlinear function of the inputs, instead of a constant in the standard RBFN.

  3. The Gaussian response functions of the receptive field units can have different variances for different inputs, instead of identical variance in the standard RBFN.

Then, the generalized RBFN is functionally equivalent to a TSK fuzzy system, under the following constraints [20]:

  1. The number of receptive field units equals the number of fuzzy rules.

  2. The antecedent MFs of each fuzzy rule are Gaussian.

  3. The product -norm is used to compute the firing level of each rule.

  4. The fuzzy system and the RBFN use the same method (i.e., either weighted average or weighted sum) to compute the final output.

Ii-C Discussions and Future Research

As ANFIS is an efficient and popular training algorithm for type-1 TSK fuzzy systems, it is natural to consider whether it can also be used for interval and general type-2 fuzzy systems [1], which have demonstrated better performance than type-1 fuzzy systems in many applications. There have been limited research in this direction [21, 22]. Unfortunately, it was found that interval type-2 ANFIS may not outperform type-1 ANFIS. One possible reason is that when the Karnik-Mendel algorithms [1] are used in type-reduction of the interval type-2 fuzzy system, the least squares estimator in the interval type-2 ANFIS does not always give the optimal solution, due to the switch point mismatch [21]. A remedy may be to use an alternative type-reduction and defuzzification approach [23], which does not involve the switch points, e.g., the Wu-Tan method [24]. This is a direction that we are currently working on.

Many novel approaches have been proposed in the last few years to speed up the training and increase the generalization ability of neural networks, particularly deep neural networks, e.g., dropout [25], dropConnect [26]

, and batch normalization

[27]

. Dropout randomly discards some neurons and their connections during the training. DropConnect randomly sets some connection weights to zero during the training. Batch normalization normalizes the activation of the hidden units, and hence reduces internal covariate shift

444As explained in [27], internal covariate shift means “the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities.”. Similar concepts may also be used to expedite the training and increase the robustness of TSK fuzzy systems.

Although deep learning has achieved great success in numerous applications, its model is essentially a black-box because it is difficult to explain the acquired knowledge or decision rationale. This may hinder it from safety-critical applications such as medical diagnoses. Explainability of deep learning models has attracted a rapidly growing research interest in the past few years. According to Amarasinghe and Manic, there have been two groups of research on this [28]: 1) altering the learning algorithms to learn explainable features; and, 2) using additional methods with the standard learning algorithm to explain existing deep learning algorithms. They [28]

also presented an interesting methodology for linguistically explaining the knowledge a deep neural network classifier has acquired in training, using linguistic summarization

[29], which generates Zadeh fuzzy rules. This work shows a novel and promising application of fuzzy rules in deep learning. Similarly, TSK fuzzy rules could also be used to linguistically explain a deep regression model.

Iii TSK Fuzzy Systems and ME

ME was first proposed by Jacobs et al. in 1991 [8]. It is established based on the divide-and-conquer principle, in which the problem space is divided among multiple local experts, supervised by a gating network, as shown in Fig. 5. ME trains multiple local experts, each taking care of only a small local region of the problem space; for a new input, the gating network determines which experts should be used for it, and then aggregates the outputs of these experts by a weighted average.

Fig. 5: Mixture of experts (ME).

Iii-a The ME

Assume there are training examples , . The experts are trained from minimizing the following error555In practice transforms of (6), e.g., , may be used to speed up the optimization [8].:

(6)

where is the output of the th expert for input , and is the corresponding normalized weight for the th expert, assigned by the gating network:

(7)

in which is a tunable function.

Once the training is done, the final output of ME is:

(8)

Iii-B Functional Equivalence between TSK Fuzzy Systems and ME

It is easy to see that, when and , in (8) is functionally equivapent to the output of the TSK fuzzy system in (3).

A few publications [30, 31]

have made the connection between TSK fuzzy systems and ME. The regression function in each rule consequent of the TSK fuzzy system can be viewed as an expert, and the rule antecedents work as the gating network: for each input, they determine how much weight should be assigned to each rule consequent (expert) in the final aggregation. Of course, the experts and gating network in ME can be constructed by more complex models, such as neural networks and support vector machines

[32], but the structure resemblance remains unchanged.

As summarized in [33], two categories of strategies have been used to partition the problem space among the experts in ME:

  1. Mixture of implicitly localized experts (MILE), which stochastically partitions the problem space into a number of subspaces using a specific error function, and experts become specialized in each subspace.

  2. Mixture of explicitly localized experts (MELE), which explicitly partitions the problem space via clustering, and assigns one expert to each cluster. Research [34, 35] has found that MELE may outperform MILE in classification problems.

Both categories have equivalent TSK fuzzy system design strategies.

For example, in evolutionary fuzzy systems design [36], the input MFs are randomly initialized, and a fitness function is used to select the configuration that achieves the best performance. In Yen et al.’s approach [37] to increase the interpretability of a TSK fuzzy system, the antecedent fuzzy partitions are determined by starting with an oversized number of partitions, and then removing redundant and less important ones using the SVD-QR algorithm [38]. These ideas are very close to MILE in ME design.

There have also been many approaches for generating initial fuzzy rule partitions through clustering [39, 40, 41, 42]. Different clustering algorithms, e.g., mountain clustering [39], fuzzy -means clustering [40], aligned clustering [42], etc., have been used. Then, one rule is generated for each cluster. This is essentially the MELE strategy in ME.

Iii-C Discussions and Future Research

Lots of progresses on ME have been made since it was first proposed in 1991 [32, 33], e.g., different training algorithms, different gating networks, and different expert models. Since ME is essentially identical to a TSK fuzzy system, these ideas could also be applied to fuzzy systems, particularly the training algorithms and expert models (the gating network is a little more challenging because in a TSK fuzzy system we always use MFs to perform gating; there is not too much freedom).

First, in training a TSK fuzzy system for regression, the error function is usually defined as:

(9)

However, as pointed out in [8], “this error measure compares the desired output with a blend of the outputs of the local experts, so, to minimize the error, each local expert must make its output cancel the residual error that is left by the combined effects of all the other experts. When the weight in one expert change, the residual error changes, and so the error derivatives for all other local experts change.” This strong coupling between the experts facilitates their cooperation, but may lead to solutions in which many experts are used for each input. That’s why in training the local experts, the error function is defined as (6) to facilitate the competition among them. (6) requires each expert to approximate , instead of a residual. Hence, each local expert is not directly affected by other experts (it is indirectly affected by other experts through the gating network, though).

It is thus interesting to study if changing the error function from (9) to (6) in training a TSK fuzzy system can improve its performance, in terms of speed and accuracy. Or, the error function could be a hybridization of (9) and (6), to facilitate both the cooperations and competitions among the local experts, i.e.,

(10)

where is a hyper-parameter defining the trade-off between cooperation and competition. This idea was first explored in [37], which proposed a TSK fuzzy system design strategy to increase its interpretability, by forcing each rule consequent to be a reasonable local model (the regression function in each rule consequent needs to fit the training data that are covered by the rule antecedent MFs well), and also the overall TSK fuzzy system to be a good global model. However, the algorithms in [37] are very memory-hungry666Let be the number of training examples, the dimensionality of the input, and the number of rules. The algorithms in [37] need to construct matrices in and , which are hardly scalable., and hence may not be applicable when is large. A more efficient solution to this problem is needed.

Second, when the performance of an initially designed TSK fuzzy system is not satisfactory, there could be two strategies to improve it: 1) increase the number of rules, so that each rule covers a smaller region in the input domain, and hence may better approximate the training examples in that region; and, 2) increase the fitting power (nonlinearity) of the consequent function, so that it can better fit the training examples in its local region. The first strategy is frequently used in practice; however, it can increase the number of parameters of the TSK fuzzy system very rapidly. Juang and Lin [42] proposed an interesting approach to incrementally add linear terms to the rule consequent to increase its fitting power. However, they only considered linear terms. Inspired by ME, whose expert models could use complex models like the neural networks and support vector machines [32], the TSK rule consequents (local experts) could also use more sophisticated models, particularly, support vector regression [43]

, which outperforms simple linear regression in many applications. The feasibility of this idea has been verified in

[44, 45].

Third, TSK fuzzy rules could also be used as experts in ME. For example, Leski [46] proposed such an approach for classification: each expert model in the ME was constructed as a TSK fuzzy rule (whose input region was determined by fuzzy -means clustering), and then a gating network was used to aggregate them. This may increase the interpretability of ME. This idea can also be extended to regression problems.

Iv TSK Fuzzy Systems and CART

CART [9]

is a popular and powerful strategy for constructing classification and regression trees. It has also been used in ensemble learning such as random forests

[47]

and gradient boosting machines

[48]. This section focuses on regression only.

Iv-a Cart

Assume there are two numerical inputs, and , and one output, . An example of CART is shown in Fig. 6

. It is constructed by a divide-and-conquer strategy, in which the input space is partitioned by a hierarchy of Boolean tests into multiple non-overlapping partitions. Each Boolean test corresponds to an internal node of the decision tree. The leaf node (terminal node) is computed as the mean

of all training examples falling into the corresponding partition; thus, CART implements a piecewise constant regression function, as shown in Fig. 6. The route leading to each leaf node can be written as a crisp rule, e.g., If and , then . Note that each leaf node can also be a function of the inputs [49, 50, 51, 52, 53], instead of a constant. In this way, the implemented regression function is smoother; however, the trees are more difficult to train.

Fig. 6: (a) An example of CART for regression; and, (b) its input-output mapping.

Iv-B Functional Equivalence between TSK Fuzzy Systems and CART

Both CART and fuzzy systems use rules. The rules in CART are crisp: each input belongs to only one rule, and the output is the leaf node of that rule. On the contrary, the rules in a fuzzy system are fuzzy: each input may fire more than one rules, and the output is a weighted average of these rule consequents.

The regression output of a traditional CART has discontinuities, which may be undesirable in practice. So, fuzzy CART, which allows an input to belong to different leaf nodes with different degrees, has been proposed to accommodate this [54, 55, 56, 57]. As pointed out by Suarez and Lutsko [57], “in regression problems, it is seen that the continuity constraint imposed by the function representation of the fuzzy tree leads to substantial improvements in the quality of the regression and limits the tendency to overfitting.” An example of fuzzy CART for regression is shown in Fig. 7, where is a fuzzy set for , and and are fuzzy sets for . Its input-output mapping is shown in Fig. 7, which is continuous.

Fig. 7: (a) An example of fuzzy CART for regression; and, (b) its input-output mapping.

Let the th constant leaf node in a fuzzy CART be . Then, given an input , the output of a fuzzy CART can be a weighed average of the predictions at all leaves:

(11)

where is the product of all membership grades in the path to . In this way, is a smooth function. Clearly, is functionally equivalent to the output of a TSK fuzzy system in (1), when each rule consequent of the fuzzy system is a constant (instead of a function).

Chaudhuri et al. [51] proposed smoothed and unsmoothed piecewise-polynomial regression trees (SUPPORT), in which each leaf node is a polynomial function of the inputs. The SUPPORT tree is generally much shorter than a traditional CART tree, and hence enjoys better interpretability. The following three-step procedure is used to ensure that its output is smooth:

  1. The input space is recursively partitioned until the data in each partition are adequately fitted by a fixed-order polynomial. Partitioning is guided by analyzing the distributions of the residuals and the cross-validation estimates of the mean squared prediction error.

  2. The data within a neighborhood of the th () partition are fitted by a polynomial .

  3. The prediction for an input is a weighted average of , where the weighting function diminishes rapidly to zero outside the th partition.

If the weighting functions are Gaussian-like, i.e.,

(12)

where is the mean of the Gaussian function for the th input, and is the standard deviation, then the output of SUPPORT is:

(13)

Clearly, is functionally equivalent to the output of the TSK fuzzy system in (1).

Iv-C Discussions and Future Research

It is well-known that fuzzy systems are subject to the curse of dimensionality. Assume a fuzzy system has inputs, each with MFs in its domain. Then, the total number of rules is , i.e., the number of rules increases exponentially with the number of inputs, and the fuzzy system quickly becomes unmanageable. Clustering could be used to reduce the number of rules (one rule is extracted for each cluster) [39, 40, 41, 42]. However, the validity of the clusters also decreases with the increase of feature dimensionality, especially when different features have different importance in different clusters [58, 59].

CART may offer a solution to this problem. For example, Jang [55] first performed CART on a regression dataset to roughly estimate the structure of a TSK fuzzy system, i.e., number of MFs in each input domain, and the number of rules. Then, each crisp rule antecedent was converted into a fuzzy set, and consequent to a linear function of the inputs. For example, a crisp rule:

can be converted to a TSK fuzzy rule:

where , and are regression coefficients, and and are fuzzy sets defined as:

(14)
(15)

in which and are tunable parameters. Once all crisp rules have been converted to TSK fuzzy rules, ANFIS [13] can be used to optimize the parameters of all rules together, e.g., , , , , and .

The above TSK fuzzy system design strategy offers at least three advantages:

  1. We can prune the CART tree on a high-dimensional dataset to obtain a regression tree with a desired number of leaf nodes, and hence a TSK fuzzy system with a desired number of rules.

  2. Rules in a traditionally designed fuzzy system usually have the same number of antecedents (which equals the number of inputs), but rules initialized from CART may have different number of antecedents (which are usually smaller than the number of inputs), depending on the depths of the corresponding leaf nodes, i.e., we can extract simple and novel rules that may not be extractable using a traditional fuzzy system design approach.

  3. In a traditional fuzzy system, each input (feature) is independently considered in rule antecedents. However, some variants of CART [49] allow to split on the linear combinations of the inputs, which is equivalent to using new (usually more informative) features in splitting. These new features are also used by fuzzy rules when they are converted from CART leaf nodes, which may be difficult to achieve in traditional fuzzy systems.

In summary, initializing TSK fuzzy systems from CART regression trees is a promising solution to deal with high-dimensional problems, and hence deserves further research.

V Fuzzy System and Stacking

Ensemble regression [10] improves the regression performance by integrating multiple base models. Stacking may be the simplest supervised ensemble regression approach. Its final regression model is a weighted average of the base models, where the weights are trained from the labeled training data.

V-a Stacking

The base models in stacking may be trained from other related tasks or datasets [60]. However, when there are enough training examples for the problem under consideration, the base models may also be trained directly from them. Fig. 8 illustrates such a strategy. For a given training dataset, we can re-sample (e.g., using bootstrap [61]) it multiple times to obtain multiple new training datasets, each of which is slightly different from the original training dataset. Then, a base model can be trained using each re-sampled dataset. These base models could use the same regression algorithm, but different regression algorithms, e.g., LASSO [62]

, ridge regression

[63], support vector regression [43], etc., can also be used to increase their diversity. Because the training datasets are different, the trained base models will be different even if the same regression algorithm is used.

Fig. 8: Stacking ensemble regression.

Once the base models are obtained, stacking trains another (linear or nonlinear) regression model to fuse them. Assume the outputs of the base regression models are . Stacking finds a regression model on the training dataset to aggregate them.

V-B Connections between TSK Fuzzy Systems and Stacking

TSK fuzzy systems for regression can be viewed as a stacking model. Each rule consequent is a base regression model, and the rule antecedent MFs determine the weights of the base models in stacking. Note that in stacking usually the aggregated output is a function of only, but in a TSK fuzzy system the aggregation function also depends on the input , as the weights are computed from them, and change with them. So, a TSK fuzzy system is actually an adaptive stacking regression model.

In stacking, the base regression models can be built from resampling, re-weighting, or different partitions of the original training dataset. This concept has also been used to construct or initialize TSK fuzzy systems.

For example, Nozaki et al. [64]

proposed a simple yet powerful heuristic approach for generating TSK fuzzy rules (whose rule consequents are constants, instead of functions of the inputs) from numerical data. Assume there are

training examples , , i.e., the fuzzy system has two inputs and one output. Then, Nozaki et al.’s approach consists of the following three steps [64]:

  1. Determine how many MFs should be used for each input, and define the shapes of the MFs. Once this is done, the input space is partitioned into several fuzzy regions.

  2. Generate a fuzzy rule in the form of:

    in the th fuzzy region, where MFs and have been determined in Step (1), and

    (16)

    in which is a positive constant. could also be computed using a least squares approach [64].

Essentially, the above rule-construction approach re-weights each training example in a fuzzy partition using the firing levels of the rule antecedents, and then computes a simple base model from them. The final fuzzy system is an aggregation of all such rules. This is exactly the idea of stacking in Fig. 8.

Another example is the local learning part in [37], where an approach for constructing a local TSK rule for each rule partition is proposed. Using again the two-input one-output example in the Introduction, a local TSK rule is in the form of:

Given and , the weight for each training example is , and then the regression coefficients , and are found from minimizing the following weighted loss:

(17)

Each local rule is equivalent to a base model in stacking.

V-C Discussions and Future Research

Traditional stacking assigns each base model a constant weight. As pointed out in the previous subsection, a TSK fuzzy system can be viewed as an adaptive stacking model, because the weights for the base models (rule consequents) change with the inputs. Inspired by this phenomenon, we can design more powerful stacking strategies, by replacing each constant weight by a linear or nonlinear function777This idea was first used in [65] for classification, under the name “modified stacked generalization.” It outperformed traditional stacking. of the input . The rationale is that the weight for a base model should be dependent on its performance, whereas its performance is usually related to the location of the input: each base model may perform well in some input regions, but not well in the rest. A well-trained function of may be able to reflect the expertise of the corresponding base model, and hence help achieve better aggregation performance.

Moreover, if the weighting functions and the base models are trained simultaneously, then the weighting functions may encourage the base models to cooperate: each focuses on a partition of the input domain, instead of the entire domain in traditional stacking. Even better performance could be expected, than training the base models first and then separately the weighting functions to aggregate them.

Some proven strategies in stacking may also be used to improve the performance of a TSK fuzzy system. For example, regularization is frequently used to increase the robustness of the stacking model [60, 66]. When LASSO [62] is used to build the stacking model, regularization is added, and hence some regression coefficients may be zero, i.e., it increases the sparsity of the solution. When ridge regression [63] or support vector regression [43] is used to build the stacking model, regularization is added, and hence the regression coefficients usually have small magnitudes, i.e., they reduces overfitting. Some new regularization terms, e.g., negative correlation [66], can be used to create negatively correlated base models to encourage specialization and cooperation among them. These concepts may also be used in training the rule consequents (base models) of a TSK fuzzy system, and also the antecedent MFs (so that the MFs for the same input are neither too crowded, nor too far away from each other).

Vi Conclusion

TSK fuzzy systems have achieved great success in numerous applications. However, there are still many challenges in designing an optimal TSK fuzzy system, e.g., how to efficiently train its parameters, how to improve its performance without adding too many parameters, how to balance the trade-off between cooperations and competitions among the rules, how to overcome the curse of dimensionality, etc. Literature has shown that by making appropriate connections between fuzzy systems and other machine learning approaches, good practices from other domains may be used to improve the fuzzy systems, and vice versa.

This paper has given an overview on the functional equivalence between TSK fuzzy systems and four classic machine learning approaches – neural networks, ME, CART, and stacking – for regression problems. We also pointed out some promising new research directions, inspired by the functional equivalence, that could lead to solutions to the aforementioned problems. For example, by making use of the functional equivalence between TSK fuzzy systems and some neural networks, we can design more efficient training algorithms for TSK fuzzy systems; by making use of the functional equivalence between TSK fuzzy systems and ME, we may be able to achieve a better trade-off between cooperations and competitions of the rules in a TSK fuzzy system; by making use of the functional equivalence between TSK fuzzy systems and CART, we can better initialize a TSK fuzzy system to deal with the curse of dimensionality; and, inspired by the connections between TSK fuzzy systems and stacking, we may design better stacking models, and increase the robustness of a TSK fuzzy model.

To our knowledge, this paper is so far the most comprehensive overview on the connections between fuzzy systems and other popular machine learning approaches, and hopefully will stimulate more hybridization between different machine learning algorithms.

References

  • [1] J. M. Mendel, Uncertain rule-based fuzzy systems: introduction and new directions, 2nd ed.   Springer, 2017.
  • [2] C.-T. Lin and C. S. G. Lee, Neural Fuzzy Systems: a Neuro-Fuzzy Synergism to Intelligent Systems.   Upper Saddle River, NJ: Prentice Hall, 1996.
  • [3] S. S. L. Chang and L. A. Zadeh, “On fuzzy mapping and control,” IEEE Trans. on Systems, Man, and Cybernetics, vol. SMC-2, no. 1, pp. 30–34, 1972.
  • [4] T. Takagi and M. Sugeno, “Fuzzy identification of systems and its application to modeling and control,” IEEE Trans. on Systems, Man, and Cybernetics, vol. 15, pp. 116–132, 1985.
  • [5] L.-X. Wang and J. M. Mendel, “Fuzzy basis functions, universal approximation, and orthogonal least-squares learning,” IEEE Trans. on Neural Networks, vol. 3, pp. 807–813, 1992.
  • [6] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
  • [7] C. M. Bishop,

    Neural Networks for Pattern Recognition

    .   Oxford, UK: Oxford University Press, 1995.
  • [8] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87, 1991.
  • [9] L. Breiman, J. Friedman, C. J. Stone, and R. Olshen, Classification and Regression Trees, 1st ed.   Routledge, 2017.
  • [10] Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms.   Boca Raton, FL: CRC press, 2012.
  • [11] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006.
  • [12] S. K. Halgamuge and M. Glesner, “Neural networks in designing fuzzy systems for real world applications,” Fuzzy Sets and Systems, vol. 65, no. 1, pp. 1–12, 1994.
  • [13] J. R. Jang, “ANFIS: adaptive-network-based fuzzy inference system,” IEEE Trans. on Systems, Man, and Cybernetics, vol. 23, no. 3, pp. 665–685, 1993.
  • [14] H. R. Berenji and P. Khedkar, “Learning and tuning fuzzy logic controllers through reinforcements,” IEEE Trans. on Neural Networks, vol. 3, no. 5, pp. 724–740, 1992.
  • [15] J. J. Buckley and Y. Hayashi, “Fuzzy neural networks: A survey,” Fuzzy Sets and Systems, vol. 66, no. 1, pp. 1–13, 1994.
  • [16] L.-X. Wang and J. M. Mendel, “Back-propagation of fuzzy systems as nonlinear dynamic system identifiers,” in Proc. IEEE Int’l Conf. on Fuzzy Systems, San Diego, CA, 1992, pp. 1409–1418.
  • [17] J. Moody and C. J. Darken, “Fast learning in networks of locally-tuned processing units,” Neural Computation, vol. 1, no. 2, pp. 281–294, 1989.
  • [18] R. Murray-Smith and T. A. Johansen, “Local learning in local model networks,” in Proc. 4th IEEE Int’l Conf. on Artificial Neural Networks, Perth, Australia, June 1995, pp. 40–46.
  • [19] J.-S. Jang and C.-T. Sun, “Functional equivalence between radial basis function networks and fuzzy inference systems,” IEEE Trans. on Neural Networks, vol. 4, no. 1, pp. 156–159, 1993.
  • [20] K. J. Hunt, R. Haas, and R. Murray-Smith, “Extending the functional equivalence of radial basis function networks and fuzzy inference systems,” IEEE Trans. on Neural Networks, vol. 7, no. 3, pp. 776–781, 1996.
  • [21] C. Chen, R. John, J. Twycross, and J. M. Garibaldi, “An extended ANFIS architecture and its learning properties for type-1 and interval type-2 models,” in IEEE Int’l Conf. on Fuzzy Systems, Vancouver, Canada, Jul. 2016, pp. 602–609.
  • [22] C. Chen, R. John, J. Twycross, and J. M. Garibaldi, “Type-1 and interval type-2 ANFIS: A comparison,” in IEEE Int’l Conf. on Fuzzy Systems, Naples, Italy, July 2017, pp. 1–6.
  • [23] D. Wu, “Approaches for reducing the computational cost of interval type-2 fuzzy logic systems: Overview and comparisons,” IEEE Trans. on Fuzzy Systems, vol. 21, no. 1, pp. 80–99, 2013.
  • [24] D. Wu and W. W. Tan, “Computationally efficient type-reduction strategies for a type-2 fuzzy logic controller,” in Proc. IEEE Int’l Conf. on Fuzzy Systems, Reno, NV, May 2005, pp. 353–358.
  • [25] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [26] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, “Regularization of neural networks using DropConnect,” in Proc. Int’l Conf. on Machine Learning, Atlanta, GA, June 2013, pp. 1058–1066.
  • [27] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int’l Conf. on Machine Learning, Lille, France, July 2015, pp. 448–456.
  • [28] K. Amarasinghe and M. Manic, “Explaining what a neural network has learned: Toward transparent classification,” in Proc. Int’l Joint Conf. on Neural Networks, Rio, Brazil, Jul. 2018.
  • [29] D. Wu and J. M. Mendel, “Linguistic summarization using IF-THEN rules and interval type-2 fuzzy sets,” IEEE Trans. on Fuzzy Systems, vol. 19, no. 1, pp. 136–151, 2011.
  • [30] H. Bersini and G. Bontempi, “Now comes the time to defuzzify neuro-fuzzy models,” Fuzzy Sets and Systems, vol. 90, no. 2, pp. 161–169, 1997.
  • [31] H. Andersen, A. Lotfi, and L. Westphal, “Comments on ‘functional equivalence between radial basis function networks and fuzzy inference systems’ [and author’s reply],” IEEE Trans. on Neural Networks, vol. 9, no. 6, pp. 1529–1532, 1998.
  • [32] S. E. Yuksel, J. N. Wilson, and P. D. Gader, “Twenty years of mixture of experts,” IEEE Trans. on Neural Networks and Learning Systems, vol. 23, no. 8, pp. 1177–1193, 2012.
  • [33] S. Masoudnia and R. Ebrahimpour, “Mixture of experts: a literature survey,” Artificial Intelligence Review, vol. 42, no. 2, pp. 275–293, 2014.
  • [34] B. Tang, M. I. Heywood, and M. Shepherd, “Input partitioning to mixture of experts,” in Proc. IEEE Int’l Joint Conf. on Neural Networks, Honolulu, HI, May 2002, pp. 227–232.
  • [35] S. Gutta, J. R. Huang, P. Jonathon, and H. Wechsler, “Mixture of experts for classification of gender, ethnic origin, and pose of human faces,” IEEE Trans. on Neural Networks, vol. 11, no. 4, pp. 948–960, 2000.
  • [36] D. Wu and W. W. Tan, “Genetic learning and performance evaluation of type-2 fuzzy logic controllers,” Engineering Applications of Artificial Intelligence, vol. 19, no. 8, pp. 829–841, 2006.
  • [37] J. Yen, L. Wang, and C. W. Gillespie, “Improving the interpretability of TSK fuzzy models by combining global learning and local learning,” IEEE Trans. on Fuzzy Systems, vol. 6, no. 4, pp. 530–537, 1998.
  • [38] G. Golub, V. Klema, and G. W. Stewart, “Rank degeneracy and least squares problems,” Department of Computer Science, Stanford University, Tech. Rep. STAN-CS-76-559, 1976.
  • [39] R. R. Yager and D. P. Filev, “Generation of fuzzy rules by mountain clustering,” Journal of Intelligent & Fuzzy Systems, vol. 2, no. 3, pp. 209–219, 1994.
  • [40] M. Delgado, A. F. Gómez-Skarmeta, and F. Martín, “A fuzzy clustering-based rapid prototyping for fuzzy rule-based modeling,” IEEE Trans. on Fuzzy Systems, vol. 5, no. 2, pp. 223–233, 1997.
  • [41] S. L. Chiu, “Fuzzy model identification based on cluster estimation,” Journal of Intelligent & fuzzy systems, vol. 2, no. 3, pp. 267–278, 1994.
  • [42] C.-F. Juang and C.-T. Lin, “An online self-constructing neural fuzzy inference network and its applications,” IEEE Trans. on Fuzzy Systems, vol. 6, no. 1, pp. 12–32, Feb 1998.
  • [43] A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199–222, 2004.
  • [44] C. Juang, R. Huang, and W. Cheng, “An interval type-2 fuzzy-neural network with support-vector regression for noisy regression problems,” IEEE Trans. on Fuzzy Systems, vol. 18, no. 4, pp. 686–699, 2010.
  • [45] M. Komijani, C. Lucas, B. N. Araabi, and A. Kalhor, “Introducing evolving Takagi-Sugeno method based on local least squares support vector machine models,” Evolving Systems, vol. 3, no. 2, pp. 81–93, 2012.
  • [46] J. Leski, “A fuzzy if-then rule-based nonlinear classifier,” Int’l Journal of Applied Mathematics and Computer Science, vol. 13, pp. 215–223, 2003.
  • [47] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 5, pp. 5–32, 2001.
  • [48] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, vol. 29, no. 5, pp. 1189–1232, 2001.
  • [49] W.-Y. Loh, “Fifty years of classification and regression trees,” International Statistical Review, vol. 82, no. 3, pp. 329–348, 2014.
  • [50] W. P. Alexander and S. D. Grimshaw, “Treed regression,” Journal of Computational and Graphical Statistics, vol. 5, no. 2, pp. 156–175, 1996.
  • [51] P. Chaudhuri, M.-C. Huang, W.-Y. Loh, and R. Yao, “Piecewise-polynomial regression trees,” Statistica Sinica, pp. 143–167, 1994.
  • [52] J. R. Quinlan, “Learning with continuous classes,” in Proc. 5th Australian Joint Conf. on Artificial Intelligence, vol. 92, Hobart, Tasmania, Nov. 1992, pp. 343–348.
  • [53] A. Dobra and J. Gehrke, “SECRET: a scalable linear regression tree algorithm,” in Proc. 8th ACM SIGKDD Inte’l Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, Jul. 2002, pp. 481–487.
  • [54] R. L. Chang and T. Pavlidis, “Fuzzy decision tree algorithms,” IEEE Trans. on Systems, Man, and Cybernetics, vol. 7, no. 1, pp. 28–35, 1977.
  • [55] J.-S. R. Jang, “Structure determination in fuzzy modeling: a fuzzy CART approach,” in Proc. IEEE Int’l Conf. on Fuzzy Systems, Orlando, FL, Jun. 1994, pp. 480–485.
  • [56] C. Z. Janikow, “Fuzzy decision trees: issues and methods,” IEEE Trans. on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 28, no. 1, pp. 1–14, 1998.
  • [57] A. Suárez and J. F. Lutsko, “Globally optimal fuzzy decision trees for classification and regression,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 21, no. 12, pp. 1297–1311, 1999.
  • [58] L. Jing, M. K. Ng, and J. Z. Huang, “An entropy weighting -means algorithm for subspace clustering of high-dimensional sparse data,” IEEE Trans. on Knowledge & Data Engineering, no. 8, pp. 1026–1041, 2007.
  • [59] H. Jia and Y. Cheung, “Subspace clustering of categorical and numerical data with an unknown number of clusters,” IEEE Trans. on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3308–3325, 2018.
  • [60] D. Wu, F. Liu, and C. Liu, “Active stacking for heart rate estimation,” Information Fusion, 2019, submitted.
  • [61] B. Efron and R. Tibshirani, An Introduction to the Bootstrap.   New York, NY: Chapman & Hall, 1993.
  • [62] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, vol. 58, no. 1, pp. 267–288, 1996.
  • [63] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970.
  • [64] K. Nozaki, H. Ishibuchi, and H. Tanaka, “A simple but powerful heuristic method for generating fuzzy rules from numerical data,” Fuzzy Sets and Systems, vol. 86, no. 3, pp. 251–270, 1997.
  • [65] R. Ebrahimpour, H. Nikoo, S. Masoudnia, M. R. Yousefi, and M. S. Ghaemi, “Mixture of MLP-experts for trend forecasting of time series: A case study of the Tehran stock exchange,” International Journal of Forecasting, vol. 27, no. 3, pp. 804–816, 2011.
  • [66] Y. Liu and X. Yao, “Ensemble learning via negative correlation,” Neural Networks, vol. 12, no. 10, pp. 1399–1404, 1999.