TSK-Streams: Learning TSK Fuzzy Systems on Data Streams

11/10/2019 ∙ by Ammar Shaker, et al. ∙ Universität Paderborn NEC Corp. 0

The problem of adaptive learning from evolving and possibly non-stationary data streams has attracted a lot of interest in machine learning in the recent past, and also stimulated research in related fields, such as computational intelligence and fuzzy systems. In particular, several rule-based methods for the incremental induction of regression models have been proposed. In this paper, we develop a method that combines the strengths of two existing approaches rooted in different learning paradigms. More concretely, our method adopts basic principles of the state-of-the-art learning algorithm AMRules and enriches them by the representational advantages of fuzzy rules. In a comprehensive experimental study, TSK-Streams is shown to be highly competitive in terms of performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many practical applications of machine learning and predictive modeling, data is produced incrementally in the course of time and observed in the form of a continuous, potentially unbounded stream of observations. Correspondingly, the problem of learning from data streams has recently received increasing attention (Gama, 2012). Algorithms for learning on streams must be able to process the data in a single pass, which implies an incremental mode of learning, and to adapt to changes of the underlying data-generating process (Domingos and Hulten, 2003).

A popular approach for learning on data streams, both for classification and regression, is rule induction, in the fuzzy logic and computational intelligence community also known as “evolving fuzzy systems” (Lughofer, 2011). Shaker et al. (2017) proposed a method for regression that builds on a very efficient and effective technique for rule induction, which is inspired by the state-of-the-art machine learning algorithm AMRules, and combines it with the strengths of fuzzy modeling. Thus, the method induces a set of fuzzy rules, which, compared to conventional rules with Boolean antecedents, has the advantage of producing smooth regression functions. The method presented in this paper, called TSK-Streams, is a revised and improved variant. The main modifications and novel contributions are as follows.

  • We give a concise overview of regression learning on data streams as well as a systematic comparison of existing methods with regard to properties such as discretization of features, splitting criteria for rules, etc. This overview helps to better understand the specificities of approaches originating from different research fields, as well as to position our own approach.

  • We introduce a new strategy for the induction of TSK fuzzy rules and realize it in the form of two concrete variants: variance reduction and error reduction. While the former is still close to

    (Shaker et al., 2017), the variance reduction approach has not been considered for online learning of fuzzy systems so far. Compared with error reduction and other state-of-the-art methods, it leads to models with superior predictive performance.

  • In (Shaker et al., 2017), rule antecedents may contain disjunctions and negations, which makes them difficult to understand and interpret. The representation of TSK rules used in this paper is simpler and more concise. This is achieved by means of an improved technique for splitting fuzzy sets (and extending corresponding rules).

  • We propose the induction of candidate fuzzy rules using a discretization technique that is based on an extended Binary Search Tree (eBST) structure. Compared to the three-layered discretization architecture used by Shaker et al. (2017), the use of eBST for constructing candidate fuzzy sets has a number of advantages in the context of online learning. Most notably, it comes with a reduction of complexity from linear to logarithmic (in the number of candidate extensions).

  • Our empirical evaluation is more extensive and comprises a couple of additional large-scale data sets with up to 100k instances. The evaluation is also extended by including an additional method that has been introduced recently.

2 Related Work

In the machine learning community, research on supervised learning from data streams has mainly focused on classification problems so far. As one of the first methods, Hoeffding trees

(Domingos and Hulten, 2000)

have been proposed for learning classifiers on high-speed data streams. Since then, the tree-based approach has been developed further, and various modifications and variants can be found in the current literature

(Bifet and Gavaldà, 2009). Closely related to tree-based approaches is the induction of decision rules. For example, the Adaptive Very Fast Decision Rules (AVFDR) method (Kosina and Gama, 2012) is an extension of the Very Fast Decision Rules (VFDR) classifier (Gama and Kosina, 2011), which learns a compact set of rules in an incrementally manner. Most recently, Bifet et al. (2017) developed an extremely fast version of Hoeffding trees with an implementation that is ready to be used in industrial environments.

Less research has been done on regression for data streams. Notable exceptions include AMRules (Almeida et al., 2013), which is an extension of AVFDR for handling numeric target values, and FIMTDD (Ikonomovska et al., 2011), which induces model trees. In contrast to the machine learning community, the fuzzy systems community has put more emphasis on regression than on classification (Angelov, 2002; Angelov et al., 2010; Lughofer, 2011). In particular, FLEXFIS (Lughofer, 2008) is a method for inducing Takagi-Sugeno-Kang (TSK) rules (Takagi and Sugeno, 1985) from data streams.

In the following, we elaborate a bit more on those approaches that are especially relevant for our own method and the experimental study presented later on namely.

In the Adaptive Model Rules (AMRules) approach, the rule premises are represented in the form of conjunctive combinations of literals on the input variables. Moreover, the rule consequents are specified as linear functions of the variables, which are fitted to the data using least squares regression. Each rule maintains various statistics characterising the part of the instance space covered by that rule. Starting with a single literal, each rule is expanded by new literals step by step, using the Hoeffding bound as a selection criterion. A distinction between unordered rule sets and decision lists is made by Almeida et al. (2013). In this paper, the authors propose two prediction and update schemes. In the first approach, the rules are sorted in the order in which they have been learned. For prediction, only the first rule that is activated by an example is used. In the second approach, the rules are treated as a set, and their predictions are aggregated in terms of a weighted sum111While the concrete weight of a rule is not specified in (Almeida et al., 2013), the implementation suggests that the weight is determined on the basis of the rule’s previous errors.. Moreover, all rules activated by an example are updated. Since a better performance was achieved for the second approach, the authors used that one in their study.

Fast Incremental Model Trees with Drift Detection (FIMTDD) is a method for learning model trees for regression. To determine splits of the model tree, candidate attributes are assessed according to how much they they help to reduce the variance of the target variable. Moreover, a linear function on a corresponding subspace is specified for each leaf of the induced tree, and learning these functions is accomplished using stochastic gradient descent. An ensemble version of FIMTDD (adaptive random forest, ARF-Reg) was proposed by

Gomes et al. (2018), using an online version of bagging for creating the ensemble members (Oza and Russell, 2001).

The Fexible Fuzzy Inference Systems (FLEXFIS) approach by Lughofer (2008)

is a method for learning fuzzy rules, or, more specifically, Takagi-Sugeno-Kang (TSK) rules, on data streams. This type of rule will be formally introduced in Section 4.1. In contrast to Boolean rules, fuzzy rules are of a gradual nature and can cover an instance to a certain degree, which in turn allows for modulating the influence of a rule on a prediction in a more fine-granular way. In FLEXFIS, the fuzzy support of a rule, i.e., the region it covers in the input space, is determined by (incrementally) clustering the training data and associating each rule with a cluster. Rule consequents are specified in terms of linear functions of the input variables, and the estimation of these functions is successively adapted through recursive weighted least squares (RWLS)

(Ljung, 1999).

The main motivation of our approach is to take advantage of the effectivity and efficiency and algorithmic techniques for rule learning as implemented by methods such as AMRules, and to combine them with the expressiveness of fuzzy rules as used in approaches like FLEXFIS and eTS+ (Angelov, 2010) as well as related formalisms such as fuzzy pattern trees (Shaker et al., 2013).

3 Learning Regression Models on Data Streams

In the following, we categorize and distinguish the learning algorithms discussed above according to several properties. Along the way, we highlight potential advantages of combining different algorithms and their features.

3.1 Trees versus rules

Most tree and rule induction methods are based on refining rules in a general-to-specific manner, i.e., they share the property of moving from general to more specific hypotheses. In FIMTDD, for example, leaf nodes are split into more specific leaf nodes. Likewise, in AMRules and TSK-Streams, rules are specialized by adding terms to the premise part.

Trees can be seen as rule sets with a specific structure. Thus, while a direct transformation from a tree to a set of rules can usually be done in a straight-forward manner, the other direction is not always possible. In AMRules, for example, some of the rules are removed upon detection of a concept change, which makes it impossible to map the current rules to an equivalent tree-model.

FLEXFIS and eTS+ do not follow the aforementioned general-to-specific induction scheme. Instead, they learn and maintain rules in the form of clusters directly in the instance space. In general, these rules cannot be represented in terms of an equivalent tree structure.

3.2 Binary versus gradual membership

The application of fuzzy logic in decision tree and rule learning leads to two important distinctions from conventional learning. First, hard conditions (in rule antecedents) are replaced by soft conditions, so that an example can satisfy a condition to a certain degree. Therefore, in a tree structure, an instance can be propagated to different sibling nodes/leaves simultaneously, perhaps with different weights. Likewise, in a system of rules, it can be covered by multiple rules with different membership degrees.

The second difference is the ability to aggregate the decisions made by different rules in a weighted manner, as done by TSK-Streams, FLEXFIS, and eTS+, instead of merely computing an unweighted average of the outputs of all rules covering an instance. Thus, more weight can be given to the more relevant and less to the less relevant rules.

Likewise, gradual membership allows for more general inference in the case of tree-structures. While decision and model trees restrict tree traversal to a single branch from the root to a leave node, an equivalent fuzzy model tree222The authors are not aware of any fuzzy model tree induction method for regression on data streams. would follow several such paths simultaneously, branching an instance at an inner node in a weighted manner depending on how much it agrees with the conditions associated with each branch.

3.3 Discretization

Discretization is usually needed to create a finite number of candidate values for splitting points (thresholds) in the case of continuous features; these splitting points are then validated using a splitting criterion to decide how a tree/rule should be extended.

Both AMRules and FIMTDD apply a supervised discretization technique that is tailored to each rule and leaf node; this is achieved by considering the target values of all instances that reached a given leaf node or are covered by a rule.

TSK-Streams, as we will explain later, applies a supervised discretization technique for the creation of fuzzy sets that are evaluated for future extensions.

3.4 Splitting criteria

As already said, refining a model normally means extending a rule with additional conditions, thereby splitting it into two more specific rules or shrinking the region covered by that rule. A splitting criterion is used to find the presumably best among the (typically large) set of candidate splits. To quantify the usefulness of a split, different measures are conceivable.

A splitting criterion employed by many method, including AMRules, is variance reduction: For the rule , the instances covered by that rule are split into two groups and based on an attribute and a threshold , i.e.,

The sets and then specify new rules and , respectively. Both and are chosen so as to achieve a maximal reduction of variance

(1)

where is the variance of the target attribute (the -values) of the instances in .

Variance reduction has its roots in the earliest decision tree induction methods, in which splits are chosen that decrease the impurity of leaf nodes. For categorical target attributes, this is usually put in practice by reducing the information entropy. In the case of classification, the majority class is then used for prediction at a leaf node. In regression, where the target attribute is numerical, averaging is a more reasonable aggregation strategy; it was already adopted by the first regression tree learner CART (Breiman et al., 1984). With the aim of minimizing the sum of squared errors, variance reduction becomes the right splitting criterion, since the sum of weighted variances (the second part of (1)) can be written as the sum of squared errors:

(2)

where is the (constant) prediction produced by the rule .

M5 (Quinlan, 1992), one of the most popular regression approaches, is a tree that is similar to regression trees with the exception of learning a linear function in the leaf nodes, instead of predicting a constant (the average in CART), while employing variance reduction as a splitting criterion. FIMTDD extends M5 for learning model trees from data streams; it also applies variance reduction as a splitting criterion.

Despite the popularity of variance reduction, it has been criticized by Karalič (1992)

as “not an appropriate measure for impurity of an example set since example sets with large variance and very low impurity can arise”. Similarly, a set of data points might be perfectly located on a hyperplane, non-orthogonal to the target axis, and still have a high variance.

FLEXFIS and eTS+ do not apply a splitting criterion directly, but utilize an extension mechanism that decides when to add rules to the current rule set. More specifically, FLEXFIS applies an incremental clustering method, namely an incremental version of vector quantization

(Gray, 1984), such that a new example forms a new cluster if its distance to the nearest cluster is larger than the “vigilance” parameter. This parameter controls the tradeoff between major structural changes (creating a new cluster) and minor adaptations of the current structure. Likewise, eTS+ utilizes a density-based incremental clustering, eClusteting+ (Angelov, 2004). In both approaches, the clusters found are eventually transformed into rules.

Finally, we mention that most of the presented approaches consider only a single attribute for splitting, which leads to axis-parallel splits, not only in the standard case (FIMTDD and AMRules) but also in the case of fuzzy methods. FLEXFIS and eTS+ constitute an exception, since they find multivariate Gaussian clusters with non-diagonal covariance matrices.

3.5 Statistical tests versus engineered parameters

Learning on data streams, including the choice of the next split, must be done in an online manner. To answer the question whether or not an additional split is required, i.e., whether or not a significant improvement can be achieved through a split, statistical tests can be applied. A statistical test based on the Hoeffding bound has been extensively used by recent machine learning approaches for classification and regression, including Hoeffding trees, FIMTDD, AMRules, and TSK-Streams.

Instead of applying statistical tests, FLEXFIS and eTS+ make use of more engineered solutions, such as creating a new rule whenever an example is distant from all existing rules, as in FLEXFIS, or when adding an example reduces the density of existing ones, as in eTS+.

4 The Learning Algorithm TSK-Streams

TSK-Streams is an incremental, adaptive algorithm for learning rule-based regression models in a streaming mode. More specifically, TSK-Streams produces a widely used type of fuzzy rule system called Takagi-Sugeno-Kang (TSK) (Takagi and Sugeno, 1985).

4.1 TSK Fuzzy Systems

A TSK rule has the following structure:

(3)

with the representation of an instance in terms of feature values, and the antecedent of in terms of a soft constraint. The consequent part of the rule is specified by the vector , which defines an affine function of the input features. In what follows, we denote a rule by , with the fuzzy sets defining the rule antecedents, and the coefficients specifying the linear function.

A fuzzy set with membership function is used to model the soft constraint . Thus, the predicate has truth degree , which corresponds to the membership degree of in . The overall degree to which an instance satisfies the rule premise is

(4)

where the triangular norm333A triangular norm is a binary operator , which is commutative, associative, non-decreasing in both arguments, and with neutral element and absorbing element . models the logical conjunction (Klement et al., 2000). We will adopt the Gödel t-norm, which is given by . Notice that might be a void constraint, which corresponds to setting ; in that case, the feature is effectively removed from the premise of the rule (3).

Now, given an instance as an input to a TSK system with rules , each of these rules will be “activated” with the degree (4). Therefore, the system’s output is specified by the weighted average of the outputs suggested by the individual rules:

(5)

with

(6)

Fuzzy sets can be characterized by any function of the form , which leads to membership functions with different shapes and properties (Pedrycz and Gomide, 1998). In our approach, we employ the family of the “S-shaped” parametrized functions: a fuzzy set has a support and core and , respectively, such that , the degree of membership is 1 inside and outside . The left boundary of the fuzzy set is modeled in terms of an “S-shaped” transition between zero and full membership:

(7)

An S-shaped membership function can also be left- or right-unbounded:

(8)
(9)

4.2 Online Rule Induction

The TSK-Streams algorithm begins with a single default rule and then learns rules in an incremental manner. The default rule has an empty premise for each feature (that is, the membership function ) and covers the complete input space. Then, the algorithm continuously checks whether, for any of the rules , one of its extensions could possibly improve the performance of the current fuzzy system.

An expansion of a rule with a predicate on the attribute means that the rule is split into two new rules and with predicates and , respectively, where . We denote the membership functions modeling the fuzzy sets and by and , respectively. These membership functions are chosen after a fuzzy partitioning of the domain of feature . To this end, we apply a supervised discretization technique that divides a fuzzy set into two new fuzzy sets so as to improve the overall performance. Here, we focus on two criteria (cf. the discussion in Section 3), to be detailed in the following.

4.2.1 Variance reduction

Similar to the AMRule principle of reducing the variance, based on the fuzzy set , two fuzzy sets and are created such that a maximum reduction in the target attribute’s variance is achieved. For example, let be a fuzzy set (for the attribute in the rule ) characterized by the S-shaped membership function , which is parametrized by the quadruple . Let be the set of examples covered by the rule , i.e., the examples for which . We then seek to find the value such that the reduction in variance

is maximal, where

with

and the variance of the set .

Similar to AMRules and FIMTDD, we achieve the variance reduction by storing candidate values in an extended binary search tree (E-BST). This data structure allows for computing the variance reduction for each candidate value in time that is linear in the size of the tree; moreover, it can be updated in logarithmic time (Ikonomovska, 2012).

4.2.2 Error reduction

Extending the current model with new rules so as to improve the system’s overall performance requires, for each existing rule, the creation and evaluation of all possible extensions—evaluating an extension here means determining the empirical performance of the (modified) system as a whole. As before, by a possible rule extension we mean replacing a fuzzy set in a rule antecedent by new fuzzy sets and , which are produced by bisecting the support of at some suitable splitting point. Even if these splitting points were organized in a binary search tree structure, the number of updates required after observing a new example would no longer be logarithmic but linear. Indeed, every possible extension means fitting a step-wise linear function, at each splitting value, on the entire training data (or updating the linear function on the new data instance).

To counter the aforementioned problem, we suggest a heuristic that simultaneously chooses a promising splitting value and fits a stepwise linear function for each candidate extension rule. The splitting value is chosen by adaptively shifting (increasing or decreasing) it based on the performance of new candidate rules. More formally, let

be a fuzzy set characterized by the S-shaped membership function , parametrized by , and let be the set of instances covered by the rule . Let be the initial splitting value from which and are constructed via suitable parametrizations and of their membership functions and , respectively. We initialize by the current mean of the observed values . The values and control the steepness of the S-shaped function and are chosen in proportion to the observed variance. From the membership functions and and the parent rule , the new candidate rules and are created (see lines 1–9 of Algorithm 1).

Upon observing a new example , both the membership degrees , and the errors committed by each candidate rule, , , are computed. If the “winner rule”, i.e., the candidate rule by which the example is covered the most, commits an error that is larger than the error committed by the other candidate rule (covering the example to a lesser degree), we consider this as an inconsistency. The latter can be mitigated by shifting the splitting value right or left, in proportion to the error committed by each candidate extension (see lines 11–21 of Algorithm 1).

Input: :
: the rule whose extensions should be created/updated
: the set of fuzzy sets conjugated in the premise.
: the vector of coefficients of the linear function.
: Set of candidate extensions of rule .
: a new training example.
1 if  is Empty then
2       for   do
3             Update
4            
5            
6            
7            
8            
9            
10            
11      
12else
13       for   do
14             Find s.t.
15            
16            
17             if  then
                   /* shift to the left */
18                  
19            else if  then
                   /* shift to the right */
20                  
21            Update
22             Update
23            
      /* Update , Algorithm 2 */
24       UpdateConsequent()
25      
Algorithm 1 GenUpdateERCandidates – ErrorReduction

In the explanations above, we outlined two ways of splitting an S-shaped function into two such functions of similar shape. In the beginning, however, the default rule contains only unbounded fuzzy sets characterized by . A split of an unbounded fuzzy set produces two sets with membership functions and , respectively, which cover the resulting half spaces (with some degree of overlap). Similarly, a split of a right- or left-unbounded membership function leads to a right- or left-unbounded and an S-shaped function.

Recall that AMRules adopts only a single rule from the two candidates emerging from a rule expansion (cf. Section 3.1). More specifically, AMRules keeps the rule with minimum weighted variance and discards the other candidate as well as the parent rule from the original rule set. Since the resulting rule set does not form a partition of the instance space, this strategy requires a default rule covering the space that is not covered by any other rule. Motivated by this strategy, we also study the effect of adopting only a single instead of both rule extensions. Thus, we distinguish the following two strategies.

  1. Single Extension: Only the best extension is added to the rule set, while the other one is discarded. The parent rule is also discarded unless it is the default rule. The choice of the best rule depends on the criterion used for splitting: either the weighted variance reduction or the weighted SSE.

  2. All Extensions: Both extensions are added to the rule set, and the parent rule is removed. This approach makes the whole system of rules equivalent to a tree structure.

The two adaptation strategies will be revisited in the context of change detection in Section 4.5. A more detailed exposition of the adaptation strategies is given in Algorithms 3 and 4.

4.3 Rule Consequents

FLEXFIS makes use of recursive weighted least squares estimation (RWLS) (Ljung, 1999) to fit linear functions as rule consequents. This approach is computationally expensive, as it requires multiple matrix inversions. In our approach, and similar to AMRules, we learn consequents more efficiently using gradient methods.

When a new training instance arrives, TSK-Streams produces a prediction , the squared error of which can be obtained as follows:

(10)
(11)

where is the current set of rules. According to the technique of stochastic gradient descent, the coefficients are then moved into the negative direction of the gradient, with the length of the shift being controlled by the learning rate :

(12)

Thus, the following (component-wise) update rule is obtained:

The process of updating the rule consequents is summarized in Algorithm 2, which also updates the consequents of the rule’s extension (when the error reduction strategy is used).

Input:
: the set of all rules and their extensions.
: the rule whose consequent and the consequents of its extensions should be updated.
: the set of fuzzy sets conjugated in premise.
: the vector of coefficients of the linear function.
: Set of candidate extensions of rule .
: a new training example
1
2
3
4 if  then
5       for  do
6            
7            
8            
9            
10            
11            
12            
13      
14      
Algorithm 2 UpdateConsequent

4.4 Model Structure

TSK-Streams adapts the TSK rule system (that is, the fuzzy sets in the rule antecedents and the linear function in the consequents) in a continuous manner. While the adaptations discussed so far essentially concern the parameters of the system, the replacement of a rule by one of its expansions corresponds to a (more substantial) structural change.

For obvious reasons, such changes should be handled with caution, especially when they lead to an increased complexity of the model. Learning methods therefore tend to maintain the current model unless being sufficiently convinced that an expansion will yield an improvement. To decide whether or not a possible expansion should be adopted, the estimated performance difference is typically taken as a criterion: this difference should be significant in a statistical sense.

In our algorithm, we make use of Hoeffding’s inequality to support these decisions. The latter bounds the difference between the empirical mean of the

i.i.d. random variables

(having support ) and the expectation in terms of

(13)

More specifically, when using the error reduction criterion, we replace a rule by two rules and , considering the reduction in the sum of squared errors (SSE). That is, the SSE of the current rule set is compared with the SSE of all alternative systems (. With and denoting the expansion with the lowest and the second lowest error, respectively, the best expansion is adopted if

(14)

or when falls below a tie-breaking constant . The constant is obtained from (13

) by setting the probability to a desired degree of confidence

, i.e., setting the right-hand side to and solving for ; noting that the ratio (14) is bounded in , is set to 1. Algorithm 3 depicts the system expansion procedure when the error reduction criterion is applied. The same technique can be used for the single extension variant, except that the rule is replaced with the extension that achieves the lowest weighted SSE (provided is not the default rule, otherwise is also kept).

As an alternative to the global error reduction criterion, the variance reduction approach checks for the decrease in variance for each rule locally. The Hoeffding inequality is then applied to the ratio of the variance reductions of the best two candidate extensions of the same rule . The procedure that performs the expansion is depicted in Algorithm 4. This strategy can be seen as a model adaptation through local improvements.

Input:
: rules and extensions
: variance reduction caused by the extension of rule
: confidence level
: tie-breaking constant
: number of examples seen by the current system
: a new training example.
1 for  do
2       let has the largest VarRed
3       let has the 2nd largest VarRed
4       + complexity
5      
6       if  OR  then
7             if Single Extension then
8                   let has the largest weighted VarRed
9                  
10                   if  is not  then
11                        
12                        
13                  
14            else
15                  
16                  
17                  
18                  
19            
20      
Algorithm 3 ExpandSystemVR – VarianceReduction
Input:
: rules and extensions
: the sum of squared errors committed by the extension of rule
: confidence level
: tie-breaking constant
: number of examples seen by the current system
: a new training example.
1 let be the SSE of the current system
2 let be the extension with smallest SSE
3 let be the extension with second smallest SSE
4 Update , and on + complexity
5
6
7 if  AND OR  then
8       if Single Extension then
9             let has the smallest weighted SSE
10            
11             if  is not  then
12                  
13                  
14            
15      else
16            
17            
18            
19            
20      
Algorithm 4 ExpandSystemER – ErrorReduction

Finally, we propose a penalization mechanism to avoid a danger of overfitting due to an excessive increase of the number of rules. This mechanism consists of adding a complexity term to . For both extensions (variance reduction and error reduction), is set to , with the number of features and the current rule set.

4.5 Change Detection

A concept drift may cause a drop in the performance of a rule. To detect such cases, we make use of the adaptive windowing (ADWIN) (Bifet and Gavaldà, 2007) drift detector. Compared to the Page-Hinkely test (PH) (Page, 1954), which is used by AMRules, ADWIN has the advantage of being non-parametric, which means that it makes no assumptions about the observed random variable. Besides, only a single parameter needs to be chosen, namely the tolerance towards false alarms (). In our approach, ADWIN is locally applied in each rule. More specifically, given that an example is covered by a rule, it is applied on the absolute error committed by that rule on this example.

For the single extension strategy, the rule that suffers from a drop of performance can be simply discarded. But in the all extensions strategy and upon detecting a drift in the rule , we find its sibling rule , from which it differs by only one single literal (i.e., there is a fuzzy set on the attribute that satisfies the following criterion: for all ). To remove the rule , it is retracted from the rule set, and its sibling rule is updated by replacing with . In case the sibling rule has already been extended before the drift is detected, the same procedure is applied recursively to the children of this rule.

5 Empirical Evaluation

To compare our method TSK-Streams with existing algorithms, we conducted a series of experiments, in which we investigated the algorithms’ predictive accuracy, their runtime, and the size of the models they produce.

5.1 Methods, Data, and Experimental Setup

TSK-Streams is implemented in MOA444http://moa.cms.waikato.ac.nz (Massive Online Analysis) (Bifet et al., 2010)

, which is an open source software framework for mining and analyzing large data sets in a streaming mode. In our experiments, TSK-Streams is compared with AMRules, FIMTDD, ARF-Reg, and FLEXFIS. Both AMRules and FIMTDD are implemented in MOA’s distribution, and we use them in their default settings with

and (the significance level of the Hoeffding inequality and the tie-breaking constant). We implement ARF-Reg as described in the original paper (Gomes et al., 2018) by setting , the ensemble size , and the number of features , with being the total number of features. As for the parametrization of TSK-Streams, maximal comparability with AMRules and FIMTDD is assured by using the same values , . FLEXFIS is implemented in Matlab. Its parameters were tuned with the help of a function specifically offered for that purpose. The only exception is the “forgetting parameter”, for the value was (manually) found to provide the best performance.

The test-then-train protocol was used for all experiments. According to this protocol, each instance is used for both testing and training: The model is evaluated on the instance first, and a learning step is carried out afterward. Experiments are performed on benchmark data sets555The first 14 data sets are the same as those used in (Almeida et al., 2013). collected from the UCI repository666http://archive.ics.uci.edu/ml/ (Lichman, 2013) and other repositories777https://github.com/renatopp/arff-datasets/tree/master/regression, http://tunedit.org/repo/UCI/numeric; a summary of the type, the number of attributes and instances of each data set is given in Table 1.

The data sets starting with prefix BNG- are obtained from the online machine learning platform OpenML (Bischl et al., 2017)

; these large data streams are drawn from Bayesian networks as generative models, after constructing each network from a relatively small data set (we refer to

van Rijn et al. (2014) for more details).

# Name Synthetic Instances Attributes
1 2dplanes yes 40768 11
2 ailerons no 13750 41
3 bank8FM yes 8192 9
4 calHousing no 20640 8
5 elevators no 8752 19
6 fried yes 40769 11
7 house16H no 22784 16
8 house8L no 22784 8
9 kin8nm - 8192 9
10 mvnumeric yes 40768 10
11 pol no 15000 49
12 puma32H yes 8192 32
13 puma8NH yes 8192 9
14 ratingssweetrs - 17903 2
15 BNG-stock semi 59049 10
16 BNG-cholesterol semi 100000 14
17 BNG-echoMonths semi 17496 10
18 BNG-wine-quality semi 100000 14
Table 1: Data Sets

5.2 Results

In the first part of the evaluation, we compare the four variants of our own proposal: variance reduction versus error reduction, and the extension using a single candidate versus the extension for both candidates.

Table 2

shows the average RMSE and the corresponding standard error on ten rounds for each data set. In this table, the last row shows the number of wins/losses of the first three against the fourth variant (with variance reduction and consideration of both candidates); these tests apply the Wilcoxon signed-rank test over the paired performances of the 10 iterations with confidence level

. From the results, the fourth variant appears to be superior to the other variants. Therefore, we adopt this variant (simply referred to as TSK-Streams in the following) and consider it for further comparison with state-of-the-art methods.

Table 3 presents the performance comparison between TSK-Streams and the other approaches, AMRules, FIMTDD, ARF-Reg, and FLEXFIS. Overall, TSK-Streams compares quite favourably and performs best in terms of the average rank statistic. Moreover, at least on 10 of the 18 data sets, its performance is statistically better (also according to the Wilcoxon signed-rank test at significance level ) than that of any other approach. With FIMTDD being the least performing method, its incremental random forest variant, ARF-Reg, presents a slightly better perfomance.

Other criteria important for the applicability of an approach in the setting of data streams include model complexity and efficiency. Obviously, these properties are not independent of each other, because more time is needed to maintain and adapt larger models. We measure the two criteria, respectively, in terms of the number of rules/leaf nodes in the model eventually produced by a learning algorithm and the average time (in milliseconds) the algorithms needs to process a single instance. We consider the latter more informative than the total runtime on an entire data set (stream), because the processing time per instance is more relevant for the possible application of an algorithm under real-time conditions. Table 4 shows that TSK-Streams tends to produces smaller models than FIMTDD and ARF-Reg, which are still slightly larger than those of FLEXFIS and AMRules. Table 5 shows that TSK-Streams is also a bit slower on average. We would argue, however, that this is not important, as it is still extremely fast in terms of absolute runtime: Being able to predict and learn from each new instance in just a few milliseconds, is certainly meets the requirements for learning on data streams.

Dataset TSK-Streams Rank TSK-Streams Rank TSK-Streams Rank TSK-Streams Rank
Error Red Error Red. Variance Red. Variance Red.
One Cand. Both Cand. One Cand. Both Cand.
2dplanes
ailerons
bank8FM
calhousing
elevators
fried
house16h
house8
kin8nm
mvnumeric
pol
puma32H