## I Introduction

More and more data are generated by stream over a long period leading to the necessity for classification methods to include long-life learning in their algorithm to stay reliable in time.
As an example, a command-gesture system [Bouillon2017Nov], where the user draws a gesture to apply a command, produces this kind of data stream. The set of gestures is chosen by the user and can change at any time. For example a novice user often draw slowly a gesture, and once he is used to make it, he can speed up the gesture leading to a shift of the target concept in the feature space. This modification is called *concept drift*

. Such environment is said to be non stationary. It means that the probability distribution that generates data (here the user) depends on time. To maintain high performance in gestures recognition over time, the classification system must adapt its parameters and/or structure to the concept drift. Thus, the learning process has to be

*incremental*. For each new data, the system extends its knowledge (i.e., it learns and adapts its parameters). Moreover, to get a constant complexity in time and avoid a memory overflow, along a long-life learning, data should not be saved.

Recent researches have shown that Evolving Fuzzy Systems - EFS (also called Incremental Fuzzy Inference System) could deal with such environment thanks to their high structural flexibility [Lughofer2018, angelov2011]

. The structure (number of rules, and antecedents/consequent parameters) of these fuzzy rule-based systems evolve with the stream of data. Especially, we distinguish between two scales of adaptation to concept drift. One concerns the adaptation of the rules’ parameters and tackles incremental drifts (or smooth drifts). The other relates to the adaptation of the structure, by the addition or deletion of rules, and tackles brutal drifts (brutal shifts of the data distribution). Both adaptions use two main algorithmic approaches. The first one relies on temporal (adaptive) sliding windows where only the most recent data are taking into account. As an example, the fuzzy windows concept drift adaptation method (FW-DA)

[Liu2017Jul] relies on the method and shows promising results outperforming state-of-the-art. The second one weights the data according to their age and interest. An example is [Manuel2013] which uses a decremental learning on the premise and conclusion parts with the Differed Directional Forgetting (DDF). However, both approaches have to figure out a ”forgetting” parameter which controls the relevance of the old data. If the forgetting parameter is too high, the system will be less stable leading to lower performance. If the parameter is too low, the system will not be reactive to the change in the environment leading to bad performance. This issue is often called*the plasticity-stability dilemma*. This dilemma can result in instability, particularly with EFS where the number of rules, their size and their reactivity greatly depend on parameters given a priori. To prevent instability, [Lughofer2018] recently proposed a new rule splitting method. The idea is to split into two, a rule which makes too many mistakes or which is too large. However, it takes time after the splitting for the rules to re-adjust to the local distribution of points, introducing inertia in the learning. On the contrary, [Song2018ASF] proposed a method to accelerate the learning of new rules after a drift based on the generation of data from new drifted concept using a GAN [NIPS2014_5423].

Similarly, we propose a complementary method to prevent instability of the rules while improving reactivity. To do so, we introduce a new original architecture of EFS, called

*ParaFIS*. In ParaFIS, a generalized evolving fuzzy system (principal system) is learned synchronously with

*an anticipation module*. For each rule of the principal system, two rules are anticipated in the anticipation module. Thus, the anticipation module enforces the system to locally adapt the distribution of points with two rules rather than just one, in order to anticipate a concept drift before it occurs. The paper is organized as follows. Section II introduces the generalized EFS and its learning model with a discussion on its drawbacks. Section III introduces our contribution, the

*ParaFIS*evolving system. The experiments showing that the anticipation improves the plasticity of the system while keeping stability, are detailed in section IV. In this section, we propose to measure performance of the system on artificial brutal drifts with fitted parameters of a handcraft model. This evaluation protocol allows to quantify the time of reactivity of the system and its stability in the steady-state.

## Ii Generalized Evolving Fuzzy Systems

In this paper, we focus on the generalized evolving fuzzy system that uses a generalized version of Takagy-Sugeno fuzzy systems, already used [Almaksour2011, lemos2011, Lughofer2018].

### Ii-a Model Architecture

A Takagy-Sugeno (TS) fuzzy system is a set of fuzzy inference rules with an antecedent part (also called premise), and a consequent part. Each rule’s antecedent is defined with a prototype that is set by a cluster with a center . The structure of a rule , is as follows:

IF x is close to THEN .. | (1) |

With a polynomial function for of class ; the number of class and the number of rules. The degree of the polynomial function is set to with the polynomial coefficients (see Eq. (2)).

(2) |

The membership of x to a rule , denoted

, is given by a normalized Radial Basis Function

(RBF), of the distance from to (see Eq.(3)). The RBF is often a multivariate Gaussian or Cauchy function [Almaksour2011, lemos2011].(3) |

In the generalized version of TS, the Mahalanobis distance is used to get rotated hyper-ellipsoid clusters as follows:

(4) |

With the covariance matrix. Finally, the predicted class for x is given by Eq. (5),(6).

(5) | ||||

(6) |

### Ii-B Rule’s adaptation

Each new incoming data is used to adapt the model parameters. In the premise part, only the most activated rule adapts its center and covariance matrix according to Eq. (7),(8) where is the number of samples of the rule.

(7) | ||||

A | (8) |

The forgetting capacity is put in the equation, by setting (see [Manuel2013]) with a threshold that defined the forgetting capacity, and the number of samples that activated the most the rules. is often written with a forgetting factor where when there is no forgetting capacity.

The consequent part is learned using a Weighted Recursive Least Square method (WRLS).
The membership functions is assumed to be almost constant to converge to the optimal solution. To reduce computation time, the local learning of the consequent part is often preferred. And, the rules are assumed to be independent to apply RLS on each one.
The conclusion matrix of the rule at time (i.e. after data points) is recursively computed according to:

(9) | ||||

(10) |

With a correlation matrix initialized by where

is the identity matrix and

a constant often fixed to (see [Almaksour2011],[angelov2004]).### Ii-C Rule creation condition

The adaptation of the parameters is not relevant to handle brutal drifts, as when a class must be represented by several clusters or when the target concept shift in the feature space. Dealing with brutal drifts requires the adaptation of the fuzzy system structure, like rule addition. In a context of online learning from scratch, all classes start with one prototype (i.e. one rule). The structure adaptation relies on specific conditions observed on data, via the existing rules. Several criteria are defined to detect brutal drifts. Most EFS uses a distance-based criteria [Lughofer2018],[Almaksour2011] that compares a certain threshold (depending on parameters ) with the distance between the prototype of the closest rule (its center ) and the new incoming points . If , then the rule creation criteria is met, a new rule is created over the last incoming point according to Eq. (11).

(11) |

### Ii-D Discussion on problems in the generalized EFS

Two scales of adaptation co-exist in EFS, the adaptation of rule parameters and the structure adaptation. Smooth drifts are tackled by introducing forgetting capacity in the parameter adaptation whereas brutal drifts are tackled by the creation of new rules. However, the system is degraded by the parameter adaptation when a brutal drift occurs. Indeed, all rule creation conditions lead to a trade-off between speed of detection and sensitivity to noise. But, for all, it exists a time between the true occurrence of the drift, and the detection time. As shown in Figure 1 and 2, during this time, one (ore more) rule tries to adapt to the drift and changes its parameters making it less fit to its previous concept. But, when the rule creation is triggered, this previous adaptation is not cancelled making the old rule perhaps unstable. Moreover, the new rule is created over one single point (the one that triggers the rule creation) although all points during could have been used to initialize the new rule. This results in a longer time for the new rule adaptation to the new concept.

## Iii Contribution: ParaFIS system

In order to attend stability and plasticity, we present in this section the ParaFIS system.

### Iii-a Model’s architecture

As shown in Figure 3, the proposed model is based on the generalized EFS similar to this described section II. This part is dedicated to smooth drifts to keep stability (no structural change). And, for each rule , a module of anticipation is added to deal with brutal drifts. This anticipation module is composed of two sub-rules and that have an antecedent part (a center, a covariance matrix, a Cauchy membership function) and a consequent part with hyperplanes ( the number of classes). The system classifies the data at any time independently of the anticipation module as it is done in the generalized EFS. The sub-rules are just used in the learning phase where different forgetting factors are applied to adapt differently the distribution of points in time. Thus, the system gets information at different scales of time.

### Iii-B Model’s learning

Each new point coming into the system will be used to learn both the principal system and the anticipation module. As in the generalized EFS, the principal system will adapt the antecedent part of the most activated rule by updating its center and its covariance matrix using Eq. (7),(8) with a factor (no forgetting capacity). Then, the sub-rules and are updated using the same equations with . The two sub-rules have two different forgetting factors leading to two temporal scales of learning. has a low forgetting factor and is learned on the most recent data whereas has a high forgetting factor and is learned on a long history. In this way, quickly reacts to a change in the distribution of points whereas preserves the old concept with a slow adaptation.

The consequent part in the principal system is learned as usual (Eq. (9),(10)). In the anticipation module, , have the same consequent part as (the hyperplanes with ).

### Iii-C Detection of brutal concept drifts

Contrary to current rule creation criterion which use no information of neighborhood [Lughofer2018],[Almaksour2011], we propose here to integrate a brutal drift detector based on a clustering separability criteria. The idea is to assume that if the two sub-rules in the anticipation module are enough separated, then a brutal drift occurred. Then, learned over the large history matches the old concept and learned over the few last points matches the new drifted concept. Eq. (12) presents the proposed separability criteria that is based on the covariance of both clusters.

(12) |

Where, as depicetd in Figure 4, (resp. ) is the distance between (resp. ) and the hyper-ellipsoid’s envelop of cluster (resp. ), along axis.

Besides, to force the rules to take into account a certain number of points (20 by default) before deciding to create a new rule, the following inertia criteria is added:

(13) |

### Iii-D Initialization of new rules based on the anticipation module

If the rule creation conditions are met for the most activated rule from the principal system, then is replaced by and in the principal system. Then for each new of the principal system, two new sub-rules , are initialized in the anticipation module, as follows:

(14) |

The idea is to keep the learned information of in the sub-rule , and to initialize the second from scratch. All steps of the learning are summarized in Algorithm 1.

## Iv Experimental Validation

### Iv-a Evaluation protocol

#### Iv-A1 Prequential test with artificial brutal drift

To evaluate the performance of an online classifier, it is current to use the prequential test [Gama2013]. In this test, data are given one by one to the system. The system first tests the new data to get a score (1 if the class is well classified, 0 otherwise) and then learns on it. In this way, all the data are used to test the system and then, to train it while maintaining independence between each phase. Scores are then averaged over a certain window (of size ) to get a smooth curve of the performance over time. This test simulates a real usecase as for many online application where the system must adapt to the behavior of a user along time. In the following experiments, (except to plot the figures to smooth the score).

There is no existing dataset with annotation of the nature of the drift or with the occurrence time of the drift making the evaluation and comparison of online classifiers complex. To make the task easier, the paper proposes to generate artificial brutal drifts in real data at a chosen time.
To do so, each dataset is split into three sub-datasets , , to create a specific data stream. As illustrated in Figure 5, is composed of the first data belonging to classes. (resp. ) is composed of the data in the stream between and (resp. between and ), this data belongs to (resp. ) different classes and are relabeled with the labels to produce the brutal drift. In this ways, brutal drifts are done at the time and others at time . Thereafter, we call this approach the protocol .
In the context of a command-gesture system, the protocol is equivalent to a user changing the gesture of a command while keeping the possibility to make the old gesture.

#### Iv-A2 Benchmark dataset

Regarding the context of command-gestures system, three datasets from the handwritten pattern recognition community have been chosen to assess the performance of the system: PenDigits

[PenDigits], Letters [Letters] and LaViola [Laviola]. They are all available in the UCI machine learning repository

[UCI]. All three datasets contain different features extracted from the handwritten patterns (digits, letters or symbols handwritten by different writers). Description of each dataset is given in Table

I.Dataset | Classes | Features | Samples | Scriptwriters |
---|---|---|---|---|

Letters | 26 | 16 | 20000 | 20 |

LaViola | 48 | 50 | 16891 | 34 |

PenDigits | 10 | 16 | 10992 | 44 |

These datasets are built in a static context meaning that there is no order between data. Thus, the prequential test with artificial brutal drifts can be done several times () by shuffling the dataset. All results given thereafter are averaged over prequential tests with . The experimental parameters are given in Table II. The classes of each dataset follow the order of occurrence in the data file from UCI.

T1 | T2-T1 | T3-T2 | n1 | n2 | n3 | |
---|---|---|---|---|---|---|

Letters | 2000 | 4000 | 4000 | 10 | 10 | 6 |

PenDigits | 2000 | 3000 | 3000 | 4 | 3 | 3 |

Laviola | 2000 | 3000 | 3000 | 10 | 10 | 10 |

#### Iv-A3 Characterization of the plasticity and stability

In order to measure the plasticity and stability of the system, we propose to fit the prequential score with an handcraft model given by Eq. (15).

(15) |

Where we define a characteristic time , which represents the reactivity time (or the plasticity). In particular after a time , 63% of the score have been reached until the steady state. At the steady state, the score is given by . A least square method is used to determine the parameters which best fit the curve with the model. The example of fits on the Letters dataset (experimented with the protocol P) given in Figure 6 shows that this model fits well the prequential score.

### Iv-B Evaluation of the anticipation

#### Iv-B1 Importance of covariance matrix initialization

No study has yet tackled the issue of optimizing the initialization parameters of the antecedent part.
Yet, the initialization of the prototype (the antecedent part) greatly influence the score. To show it, we propose to compare three methods currently used in incremental fuzzy system to initialize the covariance of a new rule: Method I1 (Eq. (16), [Manuel2013]), Method I2 (Eq. (17), [Lughofer2018],[lemos2011]), Method I3 (Eq. (16)). The center of prototypes is initialized just on the last point for all methods.

(16) | ||||

(17) | ||||

(18) |

To compare them, we use the generalized evolving fuzzy system as described section II with Condition 1 and Condition 2 to create rules (section III). Results are obtained from the three datasets and a fit using Eq. (15) is done. Results are displayed in Figure 7 column A, and the fitted parameters are shown in Table III. Results show that the best choice of initialization depends on the dataset. However it is clear that the initialization of the rules widely impacts the results. For instance, there is of difference for the parameters between methods and for letters dataset at the phase B. This highlights the importance of well initializing the covariance matrix of the new rule. This can be explained by the fact the covariance matrix plays a great role in the learning. Indeed, only the antecedent part of the most activated rule is learned on the new incoming point and the activation, computed with a Mahalanobis distance, crucially depends on the shape of the covariance matrix. If it is too small, few data from the new concept will activate the new rule. If it is too big, data from other concepts may also activate the new rule.

A | B | C | |||||
---|---|---|---|---|---|---|---|

Letters | I1 | 93.4 | 218 | 87.9 | 443 | 94.0 | 377 |

I2 | 93.4 | 215 | 83.3 | 556 | 89.8 | 697 | |

I3 | 92.6 | 157 | 89.9 | 243 | 94.1 | 276 | |

PenDigits | I1 | 98.9 | 80 | 99.3 | 29 | 98.1 | 233 |

I2 | 98.9 | 80 | 98.7 | 59 | 96.9 | 415 | |

I3 | 98.9 | 77 | 98.7 | 36 | 97.4 | 235 | |

Laviola | I1 | 98.7 | 66 | 96.7 | 88 | 96.5 | 131 |

I2 | 98.7 | 68 | 96.7 | 90 | 96.3 | 135 | |

I3 | 98.2 | 67 | 97.1 | 66 | 97.2 | 92 |

#### Iv-B2 Comparison with our proposition

Now, the impact of the anticipation will be evaluated.The antecedent part initialization is compared when using the best method among {I1,I2,I3} for each dataset, and using the anticipation module Eq.(14).
However the drift detector can depend on the parameters of the system as it is in ours, Eq. (12) (this depends on the covariance matrices). Thus different initialization of prototype change the times of detection. To avoid this bias, the drift detector is first used on the system with anticipation and the times , at which the drifts are detected, are saved. Then, rather than using the detector, we use the saved files with all times to create rules. In such ways, the system can be compared fairly without bias of the detection.

Two sets of parameters are used for the anticipation, and with (resp. ) the forgetting factor of the sub-rules (resp. ) of the secondary system for .

Moreover, a comparison is done with our own implementation of an EFS similar to Gen-Smart EFS [Lughofer2018] called GEFS*. In this last one, the rule creation criteria (given by Eq. (19)-(20)) is the same than Gen-smart EFS [Lughofer2018] with the same rule initialization method (I2). However, there are no rule merging or rule splitting methods as there are in Gen-Smart EFS.

(19) |

(20) |

Results are displayed in Figure 7 column B, and Table IV. The fitted parameters of each configuration are averaged over the three phases. The mean accuracy score is also given to compare global performance of each configuration. The results show that anticipation brings a gain in the mean score, not only on the reactivity with a lower time , but also on the score reached in the steady state. This results in a better mean accuracy score. It means that the anticipation effectively accelerates the learning of the new concept when new rule are created, but also stabilizes the covariance matrix to better match the target concept at end. Moreover, our global system ParaFIS with detector+anticipation outperforms, in term of reactivity and stability, an equivalent system GEFS* with detector+initialization from the state of the art [Lughofer2018].

Letters | 94.3 | 200 | 90.3 | |

93.8 | 188 | 90.1 | ||

I3 | 92.2 | 214 | 88.1 | |

GEFS* | 89.9 | 471 | 80.7 | |

, | ||||

PenDigits | 98.8 | 56 | 97.9 | |

98.9 | 56 | 98.0 | ||

I1 | 98.8 | 103 | 97.2 | |

GEFS* | 98.7 | 156 | 96.3 | |

, | ||||

Laviola | 98.2 | 74 | 96.2 | |

98.2 | 84 | 95.9 | ||

I3 | 97.7 | 76 | 95.7 | |

GEFS* | 97.1 | 158 | 94.4 | |

, |

## V Conclusion & Outlooks

This paper has introduced a new design of EFS which integrates an anticipation module to deal with brutal drifts. Anticipation opens up a new interesting way to tackle non stationary environment problems. It has allowed to detect brutal drifts and adapt new rules to the drifted concept with a gain in reactivity and stability. The new design opens up to new outlooks with the two sub-rules that map the rule creation problem into a 2-cluster clustering problem. It may use other online clustering validity criteria to detect brutal drifts or add new anticipation modules to tackle other natures of drifts such as gradual drifts (where data switch from one target concept to another one several times).

Comments

There are no comments yet.