Ensuring Learning Guarantees on Concept Drift Detection with Statistical Learning Theory

06/24/2020 ∙ by Lucas Pagliosa, et al. ∙ Universidade de São Paulo 0

Concept Drift (CD) detection intends to continuously identify changes in data stream behaviors, supporting researchers in the study and modeling of real-world phenomena. Motivated by the lack of learning guarantees in current CD algorithms, we decided to take advantage of the Statistical Learning Theory (SLT) to formalize the necessary requirements to ensure probabilistic learning bounds, so drifts would refer to actual changes in data rather than by chance. As discussed along this paper, a set of mathematical assumptions must be held in order to rely on SLT bounds, which are especially controversial in CD scenarios. Based on this issue, we propose a methodology to address those assumptions in CD scenarios and therefore ensure learning guarantees. Complementary, we assessed a set of relevant and known CD algorithms from the literature in light of our methodology. As main contribution, we expect this work to support researchers while designing and evaluating CD algorithms on different domains.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data streams are seen as open-ended sequences of uni or multidimensional observations rather than batch-driven datasets (lichman:uci:13)111In the context of this paper, we approach unidimensional streams, however our strategy can be extended to address multiple dimensions, without loss of generality as performed in (serra:09:njp).. Those observations are generated by processes modeled as stochastic and/or chaotic dynamical systems (kantz:97:book), which simulate several phenomena at different periods of time (agarwal1995dynamical; Metzger1997; RIOS201511). Nevertheless, those processes or their parameters may eventually change over time, such as how a disease/medicine impacts on someone’s blood temperature (Andrievskii2003). Those cases, hereinafter referred to as Concept Drifts (CD), are important to be considered while modeling especially chaotic data, whose system present recurrent behaviors. Later, specialists can analyze those anomalous and decisive instants for further comprehension on studied phenomenon.

In practice, CD algorithms compare features from current to next observations in order to detect relevant changes (gama:14:acm; lu:kde:18). Such features are usually represented by classification performances (gama:sac:04; EDDM; bifet:sigkdd:09) or by statistical measurements (gama:14:acm; page:bio:54; bifet:sigkdd:09). Moreover, despite former methods generally lead to more robust results, they demand class labels to perform supervised learning, which may be impractical when dealing with data continuously collected over time. Conversely, statistical-based methods have the advantage of requiring no label, but their simplistic models usually is not enough to properly distinguish general processes, especially when dealing with non-stationary and chaotic phenomena (costa:eswa:16).

Besides relevant contributions, both branches do not provide learning guarantees to support CD detection, although some authors claim that performance measures such as Mean Time between False Alarms (MTBFA), Mean Time for Detection (MTD) and Missed Detection Rate (MDR) (costa:eswa:17) can ensure such commitment, e.g., in terms of accuracy. In that sense, we remind the reader that those measures cannot be trustworthy when the algorithm poorly (under or overfitting) generalizes data. In an extreme case, a CD algorithm that randomly issues drifts (such as flipping a coin) or whose model memorized all training data may still provide adequate performances according to those measurements. Thus, instead of considering specific measurements on particular scenarios, we propose a general and formal approach to ensure CD detection relying on the Statistical Learning Theory (SLT), proposed by vapnik:98:book. In summary, our strategy provides the necessary probabilistic foundation to ensure learning while analyzing data streams. As consequence, drifts are not reported by chance, and validation methods can be fully employed later on.

According to the SLT, learning occurs when the empirical risk (computed over the training data) does not significantly differ from the expected risk (inferred over the whole population) and the empirical risk is small enough according to the specialist (mello:book:18)

. In order to formulate such a theoretical framework, Vapnik had to employ a strategy to prove the convergence of the adopted classifier to the best as possible model inside the algorithm bias, which has motivated him to consider the Law of Large Numbers (LLN) 

(devroye:96:book)

in this process. However, in order to employ the LLN, Vapnik had to follow a set of assumptions, in which two of them are especially controversial in CD scenarios: (i) input data must be sampled in an independently and identically distributed (i.i.d.) manner; and (ii) the Joint Probability Distribution (JPD), mapping the relationship between input and output spaces, must be static/fixed.

This controversy firstly arises from the fact that real-world data streams typically present time-dependent observations, given they represent the same phenomenon over different timestamps. Secondly, if a CD is supposed to happen, some changes over the JPD are also expected to occur. Even so, given that the SLT is the most complete framework to ensure learning for supervised algorithms (we are not aware of any other theoretical basis to tackle the problem in the same level), we tried to address such drawbacks by proposing a set of adaptations to satisfy both requirements. More precisely, our goal in this manuscript is to elaborate the necessary conditions a CD algorithm should satisfy to ensure learning while reporting drifts. It is worth to make clear, however, that we do not intend to propose a new algorithm in the process. Nonetheless, we have analyzed some of the CD literature in light of our theoretical point of view to show that many of them are, in fact, in disagreement with the SLT. Thus, this does not mean those algorithms will not work and generate fair results, but that no theoretical guarantee could be derived from their application. Lastly, after employing our strategy to ensure learning bounds, other performance measures can be safely used to validate the quality of reported drifts.

The remaining of this manuscript is structured as follows. Section 2 introduces the background involving CD algorithms, SLT and the main concepts from the area of Dynamical Systems. Section 3 shows the related work on theoretical guarantees in CD problems. Section 4 discusses our proposed methodology to ensure learning bounds while tackling the concept drift scenario. Section 5 describes important algorithms on concept drift detection, highlighting when they satisfy (or not) Vapnik’s assumptions. Concluding remarks are drawn in Section 6.

2 Concepts and Nomenclature

This section introduces the CD nomenclature used in this manuscript. In addition, Dynamical System concepts, such as phase spaces reconstruction, are briefly covered. Lastly, a short description of SLT and its assumptions are given. Despite not commonly found in the concept-drift literature, the two latter topics must be presented in order to explain our methodology.

2.1 Concept Drift Detection

Let a data stream be defined as the sequence of observations:

(1)

describing the behavior of some phenomenon along time. Moreover, a data stream defines a continuous flow of incoming data, whose observations are derived from (potentially) multiple Joint Probability Distributions (JPDs). Thus, given a data window representing the th set of observations in a specific interval of time, we have that (see Figure 1), where is the current timestamp.

Figure 1: A data stream divided in windows (red boxes) with no overlapping, each containing observations.

Additionally, despite the configuration of windows may vary from application to application, it is common to assume a fixed length for every window without the overlapping of observations, such that:

(2)

Thus, if represents the timestamp from which observations started to be collected from some phenomenon (initially set to zero), a CD algorithm induces the indicator function:

(3)

which basically classifies whether the current window continues to represent the same phenomenon of past windows or not. To avoid redundancy, the index of is omitted unless explicitly necessary to express a specific time (Section 3). Moreover, function

extracts the vector of features

from the model:

(4)

where and are the input and output (class) spaces of window , respectively, derived either after applying dynamical system reconstructions or by using statistical measurements (Section 4.1). Those spaces are necessary in any CD algorithm to perform supervised learning, and more details about both are given in Section 4.1. Complementary, can simply be the result of itself so that is the identity function (e.g.

, if the model is based on the mean, variance, or entropy of the window) or, under more sophisticated scenarios,

can be a more complex kernel function (e.g., if

is inferred from an Artificial Neural Network 

(haykin:09:book)

, features could be given by unit weights or the results of activation functions).

Formally, we define as the features obtained for the current window , and

as the set of features extracted from past windows

. In this context, the CD algorithm, here responsible for inducing function , reports a drift whenever significantly differs from by more than an acceptable threshold . On the other hand, if the divergence between features is acceptable, then infers that and belong to the same phenomenon and the analysis should be carried on to next windows. At this step, has two options: either it updates the set of features in the form or it merely sets ( is naturally incremented after both cases). Note, however, that in the former scenario, is much greater than . Thus, must either perform aggregations or apply kernel functions to make sure that has the same number of features than , so a fair comparison between them is made. In this context, the following strategy is commonly used to compare them:

(5)

where and

are the average and standard deviation of past features, and

controls the detection sensitiveness.

2.2 Dynamical Systems

This section describes Dynamical System concepts necessary to automatically derive input and output from window . Despite those sets can be generated in different ways, by means of Fourier coefficients (bracewell:78:book) or statistical measurements, for instance, we suggest them to be derived from states in the phase space for a robust modeling, as we explain next. Furthermore, this section presents some concepts not commonly adapted for the CD literature. Still, they are important to satisfy SLT assumptions according to our point of view.

A dynamical system is composed of a set of -dimensional states that, driven by a generating (a.k.a. governing) rule , models the behavior of some phenomenon in function of state trajectories, such that:

(6)

where

corresponds to the number of degrees of freedom the system has,

i.e., the number variables/dimensions needed to describe .

Moreover, when the number of states are enough to represent all system dynamics, than is referred to as the phase space in which the variable time is no longer explicitly required for modeling, given all possible trajectories are bound by the structure of a potentially low-dimensional manifold known as attractor (alligood:96:book). Figure 2(a) illustrates such structure for the well-known -dimensional Lorenz system (tucker:1999).

Figure 2: (a) Continuous Lorenz attractor. (b) Discrete reconstruction of the Lorenz attractor using . As it can be noticed, no matter the order of how phase states are drawn, the attractor is still depicted the same way, which reinforces the fact that states are independent from each other.

As consequence, phase states become independent so they can be identically distributed, resulting in satisfying the i.i.d. assumption necessary to perform the SLT (pagliosa:eswa:17).

When dealing with data streams symbolizing observations collected from a single dimension of such phase space, we rely on the embedding theorem proposed by takens:1981, a direct extension from Whitney’s studies in the area of differential manifolds (whitney:1936), to reconstruct from . This embedding theorem has been widely employed and confirmed to be the most adequate to reconstruct different types of dynamical systems (ravindra:1998). As proposed, a time series (finite sequence of observations) in the form can be reconstructed into a set of states, whose structure is a manifold diffeomorphic () to , if states are in the form:

(7)

in which corresponds to the embedding dimension and is the time delay. Figure 2(b) shows the result of the reconstruction for the Lorenz data (tucker:1999), using and . In our context, corresponds to some window (Equation 2) such that .

2.3 Statistical Learning Theory

In the context of supervised learning, data is divided as input and class spaces, represented as and , respectively, whose instances are sampled from a Joint Probability Distribution (JPD) and, then, employed to induce some learning model which is expected to provide the smallest as possible error or loss. Based on such definition, the Statistical Learning Theory (SLT) (vapnik:98:book) provides a theoretical framework to ensure the model

will generalize similarly on unknown examples as it performs on known data. As consequence, SLT supports to find good estimates for the expected value of some loss function, allowing to infer

based on performance measurements computed on known (training/test) samples.

Given a problem and its representative loss function vapnik:98:book defined the empirical risk, i.e., the error computed over some sample, as:

(8)

and the expected (a.k.a. real or actual) risk, which is computed by assessing the whole JPD ), in the form:

(9)

being E the expected value. From that context, Vapnik relied on the Law of Large Numbers (LLN) (revesz:67:book) to prove the following:

(10)

i.e., the empirical risk of a classifier probabilistically converges to the expected risk as the sample size tends to infinity, considering , , and some assumptions are satisfied (as described next). From such statement, a.k.a. Empirical Risk Minimization Principle (ERMP), Vapnik defined that a model generalizes when the difference approaches zero, and learns when generalizes and its empirical risk is small enough according to the specialist.

As Equation 9 cannot be calculated in practice, since it demands an infinite number of observations to analyze the whole joint space, Vapnik used the Symmetrization Lemma (devroye:96:book) to quantify learning in real scenarios as follows:

(11)

where is the acceptable confidence level for the probabilistic measure, is the set containing all functions an algorithm is capable of admit, a.k.a. the algorithm bias, and are the empirical risks of two different samples having the same size . In summary, this lemma states that if two independent samples have empirical risks that do not diverge more than as increases, than it is expected the empirical risk to be a good estimator for the expected risk.

Finally, it is worth to mention that must be in parsimony with the Bias-Variance Dilemma (BVD) (luxburg:11:book), otherwise Equation 11 is inconsistent. In other words, if the class of functions in has weaker bias (less restrictive), then contains many more distinct functions to represent some training set, what generally leads to overfitting. Conversely, if has a strong bias (more restrictive), fewer functions compose such a space, making it prone to underfitting. Thus, a balanced complexity of the function class is recommended to achieve the best as possible risk minimization (geman:92:nc). This is important because two underfitted models, for instance, might still provide similar empirical errors as they both learned from the average, but they will not generalize on future data.

While elaborating the SLT, Vapnik took advantage of the LLN to formulate and prove learning bounds for ERMP and the Symmetrization Lemma. Due to his formulation, the learning guarantees are only held if the following assumptions are satisfied: examples must be independent from each other and sampled in an identical manner (A1); no assumption is made about the joint probability distribution (JPD), otherwise one could simply estimate its parameters (A2); labels can assume non-deterministic values due to noise and class overlapping (A3); the JPD is fixed, i.e., it cannot change over time (A4); and, finally, data distribution is still unknown at the time of training, thus it must be estimated using data examples (A5).

It is simple to observe that assumptions A2, A3 and A5 are straightforward, since they define most of real-world scenarios. However, assumptions A1 and A4 are more difficult to hold especially in (but not limited to) the CD scenario, in which observations are time dependent and different phenomena (with distinct JPDs) are expected to happen as widely discussed in the CD related work (gama:14:acm). Nevertheless, even with such controversies, we still choose to rely on the SLT due to its robust framework towards probabilistic convergence in supervised learning. Moreover, we can adapt the CD algorithm to satisfy SLT assumptions, as we show in Section 4.

3 Related Work

Regarding the task of learning in drifting concepts, we reinforce that validation methods (e.g., accuracy, false positive/negative rates) are insufficient to draw conclusions on that aspect. Despite a good way to measure the quality of reported drifts on a limited number of observations, those metrics do not give any probabilistic guarantee that the reported performances will continue over time. Regardless, there are a few studies in the literature aiming to support theoretical learning (tsymbal:04). For instance kuh:nips:1991 relied on PAC-Learning (valiant:acm:1984) to estimate the minimum window length necessary to trust in reported drifts, whereas helmbold:ml:1994 proposed a weak upper bound to delimit how fast drifts should occur based on the same framework. As consequence, they both conclude learning could be only ensured if concept changes occurred “slowly” enough according to the window length.

With respect to those articles, we highlight that despite PAC-Learning contributions, such as the introduction of computational complexity theory in Machine Learning, the SLT consists of a much more complete and robust framework towards supervised learning. Even so, there are equivalences between both, such that if the algorithm bias

is PAC-learnable, than the Vapnik-Chervonenkis (VC)-dimension (luxburg:11:book; mello:book:18) of should be finite222Further comments on the VC-dimension are out of the scope, see cited references for details.. In fact, the upper and lower bounds formulated in the previously-cited articles explicitly consider the VC-dimension in their equations. Therefore, as the VC-dimension is a direct consequence of the ERMP (Equation 10), one should firstly be in accordance with the SLT requirements (by using our methodology, for instance) in order to rely on such bounds. Lastly, we admit that the VC-dimension is a difficult measurement to compute in practice, what might invalidate both approaches for general cases.

More recently, mello:yuli:18

elaborate a framework to ensure unsupervised learning guarantees based on Algorithmic Stability (AS) 

(bousquet:jlmr:02), which presents conditions for the probabilistic convergence between an arbitrary function and its expected value using the McDiarmid’s Inequality (mcdiarmid:sic:1989). However, as the authors elaborated themselves, “the SLT has a strong connection to the whole theory employed to ensure Algorithmic Stability” (mello:yuli:18). As it can be noticed, the comparison between the current and past windows, proposed by AS, is related to the Symmetrization Lemma (Equation 11), and most of the used inequalities are based on the Law of Large Numbers (LLN), which also requires input data to be i.i.d. Therefore, our methodology could also be applied on their framework.

As a conclusion, we observe that the literature lacks of supervised learning guarantees in CD scenarios, what have motivated us to associate the SLT towards CD detection. In that sense, our methodology comes as a novel but simple approach to ensure reported drifts are not by change, but due to actual changes on data behaviors.

4 Ensuring Learning in Concept Drift Scenarios

In this section, we translate some of the presented SLT bounds (Section 2.3) towards the CD context. From that, we elaborate the necessary conditions a CD algorithm should satisfy to meet such theoretical framework.

4.1 Adapting the SLT to CD scenarios

In this section, we make a summary of all previously presented concepts to introduce our methodology. Firstly, we remind the reader that CD algorithms typically assume the learning model is inferred over some window , where corresponds to the input examples, and is associated with the respective classes. Those classes are usually not available as it is difficult for a specialist to continuously label observations collected over time, most especially for high-frequency streams.

From that, we conclude that class labels must be somehow devised from the data stream itself in an online fashion. As consequence, the input example might also change according to the chosen procedure to determine classes. From that, two possible strategies have been used to tackle such an issue: (i) if is result of a regression performed on the phase space, then each input is a tuple composed of the first dimensions of , while the respective class label is the value of the th dimension, as show in Table 1; and (ii) when the class information is the simple result of a measurable function

computed over the data window, such as the average, variance, kurtosis, etc., then

(the input is simply the window) and its output comprises the label  (bifet:sigkdd:09).

States
Table 1: Input and output spaces for the Lorenz system, embedded with and (Equation 7) and illustrated in Figure 2(b). In the example, the phase space was reconstructed from the window but, for clarity, its states were represented as a generic time series, such that .

In the next step, function extracts the vector of features from the inferred model , such that . For instance, if is in the form of a neural network, then the empirical risk or the trained weights might represent . The indicator function is then responsible for mapping every feature vector into a binary space, indicating whether a drift has happened given the current data window (Equation 3). If no drift is detected, then can either change its current model to the newest () or update it based on the new features (). As we show, the latter option is the best as continuously approximates the true set of features corresponding to the analyzed phenomenon, allowing us to elaborate the following connection with the ERMP (Equation 10):

(12)

In other words, if the difference between the empirical and true risk decreases as the sample size increases, thus the features extracted over time should also converge to the features computed over the entire data population. Ideally, we should set some window length large enough to contain all observations from the analyzed phenomenon. However, this becomes a great challenge since: (i) we do not have access to all observations from the phenomenon, and (ii) several drifts are expected to happen in early windows. Thus, we decided to adapt the Symmetrization Lemma (Equation 11) to represent learning in terms of windows features, in the form:

(13)

remembering that represents an aggregation of all measurements for past windows, such that the sample size of and is the same. Therefore, if such a difference is held as new windows are processed, we have probabilistic support that is actually learning from data.

4.2 Satisfying SLT assumptions

In order to rely on Equation 13, however, we must satisfy assumptions A1 and A4 listed in Section 2.3. Moreover, we also need to ensure such equation is consistent by choosing a CD algorithm whose complexity is in parsimony with the BVD (luxburg:11:book).

Firstly, we draw attention to the fact that, as we propose, drifts will occur only between windows, not among observations. According to our approach, the algorithm responsible for inferring is expected to deal with A1, while the model faces the challenge in A4. Therefore, models

should employ some strategy to map observations into a different space, ensuring data becomes i.i.d. For instance, the Fourier transform 

(bracewell:78:book) could map windows into the frequency space, or the Takens’ embedding theorem (takens:1981) may be used to map such observations into phase spaces 333We encourage the reader to use this option, as the reconstruction of phase spaces allows a more robust data analysis.. Complementary, model assumes that each data window may come from distinct but fixed/unique probability distributions, so when this indicator function reports a drift, any previous model should be discarded, allowing a fresh start to analyze a next coming distribution while still ensuring learning guarantees.

Regarding under/overfitting, one should choose functions and whose bias complexity is considered moderate according to the BVD (luxburg:11:book). When is based on statistical measures, usually the search space consists of a single function, making more prone to underfitting. Further, such model is only effective to test particular hypotheses, e.g., when data is statistically stationary (which we claim it is unlikely to happen when dealing with real, nonlinear and/or chaotic datasets).

Alternatively, when is inferred based on Dynamical System approaches, the model usually relies on the distances among phase states based on an open-ball radius , given the topological space of attractors (tu:book:10). In those cases, if is the number of phase states, models using a small typically overfit, as memorizes each state. On the other hand, the use of an excessively large radius makes the model learns from the attractor/space average, leading to underfitting (mello:book:18). Thus, a balanced-complexity model should be based on a fair and adaptive percentage of distances among states, e.g., can be defined in terms of the open ball containing the

nearest neighbors or some distance quantile computed on the phase space. Regarding the indicator function

, the comparison between windows should follow some strategy as the one defined in Equation 5, otherwise simpler functions would lead to underfitting and more complex indicators to overfitting.

In summary, our methodology to satisfy SLT requirements in CD scenarios is composed as follows: (R1) the indicator function should be updated based on past data, so that the underlying phenomenon is better represented; (R2) the model must receive i.i.d. data, something ensured by a different pre-processing approach (e.g., Fourier transform or phase space reconstruction); (R3) the function should compare features from the same JPD. Otherwise a reset in is necessary; (R4) the algorithm bias from both and should be in parsimony with the BVD.

5 Concept Drift Algorithms According to our Methodology

As discussed in Section 2.1, a CD algorithm can be divided in two components: (i) the first responsible for extracting features from data windows, using function ; and (ii) the second capable of comparing those features using some indicator function . In light of such perspective, we present state-of-the-art algorithms and highlight how they approach requirements R1–R4. Firstly, the Cumulative Sum (CUSUM) (page:bio:54) algorithm reports a drift whenever an incoming observation is significantly different from the sum of past data. Thus, knowing that is initially set to zero, a drift occurs when:

(14)

in which (here and next) is an acceptable threshold, and consists in a single observation such that (window length ). In this scenario, and correspond to the identity function while is a model directly correlated to the average of such a phenomenon. A drift is reported when results in a value greater than the threshold , and resets the analysis for a new phenomenon (satisfying R3). If negative values are considered, the minimum is used instead of the maximum in Equation 14 and drifts are triggered when is smaller than . In summary, CUSUM respects R1 as is updated along new incoming data. However, the identity functions and do not break time dependency, not satisfying R2 and leading to inconsistencies in . Lastly, R4 is not satisfied since overfits (memorizing the current observation) and underfits (too restrictive bias) data since a single cumulative linear model may not be enough to represent more complex behaviors.

The Page-Hinkley Test (PHT), also proposed by page:bio:54, is a variation of CUSUM (using the same window configuration) in the sense it assesses data changes in terms of standard deviation measures rather than using averages. Thus, given the average estimation , where interval representing the evolution of some phenomenon from the start () to the current window (), PHT reports a drift whenever:

(15)

where , .

Therefore, a drift occurs whenever the difference between the cumulative standard deviation is

units greater than the minimum standard deviation observed up to the current moment. Similarly to CUSUM, model

is updated as new windows are processed and a reset occurs in case a drift is issued, such that both requirements R1 and R3 are satisfied. However, once is computed over a time-dependent sequence of observations, then R2 is not respected. In addition, despite PHT is slightly more complex than CUSUM, it is still prone of overfitting (failing R4).

The Adaptive Sliding Window (ADWIN), proposed by bifet:sigkdd:09, also comprises an extension of CUSUM, but applied over different window configurations. More precisely, given a timestamp between the start and current time of some phenomenon, the data stream is divided into two adaptive windows and , for each . In this context, ADWIN reports a drift whenever:

(16)

in which is the average of . As soon as a drift is issued, then in order to reset the past model and start a new phenomenon (respecting R3). However, the algorithm just compares averages between consecutive windows, taking no advantage from past data to update , thus not satisfying R1. Further, as was inferred directly from data stream observations, R2 is not respected either. Lastly, despite the search space of is greater than the ones considered by CUSUM and PHT (since more windows are taken into account), the usage of an average model and the fact that is too simplistic are still prone to underfitting.

From a different perspective, vallim:eswa:14 proposed the Unidimensional Fourier Transform (UDFT) to infer a model , based on the set of Fourier coefficients (bracewell:78:book) , as follows:

(17)

where and is the imaginary unit typically used to express the imaginary component of complex numbers, such that . Moreover, we define in bold since the variable was representing the window index. In this context, is the identify function and reports a drift whenever:

(18)

from which we conclude R1 is not satisfied since simply compares two consecutive windows, so that nothing is learned from past data. Furthermore, R3 is automatically respected since requires no reset. Moreover, the method is in accordance with R2, as Fourier coefficients are independent from each other. Lastly, this proposal is less prone to underfitting as the Fourier coefficients better represent data than averages and standard deviations. However, despite improving R4, such requirement is not fulfilled as still memorizes data and is ambiguous, since completely different coefficients may lead to similar Euclidean distances.

Later, costa:eswa:16 proposed the Cross Recurrence Concept Drift Detection (CRCDD) algorithm, which compares reconstructed phase spaces of consecutive windows in search for behavior changes using the Cross-Recurrence Analysis (marwan:pr:07; marwan:book:15). Formally speaking, they analyze phase states using open balls with radius in term of an matrix :

(19)

indicating when phase states and are close enough to each other (given the open ball). Later, they compute the Maximum Diagonal Length (MDL), i.e. the diagonal with maximum length represented by consecutive values equal to in . In summary, (respects R2), is the identify function, and is in the form:

(20)

Therefore, R1 is not even considered as no knowledge is accumulated from past observations (R3 is automatically satisfied). R4 is partially satisfied for (since is the memory function), as MDL is computed using an open ball whose radius is set as the average of the maximum distances from all -nearest neighbors to each phase state.

Comparisons among those methods in light of our methodology are detailed in Table 2. Despite there are other algorithms in the literature (Gama2004; EDDM; 6706768; Lev, etc), we could not include them due to the lack of space.

Method Update (R1) IID (R2) Fixed JPD (R3) BVD(, ) (R4)
CUSUM Yes No Yes (No, No)
PHT Yes No Yes (No, No)
ADWIN No No Yes (No, No)
UDFT No Yes Yes (No, No)
CRCDD No Yes Yes (Yes, No)
Table 2: Comparison of concept drift methods regarding requirements R1–R4.

As observed, CRCDD have the strongest learning guarantees, meaning its drifts are most likely to be the result of actual changes in data behavior rather than by chance. From this, we suggest all authors to revise their approaches in light of our methodology.

6 Conclusions

This paper proposes a methodology to overcome the complexity involved in labeling data streams and the lack of theoretical learning guarantees in the CD scenario. In this context, a CD algorithm should be built according to the following steps: (i) window observations should be somehow reconstructed into another space in order to ensure data independence and allow identically sampling. Among the alternatives, we suggest to map them into phase spaces, using Dynamical System tools, to automatically define spaces and , as discussed along this paper. (ii) Given features from are extracted using function , then the indicator function must compare past against current features and, in case no drift is issued, it should be updated to improve the representation of the current phenomenon. (iii) Conversely, if a drift is confirmed, then model should be reset to start analyzing a new phenomenon based on another JPD. Finally, (iv) the biases of both and should respect the BVD to avoid under/overfitting.

As elaborated in Section 5, where we show related work methods in lights of our methodology, we observed that, despite the provided requirements are typically “known”, they are not fulfilled in different steps. Nevertheless, despite it is relatively easy to adapt algorithms in terms of our requirements R1, R2 and R3 ( is updated, input data is i.i.d. and disregards accumulated data when a novelty occurs, respectively), the algorithm bias (R4) cannot be changed in most of the cases (at least without changing the nature of the algorithm itself in the process). For example, the CUSUM is too simplistic, and even forcing it to respect R1, R2 and R3, the usage of the window average to report drifts goes in discordance to the Bias-Variance Dilemma. Moreover, our contribution also explains the reasons those requirements exist. As far as we observe in practice, even when other researches consider R1–R4, they do not know why they are doing so. In that sense, we have bring that to light by associating those topics with the SLT framework.

According to our methodology, the CRCDD provides the strongest learning guarantees. In that sense, this does not mean others will not work and generate fair results. Actually, this implies that CRCDD have stronger probabilistic convergences to keep reporting fair results for incoming windows (unseen observations). We expect such analysis to be helpful to other researchers who intend to design new CD algorithms or evaluate the existent ones. Lastly, after employing our strategy to ensure learning bounds, other performance measures can be safely used to validate the quality of reported drifts, such as the of the MTBFA, MTD and MDR.

Acknowledgements

We acknowledge sponsorships of FAPESP (São Paulo Research Foundation) and CNPq (National Counsel of Technological and Scientific Development). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of FAPESP nor CNPq.

Disclosure statement

The authors declare that they have no conflict of interest.

Funding

This research was supported by FAPESP and CNPq, grants 2018/10652-9, 2017/16548-6 and 302077/2017-0.

References