Classification of streaming data is one of the most difficult problems in modern pattern recognition theory and practice. This is due to the fact that a typical data stream is characterized by several features that significantly impede making the correct classification decision. These features include: continuous flow, huge data volume, rapid arrival rate, and susceptibility to change. If a streaming data classifier aspires to practical applications, it must face these requirements and have to satisfy numerous constraints (e.g. bounded memory, single-pass, real-time response, change of data concept) to an acceptable extent. It is not easy, that is why the methodology of recognizing stream data has been developing very intensively for over two decades, proposing new, more and more perfect classification methods [8, 24].
Incremental learning is a vital capability for classifiers used in stream data classification 
. It allows the classifier to utilize new objects generated by the stream to improve the model built so far. It also allows, to some extent, dealing with the concept drift. Some of the well-known classifiers are naturally capable to be trained iteratively. Examples of such classifiers are neural networks, nearest neighbours classifiers, or probabilistic methods such as the naive Bayes classifier. Some of the classifiers were tailored to be learned incrementally. An example of such a method is well-known Hoeffding Tree classifier . Those types of classifiers can be easily used in stream classification systems. On the other hand, when a classifier is unable to learn in an incremental way, the options for using for stream classification are very limited . Only one option is to keep a set of objects and rebuild the classifier from scratch whenever it is necessary .
To bridge this gap, we propose a wrapping-classifier-based on the soft confusion matrix approach (SCM). The wrapping-classifier may be used to add incremental learning functionality to any batch classifier. The classifier based on the idea of soft confusion matrix has been proposed in . It proved to be an efficient tool for solving such practical problems as hand gesture recognition . An additional advantage in solving the above-mentioned classification problem is the ability to use imprecise feedback information about a class assignment. The SCM-based algorithm was also successfully used in multilabel learning .
Dealing with the concept drift using incremental learning only is insufficient. This is because the incremental classifiers deal effectively only with the incremental drift . To handle the sudden concept drift, additional mechanism such as single/multiple window approach , forgetting mechanisms , drift detectors  must be used. In this study, we decided to use ADWIN algorithm  to detect the drift and to manage the set of stored objects. We use the ADWIN-based detector because this approach was shown to be an effective method [13, 1].
The concept drift may also be dealt with using ensemble classifiers . There are a plethora of ensemble-based approaches [12, 3, 19] however, in this work we are focused on single-classifier-based systems.
The rest of the paper is organized as follows. Section 2 presents the corrected classifier and gives insight into its two-level structure and the original concepts of RRC and SCM which are the basis of its construction. Section 3 describes the adopted model of concept drifting data stream and provides details of chunk-based learning scheme of base classifiers and online dynamic learning of the correcting procedure and describes the method of combining ensemble members. In section 4 the description of the experimental procedure is given. The results are presented and discussed in section 5. Section 6 concludes the paper.
2 Classifier with Correction
Let us consider the pattern recognition problem in which
denotes a feature vector of an object andis its class number ( and are feature space and set of class numbers, respectively). Let be a classifier trained on the learning set , which assigns a class number to the recognized object. We assume that is described by the canonical model , i.e. for given it first produces values of normalized classification functions (supports) () and then classify object according to the maximum support rule:
To recognize the object we will apply the original procedure, which using additional information about the local (relative to ) properties of can change its decision to increase the chance of correct classification of .
The proposed correcting procedure which has the form of classifier built over will be called a wrapping-classifier. The wrapping classifier acts according to the following Bayes scheme:
where a posterioriprobabilities can be expressed in a form depending on the probabilistic properties of classifier :
denotes the probability that belongs to the -th class given that and is the probability of assigning to class by Since for deterministic classifier both above probabilities are equal to 0 or 1 we will use two concepts for their approximate calculation: randomized reference classifier (RRC) and soft confusion matrix (SCM).
2.2 Randomized Reference Classifier (RRC)
RRC is a randomized model of classifier and with its help the probabilities are calculated.
as a probabilistic classifier is defined by a probability distribution over the set of class labels. Its classifying functions
are observed values of random variablesthat meet – in addition to the normalizing conditions – the following condition:
where is the expected value operator. Formula (4) denotes that acts – on average – as the modeled classifier , hence the following approximation is fully justified:
can be easily determined if we assume – as in the original work of Woloszynski and Kurzynski  – that
follows the beta distribution.
2.3 Soft Confusion Matrix (SCM)
SCM will be used to determine the assessment of probability which denotes class-dependent probabilities of the correct classification (for ) and the misclassification (for ) of at the point . The method defines the neighborhood of the point containing validation objects in terms of fuzzy sets allowing for flexible selection of membership functions and assigning weights to individual validation objects dependent on distance from .
The SCM providing an image of the classifier local (relative to ) probabilities , is in the form of two-dimensional table, in which the rows correspond to the true classes while the columns correspond to the outcomes of the classifier , as it is shown in Table 1.
The value is determined from validation set and is defined as the following ratio:
where and are fuzzy sets specified in the validation set and denotes the cardinality of a fuzzy set .
The set denotes the set of validation objects from the -th class. Formulating this set in terms of fuzzy sets theory it can be assumed that the grade of membership of validation object to is the class indicator which leads to the following definition of :
The concept of fuzzy set is defined as follows:
where is calculated according to (5) and (6). Formula (10) demonstrates that the membership of validation object to the set is not determined by the decision of classifier . The grade of membership of object to depends on the potential chance of classifying to the -th class by the classifier . We assume, that this potential chance is equal to the probability calculated approximately using the randomized model RRC of classifier .
Set plays the crucial role in the proposed concept of SCM, because it decides which validation objects and with which weights will be involved in determining the local properties of the classifier and – as a consequence – in the procedure of correcting its classifying decision. Formally, is also a fuzzy set:
but its membership function is not defined univocally because it depends on many circumstances. By choosing the shape of the membership function we can freely model the adopted concept of ”locality” (relative to ).
depends on the distance between validation object and test object : its value is equal to 1 for and decreases with increasing the distance between and . This leads to the following form of the proposed membership function of the set:
denotes Euclidean distance in the feature space , is the Euclidean distance between and the -th nearest neighbor in , and is a normalizing coefficient. The first factor in (12) limits the concept of “locality” (relatively to ) to the set of nearest neighbors with Gaussian model of membership grade.
Since under the stream classification framework, there should be only one pass over the data , and parameters cannot be found using the extensive grid search approach just like it was for the originally proposed approach [30, 22]. Consequently, in this work, we decided to set to 1. Additionally, the initial number of nearest neighbours is found using a simple rule of thumb :
To avoid ties, the final number of neighbours is set as follows:
Additionally, the computational cost of computing the neighbourhood may be further reduced by using the kd-tree algorithm to find the nearest neighbours .
2.4 Creating the validation set
In this section, the procedure of creating the validation set from the training is described. In the original work describing SCM , the set of labelled data was wplit into the learning set and the validation set . The learning set and the validation set were disjoint . The cardinality of the validation set was controlled by the parameter , . The coefficient was usualy set to , however to achieve the highest classification quality, it should be determined using the grid-search procedure. As it was said above, in this work we want to avoid using the grid-search procedure. Therefore, we construct the validation set using three-fold cross-validation procedure that allows using of the entire learning set as a validation set. The procedure is described in Algorithm 1.
3 Classification of Data Stream
The main goal of the work is to develop a wrapping-classifier that allows incremental learning to classifiers that are unable to learn incrementally. In this section, we describe the incremental learning procedure used by the SCM-based wrapping-classifier.
3.1 Model of Data Stream
We assume that instances from a data stream appear as a sequence of labeled examples , where represents a -dimensional feature vector of an object that arrived at time and
is its class number. In this study we consider a completely supervised learning approach which means that the true class numberis available after the arrival of the object and before the arrival of the next object and this information may be used by classifier for classification of . Such a framework is one of the most often considered in the related literature [4, 25].
In addition, we assume that a data stream can be generated with a time-varying distribution, yielding the phenomenon of concept drift . We do not impose any restrictions on the concept drift. It can be real drift referring to changes of class distribution or virtual drift referring to the distribution of features. We allow sudden, incremental, gradual, and recurrent changes in the distribution of instances creating a data stream. Changes in the distribution can cause an imbalanced class system to appear in a changing configuration.
3.2 Incremental learning for SCM classifier
We assumed that the base classifier wrapped by the SCM classifier is unable to learn incrementally. Consequently, an initial training set has to be used to build the classifier. This initial data set is called an initial chunk . The desired size of the initial bath is denoted by . The initial data set is built by storing incoming examples from the data stream. By the time the initial batch is collected, the prediction is impossible. Until then, the prediction is made on the basis of a priori
probabilities estimated from the incomplete initial batch.
Since is unable to learn incrementally, incremental learning is handled with changing the validation set. Incoming instances are added to the validation set until the ADWIN-based drift detector detects that the concept drift has occurred. The ADWIN-based drift detector analyses the outcomes of the corrected classifier for the instances stored in the validation set . When there is a significant difference between the older and the newer part of the validation set, the detector removes the older part of the validation set. The remaining part of the validation set is then used to correct the outcome of . The ADWIN-based drift detector also controls the size of the neighbourhood. Even if there is no concept drift, the detector may detect the deterioration of the classification quality when the neighbourhood becomes too large.
4 Experimental Setup
To validate the classification quality obtained by the proposed approaches, the experimental evaluation, which setup is described below, is performed.
The following base classifiers were employed:
– Hoeffding tree classifier 
The classifiers implemented in WEKA framework  were used. If not stated otherwise, the classifier parameters were set to their defaults. We have chosen the classifiers that offer both batch and incremental learning procedures.
The experimental code was implemented using WEKA  framework. The source code of the algorithms is available online 111https://github.com/ptrajdos/rrcBasedClassifiers/tree/develop 222https://github.com/ptrajdos/StreamLearningPT/tree/develop.
During the experimental evaluation, the following classifiers were compared:
– The ADWIN-driven classifier created using the unmodified base classifier (The base classifier is able to update incrementally.) .
– The ADWIN-driven created using the unmodified base classifier with the incremental learning disabled. The base classifier is only retrained whenever ADWIN-based detector detects concept drift.
– The ADWIN-driven approach using SCM correction scheme with online-learning. As described in Section 3.
– The ADWIN-driven approach created using SCM correction scheme but the online-learning is disabled. The SCM-corrected classifier is only retrained whenever ADWIN-based detector detects concept drift.
To evaluate the proposed methods, the following classification-loss criteria are used : Macro-averaged (1- precision), (1-recall), Matthews correlation coefficient (). The Matthews coefficient is rescaled in such a way that 0 is perfect classification and 1 is the worst one. Quality measures from the macro-averaging group are considered because this kind of measures is more sensitive to the performance for minority classes. For many real-world classification problems, the minority class is the class that attracts the most attention .
Following the recommendations of  and , the statistical significance of the obtained results was assessed using the two-step procedure. The first step is to perform the Friedman test  for each quality criterion separately. Since multiple criteria were employed, the familywise errors (FWER) should be controlled . To do so, the Holm  procedure of controlling FWER of the conducted Friedman tests was employed. When the Friedman test shows that there is a significant difference within the group of classifiers, the pairwise tests using the Wilcoxon signed-rank test  were employed. To control FWER of the Wilcoxon-testing procedure, the Holm approach was employed . For all tests, the significance level was set to .
The experiments were conducted using 48 synthetic datasets generated using the STREAM-LEARN library 333https://github.com/w4k2/stream-learn. The properties of the datasets were as follows: Datasets size: 30k examples; Number of attributes: 8;Types of drift generated: incremental, sudden;Noise: 0%, 10%, 20%; Imbalance ratio: 0 – 4.
Datasets used in this experiment are available online 444https://github.com/ptrajdos/MLResults/blob/master/data/stream_data.tar.xz?raw=true
To examine the effectiveness of the incremental update algorithms, we applied an experimental procedure based on the methodology which is characteristic of data stream classification, namely, the test-then-update procedure . The chunk size for evaluation purposes was set to 200.
5 Results and Discussion
To compare multiple algorithms on multiple benchmark sets, the average ranks approach is used. In this approach, the winning algorithm achieves a rank equal to ’1’, the second achieves a rank equal to ’2’, and so on. In the case of ties, the ranks of algorithms that achieve the same results are averaged.
The numerical results are given in Table 2 to 5. Each table is structured as follows. The first row contains the names of the investigated algorithms. Then, the table is divided into six sections – one section is related to a single evaluation criterion. The first row of each section is the name of the quality criterion investigated in the section. The second row shows the p-value of the Friedman test. The third one shows the average ranks achieved by algorithms. The following rows show p-values resulting from the pairwise Wilcoxon test. The p-value equal to informs that the p-values are lower than . P-values lower than are bolded. Due to the page limit, the raw results are published online 555https://github.com/ptrajdos/MLResults/blob/master/RandomizedClassifiers/Results_cldd_2021.tar.xz?raw=true
To provide a visualization of the average ranks and the outcome of the statistical tests, the rank plots are used. The rank plots are compatible with the rank plots described in . That is, each classifier is placed along the line representing the values of the achieved average ranks. The classifiers between which there are no significant differences (in terms of the pairwise Wilcoxon test) are connected with a horizontal bar placed below the axis representing the average ranks. The results are visualised on figures 1 – 4.
Let us begin with an analysis of the correction ability of the SCM approach when incremental learning is disabled. Although this kind of analysis has been already done [30, 22], in this work it should be done again since the definition of the neighbourhood is significantly changed (see Section 2.3). To assess the impact of the SCM-based correction, we compare the algorithms and for different base classifiers. For and base classifiers the employment of SCM-based correction allows achieving significant improvement in terms of all quality criteria (see Figures 1 and 2). For the remaining base classifiers, on the other hand, there are no significant differences between and . These results confirm observations previously made in [30, 22]. That is, the correction ability of the SCM approach is more noticeable for classifiers that are considered to be weaker ones. The previously observed correction ability holds although the extensive grid-search technique is not applied.
In this paper, the SCM-based approach is proposed to be used as a wrapping-classifier that handles the incremental learning for base classifiers that are unable to be updated incrementally. Consequently, now we are going to analyse the SCM approach in that scenario. The results show that significantly outperforms for all base classifiers and quality criteria. It means that it works great as the incremental-learning-handling wrapping-classifier. What is more, it outperforms also for all base classifiers and criteria. It clearly shows that the source of the achieved improvement does not lie in the batch-learning-improvement-ability but the ability to handle incremental learning is also present. Moreover, it handles incremental learning more effective than the base classifiers designed to do so. This observation is confirmed by the fact that also outperforms for all base classifiers and quality criteria.
In this paper, we propose a modified SCM classifier to be used as a wrapping-classifier that allows incremental learning of classifiers that are not designed to be incrementally updated. We applied two modifications of the SCM wrapping-classifier originally described in [30, 22]. The first one is a modified neighbourhood definition. The newly proposed neighbourhood does not need an excessive grid-search procedure to be performed to find the best set of parameters. Due to the modified neighbourhood definition, the computational cost of performing the SCM-based correction is significantly smaller. The second modification is to incorporate ADWIN-based approach to create and manage the validation set used by SCM-based algorithm. This modification not only allows the proposed method to effectively deal with the concept drift but also it can shrink the neighbourhood when it becomes too wide.
The experimental results show that the proposed approach outperforms the reference methods for all investigated base classifiers in terms of all considered quality criteria.
The results obtained in this study are very promising. Consequently, we are going to continue our research related to the employment of randomised classifiers in the task of stream learning. Our next step will probably be a proposition of a stream learning ensemble that used the SCM-correction method proposed in this paper.
This work was supported by the statutory funds of the Department of Systems and Computer Networks, Wroclaw University of Science and Technology.
-  (2019-12) An overview and comprehensive comparison of ensembles for concept drift. Information Fusion 52, pp. 213–244. External Links: Cited by: §1.
-  (2007-04) Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining, External Links: Cited by: §1, §3.2, item 1.
-  (2014-01) Reacting to different types of concept drift: The accuracy updated ensemble algorithm. IEEE Trans. Neural Netw. Learning Syst. 25 (1), pp. 81–94. External Links: Cited by: §1.
-  (2014-05) Combining block-based and online methods in learning ensembles from concept drifting data streams. Information Sciences 265, pp. 50–67. External Links: Cited by: §3.1.
Statistical comparisons of classifiers over multiple data sets.
The Journal of Machine Learning Research7, pp. 1–30. Cited by: §4, §5.
-  (1996) A probabilistic theory of pattern recognition. Springer New York. External Links: Cited by: §2.3.
-  (2013-05) On cardinality of fuzzy sets. IJISA 5 (6), pp. 47–52. External Links: Cited by: §2.3.
-  (2014-03) A survey on concept drift adaptation. CSUR 46 (4), pp. 1–37. External Links: Cited by: §1, §1, §1, §3.1.
-  (2010-05) Knowledge discovery from data streams. 1st edition, Chapman and Hall/CRC. External Links: Cited by: §4.
-  (2008-12) An extension on“statistical comparisons of classifiers over multiple data sets”for all pairwise comparisons. Journal of Machine Learning Research 9 (), pp. 2677–2694. External Links: Cited by: §4.
-  (2000) A note on the utility of incremental learning. Ai Communications 13 (4), pp. 215–223. Cited by: §1.
-  (2017-03) A survey on ensemble learning for data stream classification. CSUR 50 (2), pp. 1–36. External Links: Cited by: §1.
-  (2014-12) A comparative study on concept drift detectors. Expert Syst. Appl. 41 (18), pp. 8144–8156. External Links: Cited by: §1.
-  (2003) KNN model-based approach in classification. In On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE, pp. 986–996. External Links: Cited by: 3rd item.
-  (2009-11) The WEKA data mining software. SIGKDD Explor. Newsl. 11 (1), pp. 10. External Links: Cited by: §4, §4.
-  (2001-12) Idiot’s bayes: Not so stupid after all?. International Statistical Review / Revue Internationale de Statistique 69 (3), pp. 385. External Links: Cited by: 2nd item.
-  (1979) A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics 6 (2), pp. 65–70. External Links: Cited by: §4.
-  (2018-12) An advanced k nearest neighbor classification algorithm based on KD-tree. In 2018 IEEE International Conference of Safety Produce Informatization (IICSPI), External Links: Cited by: §2.3.
-  (2017-09) Ensemble learning for data stream analysis: A survey. Information Fusion 37, pp. 132–156. External Links: Cited by: §1, §1, §2.3.
-  (2009-11) On the window size for classification in changing environments. IDA 13 (6), pp. 861–872. External Links: Cited by: §1.
-  (2014-09) Combining pattern classifiers. John Wiley & Sons, Inc.. External Links: Cited by: §2.1.
-  (2016-02) Multiclassifier system with hybrid learning applied to the control of bioprosthetic hand. Comput. Biol. Med. 69, pp. 286–297. External Links: Cited by: §1, §2.3, §5, §6.
-  (2018-11) A survey on addressing high-class imbalance in big data. J Big Data 5 (1). External Links: Cited by: §4.
-  (2017) Concept drift in streaming data classification: Algorithms, platforms and issues. Procedia Comput. Sci. 122, pp. 804–811. External Links: Cited by: §1.
-  (2014-12) A survey on data stream clustering and classification. Knowl Inf Syst 45 (3), pp. 535–569. External Links: Cited by: §3.1.
New options for hoeffding trees.
AI 2007: Advances in Artificial Intelligence, M. A. Orgun and J. Thornton (Eds.), Berlin, Heidelberg, pp. 90–99. External Links: Cited by: §1, 1st item.
-  (2012) Batch-incremental versus instance-incremental learning in dynamic and evolving data. In Advances in Intelligent Data Analysis XI, pp. 313–323. External Links: Cited by: §1.
-  (2017-03) Minimum precision requirements for the SVM-SGD learning algorithm. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), External Links: Cited by: 4th item.
-  (2009-07) A systematic analysis of performance measures for classification tasks. Information Processing & Management 45 (4), pp. 427–437. External Links: Cited by: §4.
-  (2016-03) A dynamic model of classifier competence based on the local fuzzy confusion matrix and the random reference classifier. Int. J. Appl. Math. Comput. Sci. 26 (1), pp. 175–189. External Links: Cited by: §1, §2.3, §2.4, §5, §6.
-  (2018-09) A correction method of a binary classifier applied to multi-label pairwise models. Int. J. Neur. Syst. 28 (09), pp. 1750062. External Links: Cited by: §1.
-  (2011-10) A probabilistic model of classifier competence for dynamic ensemble selection. Pattern Recognit. 44 (10-11), pp. 2656–2668. External Links: Cited by: §2.2.
-  (2011-06) Combining similarity in time and space for training set formation under concept drift. IDA 15 (4), pp. 589–611. External Links: Cited by: §1.