I Introduction
The amount of data gathered from different sources is growing exponentially. As these data usually come in streams, there is a need for fast, reliable, and incremental data stream mining algorithms [1, 2, 3]. These algorithms have to deal with new challenges, like concept drift (CD), time and memory constraints, and wellknown problems like class imbalance and overfitting. Many efforts have been made trying to tackle these challenges. To increase predictive performance and deal with CD, ensemble learners have been employed with high success in data stream scenarios [4, 5]
, however at the cost of requiring more computational resources. On the other hand, although using a single classifier demands fewer resources, their predictive performance is usually lower and also are prone to failing to adapt and detect CD. Hence, a solution that increases the predictive performance without severely impacting computational resources is highly desirable
[1, 6, 7, 5].Online solutions can fail to detect important aspects of the data, e.g. instances lying in decision boundaries. This is due to the fact they only process each instance once, differently from batch solutions that exploit additional information from data distributions to improve prediction performance. The wellknown Boosting ensemble approach is built upon this idea: particularly concentrate on the challenging examples. AdaBoost [8], a boosting algorithm, assigns different weights for the samples, giving more importance for the cases that are harder to predict. The same reasoning could be applied in streaming scenarios. Our idea is to focus on difficulty samples more than once when updating the learning models. Hence, reducing the needed amount of time for learning potentially complex patterns. Besides achieving better predictive performance, this approach does not bring significant impacts on the learning time and memory resources.
Taking into consideration the aspects discussed before, in this paper we present a novel algorithm for data stream classification, named Online Local Boosting
(OLBoost). The OLBoost, here proposed for online decision trees (ODT), works inside each leaf to dynamically adjust the incoming instances weights towards increasing the predictive performance. It works in parallel with the online decision tree inducing algorithm and does not interfere with the decision tree induction algorithm, is used solely to predict new incoming instances. Our proposal performs local boosting in the sense that only the leaf predictors are boosted towards increasing predictive performance. This work assesses the impact of OLBoost using eleven benchmark datasets. For such, it assess the impact of OLBoost inside the Very Fast Decision Tree (VFDT)
[1] and the Strict VFDT (SVFDT) [3]. Experimental results showed that, when coupled with ODT, OLBoost improves accuracy without high overheads in time and memory costs.The paper is organised as follows: Section II presents works related to our proposal. Section III gives some background on ODT building, including a brief description of VFDT and SVFDT. Following, Section IV presents our proposal. Our experimental setup is detailed in Section V. We discuss the obtained results in Section VI and present our final considerations and some venues for future research in Section VII.
Ii Related Work
Many techniques have been proposed to increase the predictive performance of ODTs. They can be divided into three main groups: structural modification of the decision tree, additional prediction strategies with the same structure; and ensembles.
Modifying the structure of the decision tree is a very wellexplored area. The Conceptadapting VFDT (CVFDT) algorithm, proposed by Hulten et al. [9], keeps secondary trees in memory and constantly assess these trees to check if they have a higher predictive performance than the original tree and uses a sliding window to discard outdated instances. The Hoeffding Option Tree (HOT) proposed by Pfahringer et al. [10] introduces the concept of option nodes instead of normal split nodes. An option node is essentially a split node which tests for multiple conditions at the same time. When a new instance arrives at an option node, it travels along with all children nodes where the conditions are true. Predictions are done by averaging all paths. These techniques generally revolve around building a larger decision tree or using complex algorithms to discard outdated information, which also impacts computational costs.
The second group, which uses additional prediction strategies in the leaves is where our work is situated. Gama et al. [6]
first introduced the idea of functional leaves to increase the predictive performance of the VFDT. These leaves use a Naive Bayes (NB) or an Adaptive NB (ANB) algorithm to further increase predictive performance
[6]. To the best of our knowledge, no further works explore other solutions which belong to this group.Lastly, many ensembles have been proposed. Oza [4] first adapted bagging and boosting to the online scenario with the OzaBagging and OzaBoosting ensembles. Bifet et al. [11]
improved OzaBagging with Leveraging Bagging (LevBag) by adding an ADWIN to monitor the error of each base learner and increasing the variability of the instances’ weights when performing bagging. Adaptive Random Forests, proposed by Gomes et al.
[12], further improved LevBag with ideas from the Random Forest algorithm. Online Accuracy Updated Ensemble [13] uses a sliding window to maintain a set of weighted base learners. All these algorithms offered some increase in predictive performance. Nonetheless, this is accompanied by a great increase in computational (memory and time) costs without a sufficient increase in performance to justify [14]. The most recent studies addressing ODT predictive improvements were focused on the last group, ensembles solution. However, we believe that time and memory cost could be slightly affected when improving the predictive performance by exploring a boosting approach in the prediction strategy without modifying the ODT induction.Iii Online Decision Trees
Two ODT algorithms were used in the experiments carried out in this work, VFDT and its recent variation, which focus on reducing memory costs, SVFDT.
Iiia Very Fast Decision Tree
The VFDT [1]
is a treebased ML algorithm for data streams based on the Hoeffding Bound (HB) theorem. When growing a tree, the VFDT employs the HB to perform a node split. After evaluating the candidate features at a split attempt with a heuristic measure
(e.g., Information Gain (IG) or Gini Index (GI)), VFDT uses the HB theorem to check whether the best split candidate would remain the best if the tree received additional instances.VFDT keeps and updates the instances class distribution in a vector at each leaf to count the number of instances from each class. Likewise, counting procedures and numerical estimators are also employed to maintain the relationship between the feature values and class distributions. By doing so, VFDT can induce a model from a single instance at a time using limited computational memory resources. Additionally, under realistic assumptions, it has the same asymptotic performance as the induction of a decision tree by a standard batch algorithm
[7].Finally, VFDT has a hyperparameter
to support tree growth when features have similar values; uses an Adaptive Naive Bayes (ANB) [6] at leaves to increase predictive performance; and uses a hyperparameter that defines the amount of instances needed by each leaf between split attempts.IiiB Strict Very Fast Decision Tree
The SVFDT algorithm [3] modifies the VFDT to create smaller decision trees while maintaining predictive performance. SVFDT has two different versions, SVFDTI and SVFDTII, both following the assumptions that:

A leaf node should only split if there is a minimum uncertainty of class assumption associated with the instances, according to previous and current statistics (i.e., a high entropy).

All leaf nodes should observe a similar number of instances to be turned into split nodes.

The feature used for splitting should have a minimum relevance according to previous statistics (i.e., a high IG).
The SVFDT applies additional rules to hold tree growth using the following function:
where is a set of observed values, is their mean,
is their standard deviation, and
is a new observation.First, consider that statistics computed at the time a leaf satisfy the splits conditions of the VFDT (according to the or ) are marked with a satisfiedVFDT. For example, the th that that this occurred, the entropy of that leaf would be marked as , the value of the best feature would be , and the number of instances seems .
The SVFDT splits a leaf when it satisfies all the conditions imposed by the VFDT and four additional constraints applied to each leaf :

,

,

,

,
where is the total number of leaves and is the total number of split attempts that satisfied the VFDT constraints.
The SVFDTII uses additional skipping mechanisms to speedup growth when class uncertainty is too high or this uncertainty is largely reduced. It employs the following function:
To check if either

or

hold true, and then ignoring all the other constraints.
Iv Online Local Boosting
The Online Local Boosting (OLBoost) is a simple algorithm designed to increase predictive performance, primarily when working coupled with ODTs. It is based on the assumption that to increase predictive performance, instances being wrongly classified are required to be used more times when inducing a model. On the other hand, instances that are easily classified can even be disconsidered during the training phase. However, the application of this procedure may introduce some pitfalls (e.g., overfitting), since it changes the original data stream distribution. To deal with this pitfalls, OLBoost works independently from tree growth, empirically reducing this problem. More specifically, OLBoost works as the predictor of each leaf in a tree. However, it must be observed that the OLBoost can use as its core either a most common (MC) prediction, NB or ANB. Consequently, according to the strategy used different statistics about the instances need to be stored. For MC, a simple class distribution is sufficient, while for NB or ANB, it needs additional statistics about the nominal and numerical feature distributions. These additional statistics are computed in the same way as a simple VFDT using as leaf predictor the ANB.
Fig. 1 describes how OLBoost works. Since we evaluate in this work two versions of OLBoost when coupled with VFDT and SVFDT, these couplings are illustrated in the figure. Both training and prediction phases are presented in the figure. Nonetheless, the presented scheme could be easily adapted to other ODT algorithms.
The instances of an (unbounded) data stream are presented one at a time to the online learning algorithm. Each instance is defined as , where is the feature vector and the real class. First, an instance is presented to the algorithm and sorted to a leaf in the ODT. This process corresponds to traversing the tree according to its split nodes until
arrives at a leaf. Then, OLBoost tries to predict this instance class, outputting a probabilities vector which is stored. When using NB or ANB as the core of OLBoost, the probabilities outputted by these algorithms are not true probabilistic representations (between 0 and 1 and with a sum of 1). To deal with this, we tested both softmax or simply diving each probability by the sum of all probabilities and found out that the latter yielded better results.
Afterwards, the boosting procedure is performed in a similar manner as described in [4]
, by sampling weights from a Poisson distribution. It works by computing a
variable that is then used to select the instance weight . Note that is equivalent to the number of times that an instance is used to update the OLBoost. Our proposal computes the instance weights by linearly combining a range of allowed Poisson distribution parameters and the probability outputted by the ODT model. In this sense, to compute the following equation is used:where and are hyperparameters which delimit the minimum and maximum possible values for and corresponds to the probability estimated by the leaf of being from its real class. Then, is drawn by sampling from the Poisson() distribution. Note that and are inversely proportional, i.e., as the prediction becomes more accurate, decreases, which in turn statistically reduces the chance of being a larger value. Lastly, the statistics used by the predictor inside the OLBoost (MC, NB or ANB) are updated times with . After this, the OLBoost does not interfere with tree growth. The instance is then used to update the leaf statistics with a normal weight of 1. Then, a split attempt if performed.
Considering the time costs of using OLBoost, we have some different scenarios. If we compare using it with using a MC or NB predictor, then the OLBoost has time costs associated with making a prediction (which will vary according to the predictor inside the OLBoost), computing , sampling and updating its statistics. On the other hand, if we compared against using an ANB, then there would be no additional prediction costs since ANB has to perform a prediction to see which predictor (MC or NB) is performing better. Memory costs will increase according to the predictor used inside the OLBoost, being more costly when using a NB or ANB. However, when using ensembles to improve prediction performance, both memory cost increase around times When the models are sequentially updated, the processing cost also increases by around times, where is the number of models inside the ensemble. Nonetheless, when predictive performance needs to be maximised and the other resources are available, OLBoost coupled with an ODT can be used in ensembles.
V Experimental setup
To evaluate the impact of the OLBoost in the VFDT and SVFDTs, eleven benchmark datasets, commonly used in data stream mining experiments, were selected: agrawal, cover_type, electricity, hyper, led24, poker, rbf (with 500k and 10 features, 1M instances and 10 features, and 250k instances and 50 features), sea and usenet. In all the cases, the prequential evaluation scheme was employed for evaluating the algorithms [7].
The VFDT and SVFDTs used ANB and and were varied: and . Likewise, and were varied, with performance increasing as increased. Based on empirical evaluations the best all around results plateaued for around and , hereby recommended as default values.
We evaluated the compared algorithms in terms of predictive performance, running time, and the amount of required memory. The predictive performance was measured in terms of accuracy. We accounted for the total running time of the algorithms in seconds and measured the final size of the models (in MB) at the end of the data streams.
All algorithms were implemented in Python 3.7^{1}^{1}1https://www.python.org/ and Cython^{2}^{2}2http://cython.org/, a superset of the Python language that allows code to be written in Python and compiled to C extensions, and are available online at ^{3}^{3}3https://github.com/vturrisi/pystream.
Vi Results and Discussions
Fig. 2 presents the boxplots and violin plots of the performance metrics for each dataset separately while varying and . The violins represent the algorithms without using OLBoost, while the overlapped boxplots represent the same algorithm with OLBoost. As previously discussed, the values of and were kept constant during the experiments.
Considering accuracy, it is possible to see that excluding agrawal and usenet datasets, using OLBoost increases the median performance. Moreover, OLBoost outperforms the traditional ODT variants in all hyperparameter configurations for the cover_type and electricity datasets. While the traditional ODT algorithm can reach similar predictive performance in some datasets by selecting high values for (creating a deeper tree), using OLBoost grants higher or very similar performance while using a much shallower tree.
Concerning memory usage, OLBoost increases consumption in all cases, but this was expected since it essentially doubles the memory sizes of the leaves. This fact can, however, be neglected for most of the applications, given that our proposal reaches higher predictive performances than standard ODT algorithms. OLBoost can also use shallower structures while maintaining competitive accuracy. In the future, we intend to evaluate new strategies for tree growth stopping based on the measured accuracy. This fact could enhance the advantages of our proposal against other algorithms.
For time costs, a small variation is perceived, but without an impacting increase. Given the previously discussed aspects, settings which lead OLBoost to structures with smaller depths would be preferred to decrease time processing costs. The robustness of our proposal accuracywise enables this kind of balance between size and performance. Hence, we advise using OLBoost while tuning ODT’s hyperparameters to create smaller trees maximising the efficiency between predictive performance and memory consumption.
It is also worth mentioning the SVFDT variants when coupled with OLBoost were able to obtain predictive performance comparable to those of VFDT. Nonetheless, we limited our analysis for pairwise comparison of tree settings, i.e., we compared the performance of the same hyperparameter set using or not OLBoost. An interesting venue for the future evaluation would be comparing the performance obtained by SVFDT with OLBoost against the traditional VFDT algorithm. This could lead to more accurate decision models that are restrictive in the usage of computational resources.
We also evaluated the benefits of using OLBoost as the tree structures grow. Fig. 3 presents the average relative accuracy considering all trees when using OLBoost sorted by tree size. The relative accuracy is calculated by dividing the accuracy of the OLBoostbased models by the accuracy of their standard counterparts. Therefore, relative accuracies greater than implies that our proposal led to higher accuracies, whereas the contrary holds true to values below . As the trees increase, the gains obtained by OLBoost become smaller. We expected such behaviour since the prediction models become more specialised in the incoming concepts as they process more examples. However, considering all ODTs evaluated and the datasets, OLBoost always presents a gain in accuracy.
Lastly, the statistical difference between the six algorithm combinations (VFDT, SVFDTI, SVFDTII, and their OLBoost variants: O_VFDT, O_SVFDTI, and O_SVFDTI) was assessed using the Wilcoxon signedrank test [15]. We used the results obtained by each pair of algorithms, considering all the evaluated hyperparameters. Thus, we used a sample size of
(the number of different evaluated hyperparameter combinations per dataset) for each performed test. First, a twosided test was computed to verify whether the median of the differences between each algorithm pair was zero (null hypothesis) or not (alternative hypothesis). In case the alternative hypothesis was observed, an additional onesided test was performed to verify whether the median of the differences was positive or negative. Therefore, we can state for each algorithm pair
and , the number of datasets was statistically better than , and vice versa. We present the observed results in Tables I(a), I(b), and I(c).Considering accuracy (Table I(a)), the O_VFDT was statistically better than the other algorithms more times than any other algorithm. Moreover, we can also observe that by adding OLBoost to the SVFDTI and SVFDTII, both algorithms were able to surpass the accuracy achieved by even the best performing technique (VFDT). From a memory perspective (Table I(b)) it is possible to see that adding OLBoost to any of the algorithms statistically increases memory costs. Nonetheless, O_SVFDTI and O_SVFDTII were still able to statistically consume less memory than the VFDT in six and five datasets, respectively. Considering time (Table I(c)), OLBoost always increases computational costs regardless of the base ODT. Despite being slower, the real difference values are around 10 seconds throughout processing the whole stream. This amount can be neglected depending on the type of data stream application.



Vii Conclusion and Future Work
This work presented a novel algorithm to increase the predictive performance of ODTs for data stream mining. We considered a large and varied set of benchmark datasets to compare our proposal against traditional ODT algorithms. According to the experimental results, OLBoost is able to increase accuracy performance. Additionally, OLBoost produced trees shallower and with better predictive performance better than the other algorithms, reducing the memory needed, speeding up the test and potentially improving the model’s interpretability. Reducing model size also helps to combat overfitting. In addition, we observed that variants of SVFDT with OLBoost were capable of presenting predictive performance values competitive to those obtained by the traditional VFDT. As future work, we intend to evaluate the possibility of using OLBoost in a restricted set of leaves to increase predictive performance while reducing memory costs. Moreover, the evaluation of OLBoost in ensemble algorithms can also be explored. The adaptation of OLBoost to regression tasks is another possible venue for future research. Finally, our proposal can be extended to other online prediction algorithms, such as kNearest Neighbours.
References
 [1] Pedro Domingos and Geoff Hulten. Mining highspeed data streams. In Kdd, volume 2, page 4, 2000.

[2]
T. R. Hoens, R. Polikar, and N. V. Chawla.
Learning from streaming data with concept drift and imbalance: an
overview.
Progress in Artificial Intelligence
, 1(1):89–101, 2012.  [3] V. G. T. da Costa, A. C. P. de L. F. de Carvalho, and S. Barbon Junior. Strict very fast decision tree: A memory conservative algorithm for data stream mining. Pattern Recognition Letters, 116:22 – 28, 2018.
 [4] N. C. Oza. Online bagging and boosting. In 2005 IEEE International Conference on Systems, Man and Cybernetics, volume 3, pages 2340–2345 Vol. 3, Oct 2005.
 [5] Bartosz Krawczyk, Leandro L. Minku, João Gama, Jerzy Stefanowski, and Michał Woźniak. Ensemble learning for data stream analysis: A survey. Information Fusion, 37:132 – 156, 2017.
 [6] J. Gama, R. Rocha, and P. Medas. Accurate decision trees for mining highspeed data streams. In Proc. of the IX ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, pages 523–528, New York, NY, USA, 2003. ACM.
 [7] J. Gama. Knowledge Discovery from Data Streams. Chapman & Hall/CRC, 1st edition, 2010.

[8]
Thomas G Dietterich.
Ensemble methods in machine learning.
In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.  [9] Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining timechanging data streams. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining  KDD ’01, pages 97–106, 2001.
 [10] Bernhard Pfahringer, Geoffrey Holmes, and Richard Kirkby. New options for hoeffding trees. In Mehmet A. Orgun and John Thornton, editors, AI 2007: Advances in Artificial Intelligence, pages 90–99, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.
 [11] A. Bifet, G. Holmes, and B. Pfahringer. Leveraging bagging for evolving data streams. In Proceedings of the 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part I, ECML PKDD’10, pages 135–150, Berlin, Heidelberg, 2010. SpringerVerlag.
 [12] H. M. Gomes, A. Bifet, J. Read, J. P. Barddal, F. Enembreck, B. Pfharinger, G. Holmes, and T. Abdessalem. Adaptive random forests for evolving data stream classification. Machine Learning, 106(9):1469–1495, Oct 2017.
 [13] Dariusz B. and Jerzy S. Combining blockbased and online methods in learning ensembles from concept drifting data streams. Information Sciences, 265:50 – 67, 2014.
 [14] Victor G. Turrisi da Costa, Saulo Martiello Mastelini, André C. P. L. F. de Carvalho, and Sylvio Barbon. Making data stream classification treebased ensembles lighter. In 7th Brazilian Conference on Intelligent Systems, BRACIS 2018, São Paulo, Brazil, October 2225, 2018, pages 480–485, 2018.
 [15] Frank Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in statistics, pages 196–202. Springer, 1992.