Online Multi-target regression trees with stacked leaf models

03/29/2019 ∙ by Saulo Martiello Mastelini, et al. ∙ State University of Londrina Universidade de São Paulo 0

The amount of available data raises at large steps. Developing machine learning strategies to cope with the high throughput and changing data streams is a scope of high relevance. Among the prediction tasks in online machine learning, multi-target regression has gained increased attention due to its high applicability and relation with real-world problems. While reliable and effective solutions have been proposed for batch multi-target regression, the few existing solutions in the online scenario present gaps which should be further investigated. Among these problems, none of the existing solutions consider the occurrence of inter-target correlations when making predictions. In this work, we propose an extension to existing decision tree based solutions in online multi-target regression which tackles the problem mentioned above. Our proposal, called Stacked Single-target Hoeffding Tree (SST-HT) uses the inter-target dependencies as an additional information source to enhance accuracy. Throughout an extensive experimental setup, we evaluate our proposal against state-of-the-art decision tree-based solutions for online multi-target regression tasks on sixteen datasets. Our observations show that SST-HT is capable of achieving significantly smaller errors than the other methods, whereas only increasing the needed time and memory requirements in small amounts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the evolution of computer hardware, the amount of generated data has increased at large steps, resulting in streams of potentially unbounded size and different characteristics. While previous generations of machine learning (ML) algorithms were concerned with processing relatively small amounts of data (in batches) and unrestrictedly processing them, the new challenges brought by the big data era changed the necessities and shifted the research focus efforts towards other directions.

Data stream solutions must process data flow distributions which possibly can change over time, and adapt to the restrictions this type of evaluation environment bring with it. Firstly, the models must process each incoming sample just once (since all data cannot be indefinitely stored) (Read et al., 2012). Besides, the algorithmic solutions must be ready to predict new cases at any point and expect an infinite data stream (despite using finite and limited resources regarding time and memory) (Read et al., 2012; Kocev et al., 2013; Sousa and Gama, 2018).

The mentioned data streams are varied and can originate from many sources, ranging from sensor networks and manufacturing processes to video streams and user operations in a web browser, for example. Although the new characteristics of arising data impose restrictions for the algorithms, well-known qualities of the generated learning models are still expected: fast and reliable predictions, generalization capabilities, among other aspects.

In the ML data stream literature, most of the efforts were devoted to dealing with single-target (ST) tasks, mostly for classification (Gama, 2010; Nguyen et al., 2015; Gomes et al., 2017; Krawczyk et al., 2017), while a small number of publications encompassed single-target regression (STR) tasks (Ikonomovska et al., 2011b; Duarte et al., 2016; Gouk et al., 2019). Moreover, small attention was paid for structured output tasks (Kocev et al., 2013; Borchani et al., 2015; Waegeman et al., 2018), i.e., when multiple target variables are related to the same set of input features. Notwithstanding, this type of prediction task reflects many aspects of real-world problems including the streaming ones, such as predicting river flow properties, multiple products sales, airline ticket prices, among others (Read et al., 2012; Spyromitros-Xioufis et al., 2016; Sousa and Gama, 2018). In this work, we focus especially on multi-target regression (MTR) tasks, which concerns, as the name implies, the prediction of multiple continuous output variables. These targets can present correlations among them since they are explained by the same set of inputs or represent correlated quantities in real-world problems, information which can be used during the modeling of the learning algorithms to reduce the overall predicting errors.

MTR is a relatively new research field even for batch problems (Borchani et al., 2015; Spyromitros-Xioufis et al., 2016; Mastelini et al., 2017; Melki et al., 2017; Mastelini et al., 2018; Santana et al., 2018), and the few existing online solutions (Ikonomovska et al., 2011a; Duarte and Gama, 2015; Osojnik et al., 2018) still have room for improvements, for instance, by effectively exploring the inter-target dependencies for further improving the quality of the predictions. In fact, some of the ideas employed in the batch setting can be ported to online scenarios, without impacting the needed computational resources in vast amounts. On the other hand, the new tailored online MTR must cope with requisites of learning from data streams, considering that limited computational resources are available. Hence, the search for a balance between sophistication and feasibility in the solutions is essential for the success of new methods.

In this sense, we propose an extension to existing incremental decision tree algorithms, which combine the targets’ predictions to explore their inter-dependencies. In our proposal, called Stacked Single-Target Hoeffding Tree (SST-HT), we followed the common trend in batch MTR literature of employing stacked single-target (SST) regression models to improve the predictions (Spyromitros-Xioufis et al., 2016; Mastelini et al., 2017, 2018; Melki et al., 2017; Santana et al., 2018). Nevertheless, our proposal does not change the way the decision tree models are built, i.e., how the decision splits are performed, neither it greatly impacts in the needed amount of computational resources. Therefore, we can expect the same tree structure while obtaining predictions with smaller errors than the traditional approaches for MTR in data streams. Two variations of SST-HT are evaluated: the first one that only employs the stacked regressors as predictors, and a second one which dynamically selects between using the stacked regressors or the traditional mean and linear predictors.

We verify the superiority of our proposal regarding predictive performance throughout an extensive experimental evaluation, comprised of sixteen MTR datasets. For the best of our knowledge, this is the highest amount of datasets employed as benchmarks for data stream MTR algorithms so far. Six of the evaluated datasets were first employed as MTR resources in this research. Among these new datasets, SCFP was specially tailored for this work by adapting an existing dataset and adding textual (provided by a word embedding model (Pennington et al., 2014)) and geolocated information (see Appendix A for detailed information).

Our experimental results showed that SST-HT was capable of obtaining the smallest prediction errors while adding small extra memory consumption, and a processing time linearly comparable to the existing solutions. The remainder of this work is organized as follows. Section 2 presents a background for understanding MTR solutions for data streams, as well as, a literature review about the research subject. Section 3 provides the theoretical foundation of traditional Hoeffding Trees (HT) algorithms, which are the basis of ours and preceding proposals. SST-HT is described in details in Section 4. Section 5 presents our experimental setup, including the datasets, evaluation strategy, and metrics, as well as, the configurations used for the decision tree algorithms. The obtained results are discussed in Section 6, and our final considerations presented in Section 7. Finally, detailed information concerning the employed datasets and the obtained results are presented in Appendices A and B.

2 Background and Related Work

The multi-target regression (MTR) deals with the prediction of multiple continuous target variables, subjected to a shared set of input or explaining variables. This task can be seen as an extension of single-target regression (STR) tasks (Osojnik et al., 2018). However, MTR problems aim not only at modelling the input to output relations, but also possible inter-outputs dependencies. This consideration can lead to a better representation of the dealt problem and also reduce the obtained prediction error. On the other hand, this additional effort demands solutions specially developed for MTR tasks, which frequently are more complex than employing a separate STR model for each target variable.

Formally, a MTR task can be described as the search for a function or a set thereof , which can model the relations between a set of input variables (being them real numbers, ordinal or nominal values) and another set of output variables. Therefore, a MTR task can be represented by the expression

From it is possible to obtain a prediction for a instance . When constructing one expects that the obtained values of are as close as possible to the real measured target values . Regarding algorithmic solutions, accordingly to Kocev et al. (2013) MTR problems have been solved using two main strategies for the solutions: global methods and local methods. Global methods correspond to solutions that use a single model to predict all target variables at once. Those methods implicitly model the inter-target dependencies, and also offer more compact and less computationally costly solutions for MTR tasks, being more viable to online scenarios. The local methods, on the other hand, employ traditional STR solutions and often manipulate or modify the input space to insert the inter-target dependency information within the problem modelling. In this sense, local methods use multiple ST regressors to solve a MTR task, frequently more than one regressor for each target variable (Spyromitros-Xioufis et al., 2016; Mastelini et al., 2017; Santana et al., 2018), which in turn make them more costly solutions than the global ones. The simplest local solution for MTR tasks, as previously mentioned, consists in creating a ST regressor for each target and simply ignoring the underlying inter-target dependencies.

MTR problems have gained increasing attention in the batch setting (Borchani et al., 2015; Spyromitros-Xioufis et al., 2016; Mastelini et al., 2017; Melki et al., 2017; Mastelini et al., 2018; Santana et al., 2018), being them related to multiple real-life problems, such as the prediction of river flow properties, online sales, airline tickets price, poultry meat properties, among others (Spyromitros-Xioufis et al., 2016; Santana et al., 2018). On the other hand, few attention have been given for MTR (and even STR) problems on the online setting. Dealing with MTR in data streams not only assumes that the same computational resources constraints apply, as in online ST classification tasks, but also brings the additional challenge of producing multiple predictions at the same time for possibly inter-correlated targets (Borchani et al., 2015; Waegeman et al., 2018).

One of the first efforts to tackle regression problems on a online setting was made by Ikonomovska et al. (2011b). The authors proposed an online and incremental method to build regression trees and dealing with concept drift. Their proposal, called FIMT-DD (Fast Incremental Model Tree with Drift Detection), employs the Hoeffding’s bound theorem to decide whether a split decision must be made (Domingos and Hulten, 2000; Gama, 2010) very much alike the VFDT (Very Fast Decision Tree) (Domingos and Hulten, 2000)

. The authors proposed using perceptrons without activation function at the tree’s leaves to provide the responses. Being one of first works to tackle regression problems on data streams, the authors mostly evaluated their approach against traditional batch regression algorithms, but their research pioneered the research on Hoeffding (Trees and Decision Rules) algorithms for STR and MTR. The FIMT-DD was only designed to deal with numerical attributes, limiting its application to nominal data, unless some data transformation technique, e.g., one hot encoding, would be applied.

The same authors also proposed the FIMT-MT (Fast Incremental Model Tree - Multi-target) (Ikonomovska et al., 2011a), an extension of the FIMT-DD algorithm for MTR settings. In their proposal, the authors utilize the principles of the predictive clustering trees (Kocev et al., 2013)

to make decision splits on multiple targets, which makes their method a global solution. The main idea is to consider each split as the induction of a cluster. In this sense, the root node corresponds to the cluster that contains all data. Each new split aims at reducing the intra-cluster variance of the new created partitions, whereas maximizing the inter-cluster variance. Similarly to the FIMT-DD algorithm, its MTR counterpart employs perceptrons at the leaves, one model per target. On the other hand, no mechanism for dealing with concept drift is inherited from the original STR method. Similarly to FIMT-DD, FIMT-MT only support numeric attributes.

Deviating from the tree-based solutions, Almeida et al. (2013) proposes the Adaptive Model Rules (AMRules) algorithm for online STR tasks, which is later on expanded by Duarte et al. (2016). The authors propose using decision rules for STR problems. Similarly to the FIMT-DD algorithm, perceptron models are employed to generate the responses. AMRules uses a built-in mechanism for dealing with concept drift based on the Page-Hinkley (PH) test (Ikonomovska et al., 2011b), which simply drops outdated decision rules. In addition, the decision rule algorithm has also a routine to detect anomalous samples, e.g., noisy data, which are not employed to update the decision models.

Duarte and Gama (2015) also expanded the AMRules framework for dealing with MTR tasks. The authors also employed the idea that the created partitions must reduce the variance in the output space. But different from the previous solution, AMRules does not lie in the global/local categorization. Instead, the MTR version of AMRules is capable of specializing in subsets of targets by performing as follows: when executing a rule expansion test, if the variance in the target space is reduced only for some targets, a new decision rule encompassing the targets benefited by the split is created; a complementary rule without the expansion is also created for the remaining targets. Therefore, AMRules creates decision rules which can encompass all the targets, some of them, or even a single target, hence being a hybrid of a local and global method. AMRules was also adapted to deal with multi-label classification tasks (Sousa and Gama, 2018).

Following the trend of using tree-based methods for streaming scenarios, (Osojnik et al., 2015a, 2018) proposed an extension for the FIMT-MT algorithm called iSOUP-Tree (incremental Structured Output Prediction Tree). Their proposal builds upon the research of Ikonomovska et al. (2011a)

by enabling the support to categorical features and employing an adaptive prediction model at the leaves. Instead of only using perceptrons at the leaves, iSOUP-Tree also maintains a mean predictor for each target and selects the best current model by monitoring a faded error metric for each of them. The authors adapted the iSOUP-Tree algorithm for multiple settings, including ensembles (Bagging and Random Forest) and Option Trees 

(Osojnik et al., 2018), and employed all the variations of the MTR method to tackle multi-label classification tasks (Osojnik et al., 2015b, 2017).

Despite the efforts made to build methods tailored to split the decision space considering all the targets, neither of the presented solutions effectively take advantage of the inter-target dependencies at the moment of generating predictions. In all of the presented cases, individual models are created for each target, simply ignoring how their values relate to each other. Inspired by the work of

Spyromitros-Xioufis et al. (2016), we propose employing the strategy of Stacked Single-target (SST) at the leaves’ models for further improving the prediction performance of the MTR decision tree models. Also, considering the mutable characteristics of streaming tasks, our proposal, called Stacked Single-target Hoeffding Tree (SST-HT) can be setup to dynamically selecting for each target when to use its SST model, the standard perceptron, or the most straightforward mean predictor.

Taking into consideration that our work builds upon the iSOUP-Tree algorithm, and consequently FIMT-MT, the following section will present in more details the base algorithm for building incremental MTR decision trees. Next, SST-HT is going to be presented in details.

3 Online Multi-target regression Trees

This section presents the traditional strategies for building decision tree algorithms for data streams. Firstly, the general Hoeffding Tree algorithm is presented, followed by its application for MTR tasks, which can be referred as Multi-target regression Hoeffding Tree (MTR-HT).

3.1 Hoeffding Tree algorithm

All of the previously presented tree-based solutions for STR and MTR in data streams build upon using the Hoeffding bound (HB) (Hoeffding, 1963) for the split point decision criteria. This idea was firstly proposed by Domingos and Hulten (2000) in their well-known work, where the VFDT is presented. The HB provides statistical evidence that the current most promising split decision would be the best one, given enough observations were made. Therefore, splits are only performed when enough statistical evidence is gathered by the decision tree to support this operation. Hence, the split decisions have statistical guarantees to deviate from the expected value by at most a value.

Suppose a heuristic measure

that provides a score for each attempted split decision in a feature. At time step or instance , the current heuristic value is denoted by . The greater , the better is the candidate input space partitioning. Let be the input feature with the current best split candidate, which has a score of . Also, let be the feature which has the second current best split candidate with heuristic score . By monitoring the ratio

through time, a new variable random variable

can be derived:

The Hoeffding Tree (HT) algorithm assumes that given enough data,

would lie in a normal distribution. Therefore, its expected value would be equal to the population mean. However, taking into consideration that a stream can be potentially unbounded, it is not possible to consider all observations to calculate the average value. On the other hand, the sample mean in time step

, can be easily calculated as:

Using Hoeffding’s inequality (Hoeffding, 1963) we can state that the expected value of would not deviate from its sample mean at time step by more than a factor , given a parameter , which determines the confidence level . For brevity, from here onward the time/instance indicator will be omitted from the mathematical expressions. Equation 1 gives the simplified form of the Hoeffding’s inequality (considering the range of ) subjected to .

(1)

From Equation 1, we can isolate in terms of in the following form

(2)

The value of obtained from Equation 2 enables us bounding a deviation interval from which we do not expect the sample mean differ from its expected value, i.e., with confidence level . Thus, if that implies . Hence, we can assume that the split decision in the feature which generated is really the best choice for making a new partition.

Nonetheless, in some cases, two decision splits may achieve almost equal heuristic scores, which means they are equally good choices. In these cases, if the values of substantially shrunk no split decision will be made. To avoid that, an additional threshold or tie-break parameter is added. In this sense, a split is performed whether or in case becomes too close to , if .

3.2 Multi-target regression Hoeffding Trees

Both FIMT-MT and iSOUP-Tree employ the intra-cluster variance reduction (ICVarR) as heuristic score, following the steps of the predictive clustering framework (Kocev et al., 2013). In this proposal, the variance is treated as a dispersal measure of the objects in the partition (i.e., a cluster) to the center of mass (the centroid) (Ikonomovska et al., 2011a). The ICVarR calculation for a set of partitions over is given by

where the intra-cluster variance (ICVar) is calculated for a sample as follows

In order to incrementally estimate the variance values for each target, and also, evaluate the split candidates, sufficient statistics must be stored. As shown in recent MTR literature in data streams 

(Ikonomovska et al., 2011a; Osojnik et al., 2018), maintaining a counter of the number of elements seen (), the sum of each target (), , and the sum of their square values () for each leaf node is enough to calculate the required measures. Besides, also maintaining the same set of statistics for the input features

enables their standardization and combination in the leaf models. This action is especially relevant when computing the ICVar to avoid possible different scales of the targets impact in the obtained heuristic scores. The input features and targets are standardized using the z-score approach, i.e., they are centered by their mean values and scaled by their standard deviation 

(Osojnik et al., 2018).

Numerical attributes are monitored using the Extended Binary Search Tree (E-BST), as proposed by Ikonomovska et al. (2011b, a) and later on expanded by Osojnik et al. (2018). The FIMT-MT algorithm does not support categorical features, as previously mentioned, but this functionality is added by iSOUP-Tree, by creating a tree branch for each possible feature value in the case of a split.

The original proposal of FIMT-MT only employs perceptron models without activation functions at the leaves, one predictor for each target. These models are incrementally trained using the Delta rule (Ikonomovska et al., 2011b) the linear weights updating. iSOUP-Tree introduces the use of adaptive models for each target, i.e., whether using the perceptrons or employing a more straightforward mean predictor at the moment of making predictions. To this end, a fading metric of error is monitored to assess which is the current best performer for each target.

As previously mentioned, neither of the employed predicting strategies for the leaf nodes effectively take into consideration the existence of inter-target correlations. We reason that this possibility should be considered when making predictions, in order to increase the accuracy of the whole tree model, as well as, better capturing the characteristics of the dealt task. Next, we describe how SST-HT is capable of considering the mentioned aspect during tree construction.

4 Online Multi-target regression Tree with stacked leaf models

Our proposal, Stacked Single-target Hoeffding Tree (SST-HT), was tailored to encompass the best aspects of the existing tree-based solutions, while increasing the prediction performance. Our reasoning is that if a partition in the target space was made, the targets in this region should present inter-correlations. By using stacked models  (Gama and Brazdil, 2000; Spyromitros-Xioufis et al., 2016) for the leaves, those inter-correlations can be explored and used to decrease the prediction error.

The traditional usage of linear models at the leaf nodes consists of creating as many perceptrons as the number of targets . Therefore, the predictions are computed separately, only considering the original problem’s features and a bias term for the linear model. As previously mentioned, the input features for each instance are standardized using the z-score strategy, resulting in a normalized instance . The normalized prediction for the -th target is calculated as follows

where , are the weights of the linear model. Given the standardized value of the expected response , the linear predictor’s weights are updated with the Delta rule

where represents the learning factor. In the standard MTR HT models, the final predictions are computed by transforming the values back to their original scales and ranges. Our proposal, in turn, adds another layer of linear models to combine and enhance the responses of the previously defined ones. We call those newly added predictors as meta models, whereas the linear regressors which use the normalized input features are called base models. The new normalized responses are computed as follows:

and their corresponding weights , , are updated using the delta rule as well:

Note that in both presented weight update expressions there are no input values for the bias terms. This value is typically set to the unit value, as we did. Also, note that in both model updating expressions the same learning factor () is depicted. In our implementation of SST-HT the same value was employed for either the base models and the meta models. The usage of different learning rates for the base and meta models can be target of future research, as well as, using decaying factors for these parameters. This investigation, however, is out-of-the-scope of the current work.

Similarly to iSOUP-Tree strategy, SST-HT implements the possibility of using different predictors adaptively. While the former chooses between the perceptron or mean predictors, our proposal adds a third model: the stacked perceptron predictor. In the same way as the preceding method, SST-HT uses a fading error metric for online predictor selection. Following the proposal of iSOUP-Tree’s authors, the faded Mean Absolute Error (fMAE) is employed to monitor the performance of each predictor for each target variable. This metric, presented in Equation 3, uses an exponential decay to assign more importance for the most recent cases, whereas giving less relevance for the examples distant in time to the current evaluation. In the equation, {mean, perceptron, stacked perceptron}.

(3)

The user can select a specific prediction model to use, or rely on the dynamic selection, as presented. It is worth mentioning that the choice of predictors does not impact the tree structure, since the splits only consider the increase in partitions’ homogeneity, regardless of the observed prediction errors. Therefore, we expect the same tree structures for SST-HT as the traditional MTR HT’s ones. On the other hand, a slight increase in memory usage and increased training times are also observed, considering that an additional set of predictors is required at each leaf node in conjunction with the monitored error. Note, however, that as the number of input features is typically much smaller than the number of targets, i.e.,

, the meta models at the leaves have less adjustable parameters (artificial neuron’s weights) than their base counterparts.

5 Experimental Setup

This section describes the experimental setup employed in our experiments. It includes the used datasets, settings for the tree predictors, as well as, the performance metrics and evaluation strategy employed during the experiments. All the experiments were executed using the scikit-multiflow111Available in: https://github.com/smastelini/scikit-multiflow framework, which is an open and free platform to perform data stream analysis and prediction.

5.1 Datasets

A total of 16 datasets were considered in the experiments. They comprehend a vast range of synthetic and real-world problems, which belong to different application domains and have different characteristics. The main characteristics of the considered sets are summarized in Table 1. Some of the employed datasets were already reported in previous MTR researches (Duarte and Gama, 2015; Osojnik et al., 2018), whereas the datasets CPU, NPSDecay, SCFP, Sulfur, and Wine are firstly applied in the context of online MTR in this work. In particular, the employed version of SCFP was specially tailored for this work. A description for each of the considered datasets is presented in Appendix A.

Dataset
#Examples
#Numeric
Inputs
#Categorical
Inputs
#Outputs
Source
2Dplanes 256,000 20 0 8 Duarte and Gama (2015)
Bicycles 17,379 13 9 3 Duarte and Gama (2015)
CPU 8,192 22 0 4 -,
Electricity 45,312 6 0 2 Gama et al. (2004),
Eunite03 8,064 29 0 5 Duarte and Gama (2015)
FriedD 256,000 10 0 4 Duarte and Gama (2015)
FriedAsyncD 256,000 10 0 4 Duarte and Gama (2015)
MV 256,000 16 4 9 Duarte and Gama (2015)
NPSDecay 455,109 25 0 5 Cipollini et al. (2018),
RF1 9,005 64 0 8 Spyromitros-Xioufis et al. (2016)
RF2 7,679 576 0 8 Spyromitros-Xioufis et al. (2016)
SCFP 223,129 54 3 3 -,
SCM1d 9,803 280 0 16 Spyromitros-Xioufis et al. (2016)
SCM20d 8,966 61 0 16 Spyromitros-Xioufis et al. (2016)
Sulfur 10,081 5 0 2 Fortuna et al. (2007),
Wine 6,497 8 0 4 Cortez et al. (2009),
Table 1: Datasets considered in the experiments. Sets marked with were firstly proposed or adapted to MTR tasks in this research (refer to Appendix A for more details)

5.2 Settings used in the tree predictors

During our experiments, we fixed some parameters for the tree predictors following typical configurations of the literature (Domingos and Hulten, 2000; Duarte and Gama, 2015; Osojnik et al., 2018). Splits attempts were performed at intervals of samples. We set the significance level for the HB calculation to , and the tie-break parameter to .

In addition, in all cases, samples were employed to initiate the tree predictors, providing a ‘warm’ start for the evaluations. Lastly, the perceptron weights were started with uniform random values in the range . In case of a split, new leaf nodes inherit their ancestors’ weights.

Regarding the decision tree algorithms, we evaluated two variants of our proposal against three variants of iSOUP-Tree, which is the predecessor of our method. Table 2 summarizes the compared methods variants, including their main characteristics and acronyms which will be used from here onward.

Acronym Description
MTR-HT variant which uses the mean of the targets as responses at the leaf nodes
MTR-HT variant which uses a perceptron model per target at the leaf nodes
iSOUP-Tree ISOUP-Tree method, which dynamically selects between the two previous prediction variants
SST-HT version of our proposal which always use the stacked regressors for making predictions
variant of proposal which dynamically selects between the mean, perceptron, and stacked perceptron predictors for each target
Table 2: Description of the methods compared in our experiments

5.3 Evaluation strategy

To evaluate the different compared methods for MTR problems we employed the prequential strategy (Gama, 2010). In this scheme, a sample is firstly evaluated by the learning model, and afterwards it is used to update the predictor. For all the considered metrics we computed their mean value for the complete data streams, and also considered windowed measurements, using a non-overlapping sliding window of size , as suggested in recent literature (Osojnik et al., 2018). All MTR methods were performed thirty times for each dataset and the results taken as the average between the repetitions, aiming at reducing effects of randomness in our evaluation.

In this sense, to monitor errors the Average Root Mean Square Error () was computed, considering both errors per sliding window and an overall measurement using all the seen data. The calculation is given by Equation 4.

(4)

Additionally, the average amount of time spent by each method (in seconds) and the total of memory resources consumed by the predictors (in MB) were also reported. All the metrics were reported in intervals of

samples in all the cases. We also performed statistical tests to verify the possibility of a prediction model being significantly better than the others regarding the considered evaluation metrics. To this end, the Friedman test and post-hoc Nemenyi test were employed, as described in

Demšar (2006). To perform the mentioned tests we considered the windowed evaluations, to describe how the different compared methods evolved through time regarding the performance metrics.

6 Results and Discussion

This section presents our experimental findings regarding the compared MTR techniques. We discuss the obtained results in terms of predictive performance, running time and model size. We also highlight some cases in details, while presenting detailed information about all the evaluated datasets in supplementary material (see Appendix B).

6.1 Predictive Performance

Firstly, regarding the observed error values, Table 3 summarizes the obtained results considering the mean measured errors, i.e., considering the whole processed streams. The smallest observed errors per datasets are highlighted in bold.

Dataset iSOUP-Tree SST-HT
2Dplanes
Bicycles
CPU
Electricity
Eunite03
FriedD
FriedAsyncD
MV
NPSDecay
RF1
RF2
SCFP
SCM1d
SCM20d
Sulfur
Wine
Table 3: Mean values observed (after processing the whole stream)

As depicted in the table, the SST-HT was the most accurate method in the majority of cases ( out datasets). The standard SST-HT, which always use the stacked perceptron predictors, reached the smallest just in one case (Eunite03). The same observation repeated to MTR-HT (Bicycles) and MTR-HT (Wine). The second best performer in this analysis was iSOUP-Tree, which has reached the smallest errors in out datasets. In general, as expected, the adaptive variants of the tree predictors obtained the most accurate results.

In addition to observing only the mean observed error for the whole dataset, the evolution of the observed error through time was also considered. We encountered different patterns depending on the considered dataset. For a matter of visualization we present line plots for two of the evaluated datasets, Bicycles and SCM1d, in Figures (a)a and (b)b, respectively. In the first case, almost all the methods presented the same behavior regarding the error values. Until around 8000 samples, SST-HT presented the smallest errors, but from this point until the end of the stream, the most straightforward MTR-HT achieved much smaller errors than the other methods. In the second case, SST-HT maintained the most accurate predictions the whole stream, whereas MTR-HT originated the worst ones. The standard counterpart of the former method, which had been presenting the second smallest errors until a little past samples, presented a sudden increase in its error, reaching the second worst results until the end of the stream. Similar plots for all the considered datasets are presented in Appendix B.1.

(a) Bicycles
(b) SCM1d
Figure 3: Exemplification of the different patterns observed for the compared methods in the datasets Bicycles and SCM1d, regarding the

Considering the different observed error behaviors, we employed a statistical test to evaluate how the different methods compared through time regarding their generated error. To this end, we considered the errors measured in the sliding windows, as presented in Section 5.3. The Friedman statistical test, and following the Nemenyi post-hoc test were performed to compare the methods. We graphically organized our findings, as proposed by Demšar (2006). The mentioned analysis is presented in Figure 4.

Figure 4: Friedman and post-hoc Nemenyi tests results for the windowed values considered all the evaluated datasets and MTR methods

As previously indicated by Table 3 the SST-HT achieved the smallest errors among all the compared techniques. The second best performer was iSOUP-Tree, which also employs adaptive models at the leaf nodes. Surprisingly, the tree variant using only the mean predictor (MTR-HT) was better than the perceptron-based methods (both the simple and stacked variants). Nevertheless, the employment of stacked linear predictors resulted in smaller errors than using only a perceptron per target variable.

6.2 Running time

Regarding the observed running times for the compared methods, as expected, the most simple alternative (MTR-HT) was also the fastest one in almost all the cases, as shown in Table 4. Again, the smallest running times per datasets are in bold. The MTR-HT variant achieved the second most fastest running times in most of the cases. SST-HT, in general, performed as fast as iSOUP-Tree for the majority of the datasets. Also, as expected, SST-HT was the worst performer regarding the running time, since besides using a dynamic choice between predictors it also maintains and updates three different prediction models per leaf node. These slightly increased running times, however, could be ignored given the gains in predictive performance.

Dataset iSOUP-Tree SST-HT
2Dplanes
Bicycles
CPU
Electricity
Eunite03
FriedD
FriedAsyncD
MV
NPSDecay
RF1
RF2
SCFP
SCM1d
SCM20d
Sulfur
Wine
Table 4: Running time for each dataset (in seconds)

Indeed, considering that all the compared tree methods only differ in the strategy that the leaf nodes employ to generate predictions, an approximately linear relationship between the running times of the algorithms can be observed. For instance, we present such comparison for the dataset SCFP in Figure 5.

Figure 5: Running time for the SCFP dataset

Similar behaviors were observed for all the datasets when considering the time spent by the decision tree models to process the complete data streams. Plots for all the evaluated datasets are presented in Appendix B.2.

6.3 Model size

We obtained very similar results as those of the running times when considering the size of the generated models. This, again, was expected when taking into consideration the usage of an extra layer of linear predictors by our proposal. The total sizes of the trained models in Megabytes (MB) at the end of the data streams are summarized in Table 5. Excluding our proposal’s variations (SST-HT and SST-HT), the remaining compared methods only differed in small amounts (with , , and wins for MTR-HT, iSOUP-Tree, and MTR-HT, respectively), regardless the compared dataset. Our method indeed spent more memory resources than the other competitors, but the additional amount of needed resources was minimal in almost all the cases, turning this difference irrelevant for most of the real-world applications.

Dataset iSOUP-Tree SST-HT
2Dplanes
Bicycles
CPU
Electricity
Eunite03
FriedD
FriedAsyncD
MV
NPSDecay
RF1
RF2
SCFP
SCM1d
SCM20d
Sulfur
Wine
Table 5: Total model size for each dataset (in MB)

Similarly to the running time observations, the relation between the amount of memory spent by different methods through time was linear for nearly all the cases. Taking into consideration that our proposal does not impact the tree growth characteristics, the extra memory usage is constant for each dataset. This can be verified in details for each dataset in Appendix B.3. As a matter of illustration, we present in Figure 6 the memory usage varying on time for the NPSDecay dataset, the set with the highest number of samples considered in our experiments.

Figure 6: Memory usage by the compared methods on the NPSDecay dataset

7 Final Considerations

In this work, we presented an extension for online MTR decision tree algorithms which further takes into consideration the characteristics of this kind of problems. Our proposal, called SST-HT, can improve the prediction performance without affecting the structure of the tree models. The main idea is to employ stacked linear models at the leaf nodes to capture and model the possible existing inter-target dependencies. In this sense, the split decisions are made in the same way as those from the traditional online MTR tree methods. Similarly to existing solutions, our proposal is also able to dynamically selecting the adequate predictor for each instance, depending on the state of the data stream. Our method, however, selects between three predictors: mean, perceptron or stacked perceptrons predictors.

We evaluated two variations of our proposal, SST-HT, and SST-HT against three well-known tree algorithms for dealing with MTR tasks in data streams. A broad set of 16 benchmark datasets was employed in the experimental evaluation, to the best of our knowledge the most extensive set of online MTR tasks considered so far. We have shown that our proposal achieved the most accurate predictions in the majority of the cases without demanding increased amounts of extra computational resources.

As future researches, we intend to verify the possibility of extending our ideas to ensembles of decision rules, like those in AMRules. In this sense, the modeling of inter-target dependencies could be further improved, since the mentioned method creates rules which encompass subsets of targets which are correlated the most. The employment of SST-HT as the base model for traditional online ensemble algorithms is another venue for future research. Besides, we also want to evaluate alternatives for monitoring the necessary statistics for splitting numerical attributes, making this procedure less costly. Also, the application of our proposal to correlated tasks, e.g., online multi-label classification could also be investigated.

Acknowledgments

The authors would like to thank FAPESP (São Paulo Research Foundation) for its financial support in grant #2018/07319-6. Besides, this research was carried out using the computational resources of the Center for Mathematical Sciences Applied to Industry (CeMEAI) funded by FAPESP (grant 2013/07375-0). Lastly, we also would like to give our special thanks to Ricardo Sousa and professor João Gama, for kindly providing some of the datasets employed in our experiments.

Appendix A Employed Datasets

This appendix describes the datasets that were employed in the experiments. Firstly, the datasets already reported as online MTR tasks in the literature are described. Next, the datasets that were used for the first time in this work are presented.

a.1 Existing datasets

This section briefly describes the datasets used in the experiments that were already reported in the literature (Duarte and Gama, 2015; Spyromitros-Xioufis et al., 2016; Osojnik et al., 2018).

Bicycles

The Bicycles dataset has already been used in multiple online MTR researches (Duarte and Gama, 2015; Osojnik et al., 2018). It describes the hourly count of rental bikes, considering the period between 2011 and 2012 in Capital bikeshare system (Duarte and Gama, 2015). The data contains weather and seasonal information for each rent event. The task consists in predicting the count of casual (non-registered), registered and total users.

Eunite03

The Eunite03 dataset was used during the competition of the 3rd European Symposium on Intelligent Technologies, Hybrid Systems and their implementation on Smart Adaptive Systems (2003). The dataset describes a process of continuous production of manufactured glasses (Duarte and Gama, 2015). The input features describe parameters employed when producing the glass products, while the outputs refers to the glass quality.

2Dplanes, FriedD, FriedAsyncD, and MV

2Dplanes, FriedD, and FriedAsyncD are MTR artificial datasets generated by Duarte and Gama (2015). They are modifications of well-known artificial ST regression tasks (Breiman, 2017). The FriedD and FriedAsyncD datasets contain one concept drift for each of the output targets. In FriedD the concept drifts occur simultaneously for all the target variables in the middle of the data stream, while in FriedAsyncD the concept drifts occur asynchronously (Duarte and Gama, 2015). Lastly, MV was also constructed by Duarte and Gama (2015) based on a ST regression artificial problem.

RF1 and RF2

The RF1 and RF2 (River Flow) datasets were firstly reported by Spyromitros-Xioufis et al. (2016) and ever since they were employed in MTR data streaming tasks (Duarte and Gama, 2015; Osojnik et al., 2018). The datasets concern the prediction of river network flows considering a time window of 48 h in the future, at specific locations. Hourly flow observations were registered for 8 sites in the Mississippi River network (US) considering a period of one year (from September 2011 to September 2012). The data was obtained from the US National Weather Service. Each observation includes the most recent data, as well as, delayed measurements considering intervals ranging from hours in the past. The first dataset, RF1, uses only the sensor data, whereas the second one, RF2, adds precipitation forecast information (expected rainfall) for each of the measurement sites.

SCM1d and SCM20d

The SCM (Supply Chain Management) were extracted from the Trading Agent Competition in Supply Chain Management (TACSCM) tournament in 2010. Again, these datasets were firstly proposed by Spyromitros-Xioufis et al. (2016), being applied in data stream problems of MTR (Duarte and Gama, 2015; Osojnik et al., 2018). Each example corresponds to an observation day in the tournament (from a total of 220 days in each game and 18 games during the whole tournament). The input variables correspond to the observed prices considering a specific tournament day. Additionally, four time-delayed observations are added for each observed product and component (delays of 1, 2, 4 and 8 days) aiming at facilitating the anticipation of trends. Each dataset have 16 targets, which correspond to the predictions for the next day mean price (SCM1d) or mean price for 20-days in the future (SCM20d), concerning each product in the simulation.

a.2 New datasets

This section describes the datasets that were firstly evaluated as MTR tasks in streaming scenarios in this work.

Cpu

The Computer Activity database222https://www.cs.toronto.edu/~delve/data/comp-activ/desc.html, collected around 1996 at the University of Toronto, records multiple performance measures, such as the number of bytes read and written from the system memory. All data was collected from a computer Sun Sparctation model 20/712, which had 2 CPUs (Central Processing Unit) and 128 MegaBytes of main memory. The records concern the monitoring of normal computer usage, for example, browsing through the web or using text editors. The records were gathered at intervals of five seconds. Originally, the tasks related to this dataset concerned predicting the percent of the time the CPU ran in user mode. However, taking into consideration that the data also contains the amount of time the CPU ran in system mode, and the period it stayed in idle due to waiting for block IO or any other circumstances, the task was tackled as an incremental MTR problem.

Electricity

The Electricity dataset is an adapted version of the well-known ELEC2 dataset (Gama et al., 2004), which is commonly employed in online classification tasks. The original task corresponds to identifying the change of the price (up or down) in the Australian New South Wales Electricity Market. In this market the prices are not fixed, being affected by aspects of demand and supply, and set every five minutes. The data comprehends an interval between the years of 1996 and 1998, and each example in the dataset refers to a period of minutes. It is a scenario with potential to multiples changes, given that transfers to/from the neighboring state of Victoria are performed to alleviate fluctuations. In this adapted version of the task the original label property was discarded, and the prices for the New South Wales and Victoria states were set as the new targets to be predicted. As input features, we selected the remaining data properties: the measured electricity demands for those markets, the measurement time stamp, the day of the week, and the scheduled electricity transfer between the two states.

NPSDecay

The NPSDecay dataset (Cipollini et al., 2018) concerns the prediction of performance decay in a Naval Propulsion System (NPS) over time. The data comes from a vessel (frigate) simulator which was specially tailored and fine-tuned over the years to represent the components of a possible real vessel. The simulated vessel has a combined diesel, electric, and gas propulsion plant. The targets correspond to decay coefficients for the main components of the NPS system, namely: the gas turbine, the gas turbine compressor, the hull, and the propeller. Hence, in this task the following coefficients must be predicted:

  • Propeller Thrust decay state coefficient (Kkt);

  • Propeller Torque decay state coefficient (Kkq);

  • Hull decay state coefficient (Khull);

  • Gas Turbine Compressor decay state coefficient (KMcompr);

  • Gas Turbine decay state coefficient (KMturb).

A total of 25 features related to parameters that indirectly represent the system state are available for each measurement of the performance decay coefficients. The dataset is available in OPenML, as well as, in a website made available by its authors333https://sites.google.com/view/cbm/home.

Scfp

The See Click Fix Prediction (SCFP) competition444https://www.kaggle.com/c/see-click-predict-fix was firstly held by Kaggle as a hackathon. Later on, the dataset employed in that competition was used in a new competition promoted by the same organization. The dataset concerns registers of issues subjected by the population to the Open311 555http://www.open311.org/ service. The original task consists in predicting the number of views, comments, and votes an issue would receive. The original dataset contains textual information about a summary and description of the issue, as well as, geolocated data, the publication source (mobile, desktop, etc.), and a tag type for the publication. The original dataset contains missing data in many fields. A random sample of the mentioned dataset was used by Spyromitros-Xioufis et al. (2016) in batch scenarios. However, their version simply ignored the textual information contained in the examples, using only the other fields, as well as, some hand-engineered features.

In our processed version of the original dataset, the categorical values were encoded using numeric values. The missing fields were encoded with . Following the approach of Spyromitros-Xioufis et al. (2016), in addition to the latitude and longitude fields, an additional attribute concerning the distance of the published issue to its city downtown (in meters) was also added. Besides, another field denoting the time interval (in hours) since the last registered issue was included in the dataset. Moreover, our main contribution to the previous and reduced version of SCFP was taking into account the textual information of the dataset. For this end, the summary of the issues was considered. We employed a pre-trained word embedding (Pennington et al., 2014) model666https://nlp.stanford.edu/projects/glove/ with an array of

positions to encode each of the non-stopwords in the summary field of the issues. The mean vector among all the considered words was then taken as a representation of the issue’s summary. Therefore,

additional features were added to our version of SCFP.

Sulfur

The Sulfur dataset concerns the prediction of pollutants concentration( and ), given air and gas flows as inputs. The dataset is available at OPenML (Vanschoren et al., 2014), and correspond to the data described in Fortuna et al. (2007). In the Sulfur dataset no pre-processing step was performed.

Wine

The Wine dataset (Cortez et al., 2009) describes the chemical properties of red and white wine samples. The input features correspond to objective tests, for instance, acidity and pH tests. Originally, the only target was the sensory data (a human-based score, given by the median of three evaluations made by wine experts). Notwithstanding, for the purposes of evaluating a multi-output scenario, the fixed and volatile acidity, as well as, the citric acid amounts were joined along with the quality score as new targets. Thus, the new task consists in predicting acidity levels and a quality score, modeling how those quantities relate to each other.

Appendix B Time-varying observations for error, running time and model size

This appendix presents line plots for the observed errors, running time, and model size considering all the evaluated datasets.

b.1 Measured error ()

(a) 2DPlanes
(b) Bicycles
(c) CPU
(d) Electricity
(e) Eunite03
(f) FriedD
(g) FriedAsyncD
(h) MV
Figure 15: Time varying results for the measured values
(i) NPSDecay
(j) RF1
(k) RF2
(l) SCFP
(m) SCM1d
(n) SCM20d
(o) Sulfur
(p) Wine
Figure 23: Time varying results for the measured values (continuation)

b.2 Running time

(a) 2DPlanes
(b) Bicycles
(c) CPU
(d) Electricity
(e) Eunite03
(f) FriedD
(g) FriedAsyncD
(h) MV
(i) NPSDecay
(j) RF1
Figure 34: Accounted running times for all the evaluated datasets
(k) RF2
(l) SCFP
(m) SCM1d
(n) SCM20d
(o) Sulfur
(p) Wine
Figure 40: Accounted running times for all the evaluated datasets (continuation)

b.3 Model size

(a) 2DPlanes
(b) Bicycles
(c) CPU
(d) Electricity
(e) Eunite03
(f) FriedD
(g) FriedAsyncD
(h) MV
(i) NPSDecay
(j) RF1
Figure 51: Time varying model size for all the evaluated datasets
(k) RF2
(l) SCFP
(m) SCM1d
(n) SCM20d
(o) Sulfur
(p) Wine
Figure 57: Time varying model size for all the evaluated datasets (continuation)

References

  • Almeida et al. (2013) Almeida, E., Ferreira, C., and Gama, J. (2013). Adaptive model rules from data streams. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 480–492. Springer.
  • Borchani et al. (2015) Borchani, H., Varando, G., Bielza, C., and Larrañaga, P. (2015). A survey on multi-output regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(5):216–233.
  • Breiman (2017) Breiman, L. (2017). Classification and regression trees. Routledge.
  • Cipollini et al. (2018) Cipollini, F., Oneto, L., Coraddu, A., Murphy, A. J., and Anguita, D. (2018). Condition-based maintenance of naval propulsion systems: Data analysis with minimal feedback. Reliability Engineering & System Safety, 177:12–23.
  • Cortez et al. (2009) Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553.
  • Demšar (2006) Demšar, J. (2006).

    Statistical comparisons of classifiers over multiple data sets.

    Journal of Machine learning research, 7(Jan):1–30.
  • Domingos and Hulten (2000) Domingos, P. and Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71–80. ACM.
  • Duarte and Gama (2015) Duarte, J. and Gama, J. (2015). Multi-target regression from high-speed data streams with adaptive model rules. In Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on, pages 1–10. IEEE.
  • Duarte et al. (2016) Duarte, J., Gama, J., and Bifet, A. (2016). Adaptive model rules from high-speed data streams. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(3):30.
  • Fortuna et al. (2007) Fortuna, L., Graziani, S., Rizzo, A., and Xibilia, M. G. (2007). Soft sensors for monitoring and control of industrial processes. Springer Science & Business Media.
  • Gama (2010) Gama, J. (2010). Knowledge discovery from data streams. Chapman and Hall/CRC.
  • Gama and Brazdil (2000) Gama, J. and Brazdil, P. (2000). Cascade generalization. Machine learning, 41(3):315–343.
  • Gama et al. (2004) Gama, J., Medas, P., Castillo, G., and Rodrigues, P. (2004). Learning with drift detection. In

    Brazilian symposium on artificial intelligence

    , pages 286–295. Springer.
  • Gomes et al. (2017) Gomes, H. M., Barddal, J. P., Enembreck, F., and Bifet, A. (2017). A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR), 50(2):23.
  • Gouk et al. (2019) Gouk, H., Pfahringer, B., and Frank, E. (2019). Stochastic Gradient Trees. arXiv e-prints, page arXiv:1901.07777.
  • Hoeffding (1963) Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30.
  • Ikonomovska et al. (2011a) Ikonomovska, E., Gama, J., and Džeroski, S. (2011a). Incremental multi-target model trees for data streams. In Proceedings of the 2011 ACM symposium on applied computing, pages 988–993. ACM.
  • Ikonomovska et al. (2011b) Ikonomovska, E., Gama, J., and Džeroski, S. (2011b). Learning model trees from evolving data streams. Data mining and knowledge discovery, 23(1):128–168.
  • Kocev et al. (2013) Kocev, D., Vens, C., Struyf, J., and Džeroski, S. (2013). Tree ensembles for predicting structured outputs. Pattern Recognition, 46(3):817–833.
  • Krawczyk et al. (2017) Krawczyk, B., Minku, L. L., Gama, J., Stefanowski, J., and Woźniak, M. (2017). Ensemble learning for data stream analysis: A survey. Information Fusion, 37:132–156.
  • Mastelini et al. (2018) Mastelini, S. M., da Costa, V. G. T., Santana, E. J., Nakano, F. K., Guido, R. C., Cerri, R., and Barbon, S. (2018). Multi-output tree chaining: An interpretative modelling and lightweight multi-target approach. Journal of Signal Processing Systems, pages 1–25.
  • Mastelini et al. (2017) Mastelini, S. M., Santana, E. J., Cerri, R., and Barbon, S. (2017). Dstars: A multi-target deep structure for tracking asynchronous regressor stack. In 2017 Brazilian Conference on Intelligent Systems (BRACIS), pages 19–24. IEEE.
  • Melki et al. (2017) Melki, G., Cano, A., Kecman, V., and Ventura, S. (2017). Multi-target support vector regression via correlation regressor chains. Information Sciences, 415:53–69.
  • Nguyen et al. (2015) Nguyen, H.-L., Woon, Y.-K., and Ng, W.-K. (2015). A survey on data stream clustering and classification. Knowledge and information systems, 45(3):535–569.
  • Osojnik et al. (2015a) Osojnik, A., Panov, P., and Džeroski, S. (2015a). Comparison of tree-based methods for multi-target regression on data streams. In International Workshop on New Frontiers in Mining Complex Patterns, pages 17–31. Springer.
  • Osojnik et al. (2015b) Osojnik, A., Panov, P., and Džeroski, S. (2015b). Multi-label classification via multi-target regression on data streams. In International Conference on Discovery Science, pages 170–185. Springer.
  • Osojnik et al. (2017) Osojnik, A., Panov, P., and Džeroski, S. (2017). Multi-label classification via multi-target regression on data streams. Machine Learning, 106(6):745–770.
  • Osojnik et al. (2018) Osojnik, A., Panov, P., and Džeroski, S. (2018). Tree-based methods for online multi-target regression. Journal of Intelligent Information Systems, 50(2):315–339.
  • Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    , pages 1532–1543.
  • Read et al. (2012) Read, J., Bifet, A., Holmes, G., and Pfahringer, B. (2012). Scalable and efficient multi-label classification for evolving data streams. Machine Learning, 88(1-2):243–272.
  • Santana et al. (2018) Santana, E. J., Geronimo, B. C., Mastelini, S. M., Carvalho, R. H., Barbin, D. F., Ida, E. I., and Barbon, S. (2018). Predicting poultry meat characteristics using an enhanced multi-target regression method. Biosystems Engineering, 171:193–204.
  • Sousa and Gama (2018) Sousa, R. and Gama, J. (2018). Multi-label classification from high-speed data streams with adaptive model rules and random rules. Progress in Artificial Intelligence, pages 1–11.
  • Spyromitros-Xioufis et al. (2016) Spyromitros-Xioufis, E., Tsoumakas, G., Groves, W., and Vlahavas, I. (2016). Multi-target regression via input space expansion: treating targets as inputs. Machine Learning, 104(1):55–98.
  • Vanschoren et al. (2014) Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. (2014). Openml: Networked science in machine learning. SIGKDD Explor. Newsl., 15(2):49–60.
  • Waegeman et al. (2018) Waegeman, W., Dembczynski, K., and Huellermeier, E. (2018). Multi-Target Prediction: A Unifying View on Problems and Methods. arXiv e-prints, page arXiv:1809.02352.