I Introduction
Multitarget regression, also called multivariate or multioutput regression, is an instance of multitarget prediction [41]
where the goal is to estimate multiple continuous variables based on a common set of predictors. Under the existence of statistical dependencies between the target variables, their joint modeling has been shown to be advantageous compared to modeling each variable independently
[20]. This is for instance the case in applications such as stock prediction [21], energy production forecasting in photovoltaic farms [13] and water quality monitoring [23].Similarly to methods for multilabel classification [36], multitarget regression methods can be broadly categorized in two groups: (i) algorithm adaptation and (ii) problem transformation. Algorithm adaptation methods extend specific learning algorithms (e.g. support vector regression [33], trees [3], etc.) in order to handle multiple outputs, while problem transformation methods transform the learning task into one or more single output tasks that can be solved with existing learning algorithms. Algorithm adaptation methods often generate a single multioutput model that is easier to interpret and are more scalable to large output spaces compared to existing problem transformation methods. On the other hand, problem transformation methods can be easily adapted to the problem at hand by employing suitable base learners and have been found superior to algorithm adaptation methods in terms of accuracy [35].
Despite the fact that problem transformation methods have seen widespread use in multilabel classification (largely owing to their excellent predictive performance [43]), their potential in the context of multitarget regression has only recently been explored. More specifically, methods that expand the input space of the independent regressions baseline using estimations of the other target variables as meta inputs were introduced in [35], while [39] proposed an ensemble approach that is based on constructing random linear combinations of the target variables. Both these methods were found competitive with the stateoftheart but scale linearly (at best) with respect to the number of targets, thereby having limited applicability in problems with large output spaces.
In this paper we propose Multitarget Regression via Quantization (MRQ), a novel problem transformation method that is based on the idea of using vector quantization [22] in order to map the original realvalued output vectors into a finite set of prototype vectors or centroids. After this transformation, the multitarget regression problem is transformed into one of multiclass classification, where the task is to predict the centroid that lies closer to the actual output vector. This type of reduction is motivated by the fact that in many real world problems, variables that are originally continuous in nature are discrete by observation and hence it is reasonable and convenient to model an appropriate discrete approximation [14].
Compared to existing problem transformation approaches, MRQ has the advantage that it directly models the joint distribution of the targets through a discrete approximation and has a complexity that is practically independent of the number of target variables, making it scalable to problems with very large output spaces.
In addition to the basic version of MRQ where a single quantizer is used to encode the whole output space, we also introduce an ensemble version of the method (eMRQ) where multiple quantizers are used, each one encoding a randomly selected subset of the target variables. Hence, in eMRQ the multitarget regression problem is transformed into multiple multiclass classification problems and the predictions are obtained by averaging multiple centroids. Although eMRQ has a higher computational complexity, it provides significantly higher accuracy as it strikes a better balance between the two main error components of the approach, i.e. the quantization and the classification error.
The rest of the paper is organized as follows. Section II presents related work on multitarget regression and draws parallels between our approach and some widely used problem transformation approaches for multilabel classification. Section III describes MRQ and its ensemble version eMRQ, including a discussion of some interesting theoretical properties of the method. Section IV describes the experimental setup and section V presents and discusses the experimental results. Finally, section VI concludes the paper and discusses directions for future work.
Ii Related Work
Iia Algorithm adaptation methods
The first multitarget regression methods were developed by statisticians in the 70s and the 80s with Reduced Rank Regression [24], FICYREG [40], twoblock PLS [42] and Curds and Whey [9] being some of the most characteristic examples. As shown in [9]
, all these methods have the same generic form, under which the estimates obtained by applying ordinary least squares regression on the target variables, are modified by a shrinkage matrix (calculated differently in each method) in order to to provide a more accurate prediction, under the assumption that the targets are correlated.
More recently, a number of multitarget methods were derived from the predictive clustering tree framework [8]
. Predictive clustering trees differ from standard decision trees in that the variance and prototype functions are treated as parameters that can be instantiated to fit a variety of learning tasks, including multitarget regression. Following this approach,
[7] developed mutiobjective decision trees where the variance function is computed as the sum of the variances of the targets, and the prototype function is the vector mean of the target vectors of the training examples falling in each leaf. Other approaches that fall under the same framework are the ensembles of multiobjective decision trees proposed in [28] and a number of rule learning algorithms whose primary focus is on improving model interpretability (e.g. [1]).Another large group of methods stem from a regularization perspective [45]^{1}^{1}1Most of these methods were originally developed to solve the more general learning task of multitask learning [12] but are commonly applied to multitarget regression tasks as well.. These methods minimize a penalized loss of the form , where is an empirical loss calculated on the training data, is an estimated parameter matrix, and is a regularization term whose particular form depends on the underlying task relatedness assumption. Most methods assume that all tasks are related to each other [4, 15, 31], while there are methods assuming that tasks are organized in structures such as clusters [44], trees [27] and graphs [16]. A systematic analysis of the connections between these regularizationbased methods and related techniques from the Gaussian Processes framework is provided in [2].
Finally, a number of methods for multitarget regression have been derived by extending artificial neural networks to handle multiple outputs (e.g.
[11, 6, 21, 17]).As the aforementioned techniques do not involve an explicit transformation of the multitarget regression problem but rather extend specific learning paradigms to handle multitarget regression tasks directly, they are regarded as algorithm adaptation methods.
IiB Problem transformation methods
Contrary to the majority of multitarget regression methods, MRQ follows a problem transformation approach. This type of approaches reduce the multitarget prediction problem into one or more singletarget prediction problems, for which a multitude of welldeveloped algorithms are readily available. The challenge is then to devise effective reduction approaches.
Problem transformation methods have been extensively studied in the context of multilabel classification with Binary Relevance (BR) and Label Powerset (LP) being two of the simplest but also widely used and theoretically justified (see, e.g., [18]) methods. BR transforms the multilabel classification task into binary classification tasks, one for each label, while LP reduces multilabel classification into multiclass classification by treating each label combination as a distinct class value. As discussed in [18], although BR ignores label dependencies, it is well tailored for losses whose risk minimizer can be expressed in terms of marginal distributions such as Hamming loss. LP on the other hand, can be seen as a method to estimate the conditional joint distribution
, and while its basic form is tailored for the subset 01 loss, it can be extended to any loss function.
While BR and LP are reasonable baselines, they have been extended and outperformed by more recent approaches. Two notable examples that have achieved stateoftheart performance and received significant attention in the multilabel classification literature are Classifier Chains (CC)
[32] and RAkEL [37]. CC is an extension of BR which manages to model label dependencies by augmenting the input space of each binary classifier with extra features that correspond to other labels. RAkEL extends LP by building multiple LP classifiers on random subsets of the original label space and combining their predictions with voting. This way, RAkEL tackles a number of LP’s limitations such as modeling of scarce label sets and increased computational cost in problems with many distinct label sets.Recently, both CC and RAkEL have offered inspiration for developing analogous approaches in multitarget regression. Regressor Chains (RC) [35] is a direct adaptation of CC where regressors are used as base models instead of binary classifiers. Equipped with a mechanism to deal with noise propagation at inference time, RC was shown to achieve stateoftheart performance in multitarget regression. RAkEL, on the other hand, inspired RLC [39]
, an approach that builds an ensemble of regression models, each one concerning a random linear combination of a random subset of the target variables. RLC uses a sparse random matrix to project the original output space into a new output space of higher dimensionality, where each transformed variable represents the linear combination of two or more of the original targets. At inference time, an overdetermined system of linear equations is solved to recover the original space. While RLC was found competitive with other multitarget regression approaches in terms of performance, it suffers increased computational complexity as it requires a number of regression models that is significantly larger than the number of targets to perform competitively. Concurrently to RLC, an approach based on random output space projections was developed in
[26] in the context of multilabel classification. Differently from RLC which aimed at improving prediction accuracy, the main goal of that approach was to reduce learning time complexity while maintaining the accuracy of predictions.The proposed approach bares a number of striking parallels with LP and RAkEL. Firstly, each cluster centroid in MRQ can be considered as the equivalent of a labelset (or label combination) in multilabel clasification. In that sense, MRQ resembles LP as it uses a multiclass classifier to predict the most likely centroid. In fact, it is easy to show that MRQ becomes equivalent to LP when the quantization error becomes zero. Given that eMRQ extends MRQ in the same way that RAkEL extends LP, it is straightforward to see their resemblance. Contrarily to other problem transformation methods for multitarget regression which have a linear (RC) or higher (RLC) complexity with respect to the number of targets, MRQ and eMRQ have a practically constant complexity with respect to the number of targets and can be tuned to provide a good tradeoff between prediction accuracy and computational efficiency by appropriate parameterization of the quantization scheme.
Recently, we developed a transformation approach for feature selection on multitarget data based on the idea of output space quantization and found it to perform competitively against other feature selection methods
[34]. To the best of our knowledge, this is the first time that this idea is applied in the context of multitarget regression.Iii Method
Iiia Background and notation
IiiA1 Multitarget regression
Given a set of training examples , where and
are realizations of the joint random variables
and , the goal in multitarget prediction is to induce a model that given an input vector , predicts an output vector that closely approximates the true output vector . In multitarget regression, all the output variables are continuous (i.e. ) while the input variables can have a real, ordinal or nominal domain.The baseline SingleTarget (ST) approach consists of building an independent regression model for each target variable. Despite the obvious limitation of ignoring dependencies between targets, when coupled with a strong base learner, ST is very competitive in both multilabel classification (called BR in this context) [29] and multitarget regression [35], especially on targetwise decomposable loss functions [18].
IiiA2 Vector quantization
Vector quantization is a technique that has its roots in information theory and was originally used for analogtodigital conversion and data compression [22]. In vector quantization, the goal is to reduce the cardinality of the representation space of highdimensional, realvalued input data, while minimizing an objective distortion criterion. Formally, a vector quantizer is a function that maps each dimensional vector to a vector where is a finite set of reproduction values or centroids . Typically, a vector quantizer seeks to minimize the squared error between the input vector and its reproduction value
and is learned using Lloyd’s algorithm (kmeans). In this work, VQ is applied to transform the output space in multitarget regression problems by replacing each output vector by a value that corresponds to the index of the quantizer centroid that lies closer to that vector (section
IIIB). This way, multitarget regression is reduced to multiclass classification.As the dimensionality of the vectors that we want to quantize increases, so does the number of centroids that are required to maintain a small quantization error. As increases, it becomes impossible to learn a quantizer using kmeans due to the fact that both the learning complexity as well as the number of required training samples are several times . To address this issue, more efficient quantization techniques such as Product Quantization (PQ) [25] are used when dealing with highdimensional vectors. In PQ, the vectors are split into nonoverlapping subvectors of dimensionality and a distinct lowercomplexity subquantizer is learned on each subspace using kmeans. The reproduction values of such a quantizer are defined as the concatenation of the centroids of the subquantizers, thus the product quantizer maps each original vector to a vector from the Cartesian product . Assuming that all subquantizers have the same number of reproduction values , the product quantizer effectively generates a quantizer with reproduction values. In this work, we use a quantization approach that is similar to PQ in order to improve the accuracy of our method in problems with large output spaces (section IIIC).
IiiB Mrq
The main idea behind MRQ is the use of vector quantization in order to transform the multidimensional continuous output space into a unidimensional discrete output space
, where the levels of the categorical variable
correspond to the indices of the centroids of a vector quantizer learned using kmeans on the original^{2}^{2}2Output variables are actually standardized to ensure equal variances before applying kmeans. output space . At training time, a mapping is learned using a multiclass classifier. At inference time, given an unknown instance , the multiclass classifier is first applied to get and the corresponding centroid is returned as the final prediction.The squared error of MRQ can be written as:
(1) 
Assuming that the quantizer satisfies the first Lloyd optimality condition, i.e that all vectors are quantized to their nearest centroids in terms of Euclidean distance:
(2) 
it is easy to see that the lower bound for the squared error of MRQ is equal to the squared error of the underlying quantizer. Thus, using a quantizer with a small quantization error is a necessary condition for good performance in MRQ. However, this lower bound is realized only when is an oracle classifier that always predicts the correct class and any classification error can cause an arbitrarily large increase. Thus, achieving a small classification error is also crucial.
Interestingly, both error components of MRQ (i.e. the quantization and the classification error) are highly dependent on the parameter . The quantization error, on one hand, is a monotonically decreasing function of for a Lloyd optimal quantizer. The classification error, on the other hand, is expected to increase with an increasing number of classes and a decreasing number of examples per class. Thus, finding a that strikes a good balance between the two error components is crucial for achieving good performance in MRQ.
This tradeoff is highlighted in Figure 1 which plots the error of MRQ (in two datasets), using an ensemble of classification trees as the multiclass classifier, against that of MRQ, an oracle version of the method whose underlying multiclass classifier is assumed to provide perfect predictions (i.e. ). We see that while the error of MRQ decreases with , the error of MRQ decreases initially as a result of a decreasing quantization error but then starts to increase as the classification error begins to dominate. We also notice that while in osales MRQ has a smaller error than ST for , is required in oes10. This is reflected in the performance of MRQ as in osales it obtains a smaller error than ST for a wide range of values, while in oes10 it is always worse.
IiiC eMRQ
eMRQ extends MRQ by employing a PQlike approach to quantize the output space. More concretely, instead of building a single kmeans quantizer on the whole output space, eMRQ builds subquantizers on variables which are defined as random subsets of the original targets, i.e. . Contrarily to PQ where only disjoint subsets are considered, in eMRQ we allow the same target variable to participate in multiple groups and thus be redundantly quantized.
By employing this quantization scheme, eMRQ effectively transforms the mutlitarget regression problem into multiclass classification subproblems, where the categorical class variables encode the indices of the centroids of the subquantizers . At inference time, all multiclass classifiers are queried to predict the corresponding targets and the predictions for each of the original targets are obtained by average pooling^{3}^{3}3For a predicate P, the expression [[P]] evaluates to 1 if P is true and to 0 if P is false.:
where (with a slight abuse of notation) denotes the component of that corresponds to variable .
eMRQ has three parameters: (a) the number of centroids in each subquantizer, (b) the number of original targets considered in each subquantizer, i.e. , henceforth denoted and (c) the total number of quantizers . We observe that can take values in and controls the degree of joint modelling of the target variables in the ensemble. The special case where corresponds to building an independent model for each target variable, while makes eMRQ equivalent to MRQ. We further observe that each of the original target variables is expected to participate in of the subproblems induced by eMRQ and therefore an should be used to guarantee that, on average, all target variables will be considered in at least one of the subproblems.
Similarly to PQ, the quantization approach adopted by eMRQ allows it to implicitly induce a quantizer with reproduction values^{4}^{4}4Since the subquantizers in eMRQ are overlapping, the effective number of reproduction values will be smaller than .. Thus, it requires quantizers with a significantly smaller number of centroids than those of the single quantizer in MRQ to achieve the same quantization error. In turn, this has a positive impact on the classification error of eMRQ as each multiclass classification problem becomes simpler (fewer classes, more training examples per class). This effect can be seen in Figure 1 where we notice that eMRQ, the oracle version of eMRQ (parameterized here with and and using the same multiclass classifier as MRQ), obtains a significantly smaller error than MRQ for the same values of . This results in eMRQ outperforming MRQ in both datasets.
IiiD Computational Complexity
The complexities of both MRQ and eMRQ depend on the complexities of the underlying multiclass classification and quantization algorithms. Given a multiclass classifier with training complexity for a dataset with examples, input variables and classes, the complexity of MRQ is , where is the complexity of running kmeans for iterations. Similarly, the complexity of eMRQ is , since it learns quantizers on dimensional vectors and multiclass classifiers. In practise, the training complexity of both methods is dominated by the complexity of building the multiclass classifier. As shown in section V, by using a multiclass classifier with sublinear complexity with respect to the number of classes (i.e. an ensemble of decision trees), both methods are significantly faster than other problem transformation approaches.
Iv Experimental setup
In this section we describe our experimental setup. We first present the datasets and their main characteristics, then provide the details of the setup used for MRQ, eMRQ and the other competing methods, and finally describe the evaluation methodology and the approach used to check for statistically significant performance differences.
Iva Datasets
The experiments are carried out on a large and diverse collection of multitarget regression datasets^{5}^{5}5http://mulan.sourceforge.net/datasetsmtr.html, whose main characteristics are summarized in Table I (see [35] for a detailed description of each dataset). In addition to the number of examples, features and targets in each dataset, Table I also reports and . is the average Pearson correlation coefficient between all distinct pairs of targets in each dataset and is a rough measure of target interdependence.
is the average entropy of the probability density functions (PDF) of the target variables, where each PDF is calculated by applying kernel density estimation using a gaussian kernel and a bandwidth tuned to maximize likelihood using 3fold crossvalidation.
aims to serve as a measure of the expected quantization error in each dataset, as lower values point to a distribution that is concentrated around few specific values, while higher values point to a dispersed distribution (uniform being the most dispersed distribution having an entropy of 4.61 in this setup).Name  # ex.  

edm  154  16  2  0.01  3.06 
enb  768  8  2  0.98  4.51 
jura  359  15  3  0.20  3.97 
scpf  1137  23  3  0.73  2.35 
sf1  323  10  3  0.23  1.66 
sf2  1066  10  3  0.20  1.31 
slump  103  7  3  0.42  4.37 
andro  49  30  6  0.40  4.52 
atp1d  337  411  6  0.82  3.91 
atp7d  296  411  6  0.64  3.90 
rf1  9125  64  8  0.39  3.96 
rf2  9125  576  8  0.39  3.96 
osales  639  401  12  0.62  2.44 
wq  1060  16  14  0.10  2.13 
oes10  403  298  16  0.82  2.90 
oes97  334  263  16  0.79  3.02 
scm1d  9803  280  16  0.64  4.07 
scm20d  8966  61  16  0.60  4.03 
IvB Methods and parameters
In section V, the performance of MRQ and eMRQ is compared to the performance of ST, RLC [39] and ERC [35]. Similarly to MRQ and eMRQ, ST, RLC and ERC take a problem transformation approach to multitarget regression and have been found significantly better than algorithm adaptation approaches such as ensembles of multiobjective decision trees (e.g. [28]) and multitask learning methods (e.g. [4]) in a previous empirical study [35].
RLC and ERC are parametrized using the setup that leads to the best results according to the corresponding papers, i.e. 100 random linear combinations of 2 target variables are used in RLC, while in ERC we use the variant that generates outofsample estimates with 10 internal crossvalidation folds and an ensemble size of 10. A crucial factor for the performance of any problem transformation approach is the underlying base learning algorithm. Throughout this study, ST, RLC and ERC are instantiated using a Bagging [10] ensemble of 100 regression trees as the base regressor, following the recommendations of [35]. To keep the comparison of MRQ and eMRQ with these methods as fair as possible, we instantiate them with a Bagging ensemble of 100 classification trees.
The proposed methods are implemented in Java and are integrated in the Mulan library [38] which already contains implementations of ST, RLC and ERC. Thus, all methods are evaluated under a common framework^{6}^{6}6https://github.com/lefman/mulanextended.
IvC Evaluation methodology
The performance of the methods is measured using average Relative Mean Absolute Error (aRMAE). The aRMAE of a model on a dataset is defined as:
where is the mean value of over and is the prediction of for . Intuitively, aRMAE measures how much better (aRMAE<1) or worse (aRMAE>1) the model is (on average) compared to a naive baseline that always predicts the mean value of each target. To estimate aRMAE we use either repeated random subsampling (with 90% of the data used for training and 10% for validation) (sections VA and VB) or fold crossvalidation (section VC).
To test the statistical significance of the observed differences between the methods, we follow the methodology suggested by [19]
for comparing multiple methods on multiple datasets, i.e. we use the Friedman test to check the validity of the nullhypothesis (all methods are equivalent) and when the nullhypothesis is rejected (
), we proceed with the Nemenyi posthoc test. Instead of reporting the outcomes of all pairwise comparisons, we employ the simple graphical presentation of the test’s results introduced in [19], i.e. all methods being compared are placed in a horizontal axis according to their average ranks and groups of methods that are not significantly different (at a certain significance level) are connected (see Figure 5 for an example). To generate such a diagram, a critical difference (CD) needs to be calculated that corresponds to the minimum difference in average ranks required for two methods to be considered significantly different. CD for a given number of methods and datasets, depends on the desired significance level. Due to the known conservancy of the Nemenyi test [19], we use a 0.05 significance level for computing the CD throughout the paper.V Results and discussion
Va Empirical evaluation of MRQ
In this section we study the performance of MRQ for different values of its parameter and compare it against ST. The results are summarized in Table II which shows the error of MRQ for as well as the error of ST on each dataset. The best performance per dataset is highlighted in bold, while the best performance among the MRQ variants is underlined^{7}^{7}7Note that whenever is larger than the number of distinct output vectors , is used..
We see that the choice of can have a big impact on the method’s performance, as different values lead to optimal results on each dataset. Trying to shed some light into the factors that affect the optimal value of , we observe that datasets with larger output spaces typically benefit from larger values of and vice versa. As a result, the best performance on scm1d and scm20d which have the highest number of targets in the collection is obtained with , while the best performance on edm which has the smallest number of targets is obtained with . However, there are notable exceptions to this rule: (a) enb has the same number of targets as edm but the best performance is obtained with , (b) oes10 and oes97 have the same number of targets as scm1d and scm20d but the best performance is obtained with . In the case of enb, we observe that it is among the two datasets with the highest output space entropy. Hence, despite the small number of targets, a high is required to reduce the quantization error. In the case of oes10 and oes97, on one hand these datasets have relatively small output space entropies and, on the other hand, they have a small number of examples (403 and 334 respectively) which makes it difficult to learn an accurate quantizer with . Summarizing the above observations, we conclude that the optimal value of is affected by the following factors: (a) the number of targets, (b) the entropy of the output space, (c) the number of training examples.
Comparing the performance of MRQ to that of ST, we observe that in 11 out of 18 datasets ST is outperformed by one of the MRQ variants, which suggests that MRQ can be very competitive with appropriate parametrization. In fact, even with a fixed , we see (last row of Table II) that MRQ with obtains a similar average rank with ST (4.333 vs 4.056). As shown in the critical difference diagram of Figure 2, only the variant that uses is found statistically significantly worse than ST.
MRQ  

Dataset  =2  5  10  20  50  100  200  500  1000  ST 
edm  1.076  0.545  0.545  0.545  0.545  0.545  0.545  0.545  0.545  0.840 
enb  0.350  0.217  0.151  0.127  0.119  0.110  0.102  0.115  0.177  0.085 
jura  0.880  0.709  0.664  0.649  0.617  0.610  0.639  1.008  1.008  0.511 
scpf  0.835  0.516  0.473  0.490  0.514  0.542  0.533  0.525  0.525  0.607 
sf1  0.695  0.591  0.581  0.581  0.581  0.581  0.581  0.581  0.581  0.957 
sf2  0.590  0.454  0.408  0.408  0.408  0.408  0.408  0.408  0.408  0.843 
slump  0.883  0.838  0.844  0.821  0.839  0.950  0.950  0.950  0.950  0.703 
andro  0.922  0.472  0.446  0.546  0.685  0.685  0.685  0.685  0.685  0.476 
atp1d  0.614  0.481  0.442  0.402  0.360  0.307  0.318  0.285  0.285  0.325 
atp7d  0.805  0.575  0.489  0.430  0.394  0.287  0.332  0.332  0.332  0.429 
rf1  0.792  0.529  0.385  0.275  0.180  0.121  0.080  0.045  0.031  0.045 
rf2  0.792  0.529  0.385  0.275  0.180  0.121  0.080  0.045  0.031  0.050 
osales  0.864  0.639  0.628  0.651  0.632  0.725  0.757  0.924  0.924  0.702 
wq  0.950  0.899  0.888  0.897  0.884  0.885  0.890  0.918  0.961  0.858 
oes10  0.752  0.532  0.482  0.455  0.455  0.460  0.500  0.627  0.627  0.399 
oes97  0.843  0.668  0.637  0.635  0.632  0.685  0.755  0.822  0.822  0.570 
scm1d  0.704  0.566  0.512  0.458  0.389  0.345  0.307  0.266  0.241  0.236 
scm20d  0.740  0.614  0.557  0.504  0.416  0.359  0.324  0.278  0.260  0.334 
Av. rank  9.389  6.750  5.083  5.000  4.333  4.694  4.917  5.361  5.417  4.056 
VB Empirical evaluation of eMRQ
In this section we study the performance of eMRQ with respect to its parameters: , and . To simplify the analysis, we first fix the parameter at , i.e. we use as many subquantizers as the number of targets in each dataset, and study the interaction between and . Then, we fix the parameter and study the interaction between and .
Figure 3 shows the aRMAE of eMRQ as a function of for on andro and scm20d, as well as the average aRMAE obtained for each combination of values across all datasets (except for edm and enb which have only two target variables and hence is not applicable). Note that corresponds to independently quantizing each target variable and was included to directly examine the advantages of joint modelling in eMRQ.
Looking at the average performance, we see that joint modelling is indeed advantageous as the best results are obtained with and . However, larger values of do not lead to better results on average. On the other hand, we also see that a different set of parameters leads to better results in each dataset (, in andro and , in scm20d). This suggests that careful per dataset tuning of these parameters is required for optimal performance.
Figure 4 shows the aRMAE of eMRQ as a function of for (for ) on andro and scm20d, as well as the average aRMAE obtained for each combination of values across all datasets. We see that the average performance of eMRQ is improving with larger values of for all values of and this behavior is consistent across the datasets shown here as well as the remaining datasets.
VC Comparison with stateoftheart
In this section, we compare the performance of MRQ and eMRQ with that of stateoftheart multitarger regression methods. In particular, the comparison includes ST, RLC [39] and ERC [35], using the setup described in section IVB.
In the case of MRQ, is used based on the analysis of section VA. In the case of eMRQ, we instantiate two variants, one using and one using . However, instead of fixing or tuning the and parameters, we employ a randomized version of eMRQ (denoted as eMRQr) where for each of the subquantizers, is chosen uniformly at random from the range and is chosen uniformly at random from the range . The advantage of this approach is that it avoids the need for parameter tuning  which increases computational time and can be unstable on smaller datasets  and increases the diversity of the eMRQ ensemble.
Table III shows the results obtained by each method on each dataset, as well as their average ranks and total running times. We first see that MRQ and eMRQr outperform the competing methods in 12 out of 18 datasets. More specifically, MRQ and eMRQr have the best performance in two datasets each and eMRQr outperforms all other methods in 10 datasets. Looking at the average ranks of the methods, we see that eMRQr obtains the lowest average rank, followed by ERC and eMRQr. As can be seen in the critical difference diagram of Figure 5, eMRQr performs statistically significantly better than RLC and MRQ, while the experimental data is not sufficient to reach any conclusion with respect to other methods.
The last row of Table III reports the total running time^{8}^{8}8Experiments were run using 10 cores of a 64bit CentOS Linux machine equipped with Intel Xeon E74860 processors running at 2.27 GHz, leveraging a parallelized implementation of the base learner. of each method (per dataset running times are omitted due to space limitations). We observe that MRQ is the fastest method overall, while eMRQr is an order of magnitude faster than ERC. Taking into account the fact that eMRQr is also the best overall performer in terms of accuracy, makes it a very appealing multitarget regression model.
Dataset  ST  RLC  ERC  MRQ  eMRQr  eMRQr 

edm  0.822  0.817  0.823  0.544  0.566  0.546 
enb  0.085  0.087  0.082  0.117  0.109  0.105 
jura  0.529  0.546  0.531  0.658  0.664  0.563 
scpf  0.625  0.626  0.629  0.553  0.494  0.473 
sf1  0.972  0.975  0.971  0.463  0.463  0.463 
sf2  0.883  0.927  0.868  0.462  0.460  0.461 
slump  0.669  0.677  0.671  0.810  0.813  0.761 
andro  0.583  0.569  0.548  0.870  0.747  0.726 
atp1d  0.313  0.325  0.312  0.343  0.280  0.261 
atp7d  0.459  0.470  0.441  0.365  0.307  0.301 
rf1  0.063  0.080  0.060  0.182  0.070  0.051 
rf2  0.072  0.087  0.067  0.182  0.070  0.050 
osales  0.690  0.679  0.646  0.599  0.581  0.567 
wq  0.857  0.851  0.858  0.873  0.702  0.697 
oes10  0.412  0.410  0.412  0.466  0.494  0.454 
oes97  0.563  0.566  0.562  0.647  0.648  0.615 
scm1d  0.271  0.267  0.252  0.412  0.266  0.238 
scm20d  0.405  0.404  0.322  0.441  0.299  0.266 
Av. rank  3.722  4.056  3.000  4.556  3.472  2.194 
Time  1.27  9.89  90.86  0.14  1.72  5.56 
Vi Conclusion and future work
We presented MRQ, a new problem transformation approach for multitarget regression that was shown to offer comparable accuracy with stateoftheart methods, while being significantly more efficient. In addition, a more computationally expensive, ensemble version of MRQ was found to be more accurate than other approaches in most studied datasets. MRQ has two important characteristics that distinguish it from other approaches for multitarget regression: (a) it models a discrete approximation of the joint distribution of the target variables, (b) it is scalable to problems with very large output spaces as it builds a constant number of models.
In the future, we would like to evaluate MRQ on realworld and synthetic datasets with significantly larger output spaces. We would also like to perform a deeper theoretical analysis of the two main sources of error in MRQ, i.e the quantization and the classification error, and come up with better ways to choose quantization parameters that strike a good balance between these two error components. Finally, we would like to explore more sophisticated quantization schemes (e.g. [5, 30]) that reduce the redundancy between the different subquantizers and are hence able to achieve a smaller quantization error for a fixed quantizer complexity.
References

[1]
(2012)
Multitarget regression with rule ensembles.
Journal of Machine Learning Research
13, pp. 2367–2407. Cited by: §IIA.  [2] (2011) Kernels for vectorvalued functions: a review. arXiv preprint arXiv:1106.6251. Cited by: §IIA.
 [3] (2007) Stepwise induction of multitarget model trees. In ECML, Cited by: §I.
 [4] (2008) Convex multitask feature learning. Machine Learning 73 (3), pp. 243–272. Cited by: §IIA, §IVB.

[5]
(2014)
Additive quantization for extreme vector compression.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 931–938. Cited by: §VI. 
[6]
(1995)
Learning internal representations.
In
Proceedings of the Eigth Annual Conference on Computational Learning Theory, COLT 1995, Santa Cruz, California, USA, July 58, 1995
, pp. 311–320. Cited by: §IIA.  [7] (1999) Simultaneous prediction of multiple chemical parameters of river water quality with TILDE. In Principles of Data Mining and Knowledge Discovery, Third European Conference, PKDD ’99, Prague, Czech Republic, September 1518, 1999, Proceedings, pp. 32–40. Cited by: §IIA.
 [8] (1998) Topdown induction of clustering trees. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), Madison, Wisconsin, USA, July 2427, 1998, pp. 55–63. Cited by: §IIA.

[9]
(1997)
Predicting multivariate responses in multiple linear regression
. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59 (1), pp. 3–54. Cited by: §IIA.  [10] (1996) Bagging predictors. Machine Learning 24 (2), pp. 123–140. Cited by: §IVB.

[11]
(1994)
Learning many related tasks at the same time with backpropagation
. In Advances in Neural Information Processing Systems 7, [NIPS Conference, Denver, Colorado, USA, 1994], pp. 657–664. Cited by: §IIA.  [12] (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: footnote 1.
 [13] (2016) Predictive modeling of pv energy production: how to set up the learning task for a better prediction?. IEEE Transactions on Industrial Informatics 13 (3), pp. 956–966. Cited by: §I.

[14]
(2015)
Generating discrete analogues of continuous probability distributionsa survey of methods and constructions
. Journal of Statistical Distributions and Applications 2 (1), pp. 6. Cited by: §I.  [15] (2010) Learning incoherent sparse and lowrank patterns from multiple tasks. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, July 2528, 2010, pp. 1179–1188. Cited by: §IIA.
 [16] (2010) Graphstructured multitask regression and an efficient optimization method for general fused lasso. arXiv preprint arXiv:1005.3579. Cited by: §IIA.

[17]
(2008)
A unified architecture for natural language processing: deep neural networks with multitask learning
. In Machine Learning, Proceedings of the TwentyFifth International Conference (ICML 2008), Helsinki, Finland, June 59, 2008, pp. 160–167. Cited by: §IIA.  [18] (2012) On label dependence and loss minimization in multilabel classification. Machine Learning 88 (12), pp. 5–45. Cited by: §IIB, §IIIA1.
 [19] (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, pp. 1–30. Cited by: §IVC.
 [20] (2001) The elements of statistical learning. Springer series in statistics Springer, Berlin. Cited by: §I.
 [21] (1996) Multitask learning for stock selection. In Advances in Neural Information Processing Systems 9, NIPS, Denver, CO, USA, December 25, 1996, pp. 946–952. Cited by: §I, §IIA.
 [22] (1998) Quantization. IEEE transactions on information theory 44 (6), pp. 2325–2383. Cited by: §I, §IIIA2.
 [23] (2008) An empirical study on sea water quality prediction. Knowl.Based Syst. 21 (6), pp. 471–478. Cited by: §I.

[24]
(1975)
Reducedrank regression for the multivariate linear model.
Journal of Multivariate Analysis
5 (2), pp. 248 – 264. Cited by: §IIA.  [25] (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §IIIA2.
 [26] (2014) Random forests with random projections of the output space for high dimensional multilabel classification. In Joint European conference on machine learning and knowledge discovery in databases, pp. 607–622. Cited by: §IIB.
 [27] (2010) Treeguided group lasso for multitask regression with structured sparsity. In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, pp. 543–550. Cited by: §IIA.
 [28] (2007) Ensembles of multiobjective decision trees. In Machine Learning: ECML 2007, 18th European Conference on Machine Learning, Warsaw, Poland, September 1721, 2007, Proceedings, pp. 624–631. Cited by: §IIA, §IVB.
 [29] (2012) Binary relevance efficacy for multilabel classification. Progress in AI 1 (4), pp. 303–313. Cited by: §IIIA1.
 [30] (2016) Revisiting additive quantization. In European Conference on Computer Vision, pp. 137–153. Cited by: §VI.
 [31] (2010) Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing 20 (2), pp. 231–252. Cited by: §IIA.
 [32] (2011) Classifier chains for multilabel classification. Machine Learning 85 (3), pp. 333–359. Cited by: §IIB.
 [33] (2004) SVM multiregression for nonlinear channel estimation in multipleinput multipleoutput systems. IEEE transactions on signal processing 52 (8), pp. 2298–2307. Cited by: §I.
 [34] (2019) Information theoretic multitarget feature selection via output space quantization. Entropy 21 (9), pp. 855. Cited by: §IIB.
 [35] (2016) Multitarget regression via input space expansion: treating targets as inputs. Machine Learning 104 (1), pp. 55–98. Cited by: §I, §I, §IIB, §IIIA1, §IVA, §IVB, §IVB, §VC.
 [36] (2010) Mining multilabel data. In Data Mining and Knowledge Discovery Handbook, 2nd ed., pp. 667–685. Cited by: §I.
 [37] (2011) Random labelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering 23, pp. 1079–1089. Cited by: §IIB.
 [38] (2011) Mulan: a java library for multilabel learning. Journal of Machine Learning Research 12, pp. 2411–2414. Cited by: §IVB.
 [39] (2014) Multitarget regression via random linear target combinations. In Machine Learning and Knowledge Discovery in Databases  European Conference, ECML PKDD 2014, Nancy, France, September 1519, 2014. Proceedings, Part III, pp. 225–240. Cited by: §I, §IIB, §IVB, §VC.

[40]
(1980)
Multivariate regression analysis and canonical variates
. Canadian Journal of Statistics 8 (1), pp. 27–39. Cited by: §IIA.  [41] (2019) Multitarget prediction: a unifying view on problems and methods. Data Mining and Knowledge Discovery 33 (2), pp. 293–324. Cited by: §I.
 [42] (1985) Partial least squares. Encyclopedia of statistical sciences. Cited by: §IIA.
 [43] (2014) A review on multilabel learning algorithms. IEEE Trans. Knowl. Data Eng. 26 (8), pp. 1819–1837. Cited by: §I.
 [44] (2011) Clustered multitask learning via alternating structure optimization. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 1214 December 2011, Granada, Spain., pp. 702–710. Cited by: §IIA.
 [45] (2012) Multitask learning: theory, algorithms, and applications. 2012 Siam International Conference on Data Mining. External Links: Link Cited by: §IIA.
Comments
There are no comments yet.