1 Introduction and Related Work
In traditional supervised learning, each instance is associated with one single outcome. Multi-output (or multi-target) prediction is a supervised learning task, where multiple targets can be assigned to each observation. In this learning problem, target variables can be of any kind (real-valued, discrete, categorical).
When all target variables are binary, this problem is known as multi-label classification [19, 24, 27, 34]. Multi-label classification originated from text classification  and is increasingly being used in many different applications such as music categorization  or semantic scene categorization .
On the other hand, if all target variables are real-valued, the multi-output prediction problem is known as multivariate regression. A broad overview of this topic can be found in 
. Applications appear in many different fields, such as ecological modeling of multiple real-valued target variables describing the quality of vegetation, predicting wind noise (represented by several variables), or the estimation of multiple gas tank levels of a gas converter system.
Multi-output prediction can be seen as the most generalized and flexible form of learning to predict multiple targets, as it allows the target variables to be of mixed kind as well. Important use cases for mixed target variables can be found in psychological research. For instance, much work in the field of personality psychology is focused on the prediction of personality and demographic traits based on behavioral data [12, 22]. As traits like gender and age [8, 9] have been found to be related to personality, it would be very useful to simultaneously predict personality via regression, gender via classification, and age via ordinal regression, instead of predicting them independently.
Currently, there are not many available methods that can handle learning tasks with objectives of different kinds (for an available method, see e.g. ). Instead of adapting existing methods to be able to handle more than one target, we will use the problem transformation method for predicting multiple targets instead. For this, we will analyze the similarity-enforcing method  of using predicted targets as feature representation, which has been studied in the multi-label community extensively [16, 18, 17, 23] and has been adapted to multivariate regression [3, 25]
. We will define this method for the more general multi-output prediction problem and introduce a component-wise boosting approach for learning and visualizing the target dependencies. Since the interpretability of black-box models has become an important topic in the machine learning community, we aimed for a method, that not only uses target dependencies for predictions, but also makes them easy to understand.
For general discussions about multi-output prediction in a broader context we refer to [30, 31]. Another method for multi-label classification, where label dependencies are learned in the form of rules, can be found here . The problem transformation method for multi-label learning is extensively discussed in many papers [34, 16, 17, 18]. This method has also been used for multivariate regression in [26, 3].
Main contributions of this paper:
A formal definition of the problem transformation method for multi-output prediction problems.
A novel method similar to the two-step stacking method, which allows interpretations and visualizations of target dependencies.
2 Definition: Multi-Output Prediction
A multi-output prediction problem can be characterized by instances , and targets . The relationship between an instance and the target can be characterized by an one-dimensional score , which can be nominal, ordinal, or real valued. A multi-output prediction problem can thus be written as a dataset
, where the target variable is a vector. This dataset can be portrayed in matrix form:
We can get the formal definition for multivariate regression by only allowing real values for . By limiting to binary values 0 or 1, we get the formal definition for multi-label classification. However, since we do not need this limitation and want to deal with prediction problems with heterogeneous output spaces as well, we allow to be of any one-valued kind. We call this problem multi-output prediction, which can be seen as a generalization of multi-label classification and multivariate regression. We use the term multi-output prediction to refer to the general prediction task and only specify the terms multi-label classification, multivariate regression, or mixed-type prediction, if we specifically relate to them.
3 Measuring Performance in Multi-Output Prediction Problems
For traditional single-target machine learning problems, performance measurement is intuitive and there are many metrics like accuracy, F-measure, AUC for classification, or the mean squared error (MSE), mean absolute error (MAE) for regression. Once we have multiple target variables, measuring performance becomes non-trivial.
There are many ways of handling this problem. First, we can compare the actual target vector with the predicted target vector and then calculate one performance metric. Many performance measures have been constructed this way for multi-label classification and multivariate regression problems [34, 3].
For multi-label learning, an example would be the so called Hamming-loss, which compares the predicted labels with the actual labels:
This value is calculated instance-wise and the performance of a test set is the mean Hamming-loss of each instance.
There are many more multi-label performance measures like -loss, Accuracy, Precision, or Ranking-loss (see ), which can be defined intuitively, because of the binary structure of multi-label learning problems.
For multivariate regression, an example is the multivariate mean squared error MMSE, which is the mean MSE of every target:
Having only regression tasks for every target, many multivariate performance metrics can be defined (see e.g. ). When using such metrics for multivariate regression problems, one should pay attention to the value range of the target variable. Targets with larger value ranges have more influence on the metric than targets with smaller value ranges. One possible way of handling this problem is to standardize the target values.
However, in the more generalized multi-output prediction problem, calculating one single performance value out of possible mixed target spaces is not trivial. Note, that many multi-label and multivariate regression performance metrics are a weighted sum of performance metrics for each target. We could write a general performance metric like this:
The Hamming-loss and the MMSE are just special cases of this more general performance metric.
Since datasets with mixed target spaces can differ very much and classification performance metrics are combined with regression performance metrics during evaluation, a general definition of a performance metric is infeasible and should thus be left to the user. One could also handle multi-output prediction problems as multi-objective optimization problems, where trade-offs between multiple (possibly conflicting) objectives (such as minimizing the MSE for a regression target and maximizing the AUC for a classification target) need to be considered. For multi-label classification this was discussed in .
Nevertheless, a further motivation to consider multi-output prediction methods instead of modeling each target independently is that improvements can be made for each target respectively. Each target can be treated independently and we can analyse whether more complex methods are feasible for each target.
For problems with mixed target variables, we will focus on target wise comparisons and use the mean classification error for classification problems and the mean squared error for regression problems. For multi-label classification and multivariate regression we will also report the Hamming-loss and MMSE.
4 Learning Target Dependencies
There are two main ways to model problems with more than one target, that are extensively studied in the multi-label community [34, 17] and could also be applied to multi-output prediction. One of them is the algorithm adaptation method, which aims at adapting existing algorithms to handle multiple outputs . The other one is called problem transformation method and aims to transform the multi-label learning problem into more established one-target prediction problems [34, 18, 16]. The problem transformation method has the advantage that any already established one-target machine learning model can be used.
In this paper, we will focus on the problem transformation method and how to use it for multi-output prediction problems. Originally used in the multi-label community, these methods were adapted to multivariate regression in . The idea of modeling target dependencies by using other target information as features is not restricted by the type of outputs and can thus be used for multi-output prediction problems as well.
4.1 Independent Models (IM)
The easiest problem transformation method (called binary relevance method in the multi-label community) is to use one model for each target independently and to combine the predictions afterwards. Target dependencies are thus not being considered when using independent models.
Given a dataset , with target and (possibly mixed) targets, we train models for each target independently:
For train model on
A new observation will get the prediction .
4.2 Stacking (STA): Using Targets as Features
One way to model target variable dependencies is to use target variables as features. A distinction can be made between different ways in which these target variables are being modeled. For instance, the real target values can be used as features, since these are available during training time. Examples would be the classifier chains or dependent binary relevance . The alternative would be to create predicted target values by using an inner cross-validation loop (e.g. nested stacking , stacking [30, 16]). A comparison between these methods is discussed in . In this paper, however, we will discuss the stacking method in more detail. After fitting the same independent models (5), as they are needed at prediction time, we obtain predicted targets through an inner cross-validation strategy:
For use inner-CV on
The inner cross-validation strategy can become resource-intensive, as many models have to be fit. Hence, a trade-off between a sufficient cross-validation strategy and available computing resources needs to be made.
In a next step, these predicted target variables are used to extend the feature space, and a second set of models is fit for each target:
For train model on
At prediction time, we first get predicted targets with independent models, which are then added to the new observation: . The final prediction is .
4.3 Component-wise Multi-output Boosting (CMOB)
For our novel method, we propose to use component-wise boosting to learn the target dependency structure. As for most machine learning models, the aim of component-wise boosting is to minimize the empirical risk:
Component-wise boosting, also called model-based boosting, generalizes the boosting framework to multiple base-learners . For each boosting iteration the algorithm selects one base-learner out of a space of base-learners by fitting them all to the pseudo residuals and choosing the one with the smallest sum of squared errors. This improves the empirical risk of the current model which is computed via stage-wise additive modeling with a learning rate :
For our purpose, numerical features are included as linear effect , where is a mapping from iteration to the selected feature. For categorical features each group is added as single one-hot coded base-learner that just includes an intercept for that group. Boosting these kind of base-learners maintains interpretability because of the additive structure of the model and the repeated selection of equal base-learners.
An important property of component-wise boosting is the intrinsic feature selection. This is achieved by selecting just one base-learner per iteration. After trainingiterations we get a subset of all features that are required to predict the target. This provides information about the importance of each feature. In our multi-output prediction case we use this internal feature selection to learn which predicted target variables are required to explain the target .
To go one step further we would also like to know which of the selected features are more important than others. Therefore, we can again use the additive structure of component-wise boosting to calculate a feature importance for all selected features. After boosting iterations we calculate the feature importance as the sum of the empirical risk improvements achieved by selecting the -th feature:
One requirement for calculating meaningful feature importance scores is to choose an adequate which can be done, by using early stopping. This stops the procedure if the relative improvement
of the empirical risk consecutively falls below a pre-defined value
. We chose component-wise boosting over other methods, which produce sparse and interpretable models (like ridge regression), because of the flexibility of the choice of the base-learners. Non-linear effects can easily be modeled using splines as base learners.
We now introduce Component-Wise Multi-output Boosting (CMOB) (see algorithm 1). The idea is to use component-wise boosting to learn the target dependencies in a sparse and interpretable manner. CMOB aims at modeling target dependencies through a dataset of predicted target variables , just like the stacking algorithm (see section 4.2).
One difference is that in our algorithm the original features are omitted, because we are only interested in the interactions between the target variables. Interactions between predicted target variables and features are thus not modeled.
Given the dataset of predicted target variables (obtained by (6)), we train component-wise boosting models for each target:
For j = 1,…,m train component-wise boosting model on
A new observation will be predicted in a two-step procedure:
Use independent models (5) to create predicted targets:
Use boosting models to create final predictions:
We use openly available datasets, that can be downloaded from OpenML [29, 7]. Since datasets for mixed-type prediction are quite uncommon, we mainly used multi-label and multivariate regression datasets. We have limited the number of targets to a maximum of 7 in order to keep the computing time reasonable and the visualizations more understandable. The multi-label classification and multivariate regression datasets are described in detail in [28, 26, 17]. The mixed-type datasets are both personality prediction datasets [2, 22]. See table 1 for more details on the datasets.
To analyze the potential of learning the target dependency structure, we compare the performance of the proposed CMOB algorithm with a stacking model (STA), which uses all other predicted labels as features. We compare these algorithms with independent models (IM) as baseline. See table 2 for an overview of the benchmark settings.
For CMOB we use linear base learners for the underlying component-wise boosting algorithm with a maximum number of 10000 iterations. Since we strive for sparse models, we have applied an early-stopping strategy. The boosting process stops when no improvement of at least has been achieved for 5 consecutive iterations.
As one-target algorithms for classification and regression we will use random forests
, as they typically perform well in many different scenarios without the need of tuning hyperparameters.
Performance will be evaluated with an outer 10-fold-cross-validation strategy. For classification tasks we will use the mean misclassification error (mmce) as performance metric. For regression tasks we will use the mean squared error (MSE) of the standardized target values (test set target values are standardized using mean and standard deviation of the respective training sets). In the inner training sets, the predicted targets are created with an inner 10-fold-cross-validation strategy. The outer test sets are only used for prediction and performance evaluation. And finally, the models are trained on the whole datasets. For full reproducibility, the benchmark code is available here.
|Multi-output algorithms||IM, STA, CMOB|
|Outer resampling strategy||10 fold cv|
|Resampling strategy for creating predictions||10 fold cv|
|One target regression learner||Random Forest|
|One target classification learner||Random Forest|
|Regression measure||mse (of normalized target values)|
|Maximum iterations for boosting||10000|
|Early stopping strategy||No 0.01% improvement for 5 iterations.|
We summarized the results of the benchmark in table 3. Reported values are mean values (over the outer test sets) of MSE or mmce (depending on the task) for each dataset and each target. For multi-label classification tasks we also included the Hamming-loss (HL) and for multivariate regression tasks the multivariate mean squared error (MMSE).
CMOB could not improve the overall Hamming-loss of the multi-label datasets used in this benchmark. Looking at the performance values for each target individually, we can see that for some targets (e.g. for the dataset emotions) CMOB could improve the mmce, but for others (e.g. for the dataset image) our algorithm did worse. However, the stacking algorithm (STA) neither performed very well on the multi-label datasets and could only improve the Hamming-loss on the image dataset by a small margin. Independently modeling each target individually seems to be a strong baseline here.
More interesting is the performance of CMOB for the multivariate regression datasets. For 4 (andro, jura, sf1, slump) of the 10 multivariate regression datasets, CMOB could improve the MMSE over independent models. In 2 of these tasks, CMOB could even beat the stacking algorithm. Looking at each target variables individually, we can see that CMOB performs comparably well to the stacking algorithm, showing improvements over the MSE, when stacking also improves over independent models (with some exceptions e.g. from the dataset sf2).
For the mixed-type dataset youtube and sens we can see improvements for some targets when using CMOB. This suggests that the use of multi-output methods can be useful for personality prediction.
Based on datasets for which considerable improvements have been achieved, we show the interpretations of the target dependencies in the following section.
5.3 Interpretation of Target Dependencies
Example: Andromeda Dataset
The Andromeda dataset (andro)  deals with the prediction of water quality variables (temperature, pH, conductivity, salinity, oxygen, turbidity). CMOB performed well on this dataset and made improvements for every target variable and performed almost as well as the stacking algorithm:
To further inspect the target dependencies, we plot the base learner coefficients for each target (for effect size and direction) together with the corresponding relative risk reduction of the underlying boosting algorithm (for feature importance) in figure 1. The relative risk reduction of a base learner is the proportion of the base learner’s risk reduction to the total risk reduction:
The numbers in the plots are the base learner’s coefficients and the background color displays the relative risk reduction.
Figure 1 needs to be read row wise, e.g. for the target variable salinity, the predicted targets conductivity and salinity have been selected by the boosting algorithm and both have a positive effect on the value of salinity. It is quite clear, that the predicted target value of the target itself should normally be the most important feature and should have a coefficient of around 1. However, we can see some anomalies, e.g. for the target turbidity, the predicted target value of oxygen, seems to be more important. A possible reason could be that the first prediction of the target turbidity was not accurate in the first place. Nevertheless, we can also see that the resulting boosting models are quite sparse, since only a few base-learners were chosen for most target variables.
Example: Slump Dataset
The Slump  dataset deals with the prediction of three properties of concrete (slump, flow and compressive strength). CMOB and STA could make considerable improvements in the prediction of the target compressive strength:
One might argue that the improvements are due to exploiting target dependencies. But if we look more closely at the selected base-learners of CMOB (see figure 2) we can see that the only base-learner chosen for the target compressive strength
is the target itself. This was also often the case for the models in the cross-validation iterations. Other targets were rarely chosen and had small coefficients. Interestingly, we could achieve a performance improvement only by linearly transforming the predictions of the targetcompressive strength.
6 Conclusion and Outlook
In this paper we defined the problem transformation method for multi-output prediction problems of possibly mixed target spaces. We introduced a novel algorithm CMOB (component-wise multi-output boosting) which simultaneously learns dependencies within target variables in a sparse and interpretable manner. Through a benchmark experiment with real-world datasets, we showed that, at least for some datasets, the performance of CMOB was comparable to the stacking method’s performance (STA). In contrast to STA, which trains (possibly black-box) machine-learning models in a second step, CMOB learns the target dependencies with an inherently interpretable model. With the help of CMOB, we were able to find an example, where improvements of predictive performance could be made for one target without using information of other targets. This would otherwise have been attributed to the exploitation of target dependencies.
We limited the choice of datasets to a rather small number of targets (less than 7). Future work should address investigations of the performance of CMOB on datasets with many targets. Since CMOB tries to model target dependencies in a sparse manner, this could be an advantage over STA, which, depending on the choice of the underlying machine learning models, cannot handle noisy variables very well.
This work has been partially supported by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for its content.
-  Au, Q.: Benchmark Code for: Component-Wise Boosting of Targets for Multi-Output Prediction (4 2019). https://doi.org/10.6084/m9.figshare.7957292.v2
-  Biel, J.I., Gatica-Perez, D.: The youtube lens: Crowdsourced personality impressions and audiovisual analysis of vlogs. Multimedia, IEEE Transactions on 15(1), 41–55 (2013)
-  Borchani, H., Varando, G., Bielza, C., Larrañaga, P.: A survey on multi-output regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5(5), 216–233 (2015)
-  1771, 1–26 (2003)
-  Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
-  Bühlmann, P., Yu, B.: Boosting with the l 2 loss: regression and classification. Journal of the American Statistical Association 98(462), 324–339 (2003)
-  Casalicchio, G., Bossek, J., Lang, M., Kirchhoff, D., Kerschke, P., Hofner, B., Seibold, H., Vanschoren, J., Bischl, B.: Openml: An r package to connect to the machine learning platform openml. Computational Statistics 32(3), 1–15 (2017)
-  Chapman, B.P., Duberstein, P.R., Sörensen, S., Lyness, J.M.: Gender differences in five factor model personality traits in an elderly cohort. Personality and individual differences 43(6), 1594–1603 (2007)
-  Donnellan, M.B., Lucas, R.E.: Age differences in the big five across the life span: evidence from two national samples. Psychology and aging 23(3), 558 (2008)
-  Hatzikos, E.V., Tsoumakas, G., Tzanis, G., Bassiliades, N., Vlahavas, I.: An empirical study on sea water quality prediction. Knowledge-Based Systems 21(6), 471–478 (2008)
-  Ishwaran, H., Kogalur, U.: Random Forests for Survival, Regression, and Classification (RF-SRC) (2019), https://cran.r-project.org/package=randomForestSRC, r package version 2.8.0
-  Kosinski, M., Stillwell, D., Graepel, T.: Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences 110(15), 5802–5805 (2013)
-  Li, T.: Detecting Emotion in Text (November 2003) (2012)
-  Loza Mencía, E., Janssen, F.: Learning rules for multi-label classification: a stacking and a separate-and-conquer approach. Machine Learning 105(1), 77–126 (oct 2016)
-  Molnar, C.: Interpretable Machine Learning (2019)
-  Montañes, E., Senge, R., Barranquero, J., Ramón Quevedo, J., José Del Coz, J., Hüllermeier, E.: Dependent binary relevance models for multi-label classification. Pattern Recognition 47(3), 1494–1508 (2014)
-  Probst, P., Au, Q., Casalicchio, G., Stachl, C., Bischl, B.: Multilabel Classification with R Package mlr. The R Journal 9(1), 352–369 (2017)
-  Read, J., Pfahringer, B., Holmes, G., Frank, E., Brodley Read, C.J., Pfahringer, B., Holmes, G., Frank, E., Read, J.: Classifier chains for multi-label classification. Mach Learn 85, 333–359 (2011)
-  Read, J., Reutemann, P., Pfahringer, B., Holmes, G.: MEKA: A Multi-label/Multi-target extension to WEKA. Journal of Machine Learning Reasearch. Available at: http://jmlr.org/papers/volume17/12-164/12-164.pdf 17(21), 1–5 (2016)
-  Schapire, R.E., Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning 39, 135–168 (2000)
-  Schmid, M., Hothorn, T.: Boosting additive models using component-wise p-splines. Computational Statistics & Data Analysis 53(2), 298–311 (2008)
-  Schoedel, R., Au, Q., Völkel, S.T., Lehmann, F., Becker, D., Bühner, M., Bischl, B., Hussmann, H., Stachl, C.: Digital Footprints of Sensation Seeking. Zeitschrift für Psychologie 226(4), 232–245 (2018)
-  Senge, R., José Del Coz, J., Hüllermeier, E.: Rectifying Classifier Chains for Multi-Label Classification. Tech. rep.
-  Shi, C., Kong, X., Yu, P.S., Wang, B.: Multi-Objective Multi-Label Classification. Proceedings of the 2012 SIAM International Conference on Data Mining pp. 355–366 (2012)
-  Spyromitros-Xioufis, E., Tsoumakas, G., Groves, W., Vlahavas, I.: Multi-Target Regression via Input Space Expansion: Treating Targets as Inputs. Machine Learning 104(1), 55–98 (nov 2012)
-  Spyromitros-Xioufis, E., Tsoumakas, G., Groves, W., Vlahavas, I.: Multi-target regression via input space expansion: treating targets as inputs. Machine Learning 104(1), 55–98 (2016)
-  Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining 3, 1–13 (2007)
-  Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., Vlahavas, I.: Mulan: A java library for multi-label learning. Journal of Machine Learning Research 12, 2411–2414 (2011)
-  Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: Openml: Networked science in machine learning. SIGKDD Explorations 15(2), 49–60 (2013)
-  Waegeman, W., Dembczynski, K., Huellermeier, E.: Multi-Target Prediction: A Unifying View on Problems and Methods (sep 2018)
-  Xu, D., Shi, Y., Tsang, I.W., Ong, Y.S., Gong, C., Shen, X.: A survey on multi-output learning. CoRR abs/1901.00248 (2019)
Yeh, I.C.: Modeling slump flow of concrete using second-order regressions and artificial neural networks. Cement and Concrete Composites29(6), 474–480 (2007)
Zhang, M.L., Zhou, Z.H.: ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition40(7), 2038–2048 (jul 2007)
-  Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms (2014)