Aggregation as Unsupervised Learning and its Evaluation

Regression uses supervised machine learning to find a model that combines several independent variables to predict a dependent variable based on ground truth (labeled) data, i.e., tuples of independent and dependent variables (labels). Similarly, aggregation also combines several independent variables to a dependent variable. The dependent variable should preserve properties of the independent variables, e.g., the ranking or relative distance of the independent variable tuples, and/or represent a latent ground truth that is a function of these independent variables. However, ground truth data is not available for finding the aggregation model. Consequently, aggregation models are data agnostic or can only be derived with unsupervised machine learning approaches. We introduce a novel unsupervised aggregation approach based on intrinsic properties of unlabeled training data, such as the cumulative probability distributions of the single independent variables and their mutual dependencies. We present an empirical evaluation framework that allows assessing the proposed approach against other aggregation approaches from two perspectives: (i) how well the aggregation output represents properties of the input tuples, and (ii) how well can aggregated output predict a latent ground truth. To this end, we use data sets for assessing supervised regression approaches that contain explicit ground truth labels. However, the ground truth is not used for deriving the aggregation models, but it allows for the assessment from a perspective (ii). More specifically, we use regression data sets from the UCI machine learning repository and benchmark several data-agnostic and unsupervised approaches for aggregation against ours. The benchmark results indicate that our approach outperforms the other data-agnostic and unsupervised aggregation approaches. It is almost on par with linear regression.


Unsupervised Recalibration

Unsupervised recalibration (URC) is a general way to improve the accurac...

Function Classes for Identifiable Nonlinear Independent Component Analysis

Unsupervised learning of latent variable models (LVMs) is widely used to...

Deep Unsupervised Drum Transcription

We introduce DrummerNet, a drum transcription system that is trained in ...

Estimating regression errors without ground truth values

Regression analysis is a standard supervised machine learning method use...

Robustness Against Outliers For Deep Neural Networks By Gradient Conjugate Priors

We analyze a new robust method for the reconstruction of probability dis...

CPTAM: Constituency Parse Tree Aggregation Method

Diverse Natural Language Processing tasks employ constituency parsing to...

Unsupervised Ensemble Classification with Dependent Data

Ensemble learning, the machine learning paradigm where multiple algorith...

1 Introduction

Aggregation combines several input variables of an object into a single output score. This output should represent meaningful information about the object. To tackle this challenge, many approaches have been proposed and widely used both in research and in societal and technical decision making, ranking, and assessment applications.

Different aggregation approaches are evaluated on how well the output score represents the input variables, according to some measures. However, even the input variables are often just an approximation of the latent object properties. On the other hand, aggregation has a goal, e.g., decision making, ranking, and assessment, which anyway abstracts from these latent object properties. This makes it hard to objectively evaluate and compare different aggregation approaches. The objective of the present paper is to suggest a way to objectively evaluate aggregation approaches.

Similar to aggregation, regression models also combine several input variables (predictors) into a single output (response). There are many different (supervised) machine learning approaches, ranging from simple linear regression to deep learning, that derive regression models from data, i.e., objects with known predictor and response variable values. However, in many practical situations, response variable can not be easily observed. The lack of this so-called ground truth data makes it impossible to use supervised regression approaches. Aggregation is then an alternative to regression, but the problem remains: how to objectively choose an appropriate aggregation approach.

To address the challenges above, we propose to evaluate and compare aggregations approaches in cases where training data is available. In these cases, we can compare the aggregation outcome with the ground truth and with an outcome proposed by regression models trained on this ground truth.

More specifically, we developed an empirical quantitative comparison of aggregation approaches by means of both external evaluation, i.e., comparison against ground truth and against what is achievable with supervised learning, and internal evaluation, i.e., comparison against information in the input variables. We compare basic aggregation functions and data-driven aggregation approaches created by unsupervised learning.

To this end, we created a benchmark from a collection of 169 regression problems including the whole collection of regression data sets of the UCI machine learning repository (except for some data sets excluded for technical reasons). We trained the data-driven aggregation approaches on the predictors of the regression data sets (unsupervised learning). For comparison, we also trained regression models on these data sets (supervised learning). This allows for a statistical evaluation of different aggregation approaches and even their comparison against regression models. We expect that the basic aggregation models are inferior to the data-driven aggregation models (created with unsupervised learning) that, in turn, are inferior to the regression models (created with supervised learning). We will confirm this hypothesis and quantitatively assess the differences between the approaches.

In summary, the work contributes with:

  1. An approach for evaluation of aggregation approaches in a context of machine learning regression tasks.

  2. An implementation of this approach in a general and flexible framework for evaluation. This framework contains the benchmark data sets, the performance measures, and the implementations of the compared aggregation approaches (extensible with other approaches).

  3. An extensive empirical comparison of several aggregation approaches.

The remainder of the paper is structured as follows. We define aggregation and motivate the choice of aggregation operators we compare in Section 2. We summarize related work concepts in Section 3. We introduce an evaluation framework in 4. In Section 5 we report and discuss the results of the experimental comparison of several aggregation operators. Finally, Section 6 concludes the research and points out directions of future work.

2 Aggregation

In this section, we define and classify aggregation and introduce the approaches we evaluate and compare in our study.

2.1 Definition

Aggregation maps several input variables to a single output variable. We assume that the number of input variables is fixed, say . W.l.o.g we also assume that all input and the output variables are of the unit interval . Formally, a function that maps the (-dimensional) unit cube onto the unit interval


is called an aggregation function if it satisfies the following properties:



Boundary conditions:


A special case of aggregation is the aggregation of a singleton, i.e., the unary operator , that usually used to get a score or index for a single variable. In this paper, we do not consider aggregation of a singleton, i.e., from now on we make an assumption .

2.2 Basic Aggregation

Basic aggregation functions can be classified into three main classes with specific behavior and semantics beliakov2007aggregation: conjunctive, disjunctive, and averaging. These classes are described below.

Conjunctive aggregation combines values like a logical AND operator, i.e., the result of aggregation can be large if all values are large. The basic rule is that the output of aggregation is bounded from above by the lowest input. If one of the inputs equals , then the output of aggregation is equal to the degree of our satisfaction with the other input variables. If any input is , then the output must be as well. For example, if to obtain a driving license one has to pass both theory and driving tests, no matter how well one does in the theory test, it does not compensate for failing the driving test. From the set of basic conjunctive aggregations, we compare minimum MIN and product PROD in our evaluation.

Disjunctive aggregation combine values like a logical OR operator, i.e., the result of aggregation can be large if at least one of the values is large. The basic rule is that the output of aggregation is bounded from below by the highest input. In contrast to conjunctive, satisfaction of any of the input variables is enough by itself. For example, when you come home both an open door and the alarm are indicators of a burglary, and either one is sufficient to raise suspicion. If both happen at the same time, they can reinforce each other, and our suspicion might become even stronger than the suspicion caused by any of these indicators by itself. From the set of basic disjunctive aggregations, we compare maximum MAX and sum SUM in our evaluation.


aggregation is also known as compensative and compromising aggregation); a high (low) value of one input variable can be compensated by a low (high) value for another one and the result will be something in between. The basic rule is that the output of aggregation is lower than the highest input and larger than the lowest input. Note that the aggregation function MIN (MAX) is at the same time conjunctive (disjunctive) and the extreme cases of an averaging aggregation. In the paper, we do not consider basic averaging aggregation functions, e.g., arithmetic and geometric mean (median), since their output is (almost) proportional to the output of SUM and PRODUCT, resp.

2.3 Data-Driven Aggregation

Data-driven approaches need tuples of variable values to define the aggregation function. Unsupervised approaches only require tuples of input variable values while supervised approaches require tuples of input and out variable values. Once the aggregation function is learned, it can be applied to all possible input variable tuples.

Regression: Widely used in research and applications are instances of weighted arithmetic mean. Weights usually indicate the importance of the input variables and can be set by experts or calculated from raw data velasquez2013analysis. In case both input and response variables are known, weights can be adjusted to fit the raw data by solving an optimization problem that minimizes an error. One basic way to solve this problem is to use linear regression beliakov2007aggregation. We refer to this supervised machine learning technique as REG and compare it with the other unsupervised techniques of basic and data-driven aggregation.

Weighted quality scoring (WQS) is a fully automated unsupervised approach based on the weighted product model for aggregation ulan2021weighted. Based on input tuples, it normalizes the input variables to be not correlated negatively, i.e., they have the same direction. It then calculates weights that account the variation of values of a single input variable and for the interdependence between all input variables.

WQS was originally designed for software quality assessment. It was evaluated in the context of the defect prediction (i.e., a proxy of ground truth is a number of bugs). The authors motivated the choice of a weighted product model for aggregation as it provides a clear interpretation of aggregation output in the context of software quality assessment: in order to easily spot software artifacts with extremely bad values in a single metric, i.e., is an annihilator or so-called “veto” element, the aggregated quality should be poor even if only a single metric indicates that. In this paper, in addition to the original WQS approach, we compare also its weighted sum variant: normalization and weighting are the same, the only difference is in the final aggregation step which is a sum WSM or a product WPM.

3 Related Concepts

In this section, we relate aggregation to the concepts of data fusion, decision making, and machine learning.

3.1 Data fusion

The integration of information from several sources is known as data fusion. Different fusion techniques have been used in areas such as statistics, machine learning, sensor networks, robotics, and computer vision, to name a few 

cocchi2019data. The goal of fusion is to combine data from multiple sources to produce information better than would have been possible by using a single source. Improved information could mean less expensive, more accurate, or more relevant information. The goal of data aggregation is to combine data from multiple variables by reducing and abstracting it into a single one. In this sense, data aggregation is a subset of data fusion. In this paper, we restrict ourselves to data aggregation of a finite number of numerical input variables into a single output variable that represents meaningful properties of the input data.

3.2 Multi-criteria decision making

Multi-Criteria Decision Making (MCDM) evaluates alternatives according to several criteria that are numerical variables. In order to choose the best alternative, one needs to aggregate the values of the criteria in some way. One popular approach is called the Multi-Attribute Utility Theory (MAUT) von1975multi. MAUT assigns a numerical score to each alternative, called its utility. The total utility is a function of individual utilities for the criteria. The rational decision-making axiom implies that one cannot prefer an alternative over another alternative if it performs better with respect to some individual utilities, but inferior with respect to the other ones. Mathematically, this means that the total utility is a monotone non-decreasing function with respect to all arguments. If we scale the utilities to , and add the boundary conditions, we obtain that total utility is an aggregation function beliakov2007aggregation.

3.3 Machine learning

Machine learning (ML) defines models that map input to output variables using example values of the variables. Based on the learning approach, we distinguish unsupervised ML, where models are learned solely based on tuples of input variables, from supervised ML, where models are learned based on tuples of input and output variables, and from feedback ML, where models are learned incrementally based on feedback on suggested output values. The aggregation techniques WSM and WPM are examples of unsupervised ML; the aggregation technique REG is an example of supervised ML. The basic aggregation techniques are not ML examples. Feedback ML is not relevant in the context of the present paper.

The UCI machine learning dataset repository Bache+Lichman:2013 has been widely used by the machine learning community for the empirical analysis of different ML approaches. In this paper, we use the subset of UCI for the regression task for our evaluation. We consider each data set as an aggregation problem, where both input variables and output variable (ground truth) are known. We use these data sets to develop a realistic and significant evaluation of the aggregation approaches disregarding the output data when training the data-driven aggregation approaches WSM and WPM. Using such datasets for evaluation purposes is not new. For example, a similar subset was used for an extensive experimental survey of regression methods fernandez2019extensive. The goal of that study was to evaluate predictive performance of supervised models. In this paper, we include a slightly larger data set (51 regression datasets were added to UCI after the date of the publication). The main difference is, however, that we evaluate basic and data-driven aggregation functions, the latter using unsupervised ML, in a dataset created for assessing regression models, using supervised ML. We add simple linear regression REG, a supervised ML approach, as a reference point. We compare aggregation functions on how good their output preserves properties of the input data (internal evaluation), and how good output agrees with the ground truth (external evaluation). To the best of our knowledge, such an empirical comparison of aggregation functions hasn’t be performed before.

Orthogonally to a classification by the availability of training data (in supervised, unsupervised, and feedback learning), ML approaches can also be classified based on their purpose. If the models are used for prediction, they are considered as black-boxes and their prediction accuracy is the foremost selection criterion of the ML approach. If they are used for inference, i.e., human knowledge gain, then their understandability is more important than accuracy for selecting an appropriate ML approach. It is well known that understandability and accuracy of the ML approaches are negatively correlated. In the present paper, we focus on prediction and, hence, evaluate the accuracy of the aggregation approaches, not their interpretability.

Dimensionality reduction

is the embedding of elements of a high-dimensional vector space in a lower-dimensional target space 


. It is an unsupervised ML approach. Implementations of dimensionality reduction include principal component analysis (PCA) 

doi:10.1080/14786440109462720, stochastic neighbor embedding (SNE) NIPS2002_6150ccc6, and its -distributed variant -SNE JMLR:v9:vandermaaten08a. One could think that the special case of reducing a multi-dimensional vector space (of input variables) to just one dimension (of a single output variable) is a problem equivalent to aggregation. However, dimensionality reduction only aims at preserving the neighborhood of vectors of the original space in the reduced space. In contrast to that, aggregation assumes a latent ground truth inducing a total order in the data related to the orders induced by each input variable that is to be aggregated. Consequently, the accuracy of dimensionality reduction can be evaluated based on the observed data, i.e., the elements of a high-vector space, while the accuracy evaluation of aggregation additionally needs an explicit ground truth. Also, dimensionality reduction is a ML technique for inference while aggregation is a technique for predicting a (latent) ground truth.

4 Evaluation framework

This section describes the setup used in our evaluation framework. We first present the data sets. We then provide the details for approaches we used for comparison followed by performance measures used to assess their performance.

4.1 Data

We analyzed the snapshot of UCI from February 2021. We adapt the regression task that contains datasets across a wide variety of domains to build a benchmark for comparison111, (visited on February 22, 2021).. The datasets included in the benchmark were chosen by the following criteria: (i) data is available and its format is numerical excluding, e.g., text and images files, (ii) the response variable is clearly defined and there are more than 10 different output values, otherwise it is potentially confused with classification, and (iii) the number of instances is greater than 50.

From the 134 available datasets, we excluded 4 since they are duplicates, i.e., identical to another regression task in UCI. Then we removed 6 because of a missing/broken link to the data and 4 since input variables are not numerical (i). We removed 26 since the response variable is not clearly defined and 30 since there are less than 10 different output values (ii). Finally, we removed 2 datasets with too few data points (iii). See Table 3 in Appendix A for the detailed list of datasets with reasons for exclusion.

We selected the remaining 62 original UCI datasets that come from different domain areas: 10 from Life Sciences, 14 from Physical Sciences, 24 from CS/Engineering, 4 from Social Sciences, 7 from Business, and 3 from other fields.

Some of them contain several sub-datasets and regression tasks and we used all of them. Therefore, the final benchmark consists of 169 datasets. See Table 4 and Table 5 in Appendix for a detailed list of the datasets selected for this study.

The number of inputs for aggregation problem in these datasets differs from 2 to 373, the number of instances from 60 to 4 208 261, and the number of different output values differs from 14 to 72 746. Table 1

summarizes some descriptive statistics for these datasets.

#instances #input variables #output values
min 60 2 14
median 9 784 9 515
max 4 208 261 373 72 746
Table 1: Descriptive statistics for the datasets used in experiments.

4.2 Aggregation approaches

Below we formally define the aggregation approaches selected for comparison (we motivated our choice in section 3).

PROD, MIN, MAX, and SUM apply the respective basic aggregation functions:


REG calculates a weighted arithmetic mean of the input values, WSM and WPM calculate a weighted sum and product, resp., of the normalized input values:


For the ML approaches REG, WSM, and WPM, the weights and the normalization functions , are learned from data. Let each input variable and response variable have instances; we denote their -th values by and , resp.

For REG, weights are learned using the least squares approach to minimize the sum of error squares in the training data:


For WSM and WPM, scores and weights are learned as follows. Let be the indicator function with if and , otherwise, and let be Pearson’s coefficient of correlation.


For the entropy based weights , we define , the entropy of the normalized variable . Let .


where is the empirical frequency of .

For the dependency based weights , we define , the dependency of aggregation on each variable . Therefore, let and be two rankings of the data points ascending in and , resp., where:


Then is the absolute value of Spearman’s rank order correlation of these two rankings:


The Equations 1218 give a self-contained definition of how the learning of the normalization functions and the weights work for WSM and WPM. However, they are better motivated and explained in depth in ulan2021weighted.

Note that the supervised approach REG requires the input variables and the output variable for learning the weights, cf. Equation 11, whereas the unsupervised approaches WSM and WPM only learn from the input variables, cf. Equations 1218.

4.3 Evaluation measures

The choice of evaluation measures depends on the purpose of aggregation and on the availability of a ground truth. Our evaluation aims at highlighting the best aggregation for prediction assuming a (latent) ground truth. Recall that the selected datasets from the UCI repository contain both input variables and response variables. In this study, we consider response variable as an explicitly available proxy of the ground truth.

We use the following four measures in order to compare aggregation approaches from different points of view. We evaluate aggregation approaches taking into account both external (ground truth) and internal (raw data) information. For external evaluation, we study how well aggregation output agrees with a ground truth. Therefore, we measure predictive power and similarity. For internal evaluation, we study how well aggregation output represents the properties of the input variables. Therefore, we measure consensus and sensitivity.

Predictive power. We measure the correlation between aggregation output and ground truth to assess the ordering, relative spacing, and possible functional dependency. We use Spearman’s rho spearman1904general as a correlation coefficient to measure the pairwise degree of association between the values, i.e., how well a monotonic function describes the relationship. High values indicate a strong predictive power.

Similarity. Moreover, we rank the data points according to the aggregation output and to the ground truth in acceding order. We measure a distance between the two rankings based on the Kendall’s tau distance to assess the number of pairwise disagreements between two rankings kendall1948rank. It corresponds to the number of transpositions that bubble sort requires to turn one ranking into the other. Low values indicate a high similarity.

Consensus. We rank the data points according to aggregation output and according to each input variable in acceding order. We use the Kemeny distance dwork2001rank, i.e., the sum of the Kendall tau distances between aggregation output and input rankings, to assess a consensus between the aggregation output and the input variables. Low values indicate a strong consensus.

Sensitivity. Finally, we measure how well the aggregation preserves a variety of the input data. We use the sensitivity ratio ulan2021cop, i.e., the ratio between unique aggregation outputs and the unique tuples in the raw data of input variables. High values indicate a strong sensitivity.

5 Experiments and discussion

5.1 Experimental design

We assess the basic aggregation functions PROD and MIN (conjunctive), MAX and SUM (disjunctive), and the data-driven aggregation functions WSM and WPM (unsupervised). The averaging aggregation REG (supervised) is a baseline reference for the basic and the unsupervised models. Since the unsupervised aggregation approaches do not use the ground truth information to build the prediction, they are not expected to perform better than the supervised model REG trained on this information in the external evaluation.

It is a well-known fact in machine learning that variables with large value ranges dominate the outcome while those with smaller value ranges get marginalized. Also in aggregation, an input variable with larger value ranges could influence the aggregation output. To have a fair comparison of the aggregation approaches, we have normalized the values of input variables to using min-max scaling.

All aggregation approaches compared in this study use the same data sets (see Table 4 and Table 5

), the same input variables, and the same ground truth. We did not consider categorical variables as inputs as well as their artificial representations such as dummy variables. We removed the instances with missing values for either the response variable or input variables. We also removed the identification variables, e.g., time, and id, since they should not contribute to the aggregation. We implemented all algorithms and statistical analyses in

R.222The R Project for Statistical Computing, We provide all R scripts that are used to conduct the experiments in a replication package downloadable from

5.2 Summary of results

We run aggregation approaches over 169 datasets. We report the 25 percentile, the median, and the 75 percentile values of the distributions of Spearman’s rho, Kendall’s tau distance, Kemeny distance, and Sensitivity ratio for each aggregation over the all datasets. The boxplots in Figure 4, Figure 4, Figure 8, Figure 8 visualize the different aggregation approaches and the distribution of their performance in the evaluation measures.

5.2.1 External evaluation.

Figure 4 shows the performance comparisons in terms of Predictive power. Recall that high values indicate good performance. We observe that all basic aggregation approaches perform worse than REG aggregation and the weighted scoring approaches WPM and WSM. Moreover, REG aggregation is (slightly) better than the WSM and WPM approaches. It is not a surprise, since the REG aggregation model was defined with a supervised ML approach.

Figure 4 shows pairwise correlation between values of Predictive power calculated for each approach on each dataset. We observe a very strong positive correlation between corresponding results for the data-driven approaches REG, WSM, and WPM. We also observe the same for the basic PROD and MIN approaches. Moreover, results for SUM have a high correlation with results for all other approaches.

Figure 4 shows the performance comparisons in terms of Similarity. Recall that low values indicate good performance. We observe that all basic aggregation approaches perform worse than the data-driven approaches REG, WPM, and WSM. Moreover, WPM aggregation is (slightly) better than both WSM and REG.

Figure 4 shows pairwise correlation between values of Similarity calculated for each approach on each dataset. We observe a very strong positive correlation between similarity results for PROD and MIN, and the same for WPM and WSM. The results for REG have high, moderate, or even low correlation with results for other approaches.

Figure 2: Correlation matrix for Predictive power.
Figure 3: Performance comparison in terms of Similarity.
Figure 1: Performance comparison in terms of Predictive power.
Figure 2: Correlation matrix for Predictive power.
Figure 3: Performance comparison in terms of Similarity.
Figure 4: Correlation matrix for Similarity.
Figure 1: Performance comparison in terms of Predictive power.

5.2.2 Internal evaluation.

Figure 8 shows the performance comparisons in terms of Consensus. Recall that low values indicate good performance. We observe that all basic and REG aggregation approaches perform worse than the weighted scoring approaches WPM and WSM. WPM, in turn, is slightly better than WSM.

Figure 8 shows pairwise correlation between the Consensus results calculated for each approach on each dataset. We observe a very strong positive correlation between the results for PROD and MIN, and a strong correlation between corresponding results for WPM and WSM. The results for REG have a high, moderate, or even low correlation with results for other approaches. Figure 8 shows the performance comparisons in terms of Sensitivity. Recall that high values indicate good performance. We observe that REG, WPM, WSM, and SUM perform equally well, while the other basic aggregation approaches perform significantly worse.

Figure 8 shows pairwise correlation between values of Sensitivity

calculated for each approach on each dataset. We observe a very strong positive correlation between the results for PROD and MIN, and a strong correlation between the results for WPM and WSM. We observe a high or moderate correlation between the results for basic aggregation approaches, and the same for the results between REG, and the SUM, WPM, and WSM approaches. We also observe an extremely low correlation between the results for data-driven and basic aggregation approaches (except correlation between the results for REG and SUM). Note that, it does not mean that approaches have completely different sensitivity results. Instead, this is an effect of the results of sensitivity for data-driven approaches that have too few different values–for almost all datasets, the sensitivity is one and the correlation results are dominated by the few outliers in different datasets.

Figure 6: Correlation matrix for Consensus.
Figure 7: Performance comparison in terms of Sensitivity.
Figure 5: Performance comparison in terms of Consensus.
Figure 6: Correlation matrix for Consensus.
Figure 7: Performance comparison in terms of Sensitivity.
Figure 8: Correlation matrix for Sensitivity.
Figure 5: Performance comparison in terms of Consensus.

5.2.3 Summary and Discussion

Table 2 summarizes the median values of the performance measures over all datasets. The Beijing Multi-Site Air Quality Data datasets represent almost half (72 out of 169 datasets). Since this fact might bias the results, we also calculated the median values for performance measures discarding these datasets. The results are presented in parentheses.

Predictive power Similarity Consensus Sensitivity
Aggregation in Spearman’s rho in Kendall tau distance in Kemeny distance in Sensitivity Ratio
PROD 0.09 (0.15) 0.41 (0.41) 0.47 (0.4) 0.04 (0.51)
MIN 0.09 (0.15) 0.41 (0.42) 0.47 (0.42) 0.01 (0.11)
MAX 0.18 (0.19) 0.5 (0.45) 0.45 (0.41) 0.1 (0.29)
SUM 0.28 (0.31) 0.49 (0.39) 0.38 (0.32) 1 (1)
WPM 0.59 (0.77) 0.21 (0.13) 0.26 (0.22) 1 (1)
WSM 0.52 (0.68) 0.29 (0.24) 0.31 (0.32) 1 (1)
REG 0.64 (0.79) 0.25 (0.18) 0.45 (0.39) 1 (1)
Table 2: Median values of the performance measures.

We observe that the REG and WPM approaches perform (slightly) better than WSM in the external evaluation. Moreover, REG performs slightly better than WPM in terms of predictive power. However, it is the opposite in terms of similarity. WPM and WSM perform better than or equally well as the other approaches in the internal evaluation. Moreover, WPM performs better than WSM in terms of Similarity. SUM from the basic aggregations performs equally good as the data-driven approaches REG, WPM, WSM approaches in terms of sensitivity.

Studying the pairwise correlations between performances in the different evaluation measures for each dataset, leads to the following observations. The data-driven approaches REG, WPM, and WSM are closely associated with each other in the external evaluation measures, i.e., their respective results of Predictive power and Similarity are highly correlated. This means that it is not very likely that in-depth studies find properties of the datasets that favor either of the methods.

SUM is associated better than the other basic aggregations with REG, WPM, and WSM in terms of external evaluation (i.e., moderate correlation).

The unsupervised approaches WPM and WSM are closely associated with each other in the internal evaluation measures, i.e., their respective results of Consensus and Sensitivity are highly correlated. REG differs from other approaches. However, associated with SUM in terms of Consensus. We conclude that SUM from basic aggregation functions is quite competitive with REG aggregation in the internal evaluation.

We also conclude that in terms of their overall performance (including both internal and external evaluation) the unsupervised, weighted scoring approaches WSM and WPM are on par with the supervised REG aggregation.

The experiments were performed on more than a hundred datasets with different sizes from different domains. However, this specific sample might be a threat to external validity. Further replications of this study on other datasets are necessary to confirm the generalization of the above conclusions.

6 Conclusion and future work

We evaluated aggregation approaches on a benchmark of 169 regression datasets of the UCI machine learning repository from different sciences and application domains. We empirically compared six unsupervised aggregation approaches, more specifically, four basic aggregation functions PROD, MIN, MAX, and SUM, and two data-driven aggregation approaches WSM and WPM. As a point of reference, we also assessed a supervised approach REG, i.e., a weighted sum aggregation with weights defined using linear regression.

The aggregation approaches were evaluated externally, i.e., we compared aggregation outputs against the response variable as a proxy of a (in aggregation usually latent) ground truth. Then they were evaluated internally, i.e., we compared aggregations output against the values of the input variables.

For external evaluation, the supervised aggregation REG achieves the best results closely followed by the unsupervised approaches WPM and WSM. For internal evaluation, WPM achieves the best results closely followed by WSM and SUM. We conclude that basic aggregation functions can be significantly improved by unsupervised ML-based approaches, that the latter are quite competitive to linear regression, a (simple) supervised learning approach. As a consequence, these data-driven aggregation approaches are well-suited for prediction tasks when ground truth is not available for training a regression model. This confirms the results of ulan2021weighted and generalizes them from the field of bug prediction to a wider range of scientific and application fields.

For the evaluation, we developed a reusable and extensible evaluation framework, i.e., new aggregation approaches as well as new performance measures can be easily plugged in. This way other researchers are able to easily compare their aggregation approaches against the ones studied in the present paper using our evaluation framework. Provided that our benchmark and selection of aggregation approaches is considered representative enough, the proposed evaluation framework can serve as a practical guideline for selecting an appropriate aggregation approach and stimulate research in the field of aggregation, highlighting the current champion approaches and forgetting about the poorly performing ones.

In the future, we plan to develop a visualization tool for easy comparison. Also, we plan to extend the framework by considering classification and clustering tasks using aggregation. One possible way is to apply the logit function to the aggregation output, and then evaluate its performance as a classifier. Another interesting direction for future research is to evaluate not only the aggregation output for prediction, but its parameters, such as weights, for inference tasks.