The purpose of multiple criteria sorting (also called ordinal classification) is to help a decision maker (DM) to assign a finite set of alternatives to pre-defined and preference ordered classes according to their performances on multiple criteria. In the past decade, sorting has been among the most growing areas in Multiple Criteria Decision Aiding (MCDA) for addressing problems in various disciplines such as credit rating, policy making and assessment, inventory control, project management, supplier segmentation, recommendation systems, risk assessment, or competitiveness analysis.
In the majority of recently proposed methods, dealing with sorting usually requires the DM to express his/her preferences in form of assignment examples concerning a subset of reference alternatives. Such information is used to construct a preference model compatible with the DM’s preferences, which is subsequently employed for comparing the alternatives against some class profiles or for establishing the preference relation among alternatives in a way that allows to derive the class assignments. In many such approaches, constructing a preference model is usually organized as a series of interactions in which the DM provides incremental preference information in order to calibrate the constructed model to better fit his/her preferences. Meanwhile, the DM could verify the consequences of the provided preference information on the decision outcomes, which allows him/her to shape one’s preferences progressively and finally to be convinced about the validity of arrived sorting recommendation.
Nowadays, data-driven decision support has been an important issue for many businesses. Specifically, with the recent development of information technology, decision support systems are used to assist humans in deriving insights and making decisions through the analysis of increasingly complex data. For example, financial institutions develop the systems for evaluating credit risks of firms and individuals according to their transaction data and financial indicators and for deciding whether to grant a loan. Furthermore, firms rely on customer relationship management systems to construct profiles of customers from their on-line and off-line behaviours and perform market segmentation in order to tailor different marketing policies for targeted segments. Although these two real-world applications can be viewed in terms of sorting, an intrinsic distinction between such data-driven decision problems and traditional MCDA problems consists in the former requiring the preference discovery to be performed automatically without further intervention of the DM, whereas the latter expecting the DM’s active participation in the preference construction process.
Preference discovery (also called preference learning) has been an important field in the Machine Learning (ML) community. Its primary focus is on constructing – in an automatic way – a model from a given training sample and predicting preference for a new sample of data so that the outcomes obtained with the discovered model are “similar” to the provided training data in some sense. In contrast to MCDA, where the preference model is co-constructed with the participation of the DM so that to ensure its interpretability and descriptive character, preference learning in ML is mainly concerned about the ability of capturing complex interdependencies between the input and the output as well as the predictive performance of the discovered model. This difference is further reflected in the form of preference models employed in both fields. On the one hand, additive models are widely used in MCDA due to their advantage of intuitiveness and comprehensibility. On the other hand, some non-linear models, such as kernel methods and neural networks, are often used in ML to capture interdependencies and other complex patterns in data. Although such non-linear models offer greater flexibility in terms of fitting the learning data and recognizing patterns, they are too complex to be interpreted by users, and therefore they are often referred to as “black boxes”.
In this paper, we bridge the gap between the fields of MCDA and ML by proposing a new preference learning approach for data-driven multiple criteria sorting problems. We aim to learn a preference model from historical decision examples (also called training samples) so that it can be used to recommend a decision for the non-reference alternatives. The model should not only have a high predictive performance, but also allow for interpretable description of preferences. Specifically, the proposed approach can capture potential interactions among criteria, which is relevant for numerous real-world applications. For example, let us consider computers evaluated in terms of the number of CPU cores, CPU speed, and price. On the one hand, there may exist a negative interaction between the number of CPU cores and CPU speed, because a computer that has a large number of CPU usually has a high CPU speed. Thus, when considering such a pair of criteria jointly, its impact on the comprehensive quality of a computer should be lesser than a simple addition of the two impacts generated by considering each of the two criteria separately. On the other hand, there may exist a positive interaction between CPU speed and price, because a high CPU speed usually comes along with a high price. Thus, a computer with a high CPU speed and a low price is much appreciated, as the joint impact of such a pair of criteria on the overall desirability of a computer should be larger than a simple summation of the two impacts viewed individually.
In MCDA, several models for capturing the interactions between criteria have been developed. Firstly, a multi-linear utility function is a more general form of a value function, which aggregates products of marginal utilities on each criterion over all subsets of criteria. Secondly, the Choquet integral can be seen as an average of marginal values according to a capacity that measures the relative importance of every subset of criteria. If there is a positive (negative) interaction between two criteria, the weight assigned to such a pair is larger (smaller) than the sum of weights assigned to each of the two criteria separately. In particular has advocated the use of the Choquet integral as an aggregation model for preference learning, and incorporated it within an extension of logistic regression. The main limitation of the two aforementioned models derives from the need of expressing the performances on different criteria on the same scale or bringing them to the joint scale by the use of marginal value functions which need to be specified beforehand. This poses a serious burden for the use of such preference models in real-world decision problems. The third type of a preference model handling interactions between criteria is a general additive value function augmented by a pair components that capture the positive and negative interactions for pairs of criteria. The latter model neither requires specification of all performances on the same scale nor a priori definition of marginal values. Its construction has been traditionally based on linear programming techniques used within an interactive procedure during which the DM could progressively discover the pairs of interacting criteria.
In the proposed preference learning approach, we consider an additive value model with piecewise-linear marginal functions under the preferential independence hypothesis, and then extend it for capturing the interactions. For this purpose, we adapt the preference model proposed in by means of two types of expressions for quantifying the positive and negative interactions among pairs of criteria. Consequently, our approach belongs to the family of value-based MCDA methods, which allow for establishing preference relations among alternatives by comparing their numerical scores, thus preserving the advantage of intuitiveness and comprehensibility. Our approach does not require all criteria to be expressed on the same scale, admitting an assessment of both the relative importances of criteria and the potential interaction intensities between pairs of criteria.
We also introduce methodological advances in ML to enhance the predictive ability of the constructed preference model and the computational efficiency of the preference learning procedure. We formulate the learning problem in the regularization framework and use regularization terms for improving the generalization ability of the constructed model on new instances. Moreover, by utilizing the properties of value functions, we formulate a convex quadratic programming model for constructing the preference model. Since the complexity of this technique is irrelevant from the number of training samples, it is suitable for addressing data-intensive tasks and the respective models can be derived using popular solvers without extraordinary efforts. In addition, we propose four methods for classifying non-reference alternatives once the preference model with the optimal fitting performance is obtained. Consequently, the generalization performance can be improved by selecting one of the four procedures that proved to be the most advantageous for a given dataset.
The various variants of the proposed approach in terms of different interaction expressions and methods for classifying non-reference alternatives are validated within an extensive computational study. In particular, the practical usefulness of the proposed method is demonstrated on a problem of parametric evaluation of research units. In this perspective, we discuss how to interpret information on the relative importance of criteria and the interaction coefficients between pairs of criteria. Moreover, we compare the proposed approach with the UTADIS method and the Choquet integral-based sorting model in terms of their predictive performances on nine monotone learning datasets.
The remainder of the paper is organized in the following way. In Section 2, we present the learning approach for addressing sorting problems with potentially interacting criteria. In Section 3, we apply the proposed approach to a problem of parametric evaluation of Polish research units. We also discuss the experimental results derived from the comparison of the introduced method with UTADIS and the Choquet integral-based model on several public datasets. Section 4 concludes the paper and provides avenues for future research.
2 Preference learning approach for sorting problems with potentially interacting criteria
2.1 Problem description
We describe the considered sorting problems with the following notation:
– a set of reference alternatives (training sample) for which the classification is known;
– a set of non-reference alternatives to be classified;
– a set of decision classes, such that is preferred to (denoted by ), , and and are, respectively, the least and the most preferred ones;
– a family of evaluation criteria, , and denotes the performance of alternative on criterion ; without loss of generality, we assume that all criteria are of gain type, i.e., the greater , the more preferred on , .
The task consists in learning a preference model from the training samples composed of reference alternatives and their assignments (denoted by ) to determine the classification for non-reference alternatives . Let us first define a simple additive value function under the preferential independence hypothesis, and later extend such a preference model to consider the interactions among criteria. The additive value model aggregates the performances of each alternative on all criteria into a comprehensive score:
where is a comprehensive value of , and is a marginal value on criterion , .
Since the marginal value functions for each criterion ,
, are unknown, we employ piecewise-linear marginal value functions to approximate the actual ones. Such an estimation technique has been adopted in many ordinal regression problems. Specifically, letbe the performance scale of , such that and are the worst and best performances, respectively. To define a piecewise-linear marginal value function , we divide into equal sub-intervals, denoted by , where , . Then, the marginal value of alternative on criterion
can be estimated through linear interpolation:
One can observe that the piecewise-linear marginal value function is fully determined by the marginal values at characteristic points, i.e., . Given a sufficient number of characteristic points, the piecewise-linear marginal value function can approximate any form of non-linear value function.
When assuming , , can be rewritten as:
Having gathered all ,
as a vectorfor criterion , , for each alternative and each criterion , , we can define a vector , such that, for each :
Then, the marginal values , , can be represented as an inner product between vectors as follows:
Subsequently, let us denote and , and the comprehensive value can be expressed in the following way:
2.2 Learning preference model from reference alternatives
In this section, we propose a new method for learning a preference model in form of an additive value function (1) from a set of reference alternatives and their associated precise class assignments . Before describing the estimation procedure, let us present the underlying consistency principle for characterizing the preference relation between alternatives in sorting problems.
- Definition 1.
For any pair of alternatives , value function is said to be consistent with the assignments of and (denoted by and , respectively, and ) iff:
where means “at least as good as”. Observe that implication (2) is equivalent to:
According to Definition 1, value function inferred from the analysis of assignment examples should ensure for pairs of reference alternatives such that . However, there may exist no such a value function that would guarantee perfect compatibility due to the inconsistency of some assignment examples with an assumed preference model. In turn, we can estimate a value function that would maximize the difference between and for pairs of reference alternatives such that . This can be implemented by solving the following linear programming model:
where denotes a set of reference alternatives assigned to class , and is the cardinality of . For any , , the value difference for such a pair of alternatives, denoted by , is identified by constraint (5). In constraint (6), is the average value difference for pairs . Then, corresponds to the minimum of such value differences for all consecutive classes. By maximizing the minimum value difference (i.e., ), model (P0) aims to find value function that restores the assignment consistency as accurately as possible.
Note that it would not be appropriate to maximize the minimal value difference between reference alternatives from the consecutive classes (i.e., replace constraint (6) with constraint , , , ). In case of an inconsistent reference set, such a minimal value is smaller than zero (i.e., ), and then it is meaningless to maximize it. Moreover, we do not maximize the sum of value differences between reference alternatives from the consecutive classes (i.e., remove constraint (6) and replace objective (4) with ). The underlying reason for avoiding doing so can be illustrated through a simple example: let us consider three reference alternatives , and . Then, and . If we aimed to maximize the sum of value differences between reference alternatives from the consecutive classes, we would just maximize the difference between and , because and then and would be completely neglected.
When addressing a large number of reference alternatives , model (P0) contains a huge number of constrains (5), exceeding the processing capacity of existing linear programming solvers. Thus, we propose a method for transforming model (P0) to an equivalent model that is suitable for data-intensive tasks. The key idea is to aggregate constrains (5) for all the possible pairs of reference alternatives for a particular , and obtain:
By dividing , constraint (7) is equivalent to:
Then, since , constraint (8) can be transformed to:
For each class , , let us average for all and derive:
Then, constraint (10) can be written as:
In this way, model (P0) can be reformulated as:
Note that the number of constraints (13) is related only to the number of classes (i.e., ) rather than the number of pairs of reference alternatives. Thus, model (P1) can deal with a large set of reference alternatives efficiently.
The optimal value of the objective function of model (P0) is equal to that of model (P1).
See Appendix A. ∎
In addition to maximizing the value difference for pairs of reference alternatives from the consecutive classes, we also propose to minimize the value difference among reference alternatives from the same class. In this way, the distribution of comprehensive values of reference alternatives from the same class would be more concentrated, and consecutive classes could be clearly delimited. For this purpose, we aim at minimizing the second objective:
where is a matrix and . Since always holds, must be positive semi-definite. Then, putting together the above two objectives, we propose the following convex quadratic programming model:
where is a -dimensional vector with all entries being equal to one, and constraint (16) is used to bound to the interval [0,1]. Since class is preferred to class , , we require that in constraint (17). Besides the two aforementioned objectives, the regularization term is added to the objective of model (P2) to avoid the problem of over-fitting. Specifically, since the performance scale on each criterion is divided into a certain number of equal sub-intervals, the fitting ability of the estimated function improves with the increase in the number of sub-intervals, at the same time increasing the risk of over-fitting. The regularization term , also named Tikhonov regularization, penalizes functions that are “too wiggly”, and derives marginal value functions that are as “smooth” as possible, which alleviates the problem of over-fitting caused by inappropriately dividing the performance scale into too many sub-intervals. The constants are used to make a trade-off between the two objectives and the regularization term. Values of and can be chosen through -fold cross-validation in the following manner: the whole set of reference alternatives is randomly partitioned into (usually is set to be 5 or 10) equal sized folds such that the percentage of reference alternatives from different decision classes in each fold are the same with that in . For certain and , folds are used as the training data and the remaining fold is retained as the validation data for testing the model. The cross-validation process is repeated times, and then the results are averaged to evaluate the performance of the developed model (e.g., classification accuracy). Finally, the values of and corresponding to the best performance are chosen as the optimal setting for the two parameters. For model (P2), let us remark that, since and , , can be specified in advance, model (P2) is not related to the pairwise comparisons among reference alternatives, and thus the number of constraints is small. Hence, model (P2) can be easily solved using popular optimization packages, such as Lingo, Cplex, or MATLAB.
2.3 Considering interactions among criteria
Even though an additive value function model is widely used in real-world decision aiding, it is not able to represent interactions among criteria due to the underlying preferential independence hypothesis. To handle interactions among criteria, we incorporate and adjust the model so that to propose a new method which can address a large set of reference alternatives efficiently. The underlying model is an additive value function augmented by “bonus” and “penalty’” components for, respectively, positive and negative interactions among criteria, which is formulated as:
where and are the bonus and penalty values for modelling the interactions between and . The extended form (19) of value function should fulfil the following basic conditions:
normalization: if , , and if , ,
monotonicity (a): , if and , then and ,
monotonicity (b): , if , , then:
The normalization conditions require to be bounded to the interval [0,1]. Monotonicity (a) ensures that and are monotone non-decreasing with respect to their arguments. Monotonicity (b) can be interpreted as follows: when comparing any pair of alternatives on a subset of criteria , if is at least as good as for all , the comprehensive value of derived from the analysis of should be not worse than that of . Note that monotonicity (b) induces numerous constraints as has non-empty subsets. For this reason, we assume that any criterion interacts with at most one other. This makes both the inference of a value function more tractable and the constructed model more interpretable. Under this assumption, value function (19) can be reformulated as:
denotes the set of all pairs of interacting criteria. Value function (20) divides a set of criteria into two disjoint subsets: one consisting of criteria not interacting with the remaining ones, and the other composed of the interacting criteria. In this way, monotonicity (b) can be reduced to that, , if and , then:
In this study, we propose to define the bonus and penalty components and in the following way:
where are the coefficients for modelling the positive and negative interactions on criterion ’s -th sub-interval and criterion ’s -th sub-interval. Since the above definition of and is based on the product of and , it ensures that the bonus and penalty components are monotone and strictly increasing with respect to and for any pair of criteria .
An alternative definition of the interaction components would consist in deriving the minimum from and as follows:
This makes these components monotone non-decreasing with respect to and . It is easy to validate that such two types of definitions of and satisfy monotonicity (a). To ensure monotonicity (b), let us consider the following proposition.
See Appendix B. ∎
Let us now introduce the method for estimating the coefficients and , , , , from the assignments of reference alternatives. For the convenience of the analysis, let us define the following vectors:
Then, the bonus and penalty components and can be reformulated in the form of an inner product between the above vectors as follows:
By considering the interactions between criteria, we can redefine and as follows: