Machine learning has recently been widely adopted to address challenging decision making problems in a variety of managerial contexts like marketing (Cui and Curry 2005, Cui et al. 2006), credit-risk evaluation (Baesens et al. 2003) and healthcare management (Gartner et al. 2015)
. Many machine learning models, such as support vector machines (SVMs)(Cortes and Vapnik 1995), boosted trees (Friedman 2001) and neural network-based methods (Rumelhart et al. 1988, LeCun et al. 2015)
, are applied to diverse real-world prediction problems due to their capacity to analyze high-dimensional data. Although higher complexity usually brings higher accuracy, it comes at the expense of interpretability(Lou et al. 2012). In practice, model interpretability is as important as (if not more important than) accuracy in many mission-critical applications such as clinical decision-making, in which the understanding of how the model makes the prediction is the key to facilitate physicians to trust the model and utilize the prediction results (Caruana et al. 2015, Fox et al. 2007). Recently, technology giants like Google, IBM, and Microsoft, have been investigating on the techniques in enhancing the model interpretability (Mohseni et al. 2018). As stated in a comprehensive overview conducted by Mr. David Gunning, the program manager in the Information Innovation Office (I2O) of the Defense Advanced Research Projects Agency (DARPA), “machine learning models are opaque, non-intuitive and difficult for people to understand” (Gunning 2017). DARPA has since funded for developing interpretable machine learning techniques among academics. In the latest budget plan of DARPA, explainable artificial intelligence (XAI) has been listed as the key funding area in the fiscal year 2019-2020, with the total amount of 26.05 million US dollars 111https://www.darpa.mil/about-us/budget.
1.1 Benefits of interpretable models
Both machine learning and management research could benefit from the interpretability of models. First, an interpretable model is trustworthy because it exploits some patterns and rules that are consistent with prior human knowledge and experiences. The unreasonable learned patterns and rules can be easily identified and corrected by a decision maker (DM). If a model tends to make mistakes that can be easily to be classified accurately by the DM, it would require his/her supervisions of modification(Ribeiro et al. 2016). Second, an interpretable model helps in understanding causality (Miller 2018). Interpretable models can extract the associations between predictors and predictions, which can facilitate the downstream managerial decision making. Third, an interpretable model incorporates the DM’s domain knowledge. A DM usually possesses rich domain knowledge but not technical skills to construct a model. An interpretable model can be used to learn a DM’s decision behavior through tuning key parameters and then provide in-depth understanding of the data and patterns (Aggarwal and Fallah Tehrani 2019).
1.2 Multiple criteria decision aiding
Multiple criteria decision aiding (MCDA)222Multiple criteria decision aiding is also named as multiple criteria decision making. In this paper, we use multiple criteria decision aiding (the “European school”) for consistency (Vincke 1986). has been a fast growing area of operational research during the last several decades (Dyer et al. 1992, Figueira et al. 2005, Wallenius et al. 2008, Saaty 2013, Ramesh et al. 1988, 1989). It involves a finite set of alternatives (e.g. actions, items, policies) that are evaluated from a set of conflicting multiple criteria or attributes333In machine learning, criteria refer to attributes or features with preference order scales (Corrente et al. 2013). For consistency, we use “attribute” in this paper.. The DM’s decision is driven by his/her underlying global value (utility) function (Keeney 1976, Keeney and Raiffa 1993). This global value measures the DM’s desirability for an alternative and can be disaggregated into a set of per-attribute marginal value functions that represent the DM’s evaluation of the corresponding attribute (Kadziński et al. 2017). These marginal value functions can be learned by the DM’s judgments on learning examples (e.g. pairwise comparison between two alternatives). Once the marginal value functions are deciphered, we can understand the decision making rationale, based on which we can predict the judgment of the DM. This process is referred as the preference disaggregation approaches of MCDA.
Many machine learning framework can help MCDA accomplish the learning objectives because both of them aim to learn a decision model from data. Thus, MCDA and machine learning naturally have reciprocal interactions (Doumpos and Zopounidis 2011)
. MCDA and machine learning are integrated in two directions. First, we can apply machine learning techniques to various tasks in a decision aiding context, such as learning to rank, multi-label classification, etc. The opposite direction is to implement MCDA concepts in a machine learning framework. It is a tendency that utilize MCDA approaches to adapt the machine learning models to various topics, such as feature selection and extraction, pruning decision rules and multiple objective optimization. Our work belongs to the second stream. We aim to construct a hybrid model, which utilizes value function-based preference disaggregation approaches of MCDA to enhance the interpretability of “black-box” machine learning models.
The motivation of introducing the value function-based preference disaggregation approaches of MCDA to machine learning stems from its powerful capacity in depicting the human decision-making process. The deciphered marginal value functions reveal the rationale of DM’s judgment, and thus provide convincing evidence to assist comprehending the decision making behavior (Aggarwal and Fallah Tehrani 2019, Lou et al. 2012). Our task of learning an interpretable model is essentially to capture the characteristics of the marginal value functions, based on which we obtain a certain degree of interpretability. This study is different from the statistical models for management problems, in that we utilize the characteristics of the marginal value function (instead of a single coefficient) to represent the effect of each attribute on the outcomes. For example, suppose that a hypothetical DM’s preference system is composed of four marginal value functions in Figure 1.2. We can analyze the DM’s preference from the following perspectives. First, we focus on the ranges of the marginal values. If the marginal values are close to 0 (see the marginal value function 1), it indicates that the corresponding attribute is not important to the DM or we have wrongly captured its characteristic. Further interaction with the DM is needed to determine whether we keep this variable or calibrate the model. Different from the statistical model selection methods (e.g.: LASSO, Bayesian information criteria (Friedman et al. 2001)), through incorporating the DM’s domain knowledge, an interactive model calibration process is invoked (Stewart 1993, Wallenius et al. 2008, Doumpos and Zopounidis 2011). Second, the increasing and decreasing tendencies of the marginal value function curves unveil the change of the DM’s preference. Moreover, the negative and positive marginal values directly show the negative and positive effects of the attributes on the outcomes (see the marginal value functions 2 and 3). Statistical models usually generate a fixed coefficient that cannot capture such preference inflexion points. Third, the convexity and concavity of the marginal value function are crucial for interpreting the DM’s rational behavior in decision-making process (see the marginal value function 4). To summarize, different from traditional statistical models, MCDA aims to extract more interpretable patterns of the DM’s behavior and builds a solid link between the underlying model and the actual decision making processes, and thus facilitates the effective use of the model in real-world decision makings.
1.3 An overview of this paper.
This paper proposes a framework for a Neural Network-based Multiple Criteria Decision A
iding (NN-MCDA) approach. NN-MCDA combines an additive model and a fully-connected multilayer perceptron (MLP) to achieve both model interpretability and complexity. The additive model is learned from the value function-based preference disaggregation models of MCDA. It uses marginal value functions to approximate the relationship between the outcome and individual attributes whereas the MLP is used to capture the high-order correlations between attributes in the model. We estimate the parameters in the model under a neural network framework that automatically balances the trade-off between two components.
We tested our proposed model using a set of synthetic datasets and two real datasets. Specifically, the simulation experiments respectively show the impact of pre-defined parameters on the model and the goodness of the model when data is either extremely complex or simple. Two real datasets on ranking universities regarding employment reputation and predicting the risk for geriatric depression are utilized to illustrate the proposed model in real cases. We explain the obtained models and compare them to other interpretable models, i.e., GAM and logistic regression models.
The contributions of this paper are fourfold. First, we advocate a new perspective of an interpretable model that both quantifies the impact of individual attributes on the outcome and captures the possible high-order correlations between attributes in the model. It helps the DM to understand the main effect of single attribute and to make better decisions. Second, to the best of our knowledge, this paper is the first pilot work that introduces the value function-based preference disaggregation approaches of MCDA to the machine learning models to enhance the model interpretability. The trained parameters in the proposed framework determine the shape for marginal value functions in the additive models. The proposed model is free from preference independence, preference monotonicity, and small learning set assumptions in MCDA approaches, which makes MCDA approaches more general and practical for real-world management problems. Third, we examine the model effectiveness given different model parameters and datasets. The empirical conclusions about the relationships between model interpretability and data complexity are managerially intuitive for the future researches. Forth, the proposed framework is flexible and extendible, especially the nonlinear part, which can be modified or replaced by other models according to different types of data. Our work is intuitive for developing interpretable models for both management and computational science.
The rest of the paper is organized as follows. We discuss the related work in Section 2. In Section 3, we introduce the framework for the proposed interpretable model. The simulation and real case experiments are presented in Section 4 and some discussions about the proposed framework is provided in Section 5. We conclude the paper in Section 6.
2 Related work.
2.1 Value function-based preference disaggregation approach of MCDA.
The value function-based preference disaggregation approaches of MCDA provide explicit marginal value functions and numerical scores. A DM can understand the importance of a particular attribute and how the individual attributes contribute to the final decision. This procedure encourages the DM to participate in the decision making process and it provides a comprehensive preference model (Corrente et al. 2013). These approaches have been successfully applied to many scenarios, such as consumer preference analysis (Hauser 1978), financial decisions (Zopounidis et al. 2015), nano-particles synthesis assessment (Kadziński et al. 2018) and territorial transformation management (Ciomek et al. 2018). However, the applications of value function-based preference disaggregation approaches are limited due to some strong assumptions, such as (1) preference independence, (2) monotonic preference, and (3) small set of alternatives.
Recently, many novel models haven been proposed to generalize the value function-based preference disaggregation approaches of MCDA. Preference independence allows the model to be additive. Considering interacted attributes, Angilella et al. (2010) utilize a fuzzy measure to model the preference system where the alternatives are now evaluated in terms of the Choquet integral. However, it is difficult for the DM to understand the impact of individual attribute evaluated from the Choquet integral. Angilella et al. (2014) account for positive and negative interactions among attributes, and add an interaction term to the additive global value function for each alternative. They require the DM to provide some knowledge about the interacted pairs that are mined by the models. These studies only consider the interaction between pairs of attributes because higher-order interactions require more cognitive efforts and more computational cost.
The majority of existing researches assume the marginal value functions are monotonic piece-wise linear. This assumption reduces the model complexity, but it fails to describe preference inflexions. Addressing this problem, Ghaderi et al. (2017) and Liu et al. (2019) relax this assumption and constrain on variations of the slope to obtain non-monotonic marginal value functions without serious over-fitting problem. Both of their approaches obtain non-smooth value functions which are difficult to interpret attitudes towards risks due to the use of non-derivative functions. Since a differetiable margianl value function is essential to analyze consumer behavior, Sobrie et al. (2018) utilize semidefinite programming to infer the key parameters for polynomial marginal value functions. It gives a more flexible and interpretable preference model. However, it still assumes that the DM preference is monotonic.
The monotonic piece-wise form of the marginal value functions has a low expressibility for large learning sets (Sobrie et al. 2018). Nowadays, MCDA approaches are expected to deal with large amount of data in many disciplines (Pelissari et al. 2019, Liu et al. 2019). Liu et al. (2019) embed the MCDA approach into a regularization framework to approximate marginal value functions in any piece-wise linear shapes, and provide efficient algorithms to handle larger learning sets.
Most existing researches focus on expanding the MCDA approaches from only one perspective. Comparing with these recent advances, the proposed framework tries to solve all aforementioned limitations of MCDA by providing a non-monotonic, smoother, and more powerful MCDA approach for real-world applications considering more complex decision making scenarios.
2.2 Interpretable models.
There is usually a trade-off between model interpretability and prediction accuracy (shown in Figure 2.2). Interpretable machine learning, or XAI, aims to create a suite of techniques that produce more explainable models while maintaining a high accuracy (Gunning 2017).
Generalized additive model (GAM) uses a link function to build a connection between the mean of the prediction and a smooth function of the predictors (Hastie and Tibshirani 1986). It is good at both dealing with and presenting the nonlinear and non-monotonic relationship between the predictors and the prediction (Lou et al. 2012). Therefore, GAM is usually more accurate than linear additive models. Although GAM does not outperform full complexity models, it possesses more interpretability than these “black-box” models. Lou et al. (2013) explore the co-effect of pairwise interactions and apply the improved GAM to predicting pneumonia risk and 30-day readmission. This model helps the DM (physician) to find useful patterns in the data and quantifies the contributions of individual attributes. Based on these promising results, they argue that it is necessary to develop more interpretable models in mission-critical applications such as management problems (Caruana et al. 2015).
Another solution is to infer a new model to approximate the true black-box model. The new model may not be as accurate as the original black-box model, but can identify patterns and rules to explain how the predictions are made. In Baesens et al. (2003), explanatory rules are extracted to help the credit-risk managers in explaining their decisions. Similarly, Letham et al. (2015) discretize a high-dimensional attribute space into a series of simpler interpretable if-then statements. They firstly make predictions using complex machine learning techniques and then use Bayesian rule lists to reconstruct the predictions. Given approximately accurate predictions, the obtained model is more interpretable.
According to Ribeiro et al. (2016), why and how the model produces that prediction are important for the DM to trust the underlying model. An interpretable model should enable to answer these questions and give the reasons behind a prediction. In this regard, they develop an algorithm named LIME which approximates a prediction locally with a simper model, for instance a linear model that is easier to interpret. It is extensible to explain the predictions of any model in an interpretable manner.
2.3 Machine learning in MCDA.
There have been a few attempts to integrate the machine learning algorithms with MCDA. In a pioneering work by Wang and Malakooti (1992), a single-layered feed-forward artificial neural network is proposed to learn MCDA objectives. The advantages of neural networks are that they are independent of functional forms. However, it only gives a final recommendation without any interpretable marginal value functions or patterns. Doumpos and Zopounidis (2011) explore the differences and similarities between machine learning and MCDA. Although there are several studies introducing MCDA into machine learning models, few utilize the MCDA concepts to enhance machine learning models’ interpretability.
As a new sub-field of machine learning, preference learning has attracted extraordinary attention from the MCDA community. Corrente et al. (2013) explore the relationship between MCDA and preference learning. They find that the higher performance of machine learning models is usually associated with lower degree of interpretability, which negatively affects the confidence in the employment of machine learning models in scenarios where we need to understand the underlying process. A latest study utilizes preference learning to model human decision behavior under a MCDA framework. Such a model can facilitate the understanding of the DM’s behavior by tuning well-defined model parameters (Aggarwal and Fallah Tehrani 2019).
3 Framework for the intelligible model.
Let be the training dataset of size , be the -th attribute vector with attributes444In MCDA, is called an alternative with criteria/attributes., and be the target/response value. In this study, we consider a binary classification problem where . The proposed framework can be easily extended to multi-classification and regression problems.
3.1 The additive model.
The value function-based preference disaggregation approaches of MCDA assume that for each attribute vector , there is a global value function in the following form:
where represents the importance of the -th attribute and is a marginal value function. Note that we reply the shape and positive/negative effect of the marginal value function to capture the contribution of individual attributes. Thus we set the weight to be positive to represent the relative importance of the -th attribute, which can positively or negatively affect the global value. The global value function linearly sums contributions of individual attributes555In GAM, is called shape function and is called link function..
Although the global value function is in an additive linear form, the marginal value functions themselves can be in any forms, often nonlinear. It has been recognized that the preference in human decision behaviors is rational, and thus the marginal value functions should be stable and smooth. In the literature, the marginal value function can be in a simple linear (weighted sum) form (Saaty and Decision 1990, Saaty 2013, Korhonen et al. 2012), monotonic and non-monotonic piecewise linear forms (Stewart 1993, Jacquet-Lagreze and Siskos 2001, Greco et al. 2008, Ghaderi et al. 2017, Liu et al. 2019), and monotonic polynomial form (Sobrie et al. 2018). To capture the first-order (e.g. monotonicity) and second-order (e.g. marginal rate in substitution) derivative patterns of the attributes’ contributions to the prediction, we extend and generalize state-of-the-art MCDA models (Liu et al. 2019, Sobrie et al. 2018) to allow the marginal value function in any polynomial forms. In this paper, we allow the -th marginal value function to be in a smooth and non-monotonic form of degrees:
where is the coefficient of the -th degree and is the highest order of degree on the -th attribute.
The motivations using Eq.(2) as a marginal value function are derived from two facets. First, we enhance the expressiveness of the preference model to capture non-monotonic preferences. For example, piecewise linear or monotonic polynomial functions fail to restore all information in a larger learning set (Sobrie et al. 2018). The nonlinearity and non-monotonicity of Eq.(2) can better fit complex relationships between attributes and the outcome, leading to a better model performance. Second, while analyzing human behavior, it is critical to examine the trade-offs or marginal rates of substitution in economics and management studies. A non-derivative value function, for instance the boosted bagged trees model in Lou et al. (2012), cannot capture the inflexion point where the marginal rate of substitution grows or diminishes more quickly (Keeney and Raiffa 1993). A model exploiting human behaviors seems convincing and has more managerial meaning for the DM in management scenarios.
3.2 Neural network-based MCDA.
Full complexity models perform well on many machine learning tasks because they can model both the nonlinearity and the interactions between attributes. An additive model like Eq.(1) does not model any interactions between attributes. Therefore, we propose a neural network-based multiple criteria decision aiding (NN-MCDA) model in the following form
where is the global score of , is a latent function of all attributes, and is a trade-off coefficient. Eq.(4) describes (a) a regression model if is the identity, and (b) a classification model if is the logistic function of the identity. is used to capture the high-order interrelations between attributes in the model. We can use any complexity models to fit for better performance, for instance we use a MLP in this paper (Rosenblatt 1958). While using an MLP form of , it is not transparent, meaning that we do not know the exact structure of . Since we have the to capture the explainable form of the marginal value functions, the non-transparent describes the complex patterns that are not readily useful to the DM. Coefficient balances the trade-off between and . If is close to 1, the model tends to be in a simple additive MCDA form. If is close to 0, we obtain a full complexity model.
The utilized joint training process is shown in Figure 3.2. The input attribute vectors should be transformed into polynomial forms, i.e.,
. In the input layer, a single-layer network without any activation functions is provided to reconstruct Eq.(1). It has units and the weight for each unit corresponds to a particular . We denote the output of the linear component with , and
where is the vector of coefficients in the -th polynomial marginal value function, is the vector of weights of attributes, and contains marginal values of -th attribute vector. Note that, Eq.(5) is actually a specific case of Eq.(1). In Eq.(1), the marginal value functions can be in any shapes (e.g. piece-wise linear). However, in this study, we allow them to be in a polynomial form in Eq.(2). Thus, is a generalization of .
The nonlinear component is a standard MLP. It is used to learn high-order correlations between attributes. Similarly, by summing every units we can obtain a marginal value on the
-th attribute. For activation functions, we opt for Rectifier (ReLU), which is is the most commonly used activation function in neural networks(Glorot et al. 2011). We can also use other activation functions such as Sigmoid and TanH functions. An -layer MLP is defined as:
where , and
denote the weight matrix, bias vector and activation function for the-th layer, respectively. The input of the MLP model is the same as the input for the linear part, i.e., .
The output is the probability of, we have
is a sigmoid function. To estimate the parameters, we minimize the mean square error (MSE):
We can adopt a variety of optimization methods to minimize Eq.(82011) for details of the optimization procedure. The interpretability of the model refers to the capacity in developing marginal value functions, which capture the relationship between individual attributes and prediction. With the proposed model, the DM can know what attributes are more important for the prediction, what values of an attribute are positively or negatively associated to the prediction, and where the convexity and concavity of the function are changed.
3.3 Application to multiple criteria ranking problems.
In this subsection, we will show how to apply NN-MCDA to traditional multiple criteria ranking problems where alternatives are ranked based on the DM’s preference. In this paper, alternatives are represented as attribute vectors.
Let denote that an attribute vector is at least as good as , and denote that is better than . Note that the symbol ‘’(or ‘’) does not necessarily require that each element in is at least as good as (or better than) that in . It actually indicates that one alternative is at least as good as (or better than) another one based on the DM’s judgment. For each pair , we define as follows:
and the difference between global scores of and is:
where and . Let be the aggregated vector for :
and be a function of . We fit function to approximate the value of . Note that in some decision problems, the attribute weights in Eq.(4) are normalized to and , which are useful for interpreting the trade-offs between attributes666Note that the trade-off between attributes is similar to attribute importance, but the trade-off emphasizes that assigning more weight to an attribute would decrease other attributes. That usually leads to a situation where some attributes have almost no effects on the predictions, which is unexpected because the selected attributes are often summarized based on DM’s prior knowledge and their requirements. In this regard, we tend to train our model without normalization but provide normalized weights to evaluate the trade-offs between attributes (Liu et al. 2019). Moreover, there are few minor differences on performances when using normalized weights or not.. To address this issue, we apply the following transformation:
For each attribute , the normalized weight is .
The new global score is . Moreover, the ordinal relations among all attribute vectors are preserved since and .
Given the input data , instead of mathematical programming, we can now use the machine learning scheme in section 3.2 to infer the preference model and rank other attribute vectors. The output is the probability that is at least as good as . We can pre-define two thresholds and , where . If , then , and if , then and otherwise, . If we use the normalized weights, since the probability is transformed nonlinearly, the pre-defined thresholds should also be transformed as follows, , to preserve the ordinal relations. In this way, the traditional multiple criteria ranking approaches can handle larger datasets and obtain smoother and more flexible marginal value functions to assist the DM. We present the simulation results in Section 4.1 and the results using real datasets in Section 4.2.
3.4 Usefulness of the proposed framework in decision making.
As we introduce MCDA into machine learning, the main objective is shifted from achieving the best predictive performance to facilitating the DM in gaining insights into the characteristics of the decision making process and the interpretations of the results (Doumpos and Zopounidis 2011). Once the marginal value functions are obtained by the proposed NN-MCDA framework, we can further analyze the DM’s preference from the following perspectives.
First, the attribute importance usually has a long-tail distribution, with a few of them being very important and the majority of them being less important (Caruana et al. 2015). The characteristics of the marginal value functions can reveal the importance of the corresponding attribute. If a marginal value function is close to 0 for the whole scale of the attribute values, it indicates that the attribute is either not important to the DM or the characteristic of the marginal value function is wrongly captured, because the change of this attribute has little influence on the predictions. When this is the case, we need to interact with the DM to determine whether we preserve this attribute or calibrate the model. In this regard, the proposed framework can perform model selection and modification (similar to statistical approaches like LASSO). For example, while predicting if a patient has the flu, the marginal value function of “room humidity” is in a shape like the marginal value function 1 in Figure 1.2, it is possible that “room humidity” has little contribution to the flu. However, whether we abandon it should be determined by a physician.
Second, the increasing and decreasing tendencies of the marginal value function curves reveal the change of the DM’s non-monotonic preference. We focus on the monotonicity inflexion points because they can determine that to what attribute values, the DM is more sensitive. Moreover, if we partition the marginal value function curve by these points, we can discretize the continuous attribute into smaller ranges in which the DM’s preference is monotonic. Such smaller intervals are useful for personalization (e.g. customer segmentation) and strategy-making tasks in management. For example, while evaluating the company’s performance, if the marginal value function of “cash to total assets ratio” is like the second function in Figure 1.2, we can learn that a company with a very small or large “cash to total assets ratio” is in a bad condition. Companies with a large ratio are suggested to use the cash to do more investigations, whereas companies with a small ratio are suggested to save general expenses so that more cash can be used in new investigations.
Third, Since the marginal value function returns a “score” that is added to the global value, it is crucial to determine whether the attribute positively or negatively contributes to the outcome. If a marginal value function is above/below zero, the corresponding attribute is positively/negatively associated with the prediction. The marginal value function can capture the sign change (if any) of an attribute’s contribution and provide the DM an exact attribute value where the sign changes. This is more informative than the statistical models that only provide a fixed coefficient representing either positive or negative effect of the attribute. For example, when predicting the risk of depression among adults, the marginal value function of “age” may has a shape similar to the third function in Figure 1.2 (please also refer to Figure 4.3, which is drawn from the real-data). The shape of this curve indicates that the risk of depression does not increase while aging if the adult is younger than a threshold. The risk will increase once the adult is older than that threshold (the threshold is 71.58 in the real data introduced in section 4.3). Statistical models, on the other hand, can only conclude that age has either a negative or positive effect on the depression risk. We need to segment the adults to pre-defined age groups to capture such sign change effect.
Fourth,the concavity (and convexity) of the marginal value function can directly reflect the changing rate of the DM’s preference. Such information is important to both economics and marketing problems. For example, if the consumer’s preference to “discount rate” is in a same shape of marginal value function 4 in Figure 1.2, it implies that at the beginning, along with the increase of the discount rate, the consumer’s utility (propensity to consume the product) grows more quickly. However, when the discount rate is over a specific value, it gives a signal that the product is possibly of bad quality. Although the consumer’s utility still grows, its rate of increase starts to slow down. This provides the DM with a conclusion that keeping the discount rate at a medium level could maximize the profit.
To validate the proposed NN-MCDA model, we perform experiments with both synthetic and real datasets. We use area under the curve (AUC) of receiver operating characteristic (ROC) curve to measure the model performance. In subsection 4.1, three simulation experiments examine (a) the influence of the degree of polynomial on the prediction performance, (b) the influence of the value of , the trade-off coefficient, on prediction performance, and (c) the goodness of the proposed NN-MCDA approach in fitting the given marginal value functions. In Section 4.2, we first apply the NN-MCDA model to a multiple criteria decision problem where we rank universities based on the employer reputation. Then we predict the risk for geriatric depression with useful interpretations of the risk factors with a higher resolution.
For brevity, we set equal pre-defined degrees of polynomial for all marginal value functions in the subsequent experiments. We generate three typical synthetic datasets (from the simplest to very complex) as follows:
Uniformly draw attribute vectors with attributes whose values are within [0,1].
We generate three datasets. (a) For the first dataset , all attributes have equal importance and the actual marginal value functions are identity functions. The global score for each attribute vector is a linear summation of
attribute values without any attribute interactions and an additional noise term that is in a standard normal distribution; (b) The second datasetrandomly generates 3-degree polynomials marginal value functions for attributes, and the global score is the summation of marginal values, all attribute interactions and a standard normal noise term. (c) The third dataset is extremely complex. The global score is the summation of 15-degree polynomial marginal values, all possible attribute interactions (pairwise, triple-wise and higher interactions) and a standard normal noise term.
We compare global scores between each pair of attribute vectors. If , then , otherwise, . Note that the actual input is the transformed attribute vector.
4.1.1 Experiment I: Relationship between degree of polynomial and model performance
The first simulated experiment aims at exploring the relationship between the pre-defined degree of polynomial and AUC. The parameters used in the experiment is shown in Table 4.1.1. For each setting, we iteratively repeat the experiment for 10 times and record the averaged AUC. In this experiment, the numbers of iterations are determined using fivefold cross-validation: We partition the training set into five sets and set aside one of them as a validation set. We then train the model using the other four partitions and use the validation set to check the convergence. This procedure is repeated five times and the averaged number of iterations is used to train the final model with the whole dataset (Lou et al. 2012).
Figures 4.1.1, 4.1.1 and 4.1.1 report the averaged AUC for the testing set with different training sizes using the three synthetic datasets. Though there is no obvious relationships between the training sizes and the model performance, we find two interesting patterns. First, higher pre-defined degrees of polynomials can lead to higher accuracy when convergence. That results from the ability of the underlying model to capture more complicated nonlinearity. However, higher degrees of polynomials usually require more iterations to converge. More specifically, we depict the averaged computational time for each training process in Figure 4.1.1. Apparently, while increasing the model complexity, for example, using higher degrees of polynomial marginal value function and considering more attributes, the average computational time to converge also increases almost linearly. Another interesting pattern is that the shape of the AUC curves (Figures 4.1.1, 4.1.1 and 4.1.1) can fit a concave function in general. When the degree increases, the AUC improvement (over the model with the immediate smaller degree) is becoming smaller. For example, the improvement is more obvious if we change the pre-defined degree from 1 to 3 than that if we change from 3 to 5 and from 5 to 10. The improvement diminishes quickly along with the increase of pre-defined degree of polynomials. The greatest AUC improvement happens if we increase the degree to 3, while the improvement resulted from further increasing the degree to 5 and 10 is slim. The results suggest that it is not necessary to set a very large for the seek of minor improvement because the computational cost increases much faster when increases. Generally, we believe that a polynomial of 3 degree is sufficient to capture the characteristics for all the three datasets. Higher degrees of polynomials have a risk for over-fitting and obviously cost more computational time, but contribute little to accuracy.
We also compare the proposed NN-MCDA with baseline machine learning models, including the standard MLP, polynomial linear regression (PLR) with 1, 2, 3, 5 and 10 degrees, GAM with 10 splines that are in 3 and 10 degree of polynomials, SVMs with linear, radial basis function (RBF) and polynomial kernels, and single decision tree (DeciTr) models with 6, 10 and 20 maximum depths. Table1 presents the results for the simplest dataset. All machine learning models, both the interpretable (including NN-MCDA) and full complexity ones, perform well. The performance drops rapidly when we use them to fit the nonlinear and high-order datasets (shown in Tables 2 and 3). Since the proposed NN-MCDA model and MLP can model the nonlinearity and attribute interactions, both of them achieve much higher AUCs as compared to the rest. As expected, although the performance of NN-MCDA is lower than MLP, the difference is relatively small. Both NN-MCDA and MLP outperform other baseline machine learning models significantly. More specifically, we can observe that GAM, SVM and DeciTr have similar accuracy, which is higher than that of PLR, but lower than NN-MCDA. NN-MCDA’s performance is close to the full complexity model, and at the same time, it has strong interpretability (will be shown in the next experiment).
4.1.2 Experiment II: Impact of on AUC
In this section, we focus on assessing how the performance of the NN-MCDA model is affected by the value of , the weight for the linear component. We evenly sample 20 values within [0, 1] as the pre-defined . For each fixed , we train the NN-MCDA model using synthetic datasets introduced in the previous subsection. In this experiment, we use the SGD algorithm to optimize the parameters and the number of iterations are set as 250.
In Figure 4.1.2 (results with the linearly generated datasets), though three curves have many monotonicity inflexions, there is a general trend that the greater the value of , the better the performance. NN-MCDA obtains the best results when is between 0.8 and 1.0 on these linearly generated datasets. In contrast, the AUC curves have a general decreasing trend when increases for the nonlinearly generated datasets (Figures 4.1.2 and 4.1.2). In the extreme cases, where only the nonlinear MLP component (i.e., ) or the linear component (i.e.,) works, the model obtains the greatest or smallest average AUC.
Note that the three datasets have very different patterns. Datasets use a set of simple linear marginal value functions whereas and simulate more complicated patterns. Theoretically, a full complexity model (MLP) can perfectly capture any patterns in the data at the cost of very large numbers of iterations and data samples for convergence. In practice, we often do not have sufficient data or computational time to achieve the optimal MLP solution. In this simulation experiment (31,125 data points and 250 iterations to fit the model), a pure MLP model (NN-MCDA with ) does not always lead to the best outcome. This result indicates that, in real-world managerial decision making, a full complexity model is usually not the best one not only because of the lack of interpretability, but also the limited data and computational resources to optimize the model. It is sensible to allow the model to automatically adjust the trade-off coefficient to avoid the scenarios where a very complex model is used to fit simple data, or a simple model is used to fit complex data.
4.1.3 Experiment III: performance in fitting actual marginal value functions
This experiment studies the ability of the NN-MCDA model to reconstruct the actual marginal value functions. From Experiments I and II, we find that the NN-MCDA model with degree equal to 3 has a good balance between prediction performance and computational cost. Therefore, in this experiment, we generate four typical synthetic models with different complexities. Each hypothetical model has three marginal value functions to estimate. Then, we use the synthetic models to generate datasets with the same attribute vectors (model input). We approximate the marginal value functions using an NN-MCDA model with a degree of 3. We also compare the obtained function with a baseline linear regression model. The four synthetic models (from the simplest to extremely complicated) are described as follows:
Synthetic model 1: Three linear marginal value functions. The global scores are the linear summation of three marginal values without interactions.
Synthetic model 2: Three polynomial functions of degree 3. The global scores are the linear summation of three marginal values with all possible pairwise interactions.
Synthetic model 3: A polynomial function of degree 15, a sigmoid function and an exponential function. The global scores are the linear summation of three marginal values without interactions.
Synthetic model 4: The Model 3 with pairwise and triple-wise attribute interactions.
Figures 4.1.3 to 4.1.3 reveal the actual and fitted marginal value functions obtained by the proposed NN-MCDA model and the baseline linear regression model. For a simple model with linear marginal value functions (Synthetic model 1), both the baseline linear regression model and the NN-MCDA can fit the actual functions well. Both models successfully capture the monotonicity of the original marginal value functions. This indicates that the NN-MCDA model is also applicable to simple prediction tasks that do not have attribute interactions or nonlinear associations between attributes and predictions.
When the attribute interactions are considered (Synthetic model 2), the proposed model outperforms the baseline linear regression model. In the first row of Figure 4.1.3, the NN-MCDA model captures correct monotonicity changes of all three actual marginal value functions. Moreover, for the first and third attributes, it captures the concavity of the original functions. While linear regression model gives opposite monotonicity of the actual functions except for the second attribute. In addition, the linear regression model does not capture the concavity of the actual functions. The bad performance of the linear regression model is due to the fact that it can not capture the interactions between attributes, which often brings distorted interpretability.
For the model with more complex marginal value functions (Synthetic model 3), the linear regression model performs rather bad. In Figure 4.1.3, the first and third attributes have very negligible impact on the prediction in the fitted linear regression model. The NN-MCDA model, on the other hand, correctly captures the main characteristics of the three attributes. Both models failed to capture the inflexion point for the second attribute. These results demonstrate that a low-degree NN-MCDA model (3-degree in this study) can fit both lower and higher degree marginal value functions, because of the nonlinear component helps deal with the complexities that cannot be captured by the predefined linear component. In Figure 4.1.3, if the marginal value function is extremely complex (Synthetic model 4), though low-degree NN-MCDA cannot fully capture the characteristics of the marginal value functions, it still outperforms the baseline linear regression model. However, the real-world DM behaviors are usually not that complex (15-degree in Synthetic model 3 and 4). We will further validate the applicability and generalizability of the proposed NN-MCDA model with real datasets in the following section.
4.2 A multiple criteria ranking problem.
QS world university ranking organization777It is an annual publication of university rankings by Quacquarelli Symonds (QS). https://www.topuniversities.com provides five carefully-chosen indicators to measure the universities’ capacity in producing the most employable graduates, including employer reputation, employer-student connection, alumni outcomes, partnerships with employers and graduate employment rate. The metric of employer reputation, regarded as a key performance indicator, is based on over 40,000 responses to the QS Employer Survey. In this experiment, we apply the proposed NN-MCDA model to a multiple criteria ranking problem that predicts the employer reputation (human decision) using the other four quantitative indicators888Employer-student connection (EC). This indicator involves the number of active presences of employers on a university’s campus over the past 12 months. Such presences are in form of providing students with opportunities to network and acquire information, organizing company presentations or other self-promoting activities, which increase the probability that students have to participate in career-launching internships and research opportunities.
Alumni outcomes (AO). The scores based on the outcomes of a university’s graduates produced. A university is successful if its graduates tend to produce more wealth and scientific researches.
Partnerships with employers (PE). The number of citable and transformative researches which are produced by a university collaborating successfully with global companies.
Graduate employment rate (GER). This indicator is essential for understanding how successful universities are at nurturing employability. It involves measuring the proportion of graduates (excluding those opting to pursue further study or unavailable to work) in full or part time employment within 12 months of graduation.
. The descriptive statistics are shown in Table4.2.
There are pairwise comparisons among 250 universities. To determine the pre-defined degree of polynomials, we respectively set as 1, 2, and 3. We use the same fivefold cross-validation process to fit the model. We record the averaged AUC for both training and testing sets. We also select three baseline models, including a 3-layer MLP, a logistic regression model and a GAM. The results of the average AUC are presented in Table 4.2. For each pre-defined degree, the full complexity model always obtain the best results whereas the logistic regression model performs the worst. Since NN-MCDA can model attribute interactions, it slightly outperforms the GAM. As interpretable models, we depict the marginal value functions obtained by NN-MCDA, linear model and GAM in Figures 4.2, 4.2 and 4.2, respectively.
In Figures 4.2 and 4.2, the vertical axis is the individual attributes contributions to employer reputation. For the attributes alumni outcomes, partnerships with employers and graduate employment rate, the value functions obtained by NN-MCDA and logistic regression model exhibit a monotonically increasing trend, which makes sense based on our common knowledge. For attribute employer-student connection, GAM obtains a generally increasing curve. Although it also captures a slight dip of the increasing trend for employer-student connectiongreater than 0.8, such effect is not as clear as that in the curve captured by NN-MCDA. For attributes alumni outcomes, partnerships with employers and graduate employment rate, GAM obtains quite different and unstable curves that are difficult to explain.
4.3 Predicting geriatric depression risk.
Depression is a major cause of emotional suffering in later life. Geriatric depression reduces the quality of older adults’ life and increases the risk for acquiring other diseases and committing suicide (Alexopoulos 2005). The existing literature empirically studied the risk factors of geriatric depression but few of them gave insight into how these risk factors affect the prevalence of geriatric depression in details (i.e.: the shape of the marginal value functions). It is more managerially helpful for clinical decision making to prevent older adults from being depressed if we can understand how each risk factor influences the risk for depression at different scales.
The Health and Retirement Study (HRS) is a nationally representative longitudinal study of US adults aged 51 years and older(Bugliari et al. 2016). It has been widely used in many medical studies because of its massive information about older adults demographics, health status, health care utilization and costs, and other useful variables (Pool et al. 2018). We sample the data in 2014 (). In this experiment, given five pre-determined attributes (risk factors) that have been found to be associated with geriatric depression, we want to capture the detailed effect of these risk factors at different scales, which are represented by marginal value functions. The descriptive statistics are described in Table 4.3.
The respondents with CES-D scores higher than or equal to 1 are assumed to be at risk for depression (positive samples). There are totally 9,816 positive samples and 7,880 negative samples. We randomly choose 90% from both positive and negative samples to train the model. We train the NN-MCDA, MLP, GAM, and logistic regression model for 30 times, and present the average results in Table 4.3. The obtained marginal value functions are visualized in Figures 4.3, 4.3, and 4.3.
We first analyze the similar conclusions by three interpretable models. For the last four attributes, both NN-MCDA and logistic regression models capture similar monotonic trends. As for degree of education and marital status, the curves are under the baseline rate indicating higher education and the longer length of marriage can reduce the risk for depression. This is consistent with the medical literature (Penninx et al. 1998, Ladin 2008). More specifically, both models find that the attributes out-of-pocket expenditure and Body mass index would increase the risk for depression. Since the obtained value functions are in a convex shape, the growth of the risk will increase along with the increase of these attribute values.
Both the NN-MCDA and logistic regression obtain a convex curve for attribute age with part of the curve being negative and the rest being positive (see Figures 4.3 and 4.3). This indicates that the risk of depression does not increase while aging if the adult is younger than a threshold. The risk of depression increases fast after an adult passes an age threshold. The threshold for NN-MCDA is 71.58, which makes sense because most adults younger than 71.58 could be enjoying their retirement and their body functions do not degrade much. However, the threshold for the logistic regression is 95.83, which seems unrealistic and inconsistent with the literature (Blazer et al. 1991).
Similar to the previous experiment, GAM obtains less stabler curves, which are relatively more difficult to interpret. For attribute age, GAM obtains similar patterns as NN-MCDA and logistic regression models with an age threshold around 90, which is inconsistent with the literature (Blazer et al. 1991). For attributes degree of education and marital status, GAM even obtains quite counter-intuitive (if not wrong) results, indicating that the increase in educational and marriage time results in the higher risk of depression.
The experiments with real data demonstrate that the proposed NN-MCDA can effectively capture the patterns in human decision behavior through learning the marginal value functions, which characterize the contribution of individual attributes to the predictions. The NN-MCDA presents good potential in enhancing the empirical studies through providing a detailed marginal value function instead of a single coefficient for each attribute. Moreover, the prediction performance of NN-MCDA is close to a full complexity model (MLP), and much better than that of baseline interpretable models (GAM and logistic regression model).
In this section, we summarize the insights from the experiments, discuss the use and extension of the proposed NN-MCDA model, and compare the proposed NN-MCDA with ensemble learning.
5.1 Attribute importance.
To explain the importance of each attribute, we present the normalized attribute weights obtained from previous experiments in Figure 5.1. In the university ranking problem, NN-MCDA assigns 0.2732, 0.228, 0.258, and 0.239 to attributes employer-student connection, alumni outcomes, partnerships with employers and graduate employment rate whereas a regression model assigns 0.3198, 0.3107, 0.1906, and 0.1789 to them, respectively. Both models determine that employer-student connection is the most important attribute, however, they give different orders of other three attributes. Given the limited resources and the obtained importances of attributes, maintaining a good employer-student connection with frequent employer presences on campus is the most effective method to achieve good employer reputation of a university.
As for the depression prediction problem, a regression model assigns almost equal importance to the five attributes, which is not intuitive for the DM. On the other hand, the NN-MCDA provides a an order of the attributes according to the importance: (0.221, 0.218, 0.208, 0.198, 0.155). Such order suggests that obesity and loads of expenditure are the most important risk factors of becoming depressed. While estimating an old adult’s risk for depression, a DM (general physician, specialists and geriatricians) should prioritize the problems related to the older adult’s body weight and economical conditions.
5.2 Interpreting the trade-off coefficient.
As a key coefficient, determining the value of is important to practical applications. It reflects the influence of high-order interactions and complex nonlinearity of variables on the final decisions. In this regard, the NN-MCDA model can be used to explore the complexity of the learning problem. If the convergence is very small, it indicates that the data is highly complex and the assumption of preference independence is not valid. The DM should then use latest models that account for attribute interactions, such as a Choquet integral-based model (Aggarwal and Fallah Tehrani 2019) or full complexity machine learning models. On the contrary, if is close to 1, the DM is recommended to use a simpler model to avoid massive computational time and non-interpretable results.
In the case of prediction for depression (Table 4.3), the convergence is around 0.4. It indicates that the involved attributes are possibly interacted in this problem. Some related medical studies also empirically demonstrated some interactions between attributes, for example, older adults that are relatively young with higher degree of education may be involved in more social activities and have a more contented life, which lead to lower risk for depression (Li et al. 2014). Moreover, these older adults with longer marriage will obtain more family support and thus they have lower risk for depression (Pearlin and Johnson 1977). We usually address the pairwise interactions because they are easier to interpret and can be visualized by a heating map (Caruana et al. 2015). Given convergence and marginal value functions obtained in the presence of high-order correlations among attributes, we could develop algorithms to use lower-ordered attribute interactions, e.g.: pairwise and triple-wise interactions, to approximate the higher ones. The framework can then be extended to a two-step procedures, including determining marginal value functions and deducing possible lower-ordered correlations.
5.3 Extending the NN-MCDA framework.
The proposed NN-MCDA presents a general modelling framework, which can be easily extended to enhance the performance and adaptivity for various problems. In this section, we discuss the three extensions, including adding regularizations replacing the model in the nonlinear component, and incorporating attributes in the nonlinear component.
5.3.1 Adding regularizations
For some mission-critical cases where the data is complex, the convergence could be very small. However, the DM still requires certain level of model interpretability to facilitate their decision making. Therefore, we opt for adding a regularization term to prevent the model from being too complicated and non-interpretable. The inclusion of the regularization term also helps prevent the over-fitting problem. For example, we can revise the original MSE as follows . The added regularization term allocates more weight to the linear component at the cost of lower fitting accuracy. We can also change the regularization term to
, which leads to a balanced model that favors a model with equal weights to the linear and nonlinear components. The exact form of the loss function should be selected according to the problem settings.
5.3.2 Replace the model in the nonlinear component.
Given different types of datasets, the proposed NN-MCDA model can be modified by replacing the neural networks in the nonlinear component by other network structures. In subsection 4.1.1, NN-MCDA has difficulty in handling extremely complex data ( and
). To improve the performance on this data, we can introduce more layers in the MLP or increase the number of neurons in each layer. For image classification problem, we can replace the MLP with a convolutional neural network (CNN), and use the features obtained from CNN as the input for the linear component. To fit time-series or free text data, we can replace the MLP with a recurrent neural network.
In addition, the proposed model can be progressively modified by iteratively interacting with the DM. We provide a user-interactive process to determine the ultimate model. The framework is shown in Figure 5.3.2 and explained as follows:
We first apply the NN-MCDA model to the management problem. While the model converges, we obtain the value of .
If , go to step 3. If , which indicates that the data are potentially very complex. We opt for a full complexity black-box model to achieve higher accuracy and present the results to the DM. If the DM agrees to use the black-box model, the process is end. Otherwise, we add a regularization term (e.g.: use ) to the original NN-MCDA, and go back to Step 1.
If , go to step 4. If , which indicates that an interpretable model is sufficient to fit the data, we explain the results to the DM. If the DM is satisfied with the accuracy, the process ends. Otherwise, we can modify the NN-MCDA model by replacing the model in the nonlinear component (e.g.: using deeper MLP or other neural net-based model). Then, we go back to Step 1.
If , we present the underlying model and results to the DM. If there are no further requirements, the process is end. If the DM requires further modifications (such as adding regularization terms or modifying the nonlinear component), we modify the model accordingly and then go to Step 1.
5.3.3 Flexible inclusion of attributes in the nonlinear component
In practice, the human decision behavior usually focuses on a small number of key attributes/criteria (Ribeiro et al. 2016). However, there could exist other minor attributes that do not directly contribute to the prediction, but could affect the prediction through non-traceable complex interactions with other attributes (for example, the interaction between the nonlinear transformation of an attribute and the nonlinear transformation of five other attributes). These minor attributes can be incorporated by the nonlinear component.
In the geriatric depression experiment, the gender of an older adult may not directly indicate a difference in the risk for depression, however, it might still influence the prediction through complex interactions with other attributes. We further extend the NN-MCDA model in incorporate gender and smoking status into the nonlinear component (as shown in Figure 5.3.3). We find that the incorporation of these attributes indeed improves the prediction accuracy (the AUC for testing set increases from 0.669 to 0.675), while still maintains the similar marginal value functions in the linear component (see Figure 5.3.3). If we add two more attributes, for instance whether the respondent received any home cares in last two years and whether the health problem limited his/her work, the AUC increases more obviously (from 0.675 to 0.708) and the marginal value functions still provide convincing results (see Figure 5.3.3).
5.4 Joint training process.
In the NN-MCDA model, the linear and nonlinear components are combined by a trade-off coefficient . Their sum is then fed to a common logistic function for a joint training process. Note that this joint training process is different from ensemble learning (Cheng et al. 2016), in which multiple classifiers are trained individually and their predictions are simply combined after every model is optimized separately. For example, an ensemble learning approach could have a linear logistic regression model and an MLP model to make predictions for the same dataset separately, and then integrate the prediction results of the two models. The joint training process indicates that the linear and nonlinear components are connected. While we tune the parameters in one component, the other component will be affected. If the model is at its global optimal, the predictions can be made.
6 Conclusion and future work.
In this paper, we proposed a framework for an interpretable model, named NN-MCDA, which combines traditional MCDA model and neural networks. MCDA uses marginal value functions to describe the contribution of individual attributes to the predictions, while neural network considers high-order interrelations among attributes. The framework automatically balances the trade-off between two components. NN-MCDA is more interpretable than a full complexity model and maintains similar predictability.
We present simulation experiments to demonstrate the effectiveness of NN-MCDA. The experiments show that (1) polynomial of higher degrees do not always improve on accuracy; (2) There is a trade-off between the interpretability and the predictability of the model. NN-MCDA can achieve a good balance between them; (3) Given simple data, NN-MCDA performs as good as interpretable model, while given more complex data, NN-MCDA outperforms an interpretable model. We also present how to apply the NN-MCDA framework to real-world decision making problems. These experiments with real data demonstrate the good prediction performance of NN-MCDA and its ability in capturing the detailed contributions of individual attributes.
To the best of our knowledge, this research is the first to introduce the interpretability into machine learning models from the perspective of MCDA. The proposed framework sheds light on how to use MCDA techniques to enhance the interpretability of machine learning models, and how to use machine learning techniques to free MCDA from strong assumptions and enhance its generalizability and predictability.
We envisage the following directions for future researches based on the NN-MCDA framework. First, We can further enhance the interpretability of the model through proposing algorithms to approximate the attribute interactions after obtaining the marginal value functions. Second, additional simulations are needed to validate the effectiveness of the NN-MCDA variants that introduced in the discussion section. Last but not the least, applying the proposed framework to a variety of real-world decision making and prediction problems constitutes another interesting direction for future work.
- Aggarwal and Fallah Tehrani (2019) Aggarwal M, Fallah Tehrani A (2019) Modelling human decision behaviour with preference learning. INFORMS Journal on Computing 31(2):318–334, URL http://dx.doi.org/10.1287/ijoc.2018.0823.
- Alexopoulos (2005) Alexopoulos GS (2005) Depression in the elderly. The Lancet 365(9475):1961–1970.
- Angilella et al. (2014) Angilella S, Corrente S, Greco S, Słowiński R (2014) MUSA-INT: Multicriteria customer satisfaction analysis with interacting criteria. Ome