1 Introduction.
Machine learning has recently been widely adopted to address challenging decision making problems in a variety of managerial contexts like marketing (Cui and Curry 2005, Cui et al. 2006), creditrisk evaluation (Baesens et al. 2003) and healthcare management (Gartner et al. 2015)
. Many machine learning models, such as support vector machines (SVMs)
(Cortes and Vapnik 1995), boosted trees (Friedman 2001) and neural networkbased methods (Rumelhart et al. 1988, LeCun et al. 2015), are applied to diverse realworld prediction problems due to their capacity to analyze highdimensional data. Although higher complexity usually brings higher accuracy, it comes at the expense of interpretability
(Lou et al. 2012). In practice, model interpretability is as important as (if not more important than) accuracy in many missioncritical applications such as clinical decisionmaking, in which the understanding of how the model makes the prediction is the key to facilitate physicians to trust the model and utilize the prediction results (Caruana et al. 2015, Fox et al. 2007). Recently, technology giants like Google, IBM, and Microsoft, have been investigating on the techniques in enhancing the model interpretability (Mohseni et al. 2018). As stated in a comprehensive overview conducted by Mr. David Gunning, the program manager in the Information Innovation Office (I2O) of the Defense Advanced Research Projects Agency (DARPA), “machine learning models are opaque, nonintuitive and difficult for people to understand” (Gunning 2017). DARPA has since funded for developing interpretable machine learning techniques among academics. In the latest budget plan of DARPA, explainable artificial intelligence (XAI) has been listed as the key funding area in the fiscal year 20192020, with the total amount of 26.05 million US dollars ^{1}^{1}1https://www.darpa.mil/aboutus/budget.1.1 Benefits of interpretable models
Both machine learning and management research could benefit from the interpretability of models. First, an interpretable model is trustworthy because it exploits some patterns and rules that are consistent with prior human knowledge and experiences. The unreasonable learned patterns and rules can be easily identified and corrected by a decision maker (DM). If a model tends to make mistakes that can be easily to be classified accurately by the DM, it would require his/her supervisions of modification
(Ribeiro et al. 2016). Second, an interpretable model helps in understanding causality (Miller 2018). Interpretable models can extract the associations between predictors and predictions, which can facilitate the downstream managerial decision making. Third, an interpretable model incorporates the DM’s domain knowledge. A DM usually possesses rich domain knowledge but not technical skills to construct a model. An interpretable model can be used to learn a DM’s decision behavior through tuning key parameters and then provide indepth understanding of the data and patterns (Aggarwal and Fallah Tehrani 2019).1.2 Multiple criteria decision aiding
Multiple criteria decision aiding (MCDA)^{2}^{2}2Multiple criteria decision aiding is also named as multiple criteria decision making. In this paper, we use multiple criteria decision aiding (the “European school”) for consistency (Vincke 1986). has been a fast growing area of operational research during the last several decades (Dyer et al. 1992, Figueira et al. 2005, Wallenius et al. 2008, Saaty 2013, Ramesh et al. 1988, 1989). It involves a finite set of alternatives (e.g. actions, items, policies) that are evaluated from a set of conflicting multiple criteria or attributes^{3}^{3}3In machine learning, criteria refer to attributes or features with preference order scales (Corrente et al. 2013). For consistency, we use “attribute” in this paper.. The DM’s decision is driven by his/her underlying global value (utility) function (Keeney 1976, Keeney and Raiffa 1993). This global value measures the DM’s desirability for an alternative and can be disaggregated into a set of perattribute marginal value functions that represent the DM’s evaluation of the corresponding attribute (Kadziński et al. 2017). These marginal value functions can be learned by the DM’s judgments on learning examples (e.g. pairwise comparison between two alternatives). Once the marginal value functions are deciphered, we can understand the decision making rationale, based on which we can predict the judgment of the DM. This process is referred as the preference disaggregation approaches of MCDA.
Many machine learning framework can help MCDA accomplish the learning objectives because both of them aim to learn a decision model from data. Thus, MCDA and machine learning naturally have reciprocal interactions (Doumpos and Zopounidis 2011)
. MCDA and machine learning are integrated in two directions. First, we can apply machine learning techniques to various tasks in a decision aiding context, such as learning to rank, multilabel classification, etc. The opposite direction is to implement MCDA concepts in a machine learning framework. It is a tendency that utilize MCDA approaches to adapt the machine learning models to various topics, such as feature selection and extraction, pruning decision rules and multiple objective optimization. Our work belongs to the second stream. We aim to construct a hybrid model, which utilizes value functionbased preference disaggregation approaches of MCDA to enhance the interpretability of “blackbox” machine learning models.
The motivation of introducing the value functionbased preference disaggregation approaches of MCDA to machine learning stems from its powerful capacity in depicting the human decisionmaking process. The deciphered marginal value functions reveal the rationale of DM’s judgment, and thus provide convincing evidence to assist comprehending the decision making behavior (Aggarwal and Fallah Tehrani 2019, Lou et al. 2012). Our task of learning an interpretable model is essentially to capture the characteristics of the marginal value functions, based on which we obtain a certain degree of interpretability. This study is different from the statistical models for management problems, in that we utilize the characteristics of the marginal value function (instead of a single coefficient) to represent the effect of each attribute on the outcomes. For example, suppose that a hypothetical DM’s preference system is composed of four marginal value functions in Figure 1.2. We can analyze the DM’s preference from the following perspectives. First, we focus on the ranges of the marginal values. If the marginal values are close to 0 (see the marginal value function 1), it indicates that the corresponding attribute is not important to the DM or we have wrongly captured its characteristic. Further interaction with the DM is needed to determine whether we keep this variable or calibrate the model. Different from the statistical model selection methods (e.g.: LASSO, Bayesian information criteria (Friedman et al. 2001)), through incorporating the DM’s domain knowledge, an interactive model calibration process is invoked (Stewart 1993, Wallenius et al. 2008, Doumpos and Zopounidis 2011). Second, the increasing and decreasing tendencies of the marginal value function curves unveil the change of the DM’s preference. Moreover, the negative and positive marginal values directly show the negative and positive effects of the attributes on the outcomes (see the marginal value functions 2 and 3). Statistical models usually generate a fixed coefficient that cannot capture such preference inflexion points. Third, the convexity and concavity of the marginal value function are crucial for interpreting the DM’s rational behavior in decisionmaking process (see the marginal value function 4). To summarize, different from traditional statistical models, MCDA aims to extract more interpretable patterns of the DM’s behavior and builds a solid link between the underlying model and the actual decision making processes, and thus facilitates the effective use of the model in realworld decision makings.
1.3 An overview of this paper.
This paper proposes a framework for a Neural Networkbased Multiple Criteria Decision A
iding (NNMCDA) approach. NNMCDA combines an additive model and a fullyconnected multilayer perceptron (MLP) to achieve both model interpretability and complexity. The additive model is learned from the value functionbased preference disaggregation models of MCDA. It uses marginal value functions to approximate the relationship between the outcome and individual attributes whereas the MLP is used to capture the highorder correlations between attributes in the model. We estimate the parameters in the model under a neural network framework that automatically balances the tradeoff between two components.
We tested our proposed model using a set of synthetic datasets and two real datasets. Specifically, the simulation experiments respectively show the impact of predefined parameters on the model and the goodness of the model when data is either extremely complex or simple. Two real datasets on ranking universities regarding employment reputation and predicting the risk for geriatric depression are utilized to illustrate the proposed model in real cases. We explain the obtained models and compare them to other interpretable models, i.e., GAM and logistic regression models.
The contributions of this paper are fourfold. First, we advocate a new perspective of an interpretable model that both quantifies the impact of individual attributes on the outcome and captures the possible highorder correlations between attributes in the model. It helps the DM to understand the main effect of single attribute and to make better decisions. Second, to the best of our knowledge, this paper is the first pilot work that introduces the value functionbased preference disaggregation approaches of MCDA to the machine learning models to enhance the model interpretability. The trained parameters in the proposed framework determine the shape for marginal value functions in the additive models. The proposed model is free from preference independence, preference monotonicity, and small learning set assumptions in MCDA approaches, which makes MCDA approaches more general and practical for realworld management problems. Third, we examine the model effectiveness given different model parameters and datasets. The empirical conclusions about the relationships between model interpretability and data complexity are managerially intuitive for the future researches. Forth, the proposed framework is flexible and extendible, especially the nonlinear part, which can be modified or replaced by other models according to different types of data. Our work is intuitive for developing interpretable models for both management and computational science.
The rest of the paper is organized as follows. We discuss the related work in Section 2. In Section 3, we introduce the framework for the proposed interpretable model. The simulation and real case experiments are presented in Section 4 and some discussions about the proposed framework is provided in Section 5. We conclude the paper in Section 6.
2 Related work.
2.1 Value functionbased preference disaggregation approach of MCDA.
The value functionbased preference disaggregation approaches of MCDA provide explicit marginal value functions and numerical scores. A DM can understand the importance of a particular attribute and how the individual attributes contribute to the final decision. This procedure encourages the DM to participate in the decision making process and it provides a comprehensive preference model (Corrente et al. 2013). These approaches have been successfully applied to many scenarios, such as consumer preference analysis (Hauser 1978), financial decisions (Zopounidis et al. 2015), nanoparticles synthesis assessment (Kadziński et al. 2018) and territorial transformation management (Ciomek et al. 2018). However, the applications of value functionbased preference disaggregation approaches are limited due to some strong assumptions, such as (1) preference independence, (2) monotonic preference, and (3) small set of alternatives.
Recently, many novel models haven been proposed to generalize the value functionbased preference disaggregation approaches of MCDA. Preference independence allows the model to be additive. Considering interacted attributes, Angilella et al. (2010) utilize a fuzzy measure to model the preference system where the alternatives are now evaluated in terms of the Choquet integral. However, it is difficult for the DM to understand the impact of individual attribute evaluated from the Choquet integral. Angilella et al. (2014) account for positive and negative interactions among attributes, and add an interaction term to the additive global value function for each alternative. They require the DM to provide some knowledge about the interacted pairs that are mined by the models. These studies only consider the interaction between pairs of attributes because higherorder interactions require more cognitive efforts and more computational cost.
The majority of existing researches assume the marginal value functions are monotonic piecewise linear. This assumption reduces the model complexity, but it fails to describe preference inflexions. Addressing this problem, Ghaderi et al. (2017) and Liu et al. (2019) relax this assumption and constrain on variations of the slope to obtain nonmonotonic marginal value functions without serious overfitting problem. Both of their approaches obtain nonsmooth value functions which are difficult to interpret attitudes towards risks due to the use of nonderivative functions. Since a differetiable margianl value function is essential to analyze consumer behavior, Sobrie et al. (2018) utilize semidefinite programming to infer the key parameters for polynomial marginal value functions. It gives a more flexible and interpretable preference model. However, it still assumes that the DM preference is monotonic.
The monotonic piecewise form of the marginal value functions has a low expressibility for large learning sets (Sobrie et al. 2018). Nowadays, MCDA approaches are expected to deal with large amount of data in many disciplines (Pelissari et al. 2019, Liu et al. 2019). Liu et al. (2019) embed the MCDA approach into a regularization framework to approximate marginal value functions in any piecewise linear shapes, and provide efficient algorithms to handle larger learning sets.
Most existing researches focus on expanding the MCDA approaches from only one perspective. Comparing with these recent advances, the proposed framework tries to solve all aforementioned limitations of MCDA by providing a nonmonotonic, smoother, and more powerful MCDA approach for realworld applications considering more complex decision making scenarios.
2.2 Interpretable models.
There is usually a tradeoff between model interpretability and prediction accuracy (shown in Figure 2.2). Interpretable machine learning, or XAI, aims to create a suite of techniques that produce more explainable models while maintaining a high accuracy (Gunning 2017).
Generalized additive model (GAM) uses a link function to build a connection between the mean of the prediction and a smooth function of the predictors (Hastie and Tibshirani 1986). It is good at both dealing with and presenting the nonlinear and nonmonotonic relationship between the predictors and the prediction (Lou et al. 2012). Therefore, GAM is usually more accurate than linear additive models. Although GAM does not outperform full complexity models, it possesses more interpretability than these “blackbox” models. Lou et al. (2013) explore the coeffect of pairwise interactions and apply the improved GAM to predicting pneumonia risk and 30day readmission. This model helps the DM (physician) to find useful patterns in the data and quantifies the contributions of individual attributes. Based on these promising results, they argue that it is necessary to develop more interpretable models in missioncritical applications such as management problems (Caruana et al. 2015).
Another solution is to infer a new model to approximate the true blackbox model. The new model may not be as accurate as the original blackbox model, but can identify patterns and rules to explain how the predictions are made. In Baesens et al. (2003), explanatory rules are extracted to help the creditrisk managers in explaining their decisions. Similarly, Letham et al. (2015) discretize a highdimensional attribute space into a series of simpler interpretable ifthen statements. They firstly make predictions using complex machine learning techniques and then use Bayesian rule lists to reconstruct the predictions. Given approximately accurate predictions, the obtained model is more interpretable.
According to Ribeiro et al. (2016), why and how the model produces that prediction are important for the DM to trust the underlying model. An interpretable model should enable to answer these questions and give the reasons behind a prediction. In this regard, they develop an algorithm named LIME which approximates a prediction locally with a simper model, for instance a linear model that is easier to interpret. It is extensible to explain the predictions of any model in an interpretable manner.
2.3 Machine learning in MCDA.
There have been a few attempts to integrate the machine learning algorithms with MCDA. In a pioneering work by Wang and Malakooti (1992), a singlelayered feedforward artificial neural network is proposed to learn MCDA objectives. The advantages of neural networks are that they are independent of functional forms. However, it only gives a final recommendation without any interpretable marginal value functions or patterns. Doumpos and Zopounidis (2011) explore the differences and similarities between machine learning and MCDA. Although there are several studies introducing MCDA into machine learning models, few utilize the MCDA concepts to enhance machine learning models’ interpretability.
As a new subfield of machine learning, preference learning has attracted extraordinary attention from the MCDA community. Corrente et al. (2013) explore the relationship between MCDA and preference learning. They find that the higher performance of machine learning models is usually associated with lower degree of interpretability, which negatively affects the confidence in the employment of machine learning models in scenarios where we need to understand the underlying process. A latest study utilizes preference learning to model human decision behavior under a MCDA framework. Such a model can facilitate the understanding of the DM’s behavior by tuning welldefined model parameters (Aggarwal and Fallah Tehrani 2019).
3 Framework for the intelligible model.
Let be the training dataset of size , be the th attribute vector with attributes^{4}^{4}4In MCDA, is called an alternative with criteria/attributes., and be the target/response value. In this study, we consider a binary classification problem where . The proposed framework can be easily extended to multiclassification and regression problems.
3.1 The additive model.
The value functionbased preference disaggregation approaches of MCDA assume that for each attribute vector , there is a global value function in the following form:
(1) 
where represents the importance of the th attribute and is a marginal value function. Note that we reply the shape and positive/negative effect of the marginal value function to capture the contribution of individual attributes. Thus we set the weight to be positive to represent the relative importance of the th attribute, which can positively or negatively affect the global value. The global value function linearly sums contributions of individual attributes^{5}^{5}5In GAM, is called shape function and is called link function..
Although the global value function is in an additive linear form, the marginal value functions themselves can be in any forms, often nonlinear. It has been recognized that the preference in human decision behaviors is rational, and thus the marginal value functions should be stable and smooth. In the literature, the marginal value function can be in a simple linear (weighted sum) form (Saaty and Decision 1990, Saaty 2013, Korhonen et al. 2012), monotonic and nonmonotonic piecewise linear forms (Stewart 1993, JacquetLagreze and Siskos 2001, Greco et al. 2008, Ghaderi et al. 2017, Liu et al. 2019), and monotonic polynomial form (Sobrie et al. 2018). To capture the firstorder (e.g. monotonicity) and secondorder (e.g. marginal rate in substitution) derivative patterns of the attributes’ contributions to the prediction, we extend and generalize stateoftheart MCDA models (Liu et al. 2019, Sobrie et al. 2018) to allow the marginal value function in any polynomial forms. In this paper, we allow the th marginal value function to be in a smooth and nonmonotonic form of degrees:
(2) 
where is the coefficient of the th degree and is the highest order of degree on the th attribute.
The motivations using Eq.(2) as a marginal value function are derived from two facets. First, we enhance the expressiveness of the preference model to capture nonmonotonic preferences. For example, piecewise linear or monotonic polynomial functions fail to restore all information in a larger learning set (Sobrie et al. 2018). The nonlinearity and nonmonotonicity of Eq.(2) can better fit complex relationships between attributes and the outcome, leading to a better model performance. Second, while analyzing human behavior, it is critical to examine the tradeoffs or marginal rates of substitution in economics and management studies. A nonderivative value function, for instance the boosted bagged trees model in Lou et al. (2012), cannot capture the inflexion point where the marginal rate of substitution grows or diminishes more quickly (Keeney and Raiffa 1993). A model exploiting human behaviors seems convincing and has more managerial meaning for the DM in management scenarios.
3.2 Neural networkbased MCDA.
Full complexity models perform well on many machine learning tasks because they can model both the nonlinearity and the interactions between attributes. An additive model like Eq.(1) does not model any interactions between attributes. Therefore, we propose a neural networkbased multiple criteria decision aiding (NNMCDA) model in the following form
(4)  
where is the global score of , is a latent function of all attributes, and is a tradeoff coefficient. Eq.(4) describes (a) a regression model if is the identity, and (b) a classification model if is the logistic function of the identity. is used to capture the highorder interrelations between attributes in the model. We can use any complexity models to fit for better performance, for instance we use a MLP in this paper (Rosenblatt 1958). While using an MLP form of , it is not transparent, meaning that we do not know the exact structure of . Since we have the to capture the explainable form of the marginal value functions, the nontransparent describes the complex patterns that are not readily useful to the DM. Coefficient balances the tradeoff between and . If is close to 1, the model tends to be in a simple additive MCDA form. If is close to 0, we obtain a full complexity model.
The utilized joint training process is shown in Figure 3.2. The input attribute vectors should be transformed into polynomial forms, i.e.,
. In the input layer, a singlelayer network without any activation functions is provided to reconstruct Eq.(
1). It has units and the weight for each unit corresponds to a particular . We denote the output of the linear component with , and(5) 
where is the vector of coefficients in the th polynomial marginal value function, is the vector of weights of attributes, and contains marginal values of th attribute vector. Note that, Eq.(5) is actually a specific case of Eq.(1). In Eq.(1), the marginal value functions can be in any shapes (e.g. piecewise linear). However, in this study, we allow them to be in a polynomial form in Eq.(2). Thus, is a generalization of .
The nonlinear component is a standard MLP. It is used to learn highorder correlations between attributes. Similarly, by summing every units we can obtain a marginal value on the
th attribute. For activation functions, we opt for Rectifier (ReLU), which is is the most commonly used activation function in neural networks
(Glorot et al. 2011). We can also use other activation functions such as Sigmoid and TanH functions. An layer MLP is defined as:(6)  
where , and
denote the weight matrix, bias vector and activation function for the
th layer, respectively. The input of the MLP model is the same as the input for the linear part, i.e., .The output is the probability of
, we have(7) 
where
is a sigmoid function. To estimate the parameters, we minimize the mean square error (MSE):
(8) 
We can adopt a variety of optimization methods to minimize Eq.(8
), such as Stochastic Gradient Descent (SGD), Adaptive Gradient Algorithm (Adagrad) and Adaptive Moment Estimation (Adam). Please refer to
Le et al. (2011) for details of the optimization procedure. The interpretability of the model refers to the capacity in developing marginal value functions, which capture the relationship between individual attributes and prediction. With the proposed model, the DM can know what attributes are more important for the prediction, what values of an attribute are positively or negatively associated to the prediction, and where the convexity and concavity of the function are changed.3.3 Application to multiple criteria ranking problems.
In this subsection, we will show how to apply NNMCDA to traditional multiple criteria ranking problems where alternatives are ranked based on the DM’s preference. In this paper, alternatives are represented as attribute vectors.
Let denote that an attribute vector is at least as good as , and denote that is better than . Note that the symbol ‘’(or ‘’) does not necessarily require that each element in is at least as good as (or better than) that in . It actually indicates that one alternative is at least as good as (or better than) another one based on the DM’s judgment. For each pair , we define as follows:
(9) 
and the difference between global scores of and is:
where and . Let be the aggregated vector for :
and be a function of . We fit function to approximate the value of . Note that in some decision problems, the attribute weights in Eq.(4) are normalized to and , which are useful for interpreting the tradeoffs between attributes^{6}^{6}6Note that the tradeoff between attributes is similar to attribute importance, but the tradeoff emphasizes that assigning more weight to an attribute would decrease other attributes. That usually leads to a situation where some attributes have almost no effects on the predictions, which is unexpected because the selected attributes are often summarized based on DM’s prior knowledge and their requirements. In this regard, we tend to train our model without normalization but provide normalized weights to evaluate the tradeoffs between attributes (Liu et al. 2019). Moreover, there are few minor differences on performances when using normalized weights or not.. To address this issue, we apply the following transformation:

For each attribute , the normalized weight is .

The new global score is . Moreover, the ordinal relations among all attribute vectors are preserved since and .
Given the input data , instead of mathematical programming, we can now use the machine learning scheme in section 3.2 to infer the preference model and rank other attribute vectors. The output is the probability that is at least as good as . We can predefine two thresholds and , where . If , then , and if , then and otherwise, . If we use the normalized weights, since the probability is transformed nonlinearly, the predefined thresholds should also be transformed as follows, , to preserve the ordinal relations. In this way, the traditional multiple criteria ranking approaches can handle larger datasets and obtain smoother and more flexible marginal value functions to assist the DM. We present the simulation results in Section 4.1 and the results using real datasets in Section 4.2.
3.4 Usefulness of the proposed framework in decision making.
As we introduce MCDA into machine learning, the main objective is shifted from achieving the best predictive performance to facilitating the DM in gaining insights into the characteristics of the decision making process and the interpretations of the results (Doumpos and Zopounidis 2011). Once the marginal value functions are obtained by the proposed NNMCDA framework, we can further analyze the DM’s preference from the following perspectives.
First, the attribute importance usually has a longtail distribution, with a few of them being very important and the majority of them being less important (Caruana et al. 2015). The characteristics of the marginal value functions can reveal the importance of the corresponding attribute. If a marginal value function is close to 0 for the whole scale of the attribute values, it indicates that the attribute is either not important to the DM or the characteristic of the marginal value function is wrongly captured, because the change of this attribute has little influence on the predictions. When this is the case, we need to interact with the DM to determine whether we preserve this attribute or calibrate the model. In this regard, the proposed framework can perform model selection and modification (similar to statistical approaches like LASSO). For example, while predicting if a patient has the flu, the marginal value function of “room humidity” is in a shape like the marginal value function 1 in Figure 1.2, it is possible that “room humidity” has little contribution to the flu. However, whether we abandon it should be determined by a physician.
Second, the increasing and decreasing tendencies of the marginal value function curves reveal the change of the DM’s nonmonotonic preference. We focus on the monotonicity inflexion points because they can determine that to what attribute values, the DM is more sensitive. Moreover, if we partition the marginal value function curve by these points, we can discretize the continuous attribute into smaller ranges in which the DM’s preference is monotonic. Such smaller intervals are useful for personalization (e.g. customer segmentation) and strategymaking tasks in management. For example, while evaluating the company’s performance, if the marginal value function of “cash to total assets ratio” is like the second function in Figure 1.2, we can learn that a company with a very small or large “cash to total assets ratio” is in a bad condition. Companies with a large ratio are suggested to use the cash to do more investigations, whereas companies with a small ratio are suggested to save general expenses so that more cash can be used in new investigations.
Third, Since the marginal value function returns a “score” that is added to the global value, it is crucial to determine whether the attribute positively or negatively contributes to the outcome. If a marginal value function is above/below zero, the corresponding attribute is positively/negatively associated with the prediction. The marginal value function can capture the sign change (if any) of an attribute’s contribution and provide the DM an exact attribute value where the sign changes. This is more informative than the statistical models that only provide a fixed coefficient representing either positive or negative effect of the attribute. For example, when predicting the risk of depression among adults, the marginal value function of “age” may has a shape similar to the third function in Figure 1.2 (please also refer to Figure 4.3, which is drawn from the realdata). The shape of this curve indicates that the risk of depression does not increase while aging if the adult is younger than a threshold. The risk will increase once the adult is older than that threshold (the threshold is 71.58 in the real data introduced in section 4.3). Statistical models, on the other hand, can only conclude that age has either a negative or positive effect on the depression risk. We need to segment the adults to predefined age groups to capture such sign change effect.
Fourth,the concavity (and convexity) of the marginal value function can directly reflect the changing rate of the DM’s preference. Such information is important to both economics and marketing problems. For example, if the consumer’s preference to “discount rate” is in a same shape of marginal value function 4 in Figure 1.2, it implies that at the beginning, along with the increase of the discount rate, the consumer’s utility (propensity to consume the product) grows more quickly. However, when the discount rate is over a specific value, it gives a signal that the product is possibly of bad quality. Although the consumer’s utility still grows, its rate of increase starts to slow down. This provides the DM with a conclusion that keeping the discount rate at a medium level could maximize the profit.
4 Experiments.
To validate the proposed NNMCDA model, we perform experiments with both synthetic and real datasets. We use area under the curve (AUC) of receiver operating characteristic (ROC) curve to measure the model performance. In subsection 4.1, three simulation experiments examine (a) the influence of the degree of polynomial on the prediction performance, (b) the influence of the value of , the tradeoff coefficient, on prediction performance, and (c) the goodness of the proposed NNMCDA approach in fitting the given marginal value functions. In Section 4.2, we first apply the NNMCDA model to a multiple criteria decision problem where we rank universities based on the employer reputation. Then we predict the risk for geriatric depression with useful interpretations of the risk factors with a higher resolution.
4.1 Simulations.
For brevity, we set equal predefined degrees of polynomial for all marginal value functions in the subsequent experiments. We generate three typical synthetic datasets (from the simplest to very complex) as follows:

Uniformly draw attribute vectors with attributes whose values are within [0,1].

We generate three datasets. (a) For the first dataset , all attributes have equal importance and the actual marginal value functions are identity functions. The global score for each attribute vector is a linear summation of
attribute values without any attribute interactions and an additional noise term that is in a standard normal distribution; (b) The second dataset
randomly generates 3degree polynomials marginal value functions for attributes, and the global score is the summation of marginal values, all attribute interactions and a standard normal noise term. (c) The third dataset is extremely complex. The global score is the summation of 15degree polynomial marginal values, all possible attribute interactions (pairwise, triplewise and higher interactions) and a standard normal noise term. 
We compare global scores between each pair of attribute vectors. If , then , otherwise, . Note that the actual input is the transformed attribute vector.
4.1.1 Experiment I: Relationship between degree of polynomial and model performance
The first simulated experiment aims at exploring the relationship between the predefined degree of polynomial and AUC. The parameters used in the experiment is shown in Table 4.1.1. For each setting, we iteratively repeat the experiment for 10 times and record the averaged AUC. In this experiment, the numbers of iterations are determined using fivefold crossvalidation: We partition the training set into five sets and set aside one of them as a validation set. We then train the model using the other four partitions and use the validation set to check the convergence. This procedure is repeated five times and the averaged number of iterations is used to train the final model with the whole dataset (Lou et al. 2012).
Figures 4.1.1, 4.1.1 and 4.1.1 report the averaged AUC for the testing set with different training sizes using the three synthetic datasets. Though there is no obvious relationships between the training sizes and the model performance, we find two interesting patterns. First, higher predefined degrees of polynomials can lead to higher accuracy when convergence. That results from the ability of the underlying model to capture more complicated nonlinearity. However, higher degrees of polynomials usually require more iterations to converge. More specifically, we depict the averaged computational time for each training process in Figure 4.1.1. Apparently, while increasing the model complexity, for example, using higher degrees of polynomial marginal value function and considering more attributes, the average computational time to converge also increases almost linearly. Another interesting pattern is that the shape of the AUC curves (Figures 4.1.1, 4.1.1 and 4.1.1) can fit a concave function in general. When the degree increases, the AUC improvement (over the model with the immediate smaller degree) is becoming smaller. For example, the improvement is more obvious if we change the predefined degree from 1 to 3 than that if we change from 3 to 5 and from 5 to 10. The improvement diminishes quickly along with the increase of predefined degree of polynomials. The greatest AUC improvement happens if we increase the degree to 3, while the improvement resulted from further increasing the degree to 5 and 10 is slim. The results suggest that it is not necessary to set a very large for the seek of minor improvement because the computational cost increases much faster when increases. Generally, we believe that a polynomial of 3 degree is sufficient to capture the characteristics for all the three datasets. Higher degrees of polynomials have a risk for overfitting and obviously cost more computational time, but contribute little to accuracy.
n=3  n=5  

Training size  0.5  0.6  0.7  0.8  0.9  0.5  0.6  0.7  0.8  0.9 
NNMCDAD1  0.9490.023  0.9590.021  0.9600.011  0.9550.009  0.9510.005  0.9530.021  0.9550.014  0.9510.019  0.9410.013  0.9450.012 
NNMCDAD2  0.9560.022  0.9580.027  0.9780.019  0.9710.007  0.9650.003  0.9620.018  0.9550.021  0.9680.009  0.9570.022  0.9620.018 
NNMCDAD3  0.9650.031  0.9630.024  0.9840.019  0.9810.004  0.9760.003  0.9800.019  0.9720.019  0.9720.021  0.9860.013  0.9780.010 
NNMCDAD5  0.9830.026  0.9710.020  0.9880.019  0.9890.005  0.9880.001  0.9830.028  0.9890.018  0.8760.037  0.9920.009  0.9820.010 
NNMCDAD10  0.9860.011  0.9930.010  0.9910.009  0.9910.002  0.9940.001  0.9980.013  0.9890.029  0.9960.009  0.9920.007  0.9850.008 
MLPD1  0.9690.001  0.9510.005  0.9600.000  0.9570.000  0.9980.001  0.9700.001  0.9690.002  0.9770.001  0.9810.001  0.9660.000 
MLPD2  0.9700.002  0.9690.003  0.9730.003  0.9700.001  0.9990.002  0.9790.001  0.9800.000  0.9890.001  0.9880.001  0.9880.001 
MLPD3  0.9920.001  0.9900.001  0.9870.002  0.9890.001  0.9730.001  0.9870.002  0.9910.001  0.9910.001  0.9950.000  0.9960.000 
MLPD5  0.9990.001  0.9930.000  0.9900.001  0.9980.000  0.9990.001  0.9950.001  0.9960.001  0.9950.002  0.9960.000  0.9960.000 
MLPD10  0.9990.002  0.9970.001  0.9950.000  0.9990.001  0.9990.001  0.9990.001  0.9990.001  0.9980.001  0.9980.001  0.9990.000 
SVMLinear  0.9980.001  0.9980.001  0.9980.001  0.9980.000  0.9980.000  0.9970.002  0.9980.003  0.9970.001  0.9970.000  0.9970.000 
SVMRBF  0.9980.001  0.9970.000  0.9980.001  0.9980.000  0.9990.001  0.9950.001  0.9960.000  0.9960.003  0.9960.001  0.9960.001 
SVMD3poly  0.9670.002  0.9690.001  0.9620.000  0.9710.001  0.9730.000  0.9870.003  0.9910.000  0.9860.001  0.9880.002  0.9890.001 
GAMD3  0.9830.011  0.9830.017  0.9830.009  0.9820.003  0.9850.002  0.9860.010  0.9850.011  0.9840.010  0.9860.009  0.9840.010 
GAMD10  0.9850.010  0.9840.018  0.9840.010  0.9830.002  0.9850.001  0.9870.008  0.9850.009  0.9850.011  0.9860.010  0.9860.011 
DeciTrMaxDep6  0.8420.001  0.8420.003  0.8430.001  0.8350.001  0.8310.000  0.9310.001  0.9270.000  0.9370.000  0.9330.000  0.9380.000 
DeciTrMaxDep10  0.8960.002  0.8930.000  0.8900.002  0.8970.001  0.8980.000  0.9650.001  0.9650.000  0.9710.001  0.9710.000  0.9670.000 
DeciTrMaxDep20  0.9000.003  0.8980.001  0.8950.000  0.9080.000  0.9030.000  0.9660.001  0.9660.001  0.9720.001  0.9740.001  0.9700.000 
PLRD1  0.8990.014  0.9030.022  0.9010.021  0.9000.011  0.9010.000  0.9030.011  0.9020.019  0.9000.016  0.9100.018  0.9110.009 
PLRD2  0.9100.017  0.9130.023  0.9280.012  0.9230.016  0.9220.001  0.9230.017  0.9310.011  0.9290.019  0.9310.011  0.9360.016 
PLRD3  0.9450.020  0.9330.017  0.9450.014  0.9410.004  0.9140.001  0.9410.010  0.9520.009  0.9460.014  0.9560.012  0.9440.010 
PLRD5  0.9500.018  0.9410.023  0.9550.020  0.9490.003  0.9250.000  0.9580.011  0.9590.012  0.9610.012  0.9640.010  0.9610.011 
PLRD10  0.9590.011  0.9570.009  0.9600.008  0.9550.010  0.9330.000  0.9630.013  0.9700.023  0.9720.014  0.9770.009  0.9730.009 
Mean  0.9570.010  0.9550.011  0.9550.008  0.9580.004  0.9570.001  0.9700.011  0.9580.009  0.957 0.009  0.9740.007  0.9720.006 
n=3  n=5  

Training size  0.5  0.6  0.7  0.8  0.9  0.5  0.6  0.7  0.8  0.9 
NNMCDAD1  0.7530.018  0.7500.023  0.7530.021  0.7480.010  0.7450.008  0.8120.021  0.8280.038  0.8180.029  0.8100.021  0.8240.015 
NNMCDAD2  0.7720.024  0.7690.021  0.7740.012  0.7640.008  0.7660.009  0.8410.023  0.8380.033  0.8430.021  0.8400.019  0.8370.021 
NNMCDAD3  0.7730.026  0.7710.020  0.7750.021  0.7650.011  0.7670.011  0.8440.031  0.8460.052  0.8530.041  0.8470.037  0.8410.031 
NNMCDAD5  0.7720.018  0.7720.019  0.7760.015  0.7680.011  0.7690.008  0.8490.029  0.8530.027  0.8560.039  0.8510.015  0.8470.011 
NNMCDAD10  0.7750.020  0.7750.021  0.7780.012  0.7690.010  0.7730.006  0.8540.032  0.8530.052  0.8610.025  0.8560.030  0.8490.028 
MLPD1  0.7870.010  0.7930.002  0.7900.001  0.7910.003  0.8000.004  0.8370.002  0.8350.001  0.8310.002  0.8360.001  0.8330.011 
MLPD2  0.7900.005  0.7930.001  0.7920.003  0.7970.002  0.8020.008  0.8480.001  0.8510.003  0.8450.004  0.8490.010  0.8470.008 
MLPD3  0.7910.002  0.7990.002  0.7990.001  0.8010.002  0.8080.002  0.8550.004  0.8550.001  0.8510.000  0.8550.000  0.8570.000 
MLPD5  0.7930.001  0.8060.003  0.8010.002  0.8050.001  0.8100.006  0.8610.002  0.8560.001  0.8540.003  0.8610.001  0.8600.001 
MLPD10  0.7950.001  0.8100.002  0.8040.004  0.8100.003  0.8110.009  0.8660.004  0.8640.001  0.8650.000  0.8620.001  0.8620.000 
SVMLinear  0.6870.004  0.6830.004  0.6860.004  0.6810.006  0.6780.001  0.7480.003  0.7480.003  0.7550.003  0.7480.007  0.7430.007 
SVMRBF  0.6880.003  0.6850.003  0.6860.007  0.6820.005  0.6800.000  0.7480.003  0.7480.002  0.7550.003  0.7480.004  0.7430.006 
SVMD3poly  0.6820.004  0.6800.003  0.6870.006  0.6820.005  0.6760.000  0.7500.002  0.7490.003  0.7560.005  0.7520.004  0.7490.008 
GAMD3  0.6870.002  0.6840.001  0.6880.002  0.6810.001  0.6810.001  0.7490.011  0.7460.006  0.7520.008  0.7500.007  0.7420.003 
GAMD10  0.6870.004  0.6840.006  0.6870.009  0.6810.001  0.6800.002  0.7490.010  0.7460.010  0.7520.012  0.7490.011  0.7410.009 
DeciTrMaxDep6  0.6850.003  0.6780.003  0.6820.004  0.6790.007  0.6760.006  0.7240.003  0.7250.002  0.7340.005  0.7360.006  0.7250.009 
DeciTrMaxDep10  0.6720.002  0.6690.004  0.6720.003  0.6630.007  0.6650.006  0.7170.002  0.7190.002  0.7210.004  0.7220.002  0.7220.008 
DeciTrMaxDep20  0.6170.003  0.6230.002  0.6200.005  0.6210.009  0.6300.008  0.6660.003  0.6630.004  0.6720.003  0.6710.005  0.6650.009 
PLRD1  0.6040.011  0.6090.021  0.6110.014  0.6030.011  0.6100.003  0.7030.039  0.6980.010  0.7000.019  0.7020.010  0.7080.017 
PLRD2  0.6210.013  0.6320.018  0.6390.013  0.6250.010  0.6210.010  0.7120.010  0.7020.018  0.7100.010  0.7110.015  0.7130.029 
PLRD3  0.6380.021  0.6440.020  0.6400.009  0.6370.009  0.6350.010  0.7300.011  0.7290.021  0.7210.014  0.7290.029  0.7200.018 
PLRD5  0.6510.017  0.6570.017  0.6490.019  0.6500.011  0.6500.011  0.7330.011  0.7350.013  0.7290.020  0.7340.016  0.7360.019 
PLRD10  0.6690.009  0.6670.007  0.6700.011  0.6690.011  0.6680.008  0.7390.021  0.7390.023  0.7320.009  0.7410.009  0.7390.014 
Mean  0.7130.010  0.7080.010  0.7130.009  0.712 0.006  0.7130.006  0.7800.009  0.7790.014  0.7810.012  0.7750.009  0.7780.012 
n=3  n=5  

Training size  0.5  0.6  0.7  0.8  0.9  0.5  0.6  0.7  0.8  0.9 
NNMCDAD1  0.6070.010  0.6220.021  0.5910.009  0.6070.010  0.6120.010  0.7110.024  0.7210.032  0.6990.012  0.7080.017  0.7190.022 
NNMCDAD2  0.6430.008  0.6380.016  0.6430.012  0.6500.008  0.6350.010  0.7270.027  0.7230.029  0.7110.026  0.7110.016  0.7290.021 
NNMCDAD3  0.6530.023  0.6630.018  0.6570.019  0.6560.021  0.6590.013  0.7410.028  0.7390.043  0.7380.037  0.7260.029  0.7490.013 
NNMCDAD5  0.6780.019  0.6850.020  0.6760.005  0.6870.023  0.6700.009  0.7520.035  0.7580.039  0.7570.032  0.7350.035  0.7610.010 
NNMCDAD10  0.6900.023  0.6860.021  0.6830.004  0.6890.009  0.6740.006  0.7730.022  0.7610.048  0.7660.019  0.7710.029  0.7620.031 
MLPD1  0.7100.001  0.7110.005  0.7100.000  0.7130.000  0.7120.001  0.7900.002  0.7870.004  0.7800.001  0.7790.002  0.7890.003 
MLPD2  0.7130.003  0.7120.002  0.7140.002  0.7180.003  0.7150.002  0.7940.003  0.7900.003  0.7820.001  0.7830.002  0.7910.001 
MLPD3  0.7190.004  0.7190.003  0.7200.001  0.7210.002  0.7190.003  0.7990.003  0.7940.002  0.7910.002  0.7880.001  0.8000.003 
MLPD5  0.7210.002  0.7200.000  0.7240.003  0.7270.003  0.7250.004  0.8030.001  0.7990.005  0.8010.002  0.7940.003  0.8030.001 
MLPD10  0.7280.003  0.7240.001  0.7290.002  0.7300.002  0.7290.003  0.8090.002  0.8030.001  0.8050.002  0.8060.003  0.8080.002 
SVMLinear  0.5790.005  0.5790.002  0.5840.003  0.5720.005  0.5790.006  0.6580.003  0.6590.005  0.6490.004  0.6450.002  0.6580.005 
SVMRBF  0.5860.003  0.5840.004  0.5860.002  0.5780.002  0.5890.005  0.6600.001  0.6610.002  0.6540.003  0.6480.003  0.6580.003 
SVMD3poly  0.5740.002  0.5790.006  0.5800.006  0.5720.004  0.5790.005  0.6590.004  0.6620.002  0.6520.002  0.6470.002  0.6550.006 
GAMD3  0.5790.001  0.5790.002  0.5840.003  0.5720.001  0.5760.003  0.6590.002  0.6590.004  0.6500.003  0.6470.005  0.6560.004 
GAMD10  0.5830.001  0.5820.003  0.5850.003  0.5790.002  0.5830.005  0.6600.004  0.6590.003  0.6510.003  0.6450.002  0.6570.006 
DeciTrMaxDep6  0.5830.003  0.5830.003  0.5830.003  0.5810.005  0.5890.009  0.6490.002  0.6500.001  0.6440.004  0.6380.005  0.6510.008 
DeciTrMaxDep10  0.5650.004  0.5790.002  0.5740.004  0.5700.007  0.5620.008  0.6310.004  0.6350.003  0.6320.005  0.6280.006  0.6380.009 
DeciTrMaxDep20  0.5360.005  0.5480.006  0.5390.004  0.5410.007  0.5470.008  0.5920.003  0.5830.004  0.5910.002  0.5880.008  0.5820.007 
PLRD1  0.5600.015  0.5620.026  0.5580.021  0.5590.017  0.5570.011  0.6320.024  0.6300.017  0.6290.019  0.6310.022  0.6330.020 
PLRD2  0.5620.014  0.5680.021  0.5600.020  0.5600.010  0.5620.017  0.6370.019  0.6310.021  0.6330.011  0.6330.017  0.6350.019 
PLRD3  0.5690.010  0.5690.019  0.5640.019  0.5650.028  0.5630.028  0.6420.023  0.6370.019  0.6370.023  0.6390.028  0.6390.020 
PLRD5  0.5710.014  0.5720.023  0.5690.020  0.5690.027  0.5690.027  0.6470.022  0.6420.020  0.6400.029  0.6410.018  0.6410.018 
PLRD10  0.5780.015  0.5750.012  0.5720.023  0.5750.019  0.5730.019  0.6520.021  0.6490.011  0.6440.016  0.6450.019  0.6470.022 
Mean  0.6210.005  0.6230.010  0.6210.008  0.6210.005  0.6210.005  0.6990.012  0.6970.014  0.6930.011  0.6900.013  0.6980.011 
We also compare the proposed NNMCDA with baseline machine learning models, including the standard MLP, polynomial linear regression (PLR) with 1, 2, 3, 5 and 10 degrees, GAM with 10 splines that are in 3 and 10 degree of polynomials, SVMs with linear, radial basis function (RBF) and polynomial kernels, and single decision tree (DeciTr) models with 6, 10 and 20 maximum depths. Table
1 presents the results for the simplest dataset. All machine learning models, both the interpretable (including NNMCDA) and full complexity ones, perform well. The performance drops rapidly when we use them to fit the nonlinear and highorder datasets (shown in Tables 2 and 3). Since the proposed NNMCDA model and MLP can model the nonlinearity and attribute interactions, both of them achieve much higher AUCs as compared to the rest. As expected, although the performance of NNMCDA is lower than MLP, the difference is relatively small. Both NNMCDA and MLP outperform other baseline machine learning models significantly. More specifically, we can observe that GAM, SVM and DeciTr have similar accuracy, which is higher than that of PLR, but lower than NNMCDA. NNMCDA’s performance is close to the full complexity model, and at the same time, it has strong interpretability (will be shown in the next experiment).4.1.2 Experiment II: Impact of on AUC
In this section, we focus on assessing how the performance of the NNMCDA model is affected by the value of , the weight for the linear component. We evenly sample 20 values within [0, 1] as the predefined . For each fixed , we train the NNMCDA model using synthetic datasets introduced in the previous subsection. In this experiment, we use the SGD algorithm to optimize the parameters and the number of iterations are set as 250.
In Figure 4.1.2 (results with the linearly generated datasets), though three curves have many monotonicity inflexions, there is a general trend that the greater the value of , the better the performance. NNMCDA obtains the best results when is between 0.8 and 1.0 on these linearly generated datasets. In contrast, the AUC curves have a general decreasing trend when increases for the nonlinearly generated datasets (Figures 4.1.2 and 4.1.2). In the extreme cases, where only the nonlinear MLP component (i.e., ) or the linear component (i.e.,) works, the model obtains the greatest or smallest average AUC.
Note that the three datasets have very different patterns. Datasets use a set of simple linear marginal value functions whereas and simulate more complicated patterns. Theoretically, a full complexity model (MLP) can perfectly capture any patterns in the data at the cost of very large numbers of iterations and data samples for convergence. In practice, we often do not have sufficient data or computational time to achieve the optimal MLP solution. In this simulation experiment (31,125 data points and 250 iterations to fit the model), a pure MLP model (NNMCDA with ) does not always lead to the best outcome. This result indicates that, in realworld managerial decision making, a full complexity model is usually not the best one not only because of the lack of interpretability, but also the limited data and computational resources to optimize the model. It is sensible to allow the model to automatically adjust the tradeoff coefficient to avoid the scenarios where a very complex model is used to fit simple data, or a simple model is used to fit complex data.
4.1.3 Experiment III: performance in fitting actual marginal value functions
This experiment studies the ability of the NNMCDA model to reconstruct the actual marginal value functions. From Experiments I and II, we find that the NNMCDA model with degree equal to 3 has a good balance between prediction performance and computational cost. Therefore, in this experiment, we generate four typical synthetic models with different complexities. Each hypothetical model has three marginal value functions to estimate. Then, we use the synthetic models to generate datasets with the same attribute vectors (model input). We approximate the marginal value functions using an NNMCDA model with a degree of 3. We also compare the obtained function with a baseline linear regression model. The four synthetic models (from the simplest to extremely complicated) are described as follows:

Synthetic model 1: Three linear marginal value functions. The global scores are the linear summation of three marginal values without interactions.

Synthetic model 2: Three polynomial functions of degree 3. The global scores are the linear summation of three marginal values with all possible pairwise interactions.

Synthetic model 3: A polynomial function of degree 15, a sigmoid function and an exponential function. The global scores are the linear summation of three marginal values without interactions.

Synthetic model 4: The Model 3 with pairwise and triplewise attribute interactions.
Figures 4.1.3 to 4.1.3 reveal the actual and fitted marginal value functions obtained by the proposed NNMCDA model and the baseline linear regression model. For a simple model with linear marginal value functions (Synthetic model 1), both the baseline linear regression model and the NNMCDA can fit the actual functions well. Both models successfully capture the monotonicity of the original marginal value functions. This indicates that the NNMCDA model is also applicable to simple prediction tasks that do not have attribute interactions or nonlinear associations between attributes and predictions.
When the attribute interactions are considered (Synthetic model 2), the proposed model outperforms the baseline linear regression model. In the first row of Figure 4.1.3, the NNMCDA model captures correct monotonicity changes of all three actual marginal value functions. Moreover, for the first and third attributes, it captures the concavity of the original functions. While linear regression model gives opposite monotonicity of the actual functions except for the second attribute. In addition, the linear regression model does not capture the concavity of the actual functions. The bad performance of the linear regression model is due to the fact that it can not capture the interactions between attributes, which often brings distorted interpretability.
For the model with more complex marginal value functions (Synthetic model 3), the linear regression model performs rather bad. In Figure 4.1.3, the first and third attributes have very negligible impact on the prediction in the fitted linear regression model. The NNMCDA model, on the other hand, correctly captures the main characteristics of the three attributes. Both models failed to capture the inflexion point for the second attribute. These results demonstrate that a lowdegree NNMCDA model (3degree in this study) can fit both lower and higher degree marginal value functions, because of the nonlinear component helps deal with the complexities that cannot be captured by the predefined linear component. In Figure 4.1.3, if the marginal value function is extremely complex (Synthetic model 4), though lowdegree NNMCDA cannot fully capture the characteristics of the marginal value functions, it still outperforms the baseline linear regression model. However, the realworld DM behaviors are usually not that complex (15degree in Synthetic model 3 and 4). We will further validate the applicability and generalizability of the proposed NNMCDA model with real datasets in the following section.
4.2 A multiple criteria ranking problem.
QS world university ranking organization^{7}^{7}7It is an annual publication of university rankings by Quacquarelli Symonds (QS). https://www.topuniversities.com provides five carefullychosen indicators to measure the universities’ capacity in producing the most employable graduates, including employer reputation, employerstudent connection, alumni outcomes, partnerships with employers and graduate employment rate. The metric of employer reputation, regarded as a key performance indicator, is based on over 40,000 responses to the QS Employer Survey. In this experiment, we apply the proposed NNMCDA model to a multiple criteria ranking problem that predicts the employer reputation (human decision) using the other four quantitative indicators^{8}^{8}8Employerstudent connection (EC). This indicator involves the number of active presences of employers on a university’s campus over the past 12 months. Such presences are in form of providing students with opportunities to network and acquire information, organizing company presentations or other selfpromoting activities, which increase the probability that students have to participate in careerlaunching internships and research opportunities.
Alumni outcomes (AO). The scores based on the outcomes of a university’s graduates produced. A university is successful if its graduates tend to produce more wealth and scientific researches.
Partnerships with employers (PE). The number of citable and transformative researches which are produced by a university collaborating successfully with global companies.
Graduate employment rate (GER). This indicator is essential for understanding how successful universities are at nurturing employability. It involves measuring the proportion of graduates (excluding those opting to pursue further study or unavailable to work) in full or part time employment within 12 months of graduation.
. The descriptive statistics are shown in Table
4.2.There are pairwise comparisons among 250 universities. To determine the predefined degree of polynomials, we respectively set as 1, 2, and 3. We use the same fivefold crossvalidation process to fit the model. We record the averaged AUC for both training and testing sets. We also select three baseline models, including a 3layer MLP, a logistic regression model and a GAM. The results of the average AUC are presented in Table 4.2. For each predefined degree, the full complexity model always obtain the best results whereas the logistic regression model performs the worst. Since NNMCDA can model attribute interactions, it slightly outperforms the GAM. As interpretable models, we depict the marginal value functions obtained by NNMCDA, linear model and GAM in Figures 4.2, 4.2 and 4.2, respectively.
In Figures 4.2 and 4.2, the vertical axis is the individual attributes contributions to employer reputation. For the attributes alumni outcomes, partnerships with employers and graduate employment rate, the value functions obtained by NNMCDA and logistic regression model exhibit a monotonically increasing trend, which makes sense based on our common knowledge. For attribute employerstudent connection, GAM obtains a generally increasing curve. Although it also captures a slight dip of the increasing trend for employerstudent connectiongreater than 0.8, such effect is not as clear as that in the curve captured by NNMCDA. For attributes alumni outcomes, partnerships with employers and graduate employment rate, GAM obtains quite different and unstable curves that are difficult to explain.
4.3 Predicting geriatric depression risk.
Depression is a major cause of emotional suffering in later life. Geriatric depression reduces the quality of older adults’ life and increases the risk for acquiring other diseases and committing suicide (Alexopoulos 2005). The existing literature empirically studied the risk factors of geriatric depression but few of them gave insight into how these risk factors affect the prevalence of geriatric depression in details (i.e.: the shape of the marginal value functions). It is more managerially helpful for clinical decision making to prevent older adults from being depressed if we can understand how each risk factor influences the risk for depression at different scales.
The Health and Retirement Study (HRS) is a nationally representative longitudinal study of US adults aged 51 years and older
(Bugliari et al. 2016). It has been widely used in many medical studies because of its massive information about older adults demographics, health status, health care utilization and costs, and other useful variables (Pool et al. 2018). We sample the data in 2014 (). In this experiment, given five predetermined attributes (risk factors) that have been found to be associated with geriatric depression, we want to capture the detailed effect of these risk factors at different scales, which are represented by marginal value functions. The descriptive statistics are described in Table 4.3.The respondents with CESD scores higher than or equal to 1 are assumed to be at risk for depression (positive samples). There are totally 9,816 positive samples and 7,880 negative samples. We randomly choose 90% from both positive and negative samples to train the model. We train the NNMCDA, MLP, GAM, and logistic regression model for 30 times, and present the average results in Table 4.3. The obtained marginal value functions are visualized in Figures 4.3, 4.3, and 4.3.
We first analyze the similar conclusions by three interpretable models. For the last four attributes, both NNMCDA and logistic regression models capture similar monotonic trends. As for degree of education and marital status, the curves are under the baseline rate indicating higher education and the longer length of marriage can reduce the risk for depression. This is consistent with the medical literature (Penninx et al. 1998, Ladin 2008). More specifically, both models find that the attributes outofpocket expenditure and Body mass index would increase the risk for depression. Since the obtained value functions are in a convex shape, the growth of the risk will increase along with the increase of these attribute values.
Both the NNMCDA and logistic regression obtain a convex curve for attribute age with part of the curve being negative and the rest being positive (see Figures 4.3 and 4.3). This indicates that the risk of depression does not increase while aging if the adult is younger than a threshold. The risk of depression increases fast after an adult passes an age threshold. The threshold for NNMCDA is 71.58, which makes sense because most adults younger than 71.58 could be enjoying their retirement and their body functions do not degrade much. However, the threshold for the logistic regression is 95.83, which seems unrealistic and inconsistent with the literature (Blazer et al. 1991).
Similar to the previous experiment, GAM obtains less stabler curves, which are relatively more difficult to interpret. For attribute age, GAM obtains similar patterns as NNMCDA and logistic regression models with an age threshold around 90, which is inconsistent with the literature (Blazer et al. 1991). For attributes degree of education and marital status, GAM even obtains quite counterintuitive (if not wrong) results, indicating that the increase in educational and marriage time results in the higher risk of depression.
The experiments with real data demonstrate that the proposed NNMCDA can effectively capture the patterns in human decision behavior through learning the marginal value functions, which characterize the contribution of individual attributes to the predictions. The NNMCDA presents good potential in enhancing the empirical studies through providing a detailed marginal value function instead of a single coefficient for each attribute. Moreover, the prediction performance of NNMCDA is close to a full complexity model (MLP), and much better than that of baseline interpretable models (GAM and logistic regression model).
5 Discussion
In this section, we summarize the insights from the experiments, discuss the use and extension of the proposed NNMCDA model, and compare the proposed NNMCDA with ensemble learning.
5.1 Attribute importance.
To explain the importance of each attribute, we present the normalized attribute weights obtained from previous experiments in Figure 5.1. In the university ranking problem, NNMCDA assigns 0.2732, 0.228, 0.258, and 0.239 to attributes employerstudent connection, alumni outcomes, partnerships with employers and graduate employment rate whereas a regression model assigns 0.3198, 0.3107, 0.1906, and 0.1789 to them, respectively. Both models determine that employerstudent connection is the most important attribute, however, they give different orders of other three attributes. Given the limited resources and the obtained importances of attributes, maintaining a good employerstudent connection with frequent employer presences on campus is the most effective method to achieve good employer reputation of a university.
As for the depression prediction problem, a regression model assigns almost equal importance to the five attributes, which is not intuitive for the DM. On the other hand, the NNMCDA provides a an order of the attributes according to the importance: (0.221, 0.218, 0.208, 0.198, 0.155). Such order suggests that obesity and loads of expenditure are the most important risk factors of becoming depressed. While estimating an old adult’s risk for depression, a DM (general physician, specialists and geriatricians) should prioritize the problems related to the older adult’s body weight and economical conditions.
5.2 Interpreting the tradeoff coefficient.
As a key coefficient, determining the value of is important to practical applications. It reflects the influence of highorder interactions and complex nonlinearity of variables on the final decisions. In this regard, the NNMCDA model can be used to explore the complexity of the learning problem. If the convergence is very small, it indicates that the data is highly complex and the assumption of preference independence is not valid. The DM should then use latest models that account for attribute interactions, such as a Choquet integralbased model (Aggarwal and Fallah Tehrani 2019) or full complexity machine learning models. On the contrary, if is close to 1, the DM is recommended to use a simpler model to avoid massive computational time and noninterpretable results.
In the case of prediction for depression (Table 4.3), the convergence is around 0.4. It indicates that the involved attributes are possibly interacted in this problem. Some related medical studies also empirically demonstrated some interactions between attributes, for example, older adults that are relatively young with higher degree of education may be involved in more social activities and have a more contented life, which lead to lower risk for depression (Li et al. 2014). Moreover, these older adults with longer marriage will obtain more family support and thus they have lower risk for depression (Pearlin and Johnson 1977). We usually address the pairwise interactions because they are easier to interpret and can be visualized by a heating map (Caruana et al. 2015). Given convergence and marginal value functions obtained in the presence of highorder correlations among attributes, we could develop algorithms to use lowerordered attribute interactions, e.g.: pairwise and triplewise interactions, to approximate the higher ones. The framework can then be extended to a twostep procedures, including determining marginal value functions and deducing possible lowerordered correlations.
5.3 Extending the NNMCDA framework.
The proposed NNMCDA presents a general modelling framework, which can be easily extended to enhance the performance and adaptivity for various problems. In this section, we discuss the three extensions, including adding regularizations replacing the model in the nonlinear component, and incorporating attributes in the nonlinear component.
5.3.1 Adding regularizations
For some missioncritical cases where the data is complex, the convergence could be very small. However, the DM still requires certain level of model interpretability to facilitate their decision making. Therefore, we opt for adding a regularization term to prevent the model from being too complicated and noninterpretable. The inclusion of the regularization term also helps prevent the overfitting problem. For example, we can revise the original MSE as follows . The added regularization term allocates more weight to the linear component at the cost of lower fitting accuracy. We can also change the regularization term to
, which leads to a balanced model that favors a model with equal weights to the linear and nonlinear components. The exact form of the loss function should be selected according to the problem settings.
5.3.2 Replace the model in the nonlinear component.
Given different types of datasets, the proposed NNMCDA model can be modified by replacing the neural networks in the nonlinear component by other network structures. In subsection 4.1.1, NNMCDA has difficulty in handling extremely complex data ( and
). To improve the performance on this data, we can introduce more layers in the MLP or increase the number of neurons in each layer. For image classification problem, we can replace the MLP with a convolutional neural network (CNN), and use the features obtained from CNN as the input for the linear component. To fit timeseries or free text data, we can replace the MLP with a recurrent neural network.
In addition, the proposed model can be progressively modified by iteratively interacting with the DM. We provide a userinteractive process to determine the ultimate model. The framework is shown in Figure 5.3.2 and explained as follows:

We first apply the NNMCDA model to the management problem. While the model converges, we obtain the value of .

If , go to step 3. If , which indicates that the data are potentially very complex. We opt for a full complexity blackbox model to achieve higher accuracy and present the results to the DM. If the DM agrees to use the blackbox model, the process is end. Otherwise, we add a regularization term (e.g.: use ) to the original NNMCDA, and go back to Step 1.

If , go to step 4. If , which indicates that an interpretable model is sufficient to fit the data, we explain the results to the DM. If the DM is satisfied with the accuracy, the process ends. Otherwise, we can modify the NNMCDA model by replacing the model in the nonlinear component (e.g.: using deeper MLP or other neural netbased model). Then, we go back to Step 1.

If , we present the underlying model and results to the DM. If there are no further requirements, the process is end. If the DM requires further modifications (such as adding regularization terms or modifying the nonlinear component), we modify the model accordingly and then go to Step 1.
5.3.3 Flexible inclusion of attributes in the nonlinear component
In practice, the human decision behavior usually focuses on a small number of key attributes/criteria (Ribeiro et al. 2016). However, there could exist other minor attributes that do not directly contribute to the prediction, but could affect the prediction through nontraceable complex interactions with other attributes (for example, the interaction between the nonlinear transformation of an attribute and the nonlinear transformation of five other attributes). These minor attributes can be incorporated by the nonlinear component.
In the geriatric depression experiment, the gender of an older adult may not directly indicate a difference in the risk for depression, however, it might still influence the prediction through complex interactions with other attributes. We further extend the NNMCDA model in incorporate gender and smoking status into the nonlinear component (as shown in Figure 5.3.3). We find that the incorporation of these attributes indeed improves the prediction accuracy (the AUC for testing set increases from 0.669 to 0.675), while still maintains the similar marginal value functions in the linear component (see Figure 5.3.3). If we add two more attributes, for instance whether the respondent received any home cares in last two years and whether the health problem limited his/her work, the AUC increases more obviously (from 0.675 to 0.708) and the marginal value functions still provide convincing results (see Figure 5.3.3).
5.4 Joint training process.
In the NNMCDA model, the linear and nonlinear components are combined by a tradeoff coefficient . Their sum is then fed to a common logistic function for a joint training process. Note that this joint training process is different from ensemble learning (Cheng et al. 2016), in which multiple classifiers are trained individually and their predictions are simply combined after every model is optimized separately. For example, an ensemble learning approach could have a linear logistic regression model and an MLP model to make predictions for the same dataset separately, and then integrate the prediction results of the two models. The joint training process indicates that the linear and nonlinear components are connected. While we tune the parameters in one component, the other component will be affected. If the model is at its global optimal, the predictions can be made.
6 Conclusion and future work.
In this paper, we proposed a framework for an interpretable model, named NNMCDA, which combines traditional MCDA model and neural networks. MCDA uses marginal value functions to describe the contribution of individual attributes to the predictions, while neural network considers highorder interrelations among attributes. The framework automatically balances the tradeoff between two components. NNMCDA is more interpretable than a full complexity model and maintains similar predictability.
We present simulation experiments to demonstrate the effectiveness of NNMCDA. The experiments show that (1) polynomial of higher degrees do not always improve on accuracy; (2) There is a tradeoff between the interpretability and the predictability of the model. NNMCDA can achieve a good balance between them; (3) Given simple data, NNMCDA performs as good as interpretable model, while given more complex data, NNMCDA outperforms an interpretable model. We also present how to apply the NNMCDA framework to realworld decision making problems. These experiments with real data demonstrate the good prediction performance of NNMCDA and its ability in capturing the detailed contributions of individual attributes.
To the best of our knowledge, this research is the first to introduce the interpretability into machine learning models from the perspective of MCDA. The proposed framework sheds light on how to use MCDA techniques to enhance the interpretability of machine learning models, and how to use machine learning techniques to free MCDA from strong assumptions and enhance its generalizability and predictability.
We envisage the following directions for future researches based on the NNMCDA framework. First, We can further enhance the interpretability of the model through proposing algorithms to approximate the attribute interactions after obtaining the marginal value functions. Second, additional simulations are needed to validate the effectiveness of the NNMCDA variants that introduced in the discussion section. Last but not the least, applying the proposed framework to a variety of realworld decision making and prediction problems constitutes another interesting direction for future work.
References
 Aggarwal and Fallah Tehrani (2019) Aggarwal M, Fallah Tehrani A (2019) Modelling human decision behaviour with preference learning. INFORMS Journal on Computing 31(2):318–334, URL http://dx.doi.org/10.1287/ijoc.2018.0823.
 Alexopoulos (2005) Alexopoulos GS (2005) Depression in the elderly. The Lancet 365(9475):1961–1970.
 Angilella et al. (2014) Angilella S, Corrente S, Greco S, Słowiński R (2014) MUSAINT: Multicriteria customer satisfaction analysis with interacting criteria. Ome
Comments
There are no comments yet.